Merge tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev...
authorLinus Torvalds <torvalds@linux-foundation.org>
Wed, 28 Jun 2023 23:43:10 +0000 (16:43 -0700)
committerLinus Torvalds <torvalds@linux-foundation.org>
Wed, 28 Jun 2023 23:43:10 +0000 (16:43 -0700)
Pull networking changes from Jakub Kicinski:
 "WiFi 7 and sendpage changes are the biggest pieces of work for this
  release. The latter will definitely require fixes but I think that we
  got it to a reasonable point.

  Core:

   - Rework the sendpage & splice implementations

     Instead of feeding data into sockets page by page extend sendmsg
     handlers to support taking a reference on the data, controlled by a
     new flag called MSG_SPLICE_PAGES

     Rework the handling of unexpected-end-of-file to invoke an
     additional callback instead of trying to predict what the right
     combination of MORE/NOTLAST flags is

     Remove the MSG_SENDPAGE_NOTLAST flag completely

   - Implement SCM_PIDFD, a new type of CMSG type analogous to
     SCM_CREDENTIALS, but it contains pidfd instead of plain pid

   - Enable socket busy polling with CONFIG_RT

   - Improve reliability and efficiency of reporting for ref_tracker

   - Auto-generate a user space C library for various Netlink families

  Protocols:

   - Allow TCP to shrink the advertised window when necessary, prevent
     sk_rcvbuf auto-tuning from growing the window all the way up to
     tcp_rmem[2]

   - Use per-VMA locking for "page-flipping" TCP receive zerocopy

   - Prepare TCP for device-to-device data transfers, by making sure
     that payloads are always attached to skbs as page frags

   - Make the backoff time for the first N TCP SYN retransmissions
     linear. Exponential backoff is unnecessarily conservative

   - Create a new MPTCP getsockopt to retrieve all info
     (MPTCP_FULL_INFO)

   - Avoid waking up applications using TLS sockets until we have a full
     record

   - Allow using kernel memory for protocol ioctl callbacks, paving the
     way to issuing ioctls over io_uring

   - Add nolocalbypass option to VxLAN, forcing packets to be fully
     encapsulated even if they are destined for a local IP address

   - Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure
     in-kernel ECMP implementation (e.g. Open vSwitch) select the same
     link for all packets. Support L4 symmetric hashing in Open vSwitch

   - PPPoE: make number of hash bits configurable

   - Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client
     (ipconfig)

   - Add layer 2 miss indication and filtering, allowing higher layers
     (e.g. ACL filters) to make forwarding decisions based on whether
     packet matched forwarding state in lower devices (bridge)

   - Support matching on Connectivity Fault Management (CFM) packets

   - Hide the "link becomes ready" IPv6 messages by demoting their
     printk level to debug

   - HSR: don't enable promiscuous mode if device offloads the proto

   - Support active scanning in IEEE 802.15.4

   - Continue work on Multi-Link Operation for WiFi 7

  BPF:

   - Add precision propagation for subprogs and callbacks. This allows
     maintaining verification efficiency when subprograms are used, or
     in fact passing the verifier at all for complex programs,
     especially those using open-coded iterators

   - Improve BPF's {g,s}setsockopt() length handling. Previously BPF
     assumed the length is always equal to the amount of written data.
     But some protos allow passing a NULL buffer to discover what the
     output buffer *should* be, without writing anything

   - Accept dynptr memory as memory arguments passed to helpers

   - Add routing table ID to bpf_fib_lookup BPF helper

   - Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands

   - Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark
     maps as read-only)

   - Show target_{obj,btf}_id in tracing link fdinfo

   - Addition of several new kfuncs (most of the names are
     self-explanatory):
      - Add a set of new dynptr kfuncs: bpf_dynptr_adjust(),
        bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size()
        and bpf_dynptr_clone().
      - bpf_task_under_cgroup()
      - bpf_sock_destroy() - force closing sockets
      - bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs

  Netfilter:

   - Relax set/map validation checks in nf_tables. Allow checking
     presence of an entry in a map without using the value

   - Increase ip_vs_conn_tab_bits range for 64BIT builds

   - Allow updating size of a set

   - Improve NAT tuple selection when connection is closing

  Driver API:

   - Integrate netdev with LED subsystem, to allow configuring HW
     "offloaded" blinking of LEDs based on link state and activity
     (i.e. packets coming in and out)

   - Support configuring rate selection pins of SFP modules

   - Factor Clause 73 auto-negotiation code out of the drivers, provide
     common helper routines

   - Add more fool-proof helpers for managing lifetime of MDIO devices
     associated with the PCS layer

   - Allow drivers to report advanced statistics related to Time Aware
     scheduler offload (taprio)

   - Allow opting out of VF statistics in link dump, to allow more VFs
     to fit into the message

   - Split devlink instance and devlink port operations

  New hardware / drivers:

   - Ethernet:
      - Synopsys EMAC4 IP support (stmmac)
      - Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches
      - Marvell 88E6250 7 port switches
      - Microchip LAN8650/1 Rev.B0 PHYs
      - MediaTek MT7981/MT7988 built-in 1GE PHY driver

   - WiFi:
      - Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps
      - Realtek RTL8723DS (SDIO variant)
      - Realtek RTL8851BE

   - CAN:
      - Fintek F81604

  Drivers:

   - Ethernet NICs:
      - Intel (100G, ice):
         - support dynamic interrupt allocation
         - use meta data match instead of VF MAC addr on slow-path
      - nVidia/Mellanox:
         - extend link aggregation to handle 4, rather than just 2 ports
         - spawn sub-functions without any features by default
      - OcteonTX2:
         - support HTB (Tx scheduling/QoS) offload
         - make RSS hash generation configurable
         - support selecting Rx queue using TC filters
      - Wangxun (ngbe/txgbe):
         - add basic Tx/Rx packet offloads
         - add phylink support (SFP/PCS control)
      - Freescale/NXP (enetc):
         - report TAPRIO packet statistics
      - Solarflare/AMD:
         - support matching on IP ToS and UDP source port of outer
           header
         - VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6
         - add devlink dev info support for EF10

   - Virtual NICs:
      - Microsoft vNIC:
         - size the Rx indirection table based on requested
           configuration
         - support VLAN tagging
      - Amazon vNIC:
         - try to reuse Rx buffers if not fully consumed, useful for ARM
           servers running with 16kB pages
      - Google vNIC:
         - support TCP segmentation of >64kB frames

   - Ethernet embedded switches:
      - Marvell (mv88e6xxx):
         - enable USXGMII (88E6191X)
      - Microchip:
         - lan966x: add support for Egress Stage 0 ACL engine
         - lan966x: support mapping packet priority to internal switch
           priority (based on PCP or DSCP)

   - Ethernet PHYs:
      - Broadcom PHYs:
         - support for Wake-on-LAN for BCM54210E/B50212E
         - report LPI counter
      - Microsemi PHYs: support RGMII delay configuration (VSC85xx)
      - Micrel PHYs: receive timestamp in the frame (LAN8841)
      - Realtek PHYs: support optional external PHY clock
      - Altera TSE PCS: merge the driver into Lynx PCS which it is a
        variant of

   - CAN: Kvaser PCIEcan:
      - support packet timestamping

   - WiFi:
      - Intel (iwlwifi):
         - major update for new firmware and Multi-Link Operation (MLO)
         - configuration rework to drop test devices and split the
           different families
         - support for segmented PNVM images and power tables
         - new vendor entries for PPAG (platform antenna gain) feature
      - Qualcomm 802.11ax (ath11k):
         - Multiple Basic Service Set Identifier (MBSSID) and Enhanced
           MBSSID Advertisement (EMA) support in AP mode
         - support factory test mode
      - RealTek (rtw89):
         - add RSSI based antenna diversity
         - support U-NII-4 channels on 5 GHz band
      - RealTek (rtl8xxxu):
         - AP mode support for 8188f
         - support USB RX aggregation for the newer chips"

* tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1602 commits)
  net: scm: introduce and use scm_recv_unix helper
  af_unix: Skip SCM_PIDFD if scm->pid is NULL.
  net: lan743x: Simplify comparison
  netlink: Add __sock_i_ino() for __netlink_diag_dump().
  net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses
  Revert "af_unix: Call scm_recv() only after scm_set_cred()."
  phylink: ReST-ify the phylink_pcs_neg_mode() kdoc
  libceph: Partially revert changes to support MSG_SPLICE_PAGES
  net: phy: mscc: fix packet loss due to RGMII delays
  net: mana: use vmalloc_array and vcalloc
  net: enetc: use vmalloc_array and vcalloc
  ionic: use vmalloc_array and vcalloc
  pds_core: use vmalloc_array and vcalloc
  gve: use vmalloc_array and vcalloc
  octeon_ep: use vmalloc_array and vcalloc
  net: usb: qmi_wwan: add u-blox 0x1312 composition
  perf trace: fix MSG_SPLICE_PAGES build error
  ipvlan: Fix return value of ipvlan_queue_xmit()
  netfilter: nf_tables: fix underflow in chain reference counter
  netfilter: nf_tables: unbind non-anonymous set if rule construction fails
  ...

2346 files changed:
.gitattributes
.mailmap
CREDITS
Documentation/ABI/testing/sysfs-devices-system-cpu
Documentation/RCU/Design/Requirements/Requirements.rst
Documentation/RCU/whatisRCU.rst
Documentation/admin-guide/bcache.rst
Documentation/admin-guide/cgroup-v1/memory.rst
Documentation/admin-guide/cgroup-v2.rst
Documentation/admin-guide/kernel-parameters.txt
Documentation/admin-guide/mm/damon/start.rst
Documentation/admin-guide/mm/damon/usage.rst
Documentation/admin-guide/perf/hisi-pmu.rst
Documentation/admin-guide/sysctl/kernel.rst
Documentation/arch/arm/arm.rst [moved from Documentation/arm/arm.rst with 100% similarity]
Documentation/arch/arm/booting.rst [moved from Documentation/arm/booting.rst with 100% similarity]
Documentation/arch/arm/cluster-pm-race-avoidance.rst [moved from Documentation/arm/cluster-pm-race-avoidance.rst with 100% similarity]
Documentation/arch/arm/features.rst [moved from Documentation/arm/features.rst with 100% similarity]
Documentation/arch/arm/firmware.rst [moved from Documentation/arm/firmware.rst with 100% similarity]
Documentation/arch/arm/google/chromebook-boot-flow.rst [moved from Documentation/arm/google/chromebook-boot-flow.rst with 100% similarity]
Documentation/arch/arm/index.rst [moved from Documentation/arm/index.rst with 100% similarity]
Documentation/arch/arm/interrupts.rst [moved from Documentation/arm/interrupts.rst with 100% similarity]
Documentation/arch/arm/ixp4xx.rst [moved from Documentation/arm/ixp4xx.rst with 100% similarity]
Documentation/arch/arm/kernel_mode_neon.rst [moved from Documentation/arm/kernel_mode_neon.rst with 100% similarity]
Documentation/arch/arm/kernel_user_helpers.rst [moved from Documentation/arm/kernel_user_helpers.rst with 100% similarity]
Documentation/arch/arm/keystone/knav-qmss.rst [moved from Documentation/arm/keystone/knav-qmss.rst with 100% similarity]
Documentation/arch/arm/keystone/overview.rst [moved from Documentation/arm/keystone/overview.rst with 100% similarity]
Documentation/arch/arm/marvell.rst [moved from Documentation/arm/marvell.rst with 100% similarity]
Documentation/arch/arm/mem_alignment.rst [moved from Documentation/arm/mem_alignment.rst with 100% similarity]
Documentation/arch/arm/memory.rst [moved from Documentation/arm/memory.rst with 100% similarity]
Documentation/arch/arm/microchip.rst [moved from Documentation/arm/microchip.rst with 100% similarity]
Documentation/arch/arm/netwinder.rst [moved from Documentation/arm/netwinder.rst with 100% similarity]
Documentation/arch/arm/nwfpe/index.rst [moved from Documentation/arm/nwfpe/index.rst with 100% similarity]
Documentation/arch/arm/nwfpe/netwinder-fpe.rst [moved from Documentation/arm/nwfpe/netwinder-fpe.rst with 100% similarity]
Documentation/arch/arm/nwfpe/notes.rst [moved from Documentation/arm/nwfpe/notes.rst with 100% similarity]
Documentation/arch/arm/nwfpe/nwfpe.rst [moved from Documentation/arm/nwfpe/nwfpe.rst with 100% similarity]
Documentation/arch/arm/nwfpe/todo.rst [moved from Documentation/arm/nwfpe/todo.rst with 100% similarity]
Documentation/arch/arm/omap/dss.rst [moved from Documentation/arm/omap/dss.rst with 100% similarity]
Documentation/arch/arm/omap/index.rst [moved from Documentation/arm/omap/index.rst with 100% similarity]
Documentation/arch/arm/omap/omap.rst [moved from Documentation/arm/omap/omap.rst with 100% similarity]
Documentation/arch/arm/omap/omap_pm.rst [moved from Documentation/arm/omap/omap_pm.rst with 100% similarity]
Documentation/arch/arm/porting.rst [moved from Documentation/arm/porting.rst with 100% similarity]
Documentation/arch/arm/pxa/mfp.rst [moved from Documentation/arm/pxa/mfp.rst with 100% similarity]
Documentation/arch/arm/sa1100/assabet.rst [moved from Documentation/arm/sa1100/assabet.rst with 100% similarity]
Documentation/arch/arm/sa1100/cerf.rst [moved from Documentation/arm/sa1100/cerf.rst with 100% similarity]
Documentation/arch/arm/sa1100/index.rst [moved from Documentation/arm/sa1100/index.rst with 100% similarity]
Documentation/arch/arm/sa1100/lart.rst [moved from Documentation/arm/sa1100/lart.rst with 100% similarity]
Documentation/arch/arm/sa1100/serial_uart.rst [moved from Documentation/arm/sa1100/serial_uart.rst with 100% similarity]
Documentation/arch/arm/samsung/bootloader-interface.rst [moved from Documentation/arm/samsung/bootloader-interface.rst with 100% similarity]
Documentation/arch/arm/samsung/clksrc-change-registers.awk [moved from Documentation/arm/samsung/clksrc-change-registers.awk with 100% similarity]
Documentation/arch/arm/samsung/gpio.rst [moved from Documentation/arm/samsung/gpio.rst with 100% similarity]
Documentation/arch/arm/samsung/index.rst [moved from Documentation/arm/samsung/index.rst with 100% similarity]
Documentation/arch/arm/samsung/overview.rst [moved from Documentation/arm/samsung/overview.rst with 100% similarity]
Documentation/arch/arm/setup.rst [moved from Documentation/arm/setup.rst with 100% similarity]
Documentation/arch/arm/spear/overview.rst [moved from Documentation/arm/spear/overview.rst with 100% similarity]
Documentation/arch/arm/sti/overview.rst [moved from Documentation/arm/sti/overview.rst with 100% similarity]
Documentation/arch/arm/sti/stih407-overview.rst [moved from Documentation/arm/sti/stih407-overview.rst with 100% similarity]
Documentation/arch/arm/sti/stih418-overview.rst [moved from Documentation/arm/sti/stih418-overview.rst with 100% similarity]
Documentation/arch/arm/stm32/overview.rst [moved from Documentation/arm/stm32/overview.rst with 100% similarity]
Documentation/arch/arm/stm32/stm32-dma-mdma-chaining.rst [moved from Documentation/arm/stm32/stm32-dma-mdma-chaining.rst with 100% similarity]
Documentation/arch/arm/stm32/stm32f429-overview.rst [moved from Documentation/arm/stm32/stm32f429-overview.rst with 100% similarity]
Documentation/arch/arm/stm32/stm32f746-overview.rst [moved from Documentation/arm/stm32/stm32f746-overview.rst with 100% similarity]
Documentation/arch/arm/stm32/stm32f769-overview.rst [moved from Documentation/arm/stm32/stm32f769-overview.rst with 100% similarity]
Documentation/arch/arm/stm32/stm32h743-overview.rst [moved from Documentation/arm/stm32/stm32h743-overview.rst with 100% similarity]
Documentation/arch/arm/stm32/stm32h750-overview.rst [moved from Documentation/arm/stm32/stm32h750-overview.rst with 100% similarity]
Documentation/arch/arm/stm32/stm32mp13-overview.rst [moved from Documentation/arm/stm32/stm32mp13-overview.rst with 100% similarity]
Documentation/arch/arm/stm32/stm32mp151-overview.rst [moved from Documentation/arm/stm32/stm32mp151-overview.rst with 100% similarity]
Documentation/arch/arm/stm32/stm32mp157-overview.rst [moved from Documentation/arm/stm32/stm32mp157-overview.rst with 100% similarity]
Documentation/arch/arm/sunxi.rst [moved from Documentation/arm/sunxi.rst with 100% similarity]
Documentation/arch/arm/sunxi/clocks.rst [moved from Documentation/arm/sunxi/clocks.rst with 100% similarity]
Documentation/arch/arm/swp_emulation.rst [moved from Documentation/arm/swp_emulation.rst with 100% similarity]
Documentation/arch/arm/tcm.rst [moved from Documentation/arm/tcm.rst with 100% similarity]
Documentation/arch/arm/uefi.rst [moved from Documentation/arm/uefi.rst with 100% similarity]
Documentation/arch/arm/vfp/release-notes.rst [moved from Documentation/arm/vfp/release-notes.rst with 100% similarity]
Documentation/arch/arm/vlocks.rst [moved from Documentation/arm/vlocks.rst with 100% similarity]
Documentation/arch/arm64/acpi_object_usage.rst [moved from Documentation/arm64/acpi_object_usage.rst with 91% similarity]
Documentation/arch/arm64/amu.rst [moved from Documentation/arm64/amu.rst with 100% similarity]
Documentation/arch/arm64/arm-acpi.rst [moved from Documentation/arm64/arm-acpi.rst with 82% similarity]
Documentation/arch/arm64/asymmetric-32bit.rst [moved from Documentation/arm64/asymmetric-32bit.rst with 100% similarity]
Documentation/arch/arm64/booting.rst [moved from Documentation/arm64/booting.rst with 94% similarity]
Documentation/arch/arm64/cpu-feature-registers.rst [moved from Documentation/arm64/cpu-feature-registers.rst with 99% similarity]
Documentation/arch/arm64/elf_hwcaps.rst [moved from Documentation/arm64/elf_hwcaps.rst with 95% similarity]
Documentation/arch/arm64/features.rst [moved from Documentation/arm64/features.rst with 100% similarity]
Documentation/arch/arm64/hugetlbpage.rst [moved from Documentation/arm64/hugetlbpage.rst with 100% similarity]
Documentation/arch/arm64/index.rst [moved from Documentation/arm64/index.rst with 96% similarity]
Documentation/arch/arm64/kasan-offsets.sh [moved from Documentation/arm64/kasan-offsets.sh with 100% similarity]
Documentation/arch/arm64/kdump.rst [new file with mode: 0644]
Documentation/arch/arm64/legacy_instructions.rst [moved from Documentation/arm64/legacy_instructions.rst with 100% similarity]
Documentation/arch/arm64/memory-tagging-extension.rst [moved from Documentation/arm64/memory-tagging-extension.rst with 99% similarity]
Documentation/arch/arm64/memory.rst [moved from Documentation/arm64/memory.rst with 97% similarity]
Documentation/arch/arm64/perf.rst [moved from Documentation/arm64/perf.rst with 100% similarity]
Documentation/arch/arm64/pointer-authentication.rst [moved from Documentation/arm64/pointer-authentication.rst with 100% similarity]
Documentation/arch/arm64/ptdump.rst [new file with mode: 0644]
Documentation/arch/arm64/silicon-errata.rst [moved from Documentation/arm64/silicon-errata.rst with 98% similarity]
Documentation/arch/arm64/sme.rst [moved from Documentation/arm64/sme.rst with 99% similarity]
Documentation/arch/arm64/sve.rst [moved from Documentation/arm64/sve.rst with 99% similarity]
Documentation/arch/arm64/tagged-address-abi.rst [moved from Documentation/arm64/tagged-address-abi.rst with 99% similarity]
Documentation/arch/arm64/tagged-pointers.rst [moved from Documentation/arm64/tagged-pointers.rst with 98% similarity]
Documentation/arch/index.rst
Documentation/arch/x86/resctrl.rst
Documentation/conf.py
Documentation/core-api/cpu_hotplug.rst
Documentation/core-api/kernel-api.rst
Documentation/core-api/pin_user_pages.rst
Documentation/core-api/this_cpu_ops.rst
Documentation/core-api/workqueue.rst
Documentation/crypto/async-tx-api.rst
Documentation/dev-tools/kasan.rst
Documentation/dev-tools/kselftest.rst
Documentation/dev-tools/kunit/architecture.rst
Documentation/dev-tools/kunit/start.rst
Documentation/dev-tools/kunit/usage.rst
Documentation/devicetree/bindings/arm/xen.txt
Documentation/devicetree/bindings/cpu/idle-states.yaml
Documentation/devicetree/bindings/firmware/qcom,scm.yaml
Documentation/devicetree/bindings/i2c/opencores,i2c-ocores.yaml
Documentation/devicetree/bindings/interrupt-controller/loongson,eiointc.yaml [new file with mode: 0644]
Documentation/devicetree/bindings/memory-controllers/nuvoton,npcm-memory-controller.yaml [new file with mode: 0644]
Documentation/devicetree/bindings/mfd/rockchip,rk806.yaml [new file with mode: 0644]
Documentation/devicetree/bindings/mmc/arm,pl18x.yaml
Documentation/devicetree/bindings/mmc/brcm,bcm2835-sdhost.txt [deleted file]
Documentation/devicetree/bindings/mmc/brcm,bcm2835-sdhost.yaml [new file with mode: 0644]
Documentation/devicetree/bindings/mmc/brcm,kona-sdhci.txt [deleted file]
Documentation/devicetree/bindings/mmc/brcm,kona-sdhci.yaml [new file with mode: 0644]
Documentation/devicetree/bindings/mmc/fsl-imx-esdhc.yaml
Documentation/devicetree/bindings/mmc/sdhci-msm.yaml
Documentation/devicetree/bindings/mtd/allwinner,sun4i-a10-nand.yaml
Documentation/devicetree/bindings/mtd/amlogic,meson-nand.yaml
Documentation/devicetree/bindings/mtd/brcm,brcmnand.yaml
Documentation/devicetree/bindings/mtd/denali,nand.yaml
Documentation/devicetree/bindings/mtd/ingenic,nand.yaml
Documentation/devicetree/bindings/mtd/intel,lgm-ebunand.yaml
Documentation/devicetree/bindings/mtd/marvell,nand-controller.yaml [new file with mode: 0644]
Documentation/devicetree/bindings/mtd/marvell-nand.txt [deleted file]
Documentation/devicetree/bindings/mtd/mediatek,mtk-nfc.yaml
Documentation/devicetree/bindings/mtd/mtd.yaml
Documentation/devicetree/bindings/mtd/nand-controller.yaml
Documentation/devicetree/bindings/mtd/partitions/partition.yaml
Documentation/devicetree/bindings/mtd/partitions/partitions.yaml
Documentation/devicetree/bindings/mtd/qcom,nandc.yaml
Documentation/devicetree/bindings/mtd/raw-nand-chip.yaml [new file with mode: 0644]
Documentation/devicetree/bindings/mtd/rockchip,nand-controller.yaml
Documentation/devicetree/bindings/mtd/st,stm32-fmc2-nand.yaml
Documentation/devicetree/bindings/mtd/ti,am654-hbmc.yaml
Documentation/devicetree/bindings/perf/fsl-imx-ddr.yaml
Documentation/devicetree/bindings/regulator/mt6358-regulator.txt
Documentation/devicetree/bindings/regulator/pfuze100.yaml
Documentation/devicetree/bindings/regulator/pwm-regulator.yaml
Documentation/devicetree/bindings/regulator/renesas,raa215300.yaml [new file with mode: 0644]
Documentation/devicetree/bindings/regulator/ti,tps62870.yaml [new file with mode: 0644]
Documentation/devicetree/bindings/spi/allwinner,sun4i-a10-spi.yaml
Documentation/devicetree/bindings/spi/allwinner,sun6i-a31-spi.yaml
Documentation/devicetree/bindings/spi/atmel,at91rm9200-spi.yaml
Documentation/devicetree/bindings/spi/cdns,qspi-nor.yaml
Documentation/devicetree/bindings/spi/qcom,spi-qcom-qspi.yaml
Documentation/devicetree/bindings/spi/renesas,rzv2m-csi.yaml [new file with mode: 0644]
Documentation/devicetree/bindings/spi/samsung,spi.yaml
Documentation/devicetree/bindings/spi/snps,dw-apb-ssi.yaml
Documentation/devicetree/bindings/spi/socionext,uniphier-spi.yaml
Documentation/devicetree/bindings/spi/spi-controller.yaml
Documentation/devicetree/bindings/spi/spi-zynqmp-qspi.yaml
Documentation/devicetree/bindings/thermal/armada-thermal.txt
Documentation/devicetree/bindings/thermal/brcm,bcm2835-thermal.txt [deleted file]
Documentation/devicetree/bindings/thermal/brcm,bcm2835-thermal.yaml [new file with mode: 0644]
Documentation/devicetree/bindings/thermal/qcom-tsens.yaml
Documentation/devicetree/bindings/timer/brcm,kona-timer.txt [deleted file]
Documentation/devicetree/bindings/timer/brcm,kona-timer.yaml [new file with mode: 0644]
Documentation/devicetree/bindings/timer/loongson,ls1x-pwmtimer.yaml [new file with mode: 0644]
Documentation/devicetree/bindings/timer/ralink,rt2880-timer.yaml [new file with mode: 0644]
Documentation/doc-guide/sphinx.rst
Documentation/driver-api/basics.rst
Documentation/driver-api/edac.rst
Documentation/filesystems/autofs-mount-control.rst
Documentation/filesystems/autofs.rst
Documentation/filesystems/directory-locking.rst
Documentation/filesystems/fsverity.rst
Documentation/maintainer/configure-git.rst
Documentation/mm/damon/design.rst
Documentation/mm/damon/faq.rst
Documentation/mm/damon/maintainer-profile.rst
Documentation/mm/page_migration.rst
Documentation/mm/page_tables.rst
Documentation/mm/split_page_table_lock.rst
Documentation/process/2.Process.rst
Documentation/process/changes.rst
Documentation/process/handling-regressions.rst
Documentation/process/maintainer-tip.rst
Documentation/process/submitting-patches.rst
Documentation/rust/quick-start.rst
Documentation/scheduler/sched-deadline.rst
Documentation/subsystem-apis.rst
Documentation/translations/zh_CN/arch/arm/Booting [moved from Documentation/translations/zh_CN/arm/Booting with 98% similarity]
Documentation/translations/zh_CN/arch/arm/kernel_user_helpers.txt [moved from Documentation/translations/zh_CN/arm/kernel_user_helpers.txt with 98% similarity]
Documentation/translations/zh_CN/arch/arm64/amu.rst [moved from Documentation/translations/zh_CN/arm64/amu.rst with 97% similarity]
Documentation/translations/zh_CN/arch/arm64/booting.txt [moved from Documentation/translations/zh_CN/arm64/booting.txt with 98% similarity]
Documentation/translations/zh_CN/arch/arm64/elf_hwcaps.rst [moved from Documentation/translations/zh_CN/arm64/elf_hwcaps.rst with 94% similarity]
Documentation/translations/zh_CN/arch/arm64/hugetlbpage.rst [moved from Documentation/translations/zh_CN/arm64/hugetlbpage.rst with 91% similarity]
Documentation/translations/zh_CN/arch/arm64/index.rst [moved from Documentation/translations/zh_CN/arm64/index.rst with 63% similarity]
Documentation/translations/zh_CN/arch/arm64/legacy_instructions.txt [moved from Documentation/translations/zh_CN/arm64/legacy_instructions.txt with 95% similarity]
Documentation/translations/zh_CN/arch/arm64/memory.txt [moved from Documentation/translations/zh_CN/arm64/memory.txt with 97% similarity]
Documentation/translations/zh_CN/arch/arm64/perf.rst [moved from Documentation/translations/zh_CN/arm64/perf.rst with 96% similarity]
Documentation/translations/zh_CN/arch/arm64/silicon-errata.txt [moved from Documentation/translations/zh_CN/arm64/silicon-errata.txt with 97% similarity]
Documentation/translations/zh_CN/arch/arm64/tagged-pointers.txt [moved from Documentation/translations/zh_CN/arm64/tagged-pointers.txt with 94% similarity]
Documentation/translations/zh_CN/arch/index.rst
Documentation/translations/zh_CN/mm/page_migration.rst
Documentation/translations/zh_TW/arch/arm64/amu.rst [moved from Documentation/translations/zh_TW/arm64/amu.rst with 97% similarity]
Documentation/translations/zh_TW/arch/arm64/booting.txt [moved from Documentation/translations/zh_TW/arm64/booting.txt with 98% similarity]
Documentation/translations/zh_TW/arch/arm64/elf_hwcaps.rst [moved from Documentation/translations/zh_TW/arm64/elf_hwcaps.rst with 94% similarity]
Documentation/translations/zh_TW/arch/arm64/hugetlbpage.rst [moved from Documentation/translations/zh_TW/arm64/hugetlbpage.rst with 91% similarity]
Documentation/translations/zh_TW/arch/arm64/index.rst [moved from Documentation/translations/zh_TW/arm64/index.rst with 71% similarity]
Documentation/translations/zh_TW/arch/arm64/legacy_instructions.txt [moved from Documentation/translations/zh_TW/arm64/legacy_instructions.txt with 96% similarity]
Documentation/translations/zh_TW/arch/arm64/memory.txt [moved from Documentation/translations/zh_TW/arm64/memory.txt with 97% similarity]
Documentation/translations/zh_TW/arch/arm64/perf.rst [moved from Documentation/translations/zh_TW/arm64/perf.rst with 96% similarity]
Documentation/translations/zh_TW/arch/arm64/silicon-errata.txt [moved from Documentation/translations/zh_TW/arm64/silicon-errata.txt with 97% similarity]
Documentation/translations/zh_TW/arch/arm64/tagged-pointers.txt [moved from Documentation/translations/zh_TW/arm64/tagged-pointers.txt with 95% similarity]
Documentation/translations/zh_TW/index.rst
Documentation/virt/guest-halt-polling.rst
Documentation/virt/kvm/api.rst
Documentation/virt/kvm/halt-polling.rst
Documentation/virt/kvm/locking.rst
Documentation/virt/kvm/ppc-pv.rst
Documentation/virt/kvm/vcpu-requests.rst
Documentation/virt/paravirt_ops.rst
MAINTAINERS
Makefile
arch/Kconfig
arch/alpha/include/asm/atomic.h
arch/alpha/include/asm/bugs.h [deleted file]
arch/alpha/kernel/osf_sys.c
arch/alpha/kernel/setup.c
arch/alpha/kernel/syscalls/syscall.tbl
arch/arc/include/asm/atomic-spinlock.h
arch/arc/include/asm/atomic.h
arch/arc/include/asm/atomic64-arcv2.h
arch/arm/Kconfig
arch/arm/boot/compressed/atags_to_fdt.c
arch/arm/boot/compressed/fdt_check_mem_start.c
arch/arm/boot/compressed/misc.c
arch/arm/boot/compressed/misc.h
arch/arm/common/mcpm_entry.c
arch/arm/common/mcpm_head.S
arch/arm/common/vlock.S
arch/arm/include/asm/assembler.h
arch/arm/include/asm/atomic.h
arch/arm/include/asm/bugs.h
arch/arm/include/asm/ftrace.h
arch/arm/include/asm/irq.h
arch/arm/include/asm/mach/arch.h
arch/arm/include/asm/page.h
arch/arm/include/asm/ptrace.h
arch/arm/include/asm/setup.h
arch/arm/include/asm/signal.h
arch/arm/include/asm/smp.h
arch/arm/include/asm/spectre.h
arch/arm/include/asm/suspend.h
arch/arm/include/asm/sync_bitops.h
arch/arm/include/asm/syscalls.h [new file with mode: 0644]
arch/arm/include/asm/tcm.h
arch/arm/include/asm/traps.h
arch/arm/include/asm/unwind.h
arch/arm/include/asm/vdso.h
arch/arm/include/asm/vfp.h
arch/arm/include/uapi/asm/setup.h
arch/arm/kernel/atags_parse.c
arch/arm/kernel/bugs.c
arch/arm/kernel/entry-armv.S
arch/arm/kernel/fiq.c
arch/arm/kernel/head-inflate-data.c
arch/arm/kernel/head.h [new file with mode: 0644]
arch/arm/kernel/module.c
arch/arm/kernel/setup.c
arch/arm/kernel/signal.c
arch/arm/kernel/smp.c
arch/arm/kernel/sys_arm.c
arch/arm/kernel/sys_oabi-compat.c
arch/arm/kernel/traps.c
arch/arm/kernel/vdso.c
arch/arm/lib/bitops.h
arch/arm/lib/testchangebit.S
arch/arm/lib/testclearbit.S
arch/arm/lib/testsetbit.S
arch/arm/lib/uaccess_with_memcpy.c
arch/arm/mach-exynos/common.h
arch/arm/mach-mxs/mach-mxs.c
arch/arm/mach-omap1/board-ams-delta.c
arch/arm/mach-omap1/board-nokia770.c
arch/arm/mach-omap1/board-osk.c
arch/arm/mach-omap1/board-palmte.c
arch/arm/mach-omap1/board-sx1.c
arch/arm/mach-omap1/irq.c
arch/arm/mach-pxa/gumstix.c
arch/arm/mach-pxa/pxa25x.c
arch/arm/mach-pxa/pxa27x.c
arch/arm/mach-pxa/spitz.c
arch/arm/mach-sti/Kconfig
arch/arm/mm/Kconfig
arch/arm/mm/dma-mapping.c
arch/arm/mm/fault-armv.c
arch/arm/mm/fault.c
arch/arm/mm/fault.h
arch/arm/mm/flush.c
arch/arm/mm/mmu.c
arch/arm/mm/nommu.c
arch/arm/mm/tcm.h [deleted file]
arch/arm/probes/kprobes/checkers-common.c
arch/arm/probes/kprobes/core.c
arch/arm/probes/kprobes/opt-arm.c
arch/arm/probes/kprobes/test-core.c
arch/arm/probes/kprobes/test-core.h
arch/arm/tools/mach-types
arch/arm/tools/syscall.tbl
arch/arm/vdso/vgettimeofday.c
arch/arm/vfp/vfpmodule.c
arch/arm64/Kconfig
arch/arm64/boot/dts/qcom/sc7180-idp.dts
arch/arm64/boot/dts/qcom/sc7180-trogdor.dtsi
arch/arm64/boot/dts/qcom/sc7180.dtsi
arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi
arch/arm64/boot/dts/qcom/sc7280.dtsi
arch/arm64/boot/dts/rockchip/rk3308.dtsi
arch/arm64/boot/dts/rockchip/rk3328-rock64.dts
arch/arm64/boot/dts/rockchip/rk3328.dtsi
arch/arm64/boot/dts/rockchip/rk3566-soquartz-cm4.dts
arch/arm64/boot/dts/rockchip/rk3566-soquartz.dtsi
arch/arm64/boot/dts/rockchip/rk3568-nanopi-r5c.dts
arch/arm64/boot/dts/rockchip/rk3568-nanopi-r5s.dts
arch/arm64/boot/dts/rockchip/rk3568.dtsi
arch/arm64/boot/dts/rockchip/rk356x.dtsi
arch/arm64/boot/dts/rockchip/rk3588s.dtsi
arch/arm64/include/asm/alternative-macros.h
arch/arm64/include/asm/alternative.h
arch/arm64/include/asm/arch_timer.h
arch/arm64/include/asm/archrandom.h
arch/arm64/include/asm/asm-uaccess.h
arch/arm64/include/asm/atomic.h
arch/arm64/include/asm/atomic_ll_sc.h
arch/arm64/include/asm/atomic_lse.h
arch/arm64/include/asm/cache.h
arch/arm64/include/asm/cmpxchg.h
arch/arm64/include/asm/compat.h
arch/arm64/include/asm/cpu.h
arch/arm64/include/asm/cpufeature.h
arch/arm64/include/asm/efi.h
arch/arm64/include/asm/el2_setup.h
arch/arm64/include/asm/esr.h
arch/arm64/include/asm/exception.h
arch/arm64/include/asm/hw_breakpoint.h
arch/arm64/include/asm/hwcap.h
arch/arm64/include/asm/image.h
arch/arm64/include/asm/io.h
arch/arm64/include/asm/irqflags.h
arch/arm64/include/asm/kernel-pgtable.h
arch/arm64/include/asm/kvm_arm.h
arch/arm64/include/asm/kvm_asm.h
arch/arm64/include/asm/kvm_host.h
arch/arm64/include/asm/lse.h
arch/arm64/include/asm/memory.h
arch/arm64/include/asm/mmu_context.h
arch/arm64/include/asm/module.h
arch/arm64/include/asm/module.lds.h
arch/arm64/include/asm/percpu.h
arch/arm64/include/asm/pgtable-hwdef.h
arch/arm64/include/asm/pgtable-prot.h
arch/arm64/include/asm/scs.h
arch/arm64/include/asm/smp.h
arch/arm64/include/asm/spectre.h
arch/arm64/include/asm/syscall_wrapper.h
arch/arm64/include/asm/sysreg.h
arch/arm64/include/asm/thread_info.h
arch/arm64/include/asm/traps.h
arch/arm64/include/asm/uaccess.h
arch/arm64/include/asm/unistd.h
arch/arm64/include/asm/unistd32.h
arch/arm64/include/uapi/asm/hwcap.h
arch/arm64/include/uapi/asm/sigcontext.h
arch/arm64/kernel/Makefile
arch/arm64/kernel/alternative.c
arch/arm64/kernel/cpufeature.c
arch/arm64/kernel/cpuidle.c
arch/arm64/kernel/cpuinfo.c
arch/arm64/kernel/entry-common.c
arch/arm64/kernel/entry.S
arch/arm64/kernel/fpsimd.c
arch/arm64/kernel/ftrace.c
arch/arm64/kernel/head.S
arch/arm64/kernel/hibernate.c
arch/arm64/kernel/hw_breakpoint.c
arch/arm64/kernel/hyp-stub.S
arch/arm64/kernel/idreg-override.c
arch/arm64/kernel/kaslr.c
arch/arm64/kernel/kexec_image.c
arch/arm64/kernel/kuser32.S
arch/arm64/kernel/module-plts.c
arch/arm64/kernel/module.c
arch/arm64/kernel/mte.c
arch/arm64/kernel/setup.c
arch/arm64/kernel/signal.c
arch/arm64/kernel/smp.c
arch/arm64/kernel/syscall.c
arch/arm64/kernel/traps.c
arch/arm64/kernel/watchdog_hld.c [new file with mode: 0644]
arch/arm64/kvm/debug.c
arch/arm64/kvm/hyp/include/hyp/switch.h
arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h
arch/arm64/kvm/hyp/nvhe/debug-sr.c
arch/arm64/kvm/sys_regs.c
arch/arm64/lib/xor-neon.c
arch/arm64/mm/context.c
arch/arm64/mm/fault.c
arch/arm64/mm/flush.c
arch/arm64/mm/hugetlbpage.c
arch/arm64/mm/init.c
arch/arm64/mm/kasan_init.c
arch/arm64/mm/mmu.c
arch/arm64/mm/proc.S
arch/arm64/tools/cpucaps
arch/arm64/tools/gen-cpucaps.awk
arch/arm64/tools/sysreg
arch/csky/Kconfig
arch/csky/include/asm/atomic.h
arch/csky/include/asm/smp.h
arch/csky/kernel/smp.c
arch/hexagon/include/asm/atomic.h
arch/hexagon/kernel/setup.c
arch/ia64/Kconfig
arch/ia64/include/asm/atomic.h
arch/ia64/include/asm/bugs.h [deleted file]
arch/ia64/kernel/setup.c
arch/ia64/kernel/syscalls/syscall.tbl
arch/ia64/mm/hugetlbpage.c
arch/loongarch/Kconfig
arch/loongarch/include/asm/atomic.h
arch/loongarch/include/asm/bugs.h [deleted file]
arch/loongarch/include/asm/loongarch.h
arch/loongarch/kernel/setup.c
arch/loongarch/kernel/time.c
arch/m68k/Kconfig
arch/m68k/configs/amiga_defconfig
arch/m68k/configs/apollo_defconfig
arch/m68k/configs/atari_defconfig
arch/m68k/configs/bvme6000_defconfig
arch/m68k/configs/hp300_defconfig
arch/m68k/configs/mac_defconfig
arch/m68k/configs/multi_defconfig
arch/m68k/configs/mvme147_defconfig
arch/m68k/configs/mvme16x_defconfig
arch/m68k/configs/q40_defconfig
arch/m68k/configs/sun3_defconfig
arch/m68k/configs/sun3x_defconfig
arch/m68k/configs/virt_defconfig
arch/m68k/include/asm/atomic.h
arch/m68k/include/asm/bugs.h [deleted file]
arch/m68k/include/asm/mmu_context.h
arch/m68k/kernel/setup_mm.c
arch/m68k/kernel/sys_m68k.c
arch/m68k/kernel/syscalls/syscall.tbl
arch/m68k/mm/mcfmmu.c
arch/microblaze/include/asm/cache.h
arch/microblaze/include/asm/page.h
arch/microblaze/include/asm/setup.h
arch/microblaze/kernel/prom.c
arch/microblaze/kernel/signal.c
arch/microblaze/kernel/syscalls/syscall.tbl
arch/mips/Kconfig
arch/mips/bmips/setup.c
arch/mips/cavium-octeon/smp.c
arch/mips/include/asm/atomic.h
arch/mips/include/asm/bugs.h
arch/mips/include/asm/fw/cfe/cfe_api.h
arch/mips/include/asm/irq.h
arch/mips/include/asm/mach-loongson32/loongson1.h
arch/mips/include/asm/mach-loongson32/regs-pwm.h [deleted file]
arch/mips/include/asm/smp-ops.h
arch/mips/kernel/setup.c
arch/mips/kernel/smp-bmips.c
arch/mips/kernel/smp-cps.c
arch/mips/kernel/smp.c
arch/mips/kernel/syscalls/syscall_n32.tbl
arch/mips/kernel/syscalls/syscall_n64.tbl
arch/mips/kernel/syscalls/syscall_o32.tbl
arch/mips/loongson32/Kconfig
arch/mips/loongson32/common/time.c
arch/mips/loongson64/smp.c
arch/mips/mm/tlb-r4k.c
arch/nios2/kernel/cpuinfo.c
arch/nios2/kernel/setup.c
arch/openrisc/include/asm/atomic.h
arch/parisc/Kconfig
arch/parisc/include/asm/atomic.h
arch/parisc/include/asm/bugs.h [deleted file]
arch/parisc/include/asm/pgtable.h
arch/parisc/kernel/cache.c
arch/parisc/kernel/pci-dma.c
arch/parisc/kernel/process.c
arch/parisc/kernel/smp.c
arch/parisc/kernel/syscalls/syscall.tbl
arch/parisc/mm/hugetlbpage.c
arch/powerpc/Kconfig
arch/powerpc/include/asm/atomic.h
arch/powerpc/include/asm/bugs.h [deleted file]
arch/powerpc/include/asm/cache.h
arch/powerpc/include/asm/irq.h
arch/powerpc/include/asm/nmi.h
arch/powerpc/include/asm/page_32.h
arch/powerpc/include/asm/pgtable.h
arch/powerpc/kernel/smp.c
arch/powerpc/kernel/syscalls/syscall.tbl
arch/powerpc/kernel/tau_6xx.c
arch/powerpc/kernel/watchdog.c
arch/powerpc/kvm/book3s_64_mmu_radix.c
arch/powerpc/mm/book3s64/hash_tlb.c
arch/powerpc/mm/book3s64/iommu_api.c
arch/powerpc/mm/book3s64/subpage_prot.c
arch/powerpc/mm/hugetlbpage.c
arch/powerpc/platforms/powermac/setup.c
arch/powerpc/platforms/pseries/dlpar.c
arch/powerpc/platforms/pseries/mobility.c
arch/powerpc/xmon/xmon.c
arch/riscv/Kconfig
arch/riscv/include/asm/atomic.h
arch/riscv/include/asm/irq.h
arch/riscv/include/asm/smp.h
arch/riscv/include/asm/timex.h
arch/riscv/kernel/cpu-hotplug.c
arch/riscv/mm/hugetlbpage.c
arch/riscv/purgatory/Makefile
arch/s390/Kconfig
arch/s390/boot/vmem.c
arch/s390/configs/debug_defconfig
arch/s390/configs/defconfig
arch/s390/crypto/paes_s390.c
arch/s390/include/asm/asm-prototypes.h
arch/s390/include/asm/cmpxchg.h
arch/s390/include/asm/cpacf.h
arch/s390/include/asm/cpu_mf.h
arch/s390/include/asm/os_info.h
arch/s390/include/asm/percpu.h
arch/s390/include/asm/pgtable.h
arch/s390/include/asm/physmem_info.h
arch/s390/include/asm/pkey.h
arch/s390/include/asm/thread_info.h
arch/s390/include/asm/timex.h
arch/s390/include/uapi/asm/pkey.h
arch/s390/kernel/crash_dump.c
arch/s390/kernel/entry.h
arch/s390/kernel/ipl.c
arch/s390/kernel/module.c
arch/s390/kernel/perf_cpum_cf.c
arch/s390/kernel/perf_cpum_sf.c
arch/s390/kernel/perf_pai_crypto.c
arch/s390/kernel/perf_pai_ext.c
arch/s390/kernel/syscalls/syscall.tbl
arch/s390/kernel/time.c
arch/s390/kernel/uv.c
arch/s390/kvm/interrupt.c
arch/s390/lib/Makefile
arch/s390/lib/tishift.S [new file with mode: 0644]
arch/s390/mm/gmap.c
arch/s390/mm/pageattr.c
arch/s390/mm/pgtable.c
arch/s390/mm/vmem.c
arch/s390/purgatory/Makefile
arch/sh/Kconfig
arch/sh/drivers/dma/dma-api.c
arch/sh/include/asm/atomic-grb.h
arch/sh/include/asm/atomic-irq.h
arch/sh/include/asm/atomic-llsc.h
arch/sh/include/asm/atomic.h
arch/sh/include/asm/bugs.h [deleted file]
arch/sh/include/asm/cache.h
arch/sh/include/asm/irq.h
arch/sh/include/asm/page.h
arch/sh/include/asm/processor.h
arch/sh/include/asm/rtc.h
arch/sh/include/asm/thread_info.h
arch/sh/kernel/idle.c
arch/sh/kernel/setup.c
arch/sh/kernel/syscalls/syscall.tbl
arch/sh/mm/hugetlbpage.c
arch/sparc/Kconfig
arch/sparc/Kconfig.debug
arch/sparc/include/asm/atomic_32.h
arch/sparc/include/asm/atomic_64.h
arch/sparc/include/asm/bugs.h [deleted file]
arch/sparc/include/asm/irq_32.h
arch/sparc/include/asm/irq_64.h
arch/sparc/include/asm/nmi.h
arch/sparc/include/asm/timer_64.h
arch/sparc/kernel/ioport.c
arch/sparc/kernel/kernel.h
arch/sparc/kernel/nmi.c
arch/sparc/kernel/setup_32.c
arch/sparc/kernel/setup_64.c
arch/sparc/kernel/signal32.c
arch/sparc/kernel/syscalls/syscall.tbl
arch/sparc/mm/fault_64.c
arch/sparc/mm/hugetlbpage.c
arch/sparc/mm/io-unit.c
arch/sparc/mm/iommu.c
arch/sparc/mm/tlb.c
arch/sparc/prom/bootstr_32.c
arch/um/Kconfig
arch/um/Makefile
arch/um/drivers/ubd_kern.c
arch/um/include/asm/bugs.h [deleted file]
arch/um/include/shared/user.h
arch/um/kernel/um_arch.c
arch/um/os-Linux/drivers/tuntap_user.c
arch/x86/Kconfig
arch/x86/Kconfig.cpu
arch/x86/Makefile
arch/x86/Makefile.postlink [new file with mode: 0644]
arch/x86/boot/Makefile
arch/x86/boot/compressed/Makefile
arch/x86/boot/compressed/efi.h
arch/x86/boot/compressed/error.c
arch/x86/boot/compressed/error.h
arch/x86/boot/compressed/kaslr.c
arch/x86/boot/compressed/mem.c [new file with mode: 0644]
arch/x86/boot/compressed/misc.c
arch/x86/boot/compressed/misc.h
arch/x86/boot/compressed/sev.c
arch/x86/boot/compressed/sev.h [new file with mode: 0644]
arch/x86/boot/compressed/tdx-shared.c [new file with mode: 0644]
arch/x86/boot/compressed/tdx.c
arch/x86/boot/cpu.c
arch/x86/coco/core.c
arch/x86/coco/tdx/Makefile
arch/x86/coco/tdx/tdx-shared.c [new file with mode: 0644]
arch/x86/coco/tdx/tdx.c
arch/x86/entry/syscalls/syscall_32.tbl
arch/x86/entry/syscalls/syscall_64.tbl
arch/x86/entry/thunk_64.S
arch/x86/entry/vdso/vgetcpu.c
arch/x86/events/amd/core.c
arch/x86/events/amd/ibs.c
arch/x86/events/intel/core.c
arch/x86/hyperv/ivm.c
arch/x86/include/asm/Kbuild
arch/x86/include/asm/alternative.h
arch/x86/include/asm/apic.h
arch/x86/include/asm/apicdef.h
arch/x86/include/asm/atomic.h
arch/x86/include/asm/atomic64_32.h
arch/x86/include/asm/atomic64_64.h
arch/x86/include/asm/bugs.h
arch/x86/include/asm/cmpxchg.h
arch/x86/include/asm/cmpxchg_32.h
arch/x86/include/asm/cmpxchg_64.h
arch/x86/include/asm/coco.h
arch/x86/include/asm/cpu.h
arch/x86/include/asm/cpufeature.h
arch/x86/include/asm/cpumask.h
arch/x86/include/asm/doublefault.h
arch/x86/include/asm/efi.h
arch/x86/include/asm/fpu/api.h
arch/x86/include/asm/ftrace.h
arch/x86/include/asm/irq.h
arch/x86/include/asm/mce.h
arch/x86/include/asm/mem_encrypt.h
arch/x86/include/asm/mshyperv.h
arch/x86/include/asm/mtrr.h
arch/x86/include/asm/nops.h
arch/x86/include/asm/nospec-branch.h
arch/x86/include/asm/orc_header.h [new file with mode: 0644]
arch/x86/include/asm/percpu.h
arch/x86/include/asm/perf_event.h
arch/x86/include/asm/pgtable.h
arch/x86/include/asm/pgtable_64.h
arch/x86/include/asm/pgtable_types.h
arch/x86/include/asm/processor.h
arch/x86/include/asm/realmode.h
arch/x86/include/asm/sev-common.h
arch/x86/include/asm/sev.h
arch/x86/include/asm/shared/tdx.h
arch/x86/include/asm/sigframe.h
arch/x86/include/asm/smp.h
arch/x86/include/asm/syscall.h
arch/x86/include/asm/tdx.h
arch/x86/include/asm/thread_info.h
arch/x86/include/asm/time.h
arch/x86/include/asm/tlbflush.h
arch/x86/include/asm/topology.h
arch/x86/include/asm/tsc.h
arch/x86/include/asm/unaccepted_memory.h [new file with mode: 0644]
arch/x86/include/asm/unwind_hints.h
arch/x86/include/asm/uv/uv_hub.h
arch/x86/include/asm/uv/uv_mmrs.h
arch/x86/include/asm/vdso/gettimeofday.h
arch/x86/include/asm/x86_init.h
arch/x86/include/uapi/asm/mtrr.h
arch/x86/kernel/acpi/sleep.c
arch/x86/kernel/acpi/sleep.h
arch/x86/kernel/alternative.c
arch/x86/kernel/amd_nb.c
arch/x86/kernel/apic/apic.c
arch/x86/kernel/apic/x2apic_phys.c
arch/x86/kernel/apic/x2apic_uv_x.c
arch/x86/kernel/callthunks.c
arch/x86/kernel/cpu/Makefile
arch/x86/kernel/cpu/bugs.c
arch/x86/kernel/cpu/cacheinfo.c
arch/x86/kernel/cpu/common.c
arch/x86/kernel/cpu/cpu.h
arch/x86/kernel/cpu/mce/amd.c
arch/x86/kernel/cpu/mce/core.c
arch/x86/kernel/cpu/microcode/amd.c
arch/x86/kernel/cpu/mtrr/Makefile
arch/x86/kernel/cpu/mtrr/amd.c
arch/x86/kernel/cpu/mtrr/centaur.c
arch/x86/kernel/cpu/mtrr/cleanup.c
arch/x86/kernel/cpu/mtrr/cyrix.c
arch/x86/kernel/cpu/mtrr/generic.c
arch/x86/kernel/cpu/mtrr/legacy.c [new file with mode: 0644]
arch/x86/kernel/cpu/mtrr/mtrr.c
arch/x86/kernel/cpu/mtrr/mtrr.h
arch/x86/kernel/cpu/resctrl/rdtgroup.c
arch/x86/kernel/cpu/sgx/encl.c
arch/x86/kernel/cpu/sgx/ioctl.c
arch/x86/kernel/doublefault_32.c
arch/x86/kernel/fpu/init.c
arch/x86/kernel/ftrace.c
arch/x86/kernel/head32.c
arch/x86/kernel/head_32.S
arch/x86/kernel/head_64.S
arch/x86/kernel/irq.c
arch/x86/kernel/itmt.c
arch/x86/kernel/kvmclock.c
arch/x86/kernel/ldt.c
arch/x86/kernel/nmi.c
arch/x86/kernel/platform-quirks.c
arch/x86/kernel/process.c
arch/x86/kernel/pvclock.c
arch/x86/kernel/setup.c
arch/x86/kernel/sev-shared.c
arch/x86/kernel/sev.c
arch/x86/kernel/signal.c
arch/x86/kernel/smp.c
arch/x86/kernel/smpboot.c
arch/x86/kernel/topology.c
arch/x86/kernel/tsc.c
arch/x86/kernel/tsc_sync.c
arch/x86/kernel/unwind_orc.c
arch/x86/kernel/vmlinux.lds.S
arch/x86/kernel/x86_init.c
arch/x86/kvm/x86.c
arch/x86/lib/Makefile
arch/x86/lib/cmpxchg16b_emu.S
arch/x86/lib/cmpxchg8b_emu.S
arch/x86/lib/csum-partial_64.c
arch/x86/lib/getuser.S
arch/x86/lib/memmove_64.S
arch/x86/lib/msr.c
arch/x86/lib/putuser.S
arch/x86/lib/retpoline.S
arch/x86/lib/usercopy_64.c
arch/x86/math-emu/fpu_entry.c
arch/x86/mm/highmem_32.c
arch/x86/mm/init_32.c
arch/x86/mm/kaslr.c
arch/x86/mm/mem_encrypt_amd.c
arch/x86/mm/mem_encrypt_identity.c
arch/x86/mm/pat/set_memory.c
arch/x86/mm/pgtable.c
arch/x86/pci/ce4100.c
arch/x86/platform/efi/efi.c
arch/x86/platform/olpc/olpc_dt.c
arch/x86/power/cpu.c
arch/x86/purgatory/Makefile
arch/x86/realmode/init.c
arch/x86/realmode/rm/trampoline_64.S
arch/x86/video/fbdev.c
arch/x86/xen/efi.c
arch/x86/xen/enlighten_hvm.c
arch/x86/xen/enlighten_pv.c
arch/x86/xen/mmu_pv.c
arch/x86/xen/setup.c
arch/x86/xen/smp.h
arch/x86/xen/smp_hvm.c
arch/x86/xen/smp_pv.c
arch/x86/xen/time.c
arch/x86/xen/xen-ops.h
arch/xtensa/Kconfig
arch/xtensa/Kconfig.debug
arch/xtensa/boot/boot-redboot/Makefile
arch/xtensa/include/asm/asm-prototypes.h [new file with mode: 0644]
arch/xtensa/include/asm/asmmacro.h
arch/xtensa/include/asm/atomic.h
arch/xtensa/include/asm/bugs.h [deleted file]
arch/xtensa/include/asm/core.h
arch/xtensa/include/asm/ftrace.h
arch/xtensa/include/asm/platform.h
arch/xtensa/include/asm/string.h
arch/xtensa/include/asm/traps.h
arch/xtensa/kernel/align.S
arch/xtensa/kernel/mcount.S
arch/xtensa/kernel/platform.c
arch/xtensa/kernel/setup.c
arch/xtensa/kernel/stacktrace.c
arch/xtensa/kernel/syscalls/syscall.tbl
arch/xtensa/kernel/time.c
arch/xtensa/kernel/traps.c
arch/xtensa/kernel/xtensa_ksyms.c
arch/xtensa/lib/Makefile
arch/xtensa/lib/ashldi3.S
arch/xtensa/lib/ashrdi3.S
arch/xtensa/lib/bswapdi2.S
arch/xtensa/lib/bswapsi2.S
arch/xtensa/lib/checksum.S
arch/xtensa/lib/divsi3.S
arch/xtensa/lib/lshrdi3.S
arch/xtensa/lib/memcopy.S
arch/xtensa/lib/memset.S
arch/xtensa/lib/modsi3.S
arch/xtensa/lib/mulsi3.S
arch/xtensa/lib/strncpy_user.S
arch/xtensa/lib/strnlen_user.S
arch/xtensa/lib/udivsi3.S
arch/xtensa/lib/umodsi3.S
arch/xtensa/lib/umulsidi3.S
arch/xtensa/lib/usercopy.S
arch/xtensa/mm/kasan_init.c
arch/xtensa/mm/misc.S
arch/xtensa/mm/tlb.c
arch/xtensa/platforms/iss/setup.c
arch/xtensa/platforms/iss/simdisk.c
arch/xtensa/platforms/xt2000/setup.c
arch/xtensa/platforms/xtfpga/setup.c
block/Makefile
block/bdev.c
block/bfq-iosched.c
block/bio.c
block/blk-cgroup-fc-appid.c
block/blk-cgroup.c
block/blk-core.c
block/blk-flush.c
block/blk-ioc.c
block/blk-iocost.c
block/blk-ioprio.c
block/blk-map.c
block/blk-mq-debugfs.c
block/blk-mq-sched.h
block/blk-mq-tag.c
block/blk-mq.c
block/blk-mq.h
block/blk-rq-qos.c
block/blk-wbt.c
block/blk-zoned.c
block/blk.h
block/bsg-lib.c
block/bsg.c
block/disk-events.c
block/early-lookup.c [new file with mode: 0644]
block/elevator.c
block/fops.c
block/genhd.c
block/ioctl.c
block/mq-deadline.c
block/partitions/amiga.c
block/partitions/core.c
drivers/accel/qaic/qaic_data.c
drivers/acpi/acpi_ffh.c
drivers/acpi/acpi_lpss.c
drivers/acpi/acpi_pad.c
drivers/acpi/apei/bert.c
drivers/acpi/apei/ghes.c
drivers/acpi/arm64/Makefile
drivers/acpi/arm64/agdi.c
drivers/acpi/arm64/apmt.c
drivers/acpi/arm64/init.c [new file with mode: 0644]
drivers/acpi/arm64/init.h [new file with mode: 0644]
drivers/acpi/arm64/iort.c
drivers/acpi/bus.c
drivers/acpi/button.c
drivers/acpi/ec.c
drivers/acpi/nfit/nfit.h
drivers/acpi/processor_idle.c
drivers/acpi/resource.c
drivers/acpi/scan.c
drivers/acpi/sleep.c
drivers/acpi/thermal.c
drivers/acpi/tiny-power-button.c
drivers/acpi/video_detect.c
drivers/acpi/x86/s2idle.c
drivers/acpi/x86/utils.c
drivers/auxdisplay/ht16k33.c
drivers/auxdisplay/lcd2s.c
drivers/base/dd.c
drivers/base/devres.c
drivers/base/node.c
drivers/base/power/domain.c
drivers/base/power/wakeup.c
drivers/base/regmap/Makefile
drivers/base/regmap/internal.h
drivers/base/regmap/regcache-maple.c
drivers/base/regmap/regcache.c
drivers/base/regmap/regmap-debugfs.c
drivers/base/regmap/regmap-irq.c
drivers/base/regmap/regmap-kunit.c
drivers/base/regmap/regmap-mmio.c
drivers/base/regmap/regmap-raw-ram.c [new file with mode: 0644]
drivers/base/regmap/regmap.c
drivers/block/amiflop.c
drivers/block/aoe/aoeblk.c
drivers/block/aoe/aoechr.c
drivers/block/ataflop.c
drivers/block/brd.c
drivers/block/drbd/drbd_bitmap.c
drivers/block/drbd/drbd_main.c
drivers/block/drbd/drbd_nl.c
drivers/block/drbd/drbd_receiver.c
drivers/block/floppy.c
drivers/block/loop.c
drivers/block/mtip32xx/mtip32xx.c
drivers/block/nbd.c
drivers/block/pktcdvd.c
drivers/block/rbd.c
drivers/block/rnbd/Makefile
drivers/block/rnbd/rnbd-clt-sysfs.c
drivers/block/rnbd/rnbd-clt.c
drivers/block/rnbd/rnbd-common.c [deleted file]
drivers/block/rnbd/rnbd-proto.h
drivers/block/rnbd/rnbd-srv-sysfs.c
drivers/block/rnbd/rnbd-srv.c
drivers/block/rnbd/rnbd-srv.h
drivers/block/sunvdc.c
drivers/block/swim.c
drivers/block/swim3.c
drivers/block/ublk_drv.c
drivers/block/xen-blkback/xenbus.c
drivers/block/xen-blkfront.c
drivers/block/z2ram.c
drivers/block/zram/zram_drv.c
drivers/cdrom/cdrom.c
drivers/cdrom/gdrom.c
drivers/char/random.c
drivers/clk/Kconfig
drivers/clk/clk-rk808.c
drivers/clk/imx/clk-imx1.c
drivers/clk/imx/clk-imx27.c
drivers/clk/imx/clk-imx31.c
drivers/clk/imx/clk-imx35.c
drivers/clocksource/Kconfig
drivers/clocksource/Makefile
drivers/clocksource/arm_arch_timer.c
drivers/clocksource/hyperv_timer.c
drivers/clocksource/ingenic-timer.c
drivers/clocksource/timer-cadence-ttc.c
drivers/clocksource/timer-imx-gpt.c
drivers/clocksource/timer-loongson1-pwm.c [new file with mode: 0644]
drivers/cpufreq/Kconfig
drivers/cpufreq/Kconfig.x86
drivers/cpufreq/amd-pstate.c
drivers/cpufreq/cpufreq.c
drivers/cpufreq/intel_pstate.c
drivers/cpuidle/cpuidle.c
drivers/cpuidle/poll_state.c
drivers/crypto/allwinner/sun4i-ss/sun4i-ss-cipher.c
drivers/crypto/allwinner/sun4i-ss/sun4i-ss-core.c
drivers/crypto/allwinner/sun4i-ss/sun4i-ss-hash.c
drivers/crypto/allwinner/sun4i-ss/sun4i-ss.h
drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c
drivers/crypto/allwinner/sun8i-ce/sun8i-ce-core.c
drivers/crypto/allwinner/sun8i-ce/sun8i-ce-hash.c
drivers/crypto/allwinner/sun8i-ce/sun8i-ce-prng.c
drivers/crypto/allwinner/sun8i-ce/sun8i-ce-trng.c
drivers/crypto/allwinner/sun8i-ss/sun8i-ss-cipher.c
drivers/crypto/allwinner/sun8i-ss/sun8i-ss-core.c
drivers/crypto/allwinner/sun8i-ss/sun8i-ss-hash.c
drivers/crypto/allwinner/sun8i-ss/sun8i-ss-prng.c
drivers/crypto/marvell/octeontx2/otx2_cptpf_main.c
drivers/crypto/marvell/octeontx2/otx2_cptvf_main.c
drivers/devfreq/exynos-bus.c
drivers/devfreq/mtk-cci-devfreq.c
drivers/edac/Kconfig
drivers/edac/Makefile
drivers/edac/amd64_edac.c
drivers/edac/amd64_edac.h
drivers/edac/mce_amd.c
drivers/edac/npcm_edac.c [new file with mode: 0644]
drivers/edac/thunderx_edac.c
drivers/firmware/efi/Kconfig
drivers/firmware/efi/Makefile
drivers/firmware/efi/efi.c
drivers/firmware/efi/libstub/Makefile
drivers/firmware/efi/libstub/bitmap.c [new file with mode: 0644]
drivers/firmware/efi/libstub/efistub.h
drivers/firmware/efi/libstub/find.c [new file with mode: 0644]
drivers/firmware/efi/libstub/unaccepted_memory.c [new file with mode: 0644]
drivers/firmware/efi/libstub/x86-stub.c
drivers/firmware/efi/unaccepted_memory.c [new file with mode: 0644]
drivers/firmware/iscsi_ibft_find.c
drivers/gpio/gpio-104-dio-48e.c
drivers/gpio/gpio-sifive.c
drivers/gpio/gpiolib.c
drivers/gpu/drm/amd/amdgpu/atom.c
drivers/gpu/drm/amd/pm/legacy-dpm/legacy_dpm.c
drivers/gpu/drm/display/drm_dp_helper.c
drivers/gpu/drm/display/drm_dp_mst_topology.c
drivers/gpu/drm/drm_gem.c
drivers/gpu/drm/drm_managed.c
drivers/gpu/drm/drm_mipi_dsi.c
drivers/gpu/drm/i2c/tda998x_drv.c
drivers/gpu/drm/i915/gem/i915_gem_shmem.c
drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c
drivers/gpu/drm/i915/i915_gpu_error.c
drivers/gpu/drm/mediatek/mtk_hdmi_ddc.c
drivers/gpu/drm/radeon/radeon_atombios.c
drivers/gpu/drm/radeon/radeon_combios.c
drivers/gpu/drm/radeon/radeon_ttm.c
drivers/gpu/drm/rockchip/inno_hdmi.c
drivers/gpu/drm/rockchip/rk3066_hdmi.c
drivers/gpu/drm/sun4i/sun4i_hdmi_i2c.c
drivers/gpu/drm/vmwgfx/vmwgfx_msg_x86.h
drivers/greybus/connection.c
drivers/greybus/svc.c
drivers/hwtracing/coresight/coresight-trbe.c
drivers/hwtracing/coresight/coresight-trbe.h
drivers/i2c/busses/i2c-imx-lpi2c.c
drivers/i2c/busses/i2c-qup.c
drivers/idle/intel_idle.c
drivers/infiniband/hw/qib/qib_user_pages.c
drivers/infiniband/hw/usnic/usnic_uiom.c
drivers/infiniband/sw/rxe/rxe_verbs.c
drivers/infiniband/sw/siw/siw_mem.c
drivers/input/misc/Kconfig
drivers/input/touchscreen/sun4i-ts.c
drivers/iommu/Kconfig
drivers/iommu/amd/amd_iommu_types.h
drivers/iommu/amd/iommu.c
drivers/iommu/dma-iommu.c
drivers/iommu/intel/irq_remapping.c
drivers/iommu/iommu.c
drivers/iommu/iommufd/pages.c
drivers/irqchip/irq-clps711x.c
drivers/irqchip/irq-ftintc010.c
drivers/irqchip/irq-gic-v3-its.c
drivers/irqchip/irq-gic-v3.c
drivers/irqchip/irq-jcore-aic.c
drivers/irqchip/irq-loongson-eiointc.c
drivers/irqchip/irq-loongson-liointc.c
drivers/irqchip/irq-loongson-pch-pic.c
drivers/irqchip/irq-mmp.c
drivers/irqchip/irq-mxs.c
drivers/irqchip/irq-stm32-exti.c
drivers/md/bcache/bcache.h
drivers/md/bcache/btree.c
drivers/md/bcache/btree.h
drivers/md/bcache/request.c
drivers/md/bcache/stats.h
drivers/md/bcache/super.c
drivers/md/bcache/sysfs.c
drivers/md/bcache/sysfs.h
drivers/md/bcache/writeback.c
drivers/md/dm-cache-target.c
drivers/md/dm-clone-target.c
drivers/md/dm-core.h
drivers/md/dm-crypt.c
drivers/md/dm-era-target.c
drivers/md/dm-init.c
drivers/md/dm-integrity.c
drivers/md/dm-ioctl.c
drivers/md/dm-raid.c
drivers/md/dm-snap.c
drivers/md/dm-table.c
drivers/md/dm-thin.c
drivers/md/dm-verity-fec.c
drivers/md/dm-verity-target.c
drivers/md/dm-zoned-metadata.c
drivers/md/dm.c
drivers/md/dm.h
drivers/md/md-autodetect.c
drivers/md/md-bitmap.c
drivers/md/md-bitmap.h
drivers/md/md-cluster.c
drivers/md/md-multipath.c
drivers/md/md.c
drivers/md/md.h
drivers/md/raid1-10.c
drivers/md/raid1.c
drivers/md/raid1.h
drivers/md/raid10.c
drivers/md/raid10.h
drivers/md/raid5-cache.c
drivers/md/raid5-ppl.c
drivers/md/raid5.c
drivers/md/raid5.h
drivers/media/platform/amphion/vpu_core.c
drivers/media/platform/amphion/vpu_v4l2.c
drivers/media/platform/chips-media/coda-common.c
drivers/media/v4l2-core/videobuf-dma-sg.c
drivers/memstick/host/r592.c
drivers/mfd/Kconfig
drivers/mfd/Makefile
drivers/mfd/axp20x-i2c.c
drivers/mfd/axp20x.c
drivers/mfd/rk8xx-core.c [moved from drivers/mfd/rk808.c with 71% similarity]
drivers/mfd/rk8xx-i2c.c [new file with mode: 0644]
drivers/mfd/rk8xx-spi.c [new file with mode: 0644]
drivers/mfd/tps6594-core.c [new file with mode: 0644]
drivers/mfd/tps6594-i2c.c [new file with mode: 0644]
drivers/mfd/tps6594-spi.c [new file with mode: 0644]
drivers/misc/lkdtm/bugs.c
drivers/misc/sgi-gru/grufault.c
drivers/mmc/core/block.c
drivers/mmc/core/card.h
drivers/mmc/core/core.c
drivers/mmc/core/quirks.h
drivers/mmc/core/sd.c
drivers/mmc/host/Kconfig
drivers/mmc/host/cqhci.h
drivers/mmc/host/dw_mmc-bluefield.c
drivers/mmc/host/dw_mmc-k3.c
drivers/mmc/host/dw_mmc-pltfm.c
drivers/mmc/host/dw_mmc-pltfm.h
drivers/mmc/host/dw_mmc-starfive.c
drivers/mmc/host/meson-mx-sdhc-mmc.c
drivers/mmc/host/mmci.c
drivers/mmc/host/mmci.h
drivers/mmc/host/mmci_stm32_sdmmc.c
drivers/mmc/host/mtk-sd.c
drivers/mmc/host/sdhci-msm.c
drivers/mmc/host/sdhci-pci-core.c
drivers/mmc/host/sdhci-pci-gli.c
drivers/mmc/host/sdhci-pci.h
drivers/mmc/host/sdhci.c
drivers/mmc/host/sdhci.h
drivers/most/configfs.c
drivers/mtd/chips/cfi_cmdset_0001.c
drivers/mtd/chips/cfi_cmdset_0002.c
drivers/mtd/chips/cfi_cmdset_0020.c
drivers/mtd/chips/cfi_probe.c
drivers/mtd/chips/cfi_util.c
drivers/mtd/chips/gen_probe.c
drivers/mtd/chips/jedec_probe.c
drivers/mtd/chips/map_ram.c
drivers/mtd/chips/map_rom.c
drivers/mtd/devices/block2mtd.c
drivers/mtd/devices/st_spi_fsm.c
drivers/mtd/maps/pismo.c
drivers/mtd/mtd_blkdevs.c
drivers/mtd/mtdblock.c
drivers/mtd/mtdcore.c
drivers/mtd/mtdpart.c
drivers/mtd/nand/raw/Makefile
drivers/mtd/nand/raw/arasan-nand-controller.c
drivers/mtd/nand/raw/internals.h
drivers/mtd/nand/raw/meson_nand.c
drivers/mtd/nand/raw/nand_ids.c
drivers/mtd/nand/raw/nand_macronix.c
drivers/mtd/nand/raw/nand_sandisk.c [new file with mode: 0644]
drivers/mtd/nand/spi/gigadevice.c
drivers/mtd/nand/spi/macronix.c
drivers/mtd/sm_ftl.c
drivers/mtd/ubi/block.c
drivers/net/ethernet/cavium/thunder/thunder_bgx.c
drivers/net/ethernet/intel/ice/ice_ddp.h
drivers/net/ethernet/marvell/octeontx2/af/rvu.c
drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
drivers/net/ethernet/marvell/octeontx2/nic/otx2_vf.c
drivers/net/ethernet/mellanox/mlx5/core/thermal.c
drivers/net/wireless/ath/ath10k/qmi.c
drivers/net/wireless/ath/ath11k/qmi.c
drivers/net/wireless/ath/ath12k/qmi.c
drivers/net/wireless/intel/iwlwifi/pcie/trans.c
drivers/net/wireless/marvell/mwifiex/cfg80211.c
drivers/net/wireless/marvell/mwifiex/main.c
drivers/net/wwan/t7xx/t7xx_hif_cldma.c
drivers/net/wwan/t7xx/t7xx_hif_dpmaif_tx.c
drivers/nubus/nubus.c
drivers/nubus/proc.c
drivers/nvme/host/Makefile
drivers/nvme/host/auth.c
drivers/nvme/host/core.c
drivers/nvme/host/fabrics.c
drivers/nvme/host/fabrics.h
drivers/nvme/host/fc.c
drivers/nvme/host/ioctl.c
drivers/nvme/host/multipath.c
drivers/nvme/host/nvme.h
drivers/nvme/host/pci.c
drivers/nvme/host/rdma.c
drivers/nvme/host/sysfs.c [new file with mode: 0644]
drivers/nvme/host/tcp.c
drivers/nvme/target/fabrics-cmd-auth.c
drivers/nvme/target/fcloop.c
drivers/nvme/target/io-cmd-bdev.c
drivers/nvme/target/nvmet.h
drivers/parport/procfs.c
drivers/parport/share.c
drivers/pci/Kconfig
drivers/perf/Kconfig
drivers/perf/Makefile
drivers/perf/apple_m1_cpu_pmu.c
drivers/perf/arm-cci.c
drivers/perf/arm-cmn.c
drivers/perf/arm_cspmu/Kconfig
drivers/perf/arm_cspmu/arm_cspmu.c
drivers/perf/arm_cspmu/arm_cspmu.h
drivers/perf/arm_dmc620_pmu.c
drivers/perf/arm_pmu.c
drivers/perf/arm_pmuv3.c
drivers/perf/fsl_imx9_ddr_perf.c [new file with mode: 0644]
drivers/perf/hisilicon/Makefile
drivers/perf/hisilicon/hisi_pcie_pmu.c
drivers/perf/hisilicon/hisi_uncore_pa_pmu.c
drivers/perf/hisilicon/hisi_uncore_pmu.c
drivers/perf/hisilicon/hisi_uncore_pmu.h
drivers/perf/hisilicon/hisi_uncore_uc_pmu.c [new file with mode: 0644]
drivers/perf/qcom_l2_pmu.c
drivers/pinctrl/Kconfig
drivers/pinctrl/pinctrl-amd.c
drivers/pinctrl/pinctrl-rk805.c
drivers/platform/chrome/cros_ec_i2c.c
drivers/platform/chrome/cros_ec_lpc.c
drivers/platform/chrome/cros_ec_spi.c
drivers/platform/chrome/cros_hps_i2c.c
drivers/platform/chrome/cros_typec_switch.c
drivers/platform/x86/amd/pmc.c
drivers/power/supply/Kconfig
drivers/powercap/Kconfig
drivers/powercap/Makefile
drivers/powercap/intel_rapl_common.c
drivers/powercap/intel_rapl_msr.c
drivers/powercap/intel_rapl_tpmi.c [new file with mode: 0644]
drivers/pwm/pwm-atmel.c
drivers/pwm/pwm-pxa.c
drivers/ras/debugfs.c
drivers/regulator/88pg86x.c
drivers/regulator/Kconfig
drivers/regulator/Makefile
drivers/regulator/act8865-regulator.c
drivers/regulator/ad5398.c
drivers/regulator/axp20x-regulator.c
drivers/regulator/core.c
drivers/regulator/da9121-regulator.c
drivers/regulator/da9210-regulator.c
drivers/regulator/da9211-regulator.c
drivers/regulator/fan53555.c
drivers/regulator/fan53880.c
drivers/regulator/helpers.c
drivers/regulator/isl6271a-regulator.c
drivers/regulator/isl9305.c
drivers/regulator/lp3971.c
drivers/regulator/lp3972.c
drivers/regulator/lp872x.c
drivers/regulator/lp8755.c
drivers/regulator/ltc3589.c
drivers/regulator/ltc3676.c
drivers/regulator/max1586.c
drivers/regulator/max20086-regulator.c
drivers/regulator/max20411-regulator.c
drivers/regulator/max77826-regulator.c
drivers/regulator/max8649.c
drivers/regulator/max8660.c
drivers/regulator/max8893.c
drivers/regulator/max8952.c
drivers/regulator/max8973-regulator.c
drivers/regulator/mcp16502.c
drivers/regulator/mp5416.c
drivers/regulator/mp8859.c
drivers/regulator/mp886x.c
drivers/regulator/mpq7920.c
drivers/regulator/mt6311-regulator.c
drivers/regulator/mt6358-regulator.c
drivers/regulator/pca9450-regulator.c
drivers/regulator/pf8x00-regulator.c
drivers/regulator/pfuze100-regulator.c
drivers/regulator/pv88060-regulator.c
drivers/regulator/pv88080-regulator.c
drivers/regulator/pv88090-regulator.c
drivers/regulator/raa215300.c [new file with mode: 0644]
drivers/regulator/rk808-regulator.c
drivers/regulator/rpi-panel-attiny-regulator.c
drivers/regulator/rt4801-regulator.c
drivers/regulator/rt5190a-regulator.c
drivers/regulator/rt5739.c
drivers/regulator/rt5759-regulator.c
drivers/regulator/rt6160-regulator.c
drivers/regulator/rt6190-regulator.c
drivers/regulator/rt6245-regulator.c
drivers/regulator/rtmv20-regulator.c
drivers/regulator/rtq2134-regulator.c
drivers/regulator/rtq6752-regulator.c
drivers/regulator/slg51000-regulator.c
drivers/regulator/stm32-pwr.c
drivers/regulator/sy8106a-regulator.c
drivers/regulator/sy8824x.c
drivers/regulator/sy8827n.c
drivers/regulator/tps51632-regulator.c
drivers/regulator/tps62360-regulator.c
drivers/regulator/tps6286x-regulator.c
drivers/regulator/tps6287x-regulator.c [new file with mode: 0644]
drivers/regulator/tps65023-regulator.c
drivers/regulator/tps65132-regulator.c
drivers/regulator/tps65219-regulator.c
drivers/regulator/tps6594-regulator.c [new file with mode: 0644]
drivers/rtc/Kconfig
drivers/s390/block/dasd.c
drivers/s390/block/dasd_genhd.c
drivers/s390/block/dasd_int.h
drivers/s390/block/dasd_ioctl.c
drivers/s390/block/dcssblk.c
drivers/s390/char/zcore.c
drivers/s390/cio/vfio_ccw_drv.c
drivers/s390/cio/vfio_ccw_private.h
drivers/s390/crypto/pkey_api.c
drivers/s390/crypto/vfio_ap_ops.c
drivers/s390/crypto/vfio_ap_private.h
drivers/scsi/3w-9xxx.c
drivers/scsi/NCR5380.c
drivers/scsi/aacraid/aachba.c
drivers/scsi/bnx2i/bnx2i_init.c
drivers/scsi/ch.c
drivers/scsi/hptiop.c
drivers/scsi/ibmvscsi/ibmvscsi.c
drivers/scsi/megaraid/megaraid_sas_base.c
drivers/scsi/megaraid/megaraid_sas_fp.c
drivers/scsi/qedi/qedi_main.c
drivers/scsi/scsi_bsg.c
drivers/scsi/scsi_ioctl.c
drivers/scsi/sd.c
drivers/scsi/sg.c
drivers/scsi/smartpqi/smartpqi_init.c
drivers/scsi/sr.c
drivers/scsi/st.c
drivers/soc/qcom/qcom-geni-se.c
drivers/spi/Kconfig
drivers/spi/Makefile
drivers/spi/spi-atmel.c
drivers/spi/spi-cadence-quadspi.c
drivers/spi/spi-cadence.c
drivers/spi/spi-dw-core.c
drivers/spi/spi-dw-dma.c
drivers/spi/spi-dw-mmio.c
drivers/spi/spi-dw.h
drivers/spi/spi-fsl-lpspi.c
drivers/spi/spi-geni-qcom.c
drivers/spi/spi-hisi-kunpeng.c
drivers/spi/spi-imx.c
drivers/spi/spi-mt65xx.c
drivers/spi/spi-pl022.c
drivers/spi/spi-qcom-qspi.c
drivers/spi/spi-rzv2m-csi.c [new file with mode: 0644]
drivers/spi/spi-s3c64xx.c
drivers/spi/spi-sc18is602.c
drivers/spi/spi-sn-f-ospi.c
drivers/spi/spi-stm32.c
drivers/spi/spi-sun6i.c
drivers/spi/spi-xcomm.c
drivers/spi/spidev.c
drivers/target/target_core_iblock.c
drivers/target/target_core_pscsi.c
drivers/thermal/Kconfig
drivers/thermal/amlogic_thermal.c
drivers/thermal/armada_thermal.c
drivers/thermal/imx8mm_thermal.c
drivers/thermal/imx_sc_thermal.c
drivers/thermal/intel/int340x_thermal/acpi_thermal_rel.c
drivers/thermal/intel/int340x_thermal/acpi_thermal_rel.h
drivers/thermal/intel/int340x_thermal/processor_thermal_rapl.c
drivers/thermal/k3_bandgap.c
drivers/thermal/mediatek/auxadc_thermal.c
drivers/thermal/mediatek/lvts_thermal.c
drivers/thermal/qcom/qcom-spmi-adc-tm5.c
drivers/thermal/qcom/qcom-spmi-temp-alarm.c
drivers/thermal/qcom/tsens-v0_1.c
drivers/thermal/qcom/tsens-v1.c
drivers/thermal/qcom/tsens.c
drivers/thermal/qcom/tsens.h
drivers/thermal/qoriq_thermal.c
drivers/thermal/rcar_gen3_thermal.c
drivers/thermal/st/st_thermal.c
drivers/thermal/st/st_thermal.h
drivers/thermal/st/st_thermal_memmap.c
drivers/thermal/sun8i_thermal.c
drivers/thermal/tegra/tegra30-tsensor.c
drivers/thermal/thermal-generic-adc.c
drivers/thermal/thermal_core.h
drivers/thermal/thermal_hwmon.c
drivers/thermal/ti-soc-thermal/ti-thermal-common.c
drivers/tty/serial/Kconfig
drivers/tty/tty_io.c
drivers/usb/core/buffer.c
drivers/vdpa/vdpa_user/vduse_dev.c
drivers/vfio/vfio_iommu_type1.c
drivers/vhost/vdpa.c
drivers/virt/acrn/ioreq.c
drivers/virt/coco/sev-guest/Kconfig
drivers/xen/privcmd.c
drivers/xen/pvcalls-back.c
fs/9p/vfs_file.c
fs/Makefile
fs/adfs/file.c
fs/affs/file.c
fs/afs/file.c
fs/afs/write.c
fs/aio.c
fs/autofs/root.c
fs/befs/btree.c
fs/befs/linuxvfs.c
fs/bfs/file.c
fs/binfmt_elf.c
fs/binfmt_elf_fdpic.c
fs/btrfs/async-thread.c
fs/btrfs/async-thread.h
fs/btrfs/bio.c
fs/btrfs/bio.h
fs/btrfs/block-group.c
fs/btrfs/block-group.h
fs/btrfs/block-rsv.c
fs/btrfs/block-rsv.h
fs/btrfs/btrfs_inode.h
fs/btrfs/check-integrity.c
fs/btrfs/compression.c
fs/btrfs/compression.h
fs/btrfs/ctree.c
fs/btrfs/ctree.h
fs/btrfs/defrag.c
fs/btrfs/delayed-ref.c
fs/btrfs/delayed-ref.h
fs/btrfs/dev-replace.c
fs/btrfs/discard.c
fs/btrfs/discard.h
fs/btrfs/disk-io.c
fs/btrfs/disk-io.h
fs/btrfs/extent-io-tree.c
fs/btrfs/extent-io-tree.h
fs/btrfs/extent-tree.c
fs/btrfs/extent-tree.h
fs/btrfs/extent_io.c
fs/btrfs/extent_io.h
fs/btrfs/extent_map.c
fs/btrfs/extent_map.h
fs/btrfs/file-item.c
fs/btrfs/file-item.h
fs/btrfs/file.c
fs/btrfs/free-space-cache.c
fs/btrfs/free-space-cache.h
fs/btrfs/free-space-tree.c
fs/btrfs/fs.h
fs/btrfs/inode-item.h
fs/btrfs/inode.c
fs/btrfs/ioctl.c
fs/btrfs/locking.c
fs/btrfs/lzo.c
fs/btrfs/messages.c
fs/btrfs/messages.h
fs/btrfs/misc.h
fs/btrfs/ordered-data.c
fs/btrfs/ordered-data.h
fs/btrfs/print-tree.c
fs/btrfs/print-tree.h
fs/btrfs/qgroup.c
fs/btrfs/raid56.c
fs/btrfs/raid56.h
fs/btrfs/relocation.c
fs/btrfs/relocation.h
fs/btrfs/scrub.c
fs/btrfs/send.c
fs/btrfs/subpage.c
fs/btrfs/subpage.h
fs/btrfs/super.c
fs/btrfs/tests/extent-io-tests.c
fs/btrfs/transaction.c
fs/btrfs/transaction.h
fs/btrfs/tree-checker.c
fs/btrfs/tree-checker.h
fs/btrfs/tree-log.c
fs/btrfs/tree-log.h
fs/btrfs/tree-mod-log.c
fs/btrfs/volumes.c
fs/btrfs/volumes.h
fs/btrfs/zlib.c
fs/btrfs/zoned.c
fs/btrfs/zoned.h
fs/btrfs/zstd.c
fs/buffer.c
fs/cachefiles/namei.c
fs/ceph/file.c
fs/char_dev.c
fs/coda/file.c
fs/coredump.c
fs/cramfs/inode.c
fs/crypto/fscrypt_private.h
fs/crypto/hooks.c
fs/d_path.c
fs/direct-io.c
fs/dlm/config.c
fs/ecryptfs/file.c
fs/erofs/compress.h
fs/erofs/data.c
fs/erofs/decompressor.c
fs/erofs/internal.h
fs/erofs/super.c
fs/erofs/utils.c
fs/erofs/xattr.c
fs/erofs/zdata.c
fs/erofs/zmap.c
fs/eventfd.c
fs/eventpoll.c
fs/exec.c
fs/exfat/file.c
fs/ext2/file.c
fs/ext4/ext4.h
fs/ext4/file.c
fs/ext4/inode.c
fs/ext4/ioctl.c
fs/ext4/namei.c
fs/ext4/super.c
fs/f2fs/file.c
fs/f2fs/namei.c
fs/f2fs/super.c
fs/fat/file.c
fs/file_table.c
fs/fs-writeback.c
fs/fs_context.c
fs/fuse/file.c
fs/gfs2/aops.c
fs/gfs2/aops.h
fs/gfs2/file.c
fs/gfs2/ops_fstype.c
fs/hfs/inode.c
fs/hfsplus/inode.c
fs/hostfs/hostfs.h
fs/hostfs/hostfs_kern.c
fs/hostfs/hostfs_user.c
fs/hpfs/file.c
fs/hugetlbfs/inode.c
fs/inode.c
fs/internal.h
fs/iomap/buffered-io.c
fs/iomap/direct-io.c
fs/jbd2/journal.c
fs/jffs2/build.c
fs/jffs2/file.c
fs/jffs2/xattr.c
fs/jffs2/xattr.h
fs/jfs/file.c
fs/jfs/jfs_logmgr.c
fs/jfs/namei.c
fs/kernfs/file.c
fs/libfs.c
fs/lockd/svc.c
fs/minix/file.c
fs/namei.c
fs/namespace.c
fs/nfs/blocklayout/dev.c
fs/nfs/file.c
fs/nfs/internal.h
fs/nfs/nfs4file.c
fs/nfs/nfsroot.c
fs/nfsd/cache.h
fs/nfsd/export.c
fs/nfsd/nfs3proc.c
fs/nfsd/nfs3xdr.c
fs/nfsd/nfs4xdr.c
fs/nfsd/nfscache.c
fs/nfsd/nfsctl.c
fs/nfsd/nfsfh.c
fs/nfsd/nfsproc.c
fs/nfsd/nfssvc.c
fs/nfsd/nfsxdr.c
fs/nfsd/trace.h
fs/nfsd/vfs.c
fs/nfsd/vfs.h
fs/nilfs2/file.c
fs/nilfs2/super.c
fs/no-block.c [deleted file]
fs/ntfs/aops.c
fs/ntfs/attrib.c
fs/ntfs/compress.c
fs/ntfs/file.c
fs/ntfs/mft.c
fs/ntfs/super.c
fs/ntfs3/file.c
fs/ocfs2/cluster/heartbeat.c
fs/ocfs2/file.c
fs/ocfs2/localalloc.c
fs/ocfs2/ocfs2_trace.h
fs/ocfs2/quota_local.c
fs/omfs/file.c
fs/open.c
fs/orangefs/file.c
fs/overlayfs/file.c
fs/overlayfs/overlayfs.h
fs/pnode.c
fs/pnode.h
fs/proc/inode.c
fs/proc/kcore.c
fs/proc/meminfo.c
fs/proc/proc_sysctl.c
fs/proc/task_mmu.c
fs/proc/task_nommu.c
fs/proc/vmcore.c
fs/proc_namespace.c
fs/pstore/blk.c
fs/pstore/ram.c
fs/pstore/ram_core.c
fs/ramfs/file-mmu.c
fs/ramfs/file-nommu.c
fs/ramfs/inode.c
fs/read_write.c
fs/readdir.c
fs/reiserfs/file.c
fs/reiserfs/inode.c
fs/reiserfs/journal.c
fs/reiserfs/reiserfs.h
fs/reiserfs/xattr_security.c
fs/remap_range.c
fs/romfs/mmap-nommu.c
fs/smb/client/cifsfs.c
fs/smb/client/cifsfs.h
fs/smb/client/file.c
fs/splice.c
fs/squashfs/block.c
fs/squashfs/decompressor.c
fs/squashfs/decompressor_multi_percpu.c
fs/squashfs/squashfs_fs_sb.h
fs/squashfs/super.c
fs/super.c
fs/sysctls.c
fs/sysv/dir.c
fs/sysv/file.c
fs/sysv/itree.c
fs/sysv/namei.c
fs/ubifs/file.c
fs/udf/file.c
fs/udf/namei.c
fs/ufs/file.c
fs/userfaultfd.c
fs/vboxsf/file.c
fs/vboxsf/super.c
fs/verity/Kconfig
fs/verity/enable.c
fs/verity/fsverity_private.h
fs/verity/hash_algs.c
fs/verity/measure.c
fs/verity/open.c
fs/verity/read_metadata.c
fs/verity/signature.c
fs/verity/verify.c
fs/xfs/libxfs/xfs_btree.h
fs/xfs/scrub/btree.h
fs/xfs/xfs_file.c
fs/xfs/xfs_fsops.c
fs/xfs/xfs_mount.h
fs/xfs/xfs_super.c
fs/xfs/xfs_trace.h
fs/zonefs/file.c
fs/zonefs/super.c
fs/zonefs/zonefs.h
include/acpi/acpi_bus.h
include/acpi/actbl.h
include/acpi/actbl3.h
include/asm-generic/atomic.h
include/asm-generic/bitops/atomic.h
include/asm-generic/bitops/lock.h
include/asm-generic/bug.h
include/asm-generic/bugs.h [deleted file]
include/asm-generic/percpu.h
include/asm-generic/vmlinux.lds.h
include/clocksource/hyperv_timer.h
include/crypto/b128ops.h
include/kunit/resource.h
include/kunit/test.h
include/linux/acpi.h
include/linux/acpi_agdi.h [deleted file]
include/linux/acpi_apmt.h [deleted file]
include/linux/acpi_iort.h
include/linux/amd-pstate.h
include/linux/atomic/atomic-arch-fallback.h
include/linux/atomic/atomic-instrumented.h
include/linux/atomic/atomic-long.h
include/linux/audit.h
include/linux/audit_arch.h
include/linux/bio.h
include/linux/blk-mq.h
include/linux/blk_types.h
include/linux/blkdev.h
include/linux/blktrace_api.h
include/linux/bsg.h
include/linux/buffer_head.h
include/linux/cache.h
include/linux/cdrom.h
include/linux/cgroup.h
include/linux/compaction.h
include/linux/compiler_attributes.h
include/linux/context_tracking.h
include/linux/context_tracking_state.h
include/linux/cpu.h
include/linux/cpufreq.h
include/linux/cpuhotplug.h
include/linux/cpumask.h
include/linux/cpuset.h
include/linux/delay.h
include/linux/devfreq.h
include/linux/device-mapper.h
include/linux/device/driver.h
include/linux/dma-map-ops.h
include/linux/dma-mapping.h
include/linux/dmar.h
include/linux/efi.h
include/linux/err.h
include/linux/eventfd.h
include/linux/fault-inject.h
include/linux/fortify-string.h
include/linux/frontswap.h
include/linux/fs.h
include/linux/fsnotify.h
include/linux/fsverity.h
include/linux/gfp.h
include/linux/gpio/driver.h
include/linux/highmem.h
include/linux/hugetlb.h
include/linux/iio/iio.h
include/linux/init.h
include/linux/intel_rapl.h
include/linux/io.h
include/linux/io_uring.h
include/linux/io_uring_types.h
include/linux/irq.h
include/linux/irqchip/mmp.h [deleted file]
include/linux/irqchip/mxs.h [deleted file]
include/linux/irqdesc.h
include/linux/iscsi_ibft.h
include/linux/jump_label.h
include/linux/kallsyms.h
include/linux/kasan.h
include/linux/kcov.h
include/linux/key.h
include/linux/kthread.h
include/linux/lockdep.h
include/linux/lockdep_types.h
include/linux/lsm_hook_defs.h
include/linux/maple_tree.h
include/linux/math.h
include/linux/math64.h
include/linux/memblock.h
include/linux/memcontrol.h
include/linux/memory_hotplug.h
include/linux/mfd/axp20x.h
include/linux/mfd/rk808.h
include/linux/mfd/tps6594.h [new file with mode: 0644]
include/linux/migrate.h
include/linux/mm.h
include/linux/mm_inline.h
include/linux/mm_types.h
include/linux/mmc/card.h
include/linux/mmdebug.h
include/linux/mmzone.h
include/linux/module.h
include/linux/mount.h
include/linux/mtd/blktrans.h
include/linux/nmi.h
include/linux/nubus.h
include/linux/nvme-fc-driver.h
include/linux/olpc-ec.h
include/linux/overflow.h
include/linux/page-isolation.h
include/linux/pagemap.h
include/linux/pagevec.h
include/linux/panic.h
include/linux/parport.h
include/linux/pci_ids.h
include/linux/percpu-defs.h
include/linux/percpu.h
include/linux/perf/arm_pmu.h
include/linux/perf_event.h
include/linux/pgtable.h
include/linux/pipe_fs_i.h
include/linux/pktcdvd.h
include/linux/platform_data/spi-s3c64xx.h
include/linux/proc_fs.h
include/linux/ramfs.h
include/linux/rbtree_latch.h
include/linux/rcupdate.h
include/linux/regmap.h
include/linux/regulator/driver.h
include/linux/regulator/mt6358-regulator.h
include/linux/root_dev.h
include/linux/scatterlist.h
include/linux/sched.h
include/linux/sched/clock.h
include/linux/sched/sd_flags.h
include/linux/sched/signal.h
include/linux/sched/topology.h
include/linux/security.h
include/linux/seqlock.h
include/linux/slab.h
include/linux/slub_def.h
include/linux/soc/qcom/geni-se.h
include/linux/spi/spi.h
include/linux/splice.h
include/linux/srcu.h
include/linux/string.h
include/linux/sunrpc/svc.h
include/linux/sunrpc/svc_rdma.h
include/linux/sunrpc/xdr.h
include/linux/suspend.h
include/linux/swap.h
include/linux/swapops.h
include/linux/syscalls.h
include/linux/sysctl.h
include/linux/thread_info.h
include/linux/time_namespace.h
include/linux/types.h
include/linux/uio.h
include/linux/umh.h
include/linux/userfaultfd_k.h
include/linux/watch_queue.h
include/linux/workqueue.h
include/linux/zpool.h
include/scsi/scsi_ioctl.h
include/soc/imx/timer.h [deleted file]
include/trace/events/block.h
include/trace/events/btrfs.h
include/trace/events/compaction.h
include/trace/events/csd.h [new file with mode: 0644]
include/trace/events/mmflags.h
include/trace/events/rpcrdma.h
include/trace/events/sunrpc.h
include/trace/events/timer.h
include/uapi/asm-generic/unistd.h
include/uapi/linux/affs_hardblocks.h
include/uapi/linux/auto_dev-ioctl.h
include/uapi/linux/capability.h
include/uapi/linux/elf.h
include/uapi/linux/eventfd.h [new file with mode: 0644]
include/uapi/linux/io_uring.h
include/uapi/linux/mman.h
include/uapi/linux/mount.h
include/uapi/linux/pktcdvd.h
include/uapi/linux/spi/spi.h
include/uapi/linux/types.h
include/uapi/linux/ublk_cmd.h
include/uapi/linux/vfio.h
include/xen/events.h
include/xen/xen.h
init/Kconfig
init/do_mounts.c
init/do_mounts.h
init/do_mounts_initrd.c
init/main.c
io_uring/cancel.c
io_uring/filetable.c
io_uring/filetable.h
io_uring/io_uring.c
io_uring/io_uring.h
io_uring/msg_ring.c
io_uring/net.c
io_uring/poll.c
io_uring/poll.h
io_uring/rsrc.c
io_uring/rw.c
io_uring/rw.h
io_uring/tctx.c
io_uring/timeout.c
io_uring/uring_cmd.c
kernel/Makefile
kernel/audit.h
kernel/capability.c
kernel/cgroup/cgroup-internal.h
kernel/cgroup/cgroup-v1.c
kernel/cgroup/cgroup.c
kernel/cgroup/cpuset.c
kernel/cgroup/misc.c
kernel/cgroup/rdma.c
kernel/cgroup/rstat.c
kernel/context_tracking.c
kernel/cpu.c
kernel/dma/Kconfig
kernel/dma/direct.c
kernel/dma/direct.h
kernel/events/core.c
kernel/events/uprobes.c
kernel/fork.c
kernel/irq/chip.c
kernel/irq/debugfs.c
kernel/irq/internals.h
kernel/irq/irqdesc.c
kernel/irq/irqdomain.c
kernel/irq/resend.c
kernel/kallsyms.c
kernel/kcov.c
kernel/kexec_core.c
kernel/kexec_file.c
kernel/ksyms_common.c [new file with mode: 0644]
kernel/kthread.c
kernel/locking/lock_events.h
kernel/locking/lockdep.c
kernel/locking/locktorture.c
kernel/module/kallsyms.c
kernel/module/main.c
kernel/panic.c
kernel/params.c
kernel/pid_sysctl.h
kernel/power/hibernate.c
kernel/power/main.c
kernel/power/power.h
kernel/power/snapshot.c
kernel/power/swap.c
kernel/printk/printk.c
kernel/rcu/Kconfig
kernel/rcu/rcu.h
kernel/rcu/rcuscale.c
kernel/rcu/tasks.h
kernel/rcu/tree.c
kernel/rcu/tree_exp.h
kernel/rcu/tree_nocb.h
kernel/rcu/tree_plugin.h
kernel/sched/clock.c
kernel/sched/core.c
kernel/sched/cpufreq_schedutil.c
kernel/sched/deadline.c
kernel/sched/debug.c
kernel/sched/fair.c
kernel/sched/psi.c
kernel/sched/sched.h
kernel/sched/topology.c
kernel/sched/wait.c
kernel/signal.c
kernel/smp.c
kernel/smpboot.c
kernel/softirq.c
kernel/sys_ni.c
kernel/sysctl.c
kernel/time/alarmtimer.c
kernel/time/clocksource.c
kernel/time/hrtimer.c
kernel/time/posix-timers.c
kernel/time/sched_clock.c
kernel/time/tick-sched.c
kernel/time/timekeeping.c
kernel/trace/ftrace.c
kernel/trace/trace.c
kernel/trace/trace_events.c
kernel/trace/trace_events_inject.c
kernel/trace/trace_events_user.c
kernel/trace/trace_kprobe.c
kernel/trace/trace_probe.c
kernel/umh.c
kernel/watch_queue.c
kernel/watchdog.c
kernel/watchdog_buddy.c [new file with mode: 0644]
kernel/watchdog_perf.c [moved from kernel/watchdog_hld.c with 72% similarity]
kernel/workqueue.c
kernel/workqueue_internal.h
lib/Kconfig.debug
lib/Kconfig.ubsan
lib/Makefile
lib/checksum_kunit.c [new file with mode: 0644]
lib/crypto/curve25519-hacl64.c
lib/crypto/poly1305-donna64.c
lib/debugobjects.c
lib/decompress_inflate.c
lib/decompress_unxz.c
lib/decompress_unzstd.c
lib/devmem_is_allowed.c
lib/devres.c
lib/fortify_kunit.c
lib/iov_iter.c
lib/kobject.c
lib/kunit/debugfs.c
lib/kunit/executor_test.c
lib/kunit/kunit-example-test.c
lib/kunit/kunit-test.c
lib/kunit/resource.c
lib/kunit/test.c
lib/maple_tree.c
lib/overflow_kunit.c
lib/raid6/neon.h [new file with mode: 0644]
lib/raid6/neon.uc
lib/raid6/recov_neon.c
lib/raid6/recov_neon_inner.c
lib/show_mem.c [deleted file]
lib/strcat_kunit.c [new file with mode: 0644]
lib/string.c
lib/string_helpers.c
lib/test_maple_tree.c
lib/test_sysctl.c
lib/ubsan.c
lib/ubsan.h
lib/zstd/common/zstd_deps.h
mm/Kconfig
mm/Makefile
mm/backing-dev.c
mm/cma.c
mm/compaction.c
mm/damon/core-test.h
mm/damon/ops-common.c
mm/damon/ops-common.h
mm/damon/paddr.c
mm/damon/vaddr.c
mm/debug.c
mm/debug_page_alloc.c [new file with mode: 0644]
mm/debug_vm_pgtable.c
mm/dmapool.c
mm/early_ioremap.c
mm/fadvise.c
mm/fail_page_alloc.c [new file with mode: 0644]
mm/filemap.c
mm/frontswap.c
mm/gup.c
mm/gup_test.c
mm/highmem.c
mm/hmm.c
mm/huge_memory.c
mm/hugetlb.c
mm/hugetlb_vmemmap.c
mm/internal.h
mm/kasan/common.c
mm/kasan/generic.c
mm/kasan/init.c
mm/kasan/kasan.h
mm/kasan/report.c
mm/kasan/report_generic.c
mm/kasan/report_hw_tags.c
mm/kasan/report_sw_tags.c
mm/kasan/shadow.c
mm/kasan/sw_tags.c
mm/kasan/tags.c
mm/khugepaged.c
mm/kmsan/core.c
mm/kmsan/instrumentation.c
mm/ksm.c
mm/madvise.c
mm/mapping_dirty_helpers.c
mm/memblock.c
mm/memcontrol.c
mm/memory-failure.c
mm/memory-tiers.c
mm/memory.c
mm/memory_hotplug.c
mm/mempolicy.c
mm/migrate.c
mm/migrate_device.c
mm/mincore.c
mm/mlock.c
mm/mm_init.c
mm/mmap.c
mm/mprotect.c
mm/mremap.c
mm/oom_kill.c
mm/page-writeback.c
mm/page_alloc.c
mm/page_io.c
mm/page_isolation.c
mm/page_owner.c
mm/page_table_check.c
mm/page_vma_mapped.c
mm/pagewalk.c
mm/percpu-internal.h
mm/pgtable-generic.c
mm/process_vm_access.c
mm/ptdump.c
mm/readahead.c
mm/rmap.c
mm/secretmem.c
mm/shmem.c
mm/show_mem.c [new file with mode: 0644]
mm/slab.c
mm/slab.h
mm/slab_common.c
mm/slub.c
mm/sparse-vmemmap.c
mm/sparse.c
mm/swap.c
mm/swap_state.c
mm/swapfile.c
mm/truncate.c
mm/userfaultfd.c
mm/vmalloc.c
mm/vmscan.c
mm/vmstat.c
mm/workingset.c
mm/z3fold.c
mm/zbud.c
mm/zpool.c
mm/zsmalloc.c
mm/zswap.c
net/mptcp/subflow.c
net/netfilter/ipset/ip_set_hash_netiface.c
net/qrtr/ns.c
net/rxrpc/af_rxrpc.c
net/socket.c
net/sunrpc/svc.c
net/sunrpc/svc_xprt.c
net/sunrpc/svcsock.c
net/sunrpc/xdr.c
net/sunrpc/xprtrdma/svc_rdma_backchannel.c
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
net/sunrpc/xprtrdma/svc_rdma_rw.c
net/sunrpc/xprtrdma/svc_rdma_sendto.c
net/sunrpc/xprtrdma/svc_rdma_transport.c
net/xdp/xdp_umem.c
rust/alloc/README.md
rust/alloc/alloc.rs
rust/alloc/boxed.rs
rust/alloc/collections/mod.rs
rust/alloc/lib.rs
rust/alloc/raw_vec.rs
rust/alloc/slice.rs
rust/alloc/vec/drain.rs
rust/alloc/vec/drain_filter.rs
rust/alloc/vec/into_iter.rs
rust/alloc/vec/is_zero.rs
rust/alloc/vec/mod.rs
rust/alloc/vec/set_len_on_drop.rs
rust/alloc/vec/spec_extend.rs
rust/bindings/bindings_helper.h
rust/bindings/lib.rs
rust/helpers.c
rust/kernel/build_assert.rs
rust/kernel/error.rs
rust/kernel/init.rs
rust/kernel/init/macros.rs
rust/kernel/lib.rs
rust/kernel/std_vendor.rs
rust/kernel/str.rs
rust/kernel/sync/arc.rs
rust/kernel/task.rs
rust/kernel/types.rs
rust/macros/helpers.rs
rust/macros/pin_data.rs
rust/macros/quote.rs
rust/uapi/lib.rs
samples/kmemleak/kmemleak-test.c
scripts/Makefile.build
scripts/Makefile.ubsan
scripts/atomic/atomic-tbl.sh
scripts/atomic/atomics.tbl
scripts/atomic/fallbacks/acquire
scripts/atomic/fallbacks/add_negative
scripts/atomic/fallbacks/add_unless
scripts/atomic/fallbacks/andnot
scripts/atomic/fallbacks/cmpxchg [new file with mode: 0644]
scripts/atomic/fallbacks/dec
scripts/atomic/fallbacks/dec_and_test
scripts/atomic/fallbacks/dec_if_positive
scripts/atomic/fallbacks/dec_unless_positive
scripts/atomic/fallbacks/fence
scripts/atomic/fallbacks/fetch_add_unless
scripts/atomic/fallbacks/inc
scripts/atomic/fallbacks/inc_and_test
scripts/atomic/fallbacks/inc_not_zero
scripts/atomic/fallbacks/inc_unless_negative
scripts/atomic/fallbacks/read_acquire
scripts/atomic/fallbacks/release
scripts/atomic/fallbacks/set_release
scripts/atomic/fallbacks/sub_and_test
scripts/atomic/fallbacks/try_cmpxchg
scripts/atomic/fallbacks/xchg [new file with mode: 0644]
scripts/atomic/gen-atomic-fallback.sh
scripts/atomic/gen-atomic-instrumented.sh
scripts/atomic/gen-atomic-long.sh
scripts/atomic/kerneldoc/add [new file with mode: 0644]
scripts/atomic/kerneldoc/add_negative [new file with mode: 0644]
scripts/atomic/kerneldoc/add_unless [new file with mode: 0644]
scripts/atomic/kerneldoc/and [new file with mode: 0644]
scripts/atomic/kerneldoc/andnot [new file with mode: 0644]
scripts/atomic/kerneldoc/cmpxchg [new file with mode: 0644]
scripts/atomic/kerneldoc/dec [new file with mode: 0644]
scripts/atomic/kerneldoc/dec_and_test [new file with mode: 0644]
scripts/atomic/kerneldoc/dec_if_positive [new file with mode: 0644]
scripts/atomic/kerneldoc/dec_unless_positive [new file with mode: 0644]
scripts/atomic/kerneldoc/inc [new file with mode: 0644]
scripts/atomic/kerneldoc/inc_and_test [new file with mode: 0644]
scripts/atomic/kerneldoc/inc_not_zero [new file with mode: 0644]
scripts/atomic/kerneldoc/inc_unless_negative [new file with mode: 0644]
scripts/atomic/kerneldoc/or [new file with mode: 0644]
scripts/atomic/kerneldoc/read [new file with mode: 0644]
scripts/atomic/kerneldoc/set [new file with mode: 0644]
scripts/atomic/kerneldoc/sub [new file with mode: 0644]
scripts/atomic/kerneldoc/sub_and_test [new file with mode: 0644]
scripts/atomic/kerneldoc/try_cmpxchg [new file with mode: 0644]
scripts/atomic/kerneldoc/xchg [new file with mode: 0644]
scripts/atomic/kerneldoc/xor [new file with mode: 0644]
scripts/check-sysctl-docs
scripts/checkpatch.pl
scripts/kernel-doc
scripts/min-tool-version.sh
scripts/mod/modpost.c
scripts/orc_hash.sh [new file with mode: 0644]
scripts/spelling.txt
security/commoncap.c
security/device_cgroup.c
security/integrity/evm/evm_crypto.c
security/integrity/evm/evm_main.c
security/integrity/iint.c
security/integrity/ima/ima_api.c
security/integrity/ima/ima_main.c
security/integrity/ima/ima_modsig.c
security/integrity/ima/ima_policy.c
security/keys/sysctl.c
security/landlock/Kconfig
security/lsm_audit.c
security/safesetid/lsm.c
security/security.c
security/selinux/Makefile
security/selinux/avc.c
security/selinux/hooks.c
security/selinux/ima.c
security/selinux/include/audit.h
security/selinux/include/avc.h
security/selinux/include/ibpkey.h
security/selinux/include/ima.h
security/selinux/include/initial_sid_to_string.h
security/selinux/include/security.h
security/selinux/netlabel.c
security/selinux/selinuxfs.c
security/selinux/ss/avtab.c
security/selinux/ss/avtab.h
security/selinux/ss/conditional.c
security/selinux/ss/conditional.h
security/selinux/ss/context.h
security/selinux/ss/policydb.c
security/selinux/ss/policydb.h
security/selinux/ss/services.c
security/smack/smack.h
security/smack/smack_lsm.c
security/tomoyo/domain.c
sound/pci/hda/patch_realtek.c
sound/soc/codecs/Kconfig
sound/soc/intel/boards/sof_sdw.c
tools/arch/x86/include/asm/nops.h
tools/arch/x86/kcpuid/.gitignore [new file with mode: 0644]
tools/arch/x86/kcpuid/kcpuid.c
tools/include/nolibc/Makefile
tools/include/nolibc/arch-aarch64.h
tools/include/nolibc/arch-arm.h
tools/include/nolibc/arch-i386.h
tools/include/nolibc/arch-loongarch.h
tools/include/nolibc/arch-mips.h
tools/include/nolibc/arch-riscv.h
tools/include/nolibc/arch-s390.h
tools/include/nolibc/arch-x86_64.h
tools/include/nolibc/arch.h
tools/include/nolibc/compiler.h [new file with mode: 0644]
tools/include/nolibc/nolibc.h
tools/include/nolibc/stackprotector.h
tools/include/nolibc/stdint.h
tools/include/nolibc/stdio.h
tools/include/nolibc/stdlib.h
tools/include/nolibc/string.h
tools/include/nolibc/sys.h
tools/include/nolibc/types.h
tools/include/nolibc/unistd.h
tools/lib/subcmd/parse-options.h
tools/lib/subcmd/subcmd-util.h
tools/objtool/Documentation/objtool.txt
tools/objtool/arch/powerpc/include/arch/elf.h
tools/objtool/arch/x86/decode.c
tools/objtool/arch/x86/include/arch/elf.h
tools/objtool/arch/x86/special.c
tools/objtool/builtin-check.c
tools/objtool/check.c
tools/objtool/elf.c
tools/objtool/include/objtool/builtin.h
tools/objtool/include/objtool/cfi.h
tools/objtool/include/objtool/elf.h
tools/objtool/include/objtool/warn.h
tools/objtool/noreturns.h [new file with mode: 0644]
tools/objtool/orc_gen.c
tools/objtool/special.c
tools/perf/arch/x86/include/arch-tests.h
tools/perf/arch/x86/tests/Build
tools/perf/arch/x86/tests/amd-ibs-via-core-pmu.c [new file with mode: 0644]
tools/perf/arch/x86/tests/arch-tests.c
tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
tools/spi/spidev_test.c
tools/testing/kunit/configs/all_tests.config
tools/testing/kunit/configs/arch_uml.config
tools/testing/kunit/kunit_kernel.py
tools/testing/kunit/mypy.ini [new file with mode: 0644]
tools/testing/kunit/run_checks.py
tools/testing/radix-tree/linux/init.h
tools/testing/radix-tree/maple.c
tools/testing/selftests/Makefile
tools/testing/selftests/arm64/abi/hwcap.c
tools/testing/selftests/arm64/abi/ptrace.c
tools/testing/selftests/arm64/signal/.gitignore
tools/testing/selftests/arm64/signal/test_signals_utils.c
tools/testing/selftests/arm64/signal/testcases/tpidr2_restore.c [new file with mode: 0644]
tools/testing/selftests/bpf/progs/bpf_iter_ksym.c
tools/testing/selftests/cachestat/.gitignore [new file with mode: 0644]
tools/testing/selftests/cachestat/Makefile [new file with mode: 0644]
tools/testing/selftests/cachestat/test_cachestat.c [new file with mode: 0644]
tools/testing/selftests/cgroup/test_memcontrol.c
tools/testing/selftests/clone3/clone3.c
tools/testing/selftests/cpufreq/config
tools/testing/selftests/damon/config [new file with mode: 0644]
tools/testing/selftests/ftrace/ftracetest
tools/testing/selftests/ftrace/test.d/kprobe/kprobe_opt_types.tc [new file with mode: 0644]
tools/testing/selftests/kselftest/runner.sh
tools/testing/selftests/kvm/aarch64/get-reg-list.c
tools/testing/selftests/landlock/config
tools/testing/selftests/landlock/config.um [new file with mode: 0644]
tools/testing/selftests/landlock/fs_test.c
tools/testing/selftests/lib.mk
tools/testing/selftests/media_tests/video_device_test.c
tools/testing/selftests/mm/.gitignore
tools/testing/selftests/mm/Makefile
tools/testing/selftests/mm/cow.c
tools/testing/selftests/mm/gup_longterm.c [new file with mode: 0644]
tools/testing/selftests/mm/hugepage-shm.c
tools/testing/selftests/mm/hugepage-vmemmap.c
tools/testing/selftests/mm/hugetlb-madvise.c
tools/testing/selftests/mm/khugepaged.c
tools/testing/selftests/mm/madv_populate.c
tools/testing/selftests/mm/map_fixed_noreplace.c
tools/testing/selftests/mm/map_hugetlb.c
tools/testing/selftests/mm/map_populate.c
tools/testing/selftests/mm/migration.c
tools/testing/selftests/mm/mlock-random-test.c
tools/testing/selftests/mm/mlock2-tests.c
tools/testing/selftests/mm/mlock2.h
tools/testing/selftests/mm/mrelease_test.c
tools/testing/selftests/mm/mremap_dontunmap.c
tools/testing/selftests/mm/on-fault-limit.c
tools/testing/selftests/mm/pkey-powerpc.h
tools/testing/selftests/mm/pkey-x86.h
tools/testing/selftests/mm/protection_keys.c
tools/testing/selftests/mm/run_vmtests.sh
tools/testing/selftests/mm/uffd-common.c
tools/testing/selftests/mm/uffd-common.h
tools/testing/selftests/mm/uffd-stress.c
tools/testing/selftests/mm/uffd-unit-tests.c
tools/testing/selftests/mm/vm_util.c
tools/testing/selftests/mm/vm_util.h
tools/testing/selftests/nolibc/.gitignore
tools/testing/selftests/nolibc/Makefile
tools/testing/selftests/nolibc/nolibc-test.c
tools/testing/selftests/pidfd/pidfd.h
tools/testing/selftests/pidfd/pidfd_fdinfo_test.c
tools/testing/selftests/pidfd/pidfd_test.c
tools/testing/selftests/prctl/set-anon-vma-name-test.c
tools/testing/selftests/rcutorture/bin/functions.sh
tools/testing/selftests/rcutorture/configs/rcu/BUSTED-BOOST.boot
tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot
tools/testing/selftests/run_kselftest.sh
tools/testing/selftests/sysctl/sysctl.sh
tools/testing/selftests/vDSO/vdso_test_clock_getres.c
tools/workqueue/wq_monitor.py [new file with mode: 0644]
virt/kvm/async_pf.c
virt/kvm/kvm_main.c

index c9ba5bf..2325c52 100644 (file)
@@ -2,3 +2,4 @@
 *.[ch] diff=cpp
 *.dts diff=dts
 *.dts[io] diff=dts
+*.rs diff=rust
index c94da2a..4d71480 100644 (file)
--- a/.mailmap
+++ b/.mailmap
@@ -183,6 +183,8 @@ Henrik Rydberg <rydberg@bitmath.org>
 Herbert Xu <herbert@gondor.apana.org.au>
 Huacai Chen <chenhuacai@kernel.org> <chenhc@lemote.com>
 Huacai Chen <chenhuacai@kernel.org> <chenhuacai@loongson.cn>
+J. Bruce Fields <bfields@fieldses.org> <bfields@redhat.com>
+J. Bruce Fields <bfields@fieldses.org> <bfields@citi.umich.edu>
 Jacob Shin <Jacob.Shin@amd.com>
 Jaegeuk Kim <jaegeuk@kernel.org> <jaegeuk@google.com>
 Jaegeuk Kim <jaegeuk@kernel.org> <jaegeuk.kim@samsung.com>
diff --git a/CREDITS b/CREDITS
index de7e4db..8b48820 100644 (file)
--- a/CREDITS
+++ b/CREDITS
@@ -383,6 +383,12 @@ E: tomas@nocrew.org
 W: http://tomas.nocrew.org/
 D: dsp56k device driver
 
+N: Srivatsa S. Bhat
+E: srivatsa@csail.mit.edu
+D: Maintainer of Generic Paravirt-Ops subsystem
+D: Maintainer of VMware hypervisor interface
+D: Maintainer of VMware virtual PTP clock driver (ptp_vmw)
+
 N: Ross Biro
 E: ross.biro@gmail.com
 D: Original author of the Linux networking code
index f54867c..ecd585c 100644 (file)
@@ -670,7 +670,7 @@ Description:        Preferred MTE tag checking mode
                "async"           Prefer asynchronous mode
                ================  ==============================================
 
-               See also: Documentation/arm64/memory-tagging-extension.rst
+               See also: Documentation/arch/arm64/memory-tagging-extension.rst
 
 What:          /sys/devices/system/cpu/nohz_full
 Date:          Apr 2015
index 49387d8..f3b6052 100644 (file)
@@ -2071,41 +2071,7 @@ call.
 
 Because RCU avoids interrupting idle CPUs, it is illegal to execute an
 RCU read-side critical section on an idle CPU. (Kernels built with
-``CONFIG_PROVE_RCU=y`` will splat if you try it.) The RCU_NONIDLE()
-macro and ``_rcuidle`` event tracing is provided to work around this
-restriction. In addition, rcu_is_watching() may be used to test
-whether or not it is currently legal to run RCU read-side critical
-sections on this CPU. I learned of the need for diagnostics on the one
-hand and RCU_NONIDLE() on the other while inspecting idle-loop code.
-Steven Rostedt supplied ``_rcuidle`` event tracing, which is used quite
-heavily in the idle loop. However, there are some restrictions on the
-code placed within RCU_NONIDLE():
-
-#. Blocking is prohibited. In practice, this is not a serious
-   restriction given that idle tasks are prohibited from blocking to
-   begin with.
-#. Although nesting RCU_NONIDLE() is permitted, they cannot nest
-   indefinitely deeply. However, given that they can be nested on the
-   order of a million deep, even on 32-bit systems, this should not be a
-   serious restriction. This nesting limit would probably be reached
-   long after the compiler OOMed or the stack overflowed.
-#. Any code path that enters RCU_NONIDLE() must sequence out of that
-   same RCU_NONIDLE(). For example, the following is grossly
-   illegal:
-
-      ::
-
-         1     RCU_NONIDLE({
-         2       do_something();
-         3       goto bad_idea;  /* BUG!!! */
-         4       do_something_else();});
-         5   bad_idea:
-
-
-   It is just as illegal to transfer control into the middle of
-   RCU_NONIDLE()'s argument. Yes, in theory, you could transfer in
-   as long as you also transferred out, but in practice you could also
-   expect to get sharply worded review comments.
+``CONFIG_PROVE_RCU=y`` will splat if you try it.)
 
 It is similarly socially unacceptable to interrupt an ``nohz_full`` CPU
 running in userspace. RCU must therefore track ``nohz_full`` userspace
index 8eddef2..e488c8e 100644 (file)
@@ -1117,7 +1117,6 @@ All: lockdep-checked RCU utility APIs::
 
        RCU_LOCKDEP_WARN
        rcu_sleep_check
-       RCU_NONIDLE
 
 All: Unchecked RCU-protected pointer access::
 
index bb5032a..6fdb495 100644 (file)
@@ -508,9 +508,6 @@ cache_miss_collisions
   cache miss, but raced with a write and data was already present (usually 0
   since the synchronization for cache misses was rewritten)
 
-cache_readaheads
-  Count of times readahead occurred.
-
 Sysfs - cache set
 ~~~~~~~~~~~~~~~~~
 
index 47d1d7d..fabaad3 100644 (file)
@@ -297,7 +297,7 @@ Lock order is as follows::
 
   Page lock (PG_locked bit of page->flags)
     mm->page_table_lock or split pte_lock
-      lock_page_memcg (memcg->move_lock)
+      folio_memcg_lock (memcg->move_lock)
         mapping->i_pages lock
           lruvec->lru_lock.
 
index e592a93..4ef8901 100644 (file)
@@ -1580,6 +1580,13 @@ PAGE_SIZE multiple when read back.
 
        Healthy workloads are not expected to reach this limit.
 
+  memory.swap.peak
+       A read-only single value file which exists on non-root
+       cgroups.
+
+       The max swap usage recorded for the cgroup and its
+       descendants since the creation of the cgroup.
+
   memory.swap.max
        A read-write single value file which exists on non-root
        cgroups.  The default is "max".
@@ -2022,31 +2029,33 @@ that attribute:
   no-change
        Do not modify the I/O priority class.
 
-  none-to-rt
-       For requests that do not have an I/O priority class (NONE),
-       change the I/O priority class into RT. Do not modify
-       the I/O priority class of other requests.
+  promote-to-rt
+       For requests that have a non-RT I/O priority class, change it into RT.
+       Also change the priority level of these requests to 4. Do not modify
+       the I/O priority of requests that have priority class RT.
 
   restrict-to-be
        For requests that do not have an I/O priority class or that have I/O
-       priority class RT, change it into BE. Do not modify the I/O priority
-       class of requests that have priority class IDLE.
+       priority class RT, change it into BE. Also change the priority level
+       of these requests to 0. Do not modify the I/O priority class of
+       requests that have priority class IDLE.
 
   idle
        Change the I/O priority class of all requests into IDLE, the lowest
        I/O priority class.
 
+  none-to-rt
+       Deprecated. Just an alias for promote-to-rt.
+
 The following numerical values are associated with the I/O priority policies:
 
-+-------------+---+
-| no-change   | 0 |
-+-------------+---+
-| none-to-rt  | 1 |
-+-------------+---+
-| rt-to-be    | 2 |
-+-------------+---+
-| all-to-idle | 3 |
-+-------------+---+
++----------------+---+
+| no-change      | 0 |
++----------------+---+
+| rt-to-be       | 2 |
++----------------+---+
+| all-to-idle    | 3 |
++----------------+---+
 
 The numerical value that corresponds to each I/O priority class is as follows:
 
@@ -2062,9 +2071,13 @@ The numerical value that corresponds to each I/O priority class is as follows:
 
 The algorithm to set the I/O priority class for a request is as follows:
 
-- Translate the I/O priority class policy into a number.
-- Change the request I/O priority class into the maximum of the I/O priority
-  class policy number and the numerical I/O priority class.
+- If I/O priority class policy is promote-to-rt, change the request I/O
+  priority class to IOPRIO_CLASS_RT and change the request I/O priority
+  level to 4.
+- If I/O priorityt class is not promote-to-rt, translate the I/O priority
+  class policy into a number, then change the request I/O priority class
+  into the maximum of the I/O priority class policy number and the numerical
+  I/O priority class.
 
 PID
 ---
@@ -2437,7 +2450,7 @@ Miscellaneous controller provides 3 interface files. If two misc resources (res_
          res_b 10
 
   misc.current
-        A read-only flat-keyed file shown in the non-root cgroups.  It shows
+        A read-only flat-keyed file shown in the all cgroups.  It shows
         the current usage of the resources in the cgroup and its children.::
 
          $ cat misc.current
index 9e5bab2..d172651 100644 (file)
                        EL0 is indicated by /sys/devices/system/cpu/aarch32_el0
                        and hot-unplug operations may be restricted.
 
-                       See Documentation/arm64/asymmetric-32bit.rst for more
+                       See Documentation/arch/arm64/asymmetric-32bit.rst for more
                        information.
 
        amd_iommu=      [HW,X86-64]
        arm64.nosme     [ARM64] Unconditionally disable Scalable Matrix
                        Extension support
 
+       arm64.nomops    [ARM64] Unconditionally disable Memory Copy and Memory
+                       Set instructions support
+
        ataflop=        [HW,M68k]
 
        atarimouse=     [HW,MOUSE] Atari Mouse
                        Format:
                        <first_slot>,<last_slot>,<port>,<enum_bit>[,<debug>]
 
-       cpu0_hotplug    [X86] Turn on CPU0 hotplug feature when
-                       CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
-                       Some features depend on CPU0. Known dependencies are:
-                       1. Resume from suspend/hibernate depends on CPU0.
-                       Suspend/hibernate will fail if CPU0 is offline and you
-                       need to online CPU0 before suspend/hibernate.
-                       2. PIC interrupts also depend on CPU0. CPU0 can't be
-                       removed if a PIC interrupt is detected.
-                       It's said poweroff/reboot may depend on CPU0 on some
-                       machines although I haven't seen such issues so far
-                       after CPU0 is offline on a few tested machines.
-                       If the dependencies are under your control, you can
-                       turn on cpu0_hotplug.
-
        cpuidle.off=1   [CPU_IDLE]
                        disable the cpuidle sub-system
 
                        on every CPU online, such as boot, and resume from suspend.
                        Default: 10000
 
+       cpuhp.parallel=
+                       [SMP] Enable/disable parallel bringup of secondary CPUs
+                       Format: <bool>
+                       Default is enabled if CONFIG_HOTPLUG_PARALLEL=y. Otherwise
+                       the parameter has no effect.
+
        crash_kexec_post_notifiers
                        Run kdump after running panic-notifiers and dumping
                        kmsg. This only for the users who doubt kdump always
                        disable
                          Do not enable intel_pstate as the default
                          scaling driver for the supported processors
+                        active
+                          Use intel_pstate driver to bypass the scaling
+                          governors layer of cpufreq and provides it own
+                          algorithms for p-state selection. There are two
+                          P-state selection algorithms provided by
+                          intel_pstate in the active mode: powersave and
+                          performance.  The way they both operate depends
+                          on whether or not the hardware managed P-states
+                          (HWP) feature has been enabled in the processor
+                          and possibly on the processor model.
                        passive
                          Use intel_pstate as a scaling driver, but configure it
                          to work with generic cpufreq governors (instead of
                        If the value is 0 (the default), KVM will pick a period based
                        on the ratio, such that a page is zapped after 1 hour on average.
 
-       kvm-amd.nested= [KVM,AMD] Allow nested virtualization in KVM/SVM.
-                       Default is 1 (enabled)
+       kvm-amd.nested= [KVM,AMD] Control nested virtualization feature in
+                       KVM/SVM. Default is 1 (enabled).
 
-       kvm-amd.npt=    [KVM,AMD] Disable nested paging (virtualized MMU)
-                       for all guests.
-                       Default is 1 (enabled) if in 64-bit or 32-bit PAE mode.
+       kvm-amd.npt=    [KVM,AMD] Control KVM's use of Nested Page Tables,
+                       a.k.a. Two-Dimensional Page Tables. Default is 1
+                       (enabled). Disable by KVM if hardware lacks support
+                       for NPT.
 
        kvm-arm.mode=
                        [KVM,ARM] Select one of KVM/arm64's modes of operation.
                        Format: <integer>
                        Default: 5
 
-       kvm-intel.ept=  [KVM,Intel] Disable extended page tables
-                       (virtualized MMU) support on capable Intel chips.
-                       Default is 1 (enabled)
+       kvm-intel.ept=  [KVM,Intel] Control KVM's use of Extended Page Tables,
+                       a.k.a. Two-Dimensional Page Tables.  Default is 1
+                       (enabled). Disable by KVM if hardware lacks support
+                       for EPT.
 
        kvm-intel.emulate_invalid_guest_state=
-                       [KVM,Intel] Disable emulation of invalid guest state.
-                       Ignored if kvm-intel.enable_unrestricted_guest=1, as
-                       guest state is never invalid for unrestricted guests.
-                       This param doesn't apply to nested guests (L2), as KVM
-                       never emulates invalid L2 guest state.
-                       Default is 1 (enabled)
+                       [KVM,Intel] Control whether to emulate invalid guest
+                       state. Ignored if kvm-intel.enable_unrestricted_guest=1,
+                       as guest state is never invalid for unrestricted
+                       guests. This param doesn't apply to nested guests (L2),
+                       as KVM never emulates invalid L2 guest state.
+                       Default is 1 (enabled).
 
        kvm-intel.flexpriority=
-                       [KVM,Intel] Disable FlexPriority feature (TPR shadow).
-                       Default is 1 (enabled)
+                       [KVM,Intel] Control KVM's use of FlexPriority feature
+                       (TPR shadow). Default is 1 (enabled). Disalbe by KVM if
+                       hardware lacks support for it.
 
        kvm-intel.nested=
-                       [KVM,Intel] Enable VMX nesting (nVMX).
-                       Default is 0 (disabled)
+                       [KVM,Intel] Control nested virtualization feature in
+                       KVM/VMX. Default is 1 (enabled).
 
        kvm-intel.unrestricted_guest=
-                       [KVM,Intel] Disable unrestricted guest feature
-                       (virtualized real and unpaged mode) on capable
-                       Intel chips. Default is 1 (enabled)
+                       [KVM,Intel] Control KVM's use of unrestricted guest
+                       feature (virtualized real and unpaged mode). Default
+                       is 1 (enabled). Disable by KVM if EPT is disabled or
+                       hardware lacks support for it.
 
        kvm-intel.vmentry_l1d_flush=[KVM,Intel] Mitigation for L1 Terminal Fault
                        CVE-2018-3620.
 
                        Default is cond (do L1 cache flush in specific instances)
 
-       kvm-intel.vpid= [KVM,Intel] Disable Virtual Processor Identification
-                       feature (tagged TLBs) on capable Intel chips.
-                       Default is 1 (enabled)
+       kvm-intel.vpid= [KVM,Intel] Control KVM's use of Virtual Processor
+                       Identification feature (tagged TLBs). Default is 1
+                       (enabled). Disable by KVM if hardware lacks support
+                       for it.
 
        l1d_flush=      [X86,INTEL]
                        Control mitigation for L1D based snooping vulnerability.
                        [HW] Make the MicroTouch USB driver use raw coordinates
                        ('y', default) or cooked coordinates ('n')
 
+       mtrr=debug      [X86]
+                       Enable printing debug information related to MTRR
+                       registers at boot time.
+
        mtrr_chunk_size=nn[KMG] [X86]
                        used for mtrr cleanup. It is largest continuous chunk
                        that could hold holes aka. UC entries.
                        the propagation of recent CPU-hotplug changes up
                        the rcu_node combining tree.
 
-       rcutree.use_softirq=    [KNL]
-                       If set to zero, move all RCU_SOFTIRQ processing to
-                       per-CPU rcuc kthreads.  Defaults to a non-zero
-                       value, meaning that RCU_SOFTIRQ is used by default.
-                       Specify rcutree.use_softirq=0 to use rcuc kthreads.
-
-                       But note that CONFIG_PREEMPT_RT=y kernels disable
-                       this kernel boot parameter, forcibly setting it
-                       to zero.
-
-       rcutree.rcu_fanout_exact= [KNL]
-                       Disable autobalancing of the rcu_node combining
-                       tree.  This is used by rcutorture, and might
-                       possibly be useful for architectures having high
-                       cache-to-cache transfer latencies.
-
-       rcutree.rcu_fanout_leaf= [KNL]
-                       Change the number of CPUs assigned to each
-                       leaf rcu_node structure.  Useful for very
-                       large systems, which will choose the value 64,
-                       and for NUMA systems with large remote-access
-                       latencies, which will choose a value aligned
-                       with the appropriate hardware boundaries.
-
-       rcutree.rcu_min_cached_objs= [KNL]
-                       Minimum number of objects which are cached and
-                       maintained per one CPU. Object size is equal
-                       to PAGE_SIZE. The cache allows to reduce the
-                       pressure to page allocator, also it makes the
-                       whole algorithm to behave better in low memory
-                       condition.
-
-       rcutree.rcu_delay_page_cache_fill_msec= [KNL]
-                       Set the page-cache refill delay (in milliseconds)
-                       in response to low-memory conditions.  The range
-                       of permitted values is in the range 0:100000.
-
        rcutree.jiffies_till_first_fqs= [KNL]
                        Set delay from grace-period initialization to
                        first attempt to force quiescent states.
                        When RCU_NOCB_CPU is set, also adjust the
                        priority of NOCB callback kthreads.
 
-       rcutree.rcu_divisor= [KNL]
-                       Set the shift-right count to use to compute
-                       the callback-invocation batch limit bl from
-                       the number of callbacks queued on this CPU.
-                       The result will be bounded below by the value of
-                       the rcutree.blimit kernel parameter.  Every bl
-                       callbacks, the softirq handler will exit in
-                       order to allow the CPU to do other work.
-
-                       Please note that this callback-invocation batch
-                       limit applies only to non-offloaded callback
-                       invocation.  Offloaded callbacks are instead
-                       invoked in the context of an rcuoc kthread, which
-                       scheduler will preempt as it does any other task.
-
        rcutree.nocb_nobypass_lim_per_jiffy= [KNL]
                        On callback-offloaded (rcu_nocbs) CPUs,
                        RCU reduces the lock contention that would
                        the ->nocb_bypass queue.  The definition of "too
                        many" is supplied by this kernel boot parameter.
 
-       rcutree.rcu_nocb_gp_stride= [KNL]
-                       Set the number of NOCB callback kthreads in
-                       each group, which defaults to the square root
-                       of the number of CPUs.  Larger numbers reduce
-                       the wakeup overhead on the global grace-period
-                       kthread, but increases that same overhead on
-                       each group's NOCB grace-period kthread.
-
        rcutree.qhimark= [KNL]
                        Set threshold of queued RCU callbacks beyond which
                        batch limiting is disabled.
                        on rcutree.qhimark at boot time and to zero to
                        disable more aggressive help enlistment.
 
+       rcutree.rcu_delay_page_cache_fill_msec= [KNL]
+                       Set the page-cache refill delay (in milliseconds)
+                       in response to low-memory conditions.  The range
+                       of permitted values is in the range 0:100000.
+
+       rcutree.rcu_divisor= [KNL]
+                       Set the shift-right count to use to compute
+                       the callback-invocation batch limit bl from
+                       the number of callbacks queued on this CPU.
+                       The result will be bounded below by the value of
+                       the rcutree.blimit kernel parameter.  Every bl
+                       callbacks, the softirq handler will exit in
+                       order to allow the CPU to do other work.
+
+                       Please note that this callback-invocation batch
+                       limit applies only to non-offloaded callback
+                       invocation.  Offloaded callbacks are instead
+                       invoked in the context of an rcuoc kthread, which
+                       scheduler will preempt as it does any other task.
+
+       rcutree.rcu_fanout_exact= [KNL]
+                       Disable autobalancing of the rcu_node combining
+                       tree.  This is used by rcutorture, and might
+                       possibly be useful for architectures having high
+                       cache-to-cache transfer latencies.
+
+       rcutree.rcu_fanout_leaf= [KNL]
+                       Change the number of CPUs assigned to each
+                       leaf rcu_node structure.  Useful for very
+                       large systems, which will choose the value 64,
+                       and for NUMA systems with large remote-access
+                       latencies, which will choose a value aligned
+                       with the appropriate hardware boundaries.
+
+       rcutree.rcu_min_cached_objs= [KNL]
+                       Minimum number of objects which are cached and
+                       maintained per one CPU. Object size is equal
+                       to PAGE_SIZE. The cache allows to reduce the
+                       pressure to page allocator, also it makes the
+                       whole algorithm to behave better in low memory
+                       condition.
+
+       rcutree.rcu_nocb_gp_stride= [KNL]
+                       Set the number of NOCB callback kthreads in
+                       each group, which defaults to the square root
+                       of the number of CPUs.  Larger numbers reduce
+                       the wakeup overhead on the global grace-period
+                       kthread, but increases that same overhead on
+                       each group's NOCB grace-period kthread.
+
        rcutree.rcu_kick_kthreads= [KNL]
                        Cause the grace-period kthread to get an extra
                        wake_up() if it sleeps three times longer than
                        This wake_up() will be accompanied by a
                        WARN_ONCE() splat and an ftrace_dump().
 
+       rcutree.rcu_resched_ns= [KNL]
+                       Limit the time spend invoking a batch of RCU
+                       callbacks to the specified number of nanoseconds.
+                       By default, this limit is checked only once
+                       every 32 callbacks in order to limit the pain
+                       inflicted by local_clock() overhead.
+
        rcutree.rcu_unlock_delay= [KNL]
                        In CONFIG_RCU_STRICT_GRACE_PERIOD=y kernels,
                        this specifies an rcu_read_unlock()-time delay
                        rcu_node tree with an eye towards determining
                        why a new grace period has not yet started.
 
+       rcutree.use_softirq=    [KNL]
+                       If set to zero, move all RCU_SOFTIRQ processing to
+                       per-CPU rcuc kthreads.  Defaults to a non-zero
+                       value, meaning that RCU_SOFTIRQ is used by default.
+                       Specify rcutree.use_softirq=0 to use rcuc kthreads.
+
+                       But note that CONFIG_PREEMPT_RT=y kernels disable
+                       this kernel boot parameter, forcibly setting it
+                       to zero.
+
        rcuscale.gp_async= [KNL]
                        Measure performance of asynchronous
                        grace-period primitives such as call_rcu().
 
        rcutorture.stall_cpu_block= [KNL]
                        Sleep while stalling if set.  This will result
-                       in warnings from preemptible RCU in addition
-                       to any other stall-related activity.
+                       in warnings from preemptible RCU in addition to
+                       any other stall-related activity.  Note that
+                       in kernels built with CONFIG_PREEMPTION=n and
+                       CONFIG_PREEMPT_COUNT=y, this parameter will
+                       cause the CPU to pass through a quiescent state.
+                       Given CONFIG_PREEMPTION=n, this will suppress
+                       RCU CPU stall warnings, but will instead result
+                       in scheduling-while-atomic splats.
+
+                       Use of this module parameter results in splats.
+
 
        rcutorture.stall_cpu_holdoff= [KNL]
                        Time to wait (s) after boot before inducing stall.
                        port and the regular usb controller gets disabled.
 
        root=           [KNL] Root filesystem
-                       See name_to_dev_t comment in init/do_mounts.c.
+                       Usually this a a block device specifier of some kind,
+                       see the early_lookup_bdev comment in
+                       block/early-lookup.c for details.
+                       Alternatively this can be "ram" for the legacy initial
+                       ramdisk, "nfs" and "cifs" for root on a network file
+                       system, or "mtd" and "ubi" for mounting from raw flash.
 
        rootdelay=      [KNL] Delay (in seconds) to pause before attempting to
                        mount the root filesystem
        unknown_nmi_panic
                        [X86] Cause panic on unknown NMI.
 
+       unwind_debug    [X86-64]
+                       Enable unwinder debug output.  This can be
+                       useful for debugging certain unwinder error
+                       conditions, including corrupt stacks and
+                       bad/missing unwinder metadata.
+
        usbcore.authorized_default=
                        [USB] Default USB device authorization:
                        (default -1 = authorized except for wireless USB,
                        it can be updated at runtime by writing to the
                        corresponding sysfs file.
 
+       workqueue.cpu_intensive_thresh_us=
+                       Per-cpu work items which run for longer than this
+                       threshold are automatically considered CPU intensive
+                       and excluded from concurrency management to prevent
+                       them from noticeably delaying other per-cpu work
+                       items. Default is 10000 (10ms).
+
+                       If CONFIG_WQ_CPU_INTENSIVE_REPORT is set, the kernel
+                       will report the work functions which violate this
+                       threshold repeatedly. They are likely good
+                       candidates for using WQ_UNBOUND workqueues instead.
+
        workqueue.disable_numa
                        By default, all work items queued to unbound
                        workqueues are affine to the NUMA nodes they're
index 9f88afc..7aa0071 100644 (file)
@@ -119,9 +119,9 @@ set size has chronologically changed.::
 Data Access Pattern Aware Memory Management
 ===========================================
 
-Below three commands make every memory region of size >=4K that doesn't
-accessed for >=60 seconds in your workload to be swapped out. ::
+Below command makes every memory region of size >=4K that has not accessed for
+>=60 seconds in your workload to be swapped out. ::
 
-    $ echo "#min-size max-size min-acc max-acc min-age max-age action" > test_scheme
-    $ echo "4K        max      0       0       60s     max     pageout" >> test_scheme
-    $ damo schemes -c test_scheme <pid of your workload>
+    $ sudo damo schemes --damos_access_rate 0 0 --damos_sz_region 4K max \
+                        --damos_age 60s max --damos_action pageout \
+                        <pid of your workload>
index 9b823fe..2d495fa 100644 (file)
@@ -10,9 +10,8 @@ DAMON provides below interfaces for different users.
   `This <https://github.com/awslabs/damo>`_ is for privileged people such as
   system administrators who want a just-working human-friendly interface.
   Using this, users can use the DAMON’s major features in a human-friendly way.
-  It may not be highly tuned for special cases, though.  It supports both
-  virtual and physical address spaces monitoring.  For more detail, please
-  refer to its `usage document
+  It may not be highly tuned for special cases, though.  For more detail,
+  please refer to its `usage document
   <https://github.com/awslabs/damo/blob/next/USAGE.md>`_.
 - *sysfs interface.*
   :ref:`This <sysfs_interface>` is for privileged user space programmers who
@@ -20,11 +19,7 @@ DAMON provides below interfaces for different users.
   features by reading from and writing to special sysfs files.  Therefore,
   you can write and use your personalized DAMON sysfs wrapper programs that
   reads/writes the sysfs files instead of you.  The `DAMON user space tool
-  <https://github.com/awslabs/damo>`_ is one example of such programs.  It
-  supports both virtual and physical address spaces monitoring.  Note that this
-  interface provides only simple :ref:`statistics <damos_stats>` for the
-  monitoring results.  For detailed monitoring results, DAMON provides a
-  :ref:`tracepoint <tracepoint>`.
+  <https://github.com/awslabs/damo>`_ is one example of such programs.
 - *debugfs interface. (DEPRECATED!)*
   :ref:`This <debugfs_interface>` is almost identical to :ref:`sysfs interface
   <sysfs_interface>`.  This is deprecated, so users should move to the
@@ -139,7 +134,7 @@ scheme of the kdamond.  Writing ``clear_schemes_tried_regions`` to ``state``
 file clears the DAMON-based operating scheme action tried regions directory for
 each DAMON-based operation scheme of the kdamond.  For details of the
 DAMON-based operation scheme action tried regions directory, please refer to
-:ref:tried_regions section <sysfs_schemes_tried_regions>`.
+:ref:`tried_regions section <sysfs_schemes_tried_regions>`.
 
 If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread.
 
@@ -259,12 +254,9 @@ be equal or smaller than ``start`` of directory ``N+1``.
 contexts/<N>/schemes/
 ---------------------
 
-For usual DAMON-based data access aware memory management optimizations, users
-would normally want the system to apply a memory management action to a memory
-region of a specific access pattern.  DAMON receives such formalized operation
-schemes from the user and applies those to the target memory regions.  Users
-can get and set the schemes by reading from and writing to files under this
-directory.
+The directory for DAMON-based Operation Schemes (:ref:`DAMOS
+<damon_design_damos>`).  Users can get and set the schemes by reading from and
+writing to files under this directory.
 
 In the beginning, this directory has only one file, ``nr_schemes``.  Writing a
 number (``N``) to the file creates the number of child directories named ``0``
@@ -277,12 +269,12 @@ In each scheme directory, five directories (``access_pattern``, ``quotas``,
 ``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and one file
 (``action``) exist.
 
-The ``action`` file is for setting and getting what action you want to apply to
-memory regions having specific access pattern of the interest.  The keywords
-that can be written to and read from the file and their meaning are as below.
+The ``action`` file is for setting and getting the scheme's :ref:`action
+<damon_design_damos_action>`.  The keywords that can be written to and read
+from the file and their meaning are as below.
 
 Note that support of each action depends on the running DAMON operations set
-`implementation <sysfs_contexts>`.
+:ref:`implementation <sysfs_contexts>`.
 
  - ``willneed``: Call ``madvise()`` for the region with ``MADV_WILLNEED``.
    Supported by ``vaddr`` and ``fvaddr`` operations set.
@@ -304,32 +296,21 @@ Note that support of each action depends on the running DAMON operations set
 schemes/<N>/access_pattern/
 ---------------------------
 
-The target access pattern of each DAMON-based operation scheme is constructed
-with three ranges including the size of the region in bytes, number of
-monitored accesses per aggregate interval, and number of aggregated intervals
-for the age of the region.
+The directory for the target access :ref:`pattern
+<damon_design_damos_access_pattern>` of the given DAMON-based operation scheme.
 
 Under the ``access_pattern`` directory, three directories (``sz``,
 ``nr_accesses``, and ``age``) each having two files (``min`` and ``max``)
 exist.  You can set and get the access pattern for the given scheme by writing
 to and reading from the ``min`` and ``max`` files under ``sz``,
-``nr_accesses``, and ``age`` directories, respectively.
+``nr_accesses``, and ``age`` directories, respectively.  Note that the ``min``
+and the ``max`` form a closed interval.
 
 schemes/<N>/quotas/
 -------------------
 
-Optimal ``target access pattern`` for each ``action`` is workload dependent, so
-not easy to find.  Worse yet, setting a scheme of some action too aggressive
-can cause severe overhead.  To avoid such overhead, users can limit time and
-size quota for each scheme.  In detail, users can ask DAMON to try to use only
-up to specific time (``time quota``) for applying the action, and to apply the
-action to only up to specific amount (``size quota``) of memory regions having
-the target access pattern within a given time interval (``reset interval``).
-
-When the quota limit is expected to be exceeded, DAMON prioritizes found memory
-regions of the ``target access pattern`` based on their size, access frequency,
-and age.  For personalized prioritization, users can set the weights for the
-three properties.
+The directory for the :ref:`quotas <damon_design_damos_quotas>` of the given
+DAMON-based operation scheme.
 
 Under ``quotas`` directory, three files (``ms``, ``bytes``,
 ``reset_interval_ms``) and one directory (``weights``) having three files
@@ -337,23 +318,26 @@ Under ``quotas`` directory, three files (``ms``, ``bytes``,
 
 You can set the ``time quota`` in milliseconds, ``size quota`` in bytes, and
 ``reset interval`` in milliseconds by writing the values to the three files,
-respectively.  You can also set the prioritization weights for size, access
-frequency, and age in per-thousand unit by writing the values to the three
-files under the ``weights`` directory.
+respectively.  Then, DAMON tries to use only up to ``time quota`` milliseconds
+for applying the ``action`` to memory regions of the ``access_pattern``, and to
+apply the action to only up to ``bytes`` bytes of memory regions within the
+``reset_interval_ms``.  Setting both ``ms`` and ``bytes`` zero disables the
+quota limits.
+
+You can also set the :ref:`prioritization weights
+<damon_design_damos_quotas_prioritization>` for size, access frequency, and age
+in per-thousand unit by writing the values to the three files under the
+``weights`` directory.
 
 schemes/<N>/watermarks/
 -----------------------
 
-To allow easy activation and deactivation of each scheme based on system
-status, DAMON provides a feature called watermarks.  The feature receives five
-values called ``metric``, ``interval``, ``high``, ``mid``, and ``low``.  The
-``metric`` is the system metric such as free memory ratio that can be measured.
-If the metric value of the system is higher than the value in ``high`` or lower
-than ``low`` at the memoent, the scheme is deactivated.  If the value is lower
-than ``mid``, the scheme is activated.
+The directory for the :ref:`watermarks <damon_design_damos_watermarks>` of the
+given DAMON-based operation scheme.
 
 Under the watermarks directory, five files (``metric``, ``interval_us``,
-``high``, ``mid``, and ``low``) for setting each value exist.  You can set and
+``high``, ``mid``, and ``low``) for setting the metric, the time interval
+between check of the metric, and the three watermarks exist.  You can set and
 get the five values by writing to the files, respectively.
 
 Keywords and meanings of those that can be written to the ``metric`` file are
@@ -367,12 +351,8 @@ The ``interval`` should written in microseconds unit.
 schemes/<N>/filters/
 --------------------
 
-Users could know something more than the kernel for specific types of memory.
-In the case, users could do their own management for the memory and hence
-doesn't want DAMOS bothers that.  Users could limit DAMOS by setting the access
-pattern of the scheme and/or the monitoring regions for the purpose, but that
-can be inefficient in some cases.  In such cases, users could set non-access
-pattern driven filters using files in this directory.
+The directory for the :ref:`filters <damon_design_damos_filters>` of the given
+DAMON-based operation scheme.
 
 In the beginning, this directory has only one file, ``nr_filters``.  Writing a
 number (``N``) to the file creates the number of child directories named ``0``
@@ -432,13 +412,17 @@ starting from ``0`` under this directory.  Each directory contains files
 exposing detailed information about each of the memory region that the
 corresponding scheme's ``action`` has tried to be applied under this directory,
 during next :ref:`aggregation interval <sysfs_monitoring_attrs>`.  The
-information includes address range, ``nr_accesses``, , and ``age`` of the
-region.
+information includes address range, ``nr_accesses``, and ``age`` of the region.
 
 The directories will be removed when another special keyword,
 ``clear_schemes_tried_regions``, is written to the relevant
 ``kdamonds/<N>/state`` file.
 
+The expected usage of this directory is investigations of schemes' behaviors,
+and query-like efficient data access monitoring results retrievals.  For the
+latter use case, in particular, users can set the ``action`` as ``stat`` and
+set the ``access pattern`` as their interested pattern that they want to query.
+
 tried_regions/<N>/
 ------------------
 
@@ -600,15 +584,10 @@ update.
 Schemes
 -------
 
-For usual DAMON-based data access aware memory management optimizations, users
-would simply want the system to apply a memory management action to a memory
-region of a specific access pattern.  DAMON receives such formalized operation
-schemes from the user and applies those to the target processes.
-
-Users can get and set the schemes by reading from and writing to ``schemes``
-debugfs file.  Reading the file also shows the statistics of each scheme.  To
-the file, each of the schemes should be represented in each line in below
-form::
+Users can get and set the DAMON-based operation :ref:`schemes
+<damon_design_damos>` by reading from and writing to ``schemes`` debugfs file.
+Reading the file also shows the statistics of each scheme.  To the file, each
+of the schemes should be represented in each line in below form::
 
     <target access pattern> <action> <quota> <watermarks>
 
@@ -617,8 +596,9 @@ You can disable schemes by simply writing an empty string to the file.
 Target Access Pattern
 ~~~~~~~~~~~~~~~~~~~~~
 
-The ``<target access pattern>`` is constructed with three ranges in below
-form::
+The target access :ref:`pattern <damon_design_damos_access_pattern>` of the
+scheme.  The ``<target access pattern>`` is constructed with three ranges in
+below form::
 
     min-size max-size min-acc max-acc min-age max-age
 
@@ -631,9 +611,9 @@ closed interval.
 Action
 ~~~~~~
 
-The ``<action>`` is a predefined integer for memory management actions, which
-DAMON will apply to the regions having the target access pattern.  The
-supported numbers and their meanings are as below.
+The ``<action>`` is a predefined integer for memory management :ref:`actions
+<damon_design_damos_action>`.  The supported numbers and their meanings are as
+below.
 
  - 0: Call ``madvise()`` for the region with ``MADV_WILLNEED``.  Ignored if
    ``target`` is ``paddr``.
@@ -649,10 +629,8 @@ supported numbers and their meanings are as below.
 Quota
 ~~~~~
 
-Optimal ``target access pattern`` for each ``action`` is workload dependent, so
-not easy to find.  Worse yet, setting a scheme of some action too aggressive
-can cause severe overhead.  To avoid such overhead, users can limit time and
-size quota for the scheme via the ``<quota>`` in below form::
+Users can set the :ref:`quotas <damon_design_damos_quotas>` of the given scheme
+via the ``<quota>`` in below form::
 
     <ms> <sz> <reset interval> <priority weights>
 
@@ -662,19 +640,17 @@ the action to memory regions of the ``target access pattern`` within the
 ``<sz>`` bytes of memory regions within the ``<reset interval>``.  Setting both
 ``<ms>`` and ``<sz>`` zero disables the quota limits.
 
-When the quota limit is expected to be exceeded, DAMON prioritizes found memory
-regions of the ``target access pattern`` based on their size, access frequency,
-and age.  For personalized prioritization, users can set the weights for the
-three properties in ``<priority weights>`` in below form::
+For the :ref:`prioritization <damon_design_damos_quotas_prioritization>`, users
+can set the weights for the three properties in ``<priority weights>`` in below
+form::
 
     <size weight> <access frequency weight> <age weight>
 
 Watermarks
 ~~~~~~~~~~
 
-Some schemes would need to run based on current value of the system's specific
-metrics like free memory ratio.  For such cases, users can specify watermarks
-for the condition.::
+Users can specify :ref:`watermarks <damon_design_damos_watermarks>` of the
+given scheme via ``<watermarks>`` in below form::
 
     <metric> <check interval> <high mark> <middle mark> <low mark>
 
@@ -797,10 +773,12 @@ root directory only.
 Tracepoint for Monitoring Results
 =================================
 
-DAMON provides the monitoring results via a tracepoint,
-``damon:damon_aggregated``.  While the monitoring is turned on, you could
-record the tracepoint events and show results using tracepoint supporting tools
-like ``perf``.  For example::
+Users can get the monitoring results via the :ref:`tried_regions
+<sysfs_schemes_tried_regions>` or a tracepoint, ``damon:damon_aggregated``.
+While the tried regions directory is useful for getting a snapshot, the
+tracepoint is useful for getting a full record of the results.  While the
+monitoring is turned on, you could record the tracepoint events and show
+results using tracepoint supporting tools like ``perf``.  For example::
 
     # echo on > monitor_on
     # perf record -e damon:damon_aggregated &
index 5469793..e0174d2 100644 (file)
@@ -56,14 +56,14 @@ Example usage of perf::
 For HiSilicon uncore PMU v2 whose identifier is 0x30, the topology is the same
 as PMU v1, but some new functions are added to the hardware.
 
-(a) L3C PMU supports filtering by core/thread within the cluster which can be
+1. L3C PMU supports filtering by core/thread within the cluster which can be
 specified as a bitmap::
 
   $# perf stat -a -e hisi_sccl3_l3c0/config=0x02,tt_core=0x3/ sleep 5
 
 This will only count the operations from core/thread 0 and 1 in this cluster.
 
-(b) Tracetag allow the user to chose to count only read, write or atomic
+2. Tracetag allow the user to chose to count only read, write or atomic
 operations via the tt_req parameeter in perf. The default value counts all
 operations. tt_req is 3bits, 3'b100 represents read operations, 3'b101
 represents write operations, 3'b110 represents atomic store operations and
@@ -73,14 +73,16 @@ represents write operations, 3'b110 represents atomic store operations and
 
 This will only count the read operations in this cluster.
 
-(c) Datasrc allows the user to check where the data comes from. It is 5 bits.
+3. Datasrc allows the user to check where the data comes from. It is 5 bits.
 Some important codes are as follows:
-5'b00001: comes from L3C in this die;
-5'b01000: comes from L3C in the cross-die;
-5'b01001: comes from L3C which is in another socket;
-5'b01110: comes from the local DDR;
-5'b01111: comes from the cross-die DDR;
-5'b10000: comes from cross-socket DDR;
+
+- 5'b00001: comes from L3C in this die;
+- 5'b01000: comes from L3C in the cross-die;
+- 5'b01001: comes from L3C which is in another socket;
+- 5'b01110: comes from the local DDR;
+- 5'b01111: comes from the cross-die DDR;
+- 5'b10000: comes from cross-socket DDR;
+
 etc, it is mainly helpful to find that the data source is nearest from the CPU
 cores. If datasrc_cfg is used in the multi-chips, the datasrc_skt shall be
 configured in perf command::
@@ -88,15 +90,25 @@ configured in perf command::
   $# perf stat -a -e hisi_sccl3_l3c0/config=0xb9,datasrc_cfg=0xE/,
   hisi_sccl3_l3c0/config=0xb9,datasrc_cfg=0xF/ sleep 5
 
-(d)Some HiSilicon SoCs encapsulate multiple CPU and IO dies. Each CPU die
+4. Some HiSilicon SoCs encapsulate multiple CPU and IO dies. Each CPU die
 contains several Compute Clusters (CCLs). The I/O dies are called Super I/O
 clusters (SICL) containing multiple I/O clusters (ICLs). Each CCL/ICL in the
 SoC has a unique ID. Each ID is 11bits, include a 6-bit SCCL-ID and 5-bit
 CCL/ICL-ID. For I/O die, the ICL-ID is followed by:
-5'b00000: I/O_MGMT_ICL;
-5'b00001: Network_ICL;
-5'b00011: HAC_ICL;
-5'b10000: PCIe_ICL;
+
+- 5'b00000: I/O_MGMT_ICL;
+- 5'b00001: Network_ICL;
+- 5'b00011: HAC_ICL;
+- 5'b10000: PCIe_ICL;
+
+5. uring_channel: UC PMU events 0x47~0x59 supports filtering by tx request
+uring channel. It is 2 bits. Some important codes are as follows:
+
+- 2'b11: count the events which sent to the uring_ext (MATA) channel;
+- 2'b01: is the same as 2'b11;
+- 2'b10: count the events which sent to the uring (non-MATA) channel;
+- 2'b00: default value, count the events which sent to the both uring and
+  uring_ext channel;
 
 Users could configure IDs to count data come from specific CCL/ICL, by setting
 srcid_cmd & srcid_msk, and data desitined for specific CCL/ICL by setting
index d85d90f..3800fab 100644 (file)
@@ -949,7 +949,7 @@ user space can read performance monitor counter registers directly.
 
 The default value is 0 (access disabled).
 
-See Documentation/arm64/perf.rst for more information.
+See Documentation/arch/arm64/perf.rst for more information.
 
 
 pid_max
similarity index 91%
rename from Documentation/arm64/acpi_object_usage.rst
rename to Documentation/arch/arm64/acpi_object_usage.rst
index 484ef96..1da2220 100644 (file)
@@ -17,16 +17,37 @@ For ACPI on arm64, tables also fall into the following categories:
 
        -  Recommended: BERT, EINJ, ERST, HEST, PCCT, SSDT
 
-       -  Optional: BGRT, CPEP, CSRT, DBG2, DRTM, ECDT, FACS, FPDT, IBFT,
-          IORT, MCHI, MPST, MSCT, NFIT, PMTT, RASF, SBST, SLIT, SPMI, SRAT,
-          STAO, TCPA, TPM2, UEFI, XENV
+       -  Optional: AGDI, BGRT, CEDT, CPEP, CSRT, DBG2, DRTM, ECDT, FACS, FPDT,
+          HMAT, IBFT, IORT, MCHI, MPAM, MPST, MSCT, NFIT, PMTT, PPTT, RASF, SBST,
+          SDEI, SLIT, SPMI, SRAT, STAO, TCPA, TPM2, UEFI, XENV
 
-       -  Not supported: BOOT, DBGP, DMAR, ETDT, HPET, IVRS, LPIT, MSDM, OEMx,
-          PSDT, RSDT, SLIC, WAET, WDAT, WDRT, WPBT
+       -  Not supported: AEST, APMT, BOOT, DBGP, DMAR, ETDT, HPET, IVRS, LPIT,
+          MSDM, OEMx, PDTT, PSDT, RAS2, RSDT, SLIC, WAET, WDAT, WDRT, WPBT
 
 ====== ========================================================================
 Table  Usage for ARMv8 Linux
 ====== ========================================================================
+AEST   Signature Reserved (signature == "AEST")
+
+       **Arm Error Source Table**
+
+       This table informs the OS of any error nodes in the system that are
+       compliant with the Arm RAS architecture.
+
+AGDI   Signature Reserved (signature == "AGDI")
+
+       **Arm Generic diagnostic Dump and Reset Device Interface Table**
+
+       This table describes a non-maskable event, that is used by the platform
+       firmware, to request the OS to generate a diagnostic dump and reset the device.
+
+APMT   Signature Reserved (signature == "APMT")
+
+       **Arm Performance Monitoring Table**
+
+       This table describes the properties of PMU support implmented by
+       components in the system.
+
 BERT   Section 18.3 (signature == "BERT")
 
        **Boot Error Record Table**
@@ -47,6 +68,13 @@ BGRT   Section 5.2.22 (signature == "BGRT")
        Optional, not currently supported, with no real use-case for an
        ARM server.
 
+CEDT   Signature Reserved (signature == "CEDT")
+
+       **CXL Early Discovery Table**
+
+       This table allows the OS to discover any CXL Host Bridges and the Host
+       Bridge registers.
+
 CPEP   Section 5.2.18 (signature == "CPEP")
 
        **Corrected Platform Error Polling table**
@@ -184,6 +212,15 @@ HEST   Section 18.3.2 (signature == "HEST")
        Must be supplied if RAS support is provided by the platform.  It
        is recommended this table be supplied.
 
+HMAT   Section 5.2.28 (signature == "HMAT")
+
+       **Heterogeneous Memory Attribute Table**
+
+       This table describes the memory attributes, such as memory side cache
+       attributes and bandwidth and latency details, related to Memory Proximity
+       Domains. The OS uses this information to optimize the system memory
+       configuration.
+
 HPET   Signature Reserved (signature == "HPET")
 
        **High Precision Event timer Table**
@@ -241,6 +278,13 @@ MCHI   Signature Reserved (signature == "MCHI")
 
        Optional, not currently supported.
 
+MPAM   Signature Reserved (signature == "MPAM")
+
+       **Memory Partitioning And Monitoring table**
+
+       This table allows the OS to discover the MPAM controls implemented by
+       the subsystems.
+
 MPST   Section 5.2.21 (signature == "MPST")
 
        **Memory Power State Table**
@@ -281,18 +325,39 @@ PCCT   Section 14.1 (signature == "PCCT)
        Recommend for use on arm64; use of PCC is recommended when using CPPC
        to control performance and power for platform processors.
 
+PDTT   Section 5.2.29 (signature == "PDTT")
+
+       **Platform Debug Trigger Table**
+
+       This table describes PCC channels used to gather debug logs of
+       non-architectural features.
+
+
 PMTT   Section 5.2.21.12 (signature == "PMTT")
 
        **Platform Memory Topology Table**
 
        Optional, not currently supported.
 
+PPTT   Section 5.2.30 (signature == "PPTT")
+
+       **Processor Properties Topology Table**
+
+       This table provides the processor and cache topology.
+
 PSDT   Section 5.2.11.3 (signature == "PSDT")
 
        **Persistent System Description Table**
 
        Obsolete table, will not be supported.
 
+RAS2   Section 5.2.21 (signature == "RAS2")
+
+       **RAS Features 2 table**
+
+       This table provides interfaces for the RAS capabilities implemented in
+       the platform.
+
 RASF   Section 5.2.20 (signature == "RASF")
 
        **RAS Feature table**
@@ -318,6 +383,12 @@ SBST   Section 5.2.14 (signature == "SBST")
 
        Optional, not currently supported.
 
+SDEI   Signature Reserved (signature == "SDEI")
+
+       **Software Delegated Exception Interface table**
+
+       This table advertises the presence of the SDEI interface.
+
 SLIC   Signature Reserved (signature == "SLIC")
 
        **Software LIcensing table**
similarity index 82%
rename from Documentation/arm64/arm-acpi.rst
rename to Documentation/arch/arm64/arm-acpi.rst
index 47ecb99..94274a8 100644 (file)
@@ -1,40 +1,41 @@
-=====================
-ACPI on ARMv8 Servers
-=====================
-
-ACPI can be used for ARMv8 general purpose servers designed to follow
-the ARM SBSA (Server Base System Architecture) [0] and SBBR (Server
-Base Boot Requirements) [1] specifications.  Please note that the SBBR
-can be retrieved simply by visiting [1], but the SBSA is currently only
-available to those with an ARM login due to ARM IP licensing concerns.
-
-The ARMv8 kernel implements the reduced hardware model of ACPI version
+===================
+ACPI on Arm systems
+===================
+
+ACPI can be used for Armv8 and Armv9 systems designed to follow
+the BSA (Arm Base System Architecture) [0] and BBR (Arm
+Base Boot Requirements) [1] specifications.  Both BSA and BBR are publicly
+accessible documents.
+Arm Servers, in addition to being BSA compliant, comply with a set
+of rules defined in SBSA (Server Base System Architecture) [2].
+
+The Arm kernel implements the reduced hardware model of ACPI version
 5.1 or later.  Links to the specification and all external documents
 it refers to are managed by the UEFI Forum.  The specification is
 available at http://www.uefi.org/specifications and documents referenced
 by the specification can be found via http://www.uefi.org/acpi.
 
-If an ARMv8 system does not meet the requirements of the SBSA and SBBR,
+If an Arm system does not meet the requirements of the BSA and BBR,
 or cannot be described using the mechanisms defined in the required ACPI
 specifications, then ACPI may not be a good fit for the hardware.
 
 While the documents mentioned above set out the requirements for building
-industry-standard ARMv8 servers, they also apply to more than one operating
+industry-standard Arm systems, they also apply to more than one operating
 system.  The purpose of this document is to describe the interaction between
-ACPI and Linux only, on an ARMv8 system -- that is, what Linux expects of
+ACPI and Linux only, on an Arm system -- that is, what Linux expects of
 ACPI and what ACPI can expect of Linux.
 
 
-Why ACPI on ARM?
+Why ACPI on Arm?
 ----------------
 Before examining the details of the interface between ACPI and Linux, it is
 useful to understand why ACPI is being used.  Several technologies already
 exist in Linux for describing non-enumerable hardware, after all.  In this
-section we summarize a blog post [2] from Grant Likely that outlines the
-reasoning behind ACPI on ARMv8 servers.  Actually, we snitch a good portion
+section we summarize a blog post [3] from Grant Likely that outlines the
+reasoning behind ACPI on Arm systems.  Actually, we snitch a good portion
 of the summary text almost directly, to be honest.
 
-The short form of the rationale for ACPI on ARM is:
+The short form of the rationale for ACPI on Arm is:
 
 -  ACPI’s byte code (AML) allows the platform to encode hardware behavior,
    while DT explicitly does not support this.  For hardware vendors, being
@@ -47,7 +48,7 @@ The short form of the rationale for ACPI on ARM is:
 
 -  In the enterprise server environment, ACPI has established bindings (such
    as for RAS) which are currently used in production systems.  DT does not.
-   Such bindings could be defined in DT at some point, but doing so means ARM
+   Such bindings could be defined in DT at some point, but doing so means Arm
    and x86 would end up using completely different code paths in both firmware
    and the kernel.
 
@@ -108,7 +109,7 @@ recent version of the kernel.
 
 Relationship with Device Tree
 -----------------------------
-ACPI support in drivers and subsystems for ARMv8 should never be mutually
+ACPI support in drivers and subsystems for Arm should never be mutually
 exclusive with DT support at compile time.
 
 At boot time the kernel will only use one description method depending on
@@ -121,11 +122,11 @@ time).
 
 Booting using ACPI tables
 -------------------------
-The only defined method for passing ACPI tables to the kernel on ARMv8
+The only defined method for passing ACPI tables to the kernel on Arm
 is via the UEFI system configuration table.  Just so it is explicit, this
 means that ACPI is only supported on platforms that boot via UEFI.
 
-When an ARMv8 system boots, it can either have DT information, ACPI tables,
+When an Arm system boots, it can either have DT information, ACPI tables,
 or in some very unusual cases, both.  If no command line parameters are used,
 the kernel will try to use DT for device enumeration; if there is no DT
 present, the kernel will try to use ACPI tables, but only if they are present.
@@ -169,7 +170,7 @@ hardware reduced mode must be set to zero.
 
 For the ACPI core to operate properly, and in turn provide the information
 the kernel needs to configure devices, it expects to find the following
-tables (all section numbers refer to the ACPI 6.1 specification):
+tables (all section numbers refer to the ACPI 6.5 specification):
 
     -  RSDP (Root System Description Pointer), section 5.2.5
 
@@ -184,20 +185,76 @@ tables (all section numbers refer to the ACPI 6.1 specification):
 
     -  GTDT (Generic Timer Description Table), section 5.2.24
 
+    -  PPTT (Processor Properties Topology Table), section 5.2.30
+
+    -  DBG2 (DeBuG port table 2), section 5.2.6, specifically Table 5-6.
+
+    -  APMT (Arm Performance Monitoring unit Table), section 5.2.6, specifically Table 5-6.
+
+    -  AGDI (Arm Generic diagnostic Dump and Reset Device Interface Table), section 5.2.6, specifically Table 5-6.
+
     -  If PCI is supported, the MCFG (Memory mapped ConFiGuration
-       Table), section 5.2.6, specifically Table 5-31.
+       Table), section 5.2.6, specifically Table 5-6.
 
     -  If booting without a console=<device> kernel parameter is
        supported, the SPCR (Serial Port Console Redirection table),
-       section 5.2.6, specifically Table 5-31.
+       section 5.2.6, specifically Table 5-6.
 
     -  If necessary to describe the I/O topology, SMMUs and GIC ITSs,
        the IORT (Input Output Remapping Table, section 5.2.6, specifically
-       Table 5-31).
+       Table 5-6).
+
+    -  If NUMA is supported, the following tables are required:
+
+       - SRAT (System Resource Affinity Table), section 5.2.16
+
+       - SLIT (System Locality distance Information Table), section 5.2.17
+
+    -  If NUMA is supported, and the system contains heterogeneous memory,
+       the HMAT (Heterogeneous Memory Attribute Table), section 5.2.28.
+
+    -  If the ACPI Platform Error Interfaces are required, the following
+       tables are conditionally required:
+
+       - BERT (Boot Error Record Table, section 18.3.1)
+
+       - EINJ (Error INJection table, section 18.6.1)
+
+       - ERST (Error Record Serialization Table, section 18.5)
+
+       - HEST (Hardware Error Source Table, section 18.3.2)
+
+       - SDEI (Software Delegated Exception Interface table, section 5.2.6,
+         specifically Table 5-6)
+
+       - AEST (Arm Error Source Table, section 5.2.6,
+         specifically Table 5-6)
+
+       - RAS2 (ACPI RAS2 feature table, section 5.2.21)
+
+    -  If the system contains controllers using PCC channel, the
+       PCCT (Platform Communications Channel Table), section 14.1
+
+    -  If the system contains a controller to capture board-level system state,
+       and communicates with the host via PCC, the PDTT (Platform Debug Trigger
+       Table), section 5.2.29.
+
+    -  If NVDIMM is supported, the NFIT (NVDIMM Firmware Interface Table), section 5.2.26
+
+    -  If video framebuffer is present, the BGRT (Boot Graphics Resource Table), section 5.2.23
+
+    -  If IPMI is implemented, the SPMI (Server Platform Management Interface),
+       section 5.2.6, specifically Table 5-6.
+
+    -  If the system contains a CXL Host Bridge, the CEDT (CXL Early Discovery
+       Table), section 5.2.6, specifically Table 5-6.
+
+    -  If the system supports MPAM, the MPAM (Memory Partitioning And Monitoring table), section 5.2.6,
+       specifically Table 5-6.
+
+    -  If the system lacks persistent storage, the IBFT (ISCSI Boot Firmware
+       Table), section 5.2.6, specifically Table 5-6.
 
-    -  If NUMA is supported, the SRAT (System Resource Affinity Table)
-       and SLIT (System Locality distance Information Table), sections
-       5.2.16 and 5.2.17, respectively.
 
 If the above tables are not all present, the kernel may or may not be
 able to boot properly since it may not be able to configure all of the
@@ -269,16 +326,14 @@ Drivers should look for device properties in the _DSD object ONLY; the _DSD
 object is described in the ACPI specification section 6.2.5, but this only
 describes how to define the structure of an object returned via _DSD, and
 how specific data structures are defined by specific UUIDs.  Linux should
-only use the _DSD Device Properties UUID [5]:
+only use the _DSD Device Properties UUID [4]:
 
    - UUID: daffd814-6eba-4d8c-8a91-bc9bbf4aa301
 
-   - https://www.uefi.org/sites/default/files/resources/_DSD-device-properties-UUID.pdf
-
-The UEFI Forum provides a mechanism for registering device properties [4]
-so that they may be used across all operating systems supporting ACPI.
-Device properties that have not been registered with the UEFI Forum should
-not be used.
+Common device properties can be registered by creating a pull request to [4] so
+that they may be used across all operating systems supporting ACPI.
+Device properties that have not been registered with the UEFI Forum can be used
+but not as "uefi-" common properties.
 
 Before creating new device properties, check to be sure that they have not
 been defined before and either registered in the Linux kernel documentation
@@ -306,7 +361,7 @@ process.
 
 Once registration and review have been completed, the kernel provides an
 interface for looking up device properties in a manner independent of
-whether DT or ACPI is being used.  This API should be used [6]; it can
+whether DT or ACPI is being used.  This API should be used [5]; it can
 eliminate some duplication of code paths in driver probing functions and
 discourage divergence between DT bindings and ACPI device properties.
 
@@ -448,15 +503,15 @@ ASWG
 ----
 The ACPI specification changes regularly.  During the year 2014, for instance,
 version 5.1 was released and version 6.0 substantially completed, with most of
-the changes being driven by ARM-specific requirements.  Proposed changes are
+the changes being driven by Arm-specific requirements.  Proposed changes are
 presented and discussed in the ASWG (ACPI Specification Working Group) which
 is a part of the UEFI Forum.  The current version of the ACPI specification
-is 6.1 release in January 2016.
+is 6.5 release in August 2022.
 
 Participation in this group is open to all UEFI members.  Please see
 http://www.uefi.org/workinggroup for details on group membership.
 
-It is the intent of the ARMv8 ACPI kernel code to follow the ACPI specification
+It is the intent of the Arm ACPI kernel code to follow the ACPI specification
 as closely as possible, and to only implement functionality that complies with
 the released standards from UEFI ASWG.  As a practical matter, there will be
 vendors that provide bad ACPI tables or violate the standards in some way.
@@ -470,12 +525,12 @@ likely be willing to assist in submitting ECRs.
 
 Linux Code
 ----------
-Individual items specific to Linux on ARM, contained in the Linux
+Individual items specific to Linux on Arm, contained in the Linux
 source code, are in the list that follows:
 
 ACPI_OS_NAME
                        This macro defines the string to be returned when
-                       an ACPI method invokes the _OS method.  On ARM64
+                       an ACPI method invokes the _OS method.  On Arm
                        systems, this macro will be "Linux" by default.
                        The command line parameter acpi_os=<string>
                        can be used to set it to some other value.  The
@@ -485,36 +540,28 @@ ACPI_OS_NAME
 ACPI Objects
 ------------
 Detailed expectations for ACPI tables and object are listed in the file
-Documentation/arm64/acpi_object_usage.rst.
+Documentation/arch/arm64/acpi_object_usage.rst.
 
 
 References
 ----------
-[0] http://silver.arm.com
-    document ARM-DEN-0029, or newer:
-    "Server Base System Architecture", version 2.3, dated 27 Mar 2014
+[0] https://developer.arm.com/documentation/den0094/latest
+    document Arm-DEN-0094: "Arm Base System Architecture", version 1.0C, dated 6 Oct 2022
+
+[1] https://developer.arm.com/documentation/den0044/latest
+    Document Arm-DEN-0044: "Arm Base Boot Requirements", version 2.0G, dated 15 Apr 2022
 
-[1] http://infocenter.arm.com/help/topic/com.arm.doc.den0044a/Server_Base_Boot_Requirements.pdf
-    Document ARM-DEN-0044A, or newer: "Server Base Boot Requirements, System
-    Software on ARM Platforms", dated 16 Aug 2014
+[2] https://developer.arm.com/documentation/den0029/latest
+    Document Arm-DEN-0029: "Arm Server Base System Architecture", version 7.1, dated 06 Oct 2022
 
-[2] http://www.secretlab.ca/archives/151,
+[3] http://www.secretlab.ca/archives/151,
     10 Jan 2015, Copyright (c) 2015,
     Linaro Ltd., written by Grant Likely.
 
-[3] AMD ACPI for Seattle platform documentation
-    http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/Seattle_ACPI_Guide.pdf
-
-
-[4] http://www.uefi.org/acpi
-    please see the link for the "ACPI _DSD Device
-    Property Registry Instructions"
-
-[5] http://www.uefi.org/acpi
-    please see the link for the "_DSD (Device
-    Specific Data) Implementation Guide"
+[4] _DSD (Device Specific Data) Implementation Guide
+    https://github.com/UEFI/DSD-Guide/blob/main/dsd-guide.pdf
 
-[6] Kernel code for the unified device
+[5] Kernel code for the unified device
     property interface can be found in
     include/linux/property.h and drivers/base/property.c.
 
similarity index 94%
rename from Documentation/arm64/booting.rst
rename to Documentation/arch/arm64/booting.rst
index ffeccdd..b57776a 100644 (file)
@@ -379,6 +379,38 @@ Before jumping into the kernel, the following conditions must be met:
 
     - SMCR_EL2.EZT0 (bit 30) must be initialised to 0b1.
 
+  For CPUs with Memory Copy and Memory Set instructions (FEAT_MOPS):
+
+  - If the kernel is entered at EL1 and EL2 is present:
+
+    - HCRX_EL2.MSCEn (bit 11) must be initialised to 0b1.
+
+  For CPUs with the Extended Translation Control Register feature (FEAT_TCR2):
+
+  - If EL3 is present:
+
+    - SCR_EL3.TCR2En (bit 43) must be initialised to 0b1.
+
+ - If the kernel is entered at EL1 and EL2 is present:
+
+    - HCRX_EL2.TCR2En (bit 14) must be initialised to 0b1.
+
+  For CPUs with the Stage 1 Permission Indirection Extension feature (FEAT_S1PIE):
+
+  - If EL3 is present:
+
+    - SCR_EL3.PIEn (bit 45) must be initialised to 0b1.
+
+  - If the kernel is entered at EL1 and EL2 is present:
+
+    - HFGRTR_EL2.nPIR_EL1 (bit 58) must be initialised to 0b1.
+
+    - HFGWTR_EL2.nPIR_EL1 (bit 58) must be initialised to 0b1.
+
+    - HFGRTR_EL2.nPIRE0_EL1 (bit 57) must be initialised to 0b1.
+
+    - HFGRWR_EL2.nPIRE0_EL1 (bit 57) must be initialised to 0b1.
+
 The requirements described above for CPU mode, caches, MMUs, architected
 timers, coherency and system registers apply to all CPUs.  All CPUs must
 enter the kernel in the same exception level.  Where the values documented
@@ -288,6 +288,8 @@ infrastructure:
      +------------------------------+---------+---------+
      | Name                         |  bits   | visible |
      +------------------------------+---------+---------+
+     | MOPS                         | [19-16] |    y    |
+     +------------------------------+---------+---------+
      | RPRES                        | [7-4]   |    y    |
      +------------------------------+---------+---------+
      | WFXT                         | [3-0]   |    y    |
similarity index 95%
rename from Documentation/arm64/elf_hwcaps.rst
rename to Documentation/arch/arm64/elf_hwcaps.rst
index 83e57e4..8c8addb 100644 (file)
@@ -102,7 +102,7 @@ HWCAP_ASIMDHP
 
 HWCAP_CPUID
     EL0 access to certain ID registers is available, to the extent
-    described by Documentation/arm64/cpu-feature-registers.rst.
+    described by Documentation/arch/arm64/cpu-feature-registers.rst.
 
     These ID registers may imply the availability of features.
 
@@ -163,12 +163,12 @@ HWCAP_SB
 HWCAP_PACA
     Functionality implied by ID_AA64ISAR1_EL1.APA == 0b0001 or
     ID_AA64ISAR1_EL1.API == 0b0001, as described by
-    Documentation/arm64/pointer-authentication.rst.
+    Documentation/arch/arm64/pointer-authentication.rst.
 
 HWCAP_PACG
     Functionality implied by ID_AA64ISAR1_EL1.GPA == 0b0001 or
     ID_AA64ISAR1_EL1.GPI == 0b0001, as described by
-    Documentation/arm64/pointer-authentication.rst.
+    Documentation/arch/arm64/pointer-authentication.rst.
 
 HWCAP2_DCPODP
     Functionality implied by ID_AA64ISAR1_EL1.DPB == 0b0010.
@@ -226,7 +226,7 @@ HWCAP2_BTI
 
 HWCAP2_MTE
     Functionality implied by ID_AA64PFR1_EL1.MTE == 0b0010, as described
-    by Documentation/arm64/memory-tagging-extension.rst.
+    by Documentation/arch/arm64/memory-tagging-extension.rst.
 
 HWCAP2_ECV
     Functionality implied by ID_AA64MMFR0_EL1.ECV == 0b0001.
@@ -239,11 +239,11 @@ HWCAP2_RPRES
 
 HWCAP2_MTE3
     Functionality implied by ID_AA64PFR1_EL1.MTE == 0b0011, as described
-    by Documentation/arm64/memory-tagging-extension.rst.
+    by Documentation/arch/arm64/memory-tagging-extension.rst.
 
 HWCAP2_SME
     Functionality implied by ID_AA64PFR1_EL1.SME == 0b0001, as described
-    by Documentation/arm64/sme.rst.
+    by Documentation/arch/arm64/sme.rst.
 
 HWCAP2_SME_I16I64
     Functionality implied by ID_AA64SMFR0_EL1.I16I64 == 0b1111.
@@ -302,6 +302,9 @@ HWCAP2_SMEB16B16
 HWCAP2_SMEF16F16
     Functionality implied by ID_AA64SMFR0_EL1.F16F16 == 0b1
 
+HWCAP2_MOPS
+    Functionality implied by ID_AA64ISAR2_EL1.MOPS == 0b0001.
+
 4. Unused AT_HWCAP bits
 -----------------------
 
similarity index 96%
rename from Documentation/arm64/index.rst
rename to Documentation/arch/arm64/index.rst
index ae21f81..d08e924 100644 (file)
@@ -15,11 +15,13 @@ ARM64 Architecture
     cpu-feature-registers
     elf_hwcaps
     hugetlbpage
+    kdump
     legacy_instructions
     memory
     memory-tagging-extension
     perf
     pointer-authentication
+    ptdump
     silicon-errata
     sme
     sve
diff --git a/Documentation/arch/arm64/kdump.rst b/Documentation/arch/arm64/kdump.rst
new file mode 100644 (file)
index 0000000..56a89f4
--- /dev/null
@@ -0,0 +1,92 @@
+=======================================
+crashkernel memory reservation on arm64
+=======================================
+
+Author: Baoquan He <bhe@redhat.com>
+
+Kdump mechanism is used to capture a corrupted kernel vmcore so that
+it can be subsequently analyzed. In order to do this, a preliminarily
+reserved memory is needed to pre-load the kdump kernel and boot such
+kernel if corruption happens.
+
+That reserved memory for kdump is adapted to be able to minimally
+accommodate the kdump kernel and the user space programs needed for the
+vmcore collection.
+
+Kernel parameter
+================
+
+Through the kernel parameters below, memory can be reserved accordingly
+during the early stage of the first kernel booting so that a continuous
+large chunk of memomy can be found. The low memory reservation needs to
+be considered if the crashkernel is reserved from the high memory area.
+
+- crashkernel=size@offset
+- crashkernel=size
+- crashkernel=size,high crashkernel=size,low
+
+Low memory and high memory
+==========================
+
+For kdump reservations, low memory is the memory area under a specific
+limit, usually decided by the accessible address bits of the DMA-capable
+devices needed by the kdump kernel to run. Those devices not related to
+vmcore dumping can be ignored. On arm64, the low memory upper bound is
+not fixed: it is 1G on the RPi4 platform but 4G on most other systems.
+On special kernels built with CONFIG_ZONE_(DMA|DMA32) disabled, the
+whole system RAM is low memory. Outside of the low memory described
+above, the rest of system RAM is considered high memory.
+
+Implementation
+==============
+
+1) crashkernel=size@offset
+--------------------------
+
+The crashkernel memory must be reserved at the user-specified region or
+fail if already occupied.
+
+
+2) crashkernel=size
+-------------------
+
+The crashkernel memory region will be reserved in any available position
+according to the search order:
+
+Firstly, the kernel searches the low memory area for an available region
+with the specified size.
+
+If searching for low memory fails, the kernel falls back to searching
+the high memory area for an available region of the specified size. If
+the reservation in high memory succeeds, a default size reservation in
+the low memory will be done. Currently the default size is 128M,
+sufficient for the low memory needs of the kdump kernel.
+
+Note: crashkernel=size is the recommended option for crashkernel kernel
+reservations. The user would not need to know the system memory layout
+for a specific platform.
+
+3) crashkernel=size,high crashkernel=size,low
+---------------------------------------------
+
+crashkernel=size,(high|low) are an important supplement to
+crashkernel=size. They allows the user to specify how much memory needs
+to be allocated from the high memory and low memory respectively. On
+many systems the low memory is precious and crashkernel reservations
+from this area should be kept to a minimum.
+
+To reserve memory for crashkernel=size,high, searching is first
+attempted from the high memory region. If the reservation succeeds, the
+low memory reservation will be done subsequently.
+
+If reservation from the high memory failed, the kernel falls back to
+searching the low memory with the specified size in crashkernel=,high.
+If it succeeds, no further reservation for low memory is needed.
+
+Notes:
+
+- If crashkernel=,low is not specified, the default low memory
+  reservation will be done automatically.
+
+- if crashkernel=0,low is specified, it means that the low memory
+  reservation is omitted intentionally.
@@ -221,7 +221,7 @@ programs should not retry in case of a non-zero system call return.
 ``NT_ARM_TAGGED_ADDR_CTRL`` allow ``ptrace()`` access to the tagged
 address ABI control and MTE configuration of a process as per the
 ``prctl()`` options described in
-Documentation/arm64/tagged-address-abi.rst and above. The corresponding
+Documentation/arch/arm64/tagged-address-abi.rst and above. The corresponding
 ``regset`` is 1 element of 8 bytes (``sizeof(long))``).
 
 Core dump support
similarity index 97%
rename from Documentation/arm64/memory.rst
rename to Documentation/arch/arm64/memory.rst
index 2a641ba..55a55f3 100644 (file)
@@ -33,8 +33,8 @@ AArch64 Linux memory layout with 4KB pages + 4 levels (48-bit)::
   0000000000000000     0000ffffffffffff         256TB          user
   ffff000000000000     ffff7fffffffffff         128TB          kernel logical memory map
  [ffff600000000000     ffff7fffffffffff]         32TB          [kasan shadow region]
-  ffff800000000000     ffff800007ffffff         128MB          modules
-  ffff800008000000     fffffbffefffffff         124TB          vmalloc
+  ffff800000000000     ffff80007fffffff           2GB          modules
+  ffff800080000000     fffffbffefffffff         124TB          vmalloc
   fffffbfff0000000     fffffbfffdffffff         224MB          fixed mappings (top down)
   fffffbfffe000000     fffffbfffe7fffff           8MB          [guard region]
   fffffbfffe800000     fffffbffff7fffff          16MB          PCI I/O space
@@ -50,8 +50,8 @@ AArch64 Linux memory layout with 64KB pages + 3 levels (52-bit with HW support):
   0000000000000000     000fffffffffffff           4PB          user
   fff0000000000000     ffff7fffffffffff          ~4PB          kernel logical memory map
  [fffd800000000000     ffff7fffffffffff]        512TB          [kasan shadow region]
-  ffff800000000000     ffff800007ffffff         128MB          modules
-  ffff800008000000     fffffbffefffffff         124TB          vmalloc
+  ffff800000000000     ffff80007fffffff           2GB          modules
+  ffff800080000000     fffffbffefffffff         124TB          vmalloc
   fffffbfff0000000     fffffbfffdffffff         224MB          fixed mappings (top down)
   fffffbfffe000000     fffffbfffe7fffff           8MB          [guard region]
   fffffbfffe800000     fffffbffff7fffff          16MB          PCI I/O space
diff --git a/Documentation/arch/arm64/ptdump.rst b/Documentation/arch/arm64/ptdump.rst
new file mode 100644 (file)
index 0000000..5dcfc5d
--- /dev/null
@@ -0,0 +1,96 @@
+======================
+Kernel page table dump
+======================
+
+ptdump is a debugfs interface that provides a detailed dump of the
+kernel page tables. It offers a comprehensive overview of the kernel
+virtual memory layout as well as the attributes associated with the
+various regions in a human-readable format. It is useful to dump the
+kernel page tables to verify permissions and memory types. Examining the
+page table entries and permissions helps identify potential security
+vulnerabilities such as mappings with overly permissive access rights or
+improper memory protections.
+
+Memory hotplug allows dynamic expansion or contraction of available
+memory without requiring a system reboot. To maintain the consistency
+and integrity of the memory management data structures, arm64 makes use
+of the ``mem_hotplug_lock`` semaphore in write mode. Additionally, in
+read mode, ``mem_hotplug_lock`` supports an efficient implementation of
+``get_online_mems()`` and ``put_online_mems()``. These protect the
+offlining of memory being accessed by the ptdump code.
+
+In order to dump the kernel page tables, enable the following
+configurations and mount debugfs::
+
+ CONFIG_GENERIC_PTDUMP=y
+ CONFIG_PTDUMP_CORE=y
+ CONFIG_PTDUMP_DEBUGFS=y
+
+ mount -t debugfs nodev /sys/kernel/debug
+ cat /sys/kernel/debug/kernel_page_tables
+
+On analysing the output of ``cat /sys/kernel/debug/kernel_page_tables``
+one can derive information about the virtual address range of the entry,
+followed by size of the memory region covered by this entry, the
+hierarchical structure of the page tables and finally the attributes
+associated with each page. The page attributes provide information about
+access permissions, execution capability, type of mapping such as leaf
+level PTE or block level PGD, PMD and PUD, and access status of a page
+within the kernel memory. Assessing these attributes can assist in
+understanding the memory layout, access patterns and security
+characteristics of the kernel pages.
+
+Kernel virtual memory layout example::
+
+ start address        end address         size             attributes
+ +---------------------------------------------------------------------------------------+
+ | ---[ Linear Mapping start ]---------------------------------------------------------- |
+ | ..................                                                                    |
+ | 0xfff0000000000000-0xfff0000000210000  2112K PTE RW NX SHD AF  UXN  MEM/NORMAL-TAGGED |
+ | 0xfff0000000210000-0xfff0000001c00000 26560K PTE ro NX SHD AF  UXN  MEM/NORMAL        |
+ | ..................                                                                    |
+ | ---[ Linear Mapping end ]------------------------------------------------------------ |
+ +---------------------------------------------------------------------------------------+
+ | ---[ Modules start ]----------------------------------------------------------------- |
+ | ..................                                                                    |
+ | 0xffff800000000000-0xffff800008000000   128M PTE                                      |
+ | ..................                                                                    |
+ | ---[ Modules end ]------------------------------------------------------------------- |
+ +---------------------------------------------------------------------------------------+
+ | ---[ vmalloc() area ]---------------------------------------------------------------- |
+ | ..................                                                                    |
+ | 0xffff800008010000-0xffff800008200000  1984K PTE ro x  SHD AF       UXN  MEM/NORMAL   |
+ | 0xffff800008200000-0xffff800008e00000    12M PTE ro x  SHD AF  CON  UXN  MEM/NORMAL   |
+ | ..................                                                                    |
+ | ---[ vmalloc() end ]----------------------------------------------------------------- |
+ +---------------------------------------------------------------------------------------+
+ | ---[ Fixmap start ]------------------------------------------------------------------ |
+ | ..................                                                                    |
+ | 0xfffffbfffdb80000-0xfffffbfffdb90000    64K PTE ro x  SHD AF  UXN  MEM/NORMAL        |
+ | 0xfffffbfffdb90000-0xfffffbfffdba0000    64K PTE ro NX SHD AF  UXN  MEM/NORMAL        |
+ | ..................                                                                    |
+ | ---[ Fixmap end ]-------------------------------------------------------------------- |
+ +---------------------------------------------------------------------------------------+
+ | ---[ PCI I/O start ]----------------------------------------------------------------- |
+ | ..................                                                                    |
+ | 0xfffffbfffe800000-0xfffffbffff800000    16M PTE                                      |
+ | ..................                                                                    |
+ | ---[ PCI I/O end ]------------------------------------------------------------------- |
+ +---------------------------------------------------------------------------------------+
+ | ---[ vmemmap start ]----------------------------------------------------------------- |
+ | ..................                                                                    |
+ | 0xfffffc0002000000-0xfffffc0002200000     2M PTE RW NX SHD AF  UXN  MEM/NORMAL        |
+ | 0xfffffc0002200000-0xfffffc0020000000   478M PTE                                      |
+ | ..................                                                                    |
+ | ---[ vmemmap end ]------------------------------------------------------------------- |
+ +---------------------------------------------------------------------------------------+
+
+``cat /sys/kernel/debug/kernel_page_tables`` output::
+
+ 0xfff0000001c00000-0xfff0000080000000     2020M PTE  RW NX SHD AF   UXN    MEM/NORMAL-TAGGED
+ 0xfff0000080000000-0xfff0000800000000       30G PMD
+ 0xfff0000800000000-0xfff0000800700000        7M PTE  RW NX SHD AF   UXN    MEM/NORMAL-TAGGED
+ 0xfff0000800700000-0xfff0000800710000       64K PTE  ro NX SHD AF   UXN    MEM/NORMAL-TAGGED
+ 0xfff0000800710000-0xfff0000880000000  2089920K PTE  RW NX SHD AF   UXN    MEM/NORMAL-TAGGED
+ 0xfff0000880000000-0xfff0040000000000     4062G PMD
+ 0xfff0040000000000-0xffff800000000000     3964T PGD
similarity index 98%
rename from Documentation/arm64/silicon-errata.rst
rename to Documentation/arch/arm64/silicon-errata.rst
index 9e311bc..d6430ad 100644 (file)
@@ -214,3 +214,7 @@ stable kernels.
 +----------------+-----------------+-----------------+-----------------------------+
 | Fujitsu        | A64FX           | E#010001        | FUJITSU_ERRATUM_010001      |
 +----------------+-----------------+-----------------+-----------------------------+
+
++----------------+-----------------+-----------------+-----------------------------+
+| ASR            | ASR8601         | #8601001        | N/A                         |
++----------------+-----------------+-----------------+-----------------------------+
similarity index 99%
rename from Documentation/arm64/sme.rst
rename to Documentation/arch/arm64/sme.rst
index 1c43ea1..ba529a1 100644 (file)
@@ -465,4 +465,4 @@ References
 [2] arch/arm64/include/uapi/asm/ptrace.h
     AArch64 Linux ptrace ABI definitions
 
-[3] Documentation/arm64/cpu-feature-registers.rst
+[3] Documentation/arch/arm64/cpu-feature-registers.rst
similarity index 99%
rename from Documentation/arm64/sve.rst
rename to Documentation/arch/arm64/sve.rst
index 1b90a30..0d9a426 100644 (file)
@@ -606,7 +606,7 @@ References
 [2] arch/arm64/include/uapi/asm/ptrace.h
     AArch64 Linux ptrace ABI definitions
 
-[3] Documentation/arm64/cpu-feature-registers.rst
+[3] Documentation/arch/arm64/cpu-feature-registers.rst
 
 [4] ARM IHI0055C
     http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055c/IHI0055C_beta_aapcs64.pdf
similarity index 99%
rename from Documentation/arm64/tagged-address-abi.rst
rename to Documentation/arch/arm64/tagged-address-abi.rst
index 540a1d4..fe24a3f 100644 (file)
@@ -107,7 +107,7 @@ following behaviours are guaranteed:
 
 
 A definition of the meaning of tagged pointers on AArch64 can be found
-in Documentation/arm64/tagged-pointers.rst.
+in Documentation/arch/arm64/tagged-pointers.rst.
 
 3. AArch64 Tagged Address ABI Exceptions
 -----------------------------------------
similarity index 98%
rename from Documentation/arm64/tagged-pointers.rst
rename to Documentation/arch/arm64/tagged-pointers.rst
index 19d284b..81b6c2a 100644 (file)
@@ -22,7 +22,7 @@ Passing tagged addresses to the kernel
 All interpretation of userspace memory addresses by the kernel assumes
 an address tag of 0x00, unless the application enables the AArch64
 Tagged Address ABI explicitly
-(Documentation/arm64/tagged-address-abi.rst).
+(Documentation/arch/arm64/tagged-address-abi.rst).
 
 This includes, but is not limited to, addresses found in:
 
index 80ee310..8458b88 100644 (file)
@@ -10,8 +10,8 @@ implementation.
    :maxdepth: 2
 
    arc/index
-   ../arm/index
-   ../arm64/index
+   arm/index
+   arm64/index
    ia64/index
    ../loongarch/index
    m68k/index
index 387ccbc..cb05d90 100644 (file)
@@ -287,6 +287,13 @@ Removing a directory will move all tasks and cpus owned by the group it
 represents to the parent. Removing one of the created CTRL_MON groups
 will automatically remove all MON groups below it.
 
+Moving MON group directories to a new parent CTRL_MON group is supported
+for the purpose of changing the resource allocations of a MON group
+without impacting its monitoring data or assigned tasks. This operation
+is not allowed for MON groups which monitor CPUs. No other move
+operation is currently allowed other than simply renaming a CTRL_MON or
+MON group.
+
 All groups contain the following files:
 
 "tasks":
index 37314af..d4fdf6a 100644 (file)
@@ -74,6 +74,7 @@ if major >= 3:
             "__percpu",
             "__rcu",
             "__user",
+            "__force",
 
             # include/linux/compiler_attributes.h:
             "__alias",
index f75778d..e6f5bc3 100644 (file)
@@ -127,17 +127,8 @@ bring CPU4 back online::
  $ echo 1 > /sys/devices/system/cpu/cpu4/online
  smpboot: Booting Node 0 Processor 4 APIC 0x1
 
-The CPU is usable again. This should work on all CPUs. CPU0 is often special
-and excluded from CPU hotplug. On X86 the kernel option
-*CONFIG_BOOTPARAM_HOTPLUG_CPU0* has to be enabled in order to be able to
-shutdown CPU0. Alternatively the kernel command option *cpu0_hotplug* can be
-used. Some known dependencies of CPU0:
-
-* Resume from hibernate/suspend. Hibernate/suspend will fail if CPU0 is offline.
-* PIC interrupts. CPU0 can't be removed if a PIC interrupt is detected.
-
-Please let Fenghua Yu <fenghua.yu@intel.com> know if you find any dependencies
-on CPU0.
+The CPU is usable again. This should work on all CPUs, but CPU0 is often special
+and excluded from CPU hotplug.
 
 The CPU hotplug coordination
 ============================
index 9b3f3e5..f2bcc5a 100644 (file)
@@ -96,6 +96,12 @@ Command-line Parsing
 .. kernel-doc:: lib/cmdline.c
    :export:
 
+Error Pointers
+--------------
+
+.. kernel-doc:: include/linux/err.h
+   :internal:
+
 Sorting
 -------
 
@@ -412,3 +418,15 @@ Read-Copy Update (RCU)
 .. kernel-doc:: include/linux/rcu_sync.h
 
 .. kernel-doc:: kernel/rcu/sync.c
+
+.. kernel-doc:: kernel/rcu/tasks.h
+
+.. kernel-doc:: kernel/rcu/tree_stall.h
+
+.. kernel-doc:: include/linux/rcupdate_trace.h
+
+.. kernel-doc:: include/linux/rcupdate_wait.h
+
+.. kernel-doc:: include/linux/rcuref.h
+
+.. kernel-doc:: include/linux/rcutree.h
index 9fb0b10..d3c1f6d 100644 (file)
@@ -112,6 +112,12 @@ pages:
 This also leads to limitations: there are only 31-10==21 bits available for a
 counter that increments 10 bits at a time.
 
+* Because of that limitation, special handling is applied to the zero pages
+  when using FOLL_PIN.  We only pretend to pin a zero page - we don't alter its
+  refcount or pincount at all (it is permanent, so there's no need).  The
+  unpinning functions also don't do anything to a zero page.  This is
+  transparent to the caller.
+
 * Callers must specifically request "dma-pinned tracking of pages". In other
   words, just calling get_user_pages() will not suffice; a new set of functions,
   pin_user_page() and related, must be used.
index 5cb8b88..91acbcf 100644 (file)
@@ -53,7 +53,6 @@ preemption and interrupts::
        this_cpu_add_return(pcp, val)
        this_cpu_xchg(pcp, nval)
        this_cpu_cmpxchg(pcp, oval, nval)
-       this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
        this_cpu_sub(pcp, val)
        this_cpu_inc(pcp)
        this_cpu_dec(pcp)
@@ -242,7 +241,6 @@ safe::
        __this_cpu_add_return(pcp, val)
        __this_cpu_xchg(pcp, nval)
        __this_cpu_cmpxchg(pcp, oval, nval)
-       __this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
        __this_cpu_sub(pcp, val)
        __this_cpu_inc(pcp)
        __this_cpu_dec(pcp)
index 8ec4d62..a4c9b9d 100644 (file)
@@ -348,6 +348,37 @@ Guidelines
   level of locality in wq operations and work item execution.
 
 
+Monitoring
+==========
+
+Use tools/workqueue/wq_monitor.py to monitor workqueue operations: ::
+
+  $ tools/workqueue/wq_monitor.py events
+                              total  infl  CPUtime  CPUhog  CMwake  mayday rescued
+  events                      18545     0      6.1       0       5       -       -
+  events_highpri                  8     0      0.0       0       0       -       -
+  events_long                     3     0      0.0       0       0       -       -
+  events_unbound              38306     0      0.1       -       -       -       -
+  events_freezable                0     0      0.0       0       0       -       -
+  events_power_efficient      29598     0      0.2       0       0       -       -
+  events_freezable_power_        10     0      0.0       0       0       -       -
+  sock_diag_events                0     0      0.0       0       0       -       -
+
+                              total  infl  CPUtime  CPUhog  CMwake  mayday rescued
+  events                      18548     0      6.1       0       5       -       -
+  events_highpri                  8     0      0.0       0       0       -       -
+  events_long                     3     0      0.0       0       0       -       -
+  events_unbound              38322     0      0.1       -       -       -       -
+  events_freezable                0     0      0.0       0       0       -       -
+  events_power_efficient      29603     0      0.2       0       0       -       -
+  events_freezable_power_        10     0      0.0       0       0       -       -
+  sock_diag_events                0     0      0.0       0       0       -       -
+
+  ...
+
+See the command's help message for more info.
+
+
 Debugging
 =========
 
@@ -387,6 +418,7 @@ the stack trace of the offending worker thread. ::
 The work item's function should be trivially visible in the stack
 trace.
 
+
 Non-reentrance Conditions
 =========================
 
index bfc7739..27c146b 100644 (file)
@@ -66,7 +66,7 @@ features surfaced as a result:
 ::
 
   struct dma_async_tx_descriptor *
-  async_<operation>(<op specific parameters>, struct async_submit ctl *submit)
+  async_<operation>(<op specific parameters>, struct async_submit_ctl *submit)
 
 3.2 Supported operations
 ------------------------
index e66916a..f4acf9c 100644 (file)
@@ -107,9 +107,12 @@ effectively disables ``panic_on_warn`` for KASAN reports.
 Alternatively, independent of ``panic_on_warn``, the ``kasan.fault=`` boot
 parameter can be used to control panic and reporting behaviour:
 
-- ``kasan.fault=report`` or ``=panic`` controls whether to only print a KASAN
-  report or also panic the kernel (default: ``report``). The panic happens even
-  if ``kasan_multi_shot`` is enabled.
+- ``kasan.fault=report``, ``=panic``, or ``=panic_on_write`` controls whether
+  to only print a KASAN report, panic the kernel, or panic the kernel on
+  invalid writes only (default: ``report``). The panic happens even if
+  ``kasan_multi_shot`` is enabled. Note that when using asynchronous mode of
+  Hardware Tag-Based KASAN, ``kasan.fault=panic_on_write`` always panics on
+  asynchronously checked accesses (including reads).
 
 Software and Hardware Tag-Based KASAN modes (see the section about various
 modes below) support altering stack trace collection behavior:
index 12b575b..deede97 100644 (file)
@@ -36,6 +36,7 @@ Running the selftests (hotplug tests are run in limited mode)
 
 To build the tests::
 
+  $ make headers
   $ make -C tools/testing/selftests
 
 To run the tests::
@@ -168,6 +169,28 @@ the `-t` option for specific single tests. Either can be used multiple times::
 
 For other features see the script usage output, seen with the `-h` option.
 
+Timeout for selftests
+=====================
+
+Selftests are designed to be quick and so a default timeout is used of 45
+seconds for each test. Tests can override the default timeout by adding
+a settings file in their directory and set a timeout variable there to the
+configured a desired upper timeout for the test. Only a few tests override
+the timeout with a value higher than 45 seconds, selftests strives to keep
+it that way. Timeouts in selftests are not considered fatal because the
+system under which a test runs may change and this can also modify the
+expected time it takes to run a test. If you have control over the systems
+which will run the tests you can configure a test runner on those systems to
+use a greater or lower timeout on the command line as with the `-o` or
+the `--override-timeout` argument. For example to use 165 seconds instead
+one would use:
+
+   $ ./run_kselftest.sh --override-timeout 165
+
+You can look at the TAP output to see if you ran into the timeout. Test
+runners which know a test must run under a specific time can then optionally
+treat these timeouts then as fatal.
+
 Packaging selftests
 ===================
 
index e95ab05..f335f88 100644 (file)
@@ -119,9 +119,9 @@ All expectations/assertions are formatted as:
          terminated immediately.
 
                - Assertions call the function:
-                 ``void __noreturn kunit_abort(struct kunit *)``.
+                 ``void __noreturn __kunit_abort(struct kunit *)``.
 
-               - ``kunit_abort`` calls the function:
+               - ``__kunit_abort`` calls the function:
                  ``void __noreturn kunit_try_catch_throw(struct kunit_try_catch *try_catch)``.
 
                - ``kunit_try_catch_throw`` calls the function:
index c736613..a982353 100644 (file)
@@ -250,15 +250,20 @@ Now we are ready to write the test cases.
        };
        kunit_test_suite(misc_example_test_suite);
 
+       MODULE_LICENSE("GPL");
+
 2. Add the following lines to ``drivers/misc/Kconfig``:
 
 .. code-block:: kconfig
 
        config MISC_EXAMPLE_TEST
                tristate "Test for my example" if !KUNIT_ALL_TESTS
-               depends on MISC_EXAMPLE && KUNIT=y
+               depends on MISC_EXAMPLE && KUNIT
                default KUNIT_ALL_TESTS
 
+Note: If your test does not support being built as a loadable module (which is
+discouraged), replace tristate by bool, and depend on KUNIT=y instead of KUNIT.
+
 3. Add the following lines to ``drivers/misc/Makefile``:
 
 .. code-block:: make
index 9faf2b4..c27e164 100644 (file)
@@ -121,6 +121,12 @@ there's an allocation error.
    ``return`` so they only work from the test function. In KUnit, we stop the
    current kthread on failure, so you can call them from anywhere.
 
+.. note::
+   Warning: There is an exception to the above rule. You shouldn't use assertions
+   in the suite's exit() function, or in the free function for a resource. These
+   run when a test is shutting down, and an assertion here prevents further
+   cleanup code from running, potentially leading to a memory leak.
+
 Customizing error messages
 --------------------------
 
@@ -160,7 +166,12 @@ many similar tests. In order to reduce duplication in these closely related
 tests, most unit testing frameworks (including KUnit) provide the concept of a
 *test suite*. A test suite is a collection of test cases for a unit of code
 with optional setup and teardown functions that run before/after the whole
-suite and/or every test case. For example:
+suite and/or every test case.
+
+.. note::
+   A test case will only run if it is associated with a test suite.
+
+For example:
 
 .. code-block:: c
 
@@ -190,7 +201,10 @@ after everything else. ``kunit_test_suite(example_test_suite)`` registers the
 test suite with the KUnit test framework.
 
 .. note::
-   A test case will only run if it is associated with a test suite.
+   The ``exit`` and ``suite_exit`` functions will run even if ``init`` or
+   ``suite_init`` fail. Make sure that they can handle any inconsistent
+   state which may result from ``init`` or ``suite_init`` encountering errors
+   or exiting early.
 
 ``kunit_test_suite(...)`` is a macro which tells the linker to put the
 specified test suite in a special linker section so that it can be run by KUnit
@@ -601,6 +615,57 @@ For example:
                KUNIT_ASSERT_STREQ(test, buffer, "");
        }
 
+Registering Cleanup Actions
+---------------------------
+
+If you need to perform some cleanup beyond simple use of ``kunit_kzalloc``,
+you can register a custom "deferred action", which is a cleanup function
+run when the test exits (whether cleanly, or via a failed assertion).
+
+Actions are simple functions with no return value, and a single ``void*``
+context argument, and fulfill the same role as "cleanup" functions in Python
+and Go tests, "defer" statements in languages which support them, and
+(in some cases) destructors in RAII languages.
+
+These are very useful for unregistering things from global lists, closing
+files or other resources, or freeing resources.
+
+For example:
+
+.. code-block:: C
+
+       static void cleanup_device(void *ctx)
+       {
+               struct device *dev = (struct device *)ctx;
+
+               device_unregister(dev);
+       }
+
+       void example_device_test(struct kunit *test)
+       {
+               struct my_device dev;
+
+               device_register(&dev);
+
+               kunit_add_action(test, &cleanup_device, &dev);
+       }
+
+Note that, for functions like device_unregister which only accept a single
+pointer-sized argument, it's possible to directly cast that function to
+a ``kunit_action_t`` rather than writing a wrapper function, for example:
+
+.. code-block:: C
+
+       kunit_add_action(test, (kunit_action_t *)&device_unregister, &dev);
+
+``kunit_add_action`` can fail if, for example, the system is out of memory.
+You can use ``kunit_add_action_or_reset`` instead which runs the action
+immediately if it cannot be deferred.
+
+If you need more control over when the cleanup function is called, you
+can trigger it early using ``kunit_release_action``, or cancel it entirely
+with ``kunit_remove_action``.
+
 
 Testing Static Functions
 ------------------------
index 61d77ac..f925290 100644 (file)
@@ -56,7 +56,7 @@ hypervisor {
 };
 
 The format and meaning of the "xen,uefi-*" parameters are similar to those in
-Documentation/arm/uefi.rst, which are provided by the regular UEFI stub. However
+Documentation/arch/arm/uefi.rst, which are provided by the regular UEFI stub. However
 they differ because they are provided by the Xen hypervisor, together with a set
 of UEFI runtime services implemented via hypercalls, see
 http://xenbits.xen.org/docs/unstable/hypercall/x86_64/include,public,platform.h.html.
index b8cc826..b3a5356 100644 (file)
@@ -259,7 +259,7 @@ description: |+
       http://infocenter.arm.com/help/index.jsp
 
   [5] ARM Linux Kernel documentation - Booting AArch64 Linux
-      Documentation/arm64/booting.rst
+      Documentation/arch/arm64/booting.rst
 
   [6] RISC-V Linux Kernel documentation - CPUs bindings
       Documentation/devicetree/bindings/riscv/cpus.yaml
index 367d04a..83381f3 100644 (file)
@@ -71,6 +71,8 @@ properties:
     minItems: 1
     maxItems: 3
 
+  dma-coherent: true
+
   interconnects:
     maxItems: 1
 
index 85d9efb..d9ef867 100644 (file)
@@ -60,6 +60,7 @@ properties:
     default: 0
 
   regstep:
+    $ref: /schemas/types.yaml#/definitions/uint32
     description: |
       deprecated, use reg-shift above
     deprecated: true
diff --git a/Documentation/devicetree/bindings/interrupt-controller/loongson,eiointc.yaml b/Documentation/devicetree/bindings/interrupt-controller/loongson,eiointc.yaml
new file mode 100644 (file)
index 0000000..393c128
--- /dev/null
@@ -0,0 +1,59 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/interrupt-controller/loongson,eiointc.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Loongson Extended I/O Interrupt Controller
+
+maintainers:
+  - Binbin Zhou <zhoubinbin@loongson.cn>
+
+description: |
+  This interrupt controller is found on the Loongson-3 family chips and
+  Loongson-2K series chips and is used to distribute interrupts directly to
+  individual cores without forwarding them through the HT's interrupt line.
+
+allOf:
+  - $ref: /schemas/interrupt-controller.yaml#
+
+properties:
+  compatible:
+    enum:
+      - loongson,ls2k0500-eiointc
+      - loongson,ls2k2000-eiointc
+
+  reg:
+    maxItems: 1
+
+  interrupts:
+    maxItems: 1
+
+  interrupt-controller: true
+
+  '#interrupt-cells':
+    const: 1
+
+required:
+  - compatible
+  - reg
+  - interrupts
+  - interrupt-controller
+  - '#interrupt-cells'
+
+unevaluatedProperties: false
+
+examples:
+  - |
+    eiointc: interrupt-controller@1fe11600 {
+      compatible = "loongson,ls2k0500-eiointc";
+      reg = <0x1fe10000 0x10000>;
+
+      interrupt-controller;
+      #interrupt-cells = <1>;
+
+      interrupt-parent = <&cpuintc>;
+      interrupts = <3>;
+    };
+
+...
diff --git a/Documentation/devicetree/bindings/memory-controllers/nuvoton,npcm-memory-controller.yaml b/Documentation/devicetree/bindings/memory-controllers/nuvoton,npcm-memory-controller.yaml
new file mode 100644 (file)
index 0000000..ac1a5a1
--- /dev/null
@@ -0,0 +1,50 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/memory-controllers/nuvoton,npcm-memory-controller.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Nuvoton NPCM Memory Controller
+
+maintainers:
+  - Marvin Lin <kflin@nuvoton.com>
+  - Stanley Chu <yschu@nuvoton.com>
+
+description: |
+  The Nuvoton BMC SoC supports DDR4 memory with or without ECC (error correction
+  check).
+
+  The memory controller supports single bit error correction, double bit error
+  detection (in-line ECC in which a section (1/8th) of the memory device used to
+  store data is used for ECC storage).
+
+  Note, the bootloader must configure ECC mode for the memory controller.
+
+properties:
+  compatible:
+    enum:
+      - nuvoton,npcm750-memory-controller
+      - nuvoton,npcm845-memory-controller
+
+  reg:
+    maxItems: 1
+
+  interrupts:
+    maxItems: 1
+
+required:
+  - compatible
+  - reg
+  - interrupts
+
+additionalProperties: false
+
+examples:
+  - |
+    #include <dt-bindings/interrupt-controller/arm-gic.h>
+
+    mc: memory-controller@f0824000 {
+        compatible = "nuvoton,npcm750-memory-controller";
+        reg = <0xf0824000 0x1000>;
+        interrupts = <GIC_SPI 25 IRQ_TYPE_LEVEL_HIGH>;
+    };
diff --git a/Documentation/devicetree/bindings/mfd/rockchip,rk806.yaml b/Documentation/devicetree/bindings/mfd/rockchip,rk806.yaml
new file mode 100644 (file)
index 0000000..cf2500f
--- /dev/null
@@ -0,0 +1,406 @@
+# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/mfd/rockchip,rk806.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: RK806 Power Management Integrated Circuit
+
+maintainers:
+  - Sebastian Reichel <sebastian.reichel@collabora.com>
+
+description:
+  Rockchip RK806 series PMIC. This device consists of an spi or
+  i2c controlled MFD that includes multiple switchable regulators.
+
+properties:
+  compatible:
+    enum:
+      - rockchip,rk806
+
+  reg:
+    maxItems: 1
+
+  interrupts:
+    maxItems: 1
+
+  gpio-controller: true
+
+  '#gpio-cells':
+    const: 2
+
+  vcc1-supply:
+    description:
+      The input supply for dcdc-reg1.
+
+  vcc2-supply:
+    description:
+      The input supply for dcdc-reg2.
+
+  vcc3-supply:
+    description:
+      The input supply for dcdc-reg3.
+
+  vcc4-supply:
+    description:
+      The input supply for dcdc-reg4.
+
+  vcc5-supply:
+    description:
+      The input supply for dcdc-reg5.
+
+  vcc6-supply:
+    description:
+      The input supply for dcdc-reg6.
+
+  vcc7-supply:
+    description:
+      The input supply for dcdc-reg7.
+
+  vcc8-supply:
+    description:
+      The input supply for dcdc-reg8.
+
+  vcc9-supply:
+    description:
+      The input supply for dcdc-reg9.
+
+  vcc10-supply:
+    description:
+      The input supply for dcdc-reg10.
+
+  vcc11-supply:
+    description:
+      The input supply for pldo-reg1, pldo-reg2 and pldo-reg3.
+
+  vcc12-supply:
+    description:
+      The input supply for pldo-reg4 and pldo-reg5.
+
+  vcc13-supply:
+    description:
+      The input supply for nldo-reg1, nldo-reg2 and nldo-reg3.
+
+  vcc14-supply:
+    description:
+      The input supply for nldo-reg4 and nldo-reg5.
+
+  vcca-supply:
+    description:
+      The input supply for pldo-reg6.
+
+  regulators:
+    type: object
+    additionalProperties: false
+    patternProperties:
+      "^(dcdc-reg([1-9]|10)|pldo-reg[1-6]|nldo-reg[1-5])$":
+        type: object
+        $ref: /schemas/regulator/regulator.yaml#
+        unevaluatedProperties: false
+
+patternProperties:
+  '-pins$':
+    type: object
+    additionalProperties: false
+    $ref: /schemas/pinctrl/pinmux-node.yaml
+
+    properties:
+      function:
+        enum: [pin_fun0, pin_fun1, pin_fun2, pin_fun3, pin_fun4, pin_fun5]
+
+      pins:
+        $ref: /schemas/types.yaml#/definitions/string
+        enum: [gpio_pwrctrl1, gpio_pwrctrl2, gpio_pwrctrl3]
+
+allOf:
+  - $ref: /schemas/spi/spi-peripheral-props.yaml
+
+required:
+  - compatible
+  - reg
+  - interrupts
+
+unevaluatedProperties: false
+
+examples:
+  - |
+    #include <dt-bindings/pinctrl/rockchip.h>
+    #include <dt-bindings/interrupt-controller/irq.h>
+    #include <dt-bindings/gpio/gpio.h>
+    spi {
+        #address-cells = <1>;
+        #size-cells = <0>;
+
+        pmic@0 {
+            compatible = "rockchip,rk806";
+            reg = <0x0>;
+
+            interrupts = <7 IRQ_TYPE_LEVEL_LOW>;
+
+            vcc1-supply = <&vcc5v0_sys>;
+            vcc2-supply = <&vcc5v0_sys>;
+            vcc3-supply = <&vcc5v0_sys>;
+            vcc4-supply = <&vcc5v0_sys>;
+            vcc5-supply = <&vcc5v0_sys>;
+            vcc6-supply = <&vcc5v0_sys>;
+            vcc7-supply = <&vcc5v0_sys>;
+            vcc8-supply = <&vcc5v0_sys>;
+            vcc9-supply = <&vcc5v0_sys>;
+            vcc10-supply = <&vcc5v0_sys>;
+            vcc11-supply = <&vcc_2v0_pldo_s3>;
+            vcc12-supply = <&vcc5v0_sys>;
+            vcc13-supply = <&vcc5v0_sys>;
+            vcc14-supply = <&vcc_1v1_nldo_s3>;
+            vcca-supply = <&vcc5v0_sys>;
+
+            regulators {
+                vdd_gpu_s0: dcdc-reg1 {
+                    regulator-always-on;
+                    regulator-boot-on;
+                    regulator-min-microvolt = <550000>;
+                    regulator-max-microvolt = <950000>;
+                    regulator-ramp-delay = <12500>;
+                    regulator-name = "vdd_gpu_s0";
+                    regulator-state-mem {
+                        regulator-off-in-suspend;
+                    };
+                };
+
+                vdd_npu_s0: dcdc-reg2 {
+                    regulator-always-on;
+                    regulator-boot-on;
+                    regulator-min-microvolt = <550000>;
+                    regulator-max-microvolt = <950000>;
+                    regulator-ramp-delay = <12500>;
+                    regulator-name = "vdd_npu_s0";
+                    regulator-state-mem {
+                        regulator-off-in-suspend;
+                    };
+                };
+
+                vdd_log_s0: dcdc-reg3 {
+                    regulator-always-on;
+                    regulator-boot-on;
+                    regulator-min-microvolt = <750000>;
+                    regulator-max-microvolt = <750000>;
+                    regulator-ramp-delay = <12500>;
+                    regulator-name = "vdd_log_s0";
+                    regulator-state-mem {
+                        regulator-on-in-suspend;
+                        regulator-suspend-microvolt = <750000>;
+                    };
+                };
+
+                vdd_vdenc_s0: dcdc-reg4 {
+                    regulator-always-on;
+                    regulator-boot-on;
+                    regulator-min-microvolt = <550000>;
+                    regulator-max-microvolt = <950000>;
+                    regulator-ramp-delay = <12500>;
+                    regulator-name = "vdd_vdenc_s0";
+                    regulator-state-mem {
+                        regulator-off-in-suspend;
+                    };
+                };
+
+                vdd_gpu_mem_s0: dcdc-reg5 {
+                    regulator-always-on;
+                    regulator-boot-on;
+                    regulator-min-microvolt = <675000>;
+                    regulator-max-microvolt = <950000>;
+                    regulator-ramp-delay = <12500>;
+                    regulator-name = "vdd_gpu_mem_s0";
+                    regulator-state-mem {
+                        regulator-off-in-suspend;
+                    };
+                };
+
+                vdd_npu_mem_s0: dcdc-reg6 {
+                    regulator-always-on;
+                    regulator-boot-on;
+                    regulator-min-microvolt = <675000>;
+                    regulator-max-microvolt = <950000>;
+                    regulator-ramp-delay = <12500>;
+                    regulator-name = "vdd_npu_mem_s0";
+                    regulator-state-mem {
+                        regulator-off-in-suspend;
+                    };
+                };
+
+                vcc_2v0_pldo_s3: dcdc-reg7 {
+                    regulator-always-on;
+                    regulator-boot-on;
+                    regulator-min-microvolt = <2000000>;
+                    regulator-max-microvolt = <2000000>;
+                    regulator-ramp-delay = <12500>;
+                    regulator-name = "vdd_2v0_pldo_s3";
+                    regulator-state-mem {
+                        regulator-on-in-suspend;
+                        regulator-suspend-microvolt = <2000000>;
+                    };
+                };
+
+                vdd_vdenc_mem_s0: dcdc-reg8 {
+                    regulator-always-on;
+                    regulator-boot-on;
+                    regulator-min-microvolt = <675000>;
+                    regulator-max-microvolt = <950000>;
+                    regulator-ramp-delay = <12500>;
+                    regulator-name = "vdd_vdenc_mem_s0";
+                    regulator-state-mem {
+                        regulator-off-in-suspend;
+                    };
+                };
+
+                vdd2_ddr_s3: dcdc-reg9 {
+                    regulator-always-on;
+                    regulator-boot-on;
+                    regulator-name = "vdd2_ddr_s3";
+                    regulator-state-mem {
+                        regulator-on-in-suspend;
+                    };
+                };
+
+                vcc_1v1_nldo_s3: dcdc-reg10 {
+                    regulator-always-on;
+                    regulator-boot-on;
+                    regulator-min-microvolt = <1100000>;
+                    regulator-max-microvolt = <1100000>;
+                    regulator-ramp-delay = <12500>;
+                    regulator-name = "vcc_1v1_nldo_s3";
+                    regulator-state-mem {
+                        regulator-on-in-suspend;
+                        regulator-suspend-microvolt = <1100000>;
+                    };
+                };
+
+                avcc_1v8_s0: pldo-reg1 {
+                    regulator-always-on;
+                    regulator-boot-on;
+                    regulator-min-microvolt = <1800000>;
+                    regulator-max-microvolt = <1800000>;
+                    regulator-ramp-delay = <12500>;
+                    regulator-name = "avcc_1v8_s0";
+                    regulator-state-mem {
+                        regulator-off-in-suspend;
+                    };
+                };
+
+                vdd1_1v8_ddr_s3: pldo-reg2 {
+                    regulator-always-on;
+                    regulator-boot-on;
+                    regulator-min-microvolt = <1800000>;
+                    regulator-max-microvolt = <1800000>;
+                    regulator-ramp-delay = <12500>;
+                    regulator-name = "vdd1_1v8_ddr_s3";
+                    regulator-state-mem {
+                        regulator-on-in-suspend;
+                        regulator-suspend-microvolt = <1800000>;
+                    };
+                };
+
+                vcc_1v8_s3: pldo-reg3 {
+                    regulator-always-on;
+                    regulator-boot-on;
+                    regulator-min-microvolt = <1800000>;
+                    regulator-max-microvolt = <1800000>;
+                    regulator-ramp-delay = <12500>;
+                    regulator-name = "vcc_1v8_s3";
+                    regulator-state-mem {
+                        regulator-on-in-suspend;
+                        regulator-suspend-microvolt = <1800000>;
+                    };
+                };
+
+                vcc_3v3_s0: pldo-reg4 {
+                    regulator-always-on;
+                    regulator-boot-on;
+                    regulator-min-microvolt = <3300000>;
+                    regulator-max-microvolt = <3300000>;
+                    regulator-ramp-delay = <12500>;
+                    regulator-name = "vcc_3v3_s0";
+                    regulator-state-mem {
+                        regulator-off-in-suspend;
+                    };
+                };
+
+                vccio_sd_s0: pldo-reg5 {
+                    regulator-always-on;
+                    regulator-boot-on;
+                    regulator-min-microvolt = <1800000>;
+                    regulator-max-microvolt = <3300000>;
+                    regulator-ramp-delay = <12500>;
+                    regulator-name = "vccio_sd_s0";
+                    regulator-state-mem {
+                        regulator-off-in-suspend;
+                    };
+                };
+
+                master_pldo6_s3: pldo-reg6 {
+                    regulator-always-on;
+                    regulator-boot-on;
+                    regulator-min-microvolt = <1800000>;
+                    regulator-max-microvolt = <1800000>;
+                    regulator-name = "master_pldo6_s3";
+                    regulator-state-mem {
+                        regulator-on-in-suspend;
+                        regulator-suspend-microvolt = <1800000>;
+                    };
+                };
+
+                vdd_0v75_s3: nldo-reg1 {
+                    regulator-always-on;
+                    regulator-boot-on;
+                    regulator-min-microvolt = <750000>;
+                    regulator-max-microvolt = <750000>;
+                    regulator-ramp-delay = <12500>;
+                    regulator-name = "vdd_0v75_s3";
+                    regulator-state-mem {
+                        regulator-on-in-suspend;
+                        regulator-suspend-microvolt = <750000>;
+                    };
+                };
+
+                vdd2l_0v9_ddr_s3: nldo-reg2 {
+                    regulator-always-on;
+                    regulator-boot-on;
+                    regulator-min-microvolt = <900000>;
+                    regulator-max-microvolt = <900000>;
+                    regulator-name = "vdd2l_0v9_ddr_s3";
+                    regulator-state-mem {
+                        regulator-on-in-suspend;
+                        regulator-suspend-microvolt = <900000>;
+                    };
+                };
+
+                master_nldo3: nldo-reg3 {
+                    regulator-name = "master_nldo3";
+                    regulator-state-mem {
+                        regulator-off-in-suspend;
+                    };
+                };
+
+                avdd_0v75_s0: nldo-reg4 {
+                    regulator-always-on;
+                    regulator-boot-on;
+                    regulator-min-microvolt = <750000>;
+                    regulator-max-microvolt = <750000>;
+                    regulator-name = "avdd_0v75_s0";
+                    regulator-state-mem {
+                        regulator-off-in-suspend;
+                    };
+                };
+
+                vdd_0v85_s0: nldo-reg5 {
+                    regulator-always-on;
+                    regulator-boot-on;
+                    regulator-min-microvolt = <850000>;
+                    regulator-max-microvolt = <850000>;
+                    regulator-name = "vdd_0v85_s0";
+                    regulator-state-mem {
+                        regulator-off-in-suspend;
+                    };
+                };
+            };
+        };
+    };
index 1c96da0..2459a55 100644 (file)
@@ -53,10 +53,11 @@ properties:
         items:
           - const: arm,pl18x
           - const: arm,primecell
-      - description: Entry for STMicroelectronics variant of PL18x.
-          This dedicated compatible is used by bootloaders.
+      - description: Entries for STMicroelectronics variant of PL18x.
         items:
-          - const: st,stm32-sdmmc2
+          - enum:
+              - st,stm32-sdmmc2
+              - st,stm32mp25-sdmmc2
           - const: arm,pl18x
           - const: arm,primecell
 
diff --git a/Documentation/devicetree/bindings/mmc/brcm,bcm2835-sdhost.txt b/Documentation/devicetree/bindings/mmc/brcm,bcm2835-sdhost.txt
deleted file mode 100644 (file)
index d876580..0000000
+++ /dev/null
@@ -1,23 +0,0 @@
-Broadcom BCM2835 SDHOST controller
-
-This file documents differences between the core properties described
-by mmc.txt and the properties that represent the BCM2835 controller.
-
-Required properties:
-- compatible: Should be "brcm,bcm2835-sdhost".
-- clocks: The clock feeding the SDHOST controller.
-
-Optional properties:
-- dmas: DMA channel for read and write.
-          See Documentation/devicetree/bindings/dma/dma.txt for details
-
-Example:
-
-sdhost: mmc@7e202000 {
-       compatible = "brcm,bcm2835-sdhost";
-       reg = <0x7e202000 0x100>;
-       interrupts = <2 24>;
-       clocks = <&clocks BCM2835_CLOCK_VPU>;
-       dmas = <&dma 13>;
-       dma-names = "rx-tx";
-};
diff --git a/Documentation/devicetree/bindings/mmc/brcm,bcm2835-sdhost.yaml b/Documentation/devicetree/bindings/mmc/brcm,bcm2835-sdhost.yaml
new file mode 100644 (file)
index 0000000..3a5a448
--- /dev/null
@@ -0,0 +1,54 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/mmc/brcm,bcm2835-sdhost.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Broadcom BCM2835 SDHOST controller
+
+maintainers:
+  - Stefan Wahren <stefan.wahren@i2se.com>
+
+allOf:
+  - $ref: mmc-controller.yaml
+
+properties:
+  compatible:
+    const: brcm,bcm2835-sdhost
+
+  reg:
+    maxItems: 1
+
+  interrupts:
+    maxItems: 1
+
+  clocks:
+    maxItems: 1
+
+  dmas:
+    maxItems: 1
+
+  dma-names:
+    const: rx-tx
+
+required:
+  - compatible
+  - reg
+  - interrupts
+  - clocks
+
+unevaluatedProperties: false
+
+examples:
+  - |
+    #include <dt-bindings/clock/bcm2835.h>
+
+    sdhost: mmc@7e202000 {
+      compatible = "brcm,bcm2835-sdhost";
+      reg = <0x7e202000 0x100>;
+      interrupts = <2 24>;
+      clocks = <&clocks BCM2835_CLOCK_VPU>;
+      dmas = <&dma 13>;
+      dma-names = "rx-tx";
+      bus-width = <4>;
+    };
diff --git a/Documentation/devicetree/bindings/mmc/brcm,kona-sdhci.txt b/Documentation/devicetree/bindings/mmc/brcm,kona-sdhci.txt
deleted file mode 100644 (file)
index 7f5dd83..0000000
+++ /dev/null
@@ -1,21 +0,0 @@
-Broadcom BCM281xx SDHCI
-
-This file documents differences between the core properties in mmc.txt
-and the properties present in the bcm281xx SDHCI
-
-Required properties:
-- compatible : Should be "brcm,kona-sdhci"
-- DEPRECATED: compatible : Should be "bcm,kona-sdhci"
-- clocks: phandle + clock specifier pair of the external clock
-
-Refer to clocks/clock-bindings.txt for generic clock consumer properties.
-
-Example:
-
-sdio2: sdio@3f1a0000 {
-       compatible = "brcm,kona-sdhci";
-       reg = <0x3f1a0000 0x10000>;
-       clocks = <&sdio3_clk>;
-       interrupts = <0x0 74 0x4>;
-};
-
diff --git a/Documentation/devicetree/bindings/mmc/brcm,kona-sdhci.yaml b/Documentation/devicetree/bindings/mmc/brcm,kona-sdhci.yaml
new file mode 100644 (file)
index 0000000..12eb398
--- /dev/null
@@ -0,0 +1,48 @@
+# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/mmc/brcm,kona-sdhci.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Broadcom Kona family SDHCI controller
+
+maintainers:
+  - Florian Fainelli <f.fainelli@gmail.com>
+
+allOf:
+  - $ref: sdhci-common.yaml#
+
+properties:
+  compatible:
+    const: brcm,kona-sdhci
+
+  reg:
+    maxItems: 1
+
+  clocks:
+    maxItems: 1
+
+  interrupts:
+    maxItems: 1
+
+required:
+  - compatible
+  - reg
+  - clocks
+  - interrupts
+
+unevaluatedProperties: false
+
+examples:
+  - |
+    #include <dt-bindings/interrupt-controller/arm-gic.h>
+    #include <dt-bindings/interrupt-controller/irq.h>
+    #include <dt-bindings/clock/bcm281xx.h>
+
+    mmc@3f1a0000 {
+        compatible = "brcm,kona-sdhci";
+        reg = <0x3f1a0000 0x10000>;
+        clocks = <&master_ccu BCM281XX_MASTER_CCU_SDIO3>;
+        interrupts = <GIC_SPI 74 IRQ_TYPE_LEVEL_HIGH>;
+    };
+...
index fbfd822..82eb7a2 100644 (file)
@@ -42,6 +42,7 @@ properties:
           - enum:
               - fsl,imx6sll-usdhc
               - fsl,imx6ull-usdhc
+              - fsl,imx6ul-usdhc
           - const: fsl,imx6sx-usdhc
       - items:
           - const: fsl,imx7d-usdhc
index 4f2d9e8..6da28e6 100644 (file)
@@ -36,11 +36,14 @@ properties:
           - enum:
               - qcom,ipq5018-sdhci
               - qcom,ipq5332-sdhci
+              - qcom,ipq6018-sdhci
               - qcom,ipq9574-sdhci
               - qcom,qcm2290-sdhci
               - qcom,qcs404-sdhci
+              - qcom,qdu1000-sdhci
               - qcom,sc7180-sdhci
               - qcom,sc7280-sdhci
+              - qcom,sc8280xp-sdhci
               - qcom,sdm630-sdhci
               - qcom,sdm670-sdhci
               - qcom,sdm845-sdhci
index 9a88870..054b6b8 100644 (file)
@@ -49,13 +49,12 @@ properties:
 patternProperties:
   "^nand@[a-f0-9]$":
     type: object
+    $ref: raw-nand-chip.yaml
     properties:
       reg:
         minimum: 0
         maximum: 7
 
-      nand-ecc-mode: true
-
       nand-ecc-algo:
         const: bch
 
@@ -75,7 +74,7 @@ patternProperties:
           minimum: 0
           maximum: 1
 
-    additionalProperties: false
+    unevaluatedProperties: false
 
 required:
   - compatible
index 28fb9a7..787ef48 100644 (file)
@@ -40,6 +40,7 @@ properties:
 patternProperties:
   "^nand@[0-7]$":
     type: object
+    $ref: raw-nand-chip.yaml
     properties:
       reg:
         minimum: 0
@@ -58,6 +59,14 @@ patternProperties:
             meson-gxl-nfc 8, 16, 24, 30, 40, 50, 60
             meson-axg-nfc 8
 
+      nand-rb:
+        maxItems: 1
+        items:
+          maximum: 0
+
+    unevaluatedProperties: false
+
+
 required:
   - compatible
   - reg
@@ -87,6 +96,7 @@ examples:
 
       nand@0 {
         reg = <0>;
+        nand-rb = <0>;
       };
     };
 
index 1571024..f57e963 100644 (file)
@@ -114,6 +114,7 @@ properties:
 patternProperties:
   "^nand@[a-f0-9]$":
     type: object
+    $ref: raw-nand-chip.yaml
     properties:
       compatible:
         const: brcm,nandcs
@@ -136,6 +137,8 @@ patternProperties:
           layout.
         $ref: /schemas/types.yaml#/definitions/uint32
 
+    unevaluatedProperties: false
+
 allOf:
   - $ref: nand-controller.yaml#
   - if:
index 0be83ad..81f9553 100644 (file)
@@ -63,6 +63,12 @@ properties:
     minItems: 1
     maxItems: 2
 
+patternProperties:
+  "^nand@[a-f0-9]$":
+    type: object
+    $ref: raw-nand-chip.yaml
+    unevaluatedProperties: false
+
 allOf:
   - $ref: nand-controller.yaml
 
@@ -74,7 +80,6 @@ allOf:
     then:
       patternProperties:
         "^nand@[a-f0-9]$":
-          type: object
           properties:
             nand-ecc-strength:
               enum:
@@ -92,7 +97,6 @@ allOf:
     then:
       patternProperties:
         "^nand@[a-f0-9]$":
-          type: object
           properties:
             nand-ecc-strength:
               enum:
@@ -111,7 +115,6 @@ allOf:
     then:
       patternProperties:
         "^nand@[a-f0-9]$":
-          type: object
           properties:
             nand-ecc-strength:
               enum:
index a7bdb5d..b9312eb 100644 (file)
@@ -39,7 +39,9 @@ properties:
 patternProperties:
   "^nand@[a-f0-9]$":
     type: object
+    $ref: raw-nand-chip.yaml
     properties:
+
       rb-gpios:
         description: GPIO specifier for the busy pin.
         maxItems: 1
@@ -48,6 +50,8 @@ patternProperties:
         description: GPIO specifier for the write-protect pin.
         maxItems: 1
 
+    unevaluatedProperties: false
+
 required:
   - compatible
   - reg
index cc3def7..07bc7e3 100644 (file)
@@ -42,17 +42,16 @@ properties:
 patternProperties:
   "^nand@[a-f0-9]$":
     type: object
+    $ref: raw-nand-chip.yaml
     properties:
       reg:
         minimum: 0
         maximum: 1
 
-      nand-ecc-mode: true
-
       nand-ecc-algo:
         const: hw
 
-    additionalProperties: false
+    unevaluatedProperties: false
 
 required:
   - compatible
diff --git a/Documentation/devicetree/bindings/mtd/marvell,nand-controller.yaml b/Documentation/devicetree/bindings/mtd/marvell,nand-controller.yaml
new file mode 100644 (file)
index 0000000..a10729b
--- /dev/null
@@ -0,0 +1,226 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/mtd/marvell,nand-controller.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Marvell NAND Flash Controller (NFC)
+
+maintainers:
+  - Miquel Raynal <miquel.raynal@bootlin.com>
+
+properties:
+  compatible:
+    oneOf:
+      - items:
+          - const: marvell,armada-8k-nand-controller
+          - const: marvell,armada370-nand-controller
+      - enum:
+          - marvell,armada370-nand-controller
+          - marvell,pxa3xx-nand-controller
+      - description: legacy bindings
+        deprecated: true
+        enum:
+          - marvell,armada-8k-nand
+          - marvell,armada370-nand
+          - marvell,pxa3xx-nand
+
+  reg:
+    maxItems: 1
+
+  interrupts:
+    maxItems: 1
+
+  clocks:
+    description:
+      Shall reference the NAND controller clocks, the second one is
+      is only needed for the Armada 7K/8K SoCs
+    minItems: 1
+    maxItems: 2
+
+  clock-names:
+    minItems: 1
+    items:
+      - const: core
+      - const: reg
+
+  dmas:
+    maxItems: 1
+
+  dma-names:
+    items:
+      - const: data
+
+  marvell,system-controller:
+    $ref: /schemas/types.yaml#/definitions/phandle
+    description: Syscon node that handles NAND controller related registers
+
+patternProperties:
+  "^nand@[a-f0-9]$":
+    type: object
+    $ref: raw-nand-chip.yaml
+
+    properties:
+      reg:
+        minimum: 0
+        maximum: 3
+
+      nand-rb:
+        items:
+          - minimum: 0
+            maximum: 1
+
+      nand-ecc-step-size:
+        const: 512
+
+      nand-ecc-strength:
+        enum: [1, 4, 8, 12, 16]
+
+      nand-ecc-mode:
+        const: hw
+
+      marvell,nand-keep-config:
+        $ref: /schemas/types.yaml#/definitions/flag
+        description:
+          Orders the driver not to take the timings from the core and
+          leaving them completely untouched. Bootloader timings will then
+          be used.
+
+      marvell,nand-enable-arbiter:
+        $ref: /schemas/types.yaml#/definitions/flag
+        description:
+          To enable the arbiter, all boards blindly used it,
+          this bit was set by the bootloader for many boards and even if
+          it is marked reserved in several datasheets, it might be needed to set
+          it (otherwise it is harmless).
+        deprecated: true
+
+    required:
+      - reg
+      - nand-rb
+
+    unevaluatedProperties: false
+
+required:
+  - compatible
+  - reg
+  - interrupts
+  - clocks
+
+allOf:
+  - $ref: nand-controller.yaml#
+
+  - if:
+      properties:
+        compatible:
+          contains:
+            const: marvell,pxa3xx-nand-controller
+    then:
+      required:
+        - dmas
+        - dma-names
+
+  - if:
+      properties:
+        compatible:
+          contains:
+            const: marvell,armada-8k-nand-controller
+    then:
+      properties:
+        clocks:
+          minItems: 2
+
+        clock-names:
+          minItems: 2
+
+      required:
+        - marvell,system-controller
+
+    else:
+      properties:
+        clocks:
+          minItems: 1
+
+        clock-names:
+          minItems: 1
+
+
+unevaluatedProperties: false
+
+examples:
+  - |
+    #include <dt-bindings/interrupt-controller/arm-gic.h>
+    nand_controller: nand-controller@d0000 {
+        compatible = "marvell,armada370-nand-controller";
+        reg = <0xd0000 0x54>;
+        #address-cells = <1>;
+        #size-cells = <0>;
+        interrupts = <GIC_SPI 84 IRQ_TYPE_LEVEL_HIGH>;
+        clocks = <&coredivclk 0>;
+
+        nand@0 {
+            reg = <0>;
+            label = "main-storage";
+            nand-rb = <0>;
+            nand-ecc-mode = "hw";
+            marvell,nand-keep-config;
+            nand-on-flash-bbt;
+            nand-ecc-strength = <4>;
+            nand-ecc-step-size = <512>;
+
+            partitions {
+                compatible = "fixed-partitions";
+                #address-cells = <1>;
+                #size-cells = <1>;
+
+                partition@0 {
+                    label = "Rootfs";
+                    reg = <0x00000000 0x40000000>;
+                };
+            };
+        };
+    };
+
+  - |
+    cp0_nand_controller: nand-controller@720000 {
+        compatible = "marvell,armada-8k-nand-controller",
+                "marvell,armada370-nand-controller";
+        reg = <0x720000 0x54>;
+        #address-cells = <1>;
+        #size-cells = <0>;
+        interrupts = <115 IRQ_TYPE_LEVEL_HIGH>;
+        clock-names = "core", "reg";
+        clocks = <&cp0_clk 1 2>,
+                 <&cp0_clk 1 17>;
+        marvell,system-controller = <&cp0_syscon0>;
+
+        nand@0 {
+            reg = <0>;
+            label = "main-storage";
+            nand-rb = <0>;
+            nand-ecc-mode = "hw";
+            nand-ecc-strength = <8>;
+            nand-ecc-step-size = <512>;
+        };
+    };
+
+  - |
+    nand-controller@43100000 {
+        compatible = "marvell,pxa3xx-nand-controller";
+        reg = <0x43100000 90>;
+        interrupts = <45>;
+        clocks = <&clks 1>;
+        clock-names = "core";
+        dmas = <&pdma 97 3>;
+        dma-names = "data";
+        #address-cells = <1>;
+        #size-cells = <0>;
+        nand@0 {
+            reg = <0>;
+            nand-rb = <0>;
+            nand-ecc-mode = "hw";
+            marvell,nand-keep-config;
+        };
+    };
+
+...
diff --git a/Documentation/devicetree/bindings/mtd/marvell-nand.txt b/Documentation/devicetree/bindings/mtd/marvell-nand.txt
deleted file mode 100644 (file)
index a2d9a0f..0000000
+++ /dev/null
@@ -1,126 +0,0 @@
-Marvell NAND Flash Controller (NFC)
-
-Required properties:
-- compatible: can be one of the following:
-    * "marvell,armada-8k-nand-controller"
-    * "marvell,armada370-nand-controller"
-    * "marvell,pxa3xx-nand-controller"
-    * "marvell,armada-8k-nand" (deprecated)
-    * "marvell,armada370-nand" (deprecated)
-    * "marvell,pxa3xx-nand" (deprecated)
-  Compatibles marked deprecated support only the old bindings described
-  at the bottom.
-- reg: NAND flash controller memory area.
-- #address-cells: shall be set to 1. Encode the NAND CS.
-- #size-cells: shall be set to 0.
-- interrupts: shall define the NAND controller interrupt.
-- clocks: shall reference the NAND controller clocks, the second one is
-  is only needed for the Armada 7K/8K SoCs
-- clock-names: mandatory if there is a second clock, in this case there
-  should be one clock named "core" and another one named "reg"
-- marvell,system-controller: Set to retrieve the syscon node that handles
-  NAND controller related registers (only required with the
-  "marvell,armada-8k-nand[-controller]" compatibles).
-
-Optional properties:
-- label: see partition.txt. New platforms shall omit this property.
-- dmas: shall reference DMA channel associated to the NAND controller.
-  This property is only used with "marvell,pxa3xx-nand[-controller]"
-  compatible strings.
-- dma-names: shall be "rxtx".
-  This property is only used with "marvell,pxa3xx-nand[-controller]"
-  compatible strings.
-
-Optional children nodes:
-Children nodes represent the available NAND chips.
-
-Required properties:
-- reg: shall contain the native Chip Select ids (0-3).
-- nand-rb: see nand-controller.yaml (0-1).
-
-Optional properties:
-- marvell,nand-keep-config: orders the driver not to take the timings
-  from the core and leaving them completely untouched. Bootloader
-  timings will then be used.
-- label: MTD name.
-- nand-on-flash-bbt: see nand-controller.yaml.
-- nand-ecc-mode: see nand-controller.yaml. Will use hardware ECC if not specified.
-- nand-ecc-algo: see nand-controller.yaml. This property is essentially useful when
-  not using hardware ECC. Howerver, it may be added when using hardware
-  ECC for clarification but will be ignored by the driver because ECC
-  mode is chosen depending on the page size and the strength required by
-  the NAND chip. This value may be overwritten with nand-ecc-strength
-  property.
-- nand-ecc-strength: see nand-controller.yaml.
-- nand-ecc-step-size: see nand-controller.yaml. Marvell's NAND flash controller does
-  use fixed strength (1-bit for Hamming, 16-bit for BCH), so the actual
-  step size will shrink or grow in order to fit the required strength.
-  Step sizes are not completely random for all and follow certain
-  patterns described in AN-379, "Marvell SoC NFC ECC".
-
-See Documentation/devicetree/bindings/mtd/nand-controller.yaml for more details on
-generic bindings.
-
-
-Example:
-nand_controller: nand-controller@d0000 {
-       compatible = "marvell,armada370-nand-controller";
-       reg = <0xd0000 0x54>;
-       #address-cells = <1>;
-       #size-cells = <0>;
-       interrupts = <GIC_SPI 84 IRQ_TYPE_LEVEL_HIGH>;
-       clocks = <&coredivclk 0>;
-
-       nand@0 {
-               reg = <0>;
-               label = "main-storage";
-               nand-rb = <0>;
-               nand-ecc-mode = "hw";
-               marvell,nand-keep-config;
-               nand-on-flash-bbt;
-               nand-ecc-strength = <4>;
-               nand-ecc-step-size = <512>;
-
-               partitions {
-                       compatible = "fixed-partitions";
-                       #address-cells = <1>;
-                       #size-cells = <1>;
-
-                       partition@0 {
-                               label = "Rootfs";
-                               reg = <0x00000000 0x40000000>;
-                       };
-               };
-       };
-};
-
-
-Note on legacy bindings: One can find, in not-updated device trees,
-bindings slightly different than described above with other properties
-described below as well as the partitions node at the root of a so
-called "nand" node (without clear controller/chip separation).
-
-Legacy properties:
-- marvell,nand-enable-arbiter: To enable the arbiter, all boards blindly
-  used it, this bit was set by the bootloader for many boards and even if
-  it is marked reserved in several datasheets, it might be needed to set
-  it (otherwise it is harmless) so whether or not this property is set,
-  the bit is selected by the driver.
-- num-cs: Number of chip-select lines to use, all boards blindly set 1
-  to this and for a reason, other values would have failed. The value of
-  this property is ignored.
-
-Example:
-
-       nand0: nand@43100000 {
-               compatible = "marvell,pxa3xx-nand";
-               reg = <0x43100000 90>;
-               interrupts = <45>;
-               dmas = <&pdma 97 0>;
-               dma-names = "rxtx";
-               #address-cells = <1>;
-               marvell,nand-keep-config;
-               marvell,nand-enable-arbiter;
-               num-cs = <1>;
-               /* Partitions (optional) */
-       };
index a6e7f12..ab503a3 100644 (file)
@@ -40,12 +40,11 @@ properties:
 
 patternProperties:
   "^nand@[a-f0-9]$":
-    $ref: nand-chip.yaml#
+    $ref: raw-nand-chip.yaml#
     unevaluatedProperties: false
     properties:
       reg:
         maximum: 1
-      nand-on-flash-bbt: true
       nand-ecc-mode:
         const: hw
 
index da3d488..b82ca03 100644 (file)
@@ -12,7 +12,7 @@ maintainers:
 
 properties:
   $nodename:
-    pattern: "^(flash|.*sram)(@.*)?$"
+    pattern: "^(flash|.*sram|nand)(@.*)?$"
 
   label:
     description:
index f70a32d..83a4fe4 100644 (file)
@@ -16,16 +16,6 @@ description: |
   children nodes of the NAND controller. This representation should be
   enforced even for simple controllers supporting only one chip.
 
-  The ECC strength and ECC step size properties define the user
-  desires in terms of correction capability of a controller. Together,
-  they request the ECC engine to correct {strength} bit errors per
-  {size} bytes.
-
-  The interpretation of these parameters is implementation-defined, so
-  not all implementations must support all possible
-  combinations. However, implementations are encouraged to further
-  specify the value(s) they support.
-
 properties:
   $nodename:
     pattern: "^nand-controller(@.*)?"
@@ -51,79 +41,8 @@ properties:
 
 patternProperties:
   "^nand@[a-f0-9]$":
-    $ref: nand-chip.yaml#
-
-    properties:
-      reg:
-        description:
-          Contains the chip-select IDs.
-
-      nand-ecc-placement:
-        description:
-          Location of the ECC bytes. This location is unknown by default
-          but can be explicitly set to "oob", if all ECC bytes are
-          known to be stored in the OOB area, or "interleaved" if ECC
-          bytes will be interleaved with regular data in the main area.
-        $ref: /schemas/types.yaml#/definitions/string
-        enum: [ oob, interleaved ]
-
-      nand-bus-width:
-        description:
-          Bus width to the NAND chip
-        $ref: /schemas/types.yaml#/definitions/uint32
-        enum: [8, 16]
-        default: 8
-
-      nand-on-flash-bbt:
-        description:
-          With this property, the OS will search the device for a Bad
-          Block Table (BBT). If not found, it will create one, reserve
-          a few blocks at the end of the device to store it and update
-          it as the device ages. Otherwise, the out-of-band area of a
-          few pages of all the blocks will be scanned at boot time to
-          find Bad Block Markers (BBM). These markers will help to
-          build a volatile BBT in RAM.
-        $ref: /schemas/types.yaml#/definitions/flag
-
-      nand-ecc-maximize:
-        description:
-          Whether or not the ECC strength should be maximized. The
-          maximum ECC strength is both controller and chip
-          dependent. The ECC engine has to select the ECC config
-          providing the best strength and taking the OOB area size
-          constraint into account. This is particularly useful when
-          only the in-band area is used by the upper layers, and you
-          want to make your NAND as reliable as possible.
-        $ref: /schemas/types.yaml#/definitions/flag
-
-      nand-is-boot-medium:
-        description:
-          Whether or not the NAND chip is a boot medium. Drivers might
-          use this information to select ECC algorithms supported by
-          the boot ROM or similar restrictions.
-        $ref: /schemas/types.yaml#/definitions/flag
-
-      nand-rb:
-        description:
-          Contains the native Ready/Busy IDs.
-        $ref: /schemas/types.yaml#/definitions/uint32-array
-
-      rb-gpios:
-        description:
-          Contains one or more GPIO descriptor (the numper of descriptor
-          depends on the number of R/B pins exposed by the flash) for the
-          Ready/Busy pins. Active state refers to the NAND ready state and
-          should be set to GPIOD_ACTIVE_HIGH unless the signal is inverted.
-
-      wp-gpios:
-        description:
-          Contains one GPIO descriptor for the Write Protect pin.
-          Active state refers to the NAND Write Protect state and should be
-          set to GPIOD_ACTIVE_LOW unless the signal is inverted.
-        maxItems: 1
-
-    required:
-      - reg
+    type: object
+    $ref: raw-nand-chip.yaml#
 
 required:
   - "#address-cells"
index cdffbb9..1ebe9e2 100644 (file)
@@ -55,6 +55,7 @@ properties:
   linux,rootfs:
     description: Marks partition that contains root filesystem to mount and boot
       user space from
+    type: boolean
 
 if:
   not:
index 2edc65e..1dda2c8 100644 (file)
@@ -21,6 +21,7 @@ oneOf:
   - $ref: linksys,ns-partitions.yaml
   - $ref: qcom,smem-part.yaml
   - $ref: redboot-fis.yaml
+  - $ref: tplink,safeloader-partitions.yaml
 
 properties:
   compatible: true
index 00c991f..4ada60f 100644 (file)
@@ -34,7 +34,9 @@ properties:
 patternProperties:
   "^nand@[a-f0-9]$":
     type: object
+    $ref: raw-nand-chip.yaml
     properties:
+
       nand-bus-width:
         const: 8
 
@@ -45,6 +47,24 @@ patternProperties:
         enum:
           - 512
 
+      qcom,boot-partitions:
+        $ref: /schemas/types.yaml#/definitions/uint32-matrix
+        items:
+          items:
+            - description: offset
+            - description: size
+        description:
+          Boot partition use a different layout where the 4 bytes of spare
+          data are not protected by ECC. Use this to declare these special
+          partitions by defining first the offset and then the size.
+
+          It's in the form of <offset1 size1 offset2 size2 offset3 ...>
+          and should be declared in ascending order.
+
+          Refer to the ipq8064 example on how to use this special binding.
+
+    unevaluatedProperties: false
+
 allOf:
   - $ref: nand-controller.yaml#
 
@@ -107,22 +127,15 @@ allOf:
               - qcom,ipq806x-nand
 
     then:
-      properties:
-        qcom,boot-partitions:
-          $ref: /schemas/types.yaml#/definitions/uint32-matrix
-          items:
-            items:
-              - description: offset
-              - description: size
-          description:
-            Boot partition use a different layout where the 4 bytes of spare
-            data are not protected by ECC. Use this to declare these special
-            partitions by defining first the offset and then the size.
-
-            It's in the form of <offset1 size1 offset2 size2 offset3 ...>
-            and should be declared in ascending order.
-
-            Refer to the ipq8064 example on how to use this special binding.
+      patternProperties:
+        "^nand@[a-f0-9]$":
+          properties:
+            qcom,boot-partitions: true
+    else:
+      patternProperties:
+        "^nand@[a-f0-9]$":
+          properties:
+            qcom,boot-partitions: false
 
 required:
   - compatible
diff --git a/Documentation/devicetree/bindings/mtd/raw-nand-chip.yaml b/Documentation/devicetree/bindings/mtd/raw-nand-chip.yaml
new file mode 100644 (file)
index 0000000..092448d
--- /dev/null
@@ -0,0 +1,111 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/mtd/raw-nand-chip.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Raw NAND Chip Common Properties
+
+maintainers:
+  - Miquel Raynal <miquel.raynal@bootlin.com>
+
+allOf:
+  - $ref: nand-chip.yaml#
+
+description: |
+  The ECC strength and ECC step size properties define the user
+  desires in terms of correction capability of a controller. Together,
+  they request the ECC engine to correct {strength} bit errors per
+  {size} bytes for a particular raw NAND chip.
+
+  The interpretation of these parameters is implementation-defined, so
+  not all implementations must support all possible
+  combinations. However, implementations are encouraged to further
+  specify the value(s) they support.
+
+properties:
+  $nodename:
+    pattern: "^nand@[a-f0-9]$"
+
+  reg:
+    description:
+      Contains the chip-select IDs.
+
+  nand-ecc-placement:
+    description:
+      Location of the ECC bytes. This location is unknown by default
+      but can be explicitly set to "oob", if all ECC bytes are
+      known to be stored in the OOB area, or "interleaved" if ECC
+      bytes will be interleaved with regular data in the main area.
+    $ref: /schemas/types.yaml#/definitions/string
+    enum: [ oob, interleaved ]
+    deprecated: true
+
+  nand-ecc-mode:
+    description:
+      Legacy ECC configuration mixing the ECC engine choice and
+      configuration.
+    $ref: /schemas/types.yaml#/definitions/string
+    enum: [none, soft, soft_bch, hw, hw_syndrome, on-die]
+    deprecated: true
+
+  nand-bus-width:
+    description:
+      Bus width to the NAND chip
+    $ref: /schemas/types.yaml#/definitions/uint32
+    enum: [8, 16]
+    default: 8
+
+  nand-on-flash-bbt:
+    description:
+      With this property, the OS will search the device for a Bad
+      Block Table (BBT). If not found, it will create one, reserve
+      a few blocks at the end of the device to store it and update
+      it as the device ages. Otherwise, the out-of-band area of a
+      few pages of all the blocks will be scanned at boot time to
+      find Bad Block Markers (BBM). These markers will help to
+      build a volatile BBT in RAM.
+    $ref: /schemas/types.yaml#/definitions/flag
+
+  nand-ecc-maximize:
+    description:
+      Whether or not the ECC strength should be maximized. The
+      maximum ECC strength is both controller and chip
+      dependent. The ECC engine has to select the ECC config
+      providing the best strength and taking the OOB area size
+      constraint into account. This is particularly useful when
+      only the in-band area is used by the upper layers, and you
+      want to make your NAND as reliable as possible.
+    $ref: /schemas/types.yaml#/definitions/flag
+
+  nand-is-boot-medium:
+    description:
+      Whether or not the NAND chip is a boot medium. Drivers might
+      use this information to select ECC algorithms supported by
+      the boot ROM or similar restrictions.
+    $ref: /schemas/types.yaml#/definitions/flag
+
+  nand-rb:
+    description:
+      Contains the native Ready/Busy IDs.
+    $ref: /schemas/types.yaml#/definitions/uint32-array
+
+  rb-gpios:
+    description:
+      Contains one or more GPIO descriptor (the numper of descriptor
+      depends on the number of R/B pins exposed by the flash) for the
+      Ready/Busy pins. Active state refers to the NAND ready state and
+      should be set to GPIOD_ACTIVE_HIGH unless the signal is inverted.
+
+  wp-gpios:
+    description:
+      Contains one GPIO descriptor for the Write Protect pin.
+      Active state refers to the NAND Write Protect state and should be
+      set to GPIOD_ACTIVE_LOW unless the signal is inverted.
+    maxItems: 1
+
+required:
+  - reg
+
+# This is a generic file other binding inherit from and extend
+additionalProperties: true
index 7eb1d0a..ee53715 100644 (file)
@@ -57,6 +57,7 @@ properties:
 patternProperties:
   "^nand@[0-7]$":
     type: object
+    $ref: raw-nand-chip.yaml
     properties:
       reg:
         minimum: 0
@@ -116,6 +117,8 @@ patternProperties:
 
           Only used in combination with 'nand-is-boot-medium'.
 
+    unevaluatedProperties: false
+
 required:
   - compatible
   - reg
index 986e85c..e72cb5b 100644 (file)
@@ -37,6 +37,7 @@ properties:
 patternProperties:
   "^nand@[a-f0-9]$":
     type: object
+    $ref: raw-nand-chip.yaml
     properties:
       nand-ecc-step-size:
         const: 512
@@ -44,6 +45,8 @@ patternProperties:
       nand-ecc-strength:
         enum: [1, 4, 8]
 
+    unevaluatedProperties: false
+
 allOf:
   - $ref: nand-controller.yaml#
 
index 4774c92..df4fdc0 100644 (file)
@@ -30,6 +30,8 @@ properties:
 patternProperties:
   "^flash@[0-1],[0-9a-f]+$":
     type: object
+    $ref: mtd-physmap.yaml
+    unevaluatedProperties: false
 
 required:
   - compatible
index 80a9238..e9fad4b 100644 (file)
@@ -4,7 +4,7 @@
 $id: http://devicetree.org/schemas/perf/fsl-imx-ddr.yaml#
 $schema: http://devicetree.org/meta-schemas/core.yaml#
 
-title: Freescale(NXP) IMX8 DDR performance monitor
+title: Freescale(NXP) IMX8/9 DDR performance monitor
 
 maintainers:
   - Frank Li <frank.li@nxp.com>
@@ -19,6 +19,7 @@ properties:
           - fsl,imx8mm-ddr-pmu
           - fsl,imx8mn-ddr-pmu
           - fsl,imx8mp-ddr-pmu
+          - fsl,imx93-ddr-pmu
       - items:
           - enum:
               - fsl,imx8mm-ddr-pmu
index 7034cdc..b638430 100644 (file)
@@ -8,15 +8,14 @@ Documentation/devicetree/bindings/regulator/regulator.txt.
 
 The valid names for regulators are::
 BUCK:
-  buck_vdram1, buck_vcore, buck_vcore_sshub, buck_vpa, buck_vproc11,
-  buck_vproc12, buck_vgpu, buck_vs2, buck_vmodem, buck_vs1
+  buck_vdram1, buck_vcore, buck_vpa, buck_vproc11, buck_vproc12, buck_vgpu,
+  buck_vs2, buck_vmodem, buck_vs1
 LDO:
   ldo_vdram2, ldo_vsim1, ldo_vibr, ldo_vrf12, ldo_vio18, ldo_vusb, ldo_vcamio,
   ldo_vcamd, ldo_vcn18, ldo_vfe28, ldo_vsram_proc11, ldo_vcn28, ldo_vsram_others,
-  ldo_vsram_others_sshub, ldo_vsram_gpu, ldo_vxo22, ldo_vefuse, ldo_vaux18,
-  ldo_vmch, ldo_vbif28, ldo_vsram_proc12, ldo_vcama1, ldo_vemc, ldo_vio28, ldo_va12,
-  ldo_vrf18, ldo_vcn33_bt, ldo_vcn33_wifi, ldo_vcama2, ldo_vmc, ldo_vldo28, ldo_vaud28,
-  ldo_vsim2
+  ldo_vsram_gpu, ldo_vxo22, ldo_vefuse, ldo_vaux18, ldo_vmch, ldo_vbif28,
+  ldo_vsram_proc12, ldo_vcama1, ldo_vemc, ldo_vio28, ldo_va12, ldo_vrf18,
+  ldo_vcn33, ldo_vcama2, ldo_vmc, ldo_vldo28, ldo_vaud28, ldo_vsim2
 
 Example:
 
@@ -305,15 +304,8 @@ Example:
                                regulator-enable-ramp-delay = <120>;
                        };
 
-                       mt6358_vcn33_bt_reg: ldo_vcn33_bt {
-                               regulator-name = "vcn33_bt";
-                               regulator-min-microvolt = <3300000>;
-                               regulator-max-microvolt = <3500000>;
-                               regulator-enable-ramp-delay = <270>;
-                       };
-
-                       mt6358_vcn33_wifi_reg: ldo_vcn33_wifi {
-                               regulator-name = "vcn33_wifi";
+                       mt6358_vcn33_reg: ldo_vcn33 {
+                               regulator-name = "vcn33";
                                regulator-min-microvolt = <3300000>;
                                regulator-max-microvolt = <3500000>;
                                regulator-enable-ramp-delay = <270>;
@@ -354,17 +346,5 @@ Example:
                                regulator-max-microvolt = <3100000>;
                                regulator-enable-ramp-delay = <540>;
                        };
-
-                       mt6358_vcore_sshub_reg: buck_vcore_sshub {
-                               regulator-name = "vcore_sshub";
-                               regulator-min-microvolt = <500000>;
-                               regulator-max-microvolt = <1293750>;
-                       };
-
-                       mt6358_vsram_others_sshub_reg: ldo_vsram_others_sshub {
-                               regulator-name = "vsram_others_sshub";
-                               regulator-min-microvolt = <500000>;
-                               regulator-max-microvolt = <1293750>;
-                       };
                };
        };
index 67a30b2..e384e49 100644 (file)
@@ -36,6 +36,9 @@ properties:
   reg:
     maxItems: 1
 
+  interrupts:
+    maxItems: 1
+
   fsl,pfuze-support-disable-sw:
     $ref: /schemas/types.yaml#/definitions/flag
     description: |
index 7e58471..80ecf93 100644 (file)
@@ -64,6 +64,7 @@ properties:
         defined, <100> is assumed, meaning that
         pwm-dutycycle-range contains values expressed in
         percent.
+    $ref: /schemas/types.yaml#/definitions/uint32
     default: 100
 
   pwm-dutycycle-range:
diff --git a/Documentation/devicetree/bindings/regulator/renesas,raa215300.yaml b/Documentation/devicetree/bindings/regulator/renesas,raa215300.yaml
new file mode 100644 (file)
index 0000000..97cff71
--- /dev/null
@@ -0,0 +1,85 @@
+# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/regulator/renesas,raa215300.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Renesas RAA215300 Power Management Integrated Circuit (PMIC)
+
+maintainers:
+  - Biju Das <biju.das.jz@bp.renesas.com>
+
+description: |
+  The RAA215300 is a high-performance, low-cost 9-channel PMIC designed for
+  32-bit and 64-bit MCU and MPU applications. It supports DDR3, DDR3L, DDR4,
+  and LPDDR4 memory power requirements. The internally compensated regulators,
+  built-in Real-Time Clock (RTC), 32kHz crystal oscillator, and coin cell
+  battery charger provide a highly integrated, small footprint power solution
+  ideal for System-On-Module (SOM) applications. A spread spectrum feature
+  provides an ease-of-use solution for noise-sensitive audio or RF applications.
+
+  This device exposes two devices via I2C. One for the integrated RTC IP, and
+  one for everything else.
+
+  Link to datasheet:
+  https://www.renesas.com/in/en/products/power-power-management/multi-channel-power-management-ics-pmics/ssdsoc-power-management-ics-pmic-and-pmus/raa215300-high-performance-9-channel-pmic-supporting-ddr-memory-built-charger-and-rtc
+
+properties:
+  compatible:
+    enum:
+      - renesas,raa215300
+
+  reg:
+    maxItems: 2
+
+  reg-names:
+    items:
+      - const: main
+      - const: rtc
+
+  interrupts:
+    maxItems: 1
+
+  clocks:
+    description: |
+      The clocks are optional. The RTC is disabled, if no clocks are
+      provided(either xin or clkin).
+    maxItems: 1
+
+  clock-names:
+    description: |
+      Use xin, if connected to an external crystal.
+      Use clkin, if connected to an external clock signal.
+    enum:
+      - xin
+      - clkin
+
+required:
+  - compatible
+  - reg
+  - reg-names
+
+additionalProperties: false
+
+examples:
+  - |
+    /* 32.768kHz crystal */
+    x2: x2-clock {
+        compatible = "fixed-clock";
+        #clock-cells = <0>;
+        clock-frequency = <32768>;
+    };
+
+    i2c {
+        #address-cells = <1>;
+        #size-cells = <0>;
+
+        raa215300: pmic@12 {
+            compatible = "renesas,raa215300";
+            reg = <0x12>, <0x6f>;
+            reg-names = "main", "rtc";
+
+            clocks = <&x2>;
+            clock-names = "xin";
+        };
+    };
diff --git a/Documentation/devicetree/bindings/regulator/ti,tps62870.yaml b/Documentation/devicetree/bindings/regulator/ti,tps62870.yaml
new file mode 100644 (file)
index 0000000..3869895
--- /dev/null
@@ -0,0 +1,52 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/regulator/ti,tps62870.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: TI TPS62870/TPS62871/TPS62872/TPS62873 voltage regulator
+
+maintainers:
+  - Mårten Lindahl <marten.lindahl@axis.com>
+
+allOf:
+  - $ref: regulator.yaml#
+
+properties:
+  compatible:
+    enum:
+      - ti,tps62870
+      - ti,tps62871
+      - ti,tps62872
+      - ti,tps62873
+
+  reg:
+    maxItems: 1
+
+  regulator-initial-mode:
+    enum: [ 1, 2 ]
+    description: 1 - Forced PWM mode, 2 - Low power mode
+
+required:
+  - compatible
+  - reg
+
+unevaluatedProperties: false
+
+examples:
+  - |
+    i2c {
+      #address-cells = <1>;
+      #size-cells = <0>;
+
+      regulator@41 {
+        compatible = "ti,tps62873";
+        reg = <0x41>;
+        regulator-name = "+0.75V";
+        regulator-min-microvolt = <400000>;
+        regulator-max-microvolt = <1675000>;
+        regulator-initial-mode = <1>;
+      };
+    };
+
+...
index 2155478..a6f34bd 100644 (file)
@@ -14,9 +14,6 @@ maintainers:
   - Maxime Ripard <mripard@kernel.org>
 
 properties:
-  "#address-cells": true
-  "#size-cells": true
-
   compatible:
     const: allwinner,sun4i-a10-spi
 
@@ -46,12 +43,9 @@ properties:
       - const: rx
       - const: tx
 
-  num-cs: true
-
 patternProperties:
   "^.*@[0-9a-f]+":
     type: object
-    additionalProperties: true
     properties:
       reg:
         items:
@@ -71,7 +65,7 @@ required:
   - clocks
   - clock-names
 
-additionalProperties: false
+unevaluatedProperties: false
 
 examples:
   - |
index de36c6a..28b8ace 100644 (file)
@@ -14,11 +14,9 @@ maintainers:
   - Maxime Ripard <mripard@kernel.org>
 
 properties:
-  "#address-cells": true
-  "#size-cells": true
-
   compatible:
     oneOf:
+      - const: allwinner,sun50i-r329-spi
       - const: allwinner,sun6i-a31-spi
       - const: allwinner,sun8i-h3-spi
       - items:
@@ -28,6 +26,15 @@ properties:
               - allwinner,sun50i-h616-spi
               - allwinner,suniv-f1c100s-spi
           - const: allwinner,sun8i-h3-spi
+      - items:
+          - enum:
+              - allwinner,sun20i-d1-spi
+              - allwinner,sun50i-r329-spi-dbi
+          - const: allwinner,sun50i-r329-spi
+      - items:
+          - const: allwinner,sun20i-d1-spi-dbi
+          - const: allwinner,sun50i-r329-spi-dbi
+          - const: allwinner,sun50i-r329-spi
 
   reg:
     maxItems: 1
@@ -58,12 +65,9 @@ properties:
       - const: rx
       - const: tx
 
-  num-cs: true
-
 patternProperties:
   "^.*@[0-9a-f]+":
     type: object
-    additionalProperties: true
     properties:
       reg:
         items:
@@ -83,7 +87,7 @@ required:
   - clocks
   - clock-names
 
-additionalProperties: false
+unevaluatedProperties: false
 
 examples:
   - |
index 6c57dd6..5836758 100644 (file)
@@ -20,6 +20,10 @@ properties:
       - items:
           - const: microchip,sam9x60-spi
           - const: atmel,at91rm9200-spi
+      - items:
+          - const: microchip,sam9x7-spi
+          - const: microchip,sam9x60-spi
+          - const: atmel,at91rm9200-spi
 
   reg:
     maxItems: 1
index b310069..4f15f9a 100644 (file)
@@ -46,12 +46,28 @@ allOf:
           maxItems: 2
           items:
             enum: [ qspi, qspi-ocp ]
+  - if:
+      properties:
+        compatible:
+          contains:
+            const: amd,pensando-elba-qspi
+    then:
+      properties:
+        cdns,fifo-depth:
+          enum: [ 128, 256, 1024 ]
+          default: 1024
+    else:
+      properties:
+        cdns,fifo-depth:
+          enum: [ 128, 256 ]
+          default: 128
 
 properties:
   compatible:
     oneOf:
       - items:
           - enum:
+              - amd,pensando-elba-qspi
               - ti,k2g-qspi
               - ti,am654-ospi
               - intel,lgm-qspi
@@ -76,8 +92,6 @@ properties:
     description:
       Size of the data FIFO in words.
     $ref: /schemas/types.yaml#/definitions/uint32
-    enum: [ 128, 256 ]
-    default: 128
 
   cdns,fifo-width:
     $ref: /schemas/types.yaml#/definitions/uint32
index ee8f7ea..1696ac4 100644 (file)
@@ -29,6 +29,9 @@ properties:
   reg:
     maxItems: 1
 
+  iommus:
+    maxItems: 1
+
   interrupts:
     maxItems: 1
 
diff --git a/Documentation/devicetree/bindings/spi/renesas,rzv2m-csi.yaml b/Documentation/devicetree/bindings/spi/renesas,rzv2m-csi.yaml
new file mode 100644 (file)
index 0000000..e59183e
--- /dev/null
@@ -0,0 +1,70 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/spi/renesas,rzv2m-csi.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Renesas RZ/V2M Clocked Serial Interface (CSI)
+
+maintainers:
+  - Fabrizio Castro <fabrizio.castro.jz@renesas.com>
+  - Geert Uytterhoeven <geert+renesas@glider.be>
+
+allOf:
+  - $ref: spi-controller.yaml#
+
+properties:
+  compatible:
+    const: renesas,rzv2m-csi
+
+  reg:
+    maxItems: 1
+
+  interrupts:
+    maxItems: 1
+
+  clocks:
+    items:
+      - description: The clock used to generate the output clock (CSICLK)
+      - description: Internal clock to access the registers (PCLK)
+
+  clock-names:
+    items:
+      - const: csiclk
+      - const: pclk
+
+  resets:
+    maxItems: 1
+
+  power-domains:
+    maxItems: 1
+
+required:
+  - compatible
+  - reg
+  - interrupts
+  - clocks
+  - clock-names
+  - resets
+  - power-domains
+  - '#address-cells'
+  - '#size-cells'
+
+unevaluatedProperties: false
+
+examples:
+  - |
+    #include <dt-bindings/interrupt-controller/arm-gic.h>
+    #include <dt-bindings/clock/r9a09g011-cpg.h>
+    csi4: spi@a4020200 {
+        compatible = "renesas,rzv2m-csi";
+        reg = <0xa4020200 0x80>;
+        interrupts = <GIC_SPI 230 IRQ_TYPE_LEVEL_HIGH>;
+        clocks = <&cpg CPG_MOD R9A09G011_CSI4_CLK>,
+                 <&cpg CPG_MOD R9A09G011_CPERI_GRPH_PCLK>;
+        clock-names = "csiclk", "pclk";
+        resets = <&cpg R9A09G011_CSI_GPH_PRESETN>;
+        power-domains = <&cpg>;
+        #address-cells = <1>;
+        #size-cells = <0>;
+    };
index e0a465d..79da99c 100644 (file)
@@ -35,8 +35,6 @@ properties:
     minItems: 2
     maxItems: 3
 
-  cs-gpios: true
-
   dmas:
     minItems: 2
     maxItems: 2
index 12ca108..a47cb14 100644 (file)
@@ -74,6 +74,8 @@ properties:
         const: intel,keembay-ssi
       - description: Intel Thunder Bay SPI Controller
         const: intel,thunderbay-ssi
+      - description: Intel Mount Evans Integrated Management Complex SPI Controller
+        const: intel,mountevans-imc-ssi
       - description: AMD Pensando Elba SoC SPI Controller
         const: amd,pensando-elba-spi
       - description: Baikal-T1 SPI Controller
index 597fc4e..c96131e 100644 (file)
@@ -17,9 +17,6 @@ allOf:
   - $ref: spi-controller.yaml#
 
 properties:
-  "#address-cells": true
-  "#size-cells": true
-
   compatible:
     const: socionext,uniphier-scssi
 
index 90945f5..524f6fe 100644 (file)
@@ -17,7 +17,7 @@ description: |
 
 properties:
   $nodename:
-    pattern: "^spi(@.*|-[0-9a-f])*$"
+    pattern: "^spi(@.*|-([0-9]|[1-9][0-9]+))?$"
 
   "#address-cells":
     enum: [0, 1]
index 20f7724..226d8b4 100644 (file)
@@ -32,6 +32,12 @@ properties:
   clocks:
     maxItems: 2
 
+  iommus:
+    maxItems: 1
+
+  power-domains:
+    maxItems: 1
+
 required:
   - compatible
   - reg
index b0bee7e..ab8b8fc 100644 (file)
@@ -8,6 +8,7 @@ Required properties:
     * marvell,armada380-thermal
     * marvell,armadaxp-thermal
     * marvell,armada-ap806-thermal
+    * marvell,armada-ap807-thermal
     * marvell,armada-cp110-thermal
 
 Note: these bindings are deprecated for AP806/CP110 and should instead
diff --git a/Documentation/devicetree/bindings/thermal/brcm,bcm2835-thermal.txt b/Documentation/devicetree/bindings/thermal/brcm,bcm2835-thermal.txt
deleted file mode 100644 (file)
index a3e9ec5..0000000
+++ /dev/null
@@ -1,41 +0,0 @@
-Binding for Thermal Sensor driver for BCM2835 SoCs.
-
-Required parameters:
--------------------
-
-compatible:            should be one of: "brcm,bcm2835-thermal",
-                       "brcm,bcm2836-thermal" or "brcm,bcm2837-thermal"
-reg:                   Address range of the thermal registers.
-clocks:                Phandle of the clock used by the thermal sensor.
-#thermal-sensor-cells: should be 0 (see Documentation/devicetree/bindings/thermal/thermal-sensor.yaml)
-
-Example:
-
-thermal-zones {
-       cpu_thermal: cpu-thermal {
-               polling-delay-passive = <0>;
-               polling-delay = <1000>;
-
-               thermal-sensors = <&thermal>;
-
-               trips {
-                       cpu-crit {
-                               temperature     = <80000>;
-                               hysteresis      = <0>;
-                               type            = "critical";
-                       };
-               };
-
-               coefficients = <(-538)  407000>;
-
-               cooling-maps {
-               };
-       };
-};
-
-thermal: thermal@7e212000 {
-       compatible = "brcm,bcm2835-thermal";
-       reg = <0x7e212000 0x8>;
-       clocks = <&clocks BCM2835_CLOCK_TSENS>;
-       #thermal-sensor-cells = <0>;
-};
diff --git a/Documentation/devicetree/bindings/thermal/brcm,bcm2835-thermal.yaml b/Documentation/devicetree/bindings/thermal/brcm,bcm2835-thermal.yaml
new file mode 100644 (file)
index 0000000..2b6026d
--- /dev/null
@@ -0,0 +1,48 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/thermal/brcm,bcm2835-thermal.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Broadcom BCM2835 thermal sensor
+
+maintainers:
+  - Stefan Wahren <stefan.wahren@i2se.com>
+
+allOf:
+  - $ref: thermal-sensor.yaml#
+
+properties:
+  compatible:
+    enum:
+      - brcm,bcm2835-thermal
+      - brcm,bcm2836-thermal
+      - brcm,bcm2837-thermal
+
+  reg:
+    maxItems: 1
+
+  clocks:
+    maxItems: 1
+
+  "#thermal-sensor-cells":
+    const: 0
+
+unevaluatedProperties: false
+
+required:
+  - compatible
+  - reg
+  - clocks
+  - '#thermal-sensor-cells'
+
+examples:
+  - |
+    #include <dt-bindings/clock/bcm2835.h>
+
+    thermal@7e212000 {
+      compatible = "brcm,bcm2835-thermal";
+      reg = <0x7e212000 0x8>;
+      clocks = <&clocks BCM2835_CLOCK_TSENS>;
+      #thermal-sensor-cells = <0>;
+    };
index d1ec963..27e9e16 100644 (file)
@@ -29,6 +29,8 @@ properties:
         items:
           - enum:
               - qcom,mdm9607-tsens
+              - qcom,msm8226-tsens
+              - qcom,msm8909-tsens
               - qcom,msm8916-tsens
               - qcom,msm8939-tsens
               - qcom,msm8974-tsens
@@ -48,6 +50,7 @@ properties:
               - qcom,msm8953-tsens
               - qcom,msm8996-tsens
               - qcom,msm8998-tsens
+              - qcom,qcm2290-tsens
               - qcom,sc7180-tsens
               - qcom,sc7280-tsens
               - qcom,sc8180x-tsens
@@ -56,6 +59,7 @@ properties:
               - qcom,sdm845-tsens
               - qcom,sm6115-tsens
               - qcom,sm6350-tsens
+              - qcom,sm6375-tsens
               - qcom,sm8150-tsens
               - qcom,sm8250-tsens
               - qcom,sm8350-tsens
@@ -67,6 +71,12 @@ properties:
         enum:
           - qcom,ipq8074-tsens
 
+      - description: v2 of TSENS with combined interrupt
+        items:
+          - enum:
+              - qcom,ipq9574-tsens
+          - const: qcom,ipq8074-tsens
+
   reg:
     items:
       - description: TM registers
@@ -223,12 +233,7 @@ allOf:
           contains:
             enum:
               - qcom,ipq8064-tsens
-              - qcom,mdm9607-tsens
-              - qcom,msm8916-tsens
               - qcom,msm8960-tsens
-              - qcom,msm8974-tsens
-              - qcom,msm8976-tsens
-              - qcom,qcs404-tsens
               - qcom,tsens-v0_1
               - qcom,tsens-v1
     then:
@@ -244,22 +249,7 @@ allOf:
       properties:
         compatible:
           contains:
-            enum:
-              - qcom,msm8953-tsens
-              - qcom,msm8996-tsens
-              - qcom,msm8998-tsens
-              - qcom,sc7180-tsens
-              - qcom,sc7280-tsens
-              - qcom,sc8180x-tsens
-              - qcom,sc8280xp-tsens
-              - qcom,sdm630-tsens
-              - qcom,sdm845-tsens
-              - qcom,sm6350-tsens
-              - qcom,sm8150-tsens
-              - qcom,sm8250-tsens
-              - qcom,sm8350-tsens
-              - qcom,sm8450-tsens
-              - qcom,tsens-v2
+            const: qcom,tsens-v2
     then:
       properties:
         interrupts:
diff --git a/Documentation/devicetree/bindings/timer/brcm,kona-timer.txt b/Documentation/devicetree/bindings/timer/brcm,kona-timer.txt
deleted file mode 100644 (file)
index 39adf54..0000000
+++ /dev/null
@@ -1,25 +0,0 @@
-Broadcom Kona Family timer
------------------------------------------------------
-This timer is used in the following Broadcom SoCs:
- BCM11130, BCM11140, BCM11351, BCM28145, BCM28155
-
-Required properties:
-- compatible : "brcm,kona-timer"
-- DEPRECATED: compatible : "bcm,kona-timer"
-- reg : Register range for the timer
-- interrupts : interrupt for the timer
-- clocks: phandle + clock specifier pair of the external clock
-- clock-frequency: frequency that the clock operates
-
-Only one of clocks or clock-frequency should be specified.
-
-Refer to clocks/clock-bindings.txt for generic clock consumer properties.
-
-Example:
-       timer@35006000 {
-               compatible = "brcm,kona-timer";
-               reg = <0x35006000 0x1000>;
-               interrupts = <0x0 7 0x4>;
-               clocks = <&hub_timer_clk>;
-       };
-
diff --git a/Documentation/devicetree/bindings/timer/brcm,kona-timer.yaml b/Documentation/devicetree/bindings/timer/brcm,kona-timer.yaml
new file mode 100644 (file)
index 0000000..d6af838
--- /dev/null
@@ -0,0 +1,52 @@
+# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/timer/brcm,kona-timer.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Broadcom Kona family timer
+
+maintainers:
+  - Florian Fainelli <f.fainelli@gmail.com>
+
+properties:
+  compatible:
+    const: brcm,kona-timer
+
+  reg:
+    maxItems: 1
+
+  interrupts:
+    maxItems: 1
+
+  clocks:
+    maxItems: 1
+
+  clock-frequency: true
+
+oneOf:
+  - required:
+      - clocks
+  - required:
+      - clock-frequency
+
+required:
+  - compatible
+  - reg
+  - interrupts
+
+additionalProperties: false
+
+examples:
+  - |
+    #include <dt-bindings/clock/bcm281xx.h>
+    #include <dt-bindings/interrupt-controller/arm-gic.h>
+    #include <dt-bindings/interrupt-controller/irq.h>
+
+    timer@35006000 {
+        compatible = "brcm,kona-timer";
+        reg = <0x35006000 0x1000>;
+        interrupts = <GIC_SPI 7 IRQ_TYPE_LEVEL_HIGH>;
+        clocks = <&aon_ccu BCM281XX_AON_CCU_HUB_TIMER>;
+    };
+...
diff --git a/Documentation/devicetree/bindings/timer/loongson,ls1x-pwmtimer.yaml b/Documentation/devicetree/bindings/timer/loongson,ls1x-pwmtimer.yaml
new file mode 100644 (file)
index 0000000..ad61ae5
--- /dev/null
@@ -0,0 +1,48 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/timer/loongson,ls1x-pwmtimer.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Loongson-1 PWM timer
+
+maintainers:
+  - Keguang Zhang <keguang.zhang@gmail.com>
+
+description:
+  Loongson-1 PWM timer can be used for system clock source
+  and clock event timers.
+
+properties:
+  compatible:
+    const: loongson,ls1b-pwmtimer
+
+  reg:
+    maxItems: 1
+
+  clocks:
+    maxItems: 1
+
+  interrupts:
+    maxItems: 1
+
+required:
+  - compatible
+  - reg
+  - clocks
+  - interrupts
+
+additionalProperties: false
+
+examples:
+  - |
+    #include <dt-bindings/clock/loongson,ls1x-clk.h>
+    #include <dt-bindings/interrupt-controller/irq.h>
+    clocksource: timer@1fe5c030 {
+        compatible = "loongson,ls1b-pwmtimer";
+        reg = <0x1fe5c030 0x10>;
+
+        clocks = <&clkc LS1X_CLKID_APB>;
+        interrupt-parent = <&intc0>;
+        interrupts = <20 IRQ_TYPE_LEVEL_HIGH>;
+    };
diff --git a/Documentation/devicetree/bindings/timer/ralink,rt2880-timer.yaml b/Documentation/devicetree/bindings/timer/ralink,rt2880-timer.yaml
new file mode 100644 (file)
index 0000000..daa7832
--- /dev/null
@@ -0,0 +1,44 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/timer/ralink,rt2880-timer.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Timer present in Ralink family SoCs
+
+maintainers:
+  - Sergio Paracuellos <sergio.paracuellos@gmail.com>
+
+properties:
+  compatible:
+    const: ralink,rt2880-timer
+
+  reg:
+    maxItems: 1
+
+  clocks:
+    maxItems: 1
+
+  interrupts:
+    maxItems: 1
+
+required:
+  - compatible
+  - reg
+  - clocks
+  - interrupts
+
+additionalProperties: false
+
+examples:
+  - |
+    timer@100 {
+        compatible = "ralink,rt2880-timer";
+        reg = <0x100 0x20>;
+
+        clocks = <&sysc 3>;
+
+        interrupt-parent = <&intc>;
+        interrupts = <1>;
+    };
+...
index 23edb42..cd8ad79 100644 (file)
@@ -313,9 +313,18 @@ the documentation build system will automatically turn a reference to
 function name exists.  If you see ``c:func:`` use in a kernel document,
 please feel free to remove it.
 
+Tables
+------
+
+ReStructuredText provides several options for table syntax. Kernel style for
+tables is to prefer *simple table* syntax or *grid table* syntax. See the
+`reStructuredText user reference for table syntax`_ for more details.
+
+.. _reStructuredText user reference for table syntax:
+   https://docutils.sourceforge.io/docs/user/rst/quickref.html#tables
 
 list tables
------------
+~~~~~~~~~~~
 
 The list-table formats can be useful for tables that are not easily laid
 out in the usual Sphinx ASCII-art formats.  These formats are nearly
index 4b4d8e2..7671b53 100644 (file)
@@ -84,7 +84,13 @@ Reference counting
 Atomics
 -------
 
-.. kernel-doc:: arch/x86/include/asm/atomic.h
+.. kernel-doc:: include/linux/atomic/atomic-instrumented.h
+   :internal:
+
+.. kernel-doc:: include/linux/atomic/atomic-arch-fallback.h
+   :internal:
+
+.. kernel-doc:: include/linux/atomic/atomic-long.h
    :internal:
 
 Kernel objects manipulation
index b8c742a..f4f044b 100644 (file)
@@ -106,6 +106,16 @@ will occupy those chip-select rows.
 This term is avoided because it is unclear when needing to distinguish
 between chip-select rows and socket sets.
 
+* High Bandwidth Memory (HBM)
+
+HBM is a new memory type with low power consumption and ultra-wide
+communication lanes. It uses vertically stacked memory chips (DRAM dies)
+interconnected by microscopic wires called "through-silicon vias," or
+TSVs.
+
+Several stacks of HBM chips connect to the CPU or GPU through an ultra-fast
+interconnect called the "interposer". Therefore, HBM's characteristics
+are nearly indistinguishable from on-chip integrated RAM.
 
 Memory Controllers
 ------------------
@@ -176,3 +186,113 @@ nodes::
        the L1 and L2 directories would be "edac_device_block's"
 
 .. kernel-doc:: drivers/edac/edac_device.h
+
+
+Heterogeneous system support
+----------------------------
+
+An AMD heterogeneous system is built by connecting the data fabrics of
+both CPUs and GPUs via custom xGMI links. Thus, the data fabric on the
+GPU nodes can be accessed the same way as the data fabric on CPU nodes.
+
+The MI200 accelerators are data center GPUs. They have 2 data fabrics,
+and each GPU data fabric contains four Unified Memory Controllers (UMC).
+Each UMC contains eight channels. Each UMC channel controls one 128-bit
+HBM2e (2GB) channel (equivalent to 8 X 2GB ranks).  This creates a total
+of 4096-bits of DRAM data bus.
+
+While the UMC is interfacing a 16GB (8high X 2GB DRAM) HBM stack, each UMC
+channel is interfacing 2GB of DRAM (represented as rank).
+
+Memory controllers on AMD GPU nodes can be represented in EDAC thusly:
+
+       GPU DF / GPU Node -> EDAC MC
+       GPU UMC           -> EDAC CSROW
+       GPU UMC channel   -> EDAC CHANNEL
+
+For example: a heterogeneous system with 1 AMD CPU is connected to
+4 MI200 (Aldebaran) GPUs using xGMI.
+
+Some more heterogeneous hardware details:
+
+- The CPU UMC (Unified Memory Controller) is mostly the same as the GPU UMC.
+  They have chip selects (csrows) and channels. However, the layouts are different
+  for performance, physical layout, or other reasons.
+- CPU UMCs use 1 channel, In this case UMC = EDAC channel. This follows the
+  marketing speak. CPU has X memory channels, etc.
+- CPU UMCs use up to 4 chip selects, So UMC chip select = EDAC CSROW.
+- GPU UMCs use 1 chip select, So UMC = EDAC CSROW.
+- GPU UMCs use 8 channels, So UMC channel = EDAC channel.
+
+The EDAC subsystem provides a mechanism to handle AMD heterogeneous
+systems by calling system specific ops for both CPUs and GPUs.
+
+AMD GPU nodes are enumerated in sequential order based on the PCI
+hierarchy, and the first GPU node is assumed to have a Node ID value
+following those of the CPU nodes after latter are fully populated::
+
+       $ ls /sys/devices/system/edac/mc/
+               mc0   - CPU MC node 0
+               mc1  |
+               mc2  |- GPU card[0] => node 0(mc1), node 1(mc2)
+               mc3  |
+               mc4  |- GPU card[1] => node 0(mc3), node 1(mc4)
+               mc5  |
+               mc6  |- GPU card[2] => node 0(mc5), node 1(mc6)
+               mc7  |
+               mc8  |- GPU card[3] => node 0(mc7), node 1(mc8)
+
+For example, a heterogeneous system with one AMD CPU is connected to
+four MI200 (Aldebaran) GPUs using xGMI. This topology can be represented
+via the following sysfs entries::
+
+       /sys/devices/system/edac/mc/..
+
+       CPU                     # CPU node
+       ├── mc 0
+
+       GPU Nodes are enumerated sequentially after CPU nodes have been populated
+       GPU card 1              # Each MI200 GPU has 2 nodes/mcs
+       ├── mc 1          # GPU node 0 == mc1, Each MC node has 4 UMCs/CSROWs
+       │   ├── csrow 0               # UMC 0
+       │   │   ├── channel 0     # Each UMC has 8 channels
+       │   │   ├── channel 1   # size of each channel is 2 GB, so each UMC has 16 GB
+       │   │   ├── channel 2
+       │   │   ├── channel 3
+       │   │   ├── channel 4
+       │   │   ├── channel 5
+       │   │   ├── channel 6
+       │   │   ├── channel 7
+       │   ├── csrow 1               # UMC 1
+       │   │   ├── channel 0
+       │   │   ├── ..
+       │   │   ├── channel 7
+       │   ├── ..            ..
+       │   ├── csrow 3               # UMC 3
+       │   │   ├── channel 0
+       │   │   ├── ..
+       │   │   ├── channel 7
+       │   ├── rank 0
+       │   ├── ..            ..
+       │   ├── rank 31               # total 32 ranks/dimms from 4 UMCs
+       ├
+       ├── mc 2          # GPU node 1 == mc2
+       │   ├── ..            # each GPU has total 64 GB
+
+       GPU card 2
+       ├── mc 3
+       │   ├── ..
+       ├── mc 4
+       │   ├── ..
+
+       GPU card 3
+       ├── mc 5
+       │   ├── ..
+       ├── mc 6
+       │   ├── ..
+
+       GPU card 4
+       ├── mc 7
+       │   ├── ..
+       ├── mc 8
+       │   ├── ..
index bf4b511..b5a379d 100644 (file)
@@ -196,7 +196,7 @@ information and return operation results::
                    struct args_ismountpoint    ismountpoint;
            };
 
-           char path[0];
+           char path[];
     };
 
 The ioctlfd field is a mount point file descriptor of an autofs mount
index 4f49027..3b6e38e 100644 (file)
@@ -467,7 +467,7 @@ Each ioctl is passed a pointer to an `autofs_dev_ioctl` structure::
                        struct args_ismountpoint        ismountpoint;
                };
 
-                char path[0];
+                char path[];
         };
 
 For the **OPEN_MOUNT** and **IS_MOUNTPOINT** commands, the target
index 504ba94..dccd61c 100644 (file)
@@ -22,12 +22,11 @@ exclusive.
 3) object removal.  Locking rules: caller locks parent, finds victim,
 locks victim and calls the method.  Locks are exclusive.
 
-4) rename() that is _not_ cross-directory.  Locking rules: caller locks
-the parent and finds source and target.  In case of exchange (with
-RENAME_EXCHANGE in flags argument) lock both.  In any case,
-if the target already exists, lock it.  If the source is a non-directory,
-lock it.  If we need to lock both, lock them in inode pointer order.
-Then call the method.  All locks are exclusive.
+4) rename() that is _not_ cross-directory.  Locking rules: caller locks the
+parent and finds source and target.  We lock both (provided they exist).  If we
+need to lock two inodes of different type (dir vs non-dir), we lock directory
+first.  If we need to lock two inodes of the same type, lock them in inode
+pointer order.  Then call the method.  All locks are exclusive.
 NB: we might get away with locking the source (and target in exchange
 case) shared.
 
@@ -44,15 +43,17 @@ All locks are exclusive.
 rules:
 
        * lock the filesystem
-       * lock parents in "ancestors first" order.
+       * lock parents in "ancestors first" order. If one is not ancestor of
+         the other, lock them in inode pointer order.
        * find source and target.
        * if old parent is equal to or is a descendent of target
          fail with -ENOTEMPTY
        * if new parent is equal to or is a descendent of source
          fail with -ELOOP
-       * If it's an exchange, lock both the source and the target.
-       * If the target exists, lock it.  If the source is a non-directory,
-         lock it.  If we need to lock both, do so in inode pointer order.
+       * Lock both the source and the target provided they exist. If we
+         need to lock two inodes of different type (dir vs non-dir), we lock
+         the directory first. If we need to lock two inodes of the same type,
+         lock them in inode pointer order.
        * call the method.
 
 All ->i_rwsem are taken exclusive.  Again, we might get away with locking
@@ -66,8 +67,9 @@ If no directory is its own ancestor, the scheme above is deadlock-free.
 
 Proof:
 
-       First of all, at any moment we have a partial ordering of the
-       objects - A < B iff A is an ancestor of B.
+       First of all, at any moment we have a linear ordering of the
+       objects - A < B iff (A is an ancestor of B) or (B is not an ancestor
+        of A and ptr(A) < ptr(B)).
 
        That ordering can change.  However, the following is true:
 
index ede672d..cb845e8 100644 (file)
@@ -38,20 +38,14 @@ fail at runtime.
 Use cases
 =========
 
-By itself, the base fs-verity feature only provides integrity
-protection, i.e. detection of accidental (non-malicious) corruption.
+By itself, fs-verity only provides integrity protection, i.e.
+detection of accidental (non-malicious) corruption.
 
 However, because fs-verity makes retrieving the file hash extremely
 efficient, it's primarily meant to be used as a tool to support
 authentication (detection of malicious modifications) or auditing
 (logging file hashes before use).
 
-Trusted userspace code (e.g. operating system code running on a
-read-only partition that is itself authenticated by dm-verity) can
-authenticate the contents of an fs-verity file by using the
-`FS_IOC_MEASURE_VERITY`_ ioctl to retrieve its hash, then verifying a
-digital signature of it.
-
 A standard file hash could be used instead of fs-verity.  However,
 this is inefficient if the file is large and only a small portion may
 be accessed.  This is often the case for Android application package
@@ -69,24 +63,31 @@ still be used on read-only filesystems.  fs-verity is for files that
 must live on a read-write filesystem because they are independently
 updated and potentially user-installed, so dm-verity cannot be used.
 
-The base fs-verity feature is a hashing mechanism only; actually
-authenticating the files may be done by:
-
-* Userspace-only
-
-* Builtin signature verification + userspace policy
-
-  fs-verity optionally supports a simple signature verification
-  mechanism where users can configure the kernel to require that
-  all fs-verity files be signed by a key loaded into a keyring;
-  see `Built-in signature verification`_.
-
-* Integrity Measurement Architecture (IMA)
-
-  IMA supports including fs-verity file digests and signatures in the
-  IMA measurement list and verifying fs-verity based file signatures
-  stored as security.ima xattrs, based on policy.
-
+fs-verity does not mandate a particular scheme for authenticating its
+file hashes.  (Similarly, dm-verity does not mandate a particular
+scheme for authenticating its block device root hashes.)  Options for
+authenticating fs-verity file hashes include:
+
+- Trusted userspace code.  Often, the userspace code that accesses
+  files can be trusted to authenticate them.  Consider e.g. an
+  application that wants to authenticate data files before using them,
+  or an application loader that is part of the operating system (which
+  is already authenticated in a different way, such as by being loaded
+  from a read-only partition that uses dm-verity) and that wants to
+  authenticate applications before loading them.  In these cases, this
+  trusted userspace code can authenticate a file's contents by
+  retrieving its fs-verity digest using `FS_IOC_MEASURE_VERITY`_, then
+  verifying a signature of it using any userspace cryptographic
+  library that supports digital signatures.
+
+- Integrity Measurement Architecture (IMA).  IMA supports fs-verity
+  file digests as an alternative to its traditional full file digests.
+  "IMA appraisal" enforces that files contain a valid, matching
+  signature in their "security.ima" extended attribute, as controlled
+  by the IMA policy.  For more information, see the IMA documentation.
+
+- Trusted userspace code in combination with `Built-in signature
+  verification`_.  This approach should be used only with great care.
 
 User API
 ========
@@ -111,8 +112,7 @@ follows::
     };
 
 This structure contains the parameters of the Merkle tree to build for
-the file, and optionally contains a signature.  It must be initialized
-as follows:
+the file.  It must be initialized as follows:
 
 - ``version`` must be 1.
 - ``hash_algorithm`` must be the identifier for the hash algorithm to
@@ -129,12 +129,14 @@ as follows:
   file or device.  Currently the maximum salt size is 32 bytes.
 - ``salt_ptr`` is the pointer to the salt, or NULL if no salt is
   provided.
-- ``sig_size`` is the size of the signature in bytes, or 0 if no
-  signature is provided.  Currently the signature is (somewhat
-  arbitrarily) limited to 16128 bytes.  See `Built-in signature
-  verification`_ for more information.
-- ``sig_ptr``  is the pointer to the signature, or NULL if no
-  signature is provided.
+- ``sig_size`` is the size of the builtin signature in bytes, or 0 if no
+  builtin signature is provided.  Currently the builtin signature is
+  (somewhat arbitrarily) limited to 16128 bytes.
+- ``sig_ptr``  is the pointer to the builtin signature, or NULL if no
+  builtin signature is provided.  A builtin signature is only needed
+  if the `Built-in signature verification`_ feature is being used.  It
+  is not needed for IMA appraisal, and it is not needed if the file
+  signature is being handled entirely in userspace.
 - All reserved fields must be zeroed.
 
 FS_IOC_ENABLE_VERITY causes the filesystem to build a Merkle tree for
@@ -158,7 +160,7 @@ fatal signal), no changes are made to the file.
 FS_IOC_ENABLE_VERITY can fail with the following errors:
 
 - ``EACCES``: the process does not have write access to the file
-- ``EBADMSG``: the signature is malformed
+- ``EBADMSG``: the builtin signature is malformed
 - ``EBUSY``: this ioctl is already running on the file
 - ``EEXIST``: the file already has verity enabled
 - ``EFAULT``: the caller provided inaccessible memory
@@ -168,10 +170,10 @@ FS_IOC_ENABLE_VERITY can fail with the following errors:
   reserved bits are set; or the file descriptor refers to neither a
   regular file nor a directory.
 - ``EISDIR``: the file descriptor refers to a directory
-- ``EKEYREJECTED``: the signature doesn't match the file
-- ``EMSGSIZE``: the salt or signature is too long
-- ``ENOKEY``: the fs-verity keyring doesn't contain the certificate
-  needed to verify the signature
+- ``EKEYREJECTED``: the builtin signature doesn't match the file
+- ``EMSGSIZE``: the salt or builtin signature is too long
+- ``ENOKEY``: the ".fs-verity" keyring doesn't contain the certificate
+  needed to verify the builtin signature
 - ``ENOPKG``: fs-verity recognizes the hash algorithm, but it's not
   available in the kernel's crypto API as currently configured (e.g.
   for SHA-512, missing CONFIG_CRYPTO_SHA512).
@@ -180,8 +182,8 @@ FS_IOC_ENABLE_VERITY can fail with the following errors:
   support; or the filesystem superblock has not had the 'verity'
   feature enabled on it; or the filesystem does not support fs-verity
   on this file.  (See `Filesystem support`_.)
-- ``EPERM``: the file is append-only; or, a signature is required and
-  one was not provided.
+- ``EPERM``: the file is append-only; or, a builtin signature is
+  required and one was not provided.
 - ``EROFS``: the filesystem is read-only
 - ``ETXTBSY``: someone has the file open for writing.  This can be the
   caller's file descriptor, another open file descriptor, or the file
@@ -270,9 +272,9 @@ This ioctl takes in a pointer to the following structure::
 - ``FS_VERITY_METADATA_TYPE_DESCRIPTOR`` reads the fs-verity
   descriptor.  See `fs-verity descriptor`_.
 
-- ``FS_VERITY_METADATA_TYPE_SIGNATURE`` reads the signature which was
-  passed to FS_IOC_ENABLE_VERITY, if any.  See `Built-in signature
-  verification`_.
+- ``FS_VERITY_METADATA_TYPE_SIGNATURE`` reads the builtin signature
+  which was passed to FS_IOC_ENABLE_VERITY, if any.  See `Built-in
+  signature verification`_.
 
 The semantics are similar to those of ``pread()``.  ``offset``
 specifies the offset in bytes into the metadata item to read from, and
@@ -299,7 +301,7 @@ FS_IOC_READ_VERITY_METADATA can fail with the following errors:
   overflowed
 - ``ENODATA``: the file is not a verity file, or
   FS_VERITY_METADATA_TYPE_SIGNATURE was requested but the file doesn't
-  have a built-in signature
+  have a builtin signature
 - ``ENOTTY``: this type of filesystem does not implement fs-verity, or
   this ioctl is not yet implemented on it
 - ``EOPNOTSUPP``: the kernel was not configured with fs-verity
@@ -347,8 +349,8 @@ non-verity one, with the following exceptions:
   with EIO (for read()) or SIGBUS (for mmap() reads).
 
 - If the sysctl "fs.verity.require_signatures" is set to 1 and the
-  file is not signed by a key in the fs-verity keyring, then opening
-  the file will fail.  See `Built-in signature verification`_.
+  file is not signed by a key in the ".fs-verity" keyring, then
+  opening the file will fail.  See `Built-in signature verification`_.
 
 Direct access to the Merkle tree is not supported.  Therefore, if a
 verity file is copied, or is backed up and restored, then it will lose
@@ -433,20 +435,25 @@ root hash as well as other fields such as the file size::
 Built-in signature verification
 ===============================
 
-With CONFIG_FS_VERITY_BUILTIN_SIGNATURES=y, fs-verity supports putting
-a portion of an authentication policy (see `Use cases`_) in the
-kernel.  Specifically, it adds support for:
+CONFIG_FS_VERITY_BUILTIN_SIGNATURES=y adds supports for in-kernel
+verification of fs-verity builtin signatures.
+
+**IMPORTANT**!  Please take great care before using this feature.
+It is not the only way to do signatures with fs-verity, and the
+alternatives (such as userspace signature verification, and IMA
+appraisal) can be much better.  It's also easy to fall into a trap
+of thinking this feature solves more problems than it actually does.
+
+Enabling this option adds the following:
 
-1. At fs-verity module initialization time, a keyring ".fs-verity" is
-   created.  The root user can add trusted X.509 certificates to this
-   keyring using the add_key() system call, then (when done)
-   optionally use keyctl_restrict_keyring() to prevent additional
-   certificates from being added.
+1. At boot time, the kernel creates a keyring named ".fs-verity".  The
+   root user can add trusted X.509 certificates to this keyring using
+   the add_key() system call.
 
 2. `FS_IOC_ENABLE_VERITY`_ accepts a pointer to a PKCS#7 formatted
    detached signature in DER format of the file's fs-verity digest.
-   On success, this signature is persisted alongside the Merkle tree.
-   Then, any time the file is opened, the kernel will verify the
+   On success, the ioctl persists the signature alongside the Merkle
+   tree.  Then, any time the file is opened, the kernel verifies the
    file's actual digest against this signature, using the certificates
    in the ".fs-verity" keyring.
 
@@ -454,8 +461,8 @@ kernel.  Specifically, it adds support for:
    When set to 1, the kernel requires that all verity files have a
    correctly signed digest as described in (2).
 
-fs-verity file digests must be signed in the following format, which
-is similar to the structure used by `FS_IOC_MEASURE_VERITY`_::
+The data that the signature as described in (2) must be a signature of
+is the fs-verity file digest in the following format::
 
     struct fsverity_formatted_digest {
             char magic[8];                  /* must be "FSVerity" */
@@ -464,13 +471,66 @@ is similar to the structure used by `FS_IOC_MEASURE_VERITY`_::
             __u8 digest[];
     };
 
-fs-verity's built-in signature verification support is meant as a
-relatively simple mechanism that can be used to provide some level of
-authenticity protection for verity files, as an alternative to doing
-the signature verification in userspace or using IMA-appraisal.
-However, with this mechanism, userspace programs still need to check
-that the verity bit is set, and there is no protection against verity
-files being swapped around.
+That's it.  It should be emphasized again that fs-verity builtin
+signatures are not the only way to do signatures with fs-verity.  See
+`Use cases`_ for an overview of ways in which fs-verity can be used.
+fs-verity builtin signatures have some major limitations that should
+be carefully considered before using them:
+
+- Builtin signature verification does *not* make the kernel enforce
+  that any files actually have fs-verity enabled.  Thus, it is not a
+  complete authentication policy.  Currently, if it is used, the only
+  way to complete the authentication policy is for trusted userspace
+  code to explicitly check whether files have fs-verity enabled with a
+  signature before they are accessed.  (With
+  fs.verity.require_signatures=1, just checking whether fs-verity is
+  enabled suffices.)  But, in this case the trusted userspace code
+  could just store the signature alongside the file and verify it
+  itself using a cryptographic library, instead of using this feature.
+
+- A file's builtin signature can only be set at the same time that
+  fs-verity is being enabled on the file.  Changing or deleting the
+  builtin signature later requires re-creating the file.
+
+- Builtin signature verification uses the same set of public keys for
+  all fs-verity enabled files on the system.  Different keys cannot be
+  trusted for different files; each key is all or nothing.
+
+- The sysctl fs.verity.require_signatures applies system-wide.
+  Setting it to 1 only works when all users of fs-verity on the system
+  agree that it should be set to 1.  This limitation can prevent
+  fs-verity from being used in cases where it would be helpful.
+
+- Builtin signature verification can only use signature algorithms
+  that are supported by the kernel.  For example, the kernel does not
+  yet support Ed25519, even though this is often the signature
+  algorithm that is recommended for new cryptographic designs.
+
+- fs-verity builtin signatures are in PKCS#7 format, and the public
+  keys are in X.509 format.  These formats are commonly used,
+  including by some other kernel features (which is why the fs-verity
+  builtin signatures use them), and are very feature rich.
+  Unfortunately, history has shown that code that parses and handles
+  these formats (which are from the 1990s and are based on ASN.1)
+  often has vulnerabilities as a result of their complexity.  This
+  complexity is not inherent to the cryptography itself.
+
+  fs-verity users who do not need advanced features of X.509 and
+  PKCS#7 should strongly consider using simpler formats, such as plain
+  Ed25519 keys and signatures, and verifying signatures in userspace.
+
+  fs-verity users who choose to use X.509 and PKCS#7 anyway should
+  still consider that verifying those signatures in userspace is more
+  flexible (for other reasons mentioned earlier in this document) and
+  eliminates the need to enable CONFIG_FS_VERITY_BUILTIN_SIGNATURES
+  and its associated increase in kernel attack surface.  In some cases
+  it can even be necessary, since advanced X.509 and PKCS#7 features
+  do not always work as intended with the kernel.  For example, the
+  kernel does not check X.509 certificate validity times.
+
+  Note: IMA appraisal, which supports fs-verity, does not use PKCS#7
+  for its signatures, so it partially avoids the issues discussed
+  here.  IMA appraisal does use X.509.
 
 Filesystem support
 ==================
index 80ae503..ec0ddfb 100644 (file)
@@ -56,7 +56,7 @@ by adding the following hook into your git:
        $ cat >.git/hooks/applypatch-msg <<'EOF'
        #!/bin/sh
        . git-sh-setup
-       perl -pi -e 's|^Message-Id:\s*<?([^>]+)>?$|Link: https://lore.kernel.org/r/$1|g;' "$1"
+       perl -pi -e 's|^Message-I[dD]:\s*<?([^>]+)>?$|Link: https://lore.kernel.org/r/$1|g;' "$1"
        test -x "$GIT_DIR/hooks/commit-msg" &&
                exec "$GIT_DIR/hooks/commit-msg" ${1+"$@"}
        :
index 0cff6fa..4bfdf1d 100644 (file)
@@ -4,31 +4,55 @@
 Design
 ======
 
-Configurable Layers
-===================
-
-DAMON provides data access monitoring functionality while making the accuracy
-and the overhead controllable.  The fundamental access monitorings require
-primitives that dependent on and optimized for the target address space.  On
-the other hand, the accuracy and overhead tradeoff mechanism, which is the core
-of DAMON, is in the pure logic space.  DAMON separates the two parts in
-different layers and defines its interface to allow various low level
-primitives implementations configurable with the core logic.  We call the low
-level primitives implementations monitoring operations.
-
-Due to this separated design and the configurable interface, users can extend
-DAMON for any address space by configuring the core logics with appropriate
-monitoring operations.  If appropriate one is not provided, users can implement
-the operations on their own.
+
+Overall Architecture
+====================
+
+DAMON subsystem is configured with three layers including
+
+- Operations Set: Implements fundamental operations for DAMON that depends on
+  the given monitoring target address-space and available set of
+  software/hardware primitives,
+- Core: Implements core logics including monitoring overhead/accurach control
+  and access-aware system operations on top of the operations set layer, and
+- Modules: Implements kernel modules for various purposes that provides
+  interfaces for the user space, on top of the core layer.
+
+
+Configurable Operations Set
+---------------------------
+
+For data access monitoring and additional low level work, DAMON needs a set of
+implementations for specific operations that are dependent on and optimized for
+the given target address space.  On the other hand, the accuracy and overhead
+tradeoff mechanism, which is the core logic of DAMON, is in the pure logic
+space.  DAMON separates the two parts in different layers, namely DAMON
+Operations Set and DAMON Core Logics Layers, respectively.  It further defines
+the interface between the layers to allow various operations sets to be
+configured with the core logic.
+
+Due to this design, users can extend DAMON for any address space by configuring
+the core logic to use the appropriate operations set.  If any appropriate set
+is unavailable, users can implement one on their own.
 
 For example, physical memory, virtual memory, swap space, those for specific
 processes, NUMA nodes, files, and backing memory devices would be supportable.
-Also, if some architectures or devices support special optimized access check
-primitives, those will be easily configurable.
+Also, if some architectures or devices supporting special optimized access
+check primitives, those will be easily configurable.
 
 
-Reference Implementations of Address Space Specific Monitoring Operations
-=========================================================================
+Programmable Modules
+--------------------
+
+Core layer of DAMON is implemented as a framework, and exposes its application
+programming interface to all kernel space components such as subsystems and
+modules.  For common use cases of DAMON, DAMON subsystem provides kernel
+modules that built on top of the core layer using the API, which can be easily
+used by the user space end users.
+
+
+Operations Set Layer
+====================
 
 The monitoring operations are defined in two parts:
 
@@ -90,8 +114,12 @@ conflict with the reclaim logic using ``PG_idle`` and ``PG_young`` page flags,
 as Idle page tracking does.
 
 
-Address Space Independent Core Mechanisms
-=========================================
+Core Logics
+===========
+
+
+Monitoring
+----------
 
 Below four sections describe each of the DAMON core mechanisms and the five
 monitoring attributes, ``sampling interval``, ``aggregation interval``,
@@ -100,7 +128,7 @@ regions``.
 
 
 Access Frequency Monitoring
----------------------------
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 The output of DAMON says what pages are how frequently accessed for a given
 duration.  The resolution of the access frequency is controlled by setting
@@ -127,7 +155,7 @@ size of the target workload grows.
 
 
 Region Based Sampling
----------------------
+~~~~~~~~~~~~~~~~~~~~~
 
 To avoid the unbounded increase of the overhead, DAMON groups adjacent pages
 that assumed to have the same access frequencies into a region.  As long as the
@@ -144,7 +172,7 @@ assumption is not guaranteed.
 
 
 Adaptive Regions Adjustment
----------------------------
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Even somehow the initial monitoring target regions are well constructed to
 fulfill the assumption (pages in same region have similar access frequencies),
@@ -162,8 +190,22 @@ In this way, DAMON provides its best-effort quality and minimal overhead while
 keeping the bounds users set for their trade-off.
 
 
+Age Tracking
+~~~~~~~~~~~~
+
+By analyzing the monitoring results, users can also find how long the current
+access pattern of a region has maintained.  That could be used for good
+understanding of the access pattern.  For example, page placement algorithm
+utilizing both the frequency and the recency could be implemented using that.
+To make such access pattern maintained period analysis easier, DAMON maintains
+yet another counter called ``age`` in each region.  For each ``aggregation
+interval``, DAMON checks if the region's size and access frequency
+(``nr_accesses``) has significantly changed.  If so, the counter is reset to
+zero.  Otherwise, the counter is increased.
+
+
 Dynamic Target Space Updates Handling
--------------------------------------
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 The monitoring target address range could dynamically changed.  For example,
 virtual memory could be dynamically mapped and unmapped.  Physical memory could
@@ -174,3 +216,246 @@ monitoring operations to check dynamic changes including memory mapping changes
 and applies it to monitoring operations-related data structures such as the
 abstracted monitoring target memory area only for each of a user-specified time
 interval (``update interval``).
+
+
+.. _damon_design_damos:
+
+Operation Schemes
+-----------------
+
+One common purpose of data access monitoring is access-aware system efficiency
+optimizations.  For example,
+
+    paging out memory regions that are not accessed for more than two minutes
+
+or
+
+    using THP for memory regions that are larger than 2 MiB and showing a high
+    access frequency for more than one minute.
+
+One straightforward approach for such schemes would be profile-guided
+optimizations.  That is, getting data access monitoring results of the
+workloads or the system using DAMON, finding memory regions of special
+characteristics by profiling the monitoring results, and making system
+operation changes for the regions.  The changes could be made by modifying or
+providing advice to the software (the application and/or the kernel), or
+reconfiguring the hardware.  Both offline and online approaches could be
+available.
+
+Among those, providing advice to the kernel at runtime would be flexible and
+effective, and therefore widely be used.   However, implementing such schemes
+could impose unnecessary redundancy and inefficiency.  The profiling could be
+redundant if the type of interest is common.  Exchanging the information
+including monitoring results and operation advice between kernel and user
+spaces could be inefficient.
+
+To allow users to reduce such redundancy and inefficiencies by offloading the
+works, DAMON provides a feature called Data Access Monitoring-based Operation
+Schemes (DAMOS).  It lets users specify their desired schemes at a high
+level.  For such specifications, DAMON starts monitoring, finds regions having
+the access pattern of interest, and applies the user-desired operation actions
+to the regions as soon as found.
+
+
+.. _damon_design_damos_action:
+
+Operation Action
+~~~~~~~~~~~~~~~~
+
+The management action that the users desire to apply to the regions of their
+interest.  For example, paging out, prioritizing for next reclamation victim
+selection, advising ``khugepaged`` to collapse or split, or doing nothing but
+collecting statistics of the regions.
+
+The list of supported actions is defined in DAMOS, but the implementation of
+each action is in the DAMON operations set layer because the implementation
+normally depends on the monitoring target address space.  For example, the code
+for paging specific virtual address ranges out would be different from that for
+physical address ranges.  And the monitoring operations implementation sets are
+not mandated to support all actions of the list.  Hence, the availability of
+specific DAMOS action depends on what operations set is selected to be used
+together.
+
+Applying an action to a region is considered as changing the region's
+characteristics.  Hence, DAMOS resets the age of regions when an action is
+applied to those.
+
+
+.. _damon_design_damos_access_pattern:
+
+Target Access Pattern
+~~~~~~~~~~~~~~~~~~~~~
+
+The access pattern of the schemes' interest.  The patterns are constructed with
+the properties that DAMON's monitoring results provide, specifically the size,
+the access frequency, and the age.  Users can describe their access pattern of
+interest by setting minimum and maximum values of the three properties.  If a
+region's three properties are in the ranges, DAMOS classifies it as one of the
+regions that the scheme is having an interest in.
+
+
+.. _damon_design_damos_quotas:
+
+Quotas
+~~~~~~
+
+DAMOS upper-bound overhead control feature.  DAMOS could incur high overhead if
+the target access pattern is not properly tuned.  For example, if a huge memory
+region having the access pattern of interest is found, applying the scheme's
+action to all pages of the huge region could consume unacceptably large system
+resources.  Preventing such issues by tuning the access pattern could be
+challenging, especially if the access patterns of the workloads are highly
+dynamic.
+
+To mitigate that situation, DAMOS provides an upper-bound overhead control
+feature called quotas.  It lets users specify an upper limit of time that DAMOS
+can use for applying the action, and/or a maximum bytes of memory regions that
+the action can be applied within a user-specified time duration.
+
+
+.. _damon_design_damos_quotas_prioritization:
+
+Prioritization
+^^^^^^^^^^^^^^
+
+A mechanism for making a good decision under the quotas.  When the action
+cannot be applied to all regions of interest due to the quotas, DAMOS
+prioritizes regions and applies the action to only regions having high enough
+priorities so that it will not exceed the quotas.
+
+The prioritization mechanism should be different for each action.  For example,
+rarely accessed (colder) memory regions would be prioritized for page-out
+scheme action.  In contrast, the colder regions would be deprioritized for huge
+page collapse scheme action.  Hence, the prioritization mechanisms for each
+action are implemented in each DAMON operations set, together with the actions.
+
+Though the implementation is up to the DAMON operations set, it would be common
+to calculate the priority using the access pattern properties of the regions.
+Some users would want the mechanisms to be personalized for their specific
+case.  For example, some users would want the mechanism to weigh the recency
+(``age``) more than the access frequency (``nr_accesses``).  DAMOS allows users
+to specify the weight of each access pattern property and passes the
+information to the underlying mechanism.  Nevertheless, how and even whether
+the weight will be respected are up to the underlying prioritization mechanism
+implementation.
+
+
+.. _damon_design_damos_watermarks:
+
+Watermarks
+~~~~~~~~~~
+
+Conditional DAMOS (de)activation automation.  Users might want DAMOS to run
+only under certain situations.  For example, when a sufficient amount of free
+memory is guaranteed, running a scheme for proactive reclamation would only
+consume unnecessary system resources.  To avoid such consumption, the user would
+need to manually monitor some metrics such as free memory ratio, and turn
+DAMON/DAMOS on or off.
+
+DAMOS allows users to offload such works using three watermarks.  It allows the
+users to configure the metric of their interest, and three watermark values,
+namely high, middle, and low.  If the value of the metric becomes above the
+high watermark or below the low watermark, the scheme is deactivated.  If the
+metric becomes below the mid watermark but above the low watermark, the scheme
+is activated.  If all schemes are deactivated by the watermarks, the monitoring
+is also deactivated.  In this case, the DAMON worker thread only periodically
+checks the watermarks and therefore incurs nearly zero overhead.
+
+
+.. _damon_design_damos_filters:
+
+Filters
+~~~~~~~
+
+Non-access pattern-based target memory regions filtering.  If users run
+self-written programs or have good profiling tools, they could know something
+more than the kernel, such as future access patterns or some special
+requirements for specific types of memory. For example, some users may know
+only anonymous pages can impact their program's performance.  They can also
+have a list of latency-critical processes.
+
+To let users optimize DAMOS schemes with such special knowledge, DAMOS provides
+a feature called DAMOS filters.  The feature allows users to set an arbitrary
+number of filters for each scheme.  Each filter specifies the type of target
+memory, and whether it should exclude the memory of the type (filter-out), or
+all except the memory of the type (filter-in).
+
+As of this writing, anonymous page type and memory cgroup type are supported by
+the feature.  Some filter target types can require additional arguments.  For
+example, the memory cgroup filter type asks users to specify the file path of
+the memory cgroup for the filter.  Hence, users can apply specific schemes to
+only anonymous pages, non-anonymous pages, pages of specific cgroups, all pages
+excluding those of specific cgroups, and any combination of those.
+
+
+Application Programming Interface
+---------------------------------
+
+The programming interface for kernel space data access-aware applications.
+DAMON is a framework, so it does nothing by itself.  Instead, it only helps
+other kernel components such as subsystems and modules building their data
+access-aware applications using DAMON's core features.  For this, DAMON exposes
+its all features to other kernel components via its application programming
+interface, namely ``include/linux/damon.h``.  Please refer to the API
+:doc:`document </mm/damon/api>` for details of the interface.
+
+
+Modules
+=======
+
+Because the core of DAMON is a framework for kernel components, it doesn't
+provide any direct interface for the user space.  Such interfaces should be
+implemented by each DAMON API user kernel components, instead.  DAMON subsystem
+itself implements such DAMON API user modules, which are supposed to be used
+for general purpose DAMON control and special purpose data access-aware system
+operations, and provides stable application binary interfaces (ABI) for the
+user space.  The user space can build their efficient data access-aware
+applications using the interfaces.
+
+
+General Purpose User Interface Modules
+--------------------------------------
+
+DAMON modules that provide user space ABIs for general purpose DAMON usage in
+runtime.
+
+DAMON user interface modules, namely 'DAMON sysfs interface' and 'DAMON debugfs
+interface' are DAMON API user kernel modules that provide ABIs to the
+user-space.  Please note that DAMON debugfs interface is currently deprecated.
+
+Like many other ABIs, the modules create files on sysfs and debugfs, allow
+users to specify their requests to and get the answers from DAMON by writing to
+and reading from the files.  As a response to such I/O, DAMON user interface
+modules control DAMON and retrieve the results as user requested via the DAMON
+API, and return the results to the user-space.
+
+The ABIs are designed to be used for user space applications development,
+rather than human beings' fingers.  Human users are recommended to use such
+user space tools.  One such Python-written user space tool is available at
+Github (https://github.com/awslabs/damo), Pypi
+(https://pypistats.org/packages/damo), and Fedora
+(https://packages.fedoraproject.org/pkgs/python-damo/damo/).
+
+Please refer to the ABI :doc:`document </admin-guide/mm/damon/usage>` for
+details of the interfaces.
+
+
+Special-Purpose Access-aware Kernel Modules
+-------------------------------------------
+
+DAMON modules that provide user space ABI for specific purpose DAMON usage.
+
+DAMON sysfs/debugfs user interfaces are for full control of all DAMON features
+in runtime.  For each special-purpose system-wide data access-aware system
+operations such as proactive reclamation or LRU lists balancing, the interfaces
+could be simplified by removing unnecessary knobs for the specific purpose, and
+extended for boot-time and even compile time control.  Default values of DAMON
+control parameters for the usage would also need to be optimized for the
+purpose.
+
+To support such cases, yet more DAMON API user kernel modules that provide more
+simple and optimized user space interfaces are available.  Currently, two
+modules for proactive reclamation and LRU lists manipulation are provided.  For
+more detail, please read the usage documents for those
+(:doc:`/admin-guide/mm/damon/reclaim` and
+:doc:`/admin-guide/mm/damon/lru_sort`).
index dde7e24..3279dc7 100644 (file)
@@ -4,29 +4,6 @@
 Frequently Asked Questions
 ==========================
 
-Why a new subsystem, instead of extending perf or other user space tools?
-=========================================================================
-
-First, because it needs to be lightweight as much as possible so that it can be
-used online, any unnecessary overhead such as kernel - user space context
-switching cost should be avoided.  Second, DAMON aims to be used by other
-programs including the kernel.  Therefore, having a dependency on specific
-tools like perf is not desirable.  These are the two biggest reasons why DAMON
-is implemented in the kernel space.
-
-
-Can 'idle pages tracking' or 'perf mem' substitute DAMON?
-=========================================================
-
-Idle page tracking is a low level primitive for access check of the physical
-address space.  'perf mem' is similar, though it can use sampling to minimize
-the overhead.  On the other hand, DAMON is a higher-level framework for the
-monitoring of various address spaces.  It is focused on memory management
-optimization and provides sophisticated accuracy/overhead handling mechanisms.
-Therefore, 'idle pages tracking' and 'perf mem' could provide a subset of
-DAMON's output, but cannot substitute DAMON.
-
-
 Does DAMON support virtual memory only?
 =======================================
 
index 24a202f..a84c14e 100644 (file)
@@ -3,7 +3,7 @@
 DAMON Maintainer Entry Profile
 ==============================
 
-The DAMON subsystem covers the files that listed in 'DATA ACCESS MONITOR'
+The DAMON subsystem covers the files that are listed in 'DATA ACCESS MONITOR'
 section of 'MAINTAINERS' file.
 
 The mailing lists for the subsystem are damon@lists.linux.dev and
@@ -15,7 +15,7 @@ SCM Trees
 
 There are multiple Linux trees for DAMON development.  Patches under
 development or testing are queued in damon/next [2]_ by the DAMON maintainer.
-Suffieicntly reviewed patches will be queued in mm-unstable [1]_ by the memory
+Sufficiently reviewed patches will be queued in mm-unstable [1]_ by the memory
 management subsystem maintainer.  After more sufficient tests, the patches will
 be queued in mm-stable [3]_ , and finally pull-requested to the mainline by the
 memory management subsystem maintainer.
index 313dce1..e35af78 100644 (file)
@@ -73,14 +73,13 @@ In kernel use of migrate_pages()
    It also prevents the swapper or other scans from encountering
    the page.
 
-2. We need to have a function of type new_page_t that can be
+2. We need to have a function of type new_folio_t that can be
    passed to migrate_pages(). This function should figure out
-   how to allocate the correct new page given the old page.
+   how to allocate the correct new folio given the old folio.
 
 3. The migrate_pages() function is called which attempts
    to do the migration. It will call the function to allocate
-   the new page for each page that is considered for
-   moving.
+   the new folio for each folio that is considered for moving.
 
 How migrate_pages() works
 =========================
index 9693957..7840c18 100644 (file)
@@ -3,3 +3,152 @@
 ===========
 Page Tables
 ===========
+
+Paged virtual memory was invented along with virtual memory as a concept in
+1962 on the Ferranti Atlas Computer which was the first computer with paged
+virtual memory. The feature migrated to newer computers and became a de facto
+feature of all Unix-like systems as time went by. In 1985 the feature was
+included in the Intel 80386, which was the CPU Linux 1.0 was developed on.
+
+Page tables map virtual addresses as seen by the CPU into physical addresses
+as seen on the external memory bus.
+
+Linux defines page tables as a hierarchy which is currently five levels in
+height. The architecture code for each supported architecture will then
+map this to the restrictions of the hardware.
+
+The physical address corresponding to the virtual address is often referenced
+by the underlying physical page frame. The **page frame number** or **pfn**
+is the physical address of the page (as seen on the external memory bus)
+divided by `PAGE_SIZE`.
+
+Physical memory address 0 will be *pfn 0* and the highest pfn will be
+the last page of physical memory the external address bus of the CPU can
+address.
+
+With a page granularity of 4KB and a address range of 32 bits, pfn 0 is at
+address 0x00000000, pfn 1 is at address 0x00001000, pfn 2 is at 0x00002000
+and so on until we reach pfn 0xfffff at 0xfffff000. With 16KB pages pfs are
+at 0x00004000, 0x00008000 ... 0xffffc000 and pfn goes from 0 to 0x3fffff.
+
+As you can see, with 4KB pages the page base address uses bits 12-31 of the
+address, and this is why `PAGE_SHIFT` in this case is defined as 12 and
+`PAGE_SIZE` is usually defined in terms of the page shift as `(1 << PAGE_SHIFT)`
+
+Over time a deeper hierarchy has been developed in response to increasing memory
+sizes. When Linux was created, 4KB pages and a single page table called
+`swapper_pg_dir` with 1024 entries was used, covering 4MB which coincided with
+the fact that Torvald's first computer had 4MB of physical memory. Entries in
+this single table were referred to as *PTE*:s - page table entries.
+
+The software page table hierarchy reflects the fact that page table hardware has
+become hierarchical and that in turn is done to save page table memory and
+speed up mapping.
+
+One could of course imagine a single, linear page table with enormous amounts
+of entries, breaking down the whole memory into single pages. Such a page table
+would be very sparse, because large portions of the virtual memory usually
+remains unused. By using hierarchical page tables large holes in the virtual
+address space does not waste valuable page table memory, because it will suffice
+to mark large areas as unmapped at a higher level in the page table hierarchy.
+
+Additionally, on modern CPUs, a higher level page table entry can point directly
+to a physical memory range, which allows mapping a contiguous range of several
+megabytes or even gigabytes in a single high-level page table entry, taking
+shortcuts in mapping virtual memory to physical memory: there is no need to
+traverse deeper in the hierarchy when you find a large mapped range like this.
+
+The page table hierarchy has now developed into this::
+
+  +-----+
+  | PGD |
+  +-----+
+     |
+     |   +-----+
+     +-->| P4D |
+         +-----+
+            |
+            |   +-----+
+            +-->| PUD |
+                +-----+
+                   |
+                   |   +-----+
+                   +-->| PMD |
+                       +-----+
+                          |
+                          |   +-----+
+                          +-->| PTE |
+                              +-----+
+
+
+Symbols on the different levels of the page table hierarchy have the following
+meaning beginning from the bottom:
+
+- **pte**, `pte_t`, `pteval_t` = **Page Table Entry** - mentioned earlier.
+  The *pte* is an array of `PTRS_PER_PTE` elements of the `pteval_t` type, each
+  mapping a single page of virtual memory to a single page of physical memory.
+  The architecture defines the size and contents of `pteval_t`.
+
+  A typical example is that the `pteval_t` is a 32- or 64-bit value with the
+  upper bits being a **pfn** (page frame number), and the lower bits being some
+  architecture-specific bits such as memory protection.
+
+  The **entry** part of the name is a bit confusing because while in Linux 1.0
+  this did refer to a single page table entry in the single top level page
+  table, it was retrofitted to be an array of mapping elements when two-level
+  page tables were first introduced, so the *pte* is the lowermost page
+  *table*, not a page table *entry*.
+
+- **pmd**, `pmd_t`, `pmdval_t` = **Page Middle Directory**, the hierarchy right
+  above the *pte*, with `PTRS_PER_PMD` references to the *pte*:s.
+
+- **pud**, `pud_t`, `pudval_t` = **Page Upper Directory** was introduced after
+  the other levels to handle 4-level page tables. It is potentially unused,
+  or *folded* as we will discuss later.
+
+- **p4d**, `p4d_t`, `p4dval_t` = **Page Level 4 Directory** was introduced to
+  handle 5-level page tables after the *pud* was introduced. Now it was clear
+  that we needed to replace *pgd*, *pmd*, *pud* etc with a figure indicating the
+  directory level and that we cannot go on with ad hoc names any more. This
+  is only used on systems which actually have 5 levels of page tables, otherwise
+  it is folded.
+
+- **pgd**, `pgd_t`, `pgdval_t` = **Page Global Directory** - the Linux kernel
+  main page table handling the PGD for the kernel memory is still found in
+  `swapper_pg_dir`, but each userspace process in the system also has its own
+  memory context and thus its own *pgd*, found in `struct mm_struct` which
+  in turn is referenced to in each `struct task_struct`. So tasks have memory
+  context in the form of a `struct mm_struct` and this in turn has a
+  `struct pgt_t *pgd` pointer to the corresponding page global directory.
+
+To repeat: each level in the page table hierarchy is a *array of pointers*, so
+the **pgd** contains `PTRS_PER_PGD` pointers to the next level below, **p4d**
+contains `PTRS_PER_P4D` pointers to **pud** items and so on. The number of
+pointers on each level is architecture-defined.::
+
+        PMD
+  --> +-----+           PTE
+      | ptr |-------> +-----+
+      | ptr |-        | ptr |-------> PAGE
+      | ptr | \       | ptr |
+      | ptr |  \        ...
+      | ... |   \
+      | ptr |    \         PTE
+      +-----+     +----> +-----+
+                         | ptr |-------> PAGE
+                         | ptr |
+                           ...
+
+
+Page Table Folding
+==================
+
+If the architecture does not use all the page table levels, they can be *folded*
+which means skipped, and all operations performed on page tables will be
+compile-time augmented to just skip a level when accessing the next lower
+level.
+
+Page table handling code that wishes to be architecture-neutral, such as the
+virtual memory manager, will need to be written so that it traverses all of the
+currently five levels. This style should also be preferred for
+architecture-specific code, so as to be robust to future changes.
index 50ee0df..a834fad 100644 (file)
@@ -14,15 +14,20 @@ tables. Access to higher level tables protected by mm->page_table_lock.
 There are helpers to lock/unlock a table and other accessor functions:
 
  - pte_offset_map_lock()
-       maps pte and takes PTE table lock, returns pointer to the taken
-       lock;
+       maps PTE and takes PTE table lock, returns pointer to PTE with
+       pointer to its PTE table lock, or returns NULL if no PTE table;
+ - pte_offset_map_nolock()
+       maps PTE, returns pointer to PTE with pointer to its PTE table
+       lock (not taken), or returns NULL if no PTE table;
+ - pte_offset_map()
+       maps PTE, returns pointer to PTE, or returns NULL if no PTE table;
+ - pte_unmap()
+       unmaps PTE table;
  - pte_unmap_unlock()
        unlocks and unmaps PTE table;
  - pte_alloc_map_lock()
-       allocates PTE table if needed and take the lock, returns pointer
-       to taken lock or NULL if allocation failed;
- - pte_lockptr()
-       returns pointer to PTE table lock;
+       allocates PTE table if needed and takes its lock, returns pointer to
+       PTE with pointer to its lock, or returns NULL if allocation failed;
  - pmd_lock()
        takes PMD table lock, returns pointer to taken lock;
  - pmd_lockptr()
index 6a919cf..613a01d 100644 (file)
@@ -434,9 +434,10 @@ There are a few hints which can help with linux-kernel survival:
   questions.  Some developers can get impatient with people who clearly
   have not done their homework.
 
-- Avoid top-posting (the practice of putting your answer above the quoted
-  text you are responding to).  It makes your response harder to read and
-  makes a poor impression.
+- Use interleaved ("inline") replies, which makes your response easier to
+  read. (i.e. avoid top-posting -- the practice of putting your answer above
+  the quoted text you are responding to.) For more details, see
+  :ref:`Documentation/process/submitting-patches.rst <interleaved_replies>`.
 
 - Ask on the correct mailing list.  Linux-kernel may be the general meeting
   point, but it is not the best place to find developers from all
index ef54086..5cf6a5f 100644 (file)
@@ -31,7 +31,7 @@ you probably needn't concern yourself with pcmciautils.
 ====================== ===============  ========================================
 GNU C                  5.1              gcc --version
 Clang/LLVM (optional)  11.0.0           clang --version
-Rust (optional)        1.62.0           rustc --version
+Rust (optional)        1.68.2           rustc --version
 bindgen (optional)     0.56.0           bindgen --version
 GNU make               3.82             make --version
 bash                   4.2              bash --version
index abb741b..5d3c3de 100644 (file)
@@ -129,88 +129,132 @@ tools and scripts used by other kernel developers or Linux distributions; one of
 these tools is regzbot, which heavily relies on the "Link:" tags to associate
 reports for regression with changes resolving them.
 
-Prioritize work on fixing regressions
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-You should fix any reported regression as quickly as possible, to provide
-affected users with a solution in a timely manner and prevent more users from
-running into the issue; nevertheless developers need to take enough time and
-care to ensure regression fixes do not cause additional damage.
-
-In the end though, developers should give their best to prevent users from
-running into situations where a regression leaves them only three options: "run
-a kernel with a regression that seriously impacts usage", "continue running an
-outdated and thus potentially insecure kernel version for more than two weeks
-after a regression's culprit was identified", and "downgrade to a still
-supported kernel series that lack required features".
-
-How to realize this depends a lot on the situation. Here are a few rules of
-thumb for you, in order or importance:
-
- * Prioritize work on handling regression reports and fixing regression over all
-   other Linux kernel work, unless the latter concerns acute security issues or
-   bugs causing data loss or damage.
-
- * Always consider reverting the culprit commits and reapplying them later
-   together with necessary fixes, as this might be the least dangerous and
-   quickest way to fix a regression.
-
- * Developers should handle regressions in all supported kernel series, but are
-   free to delegate the work to the stable team, if the issue probably at no
-   point in time occurred with mainline.
-
- * Try to resolve any regressions introduced in the current development before
-   its end. If you fear a fix might be too risky to apply only days before a new
-   mainline release, let Linus decide: submit the fix separately to him as soon
-   as possible with the explanation of the situation. He then can make a call
-   and postpone the release if necessary, for example if multiple such changes
-   show up in his inbox.
-
- * Address regressions in stable, longterm, or proper mainline releases with
-   more urgency than regressions in mainline pre-releases. That changes after
-   the release of the fifth pre-release, aka "-rc5": mainline then becomes as
-   important, to ensure all the improvements and fixes are ideally tested
-   together for at least one week before Linus releases a new mainline version.
-
- * Fix regressions within two or three days, if they are critical for some
-   reason -- for example, if the issue is likely to affect many users of the
-   kernel series in question on all or certain architectures. Note, this
-   includes mainline, as issues like compile errors otherwise might prevent many
-   testers or continuous integration systems from testing the series.
-
- * Aim to fix regressions within one week after the culprit was identified, if
-   the issue was introduced in either:
-
-    * a recent stable/longterm release
-
-    * the development cycle of the latest proper mainline release
-
-   In the latter case (say Linux v5.14), try to address regressions even
-   quicker, if the stable series for the predecessor (v5.13) will be abandoned
-   soon or already was stamped "End-of-Life" (EOL) -- this usually happens about
-   three to four weeks after a new mainline release.
-
- * Try to fix all other regressions within two weeks after the culprit was
-   found. Two or three additional weeks are acceptable for performance
-   regressions and other issues which are annoying, but don't prevent anyone
-   from running Linux (unless it's an issue in the current development cycle,
-   as those should ideally be addressed before the release). A few weeks in
-   total are acceptable if a regression can only be fixed with a risky change
-   and at the same time is affecting only a few users; as much time is
-   also okay if the regression is already present in the second newest longterm
-   kernel series.
-
-Note: The aforementioned time frames for resolving regressions are meant to
-include getting the fix tested, reviewed, and merged into mainline, ideally with
-the fix being in linux-next at least briefly. This leads to delays you need to
-account for.
-
-Subsystem maintainers are expected to assist in reaching those periods by doing
-timely reviews and quick handling of accepted patches. They thus might have to
-send git-pull requests earlier or more often than usual; depending on the fix,
-it might even be acceptable to skip testing in linux-next. Especially fixes for
-regressions in stable and longterm kernels need to be handled quickly, as fixes
-need to be merged in mainline before they can be backported to older series.
+Expectations and best practices for fixing regressions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+As a Linux kernel developer, you are expected to give your best to prevent
+situations where a regression caused by a recent change of yours leaves users
+only these options:
+
+ * Run a kernel with a regression that impacts usage.
+
+ * Switch to an older or newer kernel series.
+
+ * Continue running an outdated and thus potentially insecure kernel for more
+   than three weeks after the regression's culprit was identified. Ideally it
+   should be less than two. And it ought to be just a few days, if the issue is
+   severe or affects many users -- either in general or in prevalent
+   environments.
+
+How to realize that in practice depends on various factors. Use the following
+rules of thumb as a guide.
+
+In general:
+
+ * Prioritize work on regressions over all other Linux kernel work, unless the
+   latter concerns a severe issue (e.g. acute security vulnerability, data loss,
+   bricked hardware, ...).
+
+ * Expedite fixing mainline regressions that recently made it into a proper
+   mainline, stable, or longterm release (either directly or via backport).
+
+ * Do not consider regressions from the current cycle as something that can wait
+   till the end of the cycle, as the issue might discourage or prevent users and
+   CI systems from testing mainline now or generally.
+
+ * Work with the required care to avoid additional or bigger damage, even if
+   resolving an issue then might take longer than outlined below.
+
+On timing once the culprit of a regression is known:
+
+ * Aim to mainline a fix within two or three days, if the issue is severe or
+   bothering many users -- either in general or in prevalent conditions like a
+   particular hardware environment, distribution, or stable/longterm series.
+
+ * Aim to mainline a fix by Sunday after the next, if the culprit made it
+   into a recent mainline, stable, or longterm release (either directly or via
+   backport); if the culprit became known early during a week and is simple to
+   resolve, try to mainline the fix within the same week.
+
+ * For other regressions, aim to mainline fixes before the hindmost Sunday
+   within the next three weeks. One or two Sundays later are acceptable, if the
+   regression is something people can live with easily for a while -- like a
+   mild performance regression.
+
+ * It's strongly discouraged to delay mainlining regression fixes till the next
+   merge window, except when the fix is extraordinarily risky or when the
+   culprit was mainlined more than a year ago.
+
+On procedure:
+
+ * Always consider reverting the culprit, as it's often the quickest and least
+   dangerous way to fix a regression. Don't worry about mainlining a fixed
+   variant later: that should be straight-forward, as most of the code went
+   through review once already.
+
+ * Try to resolve any regressions introduced in mainline during the past
+   twelve months before the current development cycle ends: Linus wants such
+   regressions to be handled like those from the current cycle, unless fixing
+   bears unusual risks.
+
+ * Consider CCing Linus on discussions or patch review, if a regression seems
+   tangly. Do the same in precarious or urgent cases -- especially if the
+   subsystem maintainer might be unavailable. Also CC the stable team, when you
+   know such a regression made it into a mainline, stable, or longterm release.
+
+ * For urgent regressions, consider asking Linus to pick up the fix straight
+   from the mailing list: he is totally fine with that for uncontroversial
+   fixes. Ideally though such requests should happen in accordance with the
+   subsystem maintainers or come directly from them.
+
+ * In case you are unsure if a fix is worth the risk applying just days before
+   a new mainline release, send Linus a mail with the usual lists and people in
+   CC; in it, summarize the situation while asking him to consider picking up
+   the fix straight from the list. He then himself can make the call and when
+   needed even postpone the release. Such requests again should ideally happen
+   in accordance with the subsystem maintainers or come directly from them.
+
+Regarding stable and longterm kernels:
+
+ * You are free to leave regressions to the stable team, if they at no point in
+   time occurred with mainline or were fixed there already.
+
+ * If a regression made it into a proper mainline release during the past
+   twelve months, ensure to tag the fix with "Cc: stable@vger.kernel.org", as a
+   "Fixes:" tag alone does not guarantee a backport. Please add the same tag,
+   in case you know the culprit was backported to stable or longterm kernels.
+
+ * When receiving reports about regressions in recent stable or longterm kernel
+   series, please evaluate at least briefly if the issue might happen in current
+   mainline as well -- and if that seems likely, take hold of the report. If in
+   doubt, ask the reporter to check mainline.
+
+ * Whenever you want to swiftly resolve a regression that recently also made it
+   into a proper mainline, stable, or longterm release, fix it quickly in
+   mainline; when appropriate thus involve Linus to fast-track the fix (see
+   above). That's because the stable team normally does neither revert nor fix
+   any changes that cause the same problems in mainline.
+
+ * In case of urgent regression fixes you might want to ensure prompt
+   backporting by dropping the stable team a note once the fix was mainlined;
+   this is especially advisable during merge windows and shortly thereafter, as
+   the fix otherwise might land at the end of a huge patch queue.
+
+On patch flow:
+
+ * Developers, when trying to reach the time periods mentioned above, remember
+   to account for the time it takes to get fixes tested, reviewed, and merged by
+   Linus, ideally with them being in linux-next at least briefly. Hence, if a
+   fix is urgent, make it obvious to ensure others handle it appropriately.
+
+ * Reviewers, you are kindly asked to assist developers in reaching the time
+   periods mentioned above by reviewing regression fixes in a timely manner.
+
+ * Subsystem maintainers, you likewise are encouraged to expedite the handling
+   of regression fixes. Thus evaluate if skipping linux-next is an option for
+   the particular fix. Also consider sending git pull requests more often than
+   usual when needed. And try to avoid holding onto regression fixes over
+   weekends -- especially when the fix is marked for backporting.
 
 
 More aspects regarding regressions developers should be aware of
index 178c95f..93d8a79 100644 (file)
@@ -421,6 +421,9 @@ allowing themselves a breath. Please respect that.
 The release candidate -rc1 is the starting point for new patches to be
 applied which are targeted for the next merge window.
 
+So called _urgent_ branches will be merged into mainline during the
+stabilization phase of each release.
+
 
 Git
 ^^^
index 486875f..efac910 100644 (file)
@@ -331,6 +331,31 @@ explaining difference against previous submission (see
 See Documentation/process/email-clients.rst for recommendations on email
 clients and mailing list etiquette.
 
+.. _interleaved_replies:
+
+Use trimmed interleaved replies in email discussions
+----------------------------------------------------
+Top-posting is strongly discouraged in Linux kernel development
+discussions. Interleaved (or "inline") replies make conversations much
+easier to follow. For more details see:
+https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
+
+As is frequently quoted on the mailing list::
+
+  A: http://en.wikipedia.org/wiki/Top_post
+  Q: Were do I find info about this thing called top-posting?
+  A: Because it messes up the order in which people normally read text.
+  Q: Why is top-posting such a bad thing?
+  A: Top-posting.
+  Q: What is the most annoying thing in e-mail?
+
+Similarly, please trim all unneeded quotations that aren't relevant
+to your reply. This makes responses easier to find, and saves time and
+space. For more details see: http://daringfireball.net/2007/07/on_top ::
+
+  A: No.
+  Q: Should I include quotations after my reply?
+
 .. _resend_reminders:
 
 Don't get discouraged - or impatient
index 13b7744..a893151 100644 (file)
@@ -38,9 +38,9 @@ and run::
 
        rustup override set $(scripts/min-tool-version.sh rustc)
 
-Otherwise, fetch a standalone installer or install ``rustup`` from:
+Otherwise, fetch a standalone installer from:
 
-       https://www.rust-lang.org
+       https://forge.rust-lang.org/infra/other-installation-methods.html#standalone
 
 
 Rust standard library source
index 9d9be52..9fe4846 100644 (file)
@@ -203,12 +203,15 @@ Deadline Task Scheduling
   - Total bandwidth (this_bw): this is the sum of all tasks "belonging" to the
     runqueue, including the tasks in Inactive state.
 
+  - Maximum usable bandwidth (max_bw): This is the maximum bandwidth usable by
+    deadline tasks and is currently set to the RT capacity.
+
 
  The algorithm reclaims the bandwidth of the tasks in Inactive state.
  It does so by decrementing the runtime of the executing task Ti at a pace equal
  to
 
-           dq = -max{ Ui / Umax, (1 - Uinact - Uextra) } dt
+           dq = -(max{ Ui, (Umax - Uinact - Uextra) } / Umax) dt
 
  where:
 
index b51f385..02d6dc3 100644 (file)
@@ -10,6 +10,30 @@ is taken directly from the kernel source, with supplemental material added
 as needed (or at least as we managed to add it — probably *not* all that is
 needed).
 
+Human interfaces
+----------------
+
+.. toctree::
+   :maxdepth: 1
+
+   input/index
+   hid/index
+   sound/index
+   gpu/index
+   fb/index
+
+Storage interfaces
+------------------
+
+.. toctree::
+   :maxdepth: 1
+
+   filesystems/index
+   block/index
+   cdrom/index
+   scsi/index
+   target/index
+
 **Fixme**: much more organizational work is needed here.
 
 .. toctree::
@@ -19,12 +43,8 @@ needed).
    core-api/index
    locking/index
    accounting/index
-   block/index
-   cdrom/index
    cpu-freq/index
-   fb/index
    fpga/index
-   hid/index
    i2c/index
    iio/index
    isdn/index
@@ -34,25 +54,19 @@ needed).
    networking/index
    pcmcia/index
    power/index
-   target/index
    timers/index
    spi/index
    w1/index
    watchdog/index
    virt/index
-   input/index
    hwmon/index
-   gpu/index
    accel/index
    security/index
-   sound/index
    crypto/index
-   filesystems/index
    mm/index
    bpf/index
    usb/index
    PCI/index
-   scsi/index
    misc-devices/index
    scheduler/index
    mhi/index
@@ -1,4 +1,4 @@
-Chinese translated version of Documentation/arm/booting.rst
+Chinese translated version of Documentation/arch/arm/booting.rst
 
 If you have any comment or update to the content, please contact the
 original document maintainer directly.  However, if you have a problem
@@ -9,7 +9,7 @@ or if there is a problem with the translation.
 Maintainer: Russell King <linux@arm.linux.org.uk>
 Chinese maintainer: Fu Wei <tekkamanninja@gmail.com>
 ---------------------------------------------------------------------
-Documentation/arm/booting.rst 的中文翻译
+Documentation/arch/arm/booting.rst 的中文翻译
 
 如果想评论或更新本文的内容,请直接联系原文档的维护者。如果你使用英文
 交流有困难的话,也可以向中文版维护者求助。如果本翻译更新不及时或者翻
@@ -1,4 +1,4 @@
-Chinese translated version of Documentation/arm/kernel_user_helpers.rst
+Chinese translated version of Documentation/arch/arm/kernel_user_helpers.rst
 
 If you have any comment or update to the content, please contact the
 original document maintainer directly.  However, if you have a problem
@@ -10,7 +10,7 @@ Maintainer: Nicolas Pitre <nicolas.pitre@linaro.org>
                Dave Martin <dave.martin@linaro.org>
 Chinese maintainer: Fu Wei <tekkamanninja@gmail.com>
 ---------------------------------------------------------------------
-Documentation/arm/kernel_user_helpers.rst 的中文翻译
+Documentation/arch/arm/kernel_user_helpers.rst 的中文翻译
 
 如果想评论或更新本文的内容,请直接联系原文档的维护者。如果你使用英文
 交流有困难的话,也可以向中文版维护者求助。如果本翻译更新不及时或者翻
@@ -1,6 +1,6 @@
-.. include:: ../disclaimer-zh_CN.rst
+.. include:: ../../disclaimer-zh_CN.rst
 
-:Original: :ref:`Documentation/arm64/amu.rst <amu_index>`
+:Original: :ref:`Documentation/arch/arm64/amu.rst <amu_index>`
 
 Translator: Bailu Lin <bailu.lin@vivo.com>
 
@@ -1,4 +1,4 @@
-Chinese translated version of Documentation/arm64/booting.rst
+Chinese translated version of Documentation/arch/arm64/booting.rst
 
 If you have any comment or update to the content, please contact the
 original document maintainer directly.  However, if you have a problem
@@ -10,7 +10,7 @@ M:    Will Deacon <will.deacon@arm.com>
 zh_CN: Fu Wei <wefu@redhat.com>
 C:     55f058e7574c3615dea4615573a19bdb258696c6
 ---------------------------------------------------------------------
-Documentation/arm64/booting.rst 的中文翻译
+Documentation/arch/arm64/booting.rst 的中文翻译
 
 如果想评论或更新本文的内容,请直接联系原文档的维护者。如果你使用英文
 交流有困难的话,也可以向中文版维护者求助。如果本翻译更新不及时或者翻
@@ -1,6 +1,6 @@
-.. include:: ../disclaimer-zh_CN.rst
+.. include:: ../../disclaimer-zh_CN.rst
 
-:Original: :ref:`Documentation/arm64/elf_hwcaps.rst <elf_hwcaps_index>`
+:Original: :ref:`Documentation/arch/arm64/elf_hwcaps.rst <elf_hwcaps_index>`
 
 Translator: Bailu Lin <bailu.lin@vivo.com>
 
@@ -92,7 +92,7 @@ HWCAP_ASIMDHP
     ID_AA64PFR0_EL1.AdvSIMD == 0b0001 表示有此功能。
 
 HWCAP_CPUID
-    根据 Documentation/arm64/cpu-feature-registers.rst 描述,EL0 可以访问
+    根据 Documentation/arch/arm64/cpu-feature-registers.rst 描述,EL0 可以访问
     某些 ID 寄存器。
 
     这些 ID 寄存器可能表示功能的可用性。
@@ -152,12 +152,12 @@ HWCAP_SB
     ID_AA64ISAR1_EL1.SB == 0b0001 表示有此功能。
 
 HWCAP_PACA
-    如 Documentation/arm64/pointer-authentication.rst 所描述,
+    如 Documentation/arch/arm64/pointer-authentication.rst 所描述,
     ID_AA64ISAR1_EL1.APA == 0b0001 或 ID_AA64ISAR1_EL1.API == 0b0001
     表示有此功能。
 
 HWCAP_PACG
-    如 Documentation/arm64/pointer-authentication.rst 所描述,
+    如 Documentation/arch/arm64/pointer-authentication.rst 所描述,
     ID_AA64ISAR1_EL1.GPA == 0b0001 或 ID_AA64ISAR1_EL1.GPI == 0b0001
     表示有此功能。
 
@@ -1,6 +1,6 @@
-.. include:: ../disclaimer-zh_CN.rst
+.. include:: ../../disclaimer-zh_CN.rst
 
-:Original: :ref:`Documentation/arm64/hugetlbpage.rst <hugetlbpage_index>`
+:Original: :ref:`Documentation/arch/arm64/hugetlbpage.rst <hugetlbpage_index>`
 
 Translator: Bailu Lin <bailu.lin@vivo.com>
 
@@ -1,6 +1,6 @@
-.. include:: ../disclaimer-zh_CN.rst
+.. include:: ../../disclaimer-zh_CN.rst
 
-:Original: :ref:`Documentation/arm64/index.rst <arm64_index>`
+:Original: :ref:`Documentation/arch/arm64/index.rst <arm64_index>`
 :Translator: Bailu Lin <bailu.lin@vivo.com>
 
 .. _cn_arm64_index:
@@ -1,4 +1,4 @@
-Chinese translated version of Documentation/arm64/legacy_instructions.rst
+Chinese translated version of Documentation/arch/arm64/legacy_instructions.rst
 
 If you have any comment or update to the content, please contact the
 original document maintainer directly.  However, if you have a problem
@@ -10,7 +10,7 @@ Maintainer: Punit Agrawal <punit.agrawal@arm.com>
             Suzuki K. Poulose <suzuki.poulose@arm.com>
 Chinese maintainer: Fu Wei <wefu@redhat.com>
 ---------------------------------------------------------------------
-Documentation/arm64/legacy_instructions.rst 的中文翻译
+Documentation/arch/arm64/legacy_instructions.rst 的中文翻译
 
 如果想评论或更新本文的内容,请直接联系原文档的维护者。如果你使用英文
 交流有困难的话,也可以向中文版维护者求助。如果本翻译更新不及时或者翻
@@ -1,4 +1,4 @@
-Chinese translated version of Documentation/arm64/memory.rst
+Chinese translated version of Documentation/arch/arm64/memory.rst
 
 If you have any comment or update to the content, please contact the
 original document maintainer directly.  However, if you have a problem
@@ -9,7 +9,7 @@ or if there is a problem with the translation.
 Maintainer: Catalin Marinas <catalin.marinas@arm.com>
 Chinese maintainer: Fu Wei <wefu@redhat.com>
 ---------------------------------------------------------------------
-Documentation/arm64/memory.rst 的中文翻译
+Documentation/arch/arm64/memory.rst 的中文翻译
 
 如果想评论或更新本文的内容,请直接联系原文档的维护者。如果你使用英文
 交流有困难的话,也可以向中文版维护者求助。如果本翻译更新不及时或者翻
@@ -1,8 +1,8 @@
 .. SPDX-License-Identifier: GPL-2.0
 
-.. include:: ../disclaimer-zh_CN.rst
+.. include:: ../../disclaimer-zh_CN.rst
 
-:Original: :ref:`Documentation/arm64/perf.rst <perf_index>`
+:Original: :ref:`Documentation/arch/arm64/perf.rst <perf_index>`
 
 Translator: Bailu Lin <bailu.lin@vivo.com>
 
@@ -1,4 +1,4 @@
-Chinese translated version of Documentation/arm64/silicon-errata.rst
+Chinese translated version of Documentation/arch/arm64/silicon-errata.rst
 
 If you have any comment or update to the content, please contact the
 original document maintainer directly.  However, if you have a problem
@@ -10,7 +10,7 @@ M:    Will Deacon <will.deacon@arm.com>
 zh_CN: Fu Wei <wefu@redhat.com>
 C:     1926e54f115725a9248d0c4c65c22acaf94de4c4
 ---------------------------------------------------------------------
-Documentation/arm64/silicon-errata.rst 的中文翻译
+Documentation/arch/arm64/silicon-errata.rst 的中文翻译
 
 如果想评论或更新本文的内容,请直接联系原文档的维护者。如果你使用英文
 交流有困难的话,也可以向中文版维护者求助。如果本翻译更新不及时或者翻
@@ -1,4 +1,4 @@
-Chinese translated version of Documentation/arm64/tagged-pointers.rst
+Chinese translated version of Documentation/arch/arm64/tagged-pointers.rst
 
 If you have any comment or update to the content, please contact the
 original document maintainer directly.  However, if you have a problem
@@ -9,7 +9,7 @@ or if there is a problem with the translation.
 Maintainer: Will Deacon <will.deacon@arm.com>
 Chinese maintainer: Fu Wei <wefu@redhat.com>
 ---------------------------------------------------------------------
-Documentation/arm64/tagged-pointers.rst 的中文翻译
+Documentation/arch/arm64/tagged-pointers.rst 的中文翻译
 
 如果想评论或更新本文的内容,请直接联系原文档的维护者。如果你使用英文
 交流有困难的话,也可以向中文版维护者求助。如果本翻译更新不及时或者翻
index 908ea13..6fa0cb6 100644 (file)
@@ -9,7 +9,7 @@
    :maxdepth: 2
 
    ../mips/index
-   ../arm64/index
+   arm64/index
    ../riscv/index
    openrisc/index
    parisc/index
index 076081d..f950638 100644 (file)
@@ -55,7 +55,7 @@ mbind()设置一个新的内存策略。一个进程的页面也可以通过sys_
    消失。它还可以防止交换器或其他扫描器遇到该页。
 
 
-2. 我们需要有一个new_page_t类型的函数,可以传递给migrate_pages()。这个函数应该计算
+2. 我们需要有一个new_folio_t类型的函数,可以传递给migrate_pages()。这个函数应该计算
    出如何在给定的旧页面中分配正确的新页面。
 
 3. migrate_pages()函数被调用,它试图进行迁移。它将调用该函数为每个被考虑迁移的页面分
@@ -1,8 +1,8 @@
 .. SPDX-License-Identifier: GPL-2.0
 
-.. include:: ../disclaimer-zh_TW.rst
+.. include:: ../../disclaimer-zh_TW.rst
 
-:Original: :ref:`Documentation/arm64/amu.rst <amu_index>`
+:Original: :ref:`Documentation/arch/arm64/amu.rst <amu_index>`
 
 Translator: Bailu Lin <bailu.lin@vivo.com>
             Hu Haowen <src.res@email.cn>
@@ -1,6 +1,6 @@
 SPDX-License-Identifier: GPL-2.0
 
-Chinese translated version of Documentation/arm64/booting.rst
+Chinese translated version of Documentation/arch/arm64/booting.rst
 
 If you have any comment or update to the content, please contact the
 original document maintainer directly.  However, if you have a problem
@@ -13,7 +13,7 @@ zh_CN:        Fu Wei <wefu@redhat.com>
 zh_TW: Hu Haowen <src.res@email.cn>
 C:     55f058e7574c3615dea4615573a19bdb258696c6
 ---------------------------------------------------------------------
-Documentation/arm64/booting.rst 的中文翻譯
+Documentation/arch/arm64/booting.rst 的中文翻譯
 
 如果想評論或更新本文的內容,請直接聯繫原文檔的維護者。如果你使用英文
 交流有困難的話,也可以向中文版維護者求助。如果本翻譯更新不及時或者翻
@@ -1,8 +1,8 @@
 .. SPDX-License-Identifier: GPL-2.0
 
-.. include:: ../disclaimer-zh_TW.rst
+.. include:: ../../disclaimer-zh_TW.rst
 
-:Original: :ref:`Documentation/arm64/elf_hwcaps.rst <elf_hwcaps_index>`
+:Original: :ref:`Documentation/arch/arm64/elf_hwcaps.rst <elf_hwcaps_index>`
 
 Translator: Bailu Lin <bailu.lin@vivo.com>
             Hu Haowen <src.res@email.cn>
@@ -95,7 +95,7 @@ HWCAP_ASIMDHP
     ID_AA64PFR0_EL1.AdvSIMD == 0b0001 表示有此功能。
 
 HWCAP_CPUID
-    根據 Documentation/arm64/cpu-feature-registers.rst 描述,EL0 可以訪問
+    根據 Documentation/arch/arm64/cpu-feature-registers.rst 描述,EL0 可以訪問
     某些 ID 寄存器。
 
     這些 ID 寄存器可能表示功能的可用性。
@@ -155,12 +155,12 @@ HWCAP_SB
     ID_AA64ISAR1_EL1.SB == 0b0001 表示有此功能。
 
 HWCAP_PACA
-    如 Documentation/arm64/pointer-authentication.rst 所描述,
+    如 Documentation/arch/arm64/pointer-authentication.rst 所描述,
     ID_AA64ISAR1_EL1.APA == 0b0001 或 ID_AA64ISAR1_EL1.API == 0b0001
     表示有此功能。
 
 HWCAP_PACG
-    如 Documentation/arm64/pointer-authentication.rst 所描述,
+    如 Documentation/arch/arm64/pointer-authentication.rst 所描述,
     ID_AA64ISAR1_EL1.GPA == 0b0001 或 ID_AA64ISAR1_EL1.GPI == 0b0001
     表示有此功能。
 
@@ -1,8 +1,8 @@
 .. SPDX-License-Identifier: GPL-2.0
 
-.. include:: ../disclaimer-zh_TW.rst
+.. include:: ../../disclaimer-zh_TW.rst
 
-:Original: :ref:`Documentation/arm64/hugetlbpage.rst <hugetlbpage_index>`
+:Original: :ref:`Documentation/arch/arm64/hugetlbpage.rst <hugetlbpage_index>`
 
 Translator: Bailu Lin <bailu.lin@vivo.com>
             Hu Haowen <src.res@email.cn>
@@ -1,8 +1,8 @@
 .. SPDX-License-Identifier: GPL-2.0
 
-.. include:: ../disclaimer-zh_TW.rst
+.. include:: ../../disclaimer-zh_TW.rst
 
-:Original: :ref:`Documentation/arm64/index.rst <arm64_index>`
+:Original: :ref:`Documentation/arch/arm64/index.rst <arm64_index>`
 :Translator: Bailu Lin <bailu.lin@vivo.com>
              Hu Haowen <src.res@email.cn>
 
@@ -1,6 +1,6 @@
 SPDX-License-Identifier: GPL-2.0
 
-Chinese translated version of Documentation/arm64/legacy_instructions.rst
+Chinese translated version of Documentation/arch/arm64/legacy_instructions.rst
 
 If you have any comment or update to the content, please contact the
 original document maintainer directly.  However, if you have a problem
@@ -13,7 +13,7 @@ Maintainer: Punit Agrawal <punit.agrawal@arm.com>
 Chinese maintainer: Fu Wei <wefu@redhat.com>
 Traditional Chinese maintainer: Hu Haowen <src.res@email.cn>
 ---------------------------------------------------------------------
-Documentation/arm64/legacy_instructions.rst 的中文翻譯
+Documentation/arch/arm64/legacy_instructions.rst 的中文翻譯
 
 如果想評論或更新本文的內容,請直接聯繫原文檔的維護者。如果你使用英文
 交流有困難的話,也可以向中文版維護者求助。如果本翻譯更新不及時或者翻
@@ -1,6 +1,6 @@
 SPDX-License-Identifier: GPL-2.0
 
-Chinese translated version of Documentation/arm64/memory.rst
+Chinese translated version of Documentation/arch/arm64/memory.rst
 
 If you have any comment or update to the content, please contact the
 original document maintainer directly.  However, if you have a problem
@@ -12,7 +12,7 @@ Maintainer: Catalin Marinas <catalin.marinas@arm.com>
 Chinese maintainer: Fu Wei <wefu@redhat.com>
 Traditional Chinese maintainer: Hu Haowen <src.res@email.cn>
 ---------------------------------------------------------------------
-Documentation/arm64/memory.rst 的中文翻譯
+Documentation/arch/arm64/memory.rst 的中文翻譯
 
 如果想評論或更新本文的內容,請直接聯繫原文檔的維護者。如果你使用英文
 交流有困難的話,也可以向中文版維護者求助。如果本翻譯更新不及時或者翻
@@ -1,8 +1,8 @@
 .. SPDX-License-Identifier: GPL-2.0
 
-.. include:: ../disclaimer-zh_TW.rst
+.. include:: ../../disclaimer-zh_TW.rst
 
-:Original: :ref:`Documentation/arm64/perf.rst <perf_index>`
+:Original: :ref:`Documentation/arch/arm64/perf.rst <perf_index>`
 
 Translator: Bailu Lin <bailu.lin@vivo.com>
             Hu Haowen <src.res@email.cn>
@@ -1,6 +1,6 @@
 SPDX-License-Identifier: GPL-2.0
 
-Chinese translated version of Documentation/arm64/silicon-errata.rst
+Chinese translated version of Documentation/arch/arm64/silicon-errata.rst
 
 If you have any comment or update to the content, please contact the
 original document maintainer directly.  However, if you have a problem
@@ -13,7 +13,7 @@ zh_CN:        Fu Wei <wefu@redhat.com>
 zh_TW: Hu Haowen <src.res@email.cn>
 C:     1926e54f115725a9248d0c4c65c22acaf94de4c4
 ---------------------------------------------------------------------
-Documentation/arm64/silicon-errata.rst 的中文翻譯
+Documentation/arch/arm64/silicon-errata.rst 的中文翻譯
 
 如果想評論或更新本文的內容,請直接聯繫原文檔的維護者。如果你使用英文
 交流有困難的話,也可以向中文版維護者求助。如果本翻譯更新不及時或者翻
@@ -1,6 +1,6 @@
 SPDX-License-Identifier: GPL-2.0
 
-Chinese translated version of Documentation/arm64/tagged-pointers.rst
+Chinese translated version of Documentation/arch/arm64/tagged-pointers.rst
 
 If you have any comment or update to the content, please contact the
 original document maintainer directly.  However, if you have a problem
@@ -12,7 +12,7 @@ Maintainer: Will Deacon <will.deacon@arm.com>
 Chinese maintainer: Fu Wei <wefu@redhat.com>
 Traditional Chinese maintainer: Hu Haowen <src.res@email.cn>
 ---------------------------------------------------------------------
-Documentation/arm64/tagged-pointers.rst 的中文翻譯
+Documentation/arch/arm64/tagged-pointers.rst 的中文翻譯
 
 如果想評論或更新本文的內容,請直接聯繫原文檔的維護者。如果你使用英文
 交流有困難的話,也可以向中文版維護者求助。如果本翻譯更新不及時或者翻
index e97d7d5..e7c8386 100644 (file)
@@ -150,7 +150,7 @@ TODOList:
 .. toctree::
    :maxdepth: 2
 
-   arm64/index
+   arch/arm64/index
 
 TODOList:
 
index b4e7479..922291d 100644 (file)
@@ -72,7 +72,7 @@ high once achieves global guest_halt_poll_ns value).
 
 Default: Y
 
-The module parameters can be set from the debugfs files in::
+The module parameters can be set from the sysfs files in::
 
        /sys/module/haltpoll/parameters/
 
index add0677..96c4475 100644 (file)
@@ -2613,7 +2613,7 @@ follows::
        this vcpu, and determines which register slices are visible through
        this ioctl interface.
 
-(See Documentation/arm64/sve.rst for an explanation of the "vq"
+(See Documentation/arch/arm64/sve.rst for an explanation of the "vq"
 nomenclature.)
 
 KVM_REG_ARM64_SVE_VLS is only accessible after KVM_ARM_VCPU_INIT.
index 3fae39b..4f1a1b2 100644 (file)
@@ -112,11 +112,11 @@ powerpc kvm-hv case.
 |                      | function.                 |                         |
 +-----------------------+---------------------------+-------------------------+
 
-These module parameters can be set from the debugfs files in:
+These module parameters can be set from the sysfs files in:
 
        /sys/module/kvm/parameters/
 
-Note: that these module parameters are system wide values and are not able to
+Note: these module parameters are system-wide values and are not able to
       be tuned on a per vm basis.
 
 Any changes to these parameters will be picked up by new and existing vCPUs the
@@ -142,12 +142,12 @@ Further Notes
   global max polling interval (halt_poll_ns) then the host will always poll for the
   entire block time and thus cpu utilisation will go to 100%.
 
-- Halt polling essentially presents a trade off between power usage and latency and
+- Halt polling essentially presents a trade-off between power usage and latency and
   the module parameters should be used to tune the affinity for this. Idle cpu time is
   essentially converted to host kernel time with the aim of decreasing latency when
   entering the guest.
 
 - Halt polling will only be conducted by the host when no other tasks are runnable on
   that cpu, otherwise the polling will cease immediately and schedule will be invoked to
-  allow that other task to run. Thus this doesn't allow a guest to denial of service the
-  cpu.
+  allow that other task to run. Thus this doesn't allow a guest to cause denial of service
+  of the cpu.
index 8c77554..3a034db 100644 (file)
@@ -67,7 +67,7 @@ following two cases:
 2. Write-Protection: The SPTE is present and the fault is caused by
    write-protect. That means we just need to change the W bit of the spte.
 
-What we use to avoid all the race is the Host-writable bit and MMU-writable bit
+What we use to avoid all the races is the Host-writable bit and MMU-writable bit
 on the spte:
 
 - Host-writable means the gfn is writable in the host kernel page tables and in
@@ -130,7 +130,7 @@ to gfn.  For indirect sp, we disabled fast page fault for simplicity.
 A solution for indirect sp could be to pin the gfn, for example via
 kvm_vcpu_gfn_to_pfn_atomic, before the cmpxchg.  After the pinning:
 
-- We have held the refcount of pfn that means the pfn can not be freed and
+- We have held the refcount of pfn; that means the pfn can not be freed and
   be reused for another gfn.
 - The pfn is writable and therefore it cannot be shared between different gfns
   by KSM.
@@ -186,22 +186,22 @@ writable between reading spte and updating spte. Like below case:
 The Dirty bit is lost in this case.
 
 In order to avoid this kind of issue, we always treat the spte as "volatile"
-if it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means,
+if it can be updated out of mmu-lock [see spte_has_volatile_bits()]; it means
 the spte is always atomically updated in this case.
 
 3) flush tlbs due to spte updated
 
-If the spte is updated from writable to readonly, we should flush all TLBs,
+If the spte is updated from writable to read-only, we should flush all TLBs,
 otherwise rmap_write_protect will find a read-only spte, even though the
 writable spte might be cached on a CPU's TLB.
 
 As mentioned before, the spte can be updated to writable out of mmu-lock on
-fast page fault path, in order to easily audit the path, we see if TLBs need
-be flushed caused by this reason in mmu_spte_update() since this is a common
+fast page fault path. In order to easily audit the path, we see if TLBs needing
+to be flushed caused this reason in mmu_spte_update() since this is a common
 function to update spte (present -> present).
 
 Since the spte is "volatile" if it can be updated out of mmu-lock, we always
-atomically update the spte, the race caused by fast page fault can be avoided,
+atomically update the spte and the race caused by fast page fault can be avoided.
 See the comments in spte_has_volatile_bits() and mmu_spte_update().
 
 Lockless Access Tracking:
@@ -283,9 +283,9 @@ time it will be set using the Dirty tracking mechanism described above.
 :Arch:         x86
 :Protects:     wakeup_vcpus_on_cpu
 :Comment:      This is a per-CPU lock and it is used for VT-d posted-interrupts.
-               When VT-d posted-interrupts is supported and the VM has assigned
+               When VT-d posted-interrupts are supported and the VM has assigned
                devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu
-               protected by blocked_vcpu_on_cpu_lock, when VT-d hardware issues
+               protected by blocked_vcpu_on_cpu_lock. When VT-d hardware issues
                wakeup notification event since external interrupts from the
                assigned devices happens, we will find the vCPU on the list to
                wakeup.
index 5fdb907..740d03d 100644 (file)
@@ -89,7 +89,7 @@ also define a new hypercall feature to indicate that the host can give you more
 registers. Only if the host supports the additional features, make use of them.
 
 The magic page layout is described by struct kvm_vcpu_arch_shared
-in arch/powerpc/include/asm/kvm_para.h.
+in arch/powerpc/include/uapi/asm/kvm_para.h.
 
 Magic page features
 ===================
@@ -112,7 +112,7 @@ Magic page flags
 ================
 
 In addition to features that indicate whether a host is capable of a particular
-feature we also have a channel for a guest to tell the guest whether it's capable
+feature we also have a channel for a guest to tell the host whether it's capable
 of something. This is what we call "flags".
 
 Flags are passed to the host in the low 12 bits of the Effective Address.
@@ -139,7 +139,7 @@ Patched instructions
 ====================
 
 The "ld" and "std" instructions are transformed to "lwz" and "stw" instructions
-respectively on 32 bit systems with an added offset of 4 to accommodate for big
+respectively on 32-bit systems with an added offset of 4 to accommodate for big
 endianness.
 
 The following is a list of mapping the Linux kernel performs when running as
@@ -210,7 +210,7 @@ available on all targets.
 2) PAPR hypercalls
 
 PAPR hypercalls are needed to run server PowerPC PAPR guests (-M pseries in QEMU).
-These are the same hypercalls that pHyp, the POWER hypervisor implements. Some of
+These are the same hypercalls that pHyp, the POWER hypervisor, implements. Some of
 them are handled in the kernel, some are handled in user space. This is only
 available on book3s_64.
 
index 87f04c1..06718b9 100644 (file)
@@ -101,7 +101,7 @@ also be used, e.g. ::
 
 However, VCPU request users should refrain from doing so, as it would
 break the abstraction.  The first 8 bits are reserved for architecture
-independent requests, all additional bits are available for architecture
+independent requests; all additional bits are available for architecture
 dependent requests.
 
 Architecture Independent Requests
@@ -151,8 +151,8 @@ KVM_REQUEST_NO_WAKEUP
 
   This flag is applied to requests that only need immediate attention
   from VCPUs running in guest mode.  That is, sleeping VCPUs do not need
-  to be awaken for these requests.  Sleeping VCPUs will handle the
-  requests when they are awaken later for some other reason.
+  to be awakened for these requests.  Sleeping VCPUs will handle the
+  requests when they are awakened later for some other reason.
 
 KVM_REQUEST_WAIT
 
index 6b789d2..62d867e 100644 (file)
@@ -5,31 +5,31 @@ Paravirt_ops
 ============
 
 Linux provides support for different hypervisor virtualization technologies.
-Historically different binary kernels would be required in order to support
-different hypervisors, this restriction was removed with pv_ops.
+Historically, different binary kernels would be required in order to support
+different hypervisors; this restriction was removed with pv_ops.
 Linux pv_ops is a virtualization API which enables support for different
 hypervisors. It allows each hypervisor to override critical operations and
 allows a single kernel binary to run on all supported execution environments
 including native machine -- without any hypervisors.
 
 pv_ops provides a set of function pointers which represent operations
-corresponding to low level critical instructions and high level
-functionalities in various areas. pv-ops allows for optimizations at run
-time by enabling binary patching of the low-ops critical operations
+corresponding to low-level critical instructions and high-level
+functionalities in various areas. pv_ops allows for optimizations at run
+time by enabling binary patching of the low-level critical operations
 at boot time.
 
 pv_ops operations are classified into three categories:
 
 - simple indirect call
-   These operations correspond to high level functionality where it is
+   These operations correspond to high-level functionality where it is
    known that the overhead of indirect call isn't very important.
 
 - indirect call which allows optimization with binary patch
-   Usually these operations correspond to low level critical instructions. They
+   Usually these operations correspond to low-level critical instructions. They
    are called frequently and are performance critical. The overhead is
    very important.
 
 - a set of macros for hand written assembly code
    Hand written assembly codes (.S files) also need paravirtualization
-   because they include sensitive instructions or some of code paths in
+   because they include sensitive instructions or some code paths in
    them are very performance critical.
index 095c427..acbe540 100644 (file)
@@ -2703,7 +2703,7 @@ Q:        https://patchwork.kernel.org/project/linux-samsung-soc/list/
 B:     mailto:linux-samsung-soc@vger.kernel.org
 C:     irc://irc.libera.chat/linux-exynos
 T:     git git://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux.git
-F:     Documentation/arm/samsung/
+F:     Documentation/arch/arm/samsung/
 F:     Documentation/devicetree/bindings/arm/samsung/
 F:     Documentation/devicetree/bindings/hwinfo/samsung,*
 F:     Documentation/devicetree/bindings/power/pd-samsung.yaml
@@ -3055,7 +3055,7 @@ M:        Will Deacon <will@kernel.org>
 L:     linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
 S:     Maintained
 T:     git git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git
-F:     Documentation/arm64/
+F:     Documentation/arch/arm64/
 F:     arch/arm64/
 F:     tools/testing/selftests/arm64/
 X:     arch/arm64/boot/dts/
@@ -4481,6 +4481,13 @@ S:       Supported
 F:     Documentation/filesystems/caching/cachefiles.rst
 F:     fs/cachefiles/
 
+CACHESTAT: PAGE CACHE STATS FOR A FILE
+M:     Nhat Pham <nphamcs@gmail.com>
+M:     Johannes Weiner <hannes@cmpxchg.org>
+L:     linux-mm@kvack.org
+S:     Maintained
+F:     tools/testing/selftests/cachestat/test_cachestat.c
+
 CADENCE MIPI-CSI2 BRIDGES
 M:     Maxime Ripard <mripard@kernel.org>
 L:     linux-media@vger.kernel.org
@@ -5338,6 +5345,18 @@ F:       include/linux/sched/cpufreq.h
 F:     kernel/sched/cpufreq*.c
 F:     tools/testing/selftests/cpufreq/
 
+CPU HOTPLUG
+M:     Thomas Gleixner <tglx@linutronix.de>
+M:     Peter Zijlstra <peterz@infradead.org>
+L:     linux-kernel@vger.kernel.org
+S:     Maintained
+T:     git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git smp/core
+F:     kernel/cpu.c
+F:     kernel/smpboot.*
+F:     include/linux/cpu.h
+F:     include/linux/cpuhotplug.h
+F:     include/linux/smpboot.h
+
 CPU IDLE TIME MANAGEMENT FRAMEWORK
 M:     "Rafael J. Wysocki" <rafael@kernel.org>
 M:     Daniel Lezcano <daniel.lezcano@linaro.org>
@@ -6221,6 +6240,12 @@ X:       Documentation/power/
 X:     Documentation/spi/
 X:     Documentation/userspace-api/media/
 
+DOCUMENTATION PROCESS
+M:     Jonathan Corbet <corbet@lwn.net>
+S:     Maintained
+F:     Documentation/process/
+L:     workflows@vger.kernel.org
+
 DOCUMENTATION REPORTING ISSUES
 M:     Thorsten Leemhuis <linux@leemhuis.info>
 L:     linux-doc@vger.kernel.org
@@ -7476,6 +7501,14 @@ L:       linux-edac@vger.kernel.org
 S:     Maintained
 F:     drivers/edac/mpc85xx_edac.[ch]
 
+EDAC-NPCM
+M:     Marvin Lin <kflin@nuvoton.com>
+M:     Stanley Chu <yschu@nuvoton.com>
+L:     linux-edac@vger.kernel.org
+S:     Maintained
+F:     Documentation/devicetree/bindings/memory-controllers/nuvoton,npcm-memory-controller.yaml
+F:     drivers/edac/npcm_edac.c
+
 EDAC-PASEMI
 M:     Egor Martovetsky <egor@pasemi.com>
 L:     linux-edac@vger.kernel.org
@@ -8073,6 +8106,7 @@ T:        git git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git for-next/har
 F:     include/linux/fortify-string.h
 F:     lib/fortify_kunit.c
 F:     lib/memcpy_kunit.c
+F:     lib/strcat_kunit.c
 F:     lib/strscpy_kunit.c
 F:     lib/test_fortify/*
 F:     scripts/test_fortify.sh
@@ -11274,6 +11308,10 @@ W:     http://kernelnewbies.org/KernelJanitors
 KERNEL NFSD, SUNRPC, AND LOCKD SERVERS
 M:     Chuck Lever <chuck.lever@oracle.com>
 M:     Jeff Layton <jlayton@kernel.org>
+R:     Neil Brown <neilb@suse.de>
+R:     Olga Kornievskaia <kolga@netapp.com>
+R:     Dai Ngo <Dai.Ngo@oracle.com>
+R:     Tom Talpey <tom@talpey.com>
 L:     linux-nfs@vger.kernel.org
 S:     Supported
 W:     http://nfs.sourceforge.net/
@@ -11331,6 +11369,8 @@ L:      linux-kselftest@vger.kernel.org
 L:     kunit-dev@googlegroups.com
 S:     Maintained
 W:     https://google.github.io/kunit-docs/third_party/kernel/docs/
+T:     git git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git kunit
+T:     git git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git kunit-fixes
 F:     Documentation/dev-tools/kunit/
 F:     include/kunit/
 F:     lib/kunit/
@@ -12545,7 +12585,6 @@ MARVELL NAND CONTROLLER DRIVER
 M:     Miquel Raynal <miquel.raynal@bootlin.com>
 L:     linux-mtd@lists.infradead.org
 S:     Maintained
-F:     Documentation/devicetree/bindings/mtd/marvell-nand.txt
 F:     drivers/mtd/nand/raw/marvell_nand.c
 
 MARVELL OCTEON ENDPOINT DRIVER
@@ -14711,7 +14750,7 @@ NETWORKING [LABELED] (NetLabel, Labeled IPsec, SECMARK)
 M:     Paul Moore <paul@paul-moore.com>
 L:     netdev@vger.kernel.org
 L:     linux-security-module@vger.kernel.org
-S:     Maintained
+S:     Supported
 W:     https://github.com/netlabel
 F:     Documentation/netlabel/
 F:     include/net/calipso.h
@@ -15316,7 +15355,7 @@ OMAP DISPLAY SUBSYSTEM and FRAMEBUFFER SUPPORT (DSS2)
 L:     linux-omap@vger.kernel.org
 L:     linux-fbdev@vger.kernel.org
 S:     Orphan
-F:     Documentation/arm/omap/dss.rst
+F:     Documentation/arch/arm/omap/dss.rst
 F:     drivers/video/fbdev/omap2/
 
 OMAP FRAMEBUFFER SUPPORT
@@ -15967,7 +16006,7 @@ F:      include/uapi/linux/ppdev.h
 
 PARAVIRT_OPS INTERFACE
 M:     Juergen Gross <jgross@suse.com>
-M:     Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
+R:     Ajay Kaher <akaher@vmware.com>
 R:     Alexey Makhalov <amakhalov@vmware.com>
 R:     VMware PV-Drivers Reviewers <pv-drivers@vmware.com>
 L:     virtualization@lists.linux-foundation.org
@@ -17818,7 +17857,7 @@ M:      Boqun Feng <boqun.feng@gmail.com>
 R:     Steven Rostedt <rostedt@goodmis.org>
 R:     Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
 R:     Lai Jiangshan <jiangshanlai@gmail.com>
-R:     Zqiang <qiang1.zhang@intel.com>
+R:     Zqiang <qiang.zhang1211@gmail.com>
 L:     rcu@vger.kernel.org
 S:     Supported
 W:     http://www.rdrop.com/users/paulmck/RCU/
@@ -22542,7 +22581,7 @@ S:      Supported
 F:     drivers/misc/vmw_balloon.c
 
 VMWARE HYPERVISOR INTERFACE
-M:     Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
+M:     Ajay Kaher <akaher@vmware.com>
 M:     Alexey Makhalov <amakhalov@vmware.com>
 R:     VMware PV-Drivers Reviewers <pv-drivers@vmware.com>
 L:     virtualization@lists.linux-foundation.org
@@ -22569,8 +22608,8 @@ F:      drivers/scsi/vmw_pvscsi.c
 F:     drivers/scsi/vmw_pvscsi.h
 
 VMWARE VIRTUAL PTP CLOCK DRIVER
-M:     Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
 M:     Deep Shah <sdeep@vmware.com>
+R:     Ajay Kaher <akaher@vmware.com>
 R:     Alexey Makhalov <amakhalov@vmware.com>
 R:     VMware PV-Drivers Reviewers <pv-drivers@vmware.com>
 L:     netdev@vger.kernel.org
index b68b43c..48a044b 100644 (file)
--- a/Makefile
+++ b/Makefile
@@ -2,7 +2,7 @@
 VERSION = 6
 PATCHLEVEL = 4
 SUBLEVEL = 0
-EXTRAVERSION = -rc7
+EXTRAVERSION =
 NAME = Hurr durr I'ma ninja sloth
 
 # *DOCUMENTATION*
@@ -1026,6 +1026,12 @@ KBUILD_CFLAGS += -Wno-pointer-sign
 # globally built with -Wcast-function-type.
 KBUILD_CFLAGS += $(call cc-option, -Wcast-function-type)
 
+# To gain proper coverage for CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE,
+# the kernel uses only C99 flexible arrays for dynamically sized trailing
+# arrays. Enforce this for everything that may examine structure sizes and
+# perform bounds checking.
+KBUILD_CFLAGS += $(call cc-option, -fstrict-flex-arrays=3)
+
 # disable stringop warnings in gcc 8+
 KBUILD_CFLAGS += $(call cc-disable-warning, stringop-truncation)
 
index 205fd23..aff2746 100644 (file)
@@ -34,6 +34,29 @@ config ARCH_HAS_SUBPAGE_FAULTS
 config HOTPLUG_SMT
        bool
 
+# Selected by HOTPLUG_CORE_SYNC_DEAD or HOTPLUG_CORE_SYNC_FULL
+config HOTPLUG_CORE_SYNC
+       bool
+
+# Basic CPU dead synchronization selected by architecture
+config HOTPLUG_CORE_SYNC_DEAD
+       bool
+       select HOTPLUG_CORE_SYNC
+
+# Full CPU synchronization with alive state selected by architecture
+config HOTPLUG_CORE_SYNC_FULL
+       bool
+       select HOTPLUG_CORE_SYNC_DEAD if HOTPLUG_CPU
+       select HOTPLUG_CORE_SYNC
+
+config HOTPLUG_SPLIT_STARTUP
+       bool
+       select HOTPLUG_CORE_SYNC_FULL
+
+config HOTPLUG_PARALLEL
+       bool
+       select HOTPLUG_SPLIT_STARTUP
+
 config GENERIC_ENTRY
        bool
 
@@ -285,6 +308,9 @@ config ARCH_HAS_DMA_SET_UNCACHED
 config ARCH_HAS_DMA_CLEAR_UNCACHED
        bool
 
+config ARCH_HAS_CPU_FINALIZE_INIT
+       bool
+
 # Select if arch init_task must go in the __init_task_data section
 config ARCH_TASK_STRUCT_ON_STACK
        bool
@@ -400,20 +426,14 @@ config HAVE_HARDLOCKUP_DETECTOR_PERF
          The arch chooses to use the generic perf-NMI-based hardlockup
          detector. Must define HAVE_PERF_EVENTS_NMI.
 
-config HAVE_NMI_WATCHDOG
-       depends on HAVE_NMI
-       bool
-       help
-         The arch provides a low level NMI watchdog. It provides
-         asm/nmi.h, and defines its own arch_touch_nmi_watchdog().
-
 config HAVE_HARDLOCKUP_DETECTOR_ARCH
        bool
-       select HAVE_NMI_WATCHDOG
        help
-         The arch chooses to provide its own hardlockup detector, which is
-         a superset of the HAVE_NMI_WATCHDOG. It also conforms to config
-         interfaces and parameters provided by hardlockup detector subsystem.
+         The arch provides its own hardlockup detector implementation instead
+         of the generic ones.
+
+         It uses the same command line parameters, and sysctl interface,
+         as the generic hardlockup detectors.
 
 config HAVE_PERF_REGS
        bool
@@ -1188,13 +1208,6 @@ config COMPAT_32BIT_TIME
 config ARCH_NO_PREEMPT
        bool
 
-config ARCH_EPHEMERAL_INODES
-       def_bool n
-       help
-         An arch should select this symbol if it doesn't keep track of inode
-         instances on its own, but instead relies on something else (e.g. the
-         host kernel for an UML kernel).
-
 config ARCH_SUPPORTS_RT
        bool
 
index f2861a4..cbd9244 100644 (file)
@@ -200,25 +200,6 @@ ATOMIC_OPS(xor, xor)
 #undef ATOMIC_OP_RETURN
 #undef ATOMIC_OP
 
-#define arch_atomic64_cmpxchg(v, old, new) \
-       (arch_cmpxchg(&((v)->counter), old, new))
-#define arch_atomic64_xchg(v, new) \
-       (arch_xchg(&((v)->counter), new))
-
-#define arch_atomic_cmpxchg(v, old, new) \
-       (arch_cmpxchg(&((v)->counter), old, new))
-#define arch_atomic_xchg(v, new) \
-       (arch_xchg(&((v)->counter), new))
-
-/**
- * arch_atomic_fetch_add_unless - add unless the number is a given value
- * @v: pointer of type atomic_t
- * @a: the amount to add to v...
- * @u: ...unless v is equal to u.
- *
- * Atomically adds @a to @v, so long as it was not @u.
- * Returns the old value of @v.
- */
 static __inline__ int arch_atomic_fetch_add_unless(atomic_t *v, int a, int u)
 {
        int c, new, old;
@@ -242,15 +223,6 @@ static __inline__ int arch_atomic_fetch_add_unless(atomic_t *v, int a, int u)
 }
 #define arch_atomic_fetch_add_unless arch_atomic_fetch_add_unless
 
-/**
- * arch_atomic64_fetch_add_unless - add unless the number is a given value
- * @v: pointer of type atomic64_t
- * @a: the amount to add to v...
- * @u: ...unless v is equal to u.
- *
- * Atomically adds @a to @v, so long as it was not @u.
- * Returns the old value of @v.
- */
 static __inline__ s64 arch_atomic64_fetch_add_unless(atomic64_t *v, s64 a, s64 u)
 {
        s64 c, new, old;
@@ -274,13 +246,6 @@ static __inline__ s64 arch_atomic64_fetch_add_unless(atomic64_t *v, s64 a, s64 u
 }
 #define arch_atomic64_fetch_add_unless arch_atomic64_fetch_add_unless
 
-/*
- * arch_atomic64_dec_if_positive - decrement by 1 if old value positive
- * @v: pointer of type atomic_t
- *
- * The function returns the old value of *v minus 1, even if
- * the atomic variable, v, was not decremented.
- */
 static inline s64 arch_atomic64_dec_if_positive(atomic64_t *v)
 {
        s64 old, tmp;
diff --git a/arch/alpha/include/asm/bugs.h b/arch/alpha/include/asm/bugs.h
deleted file mode 100644 (file)
index 78030d1..0000000
+++ /dev/null
@@ -1,20 +0,0 @@
-/*
- *  include/asm-alpha/bugs.h
- *
- *  Copyright (C) 1994  Linus Torvalds
- */
-
-/*
- * This is included by init/main.c to check for architecture-dependent bugs.
- *
- * Needs:
- *     void check_bugs(void);
- */
-
-/*
- * I don't know of any alpha bugs yet.. Nice chip
- */
-
-static void check_bugs(void)
-{
-}
index 2a9a877..d98701e 100644 (file)
@@ -1014,8 +1014,6 @@ SYSCALL_DEFINE2(osf_settimeofday, struct timeval32 __user *, tv,
        return do_sys_settimeofday64(tv ? &kts : NULL, tz ? &ktz : NULL);
 }
 
-asmlinkage long sys_ni_posix_timers(void);
-
 SYSCALL_DEFINE2(osf_utimes, const char __user *, filename,
                struct timeval32 __user *, tvs)
 {
index 33bf3a6..b650ff1 100644 (file)
@@ -658,7 +658,7 @@ setup_arch(char **cmdline_p)
 #endif
 
        /* Default root filesystem to sda2.  */
-       ROOT_DEV = Root_SDA2;
+       ROOT_DEV = MKDEV(SCSI_DISK0_MAJOR, 2);
 
 #ifdef CONFIG_EISA
        /* FIXME:  only set this when we actually have EISA in this box? */
index 8ebacf3..1f13995 100644 (file)
 558    common  process_mrelease                sys_process_mrelease
 559    common  futex_waitv                     sys_futex_waitv
 560    common  set_mempolicy_home_node         sys_ni_syscall
+561    common  cachestat                       sys_cachestat
index 2c83034..89d12a6 100644 (file)
@@ -81,6 +81,11 @@ static inline int arch_atomic_fetch_##op(int i, atomic_t *v)         \
 ATOMIC_OPS(add, +=, add)
 ATOMIC_OPS(sub, -=, sub)
 
+#define arch_atomic_fetch_add          arch_atomic_fetch_add
+#define arch_atomic_fetch_sub          arch_atomic_fetch_sub
+#define arch_atomic_add_return         arch_atomic_add_return
+#define arch_atomic_sub_return         arch_atomic_sub_return
+
 #undef ATOMIC_OPS
 #define ATOMIC_OPS(op, c_op, asm_op)                                   \
        ATOMIC_OP(op, c_op, asm_op)                                     \
@@ -92,7 +97,11 @@ ATOMIC_OPS(or, |=, or)
 ATOMIC_OPS(xor, ^=, xor)
 
 #define arch_atomic_andnot             arch_atomic_andnot
+
+#define arch_atomic_fetch_and          arch_atomic_fetch_and
 #define arch_atomic_fetch_andnot       arch_atomic_fetch_andnot
+#define arch_atomic_fetch_or           arch_atomic_fetch_or
+#define arch_atomic_fetch_xor          arch_atomic_fetch_xor
 
 #undef ATOMIC_OPS
 #undef ATOMIC_FETCH_OP
index 52ee51e..592d7ff 100644 (file)
 #include <asm/atomic-spinlock.h>
 #endif
 
-#define arch_atomic_cmpxchg(v, o, n)                                   \
-({                                                                     \
-       arch_cmpxchg(&((v)->counter), (o), (n));                        \
-})
-
-#ifdef arch_cmpxchg_relaxed
-#define arch_atomic_cmpxchg_relaxed(v, o, n)                           \
-({                                                                     \
-       arch_cmpxchg_relaxed(&((v)->counter), (o), (n));                \
-})
-#endif
-
-#define arch_atomic_xchg(v, n)                                         \
-({                                                                     \
-       arch_xchg(&((v)->counter), (n));                                \
-})
-
-#ifdef arch_xchg_relaxed
-#define arch_atomic_xchg_relaxed(v, n)                                 \
-({                                                                     \
-       arch_xchg_relaxed(&((v)->counter), (n));                        \
-})
-#endif
-
 /*
  * 64-bit atomics
  */
index c5a8010..6b6db98 100644 (file)
@@ -159,6 +159,7 @@ arch_atomic64_cmpxchg(atomic64_t *ptr, s64 expected, s64 new)
 
        return prev;
 }
+#define arch_atomic64_cmpxchg arch_atomic64_cmpxchg
 
 static inline s64 arch_atomic64_xchg(atomic64_t *ptr, s64 new)
 {
@@ -179,14 +180,7 @@ static inline s64 arch_atomic64_xchg(atomic64_t *ptr, s64 new)
 
        return prev;
 }
-
-/**
- * arch_atomic64_dec_if_positive - decrement by 1 if old value positive
- * @v: pointer of type atomic64_t
- *
- * The function returns the old value of *v minus 1, even if
- * the atomic variable, v, was not decremented.
- */
+#define arch_atomic64_xchg arch_atomic64_xchg
 
 static inline s64 arch_atomic64_dec_if_positive(atomic64_t *v)
 {
@@ -212,15 +206,6 @@ static inline s64 arch_atomic64_dec_if_positive(atomic64_t *v)
 }
 #define arch_atomic64_dec_if_positive arch_atomic64_dec_if_positive
 
-/**
- * arch_atomic64_fetch_add_unless - add unless the number is a given value
- * @v: pointer of type atomic64_t
- * @a: the amount to add to v...
- * @u: ...unless v is equal to u.
- *
- * Atomically adds @a to @v, if it was not @u.
- * Returns the old value of @v
- */
 static inline s64 arch_atomic64_fetch_add_unless(atomic64_t *v, s64 a, s64 u)
 {
        s64 old, temp;
index 0fb4b21..cef741b 100644 (file)
@@ -5,6 +5,7 @@ config ARM
        select ARCH_32BIT_OFF_T
        select ARCH_CORRECT_STACKTRACE_ON_KRETPROBE if HAVE_KRETPROBES && FRAME_POINTER && !ARM_UNWIND
        select ARCH_HAS_BINFMT_FLAT
+       select ARCH_HAS_CPU_FINALIZE_INIT if MMU
        select ARCH_HAS_CURRENT_STACK_POINTER
        select ARCH_HAS_DEBUG_VIRTUAL if MMU
        select ARCH_HAS_DMA_WRITE_COMBINE if !ARM_DMA_MEM_BUFFERABLE
@@ -124,6 +125,7 @@ config ARM
        select HAVE_SYSCALL_TRACEPOINTS
        select HAVE_UID16
        select HAVE_VIRT_CPU_ACCOUNTING_GEN
+       select HOTPLUG_CORE_SYNC_DEAD if HOTPLUG_CPU
        select IRQ_FORCED_THREADING
        select MODULES_USE_ELF_REL
        select NEED_DMA_MAP_STATE
@@ -1780,7 +1782,7 @@ config VFP
          Say Y to include VFP support code in the kernel. This is needed
          if your hardware includes a VFP unit.
 
-         Please see <file:Documentation/arm/vfp/release-notes.rst> for
+         Please see <file:Documentation/arch/arm/vfp/release-notes.rst> for
          release notes and additional status information.
 
          Say N if your target does not have VFP hardware.
index 1feb6b0..627752f 100644 (file)
@@ -2,6 +2,7 @@
 #include <linux/libfdt_env.h>
 #include <asm/setup.h>
 #include <libfdt.h>
+#include "misc.h"
 
 #if defined(CONFIG_ARM_ATAG_DTB_COMPAT_CMDLINE_EXTEND)
 #define do_extend_cmdline 1
index 9291a26..aa85656 100644 (file)
@@ -3,6 +3,7 @@
 #include <linux/kernel.h>
 #include <linux/libfdt.h>
 #include <linux/sizes.h>
+#include "misc.h"
 
 static const void *get_prop(const void *fdt, const char *node_path,
                            const char *property, int minlen)
index abfed1a..6b4baa6 100644 (file)
@@ -103,9 +103,6 @@ static void putstr(const char *ptr)
 /*
  * gzip declarations
  */
-extern char input_data[];
-extern char input_data_end[];
-
 unsigned char *output_data;
 
 unsigned long free_mem_ptr;
@@ -131,9 +128,6 @@ asmlinkage void __div0(void)
        error("Attempting division by 0!");
 }
 
-extern int do_decompress(u8 *input, int len, u8 *output, void (*error)(char *x));
-
-
 void
 decompress_kernel(unsigned long output_start, unsigned long free_mem_ptr_p,
                unsigned long free_mem_ptr_end_p,
index c958dcc..6da00a2 100644 (file)
@@ -6,5 +6,16 @@
 void error(char *x) __noreturn;
 extern unsigned long free_mem_ptr;
 extern unsigned long free_mem_end_ptr;
+void __div0(void);
+void
+decompress_kernel(unsigned long output_start, unsigned long free_mem_ptr_p,
+                 unsigned long free_mem_ptr_end_p, int arch_id);
+void fortify_panic(const char *name);
+int atags_to_fdt(void *atag_list, void *fdt, int total_space);
+uint32_t fdt_check_mem_start(uint32_t mem_start, const void *fdt);
+int do_decompress(u8 *input, int len, u8 *output, void (*error)(char *x));
+
+extern char input_data[];
+extern char input_data_end[];
 
 #endif
index 8a9aeeb..e013ff1 100644 (file)
@@ -21,7 +21,7 @@
 /*
  * The public API for this code is documented in arch/arm/include/asm/mcpm.h.
  * For a comprehensive description of the main algorithm used here, please
- * see Documentation/arm/cluster-pm-race-avoidance.rst.
+ * see Documentation/arch/arm/cluster-pm-race-avoidance.rst.
  */
 
 struct sync_struct mcpm_sync;
index 299495c..f590e80 100644 (file)
@@ -5,7 +5,7 @@
  * Created by:  Nicolas Pitre, March 2012
  * Copyright:   (C) 2012-2013  Linaro Limited
  *
- * Refer to Documentation/arm/cluster-pm-race-avoidance.rst
+ * Refer to Documentation/arch/arm/cluster-pm-race-avoidance.rst
  * for details of the synchronisation algorithms used here.
  */
 
index 1fa09c4..c5eaed5 100644 (file)
@@ -6,7 +6,7 @@
  * Copyright:  (C) 2012-2013  Linaro Limited
  *
  * This algorithm is described in more detail in
- * Documentation/arm/vlocks.rst.
+ * Documentation/arch/arm/vlocks.rst.
  */
 
 #include <linux/linkage.h>
index 505a306..aebe2c8 100644 (file)
@@ -394,6 +394,23 @@ ALT_UP_B(.L0_\@)
 #endif
        .endm
 
+/*
+ * Raw SMP data memory barrier
+ */
+       .macro  __smp_dmb mode
+#if __LINUX_ARM_ARCH__ >= 7
+       .ifeqs "\mode","arm"
+       dmb     ish
+       .else
+       W(dmb)  ish
+       .endif
+#elif __LINUX_ARM_ARCH__ == 6
+       mcr     p15, 0, r0, c7, c10, 5  @ dmb
+#else
+       .error "Incompatible SMP platform"
+#endif
+       .endm
+
 #if defined(CONFIG_CPU_V7M)
        /*
         * setmode is used to assert to be in svc mode during boot. For v7-M
index db8512d..f0e3b01 100644 (file)
@@ -197,6 +197,16 @@ static inline int arch_atomic_fetch_##op(int i, atomic_t *v)               \
        return val;                                                     \
 }
 
+#define arch_atomic_add_return                 arch_atomic_add_return
+#define arch_atomic_sub_return                 arch_atomic_sub_return
+#define arch_atomic_fetch_add                  arch_atomic_fetch_add
+#define arch_atomic_fetch_sub                  arch_atomic_fetch_sub
+
+#define arch_atomic_fetch_and                  arch_atomic_fetch_and
+#define arch_atomic_fetch_andnot               arch_atomic_fetch_andnot
+#define arch_atomic_fetch_or                   arch_atomic_fetch_or
+#define arch_atomic_fetch_xor                  arch_atomic_fetch_xor
+
 static inline int arch_atomic_cmpxchg(atomic_t *v, int old, int new)
 {
        int ret;
@@ -210,8 +220,7 @@ static inline int arch_atomic_cmpxchg(atomic_t *v, int old, int new)
 
        return ret;
 }
-
-#define arch_atomic_fetch_andnot               arch_atomic_fetch_andnot
+#define arch_atomic_cmpxchg arch_atomic_cmpxchg
 
 #endif /* __LINUX_ARM_ARCH__ */
 
@@ -240,8 +249,6 @@ ATOMIC_OPS(xor, ^=, eor)
 #undef ATOMIC_OP_RETURN
 #undef ATOMIC_OP
 
-#define arch_atomic_xchg(v, new) (arch_xchg(&((v)->counter), new))
-
 #ifndef CONFIG_GENERIC_ATOMIC64
 typedef struct {
        s64 counter;
index 97a312b..fe38555 100644 (file)
@@ -1,7 +1,5 @@
 /* SPDX-License-Identifier: GPL-2.0-only */
 /*
- *  arch/arm/include/asm/bugs.h
- *
  *  Copyright (C) 1995-2003 Russell King
  */
 #ifndef __ASM_BUGS_H
@@ -10,10 +8,8 @@
 extern void check_writebuffer_bugs(void);
 
 #ifdef CONFIG_MMU
-extern void check_bugs(void);
 extern void check_other_bugs(void);
 #else
-#define check_bugs() do { } while (0)
 #define check_other_bugs() do { } while (0)
 #endif
 
index 7e9251c..5be3ddc 100644 (file)
@@ -75,6 +75,10 @@ static inline bool arch_syscall_match_sym_name(const char *sym,
        return !strcasecmp(sym, name);
 }
 
+void prepare_ftrace_return(unsigned long *parent, unsigned long self,
+                          unsigned long frame_pointer,
+                          unsigned long stack_pointer);
+
 #endif /* ifndef __ASSEMBLY__ */
 
 #endif /* _ASM_ARM_FTRACE */
index a7c2337..18605f1 100644 (file)
@@ -27,7 +27,6 @@ struct irqaction;
 struct pt_regs;
 
 void handle_IRQ(unsigned int, struct pt_regs *);
-void init_IRQ(void);
 
 #ifdef CONFIG_SMP
 #include <linux/cpumask.h>
index 9349e7a..2b18a25 100644 (file)
@@ -56,7 +56,6 @@ struct machine_desc {
        void                    (*init_time)(void);
        void                    (*init_machine)(void);
        void                    (*init_late)(void);
-       void                    (*handle_irq)(struct pt_regs *);
        void                    (*restart)(enum reboot_mode, const char *);
 };
 
index 74bb594..28c63d1 100644 (file)
@@ -113,6 +113,28 @@ struct cpu_user_fns {
                        unsigned long vaddr, struct vm_area_struct *vma);
 };
 
+void fa_copy_user_highpage(struct page *to, struct page *from,
+       unsigned long vaddr, struct vm_area_struct *vma);
+void fa_clear_user_highpage(struct page *page, unsigned long vaddr);
+void feroceon_copy_user_highpage(struct page *to, struct page *from,
+       unsigned long vaddr, struct vm_area_struct *vma);
+void feroceon_clear_user_highpage(struct page *page, unsigned long vaddr);
+void v4_mc_copy_user_highpage(struct page *to, struct page *from,
+       unsigned long vaddr, struct vm_area_struct *vma);
+void v4_mc_clear_user_highpage(struct page *page, unsigned long vaddr);
+void v4wb_copy_user_highpage(struct page *to, struct page *from,
+       unsigned long vaddr, struct vm_area_struct *vma);
+void v4wb_clear_user_highpage(struct page *page, unsigned long vaddr);
+void v4wt_copy_user_highpage(struct page *to, struct page *from,
+       unsigned long vaddr, struct vm_area_struct *vma);
+void v4wt_clear_user_highpage(struct page *page, unsigned long vaddr);
+void xsc3_mc_copy_user_highpage(struct page *to, struct page *from,
+       unsigned long vaddr, struct vm_area_struct *vma);
+void xsc3_mc_clear_user_highpage(struct page *page, unsigned long vaddr);
+void xscale_mc_copy_user_highpage(struct page *to, struct page *from,
+       unsigned long vaddr, struct vm_area_struct *vma);
+void xscale_mc_clear_user_highpage(struct page *page, unsigned long vaddr);
+
 #ifdef MULTI_USER
 extern struct cpu_user_fns cpu_user;
 
index 483b8dd..7f44e88 100644 (file)
@@ -193,5 +193,8 @@ static inline unsigned long it_advance(unsigned long cpsr)
        return cpsr;
 }
 
+int syscall_trace_enter(struct pt_regs *regs);
+void syscall_trace_exit(struct pt_regs *regs);
+
 #endif /* __ASSEMBLY__ */
 #endif
index ba0872a..546af8b 100644 (file)
@@ -5,7 +5,7 @@
  *  Copyright (C) 1997-1999 Russell King
  *
  *  Structure passed to kernel to tell it about the
- *  hardware it's running on.  See Documentation/arm/setup.rst
+ *  hardware it's running on.  See Documentation/arch/arm/setup.rst
  *  for more info.
  */
 #ifndef __ASMARM_SETUP_H
@@ -28,4 +28,11 @@ extern void save_atags(const struct tag *tags);
 static inline void save_atags(const struct tag *tags) { }
 #endif
 
+struct machine_desc;
+void init_default_cache_policy(unsigned long);
+void paging_init(const struct machine_desc *desc);
+void early_mm_init(const struct machine_desc *);
+void adjust_lowmem_bounds(void);
+void setup_dma_zone(const struct machine_desc *desc);
+
 #endif
index 430be77..8b84092 100644 (file)
@@ -22,4 +22,9 @@ typedef struct {
 #define __ARCH_HAS_SA_RESTORER
 
 #include <asm/sigcontext.h>
+
+void do_rseq_syscall(struct pt_regs *regs);
+int do_work_pending(struct pt_regs *regs, unsigned int thread_flags,
+                   int syscall);
+
 #endif
index 7c1c90d..8c05a7f 100644 (file)
@@ -64,7 +64,7 @@ extern void secondary_startup_arm(void);
 
 extern int __cpu_disable(void);
 
-extern void __cpu_die(unsigned int cpu);
+static inline void __cpu_die(unsigned int cpu) { }
 
 extern void arch_send_call_function_single_ipi(int cpu);
 extern void arch_send_call_function_ipi_mask(const struct cpumask *mask);
index 85f9e53..d9c28b3 100644 (file)
@@ -35,4 +35,8 @@ static inline void spectre_v2_update_state(unsigned int state,
 
 int spectre_bhb_update_vectors(unsigned int method);
 
+void cpu_v7_ca8_ibe(void);
+void cpu_v7_ca15_ibe(void);
+void cpu_v7_bugs_init(void);
+
 #endif
index 5063142..be81b9c 100644 (file)
@@ -13,5 +13,6 @@ extern void cpu_resume(void);
 extern void cpu_resume_no_hyp(void);
 extern void cpu_resume_arm(void);
 extern int cpu_suspend(unsigned long, int (*)(unsigned long));
+extern void __cpu_suspend_save(u32 *ptr, u32 ptrsz, u32 sp, u32 *save_ptr);
 
 #endif
index 6f5d627..f46b3c5 100644 (file)
  * ops which are SMP safe even on a UP kernel.
  */
 
+/*
+ * Unordered
+ */
+
 #define sync_set_bit(nr, p)            _set_bit(nr, p)
 #define sync_clear_bit(nr, p)          _clear_bit(nr, p)
 #define sync_change_bit(nr, p)         _change_bit(nr, p)
-#define sync_test_and_set_bit(nr, p)   _test_and_set_bit(nr, p)
-#define sync_test_and_clear_bit(nr, p) _test_and_clear_bit(nr, p)
-#define sync_test_and_change_bit(nr, p)        _test_and_change_bit(nr, p)
 #define sync_test_bit(nr, addr)                test_bit(nr, addr)
-#define arch_sync_cmpxchg              arch_cmpxchg
 
+/*
+ * Fully ordered
+ */
+
+int _sync_test_and_set_bit(int nr, volatile unsigned long * p);
+#define sync_test_and_set_bit(nr, p)   _sync_test_and_set_bit(nr, p)
+
+int _sync_test_and_clear_bit(int nr, volatile unsigned long * p);
+#define sync_test_and_clear_bit(nr, p) _sync_test_and_clear_bit(nr, p)
+
+int _sync_test_and_change_bit(int nr, volatile unsigned long * p);
+#define sync_test_and_change_bit(nr, p)        _sync_test_and_change_bit(nr, p)
+
+#define arch_sync_cmpxchg(ptr, old, new)                               \
+({                                                                     \
+       __typeof__(*(ptr)) __ret;                                       \
+       __smp_mb__before_atomic();                                      \
+       __ret = arch_cmpxchg_relaxed((ptr), (old), (new));              \
+       __smp_mb__after_atomic();                                       \
+       __ret;                                                          \
+})
 
 #endif
diff --git a/arch/arm/include/asm/syscalls.h b/arch/arm/include/asm/syscalls.h
new file mode 100644 (file)
index 0000000..5912e7c
--- /dev/null
@@ -0,0 +1,51 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __ASM_SYSCALLS_H
+#define __ASM_SYSCALLS_H
+
+#include <linux/linkage.h>
+#include <linux/types.h>
+
+struct pt_regs;
+asmlinkage int sys_sigreturn(struct pt_regs *regs);
+asmlinkage int sys_rt_sigreturn(struct pt_regs *regs);
+asmlinkage long sys_arm_fadvise64_64(int fd, int advice,
+                                    loff_t offset, loff_t len);
+
+struct oldabi_stat64;
+asmlinkage long sys_oabi_stat64(const char __user * filename,
+                               struct oldabi_stat64 __user * statbuf);
+asmlinkage long sys_oabi_lstat64(const char __user * filename,
+                                struct oldabi_stat64 __user * statbuf);
+asmlinkage long sys_oabi_fstat64(unsigned long fd,
+                                struct oldabi_stat64 __user * statbuf);
+asmlinkage long sys_oabi_fstatat64(int dfd,
+                                  const char __user *filename,
+                                  struct oldabi_stat64  __user *statbuf,
+                                  int flag);
+asmlinkage long sys_oabi_fcntl64(unsigned int fd, unsigned int cmd,
+                                unsigned long arg);
+struct oabi_epoll_event;
+asmlinkage long sys_oabi_epoll_ctl(int epfd, int op, int fd,
+                                  struct oabi_epoll_event __user *event);
+struct oabi_sembuf;
+struct old_timespec32;
+asmlinkage long sys_oabi_semtimedop(int semid,
+                                   struct oabi_sembuf __user *tsops,
+                                   unsigned nsops,
+                                   const struct old_timespec32 __user *timeout);
+asmlinkage long sys_oabi_semop(int semid, struct oabi_sembuf __user *tsops,
+                              unsigned nsops);
+asmlinkage int sys_oabi_ipc(uint call, int first, int second, int third,
+                           void __user *ptr, long fifth);
+struct sockaddr;
+asmlinkage long sys_oabi_bind(int fd, struct sockaddr __user *addr, int addrlen);
+asmlinkage long sys_oabi_connect(int fd, struct sockaddr __user *addr, int addrlen);
+asmlinkage long sys_oabi_sendto(int fd, void __user *buff,
+                               size_t len, unsigned flags,
+                               struct sockaddr __user *addr,
+                               int addrlen);
+struct user_msghdr;
+asmlinkage long sys_oabi_sendmsg(int fd, struct user_msghdr __user *msg, unsigned flags);
+asmlinkage long sys_oabi_socketcall(int call, unsigned long __user *args);
+
+#endif
index d8bd8a4..e1f7dca 100644 (file)
@@ -9,9 +9,7 @@
 #ifndef __ASMARM_TCM_H
 #define __ASMARM_TCM_H
 
-#ifndef CONFIG_HAVE_TCM
-#error "You should not be including tcm.h unless you have a TCM!"
-#endif
+#ifdef CONFIG_HAVE_TCM
 
 #include <linux/compiler.h>
 
@@ -29,4 +27,11 @@ void tcm_free(void *addr, size_t len);
 bool tcm_dtcm_present(void);
 bool tcm_itcm_present(void);
 
+void __init tcm_init(void);
+#else
+/* No TCM support, just blank inlines to be optimized out */
+static inline void tcm_init(void)
+{
+}
+#endif
 #endif
index 987fefb..0aaefe3 100644 (file)
@@ -35,4 +35,13 @@ extern void ptrace_break(struct pt_regs *regs);
 
 extern void *vectors_page;
 
+asmlinkage void dump_backtrace_stm(u32 *stack, u32 instruction, const char *loglvl);
+asmlinkage void do_undefinstr(struct pt_regs *regs);
+asmlinkage void handle_fiq_as_nmi(struct pt_regs *regs);
+asmlinkage void bad_mode(struct pt_regs *regs, int reason);
+asmlinkage int arm_syscall(int no, struct pt_regs *regs);
+asmlinkage void baddataabort(int code, unsigned long instr, struct pt_regs *regs);
+asmlinkage void __div0(void);
+asmlinkage void handle_bad_stack(struct pt_regs *regs);
+
 #endif
index b51f854..d60b09a 100644 (file)
@@ -40,6 +40,10 @@ extern void unwind_table_del(struct unwind_table *tab);
 extern void unwind_backtrace(struct pt_regs *regs, struct task_struct *tsk,
                             const char *loglvl);
 
+void __aeabi_unwind_cpp_pr0(void);
+void __aeabi_unwind_cpp_pr1(void);
+void __aeabi_unwind_cpp_pr2(void);
+
 #endif /* !__ASSEMBLY__ */
 
 #ifdef CONFIG_ARM_UNWIND
index 5b85889..422c3af 100644 (file)
@@ -24,6 +24,11 @@ static inline void arm_install_vdso(struct mm_struct *mm, unsigned long addr)
 
 #endif /* CONFIG_VDSO */
 
+int __vdso_clock_gettime(clockid_t clock, struct old_timespec32 *ts);
+int __vdso_clock_gettime64(clockid_t clock, struct __kernel_timespec *ts);
+int __vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz);
+int __vdso_clock_getres(clockid_t clock_id, struct old_timespec32 *res);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __KERNEL__ */
index 157ea34..5b57b87 100644 (file)
 
 #ifndef __ASSEMBLY__
 void vfp_disable(void);
+void VFP_bounce(u32 trigger, u32 fpexc, struct pt_regs *regs);
 #endif
 
 #endif /* __ASM_VFP_H */
index 25ceda6..8e50e03 100644 (file)
@@ -9,7 +9,7 @@
  * published by the Free Software Foundation.
  *
  *  Structure passed to kernel to tell it about the
- *  hardware it's running on.  See Documentation/arm/setup.rst
+ *  hardware it's running on.  See Documentation/arch/arm/setup.rst
  *  for more info.
  */
 #ifndef _UAPI__ASMARM_SETUP_H
index 373b61f..33f6eb5 100644 (file)
@@ -127,7 +127,7 @@ static int __init parse_tag_cmdline(const struct tag *tag)
 #elif defined(CONFIG_CMDLINE_FORCE)
        pr_warn("Ignoring tag cmdline (using the default kernel command line)\n");
 #else
-       strlcpy(default_command_line, tag->u.cmdline.cmdline,
+       strscpy(default_command_line, tag->u.cmdline.cmdline,
                COMMAND_LINE_SIZE);
 #endif
        return 0;
@@ -224,7 +224,7 @@ setup_machine_tags(void *atags_vaddr, unsigned int machine_nr)
        }
 
        /* parse_early_param needs a boot_command_line */
-       strlcpy(boot_command_line, from, COMMAND_LINE_SIZE);
+       strscpy(boot_command_line, from, COMMAND_LINE_SIZE);
 
        return mdesc;
 }
index 14c8dbb..087bce6 100644 (file)
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/init.h>
+#include <linux/cpu.h>
 #include <asm/bugs.h>
 #include <asm/proc-fns.h>
 
@@ -11,7 +12,7 @@ void check_other_bugs(void)
 #endif
 }
 
-void __init check_bugs(void)
+void __init arch_cpu_finalize_init(void)
 {
        check_writebuffer_bugs();
        check_other_bugs();
index c39303e..291dc48 100644 (file)
@@ -875,7 +875,7 @@ ENDPROC(__bad_stack)
  * existing ones.  This mechanism should be used only for things that are
  * really small and justified, and not be abused freely.
  *
- * See Documentation/arm/kernel_user_helpers.rst for formal definitions.
+ * See Documentation/arch/arm/kernel_user_helpers.rst for formal definitions.
  */
  THUMB(        .arm    )
 
index 98ca3e3..d2c8e53 100644 (file)
@@ -45,6 +45,7 @@
 #include <asm/cacheflush.h>
 #include <asm/cp15.h>
 #include <asm/fiq.h>
+#include <asm/mach/irq.h>
 #include <asm/irq.h>
 #include <asm/traps.h>
 
index 89a5210..225c069 100644 (file)
@@ -8,16 +8,13 @@
 
 #include <linux/init.h>
 #include <linux/zutil.h>
+#include "head.h"
 
 /* for struct inflate_state */
 #include "../../../lib/zlib_inflate/inftrees.h"
 #include "../../../lib/zlib_inflate/inflate.h"
 #include "../../../lib/zlib_inflate/infutil.h"
 
-extern char __data_loc[];
-extern char _edata_loc[];
-extern char _sdata[];
-
 /*
  * This code is called very early during the boot process to decompress
  * the .data segment stored compressed in ROM. Therefore none of the global
diff --git a/arch/arm/kernel/head.h b/arch/arm/kernel/head.h
new file mode 100644 (file)
index 0000000..0eb5acc
--- /dev/null
@@ -0,0 +1,7 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+extern char __data_loc[];
+extern char _edata_loc[];
+extern char _sdata[];
+
+int __init __inflate_kernel_data(void);
index d59c36d..e74d84f 100644 (file)
@@ -169,8 +169,7 @@ apply_relocate(Elf32_Shdr *sechdrs, const char *strtab, unsigned int symindex,
 
                        offset = __mem_to_opcode_arm(*(u32 *)loc);
                        offset = (offset & 0x00ffffff) << 2;
-                       if (offset & 0x02000000)
-                               offset -= 0x04000000;
+                       offset = sign_extend32(offset, 25);
 
                        offset += sym->st_value - loc;
 
@@ -236,7 +235,7 @@ apply_relocate(Elf32_Shdr *sechdrs, const char *strtab, unsigned int symindex,
                case R_ARM_MOVT_PREL:
                        offset = tmp = __mem_to_opcode_arm(*(u32 *)loc);
                        offset = ((offset & 0xf0000) >> 4) | (offset & 0xfff);
-                       offset = (offset ^ 0x8000) - 0x8000;
+                       offset = sign_extend32(offset, 15);
 
                        offset += sym->st_value;
                        if (ELF32_R_TYPE(rel->r_info) == R_ARM_MOVT_PREL ||
@@ -344,8 +343,7 @@ apply_relocate(Elf32_Shdr *sechdrs, const char *strtab, unsigned int symindex,
                                ((~(j2 ^ sign) & 1) << 22) |
                                ((upper & 0x03ff) << 12) |
                                ((lower & 0x07ff) << 1);
-                       if (offset & 0x01000000)
-                               offset -= 0x02000000;
+                       offset = sign_extend32(offset, 24);
                        offset += sym->st_value - loc;
 
                        /*
@@ -401,7 +399,7 @@ apply_relocate(Elf32_Shdr *sechdrs, const char *strtab, unsigned int symindex,
                        offset = ((upper & 0x000f) << 12) |
                                ((upper & 0x0400) << 1) |
                                ((lower & 0x7000) >> 4) | (lower & 0x00ff);
-                       offset = (offset ^ 0x8000) - 0x8000;
+                       offset = sign_extend32(offset, 15);
                        offset += sym->st_value;
 
                        if (ELF32_R_TYPE(rel->r_info) == R_ARM_THM_MOVT_PREL ||
index 75cd469..c66b560 100644 (file)
@@ -76,13 +76,6 @@ static int __init fpe_setup(char *line)
 __setup("fpe=", fpe_setup);
 #endif
 
-extern void init_default_cache_policy(unsigned long);
-extern void paging_init(const struct machine_desc *desc);
-extern void early_mm_init(const struct machine_desc *);
-extern void adjust_lowmem_bounds(void);
-extern enum reboot_mode reboot_mode;
-extern void setup_dma_zone(const struct machine_desc *desc);
-
 unsigned int processor_id;
 EXPORT_SYMBOL(processor_id);
 unsigned int __machine_arch_type __read_mostly;
@@ -1142,7 +1135,7 @@ void __init setup_arch(char **cmdline_p)
        setup_initial_init_mm(_text, _etext, _edata, _end);
 
        /* populate cmd_line too for later use, preserving boot_command_line */
-       strlcpy(cmd_line, boot_command_line, COMMAND_LINE_SIZE);
+       strscpy(cmd_line, boot_command_line, COMMAND_LINE_SIZE);
        *cmdline_p = cmd_line;
 
        early_fixmap_init();
@@ -1198,10 +1191,6 @@ void __init setup_arch(char **cmdline_p)
 
        reserve_crashkernel();
 
-#ifdef CONFIG_GENERIC_IRQ_MULTI_HANDLER
-       handle_arch_irq = mdesc->handle_irq;
-#endif
-
 #ifdef CONFIG_VT
 #if defined(CONFIG_VGA_CONSOLE)
        conswitchp = &vga_con;
index e07f359..8d0afa1 100644 (file)
@@ -18,6 +18,7 @@
 #include <asm/traps.h>
 #include <asm/unistd.h>
 #include <asm/vfp.h>
+#include <asm/syscalls.h>
 
 #include "signal.h"
 
index 87f8d0e..6756203 100644 (file)
@@ -288,15 +288,11 @@ int __cpu_disable(void)
 }
 
 /*
- * called on the thread which is asking for a CPU to be shutdown -
- * waits until shutdown has completed, or it is timed out.
+ * called on the thread which is asking for a CPU to be shutdown after the
+ * shutdown completed.
  */
-void __cpu_die(unsigned int cpu)
+void arch_cpuhp_cleanup_dead_cpu(unsigned int cpu)
 {
-       if (!cpu_wait_death(cpu, 5)) {
-               pr_err("CPU%u: cpu didn't die\n", cpu);
-               return;
-       }
        pr_debug("CPU%u: shutdown\n", cpu);
 
        clear_tasks_mm_cpumask(cpu);
@@ -336,11 +332,11 @@ void __noreturn arch_cpu_idle_dead(void)
        flush_cache_louis();
 
        /*
-        * Tell __cpu_die() that this CPU is now safe to dispose of.  Once
-        * this returns, power and/or clocks can be removed at any point
-        * from this CPU and its cache by platform_cpu_kill().
+        * Tell cpuhp_bp_sync_dead() that this CPU is now safe to dispose
+        * of. Once this returns, power and/or clocks can be removed at
+        * any point from this CPU and its cache by platform_cpu_kill().
         */
-       (void)cpu_report_death();
+       cpuhp_ap_report_dead();
 
        /*
         * Ensure that the cache lines associated with that completion are
index a5f183c..0141e9b 100644 (file)
@@ -24,6 +24,7 @@
 #include <linux/ipc.h>
 #include <linux/uaccess.h>
 #include <linux/slab.h>
+#include <asm/syscalls.h>
 
 /*
  * Since loff_t is a 64 bit type we avoid a lot of ABI hassle
index 0061631..d00f404 100644 (file)
@@ -10,6 +10,8 @@
  *  Copyright: MontaVista Software, Inc.
  */
 
+#include <asm/syscalls.h>
+
 /*
  * The legacy ABI and the new ARM EABI have different rules making some
  * syscalls incompatible especially with structure arguments.
index 40c7c80..3bad79d 100644 (file)
@@ -756,6 +756,7 @@ void __readwrite_bug(const char *fn)
 }
 EXPORT_SYMBOL(__readwrite_bug);
 
+#ifdef CONFIG_MMU
 void __pte_error(const char *file, int line, pte_t pte)
 {
        pr_err("%s:%d: bad pte %08llx.\n", file, line, (long long)pte_val(pte));
@@ -770,6 +771,7 @@ void __pgd_error(const char *file, int line, pgd_t pgd)
 {
        pr_err("%s:%d: bad pgd %08llx.\n", file, line, (long long)pgd_val(pgd));
 }
+#endif
 
 asmlinkage void __div0(void)
 {
index 3408269..f297d66 100644 (file)
@@ -135,7 +135,7 @@ static Elf32_Sym * __init find_symbol(struct elfinfo *lib, const char *symname)
 
                if (lib->dynsym[i].st_name == 0)
                        continue;
-               strlcpy(name, lib->dynstr + lib->dynsym[i].st_name,
+               strscpy(name, lib->dynstr + lib->dynsym[i].st_name,
                        MAX_SYMNAME);
                c = strchr(name, '@');
                if (c)
index 95bd359..f069d1b 100644 (file)
@@ -28,7 +28,7 @@ UNWIND(       .fnend          )
 ENDPROC(\name          )
        .endm
 
-       .macro  testop, name, instr, store
+       .macro  __testop, name, instr, store, barrier
 ENTRY( \name           )
 UNWIND(        .fnstart        )
        ands    ip, r1, #3
@@ -38,7 +38,7 @@ UNWIND(       .fnstart        )
        mov     r0, r0, lsr #5
        add     r1, r1, r0, lsl #2      @ Get word offset
        mov     r3, r2, lsl r3          @ create mask
-       smp_dmb
+       \barrier
 #if __LINUX_ARM_ARCH__ >= 7 && defined(CONFIG_SMP)
        .arch_extension mp
        ALT_SMP(W(pldw) [r1])
@@ -50,13 +50,21 @@ UNWIND(     .fnstart        )
        strex   ip, r2, [r1]
        cmp     ip, #0
        bne     1b
-       smp_dmb
+       \barrier
        cmp     r0, #0
        movne   r0, #1
 2:     bx      lr
 UNWIND(        .fnend          )
 ENDPROC(\name          )
        .endm
+
+       .macro  testop, name, instr, store
+       __testop \name, \instr, \store, smp_dmb
+       .endm
+
+       .macro  sync_testop, name, instr, store
+       __testop \name, \instr, \store, __smp_dmb
+       .endm
 #else
        .macro  bitop, name, instr
 ENTRY( \name           )
index 4ebecc6..f13fe9b 100644 (file)
@@ -10,3 +10,7 @@
                 .text
 
 testop _test_and_change_bit, eor, str
+
+#if __LINUX_ARM_ARCH__ >= 6
+sync_testop    _sync_test_and_change_bit, eor, str
+#endif
index 009afa0..4d2c5ca 100644 (file)
@@ -10,3 +10,7 @@
                 .text
 
 testop _test_and_clear_bit, bicne, strne
+
+#if __LINUX_ARM_ARCH__ >= 6
+sync_testop    _sync_test_and_clear_bit, bicne, strne
+#endif
index f3192e5..649dbab 100644 (file)
@@ -10,3 +10,7 @@
                 .text
 
 testop _test_and_set_bit, orreq, streq
+
+#if __LINUX_ARM_ARCH__ >= 6
+sync_testop    _sync_test_and_set_bit, orreq, streq
+#endif
index e4c2677..2f6163f 100644 (file)
@@ -74,6 +74,9 @@ pin_page_for_write(const void __user *_addr, pte_t **ptep, spinlock_t **ptlp)
                return 0;
 
        pte = pte_offset_map_lock(current->mm, pmd, addr, &ptl);
+       if (unlikely(!pte))
+               return 0;
+
        if (unlikely(!pte_present(*pte) || !pte_young(*pte) ||
            !pte_write(*pte) || !pte_dirty(*pte))) {
                pte_unmap_unlock(pte, ptl);
index 29eb075..b5287ff 100644 (file)
@@ -106,7 +106,7 @@ void exynos_firmware_init(void);
 #define C2_STATE       (1 << 3)
 /*
  * Magic values for bootloader indicating chosen low power mode.
- * See also Documentation/arm/samsung/bootloader-interface.rst
+ * See also Documentation/arch/arm/samsung/bootloader-interface.rst
  */
 #define EXYNOS_SLEEP_MAGIC     0x00000bad
 #define EXYNOS_AFTR_MAGIC      0xfcba0d10
index 51e4705..3faf9a1 100644 (file)
@@ -11,7 +11,6 @@
 #include <linux/err.h>
 #include <linux/gpio.h>
 #include <linux/init.h>
-#include <linux/irqchip/mxs.h>
 #include <linux/reboot.h>
 #include <linux/micrel_phy.h>
 #include <linux/of_address.h>
@@ -472,7 +471,6 @@ static const char *const mxs_dt_compat[] __initconst = {
 };
 
 DT_MACHINE_START(MXS, "Freescale MXS (Device Tree)")
-       .handle_irq     = icoll_handle_irq,
        .init_machine   = mxs_machine_init,
        .init_late      = mxs_pm_init,
        .dt_compat      = mxs_dt_compat,
index 9108c87..8813920 100644 (file)
@@ -877,7 +877,6 @@ MACHINE_START(AMS_DELTA, "Amstrad E3 (Delta)")
        .map_io         = ams_delta_map_io,
        .init_early     = omap1_init_early,
        .init_irq       = omap1_init_irq,
-       .handle_irq     = omap1_handle_irq,
        .init_machine   = ams_delta_init,
        .init_late      = ams_delta_init_late,
        .init_time      = omap1_timer_init,
index a501a47..b56cea9 100644 (file)
@@ -291,7 +291,6 @@ MACHINE_START(NOKIA770, "Nokia 770")
        .map_io         = omap1_map_io,
        .init_early     = omap1_init_early,
        .init_irq       = omap1_init_irq,
-       .handle_irq     = omap1_handle_irq,
        .init_machine   = omap_nokia770_init,
        .init_late      = omap1_init_late,
        .init_time      = omap1_timer_init,
index df758c1..46eda4f 100644 (file)
@@ -389,7 +389,6 @@ MACHINE_START(OMAP_OSK, "TI-OSK")
        .map_io         = omap1_map_io,
        .init_early     = omap1_init_early,
        .init_irq       = omap1_init_irq,
-       .handle_irq     = omap1_handle_irq,
        .init_machine   = osk_init,
        .init_late      = omap1_init_late,
        .init_time      = omap1_timer_init,
index f79c497..91df3dc 100644 (file)
@@ -259,7 +259,6 @@ MACHINE_START(OMAP_PALMTE, "OMAP310 based Palm Tungsten E")
        .map_io         = omap1_map_io,
        .init_early     = omap1_init_early,
        .init_irq       = omap1_init_irq,
-       .handle_irq     = omap1_handle_irq,
        .init_machine   = omap_palmte_init,
        .init_late      = omap1_init_late,
        .init_time      = omap1_timer_init,
index 0c0cdd5..3ae295a 100644 (file)
@@ -338,7 +338,6 @@ MACHINE_START(SX1, "OMAP310 based Siemens SX1")
        .map_io         = omap1_map_io,
        .init_early     = omap1_init_early,
        .init_irq       = omap1_init_irq,
-       .handle_irq     = omap1_handle_irq,
        .init_machine   = omap_sx1_init,
        .init_late      = omap1_init_late,
        .init_time      = omap1_timer_init,
index bfc7ab0..3d9e72e 100644 (file)
@@ -37,6 +37,7 @@
  */
 #include <linux/gpio.h>
 #include <linux/init.h>
+#include <linux/irq.h>
 #include <linux/module.h>
 #include <linux/sched.h>
 #include <linux/interrupt.h>
@@ -254,4 +255,6 @@ void __init omap1_init_irq(void)
                ct = irq_data_get_chip_type(d);
                ct->chip.irq_unmask(d);
        }
+
+       set_handle_irq(omap1_handle_irq);
 }
index 72b08a9..6b7197a 100644 (file)
@@ -233,7 +233,6 @@ MACHINE_START(GUMSTIX, "Gumstix")
        .map_io         = pxa25x_map_io,
        .nr_irqs        = PXA_NR_IRQS,
        .init_irq       = pxa25x_init_irq,
-       .handle_irq     = pxa25x_handle_irq,
        .init_time      = pxa_timer_init,
        .init_machine   = gumstix_init,
        .restart        = pxa_restart,
index 1b83be1..032dc89 100644 (file)
@@ -143,6 +143,7 @@ set_pwer:
 void __init pxa25x_init_irq(void)
 {
        pxa_init_irq(32, pxa25x_set_wake);
+       set_handle_irq(pxa25x_handle_irq);
 }
 
 static int __init __init
index 4135ba2..c9b5642 100644 (file)
@@ -228,6 +228,7 @@ static int pxa27x_set_wake(struct irq_data *d, unsigned int on)
 void __init pxa27x_init_irq(void)
 {
        pxa_init_irq(34, pxa27x_set_wake);
+       set_handle_irq(pxa27x_handle_irq);
 }
 
 static int __init
index 4325bdc..042922a 100644 (file)
@@ -1043,7 +1043,6 @@ MACHINE_START(SPITZ, "SHARP Spitz")
        .map_io         = pxa27x_map_io,
        .nr_irqs        = PXA_NR_IRQS,
        .init_irq       = pxa27x_init_irq,
-       .handle_irq     = pxa27x_handle_irq,
        .init_machine   = spitz_init,
        .init_time      = pxa_timer_init,
        .restart        = spitz_restart,
@@ -1056,7 +1055,6 @@ MACHINE_START(BORZOI, "SHARP Borzoi")
        .map_io         = pxa27x_map_io,
        .nr_irqs        = PXA_NR_IRQS,
        .init_irq       = pxa27x_init_irq,
-       .handle_irq     = pxa27x_handle_irq,
        .init_machine   = spitz_init,
        .init_time      = pxa_timer_init,
        .restart        = spitz_restart,
@@ -1069,7 +1067,6 @@ MACHINE_START(AKITA, "SHARP Akita")
        .map_io         = pxa27x_map_io,
        .nr_irqs        = PXA_NR_IRQS,
        .init_irq       = pxa27x_init_irq,
-       .handle_irq     = pxa27x_handle_irq,
        .init_machine   = spitz_init,
        .init_time      = pxa_timer_init,
        .restart        = spitz_restart,
index b2d45cf..b3842c9 100644 (file)
@@ -21,7 +21,7 @@ menuconfig ARCH_STI
        help
          Include support for STMicroelectronics' STiH415/416, STiH407/10 and
          STiH418 family SoCs using the Device Tree for discovery.  More
-         information can be found in Documentation/arm/sti/ and
+         information can be found in Documentation/arch/arm/sti/ and
          Documentation/devicetree.
 
 if ARCH_STI
index be183ed..c164cde 100644 (file)
@@ -712,7 +712,7 @@ config ARM_VIRT_EXT
          assistance.
 
          A compliant bootloader is required in order to make maximum
-         use of this feature.  Refer to Documentation/arm/booting.rst for
+         use of this feature.  Refer to Documentation/arch/arm/booting.rst for
          details.
 
 config SWP_EMULATE
@@ -904,7 +904,7 @@ config KUSER_HELPERS
          the CPU type fitted to the system.  This permits binaries to be
          run on ARMv4 through to ARMv7 without modification.
 
-         See Documentation/arm/kernel_user_helpers.rst for details.
+         See Documentation/arch/arm/kernel_user_helpers.rst for details.
 
          However, the fixed address nature of these helpers can be used
          by ROP (return orientated programming) authors when creating
index b4a3335..bc4ed5c 100644 (file)
@@ -258,12 +258,14 @@ static struct dma_contig_early_reserve dma_mmu_remap[MAX_CMA_AREAS] __initdata;
 
 static int dma_mmu_remap_num __initdata;
 
+#ifdef CONFIG_DMA_CMA
 void __init dma_contiguous_early_fixup(phys_addr_t base, unsigned long size)
 {
        dma_mmu_remap[dma_mmu_remap_num].base = base;
        dma_mmu_remap[dma_mmu_remap_num].size = size;
        dma_mmu_remap_num++;
 }
+#endif
 
 void __init dma_contiguous_remap(void)
 {
index 0e49154..ca5302b 100644 (file)
@@ -117,8 +117,11 @@ static int adjust_pte(struct vm_area_struct *vma, unsigned long address,
         * must use the nested version.  This also means we need to
         * open-code the spin-locking.
         */
-       ptl = pte_lockptr(vma->vm_mm, pmd);
        pte = pte_offset_map(pmd, address);
+       if (!pte)
+               return 0;
+
+       ptl = pte_lockptr(vma->vm_mm, pmd);
        do_pte_lock(ptl);
 
        ret = do_adjust_pte(vma, address, pfn, pte);
index 2418f1e..8359864 100644 (file)
@@ -85,6 +85,9 @@ void show_pte(const char *lvl, struct mm_struct *mm, unsigned long addr)
                        break;
 
                pte = pte_offset_map(pmd, addr);
+               if (!pte)
+                       break;
+
                pr_cont(", *pte=%08llx", (long long)pte_val(*pte));
 #ifndef CONFIG_ARM_LPAE
                pr_cont(", *ppte=%08llx",
index 54927ba..e8f8c19 100644 (file)
@@ -37,5 +37,9 @@ static inline int fsr_fs(unsigned int fsr)
 
 void do_bad_area(unsigned long addr, unsigned int fsr, struct pt_regs *regs);
 void early_abt_enable(void);
+asmlinkage void do_DataAbort(unsigned long addr, unsigned int fsr,
+                            struct pt_regs *regs);
+asmlinkage void do_PrefetchAbort(unsigned long addr, unsigned int ifsr,
+                                struct pt_regs *regs);
 
 #endif /* __ARCH_ARM_FAULT_H */
index 7ff9fee..2508be9 100644 (file)
@@ -354,6 +354,7 @@ EXPORT_SYMBOL(flush_dcache_page);
  *  memcpy() to/from page
  *  if written to page, flush_dcache_page()
  */
+void __flush_anon_page(struct vm_area_struct *vma, struct page *page, unsigned long vmaddr);
 void __flush_anon_page(struct vm_area_struct *vma, struct page *page, unsigned long vmaddr)
 {
        unsigned long pfn;
index 463fc2a..f3a52c0 100644 (file)
@@ -21,6 +21,7 @@
 #include <asm/sections.h>
 #include <asm/setup.h>
 #include <asm/smp_plat.h>
+#include <asm/tcm.h>
 #include <asm/tlb.h>
 #include <asm/highmem.h>
 #include <asm/system_info.h>
@@ -37,7 +38,6 @@
 
 #include "fault.h"
 #include "mm.h"
-#include "tcm.h"
 
 extern unsigned long __atags_pointer;
 
index 53f2d87..43cfd06 100644 (file)
@@ -21,6 +21,7 @@
 #include <asm/cputype.h>
 #include <asm/mpu.h>
 #include <asm/procinfo.h>
+#include <asm/idmap.h>
 
 #include "mm.h"
 
diff --git a/arch/arm/mm/tcm.h b/arch/arm/mm/tcm.h
deleted file mode 100644 (file)
index 6b80a76..0000000
+++ /dev/null
@@ -1,17 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Copyright (C) 2008-2009 ST-Ericsson AB
- * TCM memory handling for ARM systems
- *
- * Author: Linus Walleij <linus.walleij@stericsson.com>
- * Author: Rickard Andersson <rickard.andersson@stericsson.com>
- */
-
-#ifdef CONFIG_HAVE_TCM
-void __init tcm_init(void);
-#else
-/* No TCM support, just blank inlines to be optimized out */
-static inline void tcm_init(void)
-{
-}
-#endif
index 4d72099..eba7ac4 100644 (file)
@@ -40,7 +40,7 @@ enum probes_insn checker_stack_use_imm_0xx(probes_opcode_t insn,
  * Different from other insn uses imm8, the real addressing offset of
  * STRD in T32 encoding should be imm8 * 4. See ARMARM description.
  */
-enum probes_insn checker_stack_use_t32strd(probes_opcode_t insn,
+static enum probes_insn checker_stack_use_t32strd(probes_opcode_t insn,
                struct arch_probes_insn *asi,
                const struct decode_header *h)
 {
index 9090c3a..d8238da 100644 (file)
@@ -233,7 +233,7 @@ singlestep(struct kprobe *p, struct pt_regs *regs, struct kprobe_ctlblk *kcb)
  * kprobe, and that level is reserved for user kprobe handlers, so we can't
  * risk encountering a new kprobe in an interrupt handler.
  */
-void __kprobes kprobe_handler(struct pt_regs *regs)
+static void __kprobes kprobe_handler(struct pt_regs *regs)
 {
        struct kprobe *p, *cur;
        struct kprobe_ctlblk *kcb;
index dbef34e..7f65048 100644 (file)
@@ -145,8 +145,6 @@ __arch_remove_optimized_kprobe(struct optimized_kprobe *op, int dirty)
        }
 }
 
-extern void kprobe_handler(struct pt_regs *regs);
-
 static void
 optimized_callback(struct optimized_kprobe *op, struct pt_regs *regs)
 {
index c562832..171c707 100644 (file)
@@ -720,7 +720,7 @@ static const char coverage_register_lookup[16] = {
        [REG_TYPE_NOSPPCX]      = COVERAGE_ANY_REG | COVERAGE_SP,
 };
 
-unsigned coverage_start_registers(const struct decode_header *h)
+static unsigned coverage_start_registers(const struct decode_header *h)
 {
        unsigned regs = 0;
        int i;
index 56ad3c0..c729703 100644 (file)
@@ -454,3 +454,7 @@ void kprobe_thumb32_test_cases(void);
 #else
 void kprobe_arm_test_cases(void);
 #endif
+
+void __kprobes_test_case_start(void);
+void __kprobes_test_case_end_16(void);
+void __kprobes_test_case_end_32(void);
index 9e74c7f..97e2bfa 100644 (file)
@@ -7,7 +7,7 @@
 #   http://www.arm.linux.org.uk/developer/machines/download.php
 #
 # Please do not send patches to this file; it is automatically generated!
-# To add an entry into this database, please see Documentation/arm/arm.rst,
+# To add an entry into this database, please see Documentation/arch/arm/arm.rst,
 # or visit:
 #
 #   http://www.arm.linux.org.uk/developer/machines/?action=new
index ac96461..8ebed8a 100644 (file)
 448    common  process_mrelease                sys_process_mrelease
 449    common  futex_waitv                     sys_futex_waitv
 450    common  set_mempolicy_home_node         sys_set_mempolicy_home_node
+451    common  cachestat                       sys_cachestat
index 1976c6f..a003bea 100644 (file)
@@ -6,6 +6,8 @@
  */
 #include <linux/time.h>
 #include <linux/types.h>
+#include <asm/vdso.h>
+#include <asm/unwind.h>
 
 int __vdso_clock_gettime(clockid_t clock,
                         struct old_timespec32 *ts)
index 349dcb9..1ba5078 100644 (file)
@@ -25,6 +25,7 @@
 #include <asm/thread_notify.h>
 #include <asm/traps.h>
 #include <asm/vfp.h>
+#include <asm/neon.h>
 
 #include "vfpinstr.h"
 #include "vfp.h"
index 343e1e1..3ae5c03 100644 (file)
@@ -120,6 +120,7 @@ config ARM64
        select CRC32
        select DCACHE_WORD_ACCESS
        select DYNAMIC_FTRACE if FUNCTION_TRACER
+       select DMA_BOUNCE_UNALIGNED_KMALLOC
        select DMA_DIRECT_REMAP
        select EDAC_SUPPORT
        select FRAME_POINTER
@@ -203,12 +204,16 @@ config ARM64
        select HAVE_FUNCTION_ERROR_INJECTION
        select HAVE_FUNCTION_GRAPH_TRACER
        select HAVE_GCC_PLUGINS
+       select HAVE_HARDLOCKUP_DETECTOR_PERF if PERF_EVENTS && \
+               HW_PERF_EVENTS && HAVE_PERF_EVENTS_NMI
        select HAVE_HW_BREAKPOINT if PERF_EVENTS
        select HAVE_IOREMAP_PROT
        select HAVE_IRQ_TIME_ACCOUNTING
        select HAVE_KVM
+       select HAVE_MOD_ARCH_SPECIFIC
        select HAVE_NMI
        select HAVE_PERF_EVENTS
+       select HAVE_PERF_EVENTS_NMI if ARM64_PSEUDO_NMI
        select HAVE_PERF_REGS
        select HAVE_PERF_USER_STACK_DUMP
        select HAVE_PREEMPT_DYNAMIC_KEY
@@ -222,6 +227,7 @@ config ARM64
        select HAVE_KPROBES
        select HAVE_KRETPROBES
        select HAVE_GENERIC_VDSO
+       select HOTPLUG_CORE_SYNC_DEAD if HOTPLUG_CPU
        select IRQ_DOMAIN
        select IRQ_FORCED_THREADING
        select KASAN_VMALLOC if KASAN
@@ -577,7 +583,6 @@ config ARM64_ERRATUM_845719
 config ARM64_ERRATUM_843419
        bool "Cortex-A53: 843419: A load or store might access an incorrect address"
        default y
-       select ARM64_MODULE_PLTS if MODULES
        help
          This option links the kernel with '--fix-cortex-a53-843419' and
          enables PLT support to replace certain ADRP instructions, which can
@@ -1585,7 +1590,7 @@ config ARM64_TAGGED_ADDR_ABI
          When this option is enabled, user applications can opt in to a
          relaxed ABI via prctl() allowing tagged addresses to be passed
          to system calls as pointer arguments. For details, see
-         Documentation/arm64/tagged-address-abi.rst.
+         Documentation/arch/arm64/tagged-address-abi.rst.
 
 menuconfig COMPAT
        bool "Kernel support for 32-bit EL0"
@@ -1619,7 +1624,7 @@ config KUSER_HELPERS
          the system. This permits binaries to be run on ARMv4 through
          to ARMv8 without modification.
 
-         See Documentation/arm/kernel_user_helpers.rst for details.
+         See Documentation/arch/arm/kernel_user_helpers.rst for details.
 
          However, the fixed address nature of these helpers can be used
          by ROP (return orientated programming) authors when creating
@@ -2047,7 +2052,7 @@ config ARM64_MTE
          explicitly opt in. The mechanism for the userspace is
          described in:
 
-         Documentation/arm64/memory-tagging-extension.rst.
+         Documentation/arch/arm64/memory-tagging-extension.rst.
 
 endmenu # "ARMv8.5 architectural features"
 
@@ -2107,26 +2112,6 @@ config ARM64_SME
          register state capable of holding two dimensional matrix tiles to
          enable various matrix operations.
 
-config ARM64_MODULE_PLTS
-       bool "Use PLTs to allow module memory to spill over into vmalloc area"
-       depends on MODULES
-       select HAVE_MOD_ARCH_SPECIFIC
-       help
-         Allocate PLTs when loading modules so that jumps and calls whose
-         targets are too far away for their relative offsets to be encoded
-         in the instructions themselves can be bounced via veneers in the
-         module's PLT. This allows modules to be allocated in the generic
-         vmalloc area after the dedicated module memory area has been
-         exhausted.
-
-         When running with address space randomization (KASLR), the module
-         region itself may be too far away for ordinary relative jumps and
-         calls, and so in that case, module PLTs are required and cannot be
-         disabled.
-
-         Specific errata workaround(s) might also force module PLTs to be
-         enabled (ARM64_ERRATUM_843419).
-
 config ARM64_PSEUDO_NMI
        bool "Support for NMI-like interrupts"
        select ARM_GIC_V3
@@ -2167,7 +2152,6 @@ config RELOCATABLE
 
 config RANDOMIZE_BASE
        bool "Randomize the address of the kernel image"
-       select ARM64_MODULE_PLTS if MODULES
        select RELOCATABLE
        help
          Randomizes the virtual address at which the kernel image is
@@ -2198,9 +2182,8 @@ config RANDOMIZE_MODULE_REGION_FULL
          When this option is not set, the module region will be randomized over
          a limited range that contains the [_stext, _etext] interval of the
          core kernel, so branch relocations are almost always in range unless
-         ARM64_MODULE_PLTS is enabled and the region is exhausted. In this
-         particular case of region exhaustion, modules might be able to fall
-         back to a larger 2GB area.
+         the region is exhausted. In this particular case of region
+         exhaustion, modules might be able to fall back to a larger 2GB area.
 
 config CC_HAVE_STACKPROTECTOR_SYSREG
        def_bool $(cc-option,-mstack-protector-guard=sysreg -mstack-protector-guard-reg=sp_el0 -mstack-protector-guard-offset=0)
index 9f05227..299ef5d 100644 (file)
        qcom,spare-regs = <&tcsr_regs_2 0xb3e4>;
 };
 
+&scm {
+       /* TF-A firmware maps memory cached so mark dma-coherent to match. */
+       dma-coherent;
+};
+
 &sdhc_1 {
        status = "okay";
 
index ca6920d..1472e7f 100644 (file)
@@ -892,6 +892,11 @@ hp_i2c: &i2c9 {
        qcom,spare-regs = <&tcsr_regs_2 0xb3e4>;
 };
 
+&scm {
+       /* TF-A firmware maps memory cached so mark dma-coherent to match. */
+       dma-coherent;
+};
+
 &sdhc_1 {
        status = "okay";
 
index f479cab..a65be76 100644 (file)
        };
 
        firmware {
-               scm {
+               scm: scm {
                        compatible = "qcom,scm-sc7180", "qcom,scm";
                };
        };
index f562e4d..2e1cd21 100644 (file)
        firmware-name = "ath11k/WCN6750/hw1.0/wpss.mdt";
 };
 
+&scm {
+       /* TF-A firmware maps memory cached so mark dma-coherent to match. */
+       dma-coherent;
+};
+
 &wifi {
        status = "okay";
 
index 2fd1d3c..36f0bb9 100644 (file)
        };
 
        firmware {
-               scm {
+               scm: scm {
                        compatible = "qcom,scm-sc7280", "qcom,scm";
                };
        };
index dd228a2..2ae4bb7 100644 (file)
@@ -97,6 +97,7 @@
                l2: l2-cache {
                        compatible = "cache";
                        cache-level = <2>;
+                       cache-unified;
                };
        };
 
index f69a38f..0a27fa5 100644 (file)
@@ -37,7 +37,8 @@
                vin-supply = <&vcc_io>;
        };
 
-       vcc_host_5v: vcc-host-5v-regulator {
+       /* Common enable line for all of the rails mentioned in the labels */
+       vcc_host_5v: vcc_host1_5v: vcc_otg_5v: vcc-host-5v-regulator {
                compatible = "regulator-fixed";
                gpio = <&gpio0 RK_PA2 GPIO_ACTIVE_LOW>;
                pinctrl-names = "default";
                vin-supply = <&vcc_sys>;
        };
 
-       vcc_host1_5v: vcc_otg_5v: vcc-host1-5v-regulator {
-               compatible = "regulator-fixed";
-               gpio = <&gpio0 RK_PA2 GPIO_ACTIVE_LOW>;
-               pinctrl-names = "default";
-               pinctrl-0 = <&usb20_host_drv>;
-               regulator-name = "vcc_host1_5v";
-               regulator-always-on;
-               regulator-boot-on;
-               vin-supply = <&vcc_sys>;
-       };
-
        vcc_sys: vcc-sys {
                compatible = "regulator-fixed";
                regulator-name = "vcc_sys";
index 6d7a7bf..e729e7a 100644 (file)
                l2: l2-cache0 {
                        compatible = "cache";
                        cache-level = <2>;
+                       cache-unified;
                };
        };
 
index 263ce40..cddf6cd 100644 (file)
                regulator-max-microvolt = <5000000>;
                vin-supply = <&vcc12v_dcin>;
        };
+
+       vcc_sd_pwr: vcc-sd-pwr-regulator {
+               compatible = "regulator-fixed";
+               regulator-name = "vcc_sd_pwr";
+               regulator-always-on;
+               regulator-boot-on;
+               regulator-min-microvolt = <3300000>;
+               regulator-max-microvolt = <3300000>;
+               vin-supply = <&vcc3v3_sys>;
+       };
 };
 
 /* phy for pcie */
 };
 
 &sdmmc0 {
-       vmmc-supply = <&sdmmc_pwr>;
-       status = "okay";
-};
-
-&sdmmc_pwr {
-       regulator-min-microvolt = <3300000>;
-       regulator-max-microvolt = <3300000>;
+       vmmc-supply = <&vcc_sd_pwr>;
        status = "okay";
 };
 
index 102e448..31aa2b8 100644 (file)
                regulator-max-microvolt = <3300000>;
                vin-supply = <&vcc5v0_sys>;
        };
-
-       sdmmc_pwr: sdmmc-pwr-regulator {
-               compatible = "regulator-fixed";
-               enable-active-high;
-               gpio = <&gpio0 RK_PA5 GPIO_ACTIVE_HIGH>;
-               pinctrl-names = "default";
-               pinctrl-0 = <&sdmmc_pwr_h>;
-               regulator-name = "sdmmc_pwr";
-               status = "disabled";
-       };
 };
 
 &cpu0 {
        status = "disabled";
 };
 
+&gpio0 {
+       nextrst-hog {
+               gpio-hog;
+               /*
+                * GPIO_ACTIVE_LOW + output-low here means that the pin is set
+                * to high, because output-low decides the value pre-inversion.
+                */
+               gpios = <RK_PA5 GPIO_ACTIVE_LOW>;
+               line-name = "nEXTRST";
+               output-low;
+       };
+};
+
 &gpu {
        mali-supply = <&vdd_gpu>;
        status = "okay";
                        rockchip,pins = <2 RK_PC2 RK_FUNC_GPIO &pcfg_pull_none>;
                };
        };
-
-       sdmmc-pwr {
-               sdmmc_pwr_h: sdmmc-pwr-h {
-                       rockchip,pins = <0 RK_PA5 RK_FUNC_GPIO &pcfg_pull_none>;
-               };
-       };
 };
 
 &pmu_io_domains {
index f70ca9f..c718b8d 100644 (file)
 
        rockchip-key {
                reset_button_pin: reset-button-pin {
-                       rockchip,pins = <4 RK_PA0 RK_FUNC_GPIO &pcfg_pull_up>;
+                       rockchip,pins = <0 RK_PB7 RK_FUNC_GPIO &pcfg_pull_up>;
                };
        };
 };
index ba67b58..f1be76a 100644 (file)
                power-domains = <&power RK3568_PD_PIPE>;
                reg = <0x3 0xc0400000 0x0 0x00400000>,
                      <0x0 0xfe270000 0x0 0x00010000>,
-                     <0x3 0x7f000000 0x0 0x01000000>;
-               ranges = <0x01000000 0x0 0x3ef00000 0x3 0x7ef00000 0x0 0x00100000>,
-                        <0x02000000 0x0 0x00000000 0x3 0x40000000 0x0 0x3ef00000>;
+                     <0x0 0xf2000000 0x0 0x00100000>;
+               ranges = <0x01000000 0x0 0xf2100000 0x0 0xf2100000 0x0 0x00100000>,
+                        <0x02000000 0x0 0xf2200000 0x0 0xf2200000 0x0 0x01e00000>,
+                        <0x03000000 0x0 0x40000000 0x3 0x40000000 0x0 0x40000000>;
                reg-names = "dbi", "apb", "config";
                resets = <&cru SRST_PCIE30X1_POWERUP>;
                reset-names = "pipe";
                power-domains = <&power RK3568_PD_PIPE>;
                reg = <0x3 0xc0800000 0x0 0x00400000>,
                      <0x0 0xfe280000 0x0 0x00010000>,
-                     <0x3 0xbf000000 0x0 0x01000000>;
-               ranges = <0x01000000 0x0 0x3ef00000 0x3 0xbef00000 0x0 0x00100000>,
-                        <0x02000000 0x0 0x00000000 0x3 0x80000000 0x0 0x3ef00000>;
+                     <0x0 0xf0000000 0x0 0x00100000>;
+               ranges = <0x01000000 0x0 0xf0100000 0x0 0xf0100000 0x0 0x00100000>,
+                        <0x02000000 0x0 0xf0200000 0x0 0xf0200000 0x0 0x01e00000>,
+                        <0x03000000 0x0 0x40000000 0x3 0x80000000 0x0 0x40000000>;
                reg-names = "dbi", "apb", "config";
                resets = <&cru SRST_PCIE30X2_POWERUP>;
                reset-names = "pipe";
index f62e0fd..61680c7 100644 (file)
                compatible = "rockchip,rk3568-pcie";
                reg = <0x3 0xc0000000 0x0 0x00400000>,
                      <0x0 0xfe260000 0x0 0x00010000>,
-                     <0x3 0x3f000000 0x0 0x01000000>;
+                     <0x0 0xf4000000 0x0 0x00100000>;
                reg-names = "dbi", "apb", "config";
                interrupts = <GIC_SPI 75 IRQ_TYPE_LEVEL_HIGH>,
                             <GIC_SPI 74 IRQ_TYPE_LEVEL_HIGH>,
                phys = <&combphy2 PHY_TYPE_PCIE>;
                phy-names = "pcie-phy";
                power-domains = <&power RK3568_PD_PIPE>;
-               ranges = <0x01000000 0x0 0x3ef00000 0x3 0x3ef00000 0x0 0x00100000
-                         0x02000000 0x0 0x00000000 0x3 0x00000000 0x0 0x3ef00000>;
+               ranges = <0x01000000 0x0 0xf4100000 0x0 0xf4100000 0x0 0x00100000>,
+                        <0x02000000 0x0 0xf4200000 0x0 0xf4200000 0x0 0x01e00000>,
+                        <0x03000000 0x0 0x40000000 0x3 0x00000000 0x0 0x40000000>;
                resets = <&cru SRST_PCIE20_POWERUP>;
                reset-names = "pipe";
                #address-cells = <3>;
index 657c019..a3124bd 100644 (file)
                        cache-line-size = <64>;
                        cache-sets = <512>;
                        cache-level = <2>;
+                       cache-unified;
                        next-level-cache = <&l3_cache>;
                };
 
                        cache-line-size = <64>;
                        cache-sets = <512>;
                        cache-level = <2>;
+                       cache-unified;
                        next-level-cache = <&l3_cache>;
                };
 
                        cache-line-size = <64>;
                        cache-sets = <512>;
                        cache-level = <2>;
+                       cache-unified;
                        next-level-cache = <&l3_cache>;
                };
 
                        cache-line-size = <64>;
                        cache-sets = <512>;
                        cache-level = <2>;
+                       cache-unified;
                        next-level-cache = <&l3_cache>;
                };
 
                        cache-line-size = <64>;
                        cache-sets = <1024>;
                        cache-level = <2>;
+                       cache-unified;
                        next-level-cache = <&l3_cache>;
                };
 
                        cache-line-size = <64>;
                        cache-sets = <1024>;
                        cache-level = <2>;
+                       cache-unified;
                        next-level-cache = <&l3_cache>;
                };
 
                        cache-line-size = <64>;
                        cache-sets = <1024>;
                        cache-level = <2>;
+                       cache-unified;
                        next-level-cache = <&l3_cache>;
                };
 
                        cache-line-size = <64>;
                        cache-sets = <1024>;
                        cache-level = <2>;
+                       cache-unified;
                        next-level-cache = <&l3_cache>;
                };
 
                        cache-line-size = <64>;
                        cache-sets = <4096>;
                        cache-level = <3>;
+                       cache-unified;
                };
        };
 
index bdf1f6b..94b4861 100644 (file)
 
 #include <linux/stringify.h>
 
-#define ALTINSTR_ENTRY(feature)                                                      \
+#define ALTINSTR_ENTRY(cpucap)                                               \
        " .word 661b - .\n"                             /* label           */ \
        " .word 663f - .\n"                             /* new instruction */ \
-       " .hword " __stringify(feature) "\n"            /* feature bit     */ \
+       " .hword " __stringify(cpucap) "\n"             /* cpucap          */ \
        " .byte 662b-661b\n"                            /* source len      */ \
        " .byte 664f-663f\n"                            /* replacement len */
 
-#define ALTINSTR_ENTRY_CB(feature, cb)                                       \
+#define ALTINSTR_ENTRY_CB(cpucap, cb)                                        \
        " .word 661b - .\n"                             /* label           */ \
-       " .word " __stringify(cb) "- .\n"               /* callback */        \
-       " .hword " __stringify(feature) "\n"            /* feature bit     */ \
+       " .word " __stringify(cb) "- .\n"               /* callback        */ \
+       " .hword " __stringify(cpucap) "\n"             /* cpucap          */ \
        " .byte 662b-661b\n"                            /* source len      */ \
        " .byte 664f-663f\n"                            /* replacement len */
 
  *
  * Alternatives with callbacks do not generate replacement instructions.
  */
-#define __ALTERNATIVE_CFG(oldinstr, newinstr, feature, cfg_enabled)    \
+#define __ALTERNATIVE_CFG(oldinstr, newinstr, cpucap, cfg_enabled)     \
        ".if "__stringify(cfg_enabled)" == 1\n"                         \
        "661:\n\t"                                                      \
        oldinstr "\n"                                                   \
        "662:\n"                                                        \
        ".pushsection .altinstructions,\"a\"\n"                         \
-       ALTINSTR_ENTRY(feature)                                         \
+       ALTINSTR_ENTRY(cpucap)                                          \
        ".popsection\n"                                                 \
        ".subsection 1\n"                                               \
        "663:\n\t"                                                      \
        ".previous\n"                                                   \
        ".endif\n"
 
-#define __ALTERNATIVE_CFG_CB(oldinstr, feature, cfg_enabled, cb)       \
+#define __ALTERNATIVE_CFG_CB(oldinstr, cpucap, cfg_enabled, cb)        \
        ".if "__stringify(cfg_enabled)" == 1\n"                         \
        "661:\n\t"                                                      \
        oldinstr "\n"                                                   \
        "662:\n"                                                        \
        ".pushsection .altinstructions,\"a\"\n"                         \
-       ALTINSTR_ENTRY_CB(feature, cb)                                  \
+       ALTINSTR_ENTRY_CB(cpucap, cb)                                   \
        ".popsection\n"                                                 \
        "663:\n\t"                                                      \
        "664:\n\t"                                                      \
        ".endif\n"
 
-#define _ALTERNATIVE_CFG(oldinstr, newinstr, feature, cfg, ...)        \
-       __ALTERNATIVE_CFG(oldinstr, newinstr, feature, IS_ENABLED(cfg))
+#define _ALTERNATIVE_CFG(oldinstr, newinstr, cpucap, cfg, ...) \
+       __ALTERNATIVE_CFG(oldinstr, newinstr, cpucap, IS_ENABLED(cfg))
 
-#define ALTERNATIVE_CB(oldinstr, feature, cb) \
-       __ALTERNATIVE_CFG_CB(oldinstr, (1 << ARM64_CB_SHIFT) | (feature), 1, cb)
+#define ALTERNATIVE_CB(oldinstr, cpucap, cb) \
+       __ALTERNATIVE_CFG_CB(oldinstr, (1 << ARM64_CB_SHIFT) | (cpucap), 1, cb)
 #else
 
 #include <asm/assembler.h>
 
-.macro altinstruction_entry orig_offset alt_offset feature orig_len alt_len
+.macro altinstruction_entry orig_offset alt_offset cpucap orig_len alt_len
        .word \orig_offset - .
        .word \alt_offset - .
-       .hword (\feature)
+       .hword (\cpucap)
        .byte \orig_len
        .byte \alt_len
 .endm
@@ -210,9 +210,9 @@ alternative_endif
 #endif  /*  __ASSEMBLY__  */
 
 /*
- * Usage: asm(ALTERNATIVE(oldinstr, newinstr, feature));
+ * Usage: asm(ALTERNATIVE(oldinstr, newinstr, cpucap));
  *
- * Usage: asm(ALTERNATIVE(oldinstr, newinstr, feature, CONFIG_FOO));
+ * Usage: asm(ALTERNATIVE(oldinstr, newinstr, cpucap, CONFIG_FOO));
  * N.B. If CONFIG_FOO is specified, but not selected, the whole block
  *      will be omitted, including oldinstr.
  */
@@ -224,15 +224,15 @@ alternative_endif
 #include <linux/types.h>
 
 static __always_inline bool
-alternative_has_feature_likely(const unsigned long feature)
+alternative_has_cap_likely(const unsigned long cpucap)
 {
-       compiletime_assert(feature < ARM64_NCAPS,
-                          "feature must be < ARM64_NCAPS");
+       compiletime_assert(cpucap < ARM64_NCAPS,
+                          "cpucap must be < ARM64_NCAPS");
 
        asm_volatile_goto(
-       ALTERNATIVE_CB("b       %l[l_no]", %[feature], alt_cb_patch_nops)
+       ALTERNATIVE_CB("b       %l[l_no]", %[cpucap], alt_cb_patch_nops)
        :
-       : [feature] "i" (feature)
+       : [cpucap] "i" (cpucap)
        :
        : l_no);
 
@@ -242,15 +242,15 @@ l_no:
 }
 
 static __always_inline bool
-alternative_has_feature_unlikely(const unsigned long feature)
+alternative_has_cap_unlikely(const unsigned long cpucap)
 {
-       compiletime_assert(feature < ARM64_NCAPS,
-                          "feature must be < ARM64_NCAPS");
+       compiletime_assert(cpucap < ARM64_NCAPS,
+                          "cpucap must be < ARM64_NCAPS");
 
        asm_volatile_goto(
-       ALTERNATIVE("nop", "b   %l[l_yes]", %[feature])
+       ALTERNATIVE("nop", "b   %l[l_yes]", %[cpucap])
        :
-       : [feature] "i" (feature)
+       : [cpucap] "i" (cpucap)
        :
        : l_yes);
 
index a38b92e..00d97b8 100644 (file)
@@ -13,7 +13,7 @@
 struct alt_instr {
        s32 orig_offset;        /* offset to original instruction */
        s32 alt_offset;         /* offset to replacement instruction */
-       u16 cpufeature;         /* cpufeature bit set for replacement */
+       u16 cpucap;             /* cpucap bit set for replacement */
        u8  orig_len;           /* size of original instruction(s) */
        u8  alt_len;            /* size of new instruction(s), <= orig_len */
 };
@@ -23,7 +23,7 @@ typedef void (*alternative_cb_t)(struct alt_instr *alt,
 
 void __init apply_boot_alternatives(void);
 void __init apply_alternatives_all(void);
-bool alternative_is_applied(u16 cpufeature);
+bool alternative_is_applied(u16 cpucap);
 
 #ifdef CONFIG_MODULES
 void apply_alternatives_module(void *start, size_t length);
@@ -31,5 +31,8 @@ void apply_alternatives_module(void *start, size_t length);
 static inline void apply_alternatives_module(void *start, size_t length) { }
 #endif
 
+void alt_cb_patch_nops(struct alt_instr *alt, __le32 *origptr,
+                      __le32 *updptr, int nr_inst);
+
 #endif /* __ASSEMBLY__ */
 #endif /* __ASM_ALTERNATIVE_H */
index af1fafb..934c658 100644 (file)
@@ -88,13 +88,7 @@ static inline notrace u64 arch_timer_read_cntvct_el0(void)
 
 #define arch_timer_reg_read_stable(reg)                                        \
        ({                                                              \
-               u64 _val;                                               \
-                                                                       \
-               preempt_disable_notrace();                              \
-               _val = erratum_handler(read_ ## reg)();                 \
-               preempt_enable_notrace();                               \
-                                                                       \
-               _val;                                                   \
+               erratum_handler(read_ ## reg)();                        \
        })
 
 /*
index 2f5f3da..b0abc64 100644 (file)
@@ -129,4 +129,6 @@ static inline bool __init __early_cpu_has_rndr(void)
        return (ftr >> ID_AA64ISAR0_EL1_RNDR_SHIFT) & 0xf;
 }
 
+u64 kaslr_early_init(void *fdt);
+
 #endif /* _ASM_ARCHRANDOM_H */
index 75b211c..5b6efe8 100644 (file)
@@ -18,7 +18,6 @@
        bic     \tmp1, \tmp1, #TTBR_ASID_MASK
        sub     \tmp1, \tmp1, #RESERVED_SWAPPER_OFFSET  // reserved_pg_dir
        msr     ttbr0_el1, \tmp1                        // set reserved TTBR0_EL1
-       isb
        add     \tmp1, \tmp1, #RESERVED_SWAPPER_OFFSET
        msr     ttbr1_el1, \tmp1                // set reserved ASID
        isb
@@ -31,7 +30,6 @@
        extr    \tmp2, \tmp2, \tmp1, #48
        ror     \tmp2, \tmp2, #16
        msr     ttbr1_el1, \tmp2                // set the active ASID
-       isb
        msr     ttbr0_el1, \tmp1                // set the non-PAN TTBR0_EL1
        isb
        .endm
index c997927..400d279 100644 (file)
@@ -142,24 +142,6 @@ static __always_inline long arch_atomic64_dec_if_positive(atomic64_t *v)
 #define arch_atomic_fetch_xor_release          arch_atomic_fetch_xor_release
 #define arch_atomic_fetch_xor                  arch_atomic_fetch_xor
 
-#define arch_atomic_xchg_relaxed(v, new) \
-       arch_xchg_relaxed(&((v)->counter), (new))
-#define arch_atomic_xchg_acquire(v, new) \
-       arch_xchg_acquire(&((v)->counter), (new))
-#define arch_atomic_xchg_release(v, new) \
-       arch_xchg_release(&((v)->counter), (new))
-#define arch_atomic_xchg(v, new) \
-       arch_xchg(&((v)->counter), (new))
-
-#define arch_atomic_cmpxchg_relaxed(v, old, new) \
-       arch_cmpxchg_relaxed(&((v)->counter), (old), (new))
-#define arch_atomic_cmpxchg_acquire(v, old, new) \
-       arch_cmpxchg_acquire(&((v)->counter), (old), (new))
-#define arch_atomic_cmpxchg_release(v, old, new) \
-       arch_cmpxchg_release(&((v)->counter), (old), (new))
-#define arch_atomic_cmpxchg(v, old, new) \
-       arch_cmpxchg(&((v)->counter), (old), (new))
-
 #define arch_atomic_andnot                     arch_atomic_andnot
 
 /*
@@ -209,16 +191,6 @@ static __always_inline long arch_atomic64_dec_if_positive(atomic64_t *v)
 #define arch_atomic64_fetch_xor_release                arch_atomic64_fetch_xor_release
 #define arch_atomic64_fetch_xor                        arch_atomic64_fetch_xor
 
-#define arch_atomic64_xchg_relaxed             arch_atomic_xchg_relaxed
-#define arch_atomic64_xchg_acquire             arch_atomic_xchg_acquire
-#define arch_atomic64_xchg_release             arch_atomic_xchg_release
-#define arch_atomic64_xchg                     arch_atomic_xchg
-
-#define arch_atomic64_cmpxchg_relaxed          arch_atomic_cmpxchg_relaxed
-#define arch_atomic64_cmpxchg_acquire          arch_atomic_cmpxchg_acquire
-#define arch_atomic64_cmpxchg_release          arch_atomic_cmpxchg_release
-#define arch_atomic64_cmpxchg                  arch_atomic_cmpxchg
-
 #define arch_atomic64_andnot                   arch_atomic64_andnot
 
 #define arch_atomic64_dec_if_positive          arch_atomic64_dec_if_positive
index cbb3d96..89d2ba2 100644 (file)
@@ -294,38 +294,46 @@ __CMPXCHG_CASE( ,  ,  mb_, 64, dmb ish,  , l, "memory", L)
 
 #undef __CMPXCHG_CASE
 
-#define __CMPXCHG_DBL(name, mb, rel, cl)                               \
-static __always_inline long                                            \
-__ll_sc__cmpxchg_double##name(unsigned long old1,                      \
-                                     unsigned long old2,               \
-                                     unsigned long new1,               \
-                                     unsigned long new2,               \
-                                     volatile void *ptr)               \
+union __u128_halves {
+       u128 full;
+       struct {
+               u64 low, high;
+       };
+};
+
+#define __CMPXCHG128(name, mb, rel, cl...)                             \
+static __always_inline u128                                            \
+__ll_sc__cmpxchg128##name(volatile u128 *ptr, u128 old, u128 new)      \
 {                                                                      \
-       unsigned long tmp, ret;                                         \
+       union __u128_halves r, o = { .full = (old) },                   \
+                              n = { .full = (new) };                   \
+       unsigned int tmp;                                               \
                                                                        \
-       asm volatile("// __cmpxchg_double" #name "\n"                   \
-       "       prfm    pstl1strm, %2\n"                                \
-       "1:     ldxp    %0, %1, %2\n"                                   \
-       "       eor     %0, %0, %3\n"                                   \
-       "       eor     %1, %1, %4\n"                                   \
-       "       orr     %1, %0, %1\n"                                   \
-       "       cbnz    %1, 2f\n"                                       \
-       "       st" #rel "xp    %w0, %5, %6, %2\n"                      \
-       "       cbnz    %w0, 1b\n"                                      \
+       asm volatile("// __cmpxchg128" #name "\n"                       \
+       "       prfm    pstl1strm, %[v]\n"                              \
+       "1:     ldxp    %[rl], %[rh], %[v]\n"                           \
+       "       cmp     %[rl], %[ol]\n"                                 \
+       "       ccmp    %[rh], %[oh], 0, eq\n"                          \
+       "       b.ne    2f\n"                                           \
+       "       st" #rel "xp    %w[tmp], %[nl], %[nh], %[v]\n"          \
+       "       cbnz    %w[tmp], 1b\n"                                  \
        "       " #mb "\n"                                              \
        "2:"                                                            \
-       : "=&r" (tmp), "=&r" (ret), "+Q" (*(__uint128_t *)ptr)          \
-       : "r" (old1), "r" (old2), "r" (new1), "r" (new2)                \
-       : cl);                                                          \
+       : [v] "+Q" (*(u128 *)ptr),                                      \
+         [rl] "=&r" (r.low), [rh] "=&r" (r.high),                      \
+         [tmp] "=&r" (tmp)                                             \
+       : [ol] "r" (o.low), [oh] "r" (o.high),                          \
+         [nl] "r" (n.low), [nh] "r" (n.high)                           \
+       : "cc", ##cl);                                                  \
                                                                        \
-       return ret;                                                     \
+       return r.full;                                                  \
 }
 
-__CMPXCHG_DBL(   ,        ,  ,         )
-__CMPXCHG_DBL(_mb, dmb ish, l, "memory")
+__CMPXCHG128(   ,        ,  )
+__CMPXCHG128(_mb, dmb ish, l, "memory")
+
+#undef __CMPXCHG128
 
-#undef __CMPXCHG_DBL
 #undef K
 
 #endif /* __ASM_ATOMIC_LL_SC_H */
index 319958b..87f568a 100644 (file)
@@ -281,40 +281,35 @@ __CMPXCHG_CASE(x,  ,  mb_, 64, al, "memory")
 
 #undef __CMPXCHG_CASE
 
-#define __CMPXCHG_DBL(name, mb, cl...)                                 \
-static __always_inline long                                            \
-__lse__cmpxchg_double##name(unsigned long old1,                                \
-                                        unsigned long old2,            \
-                                        unsigned long new1,            \
-                                        unsigned long new2,            \
-                                        volatile void *ptr)            \
+#define __CMPXCHG128(name, mb, cl...)                                  \
+static __always_inline u128                                            \
+__lse__cmpxchg128##name(volatile u128 *ptr, u128 old, u128 new)                \
 {                                                                      \
-       unsigned long oldval1 = old1;                                   \
-       unsigned long oldval2 = old2;                                   \
-       register unsigned long x0 asm ("x0") = old1;                    \
-       register unsigned long x1 asm ("x1") = old2;                    \
-       register unsigned long x2 asm ("x2") = new1;                    \
-       register unsigned long x3 asm ("x3") = new2;                    \
+       union __u128_halves r, o = { .full = (old) },                   \
+                              n = { .full = (new) };                   \
+       register unsigned long x0 asm ("x0") = o.low;                   \
+       register unsigned long x1 asm ("x1") = o.high;                  \
+       register unsigned long x2 asm ("x2") = n.low;                   \
+       register unsigned long x3 asm ("x3") = n.high;                  \
        register unsigned long x4 asm ("x4") = (unsigned long)ptr;      \
                                                                        \
        asm volatile(                                                   \
        __LSE_PREAMBLE                                                  \
        "       casp" #mb "\t%[old1], %[old2], %[new1], %[new2], %[v]\n"\
-       "       eor     %[old1], %[old1], %[oldval1]\n"                 \
-       "       eor     %[old2], %[old2], %[oldval2]\n"                 \
-       "       orr     %[old1], %[old1], %[old2]"                      \
        : [old1] "+&r" (x0), [old2] "+&r" (x1),                         \
-         [v] "+Q" (*(__uint128_t *)ptr)                                \
+         [v] "+Q" (*(u128 *)ptr)                                       \
        : [new1] "r" (x2), [new2] "r" (x3), [ptr] "r" (x4),             \
-         [oldval1] "r" (oldval1), [oldval2] "r" (oldval2)              \
+         [oldval1] "r" (o.low), [oldval2] "r" (o.high)                 \
        : cl);                                                          \
                                                                        \
-       return x0;                                                      \
+       r.low = x0; r.high = x1;                                        \
+                                                                       \
+       return r.full;                                                  \
 }
 
-__CMPXCHG_DBL(   ,   )
-__CMPXCHG_DBL(_mb, al, "memory")
+__CMPXCHG128(   ,   )
+__CMPXCHG128(_mb, al, "memory")
 
-#undef __CMPXCHG_DBL
+#undef __CMPXCHG128
 
 #endif /* __ASM_ATOMIC_LSE_H */
index a51e6e8..ceb368d 100644 (file)
@@ -33,6 +33,7 @@
  * the CPU.
  */
 #define ARCH_DMA_MINALIGN      (128)
+#define ARCH_KMALLOC_MINALIGN  (8)
 
 #ifndef __ASSEMBLY__
 
@@ -90,6 +91,8 @@ static inline int cache_line_size_of_cpu(void)
 
 int cache_line_size(void);
 
+#define dma_get_cache_alignment        cache_line_size
+
 /*
  * Read the effective value of CTR_EL0.
  *
index c6bc5d8..d7a5407 100644 (file)
@@ -130,21 +130,18 @@ __CMPXCHG_CASE(mb_, 64)
 
 #undef __CMPXCHG_CASE
 
-#define __CMPXCHG_DBL(name)                                            \
-static inline long __cmpxchg_double##name(unsigned long old1,          \
-                                        unsigned long old2,            \
-                                        unsigned long new1,            \
-                                        unsigned long new2,            \
-                                        volatile void *ptr)            \
+#define __CMPXCHG128(name)                                             \
+static inline u128 __cmpxchg128##name(volatile u128 *ptr,              \
+                                     u128 old, u128 new)               \
 {                                                                      \
-       return __lse_ll_sc_body(_cmpxchg_double##name,                  \
-                               old1, old2, new1, new2, ptr);           \
+       return __lse_ll_sc_body(_cmpxchg128##name,                      \
+                               ptr, old, new);                         \
 }
 
-__CMPXCHG_DBL(   )
-__CMPXCHG_DBL(_mb)
+__CMPXCHG128(   )
+__CMPXCHG128(_mb)
 
-#undef __CMPXCHG_DBL
+#undef __CMPXCHG128
 
 #define __CMPXCHG_GEN(sfx)                                             \
 static __always_inline unsigned long __cmpxchg##sfx(volatile void *ptr,        \
@@ -198,34 +195,17 @@ __CMPXCHG_GEN(_mb)
 #define arch_cmpxchg64                 arch_cmpxchg
 #define arch_cmpxchg64_local           arch_cmpxchg_local
 
-/* cmpxchg_double */
-#define system_has_cmpxchg_double()     1
-
-#define __cmpxchg_double_check(ptr1, ptr2)                                     \
-({                                                                             \
-       if (sizeof(*(ptr1)) != 8)                                               \
-               BUILD_BUG();                                                    \
-       VM_BUG_ON((unsigned long *)(ptr2) - (unsigned long *)(ptr1) != 1);      \
-})
+/* cmpxchg128 */
+#define system_has_cmpxchg128()                1
 
-#define arch_cmpxchg_double(ptr1, ptr2, o1, o2, n1, n2)                                \
+#define arch_cmpxchg128(ptr, o, n)                                             \
 ({                                                                             \
-       int __ret;                                                              \
-       __cmpxchg_double_check(ptr1, ptr2);                                     \
-       __ret = !__cmpxchg_double_mb((unsigned long)(o1), (unsigned long)(o2),  \
-                                    (unsigned long)(n1), (unsigned long)(n2),  \
-                                    ptr1);                                     \
-       __ret;                                                                  \
+       __cmpxchg128_mb((ptr), (o), (n));                                       \
 })
 
-#define arch_cmpxchg_double_local(ptr1, ptr2, o1, o2, n1, n2)                  \
+#define arch_cmpxchg128_local(ptr, o, n)                                       \
 ({                                                                             \
-       int __ret;                                                              \
-       __cmpxchg_double_check(ptr1, ptr2);                                     \
-       __ret = !__cmpxchg_double((unsigned long)(o1), (unsigned long)(o2),     \
-                                 (unsigned long)(n1), (unsigned long)(n2),     \
-                                 ptr1);                                        \
-       __ret;                                                                  \
+       __cmpxchg128((ptr), (o), (n));                                          \
 })
 
 #define __CMPWAIT_CASE(w, sfx, sz)                                     \
index 74575c3..ae904a1 100644 (file)
@@ -96,6 +96,8 @@ static inline int is_compat_thread(struct thread_info *thread)
        return test_ti_thread_flag(thread, TIF_32BIT);
 }
 
+long compat_arm_syscall(struct pt_regs *regs, int scno);
+
 #else /* !CONFIG_COMPAT */
 
 static inline int is_compat_thread(struct thread_info *thread)
index fd7a922..e749838 100644 (file)
@@ -56,6 +56,7 @@ struct cpuinfo_arm64 {
        u64             reg_id_aa64mmfr0;
        u64             reg_id_aa64mmfr1;
        u64             reg_id_aa64mmfr2;
+       u64             reg_id_aa64mmfr3;
        u64             reg_id_aa64pfr0;
        u64             reg_id_aa64pfr1;
        u64             reg_id_aa64zfr0;
index 6bf013f..7a95c32 100644 (file)
@@ -107,7 +107,7 @@ extern struct arm64_ftr_reg arm64_ftr_reg_ctrel0;
  * CPU capabilities:
  *
  * We use arm64_cpu_capabilities to represent system features, errata work
- * arounds (both used internally by kernel and tracked in cpu_hwcaps) and
+ * arounds (both used internally by kernel and tracked in system_cpucaps) and
  * ELF HWCAPs (which are exposed to user).
  *
  * To support systems with heterogeneous CPUs, we need to make sure that we
@@ -419,12 +419,12 @@ static __always_inline bool is_hyp_code(void)
        return is_vhe_hyp_code() || is_nvhe_hyp_code();
 }
 
-extern DECLARE_BITMAP(cpu_hwcaps, ARM64_NCAPS);
+extern DECLARE_BITMAP(system_cpucaps, ARM64_NCAPS);
 
-extern DECLARE_BITMAP(boot_capabilities, ARM64_NCAPS);
+extern DECLARE_BITMAP(boot_cpucaps, ARM64_NCAPS);
 
 #define for_each_available_cap(cap)            \
-       for_each_set_bit(cap, cpu_hwcaps, ARM64_NCAPS)
+       for_each_set_bit(cap, system_cpucaps, ARM64_NCAPS)
 
 bool this_cpu_has_cap(unsigned int cap);
 void cpu_set_feature(unsigned int num);
@@ -437,7 +437,7 @@ unsigned long cpu_get_elf_hwcap2(void);
 
 static __always_inline bool system_capabilities_finalized(void)
 {
-       return alternative_has_feature_likely(ARM64_ALWAYS_SYSTEM);
+       return alternative_has_cap_likely(ARM64_ALWAYS_SYSTEM);
 }
 
 /*
@@ -449,7 +449,7 @@ static __always_inline bool cpus_have_cap(unsigned int num)
 {
        if (num >= ARM64_NCAPS)
                return false;
-       return arch_test_bit(num, cpu_hwcaps);
+       return arch_test_bit(num, system_cpucaps);
 }
 
 /*
@@ -464,7 +464,7 @@ static __always_inline bool __cpus_have_const_cap(int num)
 {
        if (num >= ARM64_NCAPS)
                return false;
-       return alternative_has_feature_unlikely(num);
+       return alternative_has_cap_unlikely(num);
 }
 
 /*
@@ -504,16 +504,6 @@ static __always_inline bool cpus_have_const_cap(int num)
                return cpus_have_cap(num);
 }
 
-static inline void cpus_set_cap(unsigned int num)
-{
-       if (num >= ARM64_NCAPS) {
-               pr_warn("Attempt to set an illegal CPU capability (%d >= %d)\n",
-                       num, ARM64_NCAPS);
-       } else {
-               __set_bit(num, cpu_hwcaps);
-       }
-}
-
 static inline int __attribute_const__
 cpuid_feature_extract_signed_field_width(u64 features, int field, int width)
 {
index f86b157..4cf2cb0 100644 (file)
@@ -88,7 +88,7 @@ efi_status_t __efi_rt_asm_wrapper(void *, const char *, ...);
  * guaranteed to cover the kernel Image.
  *
  * Since the EFI stub is part of the kernel Image, we can relax the
- * usual requirements in Documentation/arm64/booting.rst, which still
+ * usual requirements in Documentation/arch/arm64/booting.rst, which still
  * apply to other bootloaders, and are required for some kernel
  * configurations.
  */
@@ -166,4 +166,6 @@ static inline void efi_capsule_flush_cache_range(void *addr, int size)
        dcache_clean_inval_poc((unsigned long)addr, (unsigned long)addr + size);
 }
 
+efi_status_t efi_handle_corrupted_x18(efi_status_t s, const char *f);
+
 #endif /* _ASM_EFI_H */
index 037724b..f4c3d30 100644 (file)
        isb
 .endm
 
+.macro __init_el2_hcrx
+       mrs     x0, id_aa64mmfr1_el1
+       ubfx    x0, x0, #ID_AA64MMFR1_EL1_HCX_SHIFT, #4
+       cbz     x0, .Lskip_hcrx_\@
+       mov_q   x0, HCRX_HOST_FLAGS
+       msr_s   SYS_HCRX_EL2, x0
+.Lskip_hcrx_\@:
+.endm
+
 /*
  * Allow Non-secure EL1 and EL0 to access physical timer and counter.
  * This is not necessary for VHE, since the host kernel runs in EL2,
@@ -69,7 +78,7 @@
        cbz     x0, .Lskip_trace_\@             // Skip if TraceBuffer is not present
 
        mrs_s   x0, SYS_TRBIDR_EL1
-       and     x0, x0, TRBIDR_PROG
+       and     x0, x0, TRBIDR_EL1_P
        cbnz    x0, .Lskip_trace_\@             // If TRBE is available at EL2
 
        mov     x0, #(MDCR_EL2_E2TB_MASK << MDCR_EL2_E2TB_SHIFT)
        mov     x0, xzr
        mrs     x1, id_aa64pfr1_el1
        ubfx    x1, x1, #ID_AA64PFR1_EL1_SME_SHIFT, #4
-       cbz     x1, .Lset_fgt_\@
+       cbz     x1, .Lset_pie_fgt_\@
 
        /* Disable nVHE traps of TPIDR2 and SMPRI */
        orr     x0, x0, #HFGxTR_EL2_nSMPRI_EL1_MASK
        orr     x0, x0, #HFGxTR_EL2_nTPIDR2_EL0_MASK
 
+.Lset_pie_fgt_\@:
+       mrs_s   x1, SYS_ID_AA64MMFR3_EL1
+       ubfx    x1, x1, #ID_AA64MMFR3_EL1_S1PIE_SHIFT, #4
+       cbz     x1, .Lset_fgt_\@
+
+       /* Disable trapping of PIR_EL1 / PIRE0_EL1 */
+       orr     x0, x0, #HFGxTR_EL2_nPIR_EL1
+       orr     x0, x0, #HFGxTR_EL2_nPIRE0_EL1
+
 .Lset_fgt_\@:
        msr_s   SYS_HFGRTR_EL2, x0
        msr_s   SYS_HFGWTR_EL2, x0
  */
 .macro init_el2_state
        __init_el2_sctlr
+       __init_el2_hcrx
        __init_el2_timers
        __init_el2_debug
        __init_el2_lor
        cbz     x1, .Lskip_sme_\@
 
        msr_s   SYS_SMPRIMAP_EL2, xzr           // Make all priorities equal
-
-       mrs     x1, id_aa64mmfr1_el1            // HCRX_EL2 present?
-       ubfx    x1, x1, #ID_AA64MMFR1_EL1_HCX_SHIFT, #4
-       cbz     x1, .Lskip_sme_\@
-
-       mrs_s   x1, SYS_HCRX_EL2
-       orr     x1, x1, #HCRX_EL2_SMPME_MASK    // Enable priority mapping
-       msr_s   SYS_HCRX_EL2, x1
 .Lskip_sme_\@:
 .endm
 
index 8487aec..ae35939 100644 (file)
@@ -47,7 +47,7 @@
 #define ESR_ELx_EC_DABT_LOW    (0x24)
 #define ESR_ELx_EC_DABT_CUR    (0x25)
 #define ESR_ELx_EC_SP_ALIGN    (0x26)
-/* Unallocated EC: 0x27 */
+#define ESR_ELx_EC_MOPS                (0x27)
 #define ESR_ELx_EC_FP_EXC32    (0x28)
 /* Unallocated EC: 0x29 - 0x2B */
 #define ESR_ELx_EC_FP_EXC64    (0x2C)
 
 #define ESR_ELx_IL_SHIFT       (25)
 #define ESR_ELx_IL             (UL(1) << ESR_ELx_IL_SHIFT)
-#define ESR_ELx_ISS_MASK       (ESR_ELx_IL - 1)
+#define ESR_ELx_ISS_MASK       (GENMASK(24, 0))
 #define ESR_ELx_ISS(esr)       ((esr) & ESR_ELx_ISS_MASK)
+#define ESR_ELx_ISS2_SHIFT     (32)
+#define ESR_ELx_ISS2_MASK      (GENMASK_ULL(55, 32))
+#define ESR_ELx_ISS2(esr)      (((esr) & ESR_ELx_ISS2_MASK) >> ESR_ELx_ISS2_SHIFT)
 
 /* ISS field definitions shared by different classes */
 #define ESR_ELx_WNR_SHIFT      (6)
 #define ESR_ELx_CM_SHIFT       (8)
 #define ESR_ELx_CM             (UL(1) << ESR_ELx_CM_SHIFT)
 
+/* ISS2 field definitions for Data Aborts */
+#define ESR_ELx_TnD_SHIFT      (10)
+#define ESR_ELx_TnD            (UL(1) << ESR_ELx_TnD_SHIFT)
+#define ESR_ELx_TagAccess_SHIFT        (9)
+#define ESR_ELx_TagAccess      (UL(1) << ESR_ELx_TagAccess_SHIFT)
+#define ESR_ELx_GCS_SHIFT      (8)
+#define ESR_ELx_GCS            (UL(1) << ESR_ELx_GCS_SHIFT)
+#define ESR_ELx_Overlay_SHIFT  (6)
+#define ESR_ELx_Overlay                (UL(1) << ESR_ELx_Overlay_SHIFT)
+#define ESR_ELx_DirtyBit_SHIFT (5)
+#define ESR_ELx_DirtyBit       (UL(1) << ESR_ELx_DirtyBit_SHIFT)
+#define ESR_ELx_Xs_SHIFT       (0)
+#define ESR_ELx_Xs_MASK                (GENMASK_ULL(4, 0))
+
 /* ISS field definitions for exceptions taken in to Hyp */
 #define ESR_ELx_CV             (UL(1) << 24)
 #define ESR_ELx_COND_SHIFT     (20)
 #define ESR_ELx_SME_ISS_ZA_DISABLED    3
 #define ESR_ELx_SME_ISS_ZT_DISABLED    4
 
+/* ISS field definitions for MOPS exceptions */
+#define ESR_ELx_MOPS_ISS_MEM_INST      (UL(1) << 24)
+#define ESR_ELx_MOPS_ISS_FROM_EPILOGUE (UL(1) << 18)
+#define ESR_ELx_MOPS_ISS_WRONG_OPTION  (UL(1) << 17)
+#define ESR_ELx_MOPS_ISS_OPTION_A      (UL(1) << 16)
+#define ESR_ELx_MOPS_ISS_DESTREG(esr)  (((esr) & (UL(0x1f) << 10)) >> 10)
+#define ESR_ELx_MOPS_ISS_SRCREG(esr)   (((esr) & (UL(0x1f) << 5)) >> 5)
+#define ESR_ELx_MOPS_ISS_SIZEREG(esr)  (((esr) & (UL(0x1f) << 0)) >> 0)
+
 #ifndef __ASSEMBLY__
 #include <asm/types.h>
 
index e73af70..ad688e1 100644 (file)
@@ -8,16 +8,11 @@
 #define __ASM_EXCEPTION_H
 
 #include <asm/esr.h>
-#include <asm/kprobes.h>
 #include <asm/ptrace.h>
 
 #include <linux/interrupt.h>
 
-#ifdef CONFIG_FUNCTION_GRAPH_TRACER
 #define __exception_irq_entry  __irq_entry
-#else
-#define __exception_irq_entry  __kprobes
-#endif
 
 static inline unsigned long disr_to_esr(u64 disr)
 {
@@ -77,6 +72,7 @@ void do_el0_svc(struct pt_regs *regs);
 void do_el0_svc_compat(struct pt_regs *regs);
 void do_el0_fpac(struct pt_regs *regs, unsigned long esr);
 void do_el1_fpac(struct pt_regs *regs, unsigned long esr);
+void do_el0_mops(struct pt_regs *regs, unsigned long esr);
 void do_serror(struct pt_regs *regs, unsigned long esr);
 void do_notify_resume(struct pt_regs *regs, unsigned long thread_flags);
 
index fa4c6ff..8405532 100644 (file)
@@ -154,4 +154,12 @@ static inline int get_num_wrps(void)
                                                ID_AA64DFR0_EL1_WRPs_SHIFT);
 }
 
+#ifdef CONFIG_CPU_PM
+extern void cpu_suspend_set_dbg_restorer(int (*hw_bp_restore)(unsigned int));
+#else
+static inline void cpu_suspend_set_dbg_restorer(int (*hw_bp_restore)(unsigned int))
+{
+}
+#endif
+
 #endif /* __ASM_BREAKPOINT_H */
index 5d45f19..692b1ec 100644 (file)
 #define KERNEL_HWCAP_SME_BI32I32       __khwcap2_feature(SME_BI32I32)
 #define KERNEL_HWCAP_SME_B16B16                __khwcap2_feature(SME_B16B16)
 #define KERNEL_HWCAP_SME_F16F16                __khwcap2_feature(SME_F16F16)
+#define KERNEL_HWCAP_MOPS              __khwcap2_feature(MOPS)
 
 /*
  * This yields a mask that user programs can use to figure out what
index c2b1321..c09cf94 100644 (file)
@@ -27,7 +27,7 @@
 
 /*
  * struct arm64_image_header - arm64 kernel image header
- * See Documentation/arm64/booting.rst for details
+ * See Documentation/arch/arm64/booting.rst for details
  *
  * @code0:             Executable code, or
  *   @mz_header                  alternatively used for part of MZ header
index 877495a..51d92ab 100644 (file)
  * Generic IO read/write.  These perform native-endian accesses.
  */
 #define __raw_writeb __raw_writeb
-static inline void __raw_writeb(u8 val, volatile void __iomem *addr)
+static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
 {
        asm volatile("strb %w0, [%1]" : : "rZ" (val), "r" (addr));
 }
 
 #define __raw_writew __raw_writew
-static inline void __raw_writew(u16 val, volatile void __iomem *addr)
+static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr)
 {
        asm volatile("strh %w0, [%1]" : : "rZ" (val), "r" (addr));
 }
@@ -40,13 +40,13 @@ static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr)
 }
 
 #define __raw_writeq __raw_writeq
-static inline void __raw_writeq(u64 val, volatile void __iomem *addr)
+static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr)
 {
        asm volatile("str %x0, [%1]" : : "rZ" (val), "r" (addr));
 }
 
 #define __raw_readb __raw_readb
-static inline u8 __raw_readb(const volatile void __iomem *addr)
+static __always_inline u8 __raw_readb(const volatile void __iomem *addr)
 {
        u8 val;
        asm volatile(ALTERNATIVE("ldrb %w0, [%1]",
@@ -57,7 +57,7 @@ static inline u8 __raw_readb(const volatile void __iomem *addr)
 }
 
 #define __raw_readw __raw_readw
-static inline u16 __raw_readw(const volatile void __iomem *addr)
+static __always_inline u16 __raw_readw(const volatile void __iomem *addr)
 {
        u16 val;
 
@@ -80,7 +80,7 @@ static __always_inline u32 __raw_readl(const volatile void __iomem *addr)
 }
 
 #define __raw_readq __raw_readq
-static inline u64 __raw_readq(const volatile void __iomem *addr)
+static __always_inline u64 __raw_readq(const volatile void __iomem *addr)
 {
        u64 val;
        asm volatile(ALTERNATIVE("ldr %0, [%1]",
index e0f5f6b..1f31ec1 100644 (file)
@@ -24,7 +24,7 @@
 static __always_inline bool __irqflags_uses_pmr(void)
 {
        return IS_ENABLED(CONFIG_ARM64_PSEUDO_NMI) &&
-              alternative_has_feature_unlikely(ARM64_HAS_GIC_PRIO_MASKING);
+              alternative_has_cap_unlikely(ARM64_HAS_GIC_PRIO_MASKING);
 }
 
 static __always_inline void __daif_local_irq_enable(void)
index 186dd7f..5777738 100644 (file)
 /*
  * Initial memory map attributes.
  */
-#define SWAPPER_PTE_FLAGS      (PTE_TYPE_PAGE | PTE_AF | PTE_SHARED)
-#define SWAPPER_PMD_FLAGS      (PMD_TYPE_SECT | PMD_SECT_AF | PMD_SECT_S)
+#define SWAPPER_PTE_FLAGS      (PTE_TYPE_PAGE | PTE_AF | PTE_SHARED | PTE_UXN)
+#define SWAPPER_PMD_FLAGS      (PMD_TYPE_SECT | PMD_SECT_AF | PMD_SECT_S | PTE_UXN)
 
 #ifdef CONFIG_ARM64_4K_PAGES
-#define SWAPPER_RW_MMUFLAGS    (PMD_ATTRINDX(MT_NORMAL) | SWAPPER_PMD_FLAGS)
+#define SWAPPER_RW_MMUFLAGS    (PMD_ATTRINDX(MT_NORMAL) | SWAPPER_PMD_FLAGS | PTE_WRITE)
 #define SWAPPER_RX_MMUFLAGS    (SWAPPER_RW_MMUFLAGS | PMD_SECT_RDONLY)
 #else
-#define SWAPPER_RW_MMUFLAGS    (PTE_ATTRINDX(MT_NORMAL) | SWAPPER_PTE_FLAGS)
+#define SWAPPER_RW_MMUFLAGS    (PTE_ATTRINDX(MT_NORMAL) | SWAPPER_PTE_FLAGS | PTE_WRITE)
 #define SWAPPER_RX_MMUFLAGS    (SWAPPER_RW_MMUFLAGS | PTE_RDONLY)
 #endif
 
index baef29f..c6e12e8 100644 (file)
@@ -9,6 +9,7 @@
 
 #include <asm/esr.h>
 #include <asm/memory.h>
+#include <asm/sysreg.h>
 #include <asm/types.h>
 
 /* Hyp Configuration Register (HCR) bits */
@@ -92,6 +93,9 @@
 #define HCR_HOST_NVHE_PROTECTED_FLAGS (HCR_HOST_NVHE_FLAGS | HCR_TSC)
 #define HCR_HOST_VHE_FLAGS (HCR_RW | HCR_TGE | HCR_E2H)
 
+#define HCRX_GUEST_FLAGS (HCRX_EL2_SMPME | HCRX_EL2_TCR2En)
+#define HCRX_HOST_FLAGS (HCRX_EL2_MSCEn | HCRX_EL2_TCR2En)
+
 /* TCR_EL2 Registers bits */
 #define TCR_EL2_RES1           ((1U << 31) | (1 << 23))
 #define TCR_EL2_TBI            (1 << 20)
index 43c3bc0..86042af 100644 (file)
@@ -267,6 +267,24 @@ extern u64 __kvm_get_mdcr_el2(void);
        __kvm_at_err;                                                   \
 } )
 
+void __noreturn hyp_panic(void);
+asmlinkage void kvm_unexpected_el2_exception(void);
+asmlinkage void __noreturn hyp_panic(void);
+asmlinkage void __noreturn hyp_panic_bad_stack(void);
+asmlinkage void kvm_unexpected_el2_exception(void);
+struct kvm_cpu_context;
+void handle_trap(struct kvm_cpu_context *host_ctxt);
+asmlinkage void __noreturn kvm_host_psci_cpu_entry(bool is_cpu_on);
+void __noreturn __pkvm_init_finalise(void);
+void kvm_nvhe_prepare_backtrace(unsigned long fp, unsigned long pc);
+void kvm_patch_vector_branch(struct alt_instr *alt,
+       __le32 *origptr, __le32 *updptr, int nr_inst);
+void kvm_get_kimage_voffset(struct alt_instr *alt,
+       __le32 *origptr, __le32 *updptr, int nr_inst);
+void kvm_compute_final_ctr_el0(struct alt_instr *alt,
+       __le32 *origptr, __le32 *updptr, int nr_inst);
+void __noreturn __cold nvhe_hyp_panic_handler(u64 esr, u64 spsr, u64 elr_virt,
+       u64 elr_phys, u64 par, uintptr_t vcpu, u64 far, u64 hpfar);
 
 #else /* __ASSEMBLY__ */
 
index 9787503..d48609d 100644 (file)
@@ -279,6 +279,7 @@ enum vcpu_sysreg {
        TTBR0_EL1,      /* Translation Table Base Register 0 */
        TTBR1_EL1,      /* Translation Table Base Register 1 */
        TCR_EL1,        /* Translation Control Register */
+       TCR2_EL1,       /* Extended Translation Control Register */
        ESR_EL1,        /* Exception Syndrome Register */
        AFSR0_EL1,      /* Auxiliary Fault Status Register 0 */
        AFSR1_EL1,      /* Auxiliary Fault Status Register 1 */
@@ -339,6 +340,10 @@ enum vcpu_sysreg {
        TFSR_EL1,       /* Tag Fault Status Register (EL1) */
        TFSRE0_EL1,     /* Tag Fault Status Register (EL0) */
 
+       /* Permission Indirection Extension registers */
+       PIR_EL1,       /* Permission Indirection Register 1 (EL1) */
+       PIRE0_EL1,     /*  Permission Indirection Register 0 (EL1) */
+
        /* 32bit specific registers. */
        DACR32_EL2,     /* Domain Access Control Register */
        IFSR32_EL2,     /* Instruction Fault Status Register */
@@ -1033,7 +1038,7 @@ void kvm_arm_clear_debug(struct kvm_vcpu *vcpu);
 void kvm_arm_reset_debug_ptr(struct kvm_vcpu *vcpu);
 
 #define kvm_vcpu_os_lock_enabled(vcpu)         \
-       (!!(__vcpu_sys_reg(vcpu, OSLSR_EL1) & SYS_OSLSR_OSLK))
+       (!!(__vcpu_sys_reg(vcpu, OSLSR_EL1) & OSLSR_EL1_OSLK))
 
 int kvm_arm_vcpu_arch_set_attr(struct kvm_vcpu *vcpu,
                               struct kvm_device_attr *attr);
index f99d748..cbbcdc3 100644 (file)
@@ -18,7 +18,7 @@
 
 static __always_inline bool system_uses_lse_atomics(void)
 {
-       return alternative_has_feature_likely(ARM64_HAS_LSE_ATOMICS);
+       return alternative_has_cap_likely(ARM64_HAS_LSE_ATOMICS);
 }
 
 #define __lse_ll_sc_body(op, ...)                                      \
index c735afd..6e0e572 100644 (file)
@@ -46,7 +46,7 @@
 #define KIMAGE_VADDR           (MODULES_END)
 #define MODULES_END            (MODULES_VADDR + MODULES_VSIZE)
 #define MODULES_VADDR          (_PAGE_END(VA_BITS_MIN))
-#define MODULES_VSIZE          (SZ_128M)
+#define MODULES_VSIZE          (SZ_2G)
 #define VMEMMAP_START          (-(UL(1) << (VA_BITS - VMEMMAP_SHIFT)))
 #define VMEMMAP_END            (VMEMMAP_START + VMEMMAP_SIZE)
 #define PCI_IO_END             (VMEMMAP_START - SZ_8M)
@@ -204,15 +204,17 @@ static inline unsigned long kaslr_offset(void)
        return kimage_vaddr - KIMAGE_VADDR;
 }
 
+#ifdef CONFIG_RANDOMIZE_BASE
+void kaslr_init(void);
 static inline bool kaslr_enabled(void)
 {
-       /*
-        * The KASLR offset modulo MIN_KIMG_ALIGN is taken from the physical
-        * placement of the image rather than from the seed, so a displacement
-        * of less than MIN_KIMG_ALIGN means that no seed was provided.
-        */
-       return kaslr_offset() >= MIN_KIMG_ALIGN;
+       extern bool __kaslr_is_enabled;
+       return __kaslr_is_enabled;
 }
+#else
+static inline void kaslr_init(void) { }
+static inline bool kaslr_enabled(void) { return false; }
+#endif
 
 /*
  * Allow all memory at the discovery stage. We will clip it later.
index 5691169..a6fb325 100644 (file)
@@ -39,11 +39,16 @@ static inline void contextidr_thread_switch(struct task_struct *next)
 /*
  * Set TTBR0 to reserved_pg_dir. No translations will be possible via TTBR0.
  */
-static inline void cpu_set_reserved_ttbr0(void)
+static inline void cpu_set_reserved_ttbr0_nosync(void)
 {
        unsigned long ttbr = phys_to_ttbr(__pa_symbol(reserved_pg_dir));
 
        write_sysreg(ttbr, ttbr0_el1);
+}
+
+static inline void cpu_set_reserved_ttbr0(void)
+{
+       cpu_set_reserved_ttbr0_nosync();
        isb();
 }
 
@@ -52,7 +57,6 @@ void cpu_do_switch_mm(phys_addr_t pgd_phys, struct mm_struct *mm);
 static inline void cpu_switch_mm(pgd_t *pgd, struct mm_struct *mm)
 {
        BUG_ON(pgd == swapper_pg_dir);
-       cpu_set_reserved_ttbr0();
        cpu_do_switch_mm(virt_to_phys(pgd),mm);
 }
 
@@ -164,7 +168,7 @@ static inline void cpu_replace_ttbr1(pgd_t *pgdp, pgd_t *idmap)
                 * up (i.e. cpufeature framework is not up yet) and
                 * latter only when we enable CNP via cpufeature's
                 * enable() callback.
-                * Also we rely on the cpu_hwcap bit being set before
+                * Also we rely on the system_cpucaps bit being set before
                 * calling the enable() function.
                 */
                ttbr1 |= TTBR_CNP_BIT;
index 18734fe..bfa6638 100644 (file)
@@ -7,7 +7,6 @@
 
 #include <asm-generic/module.h>
 
-#ifdef CONFIG_ARM64_MODULE_PLTS
 struct mod_plt_sec {
        int                     plt_shndx;
        int                     plt_num_entries;
@@ -21,7 +20,6 @@ struct mod_arch_specific {
        /* for CONFIG_DYNAMIC_FTRACE */
        struct plt_entry        *ftrace_trampolines;
 };
-#endif
 
 u64 module_emit_plt_entry(struct module *mod, Elf64_Shdr *sechdrs,
                          void *loc, const Elf64_Rela *rela,
@@ -30,12 +28,6 @@ u64 module_emit_plt_entry(struct module *mod, Elf64_Shdr *sechdrs,
 u64 module_emit_veneer_for_adrp(struct module *mod, Elf64_Shdr *sechdrs,
                                void *loc, u64 val);
 
-#ifdef CONFIG_RANDOMIZE_BASE
-extern u64 module_alloc_base;
-#else
-#define module_alloc_base      ((u64)_etext - MODULES_VSIZE)
-#endif
-
 struct plt_entry {
        /*
         * A program that conforms to the AArch64 Procedure Call Standard
index dbba4b7..b9ae834 100644 (file)
@@ -1,9 +1,7 @@
 SECTIONS {
-#ifdef CONFIG_ARM64_MODULE_PLTS
        .plt 0 : { BYTE(0) }
        .init.plt 0 : { BYTE(0) }
        .text.ftrace_trampoline 0 : { BYTE(0) }
-#endif
 
 #ifdef CONFIG_KASAN_SW_TAGS
        /*
index b9ba19d..9abcc8e 100644 (file)
@@ -140,17 +140,11 @@ PERCPU_RET_OP(add, add, ldadd)
  * re-enabling preemption for preemptible kernels, but doing that in a way
  * which builds inside a module would mean messing directly with the preempt
  * count. If you do this, peterz and tglx will hunt you down.
+ *
+ * Not to mention it'll break the actual preemption model for missing a
+ * preemption point when TIF_NEED_RESCHED gets set while preemption is
+ * disabled.
  */
-#define this_cpu_cmpxchg_double_8(ptr1, ptr2, o1, o2, n1, n2)          \
-({                                                                     \
-       int __ret;                                                      \
-       preempt_disable_notrace();                                      \
-       __ret = cmpxchg_double_local(   raw_cpu_ptr(&(ptr1)),           \
-                                       raw_cpu_ptr(&(ptr2)),           \
-                                       o1, o2, n1, n2);                \
-       preempt_enable_notrace();                                       \
-       __ret;                                                          \
-})
 
 #define _pcp_protect(op, pcp, ...)                                     \
 ({                                                                     \
@@ -240,6 +234,22 @@ PERCPU_RET_OP(add, add, ldadd)
 #define this_cpu_cmpxchg_8(pcp, o, n)  \
        _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
 
+#define this_cpu_cmpxchg64(pcp, o, n)  this_cpu_cmpxchg_8(pcp, o, n)
+
+#define this_cpu_cmpxchg128(pcp, o, n)                                 \
+({                                                                     \
+       typedef typeof(pcp) pcp_op_T__;                                 \
+       u128 old__, new__, ret__;                                       \
+       pcp_op_T__ *ptr__;                                              \
+       old__ = o;                                                      \
+       new__ = n;                                                      \
+       preempt_disable_notrace();                                      \
+       ptr__ = raw_cpu_ptr(&(pcp));                                    \
+       ret__ = cmpxchg128_local((void *)ptr__, old__, new__);          \
+       preempt_enable_notrace();                                       \
+       ret__;                                                          \
+})
+
 #ifdef __KVM_NVHE_HYPERVISOR__
 extern unsigned long __hyp_per_cpu_offset(unsigned int cpu);
 #define __per_cpu_offset
index f658aaf..e4944d5 100644 (file)
 #define PTE_ATTRINDX_MASK      (_AT(pteval_t, 7) << 2)
 
 /*
+ * PIIndex[3:0] encoding (Permission Indirection Extension)
+ */
+#define PTE_PI_IDX_0   6       /* AP[1], USER */
+#define PTE_PI_IDX_1   51      /* DBM */
+#define PTE_PI_IDX_2   53      /* PXN */
+#define PTE_PI_IDX_3   54      /* UXN */
+
+/*
  * Memory Attribute override for Stage-2 (MemAttr[3:0])
  */
 #define PTE_S2_MEMATTR(t)      (_AT(pteval_t, (t)) << 2)
index 9b16511..eed814b 100644 (file)
  */
 #define PMD_PRESENT_INVALID    (_AT(pteval_t, 1) << 59) /* only when !PMD_SECT_VALID */
 
+#define _PROT_DEFAULT          (PTE_TYPE_PAGE | PTE_AF | PTE_SHARED)
+#define _PROT_SECT_DEFAULT     (PMD_TYPE_SECT | PMD_SECT_AF | PMD_SECT_S)
+
+#define PROT_DEFAULT           (_PROT_DEFAULT | PTE_MAYBE_NG)
+#define PROT_SECT_DEFAULT      (_PROT_SECT_DEFAULT | PMD_MAYBE_NG)
+
+#define PROT_DEVICE_nGnRnE     (PROT_DEFAULT | PTE_PXN | PTE_UXN | PTE_WRITE | PTE_ATTRINDX(MT_DEVICE_nGnRnE))
+#define PROT_DEVICE_nGnRE      (PROT_DEFAULT | PTE_PXN | PTE_UXN | PTE_WRITE | PTE_ATTRINDX(MT_DEVICE_nGnRE))
+#define PROT_NORMAL_NC         (PROT_DEFAULT | PTE_PXN | PTE_UXN | PTE_WRITE | PTE_ATTRINDX(MT_NORMAL_NC))
+#define PROT_NORMAL            (PROT_DEFAULT | PTE_PXN | PTE_UXN | PTE_WRITE | PTE_ATTRINDX(MT_NORMAL))
+#define PROT_NORMAL_TAGGED     (PROT_DEFAULT | PTE_PXN | PTE_UXN | PTE_WRITE | PTE_ATTRINDX(MT_NORMAL_TAGGED))
+
+#define PROT_SECT_DEVICE_nGnRE (PROT_SECT_DEFAULT | PMD_SECT_PXN | PMD_SECT_UXN | PMD_ATTRINDX(MT_DEVICE_nGnRE))
+#define PROT_SECT_NORMAL       (PROT_SECT_DEFAULT | PMD_SECT_PXN | PMD_SECT_UXN | PTE_WRITE | PMD_ATTRINDX(MT_NORMAL))
+#define PROT_SECT_NORMAL_EXEC  (PROT_SECT_DEFAULT | PMD_SECT_UXN | PMD_ATTRINDX(MT_NORMAL))
+
+#define _PAGE_DEFAULT          (_PROT_DEFAULT | PTE_ATTRINDX(MT_NORMAL))
+
+#define _PAGE_KERNEL           (PROT_NORMAL)
+#define _PAGE_KERNEL_RO                ((PROT_NORMAL & ~PTE_WRITE) | PTE_RDONLY)
+#define _PAGE_KERNEL_ROX       ((PROT_NORMAL & ~(PTE_WRITE | PTE_PXN)) | PTE_RDONLY)
+#define _PAGE_KERNEL_EXEC      (PROT_NORMAL & ~PTE_PXN)
+#define _PAGE_KERNEL_EXEC_CONT ((PROT_NORMAL & ~PTE_PXN) | PTE_CONT)
+
+#define _PAGE_SHARED           (_PAGE_DEFAULT | PTE_USER | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN | PTE_WRITE)
+#define _PAGE_SHARED_EXEC      (_PAGE_DEFAULT | PTE_USER | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_WRITE)
+#define _PAGE_READONLY         (_PAGE_DEFAULT | PTE_USER | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
+#define _PAGE_READONLY_EXEC    (_PAGE_DEFAULT | PTE_USER | PTE_RDONLY | PTE_NG | PTE_PXN)
+#define _PAGE_EXECONLY         (_PAGE_DEFAULT | PTE_RDONLY | PTE_NG | PTE_PXN)
+
+#ifdef __ASSEMBLY__
+#define PTE_MAYBE_NG   0
+#endif
+
 #ifndef __ASSEMBLY__
 
 #include <asm/cpufeature.h>
@@ -34,9 +68,6 @@
 
 extern bool arm64_use_ng_mappings;
 
-#define _PROT_DEFAULT          (PTE_TYPE_PAGE | PTE_AF | PTE_SHARED)
-#define _PROT_SECT_DEFAULT     (PMD_TYPE_SECT | PMD_SECT_AF | PMD_SECT_S)
-
 #define PTE_MAYBE_NG           (arm64_use_ng_mappings ? PTE_NG : 0)
 #define PMD_MAYBE_NG           (arm64_use_ng_mappings ? PMD_SECT_NG : 0)
 
@@ -50,26 +81,11 @@ extern bool arm64_use_ng_mappings;
 #define PTE_MAYBE_GP           0
 #endif
 
-#define PROT_DEFAULT           (_PROT_DEFAULT | PTE_MAYBE_NG)
-#define PROT_SECT_DEFAULT      (_PROT_SECT_DEFAULT | PMD_MAYBE_NG)
-
-#define PROT_DEVICE_nGnRnE     (PROT_DEFAULT | PTE_PXN | PTE_UXN | PTE_WRITE | PTE_ATTRINDX(MT_DEVICE_nGnRnE))
-#define PROT_DEVICE_nGnRE      (PROT_DEFAULT | PTE_PXN | PTE_UXN | PTE_WRITE | PTE_ATTRINDX(MT_DEVICE_nGnRE))
-#define PROT_NORMAL_NC         (PROT_DEFAULT | PTE_PXN | PTE_UXN | PTE_WRITE | PTE_ATTRINDX(MT_NORMAL_NC))
-#define PROT_NORMAL            (PROT_DEFAULT | PTE_PXN | PTE_UXN | PTE_WRITE | PTE_ATTRINDX(MT_NORMAL))
-#define PROT_NORMAL_TAGGED     (PROT_DEFAULT | PTE_PXN | PTE_UXN | PTE_WRITE | PTE_ATTRINDX(MT_NORMAL_TAGGED))
-
-#define PROT_SECT_DEVICE_nGnRE (PROT_SECT_DEFAULT | PMD_SECT_PXN | PMD_SECT_UXN | PMD_ATTRINDX(MT_DEVICE_nGnRE))
-#define PROT_SECT_NORMAL       (PROT_SECT_DEFAULT | PMD_SECT_PXN | PMD_SECT_UXN | PMD_ATTRINDX(MT_NORMAL))
-#define PROT_SECT_NORMAL_EXEC  (PROT_SECT_DEFAULT | PMD_SECT_UXN | PMD_ATTRINDX(MT_NORMAL))
-
-#define _PAGE_DEFAULT          (_PROT_DEFAULT | PTE_ATTRINDX(MT_NORMAL))
-
-#define PAGE_KERNEL            __pgprot(PROT_NORMAL)
-#define PAGE_KERNEL_RO         __pgprot((PROT_NORMAL & ~PTE_WRITE) | PTE_RDONLY)
-#define PAGE_KERNEL_ROX                __pgprot((PROT_NORMAL & ~(PTE_WRITE | PTE_PXN)) | PTE_RDONLY)
-#define PAGE_KERNEL_EXEC       __pgprot(PROT_NORMAL & ~PTE_PXN)
-#define PAGE_KERNEL_EXEC_CONT  __pgprot((PROT_NORMAL & ~PTE_PXN) | PTE_CONT)
+#define PAGE_KERNEL            __pgprot(_PAGE_KERNEL)
+#define PAGE_KERNEL_RO         __pgprot(_PAGE_KERNEL_RO)
+#define PAGE_KERNEL_ROX                __pgprot(_PAGE_KERNEL_ROX)
+#define PAGE_KERNEL_EXEC       __pgprot(_PAGE_KERNEL_EXEC)
+#define PAGE_KERNEL_EXEC_CONT  __pgprot(_PAGE_KERNEL_EXEC_CONT)
 
 #define PAGE_S2_MEMATTR(attr, has_fwb)                                 \
        ({                                                              \
@@ -83,12 +99,62 @@ extern bool arm64_use_ng_mappings;
 
 #define PAGE_NONE              __pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | PTE_PROT_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
 /* shared+writable pages are clean by default, hence PTE_RDONLY|PTE_WRITE */
-#define PAGE_SHARED            __pgprot(_PAGE_DEFAULT | PTE_USER | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN | PTE_WRITE)
-#define PAGE_SHARED_EXEC       __pgprot(_PAGE_DEFAULT | PTE_USER | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_WRITE)
-#define PAGE_READONLY          __pgprot(_PAGE_DEFAULT | PTE_USER | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
-#define PAGE_READONLY_EXEC     __pgprot(_PAGE_DEFAULT | PTE_USER | PTE_RDONLY | PTE_NG | PTE_PXN)
-#define PAGE_EXECONLY          __pgprot(_PAGE_DEFAULT | PTE_RDONLY | PTE_NG | PTE_PXN)
+#define PAGE_SHARED            __pgprot(_PAGE_SHARED)
+#define PAGE_SHARED_EXEC       __pgprot(_PAGE_SHARED_EXEC)
+#define PAGE_READONLY          __pgprot(_PAGE_READONLY)
+#define PAGE_READONLY_EXEC     __pgprot(_PAGE_READONLY_EXEC)
+#define PAGE_EXECONLY          __pgprot(_PAGE_EXECONLY)
 
 #endif /* __ASSEMBLY__ */
 
+#define pte_pi_index(pte) ( \
+       ((pte & BIT(PTE_PI_IDX_3)) >> (PTE_PI_IDX_3 - 3)) | \
+       ((pte & BIT(PTE_PI_IDX_2)) >> (PTE_PI_IDX_2 - 2)) | \
+       ((pte & BIT(PTE_PI_IDX_1)) >> (PTE_PI_IDX_1 - 1)) | \
+       ((pte & BIT(PTE_PI_IDX_0)) >> (PTE_PI_IDX_0 - 0)))
+
+/*
+ * Page types used via Permission Indirection Extension (PIE). PIE uses
+ * the USER, DBM, PXN and UXN bits to to generate an index which is used
+ * to look up the actual permission in PIR_ELx and PIRE0_EL1. We define
+ * combinations we use on non-PIE systems with the same encoding, for
+ * convenience these are listed here as comments as are the unallocated
+ * encodings.
+ */
+
+/* 0: PAGE_DEFAULT                                                  */
+/* 1:                                                      PTE_USER */
+/* 2:                                          PTE_WRITE            */
+/* 3:                                          PTE_WRITE | PTE_USER */
+/* 4: PAGE_EXECONLY                  PTE_PXN                        */
+/* 5: PAGE_READONLY_EXEC             PTE_PXN |             PTE_USER */
+/* 6:                                PTE_PXN | PTE_WRITE            */
+/* 7: PAGE_SHARED_EXEC               PTE_PXN | PTE_WRITE | PTE_USER */
+/* 8: PAGE_KERNEL_ROX      PTE_UXN                                  */
+/* 9:                      PTE_UXN |                       PTE_USER */
+/* a: PAGE_KERNEL_EXEC     PTE_UXN |           PTE_WRITE            */
+/* b:                      PTE_UXN |           PTE_WRITE | PTE_USER */
+/* c: PAGE_KERNEL_RO       PTE_UXN | PTE_PXN                        */
+/* d: PAGE_READONLY        PTE_UXN | PTE_PXN |             PTE_USER */
+/* e: PAGE_KERNEL          PTE_UXN | PTE_PXN | PTE_WRITE            */
+/* f: PAGE_SHARED          PTE_UXN | PTE_PXN | PTE_WRITE | PTE_USER */
+
+#define PIE_E0 ( \
+       PIRx_ELx_PERM(pte_pi_index(_PAGE_EXECONLY),      PIE_X_O) | \
+       PIRx_ELx_PERM(pte_pi_index(_PAGE_READONLY_EXEC), PIE_RX)  | \
+       PIRx_ELx_PERM(pte_pi_index(_PAGE_SHARED_EXEC),   PIE_RWX) | \
+       PIRx_ELx_PERM(pte_pi_index(_PAGE_READONLY),      PIE_R)   | \
+       PIRx_ELx_PERM(pte_pi_index(_PAGE_SHARED),        PIE_RW))
+
+#define PIE_E1 ( \
+       PIRx_ELx_PERM(pte_pi_index(_PAGE_EXECONLY),      PIE_NONE_O) | \
+       PIRx_ELx_PERM(pte_pi_index(_PAGE_READONLY_EXEC), PIE_R)      | \
+       PIRx_ELx_PERM(pte_pi_index(_PAGE_SHARED_EXEC),   PIE_RW)     | \
+       PIRx_ELx_PERM(pte_pi_index(_PAGE_READONLY),      PIE_R)      | \
+       PIRx_ELx_PERM(pte_pi_index(_PAGE_SHARED),        PIE_RW)     | \
+       PIRx_ELx_PERM(pte_pi_index(_PAGE_KERNEL_ROX),    PIE_RX)     | \
+       PIRx_ELx_PERM(pte_pi_index(_PAGE_KERNEL_EXEC),   PIE_RWX)    | \
+       PIRx_ELx_PERM(pte_pi_index(_PAGE_KERNEL_RO),     PIE_R)      | \
+       PIRx_ELx_PERM(pte_pi_index(_PAGE_KERNEL),        PIE_RW))
+
 #endif /* __ASM_PGTABLE_PROT_H */
index 13df982..3fdae5f 100644 (file)
@@ -73,6 +73,7 @@ static inline void dynamic_scs_init(void) {}
 #endif
 
 int scs_patch(const u8 eh_frame[], int size);
+asmlinkage void scs_patch_vmlinux(void);
 
 #endif /* __ASSEMBLY __ */
 
index f2d2623..9b31e6d 100644 (file)
@@ -99,7 +99,7 @@ static inline void arch_send_wakeup_ipi_mask(const struct cpumask *mask)
 
 extern int __cpu_disable(void);
 
-extern void __cpu_die(unsigned int cpu);
+static inline void __cpu_die(unsigned int cpu) { }
 extern void __noreturn cpu_die(void);
 extern void __noreturn cpu_die_early(void);
 
index db7b371..9cc5014 100644 (file)
@@ -100,5 +100,21 @@ bool is_spectre_bhb_affected(const struct arm64_cpu_capabilities *entry, int sco
 u8 spectre_bhb_loop_affected(int scope);
 void spectre_bhb_enable_mitigation(const struct arm64_cpu_capabilities *__unused);
 bool try_emulate_el1_ssbs(struct pt_regs *regs, u32 instr);
+
+void spectre_v4_patch_fw_mitigation_enable(struct alt_instr *alt, __le32 *origptr,
+                                          __le32 *updptr, int nr_inst);
+void smccc_patch_fw_mitigation_conduit(struct alt_instr *alt, __le32 *origptr,
+                                      __le32 *updptr, int nr_inst);
+void spectre_bhb_patch_loop_mitigation_enable(struct alt_instr *alt, __le32 *origptr,
+                                             __le32 *updptr, int nr_inst);
+void spectre_bhb_patch_fw_mitigation_enabled(struct alt_instr *alt, __le32 *origptr,
+                                            __le32 *updptr, int nr_inst);
+void spectre_bhb_patch_loop_iter(struct alt_instr *alt,
+                                __le32 *origptr, __le32 *updptr, int nr_inst);
+void spectre_bhb_patch_wa3(struct alt_instr *alt,
+                          __le32 *origptr, __le32 *updptr, int nr_inst);
+void spectre_bhb_patch_clearbhb(struct alt_instr *alt,
+                               __le32 *origptr, __le32 *updptr, int nr_inst);
+
 #endif /* __ASSEMBLY__ */
 #endif /* __ASM_SPECTRE_H */
index d30217c..17f6875 100644 (file)
@@ -38,6 +38,7 @@
        asmlinkage long __arm64_compat_sys_##sname(const struct pt_regs *__unused)
 
 #define COND_SYSCALL_COMPAT(name)                                                      \
+       asmlinkage long __arm64_compat_sys_##name(const struct pt_regs *regs);          \
        asmlinkage long __weak __arm64_compat_sys_##name(const struct pt_regs *regs)    \
        {                                                                               \
                return sys_ni_syscall();                                                \
@@ -53,6 +54,7 @@
        ALLOW_ERROR_INJECTION(__arm64_sys##name, ERRNO);                        \
        static long __se_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__));             \
        static inline long __do_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__));      \
+       asmlinkage long __arm64_sys##name(const struct pt_regs *regs);          \
        asmlinkage long __arm64_sys##name(const struct pt_regs *regs)           \
        {                                                                       \
                return __se_sys##name(SC_ARM64_REGS_TO_ARGS(x,__VA_ARGS__));    \
        asmlinkage long __arm64_sys_##sname(const struct pt_regs *__unused)
 
 #define COND_SYSCALL(name)                                                     \
+       asmlinkage long __arm64_sys_##name(const struct pt_regs *regs);         \
        asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
        {                                                                       \
                return sys_ni_syscall();                                        \
        }
 
+asmlinkage long __arm64_sys_ni_syscall(const struct pt_regs *__unused);
 #define SYS_NI(name) SYSCALL_ALIAS(__arm64_sys_##name, sys_ni_posix_timers);
 
 #endif /* __ASM_SYSCALL_WRAPPER_H */
index eefd712..7a1e626 100644 (file)
 #define SYS_SVCR_SMSTART_SM_EL0                sys_reg(0, 3, 4, 3, 3)
 #define SYS_SVCR_SMSTOP_SMZA_EL0       sys_reg(0, 3, 4, 6, 3)
 
-#define SYS_OSDTRRX_EL1                        sys_reg(2, 0, 0, 0, 2)
-#define SYS_MDCCINT_EL1                        sys_reg(2, 0, 0, 2, 0)
-#define SYS_MDSCR_EL1                  sys_reg(2, 0, 0, 2, 2)
-#define SYS_OSDTRTX_EL1                        sys_reg(2, 0, 0, 3, 2)
-#define SYS_OSECCR_EL1                 sys_reg(2, 0, 0, 6, 2)
 #define SYS_DBGBVRn_EL1(n)             sys_reg(2, 0, 0, n, 4)
 #define SYS_DBGBCRn_EL1(n)             sys_reg(2, 0, 0, n, 5)
 #define SYS_DBGWVRn_EL1(n)             sys_reg(2, 0, 0, n, 6)
 #define SYS_DBGWCRn_EL1(n)             sys_reg(2, 0, 0, n, 7)
 #define SYS_MDRAR_EL1                  sys_reg(2, 0, 1, 0, 0)
 
-#define SYS_OSLAR_EL1                  sys_reg(2, 0, 1, 0, 4)
-#define SYS_OSLAR_OSLK                 BIT(0)
-
 #define SYS_OSLSR_EL1                  sys_reg(2, 0, 1, 1, 4)
-#define SYS_OSLSR_OSLM_MASK            (BIT(3) | BIT(0))
-#define SYS_OSLSR_OSLM_NI              0
-#define SYS_OSLSR_OSLM_IMPLEMENTED     BIT(3)
-#define SYS_OSLSR_OSLK                 BIT(1)
+#define OSLSR_EL1_OSLM_MASK            (BIT(3) | BIT(0))
+#define OSLSR_EL1_OSLM_NI              0
+#define OSLSR_EL1_OSLM_IMPLEMENTED     BIT(3)
+#define OSLSR_EL1_OSLK                 BIT(1)
 
 #define SYS_OSDLR_EL1                  sys_reg(2, 0, 1, 3, 4)
 #define SYS_DBGPRCR_EL1                        sys_reg(2, 0, 1, 4, 4)
 
 /*** End of Statistical Profiling Extension ***/
 
-/*
- * TRBE Registers
- */
-#define SYS_TRBLIMITR_EL1              sys_reg(3, 0, 9, 11, 0)
-#define SYS_TRBPTR_EL1                 sys_reg(3, 0, 9, 11, 1)
-#define SYS_TRBBASER_EL1               sys_reg(3, 0, 9, 11, 2)
-#define SYS_TRBSR_EL1                  sys_reg(3, 0, 9, 11, 3)
-#define SYS_TRBMAR_EL1                 sys_reg(3, 0, 9, 11, 4)
-#define SYS_TRBTRG_EL1                 sys_reg(3, 0, 9, 11, 6)
-#define SYS_TRBIDR_EL1                 sys_reg(3, 0, 9, 11, 7)
-
-#define TRBLIMITR_LIMIT_MASK           GENMASK_ULL(51, 0)
-#define TRBLIMITR_LIMIT_SHIFT          12
-#define TRBLIMITR_NVM                  BIT(5)
-#define TRBLIMITR_TRIG_MODE_MASK       GENMASK(1, 0)
-#define TRBLIMITR_TRIG_MODE_SHIFT      3
-#define TRBLIMITR_FILL_MODE_MASK       GENMASK(1, 0)
-#define TRBLIMITR_FILL_MODE_SHIFT      1
-#define TRBLIMITR_ENABLE               BIT(0)
-#define TRBPTR_PTR_MASK                        GENMASK_ULL(63, 0)
-#define TRBPTR_PTR_SHIFT               0
-#define TRBBASER_BASE_MASK             GENMASK_ULL(51, 0)
-#define TRBBASER_BASE_SHIFT            12
-#define TRBSR_EC_MASK                  GENMASK(5, 0)
-#define TRBSR_EC_SHIFT                 26
-#define TRBSR_IRQ                      BIT(22)
-#define TRBSR_TRG                      BIT(21)
-#define TRBSR_WRAP                     BIT(20)
-#define TRBSR_ABORT                    BIT(18)
-#define TRBSR_STOP                     BIT(17)
-#define TRBSR_MSS_MASK                 GENMASK(15, 0)
-#define TRBSR_MSS_SHIFT                        0
-#define TRBSR_BSC_MASK                 GENMASK(5, 0)
-#define TRBSR_BSC_SHIFT                        0
-#define TRBSR_FSC_MASK                 GENMASK(5, 0)
-#define TRBSR_FSC_SHIFT                        0
-#define TRBMAR_SHARE_MASK              GENMASK(1, 0)
-#define TRBMAR_SHARE_SHIFT             8
-#define TRBMAR_OUTER_MASK              GENMASK(3, 0)
-#define TRBMAR_OUTER_SHIFT             4
-#define TRBMAR_INNER_MASK              GENMASK(3, 0)
-#define TRBMAR_INNER_SHIFT             0
-#define TRBTRG_TRG_MASK                        GENMASK(31, 0)
-#define TRBTRG_TRG_SHIFT               0
-#define TRBIDR_FLAG                    BIT(5)
-#define TRBIDR_PROG                    BIT(4)
-#define TRBIDR_ALIGN_MASK              GENMASK(3, 0)
-#define TRBIDR_ALIGN_SHIFT             0
+#define TRBSR_EL1_BSC_MASK             GENMASK(5, 0)
+#define TRBSR_EL1_BSC_SHIFT            0
 
 #define SYS_PMINTENSET_EL1             sys_reg(3, 0, 9, 14, 1)
 #define SYS_PMINTENCLR_EL1             sys_reg(3, 0, 9, 14, 2)
 #define ICH_VTR_TDS_SHIFT      19
 #define ICH_VTR_TDS_MASK       (1 << ICH_VTR_TDS_SHIFT)
 
+/*
+ * Permission Indirection Extension (PIE) permission encodings.
+ * Encodings with the _O suffix, have overlays applied (Permission Overlay Extension).
+ */
+#define PIE_NONE_O     0x0
+#define PIE_R_O                0x1
+#define PIE_X_O                0x2
+#define PIE_RX_O       0x3
+#define PIE_RW_O       0x5
+#define PIE_RWnX_O     0x6
+#define PIE_RWX_O      0x7
+#define PIE_R          0x8
+#define PIE_GCS                0x9
+#define PIE_RX         0xa
+#define PIE_RW         0xc
+#define PIE_RWX                0xe
+
+#define PIRx_ELx_PERM(idx, perm)       ((perm) << ((idx) * 4))
+
 #define ARM64_FEATURE_FIELD_BITS       4
 
 /* Defined for compatibility only, do not add new users. */
index 848739c..553d1bc 100644 (file)
@@ -55,10 +55,6 @@ struct thread_info {
 void arch_setup_new_exec(void);
 #define arch_setup_new_exec     arch_setup_new_exec
 
-void arch_release_task_struct(struct task_struct *tsk);
-int arch_dup_task_struct(struct task_struct *dst,
-                               struct task_struct *src);
-
 #endif
 
 #define TIF_SIGPENDING         0       /* signal pending */
index 1f361e2..d66dfb3 100644 (file)
@@ -29,6 +29,8 @@ void arm64_force_sig_fault(int signo, int code, unsigned long far, const char *s
 void arm64_force_sig_mceerr(int code, unsigned long far, short lsb, const char *str);
 void arm64_force_sig_ptrace_errno_trap(int errno, unsigned long far, const char *str);
 
+int early_brk64(unsigned long addr, unsigned long esr, struct pt_regs *regs);
+
 /*
  * Move regs->pc to next instruction and do necessary setup before it
  * is executed.
index 05f4fc2..14be500 100644 (file)
@@ -65,7 +65,6 @@ static inline void __uaccess_ttbr0_disable(void)
        ttbr &= ~TTBR_ASID_MASK;
        /* reserved_pg_dir placed before swapper_pg_dir */
        write_sysreg(ttbr - RESERVED_SWAPPER_OFFSET, ttbr0_el1);
-       isb();
        /* Set reserved ASID */
        write_sysreg(ttbr, ttbr1_el1);
        isb();
@@ -89,7 +88,6 @@ static inline void __uaccess_ttbr0_enable(void)
        ttbr1 &= ~TTBR_ASID_MASK;               /* safety measure */
        ttbr1 |= ttbr0 & TTBR_ASID_MASK;
        write_sysreg(ttbr1, ttbr1_el1);
-       isb();
 
        /* Restore user page table */
        write_sysreg(ttbr0, ttbr0_el1);
index 037feba..64a514f 100644 (file)
@@ -39,7 +39,7 @@
 #define __ARM_NR_compat_set_tls                (__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END            (__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls           451
+#define __NR_compat_syscalls           452
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
index 604a205..d952a28 100644 (file)
@@ -907,6 +907,8 @@ __SYSCALL(__NR_process_mrelease, sys_process_mrelease)
 __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
 #define __NR_set_mempolicy_home_node 450
 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
+#define __NR_cachestat 451
+__SYSCALL(__NR_cachestat, sys_cachestat)
 
 /*
  * Please add new compat syscalls above this comment and update
index 69a4fb7..a2cac43 100644 (file)
 #define HWCAP2_SME_BI32I32     (1UL << 40)
 #define HWCAP2_SME_B16B16      (1UL << 41)
 #define HWCAP2_SME_F16F16      (1UL << 42)
+#define HWCAP2_MOPS            (1UL << 43)
 
 #endif /* _UAPI__ASM_HWCAP_H */
index 656a10e..f23c1dc 100644 (file)
@@ -177,7 +177,7 @@ struct zt_context {
  * vector length beyond its initial architectural limit of 2048 bits
  * (16 quadwords).
  *
- * See linux/Documentation/arm64/sve.rst for a description of the VL/VQ
+ * See linux/Documentation/arch/arm64/sve.rst for a description of the VL/VQ
  * terminology.
  */
 #define SVE_VQ_BYTES           __SVE_VQ_BYTES  /* bytes per quadword */
index 7c2bb4e..d95b3d6 100644 (file)
@@ -42,9 +42,9 @@ obj-$(CONFIG_COMPAT)                  += sigreturn32.o
 obj-$(CONFIG_COMPAT_ALIGNMENT_FIXUPS)  += compat_alignment.o
 obj-$(CONFIG_KUSER_HELPERS)            += kuser32.o
 obj-$(CONFIG_FUNCTION_TRACER)          += ftrace.o entry-ftrace.o
-obj-$(CONFIG_MODULES)                  += module.o
-obj-$(CONFIG_ARM64_MODULE_PLTS)                += module-plts.o
+obj-$(CONFIG_MODULES)                  += module.o module-plts.o
 obj-$(CONFIG_PERF_EVENTS)              += perf_regs.o perf_callchain.o
+obj-$(CONFIG_HARDLOCKUP_DETECTOR_PERF) += watchdog_hld.o
 obj-$(CONFIG_HAVE_HW_BREAKPOINT)       += hw_breakpoint.o
 obj-$(CONFIG_CPU_PM)                   += sleep.o suspend.o
 obj-$(CONFIG_CPU_IDLE)                 += cpuidle.o
index d32d4ed..8ff6610 100644 (file)
@@ -24,8 +24,8 @@
 #define ALT_ORIG_PTR(a)                __ALT_PTR(a, orig_offset)
 #define ALT_REPL_PTR(a)                __ALT_PTR(a, alt_offset)
 
-#define ALT_CAP(a)             ((a)->cpufeature & ~ARM64_CB_BIT)
-#define ALT_HAS_CB(a)          ((a)->cpufeature & ARM64_CB_BIT)
+#define ALT_CAP(a)             ((a)->cpucap & ~ARM64_CB_BIT)
+#define ALT_HAS_CB(a)          ((a)->cpucap & ARM64_CB_BIT)
 
 /* Volatile, as we may be patching the guts of READ_ONCE() */
 static volatile int all_alternatives_applied;
@@ -37,12 +37,12 @@ struct alt_region {
        struct alt_instr *end;
 };
 
-bool alternative_is_applied(u16 cpufeature)
+bool alternative_is_applied(u16 cpucap)
 {
-       if (WARN_ON(cpufeature >= ARM64_NCAPS))
+       if (WARN_ON(cpucap >= ARM64_NCAPS))
                return false;
 
-       return test_bit(cpufeature, applied_alternatives);
+       return test_bit(cpucap, applied_alternatives);
 }
 
 /*
@@ -121,11 +121,11 @@ static noinstr void patch_alternative(struct alt_instr *alt,
  * accidentally call into the cache.S code, which is patched by us at
  * runtime.
  */
-static void clean_dcache_range_nopatch(u64 start, u64 end)
+static noinstr void clean_dcache_range_nopatch(u64 start, u64 end)
 {
        u64 cur, d_size, ctr_el0;
 
-       ctr_el0 = read_sanitised_ftr_reg(SYS_CTR_EL0);
+       ctr_el0 = arm64_ftr_reg_ctrel0.sys_val;
        d_size = 4 << cpuid_feature_extract_unsigned_field(ctr_el0,
                                                           CTR_EL0_DminLine_SHIFT);
        cur = start & ~(d_size - 1);
@@ -141,7 +141,7 @@ static void clean_dcache_range_nopatch(u64 start, u64 end)
 
 static void __apply_alternatives(const struct alt_region *region,
                                 bool is_module,
-                                unsigned long *feature_mask)
+                                unsigned long *cpucap_mask)
 {
        struct alt_instr *alt;
        __le32 *origptr, *updptr;
@@ -151,7 +151,7 @@ static void __apply_alternatives(const struct alt_region *region,
                int nr_inst;
                int cap = ALT_CAP(alt);
 
-               if (!test_bit(cap, feature_mask))
+               if (!test_bit(cap, cpucap_mask))
                        continue;
 
                if (!cpus_have_cap(cap))
@@ -188,11 +188,10 @@ static void __apply_alternatives(const struct alt_region *region,
                icache_inval_all_pou();
                isb();
 
-               /* Ignore ARM64_CB bit from feature mask */
                bitmap_or(applied_alternatives, applied_alternatives,
-                         feature_mask, ARM64_NCAPS);
+                         cpucap_mask, ARM64_NCAPS);
                bitmap_and(applied_alternatives, applied_alternatives,
-                          cpu_hwcaps, ARM64_NCAPS);
+                          system_cpucaps, ARM64_NCAPS);
        }
 }
 
@@ -239,7 +238,7 @@ static int __init __apply_alternatives_multi_stop(void *unused)
        } else {
                DECLARE_BITMAP(remaining_capabilities, ARM64_NCAPS);
 
-               bitmap_complement(remaining_capabilities, boot_capabilities,
+               bitmap_complement(remaining_capabilities, boot_cpucaps,
                                  ARM64_NCAPS);
 
                BUG_ON(all_alternatives_applied);
@@ -274,7 +273,7 @@ void __init apply_boot_alternatives(void)
        pr_info("applying boot alternatives\n");
 
        __apply_alternatives(&kernel_alternatives, false,
-                            &boot_capabilities[0]);
+                            &boot_cpucaps[0]);
 }
 
 #ifdef CONFIG_MODULES
index 7d7128c..6ea7f23 100644 (file)
@@ -105,11 +105,11 @@ unsigned int compat_elf_hwcap __read_mostly = COMPAT_ELF_HWCAP_DEFAULT;
 unsigned int compat_elf_hwcap2 __read_mostly;
 #endif
 
-DECLARE_BITMAP(cpu_hwcaps, ARM64_NCAPS);
-EXPORT_SYMBOL(cpu_hwcaps);
-static struct arm64_cpu_capabilities const __ro_after_init *cpu_hwcaps_ptrs[ARM64_NCAPS];
+DECLARE_BITMAP(system_cpucaps, ARM64_NCAPS);
+EXPORT_SYMBOL(system_cpucaps);
+static struct arm64_cpu_capabilities const __ro_after_init *cpucap_ptrs[ARM64_NCAPS];
 
-DECLARE_BITMAP(boot_capabilities, ARM64_NCAPS);
+DECLARE_BITMAP(boot_cpucaps, ARM64_NCAPS);
 
 bool arm64_use_ng_mappings = false;
 EXPORT_SYMBOL(arm64_use_ng_mappings);
@@ -137,7 +137,7 @@ static cpumask_var_t cpu_32bit_el0_mask __cpumask_var_read_mostly;
 void dump_cpu_features(void)
 {
        /* file-wide pr_fmt adds "CPU features: " prefix */
-       pr_emerg("0x%*pb\n", ARM64_NCAPS, &cpu_hwcaps);
+       pr_emerg("0x%*pb\n", ARM64_NCAPS, &system_cpucaps);
 }
 
 #define ARM64_CPUID_FIELDS(reg, field, min_value)                      \
@@ -223,6 +223,7 @@ static const struct arm64_ftr_bits ftr_id_aa64isar2[] = {
        ARM64_FTR_BITS(FTR_VISIBLE, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64ISAR2_EL1_CSSC_SHIFT, 4, 0),
        ARM64_FTR_BITS(FTR_VISIBLE, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64ISAR2_EL1_RPRFM_SHIFT, 4, 0),
        ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_HIGHER_SAFE, ID_AA64ISAR2_EL1_BC_SHIFT, 4, 0),
+       ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64ISAR2_EL1_MOPS_SHIFT, 4, 0),
        ARM64_FTR_BITS(FTR_VISIBLE_IF_IS_ENABLED(CONFIG_ARM64_PTR_AUTH),
                       FTR_STRICT, FTR_EXACT, ID_AA64ISAR2_EL1_APA3_SHIFT, 4, 0),
        ARM64_FTR_BITS(FTR_VISIBLE_IF_IS_ENABLED(CONFIG_ARM64_PTR_AUTH),
@@ -364,6 +365,7 @@ static const struct arm64_ftr_bits ftr_id_aa64mmfr0[] = {
 static const struct arm64_ftr_bits ftr_id_aa64mmfr1[] = {
        ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64MMFR1_EL1_TIDCP1_SHIFT, 4, 0),
        ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64MMFR1_EL1_AFP_SHIFT, 4, 0),
+       ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64MMFR1_EL1_HCX_SHIFT, 4, 0),
        ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64MMFR1_EL1_ETS_SHIFT, 4, 0),
        ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64MMFR1_EL1_TWED_SHIFT, 4, 0),
        ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64MMFR1_EL1_XNX_SHIFT, 4, 0),
@@ -396,6 +398,12 @@ static const struct arm64_ftr_bits ftr_id_aa64mmfr2[] = {
        ARM64_FTR_END,
 };
 
+static const struct arm64_ftr_bits ftr_id_aa64mmfr3[] = {
+       ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64MMFR3_EL1_S1PIE_SHIFT, 4, 0),
+       ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64MMFR3_EL1_TCRX_SHIFT, 4, 0),
+       ARM64_FTR_END,
+};
+
 static const struct arm64_ftr_bits ftr_ctr[] = {
        ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_EXACT, 31, 1, 1), /* RES1 */
        ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, CTR_EL0_DIC_SHIFT, 1, 1),
@@ -722,6 +730,7 @@ static const struct __ftr_reg_entry {
        ARM64_FTR_REG_OVERRIDE(SYS_ID_AA64MMFR1_EL1, ftr_id_aa64mmfr1,
                               &id_aa64mmfr1_override),
        ARM64_FTR_REG(SYS_ID_AA64MMFR2_EL1, ftr_id_aa64mmfr2),
+       ARM64_FTR_REG(SYS_ID_AA64MMFR3_EL1, ftr_id_aa64mmfr3),
 
        /* Op1 = 0, CRn = 1, CRm = 2 */
        ARM64_FTR_REG(SYS_ZCR_EL1, ftr_zcr),
@@ -954,24 +963,24 @@ extern const struct arm64_cpu_capabilities arm64_errata[];
 static const struct arm64_cpu_capabilities arm64_features[];
 
 static void __init
-init_cpu_hwcaps_indirect_list_from_array(const struct arm64_cpu_capabilities *caps)
+init_cpucap_indirect_list_from_array(const struct arm64_cpu_capabilities *caps)
 {
        for (; caps->matches; caps++) {
                if (WARN(caps->capability >= ARM64_NCAPS,
                        "Invalid capability %d\n", caps->capability))
                        continue;
-               if (WARN(cpu_hwcaps_ptrs[caps->capability],
+               if (WARN(cpucap_ptrs[caps->capability],
                        "Duplicate entry for capability %d\n",
                        caps->capability))
                        continue;
-               cpu_hwcaps_ptrs[caps->capability] = caps;
+               cpucap_ptrs[caps->capability] = caps;
        }
 }
 
-static void __init init_cpu_hwcaps_indirect_list(void)
+static void __init init_cpucap_indirect_list(void)
 {
-       init_cpu_hwcaps_indirect_list_from_array(arm64_features);
-       init_cpu_hwcaps_indirect_list_from_array(arm64_errata);
+       init_cpucap_indirect_list_from_array(arm64_features);
+       init_cpucap_indirect_list_from_array(arm64_errata);
 }
 
 static void __init setup_boot_cpu_capabilities(void);
@@ -1017,6 +1026,7 @@ void __init init_cpu_features(struct cpuinfo_arm64 *info)
        init_cpu_ftr_reg(SYS_ID_AA64MMFR0_EL1, info->reg_id_aa64mmfr0);
        init_cpu_ftr_reg(SYS_ID_AA64MMFR1_EL1, info->reg_id_aa64mmfr1);
        init_cpu_ftr_reg(SYS_ID_AA64MMFR2_EL1, info->reg_id_aa64mmfr2);
+       init_cpu_ftr_reg(SYS_ID_AA64MMFR3_EL1, info->reg_id_aa64mmfr3);
        init_cpu_ftr_reg(SYS_ID_AA64PFR0_EL1, info->reg_id_aa64pfr0);
        init_cpu_ftr_reg(SYS_ID_AA64PFR1_EL1, info->reg_id_aa64pfr1);
        init_cpu_ftr_reg(SYS_ID_AA64ZFR0_EL1, info->reg_id_aa64zfr0);
@@ -1049,10 +1059,10 @@ void __init init_cpu_features(struct cpuinfo_arm64 *info)
                init_cpu_ftr_reg(SYS_GMID_EL1, info->reg_gmid);
 
        /*
-        * Initialize the indirect array of CPU hwcaps capabilities pointers
-        * before we handle the boot CPU below.
+        * Initialize the indirect array of CPU capabilities pointers before we
+        * handle the boot CPU below.
         */
-       init_cpu_hwcaps_indirect_list();
+       init_cpucap_indirect_list();
 
        /*
         * Detect and enable early CPU capabilities based on the boot CPU,
@@ -1262,6 +1272,8 @@ void update_cpu_features(int cpu,
                                      info->reg_id_aa64mmfr1, boot->reg_id_aa64mmfr1);
        taint |= check_update_ftr_reg(SYS_ID_AA64MMFR2_EL1, cpu,
                                      info->reg_id_aa64mmfr2, boot->reg_id_aa64mmfr2);
+       taint |= check_update_ftr_reg(SYS_ID_AA64MMFR3_EL1, cpu,
+                                     info->reg_id_aa64mmfr3, boot->reg_id_aa64mmfr3);
 
        taint |= check_update_ftr_reg(SYS_ID_AA64PFR0_EL1, cpu,
                                      info->reg_id_aa64pfr0, boot->reg_id_aa64pfr0);
@@ -1391,6 +1403,7 @@ u64 __read_sysreg_by_encoding(u32 sys_id)
        read_sysreg_case(SYS_ID_AA64MMFR0_EL1);
        read_sysreg_case(SYS_ID_AA64MMFR1_EL1);
        read_sysreg_case(SYS_ID_AA64MMFR2_EL1);
+       read_sysreg_case(SYS_ID_AA64MMFR3_EL1);
        read_sysreg_case(SYS_ID_AA64ISAR0_EL1);
        read_sysreg_case(SYS_ID_AA64ISAR1_EL1);
        read_sysreg_case(SYS_ID_AA64ISAR2_EL1);
@@ -2048,9 +2061,9 @@ static bool has_address_auth_cpucap(const struct arm64_cpu_capabilities *entry,
 static bool has_address_auth_metacap(const struct arm64_cpu_capabilities *entry,
                                     int scope)
 {
-       bool api = has_address_auth_cpucap(cpu_hwcaps_ptrs[ARM64_HAS_ADDRESS_AUTH_IMP_DEF], scope);
-       bool apa = has_address_auth_cpucap(cpu_hwcaps_ptrs[ARM64_HAS_ADDRESS_AUTH_ARCH_QARMA5], scope);
-       bool apa3 = has_address_auth_cpucap(cpu_hwcaps_ptrs[ARM64_HAS_ADDRESS_AUTH_ARCH_QARMA3], scope);
+       bool api = has_address_auth_cpucap(cpucap_ptrs[ARM64_HAS_ADDRESS_AUTH_IMP_DEF], scope);
+       bool apa = has_address_auth_cpucap(cpucap_ptrs[ARM64_HAS_ADDRESS_AUTH_ARCH_QARMA5], scope);
+       bool apa3 = has_address_auth_cpucap(cpucap_ptrs[ARM64_HAS_ADDRESS_AUTH_ARCH_QARMA3], scope);
 
        return apa || apa3 || api;
 }
@@ -2186,6 +2199,11 @@ static void cpu_enable_dit(const struct arm64_cpu_capabilities *__unused)
        set_pstate_dit(1);
 }
 
+static void cpu_enable_mops(const struct arm64_cpu_capabilities *__unused)
+{
+       sysreg_clear_set(sctlr_el1, 0, SCTLR_EL1_MSCEn);
+}
+
 /* Internal helper functions to match cpu capability type */
 static bool
 cpucap_late_cpu_optional(const struct arm64_cpu_capabilities *cap)
@@ -2235,11 +2253,7 @@ static const struct arm64_cpu_capabilities arm64_features[] = {
                .capability = ARM64_HAS_ECV_CNTPOFF,
                .type = ARM64_CPUCAP_SYSTEM_FEATURE,
                .matches = has_cpuid_feature,
-               .sys_reg = SYS_ID_AA64MMFR0_EL1,
-               .field_pos = ID_AA64MMFR0_EL1_ECV_SHIFT,
-               .field_width = 4,
-               .sign = FTR_UNSIGNED,
-               .min_field_value = ID_AA64MMFR0_EL1_ECV_CNTPOFF,
+               ARM64_CPUID_FIELDS(ID_AA64MMFR0_EL1, ECV, CNTPOFF)
        },
 #ifdef CONFIG_ARM64_PAN
        {
@@ -2309,6 +2323,13 @@ static const struct arm64_cpu_capabilities arm64_features[] = {
                .type = ARM64_CPUCAP_SYSTEM_FEATURE,
                .matches = is_kvm_protected_mode,
        },
+       {
+               .desc = "HCRX_EL2 register",
+               .capability = ARM64_HAS_HCX,
+               .type = ARM64_CPUCAP_STRICT_BOOT_CPU_FEATURE,
+               .matches = has_cpuid_feature,
+               ARM64_CPUID_FIELDS(ID_AA64MMFR1_EL1, HCX, IMP)
+       },
 #endif
        {
                .desc = "Kernel page table isolation (KPTI)",
@@ -2641,6 +2662,27 @@ static const struct arm64_cpu_capabilities arm64_features[] = {
                .cpu_enable = cpu_enable_dit,
                ARM64_CPUID_FIELDS(ID_AA64PFR0_EL1, DIT, IMP)
        },
+       {
+               .desc = "Memory Copy and Memory Set instructions",
+               .capability = ARM64_HAS_MOPS,
+               .type = ARM64_CPUCAP_SYSTEM_FEATURE,
+               .matches = has_cpuid_feature,
+               .cpu_enable = cpu_enable_mops,
+               ARM64_CPUID_FIELDS(ID_AA64ISAR2_EL1, MOPS, IMP)
+       },
+       {
+               .capability = ARM64_HAS_TCR2,
+               .type = ARM64_CPUCAP_SYSTEM_FEATURE,
+               .matches = has_cpuid_feature,
+               ARM64_CPUID_FIELDS(ID_AA64MMFR3_EL1, TCRX, IMP)
+       },
+       {
+               .desc = "Stage-1 Permission Indirection Extension (S1PIE)",
+               .capability = ARM64_HAS_S1PIE,
+               .type = ARM64_CPUCAP_BOOT_CPU_FEATURE,
+               .matches = has_cpuid_feature,
+               ARM64_CPUID_FIELDS(ID_AA64MMFR3_EL1, S1PIE, IMP)
+       },
        {},
 };
 
@@ -2769,6 +2811,7 @@ static const struct arm64_cpu_capabilities arm64_elf_hwcaps[] = {
        HWCAP_CAP(ID_AA64ISAR2_EL1, RPRFM, IMP, CAP_HWCAP, KERNEL_HWCAP_RPRFM),
        HWCAP_CAP(ID_AA64ISAR2_EL1, RPRES, IMP, CAP_HWCAP, KERNEL_HWCAP_RPRES),
        HWCAP_CAP(ID_AA64ISAR2_EL1, WFxT, IMP, CAP_HWCAP, KERNEL_HWCAP_WFXT),
+       HWCAP_CAP(ID_AA64ISAR2_EL1, MOPS, IMP, CAP_HWCAP, KERNEL_HWCAP_MOPS),
 #ifdef CONFIG_ARM64_SME
        HWCAP_CAP(ID_AA64PFR1_EL1, SME, IMP, CAP_HWCAP, KERNEL_HWCAP_SME),
        HWCAP_CAP(ID_AA64SMFR0_EL1, FA64, IMP, CAP_HWCAP, KERNEL_HWCAP_SME_FA64),
@@ -2895,7 +2938,7 @@ static void update_cpu_capabilities(u16 scope_mask)
 
        scope_mask &= ARM64_CPUCAP_SCOPE_MASK;
        for (i = 0; i < ARM64_NCAPS; i++) {
-               caps = cpu_hwcaps_ptrs[i];
+               caps = cpucap_ptrs[i];
                if (!caps || !(caps->type & scope_mask) ||
                    cpus_have_cap(caps->capability) ||
                    !caps->matches(caps, cpucap_default_scope(caps)))
@@ -2903,10 +2946,11 @@ static void update_cpu_capabilities(u16 scope_mask)
 
                if (caps->desc)
                        pr_info("detected: %s\n", caps->desc);
-               cpus_set_cap(caps->capability);
+
+               __set_bit(caps->capability, system_cpucaps);
 
                if ((scope_mask & SCOPE_BOOT_CPU) && (caps->type & SCOPE_BOOT_CPU))
-                       set_bit(caps->capability, boot_capabilities);
+                       set_bit(caps->capability, boot_cpucaps);
        }
 }
 
@@ -2920,7 +2964,7 @@ static int cpu_enable_non_boot_scope_capabilities(void *__unused)
        u16 non_boot_scope = SCOPE_ALL & ~SCOPE_BOOT_CPU;
 
        for_each_available_cap(i) {
-               const struct arm64_cpu_capabilities *cap = cpu_hwcaps_ptrs[i];
+               const struct arm64_cpu_capabilities *cap = cpucap_ptrs[i];
 
                if (WARN_ON(!cap))
                        continue;
@@ -2950,7 +2994,7 @@ static void __init enable_cpu_capabilities(u16 scope_mask)
        for (i = 0; i < ARM64_NCAPS; i++) {
                unsigned int num;
 
-               caps = cpu_hwcaps_ptrs[i];
+               caps = cpucap_ptrs[i];
                if (!caps || !(caps->type & scope_mask))
                        continue;
                num = caps->capability;
@@ -2995,7 +3039,7 @@ static void verify_local_cpu_caps(u16 scope_mask)
        scope_mask &= ARM64_CPUCAP_SCOPE_MASK;
 
        for (i = 0; i < ARM64_NCAPS; i++) {
-               caps = cpu_hwcaps_ptrs[i];
+               caps = cpucap_ptrs[i];
                if (!caps || !(caps->type & scope_mask))
                        continue;
 
@@ -3194,7 +3238,7 @@ static void __init setup_boot_cpu_capabilities(void)
 bool this_cpu_has_cap(unsigned int n)
 {
        if (!WARN_ON(preemptible()) && n < ARM64_NCAPS) {
-               const struct arm64_cpu_capabilities *cap = cpu_hwcaps_ptrs[n];
+               const struct arm64_cpu_capabilities *cap = cpucap_ptrs[n];
 
                if (cap)
                        return cap->matches(cap, SCOPE_LOCAL_CPU);
@@ -3207,13 +3251,13 @@ EXPORT_SYMBOL_GPL(this_cpu_has_cap);
 /*
  * This helper function is used in a narrow window when,
  * - The system wide safe registers are set with all the SMP CPUs and,
- * - The SYSTEM_FEATURE cpu_hwcaps may not have been set.
+ * - The SYSTEM_FEATURE system_cpucaps may not have been set.
  * In all other cases cpus_have_{const_}cap() should be used.
  */
 static bool __maybe_unused __system_matches_cap(unsigned int n)
 {
        if (n < ARM64_NCAPS) {
-               const struct arm64_cpu_capabilities *cap = cpu_hwcaps_ptrs[n];
+               const struct arm64_cpu_capabilities *cap = cpucap_ptrs[n];
 
                if (cap)
                        return cap->matches(cap, SCOPE_SYSTEM);
index 42e19ff..d1f6859 100644 (file)
@@ -13,7 +13,7 @@
 #include <linux/of_device.h>
 #include <linux/psci.h>
 
-#ifdef CONFIG_ACPI
+#ifdef CONFIG_ACPI_PROCESSOR_IDLE
 
 #include <acpi/processor.h>
 
index eb4378c..58622dc 100644 (file)
@@ -125,6 +125,7 @@ static const char *const hwcap_str[] = {
        [KERNEL_HWCAP_SME_BI32I32]      = "smebi32i32",
        [KERNEL_HWCAP_SME_B16B16]       = "smeb16b16",
        [KERNEL_HWCAP_SME_F16F16]       = "smef16f16",
+       [KERNEL_HWCAP_MOPS]             = "mops",
 };
 
 #ifdef CONFIG_COMPAT
@@ -446,6 +447,7 @@ static void __cpuinfo_store_cpu(struct cpuinfo_arm64 *info)
        info->reg_id_aa64mmfr0 = read_cpuid(ID_AA64MMFR0_EL1);
        info->reg_id_aa64mmfr1 = read_cpuid(ID_AA64MMFR1_EL1);
        info->reg_id_aa64mmfr2 = read_cpuid(ID_AA64MMFR2_EL1);
+       info->reg_id_aa64mmfr3 = read_cpuid(ID_AA64MMFR3_EL1);
        info->reg_id_aa64pfr0 = read_cpuid(ID_AA64PFR0_EL1);
        info->reg_id_aa64pfr1 = read_cpuid(ID_AA64PFR1_EL1);
        info->reg_id_aa64zfr0 = read_cpuid(ID_AA64ZFR0_EL1);
index 3af3c01..6b2e0c3 100644 (file)
@@ -126,7 +126,7 @@ static __always_inline void __exit_to_user_mode(void)
        lockdep_hardirqs_on(CALLER_ADDR0);
 }
 
-static __always_inline void prepare_exit_to_user_mode(struct pt_regs *regs)
+static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
 {
        unsigned long flags;
 
@@ -135,11 +135,13 @@ static __always_inline void prepare_exit_to_user_mode(struct pt_regs *regs)
        flags = read_thread_flags();
        if (unlikely(flags & _TIF_WORK_MASK))
                do_notify_resume(regs, flags);
+
+       lockdep_sys_exit();
 }
 
 static __always_inline void exit_to_user_mode(struct pt_regs *regs)
 {
-       prepare_exit_to_user_mode(regs);
+       exit_to_user_mode_prepare(regs);
        mte_check_tfsr_exit();
        __exit_to_user_mode();
 }
@@ -611,6 +613,14 @@ static void noinstr el0_bti(struct pt_regs *regs)
        exit_to_user_mode(regs);
 }
 
+static void noinstr el0_mops(struct pt_regs *regs, unsigned long esr)
+{
+       enter_from_user_mode(regs);
+       local_daif_restore(DAIF_PROCCTX);
+       do_el0_mops(regs, esr);
+       exit_to_user_mode(regs);
+}
+
 static void noinstr el0_inv(struct pt_regs *regs, unsigned long esr)
 {
        enter_from_user_mode(regs);
@@ -688,6 +698,9 @@ asmlinkage void noinstr el0t_64_sync_handler(struct pt_regs *regs)
        case ESR_ELx_EC_BTI:
                el0_bti(regs);
                break;
+       case ESR_ELx_EC_MOPS:
+               el0_mops(regs, esr);
+               break;
        case ESR_ELx_EC_BREAKPT_LOW:
        case ESR_ELx_EC_SOFTSTP_LOW:
        case ESR_ELx_EC_WATCHPT_LOW:
index ab2a6e3..a40e5e5 100644 (file)
 .org .Lventry_start\@ + 128    // Did we overflow the ventry slot?
        .endm
 
-       .macro tramp_alias, dst, sym, tmp
-       mov_q   \dst, TRAMP_VALIAS
-       adr_l   \tmp, \sym
-       add     \dst, \dst, \tmp
-       adr_l   \tmp, .entry.tramp.text
-       sub     \dst, \dst, \tmp
+       .macro  tramp_alias, dst, sym
+       .set    .Lalias\@, TRAMP_VALIAS + \sym - .entry.tramp.text
+       movz    \dst, :abs_g2_s:.Lalias\@
+       movk    \dst, :abs_g1_nc:.Lalias\@
+       movk    \dst, :abs_g0_nc:.Lalias\@
        .endm
 
        /*
@@ -435,13 +434,14 @@ alternative_if_not ARM64_UNMAP_KERNEL_AT_EL0
        eret
 alternative_else_nop_endif
 #ifdef CONFIG_UNMAP_KERNEL_AT_EL0
-       bne     4f
        msr     far_el1, x29
-       tramp_alias     x30, tramp_exit_native, x29
-       br      x30
-4:
-       tramp_alias     x30, tramp_exit_compat, x29
-       br      x30
+
+       ldr_this_cpu    x30, this_cpu_vector, x29
+       tramp_alias     x29, tramp_exit
+       msr             vbar_el1, x30           // install vector table
+       ldr             lr, [sp, #S_LR]         // restore x30
+       add             sp, sp, #PT_REGS_SIZE   // restore sp
+       br              x29
 #endif
        .else
        ldr     lr, [sp, #S_LR]
@@ -732,22 +732,6 @@ alternative_else_nop_endif
 .org 1b + 128  // Did we overflow the ventry slot?
        .endm
 
-       .macro tramp_exit, regsize = 64
-       tramp_data_read_var     x30, this_cpu_vector
-       get_this_cpu_offset x29
-       ldr     x30, [x30, x29]
-
-       msr     vbar_el1, x30
-       ldr     lr, [sp, #S_LR]
-       tramp_unmap_kernel      x29
-       .if     \regsize == 64
-       mrs     x29, far_el1
-       .endif
-       add     sp, sp, #PT_REGS_SIZE           // restore sp
-       eret
-       sb
-       .endm
-
        .macro  generate_tramp_vector,  kpti, bhb
 .Lvector_start\@:
        .space  0x400
@@ -768,7 +752,7 @@ alternative_else_nop_endif
  */
        .pushsection ".entry.tramp.text", "ax"
        .align  11
-SYM_CODE_START_NOALIGN(tramp_vectors)
+SYM_CODE_START_LOCAL_NOALIGN(tramp_vectors)
 #ifdef CONFIG_MITIGATE_SPECTRE_BRANCH_HISTORY
        generate_tramp_vector   kpti=1, bhb=BHB_MITIGATION_LOOP
        generate_tramp_vector   kpti=1, bhb=BHB_MITIGATION_FW
@@ -777,13 +761,12 @@ SYM_CODE_START_NOALIGN(tramp_vectors)
        generate_tramp_vector   kpti=1, bhb=BHB_MITIGATION_NONE
 SYM_CODE_END(tramp_vectors)
 
-SYM_CODE_START(tramp_exit_native)
-       tramp_exit
-SYM_CODE_END(tramp_exit_native)
-
-SYM_CODE_START(tramp_exit_compat)
-       tramp_exit      32
-SYM_CODE_END(tramp_exit_compat)
+SYM_CODE_START_LOCAL(tramp_exit)
+       tramp_unmap_kernel      x29
+       mrs             x29, far_el1            // restore x29
+       eret
+       sb
+SYM_CODE_END(tramp_exit)
        .popsection                             // .entry.tramp.text
 #endif /* CONFIG_UNMAP_KERNEL_AT_EL0 */
 
@@ -1077,7 +1060,7 @@ alternative_if_not ARM64_UNMAP_KERNEL_AT_EL0
 alternative_else_nop_endif
 
 #ifdef CONFIG_UNMAP_KERNEL_AT_EL0
-       tramp_alias     dst=x5, sym=__sdei_asm_exit_trampoline, tmp=x3
+       tramp_alias     dst=x5, sym=__sdei_asm_exit_trampoline
        br      x5
 #endif
 SYM_CODE_END(__sdei_asm_handler)
index 2fbafa5..7a1aeb9 100644 (file)
@@ -1649,6 +1649,7 @@ void fpsimd_flush_thread(void)
 
                fpsimd_flush_thread_vl(ARM64_VEC_SME);
                current->thread.svcr = 0;
+               sme_smstop();
        }
 
        current->thread.fp_type = FP_STATE_FPSIMD;
index 432626c..a650f5e 100644 (file)
@@ -197,7 +197,7 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
 
 static struct plt_entry *get_ftrace_plt(struct module *mod)
 {
-#ifdef CONFIG_ARM64_MODULE_PLTS
+#ifdef CONFIG_MODULES
        struct plt_entry *plt = mod->arch.ftrace_trampolines;
 
        return &plt[FTRACE_PLT_IDX];
@@ -249,7 +249,7 @@ static bool ftrace_find_callable_addr(struct dyn_ftrace *rec,
         * must use a PLT to reach it. We can only place PLTs for modules, and
         * only when module PLT support is built-in.
         */
-       if (!IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
+       if (!IS_ENABLED(CONFIG_MODULES))
                return false;
 
        /*
@@ -431,10 +431,8 @@ int ftrace_make_nop(struct module *mod, struct dyn_ftrace *rec,
         *
         * Note: 'mod' is only set at module load time.
         */
-       if (!IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_ARGS) &&
-           IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) && mod) {
+       if (!IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_ARGS) && mod)
                return aarch64_insn_patch_text_nosync((void *)pc, new);
-       }
 
        if (!ftrace_find_callable_addr(rec, mod, &addr))
                return -EINVAL;
index e92caeb..0f5a30f 100644 (file)
@@ -382,7 +382,7 @@ SYM_FUNC_START_LOCAL(create_idmap)
        adrp    x0, init_idmap_pg_dir
        adrp    x3, _text
        adrp    x6, _end + MAX_FDT_SIZE + SWAPPER_BLOCK_SIZE
-       mov     x7, SWAPPER_RX_MMUFLAGS
+       mov_q   x7, SWAPPER_RX_MMUFLAGS
 
        map_memory x0, x1, x3, x6, x7, x3, IDMAP_PGD_ORDER, x10, x11, x12, x13, x14, EXTRA_SHIFT
 
@@ -391,7 +391,7 @@ SYM_FUNC_START_LOCAL(create_idmap)
        adrp    x2, init_pg_dir
        adrp    x3, init_pg_end
        bic     x4, x2, #SWAPPER_BLOCK_SIZE - 1
-       mov     x5, SWAPPER_RW_MMUFLAGS
+       mov_q   x5, SWAPPER_RW_MMUFLAGS
        mov     x6, #SWAPPER_BLOCK_SHIFT
        bl      remap_region
 
@@ -402,7 +402,7 @@ SYM_FUNC_START_LOCAL(create_idmap)
        bfi     x22, x21, #0, #SWAPPER_BLOCK_SHIFT              // remapped FDT address
        add     x3, x2, #MAX_FDT_SIZE + SWAPPER_BLOCK_SIZE
        bic     x4, x21, #SWAPPER_BLOCK_SIZE - 1
-       mov     x5, SWAPPER_RW_MMUFLAGS
+       mov_q   x5, SWAPPER_RW_MMUFLAGS
        mov     x6, #SWAPPER_BLOCK_SHIFT
        bl      remap_region
 
@@ -430,7 +430,7 @@ SYM_FUNC_START_LOCAL(create_kernel_mapping)
        adrp    x3, _text                       // runtime __pa(_text)
        sub     x6, x6, x3                      // _end - _text
        add     x6, x6, x5                      // runtime __va(_end)
-       mov     x7, SWAPPER_RW_MMUFLAGS
+       mov_q   x7, SWAPPER_RW_MMUFLAGS
 
        map_memory x0, x1, x5, x6, x7, x3, (VA_BITS - PGDIR_SHIFT), x10, x11, x12, x13, x14
 
index 788597a..02870be 100644 (file)
@@ -99,7 +99,6 @@ int pfn_is_nosave(unsigned long pfn)
 
 void notrace save_processor_state(void)
 {
-       WARN_ON(num_online_cpus() != 1);
 }
 
 void notrace restore_processor_state(void)
index b29a311..db2a186 100644 (file)
@@ -973,14 +973,6 @@ static int hw_breakpoint_reset(unsigned int cpu)
        return 0;
 }
 
-#ifdef CONFIG_CPU_PM
-extern void cpu_suspend_set_dbg_restorer(int (*hw_bp_restore)(unsigned int));
-#else
-static inline void cpu_suspend_set_dbg_restorer(int (*hw_bp_restore)(unsigned int))
-{
-}
-#endif
-
 /*
  * One-time initialisation.
  */
index 9439240..d63de19 100644 (file)
@@ -119,6 +119,24 @@ SYM_CODE_START_LOCAL(__finalise_el2)
        msr     ttbr1_el1, x0
        mrs_s   x0, SYS_MAIR_EL12
        msr     mair_el1, x0
+       mrs     x1, REG_ID_AA64MMFR3_EL1
+       ubfx    x1, x1, #ID_AA64MMFR3_EL1_TCRX_SHIFT, #4
+       cbz     x1, .Lskip_tcr2
+       mrs     x0, REG_TCR2_EL12
+       msr     REG_TCR2_EL1, x0
+
+       // Transfer permission indirection state
+       mrs     x1, REG_ID_AA64MMFR3_EL1
+       ubfx    x1, x1, #ID_AA64MMFR3_EL1_S1PIE_SHIFT, #4
+       cbz     x1, .Lskip_indirection
+       mrs     x0, REG_PIRE0_EL12
+       msr     REG_PIRE0_EL1, x0
+       mrs     x0, REG_PIR_EL12
+       msr     REG_PIR_EL1, x0
+
+.Lskip_indirection:
+.Lskip_tcr2:
+
        isb
 
        // Hack the exception return to stay at EL2
index 370ab84..8439248 100644 (file)
@@ -123,6 +123,7 @@ static const struct ftr_set_desc isar2 __initconst = {
        .fields         = {
                FIELD("gpa3", ID_AA64ISAR2_EL1_GPA3_SHIFT, NULL),
                FIELD("apa3", ID_AA64ISAR2_EL1_APA3_SHIFT, NULL),
+               FIELD("mops", ID_AA64ISAR2_EL1_MOPS_SHIFT, NULL),
                {}
        },
 };
@@ -174,6 +175,7 @@ static const struct {
          "id_aa64isar1.gpi=0 id_aa64isar1.gpa=0 "
          "id_aa64isar1.api=0 id_aa64isar1.apa=0 "
          "id_aa64isar2.gpa3=0 id_aa64isar2.apa3=0"        },
+       { "arm64.nomops",               "id_aa64isar2.mops=0" },
        { "arm64.nomte",                "id_aa64pfr1.mte=0" },
        { "nokaslr",                    "kaslr.disabled=1" },
 };
index e7477f2..17f96a1 100644 (file)
@@ -4,90 +4,35 @@
  */
 
 #include <linux/cache.h>
-#include <linux/crc32.h>
 #include <linux/init.h>
-#include <linux/libfdt.h>
-#include <linux/mm_types.h>
-#include <linux/sched.h>
-#include <linux/types.h>
-#include <linux/pgtable.h>
-#include <linux/random.h>
+#include <linux/printk.h>
 
-#include <asm/fixmap.h>
-#include <asm/kernel-pgtable.h>
+#include <asm/cpufeature.h>
 #include <asm/memory.h>
-#include <asm/mmu.h>
-#include <asm/sections.h>
-#include <asm/setup.h>
 
-u64 __ro_after_init module_alloc_base;
 u16 __initdata memstart_offset_seed;
 
 struct arm64_ftr_override kaslr_feature_override __initdata;
 
-static int __init kaslr_init(void)
-{
-       u64 module_range;
-       u32 seed;
-
-       /*
-        * Set a reasonable default for module_alloc_base in case
-        * we end up running with module randomization disabled.
-        */
-       module_alloc_base = (u64)_etext - MODULES_VSIZE;
+bool __ro_after_init __kaslr_is_enabled = false;
 
+void __init kaslr_init(void)
+{
        if (kaslr_feature_override.val & kaslr_feature_override.mask & 0xf) {
                pr_info("KASLR disabled on command line\n");
-               return 0;
-       }
-
-       if (!kaslr_enabled()) {
-               pr_warn("KASLR disabled due to lack of seed\n");
-               return 0;
+               return;
        }
 
-       pr_info("KASLR enabled\n");
-
        /*
-        * KASAN without KASAN_VMALLOC does not expect the module region to
-        * intersect the vmalloc region, since shadow memory is allocated for
-        * each module at load time, whereas the vmalloc region will already be
-        * shadowed by KASAN zero pages.
+        * The KASLR offset modulo MIN_KIMG_ALIGN is taken from the physical
+        * placement of the image rather than from the seed, so a displacement
+        * of less than MIN_KIMG_ALIGN means that no seed was provided.
         */
-       BUILD_BUG_ON((IS_ENABLED(CONFIG_KASAN_GENERIC) ||
-                     IS_ENABLED(CONFIG_KASAN_SW_TAGS)) &&
-                    !IS_ENABLED(CONFIG_KASAN_VMALLOC));
-
-       seed = get_random_u32();
-
-       if (IS_ENABLED(CONFIG_RANDOMIZE_MODULE_REGION_FULL)) {
-               /*
-                * Randomize the module region over a 2 GB window covering the
-                * kernel. This reduces the risk of modules leaking information
-                * about the address of the kernel itself, but results in
-                * branches between modules and the core kernel that are
-                * resolved via PLTs. (Branches between modules will be
-                * resolved normally.)
-                */
-               module_range = SZ_2G - (u64)(_end - _stext);
-               module_alloc_base = max((u64)_end - SZ_2G, (u64)MODULES_VADDR);
-       } else {
-               /*
-                * Randomize the module region by setting module_alloc_base to
-                * a PAGE_SIZE multiple in the range [_etext - MODULES_VSIZE,
-                * _stext) . This guarantees that the resulting region still
-                * covers [_stext, _etext], and that all relative branches can
-                * be resolved without veneers unless this region is exhausted
-                * and we fall back to a larger 2GB window in module_alloc()
-                * when ARM64_MODULE_PLTS is enabled.
-                */
-               module_range = MODULES_VSIZE - (u64)(_etext - _stext);
+       if (kaslr_offset() < MIN_KIMG_ALIGN) {
+               pr_warn("KASLR disabled due to lack of seed\n");
+               return;
        }
 
-       /* use the lower 21 bits to randomize the base of the module region */
-       module_alloc_base += (module_range * (seed & ((1 << 21) - 1))) >> 21;
-       module_alloc_base &= PAGE_MASK;
-
-       return 0;
+       pr_info("KASLR enabled\n");
+       __kaslr_is_enabled = true;
 }
-subsys_initcall(kaslr_init)
index 5ed6a58..636be67 100644 (file)
@@ -48,7 +48,7 @@ static void *image_load(struct kimage *image,
 
        /*
         * We require a kernel with an unambiguous Image header. Per
-        * Documentation/arm64/booting.rst, this is the case when image_size
+        * Documentation/arch/arm64/booting.rst, this is the case when image_size
         * is non-zero (practically speaking, since v3.17).
         */
        h = (struct arm64_image_header *)kernel;
index 692e9d2..af046ce 100644 (file)
@@ -10,7 +10,7 @@
  * aarch32_setup_additional_pages() and are provided for compatibility
  * reasons with 32 bit (aarch32) applications that need them.
  *
- * See Documentation/arm/kernel_user_helpers.rst for formal definitions.
+ * See Documentation/arch/arm/kernel_user_helpers.rst for formal definitions.
  */
 
 #include <asm/unistd.h>
index 543493b..ad02058 100644 (file)
@@ -7,6 +7,7 @@
 #include <linux/ftrace.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
+#include <linux/moduleloader.h>
 #include <linux/sort.h>
 
 static struct plt_entry __get_adrp_add_pair(u64 dst, u64 pc,
index 5af4975..dd85129 100644 (file)
@@ -7,6 +7,8 @@
  * Author: Will Deacon <will.deacon@arm.com>
  */
 
+#define pr_fmt(fmt) "Modules: " fmt
+
 #include <linux/bitops.h>
 #include <linux/elf.h>
 #include <linux/ftrace.h>
 #include <linux/kernel.h>
 #include <linux/mm.h>
 #include <linux/moduleloader.h>
+#include <linux/random.h>
 #include <linux/scs.h>
 #include <linux/vmalloc.h>
+
 #include <asm/alternative.h>
 #include <asm/insn.h>
 #include <asm/scs.h>
 #include <asm/sections.h>
 
+static u64 module_direct_base __ro_after_init = 0;
+static u64 module_plt_base __ro_after_init = 0;
+
+/*
+ * Choose a random page-aligned base address for a window of 'size' bytes which
+ * entirely contains the interval [start, end - 1].
+ */
+static u64 __init random_bounding_box(u64 size, u64 start, u64 end)
+{
+       u64 max_pgoff, pgoff;
+
+       if ((end - start) >= size)
+               return 0;
+
+       max_pgoff = (size - (end - start)) / PAGE_SIZE;
+       pgoff = get_random_u32_inclusive(0, max_pgoff);
+
+       return start - pgoff * PAGE_SIZE;
+}
+
+/*
+ * Modules may directly reference data and text anywhere within the kernel
+ * image and other modules. References using PREL32 relocations have a +/-2G
+ * range, and so we need to ensure that the entire kernel image and all modules
+ * fall within a 2G window such that these are always within range.
+ *
+ * Modules may directly branch to functions and code within the kernel text,
+ * and to functions and code within other modules. These branches will use
+ * CALL26/JUMP26 relocations with a +/-128M range. Without PLTs, we must ensure
+ * that the entire kernel text and all module text falls within a 128M window
+ * such that these are always within range. With PLTs, we can expand this to a
+ * 2G window.
+ *
+ * We chose the 128M region to surround the entire kernel image (rather than
+ * just the text) as using the same bounds for the 128M and 2G regions ensures
+ * by construction that we never select a 128M region that is not a subset of
+ * the 2G region. For very large and unusual kernel configurations this means
+ * we may fall back to PLTs where they could have been avoided, but this keeps
+ * the logic significantly simpler.
+ */
+static int __init module_init_limits(void)
+{
+       u64 kernel_end = (u64)_end;
+       u64 kernel_start = (u64)_text;
+       u64 kernel_size = kernel_end - kernel_start;
+
+       /*
+        * The default modules region is placed immediately below the kernel
+        * image, and is large enough to use the full 2G relocation range.
+        */
+       BUILD_BUG_ON(KIMAGE_VADDR != MODULES_END);
+       BUILD_BUG_ON(MODULES_VSIZE < SZ_2G);
+
+       if (!kaslr_enabled()) {
+               if (kernel_size < SZ_128M)
+                       module_direct_base = kernel_end - SZ_128M;
+               if (kernel_size < SZ_2G)
+                       module_plt_base = kernel_end - SZ_2G;
+       } else {
+               u64 min = kernel_start;
+               u64 max = kernel_end;
+
+               if (IS_ENABLED(CONFIG_RANDOMIZE_MODULE_REGION_FULL)) {
+                       pr_info("2G module region forced by RANDOMIZE_MODULE_REGION_FULL\n");
+               } else {
+                       module_direct_base = random_bounding_box(SZ_128M, min, max);
+                       if (module_direct_base) {
+                               min = module_direct_base;
+                               max = module_direct_base + SZ_128M;
+                       }
+               }
+
+               module_plt_base = random_bounding_box(SZ_2G, min, max);
+       }
+
+       pr_info("%llu pages in range for non-PLT usage",
+               module_direct_base ? (SZ_128M - kernel_size) / PAGE_SIZE : 0);
+       pr_info("%llu pages in range for PLT usage",
+               module_plt_base ? (SZ_2G - kernel_size) / PAGE_SIZE : 0);
+
+       return 0;
+}
+subsys_initcall(module_init_limits);
+
 void *module_alloc(unsigned long size)
 {
-       u64 module_alloc_end = module_alloc_base + MODULES_VSIZE;
-       gfp_t gfp_mask = GFP_KERNEL;
-       void *p;
-
-       /* Silence the initial allocation */
-       if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
-               gfp_mask |= __GFP_NOWARN;
-
-       if (IS_ENABLED(CONFIG_KASAN_GENERIC) ||
-           IS_ENABLED(CONFIG_KASAN_SW_TAGS))
-               /* don't exceed the static module region - see below */
-               module_alloc_end = MODULES_END;
-
-       p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
-                               module_alloc_end, gfp_mask, PAGE_KERNEL, VM_DEFER_KMEMLEAK,
-                               NUMA_NO_NODE, __builtin_return_address(0));
-
-       if (!p && IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
-           (IS_ENABLED(CONFIG_KASAN_VMALLOC) ||
-            (!IS_ENABLED(CONFIG_KASAN_GENERIC) &&
-             !IS_ENABLED(CONFIG_KASAN_SW_TAGS))))
-               /*
-                * KASAN without KASAN_VMALLOC can only deal with module
-                * allocations being served from the reserved module region,
-                * since the remainder of the vmalloc region is already
-                * backed by zero shadow pages, and punching holes into it
-                * is non-trivial. Since the module region is not randomized
-                * when KASAN is enabled without KASAN_VMALLOC, it is even
-                * less likely that the module region gets exhausted, so we
-                * can simply omit this fallback in that case.
-                */
-               p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
-                               module_alloc_base + SZ_2G, GFP_KERNEL,
-                               PAGE_KERNEL, 0, NUMA_NO_NODE,
-                               __builtin_return_address(0));
+       void *p = NULL;
+
+       /*
+        * Where possible, prefer to allocate within direct branch range of the
+        * kernel such that no PLTs are necessary.
+        */
+       if (module_direct_base) {
+               p = __vmalloc_node_range(size, MODULE_ALIGN,
+                                        module_direct_base,
+                                        module_direct_base + SZ_128M,
+                                        GFP_KERNEL | __GFP_NOWARN,
+                                        PAGE_KERNEL, 0, NUMA_NO_NODE,
+                                        __builtin_return_address(0));
+       }
 
-       if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
+       if (!p && module_plt_base) {
+               p = __vmalloc_node_range(size, MODULE_ALIGN,
+                                        module_plt_base,
+                                        module_plt_base + SZ_2G,
+                                        GFP_KERNEL | __GFP_NOWARN,
+                                        PAGE_KERNEL, 0, NUMA_NO_NODE,
+                                        __builtin_return_address(0));
+       }
+
+       if (!p) {
+               pr_warn_ratelimited("%s: unable to allocate memory\n",
+                                   __func__);
+       }
+
+       if (p && (kasan_alloc_module_shadow(p, size, GFP_KERNEL) < 0)) {
                vfree(p);
                return NULL;
        }
@@ -448,9 +529,7 @@ int apply_relocate_add(Elf64_Shdr *sechdrs,
                case R_AARCH64_CALL26:
                        ovf = reloc_insn_imm(RELOC_OP_PREL, loc, val, 2, 26,
                                             AARCH64_INSN_IMM_26);
-
-                       if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
-                           ovf == -ERANGE) {
+                       if (ovf == -ERANGE) {
                                val = module_emit_plt_entry(me, sechdrs, loc, &rel[i], sym);
                                if (!val)
                                        return -ENOEXEC;
@@ -487,7 +566,7 @@ static int module_init_ftrace_plt(const Elf_Ehdr *hdr,
                                  const Elf_Shdr *sechdrs,
                                  struct module *mod)
 {
-#if defined(CONFIG_ARM64_MODULE_PLTS) && defined(CONFIG_DYNAMIC_FTRACE)
+#if defined(CONFIG_DYNAMIC_FTRACE)
        const Elf_Shdr *s;
        struct plt_entry *plts;
 
index 7e89968..4c5ef9b 100644 (file)
@@ -416,10 +416,9 @@ long get_mte_ctrl(struct task_struct *task)
 static int __access_remote_tags(struct mm_struct *mm, unsigned long addr,
                                struct iovec *kiov, unsigned int gup_flags)
 {
-       struct vm_area_struct *vma;
        void __user *buf = kiov->iov_base;
        size_t len = kiov->iov_len;
-       int ret;
+       int err = 0;
        int write = gup_flags & FOLL_WRITE;
 
        if (!access_ok(buf, len))
@@ -429,14 +428,16 @@ static int __access_remote_tags(struct mm_struct *mm, unsigned long addr,
                return -EIO;
 
        while (len) {
+               struct vm_area_struct *vma;
                unsigned long tags, offset;
                void *maddr;
-               struct page *page = NULL;
+               struct page *page = get_user_page_vma_remote(mm, addr,
+                                                            gup_flags, &vma);
 
-               ret = get_user_pages_remote(mm, addr, 1, gup_flags, &page,
-                                           &vma, NULL);
-               if (ret <= 0)
+               if (IS_ERR_OR_NULL(page)) {
+                       err = page == NULL ? -EIO : PTR_ERR(page);
                        break;
+               }
 
                /*
                 * Only copy tags if the page has been mapped as PROT_MTE
@@ -446,7 +447,7 @@ static int __access_remote_tags(struct mm_struct *mm, unsigned long addr,
                 * was never mapped with PROT_MTE.
                 */
                if (!(vma->vm_flags & VM_MTE)) {
-                       ret = -EOPNOTSUPP;
+                       err = -EOPNOTSUPP;
                        put_page(page);
                        break;
                }
@@ -479,7 +480,7 @@ static int __access_remote_tags(struct mm_struct *mm, unsigned long addr,
        kiov->iov_len = buf - kiov->iov_base;
        if (!kiov->iov_len) {
                /* check for error accessing the tracee's address space */
-               if (ret <= 0)
+               if (err)
                        return -EIO;
                else
                        return -EFAULT;
index b8ec7b3..417a8a8 100644 (file)
@@ -296,6 +296,8 @@ void __init __no_sanitize_address setup_arch(char **cmdline_p)
 
        *cmdline_p = boot_command_line;
 
+       kaslr_init();
+
        /*
         * If know now we are going to need KPTI then use non-global
         * mappings from the start, avoiding the cost of rewriting
index 2cfc810..e304f7e 100644 (file)
@@ -23,6 +23,7 @@
 #include <asm/daifflags.h>
 #include <asm/debug-monitors.h>
 #include <asm/elf.h>
+#include <asm/exception.h>
 #include <asm/cacheflush.h>
 #include <asm/ucontext.h>
 #include <asm/unistd.h>
@@ -398,7 +399,7 @@ static int restore_tpidr2_context(struct user_ctxs *user)
 
        __get_user_error(tpidr2_el0, &user->tpidr2->tpidr2, err);
        if (!err)
-               current->thread.tpidr2_el0 = tpidr2_el0;
+               write_sysreg_s(tpidr2_el0, SYS_TPIDR2_EL0);
 
        return err;
 }
index d00d4cb..edd6389 100644 (file)
@@ -332,17 +332,13 @@ static int op_cpu_kill(unsigned int cpu)
 }
 
 /*
- * called on the thread which is asking for a CPU to be shutdown -
- * waits until shutdown has completed, or it is timed out.
+ * Called on the thread which is asking for a CPU to be shutdown after the
+ * shutdown completed.
  */
-void __cpu_die(unsigned int cpu)
+void arch_cpuhp_cleanup_dead_cpu(unsigned int cpu)
 {
        int err;
 
-       if (!cpu_wait_death(cpu, 5)) {
-               pr_crit("CPU%u: cpu didn't die\n", cpu);
-               return;
-       }
        pr_debug("CPU%u: shutdown\n", cpu);
 
        /*
@@ -369,8 +365,8 @@ void __noreturn cpu_die(void)
 
        local_daif_mask();
 
-       /* Tell __cpu_die() that this CPU is now safe to dispose of */
-       (void)cpu_report_death();
+       /* Tell cpuhp_bp_sync_dead() that this CPU is now safe to dispose of */
+       cpuhp_ap_report_dead();
 
        /*
         * Actually shutdown the CPU. This must never fail. The specific hotplug
index da84cf8..5a668d7 100644 (file)
@@ -147,11 +147,9 @@ static void el0_svc_common(struct pt_regs *regs, int scno, int sc_nr,
         * exit regardless, as the old entry assembly did.
         */
        if (!has_syscall_work(flags) && !IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
-               local_daif_mask();
                flags = read_thread_flags();
                if (!has_syscall_work(flags) && !(flags & _TIF_SINGLESTEP))
                        return;
-               local_daif_restore(DAIF_PROCCTX);
        }
 
 trace_exit:
index 4bb1b8f..8b70759 100644 (file)
@@ -514,6 +514,63 @@ void do_el1_fpac(struct pt_regs *regs, unsigned long esr)
        die("Oops - FPAC", regs, esr);
 }
 
+void do_el0_mops(struct pt_regs *regs, unsigned long esr)
+{
+       bool wrong_option = esr & ESR_ELx_MOPS_ISS_WRONG_OPTION;
+       bool option_a = esr & ESR_ELx_MOPS_ISS_OPTION_A;
+       int dstreg = ESR_ELx_MOPS_ISS_DESTREG(esr);
+       int srcreg = ESR_ELx_MOPS_ISS_SRCREG(esr);
+       int sizereg = ESR_ELx_MOPS_ISS_SIZEREG(esr);
+       unsigned long dst, src, size;
+
+       dst = pt_regs_read_reg(regs, dstreg);
+       src = pt_regs_read_reg(regs, srcreg);
+       size = pt_regs_read_reg(regs, sizereg);
+
+       /*
+        * Put the registers back in the original format suitable for a
+        * prologue instruction, using the generic return routine from the
+        * Arm ARM (DDI 0487I.a) rules CNTMJ and MWFQH.
+        */
+       if (esr & ESR_ELx_MOPS_ISS_MEM_INST) {
+               /* SET* instruction */
+               if (option_a ^ wrong_option) {
+                       /* Format is from Option A; forward set */
+                       pt_regs_write_reg(regs, dstreg, dst + size);
+                       pt_regs_write_reg(regs, sizereg, -size);
+               }
+       } else {
+               /* CPY* instruction */
+               if (!(option_a ^ wrong_option)) {
+                       /* Format is from Option B */
+                       if (regs->pstate & PSR_N_BIT) {
+                               /* Backward copy */
+                               pt_regs_write_reg(regs, dstreg, dst - size);
+                               pt_regs_write_reg(regs, srcreg, src - size);
+                       }
+               } else {
+                       /* Format is from Option A */
+                       if (size & BIT(63)) {
+                               /* Forward copy */
+                               pt_regs_write_reg(regs, dstreg, dst + size);
+                               pt_regs_write_reg(regs, srcreg, src + size);
+                               pt_regs_write_reg(regs, sizereg, -size);
+                       }
+               }
+       }
+
+       if (esr & ESR_ELx_MOPS_ISS_FROM_EPILOGUE)
+               regs->pc -= 8;
+       else
+               regs->pc -= 4;
+
+       /*
+        * If single stepping then finish the step before executing the
+        * prologue instruction.
+        */
+       user_fastforward_single_step(current);
+}
+
 #define __user_cache_maint(insn, address, res)                 \
        if (address >= TASK_SIZE_MAX) {                         \
                res = -EFAULT;                                  \
@@ -824,6 +881,7 @@ static const char *esr_class_str[] = {
        [ESR_ELx_EC_DABT_LOW]           = "DABT (lower EL)",
        [ESR_ELx_EC_DABT_CUR]           = "DABT (current EL)",
        [ESR_ELx_EC_SP_ALIGN]           = "SP Alignment",
+       [ESR_ELx_EC_MOPS]               = "MOPS",
        [ESR_ELx_EC_FP_EXC32]           = "FP (AArch32)",
        [ESR_ELx_EC_FP_EXC64]           = "FP (AArch64)",
        [ESR_ELx_EC_SERROR]             = "SError",
@@ -947,7 +1005,7 @@ void do_serror(struct pt_regs *regs, unsigned long esr)
 }
 
 /* GENERIC_BUG traps */
-
+#ifdef CONFIG_GENERIC_BUG
 int is_valid_bugaddr(unsigned long addr)
 {
        /*
@@ -959,6 +1017,7 @@ int is_valid_bugaddr(unsigned long addr)
         */
        return 1;
 }
+#endif
 
 static int bug_handler(struct pt_regs *regs, unsigned long esr)
 {
@@ -1044,7 +1103,7 @@ static int kasan_handler(struct pt_regs *regs, unsigned long esr)
        bool recover = esr & KASAN_ESR_RECOVER;
        bool write = esr & KASAN_ESR_WRITE;
        size_t size = KASAN_ESR_SIZE(esr);
-       u64 addr = regs->regs[0];
+       void *addr = (void *)regs->regs[0];
        u64 pc = regs->pc;
 
        kasan_report(addr, size, write, pc);
diff --git a/arch/arm64/kernel/watchdog_hld.c b/arch/arm64/kernel/watchdog_hld.c
new file mode 100644 (file)
index 0000000..dcd2532
--- /dev/null
@@ -0,0 +1,36 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/nmi.h>
+#include <linux/cpufreq.h>
+#include <linux/perf/arm_pmu.h>
+
+/*
+ * Safe maximum CPU frequency in case a particular platform doesn't implement
+ * cpufreq driver. Although, architecture doesn't put any restrictions on
+ * maximum frequency but 5 GHz seems to be safe maximum given the available
+ * Arm CPUs in the market which are clocked much less than 5 GHz. On the other
+ * hand, we can't make it much higher as it would lead to a large hard-lockup
+ * detection timeout on parts which are running slower (eg. 1GHz on
+ * Developerbox) and doesn't possess a cpufreq driver.
+ */
+#define SAFE_MAX_CPU_FREQ      5000000000UL // 5 GHz
+u64 hw_nmi_get_sample_period(int watchdog_thresh)
+{
+       unsigned int cpu = smp_processor_id();
+       unsigned long max_cpu_freq;
+
+       max_cpu_freq = cpufreq_get_hw_max_freq(cpu) * 1000UL;
+       if (!max_cpu_freq)
+               max_cpu_freq = SAFE_MAX_CPU_FREQ;
+
+       return (u64)max_cpu_freq * watchdog_thresh;
+}
+
+bool __init arch_perf_nmi_is_available(void)
+{
+       /*
+        * hardlockup_detector_perf_init() will success even if Pseudo-NMI turns off,
+        * however, the pmu interrupts will act like a normal interrupt instead of
+        * NMI and the hardlockup detector would be broken.
+        */
+       return arm_pmu_irq_is_nmi();
+}
index 55f80fb..8725291 100644 (file)
@@ -333,7 +333,7 @@ void kvm_arch_vcpu_load_debug_state_flags(struct kvm_vcpu *vcpu)
 
        /* Check if we have TRBE implemented and available at the host */
        if (cpuid_feature_extract_unsigned_field(dfr0, ID_AA64DFR0_EL1_TraceBuffer_SHIFT) &&
-           !(read_sysreg_s(SYS_TRBIDR_EL1) & TRBIDR_PROG))
+           !(read_sysreg_s(SYS_TRBIDR_EL1) & TRBIDR_EL1_P))
                vcpu_set_flag(vcpu, DEBUG_STATE_SAVE_TRBE);
 }
 
index 4fe217e..2f6e0b3 100644 (file)
@@ -141,6 +141,9 @@ static inline void ___activate_traps(struct kvm_vcpu *vcpu)
 
        if (cpus_have_final_cap(ARM64_HAS_RAS_EXTN) && (hcr & HCR_VSE))
                write_sysreg_s(vcpu->arch.vsesr_el2, SYS_VSESR_EL2);
+
+       if (cpus_have_final_cap(ARM64_HAS_HCX))
+               write_sysreg_s(HCRX_GUEST_FLAGS, SYS_HCRX_EL2);
 }
 
 static inline void ___deactivate_traps(struct kvm_vcpu *vcpu)
@@ -155,6 +158,9 @@ static inline void ___deactivate_traps(struct kvm_vcpu *vcpu)
                vcpu->arch.hcr_el2 &= ~HCR_VSE;
                vcpu->arch.hcr_el2 |= read_sysreg(hcr_el2) & HCR_VSE;
        }
+
+       if (cpus_have_final_cap(ARM64_HAS_HCX))
+               write_sysreg_s(HCRX_HOST_FLAGS, SYS_HCRX_EL2);
 }
 
 static inline bool __populate_fault_info(struct kvm_vcpu *vcpu)
index 699ea1f..bb6b571 100644 (file)
@@ -44,6 +44,8 @@ static inline void __sysreg_save_el1_state(struct kvm_cpu_context *ctxt)
        ctxt_sys_reg(ctxt, TTBR0_EL1)   = read_sysreg_el1(SYS_TTBR0);
        ctxt_sys_reg(ctxt, TTBR1_EL1)   = read_sysreg_el1(SYS_TTBR1);
        ctxt_sys_reg(ctxt, TCR_EL1)     = read_sysreg_el1(SYS_TCR);
+       if (cpus_have_final_cap(ARM64_HAS_TCR2))
+               ctxt_sys_reg(ctxt, TCR2_EL1)    = read_sysreg_el1(SYS_TCR2);
        ctxt_sys_reg(ctxt, ESR_EL1)     = read_sysreg_el1(SYS_ESR);
        ctxt_sys_reg(ctxt, AFSR0_EL1)   = read_sysreg_el1(SYS_AFSR0);
        ctxt_sys_reg(ctxt, AFSR1_EL1)   = read_sysreg_el1(SYS_AFSR1);
@@ -53,6 +55,10 @@ static inline void __sysreg_save_el1_state(struct kvm_cpu_context *ctxt)
        ctxt_sys_reg(ctxt, CONTEXTIDR_EL1) = read_sysreg_el1(SYS_CONTEXTIDR);
        ctxt_sys_reg(ctxt, AMAIR_EL1)   = read_sysreg_el1(SYS_AMAIR);
        ctxt_sys_reg(ctxt, CNTKCTL_EL1) = read_sysreg_el1(SYS_CNTKCTL);
+       if (cpus_have_final_cap(ARM64_HAS_S1PIE)) {
+               ctxt_sys_reg(ctxt, PIR_EL1)     = read_sysreg_el1(SYS_PIR);
+               ctxt_sys_reg(ctxt, PIRE0_EL1)   = read_sysreg_el1(SYS_PIRE0);
+       }
        ctxt_sys_reg(ctxt, PAR_EL1)     = read_sysreg_par();
        ctxt_sys_reg(ctxt, TPIDR_EL1)   = read_sysreg(tpidr_el1);
 
@@ -114,6 +120,8 @@ static inline void __sysreg_restore_el1_state(struct kvm_cpu_context *ctxt)
        write_sysreg_el1(ctxt_sys_reg(ctxt, CPACR_EL1), SYS_CPACR);
        write_sysreg_el1(ctxt_sys_reg(ctxt, TTBR0_EL1), SYS_TTBR0);
        write_sysreg_el1(ctxt_sys_reg(ctxt, TTBR1_EL1), SYS_TTBR1);
+       if (cpus_have_final_cap(ARM64_HAS_TCR2))
+               write_sysreg_el1(ctxt_sys_reg(ctxt, TCR2_EL1),  SYS_TCR2);
        write_sysreg_el1(ctxt_sys_reg(ctxt, ESR_EL1),   SYS_ESR);
        write_sysreg_el1(ctxt_sys_reg(ctxt, AFSR0_EL1), SYS_AFSR0);
        write_sysreg_el1(ctxt_sys_reg(ctxt, AFSR1_EL1), SYS_AFSR1);
@@ -123,6 +131,10 @@ static inline void __sysreg_restore_el1_state(struct kvm_cpu_context *ctxt)
        write_sysreg_el1(ctxt_sys_reg(ctxt, CONTEXTIDR_EL1), SYS_CONTEXTIDR);
        write_sysreg_el1(ctxt_sys_reg(ctxt, AMAIR_EL1), SYS_AMAIR);
        write_sysreg_el1(ctxt_sys_reg(ctxt, CNTKCTL_EL1), SYS_CNTKCTL);
+       if (cpus_have_final_cap(ARM64_HAS_S1PIE)) {
+               write_sysreg_el1(ctxt_sys_reg(ctxt, PIR_EL1),   SYS_PIR);
+               write_sysreg_el1(ctxt_sys_reg(ctxt, PIRE0_EL1), SYS_PIRE0);
+       }
        write_sysreg(ctxt_sys_reg(ctxt, PAR_EL1),       par_el1);
        write_sysreg(ctxt_sys_reg(ctxt, TPIDR_EL1),     tpidr_el1);
 
index d756b93..4558c02 100644 (file)
@@ -56,7 +56,7 @@ static void __debug_save_trace(u64 *trfcr_el1)
        *trfcr_el1 = 0;
 
        /* Check if the TRBE is enabled */
-       if (!(read_sysreg_s(SYS_TRBLIMITR_EL1) & TRBLIMITR_ENABLE))
+       if (!(read_sysreg_s(SYS_TRBLIMITR_EL1) & TRBLIMITR_EL1_E))
                return;
        /*
         * Prohibit trace generation while we are in guest.
index 753aa74..5b5d5e5 100644 (file)
@@ -401,9 +401,9 @@ static bool trap_oslar_el1(struct kvm_vcpu *vcpu,
                return read_from_write_only(vcpu, p, r);
 
        /* Forward the OSLK bit to OSLSR */
-       oslsr = __vcpu_sys_reg(vcpu, OSLSR_EL1) & ~SYS_OSLSR_OSLK;
-       if (p->regval & SYS_OSLAR_OSLK)
-               oslsr |= SYS_OSLSR_OSLK;
+       oslsr = __vcpu_sys_reg(vcpu, OSLSR_EL1) & ~OSLSR_EL1_OSLK;
+       if (p->regval & OSLAR_EL1_OSLK)
+               oslsr |= OSLSR_EL1_OSLK;
 
        __vcpu_sys_reg(vcpu, OSLSR_EL1) = oslsr;
        return true;
@@ -427,7 +427,7 @@ static int set_oslsr_el1(struct kvm_vcpu *vcpu, const struct sys_reg_desc *rd,
         * The only modifiable bit is the OSLK bit. Refuse the write if
         * userspace attempts to change any other bit in the register.
         */
-       if ((val ^ rd->val) & ~SYS_OSLSR_OSLK)
+       if ((val ^ rd->val) & ~OSLSR_EL1_OSLK)
                return -EINVAL;
 
        __vcpu_sys_reg(vcpu, rd->reg) = val;
@@ -1265,6 +1265,7 @@ static u64 read_id_reg(const struct kvm_vcpu *vcpu, struct sys_reg_desc const *r
                                 ARM64_FEATURE_MASK(ID_AA64ISAR2_EL1_GPA3));
                if (!cpus_have_final_cap(ARM64_HAS_WFXT))
                        val &= ~ARM64_FEATURE_MASK(ID_AA64ISAR2_EL1_WFxT);
+               val &= ~ARM64_FEATURE_MASK(ID_AA64ISAR2_EL1_MOPS);
                break;
        case SYS_ID_AA64DFR0_EL1:
                /* Limit debug to ARMv8.0 */
@@ -1800,7 +1801,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
        { SYS_DESC(SYS_MDRAR_EL1), trap_raz_wi },
        { SYS_DESC(SYS_OSLAR_EL1), trap_oslar_el1 },
        { SYS_DESC(SYS_OSLSR_EL1), trap_oslsr_el1, reset_val, OSLSR_EL1,
-               SYS_OSLSR_OSLM_IMPLEMENTED, .set_user = set_oslsr_el1, },
+               OSLSR_EL1_OSLM_IMPLEMENTED, .set_user = set_oslsr_el1, },
        { SYS_DESC(SYS_OSDLR_EL1), trap_raz_wi },
        { SYS_DESC(SYS_DBGPRCR_EL1), trap_raz_wi },
        { SYS_DESC(SYS_DBGCLAIMSET_EL1), trap_raz_wi },
@@ -1891,7 +1892,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
        ID_SANITISED(ID_AA64MMFR0_EL1),
        ID_SANITISED(ID_AA64MMFR1_EL1),
        ID_SANITISED(ID_AA64MMFR2_EL1),
-       ID_UNALLOCATED(7,3),
+       ID_SANITISED(ID_AA64MMFR3_EL1),
        ID_UNALLOCATED(7,4),
        ID_UNALLOCATED(7,5),
        ID_UNALLOCATED(7,6),
@@ -1911,6 +1912,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
        { SYS_DESC(SYS_TTBR0_EL1), access_vm_reg, reset_unknown, TTBR0_EL1 },
        { SYS_DESC(SYS_TTBR1_EL1), access_vm_reg, reset_unknown, TTBR1_EL1 },
        { SYS_DESC(SYS_TCR_EL1), access_vm_reg, reset_val, TCR_EL1, 0 },
+       { SYS_DESC(SYS_TCR2_EL1), access_vm_reg, reset_val, TCR2_EL1, 0 },
 
        PTRAUTH_KEY(APIA),
        PTRAUTH_KEY(APIB),
@@ -1960,6 +1962,8 @@ static const struct sys_reg_desc sys_reg_descs[] = {
        { SYS_DESC(SYS_PMMIR_EL1), trap_raz_wi },
 
        { SYS_DESC(SYS_MAIR_EL1), access_vm_reg, reset_unknown, MAIR_EL1 },
+       { SYS_DESC(SYS_PIRE0_EL1), access_vm_reg, reset_unknown, PIRE0_EL1 },
+       { SYS_DESC(SYS_PIR_EL1), access_vm_reg, reset_unknown, PIR_EL1 },
        { SYS_DESC(SYS_AMAIR_EL1), access_vm_reg, reset_amair_el1, AMAIR_EL1 },
 
        { SYS_DESC(SYS_LORSA_EL1), trap_loregion },
index 96b1719..f9a53b7 100644 (file)
@@ -10,7 +10,7 @@
 #include <linux/module.h>
 #include <asm/neon-intrinsics.h>
 
-void xor_arm64_neon_2(unsigned long bytes, unsigned long * __restrict p1,
+static void xor_arm64_neon_2(unsigned long bytes, unsigned long * __restrict p1,
        const unsigned long * __restrict p2)
 {
        uint64_t *dp1 = (uint64_t *)p1;
@@ -37,7 +37,7 @@ void xor_arm64_neon_2(unsigned long bytes, unsigned long * __restrict p1,
        } while (--lines > 0);
 }
 
-void xor_arm64_neon_3(unsigned long bytes, unsigned long * __restrict p1,
+static void xor_arm64_neon_3(unsigned long bytes, unsigned long * __restrict p1,
        const unsigned long * __restrict p2,
        const unsigned long * __restrict p3)
 {
@@ -73,7 +73,7 @@ void xor_arm64_neon_3(unsigned long bytes, unsigned long * __restrict p1,
        } while (--lines > 0);
 }
 
-void xor_arm64_neon_4(unsigned long bytes, unsigned long * __restrict p1,
+static void xor_arm64_neon_4(unsigned long bytes, unsigned long * __restrict p1,
        const unsigned long * __restrict p2,
        const unsigned long * __restrict p3,
        const unsigned long * __restrict p4)
@@ -118,7 +118,7 @@ void xor_arm64_neon_4(unsigned long bytes, unsigned long * __restrict p1,
        } while (--lines > 0);
 }
 
-void xor_arm64_neon_5(unsigned long bytes, unsigned long * __restrict p1,
+static void xor_arm64_neon_5(unsigned long bytes, unsigned long * __restrict p1,
        const unsigned long * __restrict p2,
        const unsigned long * __restrict p3,
        const unsigned long * __restrict p4,
index e1e0dca..1881975 100644 (file)
@@ -364,8 +364,8 @@ void cpu_do_switch_mm(phys_addr_t pgd_phys, struct mm_struct *mm)
        ttbr1 &= ~TTBR_ASID_MASK;
        ttbr1 |= FIELD_PREP(TTBR_ASID_MASK, asid);
 
+       cpu_set_reserved_ttbr0_nosync();
        write_sysreg(ttbr1, ttbr1_el1);
-       isb();
        write_sysreg(ttbr0, ttbr0_el1);
        isb();
        post_ttbr_update_workaround();
index 6045a51..c601007 100644 (file)
@@ -66,6 +66,8 @@ static inline const struct fault_info *esr_to_debug_fault_info(unsigned long esr
 
 static void data_abort_decode(unsigned long esr)
 {
+       unsigned long iss2 = ESR_ELx_ISS2(esr);
+
        pr_alert("Data abort info:\n");
 
        if (esr & ESR_ELx_ISV) {
@@ -78,12 +80,21 @@ static void data_abort_decode(unsigned long esr)
                         (esr & ESR_ELx_SF) >> ESR_ELx_SF_SHIFT,
                         (esr & ESR_ELx_AR) >> ESR_ELx_AR_SHIFT);
        } else {
-               pr_alert("  ISV = 0, ISS = 0x%08lx\n", esr & ESR_ELx_ISS_MASK);
+               pr_alert("  ISV = 0, ISS = 0x%08lx, ISS2 = 0x%08lx\n",
+                        esr & ESR_ELx_ISS_MASK, iss2);
        }
 
-       pr_alert("  CM = %lu, WnR = %lu\n",
+       pr_alert("  CM = %lu, WnR = %lu, TnD = %lu, TagAccess = %lu\n",
                 (esr & ESR_ELx_CM) >> ESR_ELx_CM_SHIFT,
-                (esr & ESR_ELx_WNR) >> ESR_ELx_WNR_SHIFT);
+                (esr & ESR_ELx_WNR) >> ESR_ELx_WNR_SHIFT,
+                (iss2 & ESR_ELx_TnD) >> ESR_ELx_TnD_SHIFT,
+                (iss2 & ESR_ELx_TagAccess) >> ESR_ELx_TagAccess_SHIFT);
+
+       pr_alert("  GCS = %ld, Overlay = %lu, DirtyBit = %lu, Xs = %llu\n",
+                (iss2 & ESR_ELx_GCS) >> ESR_ELx_GCS_SHIFT,
+                (iss2 & ESR_ELx_Overlay) >> ESR_ELx_Overlay_SHIFT,
+                (iss2 & ESR_ELx_DirtyBit) >> ESR_ELx_DirtyBit_SHIFT,
+                (iss2 & ESR_ELx_Xs_MASK) >> ESR_ELx_Xs_SHIFT);
 }
 
 static void mem_abort_decode(unsigned long esr)
@@ -177,6 +188,9 @@ static void show_pte(unsigned long addr)
                        break;
 
                ptep = pte_offset_map(pmdp, addr);
+               if (!ptep)
+                       break;
+
                pte = READ_ONCE(*ptep);
                pr_cont(", pte=%016llx", pte_val(pte));
                pte_unmap(ptep);
@@ -317,7 +331,7 @@ static void report_tag_fault(unsigned long addr, unsigned long esr,
         * find out access size.
         */
        bool is_write = !!(esr & ESR_ELx_WNR);
-       kasan_report(addr, 0, is_write, regs->pc);
+       kasan_report((void *)addr, 0, is_write, regs->pc);
 }
 #else
 /* Tag faults aren't enabled without CONFIG_KASAN_HW_TAGS. */
@@ -885,9 +899,6 @@ void do_sp_pc_abort(unsigned long addr, unsigned long esr, struct pt_regs *regs)
 }
 NOKPROBE_SYMBOL(do_sp_pc_abort);
 
-int __init early_brk64(unsigned long addr, unsigned long esr,
-                      struct pt_regs *regs);
-
 /*
  * __refdata because early_brk64 is __init, but the reference to it is
  * clobbered at arch_initcall time.
index 5f9379b..4e64760 100644 (file)
@@ -8,6 +8,7 @@
 
 #include <linux/export.h>
 #include <linux/mm.h>
+#include <linux/libnvdimm.h>
 #include <linux/pagemap.h>
 
 #include <asm/cacheflush.h>
index 95364e8..21716c9 100644 (file)
@@ -307,14 +307,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
                        return NULL;
 
                WARN_ON(addr & (sz - 1));
-               /*
-                * Note that if this code were ever ported to the
-                * 32-bit arm platform then it will cause trouble in
-                * the case where CONFIG_HIGHPTE is set, since there
-                * will be no pte_unmap() to correspond with this
-                * pte_alloc_map().
-                */
-               ptep = pte_alloc_map(mm, pmdp, addr);
+               ptep = pte_alloc_huge(mm, pmdp, addr);
        } else if (sz == PMD_SIZE) {
                if (want_pmd_share(vma, addr) && pud_none(READ_ONCE(*pudp)))
                        ptep = huge_pmd_share(mm, vma, addr, pudp);
@@ -366,7 +359,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
                return (pte_t *)pmdp;
 
        if (sz == CONT_PTE_SIZE)
-               return pte_offset_kernel(pmdp, (addr & CONT_PTE_MASK));
+               return pte_offset_huge(pmdp, (addr & CONT_PTE_MASK));
 
        return NULL;
 }
index 66e70ca..d31c3a9 100644 (file)
@@ -69,6 +69,7 @@ phys_addr_t __ro_after_init arm64_dma_phys_limit;
 
 #define CRASH_ADDR_LOW_MAX             arm64_dma_phys_limit
 #define CRASH_ADDR_HIGH_MAX            (PHYS_MASK + 1)
+#define CRASH_HIGH_SEARCH_BASE         SZ_4G
 
 #define DEFAULT_CRASH_KERNEL_LOW_SIZE  (128UL << 20)
 
@@ -101,12 +102,13 @@ static int __init reserve_crashkernel_low(unsigned long long low_size)
  */
 static void __init reserve_crashkernel(void)
 {
-       unsigned long long crash_base, crash_size;
-       unsigned long long crash_low_size = 0;
+       unsigned long long crash_low_size = 0, search_base = 0;
        unsigned long long crash_max = CRASH_ADDR_LOW_MAX;
+       unsigned long long crash_base, crash_size;
        char *cmdline = boot_command_line;
-       int ret;
        bool fixed_base = false;
+       bool high = false;
+       int ret;
 
        if (!IS_ENABLED(CONFIG_KEXEC_CORE))
                return;
@@ -129,7 +131,9 @@ static void __init reserve_crashkernel(void)
                else if (ret)
                        return;
 
+               search_base = CRASH_HIGH_SEARCH_BASE;
                crash_max = CRASH_ADDR_HIGH_MAX;
+               high = true;
        } else if (ret || !crash_size) {
                /* The specified value is invalid */
                return;
@@ -140,31 +144,51 @@ static void __init reserve_crashkernel(void)
        /* User specifies base address explicitly. */
        if (crash_base) {
                fixed_base = true;
+               search_base = crash_base;
                crash_max = crash_base + crash_size;
        }
 
 retry:
        crash_base = memblock_phys_alloc_range(crash_size, CRASH_ALIGN,
-                                              crash_base, crash_max);
+                                              search_base, crash_max);
        if (!crash_base) {
                /*
-                * If the first attempt was for low memory, fall back to
-                * high memory, the minimum required low memory will be
-                * reserved later.
+                * For crashkernel=size[KMG]@offset[KMG], print out failure
+                * message if can't reserve the specified region.
                 */
-               if (!fixed_base && (crash_max == CRASH_ADDR_LOW_MAX)) {
+               if (fixed_base) {
+                       pr_warn("crashkernel reservation failed - memory is in use.\n");
+                       return;
+               }
+
+               /*
+                * For crashkernel=size[KMG], if the first attempt was for
+                * low memory, fall back to high memory, the minimum required
+                * low memory will be reserved later.
+                */
+               if (!high && crash_max == CRASH_ADDR_LOW_MAX) {
                        crash_max = CRASH_ADDR_HIGH_MAX;
+                       search_base = CRASH_ADDR_LOW_MAX;
                        crash_low_size = DEFAULT_CRASH_KERNEL_LOW_SIZE;
                        goto retry;
                }
 
+               /*
+                * For crashkernel=size[KMG],high, if the first attempt was
+                * for high memory, fall back to low memory.
+                */
+               if (high && crash_max == CRASH_ADDR_HIGH_MAX) {
+                       crash_max = CRASH_ADDR_LOW_MAX;
+                       search_base = 0;
+                       goto retry;
+               }
                pr_warn("cannot allocate crashkernel (size:0x%llx)\n",
                        crash_size);
                return;
        }
 
-       if ((crash_base > CRASH_ADDR_LOW_MAX - crash_low_size) &&
-            crash_low_size && reserve_crashkernel_low(crash_low_size)) {
+       if ((crash_base >= CRASH_ADDR_LOW_MAX) && crash_low_size &&
+            reserve_crashkernel_low(crash_low_size)) {
                memblock_phys_free(crash_base, crash_size);
                return;
        }
@@ -442,7 +466,12 @@ void __init bootmem_init(void)
  */
 void __init mem_init(void)
 {
-       swiotlb_init(max_pfn > PFN_DOWN(arm64_dma_phys_limit), SWIOTLB_VERBOSE);
+       bool swiotlb = max_pfn > PFN_DOWN(arm64_dma_phys_limit);
+
+       if (IS_ENABLED(CONFIG_DMA_BOUNCE_UNALIGNED_KMALLOC))
+               swiotlb = true;
+
+       swiotlb_init(swiotlb, SWIOTLB_VERBOSE);
 
        /* this will put all unused low memory onto the freelists */
        memblock_free_all();
index e969e68..f17d066 100644 (file)
@@ -214,7 +214,7 @@ static void __init clear_pgds(unsigned long start,
 static void __init kasan_init_shadow(void)
 {
        u64 kimg_shadow_start, kimg_shadow_end;
-       u64 mod_shadow_start, mod_shadow_end;
+       u64 mod_shadow_start;
        u64 vmalloc_shadow_end;
        phys_addr_t pa_start, pa_end;
        u64 i;
@@ -223,7 +223,6 @@ static void __init kasan_init_shadow(void)
        kimg_shadow_end = PAGE_ALIGN((u64)kasan_mem_to_shadow(KERNEL_END));
 
        mod_shadow_start = (u64)kasan_mem_to_shadow((void *)MODULES_VADDR);
-       mod_shadow_end = (u64)kasan_mem_to_shadow((void *)MODULES_END);
 
        vmalloc_shadow_end = (u64)kasan_mem_to_shadow((void *)VMALLOC_END);
 
@@ -246,17 +245,9 @@ static void __init kasan_init_shadow(void)
        kasan_populate_early_shadow(kasan_mem_to_shadow((void *)PAGE_END),
                                   (void *)mod_shadow_start);
 
-       if (IS_ENABLED(CONFIG_KASAN_VMALLOC)) {
-               BUILD_BUG_ON(VMALLOC_START != MODULES_END);
-               kasan_populate_early_shadow((void *)vmalloc_shadow_end,
-                                           (void *)KASAN_SHADOW_END);
-       } else {
-               kasan_populate_early_shadow((void *)kimg_shadow_end,
-                                           (void *)KASAN_SHADOW_END);
-               if (kimg_shadow_start > mod_shadow_end)
-                       kasan_populate_early_shadow((void *)mod_shadow_end,
-                                                   (void *)kimg_shadow_start);
-       }
+       BUILD_BUG_ON(VMALLOC_START != MODULES_END);
+       kasan_populate_early_shadow((void *)vmalloc_shadow_end,
+                                   (void *)KASAN_SHADOW_END);
 
        for_each_mem_range(i, &pa_start, &pa_end) {
                void *start = (void *)__phys_to_virt(pa_start);
index af6bc84..95d3608 100644 (file)
@@ -451,7 +451,7 @@ static phys_addr_t pgd_pgtable_alloc(int shift)
 void __init create_mapping_noalloc(phys_addr_t phys, unsigned long virt,
                                   phys_addr_t size, pgprot_t prot)
 {
-       if ((virt >= PAGE_END) && (virt < VMALLOC_START)) {
+       if (virt < PAGE_OFFSET) {
                pr_warn("BUG: not creating mapping for %pa at 0x%016lx - outside kernel range\n",
                        &phys, virt);
                return;
@@ -478,7 +478,7 @@ void __init create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
 static void update_mapping_prot(phys_addr_t phys, unsigned long virt,
                                phys_addr_t size, pgprot_t prot)
 {
-       if ((virt >= PAGE_END) && (virt < VMALLOC_START)) {
+       if (virt < PAGE_OFFSET) {
                pr_warn("BUG: not updating mapping for %pa at 0x%016lx - outside kernel range\n",
                        &phys, virt);
                return;
@@ -663,12 +663,17 @@ static void __init map_kernel_segment(pgd_t *pgdp, void *va_start, void *va_end,
        vm_area_add_early(vma);
 }
 
+static pgprot_t kernel_exec_prot(void)
+{
+       return rodata_enabled ? PAGE_KERNEL_ROX : PAGE_KERNEL_EXEC;
+}
+
 #ifdef CONFIG_UNMAP_KERNEL_AT_EL0
 static int __init map_entry_trampoline(void)
 {
        int i;
 
-       pgprot_t prot = rodata_enabled ? PAGE_KERNEL_ROX : PAGE_KERNEL_EXEC;
+       pgprot_t prot = kernel_exec_prot();
        phys_addr_t pa_start = __pa_symbol(__entry_tramp_text_start);
 
        /* The trampoline is always mapped and can therefore be global */
@@ -723,7 +728,7 @@ static void __init map_kernel(pgd_t *pgdp)
         * mapping to install SW breakpoints. Allow this (only) when
         * explicitly requested with rodata=off.
         */
-       pgprot_t text_prot = rodata_enabled ? PAGE_KERNEL_ROX : PAGE_KERNEL_EXEC;
+       pgprot_t text_prot = kernel_exec_prot();
 
        /*
         * If we have a CPU that supports BTI and a kernel built for
index c2cb437..2baeec4 100644 (file)
@@ -199,7 +199,7 @@ SYM_FUNC_END(idmap_cpu_replace_ttbr1)
 
 #ifdef CONFIG_UNMAP_KERNEL_AT_EL0
 
-#define KPTI_NG_PTE_FLAGS      (PTE_ATTRINDX(MT_NORMAL) | SWAPPER_PTE_FLAGS)
+#define KPTI_NG_PTE_FLAGS      (PTE_ATTRINDX(MT_NORMAL) | SWAPPER_PTE_FLAGS | PTE_WRITE)
 
        .pushsection ".idmap.text", "a"
 
@@ -290,7 +290,7 @@ SYM_TYPED_FUNC_START(idmap_kpti_install_ng_mappings)
        isb
 
        mov     temp_pte, x5
-       mov     pte_flags, #KPTI_NG_PTE_FLAGS
+       mov_q   pte_flags, KPTI_NG_PTE_FLAGS
 
        /* Everybody is enjoying the idmap, so we can rewrite swapper. */
        /* PGD */
@@ -454,6 +454,21 @@ SYM_FUNC_START(__cpu_setup)
 #endif /* CONFIG_ARM64_HW_AFDBM */
        msr     mair_el1, mair
        msr     tcr_el1, tcr
+
+       mrs_s   x1, SYS_ID_AA64MMFR3_EL1
+       ubfx    x1, x1, #ID_AA64MMFR3_EL1_S1PIE_SHIFT, #4
+       cbz     x1, .Lskip_indirection
+
+       mov_q   x0, PIE_E0
+       msr     REG_PIRE0_EL1, x0
+       mov_q   x0, PIE_E1
+       msr     REG_PIR_EL1, x0
+
+       mov     x0, TCR2_EL1x_PIE
+       msr     REG_TCR2_EL1, x0
+
+.Lskip_indirection:
+
        /*
         * Prepare SCTLR
         */
index 40ba954..19c23c4 100644 (file)
@@ -32,16 +32,20 @@ HAS_GENERIC_AUTH_IMP_DEF
 HAS_GIC_CPUIF_SYSREGS
 HAS_GIC_PRIO_MASKING
 HAS_GIC_PRIO_RELAXED_SYNC
+HAS_HCX
 HAS_LDAPR
 HAS_LSE_ATOMICS
+HAS_MOPS
 HAS_NESTED_VIRT
 HAS_NO_FPSIMD
 HAS_NO_HW_PREFETCH
 HAS_PAN
+HAS_S1PIE
 HAS_RAS_EXTN
 HAS_RNG
 HAS_SB
 HAS_STAGE2_FWB
+HAS_TCR2
 HAS_TIDCP1
 HAS_TLB_RANGE
 HAS_VIRT_HOST_EXTN
index 00c9e72..8525980 100755 (executable)
@@ -24,12 +24,12 @@ BEGIN {
 }
 
 /^[vA-Z0-9_]+$/ {
-       printf("#define ARM64_%-30s\t%d\n", $0, cap_num++)
+       printf("#define ARM64_%-40s\t%d\n", $0, cap_num++)
        next
 }
 
 END {
-       printf("#define ARM64_NCAPS\t\t\t\t%d\n", cap_num)
+       printf("#define ARM64_NCAPS\t\t\t\t\t%d\n", cap_num)
        print ""
        print "#endif /* __ASM_CPUCAPS_H */"
 }
index c9a0d1f..1ea4a3d 100644 (file)
 # feature that introduces them (eg, FEAT_LS64_ACCDATA introduces enumeration
 # item ACCDATA) though it may be more taseful to do something else.
 
+Sysreg OSDTRRX_EL1     2       0       0       0       2
+Res0   63:32
+Field  31:0    DTRRX
+EndSysreg
+
+Sysreg MDCCINT_EL1     2       0       0       2       0
+Res0   63:31
+Field  30      RX
+Field  29      TX
+Res0   28:0
+EndSysreg
+
+Sysreg MDSCR_EL1       2       0       0       2       2
+Res0   63:36
+Field  35      EHBWE
+Field  34      EnSPM
+Field  33      TTA
+Field  32      EMBWE
+Field  31      TFO
+Field  30      RXfull
+Field  29      TXfull
+Res0   28
+Field  27      RXO
+Field  26      TXU
+Res0   25:24
+Field  23:22   INTdis
+Field  21      TDA
+Res0   20
+Field  19      SC2
+Res0   18:16
+Field  15      MDE
+Field  14      HDE
+Field  13      KDE
+Field  12      TDCC
+Res0   11:7
+Field  6       ERR
+Res0   5:1
+Field  0       SS
+EndSysreg
+
+Sysreg OSDTRTX_EL1     2       0       0       3       2
+Res0   63:32
+Field  31:0    DTRTX
+EndSysreg
+
+Sysreg OSECCR_EL1      2       0       0       6       2
+Res0   63:32
+Field  31:0    EDECCR
+EndSysreg
+
+Sysreg OSLAR_EL1       2       0       1       0       4
+Res0   63:1
+Field  0       OSLK
+EndSysreg
+
 Sysreg ID_PFR0_EL1     3       0       0       1       0
 Res0   63:32
 UnsignedEnum   31:28   RAS
@@ -1538,6 +1593,78 @@ UnsignedEnum     3:0     CnP
 EndEnum
 EndSysreg
 
+Sysreg ID_AA64MMFR3_EL1        3       0       0       7       3
+UnsignedEnum   63:60   Spec_FPACC
+       0b0000  NI
+       0b0001  IMP
+EndEnum
+UnsignedEnum   59:56   ADERR
+       0b0000  NI
+       0b0001  DEV_ASYNC
+       0b0010  FEAT_ADERR
+       0b0011  FEAT_ADERR_IND
+EndEnum
+UnsignedEnum   55:52   SDERR
+       0b0000  NI
+       0b0001  DEV_SYNC
+       0b0010  FEAT_ADERR
+       0b0011  FEAT_ADERR_IND
+EndEnum
+Res0   51:48
+UnsignedEnum   47:44   ANERR
+       0b0000  NI
+       0b0001  ASYNC
+       0b0010  FEAT_ANERR
+       0b0011  FEAT_ANERR_IND
+EndEnum
+UnsignedEnum   43:40   SNERR
+       0b0000  NI
+       0b0001  SYNC
+       0b0010  FEAT_ANERR
+       0b0011  FEAT_ANERR_IND
+EndEnum
+UnsignedEnum   39:36   D128_2
+       0b0000  NI
+       0b0001  IMP
+EndEnum
+UnsignedEnum   35:32   D128
+       0b0000  NI
+       0b0001  IMP
+EndEnum
+UnsignedEnum   31:28   MEC
+       0b0000  NI
+       0b0001  IMP
+EndEnum
+UnsignedEnum   27:24   AIE
+       0b0000  NI
+       0b0001  IMP
+EndEnum
+UnsignedEnum   23:20   S2POE
+       0b0000  NI
+       0b0001  IMP
+EndEnum
+UnsignedEnum   19:16   S1POE
+       0b0000  NI
+       0b0001  IMP
+EndEnum
+UnsignedEnum   15:12   S2PIE
+       0b0000  NI
+       0b0001  IMP
+EndEnum
+UnsignedEnum   11:8    S1PIE
+       0b0000  NI
+       0b0001  IMP
+EndEnum
+UnsignedEnum   7:4     SCTLRX
+       0b0000  NI
+       0b0001  IMP
+EndEnum
+UnsignedEnum   3:0     TCRX
+       0b0000  NI
+       0b0001  IMP
+EndEnum
+EndSysreg
+
 Sysreg SCTLR_EL1       3       0       1       0       0
 Field  63      TIDCP
 Field  62      SPINTMASK
@@ -2034,7 +2161,17 @@ Fields   ZCR_ELx
 EndSysreg
 
 Sysreg HCRX_EL2        3       4       1       2       2
-Res0   63:12
+Res0   63:23
+Field  22      GCSEn
+Field  21      EnIDCP128
+Field  20      EnSDERR
+Field  19      TMEA
+Field  18      EnSNERR
+Field  17      D128En
+Field  16      PTTWI
+Field  15      SCTLR2En
+Field  14      TCR2En
+Res0   13:12
 Field  11      MSCEn
 Field  10      MCE2
 Field  9       CMOW
@@ -2153,6 +2290,87 @@ Sysreg   TTBR1_EL1       3       0       2       0       1
 Fields TTBRx_EL1
 EndSysreg
 
+SysregFields   TCR2_EL1x
+Res0   63:16
+Field  15      DisCH1
+Field  14      DisCH0
+Res0   13:12
+Field  11      HAFT
+Field  10      PTTWI
+Res0   9:6
+Field  5       D128
+Field  4       AIE
+Field  3       POE
+Field  2       E0POE
+Field  1       PIE
+Field  0       PnCH
+EndSysregFields
+
+Sysreg TCR2_EL1        3       0       2       0       3
+Fields TCR2_EL1x
+EndSysreg
+
+Sysreg TCR2_EL12       3       5       2       0       3
+Fields TCR2_EL1x
+EndSysreg
+
+Sysreg TCR2_EL2        3       4       2       0       3
+Res0   63:16
+Field  15      DisCH1
+Field  14      DisCH0
+Field  13      AMEC1
+Field  12      AMEC0
+Field  11      HAFT
+Field  10      PTTWI
+Field  9:8     SKL1
+Field  7:6     SKL0
+Field  5       D128
+Field  4       AIE
+Field  3       POE
+Field  2       E0POE
+Field  1       PIE
+Field  0       PnCH
+EndSysreg
+
+SysregFields PIRx_ELx
+Field  63:60   Perm15
+Field  59:56   Perm14
+Field  55:52   Perm13
+Field  51:48   Perm12
+Field  47:44   Perm11
+Field  43:40   Perm10
+Field  39:36   Perm9
+Field  35:32   Perm8
+Field  31:28   Perm7
+Field  27:24   Perm6
+Field  23:20   Perm5
+Field  19:16   Perm4
+Field  15:12   Perm3
+Field  11:8    Perm2
+Field  7:4     Perm1
+Field  3:0     Perm0
+EndSysregFields
+
+Sysreg PIRE0_EL1       3       0       10      2       2
+Fields PIRx_ELx
+EndSysreg
+
+Sysreg PIRE0_EL12      3       5       10      2       2
+Fields PIRx_ELx
+EndSysreg
+
+Sysreg PIR_EL1         3       0       10      2       3
+Fields PIRx_ELx
+EndSysreg
+
+Sysreg PIR_EL12        3       5       10      2       3
+Fields PIRx_ELx
+EndSysreg
+
+Sysreg PIR_EL2         3       4       10      2       3
+Fields PIRx_ELx
+EndSysreg
+
 Sysreg LORSA_EL1       3       0       10      4       0
 Res0   63:52
 Field  51:16   SA
@@ -2200,3 +2418,80 @@ Sysreg   ICC_NMIAR1_EL1  3       0       12      9       5
 Res0   63:24
 Field  23:0    INTID
 EndSysreg
+
+Sysreg TRBLIMITR_EL1   3       0       9       11      0
+Field  63:12   LIMIT
+Res0   11:7
+Field  6       XE
+Field  5       nVM
+Enum   4:3     TM
+       0b00    STOP
+       0b01    IRQ
+       0b11    IGNR
+EndEnum
+Enum   2:1     FM
+       0b00    FILL
+       0b01    WRAP
+       0b11    CBUF
+EndEnum
+Field  0       E
+EndSysreg
+
+Sysreg TRBPTR_EL1      3       0       9       11      1
+Field  63:0    PTR
+EndSysreg
+
+Sysreg TRBBASER_EL1    3       0       9       11      2
+Field  63:12   BASE
+Res0   11:0
+EndSysreg
+
+Sysreg TRBSR_EL1       3       0       9       11      3
+Res0   63:56
+Field  55:32   MSS2
+Field  31:26   EC
+Res0   25:24
+Field  23      DAT
+Field  22      IRQ
+Field  21      TRG
+Field  20      WRAP
+Res0   19
+Field  18      EA
+Field  17      S
+Res0   16
+Field  15:0    MSS
+EndSysreg
+
+Sysreg TRBMAR_EL1      3       0       9       11      4
+Res0   63:12
+Enum   11:10   PAS
+       0b00    SECURE
+       0b01    NON_SECURE
+       0b10    ROOT
+       0b11    REALM
+EndEnum
+Enum   9:8     SH
+       0b00    NON_SHAREABLE
+       0b10    OUTER_SHAREABLE
+       0b11    INNER_SHAREABLE
+EndEnum
+Field  7:0     Attr
+EndSysreg
+
+Sysreg TRBTRG_EL1      3       0       9       11      6
+Res0   63:32
+Field  31:0    TRG
+EndSysreg
+
+Sysreg TRBIDR_EL1      3       0       9       11      7
+Res0   63:12
+Enum   11:8    EA
+       0b0000  NON_DESC
+       0b0001  IGNORE
+       0b0010  SERROR
+EndEnum
+Res0   7:6
+Field  5       F
+Field  4       P
+Field  3:0     Align
+EndSysreg
index 4df1f8c..95f1e9b 100644 (file)
@@ -96,6 +96,7 @@ config CSKY
        select HAVE_REGS_AND_STACK_ACCESS_API
        select HAVE_STACKPROTECTOR
        select HAVE_SYSCALL_TRACEPOINTS
+       select HOTPLUG_CORE_SYNC_DEAD if HOTPLUG_CPU
        select MAY_HAVE_SPARSE_IRQ
        select MODULES_USE_ELF_RELA if MODULES
        select OF
index 60406ef..4dab44f 100644 (file)
@@ -195,41 +195,6 @@ arch_atomic_dec_if_positive(atomic_t *v)
 }
 #define arch_atomic_dec_if_positive arch_atomic_dec_if_positive
 
-#define ATOMIC_OP()                                                    \
-static __always_inline                                                 \
-int arch_atomic_xchg_relaxed(atomic_t *v, int n)                       \
-{                                                                      \
-       return __xchg_relaxed(n, &(v->counter), 4);                     \
-}                                                                      \
-static __always_inline                                                 \
-int arch_atomic_cmpxchg_relaxed(atomic_t *v, int o, int n)             \
-{                                                                      \
-       return __cmpxchg_relaxed(&(v->counter), o, n, 4);               \
-}                                                                      \
-static __always_inline                                                 \
-int arch_atomic_cmpxchg_acquire(atomic_t *v, int o, int n)             \
-{                                                                      \
-       return __cmpxchg_acquire(&(v->counter), o, n, 4);               \
-}                                                                      \
-static __always_inline                                                 \
-int arch_atomic_cmpxchg(atomic_t *v, int o, int n)                     \
-{                                                                      \
-       return __cmpxchg(&(v->counter), o, n, 4);                       \
-}
-
-#define ATOMIC_OPS()                                                   \
-       ATOMIC_OP()
-
-ATOMIC_OPS()
-
-#define arch_atomic_xchg_relaxed       arch_atomic_xchg_relaxed
-#define arch_atomic_cmpxchg_relaxed    arch_atomic_cmpxchg_relaxed
-#define arch_atomic_cmpxchg_acquire    arch_atomic_cmpxchg_acquire
-#define arch_atomic_cmpxchg            arch_atomic_cmpxchg
-
-#undef ATOMIC_OPS
-#undef ATOMIC_OP
-
 #else
 #include <asm-generic/atomic.h>
 #endif
index 668b79c..d3db334 100644 (file)
@@ -23,7 +23,7 @@ void __init set_send_ipi(void (*func)(const struct cpumask *mask), int irq);
 
 int __cpu_disable(void);
 
-void __cpu_die(unsigned int cpu);
+static inline void __cpu_die(unsigned int cpu) { }
 
 #endif /* CONFIG_SMP */
 
index b12e2c3..8e42352 100644 (file)
@@ -291,12 +291,8 @@ int __cpu_disable(void)
        return 0;
 }
 
-void __cpu_die(unsigned int cpu)
+void arch_cpuhp_cleanup_dead_cpu(unsigned int cpu)
 {
-       if (!cpu_wait_death(cpu, 5)) {
-               pr_crit("CPU%u: shutdown failed\n", cpu);
-               return;
-       }
        pr_notice("CPU%u: shutdown\n", cpu);
 }
 
@@ -304,7 +300,7 @@ void __noreturn arch_cpu_idle_dead(void)
 {
        idle_task_exit();
 
-       cpu_report_death();
+       cpuhp_ap_report_dead();
 
        while (!secondary_stack)
                arch_cpu_idle();
index 6e94f8d..2447d08 100644 (file)
@@ -28,58 +28,8 @@ static inline void arch_atomic_set(atomic_t *v, int new)
 
 #define arch_atomic_set_release(v, i)  arch_atomic_set((v), (i))
 
-/**
- * arch_atomic_read - reads a word, atomically
- * @v: pointer to atomic value
- *
- * Assumes all word reads on our architecture are atomic.
- */
 #define arch_atomic_read(v)            READ_ONCE((v)->counter)
 
-/**
- * arch_atomic_xchg - atomic
- * @v: pointer to memory to change
- * @new: new value (technically passed in a register -- see xchg)
- */
-#define arch_atomic_xchg(v, new)       (arch_xchg(&((v)->counter), (new)))
-
-
-/**
- * arch_atomic_cmpxchg - atomic compare-and-exchange values
- * @v: pointer to value to change
- * @old:  desired old value to match
- * @new:  new value to put in
- *
- * Parameters are then pointer, value-in-register, value-in-register,
- * and the output is the old value.
- *
- * Apparently this is complicated for archs that don't support
- * the memw_locked like we do (or it's broken or whatever).
- *
- * Kind of the lynchpin of the rest of the generically defined routines.
- * Remember V2 had that bug with dotnew predicate set by memw_locked.
- *
- * "old" is "expected" old val, __oldval is actual old value
- */
-static inline int arch_atomic_cmpxchg(atomic_t *v, int old, int new)
-{
-       int __oldval;
-
-       asm volatile(
-               "1:     %0 = memw_locked(%1);\n"
-               "       { P0 = cmp.eq(%0,%2);\n"
-               "         if (!P0.new) jump:nt 2f; }\n"
-               "       memw_locked(%1,P0) = %3;\n"
-               "       if (!P0) jump 1b;\n"
-               "2:\n"
-               : "=&r" (__oldval)
-               : "r" (&v->counter), "r" (old), "r" (new)
-               : "memory", "p0"
-       );
-
-       return __oldval;
-}
-
 #define ATOMIC_OP(op)                                                  \
 static inline void arch_atomic_##op(int i, atomic_t *v)                        \
 {                                                                      \
@@ -135,6 +85,11 @@ static inline int arch_atomic_fetch_##op(int i, atomic_t *v)                \
 ATOMIC_OPS(add)
 ATOMIC_OPS(sub)
 
+#define arch_atomic_add_return                 arch_atomic_add_return
+#define arch_atomic_sub_return                 arch_atomic_sub_return
+#define arch_atomic_fetch_add                  arch_atomic_fetch_add
+#define arch_atomic_fetch_sub                  arch_atomic_fetch_sub
+
 #undef ATOMIC_OPS
 #define ATOMIC_OPS(op) ATOMIC_OP(op) ATOMIC_FETCH_OP(op)
 
@@ -142,21 +97,15 @@ ATOMIC_OPS(and)
 ATOMIC_OPS(or)
 ATOMIC_OPS(xor)
 
+#define arch_atomic_fetch_and                  arch_atomic_fetch_and
+#define arch_atomic_fetch_or                   arch_atomic_fetch_or
+#define arch_atomic_fetch_xor                  arch_atomic_fetch_xor
+
 #undef ATOMIC_OPS
 #undef ATOMIC_FETCH_OP
 #undef ATOMIC_OP_RETURN
 #undef ATOMIC_OP
 
-/**
- * arch_atomic_fetch_add_unless - add unless the number is a given value
- * @v: pointer to value
- * @a: amount to add
- * @u: unless value is equal to u
- *
- * Returns old value.
- *
- */
-
 static inline int arch_atomic_fetch_add_unless(atomic_t *v, int a, int u)
 {
        int __oldval;
index 1880d9b..621674e 100644 (file)
@@ -66,9 +66,9 @@ void __init setup_arch(char **cmdline_p)
                on_simulator = 0;
 
        if (p[0] != '\0')
-               strlcpy(boot_command_line, p, COMMAND_LINE_SIZE);
+               strscpy(boot_command_line, p, COMMAND_LINE_SIZE);
        else
-               strlcpy(boot_command_line, default_command_line,
+               strscpy(boot_command_line, default_command_line,
                        COMMAND_LINE_SIZE);
 
        /*
@@ -76,7 +76,7 @@ void __init setup_arch(char **cmdline_p)
         * are both picked up by the init code. If no reason to
         * make them different, pass the same pointer back.
         */
-       strlcpy(cmd_line, boot_command_line, COMMAND_LINE_SIZE);
+       strscpy(cmd_line, boot_command_line, COMMAND_LINE_SIZE);
        *cmdline_p = cmd_line;
 
        parse_early_param();
index 21fa63c..2cd93e6 100644 (file)
@@ -9,6 +9,7 @@ menu "Processor type and features"
 config IA64
        bool
        select ARCH_BINFMT_ELF_EXTRA_PHDRS
+       select ARCH_HAS_CPU_FINALIZE_INIT
        select ARCH_HAS_DMA_MARK_CLEAN
        select ARCH_HAS_STRNCPY_FROM_USER
        select ARCH_HAS_STRNLEN_USER
index 266c429..6540a62 100644 (file)
@@ -207,13 +207,6 @@ ATOMIC64_FETCH_OP(xor, ^)
 #undef ATOMIC64_FETCH_OP
 #undef ATOMIC64_OP
 
-#define arch_atomic_cmpxchg(v, old, new) (arch_cmpxchg(&((v)->counter), old, new))
-#define arch_atomic_xchg(v, new) (arch_xchg(&((v)->counter), new))
-
-#define arch_atomic64_cmpxchg(v, old, new) \
-       (arch_cmpxchg(&((v)->counter), old, new))
-#define arch_atomic64_xchg(v, new) (arch_xchg(&((v)->counter), new))
-
 #define arch_atomic_add(i,v)           (void)arch_atomic_add_return((i), (v))
 #define arch_atomic_sub(i,v)           (void)arch_atomic_sub_return((i), (v))
 
diff --git a/arch/ia64/include/asm/bugs.h b/arch/ia64/include/asm/bugs.h
deleted file mode 100644 (file)
index 0d6b9bd..0000000
+++ /dev/null
@@ -1,20 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/*
- * This is included by init/main.c to check for architecture-dependent bugs.
- *
- * Needs:
- *     void check_bugs(void);
- *
- * Based on <asm-alpha/bugs.h>.
- *
- * Modified 1998, 1999, 2003
- *     David Mosberger-Tang <davidm@hpl.hp.com>,  Hewlett-Packard Co.
- */
-#ifndef _ASM_IA64_BUGS_H
-#define _ASM_IA64_BUGS_H
-
-#include <asm/processor.h>
-
-extern void check_bugs (void);
-
-#endif /* _ASM_IA64_BUGS_H */
index c057280..5a55ac8 100644 (file)
@@ -627,7 +627,7 @@ setup_arch (char **cmdline_p)
         * is physical disk 1 partition 1 and the Linux root disk is
         * physical disk 1 partition 2.
         */
-       ROOT_DEV = Root_SDA2;           /* default to second partition on first drive */
+       ROOT_DEV = MKDEV(SCSI_DISK0_MAJOR, 2);
 
        if (is_uv_system())
                uv_setup(cmdline_p);
@@ -1067,8 +1067,7 @@ cpu_init (void)
        }
 }
 
-void __init
-check_bugs (void)
+void __init arch_cpu_finalize_init(void)
 {
        ia64_patch_mckinley_e9((unsigned long) __start___mckinley_e9_bundles,
                               (unsigned long) __end___mckinley_e9_bundles);
index 72c929d..f8c74ff 100644 (file)
 448    common  process_mrelease                sys_process_mrelease
 449    common  futex_waitv                     sys_futex_waitv
 450    common  set_mempolicy_home_node         sys_set_mempolicy_home_node
+451    common  cachestat                       sys_cachestat
index 78a02e0..adc49f2 100644 (file)
@@ -41,7 +41,7 @@ huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
        if (pud) {
                pmd = pmd_alloc(mm, pud, taddr);
                if (pmd)
-                       pte = pte_alloc_map(mm, pmd, taddr);
+                       pte = pte_alloc_huge(mm, pmd, taddr);
        }
        return pte;
 }
@@ -64,7 +64,7 @@ huge_pte_offset (struct mm_struct *mm, unsigned long addr, unsigned long sz)
                        if (pud_present(*pud)) {
                                pmd = pmd_offset(pud, taddr);
                                if (pmd_present(*pmd))
-                                       pte = pte_offset_map(pmd, taddr);
+                                       pte = pte_offset_huge(pmd, taddr);
                        }
                }
        }
index d38b066..cbab4f9 100644 (file)
@@ -10,6 +10,7 @@ config LOONGARCH
        select ARCH_ENABLE_MEMORY_HOTPLUG
        select ARCH_ENABLE_MEMORY_HOTREMOVE
        select ARCH_HAS_ACPI_TABLE_UPGRADE      if ACPI
+       select ARCH_HAS_CPU_FINALIZE_INIT
        select ARCH_HAS_FORTIFY_SOURCE
        select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
        select ARCH_HAS_PTE_SPECIAL
index 6b9aca9..e27f0c7 100644 (file)
 
 #define ATOMIC_INIT(i)   { (i) }
 
-/*
- * arch_atomic_read - read atomic variable
- * @v: pointer of type atomic_t
- *
- * Atomically reads the value of @v.
- */
 #define arch_atomic_read(v)    READ_ONCE((v)->counter)
-
-/*
- * arch_atomic_set - set atomic variable
- * @v: pointer of type atomic_t
- * @i: required value
- *
- * Atomically sets the value of @v to @i.
- */
 #define arch_atomic_set(v, i)  WRITE_ONCE((v)->counter, (i))
 
 #define ATOMIC_OP(op, I, asm_op)                                       \
@@ -139,14 +125,6 @@ static inline int arch_atomic_fetch_add_unless(atomic_t *v, int a, int u)
 }
 #define arch_atomic_fetch_add_unless arch_atomic_fetch_add_unless
 
-/*
- * arch_atomic_sub_if_positive - conditionally subtract integer from atomic variable
- * @i: integer value to subtract
- * @v: pointer of type atomic_t
- *
- * Atomically test @v and subtract @i if @v is greater or equal than @i.
- * The function returns the old value of @v minus @i.
- */
 static inline int arch_atomic_sub_if_positive(int i, atomic_t *v)
 {
        int result;
@@ -181,31 +159,13 @@ static inline int arch_atomic_sub_if_positive(int i, atomic_t *v)
        return result;
 }
 
-#define arch_atomic_cmpxchg(v, o, n) (arch_cmpxchg(&((v)->counter), (o), (n)))
-#define arch_atomic_xchg(v, new) (arch_xchg(&((v)->counter), (new)))
-
-/*
- * arch_atomic_dec_if_positive - decrement by 1 if old value positive
- * @v: pointer of type atomic_t
- */
 #define arch_atomic_dec_if_positive(v) arch_atomic_sub_if_positive(1, v)
 
 #ifdef CONFIG_64BIT
 
 #define ATOMIC64_INIT(i)    { (i) }
 
-/*
- * arch_atomic64_read - read atomic variable
- * @v: pointer of type atomic64_t
- *
- */
 #define arch_atomic64_read(v)  READ_ONCE((v)->counter)
-
-/*
- * arch_atomic64_set - set atomic variable
- * @v: pointer of type atomic64_t
- * @i: required value
- */
 #define arch_atomic64_set(v, i)        WRITE_ONCE((v)->counter, (i))
 
 #define ATOMIC64_OP(op, I, asm_op)                                     \
@@ -300,14 +260,6 @@ static inline long arch_atomic64_fetch_add_unless(atomic64_t *v, long a, long u)
 }
 #define arch_atomic64_fetch_add_unless arch_atomic64_fetch_add_unless
 
-/*
- * arch_atomic64_sub_if_positive - conditionally subtract integer from atomic variable
- * @i: integer value to subtract
- * @v: pointer of type atomic64_t
- *
- * Atomically test @v and subtract @i if @v is greater or equal than @i.
- * The function returns the old value of @v minus @i.
- */
 static inline long arch_atomic64_sub_if_positive(long i, atomic64_t *v)
 {
        long result;
@@ -342,14 +294,6 @@ static inline long arch_atomic64_sub_if_positive(long i, atomic64_t *v)
        return result;
 }
 
-#define arch_atomic64_cmpxchg(v, o, n) \
-       ((__typeof__((v)->counter))arch_cmpxchg(&((v)->counter), (o), (n)))
-#define arch_atomic64_xchg(v, new) (arch_xchg(&((v)->counter), (new)))
-
-/*
- * arch_atomic64_dec_if_positive - decrement by 1 if old value positive
- * @v: pointer of type atomic64_t
- */
 #define arch_atomic64_dec_if_positive(v)       arch_atomic64_sub_if_positive(1, v)
 
 #endif /* CONFIG_64BIT */
diff --git a/arch/loongarch/include/asm/bugs.h b/arch/loongarch/include/asm/bugs.h
deleted file mode 100644 (file)
index 9839653..0000000
+++ /dev/null
@@ -1,15 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/*
- * This is included by init/main.c to check for architecture-dependent bugs.
- *
- * Copyright (C) 2020-2022 Loongson Technology Corporation Limited
- */
-#ifndef _ASM_BUGS_H
-#define _ASM_BUGS_H
-
-#include <asm/cpu.h>
-#include <asm/cpu-info.h>
-
-extern void check_bugs(void);
-
-#endif /* _ASM_BUGS_H */
index 35e8a52..1c2a0a2 100644 (file)
@@ -1167,7 +1167,7 @@ static __always_inline void iocsr_write64(u64 val, u32 reg)
 
 #ifndef __ASSEMBLY__
 
-static inline u64 drdtime(void)
+static __always_inline u64 drdtime(void)
 {
        int rID = 0;
        u64 val = 0;
index 4444b13..78a0035 100644 (file)
@@ -12,6 +12,7 @@
  */
 #include <linux/init.h>
 #include <linux/acpi.h>
+#include <linux/cpu.h>
 #include <linux/dmi.h>
 #include <linux/efi.h>
 #include <linux/export.h>
@@ -37,7 +38,6 @@
 #include <asm/addrspace.h>
 #include <asm/alternative.h>
 #include <asm/bootinfo.h>
-#include <asm/bugs.h>
 #include <asm/cache.h>
 #include <asm/cpu.h>
 #include <asm/dma.h>
@@ -87,7 +87,7 @@ const char *get_system_type(void)
        return "generic-loongson-machine";
 }
 
-void __init check_bugs(void)
+void __init arch_cpu_finalize_init(void)
 {
        alternative_instructions();
 }
index f377e50..c189e03 100644 (file)
@@ -190,9 +190,9 @@ static u64 read_const_counter(struct clocksource *clk)
        return drdtime();
 }
 
-static u64 native_sched_clock(void)
+static noinstr u64 sched_clock_read(void)
 {
-       return read_const_counter(NULL);
+       return drdtime();
 }
 
 static struct clocksource clocksource_const = {
@@ -211,7 +211,7 @@ int __init constant_clocksource_init(void)
 
        res = clocksource_register_hz(&clocksource_const, freq);
 
-       sched_clock_register(native_sched_clock, 64, freq);
+       sched_clock_register(sched_clock_read, 64, freq);
 
        pr_info("Constant clock source device register\n");
 
index 40198a1..dc792b3 100644 (file)
@@ -4,6 +4,7 @@ config M68K
        default y
        select ARCH_32BIT_OFF_T
        select ARCH_HAS_BINFMT_FLAT
+       select ARCH_HAS_CPU_FINALIZE_INIT if MMU
        select ARCH_HAS_CURRENT_STACK_POINTER
        select ARCH_HAS_DMA_PREP_COHERENT if HAS_DMA && MMU && !COLDFIRE
        select ARCH_HAS_SYNC_DMA_FOR_DEVICE if HAS_DMA
index b26469a..62fdca7 100644 (file)
@@ -43,6 +43,7 @@ CONFIG_IOSCHED_BFQ=m
 CONFIG_BINFMT_MISC=m
 CONFIG_SLAB=y
 # CONFIG_COMPACTION is not set
+CONFIG_DMAPOOL_TEST=m
 CONFIG_USERFAULTFD=y
 CONFIG_NET=y
 CONFIG_PACKET=y
@@ -454,7 +455,6 @@ CONFIG_OCFS2_FS=m
 # CONFIG_OCFS2_DEBUG_MASKLOG is not set
 CONFIG_FANOTIFY=y
 CONFIG_QUOTA_NETLINK_INTERFACE=y
-# CONFIG_PRINT_QUOTA_WARNING is not set
 CONFIG_AUTOFS_FS=m
 CONFIG_FUSE_FS=m
 CONFIG_CUSE=m
index 944a49a..5bfbd04 100644 (file)
@@ -39,6 +39,7 @@ CONFIG_IOSCHED_BFQ=m
 CONFIG_BINFMT_MISC=m
 CONFIG_SLAB=y
 # CONFIG_COMPACTION is not set
+CONFIG_DMAPOOL_TEST=m
 CONFIG_USERFAULTFD=y
 CONFIG_NET=y
 CONFIG_PACKET=y
@@ -411,7 +412,6 @@ CONFIG_OCFS2_FS=m
 # CONFIG_OCFS2_DEBUG_MASKLOG is not set
 CONFIG_FANOTIFY=y
 CONFIG_QUOTA_NETLINK_INTERFACE=y
-# CONFIG_PRINT_QUOTA_WARNING is not set
 CONFIG_AUTOFS_FS=m
 CONFIG_FUSE_FS=m
 CONFIG_CUSE=m
index a32dd88..44302f1 100644 (file)
@@ -46,6 +46,7 @@ CONFIG_IOSCHED_BFQ=m
 CONFIG_BINFMT_MISC=m
 CONFIG_SLAB=y
 # CONFIG_COMPACTION is not set
+CONFIG_DMAPOOL_TEST=m
 CONFIG_USERFAULTFD=y
 CONFIG_NET=y
 CONFIG_PACKET=y
@@ -431,7 +432,6 @@ CONFIG_OCFS2_FS=m
 # CONFIG_OCFS2_DEBUG_MASKLOG is not set
 CONFIG_FANOTIFY=y
 CONFIG_QUOTA_NETLINK_INTERFACE=y
-# CONFIG_PRINT_QUOTA_WARNING is not set
 CONFIG_AUTOFS_FS=m
 CONFIG_FUSE_FS=m
 CONFIG_CUSE=m
index 23b7805..f3336f1 100644 (file)
@@ -36,6 +36,7 @@ CONFIG_IOSCHED_BFQ=m
 CONFIG_BINFMT_MISC=m
 CONFIG_SLAB=y
 # CONFIG_COMPACTION is not set
+CONFIG_DMAPOOL_TEST=m
 CONFIG_USERFAULTFD=y
 CONFIG_NET=y
 CONFIG_PACKET=y
@@ -403,7 +404,6 @@ CONFIG_OCFS2_FS=m
 # CONFIG_OCFS2_DEBUG_MASKLOG is not set
 CONFIG_FANOTIFY=y
 CONFIG_QUOTA_NETLINK_INTERFACE=y
-# CONFIG_PRINT_QUOTA_WARNING is not set
 CONFIG_AUTOFS_FS=m
 CONFIG_FUSE_FS=m
 CONFIG_CUSE=m
index 5605ab5..2d1bbac 100644 (file)
@@ -38,6 +38,7 @@ CONFIG_IOSCHED_BFQ=m
 CONFIG_BINFMT_MISC=m
 CONFIG_SLAB=y
 # CONFIG_COMPACTION is not set
+CONFIG_DMAPOOL_TEST=m
 CONFIG_USERFAULTFD=y
 CONFIG_NET=y
 CONFIG_PACKET=y
@@ -413,7 +414,6 @@ CONFIG_OCFS2_FS=m
 # CONFIG_OCFS2_DEBUG_MASKLOG is not set
 CONFIG_FANOTIFY=y
 CONFIG_QUOTA_NETLINK_INTERFACE=y
-# CONFIG_PRINT_QUOTA_WARNING is not set
 CONFIG_AUTOFS_FS=m
 CONFIG_FUSE_FS=m
 CONFIG_CUSE=m
index d0d1f9c..b4428dc 100644 (file)
@@ -37,6 +37,7 @@ CONFIG_IOSCHED_BFQ=m
 CONFIG_BINFMT_MISC=m
 CONFIG_SLAB=y
 # CONFIG_COMPACTION is not set
+CONFIG_DMAPOOL_TEST=m
 CONFIG_USERFAULTFD=y
 CONFIG_NET=y
 CONFIG_PACKET=y
@@ -433,7 +434,6 @@ CONFIG_OCFS2_FS=m
 # CONFIG_OCFS2_DEBUG_MASKLOG is not set
 CONFIG_FANOTIFY=y
 CONFIG_QUOTA_NETLINK_INTERFACE=y
-# CONFIG_PRINT_QUOTA_WARNING is not set
 CONFIG_AUTOFS_FS=m
 CONFIG_FUSE_FS=m
 CONFIG_CUSE=m
index 6d04314..4cd9fa4 100644 (file)
@@ -57,6 +57,7 @@ CONFIG_IOSCHED_BFQ=m
 CONFIG_BINFMT_MISC=m
 CONFIG_SLAB=y
 # CONFIG_COMPACTION is not set
+CONFIG_DMAPOOL_TEST=m
 CONFIG_USERFAULTFD=y
 CONFIG_NET=y
 CONFIG_PACKET=y
@@ -519,7 +520,6 @@ CONFIG_OCFS2_FS=m
 # CONFIG_OCFS2_DEBUG_MASKLOG is not set
 CONFIG_FANOTIFY=y
 CONFIG_QUOTA_NETLINK_INTERFACE=y
-# CONFIG_PRINT_QUOTA_WARNING is not set
 CONFIG_AUTOFS_FS=m
 CONFIG_FUSE_FS=m
 CONFIG_CUSE=m
index e6f5ae5..7ee9ad5 100644 (file)
@@ -35,6 +35,7 @@ CONFIG_IOSCHED_BFQ=m
 CONFIG_BINFMT_MISC=m
 CONFIG_SLAB=y
 # CONFIG_COMPACTION is not set
+CONFIG_DMAPOOL_TEST=m
 CONFIG_USERFAULTFD=y
 CONFIG_NET=y
 CONFIG_PACKET=y
@@ -402,7 +403,6 @@ CONFIG_OCFS2_FS=m
 # CONFIG_OCFS2_DEBUG_MASKLOG is not set
 CONFIG_FANOTIFY=y
 CONFIG_QUOTA_NETLINK_INTERFACE=y
-# CONFIG_PRINT_QUOTA_WARNING is not set
 CONFIG_AUTOFS_FS=m
 CONFIG_FUSE_FS=m
 CONFIG_CUSE=m
index f2d4dff..2488893 100644 (file)
@@ -36,6 +36,7 @@ CONFIG_IOSCHED_BFQ=m
 CONFIG_BINFMT_MISC=m
 CONFIG_SLAB=y
 # CONFIG_COMPACTION is not set
+CONFIG_DMAPOOL_TEST=m
 CONFIG_USERFAULTFD=y
 CONFIG_NET=y
 CONFIG_PACKET=y
@@ -403,7 +404,6 @@ CONFIG_OCFS2_FS=m
 # CONFIG_OCFS2_DEBUG_MASKLOG is not set
 CONFIG_FANOTIFY=y
 CONFIG_QUOTA_NETLINK_INTERFACE=y
-# CONFIG_PRINT_QUOTA_WARNING is not set
 CONFIG_AUTOFS_FS=m
 CONFIG_FUSE_FS=m
 CONFIG_CUSE=m
index 907eede..ffc6762 100644 (file)
@@ -37,6 +37,7 @@ CONFIG_IOSCHED_BFQ=m
 CONFIG_BINFMT_MISC=m
 CONFIG_SLAB=y
 # CONFIG_COMPACTION is not set
+CONFIG_DMAPOOL_TEST=m
 CONFIG_USERFAULTFD=y
 CONFIG_NET=y
 CONFIG_PACKET=y
@@ -420,7 +421,6 @@ CONFIG_OCFS2_FS=m
 # CONFIG_OCFS2_DEBUG_MASKLOG is not set
 CONFIG_FANOTIFY=y
 CONFIG_QUOTA_NETLINK_INTERFACE=y
-# CONFIG_PRINT_QUOTA_WARNING is not set
 CONFIG_AUTOFS_FS=m
 CONFIG_FUSE_FS=m
 CONFIG_CUSE=m
index 9e3d470..1981796 100644 (file)
@@ -402,7 +402,6 @@ CONFIG_OCFS2_FS=m
 # CONFIG_OCFS2_DEBUG_MASKLOG is not set
 CONFIG_FANOTIFY=y
 CONFIG_QUOTA_NETLINK_INTERFACE=y
-# CONFIG_PRINT_QUOTA_WARNING is not set
 CONFIG_AUTOFS_FS=m
 CONFIG_FUSE_FS=m
 CONFIG_CUSE=m
index f654007..85364f6 100644 (file)
@@ -33,6 +33,7 @@ CONFIG_IOSCHED_BFQ=m
 CONFIG_BINFMT_MISC=m
 CONFIG_SLAB=y
 # CONFIG_COMPACTION is not set
+CONFIG_DMAPOOL_TEST=m
 CONFIG_USERFAULTFD=y
 CONFIG_NET=y
 CONFIG_PACKET=y
@@ -401,7 +402,6 @@ CONFIG_OCFS2_FS=m
 # CONFIG_OCFS2_DEBUG_MASKLOG is not set
 CONFIG_FANOTIFY=y
 CONFIG_QUOTA_NETLINK_INTERFACE=y
-# CONFIG_PRINT_QUOTA_WARNING is not set
 CONFIG_AUTOFS_FS=m
 CONFIG_FUSE_FS=m
 CONFIG_CUSE=m
index 8059bd6..311b57e 100644 (file)
@@ -24,8 +24,6 @@ CONFIG_SUN_PARTITION=y
 CONFIG_SYSV68_PARTITION=y
 CONFIG_NET=y
 CONFIG_PACKET=y
-CONFIG_UNIX=y
-CONFIG_INET=y
 CONFIG_IP_PNP=y
 CONFIG_IP_PNP_DHCP=y
 CONFIG_IP_PNP_BOOTP=y
index cfba83d..4bfbc25 100644 (file)
@@ -106,6 +106,11 @@ static inline int arch_atomic_fetch_##op(int i, atomic_t * v)              \
 ATOMIC_OPS(add, +=, add)
 ATOMIC_OPS(sub, -=, sub)
 
+#define arch_atomic_add_return                 arch_atomic_add_return
+#define arch_atomic_sub_return                 arch_atomic_sub_return
+#define arch_atomic_fetch_add                  arch_atomic_fetch_add
+#define arch_atomic_fetch_sub                  arch_atomic_fetch_sub
+
 #undef ATOMIC_OPS
 #define ATOMIC_OPS(op, c_op, asm_op)                                   \
        ATOMIC_OP(op, c_op, asm_op)                                     \
@@ -115,6 +120,10 @@ ATOMIC_OPS(and, &=, and)
 ATOMIC_OPS(or, |=, or)
 ATOMIC_OPS(xor, ^=, eor)
 
+#define arch_atomic_fetch_and                  arch_atomic_fetch_and
+#define arch_atomic_fetch_or                   arch_atomic_fetch_or
+#define arch_atomic_fetch_xor                  arch_atomic_fetch_xor
+
 #undef ATOMIC_OPS
 #undef ATOMIC_FETCH_OP
 #undef ATOMIC_OP_RETURN
@@ -158,12 +167,7 @@ static inline int arch_atomic_inc_and_test(atomic_t *v)
 }
 #define arch_atomic_inc_and_test arch_atomic_inc_and_test
 
-#ifdef CONFIG_RMW_INSNS
-
-#define arch_atomic_cmpxchg(v, o, n) ((int)arch_cmpxchg(&((v)->counter), (o), (n)))
-#define arch_atomic_xchg(v, new) (arch_xchg(&((v)->counter), new))
-
-#else /* !CONFIG_RMW_INSNS */
+#ifndef CONFIG_RMW_INSNS
 
 static inline int arch_atomic_cmpxchg(atomic_t *v, int old, int new)
 {
@@ -177,6 +181,7 @@ static inline int arch_atomic_cmpxchg(atomic_t *v, int old, int new)
        local_irq_restore(flags);
        return prev;
 }
+#define arch_atomic_cmpxchg arch_atomic_cmpxchg
 
 static inline int arch_atomic_xchg(atomic_t *v, int new)
 {
@@ -189,6 +194,7 @@ static inline int arch_atomic_xchg(atomic_t *v, int new)
        local_irq_restore(flags);
        return prev;
 }
+#define arch_atomic_xchg arch_atomic_xchg
 
 #endif /* !CONFIG_RMW_INSNS */
 
diff --git a/arch/m68k/include/asm/bugs.h b/arch/m68k/include/asm/bugs.h
deleted file mode 100644 (file)
index 7455306..0000000
+++ /dev/null
@@ -1,21 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/*
- *  include/asm-m68k/bugs.h
- *
- *  Copyright (C) 1994  Linus Torvalds
- */
-
-/*
- * This is included by init/main.c to check for architecture-dependent bugs.
- *
- * Needs:
- *     void check_bugs(void);
- */
-
-#ifdef CONFIG_MMU
-extern void check_bugs(void);  /* in arch/m68k/kernel/setup.c */
-#else
-static void check_bugs(void)
-{
-}
-#endif
index 8ed6ac1..141bbdf 100644 (file)
@@ -99,7 +99,7 @@ static inline void load_ksp_mmu(struct task_struct *task)
        p4d_t *p4d;
        pud_t *pud;
        pmd_t *pmd;
-       pte_t *pte;
+       pte_t *pte = NULL;
        unsigned long mmuar;
 
        local_irq_save(flags);
@@ -139,7 +139,7 @@ static inline void load_ksp_mmu(struct task_struct *task)
 
        pte = (mmuar >= PAGE_OFFSET) ? pte_offset_kernel(pmd, mmuar)
                                     : pte_offset_map(pmd, mmuar);
-       if (pte_none(*pte) || !pte_present(*pte))
+       if (!pte || pte_none(*pte) || !pte_present(*pte))
                goto bug;
 
        set_pte(pte, pte_mkyoung(*pte));
@@ -161,6 +161,8 @@ static inline void load_ksp_mmu(struct task_struct *task)
 bug:
        pr_info("ksp load failed: mm=0x%p ksp=0x08%lx\n", mm, mmuar);
 end:
+       if (pte && mmuar < PAGE_OFFSET)
+               pte_unmap(pte);
        local_irq_restore(flags);
 }
 
index fbff1ce..6f1ae01 100644 (file)
@@ -10,6 +10,7 @@
  */
 
 #include <linux/kernel.h>
+#include <linux/cpu.h>
 #include <linux/mm.h>
 #include <linux/sched.h>
 #include <linux/delay.h>
@@ -504,7 +505,7 @@ static int __init proc_hardware_init(void)
 module_init(proc_hardware_init);
 #endif
 
-void check_bugs(void)
+void __init arch_cpu_finalize_init(void)
 {
 #if defined(CONFIG_FPU) && !defined(CONFIG_M68KFPU_EMU)
        if (m68k_fputype == 0) {
index bd0274c..c586034 100644 (file)
@@ -488,6 +488,8 @@ sys_atomic_cmpxchg_32(unsigned long newval, int oldval, int d3, int d4, int d5,
                if (!pmd_present(*pmd))
                        goto bad_access;
                pte = pte_offset_map_lock(mm, pmd, (unsigned long)mem, &ptl);
+               if (!pte)
+                       goto bad_access;
                if (!pte_present(*pte) || !pte_dirty(*pte)
                    || !pte_write(*pte)) {
                        pte_unmap_unlock(pte, ptl);
index b1f3940..4f50478 100644 (file)
 448    common  process_mrelease                sys_process_mrelease
 449    common  futex_waitv                     sys_futex_waitv
 450    common  set_mempolicy_home_node         sys_set_mempolicy_home_node
+451    common  cachestat                       sys_cachestat
index 70aa097..42f45ab 100644 (file)
@@ -91,7 +91,8 @@ int cf_tlb_miss(struct pt_regs *regs, int write, int dtlb, int extension_word)
        p4d_t *p4d;
        pud_t *pud;
        pmd_t *pmd;
-       pte_t *pte;
+       pte_t *pte = NULL;
+       int ret = -1;
        int asid;
 
        local_irq_save(flags);
@@ -100,47 +101,33 @@ int cf_tlb_miss(struct pt_regs *regs, int write, int dtlb, int extension_word)
                regs->pc + (extension_word * sizeof(long));
 
        mm = (!user_mode(regs) && KMAPAREA(mmuar)) ? &init_mm : current->mm;
-       if (!mm) {
-               local_irq_restore(flags);
-               return -1;
-       }
+       if (!mm)
+               goto out;
 
        pgd = pgd_offset(mm, mmuar);
-       if (pgd_none(*pgd))  {
-               local_irq_restore(flags);
-               return -1;
-       }
+       if (pgd_none(*pgd))
+               goto out;
 
        p4d = p4d_offset(pgd, mmuar);
-       if (p4d_none(*p4d)) {
-               local_irq_restore(flags);
-               return -1;
-       }
+       if (p4d_none(*p4d))
+               goto out;
 
        pud = pud_offset(p4d, mmuar);
-       if (pud_none(*pud)) {
-               local_irq_restore(flags);
-               return -1;
-       }
+       if (pud_none(*pud))
+               goto out;
 
        pmd = pmd_offset(pud, mmuar);
-       if (pmd_none(*pmd)) {
-               local_irq_restore(flags);
-               return -1;
-       }
+       if (pmd_none(*pmd))
+               goto out;
 
        pte = (KMAPAREA(mmuar)) ? pte_offset_kernel(pmd, mmuar)
                                : pte_offset_map(pmd, mmuar);
-       if (pte_none(*pte) || !pte_present(*pte)) {
-               local_irq_restore(flags);
-               return -1;
-       }
+       if (!pte || pte_none(*pte) || !pte_present(*pte))
+               goto out;
 
        if (write) {
-               if (!pte_write(*pte)) {
-                       local_irq_restore(flags);
-                       return -1;
-               }
+               if (!pte_write(*pte))
+                       goto out;
                set_pte(pte, pte_mkdirty(*pte));
        }
 
@@ -161,9 +148,12 @@ int cf_tlb_miss(struct pt_regs *regs, int write, int dtlb, int extension_word)
                mmu_write(MMUOR, MMUOR_ACC | MMUOR_UAA);
        else
                mmu_write(MMUOR, MMUOR_ITLB | MMUOR_ACC | MMUOR_UAA);
-
+       ret = 0;
+out:
+       if (pte && !KMAPAREA(mmuar))
+               pte_unmap(pte);
        local_irq_restore(flags);
-       return 0;
+       return ret;
 }
 
 void __init cf_bootmem_alloc(void)
index a149b3e..1903988 100644 (file)
@@ -18,4 +18,9 @@
 
 #define SMP_CACHE_BYTES        L1_CACHE_BYTES
 
+/* MS be sure that SLAB allocates aligned objects */
+#define ARCH_DMA_MINALIGN      L1_CACHE_BYTES
+
+#define ARCH_SLAB_MINALIGN     L1_CACHE_BYTES
+
 #endif /* _ASM_MICROBLAZE_CACHE_H */
index 7b9861b..337f23e 100644 (file)
 
 #ifndef __ASSEMBLY__
 
-/* MS be sure that SLAB allocates aligned objects */
-#define ARCH_DMA_MINALIGN      L1_CACHE_BYTES
-
-#define ARCH_SLAB_MINALIGN     L1_CACHE_BYTES
-
 /*
  * PAGE_OFFSET -- the first address of the first page of memory. With MMU
  * it is set to the kernel start address (aligned on a page boundary).
index a06cc1f..3657f5e 100644 (file)
@@ -16,8 +16,6 @@ extern char *klimit;
 
 extern void mmu_reset(void);
 
-void time_init(void);
-void init_IRQ(void);
 void machine_early_init(const char *cmdline, unsigned int ram,
                unsigned int fdt, unsigned int msr, unsigned int tlb0,
                unsigned int tlb1);
index c5c6186..e424c79 100644 (file)
@@ -20,7 +20,7 @@ void __init early_init_devtree(void *params)
 
        early_init_dt_scan(params);
        if (!strlen(boot_command_line))
-               strlcpy(boot_command_line, cmd_line, COMMAND_LINE_SIZE);
+               strscpy(boot_command_line, cmd_line, COMMAND_LINE_SIZE);
 
        memblock_allow_resize();
 
index c3aebec..c78a0ff 100644 (file)
@@ -194,7 +194,7 @@ static int setup_rt_frame(struct ksignal *ksig, sigset_t *set,
 
        preempt_disable();
        ptep = pte_offset_map(pmdp, address);
-       if (pte_present(*ptep)) {
+       if (ptep && pte_present(*ptep)) {
                address = (unsigned long) page_address(pte_page(*ptep));
                /* MS: I need add offset in page */
                address += ((unsigned long)frame->tramp) & ~PAGE_MASK;
@@ -203,7 +203,8 @@ static int setup_rt_frame(struct ksignal *ksig, sigset_t *set,
                invalidate_icache_range(address, address + 8);
                flush_dcache_range(address, address + 8);
        }
-       pte_unmap(ptep);
+       if (ptep)
+               pte_unmap(ptep);
        preempt_enable();
        if (err)
                return -EFAULT;
index 820145e..858d22b 100644 (file)
 448    common  process_mrelease                sys_process_mrelease
 449    common  futex_waitv                     sys_futex_waitv
 450    common  set_mempolicy_home_node         sys_set_mempolicy_home_node
+451    common  cachestat                       sys_cachestat
index 675a866..ada18f3 100644 (file)
@@ -4,6 +4,7 @@ config MIPS
        default y
        select ARCH_32BIT_OFF_T if !64BIT
        select ARCH_BINFMT_ELF_STATE if MIPS_FP_SUPPORT
+       select ARCH_HAS_CPU_FINALIZE_INIT
        select ARCH_HAS_CURRENT_STACK_POINTER if !CC_IS_CLANG || CLANG_VERSION >= 140000
        select ARCH_HAS_DEBUG_VIRTUAL if !64BIT
        select ARCH_HAS_FORTIFY_SOURCE
@@ -2286,6 +2287,7 @@ config MIPS_CPS
        select MIPS_CM
        select MIPS_CPS_PM if HOTPLUG_CPU
        select SMP
+       select HOTPLUG_CORE_SYNC_DEAD if HOTPLUG_CPU
        select SYNC_R4K if (CEVT_R4K || CSRC_R4K)
        select SYS_SUPPORTS_HOTPLUG_CPU
        select SYS_SUPPORTS_SCHED_SMT if CPU_MIPSR6
index 549a639..053805c 100644 (file)
@@ -178,7 +178,10 @@ void __init plat_mem_setup(void)
        ioport_resource.start = 0;
        ioport_resource.end = ~0;
 
-       /* intended to somewhat resemble ARM; see Documentation/arm/booting.rst */
+       /*
+        * intended to somewhat resemble ARM; see
+        * Documentation/arch/arm/booting.rst
+        */
        if (fw_arg0 == 0 && fw_arg1 == 0xffffffff)
                dtb = phys_to_virt(fw_arg2);
        else
index 4212584..33c0968 100644 (file)
@@ -345,6 +345,7 @@ void play_dead(void)
        int cpu = cpu_number_map(cvmx_get_core_num());
 
        idle_task_exit();
+       cpuhp_ap_report_dead();
        octeon_processor_boot = 0xff;
        per_cpu(cpu_state, cpu) = CPU_DEAD;
 
index 712fb5a..ba188e7 100644 (file)
@@ -33,17 +33,6 @@ static __always_inline void arch_##pfx##_set(pfx##_t *v, type i)     \
 {                                                                      \
        WRITE_ONCE(v->counter, i);                                      \
 }                                                                      \
-                                                                       \
-static __always_inline type                                            \
-arch_##pfx##_cmpxchg(pfx##_t *v, type o, type n)                       \
-{                                                                      \
-       return arch_cmpxchg(&v->counter, o, n);                         \
-}                                                                      \
-                                                                       \
-static __always_inline type arch_##pfx##_xchg(pfx##_t *v, type n)      \
-{                                                                      \
-       return arch_xchg(&v->counter, n);                               \
-}
 
 ATOMIC_OPS(atomic, int)
 
index 653f78f..84be74a 100644 (file)
@@ -1,17 +1,11 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
- * This is included by init/main.c to check for architecture-dependent bugs.
- *
  * Copyright (C) 2007  Maciej W. Rozycki
- *
- * Needs:
- *     void check_bugs(void);
  */
 #ifndef _ASM_BUGS_H
 #define _ASM_BUGS_H
 
 #include <linux/bug.h>
-#include <linux/delay.h>
 #include <linux/smp.h>
 
 #include <asm/cpu.h>
@@ -24,17 +18,6 @@ extern void check_bugs64_early(void);
 extern void check_bugs32(void);
 extern void check_bugs64(void);
 
-static inline void __init check_bugs(void)
-{
-       unsigned int cpu = smp_processor_id();
-
-       cpu_data[cpu].udelay_val = loops_per_jiffy;
-       check_bugs32();
-
-       if (IS_ENABLED(CONFIG_CPU_R4X00_BUGS64))
-               check_bugs64();
-}
-
 static inline int r4k_daddiu_bug(void)
 {
        if (!IS_ENABLED(CONFIG_CPU_R4X00_BUGS64))
index 25df2f4..b52a6a9 100644 (file)
@@ -17,9 +17,6 @@
 #include <linux/types.h>
 #include <linux/string.h>
 
-typedef long intptr_t;
-
-
 /*
  * Constants
  */
index 44f9824..75abfa8 100644 (file)
@@ -19,7 +19,6 @@
 #define IRQ_STACK_SIZE                 THREAD_SIZE
 #define IRQ_STACK_START                        (IRQ_STACK_SIZE - 16)
 
-extern void __init init_IRQ(void);
 extern void *irq_stack[NR_CPUS];
 
 /*
index eb3ddbe..d8f9dec 100644 (file)
@@ -47,7 +47,6 @@
 
 #include <regs-clk.h>
 #include <regs-mux.h>
-#include <regs-pwm.h>
 #include <regs-rtc.h>
 #include <regs-wdt.h>
 
diff --git a/arch/mips/include/asm/mach-loongson32/regs-pwm.h b/arch/mips/include/asm/mach-loongson32/regs-pwm.h
deleted file mode 100644 (file)
index ec870c8..0000000
+++ /dev/null
@@ -1,25 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-or-later */
-/*
- * Copyright (c) 2014 Zhang, Keguang <keguang.zhang@gmail.com>
- *
- * Loongson 1 PWM Register Definitions.
- */
-
-#ifndef __ASM_MACH_LOONGSON32_REGS_PWM_H
-#define __ASM_MACH_LOONGSON32_REGS_PWM_H
-
-/* Loongson 1 PWM Timer Register Definitions */
-#define PWM_CNT                        0x0
-#define PWM_HRC                        0x4
-#define PWM_LRC                        0x8
-#define PWM_CTRL               0xc
-
-/* PWM Control Register Bits */
-#define CNT_RST                        BIT(7)
-#define INT_SR                 BIT(6)
-#define INT_EN                 BIT(5)
-#define PWM_SINGLE             BIT(4)
-#define PWM_OE                 BIT(3)
-#define CNT_EN                 BIT(0)
-
-#endif /* __ASM_MACH_LOONGSON32_REGS_PWM_H */
index 0145bbf..5719ff4 100644 (file)
@@ -33,6 +33,7 @@ struct plat_smp_ops {
 #ifdef CONFIG_HOTPLUG_CPU
        int (*cpu_disable)(void);
        void (*cpu_die)(unsigned int cpu);
+       void (*cleanup_dead_cpu)(unsigned cpu);
 #endif
 #ifdef CONFIG_KEXEC
        void (*kexec_nonboot_cpu)(void);
index c0e6513..cb871eb 100644 (file)
@@ -11,6 +11,8 @@
  * Copyright (C) 2000, 2001, 2002, 2007         Maciej W. Rozycki
  */
 #include <linux/init.h>
+#include <linux/cpu.h>
+#include <linux/delay.h>
 #include <linux/ioport.h>
 #include <linux/export.h>
 #include <linux/screen_info.h>
@@ -841,3 +843,14 @@ static int __init setnocoherentio(char *str)
 }
 early_param("nocoherentio", setnocoherentio);
 #endif
+
+void __init arch_cpu_finalize_init(void)
+{
+       unsigned int cpu = smp_processor_id();
+
+       cpu_data[cpu].udelay_val = loops_per_jiffy;
+       check_bugs32();
+
+       if (IS_ENABLED(CONFIG_CPU_R4X00_BUGS64))
+               check_bugs64();
+}
index 15466d4..c074ecc 100644 (file)
@@ -392,6 +392,7 @@ static void bmips_cpu_die(unsigned int cpu)
 void __ref play_dead(void)
 {
        idle_task_exit();
+       cpuhp_ap_report_dead();
 
        /* flush data cache */
        _dma_cache_wback_inv(0, ~0);
index 62f677b..d7fdbec 100644 (file)
@@ -503,8 +503,7 @@ void play_dead(void)
                }
        }
 
-       /* This CPU has chosen its way out */
-       (void)cpu_report_death();
+       cpuhp_ap_report_dead();
 
        cps_shutdown_this_cpu(cpu_death);
 
@@ -527,7 +526,9 @@ static void wait_for_sibling_halt(void *ptr_cpu)
        } while (!(halted & TCHALT_H));
 }
 
-static void cps_cpu_die(unsigned int cpu)
+static void cps_cpu_die(unsigned int cpu) { }
+
+static void cps_cleanup_dead_cpu(unsigned cpu)
 {
        unsigned core = cpu_core(&cpu_data[cpu]);
        unsigned int vpe_id = cpu_vpe_id(&cpu_data[cpu]);
@@ -535,12 +536,6 @@ static void cps_cpu_die(unsigned int cpu)
        unsigned stat;
        int err;
 
-       /* Wait for the cpu to choose its way out */
-       if (!cpu_wait_death(cpu, 5)) {
-               pr_err("CPU%u: didn't offline\n", cpu);
-               return;
-       }
-
        /*
         * Now wait for the CPU to actually offline. Without doing this that
         * offlining may race with one or more of:
@@ -624,6 +619,7 @@ static const struct plat_smp_ops cps_smp_ops = {
 #ifdef CONFIG_HOTPLUG_CPU
        .cpu_disable            = cps_cpu_disable,
        .cpu_die                = cps_cpu_die,
+       .cleanup_dead_cpu       = cps_cleanup_dead_cpu,
 #endif
 #ifdef CONFIG_KEXEC
        .kexec_nonboot_cpu      = cps_kexec_nonboot_cpu,
index 1d93b85..90c71d8 100644 (file)
@@ -690,6 +690,14 @@ void flush_tlb_one(unsigned long vaddr)
 EXPORT_SYMBOL(flush_tlb_page);
 EXPORT_SYMBOL(flush_tlb_one);
 
+#ifdef CONFIG_HOTPLUG_CORE_SYNC_DEAD
+void arch_cpuhp_cleanup_dead_cpu(unsigned int cpu)
+{
+       if (mp_ops->cleanup_dead_cpu)
+               mp_ops->cleanup_dead_cpu(cpu);
+}
+#endif
+
 #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 
 static void tick_broadcast_callee(void *info)
index 253ff99..1976317 100644 (file)
 448    n32     process_mrelease                sys_process_mrelease
 449    n32     futex_waitv                     sys_futex_waitv
 450    n32     set_mempolicy_home_node         sys_set_mempolicy_home_node
+451    n32     cachestat                       sys_cachestat
index 3f1886a..cfda251 100644 (file)
 448    n64     process_mrelease                sys_process_mrelease
 449    n64     futex_waitv                     sys_futex_waitv
 450    common  set_mempolicy_home_node         sys_set_mempolicy_home_node
+451    n64     cachestat                       sys_cachestat
index 8f243e3..7692234 100644 (file)
 448    o32     process_mrelease                sys_process_mrelease
 449    o32     futex_waitv                     sys_futex_waitv
 450    o32     set_mempolicy_home_node         sys_set_mempolicy_home_node
+451    o32     cachestat                       sys_cachestat
index 2ef9da0..a7c5009 100644 (file)
@@ -35,41 +35,4 @@ config LOONGSON1_LS1C
        select COMMON_CLK
 endchoice
 
-menuconfig CEVT_CSRC_LS1X
-       bool "Use PWM Timer for clockevent/clocksource"
-       select MIPS_EXTERNAL_TIMER
-       depends on CPU_LOONGSON32
-       help
-         This option changes the default clockevent/clocksource to PWM Timer,
-         and is required by Loongson1 CPUFreq support.
-
-         If unsure, say N.
-
-choice
-       prompt "Select clockevent/clocksource"
-       depends on CEVT_CSRC_LS1X
-       default TIMER_USE_PWM0
-
-config TIMER_USE_PWM0
-       bool "Use PWM Timer 0"
-       help
-         Use PWM Timer 0 as the default clockevent/clocksourcer.
-
-config TIMER_USE_PWM1
-       bool "Use PWM Timer 1"
-       help
-         Use PWM Timer 1 as the default clockevent/clocksourcer.
-
-config TIMER_USE_PWM2
-       bool "Use PWM Timer 2"
-       help
-         Use PWM Timer 2 as the default clockevent/clocksourcer.
-
-config TIMER_USE_PWM3
-       bool "Use PWM Timer 3"
-       help
-         Use PWM Timer 3 as the default clockevent/clocksourcer.
-
-endchoice
-
 endif # MACH_LOONGSON32
index 965c04a..74ad2b1 100644 (file)
@@ -5,208 +5,8 @@
 
 #include <linux/clk.h>
 #include <linux/of_clk.h>
-#include <linux/interrupt.h>
-#include <linux/sizes.h>
 #include <asm/time.h>
 
-#include <loongson1.h>
-#include <platform.h>
-
-#ifdef CONFIG_CEVT_CSRC_LS1X
-
-#if defined(CONFIG_TIMER_USE_PWM1)
-#define LS1X_TIMER_BASE        LS1X_PWM1_BASE
-#define LS1X_TIMER_IRQ LS1X_PWM1_IRQ
-
-#elif defined(CONFIG_TIMER_USE_PWM2)
-#define LS1X_TIMER_BASE        LS1X_PWM2_BASE
-#define LS1X_TIMER_IRQ LS1X_PWM2_IRQ
-
-#elif defined(CONFIG_TIMER_USE_PWM3)
-#define LS1X_TIMER_BASE        LS1X_PWM3_BASE
-#define LS1X_TIMER_IRQ LS1X_PWM3_IRQ
-
-#else
-#define LS1X_TIMER_BASE        LS1X_PWM0_BASE
-#define LS1X_TIMER_IRQ LS1X_PWM0_IRQ
-#endif
-
-DEFINE_RAW_SPINLOCK(ls1x_timer_lock);
-
-static void __iomem *timer_reg_base;
-static uint32_t ls1x_jiffies_per_tick;
-
-static inline void ls1x_pwmtimer_set_period(uint32_t period)
-{
-       __raw_writel(period, timer_reg_base + PWM_HRC);
-       __raw_writel(period, timer_reg_base + PWM_LRC);
-}
-
-static inline void ls1x_pwmtimer_restart(void)
-{
-       __raw_writel(0x0, timer_reg_base + PWM_CNT);
-       __raw_writel(INT_EN | CNT_EN, timer_reg_base + PWM_CTRL);
-}
-
-void __init ls1x_pwmtimer_init(void)
-{
-       timer_reg_base = ioremap(LS1X_TIMER_BASE, SZ_16);
-       if (!timer_reg_base)
-               panic("Failed to remap timer registers");
-
-       ls1x_jiffies_per_tick = DIV_ROUND_CLOSEST(mips_hpt_frequency, HZ);
-
-       ls1x_pwmtimer_set_period(ls1x_jiffies_per_tick);
-       ls1x_pwmtimer_restart();
-}
-
-static u64 ls1x_clocksource_read(struct clocksource *cs)
-{
-       unsigned long flags;
-       int count;
-       u32 jifs;
-       static int old_count;
-       static u32 old_jifs;
-
-       raw_spin_lock_irqsave(&ls1x_timer_lock, flags);
-       /*
-        * Although our caller may have the read side of xtime_lock,
-        * this is now a seqlock, and we are cheating in this routine
-        * by having side effects on state that we cannot undo if
-        * there is a collision on the seqlock and our caller has to
-        * retry.  (Namely, old_jifs and old_count.)  So we must treat
-        * jiffies as volatile despite the lock.  We read jiffies
-        * before latching the timer count to guarantee that although
-        * the jiffies value might be older than the count (that is,
-        * the counter may underflow between the last point where
-        * jiffies was incremented and the point where we latch the
-        * count), it cannot be newer.
-        */
-       jifs = jiffies;
-       /* read the count */
-       count = __raw_readl(timer_reg_base + PWM_CNT);
-
-       /*
-        * It's possible for count to appear to go the wrong way for this
-        * reason:
-        *
-        *  The timer counter underflows, but we haven't handled the resulting
-        *  interrupt and incremented jiffies yet.
-        *
-        * Previous attempts to handle these cases intelligently were buggy, so
-        * we just do the simple thing now.
-        */
-       if (count < old_count && jifs == old_jifs)
-               count = old_count;
-
-       old_count = count;
-       old_jifs = jifs;
-
-       raw_spin_unlock_irqrestore(&ls1x_timer_lock, flags);
-
-       return (u64) (jifs * ls1x_jiffies_per_tick) + count;
-}
-
-static struct clocksource ls1x_clocksource = {
-       .name           = "ls1x-pwmtimer",
-       .read           = ls1x_clocksource_read,
-       .mask           = CLOCKSOURCE_MASK(24),
-       .flags          = CLOCK_SOURCE_IS_CONTINUOUS,
-};
-
-static irqreturn_t ls1x_clockevent_isr(int irq, void *devid)
-{
-       struct clock_event_device *cd = devid;
-
-       ls1x_pwmtimer_restart();
-       cd->event_handler(cd);
-
-       return IRQ_HANDLED;
-}
-
-static int ls1x_clockevent_set_state_periodic(struct clock_event_device *cd)
-{
-       raw_spin_lock(&ls1x_timer_lock);
-       ls1x_pwmtimer_set_period(ls1x_jiffies_per_tick);
-       ls1x_pwmtimer_restart();
-       __raw_writel(INT_EN | CNT_EN, timer_reg_base + PWM_CTRL);
-       raw_spin_unlock(&ls1x_timer_lock);
-
-       return 0;
-}
-
-static int ls1x_clockevent_tick_resume(struct clock_event_device *cd)
-{
-       raw_spin_lock(&ls1x_timer_lock);
-       __raw_writel(INT_EN | CNT_EN, timer_reg_base + PWM_CTRL);
-       raw_spin_unlock(&ls1x_timer_lock);
-
-       return 0;
-}
-
-static int ls1x_clockevent_set_state_shutdown(struct clock_event_device *cd)
-{
-       raw_spin_lock(&ls1x_timer_lock);
-       __raw_writel(__raw_readl(timer_reg_base + PWM_CTRL) & ~CNT_EN,
-                    timer_reg_base + PWM_CTRL);
-       raw_spin_unlock(&ls1x_timer_lock);
-
-       return 0;
-}
-
-static int ls1x_clockevent_set_next(unsigned long evt,
-                                   struct clock_event_device *cd)
-{
-       raw_spin_lock(&ls1x_timer_lock);
-       ls1x_pwmtimer_set_period(evt);
-       ls1x_pwmtimer_restart();
-       raw_spin_unlock(&ls1x_timer_lock);
-
-       return 0;
-}
-
-static struct clock_event_device ls1x_clockevent = {
-       .name                   = "ls1x-pwmtimer",
-       .features               = CLOCK_EVT_FEAT_PERIODIC,
-       .rating                 = 300,
-       .irq                    = LS1X_TIMER_IRQ,
-       .set_next_event         = ls1x_clockevent_set_next,
-       .set_state_shutdown     = ls1x_clockevent_set_state_shutdown,
-       .set_state_periodic     = ls1x_clockevent_set_state_periodic,
-       .set_state_oneshot      = ls1x_clockevent_set_state_shutdown,
-       .tick_resume            = ls1x_clockevent_tick_resume,
-};
-
-static void __init ls1x_time_init(void)
-{
-       struct clock_event_device *cd = &ls1x_clockevent;
-       int ret;
-
-       if (!mips_hpt_frequency)
-               panic("Invalid timer clock rate");
-
-       ls1x_pwmtimer_init();
-
-       clockevent_set_clock(cd, mips_hpt_frequency);
-       cd->max_delta_ns = clockevent_delta2ns(0xffffff, cd);
-       cd->max_delta_ticks = 0xffffff;
-       cd->min_delta_ns = clockevent_delta2ns(0x000300, cd);
-       cd->min_delta_ticks = 0x000300;
-       cd->cpumask = cpumask_of(smp_processor_id());
-       clockevents_register_device(cd);
-
-       ls1x_clocksource.rating = 200 + mips_hpt_frequency / 10000000;
-       ret = clocksource_register_hz(&ls1x_clocksource, mips_hpt_frequency);
-       if (ret)
-               panic(KERN_ERR "Failed to register clocksource: %d\n", ret);
-
-       if (request_irq(LS1X_TIMER_IRQ, ls1x_clockevent_isr,
-                       IRQF_PERCPU | IRQF_TIMER, "ls1x-pwmtimer",
-                       &ls1x_clockevent))
-               pr_err("Failed to register ls1x-pwmtimer interrupt\n");
-}
-#endif /* CONFIG_CEVT_CSRC_LS1X */
-
 void __init plat_time_init(void)
 {
        struct clk *clk = NULL;
@@ -214,20 +14,10 @@ void __init plat_time_init(void)
        /* initialize LS1X clocks */
        of_clk_init(NULL);
 
-#ifdef CONFIG_CEVT_CSRC_LS1X
-       /* setup LS1X PWM timer */
-       clk = clk_get(NULL, "ls1x-pwmtimer");
-       if (IS_ERR(clk))
-               panic("unable to get timer clock, err=%ld", PTR_ERR(clk));
-
-       mips_hpt_frequency = clk_get_rate(clk);
-       ls1x_time_init();
-#else
        /* setup mips r4k timer */
        clk = clk_get(NULL, "cpu_clk");
        if (IS_ERR(clk))
                panic("unable to get cpu clock, err=%ld", PTR_ERR(clk));
 
        mips_hpt_frequency = clk_get_rate(clk) / 2;
-#endif /* CONFIG_CEVT_CSRC_LS1X */
 }
index b0e8bb9..cdecd7a 100644 (file)
@@ -775,6 +775,7 @@ void play_dead(void)
        void (*play_dead_at_ckseg1)(int *);
 
        idle_task_exit();
+       cpuhp_ap_report_dead();
 
        prid_imp = read_c0_prid() & PRID_IMP_MASK;
        prid_rev = read_c0_prid() & PRID_REV_MASK;
index 1b939ab..93c2d69 100644 (file)
@@ -297,7 +297,7 @@ void __update_tlb(struct vm_area_struct * vma, unsigned long address, pte_t pte)
        p4d_t *p4dp;
        pud_t *pudp;
        pmd_t *pmdp;
-       pte_t *ptep;
+       pte_t *ptep, *ptemap = NULL;
        int idx, pid;
 
        /*
@@ -344,7 +344,12 @@ void __update_tlb(struct vm_area_struct * vma, unsigned long address, pte_t pte)
        } else
 #endif
        {
-               ptep = pte_offset_map(pmdp, address);
+               ptemap = ptep = pte_offset_map(pmdp, address);
+               /*
+                * update_mmu_cache() is called between pte_offset_map_lock()
+                * and pte_unmap_unlock(), so we can assume that ptep is not
+                * NULL here: and what should be done below if it were NULL?
+                */
 
 #if defined(CONFIG_PHYS_ADDR_T_64BIT) && defined(CONFIG_CPU_MIPS32)
 #ifdef CONFIG_XPA
@@ -373,6 +378,9 @@ void __update_tlb(struct vm_area_struct * vma, unsigned long address, pte_t pte)
        tlbw_use_hazard();
        htw_start();
        flush_micro_tlb_vm(vma);
+
+       if (ptemap)
+               pte_unmap(ptemap);
        local_irq_restore(flags);
 }
 
index 203870c..338849c 100644 (file)
@@ -47,7 +47,7 @@ void __init setup_cpuinfo(void)
 
        str = of_get_property(cpu, "altr,implementation", &len);
        if (str)
-               strlcpy(cpuinfo.cpu_impl, str, sizeof(cpuinfo.cpu_impl));
+               strscpy(cpuinfo.cpu_impl, str, sizeof(cpuinfo.cpu_impl));
        else
                strcpy(cpuinfo.cpu_impl, "<unknown>");
 
index 40bc8fb..8582ed9 100644 (file)
@@ -121,7 +121,7 @@ asmlinkage void __init nios2_boot_init(unsigned r4, unsigned r5, unsigned r6,
                dtb_passed = r6;
 
                if (r7)
-                       strlcpy(cmdline_passed, (char *)r7, COMMAND_LINE_SIZE);
+                       strscpy(cmdline_passed, (char *)r7, COMMAND_LINE_SIZE);
        }
 #endif
 
@@ -129,10 +129,10 @@ asmlinkage void __init nios2_boot_init(unsigned r4, unsigned r5, unsigned r6,
 
 #ifndef CONFIG_CMDLINE_FORCE
        if (cmdline_passed[0])
-               strlcpy(boot_command_line, cmdline_passed, COMMAND_LINE_SIZE);
+               strscpy(boot_command_line, cmdline_passed, COMMAND_LINE_SIZE);
 #ifdef CONFIG_NIOS2_CMDLINE_IGNORE_DTB
        else
-               strlcpy(boot_command_line, CONFIG_CMDLINE, COMMAND_LINE_SIZE);
+               strscpy(boot_command_line, CONFIG_CMDLINE, COMMAND_LINE_SIZE);
 #endif
 #endif
 
index 326167e..8ce67ec 100644 (file)
@@ -130,7 +130,4 @@ static inline int arch_atomic_fetch_add_unless(atomic_t *v, int a, int u)
 
 #include <asm/cmpxchg.h>
 
-#define arch_atomic_xchg(ptr, v)               (arch_xchg(&(ptr)->counter, (v)))
-#define arch_atomic_cmpxchg(v, old, new)       (arch_cmpxchg(&((v)->counter), (old), (new)))
-
 #endif /* __ASM_OPENRISC_ATOMIC_H */
index 967bde6..c0b4b1c 100644 (file)
@@ -57,6 +57,7 @@ config PARISC
        select HAVE_ARCH_SECCOMP_FILTER
        select HAVE_ARCH_TRACEHOOK
        select HAVE_REGS_AND_STACK_ACCESS_API
+       select HOTPLUG_CORE_SYNC_DEAD if HOTPLUG_CPU
        select GENERIC_SCHED_CLOCK
        select GENERIC_IRQ_MIGRATION if SMP
        select HAVE_UNSTABLE_SCHED_CLOCK if SMP
index dd5a299..d4f0238 100644 (file)
@@ -73,10 +73,6 @@ static __inline__ int arch_atomic_read(const atomic_t *v)
        return READ_ONCE((v)->counter);
 }
 
-/* exported interface */
-#define arch_atomic_cmpxchg(v, o, n)   (arch_cmpxchg(&((v)->counter), (o), (n)))
-#define arch_atomic_xchg(v, new)       (arch_xchg(&((v)->counter), new))
-
 #define ATOMIC_OP(op, c_op)                                            \
 static __inline__ void arch_atomic_##op(int i, atomic_t *v)            \
 {                                                                      \
@@ -122,6 +118,11 @@ static __inline__ int arch_atomic_fetch_##op(int i, atomic_t *v)   \
 ATOMIC_OPS(add, +=)
 ATOMIC_OPS(sub, -=)
 
+#define arch_atomic_add_return arch_atomic_add_return
+#define arch_atomic_sub_return arch_atomic_sub_return
+#define arch_atomic_fetch_add  arch_atomic_fetch_add
+#define arch_atomic_fetch_sub  arch_atomic_fetch_sub
+
 #undef ATOMIC_OPS
 #define ATOMIC_OPS(op, c_op)                                           \
        ATOMIC_OP(op, c_op)                                             \
@@ -131,6 +132,10 @@ ATOMIC_OPS(and, &=)
 ATOMIC_OPS(or, |=)
 ATOMIC_OPS(xor, ^=)
 
+#define arch_atomic_fetch_and  arch_atomic_fetch_and
+#define arch_atomic_fetch_or   arch_atomic_fetch_or
+#define arch_atomic_fetch_xor  arch_atomic_fetch_xor
+
 #undef ATOMIC_OPS
 #undef ATOMIC_FETCH_OP
 #undef ATOMIC_OP_RETURN
@@ -185,6 +190,11 @@ static __inline__ s64 arch_atomic64_fetch_##op(s64 i, atomic64_t *v)       \
 ATOMIC64_OPS(add, +=)
 ATOMIC64_OPS(sub, -=)
 
+#define arch_atomic64_add_return       arch_atomic64_add_return
+#define arch_atomic64_sub_return       arch_atomic64_sub_return
+#define arch_atomic64_fetch_add                arch_atomic64_fetch_add
+#define arch_atomic64_fetch_sub                arch_atomic64_fetch_sub
+
 #undef ATOMIC64_OPS
 #define ATOMIC64_OPS(op, c_op)                                         \
        ATOMIC64_OP(op, c_op)                                           \
@@ -194,6 +204,10 @@ ATOMIC64_OPS(and, &=)
 ATOMIC64_OPS(or, |=)
 ATOMIC64_OPS(xor, ^=)
 
+#define arch_atomic64_fetch_and                arch_atomic64_fetch_and
+#define arch_atomic64_fetch_or         arch_atomic64_fetch_or
+#define arch_atomic64_fetch_xor                arch_atomic64_fetch_xor
+
 #undef ATOMIC64_OPS
 #undef ATOMIC64_FETCH_OP
 #undef ATOMIC64_OP_RETURN
@@ -218,11 +232,6 @@ arch_atomic64_read(const atomic64_t *v)
        return READ_ONCE((v)->counter);
 }
 
-/* exported interface */
-#define arch_atomic64_cmpxchg(v, o, n) \
-       ((__typeof__((v)->counter))arch_cmpxchg(&((v)->counter), (o), (n)))
-#define arch_atomic64_xchg(v, new) (arch_xchg(&((v)->counter), new))
-
 #endif /* !CONFIG_64BIT */
 
 
diff --git a/arch/parisc/include/asm/bugs.h b/arch/parisc/include/asm/bugs.h
deleted file mode 100644 (file)
index 0a7f9db..0000000
+++ /dev/null
@@ -1,20 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/*
- *  include/asm-parisc/bugs.h
- *
- *  Copyright (C) 1999 Mike Shaver
- */
-
-/*
- * This is included by init/main.c to check for architecture-dependent bugs.
- *
- * Needs:
- *     void check_bugs(void);
- */
-
-#include <asm/processor.h>
-
-static inline void check_bugs(void)
-{
-//     identify_cpu(&boot_cpu_data);
-}
index e715df5..5656395 100644 (file)
@@ -472,9 +472,6 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr,
 
 #define pte_same(A,B)  (pte_val(A) == pte_val(B))
 
-struct seq_file;
-extern void arch_report_meminfo(struct seq_file *m);
-
 #endif /* !__ASSEMBLY__ */
 
 
index ca4a302..5011602 100644 (file)
@@ -426,10 +426,15 @@ void flush_dcache_page(struct page *page)
                offset = (pgoff - mpnt->vm_pgoff) << PAGE_SHIFT;
                addr = mpnt->vm_start + offset;
                if (parisc_requires_coherency()) {
+                       bool needs_flush = false;
                        pte_t *ptep;
 
                        ptep = get_ptep(mpnt->vm_mm, addr);
-                       if (ptep && pte_needs_flush(*ptep))
+                       if (ptep) {
+                               needs_flush = pte_needs_flush(*ptep);
+                               pte_unmap(ptep);
+                       }
+                       if (needs_flush)
                                flush_user_cache_page(mpnt, addr);
                } else {
                        /*
@@ -561,14 +566,20 @@ EXPORT_SYMBOL(flush_kernel_dcache_page_addr);
 static void flush_cache_page_if_present(struct vm_area_struct *vma,
        unsigned long vmaddr, unsigned long pfn)
 {
-       pte_t *ptep = get_ptep(vma->vm_mm, vmaddr);
+       bool needs_flush = false;
+       pte_t *ptep;
 
        /*
         * The pte check is racy and sometimes the flush will trigger
         * a non-access TLB miss. Hopefully, the page has already been
         * flushed.
         */
-       if (ptep && pte_needs_flush(*ptep))
+       ptep = get_ptep(vma->vm_mm, vmaddr);
+       if (ptep) {
+               needs_flush = pte_needs_flush(*ptep);
+               pte_unmap(ptep);
+       }
+       if (needs_flush)
                flush_cache_page(vma, vmaddr, pfn);
 }
 
@@ -635,17 +646,22 @@ static void flush_cache_pages(struct vm_area_struct *vma, unsigned long start, u
        pte_t *ptep;
 
        for (addr = start; addr < end; addr += PAGE_SIZE) {
+               bool needs_flush = false;
                /*
                 * The vma can contain pages that aren't present. Although
                 * the pte search is expensive, we need the pte to find the
                 * page pfn and to check whether the page should be flushed.
                 */
                ptep = get_ptep(vma->vm_mm, addr);
-               if (ptep && pte_needs_flush(*ptep)) {
+               if (ptep) {
+                       needs_flush = pte_needs_flush(*ptep);
+                       pfn = pte_pfn(*ptep);
+                       pte_unmap(ptep);
+               }
+               if (needs_flush) {
                        if (parisc_requires_coherency()) {
                                flush_user_cache_page(vma, addr);
                        } else {
-                               pfn = pte_pfn(*ptep);
                                if (WARN_ON(!pfn_valid(pfn)))
                                        return;
                                __flush_cache_page(vma, addr, PFN_PHYS(pfn));
index 71ed539..415f12d 100644 (file)
@@ -164,7 +164,7 @@ static inline void unmap_uncached_pte(pmd_t * pmd, unsigned long vaddr,
                pmd_clear(pmd);
                return;
        }
-       pte = pte_offset_map(pmd, vaddr);
+       pte = pte_offset_kernel(pmd, vaddr);
        vaddr &= ~PMD_MASK;
        end = vaddr + size;
        if (end > PMD_SIZE)
index 24411ab..abdbf03 100644 (file)
@@ -171,8 +171,8 @@ void __noreturn arch_cpu_idle_dead(void)
 
        local_irq_disable();
 
-       /* Tell __cpu_die() that this CPU is now safe to dispose of. */
-       (void)cpu_report_death();
+       /* Tell the core that this CPU is now safe to dispose of. */
+       cpuhp_ap_report_dead();
 
        /* Ensure that the cache lines are written out. */
        flush_cache_all_local();
index b7fc859..d0eb1bd 100644 (file)
@@ -271,7 +271,6 @@ void arch_send_call_function_single_ipi(int cpu)
 static void
 smp_cpu_init(int cpunum)
 {
-       extern void init_IRQ(void);    /* arch/parisc/kernel/irq.c */
        extern void start_cpu_itimer(void); /* arch/parisc/kernel/time.c */
 
        /* Set modes and Enable floating point coprocessor */
@@ -500,11 +499,10 @@ int __cpu_disable(void)
 void __cpu_die(unsigned int cpu)
 {
        pdc_cpu_rendezvous_lock();
+}
 
-       if (!cpu_wait_death(cpu, 5)) {
-               pr_crit("CPU%u: cpu didn't die\n", cpu);
-               return;
-       }
+void arch_cpuhp_cleanup_dead_cpu(unsigned int cpu)
+{
        pr_info("CPU%u: is shutting down\n", cpu);
 
        /* set task's state to interruptible sleep */
index 0e42fce..3c71fad 100644 (file)
 448    common  process_mrelease                sys_process_mrelease
 449    common  futex_waitv                     sys_futex_waitv
 450    common  set_mempolicy_home_node         sys_set_mempolicy_home_node
+451    common  cachestat                       sys_cachestat
index d1d3990..a8a1a7c 100644 (file)
@@ -66,7 +66,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
        if (pud) {
                pmd = pmd_alloc(mm, pud, addr);
                if (pmd)
-                       pte = pte_alloc_map(mm, pmd, addr);
+                       pte = pte_alloc_huge(mm, pmd, addr);
        }
        return pte;
 }
@@ -90,7 +90,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
                        if (!pud_none(*pud)) {
                                pmd = pmd_offset(pud, addr);
                                if (!pmd_none(*pmd))
-                                       pte = pte_offset_map(pmd, addr);
+                                       pte = pte_offset_huge(pmd, addr);
                        }
                }
        }
index bff5820..8eb6c5e 100644 (file)
@@ -90,8 +90,7 @@ config NMI_IPI
 
 config PPC_WATCHDOG
        bool
-       depends on HARDLOCKUP_DETECTOR
-       depends on HAVE_HARDLOCKUP_DETECTOR_ARCH
+       depends on HARDLOCKUP_DETECTOR_ARCH
        default y
        help
          This is a placeholder when the powerpc hardlockup detector
@@ -240,7 +239,7 @@ config PPC
        select HAVE_GCC_PLUGINS                 if GCC_VERSION >= 50200   # plugin support on gcc <= 5.1 is buggy on PPC
        select HAVE_GENERIC_VDSO
        select HAVE_HARDLOCKUP_DETECTOR_ARCH    if PPC_BOOK3S_64 && SMP
-       select HAVE_HARDLOCKUP_DETECTOR_PERF    if PERF_EVENTS && HAVE_PERF_EVENTS_NMI && !HAVE_HARDLOCKUP_DETECTOR_ARCH
+       select HAVE_HARDLOCKUP_DETECTOR_PERF    if PERF_EVENTS && HAVE_PERF_EVENTS_NMI
        select HAVE_HW_BREAKPOINT               if PERF_EVENTS && (PPC_BOOK3S || PPC_8xx)
        select HAVE_IOREMAP_PROT
        select HAVE_IRQ_TIME_ACCOUNTING
index 47228b1..5bf6a4d 100644 (file)
@@ -126,18 +126,6 @@ ATOMIC_OPS(xor, xor, "", K)
 #undef ATOMIC_OP_RETURN_RELAXED
 #undef ATOMIC_OP
 
-#define arch_atomic_cmpxchg(v, o, n) \
-       (arch_cmpxchg(&((v)->counter), (o), (n)))
-#define arch_atomic_cmpxchg_relaxed(v, o, n) \
-       arch_cmpxchg_relaxed(&((v)->counter), (o), (n))
-#define arch_atomic_cmpxchg_acquire(v, o, n) \
-       arch_cmpxchg_acquire(&((v)->counter), (o), (n))
-
-#define arch_atomic_xchg(v, new) \
-       (arch_xchg(&((v)->counter), new))
-#define arch_atomic_xchg_relaxed(v, new) \
-       arch_xchg_relaxed(&((v)->counter), (new))
-
 /**
  * atomic_fetch_add_unless - add unless the number is a given value
  * @v: pointer of type atomic_t
@@ -396,18 +384,6 @@ static __inline__ s64 arch_atomic64_dec_if_positive(atomic64_t *v)
 }
 #define arch_atomic64_dec_if_positive arch_atomic64_dec_if_positive
 
-#define arch_atomic64_cmpxchg(v, o, n) \
-       (arch_cmpxchg(&((v)->counter), (o), (n)))
-#define arch_atomic64_cmpxchg_relaxed(v, o, n) \
-       arch_cmpxchg_relaxed(&((v)->counter), (o), (n))
-#define arch_atomic64_cmpxchg_acquire(v, o, n) \
-       arch_cmpxchg_acquire(&((v)->counter), (o), (n))
-
-#define arch_atomic64_xchg(v, new) \
-       (arch_xchg(&((v)->counter), new))
-#define arch_atomic64_xchg_relaxed(v, new) \
-       arch_xchg_relaxed(&((v)->counter), (new))
-
 /**
  * atomic64_fetch_add_unless - add unless the number is a given value
  * @v: pointer of type atomic64_t
diff --git a/arch/powerpc/include/asm/bugs.h b/arch/powerpc/include/asm/bugs.h
deleted file mode 100644 (file)
index 01b8f6c..0000000
+++ /dev/null
@@ -1,15 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-or-later */
-#ifndef _ASM_POWERPC_BUGS_H
-#define _ASM_POWERPC_BUGS_H
-
-/*
- */
-
-/*
- * This file is included by 'init/main.c' to check for
- * architecture-dependent bugs.
- */
-
-static inline void check_bugs(void) { }
-
-#endif /* _ASM_POWERPC_BUGS_H */
index ae0a68a..6923223 100644 (file)
 
 #define IFETCH_ALIGN_BYTES     (1 << IFETCH_ALIGN_SHIFT)
 
+#ifdef CONFIG_NOT_COHERENT_CACHE
+#define ARCH_DMA_MINALIGN      L1_CACHE_BYTES
+#endif
+
 #if !defined(__ASSEMBLY__)
 #ifdef CONFIG_PPC64
 
index deadd21..f257cac 100644 (file)
@@ -50,9 +50,14 @@ extern void *hardirq_ctx[NR_CPUS];
 extern void *softirq_ctx[NR_CPUS];
 
 void __do_IRQ(struct pt_regs *regs);
-extern void __init init_IRQ(void);
 
 int irq_choose_cpu(const struct cpumask *mask);
 
+#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_NMI_IPI)
+extern void arch_trigger_cpumask_backtrace(const cpumask_t *mask,
+                                          bool exclude_self);
+#define arch_trigger_cpumask_backtrace arch_trigger_cpumask_backtrace
+#endif
+
 #endif /* _ASM_IRQ_H */
 #endif /* __KERNEL__ */
index c3c7ade..49a7534 100644 (file)
@@ -3,18 +3,10 @@
 #define _ASM_NMI_H
 
 #ifdef CONFIG_PPC_WATCHDOG
-extern void arch_touch_nmi_watchdog(void);
 long soft_nmi_interrupt(struct pt_regs *regs);
-void watchdog_nmi_set_timeout_pct(u64 pct);
+void watchdog_hardlockup_set_timeout_pct(u64 pct);
 #else
-static inline void arch_touch_nmi_watchdog(void) {}
-static inline void watchdog_nmi_set_timeout_pct(u64 pct) {}
-#endif
-
-#ifdef CONFIG_NMI_IPI
-extern void arch_trigger_cpumask_backtrace(const cpumask_t *mask,
-                                          bool exclude_self);
-#define arch_trigger_cpumask_backtrace arch_trigger_cpumask_backtrace
+static inline void watchdog_hardlockup_set_timeout_pct(u64 pct) {}
 #endif
 
 extern void hv_nmi_check_nonrecoverable(struct pt_regs *regs);
index 56f2176..b9ac9e3 100644 (file)
 
 #define VM_DATA_DEFAULT_FLAGS  VM_DATA_DEFAULT_FLAGS32
 
-#ifdef CONFIG_NOT_COHERENT_CACHE
-#define ARCH_DMA_MINALIGN      L1_CACHE_BYTES
-#endif
-
 #if defined(CONFIG_PPC_256K_PAGES) || \
     (defined(CONFIG_PPC_8xx) && defined(CONFIG_PPC_16K_PAGES))
 #define PTE_SHIFT      (PAGE_SHIFT - PTE_T_LOG2 - 2)   /* 1/4 of a page */
index 9972626..6a88bfd 100644 (file)
@@ -165,9 +165,6 @@ static inline bool is_ioremap_addr(const void *x)
 
        return addr >= IOREMAP_BASE && addr < IOREMAP_END;
 }
-
-struct seq_file;
-void arch_report_meminfo(struct seq_file *m);
 #endif /* CONFIG_PPC64 */
 
 #endif /* __ASSEMBLY__ */
index 265801a..e95660e 100644 (file)
@@ -417,9 +417,9 @@ noinstr static void nmi_ipi_lock_start(unsigned long *flags)
 {
        raw_local_irq_save(*flags);
        hard_irq_disable();
-       while (arch_atomic_cmpxchg(&__nmi_ipi_lock, 0, 1) == 1) {
+       while (raw_atomic_cmpxchg(&__nmi_ipi_lock, 0, 1) == 1) {
                raw_local_irq_restore(*flags);
-               spin_until_cond(arch_atomic_read(&__nmi_ipi_lock) == 0);
+               spin_until_cond(raw_atomic_read(&__nmi_ipi_lock) == 0);
                raw_local_irq_save(*flags);
                hard_irq_disable();
        }
@@ -427,15 +427,15 @@ noinstr static void nmi_ipi_lock_start(unsigned long *flags)
 
 noinstr static void nmi_ipi_lock(void)
 {
-       while (arch_atomic_cmpxchg(&__nmi_ipi_lock, 0, 1) == 1)
-               spin_until_cond(arch_atomic_read(&__nmi_ipi_lock) == 0);
+       while (raw_atomic_cmpxchg(&__nmi_ipi_lock, 0, 1) == 1)
+               spin_until_cond(raw_atomic_read(&__nmi_ipi_lock) == 0);
 }
 
 noinstr static void nmi_ipi_unlock(void)
 {
        smp_mb();
-       WARN_ON(arch_atomic_read(&__nmi_ipi_lock) != 1);
-       arch_atomic_set(&__nmi_ipi_lock, 0);
+       WARN_ON(raw_atomic_read(&__nmi_ipi_lock) != 1);
+       raw_atomic_set(&__nmi_ipi_lock, 0);
 }
 
 noinstr static void nmi_ipi_unlock_end(unsigned long *flags)
@@ -1605,6 +1605,7 @@ static void add_cpu_to_masks(int cpu)
 }
 
 /* Activate a secondary processor. */
+__no_stack_protector
 void start_secondary(void *unused)
 {
        unsigned int cpu = raw_smp_processor_id();
index a0be127..8c0b08b 100644 (file)
 448    common  process_mrelease                sys_process_mrelease
 449    common  futex_waitv                     sys_futex_waitv
 450    nospu   set_mempolicy_home_node         sys_set_mempolicy_home_node
+451    common  cachestat                       sys_cachestat
index 828d0f4..cba6dd1 100644 (file)
@@ -200,7 +200,7 @@ static int __init TAU_init(void)
        tau_int_enable = IS_ENABLED(CONFIG_TAU_INT) &&
                         !strcmp(cur_cpu_spec->platform, "ppc750");
 
-       tau_workq = alloc_workqueue("tau", WQ_UNBOUND, 1);
+       tau_workq = alloc_ordered_workqueue("tau", 0);
        if (!tau_workq)
                return -ENOMEM;
 
index dbcc4a7..edb2dd1 100644 (file)
@@ -438,7 +438,7 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 {
        int cpu = smp_processor_id();
 
-       if (!(watchdog_enabled & NMI_WATCHDOG_ENABLED))
+       if (!(watchdog_enabled & WATCHDOG_HARDLOCKUP_ENABLED))
                return HRTIMER_NORESTART;
 
        if (!cpumask_test_cpu(cpu, &watchdog_cpumask))
@@ -479,7 +479,7 @@ static void start_watchdog(void *arg)
                return;
        }
 
-       if (!(watchdog_enabled & NMI_WATCHDOG_ENABLED))
+       if (!(watchdog_enabled & WATCHDOG_HARDLOCKUP_ENABLED))
                return;
 
        if (!cpumask_test_cpu(cpu, &watchdog_cpumask))
@@ -546,7 +546,7 @@ static void watchdog_calc_timeouts(void)
        wd_timer_period_ms = watchdog_thresh * 1000 * 2 / 5;
 }
 
-void watchdog_nmi_stop(void)
+void watchdog_hardlockup_stop(void)
 {
        int cpu;
 
@@ -554,7 +554,7 @@ void watchdog_nmi_stop(void)
                stop_watchdog_on_cpu(cpu);
 }
 
-void watchdog_nmi_start(void)
+void watchdog_hardlockup_start(void)
 {
        int cpu;
 
@@ -566,7 +566,7 @@ void watchdog_nmi_start(void)
 /*
  * Invoked from core watchdog init.
  */
-int __init watchdog_nmi_probe(void)
+int __init watchdog_hardlockup_probe(void)
 {
        int err;
 
@@ -582,7 +582,7 @@ int __init watchdog_nmi_probe(void)
 }
 
 #ifdef CONFIG_PPC_PSERIES
-void watchdog_nmi_set_timeout_pct(u64 pct)
+void watchdog_hardlockup_set_timeout_pct(u64 pct)
 {
        pr_info("Set the NMI watchdog timeout factor to %llu%%\n", pct);
        WRITE_ONCE(wd_timeout_pct, pct);
index 461307b..5727078 100644 (file)
@@ -509,7 +509,7 @@ static void kvmppc_unmap_free_pmd(struct kvm *kvm, pmd_t *pmd, bool full,
                } else {
                        pte_t *pte;
 
-                       pte = pte_offset_map(p, 0);
+                       pte = pte_offset_kernel(p, 0);
                        kvmppc_unmap_free_pte(kvm, pte, full, lpid);
                        pmd_clear(p);
                }
index a64ea0a..21fcad9 100644 (file)
@@ -239,12 +239,16 @@ void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long
        local_irq_save(flags);
        arch_enter_lazy_mmu_mode();
        start_pte = pte_offset_map(pmd, addr);
+       if (!start_pte)
+               goto out;
        for (pte = start_pte; pte < start_pte + PTRS_PER_PTE; pte++) {
                unsigned long pteval = pte_val(*pte);
                if (pteval & H_PAGE_HASHPTE)
                        hpte_need_flush(mm, addr, pte, pteval, 0);
                addr += PAGE_SIZE;
        }
+       pte_unmap(start_pte);
+out:
        arch_leave_lazy_mmu_mode();
        local_irq_restore(flags);
 }
index 81d7185..d19fb1f 100644 (file)
@@ -105,7 +105,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 
                ret = pin_user_pages(ua + (entry << PAGE_SHIFT), n,
                                FOLL_WRITE | FOLL_LONGTERM,
-                               mem->hpages + entry, NULL);
+                               mem->hpages + entry);
                if (ret == n) {
                        pinned += n;
                        continue;
index b75a9fb..0dc8555 100644 (file)
@@ -71,6 +71,8 @@ static void hpte_flush_range(struct mm_struct *mm, unsigned long addr,
        if (pmd_none(*pmd))
                return;
        pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+       if (!pte)
+               return;
        arch_enter_lazy_mmu_mode();
        for (; npages > 0; --npages) {
                pte_update(mm, addr, pte, 0, 0, 0);
index b900933..f7c683b 100644 (file)
@@ -183,7 +183,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
                return NULL;
 
        if (IS_ENABLED(CONFIG_PPC_8xx) && pshift < PMD_SHIFT)
-               return pte_alloc_map(mm, (pmd_t *)hpdp, addr);
+               return pte_alloc_huge(mm, (pmd_t *)hpdp, addr);
 
        BUG_ON(!hugepd_none(*hpdp) && !hugepd_ok(*hpdp));
 
index 193cc9c..0c41f4b 100644 (file)
@@ -76,7 +76,8 @@ int pmac_newworld;
 
 static int current_root_goodness = -1;
 
-#define DEFAULT_ROOT_DEVICE Root_SDA1  /* sda1 - slightly silly choice */
+/* sda1 - slightly silly choice */
+#define DEFAULT_ROOT_DEVICE    MKDEV(SCSI_DISK0_MAJOR, 1)
 
 sys_ctrler_t sys_ctrler = SYS_CTRLER_UNKNOWN;
 EXPORT_SYMBOL(sys_ctrler);
index 719c97a..47f8eab 100644 (file)
@@ -564,8 +564,7 @@ int __init dlpar_workqueue_init(void)
        if (pseries_hp_wq)
                return 0;
 
-       pseries_hp_wq = alloc_workqueue("pseries hotplug workqueue",
-                       WQ_UNBOUND, 1);
+       pseries_hp_wq = alloc_ordered_workqueue("pseries hotplug workqueue", 0);
 
        return pseries_hp_wq ? 0 : -ENOMEM;
 }
index 6f30113..cd632ba 100644 (file)
@@ -750,7 +750,7 @@ static int pseries_migrate_partition(u64 handle)
                goto out;
 
        if (factor)
-               watchdog_nmi_set_timeout_pct(factor);
+               watchdog_hardlockup_set_timeout_pct(factor);
 
        ret = pseries_suspend(handle);
        if (ret == 0) {
@@ -766,7 +766,7 @@ static int pseries_migrate_partition(u64 handle)
                pseries_cancel_migration(handle, ret);
 
        if (factor)
-               watchdog_nmi_set_timeout_pct(0);
+               watchdog_hardlockup_set_timeout_pct(0);
 
 out:
        vas_migration_handler(VAS_RESUME);
index 70c4c59..fae747c 100644 (file)
@@ -3376,12 +3376,15 @@ static void show_pte(unsigned long addr)
        printf("pmdp @ 0x%px = 0x%016lx\n", pmdp, pmd_val(*pmdp));
 
        ptep = pte_offset_map(pmdp, addr);
-       if (pte_none(*ptep)) {
+       if (!ptep || pte_none(*ptep)) {
+               if (ptep)
+                       pte_unmap(ptep);
                printf("no valid PTE\n");
                return;
        }
 
        format_pte(ptep, pte_val(*ptep));
+       pte_unmap(ptep);
 
        sync();
        __delay(200);
index 5966ad9..c69572f 100644 (file)
@@ -123,6 +123,7 @@ config RISCV
        select HAVE_RSEQ
        select HAVE_STACKPROTECTOR
        select HAVE_SYSCALL_TRACEPOINTS
+       select HOTPLUG_CORE_SYNC_DEAD if HOTPLUG_CPU
        select IRQ_DOMAIN
        select IRQ_FORCED_THREADING
        select KASAN_VMALLOC if KASAN
index bba4729..f5dfef6 100644 (file)
@@ -238,78 +238,6 @@ static __always_inline s64 arch_atomic64_fetch_add_unless(atomic64_t *v, s64 a,
 #define arch_atomic64_fetch_add_unless arch_atomic64_fetch_add_unless
 #endif
 
-/*
- * atomic_{cmp,}xchg is required to have exactly the same ordering semantics as
- * {cmp,}xchg and the operations that return, so they need a full barrier.
- */
-#define ATOMIC_OP(c_t, prefix, size)                                   \
-static __always_inline                                                 \
-c_t arch_atomic##prefix##_xchg_relaxed(atomic##prefix##_t *v, c_t n)   \
-{                                                                      \
-       return __xchg_relaxed(&(v->counter), n, size);                  \
-}                                                                      \
-static __always_inline                                                 \
-c_t arch_atomic##prefix##_xchg_acquire(atomic##prefix##_t *v, c_t n)   \
-{                                                                      \
-       return __xchg_acquire(&(v->counter), n, size);                  \
-}                                                                      \
-static __always_inline                                                 \
-c_t arch_atomic##prefix##_xchg_release(atomic##prefix##_t *v, c_t n)   \
-{                                                                      \
-       return __xchg_release(&(v->counter), n, size);                  \
-}                                                                      \
-static __always_inline                                                 \
-c_t arch_atomic##prefix##_xchg(atomic##prefix##_t *v, c_t n)           \
-{                                                                      \
-       return __arch_xchg(&(v->counter), n, size);                     \
-}                                                                      \
-static __always_inline                                                 \
-c_t arch_atomic##prefix##_cmpxchg_relaxed(atomic##prefix##_t *v,       \
-                                    c_t o, c_t n)                      \
-{                                                                      \
-       return __cmpxchg_relaxed(&(v->counter), o, n, size);            \
-}                                                                      \
-static __always_inline                                                 \
-c_t arch_atomic##prefix##_cmpxchg_acquire(atomic##prefix##_t *v,       \
-                                    c_t o, c_t n)                      \
-{                                                                      \
-       return __cmpxchg_acquire(&(v->counter), o, n, size);            \
-}                                                                      \
-static __always_inline                                                 \
-c_t arch_atomic##prefix##_cmpxchg_release(atomic##prefix##_t *v,       \
-                                    c_t o, c_t n)                      \
-{                                                                      \
-       return __cmpxchg_release(&(v->counter), o, n, size);            \
-}                                                                      \
-static __always_inline                                                 \
-c_t arch_atomic##prefix##_cmpxchg(atomic##prefix##_t *v, c_t o, c_t n) \
-{                                                                      \
-       return __cmpxchg(&(v->counter), o, n, size);                    \
-}
-
-#ifdef CONFIG_GENERIC_ATOMIC64
-#define ATOMIC_OPS()                                                   \
-       ATOMIC_OP(int,   , 4)
-#else
-#define ATOMIC_OPS()                                                   \
-       ATOMIC_OP(int,   , 4)                                           \
-       ATOMIC_OP(s64, 64, 8)
-#endif
-
-ATOMIC_OPS()
-
-#define arch_atomic_xchg_relaxed       arch_atomic_xchg_relaxed
-#define arch_atomic_xchg_acquire       arch_atomic_xchg_acquire
-#define arch_atomic_xchg_release       arch_atomic_xchg_release
-#define arch_atomic_xchg               arch_atomic_xchg
-#define arch_atomic_cmpxchg_relaxed    arch_atomic_cmpxchg_relaxed
-#define arch_atomic_cmpxchg_acquire    arch_atomic_cmpxchg_acquire
-#define arch_atomic_cmpxchg_release    arch_atomic_cmpxchg_release
-#define arch_atomic_cmpxchg            arch_atomic_cmpxchg
-
-#undef ATOMIC_OPS
-#undef ATOMIC_OP
-
 static __always_inline bool arch_atomic_inc_unless_negative(atomic_t *v)
 {
        int prev, rc;
index 43b9ebf..8e10a94 100644 (file)
@@ -16,6 +16,4 @@ void riscv_set_intc_hwnode_fn(struct fwnode_handle *(*fn)(void));
 
 struct fwnode_handle *riscv_get_intc_hwnode(void);
 
-extern void __init init_IRQ(void);
-
 #endif /* _ASM_RISCV_IRQ_H */
index c4b7701..0d55584 100644 (file)
@@ -70,7 +70,7 @@ asmlinkage void smp_callin(void);
 
 #if defined CONFIG_HOTPLUG_CPU
 int __cpu_disable(void);
-void __cpu_die(unsigned int cpu);
+static inline void __cpu_die(unsigned int cpu) { }
 #endif /* CONFIG_HOTPLUG_CPU */
 
 #else
index d6a7428..a066978 100644 (file)
@@ -88,6 +88,4 @@ static inline int read_current_timer(unsigned long *timer_val)
        return 0;
 }
 
-extern void time_init(void);
-
 #endif /* _ASM_RISCV_TIMEX_H */
index a941adc..457a18e 100644 (file)
@@ -8,6 +8,7 @@
 #include <linux/sched.h>
 #include <linux/err.h>
 #include <linux/irq.h>
+#include <linux/cpuhotplug.h>
 #include <linux/cpu.h>
 #include <linux/sched/hotplug.h>
 #include <asm/irq.h>
@@ -49,17 +50,15 @@ int __cpu_disable(void)
        return ret;
 }
 
+#ifdef CONFIG_HOTPLUG_CPU
 /*
- * Called on the thread which is asking for a CPU to be shutdown.
+ * Called on the thread which is asking for a CPU to be shutdown, if the
+ * CPU reported dead to the hotplug core.
  */
-void __cpu_die(unsigned int cpu)
+void arch_cpuhp_cleanup_dead_cpu(unsigned int cpu)
 {
        int ret = 0;
 
-       if (!cpu_wait_death(cpu, 5)) {
-               pr_err("CPU %u: didn't die\n", cpu);
-               return;
-       }
        pr_notice("CPU%u: off\n", cpu);
 
        /* Verify from the firmware if the cpu is really stopped*/
@@ -76,9 +75,10 @@ void __noreturn arch_cpu_idle_dead(void)
 {
        idle_task_exit();
 
-       (void)cpu_report_death();
+       cpuhp_ap_report_dead();
 
        cpu_ops[smp_processor_id()]->cpu_stop();
        /* It should never reach here */
        BUG();
 }
+#endif
index e0ef56d..542883b 100644 (file)
@@ -67,7 +67,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 
        for_each_napot_order(order) {
                if (napot_cont_size(order) == sz) {
-                       pte = pte_alloc_map(mm, pmd, addr & napot_cont_mask(order));
+                       pte = pte_alloc_huge(mm, pmd, addr & napot_cont_mask(order));
                        break;
                }
        }
@@ -114,7 +114,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
 
        for_each_napot_order(order) {
                if (napot_cont_size(order) == sz) {
-                       pte = pte_offset_kernel(pmd, addr & napot_cont_mask(order));
+                       pte = pte_offset_huge(pmd, addr & napot_cont_mask(order));
                        break;
                }
        }
index bd2e27f..dc20e16 100644 (file)
@@ -31,7 +31,7 @@ $(obj)/strncmp.o: $(srctree)/arch/riscv/lib/strncmp.S FORCE
 $(obj)/sha256.o: $(srctree)/lib/crypto/sha256.c FORCE
        $(call if_changed_rule,cc_o_c)
 
-CFLAGS_sha256.o := -D__DISABLE_EXPORTS
+CFLAGS_sha256.o := -D__DISABLE_EXPORTS -D__NO_FORTIFY
 CFLAGS_string.o := -D__DISABLE_EXPORTS
 CFLAGS_ctype.o := -D__DISABLE_EXPORTS
 
index 6dab9c1..5b39918 100644 (file)
@@ -117,6 +117,7 @@ config S390
        select ARCH_SUPPORTS_ATOMIC_RMW
        select ARCH_SUPPORTS_DEBUG_PAGEALLOC
        select ARCH_SUPPORTS_HUGETLBFS
+       select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 && CC_IS_CLANG
        select ARCH_SUPPORTS_NUMA_BALANCING
        select ARCH_SUPPORTS_PER_VMA_LOCK
        select ARCH_USE_BUILTIN_BSWAP
index acb1f8b..c67f59d 100644 (file)
@@ -45,6 +45,13 @@ static void pgtable_populate(unsigned long addr, unsigned long end, enum populat
 
 static pte_t pte_z;
 
+static inline void kasan_populate(unsigned long start, unsigned long end, enum populate_mode mode)
+{
+       start = PAGE_ALIGN_DOWN(__sha(start));
+       end = PAGE_ALIGN(__sha(end));
+       pgtable_populate(start, end, mode);
+}
+
 static void kasan_populate_shadow(void)
 {
        pmd_t pmd_z = __pmd(__pa(kasan_early_shadow_pte) | _SEGMENT_ENTRY);
@@ -95,17 +102,17 @@ static void kasan_populate_shadow(void)
         */
 
        for_each_physmem_usable_range(i, &start, &end)
-               pgtable_populate(__sha(start), __sha(end), POPULATE_KASAN_MAP_SHADOW);
+               kasan_populate(start, end, POPULATE_KASAN_MAP_SHADOW);
        if (IS_ENABLED(CONFIG_KASAN_VMALLOC)) {
                untracked_end = VMALLOC_START;
                /* shallowly populate kasan shadow for vmalloc and modules */
-               pgtable_populate(__sha(VMALLOC_START), __sha(MODULES_END), POPULATE_KASAN_SHALLOW);
+               kasan_populate(VMALLOC_START, MODULES_END, POPULATE_KASAN_SHALLOW);
        } else {
                untracked_end = MODULES_VADDR;
        }
        /* populate kasan shadow for untracked memory */
-       pgtable_populate(__sha(ident_map_size), __sha(untracked_end), POPULATE_KASAN_ZERO_SHADOW);
-       pgtable_populate(__sha(MODULES_END), __sha(_REGION1_SIZE), POPULATE_KASAN_ZERO_SHADOW);
+       kasan_populate(ident_map_size, untracked_end, POPULATE_KASAN_ZERO_SHADOW);
+       kasan_populate(MODULES_END, _REGION1_SIZE, POPULATE_KASAN_ZERO_SHADOW);
 }
 
 static bool kasan_pgd_populate_zero_shadow(pgd_t *pgd, unsigned long addr,
index be3bf03..aa95cf6 100644 (file)
@@ -116,6 +116,7 @@ CONFIG_UNIX=y
 CONFIG_UNIX_DIAG=m
 CONFIG_XFRM_USER=m
 CONFIG_NET_KEY=m
+CONFIG_NET_TC_SKB_EXT=y
 CONFIG_SMC=m
 CONFIG_SMC_DIAG=m
 CONFIG_INET=y
index 769c7ee..f041945 100644 (file)
@@ -107,6 +107,7 @@ CONFIG_UNIX=y
 CONFIG_UNIX_DIAG=m
 CONFIG_XFRM_USER=m
 CONFIG_NET_KEY=m
+CONFIG_NET_TC_SKB_EXT=y
 CONFIG_SMC=m
 CONFIG_SMC_DIAG=m
 CONFIG_INET=y
index 29dc827..d29a9d9 100644 (file)
@@ -5,7 +5,7 @@
  * s390 implementation of the AES Cipher Algorithm with protected keys.
  *
  * s390 Version:
- *   Copyright IBM Corp. 2017,2020
+ *   Copyright IBM Corp. 2017, 2023
  *   Author(s): Martin Schwidefsky <schwidefsky@de.ibm.com>
  *             Harald Freudenberger <freude@de.ibm.com>
  */
@@ -132,7 +132,8 @@ static inline int __paes_keyblob2pkey(struct key_blob *kb,
                if (i > 0 && ret == -EAGAIN && in_task())
                        if (msleep_interruptible(1000))
                                return -EINTR;
-               ret = pkey_keyblob2pkey(kb->key, kb->keylen, pk);
+               ret = pkey_keyblob2pkey(kb->key, kb->keylen,
+                                       pk->protkey, &pk->len, &pk->type);
                if (ret == 0)
                        break;
        }
@@ -145,6 +146,7 @@ static inline int __paes_convert_key(struct s390_paes_ctx *ctx)
        int ret;
        struct pkey_protkey pkey;
 
+       pkey.len = sizeof(pkey.protkey);
        ret = __paes_keyblob2pkey(&ctx->kb, &pkey);
        if (ret)
                return ret;
@@ -414,6 +416,9 @@ static inline int __xts_paes_convert_key(struct s390_pxts_ctx *ctx)
 {
        struct pkey_protkey pkey0, pkey1;
 
+       pkey0.len = sizeof(pkey0.protkey);
+       pkey1.len = sizeof(pkey1.protkey);
+
        if (__paes_keyblob2pkey(&ctx->kb[0], &pkey0) ||
            __paes_keyblob2pkey(&ctx->kb[1], &pkey1))
                return -EINVAL;
index c37eb92..a873e87 100644 (file)
@@ -6,4 +6,8 @@
 #include <asm/fpu/api.h>
 #include <asm-generic/asm-prototypes.h>
 
+__int128_t __ashlti3(__int128_t a, int b);
+__int128_t __ashrti3(__int128_t a, int b);
+__int128_t __lshrti3(__int128_t a, int b);
+
 #endif /* _ASM_S390_PROTOTYPES_H */
index 06e0e42..aae0315 100644 (file)
@@ -190,38 +190,18 @@ static __always_inline unsigned long __cmpxchg(unsigned long address,
 #define arch_cmpxchg_local     arch_cmpxchg
 #define arch_cmpxchg64_local   arch_cmpxchg
 
-#define system_has_cmpxchg_double()    1
+#define system_has_cmpxchg128()                1
 
-static __always_inline int __cmpxchg_double(unsigned long p1, unsigned long p2,
-                                           unsigned long o1, unsigned long o2,
-                                           unsigned long n1, unsigned long n2)
+static __always_inline u128 arch_cmpxchg128(volatile u128 *ptr, u128 old, u128 new)
 {
-       union register_pair old = { .even = o1, .odd = o2, };
-       union register_pair new = { .even = n1, .odd = n2, };
-       int cc;
-
        asm volatile(
                "       cdsg    %[old],%[new],%[ptr]\n"
-               "       ipm     %[cc]\n"
-               "       srl     %[cc],28\n"
-               : [cc] "=&d" (cc), [old] "+&d" (old.pair)
-               : [new] "d" (new.pair),
-                 [ptr] "QS" (*(unsigned long *)p1), "Q" (*(unsigned long *)p2)
+               : [old] "+d" (old), [ptr] "+QS" (*ptr)
+               : [new] "d" (new)
                : "memory", "cc");
-       return !cc;
+       return old;
 }
 
-#define arch_cmpxchg_double(p1, p2, o1, o2, n1, n2)                    \
-({                                                                     \
-       typeof(p1) __p1 = (p1);                                         \
-       typeof(p2) __p2 = (p2);                                         \
-                                                                       \
-       BUILD_BUG_ON(sizeof(*(p1)) != sizeof(long));                    \
-       BUILD_BUG_ON(sizeof(*(p2)) != sizeof(long));                    \
-       VM_BUG_ON((unsigned long)((__p1) + 1) != (unsigned long)(__p2));\
-       __cmpxchg_double((unsigned long)__p1, (unsigned long)__p2,      \
-                        (unsigned long)(o1), (unsigned long)(o2),      \
-                        (unsigned long)(n1), (unsigned long)(n2));     \
-})
+#define arch_cmpxchg128                arch_cmpxchg128
 
 #endif /* __ASM_CMPXCHG_H */
index 646b129..b378e2b 100644 (file)
@@ -2,7 +2,7 @@
 /*
  * CP Assist for Cryptographic Functions (CPACF)
  *
- * Copyright IBM Corp. 2003, 2017
+ * Copyright IBM Corp. 2003, 2023
  * Author(s): Thomas Spatzier
  *           Jan Glauber
  *           Harald Freudenberger (freude@de.ibm.com)
 #define CPACF_PCKMO_ENC_AES_128_KEY    0x12
 #define CPACF_PCKMO_ENC_AES_192_KEY    0x13
 #define CPACF_PCKMO_ENC_AES_256_KEY    0x14
+#define CPACF_PCKMO_ENC_ECC_P256_KEY   0x20
+#define CPACF_PCKMO_ENC_ECC_P384_KEY   0x21
+#define CPACF_PCKMO_ENC_ECC_P521_KEY   0x22
+#define CPACF_PCKMO_ENC_ECC_ED25519_KEY        0x28
+#define CPACF_PCKMO_ENC_ECC_ED448_KEY  0x29
 
 /*
  * Function codes for the PRNO (PERFORM RANDOM NUMBER OPERATION)
index 7e417d7..a0de5b9 100644 (file)
@@ -140,7 +140,7 @@ union hws_trailer_header {
                unsigned int dsdes:16;  /* 48-63: size of diagnostic SDE */
                unsigned long long overflow; /* 64 - Overflow Count   */
        };
-       __uint128_t val;
+       u128 val;
 };
 
 struct hws_trailer_entry {
index 0d1c74a..a4d2e10 100644 (file)
@@ -16,6 +16,9 @@
 
 #define OS_INFO_VMCOREINFO     0
 #define OS_INFO_REIPL_BLOCK    1
+#define OS_INFO_FLAGS_ENTRY    2
+
+#define OS_INFO_FLAG_REIPL_CLEAR       (1UL << 0)
 
 struct os_info_entry {
        u64     addr;
@@ -30,8 +33,8 @@ struct os_info {
        u16     version_minor;
        u64     crashkernel_addr;
        u64     crashkernel_size;
-       struct os_info_entry entry[2];
-       u8      reserved[4024];
+       struct os_info_entry entry[3];
+       u8      reserved[4004];
 } __packed;
 
 void os_info_init(void);
index 081837b..264095d 100644 (file)
 #define this_cpu_cmpxchg_4(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
 #define this_cpu_cmpxchg_8(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
 
+#define this_cpu_cmpxchg64(pcp, o, n)  this_cpu_cmpxchg_8(pcp, o, n)
+
+#define this_cpu_cmpxchg128(pcp, oval, nval)                           \
+({                                                                     \
+       typedef typeof(pcp) pcp_op_T__;                                 \
+       u128 old__, new__, ret__;                                       \
+       pcp_op_T__ *ptr__;                                              \
+       old__ = oval;                                                   \
+       new__ = nval;                                                   \
+       preempt_disable_notrace();                                      \
+       ptr__ = raw_cpu_ptr(&(pcp));                                    \
+       ret__ = cmpxchg128((void *)ptr__, old__, new__);                \
+       preempt_enable_notrace();                                       \
+       ret__;                                                          \
+})
+
 #define arch_this_cpu_xchg(pcp, nval)                                  \
 ({                                                                     \
        typeof(pcp) *ptr__;                                             \
 #define this_cpu_xchg_4(pcp, nval) arch_this_cpu_xchg(pcp, nval)
 #define this_cpu_xchg_8(pcp, nval) arch_this_cpu_xchg(pcp, nval)
 
-#define arch_this_cpu_cmpxchg_double(pcp1, pcp2, o1, o2, n1, n2)           \
-({                                                                         \
-       typeof(pcp1) *p1__;                                                 \
-       typeof(pcp2) *p2__;                                                 \
-       int ret__;                                                          \
-                                                                           \
-       preempt_disable_notrace();                                          \
-       p1__ = raw_cpu_ptr(&(pcp1));                                        \
-       p2__ = raw_cpu_ptr(&(pcp2));                                        \
-       ret__ = __cmpxchg_double((unsigned long)p1__, (unsigned long)p2__,  \
-                                (unsigned long)(o1), (unsigned long)(o2),  \
-                                (unsigned long)(n1), (unsigned long)(n2)); \
-       preempt_enable_notrace();                                           \
-       ret__;                                                              \
-})
-
-#define this_cpu_cmpxchg_double_8 arch_this_cpu_cmpxchg_double
-
 #include <asm-generic/percpu.h>
 
 #endif /* __ARCH_S390_PERCPU__ */
index 6822a11..c55f3c3 100644 (file)
@@ -42,9 +42,6 @@ static inline void update_page_count(int level, long count)
                atomic_long_add(count, &direct_pages_count[level]);
 }
 
-struct seq_file;
-void arch_report_meminfo(struct seq_file *m);
-
 /*
  * The S390 doesn't have any external MMU info: the kernel page
  * tables contain all the necessary information.
index 8e9c582..9e41a74 100644 (file)
@@ -3,6 +3,7 @@
 #define _ASM_S390_MEM_DETECT_H
 
 #include <linux/types.h>
+#include <asm/page.h>
 
 enum physmem_info_source {
        MEM_DETECT_NONE = 0,
@@ -133,7 +134,7 @@ static inline const char *get_rr_type_name(enum reserved_range_type t)
 
 #define for_each_physmem_reserved_type_range(t, range, p_start, p_end)                         \
        for (range = &physmem_info.reserved[t], *p_start = range->start, *p_end = range->end;   \
-            range && range->end; range = range->chain,                                         \
+            range && range->end; range = range->chain ? __va(range->chain) : NULL,             \
             *p_start = range ? range->start : 0, *p_end = range ? range->end : 0)
 
 static inline struct reserved_range *__physmem_reserved_next(enum reserved_range_type *t,
@@ -145,7 +146,7 @@ static inline struct reserved_range *__physmem_reserved_next(enum reserved_range
                        return range;
        }
        if (range->chain)
-               return range->chain;
+               return __va(range->chain);
        while (++*t < RR_MAX) {
                range = &physmem_info.reserved[*t];
                if (range->end)
index dd3d20c..47d80a7 100644 (file)
@@ -2,7 +2,7 @@
 /*
  * Kernelspace interface to the pkey device driver
  *
- * Copyright IBM Corp. 2016,2019
+ * Copyright IBM Corp. 2016, 2023
  *
  * Author: Harald Freudenberger <freude@de.ibm.com>
  *
@@ -23,6 +23,6 @@
  * @return 0 on success, negative errno value on failure
  */
 int pkey_keyblob2pkey(const u8 *key, u32 keylen,
-                     struct pkey_protkey *protkey);
+                     u8 *protkey, u32 *protkeylen, u32 *protkeytype);
 
 #endif /* _KAPI_PKEY_H */
index c7c9792..a674c7d 100644 (file)
@@ -52,9 +52,6 @@ struct thread_info {
 
 struct task_struct;
 
-void arch_release_task_struct(struct task_struct *tsk);
-int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src);
-
 void arch_setup_new_exec(void);
 #define arch_setup_new_exec arch_setup_new_exec
 
index ce878e8..4d64665 100644 (file)
@@ -63,7 +63,7 @@ static inline int store_tod_clock_ext_cc(union tod_clock *clk)
        return cc;
 }
 
-static inline void store_tod_clock_ext(union tod_clock *tod)
+static __always_inline void store_tod_clock_ext(union tod_clock *tod)
 {
        asm volatile("stcke %0" : "=Q" (*tod) : : "cc");
 }
@@ -177,7 +177,7 @@ static inline void local_tick_enable(unsigned long comp)
 
 typedef unsigned long cycles_t;
 
-static inline unsigned long get_tod_clock(void)
+static __always_inline unsigned long get_tod_clock(void)
 {
        union tod_clock clk;
 
@@ -204,6 +204,11 @@ void init_cpu_timer(void);
 
 extern union tod_clock tod_clock_base;
 
+static __always_inline unsigned long __get_tod_clock_monotonic(void)
+{
+       return get_tod_clock() - tod_clock_base.tod;
+}
+
 /**
  * get_clock_monotonic - returns current time in clock rate units
  *
@@ -216,7 +221,7 @@ static inline unsigned long get_tod_clock_monotonic(void)
        unsigned long tod;
 
        preempt_disable_notrace();
-       tod = get_tod_clock() - tod_clock_base.tod;
+       tod = __get_tod_clock_monotonic();
        preempt_enable_notrace();
        return tod;
 }
@@ -240,7 +245,7 @@ static inline unsigned long get_tod_clock_monotonic(void)
  * -> ns = (th * 125) + ((tl * 125) >> 9);
  *
  */
-static inline unsigned long tod_to_ns(unsigned long todval)
+static __always_inline unsigned long tod_to_ns(unsigned long todval)
 {
        return ((todval >> 9) * 125) + (((todval & 0x1ff) * 125) >> 9);
 }
index 924b876..f7bae1c 100644 (file)
@@ -2,7 +2,7 @@
 /*
  * Userspace interface to the pkey device driver
  *
- * Copyright IBM Corp. 2017, 2019
+ * Copyright IBM Corp. 2017, 2023
  *
  * Author: Harald Freudenberger <freude@de.ibm.com>
  *
 #define MINKEYBLOBSIZE SECKEYBLOBSIZE
 
 /* defines for the type field within the pkey_protkey struct */
-#define PKEY_KEYTYPE_AES_128                 1
-#define PKEY_KEYTYPE_AES_192                 2
-#define PKEY_KEYTYPE_AES_256                 3
-#define PKEY_KEYTYPE_ECC                     4
+#define PKEY_KEYTYPE_AES_128           1
+#define PKEY_KEYTYPE_AES_192           2
+#define PKEY_KEYTYPE_AES_256           3
+#define PKEY_KEYTYPE_ECC               4
+#define PKEY_KEYTYPE_ECC_P256          5
+#define PKEY_KEYTYPE_ECC_P384          6
+#define PKEY_KEYTYPE_ECC_P521          7
+#define PKEY_KEYTYPE_ECC_ED25519       8
+#define PKEY_KEYTYPE_ECC_ED448         9
 
 /* the newer ioctls use a pkey_key_type enum for type information */
 enum pkey_key_type {
index 8a617be..7af6994 100644 (file)
@@ -568,9 +568,9 @@ static size_t get_elfcorehdr_size(int mem_chunk_cnt)
 int elfcorehdr_alloc(unsigned long long *addr, unsigned long long *size)
 {
        Elf64_Phdr *phdr_notes, *phdr_loads;
+       size_t alloc_size;
        int mem_chunk_cnt;
        void *ptr, *hdr;
-       u32 alloc_size;
        u64 hdr_off;
 
        /* If we are not in kdump or zfcp/nvme dump mode return */
index 34674e3..9f41853 100644 (file)
@@ -34,14 +34,12 @@ void kernel_stack_overflow(struct pt_regs * regs);
 void handle_signal32(struct ksignal *ksig, sigset_t *oldset,
                     struct pt_regs *regs);
 
-void __init init_IRQ(void);
 void do_io_irq(struct pt_regs *regs);
 void do_ext_irq(struct pt_regs *regs);
 void do_restart(void *arg);
 void __init startup_init(void);
 void die(struct pt_regs *regs, const char *str);
 int setup_profiling_timer(unsigned int multiplier);
-void __init time_init(void);
 unsigned long prepare_ftrace_return(unsigned long parent, unsigned long sp, unsigned long ip);
 
 struct s390_mmap_arg_struct;
index f44f70d..85a00d9 100644 (file)
@@ -176,6 +176,8 @@ static bool reipl_fcp_clear;
 static bool reipl_ccw_clear;
 static bool reipl_eckd_clear;
 
+static unsigned long os_info_flags;
+
 static inline int __diag308(unsigned long subcode, unsigned long addr)
 {
        union register_pair r1;
@@ -1938,6 +1940,20 @@ static void dump_reipl_run(struct shutdown_trigger *trigger)
        struct lowcore *abs_lc;
        unsigned int csum;
 
+       /*
+        * Set REIPL_CLEAR flag in os_info flags entry indicating
+        * 'clear' sysfs attribute has been set on the panicked system
+        * for specified reipl type.
+        * Always set for IPL_TYPE_NSS and IPL_TYPE_UNKNOWN.
+        */
+       if ((reipl_type == IPL_TYPE_CCW && reipl_ccw_clear) ||
+           (reipl_type == IPL_TYPE_ECKD && reipl_eckd_clear) ||
+           (reipl_type == IPL_TYPE_FCP && reipl_fcp_clear) ||
+           (reipl_type == IPL_TYPE_NVME && reipl_nvme_clear) ||
+           reipl_type == IPL_TYPE_NSS ||
+           reipl_type == IPL_TYPE_UNKNOWN)
+               os_info_flags |= OS_INFO_FLAG_REIPL_CLEAR;
+       os_info_entry_add(OS_INFO_FLAGS_ENTRY, &os_info_flags, sizeof(os_info_flags));
        csum = (__force unsigned int)
               csum_partial(reipl_block_actual, reipl_block_actual->hdr.len, 0);
        abs_lc = get_abs_lowcore();
index f1b35dc..42215f9 100644 (file)
@@ -352,7 +352,8 @@ static int apply_rela(Elf_Rela *rela, Elf_Addr base, Elf_Sym *symtab,
                        rc = apply_rela_bits(loc, val, 0, 64, 0, write);
                else if (r_type == R_390_GOTENT ||
                         r_type == R_390_GOTPLTENT) {
-                       val += (Elf_Addr) me->mem[MOD_TEXT].base - loc;
+                       val += (Elf_Addr)me->mem[MOD_TEXT].base +
+                               me->arch.got_offset - loc;
                        rc = apply_rela_bits(loc, val, 1, 32, 1, write);
                }
                break;
index cf1b6e8..9067914 100644 (file)
@@ -76,6 +76,7 @@ static inline int ctr_stcctm(enum cpumf_ctr_set set, u64 range, u64 *dest)
 }
 
 struct cpu_cf_events {
+       refcount_t refcnt;              /* Reference count */
        atomic_t                ctr_set[CPUMF_CTR_SET_MAX];
        u64                     state;          /* For perf_event_open SVC */
        u64                     dev_state;      /* For /dev/hwctr */
@@ -88,9 +89,6 @@ struct cpu_cf_events {
        unsigned int sets;              /* # Counter set saved in memory */
 };
 
-/* Per-CPU event structure for the counter facility */
-static DEFINE_PER_CPU(struct cpu_cf_events, cpu_cf_events);
-
 static unsigned int cfdiag_cpu_speed;  /* CPU speed for CF_DIAG trailer */
 static debug_info_t *cf_dbg;
 
@@ -103,6 +101,221 @@ static debug_info_t *cf_dbg;
  */
 static struct cpumf_ctr_info   cpumf_ctr_info;
 
+struct cpu_cf_ptr {
+       struct cpu_cf_events *cpucf;
+};
+
+static struct cpu_cf_root {            /* Anchor to per CPU data */
+       refcount_t refcnt;              /* Overall active events */
+       struct cpu_cf_ptr __percpu *cfptr;
+} cpu_cf_root;
+
+/*
+ * Serialize event initialization and event removal. Both are called from
+ * user space in task context with perf_event_open() and close()
+ * system calls.
+ *
+ * This mutex serializes functions cpum_cf_alloc_cpu() called at event
+ * initialization via cpumf_pmu_event_init() and function cpum_cf_free_cpu()
+ * called at event removal via call back function hw_perf_event_destroy()
+ * when the event is deleted. They are serialized to enforce correct
+ * bookkeeping of pointer and reference counts anchored by
+ * struct cpu_cf_root and the access to cpu_cf_root::refcnt and the
+ * per CPU pointers stored in cpu_cf_root::cfptr.
+ */
+static DEFINE_MUTEX(pmc_reserve_mutex);
+
+/*
+ * Get pointer to per-cpu structure.
+ *
+ * Function get_cpu_cfhw() is called from
+ * - cfset_copy_all(): This function is protected by cpus_read_lock(), so
+ *   CPU hot plug remove can not happen. Event removal requires a close()
+ *   first.
+ *
+ * Function this_cpu_cfhw() is called from perf common code functions:
+ * - pmu_{en|dis}able(), pmu_{add|del}()and pmu_{start|stop}():
+ *   All functions execute with interrupts disabled on that particular CPU.
+ * - cfset_ioctl_{on|off}, cfset_cpu_read(): see comment cfset_copy_all().
+ *
+ * Therefore it is safe to access the CPU specific pointer to the event.
+ */
+static struct cpu_cf_events *get_cpu_cfhw(int cpu)
+{
+       struct cpu_cf_ptr __percpu *p = cpu_cf_root.cfptr;
+
+       if (p) {
+               struct cpu_cf_ptr *q = per_cpu_ptr(p, cpu);
+
+               return q->cpucf;
+       }
+       return NULL;
+}
+
+static struct cpu_cf_events *this_cpu_cfhw(void)
+{
+       return get_cpu_cfhw(smp_processor_id());
+}
+
+/* Disable counter sets on dedicated CPU */
+static void cpum_cf_reset_cpu(void *flags)
+{
+       lcctl(0);
+}
+
+/* Free per CPU data when the last event is removed. */
+static void cpum_cf_free_root(void)
+{
+       if (!refcount_dec_and_test(&cpu_cf_root.refcnt))
+               return;
+       free_percpu(cpu_cf_root.cfptr);
+       cpu_cf_root.cfptr = NULL;
+       irq_subclass_unregister(IRQ_SUBCLASS_MEASUREMENT_ALERT);
+       on_each_cpu(cpum_cf_reset_cpu, NULL, 1);
+       debug_sprintf_event(cf_dbg, 4, "%s2 root.refcnt %u cfptr %px\n",
+                           __func__, refcount_read(&cpu_cf_root.refcnt),
+                           cpu_cf_root.cfptr);
+}
+
+/*
+ * On initialization of first event also allocate per CPU data dynamically.
+ * Start with an array of pointers, the array size is the maximum number of
+ * CPUs possible, which might be larger than the number of CPUs currently
+ * online.
+ */
+static int cpum_cf_alloc_root(void)
+{
+       int rc = 0;
+
+       if (refcount_inc_not_zero(&cpu_cf_root.refcnt))
+               return rc;
+
+       /* The memory is already zeroed. */
+       cpu_cf_root.cfptr = alloc_percpu(struct cpu_cf_ptr);
+       if (cpu_cf_root.cfptr) {
+               refcount_set(&cpu_cf_root.refcnt, 1);
+               on_each_cpu(cpum_cf_reset_cpu, NULL, 1);
+               irq_subclass_register(IRQ_SUBCLASS_MEASUREMENT_ALERT);
+       } else {
+               rc = -ENOMEM;
+       }
+
+       return rc;
+}
+
+/* Free CPU counter data structure for a PMU */
+static void cpum_cf_free_cpu(int cpu)
+{
+       struct cpu_cf_events *cpuhw;
+       struct cpu_cf_ptr *p;
+
+       mutex_lock(&pmc_reserve_mutex);
+       /*
+        * When invoked via CPU hotplug handler, there might be no events
+        * installed or that particular CPU might not have an
+        * event installed. This anchor pointer can be NULL!
+        */
+       if (!cpu_cf_root.cfptr)
+               goto out;
+       p = per_cpu_ptr(cpu_cf_root.cfptr, cpu);
+       cpuhw = p->cpucf;
+       /*
+        * Might be zero when called from CPU hotplug handler and no event
+        * installed on that CPU, but on different CPUs.
+        */
+       if (!cpuhw)
+               goto out;
+
+       if (refcount_dec_and_test(&cpuhw->refcnt)) {
+               kfree(cpuhw);
+               p->cpucf = NULL;
+       }
+       cpum_cf_free_root();
+out:
+       mutex_unlock(&pmc_reserve_mutex);
+}
+
+/* Allocate CPU counter data structure for a PMU. Called under mutex lock. */
+static int cpum_cf_alloc_cpu(int cpu)
+{
+       struct cpu_cf_events *cpuhw;
+       struct cpu_cf_ptr *p;
+       int rc;
+
+       mutex_lock(&pmc_reserve_mutex);
+       rc = cpum_cf_alloc_root();
+       if (rc)
+               goto unlock;
+       p = per_cpu_ptr(cpu_cf_root.cfptr, cpu);
+       cpuhw = p->cpucf;
+
+       if (!cpuhw) {
+               cpuhw = kzalloc(sizeof(*cpuhw), GFP_KERNEL);
+               if (cpuhw) {
+                       p->cpucf = cpuhw;
+                       refcount_set(&cpuhw->refcnt, 1);
+               } else {
+                       rc = -ENOMEM;
+               }
+       } else {
+               refcount_inc(&cpuhw->refcnt);
+       }
+       if (rc) {
+               /*
+                * Error in allocation of event, decrement anchor. Since
+                * cpu_cf_event in not created, its destroy() function is not
+                * invoked. Adjust the reference counter for the anchor.
+                */
+               cpum_cf_free_root();
+       }
+unlock:
+       mutex_unlock(&pmc_reserve_mutex);
+       return rc;
+}
+
+/*
+ * Create/delete per CPU data structures for /dev/hwctr interface and events
+ * created by perf_event_open().
+ * If cpu is -1, track task on all available CPUs. This requires
+ * allocation of hardware data structures for all CPUs. This setup handles
+ * perf_event_open() with task context and /dev/hwctr interface.
+ * If cpu is non-zero install event on this CPU only. This setup handles
+ * perf_event_open() with CPU context.
+ */
+static int cpum_cf_alloc(int cpu)
+{
+       cpumask_var_t mask;
+       int rc;
+
+       if (cpu == -1) {
+               if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
+                       return -ENOMEM;
+               for_each_online_cpu(cpu) {
+                       rc = cpum_cf_alloc_cpu(cpu);
+                       if (rc) {
+                               for_each_cpu(cpu, mask)
+                                       cpum_cf_free_cpu(cpu);
+                               break;
+                       }
+                       cpumask_set_cpu(cpu, mask);
+               }
+               free_cpumask_var(mask);
+       } else {
+               rc = cpum_cf_alloc_cpu(cpu);
+       }
+       return rc;
+}
+
+static void cpum_cf_free(int cpu)
+{
+       if (cpu == -1) {
+               for_each_online_cpu(cpu)
+                       cpum_cf_free_cpu(cpu);
+       } else {
+               cpum_cf_free_cpu(cpu);
+       }
+}
+
 #define        CF_DIAG_CTRSET_DEF              0xfeef  /* Counter set header mark */
                                                /* interval in seconds */
 
@@ -451,10 +664,10 @@ static int validate_ctr_version(const u64 config, enum cpumf_ctr_set set)
  */
 static void cpumf_pmu_enable(struct pmu *pmu)
 {
-       struct cpu_cf_events *cpuhw = this_cpu_ptr(&cpu_cf_events);
+       struct cpu_cf_events *cpuhw = this_cpu_cfhw();
        int err;
 
-       if (cpuhw->flags & PMU_F_ENABLED)
+       if (!cpuhw || (cpuhw->flags & PMU_F_ENABLED))
                return;
 
        err = lcctl(cpuhw->state | cpuhw->dev_state);
@@ -471,11 +684,11 @@ static void cpumf_pmu_enable(struct pmu *pmu)
  */
 static void cpumf_pmu_disable(struct pmu *pmu)
 {
-       struct cpu_cf_events *cpuhw = this_cpu_ptr(&cpu_cf_events);
-       int err;
+       struct cpu_cf_events *cpuhw = this_cpu_cfhw();
        u64 inactive;
+       int err;
 
-       if (!(cpuhw->flags & PMU_F_ENABLED))
+       if (!cpuhw || !(cpuhw->flags & PMU_F_ENABLED))
                return;
 
        inactive = cpuhw->state & ~((1 << CPUMF_LCCTL_ENABLE_SHIFT) - 1);
@@ -487,58 +700,10 @@ static void cpumf_pmu_disable(struct pmu *pmu)
                cpuhw->flags &= ~PMU_F_ENABLED;
 }
 
-#define PMC_INIT      0UL
-#define PMC_RELEASE   1UL
-
-static void cpum_cf_setup_cpu(void *flags)
-{
-       struct cpu_cf_events *cpuhw = this_cpu_ptr(&cpu_cf_events);
-
-       switch ((unsigned long)flags) {
-       case PMC_INIT:
-               cpuhw->flags |= PMU_F_RESERVED;
-               break;
-
-       case PMC_RELEASE:
-               cpuhw->flags &= ~PMU_F_RESERVED;
-               break;
-       }
-
-       /* Disable CPU counter sets */
-       lcctl(0);
-       debug_sprintf_event(cf_dbg, 5, "%s flags %#x flags %#x state %#llx\n",
-                           __func__, *(int *)flags, cpuhw->flags,
-                           cpuhw->state);
-}
-
-/* Initialize the CPU-measurement counter facility */
-static int __kernel_cpumcf_begin(void)
-{
-       on_each_cpu(cpum_cf_setup_cpu, (void *)PMC_INIT, 1);
-       irq_subclass_register(IRQ_SUBCLASS_MEASUREMENT_ALERT);
-
-       return 0;
-}
-
-/* Release the CPU-measurement counter facility */
-static void __kernel_cpumcf_end(void)
-{
-       on_each_cpu(cpum_cf_setup_cpu, (void *)PMC_RELEASE, 1);
-       irq_subclass_unregister(IRQ_SUBCLASS_MEASUREMENT_ALERT);
-}
-
-/* Number of perf events counting hardware events */
-static atomic_t num_events = ATOMIC_INIT(0);
-/* Used to avoid races in calling reserve/release_cpumf_hardware */
-static DEFINE_MUTEX(pmc_reserve_mutex);
-
 /* Release the PMU if event is the last perf event */
 static void hw_perf_event_destroy(struct perf_event *event)
 {
-       mutex_lock(&pmc_reserve_mutex);
-       if (atomic_dec_return(&num_events) == 0)
-               __kernel_cpumcf_end();
-       mutex_unlock(&pmc_reserve_mutex);
+       cpum_cf_free(event->cpu);
 }
 
 /* CPUMF <-> perf event mappings for kernel+userspace (basic set) */
@@ -562,14 +727,6 @@ static const int cpumf_generic_events_user[] = {
        [PERF_COUNT_HW_BUS_CYCLES]          = -1,
 };
 
-static void cpumf_hw_inuse(void)
-{
-       mutex_lock(&pmc_reserve_mutex);
-       if (atomic_inc_return(&num_events) == 1)
-               __kernel_cpumcf_begin();
-       mutex_unlock(&pmc_reserve_mutex);
-}
-
 static int is_userspace_event(u64 ev)
 {
        return cpumf_generic_events_user[PERF_COUNT_HW_CPU_CYCLES] == ev ||
@@ -653,7 +810,8 @@ static int __hw_perf_event_init(struct perf_event *event, unsigned int type)
        }
 
        /* Initialize for using the CPU-measurement counter facility */
-       cpumf_hw_inuse();
+       if (cpum_cf_alloc(event->cpu))
+               return -ENOMEM;
        event->destroy = hw_perf_event_destroy;
 
        /*
@@ -756,7 +914,7 @@ static void cpumf_pmu_read(struct perf_event *event)
 
 static void cpumf_pmu_start(struct perf_event *event, int flags)
 {
-       struct cpu_cf_events *cpuhw = this_cpu_ptr(&cpu_cf_events);
+       struct cpu_cf_events *cpuhw = this_cpu_cfhw();
        struct hw_perf_event *hwc = &event->hw;
        int i;
 
@@ -830,7 +988,7 @@ static int cfdiag_push_sample(struct perf_event *event,
 
 static void cpumf_pmu_stop(struct perf_event *event, int flags)
 {
-       struct cpu_cf_events *cpuhw = this_cpu_ptr(&cpu_cf_events);
+       struct cpu_cf_events *cpuhw = this_cpu_cfhw();
        struct hw_perf_event *hwc = &event->hw;
        int i;
 
@@ -857,8 +1015,7 @@ static void cpumf_pmu_stop(struct perf_event *event, int flags)
                                                      false);
                        if (cfdiag_diffctr(cpuhw, event->hw.config_base))
                                cfdiag_push_sample(event, cpuhw);
-               } else if (cpuhw->flags & PMU_F_RESERVED) {
-                       /* Only update when PMU not hotplugged off */
+               } else {
                        hw_perf_event_update(event);
                }
                hwc->state |= PERF_HES_UPTODATE;
@@ -867,7 +1024,7 @@ static void cpumf_pmu_stop(struct perf_event *event, int flags)
 
 static int cpumf_pmu_add(struct perf_event *event, int flags)
 {
-       struct cpu_cf_events *cpuhw = this_cpu_ptr(&cpu_cf_events);
+       struct cpu_cf_events *cpuhw = this_cpu_cfhw();
 
        ctr_set_enable(&cpuhw->state, event->hw.config_base);
        event->hw.state = PERF_HES_UPTODATE | PERF_HES_STOPPED;
@@ -880,7 +1037,7 @@ static int cpumf_pmu_add(struct perf_event *event, int flags)
 
 static void cpumf_pmu_del(struct perf_event *event, int flags)
 {
-       struct cpu_cf_events *cpuhw = this_cpu_ptr(&cpu_cf_events);
+       struct cpu_cf_events *cpuhw = this_cpu_cfhw();
        int i;
 
        cpumf_pmu_stop(event, PERF_EF_UPDATE);
@@ -912,29 +1069,83 @@ static struct pmu cpumf_pmu = {
        .read         = cpumf_pmu_read,
 };
 
-static int cpum_cf_setup(unsigned int cpu, unsigned long flags)
-{
-       local_irq_disable();
-       cpum_cf_setup_cpu((void *)flags);
-       local_irq_enable();
-       return 0;
-}
+static struct cfset_session {          /* CPUs and counter set bit mask */
+       struct list_head head;          /* Head of list of active processes */
+} cfset_session = {
+       .head = LIST_HEAD_INIT(cfset_session.head)
+};
+
+static refcount_t cfset_opencnt = REFCOUNT_INIT(0);    /* Access count */
+/*
+ * Synchronize access to device /dev/hwc. This mutex protects against
+ * concurrent access to functions cfset_open() and cfset_release().
+ * Same for CPU hotplug add and remove events triggering
+ * cpum_cf_online_cpu() and cpum_cf_offline_cpu().
+ * It also serializes concurrent device ioctl access from multiple
+ * processes accessing /dev/hwc.
+ *
+ * The mutex protects concurrent access to the /dev/hwctr session management
+ * struct cfset_session and reference counting variable cfset_opencnt.
+ */
+static DEFINE_MUTEX(cfset_ctrset_mutex);
 
+/*
+ * CPU hotplug handles only /dev/hwctr device.
+ * For perf_event_open() the CPU hotplug handling is done on kernel common
+ * code:
+ * - CPU add: Nothing is done since a file descriptor can not be created
+ *   and returned to the user.
+ * - CPU delete: Handled by common code via pmu_disable(), pmu_stop() and
+ *   pmu_delete(). The event itself is removed when the file descriptor is
+ *   closed.
+ */
 static int cfset_online_cpu(unsigned int cpu);
+
 static int cpum_cf_online_cpu(unsigned int cpu)
 {
-       debug_sprintf_event(cf_dbg, 4, "%s cpu %d in_irq %ld\n", __func__,
-                           cpu, in_interrupt());
-       cpum_cf_setup(cpu, PMC_INIT);
-       return cfset_online_cpu(cpu);
+       int rc = 0;
+
+       debug_sprintf_event(cf_dbg, 4, "%s cpu %d root.refcnt %d "
+                           "opencnt %d\n", __func__, cpu,
+                           refcount_read(&cpu_cf_root.refcnt),
+                           refcount_read(&cfset_opencnt));
+       /*
+        * Ignore notification for perf_event_open().
+        * Handle only /dev/hwctr device sessions.
+        */
+       mutex_lock(&cfset_ctrset_mutex);
+       if (refcount_read(&cfset_opencnt)) {
+               rc = cpum_cf_alloc_cpu(cpu);
+               if (!rc)
+                       cfset_online_cpu(cpu);
+       }
+       mutex_unlock(&cfset_ctrset_mutex);
+       return rc;
 }
 
 static int cfset_offline_cpu(unsigned int cpu);
+
 static int cpum_cf_offline_cpu(unsigned int cpu)
 {
-       debug_sprintf_event(cf_dbg, 4, "%s cpu %d\n", __func__, cpu);
-       cfset_offline_cpu(cpu);
-       return cpum_cf_setup(cpu, PMC_RELEASE);
+       debug_sprintf_event(cf_dbg, 4, "%s cpu %d root.refcnt %d opencnt %d\n",
+                           __func__, cpu, refcount_read(&cpu_cf_root.refcnt),
+                           refcount_read(&cfset_opencnt));
+       /*
+        * During task exit processing of grouped perf events triggered by CPU
+        * hotplug processing, pmu_disable() is called as part of perf context
+        * removal process. Therefore do not trigger event removal now for
+        * perf_event_open() created events. Perf common code triggers event
+        * destruction when the event file descriptor is closed.
+        *
+        * Handle only /dev/hwctr device sessions.
+        */
+       mutex_lock(&cfset_ctrset_mutex);
+       if (refcount_read(&cfset_opencnt)) {
+               cfset_offline_cpu(cpu);
+               cpum_cf_free_cpu(cpu);
+       }
+       mutex_unlock(&cfset_ctrset_mutex);
+       return 0;
 }
 
 /* Return true if store counter set multiple instruction is available */
@@ -953,13 +1164,13 @@ static void cpumf_measurement_alert(struct ext_code ext_code,
                return;
 
        inc_irq_stat(IRQEXT_CMC);
-       cpuhw = this_cpu_ptr(&cpu_cf_events);
 
        /*
         * Measurement alerts are shared and might happen when the PMU
         * is not reserved.  Ignore these alerts in this case.
         */
-       if (!(cpuhw->flags & PMU_F_RESERVED))
+       cpuhw = this_cpu_cfhw();
+       if (!cpuhw)
                return;
 
        /* counter authorization change alert */
@@ -1039,19 +1250,11 @@ out1:
  * counter set via normal file operations.
  */
 
-static atomic_t cfset_opencnt = ATOMIC_INIT(0);                /* Access count */
-static DEFINE_MUTEX(cfset_ctrset_mutex);/* Synchronize access to hardware */
 struct cfset_call_on_cpu_parm {                /* Parm struct for smp_call_on_cpu */
        unsigned int sets;              /* Counter set bit mask */
        atomic_t cpus_ack;              /* # CPUs successfully executed func */
 };
 
-static struct cfset_session {          /* CPUs and counter set bit mask */
-       struct list_head head;          /* Head of list of active processes */
-} cfset_session = {
-       .head = LIST_HEAD_INIT(cfset_session.head)
-};
-
 struct cfset_request {                 /* CPUs and counter set bit mask */
        unsigned long ctrset;           /* Bit mask of counter set to read */
        cpumask_t mask;                 /* CPU mask to read from */
@@ -1113,11 +1316,11 @@ static void cfset_session_add(struct cfset_request *p)
 /* Stop all counter sets via ioctl interface */
 static void cfset_ioctl_off(void *parm)
 {
-       struct cpu_cf_events *cpuhw = this_cpu_ptr(&cpu_cf_events);
+       struct cpu_cf_events *cpuhw = this_cpu_cfhw();
        struct cfset_call_on_cpu_parm *p = parm;
        int rc;
 
-       /* Check if any counter set used by /dev/hwc */
+       /* Check if any counter set used by /dev/hwctr */
        for (rc = CPUMF_CTR_SET_BASIC; rc < CPUMF_CTR_SET_MAX; ++rc)
                if ((p->sets & cpumf_ctr_ctl[rc])) {
                        if (!atomic_dec_return(&cpuhw->ctr_set[rc])) {
@@ -1141,7 +1344,7 @@ static void cfset_ioctl_off(void *parm)
 /* Start counter sets on particular CPU */
 static void cfset_ioctl_on(void *parm)
 {
-       struct cpu_cf_events *cpuhw = this_cpu_ptr(&cpu_cf_events);
+       struct cpu_cf_events *cpuhw = this_cpu_cfhw();
        struct cfset_call_on_cpu_parm *p = parm;
        int rc;
 
@@ -1163,7 +1366,7 @@ static void cfset_ioctl_on(void *parm)
 
 static void cfset_release_cpu(void *p)
 {
-       struct cpu_cf_events *cpuhw = this_cpu_ptr(&cpu_cf_events);
+       struct cpu_cf_events *cpuhw = this_cpu_cfhw();
        int rc;
 
        debug_sprintf_event(cf_dbg, 4, "%s state %#llx dev_state %#llx\n",
@@ -1203,27 +1406,41 @@ static int cfset_release(struct inode *inode, struct file *file)
                kfree(file->private_data);
                file->private_data = NULL;
        }
-       if (!atomic_dec_return(&cfset_opencnt))
+       if (refcount_dec_and_test(&cfset_opencnt)) {    /* Last close */
                on_each_cpu(cfset_release_cpu, NULL, 1);
+               cpum_cf_free(-1);
+       }
        mutex_unlock(&cfset_ctrset_mutex);
-
-       hw_perf_event_destroy(NULL);
        return 0;
 }
 
+/*
+ * Open via /dev/hwctr device. Allocate all per CPU resources on the first
+ * open of the device. The last close releases all per CPU resources.
+ * Parallel perf_event_open system calls also use per CPU resources.
+ * These invocations are handled via reference counting on the per CPU data
+ * structures.
+ */
 static int cfset_open(struct inode *inode, struct file *file)
 {
-       if (!capable(CAP_SYS_ADMIN))
+       int rc = 0;
+
+       if (!perfmon_capable())
                return -EPERM;
+       file->private_data = NULL;
+
        mutex_lock(&cfset_ctrset_mutex);
-       if (atomic_inc_return(&cfset_opencnt) == 1)
-               cfset_session_init();
+       if (!refcount_inc_not_zero(&cfset_opencnt)) {   /* First open */
+               rc = cpum_cf_alloc(-1);
+               if (!rc) {
+                       cfset_session_init();
+                       refcount_set(&cfset_opencnt, 1);
+               }
+       }
        mutex_unlock(&cfset_ctrset_mutex);
 
-       cpumf_hw_inuse();
-       file->private_data = NULL;
        /* nonseekable_open() never fails */
-       return nonseekable_open(inode, file);
+       return rc ?: nonseekable_open(inode, file);
 }
 
 static int cfset_all_start(struct cfset_request *req)
@@ -1280,7 +1497,7 @@ static int cfset_all_copy(unsigned long arg, cpumask_t *mask)
        ctrset_read = (struct s390_ctrset_read __user *)arg;
        uptr = ctrset_read->data;
        for_each_cpu(cpu, mask) {
-               struct cpu_cf_events *cpuhw = per_cpu_ptr(&cpu_cf_events, cpu);
+               struct cpu_cf_events *cpuhw = get_cpu_cfhw(cpu);
                struct s390_ctrset_cpudata __user *ctrset_cpudata;
 
                ctrset_cpudata = uptr;
@@ -1324,7 +1541,7 @@ static size_t cfset_cpuset_read(struct s390_ctrset_setdata *p, int ctrset,
 /* Read all counter sets. */
 static void cfset_cpu_read(void *parm)
 {
-       struct cpu_cf_events *cpuhw = this_cpu_ptr(&cpu_cf_events);
+       struct cpu_cf_events *cpuhw = this_cpu_cfhw();
        struct cfset_call_on_cpu_parm *p = parm;
        int set, set_size;
        size_t space;
@@ -1348,9 +1565,9 @@ static void cfset_cpu_read(void *parm)
                        cpuhw->used += space;
                        cpuhw->sets += 1;
                }
+               debug_sprintf_event(cf_dbg, 4, "%s sets %d used %zd\n", __func__,
+                                   cpuhw->sets, cpuhw->used);
        }
-       debug_sprintf_event(cf_dbg, 4, "%s sets %d used %zd\n", __func__,
-                           cpuhw->sets, cpuhw->used);
 }
 
 static int cfset_all_read(unsigned long arg, struct cfset_request *req)
@@ -1502,6 +1719,7 @@ static struct miscdevice cfset_dev = {
        .name   = S390_HWCTR_DEVICE,
        .minor  = MISC_DYNAMIC_MINOR,
        .fops   = &cfset_fops,
+       .mode   = 0666,
 };
 
 /* Hotplug add of a CPU. Scan through all active processes and add
@@ -1512,7 +1730,6 @@ static int cfset_online_cpu(unsigned int cpu)
        struct cfset_call_on_cpu_parm p;
        struct cfset_request *rp;
 
-       mutex_lock(&cfset_ctrset_mutex);
        if (!list_empty(&cfset_session.head)) {
                list_for_each_entry(rp, &cfset_session.head, node) {
                        p.sets = rp->ctrset;
@@ -1520,19 +1737,18 @@ static int cfset_online_cpu(unsigned int cpu)
                        cpumask_set_cpu(cpu, &rp->mask);
                }
        }
-       mutex_unlock(&cfset_ctrset_mutex);
        return 0;
 }
 
 /* Hotplug remove of a CPU. Scan through all active processes and clear
  * that CPU from the list of CPUs supplied with ioctl(..., START, ...).
+ * Adjust reference counts.
  */
 static int cfset_offline_cpu(unsigned int cpu)
 {
        struct cfset_call_on_cpu_parm p;
        struct cfset_request *rp;
 
-       mutex_lock(&cfset_ctrset_mutex);
        if (!list_empty(&cfset_session.head)) {
                list_for_each_entry(rp, &cfset_session.head, node) {
                        p.sets = rp->ctrset;
@@ -1540,7 +1756,6 @@ static int cfset_offline_cpu(unsigned int cpu)
                        cpumask_clear_cpu(cpu, &rp->mask);
                }
        }
-       mutex_unlock(&cfset_ctrset_mutex);
        return 0;
 }
 
@@ -1618,7 +1833,8 @@ static int cfdiag_event_init(struct perf_event *event)
        }
 
        /* Initialize for using the CPU-measurement counter facility */
-       cpumf_hw_inuse();
+       if (cpum_cf_alloc(event->cpu))
+               return -ENOMEM;
        event->destroy = hw_perf_event_destroy;
 
        err = cfdiag_event_init2(event);
index 7ef72f5..8ecfbce 100644 (file)
@@ -1271,16 +1271,6 @@ static void hw_collect_samples(struct perf_event *event, unsigned long *sdbt,
        }
 }
 
-static inline __uint128_t __cdsg(__uint128_t *ptr, __uint128_t old, __uint128_t new)
-{
-       asm volatile(
-               "       cdsg    %[old],%[new],%[ptr]\n"
-               : [old] "+d" (old), [ptr] "+QS" (*ptr)
-               : [new] "d" (new)
-               : "memory", "cc");
-       return old;
-}
-
 /* hw_perf_event_update() - Process sampling buffer
  * @event:     The perf event
  * @flush_all: Flag to also flush partially filled sample-data-blocks
@@ -1352,7 +1342,7 @@ static void hw_perf_event_update(struct perf_event *event, int flush_all)
                        new.f = 0;
                        new.a = 1;
                        new.overflow = 0;
-                       prev.val = __cdsg(&te->header.val, old.val, new.val);
+                       prev.val = cmpxchg128(&te->header.val, old.val, new.val);
                } while (prev.val != old.val);
 
                /* Advance to next sample-data-block */
@@ -1562,7 +1552,7 @@ static bool aux_set_alert(struct aux_buffer *aux, unsigned long alert_index,
                }
                new.a = 1;
                new.overflow = 0;
-               prev.val = __cdsg(&te->header.val, old.val, new.val);
+               prev.val = cmpxchg128(&te->header.val, old.val, new.val);
        } while (prev.val != old.val);
        return true;
 }
@@ -1636,7 +1626,7 @@ static bool aux_reset_buffer(struct aux_buffer *aux, unsigned long range,
                                new.a = 1;
                        else
                                new.a = 0;
-                       prev.val = __cdsg(&te->header.val, old.val, new.val);
+                       prev.val = cmpxchg128(&te->header.val, old.val, new.val);
                } while (prev.val != old.val);
                *overflow += orig_overflow;
        }
index a7b339c..fe7d177 100644 (file)
@@ -36,7 +36,7 @@ struct paicrypt_map {
        unsigned long *page;            /* Page for CPU to store counters */
        struct pai_userdata *save;      /* Page to store no-zero counters */
        unsigned int active_events;     /* # of PAI crypto users */
-       unsigned int refcnt;            /* Reference count mapped buffers */
+       refcount_t refcnt;              /* Reference count mapped buffers */
        enum paievt_mode mode;          /* Type of event */
        struct perf_event *event;       /* Perf event for sampling */
 };
@@ -57,10 +57,11 @@ static void paicrypt_event_destroy(struct perf_event *event)
        static_branch_dec(&pai_key);
        mutex_lock(&pai_reserve_mutex);
        debug_sprintf_event(cfm_dbg, 5, "%s event %#llx cpu %d users %d"
-                           " mode %d refcnt %d\n", __func__,
+                           " mode %d refcnt %u\n", __func__,
                            event->attr.config, event->cpu,
-                           cpump->active_events, cpump->mode, cpump->refcnt);
-       if (!--cpump->refcnt) {
+                           cpump->active_events, cpump->mode,
+                           refcount_read(&cpump->refcnt));
+       if (refcount_dec_and_test(&cpump->refcnt)) {
                debug_sprintf_event(cfm_dbg, 4, "%s page %#lx save %p\n",
                                    __func__, (unsigned long)cpump->page,
                                    cpump->save);
@@ -149,8 +150,10 @@ static int paicrypt_busy(struct perf_event_attr *a, struct paicrypt_map *cpump)
        /* Allocate memory for counter page and counter extraction.
         * Only the first counting event has to allocate a page.
         */
-       if (cpump->page)
+       if (cpump->page) {
+               refcount_inc(&cpump->refcnt);
                goto unlock;
+       }
 
        rc = -ENOMEM;
        cpump->page = (unsigned long *)get_zeroed_page(GFP_KERNEL);
@@ -164,18 +167,18 @@ static int paicrypt_busy(struct perf_event_attr *a, struct paicrypt_map *cpump)
                goto unlock;
        }
        rc = 0;
+       refcount_set(&cpump->refcnt, 1);
 
 unlock:
        /* If rc is non-zero, do not set mode and reference count */
        if (!rc) {
-               cpump->refcnt++;
                cpump->mode = a->sample_period ? PAI_MODE_SAMPLING
                                               : PAI_MODE_COUNTING;
        }
        debug_sprintf_event(cfm_dbg, 5, "%s sample_period %#llx users %d"
-                           " mode %d refcnt %d page %#lx save %p rc %d\n",
+                           " mode %d refcnt %u page %#lx save %p rc %d\n",
                            __func__, a->sample_period, cpump->active_events,
-                           cpump->mode, cpump->refcnt,
+                           cpump->mode, refcount_read(&cpump->refcnt),
                            (unsigned long)cpump->page, cpump->save, rc);
        mutex_unlock(&pai_reserve_mutex);
        return rc;
index fcea307..3b4f384 100644 (file)
@@ -50,7 +50,7 @@ struct paiext_map {
        struct pai_userdata *save;      /* Area to store non-zero counters */
        enum paievt_mode mode;          /* Type of event */
        unsigned int active_events;     /* # of PAI Extension users */
-       unsigned int refcnt;
+       refcount_t refcnt;
        struct perf_event *event;       /* Perf event for sampling */
        struct paiext_cb *paiext_cb;    /* PAI extension control block area */
 };
@@ -60,14 +60,14 @@ struct paiext_mapptr {
 };
 
 static struct paiext_root {            /* Anchor to per CPU data */
-       int refcnt;                     /* Overall active events */
+       refcount_t refcnt;              /* Overall active events */
        struct paiext_mapptr __percpu *mapptr;
 } paiext_root;
 
 /* Free per CPU data when the last event is removed. */
 static void paiext_root_free(void)
 {
-       if (!--paiext_root.refcnt) {
+       if (refcount_dec_and_test(&paiext_root.refcnt)) {
                free_percpu(paiext_root.mapptr);
                paiext_root.mapptr = NULL;
        }
@@ -80,7 +80,7 @@ static void paiext_root_free(void)
  */
 static int paiext_root_alloc(void)
 {
-       if (++paiext_root.refcnt == 1) {
+       if (!refcount_inc_not_zero(&paiext_root.refcnt)) {
                /* The memory is already zeroed. */
                paiext_root.mapptr = alloc_percpu(struct paiext_mapptr);
                if (!paiext_root.mapptr) {
@@ -91,6 +91,7 @@ static int paiext_root_alloc(void)
                         */
                        return -ENOMEM;
                }
+               refcount_set(&paiext_root.refcnt, 1);
        }
        return 0;
 }
@@ -122,7 +123,7 @@ static void paiext_event_destroy(struct perf_event *event)
 
        mutex_lock(&paiext_reserve_mutex);
        cpump->event = NULL;
-       if (!--cpump->refcnt)           /* Last reference gone */
+       if (refcount_dec_and_test(&cpump->refcnt))      /* Last reference gone */
                paiext_free(mp);
        paiext_root_free();
        mutex_unlock(&paiext_reserve_mutex);
@@ -163,7 +164,7 @@ static int paiext_alloc(struct perf_event_attr *a, struct perf_event *event)
                rc = -ENOMEM;
                cpump = kzalloc(sizeof(*cpump), GFP_KERNEL);
                if (!cpump)
-                       goto unlock;
+                       goto undo;
 
                /* Allocate memory for counter area and counter extraction.
                 * These are
@@ -183,8 +184,9 @@ static int paiext_alloc(struct perf_event_attr *a, struct perf_event *event)
                                             GFP_KERNEL);
                if (!cpump->save || !cpump->area || !cpump->paiext_cb) {
                        paiext_free(mp);
-                       goto unlock;
+                       goto undo;
                }
+               refcount_set(&cpump->refcnt, 1);
                cpump->mode = a->sample_period ? PAI_MODE_SAMPLING
                                               : PAI_MODE_COUNTING;
        } else {
@@ -195,15 +197,15 @@ static int paiext_alloc(struct perf_event_attr *a, struct perf_event *event)
                if (cpump->mode == PAI_MODE_SAMPLING ||
                    (cpump->mode == PAI_MODE_COUNTING && a->sample_period)) {
                        rc = -EBUSY;
-                       goto unlock;
+                       goto undo;
                }
+               refcount_inc(&cpump->refcnt);
        }
 
        rc = 0;
        cpump->event = event;
-       ++cpump->refcnt;
 
-unlock:
+undo:
        if (rc) {
                /* Error in allocation of event, decrement anchor. Since
                 * the event in not created, its destroy() function is never
@@ -211,6 +213,7 @@ unlock:
                 */
                paiext_root_free();
        }
+unlock:
        mutex_unlock(&paiext_reserve_mutex);
        /* If rc is non-zero, no increment of counter/sampler was done. */
        return rc;
index b68f475..a6935af 100644 (file)
 448  common    process_mrelease        sys_process_mrelease            sys_process_mrelease
 449  common    futex_waitv             sys_futex_waitv                 sys_futex_waitv
 450  common    set_mempolicy_home_node sys_set_mempolicy_home_node     sys_set_mempolicy_home_node
+451  common    cachestat               sys_cachestat                   sys_cachestat
index 6b7b6d5..2762781 100644 (file)
@@ -102,6 +102,11 @@ void __init time_early_init(void)
                        ((long) qui.old_leap * 4096000000L);
 }
 
+unsigned long long noinstr sched_clock_noinstr(void)
+{
+       return tod_to_ns(__get_tod_clock_monotonic());
+}
+
 /*
  * Scheduler clock - returns current time in nanosec units.
  */
index cb2ee06..3c62d1b 100644 (file)
@@ -294,6 +294,8 @@ again:
 
        rc = -ENXIO;
        ptep = get_locked_pte(gmap->mm, uaddr, &ptelock);
+       if (!ptep)
+               goto out;
        if (pte_present(*ptep) && !(pte_val(*ptep) & _PAGE_INVALID) && pte_write(*ptep)) {
                page = pte_page(*ptep);
                rc = -EAGAIN;
index da6dac3..9bd0a87 100644 (file)
@@ -2777,7 +2777,7 @@ static struct page *get_map_page(struct kvm *kvm, u64 uaddr)
 
        mmap_read_lock(kvm->mm);
        get_user_pages_remote(kvm->mm, uaddr, 1, FOLL_WRITE,
-                             &page, NULL, NULL);
+                             &page, NULL);
        mmap_read_unlock(kvm->mm);
        return page;
 }
index 580d2e3..7c50eca 100644 (file)
@@ -3,7 +3,7 @@
 # Makefile for s390-specific library files..
 #
 
-lib-y += delay.o string.o uaccess.o find.o spinlock.o
+lib-y += delay.o string.o uaccess.o find.o spinlock.o tishift.o
 obj-y += mem.o xor.o
 lib-$(CONFIG_KPROBES) += probes.o
 lib-$(CONFIG_UPROBES) += probes.o
diff --git a/arch/s390/lib/tishift.S b/arch/s390/lib/tishift.S
new file mode 100644 (file)
index 0000000..de33cf0
--- /dev/null
@@ -0,0 +1,63 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/linkage.h>
+#include <asm/nospec-insn.h>
+#include <asm/export.h>
+
+       .section .noinstr.text, "ax"
+
+       GEN_BR_THUNK %r14
+
+SYM_FUNC_START(__ashlti3)
+       lmg     %r0,%r1,0(%r3)
+       cije    %r4,0,1f
+       lhi     %r3,64
+       sr      %r3,%r4
+       jnh     0f
+       srlg    %r3,%r1,0(%r3)
+       sllg    %r0,%r0,0(%r4)
+       sllg    %r1,%r1,0(%r4)
+       ogr     %r0,%r3
+       j       1f
+0:     sllg    %r0,%r1,-64(%r4)
+       lghi    %r1,0
+1:     stmg    %r0,%r1,0(%r2)
+       BR_EX   %r14
+SYM_FUNC_END(__ashlti3)
+EXPORT_SYMBOL(__ashlti3)
+
+SYM_FUNC_START(__ashrti3)
+       lmg     %r0,%r1,0(%r3)
+       cije    %r4,0,1f
+       lhi     %r3,64
+       sr      %r3,%r4
+       jnh     0f
+       sllg    %r3,%r0,0(%r3)
+       srlg    %r1,%r1,0(%r4)
+       srag    %r0,%r0,0(%r4)
+       ogr     %r1,%r3
+       j       1f
+0:     srag    %r1,%r0,-64(%r4)
+       srag    %r0,%r0,63
+1:     stmg    %r0,%r1,0(%r2)
+       BR_EX   %r14
+SYM_FUNC_END(__ashrti3)
+EXPORT_SYMBOL(__ashrti3)
+
+SYM_FUNC_START(__lshrti3)
+       lmg     %r0,%r1,0(%r3)
+       cije    %r4,0,1f
+       lhi     %r3,64
+       sr      %r3,%r4
+       jnh     0f
+       sllg    %r3,%r0,0(%r3)
+       srlg    %r1,%r1,0(%r4)
+       srlg    %r0,%r0,0(%r4)
+       ogr     %r1,%r3
+       j       1f
+0:     srlg    %r1,%r0,-64(%r4)
+       lghi    %r0,0
+1:     stmg    %r0,%r1,0(%r2)
+       BR_EX   %r14
+SYM_FUNC_END(__lshrti3)
+EXPORT_SYMBOL(__lshrti3)
index dc90d1e..f4b6fc7 100644 (file)
@@ -895,12 +895,12 @@ static int gmap_pte_op_fixup(struct gmap *gmap, unsigned long gaddr,
 
 /**
  * gmap_pte_op_end - release the page table lock
- * @ptl: pointer to the spinlock pointer
+ * @ptep: pointer to the locked pte
+ * @ptl: pointer to the page table spinlock
  */
-static void gmap_pte_op_end(spinlock_t *ptl)
+static void gmap_pte_op_end(pte_t *ptep, spinlock_t *ptl)
 {
-       if (ptl)
-               spin_unlock(ptl);
+       pte_unmap_unlock(ptep, ptl);
 }
 
 /**
@@ -1011,7 +1011,7 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
 {
        int rc;
        pte_t *ptep;
-       spinlock_t *ptl = NULL;
+       spinlock_t *ptl;
        unsigned long pbits = 0;
 
        if (pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)
@@ -1025,7 +1025,7 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
        pbits |= (bits & GMAP_NOTIFY_SHADOW) ? PGSTE_VSIE_BIT : 0;
        /* Protect and unlock. */
        rc = ptep_force_prot(gmap->mm, gaddr, ptep, prot, pbits);
-       gmap_pte_op_end(ptl);
+       gmap_pte_op_end(ptep, ptl);
        return rc;
 }
 
@@ -1154,7 +1154,7 @@ int gmap_read_table(struct gmap *gmap, unsigned long gaddr, unsigned long *val)
                                /* Do *NOT* clear the _PAGE_INVALID bit! */
                                rc = 0;
                        }
-                       gmap_pte_op_end(ptl);
+                       gmap_pte_op_end(ptep, ptl);
                }
                if (!rc)
                        break;
@@ -1248,7 +1248,7 @@ static int gmap_protect_rmap(struct gmap *sg, unsigned long raddr,
                        if (!rc)
                                gmap_insert_rmap(sg, vmaddr, rmap);
                        spin_unlock(&sg->guest_table_lock);
-                       gmap_pte_op_end(ptl);
+                       gmap_pte_op_end(ptep, ptl);
                }
                radix_tree_preload_end();
                if (rc) {
@@ -2156,7 +2156,7 @@ int gmap_shadow_page(struct gmap *sg, unsigned long saddr, pte_t pte)
                        tptep = (pte_t *) gmap_table_walk(sg, saddr, 0);
                        if (!tptep) {
                                spin_unlock(&sg->guest_table_lock);
-                               gmap_pte_op_end(ptl);
+                               gmap_pte_op_end(sptep, ptl);
                                radix_tree_preload_end();
                                break;
                        }
@@ -2167,7 +2167,7 @@ int gmap_shadow_page(struct gmap *sg, unsigned long saddr, pte_t pte)
                                rmap = NULL;
                                rc = 0;
                        }
-                       gmap_pte_op_end(ptl);
+                       gmap_pte_op_end(sptep, ptl);
                        spin_unlock(&sg->guest_table_lock);
                }
                radix_tree_preload_end();
@@ -2495,7 +2495,7 @@ void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned long bitmap[4],
                                continue;
                        if (ptep_test_and_clear_uc(gmap->mm, vmaddr, ptep))
                                set_bit(i, bitmap);
-                       spin_unlock(ptl);
+                       pte_unmap_unlock(ptep, ptl);
                }
        }
        gmap_pmd_op_end(gmap, pmdp);
@@ -2537,7 +2537,12 @@ static inline void thp_split_mm(struct mm_struct *mm)
  * Remove all empty zero pages from the mapping for lazy refaulting
  * - This must be called after mm->context.has_pgste is set, to avoid
  *   future creation of zero pages
- * - This must be called after THP was enabled
+ * - This must be called after THP was disabled.
+ *
+ * mm contracts with s390, that even if mm were to remove a page table,
+ * racing with the loop below and so causing pte_offset_map_lock() to fail,
+ * it will never insert a page table containing empty zero pages once
+ * mm_forbids_zeropage(mm) i.e. mm->context.has_pgste is set.
  */
 static int __zap_zero_pages(pmd_t *pmd, unsigned long start,
                           unsigned long end, struct mm_walk *walk)
@@ -2549,6 +2554,8 @@ static int __zap_zero_pages(pmd_t *pmd, unsigned long start,
                spinlock_t *ptl;
 
                ptep = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+               if (!ptep)
+                       break;
                if (is_zero_pfn(pte_pfn(*ptep)))
                        ptep_xchg_direct(walk->mm, addr, ptep, __pte(_PAGE_INVALID));
                pte_unmap_unlock(ptep, ptl);
index 5ba3bd8..ca5a418 100644 (file)
@@ -4,6 +4,7 @@
  * Author(s): Jan Glauber <jang@linux.vnet.ibm.com>
  */
 #include <linux/hugetlb.h>
+#include <linux/proc_fs.h>
 #include <linux/vmalloc.h>
 #include <linux/mm.h>
 #include <asm/cacheflush.h>
index 6effb24..3bd2ab2 100644 (file)
@@ -829,7 +829,7 @@ int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
        default:
                return -EFAULT;
        }
-
+again:
        ptl = pmd_lock(mm, pmdp);
        if (!pmd_present(*pmdp)) {
                spin_unlock(ptl);
@@ -850,6 +850,8 @@ int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
        spin_unlock(ptl);
 
        ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+       if (!ptep)
+               goto again;
        new = old = pgste_get_lock(ptep);
        pgste_val(new) &= ~(PGSTE_GR_BIT | PGSTE_GC_BIT |
                            PGSTE_ACC_BITS | PGSTE_FP_BIT);
@@ -938,7 +940,7 @@ int reset_guest_reference_bit(struct mm_struct *mm, unsigned long addr)
        default:
                return -EFAULT;
        }
-
+again:
        ptl = pmd_lock(mm, pmdp);
        if (!pmd_present(*pmdp)) {
                spin_unlock(ptl);
@@ -955,6 +957,8 @@ int reset_guest_reference_bit(struct mm_struct *mm, unsigned long addr)
        spin_unlock(ptl);
 
        ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+       if (!ptep)
+               goto again;
        new = old = pgste_get_lock(ptep);
        /* Reset guest reference bit only */
        pgste_val(new) &= ~PGSTE_GR_BIT;
@@ -1000,7 +1004,7 @@ int get_guest_storage_key(struct mm_struct *mm, unsigned long addr,
        default:
                return -EFAULT;
        }
-
+again:
        ptl = pmd_lock(mm, pmdp);
        if (!pmd_present(*pmdp)) {
                spin_unlock(ptl);
@@ -1017,6 +1021,8 @@ int get_guest_storage_key(struct mm_struct *mm, unsigned long addr,
        spin_unlock(ptl);
 
        ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+       if (!ptep)
+               goto again;
        pgste = pgste_get_lock(ptep);
        *key = (pgste_val(pgste) & (PGSTE_ACC_BITS | PGSTE_FP_BIT)) >> 56;
        paddr = pte_val(*ptep) & PAGE_MASK;
index 5b22c6e..b9dcb4a 100644 (file)
@@ -667,7 +667,15 @@ static void __init memblock_region_swap(void *a, void *b, int size)
 
 #ifdef CONFIG_KASAN
 #define __sha(x)       ((unsigned long)kasan_mem_to_shadow((void *)x))
+
+static inline int set_memory_kasan(unsigned long start, unsigned long end)
+{
+       start = PAGE_ALIGN_DOWN(__sha(start));
+       end = PAGE_ALIGN(__sha(end));
+       return set_memory_rwnx(start, (end - start) >> PAGE_SHIFT);
+}
 #endif
+
 /*
  * map whole physical memory to virtual memory (identity mapping)
  * we reserve enough space in the vmalloc area for vmemmap to hotplug
@@ -737,10 +745,8 @@ void __init vmem_map_init(void)
        }
 
 #ifdef CONFIG_KASAN
-       for_each_mem_range(i, &base, &end) {
-               set_memory_rwnx(__sha(base),
-                               (__sha(end) - __sha(base)) >> PAGE_SHIFT);
-       }
+       for_each_mem_range(i, &base, &end)
+               set_memory_kasan(base, end);
 #endif
        set_memory_rox((unsigned long)_stext,
                       (unsigned long)(_etext - _stext) >> PAGE_SHIFT);
index cc8cf5a..4e930f5 100644 (file)
@@ -10,7 +10,7 @@ PURGATORY_OBJS = $(addprefix $(obj)/,$(purgatory-y))
 $(obj)/sha256.o: $(srctree)/lib/crypto/sha256.c FORCE
        $(call if_changed_rule,cc_o_c)
 
-CFLAGS_sha256.o := -D__DISABLE_EXPORTS
+CFLAGS_sha256.o := -D__DISABLE_EXPORTS -D__NO_FORTIFY
 
 $(obj)/mem.o: $(srctree)/arch/s390/lib/mem.S FORCE
        $(call if_changed_rule,as_o_S)
index 9652d36..e339745 100644 (file)
@@ -6,6 +6,7 @@ config SUPERH
        select ARCH_ENABLE_MEMORY_HOTREMOVE if SPARSEMEM && MMU
        select ARCH_HAVE_NMI_SAFE_CMPXCHG if (GUSA_RB || CPU_SH4A)
        select ARCH_HAS_BINFMT_FLAT if !MMU
+       select ARCH_HAS_CPU_FINALIZE_INIT
        select ARCH_HAS_CURRENT_STACK_POINTER
        select ARCH_HAS_GIGANTIC_PAGE
        select ARCH_HAS_GCOV_PROFILE_ALL
index ab91704..89cd4a3 100644 (file)
@@ -198,7 +198,7 @@ int request_dma(unsigned int chan, const char *dev_id)
        if (atomic_xchg(&channel->busy, 1))
                return -EBUSY;
 
-       strlcpy(channel->dev_id, dev_id, sizeof(channel->dev_id));
+       strscpy(channel->dev_id, dev_id, sizeof(channel->dev_id));
 
        if (info->ops->request) {
                result = info->ops->request(channel);
index 059791f..cf1c10f 100644 (file)
@@ -71,6 +71,11 @@ static inline int arch_atomic_fetch_##op(int i, atomic_t *v)         \
 ATOMIC_OPS(add)
 ATOMIC_OPS(sub)
 
+#define arch_atomic_add_return arch_atomic_add_return
+#define arch_atomic_sub_return arch_atomic_sub_return
+#define arch_atomic_fetch_add  arch_atomic_fetch_add
+#define arch_atomic_fetch_sub  arch_atomic_fetch_sub
+
 #undef ATOMIC_OPS
 #define ATOMIC_OPS(op) ATOMIC_OP(op) ATOMIC_FETCH_OP(op)
 
@@ -78,6 +83,10 @@ ATOMIC_OPS(and)
 ATOMIC_OPS(or)
 ATOMIC_OPS(xor)
 
+#define arch_atomic_fetch_and  arch_atomic_fetch_and
+#define arch_atomic_fetch_or   arch_atomic_fetch_or
+#define arch_atomic_fetch_xor  arch_atomic_fetch_xor
+
 #undef ATOMIC_OPS
 #undef ATOMIC_FETCH_OP
 #undef ATOMIC_OP_RETURN
index 7665de9..b4090cc 100644 (file)
@@ -55,6 +55,11 @@ static inline int arch_atomic_fetch_##op(int i, atomic_t *v)         \
 ATOMIC_OPS(add, +=)
 ATOMIC_OPS(sub, -=)
 
+#define arch_atomic_add_return arch_atomic_add_return
+#define arch_atomic_sub_return arch_atomic_sub_return
+#define arch_atomic_fetch_add  arch_atomic_fetch_add
+#define arch_atomic_fetch_sub  arch_atomic_fetch_sub
+
 #undef ATOMIC_OPS
 #define ATOMIC_OPS(op, c_op)                                           \
        ATOMIC_OP(op, c_op)                                             \
@@ -64,6 +69,10 @@ ATOMIC_OPS(and, &=)
 ATOMIC_OPS(or, |=)
 ATOMIC_OPS(xor, ^=)
 
+#define arch_atomic_fetch_and  arch_atomic_fetch_and
+#define arch_atomic_fetch_or   arch_atomic_fetch_or
+#define arch_atomic_fetch_xor  arch_atomic_fetch_xor
+
 #undef ATOMIC_OPS
 #undef ATOMIC_FETCH_OP
 #undef ATOMIC_OP_RETURN
index b63dcfb..9ef1fb1 100644 (file)
@@ -73,6 +73,11 @@ static inline int arch_atomic_fetch_##op(int i, atomic_t *v)         \
 ATOMIC_OPS(add)
 ATOMIC_OPS(sub)
 
+#define arch_atomic_add_return arch_atomic_add_return
+#define arch_atomic_sub_return arch_atomic_sub_return
+#define arch_atomic_fetch_add  arch_atomic_fetch_add
+#define arch_atomic_fetch_sub  arch_atomic_fetch_sub
+
 #undef ATOMIC_OPS
 #define ATOMIC_OPS(op) ATOMIC_OP(op) ATOMIC_FETCH_OP(op)
 
@@ -80,6 +85,10 @@ ATOMIC_OPS(and)
 ATOMIC_OPS(or)
 ATOMIC_OPS(xor)
 
+#define arch_atomic_fetch_and  arch_atomic_fetch_and
+#define arch_atomic_fetch_or   arch_atomic_fetch_or
+#define arch_atomic_fetch_xor  arch_atomic_fetch_xor
+
 #undef ATOMIC_OPS
 #undef ATOMIC_FETCH_OP
 #undef ATOMIC_OP_RETURN
index 528bfed..7a18cb2 100644 (file)
@@ -30,9 +30,6 @@
 #include <asm/atomic-irq.h>
 #endif
 
-#define arch_atomic_xchg(v, new)       (arch_xchg(&((v)->counter), new))
-#define arch_atomic_cmpxchg(v, o, n)   (arch_cmpxchg(&((v)->counter), (o), (n)))
-
 #endif /* CONFIG_CPU_J2 */
 
 #endif /* __ASM_SH_ATOMIC_H */
diff --git a/arch/sh/include/asm/bugs.h b/arch/sh/include/asm/bugs.h
deleted file mode 100644 (file)
index fe52abb..0000000
+++ /dev/null
@@ -1,74 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef __ASM_SH_BUGS_H
-#define __ASM_SH_BUGS_H
-
-/*
- * This is included by init/main.c to check for architecture-dependent bugs.
- *
- * Needs:
- *     void check_bugs(void);
- */
-
-/*
- * I don't know of any Super-H bugs yet.
- */
-
-#include <asm/processor.h>
-
-extern void select_idle_routine(void);
-
-static void __init check_bugs(void)
-{
-       extern unsigned long loops_per_jiffy;
-       char *p = &init_utsname()->machine[2]; /* "sh" */
-
-       select_idle_routine();
-
-       current_cpu_data.loops_per_jiffy = loops_per_jiffy;
-
-       switch (current_cpu_data.family) {
-       case CPU_FAMILY_SH2:
-               *p++ = '2';
-               break;
-       case CPU_FAMILY_SH2A:
-               *p++ = '2';
-               *p++ = 'a';
-               break;
-       case CPU_FAMILY_SH3:
-               *p++ = '3';
-               break;
-       case CPU_FAMILY_SH4:
-               *p++ = '4';
-               break;
-       case CPU_FAMILY_SH4A:
-               *p++ = '4';
-               *p++ = 'a';
-               break;
-       case CPU_FAMILY_SH4AL_DSP:
-               *p++ = '4';
-               *p++ = 'a';
-               *p++ = 'l';
-               *p++ = '-';
-               *p++ = 'd';
-               *p++ = 's';
-               *p++ = 'p';
-               break;
-       case CPU_FAMILY_UNKNOWN:
-               /*
-                * Specifically use CPU_FAMILY_UNKNOWN rather than
-                * default:, so we're able to have the compiler whine
-                * about unhandled enumerations.
-                */
-               break;
-       }
-
-       printk("CPU: %s\n", get_cpu_subtype(&current_cpu_data));
-
-#ifndef __LITTLE_ENDIAN__
-       /* 'eb' means 'Endian Big' */
-       *p++ = 'e';
-       *p++ = 'b';
-#endif
-       *p = '\0';
-}
-#endif /* __ASM_SH_BUGS_H */
index 32dfa6b..b38dbc9 100644 (file)
 
 #define L1_CACHE_BYTES         (1 << L1_CACHE_SHIFT)
 
+/*
+ * Some drivers need to perform DMA into kmalloc'ed buffers
+ * and so we have to increase the kmalloc minalign for this.
+ */
+#define ARCH_DMA_MINALIGN      L1_CACHE_BYTES
+
 #define __read_mostly __section(".data..read_mostly")
 
 #ifndef __ASSEMBLY__
index 1c49235..0f384b1 100644 (file)
@@ -22,7 +22,6 @@ extern unsigned short *irq_mask_register;
 /*
  * PINT IRQs
  */
-void init_IRQ_pint(void);
 void make_imask_irq(unsigned int irq);
 
 static inline int generic_irq_demux(int irq)
index 09ac6c7..62f4b9e 100644 (file)
@@ -174,10 +174,4 @@ typedef struct page *pgtable_t;
 #include <asm-generic/memory_model.h>
 #include <asm-generic/getorder.h>
 
-/*
- * Some drivers need to perform DMA into kmalloc'ed buffers
- * and so we have to increase the kmalloc minalign for this.
- */
-#define ARCH_DMA_MINALIGN      L1_CACHE_BYTES
-
 #endif /* __ASM_SH_PAGE_H */
index 85a6c1c..73fba7c 100644 (file)
@@ -166,6 +166,8 @@ extern unsigned int instruction_size(unsigned int insn);
 #define instruction_size(insn) (2)
 #endif
 
+void select_idle_routine(void);
+
 #endif /* __ASSEMBLY__ */
 
 #include <asm/processor_32.h>
index 69dbae2..7fe7002 100644 (file)
@@ -2,8 +2,6 @@
 #ifndef _ASM_RTC_H
 #define _ASM_RTC_H
 
-void time_init(void);
-
 #define RTC_CAP_4_DIGIT_YEAR   (1 << 0)
 
 struct sh_rtc_platform_info {
index 1400fbb..9f19a68 100644 (file)
@@ -84,9 +84,6 @@ static inline struct thread_info *current_thread_info(void)
 
 #define THREAD_SIZE_ORDER      (THREAD_SHIFT - PAGE_SHIFT)
 
-extern void arch_task_cache_init(void);
-extern int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src);
-extern void arch_release_task_struct(struct task_struct *tsk);
 extern void init_thread_xstate(void);
 
 #endif /* __ASSEMBLY__ */
index d662503..045d93f 100644 (file)
@@ -15,6 +15,7 @@
 #include <linux/irqflags.h>
 #include <linux/smp.h>
 #include <linux/atomic.h>
+#include <asm/processor.h>
 #include <asm/smp.h>
 #include <asm/bl_bit.h>
 
index af977ec..b3da275 100644 (file)
@@ -43,6 +43,7 @@
 #include <asm/smp.h>
 #include <asm/mmu_context.h>
 #include <asm/mmzone.h>
+#include <asm/processor.h>
 #include <asm/sparsemem.h>
 #include <asm/platform_early.h>
 
@@ -304,9 +305,9 @@ void __init setup_arch(char **cmdline_p)
        bss_resource.end = virt_to_phys(__bss_stop)-1;
 
 #ifdef CONFIG_CMDLINE_OVERWRITE
-       strlcpy(command_line, CONFIG_CMDLINE, sizeof(command_line));
+       strscpy(command_line, CONFIG_CMDLINE, sizeof(command_line));
 #else
-       strlcpy(command_line, COMMAND_LINE, sizeof(command_line));
+       strscpy(command_line, COMMAND_LINE, sizeof(command_line));
 #ifdef CONFIG_CMDLINE_EXTEND
        strlcat(command_line, " ", sizeof(command_line));
        strlcat(command_line, CONFIG_CMDLINE, sizeof(command_line));
@@ -354,3 +355,57 @@ int test_mode_pin(int pin)
 {
        return sh_mv.mv_mode_pins() & pin;
 }
+
+void __init arch_cpu_finalize_init(void)
+{
+       char *p = &init_utsname()->machine[2]; /* "sh" */
+
+       select_idle_routine();
+
+       current_cpu_data.loops_per_jiffy = loops_per_jiffy;
+
+       switch (current_cpu_data.family) {
+       case CPU_FAMILY_SH2:
+               *p++ = '2';
+               break;
+       case CPU_FAMILY_SH2A:
+               *p++ = '2';
+               *p++ = 'a';
+               break;
+       case CPU_FAMILY_SH3:
+               *p++ = '3';
+               break;
+       case CPU_FAMILY_SH4:
+               *p++ = '4';
+               break;
+       case CPU_FAMILY_SH4A:
+               *p++ = '4';
+               *p++ = 'a';
+               break;
+       case CPU_FAMILY_SH4AL_DSP:
+               *p++ = '4';
+               *p++ = 'a';
+               *p++ = 'l';
+               *p++ = '-';
+               *p++ = 'd';
+               *p++ = 's';
+               *p++ = 'p';
+               break;
+       case CPU_FAMILY_UNKNOWN:
+               /*
+                * Specifically use CPU_FAMILY_UNKNOWN rather than
+                * default:, so we're able to have the compiler whine
+                * about unhandled enumerations.
+                */
+               break;
+       }
+
+       pr_info("CPU: %s\n", get_cpu_subtype(&current_cpu_data));
+
+#ifndef __LITTLE_ENDIAN__
+       /* 'eb' means 'Endian Big' */
+       *p++ = 'e';
+       *p++ = 'b';
+#endif
+       *p = '\0';
+}
index 2de85c9..97377e8 100644 (file)
 448    common  process_mrelease                sys_process_mrelease
 449    common  futex_waitv                     sys_futex_waitv
 450    common  set_mempolicy_home_node         sys_set_mempolicy_home_node
+451    common  cachestat                       sys_cachestat
index 999ab59..6cb0ad7 100644 (file)
@@ -38,7 +38,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
                        if (pud) {
                                pmd = pmd_alloc(mm, pud, addr);
                                if (pmd)
-                                       pte = pte_alloc_map(mm, pmd, addr);
+                                       pte = pte_alloc_huge(mm, pmd, addr);
                        }
                }
        }
@@ -63,7 +63,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
                        if (pud) {
                                pmd = pmd_offset(pud, addr);
                                if (pmd)
-                                       pte = pte_offset_map(pmd, addr);
+                                       pte = pte_offset_huge(pmd, addr);
                        }
                }
        }
index 8535e19..6197b87 100644 (file)
@@ -33,7 +33,7 @@ config SPARC
        select ARCH_WANT_IPC_PARSE_VERSION
        select GENERIC_PCI_IOMAP
        select HAS_IOPORT
-       select HAVE_NMI_WATCHDOG if SPARC64
+       select HAVE_HARDLOCKUP_DETECTOR_SPARC64 if SPARC64
        select HAVE_CBPF_JIT if SPARC32
        select HAVE_EBPF_JIT if SPARC64
        select HAVE_DEBUG_BUGVERBOSE
@@ -52,6 +52,7 @@ config SPARC
 config SPARC32
        def_bool !64BIT
        select ARCH_32BIT_OFF_T
+       select ARCH_HAS_CPU_FINALIZE_INIT if !SMP
        select ARCH_HAS_SYNC_DMA_FOR_CPU
        select CLZ_TAB
        select DMA_DIRECT_REMAP
index 6b2bec1..37e0036 100644 (file)
@@ -14,3 +14,17 @@ config FRAME_POINTER
        bool
        depends on MCOUNT
        default y
+
+config HAVE_HARDLOCKUP_DETECTOR_SPARC64
+       bool
+       depends on HAVE_NMI
+       select HARDLOCKUP_DETECTOR_SPARC64
+       help
+         Sparc64 hardlockup detector is the last one developed before adding
+         the common infrastructure for handling hardlockup detectors. It is
+         always built. It does _not_ use the common command line parameters
+         and sysctl interface, except for /proc/sys/kernel/nmi_watchdog.
+
+config HARDLOCKUP_DETECTOR_SPARC64
+       bool
+       depends on HAVE_HARDLOCKUP_DETECTOR_SPARC64
index d775daa..60ce2fe 100644 (file)
 #include <asm-generic/atomic64.h>
 
 int arch_atomic_add_return(int, atomic_t *);
+#define arch_atomic_add_return arch_atomic_add_return
+
 int arch_atomic_fetch_add(int, atomic_t *);
+#define arch_atomic_fetch_add arch_atomic_fetch_add
+
 int arch_atomic_fetch_and(int, atomic_t *);
+#define arch_atomic_fetch_and arch_atomic_fetch_and
+
 int arch_atomic_fetch_or(int, atomic_t *);
+#define arch_atomic_fetch_or arch_atomic_fetch_or
+
 int arch_atomic_fetch_xor(int, atomic_t *);
+#define arch_atomic_fetch_xor arch_atomic_fetch_xor
+
 int arch_atomic_cmpxchg(atomic_t *, int, int);
+#define arch_atomic_cmpxchg arch_atomic_cmpxchg
+
 int arch_atomic_xchg(atomic_t *, int);
-int arch_atomic_fetch_add_unless(atomic_t *, int, int);
-void arch_atomic_set(atomic_t *, int);
+#define arch_atomic_xchg arch_atomic_xchg
 
+int arch_atomic_fetch_add_unless(atomic_t *, int, int);
 #define arch_atomic_fetch_add_unless arch_atomic_fetch_add_unless
 
+void arch_atomic_set(atomic_t *, int);
+
 #define arch_atomic_set_release(v, i)  arch_atomic_set((v), (i))
 
 #define arch_atomic_read(v)            READ_ONCE((v)->counter)
index 0778916..a5e9c37 100644 (file)
@@ -37,6 +37,16 @@ s64 arch_atomic64_fetch_##op(s64, atomic64_t *);
 ATOMIC_OPS(add)
 ATOMIC_OPS(sub)
 
+#define arch_atomic_add_return                 arch_atomic_add_return
+#define arch_atomic_sub_return                 arch_atomic_sub_return
+#define arch_atomic_fetch_add                  arch_atomic_fetch_add
+#define arch_atomic_fetch_sub                  arch_atomic_fetch_sub
+
+#define arch_atomic64_add_return               arch_atomic64_add_return
+#define arch_atomic64_sub_return               arch_atomic64_sub_return
+#define arch_atomic64_fetch_add                        arch_atomic64_fetch_add
+#define arch_atomic64_fetch_sub                        arch_atomic64_fetch_sub
+
 #undef ATOMIC_OPS
 #define ATOMIC_OPS(op) ATOMIC_OP(op) ATOMIC_FETCH_OP(op)
 
@@ -44,22 +54,19 @@ ATOMIC_OPS(and)
 ATOMIC_OPS(or)
 ATOMIC_OPS(xor)
 
+#define arch_atomic_fetch_and                  arch_atomic_fetch_and
+#define arch_atomic_fetch_or                   arch_atomic_fetch_or
+#define arch_atomic_fetch_xor                  arch_atomic_fetch_xor
+
+#define arch_atomic64_fetch_and                        arch_atomic64_fetch_and
+#define arch_atomic64_fetch_or                 arch_atomic64_fetch_or
+#define arch_atomic64_fetch_xor                        arch_atomic64_fetch_xor
+
 #undef ATOMIC_OPS
 #undef ATOMIC_FETCH_OP
 #undef ATOMIC_OP_RETURN
 #undef ATOMIC_OP
 
-#define arch_atomic_cmpxchg(v, o, n) (arch_cmpxchg(&((v)->counter), (o), (n)))
-
-static inline int arch_atomic_xchg(atomic_t *v, int new)
-{
-       return arch_xchg(&v->counter, new);
-}
-
-#define arch_atomic64_cmpxchg(v, o, n) \
-       ((__typeof__((v)->counter))arch_cmpxchg(&((v)->counter), (o), (n)))
-#define arch_atomic64_xchg(v, new) (arch_xchg(&((v)->counter), new))
-
 s64 arch_atomic64_dec_if_positive(atomic64_t *v);
 #define arch_atomic64_dec_if_positive arch_atomic64_dec_if_positive
 
diff --git a/arch/sparc/include/asm/bugs.h b/arch/sparc/include/asm/bugs.h
deleted file mode 100644 (file)
index 02fa369..0000000
+++ /dev/null
@@ -1,18 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/* include/asm/bugs.h:  Sparc probes for various bugs.
- *
- * Copyright (C) 1996, 2007 David S. Miller (davem@davemloft.net)
- */
-
-#ifdef CONFIG_SPARC32
-#include <asm/cpudata.h>
-#endif
-
-extern unsigned long loops_per_jiffy;
-
-static void __init check_bugs(void)
-{
-#if defined(CONFIG_SPARC32) && !defined(CONFIG_SMP)
-       cpu_data(0).udelay_val = loops_per_jiffy;
-#endif
-}
index 43ec260..6ee4832 100644 (file)
@@ -17,7 +17,6 @@
 
 #define irq_canonicalize(irq)  (irq)
 
-void __init init_IRQ(void);
 void __init sun4d_init_sbi_irq(void);
 
 #define NO_IRQ         0xffffffff
index 154df2c..b436029 100644 (file)
@@ -61,7 +61,6 @@ void sun4u_destroy_msi(unsigned int irq);
 unsigned int irq_alloc(unsigned int dev_handle, unsigned int dev_ino);
 void irq_free(unsigned int irq);
 
-void __init init_IRQ(void);
 void fixup_irqs(void);
 
 static inline void set_softint(unsigned long bits)
index 90ee786..920dc23 100644 (file)
@@ -8,7 +8,6 @@ void nmi_adjust_hz(unsigned int new_hz);
 
 extern atomic_t nmi_active;
 
-void arch_touch_nmi_watchdog(void);
 void start_nmi_watchdog(void *unused);
 void stop_nmi_watchdog(void *unused);
 
index dcfad46..ffff52c 100644 (file)
@@ -34,7 +34,6 @@ extern struct sparc64_tick_ops *tick_ops;
 
 unsigned long sparc64_get_clock_tick(unsigned int cpu);
 void setup_sparc64_timer(void);
-void __init time_init(void);
 
 #define TICK_PRIV_BIT          BIT(63)
 #define TICKCMP_IRQ_BIT                BIT(63)
index 4e4f3d3..a8cbe40 100644 (file)
@@ -191,7 +191,7 @@ static void __iomem *_sparc_alloc_io(unsigned int busno, unsigned long phys,
                tack += sizeof (struct resource);
        }
 
-       strlcpy(tack, name, XNMLN+1);
+       strscpy(tack, name, XNMLN+1);
        res->name = tack;
 
        va = _sparc_ioremap(res, busno, phys, size);
index 9cd09a3..15da3c0 100644 (file)
@@ -91,7 +91,6 @@ extern int static_irq_count;
 extern spinlock_t irq_action_lock;
 
 void unexpected_irq(int irq, void *dev_id, struct pt_regs * regs);
-void init_IRQ(void);
 
 /* sun4m_irq.c */
 void sun4m_init_IRQ(void);
index 060fff9..17cdfdb 100644 (file)
@@ -65,6 +65,11 @@ void arch_touch_nmi_watchdog(void)
 }
 EXPORT_SYMBOL(arch_touch_nmi_watchdog);
 
+int __init watchdog_hardlockup_probe(void)
+{
+       return 0;
+}
+
 static void die_nmi(const char *str, struct pt_regs *regs, int do_panic)
 {
        int this_cpu = smp_processor_id();
@@ -282,11 +287,11 @@ __setup("nmi_watchdog=", setup_nmi_watchdog);
  * sparc specific NMI watchdog enable function.
  * Enables watchdog if it is not enabled already.
  */
-int watchdog_nmi_enable(unsigned int cpu)
+void watchdog_hardlockup_enable(unsigned int cpu)
 {
        if (atomic_read(&nmi_active) == -1) {
                pr_warn("NMI watchdog cannot be enabled or disabled\n");
-               return -1;
+               return;
        }
 
        /*
@@ -295,17 +300,15 @@ int watchdog_nmi_enable(unsigned int cpu)
         * process first.
         */
        if (!nmi_init_done)
-               return 0;
+               return;
 
        smp_call_function_single(cpu, start_nmi_watchdog, NULL, 1);
-
-       return 0;
 }
 /*
  * sparc specific NMI watchdog disable function.
  * Disables watchdog if it is not disabled already.
  */
-void watchdog_nmi_disable(unsigned int cpu)
+void watchdog_hardlockup_disable(unsigned int cpu)
 {
        if (atomic_read(&nmi_active) == -1)
                pr_warn_once("NMI watchdog cannot be enabled or disabled\n");
index c8e0dd9..1adf5c1 100644 (file)
@@ -302,7 +302,7 @@ void __init setup_arch(char **cmdline_p)
 
        /* Initialize PROM console and command line. */
        *cmdline_p = prom_getbootargs();
-       strlcpy(boot_command_line, *cmdline_p, COMMAND_LINE_SIZE);
+       strscpy(boot_command_line, *cmdline_p, COMMAND_LINE_SIZE);
        parse_early_param();
 
        boot_flags_init(*cmdline_p);
@@ -412,3 +412,10 @@ static int __init topology_init(void)
 }
 
 subsys_initcall(topology_init);
+
+#if defined(CONFIG_SPARC32) && !defined(CONFIG_SMP)
+void __init arch_cpu_finalize_init(void)
+{
+       cpu_data(0).udelay_val = loops_per_jiffy;
+}
+#endif
index 48abee4..6546ca9 100644 (file)
@@ -636,7 +636,7 @@ void __init setup_arch(char **cmdline_p)
 {
        /* Initialize PROM console and command line. */
        *cmdline_p = prom_getbootargs();
-       strlcpy(boot_command_line, *cmdline_p, COMMAND_LINE_SIZE);
+       strscpy(boot_command_line, *cmdline_p, COMMAND_LINE_SIZE);
        parse_early_param();
 
        boot_flags_init(*cmdline_p);
index dad3896..ca450c7 100644 (file)
@@ -328,6 +328,8 @@ static void flush_signal_insns(unsigned long address)
                goto out_irqs_on;
 
        ptep = pte_offset_map(pmdp, address);
+       if (!ptep)
+               goto out_irqs_on;
        pte = *ptep;
        if (!pte_present(pte))
                goto out_unmap;
index 4398cc6..faa835f 100644 (file)
 448    common  process_mrelease                sys_process_mrelease
 449    common  futex_waitv                     sys_futex_waitv
 450    common  set_mempolicy_home_node         sys_set_mempolicy_home_node
+451    common  cachestat                       sys_cachestat
index d91305d..d8a407f 100644 (file)
@@ -99,6 +99,7 @@ static unsigned int get_user_insn(unsigned long tpc)
        local_irq_disable();
 
        pmdp = pmd_offset(pudp, tpc);
+again:
        if (pmd_none(*pmdp) || unlikely(pmd_bad(*pmdp)))
                goto out_irq_enable;
 
@@ -115,6 +116,8 @@ static unsigned int get_user_insn(unsigned long tpc)
 #endif
        {
                ptep = pte_offset_map(pmdp, tpc);
+               if (!ptep)
+                       goto again;
                pte = *ptep;
                if (pte_present(pte)) {
                        pa  = (pte_pfn(pte) << PAGE_SHIFT);
index d8e0e3c..d701882 100644 (file)
@@ -298,7 +298,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
                return NULL;
        if (sz >= PMD_SIZE)
                return (pte_t *)pmd;
-       return pte_alloc_map(mm, pmd, addr);
+       return pte_alloc_huge(mm, pmd, addr);
 }
 
 pte_t *huge_pte_offset(struct mm_struct *mm,
@@ -325,7 +325,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
                return NULL;
        if (is_hugetlb_pmd(*pmd))
                return (pte_t *)pmd;
-       return pte_offset_map(pmd, addr);
+       return pte_offset_huge(pmd, addr);
 }
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
index bf3e6d2..133dd42 100644 (file)
@@ -244,7 +244,7 @@ static void *iounit_alloc(struct device *dev, size_t len,
                        long i;
 
                        pmdp = pmd_off_k(addr);
-                       ptep = pte_offset_map(pmdp, addr);
+                       ptep = pte_offset_kernel(pmdp, addr);
 
                        set_pte(ptep, mk_pte(virt_to_page(page), dvma_prot));
 
index 9e3f693..3a6caef 100644 (file)
@@ -358,7 +358,7 @@ static void *sbus_iommu_alloc(struct device *dev, size_t len,
                                __flush_page_to_ram(page);
 
                        pmdp = pmd_off_k(addr);
-                       ptep = pte_offset_map(pmdp, addr);
+                       ptep = pte_offset_kernel(pmdp, addr);
 
                        set_pte(ptep, mk_pte(virt_to_page(page), dvma_prot));
                }
index 9a72554..7ecf855 100644 (file)
@@ -149,6 +149,8 @@ static void tlb_batch_pmd_scan(struct mm_struct *mm, unsigned long vaddr,
        pte_t *pte;
 
        pte = pte_offset_map(&pmd, vaddr);
+       if (!pte)
+               return;
        end = vaddr + HPAGE_SIZE;
        while (vaddr < end) {
                if (pte_val(*pte) & _PAGE_VALID) {
index e3b731f..1c7cd25 100644 (file)
@@ -52,7 +52,7 @@ prom_getbootargs(void)
                 * V3 PROM cannot supply as with more than 128 bytes
                 * of an argument. But a smart bootstrap loader can.
                 */
-               strlcpy(barg_buf, *romvec->pv_v2bootargs.bootargs, sizeof(barg_buf));
+               strscpy(barg_buf, *romvec->pv_v2bootargs.bootargs, sizeof(barg_buf));
                break;
        default:
                break;
index 541a9b1..b5e1793 100644 (file)
@@ -5,7 +5,7 @@ menu "UML-specific options"
 config UML
        bool
        default y
-       select ARCH_EPHEMERAL_INODES
+       select ARCH_HAS_CPU_FINALIZE_INIT
        select ARCH_HAS_FORTIFY_SOURCE
        select ARCH_HAS_GCOV_PROFILE_ALL
        select ARCH_HAS_KCOV
index 8186d47..da4d525 100644 (file)
@@ -149,7 +149,7 @@ export CFLAGS_vmlinux := $(LINK-y) $(LINK_WRAPS) $(LD_FLAGS_CMDLINE) $(CC_FLAGS_
 # When cleaning we don't include .config, so we don't include
 # TT or skas makefiles and don't clean skas_ptregs.h.
 CLEAN_FILES += linux x.i gmon.out
-MRPROPER_FILES += arch/$(SUBARCH)/include/generated
+MRPROPER_FILES += $(HOST_DIR)/include/generated
 
 archclean:
        @find . \( -name '*.bb' -o -name '*.bbg' -o -name '*.da' \
index f4c1e6e..50206fe 100644 (file)
@@ -108,9 +108,9 @@ static inline void ubd_set_bit(__u64 bit, unsigned char *data)
 static DEFINE_MUTEX(ubd_lock);
 static DEFINE_MUTEX(ubd_mutex); /* replaces BKL, might not be needed */
 
-static int ubd_open(struct block_device *bdev, fmode_t mode);
-static void ubd_release(struct gendisk *disk, fmode_t mode);
-static int ubd_ioctl(struct block_device *bdev, fmode_t mode,
+static int ubd_open(struct gendisk *disk, blk_mode_t mode);
+static void ubd_release(struct gendisk *disk);
+static int ubd_ioctl(struct block_device *bdev, blk_mode_t mode,
                     unsigned int cmd, unsigned long arg);
 static int ubd_getgeo(struct block_device *bdev, struct hd_geometry *geo);
 
@@ -1154,9 +1154,8 @@ static int __init ubd_driver_init(void){
 
 device_initcall(ubd_driver_init);
 
-static int ubd_open(struct block_device *bdev, fmode_t mode)
+static int ubd_open(struct gendisk *disk, blk_mode_t mode)
 {
-       struct gendisk *disk = bdev->bd_disk;
        struct ubd *ubd_dev = disk->private_data;
        int err = 0;
 
@@ -1171,19 +1170,12 @@ static int ubd_open(struct block_device *bdev, fmode_t mode)
        }
        ubd_dev->count++;
        set_disk_ro(disk, !ubd_dev->openflags.w);
-
-       /* This should no more be needed. And it didn't work anyway to exclude
-        * read-write remounting of filesystems.*/
-       /*if((mode & FMODE_WRITE) && !ubd_dev->openflags.w){
-               if(--ubd_dev->count == 0) ubd_close_dev(ubd_dev);
-               err = -EROFS;
-       }*/
 out:
        mutex_unlock(&ubd_mutex);
        return err;
 }
 
-static void ubd_release(struct gendisk *disk, fmode_t mode)
+static void ubd_release(struct gendisk *disk)
 {
        struct ubd *ubd_dev = disk->private_data;
 
@@ -1397,7 +1389,7 @@ static int ubd_getgeo(struct block_device *bdev, struct hd_geometry *geo)
        return 0;
 }
 
-static int ubd_ioctl(struct block_device *bdev, fmode_t mode,
+static int ubd_ioctl(struct block_device *bdev, blk_mode_t mode,
                     unsigned int cmd, unsigned long arg)
 {
        struct ubd *ubd_dev = bdev->bd_disk->private_data;
diff --git a/arch/um/include/asm/bugs.h b/arch/um/include/asm/bugs.h
deleted file mode 100644 (file)
index 4473942..0000000
+++ /dev/null
@@ -1,7 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef __UM_BUGS_H
-#define __UM_BUGS_H
-
-void check_bugs(void);
-
-#endif
index bda66e5..0347a19 100644 (file)
@@ -52,6 +52,7 @@ static inline int printk(const char *fmt, ...)
 extern int in_aton(char *str);
 extern size_t strlcpy(char *, const char *, size_t);
 extern size_t strlcat(char *, const char *, size_t);
+extern size_t strscpy(char *, const char *, size_t);
 
 /* Copied from linux/compiler-gcc.h since we can't include it directly */
 #define barrier() __asm__ __volatile__("": : :"memory")
index 0a23a98..918fed7 100644 (file)
@@ -3,6 +3,7 @@
  * Copyright (C) 2000 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com)
  */
 
+#include <linux/cpu.h>
 #include <linux/delay.h>
 #include <linux/init.h>
 #include <linux/mm.h>
@@ -430,7 +431,7 @@ void __init setup_arch(char **cmdline_p)
        }
 }
 
-void __init check_bugs(void)
+void __init arch_cpu_finalize_init(void)
 {
        arch_check_bugs();
        os_check_bugs();
index 53eb3d5..2284e9c 100644 (file)
@@ -146,7 +146,7 @@ static int tuntap_open(void *data)
                }
                memset(&ifr, 0, sizeof(ifr));
                ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
-               strlcpy(ifr.ifr_name, pri->dev_name, sizeof(ifr.ifr_name));
+               strscpy(ifr.ifr_name, pri->dev_name, sizeof(ifr.ifr_name));
                if (ioctl(pri->fd, TUNSETIFF, &ifr) < 0) {
                        err = -errno;
                        printk(UM_KERN_ERR "TUNSETIFF failed, errno = %d\n",
index 53bab12..d5c6914 100644 (file)
@@ -71,6 +71,7 @@ config X86
        select ARCH_HAS_ACPI_TABLE_UPGRADE      if ACPI
        select ARCH_HAS_CACHE_LINE_SIZE
        select ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION
+       select ARCH_HAS_CPU_FINALIZE_INIT
        select ARCH_HAS_CURRENT_STACK_POINTER
        select ARCH_HAS_DEBUG_VIRTUAL
        select ARCH_HAS_DEBUG_VM_PGTABLE        if !X86_PAE
@@ -274,7 +275,9 @@ config X86
        select HAVE_UNSTABLE_SCHED_CLOCK
        select HAVE_USER_RETURN_NOTIFIER
        select HAVE_GENERIC_VDSO
+       select HOTPLUG_PARALLEL                 if SMP && X86_64
        select HOTPLUG_SMT                      if SMP
+       select HOTPLUG_SPLIT_STARTUP            if SMP && X86_32
        select IRQ_FORCED_THREADING
        select NEED_PER_CPU_EMBED_FIRST_CHUNK
        select NEED_PER_CPU_PAGE_FIRST_CHUNK
@@ -291,7 +294,6 @@ config X86
        select TRACE_IRQFLAGS_NMI_SUPPORT
        select USER_STACKTRACE_SUPPORT
        select HAVE_ARCH_KCSAN                  if X86_64
-       select X86_FEATURE_NAMES                if PROC_FS
        select PROC_PID_ARCH_STATUS             if PROC_FS
        select HAVE_ARCH_NODE_DEV_GROUP         if X86_SGX
        select FUNCTION_ALIGNMENT_16B           if X86_64 || X86_ALIGNMENT_16
@@ -441,17 +443,6 @@ config SMP
 
          If you don't know what to do here, say N.
 
-config X86_FEATURE_NAMES
-       bool "Processor feature human-readable names" if EMBEDDED
-       default y
-       help
-         This option compiles in a table of x86 feature bits and corresponding
-         names.  This is required to support /proc/cpuinfo and a few kernel
-         messages.  You can disable this to save space, at the expense of
-         making those few kernel messages show numeric feature bits instead.
-
-         If in doubt, say Y.
-
 config X86_X2APIC
        bool "Support x2apic"
        depends on X86_LOCAL_APIC && X86_64 && (IRQ_REMAP || HYPERVISOR_GUEST)
@@ -884,9 +875,11 @@ config INTEL_TDX_GUEST
        bool "Intel TDX (Trust Domain Extensions) - Guest Support"
        depends on X86_64 && CPU_SUP_INTEL
        depends on X86_X2APIC
+       depends on EFI_STUB
        select ARCH_HAS_CC_PLATFORM
        select X86_MEM_ENCRYPT
        select X86_MCE
+       select UNACCEPTED_MEMORY
        help
          Support running as a guest under Intel TDX.  Without this support,
          the guest kernel can not boot or run under TDX.
@@ -1541,11 +1534,13 @@ config X86_MEM_ENCRYPT
 config AMD_MEM_ENCRYPT
        bool "AMD Secure Memory Encryption (SME) support"
        depends on X86_64 && CPU_SUP_AMD
+       depends on EFI_STUB
        select DMA_COHERENT_POOL
        select ARCH_USE_MEMREMAP_PROT
        select INSTRUCTION_DECODER
        select ARCH_HAS_CC_PLATFORM
        select X86_MEM_ENCRYPT
+       select UNACCEPTED_MEMORY
        help
          Say yes to enable support for the encryption of system memory.
          This requires an AMD processor that supports Secure Memory
@@ -2305,49 +2300,6 @@ config HOTPLUG_CPU
        def_bool y
        depends on SMP
 
-config BOOTPARAM_HOTPLUG_CPU0
-       bool "Set default setting of cpu0_hotpluggable"
-       depends on HOTPLUG_CPU
-       help
-         Set whether default state of cpu0_hotpluggable is on or off.
-
-         Say Y here to enable CPU0 hotplug by default. If this switch
-         is turned on, there is no need to give cpu0_hotplug kernel
-         parameter and the CPU0 hotplug feature is enabled by default.
-
-         Please note: there are two known CPU0 dependencies if you want
-         to enable the CPU0 hotplug feature either by this switch or by
-         cpu0_hotplug kernel parameter.
-
-         First, resume from hibernate or suspend always starts from CPU0.
-         So hibernate and suspend are prevented if CPU0 is offline.
-
-         Second dependency is PIC interrupts always go to CPU0. CPU0 can not
-         offline if any interrupt can not migrate out of CPU0. There may
-         be other CPU0 dependencies.
-
-         Please make sure the dependencies are under your control before
-         you enable this feature.
-
-         Say N if you don't want to enable CPU0 hotplug feature by default.
-         You still can enable the CPU0 hotplug feature at boot by kernel
-         parameter cpu0_hotplug.
-
-config DEBUG_HOTPLUG_CPU0
-       def_bool n
-       prompt "Debug CPU0 hotplug"
-       depends on HOTPLUG_CPU
-       help
-         Enabling this option offlines CPU0 (if CPU0 can be offlined) as
-         soon as possible and boots up userspace with CPU0 offlined. User
-         can online CPU0 back after boot time.
-
-         To debug CPU0 hotplug, you need to enable CPU0 offline/online
-         feature by either turning on CONFIG_BOOTPARAM_HOTPLUG_CPU0 during
-         compilation or giving cpu0_hotplug kernel parameter at boot.
-
-         If unsure, say N.
-
 config COMPAT_VDSO
        def_bool n
        prompt "Disable the 32-bit vDSO (needed for glibc 2.3.3)"
index 542377c..00468ad 100644 (file)
@@ -389,7 +389,7 @@ config IA32_FEAT_CTL
 
 config X86_VMX_FEATURE_NAMES
        def_bool y
-       depends on IA32_FEAT_CTL && X86_FEATURE_NAMES
+       depends on IA32_FEAT_CTL
 
 menuconfig PROCESSOR_SELECT
        bool "Supported processor vendors" if EXPERT
index b399759..fdc2e3a 100644 (file)
@@ -305,6 +305,18 @@ ifeq ($(RETPOLINE_CFLAGS),)
 endif
 endif
 
+ifdef CONFIG_UNWINDER_ORC
+orc_hash_h := arch/$(SRCARCH)/include/generated/asm/orc_hash.h
+orc_hash_sh := $(srctree)/scripts/orc_hash.sh
+targets += $(orc_hash_h)
+quiet_cmd_orc_hash = GEN     $@
+      cmd_orc_hash = mkdir -p $(dir $@); \
+                    $(CONFIG_SHELL) $(orc_hash_sh) < $< > $@
+$(orc_hash_h): $(srctree)/arch/x86/include/asm/orc_types.h $(orc_hash_sh) FORCE
+       $(call if_changed,orc_hash)
+archprepare: $(orc_hash_h)
+endif
+
 archclean:
        $(Q)rm -rf $(objtree)/arch/i386
        $(Q)rm -rf $(objtree)/arch/x86_64
diff --git a/arch/x86/Makefile.postlink b/arch/x86/Makefile.postlink
new file mode 100644 (file)
index 0000000..936093d
--- /dev/null
@@ -0,0 +1,47 @@
+# SPDX-License-Identifier: GPL-2.0
+# ===========================================================================
+# Post-link x86 pass
+# ===========================================================================
+#
+# 1. Separate relocations from vmlinux into vmlinux.relocs.
+# 2. Strip relocations from vmlinux.
+
+PHONY := __archpost
+__archpost:
+
+-include include/config/auto.conf
+include $(srctree)/scripts/Kbuild.include
+
+CMD_RELOCS = arch/x86/tools/relocs
+OUT_RELOCS = arch/x86/boot/compressed
+quiet_cmd_relocs = RELOCS  $(OUT_RELOCS)/$@.relocs
+      cmd_relocs = \
+       mkdir -p $(OUT_RELOCS); \
+       $(CMD_RELOCS) $@ > $(OUT_RELOCS)/$@.relocs; \
+       $(CMD_RELOCS) --abs-relocs $@
+
+quiet_cmd_strip_relocs = RSTRIP  $@
+      cmd_strip_relocs = \
+       $(OBJCOPY) --remove-section='.rel.*' --remove-section='.rel__*' \
+                  --remove-section='.rela.*' --remove-section='.rela__*' $@
+
+# `@true` prevents complaint when there is nothing to be done
+
+vmlinux: FORCE
+       @true
+ifeq ($(CONFIG_X86_NEED_RELOCS),y)
+       $(call cmd,relocs)
+       $(call cmd,strip_relocs)
+endif
+
+%.ko: FORCE
+       @true
+
+clean:
+       @rm -f $(OUT_RELOCS)/vmlinux.relocs
+
+PHONY += FORCE clean
+
+FORCE:
+
+.PHONY: $(PHONY)
index 9e38ffa..f33e45e 100644 (file)
@@ -55,14 +55,12 @@ HOST_EXTRACFLAGS += -I$(srctree)/tools/include \
                    -include include/generated/autoconf.h \
                    -D__EXPORTED_HEADERS__
 
-ifdef CONFIG_X86_FEATURE_NAMES
 $(obj)/cpu.o: $(obj)/cpustr.h
 
 quiet_cmd_cpustr = CPUSTR  $@
       cmd_cpustr = $(obj)/mkcpustr > $@
 $(obj)/cpustr.h: $(obj)/mkcpustr FORCE
        $(call if_changed,cpustr)
-endif
 targets += cpustr.h
 
 # ---------------------------------------------------------------------------
index 6b6cfe6..40d2ff5 100644 (file)
@@ -106,7 +106,8 @@ ifdef CONFIG_X86_64
 endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
-vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o
+vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o $(obj)/tdx-shared.o
+vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/mem.o
 
 vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
 vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_mixed.o
@@ -121,11 +122,9 @@ $(obj)/vmlinux.bin: vmlinux FORCE
 
 targets += $(patsubst $(obj)/%,%,$(vmlinux-objs-y)) vmlinux.bin.all vmlinux.relocs
 
-CMD_RELOCS = arch/x86/tools/relocs
-quiet_cmd_relocs = RELOCS  $@
-      cmd_relocs = $(CMD_RELOCS) $< > $@;$(CMD_RELOCS) --abs-relocs $<
-$(obj)/vmlinux.relocs: vmlinux FORCE
-       $(call if_changed,relocs)
+# vmlinux.relocs is created by the vmlinux postlink step.
+$(obj)/vmlinux.relocs: vmlinux
+       @true
 
 vmlinux.bin.all-y := $(obj)/vmlinux.bin
 vmlinux.bin.all-$(CONFIG_X86_NEED_RELOCS) += $(obj)/vmlinux.relocs
index 7db2f41..866c0af 100644 (file)
@@ -16,6 +16,7 @@ typedef guid_t efi_guid_t __aligned(__alignof__(u32));
 #define ACPI_TABLE_GUID                                EFI_GUID(0xeb9d2d30, 0x2d88, 0x11d3,  0x9a, 0x16, 0x00, 0x90, 0x27, 0x3f, 0xc1, 0x4d)
 #define ACPI_20_TABLE_GUID                     EFI_GUID(0x8868e871, 0xe4f1, 0x11d3,  0xbc, 0x22, 0x00, 0x80, 0xc7, 0x3c, 0x88, 0x81)
 #define EFI_CC_BLOB_GUID                       EFI_GUID(0x067b1f5f, 0xcf26, 0x44c5, 0x85, 0x54, 0x93, 0xd7, 0x77, 0x91, 0x2d, 0x42)
+#define LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID    EFI_GUID(0xd5d1de3c, 0x105c, 0x44f9,  0x9e, 0xa9, 0xbc, 0xef, 0x98, 0x12, 0x00, 0x31)
 
 #define EFI32_LOADER_SIGNATURE "EL32"
 #define EFI64_LOADER_SIGNATURE "EL64"
@@ -32,6 +33,7 @@ typedef       struct {
 } efi_table_hdr_t;
 
 #define EFI_CONVENTIONAL_MEMORY                 7
+#define EFI_UNACCEPTED_MEMORY          15
 
 #define EFI_MEMORY_MORE_RELIABLE \
                                ((u64)0x0000000000010000ULL)    /* higher reliability */
@@ -104,6 +106,14 @@ struct efi_setup_data {
        u64 reserved[8];
 };
 
+struct efi_unaccepted_memory {
+       u32 version;
+       u32 unit_size;
+       u64 phys_base;
+       u64 size;
+       unsigned long bitmap[];
+};
+
 static inline int efi_guidcmp (efi_guid_t left, efi_guid_t right)
 {
        return memcmp(&left, &right, sizeof (efi_guid_t));
index c881878..5313c5c 100644 (file)
@@ -22,3 +22,22 @@ void error(char *m)
        while (1)
                asm("hlt");
 }
+
+/* EFI libstub  provides vsnprintf() */
+#ifdef CONFIG_EFI_STUB
+void panic(const char *fmt, ...)
+{
+       static char buf[1024];
+       va_list args;
+       int len;
+
+       va_start(args, fmt);
+       len = vsnprintf(buf, sizeof(buf), fmt, args);
+       va_end(args);
+
+       if (len && buf[len - 1] == '\n')
+               buf[len - 1] = '\0';
+
+       error(buf);
+}
+#endif
index 1de5821..86fe33b 100644 (file)
@@ -6,5 +6,6 @@
 
 void warn(char *m);
 void error(char *m) __noreturn;
+void panic(const char *fmt, ...) __noreturn __cold;
 
 #endif /* BOOT_COMPRESSED_ERROR_H */
index 454757f..9193acf 100644 (file)
@@ -672,6 +672,33 @@ static bool process_mem_region(struct mem_vector *region,
 }
 
 #ifdef CONFIG_EFI
+
+/*
+ * Only EFI_CONVENTIONAL_MEMORY and EFI_UNACCEPTED_MEMORY (if supported) are
+ * guaranteed to be free.
+ *
+ * Pick free memory more conservatively than the EFI spec allows: according to
+ * the spec, EFI_BOOT_SERVICES_{CODE|DATA} are also free memory and thus
+ * available to place the kernel image into, but in practice there's firmware
+ * where using that memory leads to crashes. Buggy vendor EFI code registers
+ * for an event that triggers on SetVirtualAddressMap(). The handler assumes
+ * that EFI_BOOT_SERVICES_DATA memory has not been touched by loader yet, which
+ * is probably true for Windows.
+ *
+ * Preserve EFI_BOOT_SERVICES_* regions until after SetVirtualAddressMap().
+ */
+static inline bool memory_type_is_free(efi_memory_desc_t *md)
+{
+       if (md->type == EFI_CONVENTIONAL_MEMORY)
+               return true;
+
+       if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY) &&
+           md->type == EFI_UNACCEPTED_MEMORY)
+                   return true;
+
+       return false;
+}
+
 /*
  * Returns true if we processed the EFI memmap, which we prefer over the E820
  * table if it is available.
@@ -716,18 +743,7 @@ process_efi_entries(unsigned long minimum, unsigned long image_size)
        for (i = 0; i < nr_desc; i++) {
                md = efi_early_memdesc_ptr(pmap, e->efi_memdesc_size, i);
 
-               /*
-                * Here we are more conservative in picking free memory than
-                * the EFI spec allows:
-                *
-                * According to the spec, EFI_BOOT_SERVICES_{CODE|DATA} are also
-                * free memory and thus available to place the kernel image into,
-                * but in practice there's firmware where using that memory leads
-                * to crashes.
-                *
-                * Only EFI_CONVENTIONAL_MEMORY is guaranteed to be free.
-                */
-               if (md->type != EFI_CONVENTIONAL_MEMORY)
+               if (!memory_type_is_free(md))
                        continue;
 
                if (efi_soft_reserve_enabled() &&
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
new file mode 100644 (file)
index 0000000..3c16092
--- /dev/null
@@ -0,0 +1,86 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "error.h"
+#include "misc.h"
+#include "tdx.h"
+#include "sev.h"
+#include <asm/shared/tdx.h>
+
+/*
+ * accept_memory() and process_unaccepted_memory() called from EFI stub which
+ * runs before decompresser and its early_tdx_detect().
+ *
+ * Enumerate TDX directly from the early users.
+ */
+static bool early_is_tdx_guest(void)
+{
+       static bool once;
+       static bool is_tdx;
+
+       if (!IS_ENABLED(CONFIG_INTEL_TDX_GUEST))
+               return false;
+
+       if (!once) {
+               u32 eax, sig[3];
+
+               cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax,
+                           &sig[0], &sig[2],  &sig[1]);
+               is_tdx = !memcmp(TDX_IDENT, sig, sizeof(sig));
+               once = true;
+       }
+
+       return is_tdx;
+}
+
+void arch_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+       /* Platform-specific memory-acceptance call goes here */
+       if (early_is_tdx_guest()) {
+               if (!tdx_accept_memory(start, end))
+                       panic("TDX: Failed to accept memory\n");
+       } else if (sev_snp_enabled()) {
+               snp_accept_memory(start, end);
+       } else {
+               error("Cannot accept memory: unknown platform\n");
+       }
+}
+
+bool init_unaccepted_memory(void)
+{
+       guid_t guid = LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID;
+       struct efi_unaccepted_memory *table;
+       unsigned long cfg_table_pa;
+       unsigned int cfg_table_len;
+       enum efi_type et;
+       int ret;
+
+       et = efi_get_type(boot_params);
+       if (et == EFI_TYPE_NONE)
+               return false;
+
+       ret = efi_get_conf_table(boot_params, &cfg_table_pa, &cfg_table_len);
+       if (ret) {
+               warn("EFI config table not found.");
+               return false;
+       }
+
+       table = (void *)efi_find_vendor_table(boot_params, cfg_table_pa,
+                                             cfg_table_len, guid);
+       if (!table)
+               return false;
+
+       if (table->version != 1)
+               error("Unknown version of unaccepted memory table\n");
+
+       /*
+        * In many cases unaccepted_table is already set by EFI stub, but it
+        * has to be initialized again to cover cases when the table is not
+        * allocated by EFI stub or EFI stub copied the kernel image with
+        * efi_relocate_kernel() before the variable is set.
+        *
+        * It must be initialized before the first usage of accept_memory().
+        */
+       unaccepted_table = table;
+
+       return true;
+}
index 014ff22..94b7abc 100644 (file)
@@ -455,6 +455,12 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
 #endif
 
        debug_putstr("\nDecompressing Linux... ");
+
+       if (init_unaccepted_memory()) {
+               debug_putstr("Accepting memory... ");
+               accept_memory(__pa(output), __pa(output) + needed_size);
+       }
+
        __decompress(input_data, input_len, NULL, NULL, output, output_len,
                        NULL, error);
        entry_offset = parse_elf(output);
index 2f155a0..964fe90 100644 (file)
@@ -247,4 +247,14 @@ static inline unsigned long efi_find_vendor_table(struct boot_params *bp,
 }
 #endif /* CONFIG_EFI */
 
+#ifdef CONFIG_UNACCEPTED_MEMORY
+bool init_unaccepted_memory(void);
+#else
+static inline bool init_unaccepted_memory(void) { return false; }
+#endif
+
+/* Defined in EFI stub */
+extern struct efi_unaccepted_memory *unaccepted_table;
+void accept_memory(phys_addr_t start, phys_addr_t end);
+
 #endif /* BOOT_COMPRESSED_MISC_H */
index 014b89c..09dc8c1 100644 (file)
@@ -115,7 +115,7 @@ static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
 /* Include code for early handlers */
 #include "../../kernel/sev-shared.c"
 
-static inline bool sev_snp_enabled(void)
+bool sev_snp_enabled(void)
 {
        return sev_status & MSR_AMD64_SEV_SNP_ENABLED;
 }
@@ -181,6 +181,58 @@ static bool early_setup_ghcb(void)
        return true;
 }
 
+static phys_addr_t __snp_accept_memory(struct snp_psc_desc *desc,
+                                      phys_addr_t pa, phys_addr_t pa_end)
+{
+       struct psc_hdr *hdr;
+       struct psc_entry *e;
+       unsigned int i;
+
+       hdr = &desc->hdr;
+       memset(hdr, 0, sizeof(*hdr));
+
+       e = desc->entries;
+
+       i = 0;
+       while (pa < pa_end && i < VMGEXIT_PSC_MAX_ENTRY) {
+               hdr->end_entry = i;
+
+               e->gfn = pa >> PAGE_SHIFT;
+               e->operation = SNP_PAGE_STATE_PRIVATE;
+               if (IS_ALIGNED(pa, PMD_SIZE) && (pa_end - pa) >= PMD_SIZE) {
+                       e->pagesize = RMP_PG_SIZE_2M;
+                       pa += PMD_SIZE;
+               } else {
+                       e->pagesize = RMP_PG_SIZE_4K;
+                       pa += PAGE_SIZE;
+               }
+
+               e++;
+               i++;
+       }
+
+       if (vmgexit_psc(boot_ghcb, desc))
+               sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
+
+       pvalidate_pages(desc);
+
+       return pa;
+}
+
+void snp_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+       struct snp_psc_desc desc = {};
+       unsigned int i;
+       phys_addr_t pa;
+
+       if (!boot_ghcb && !early_setup_ghcb())
+               sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
+
+       pa = start;
+       while (pa < end)
+               pa = __snp_accept_memory(&desc, pa, end);
+}
+
 void sev_es_shutdown_ghcb(void)
 {
        if (!boot_ghcb)
diff --git a/arch/x86/boot/compressed/sev.h b/arch/x86/boot/compressed/sev.h
new file mode 100644 (file)
index 0000000..fc725a9
--- /dev/null
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * AMD SEV header for early boot related functions.
+ *
+ * Author: Tom Lendacky <thomas.lendacky@amd.com>
+ */
+
+#ifndef BOOT_COMPRESSED_SEV_H
+#define BOOT_COMPRESSED_SEV_H
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+
+bool sev_snp_enabled(void);
+void snp_accept_memory(phys_addr_t start, phys_addr_t end);
+
+#else
+
+static inline bool sev_snp_enabled(void) { return false; }
+static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
+
+#endif
+
+#endif
diff --git a/arch/x86/boot/compressed/tdx-shared.c b/arch/x86/boot/compressed/tdx-shared.c
new file mode 100644 (file)
index 0000000..5ac4376
--- /dev/null
@@ -0,0 +1,2 @@
+#include "error.h"
+#include "../../coco/tdx/tdx-shared.c"
index 2d81d3c..8841b94 100644 (file)
@@ -20,7 +20,7 @@ static inline unsigned int tdx_io_in(int size, u16 port)
 {
        struct tdx_hypercall_args args = {
                .r10 = TDX_HYPERCALL_STANDARD,
-               .r11 = EXIT_REASON_IO_INSTRUCTION,
+               .r11 = hcall_func(EXIT_REASON_IO_INSTRUCTION),
                .r12 = size,
                .r13 = 0,
                .r14 = port,
@@ -36,7 +36,7 @@ static inline void tdx_io_out(int size, u16 port, u32 value)
 {
        struct tdx_hypercall_args args = {
                .r10 = TDX_HYPERCALL_STANDARD,
-               .r11 = EXIT_REASON_IO_INSTRUCTION,
+               .r11 = hcall_func(EXIT_REASON_IO_INSTRUCTION),
                .r12 = size,
                .r13 = 1,
                .r14 = port,
index 0bbf4f3..feb6dbd 100644 (file)
@@ -14,9 +14,7 @@
  */
 
 #include "boot.h"
-#ifdef CONFIG_X86_FEATURE_NAMES
 #include "cpustr.h"
-#endif
 
 static char *cpu_name(int level)
 {
@@ -35,7 +33,6 @@ static char *cpu_name(int level)
 static void show_cap_strs(u32 *err_flags)
 {
        int i, j;
-#ifdef CONFIG_X86_FEATURE_NAMES
        const unsigned char *msg_strs = (const unsigned char *)x86_cap_strs;
        for (i = 0; i < NCAPINTS; i++) {
                u32 e = err_flags[i];
@@ -58,16 +55,6 @@ static void show_cap_strs(u32 *err_flags)
                        e >>= 1;
                }
        }
-#else
-       for (i = 0; i < NCAPINTS; i++) {
-               u32 e = err_flags[i];
-               for (j = 0; j < 32; j++) {
-                       if (e & 1)
-                               printf("%d:%d ", i, j);
-                       e >>= 1;
-               }
-       }
-#endif
 }
 
 int validate_cpu(void)
index 73f8323..eeec998 100644 (file)
 #include <asm/coco.h>
 #include <asm/processor.h>
 
-enum cc_vendor cc_vendor __ro_after_init;
+enum cc_vendor cc_vendor __ro_after_init = CC_VENDOR_NONE;
 static u64 cc_mask __ro_after_init;
 
-static bool intel_cc_platform_has(enum cc_attr attr)
+static bool noinstr intel_cc_platform_has(enum cc_attr attr)
 {
        switch (attr) {
        case CC_ATTR_GUEST_UNROLL_STRING_IO:
@@ -34,7 +34,7 @@ static bool intel_cc_platform_has(enum cc_attr attr)
  * the other levels of SME/SEV functionality, including C-bit
  * based SEV-SNP, are not enabled.
  */
-static __maybe_unused bool amd_cc_platform_vtom(enum cc_attr attr)
+static __maybe_unused __always_inline bool amd_cc_platform_vtom(enum cc_attr attr)
 {
        switch (attr) {
        case CC_ATTR_GUEST_MEM_ENCRYPT:
@@ -58,7 +58,7 @@ static __maybe_unused bool amd_cc_platform_vtom(enum cc_attr attr)
  * the trampoline area must be encrypted.
  */
 
-static bool amd_cc_platform_has(enum cc_attr attr)
+static bool noinstr amd_cc_platform_has(enum cc_attr attr)
 {
 #ifdef CONFIG_AMD_MEM_ENCRYPT
 
@@ -97,7 +97,7 @@ static bool amd_cc_platform_has(enum cc_attr attr)
 #endif
 }
 
-bool cc_platform_has(enum cc_attr attr)
+bool noinstr cc_platform_has(enum cc_attr attr)
 {
        switch (cc_vendor) {
        case CC_VENDOR_AMD:
index 46c5599..2c7dcbf 100644 (file)
@@ -1,3 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0
 
-obj-y += tdx.o tdcall.o
+obj-y += tdx.o tdx-shared.o tdcall.o
diff --git a/arch/x86/coco/tdx/tdx-shared.c b/arch/x86/coco/tdx/tdx-shared.c
new file mode 100644 (file)
index 0000000..ef20ddc
--- /dev/null
@@ -0,0 +1,71 @@
+#include <asm/tdx.h>
+#include <asm/pgtable.h>
+
+static unsigned long try_accept_one(phys_addr_t start, unsigned long len,
+                                   enum pg_level pg_level)
+{
+       unsigned long accept_size = page_level_size(pg_level);
+       u64 tdcall_rcx;
+       u8 page_size;
+
+       if (!IS_ALIGNED(start, accept_size))
+               return 0;
+
+       if (len < accept_size)
+               return 0;
+
+       /*
+        * Pass the page physical address to the TDX module to accept the
+        * pending, private page.
+        *
+        * Bits 2:0 of RCX encode page size: 0 - 4K, 1 - 2M, 2 - 1G.
+        */
+       switch (pg_level) {
+       case PG_LEVEL_4K:
+               page_size = 0;
+               break;
+       case PG_LEVEL_2M:
+               page_size = 1;
+               break;
+       case PG_LEVEL_1G:
+               page_size = 2;
+               break;
+       default:
+               return 0;
+       }
+
+       tdcall_rcx = start | page_size;
+       if (__tdx_module_call(TDX_ACCEPT_PAGE, tdcall_rcx, 0, 0, 0, NULL))
+               return 0;
+
+       return accept_size;
+}
+
+bool tdx_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+       /*
+        * For shared->private conversion, accept the page using
+        * TDX_ACCEPT_PAGE TDX module call.
+        */
+       while (start < end) {
+               unsigned long len = end - start;
+               unsigned long accept_size;
+
+               /*
+                * Try larger accepts first. It gives chance to VMM to keep
+                * 1G/2M Secure EPT entries where possible and speeds up
+                * process by cutting number of hypercalls (if successful).
+                */
+
+               accept_size = try_accept_one(start, len, PG_LEVEL_1G);
+               if (!accept_size)
+                       accept_size = try_accept_one(start, len, PG_LEVEL_2M);
+               if (!accept_size)
+                       accept_size = try_accept_one(start, len, PG_LEVEL_4K);
+               if (!accept_size)
+                       return false;
+               start += accept_size;
+       }
+
+       return true;
+}
index e146b59..1d6b863 100644 (file)
 #include <asm/insn-eval.h>
 #include <asm/pgtable.h>
 
-/* TDX module Call Leaf IDs */
-#define TDX_GET_INFO                   1
-#define TDX_GET_VEINFO                 3
-#define TDX_GET_REPORT                 4
-#define TDX_ACCEPT_PAGE                        6
-#define TDX_WR                         8
-
-/* TDCS fields. To be used by TDG.VM.WR and TDG.VM.RD module calls */
-#define TDCS_NOTIFY_ENABLES            0x9100000000000010
-
-/* TDX hypercall Leaf IDs */
-#define TDVMCALL_MAP_GPA               0x10001
-#define TDVMCALL_REPORT_FATAL_ERROR    0x10003
-
 /* MMIO direction */
 #define EPT_READ       0
 #define EPT_WRITE      1
 
 #define TDREPORT_SUBTYPE_0     0
 
-/*
- * Wrapper for standard use of __tdx_hypercall with no output aside from
- * return code.
- */
-static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
-{
-       struct tdx_hypercall_args args = {
-               .r10 = TDX_HYPERCALL_STANDARD,
-               .r11 = fn,
-               .r12 = r12,
-               .r13 = r13,
-               .r14 = r14,
-               .r15 = r15,
-       };
-
-       return __tdx_hypercall(&args);
-}
-
 /* Called from __tdx_hypercall() for unrecoverable failure */
 noinstr void __tdx_hypercall_failed(void)
 {
@@ -76,17 +44,6 @@ noinstr void __tdx_hypercall_failed(void)
        panic("TDVMCALL failed. TDX module bug?");
 }
 
-/*
- * The TDG.VP.VMCALL-Instruction-execution sub-functions are defined
- * independently from but are currently matched 1:1 with VMX EXIT_REASONs.
- * Reusing the KVM EXIT_REASON macros makes it easier to connect the host and
- * guest sides of these calls.
- */
-static __always_inline u64 hcall_func(u64 exit_reason)
-{
-       return exit_reason;
-}
-
 #ifdef CONFIG_KVM_GUEST
 long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, unsigned long p2,
                       unsigned long p3, unsigned long p4)
@@ -745,47 +702,6 @@ static bool tdx_cache_flush_required(void)
        return true;
 }
 
-static bool try_accept_one(phys_addr_t *start, unsigned long len,
-                         enum pg_level pg_level)
-{
-       unsigned long accept_size = page_level_size(pg_level);
-       u64 tdcall_rcx;
-       u8 page_size;
-
-       if (!IS_ALIGNED(*start, accept_size))
-               return false;
-
-       if (len < accept_size)
-               return false;
-
-       /*
-        * Pass the page physical address to the TDX module to accept the
-        * pending, private page.
-        *
-        * Bits 2:0 of RCX encode page size: 0 - 4K, 1 - 2M, 2 - 1G.
-        */
-       switch (pg_level) {
-       case PG_LEVEL_4K:
-               page_size = 0;
-               break;
-       case PG_LEVEL_2M:
-               page_size = 1;
-               break;
-       case PG_LEVEL_1G:
-               page_size = 2;
-               break;
-       default:
-               return false;
-       }
-
-       tdcall_rcx = *start | page_size;
-       if (__tdx_module_call(TDX_ACCEPT_PAGE, tdcall_rcx, 0, 0, 0, NULL))
-               return false;
-
-       *start += accept_size;
-       return true;
-}
-
 /*
  * Inform the VMM of the guest's intent for this physical page: shared with
  * the VMM or private to the guest.  The VMM is expected to change its mapping
@@ -810,33 +726,34 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
        if (_tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0))
                return false;
 
-       /* private->shared conversion  requires only MapGPA call */
-       if (!enc)
-               return true;
+       /* shared->private conversion requires memory to be accepted before use */
+       if (enc)
+               return tdx_accept_memory(start, end);
+
+       return true;
+}
 
+static bool tdx_enc_status_change_prepare(unsigned long vaddr, int numpages,
+                                         bool enc)
+{
        /*
-        * For shared->private conversion, accept the page using
-        * TDX_ACCEPT_PAGE TDX module call.
+        * Only handle shared->private conversion here.
+        * See the comment in tdx_early_init().
         */
-       while (start < end) {
-               unsigned long len = end - start;
-
-               /*
-                * Try larger accepts first. It gives chance to VMM to keep
-                * 1G/2M SEPT entries where possible and speeds up process by
-                * cutting number of hypercalls (if successful).
-                */
-
-               if (try_accept_one(&start, len, PG_LEVEL_1G))
-                       continue;
-
-               if (try_accept_one(&start, len, PG_LEVEL_2M))
-                       continue;
-
-               if (!try_accept_one(&start, len, PG_LEVEL_4K))
-                       return false;
-       }
+       if (enc)
+               return tdx_enc_status_changed(vaddr, numpages, enc);
+       return true;
+}
 
+static bool tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
+                                        bool enc)
+{
+       /*
+        * Only handle private->shared conversion here.
+        * See the comment in tdx_early_init().
+        */
+       if (!enc)
+               return tdx_enc_status_changed(vaddr, numpages, enc);
        return true;
 }
 
@@ -852,7 +769,7 @@ void __init tdx_early_init(void)
 
        setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
 
-       cc_set_vendor(CC_VENDOR_INTEL);
+       cc_vendor = CC_VENDOR_INTEL;
        tdx_parse_tdinfo(&cc_mask);
        cc_set_mask(cc_mask);
 
@@ -867,9 +784,41 @@ void __init tdx_early_init(void)
         */
        physical_mask &= cc_mask - 1;
 
-       x86_platform.guest.enc_cache_flush_required = tdx_cache_flush_required;
-       x86_platform.guest.enc_tlb_flush_required   = tdx_tlb_flush_required;
-       x86_platform.guest.enc_status_change_finish = tdx_enc_status_changed;
+       /*
+        * The kernel mapping should match the TDX metadata for the page.
+        * load_unaligned_zeropad() can touch memory *adjacent* to that which is
+        * owned by the caller and can catch even _momentary_ mismatches.  Bad
+        * things happen on mismatch:
+        *
+        *   - Private mapping => Shared Page  == Guest shutdown
+         *   - Shared mapping  => Private Page == Recoverable #VE
+        *
+        * guest.enc_status_change_prepare() converts the page from
+        * shared=>private before the mapping becomes private.
+        *
+        * guest.enc_status_change_finish() converts the page from
+        * private=>shared after the mapping becomes private.
+        *
+        * In both cases there is a temporary shared mapping to a private page,
+        * which can result in a #VE.  But, there is never a private mapping to
+        * a shared page.
+        */
+       x86_platform.guest.enc_status_change_prepare = tdx_enc_status_change_prepare;
+       x86_platform.guest.enc_status_change_finish  = tdx_enc_status_change_finish;
+
+       x86_platform.guest.enc_cache_flush_required  = tdx_cache_flush_required;
+       x86_platform.guest.enc_tlb_flush_required    = tdx_tlb_flush_required;
+
+       /*
+        * TDX intercepts the RDMSR to read the X2APIC ID in the parallel
+        * bringup low level code. That raises #VE which cannot be handled
+        * there.
+        *
+        * Intel-TDX has a secure RDMSR hypercall, but that needs to be
+        * implemented seperately in the low level startup ASM code.
+        * Until that is in place, disable parallel bringup for TDX.
+        */
+       x86_cpuinit.parallel_bringup = false;
 
        pr_info("Guest detected\n");
 }
index 320480a..bc0a3c9 100644 (file)
 448    i386    process_mrelease        sys_process_mrelease
 449    i386    futex_waitv             sys_futex_waitv
 450    i386    set_mempolicy_home_node         sys_set_mempolicy_home_node
+451    i386    cachestat               sys_cachestat
index c84d126..227538b 100644 (file)
 448    common  process_mrelease        sys_process_mrelease
 449    common  futex_waitv             sys_futex_waitv
 450    common  set_mempolicy_home_node sys_set_mempolicy_home_node
+451    common  cachestat               sys_cachestat
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
index 5e37f41..27b5da2 100644 (file)
@@ -26,17 +26,7 @@ SYM_FUNC_START(\name)
        pushq %r11
 
        call \func
-       jmp  __thunk_restore
-SYM_FUNC_END(\name)
-       _ASM_NOKPROBE(\name)
-       .endm
-
-       THUNK preempt_schedule_thunk, preempt_schedule
-       THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace
-       EXPORT_SYMBOL(preempt_schedule_thunk)
-       EXPORT_SYMBOL(preempt_schedule_notrace_thunk)
 
-SYM_CODE_START_LOCAL(__thunk_restore)
        popq %r11
        popq %r10
        popq %r9
@@ -48,5 +38,11 @@ SYM_CODE_START_LOCAL(__thunk_restore)
        popq %rdi
        popq %rbp
        RET
-       _ASM_NOKPROBE(__thunk_restore)
-SYM_CODE_END(__thunk_restore)
+SYM_FUNC_END(\name)
+       _ASM_NOKPROBE(\name)
+       .endm
+
+THUNK preempt_schedule_thunk, preempt_schedule
+THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace
+EXPORT_SYMBOL(preempt_schedule_thunk)
+EXPORT_SYMBOL(preempt_schedule_notrace_thunk)
index 0a9007c..e464030 100644 (file)
@@ -8,6 +8,7 @@
 #include <linux/kernel.h>
 #include <linux/getcpu.h>
 #include <asm/segment.h>
+#include <vdso/processor.h>
 
 notrace long
 __vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused)
index bccea57..abadd5f 100644 (file)
@@ -374,7 +374,7 @@ static int amd_pmu_hw_config(struct perf_event *event)
 
        /* pass precise event sampling to ibs: */
        if (event->attr.precise_ip && get_ibs_caps())
-               return -ENOENT;
+               return forward_event_to_ibs(event);
 
        if (has_branch_stack(event) && !x86_pmu.lbr_nr)
                return -EOPNOTSUPP;
index 6458295..3710148 100644 (file)
@@ -190,7 +190,7 @@ static struct perf_ibs *get_ibs_pmu(int type)
 }
 
 /*
- * Use IBS for precise event sampling:
+ * core pmu config -> IBS config
  *
  *  perf record -a -e cpu-cycles:p ...    # use ibs op counting cycle count
  *  perf record -a -e r076:p ...          # same as -e cpu-cycles:p
@@ -199,25 +199,9 @@ static struct perf_ibs *get_ibs_pmu(int type)
  * IbsOpCntCtl (bit 19) of IBS Execution Control Register (IbsOpCtl,
  * MSRC001_1033) is used to select either cycle or micro-ops counting
  * mode.
- *
- * The rip of IBS samples has skid 0. Thus, IBS supports precise
- * levels 1 and 2 and the PERF_EFLAGS_EXACT is set. In rare cases the
- * rip is invalid when IBS was not able to record the rip correctly.
- * We clear PERF_EFLAGS_EXACT and take the rip from pt_regs then.
- *
  */
-static int perf_ibs_precise_event(struct perf_event *event, u64 *config)
+static int core_pmu_ibs_config(struct perf_event *event, u64 *config)
 {
-       switch (event->attr.precise_ip) {
-       case 0:
-               return -ENOENT;
-       case 1:
-       case 2:
-               break;
-       default:
-               return -EOPNOTSUPP;
-       }
-
        switch (event->attr.type) {
        case PERF_TYPE_HARDWARE:
                switch (event->attr.config) {
@@ -243,22 +227,37 @@ static int perf_ibs_precise_event(struct perf_event *event, u64 *config)
        return -EOPNOTSUPP;
 }
 
+/*
+ * The rip of IBS samples has skid 0. Thus, IBS supports precise
+ * levels 1 and 2 and the PERF_EFLAGS_EXACT is set. In rare cases the
+ * rip is invalid when IBS was not able to record the rip correctly.
+ * We clear PERF_EFLAGS_EXACT and take the rip from pt_regs then.
+ */
+int forward_event_to_ibs(struct perf_event *event)
+{
+       u64 config = 0;
+
+       if (!event->attr.precise_ip || event->attr.precise_ip > 2)
+               return -EOPNOTSUPP;
+
+       if (!core_pmu_ibs_config(event, &config)) {
+               event->attr.type = perf_ibs_op.pmu.type;
+               event->attr.config = config;
+       }
+       return -ENOENT;
+}
+
 static int perf_ibs_init(struct perf_event *event)
 {
        struct hw_perf_event *hwc = &event->hw;
        struct perf_ibs *perf_ibs;
        u64 max_cnt, config;
-       int ret;
 
        perf_ibs = get_ibs_pmu(event->attr.type);
-       if (perf_ibs) {
-               config = event->attr.config;
-       } else {
-               perf_ibs = &perf_ibs_op;
-               ret = perf_ibs_precise_event(event, &config);
-               if (ret)
-                       return ret;
-       }
+       if (!perf_ibs)
+               return -ENOENT;
+
+       config = event->attr.config;
 
        if (event->pmu != &perf_ibs->pmu)
                return -ENOENT;
index 89b9c1c..a149faf 100644 (file)
@@ -349,6 +349,16 @@ static struct event_constraint intel_spr_event_constraints[] = {
        EVENT_CONSTRAINT_END
 };
 
+static struct extra_reg intel_gnr_extra_regs[] __read_mostly = {
+       INTEL_UEVENT_EXTRA_REG(0x012a, MSR_OFFCORE_RSP_0, 0x3fffffffffull, RSP_0),
+       INTEL_UEVENT_EXTRA_REG(0x012b, MSR_OFFCORE_RSP_1, 0x3fffffffffull, RSP_1),
+       INTEL_UEVENT_PEBS_LDLAT_EXTRA_REG(0x01cd),
+       INTEL_UEVENT_EXTRA_REG(0x02c6, MSR_PEBS_FRONTEND, 0x9, FE),
+       INTEL_UEVENT_EXTRA_REG(0x03c6, MSR_PEBS_FRONTEND, 0x7fff1f, FE),
+       INTEL_UEVENT_EXTRA_REG(0x40ad, MSR_PEBS_FRONTEND, 0x7, FE),
+       INTEL_UEVENT_EXTRA_REG(0x04c2, MSR_PEBS_FRONTEND, 0x8, FE),
+       EVENT_EXTRA_END
+};
 
 EVENT_ATTR_STR(mem-loads,      mem_ld_nhm,     "event=0x0b,umask=0x10,ldlat=3");
 EVENT_ATTR_STR(mem-loads,      mem_ld_snb,     "event=0xcd,umask=0x1,ldlat=3");
@@ -2451,7 +2461,7 @@ static void intel_pmu_disable_fixed(struct perf_event *event)
 
        intel_clear_masks(event, idx);
 
-       mask = 0xfULL << ((idx - INTEL_PMC_IDX_FIXED) * 4);
+       mask = intel_fixed_bits_by_idx(idx - INTEL_PMC_IDX_FIXED, INTEL_FIXED_BITS_MASK);
        cpuc->fixed_ctrl_val &= ~mask;
 }
 
@@ -2750,25 +2760,25 @@ static void intel_pmu_enable_fixed(struct perf_event *event)
         * if requested:
         */
        if (!event->attr.precise_ip)
-               bits |= 0x8;
+               bits |= INTEL_FIXED_0_ENABLE_PMI;
        if (hwc->config & ARCH_PERFMON_EVENTSEL_USR)
-               bits |= 0x2;
+               bits |= INTEL_FIXED_0_USER;
        if (hwc->config & ARCH_PERFMON_EVENTSEL_OS)
-               bits |= 0x1;
+               bits |= INTEL_FIXED_0_KERNEL;
 
        /*
         * ANY bit is supported in v3 and up
         */
        if (x86_pmu.version > 2 && hwc->config & ARCH_PERFMON_EVENTSEL_ANY)
-               bits |= 0x4;
+               bits |= INTEL_FIXED_0_ANYTHREAD;
 
        idx -= INTEL_PMC_IDX_FIXED;
-       bits <<= (idx * 4);
-       mask = 0xfULL << (idx * 4);
+       bits = intel_fixed_bits_by_idx(idx, bits);
+       mask = intel_fixed_bits_by_idx(idx, INTEL_FIXED_BITS_MASK);
 
        if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip) {
-               bits |= ICL_FIXED_0_ADAPTIVE << (idx * 4);
-               mask |= ICL_FIXED_0_ADAPTIVE << (idx * 4);
+               bits |= intel_fixed_bits_by_idx(idx, ICL_FIXED_0_ADAPTIVE);
+               mask |= intel_fixed_bits_by_idx(idx, ICL_FIXED_0_ADAPTIVE);
        }
 
        cpuc->fixed_ctrl_val &= ~mask;
@@ -6496,6 +6506,7 @@ __init int intel_pmu_init(void)
        case INTEL_FAM6_SAPPHIRERAPIDS_X:
        case INTEL_FAM6_EMERALDRAPIDS_X:
                x86_pmu.flags |= PMU_FL_MEM_LOADS_AUX;
+               x86_pmu.extra_regs = intel_spr_extra_regs;
                fallthrough;
        case INTEL_FAM6_GRANITERAPIDS_X:
        case INTEL_FAM6_GRANITERAPIDS_D:
@@ -6506,7 +6517,8 @@ __init int intel_pmu_init(void)
 
                x86_pmu.event_constraints = intel_spr_event_constraints;
                x86_pmu.pebs_constraints = intel_spr_pebs_event_constraints;
-               x86_pmu.extra_regs = intel_spr_extra_regs;
+               if (!x86_pmu.extra_regs)
+                       x86_pmu.extra_regs = intel_gnr_extra_regs;
                x86_pmu.limit_period = spr_limit_period;
                x86_pmu.pebs_ept = 1;
                x86_pmu.pebs_aliases = NULL;
@@ -6650,6 +6662,7 @@ __init int intel_pmu_init(void)
                pmu->pebs_constraints = intel_grt_pebs_event_constraints;
                pmu->extra_regs = intel_grt_extra_regs;
                if (is_mtl(boot_cpu_data.x86_model)) {
+                       x86_pmu.hybrid_pmu[X86_HYBRID_PMU_CORE_IDX].extra_regs = intel_gnr_extra_regs;
                        x86_pmu.pebs_latency_data = mtl_latency_data_small;
                        extra_attr = boot_cpu_has(X86_FEATURE_RTM) ?
                                mtl_hybrid_extra_attr_rtm : mtl_hybrid_extra_attr;
index cc92388..14f46ad 100644 (file)
@@ -17,6 +17,7 @@
 #include <asm/mem_encrypt.h>
 #include <asm/mshyperv.h>
 #include <asm/hypervisor.h>
+#include <asm/mtrr.h>
 
 #ifdef CONFIG_AMD_MEM_ENCRYPT
 
@@ -364,7 +365,7 @@ void __init hv_vtom_init(void)
         * Set it here to indicate a vTOM VM.
         */
        sev_status = MSR_AMD64_SNP_VTOM;
-       cc_set_vendor(CC_VENDOR_AMD);
+       cc_vendor = CC_VENDOR_AMD;
        cc_set_mask(ms_hyperv.shared_gpa_boundary);
        physical_mask &= ms_hyperv.shared_gpa_boundary - 1;
 
@@ -372,6 +373,9 @@ void __init hv_vtom_init(void)
        x86_platform.guest.enc_cache_flush_required = hv_vtom_cache_flush_required;
        x86_platform.guest.enc_tlb_flush_required = hv_vtom_tlb_flush_required;
        x86_platform.guest.enc_status_change_finish = hv_vtom_set_host_visibility;
+
+       /* Set WB as the default cache mode. */
+       mtrr_overwrite_state(NULL, 0, MTRR_TYPE_WRBACK);
 }
 
 #endif /* CONFIG_AMD_MEM_ENCRYPT */
index 1e51650..4f1ce5f 100644 (file)
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 
 
+generated-y += orc_hash.h
 generated-y += syscalls_32.h
 generated-y += syscalls_64.h
 generated-y += syscalls_x32.h
index d7da28f..6c15a62 100644 (file)
@@ -113,7 +113,6 @@ extern void callthunks_patch_builtin_calls(void);
 extern void callthunks_patch_module_calls(struct callthunk_sites *sites,
                                          struct module *mod);
 extern void *callthunks_translate_call_dest(void *dest);
-extern bool is_callthunk(void *addr);
 extern int x86_call_depth_emit_accounting(u8 **pprog, void *func);
 #else
 static __always_inline void callthunks_patch_builtin_calls(void) {}
@@ -124,10 +123,6 @@ static __always_inline void *callthunks_translate_call_dest(void *dest)
 {
        return dest;
 }
-static __always_inline bool is_callthunk(void *addr)
-{
-       return false;
-}
 static __always_inline int x86_call_depth_emit_accounting(u8 **pprog,
                                                          void *func)
 {
index 3216da7..98c32aa 100644 (file)
@@ -55,6 +55,8 @@ extern int local_apic_timer_c2_ok;
 extern int disable_apic;
 extern unsigned int lapic_timer_period;
 
+extern int cpuid_to_apicid[];
+
 extern enum apic_intr_mode_id apic_intr_mode;
 enum apic_intr_mode_id {
        APIC_PIC,
@@ -377,7 +379,6 @@ extern struct apic *__apicdrivers[], *__apicdrivers_end[];
  * APIC functionality to boot other CPUs - only used on SMP:
  */
 #ifdef CONFIG_SMP
-extern int wakeup_secondary_cpu_via_nmi(int apicid, unsigned long start_eip);
 extern int lapic_can_unplug_cpu(void);
 #endif
 
@@ -507,10 +508,8 @@ extern int default_check_phys_apicid_present(int phys_apicid);
 #endif /* CONFIG_X86_LOCAL_APIC */
 
 #ifdef CONFIG_SMP
-bool apic_id_is_primary_thread(unsigned int id);
 void apic_smt_update(void);
 #else
-static inline bool apic_id_is_primary_thread(unsigned int id) { return false; }
 static inline void apic_smt_update(void) { }
 #endif
 
index 68d213e..4b125e5 100644 (file)
@@ -2,6 +2,8 @@
 #ifndef _ASM_X86_APICDEF_H
 #define _ASM_X86_APICDEF_H
 
+#include <linux/bits.h>
+
 /*
  * Constants for various Intel APICs. (local APIC, IOAPIC, etc.)
  *
 #define                APIC_EILVT_MASKED       (1 << 16)
 
 #define APIC_BASE (fix_to_virt(FIX_APIC_BASE))
-#define APIC_BASE_MSR  0x800
-#define XAPIC_ENABLE   (1UL << 11)
-#define X2APIC_ENABLE  (1UL << 10)
+#define APIC_BASE_MSR          0x800
+#define APIC_X2APIC_ID_MSR     0x802
+#define XAPIC_ENABLE           BIT(11)
+#define X2APIC_ENABLE          BIT(10)
 
 #ifdef CONFIG_X86_32
 # define MAX_IO_APICS 64
 #define APIC_CPUID(apicid)     ((apicid) & XAPIC_DEST_CPUS_MASK)
 #define NUM_APIC_CLUSTERS      ((BAD_APICID + 1) >> XAPIC_DEST_CPUS_SHIFT)
 
+#ifndef __ASSEMBLY__
 /*
  * the local APIC register structure, memory mapped. Not terribly well
  * tested, but we might eventually use this one in the future - the
@@ -435,4 +439,5 @@ enum apic_delivery_modes {
        APIC_DELIVERY_MODE_EXTINT       = 7,
 };
 
+#endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_APICDEF_H */
index 5e754e8..55a55ec 100644 (file)
  * resource counting etc..
  */
 
-/**
- * arch_atomic_read - read atomic variable
- * @v: pointer of type atomic_t
- *
- * Atomically reads the value of @v.
- */
 static __always_inline int arch_atomic_read(const atomic_t *v)
 {
        /*
@@ -29,25 +23,11 @@ static __always_inline int arch_atomic_read(const atomic_t *v)
        return __READ_ONCE((v)->counter);
 }
 
-/**
- * arch_atomic_set - set atomic variable
- * @v: pointer of type atomic_t
- * @i: required value
- *
- * Atomically sets the value of @v to @i.
- */
 static __always_inline void arch_atomic_set(atomic_t *v, int i)
 {
        __WRITE_ONCE(v->counter, i);
 }
 
-/**
- * arch_atomic_add - add integer to atomic variable
- * @i: integer value to add
- * @v: pointer of type atomic_t
- *
- * Atomically adds @i to @v.
- */
 static __always_inline void arch_atomic_add(int i, atomic_t *v)
 {
        asm volatile(LOCK_PREFIX "addl %1,%0"
@@ -55,13 +35,6 @@ static __always_inline void arch_atomic_add(int i, atomic_t *v)
                     : "ir" (i) : "memory");
 }
 
-/**
- * arch_atomic_sub - subtract integer from atomic variable
- * @i: integer value to subtract
- * @v: pointer of type atomic_t
- *
- * Atomically subtracts @i from @v.
- */
 static __always_inline void arch_atomic_sub(int i, atomic_t *v)
 {
        asm volatile(LOCK_PREFIX "subl %1,%0"
@@ -69,27 +42,12 @@ static __always_inline void arch_atomic_sub(int i, atomic_t *v)
                     : "ir" (i) : "memory");
 }
 
-/**
- * arch_atomic_sub_and_test - subtract value from variable and test result
- * @i: integer value to subtract
- * @v: pointer of type atomic_t
- *
- * Atomically subtracts @i from @v and returns
- * true if the result is zero, or false for all
- * other cases.
- */
 static __always_inline bool arch_atomic_sub_and_test(int i, atomic_t *v)
 {
        return GEN_BINARY_RMWcc(LOCK_PREFIX "subl", v->counter, e, "er", i);
 }
 #define arch_atomic_sub_and_test arch_atomic_sub_and_test
 
-/**
- * arch_atomic_inc - increment atomic variable
- * @v: pointer of type atomic_t
- *
- * Atomically increments @v by 1.
- */
 static __always_inline void arch_atomic_inc(atomic_t *v)
 {
        asm volatile(LOCK_PREFIX "incl %0"
@@ -97,12 +55,6 @@ static __always_inline void arch_atomic_inc(atomic_t *v)
 }
 #define arch_atomic_inc arch_atomic_inc
 
-/**
- * arch_atomic_dec - decrement atomic variable
- * @v: pointer of type atomic_t
- *
- * Atomically decrements @v by 1.
- */
 static __always_inline void arch_atomic_dec(atomic_t *v)
 {
        asm volatile(LOCK_PREFIX "decl %0"
@@ -110,69 +62,30 @@ static __always_inline void arch_atomic_dec(atomic_t *v)
 }
 #define arch_atomic_dec arch_atomic_dec
 
-/**
- * arch_atomic_dec_and_test - decrement and test
- * @v: pointer of type atomic_t
- *
- * Atomically decrements @v by 1 and
- * returns true if the result is 0, or false for all other
- * cases.
- */
 static __always_inline bool arch_atomic_dec_and_test(atomic_t *v)
 {
        return GEN_UNARY_RMWcc(LOCK_PREFIX "decl", v->counter, e);
 }
 #define arch_atomic_dec_and_test arch_atomic_dec_and_test
 
-/**
- * arch_atomic_inc_and_test - increment and test
- * @v: pointer of type atomic_t
- *
- * Atomically increments @v by 1
- * and returns true if the result is zero, or false for all
- * other cases.
- */
 static __always_inline bool arch_atomic_inc_and_test(atomic_t *v)
 {
        return GEN_UNARY_RMWcc(LOCK_PREFIX "incl", v->counter, e);
 }
 #define arch_atomic_inc_and_test arch_atomic_inc_and_test
 
-/**
- * arch_atomic_add_negative - add and test if negative
- * @i: integer value to add
- * @v: pointer of type atomic_t
- *
- * Atomically adds @i to @v and returns true
- * if the result is negative, or false when
- * result is greater than or equal to zero.
- */
 static __always_inline bool arch_atomic_add_negative(int i, atomic_t *v)
 {
        return GEN_BINARY_RMWcc(LOCK_PREFIX "addl", v->counter, s, "er", i);
 }
 #define arch_atomic_add_negative arch_atomic_add_negative
 
-/**
- * arch_atomic_add_return - add integer and return
- * @i: integer value to add
- * @v: pointer of type atomic_t
- *
- * Atomically adds @i to @v and returns @i + @v
- */
 static __always_inline int arch_atomic_add_return(int i, atomic_t *v)
 {
        return i + xadd(&v->counter, i);
 }
 #define arch_atomic_add_return arch_atomic_add_return
 
-/**
- * arch_atomic_sub_return - subtract integer and return
- * @v: pointer of type atomic_t
- * @i: integer value to subtract
- *
- * Atomically subtracts @i from @v and returns @v - @i
- */
 static __always_inline int arch_atomic_sub_return(int i, atomic_t *v)
 {
        return arch_atomic_add_return(-i, v);
index 808b4ee..3486d91 100644 (file)
@@ -61,30 +61,12 @@ ATOMIC64_DECL(add_unless);
 #undef __ATOMIC64_DECL
 #undef ATOMIC64_EXPORT
 
-/**
- * arch_atomic64_cmpxchg - cmpxchg atomic64 variable
- * @v: pointer to type atomic64_t
- * @o: expected value
- * @n: new value
- *
- * Atomically sets @v to @n if it was equal to @o and returns
- * the old value.
- */
-
 static __always_inline s64 arch_atomic64_cmpxchg(atomic64_t *v, s64 o, s64 n)
 {
        return arch_cmpxchg64(&v->counter, o, n);
 }
 #define arch_atomic64_cmpxchg arch_atomic64_cmpxchg
 
-/**
- * arch_atomic64_xchg - xchg atomic64 variable
- * @v: pointer to type atomic64_t
- * @n: value to assign
- *
- * Atomically xchgs the value of @v to @n and returns
- * the old value.
- */
 static __always_inline s64 arch_atomic64_xchg(atomic64_t *v, s64 n)
 {
        s64 o;
@@ -97,13 +79,6 @@ static __always_inline s64 arch_atomic64_xchg(atomic64_t *v, s64 n)
 }
 #define arch_atomic64_xchg arch_atomic64_xchg
 
-/**
- * arch_atomic64_set - set atomic64 variable
- * @v: pointer to type atomic64_t
- * @i: value to assign
- *
- * Atomically sets the value of @v to @n.
- */
 static __always_inline void arch_atomic64_set(atomic64_t *v, s64 i)
 {
        unsigned high = (unsigned)(i >> 32);
@@ -113,12 +88,6 @@ static __always_inline void arch_atomic64_set(atomic64_t *v, s64 i)
                             : "eax", "edx", "memory");
 }
 
-/**
- * arch_atomic64_read - read atomic64 variable
- * @v: pointer to type atomic64_t
- *
- * Atomically reads the value of @v and returns it.
- */
 static __always_inline s64 arch_atomic64_read(const atomic64_t *v)
 {
        s64 r;
@@ -126,13 +95,6 @@ static __always_inline s64 arch_atomic64_read(const atomic64_t *v)
        return r;
 }
 
-/**
- * arch_atomic64_add_return - add and return
- * @i: integer value to add
- * @v: pointer to type atomic64_t
- *
- * Atomically adds @i to @v and returns @i + *@v
- */
 static __always_inline s64 arch_atomic64_add_return(s64 i, atomic64_t *v)
 {
        alternative_atomic64(add_return,
@@ -142,9 +104,6 @@ static __always_inline s64 arch_atomic64_add_return(s64 i, atomic64_t *v)
 }
 #define arch_atomic64_add_return arch_atomic64_add_return
 
-/*
- * Other variants with different arithmetic operators:
- */
 static __always_inline s64 arch_atomic64_sub_return(s64 i, atomic64_t *v)
 {
        alternative_atomic64(sub_return,
@@ -172,13 +131,6 @@ static __always_inline s64 arch_atomic64_dec_return(atomic64_t *v)
 }
 #define arch_atomic64_dec_return arch_atomic64_dec_return
 
-/**
- * arch_atomic64_add - add integer to atomic64 variable
- * @i: integer value to add
- * @v: pointer to type atomic64_t
- *
- * Atomically adds @i to @v.
- */
 static __always_inline s64 arch_atomic64_add(s64 i, atomic64_t *v)
 {
        __alternative_atomic64(add, add_return,
@@ -187,13 +139,6 @@ static __always_inline s64 arch_atomic64_add(s64 i, atomic64_t *v)
        return i;
 }
 
-/**
- * arch_atomic64_sub - subtract the atomic64 variable
- * @i: integer value to subtract
- * @v: pointer to type atomic64_t
- *
- * Atomically subtracts @i from @v.
- */
 static __always_inline s64 arch_atomic64_sub(s64 i, atomic64_t *v)
 {
        __alternative_atomic64(sub, sub_return,
@@ -202,12 +147,6 @@ static __always_inline s64 arch_atomic64_sub(s64 i, atomic64_t *v)
        return i;
 }
 
-/**
- * arch_atomic64_inc - increment atomic64 variable
- * @v: pointer to type atomic64_t
- *
- * Atomically increments @v by 1.
- */
 static __always_inline void arch_atomic64_inc(atomic64_t *v)
 {
        __alternative_atomic64(inc, inc_return, /* no output */,
@@ -215,12 +154,6 @@ static __always_inline void arch_atomic64_inc(atomic64_t *v)
 }
 #define arch_atomic64_inc arch_atomic64_inc
 
-/**
- * arch_atomic64_dec - decrement atomic64 variable
- * @v: pointer to type atomic64_t
- *
- * Atomically decrements @v by 1.
- */
 static __always_inline void arch_atomic64_dec(atomic64_t *v)
 {
        __alternative_atomic64(dec, dec_return, /* no output */,
@@ -228,15 +161,6 @@ static __always_inline void arch_atomic64_dec(atomic64_t *v)
 }
 #define arch_atomic64_dec arch_atomic64_dec
 
-/**
- * arch_atomic64_add_unless - add unless the number is a given value
- * @v: pointer of type atomic64_t
- * @a: the amount to add to v...
- * @u: ...unless v is equal to u.
- *
- * Atomically adds @a to @v, so long as it was not @u.
- * Returns non-zero if the add was done, zero otherwise.
- */
 static __always_inline int arch_atomic64_add_unless(atomic64_t *v, s64 a, s64 u)
 {
        unsigned low = (unsigned)u;
index c496595..3165c0f 100644 (file)
 
 #define ATOMIC64_INIT(i)       { (i) }
 
-/**
- * arch_atomic64_read - read atomic64 variable
- * @v: pointer of type atomic64_t
- *
- * Atomically reads the value of @v.
- * Doesn't imply a read memory barrier.
- */
 static __always_inline s64 arch_atomic64_read(const atomic64_t *v)
 {
        return __READ_ONCE((v)->counter);
 }
 
-/**
- * arch_atomic64_set - set atomic64 variable
- * @v: pointer to type atomic64_t
- * @i: required value
- *
- * Atomically sets the value of @v to @i.
- */
 static __always_inline void arch_atomic64_set(atomic64_t *v, s64 i)
 {
        __WRITE_ONCE(v->counter, i);
 }
 
-/**
- * arch_atomic64_add - add integer to atomic64 variable
- * @i: integer value to add
- * @v: pointer to type atomic64_t
- *
- * Atomically adds @i to @v.
- */
 static __always_inline void arch_atomic64_add(s64 i, atomic64_t *v)
 {
        asm volatile(LOCK_PREFIX "addq %1,%0"
@@ -48,13 +27,6 @@ static __always_inline void arch_atomic64_add(s64 i, atomic64_t *v)
                     : "er" (i), "m" (v->counter) : "memory");
 }
 
-/**
- * arch_atomic64_sub - subtract the atomic64 variable
- * @i: integer value to subtract
- * @v: pointer to type atomic64_t
- *
- * Atomically subtracts @i from @v.
- */
 static __always_inline void arch_atomic64_sub(s64 i, atomic64_t *v)
 {
        asm volatile(LOCK_PREFIX "subq %1,%0"
@@ -62,27 +34,12 @@ static __always_inline void arch_atomic64_sub(s64 i, atomic64_t *v)
                     : "er" (i), "m" (v->counter) : "memory");
 }
 
-/**
- * arch_atomic64_sub_and_test - subtract value from variable and test result
- * @i: integer value to subtract
- * @v: pointer to type atomic64_t
- *
- * Atomically subtracts @i from @v and returns
- * true if the result is zero, or false for all
- * other cases.
- */
 static __always_inline bool arch_atomic64_sub_and_test(s64 i, atomic64_t *v)
 {
        return GEN_BINARY_RMWcc(LOCK_PREFIX "subq", v->counter, e, "er", i);
 }
 #define arch_atomic64_sub_and_test arch_atomic64_sub_and_test
 
-/**
- * arch_atomic64_inc - increment atomic64 variable
- * @v: pointer to type atomic64_t
- *
- * Atomically increments @v by 1.
- */
 static __always_inline void arch_atomic64_inc(atomic64_t *v)
 {
        asm volatile(LOCK_PREFIX "incq %0"
@@ -91,12 +48,6 @@ static __always_inline void arch_atomic64_inc(atomic64_t *v)
 }
 #define arch_atomic64_inc arch_atomic64_inc
 
-/**
- * arch_atomic64_dec - decrement atomic64 variable
- * @v: pointer to type atomic64_t
- *
- * Atomically decrements @v by 1.
- */
 static __always_inline void arch_atomic64_dec(atomic64_t *v)
 {
        asm volatile(LOCK_PREFIX "decq %0"
@@ -105,56 +56,24 @@ static __always_inline void arch_atomic64_dec(atomic64_t *v)
 }
 #define arch_atomic64_dec arch_atomic64_dec
 
-/**
- * arch_atomic64_dec_and_test - decrement and test
- * @v: pointer to type atomic64_t
- *
- * Atomically decrements @v by 1 and
- * returns true if the result is 0, or false for all other
- * cases.
- */
 static __always_inline bool arch_atomic64_dec_and_test(atomic64_t *v)
 {
        return GEN_UNARY_RMWcc(LOCK_PREFIX "decq", v->counter, e);
 }
 #define arch_atomic64_dec_and_test arch_atomic64_dec_and_test
 
-/**
- * arch_atomic64_inc_and_test - increment and test
- * @v: pointer to type atomic64_t
- *
- * Atomically increments @v by 1
- * and returns true if the result is zero, or false for all
- * other cases.
- */
 static __always_inline bool arch_atomic64_inc_and_test(atomic64_t *v)
 {
        return GEN_UNARY_RMWcc(LOCK_PREFIX "incq", v->counter, e);
 }
 #define arch_atomic64_inc_and_test arch_atomic64_inc_and_test
 
-/**
- * arch_atomic64_add_negative - add and test if negative
- * @i: integer value to add
- * @v: pointer to type atomic64_t
- *
- * Atomically adds @i to @v and returns true
- * if the result is negative, or false when
- * result is greater than or equal to zero.
- */
 static __always_inline bool arch_atomic64_add_negative(s64 i, atomic64_t *v)
 {
        return GEN_BINARY_RMWcc(LOCK_PREFIX "addq", v->counter, s, "er", i);
 }
 #define arch_atomic64_add_negative arch_atomic64_add_negative
 
-/**
- * arch_atomic64_add_return - add and return
- * @i: integer value to add
- * @v: pointer to type atomic64_t
- *
- * Atomically adds @i to @v and returns @i + @v
- */
 static __always_inline s64 arch_atomic64_add_return(s64 i, atomic64_t *v)
 {
        return i + xadd(&v->counter, i);
index 92ae283..f25ca2d 100644 (file)
@@ -4,8 +4,6 @@
 
 #include <asm/processor.h>
 
-extern void check_bugs(void);
-
 #if defined(CONFIG_CPU_SUP_INTEL) && defined(CONFIG_X86_32)
 int ppro_with_ram_bug(void);
 #else
index 540573f..d536365 100644 (file)
@@ -239,29 +239,4 @@ extern void __add_wrong_size(void)
 #define __xadd(ptr, inc, lock) __xchg_op((ptr), (inc), xadd, lock)
 #define xadd(ptr, inc)         __xadd((ptr), (inc), LOCK_PREFIX)
 
-#define __cmpxchg_double(pfx, p1, p2, o1, o2, n1, n2)                  \
-({                                                                     \
-       bool __ret;                                                     \
-       __typeof__(*(p1)) __old1 = (o1), __new1 = (n1);                 \
-       __typeof__(*(p2)) __old2 = (o2), __new2 = (n2);                 \
-       BUILD_BUG_ON(sizeof(*(p1)) != sizeof(long));                    \
-       BUILD_BUG_ON(sizeof(*(p2)) != sizeof(long));                    \
-       VM_BUG_ON((unsigned long)(p1) % (2 * sizeof(long)));            \
-       VM_BUG_ON((unsigned long)((p1) + 1) != (unsigned long)(p2));    \
-       asm volatile(pfx "cmpxchg%c5b %1"                               \
-                    CC_SET(e)                                          \
-                    : CC_OUT(e) (__ret),                               \
-                      "+m" (*(p1)), "+m" (*(p2)),                      \
-                      "+a" (__old1), "+d" (__old2)                     \
-                    : "i" (2 * sizeof(long)),                          \
-                      "b" (__new1), "c" (__new2));                     \
-       __ret;                                                          \
-})
-
-#define arch_cmpxchg_double(p1, p2, o1, o2, n1, n2) \
-       __cmpxchg_double(LOCK_PREFIX, p1, p2, o1, o2, n1, n2)
-
-#define arch_cmpxchg_double_local(p1, p2, o1, o2, n1, n2) \
-       __cmpxchg_double(, p1, p2, o1, o2, n1, n2)
-
 #endif /* ASM_X86_CMPXCHG_H */
index 6ba80ce..b5731c5 100644 (file)
@@ -103,6 +103,6 @@ static inline bool __try_cmpxchg64(volatile u64 *ptr, u64 *pold, u64 new)
 
 #endif
 
-#define system_has_cmpxchg_double() boot_cpu_has(X86_FEATURE_CX8)
+#define system_has_cmpxchg64()         boot_cpu_has(X86_FEATURE_CX8)
 
 #endif /* _ASM_X86_CMPXCHG_32_H */
index 0d3beb2..44b08b5 100644 (file)
        arch_try_cmpxchg((ptr), (po), (n));                             \
 })
 
-#define system_has_cmpxchg_double() boot_cpu_has(X86_FEATURE_CX16)
+union __u128_halves {
+       u128 full;
+       struct {
+               u64 low, high;
+       };
+};
+
+#define __arch_cmpxchg128(_ptr, _old, _new, _lock)                     \
+({                                                                     \
+       union __u128_halves o = { .full = (_old), },                    \
+                           n = { .full = (_new), };                    \
+                                                                       \
+       asm volatile(_lock "cmpxchg16b %[ptr]"                          \
+                    : [ptr] "+m" (*(_ptr)),                            \
+                      "+a" (o.low), "+d" (o.high)                      \
+                    : "b" (n.low), "c" (n.high)                        \
+                    : "memory");                                       \
+                                                                       \
+       o.full;                                                         \
+})
+
+static __always_inline u128 arch_cmpxchg128(volatile u128 *ptr, u128 old, u128 new)
+{
+       return __arch_cmpxchg128(ptr, old, new, LOCK_PREFIX);
+}
+#define arch_cmpxchg128 arch_cmpxchg128
+
+static __always_inline u128 arch_cmpxchg128_local(volatile u128 *ptr, u128 old, u128 new)
+{
+       return __arch_cmpxchg128(ptr, old, new,);
+}
+#define arch_cmpxchg128_local arch_cmpxchg128_local
+
+#define __arch_try_cmpxchg128(_ptr, _oldp, _new, _lock)                        \
+({                                                                     \
+       union __u128_halves o = { .full = *(_oldp), },                  \
+                           n = { .full = (_new), };                    \
+       bool ret;                                                       \
+                                                                       \
+       asm volatile(_lock "cmpxchg16b %[ptr]"                          \
+                    CC_SET(e)                                          \
+                    : CC_OUT(e) (ret),                                 \
+                      [ptr] "+m" (*ptr),                               \
+                      "+a" (o.low), "+d" (o.high)                      \
+                    : "b" (n.low), "c" (n.high)                        \
+                    : "memory");                                       \
+                                                                       \
+       if (unlikely(!ret))                                             \
+               *(_oldp) = o.full;                                      \
+                                                                       \
+       likely(ret);                                                    \
+})
+
+static __always_inline bool arch_try_cmpxchg128(volatile u128 *ptr, u128 *oldp, u128 new)
+{
+       return __arch_try_cmpxchg128(ptr, oldp, new, LOCK_PREFIX);
+}
+#define arch_try_cmpxchg128 arch_try_cmpxchg128
+
+static __always_inline bool arch_try_cmpxchg128_local(volatile u128 *ptr, u128 *oldp, u128 new)
+{
+       return __arch_try_cmpxchg128(ptr, oldp, new,);
+}
+#define arch_try_cmpxchg128_local arch_try_cmpxchg128_local
+
+#define system_has_cmpxchg128()                boot_cpu_has(X86_FEATURE_CX16)
 
 #endif /* _ASM_X86_CMPXCHG_64_H */
index eb08796..6ae2d16 100644 (file)
@@ -10,30 +10,13 @@ enum cc_vendor {
        CC_VENDOR_INTEL,
 };
 
-#ifdef CONFIG_ARCH_HAS_CC_PLATFORM
 extern enum cc_vendor cc_vendor;
 
-static inline enum cc_vendor cc_get_vendor(void)
-{
-       return cc_vendor;
-}
-
-static inline void cc_set_vendor(enum cc_vendor vendor)
-{
-       cc_vendor = vendor;
-}
-
+#ifdef CONFIG_ARCH_HAS_CC_PLATFORM
 void cc_set_mask(u64 mask);
 u64 cc_mkenc(u64 val);
 u64 cc_mkdec(u64 val);
 #else
-static inline enum cc_vendor cc_get_vendor(void)
-{
-       return CC_VENDOR_NONE;
-}
-
-static inline void cc_set_vendor(enum cc_vendor vendor) { }
-
 static inline u64 cc_mkenc(u64 val)
 {
        return val;
index 78796b9..3a233eb 100644 (file)
@@ -30,10 +30,7 @@ struct x86_cpu {
 #ifdef CONFIG_HOTPLUG_CPU
 extern int arch_register_cpu(int num);
 extern void arch_unregister_cpu(int);
-extern void start_cpu0(void);
-#ifdef CONFIG_DEBUG_HOTPLUG_CPU0
-extern int _debug_hotplug_cpu(int cpu, int action);
-#endif
+extern void soft_restart_cpu(void);
 #endif
 
 extern void ap_init_aperfmperf(void);
@@ -98,4 +95,6 @@ extern u64 x86_read_arch_cap_msr(void);
 int intel_find_matching_signature(void *mc, unsigned int csig, int cpf);
 int intel_microcode_sanity_check(void *mc, bool print_err, int hdr_type);
 
+extern struct cpumask cpus_stop_mask;
+
 #endif /* _ASM_X86_CPU_H */
index ce0c8f7..a26bebb 100644 (file)
@@ -38,15 +38,10 @@ enum cpuid_leafs
 #define X86_CAP_FMT_NUM "%d:%d"
 #define x86_cap_flag_num(flag) ((flag) >> 5), ((flag) & 31)
 
-#ifdef CONFIG_X86_FEATURE_NAMES
 extern const char * const x86_cap_flags[NCAPINTS*32];
 extern const char * const x86_power_flags[32];
 #define X86_CAP_FMT "%s"
 #define x86_cap_flag(flag) x86_cap_flags[flag]
-#else
-#define X86_CAP_FMT X86_CAP_FMT_NUM
-#define x86_cap_flag x86_cap_flag_num
-#endif
 
 /*
  * In order to save room, we index into this array by doing
index c5aed9e..4acfd57 100644 (file)
@@ -4,11 +4,6 @@
 #ifndef __ASSEMBLY__
 #include <linux/cpumask.h>
 
-extern cpumask_var_t cpu_callin_mask;
-extern cpumask_var_t cpu_callout_mask;
-extern cpumask_var_t cpu_initialized_mask;
-extern cpumask_var_t cpu_sibling_setup_mask;
-
 extern void setup_cpu_local_masks(void);
 
 /*
index 54a6e4a..de0e88b 100644 (file)
@@ -2,6 +2,8 @@
 #ifndef _ASM_X86_DOUBLEFAULT_H
 #define _ASM_X86_DOUBLEFAULT_H
 
+#include <linux/linkage.h>
+
 #ifdef CONFIG_X86_32
 extern void doublefault_init_cpu_tss(void);
 #else
@@ -10,4 +12,6 @@ static inline void doublefault_init_cpu_tss(void)
 }
 #endif
 
+asmlinkage void __noreturn doublefault_shim(void);
+
 #endif /* _ASM_X86_DOUBLEFAULT_H */
index 419280d..8b4be7c 100644 (file)
@@ -31,6 +31,8 @@ extern unsigned long efi_mixed_mode_stack_pa;
 
 #define ARCH_EFI_IRQ_FLAGS_MASK        X86_EFLAGS_IF
 
+#define EFI_UNACCEPTED_UNIT_SIZE PMD_SIZE
+
 /*
  * The EFI services are called through variadic functions in many cases. These
  * functions are implemented in assembler and support only a fixed number of
index 503a577..b475d9a 100644 (file)
@@ -109,7 +109,7 @@ extern void fpu_reset_from_exception_fixup(void);
 
 /* Boot, hotplug and resume */
 extern void fpu__init_cpu(void);
-extern void fpu__init_system(struct cpuinfo_x86 *c);
+extern void fpu__init_system(void);
 extern void fpu__init_check_bugs(void);
 extern void fpu__resume_cpu(void);
 
index 5061ac9..b8d4a07 100644 (file)
@@ -106,6 +106,9 @@ struct dyn_arch_ftrace {
 
 #ifndef __ASSEMBLY__
 
+void prepare_ftrace_return(unsigned long ip, unsigned long *parent,
+                          unsigned long frame_pointer);
+
 #if defined(CONFIG_FUNCTION_TRACER) && defined(CONFIG_DYNAMIC_FTRACE)
 extern void set_ftrace_ops_ro(void);
 #else
index 768aa23..29e083b 100644 (file)
@@ -40,8 +40,6 @@ extern void __handle_irq(struct irq_desc *desc, struct pt_regs *regs);
 
 extern void init_ISA_irqs(void);
 
-extern void __init init_IRQ(void);
-
 #ifdef CONFIG_X86_LOCAL_APIC
 void arch_trigger_cpumask_backtrace(const struct cpumask *mask,
                                    bool exclude_self);
index 9646ed6..180b1cb 100644 (file)
@@ -350,4 +350,7 @@ static inline void mce_amd_feature_init(struct cpuinfo_x86 *c)              { }
 #endif
 
 static inline void mce_hygon_feature_init(struct cpuinfo_x86 *c)       { return mce_amd_feature_init(c); }
+
+unsigned long copy_mc_fragile_handle_tail(char *to, char *from, unsigned len);
+
 #endif /* _ASM_X86_MCE_H */
index b712670..7f97a8a 100644 (file)
 
 #include <asm/bootparam.h>
 
+#ifdef CONFIG_X86_MEM_ENCRYPT
+void __init mem_encrypt_init(void);
+#else
+static inline void mem_encrypt_init(void) { }
+#endif
+
 #ifdef CONFIG_AMD_MEM_ENCRYPT
 
 extern u64 sme_me_mask;
@@ -87,9 +93,6 @@ static inline void mem_encrypt_free_decrypted_mem(void) { }
 
 #endif /* CONFIG_AMD_MEM_ENCRYPT */
 
-/* Architecture __weak replacement functions */
-void __init mem_encrypt_init(void);
-
 void add_encrypt_protection_map(void);
 
 /*
index 49bb4f2..88d9ef9 100644 (file)
@@ -257,6 +257,11 @@ void hv_set_register(unsigned int reg, u64 value);
 u64 hv_get_non_nested_register(unsigned int reg);
 void hv_set_non_nested_register(unsigned int reg, u64 value);
 
+static __always_inline u64 hv_raw_get_register(unsigned int reg)
+{
+       return __rdmsr(reg);
+}
+
 #else /* CONFIG_HYPERV */
 static inline void hyperv_init(void) {}
 static inline void hyperv_setup_mmu_ops(void) {}
index f0eeaf6..090d658 100644 (file)
 #ifndef _ASM_X86_MTRR_H
 #define _ASM_X86_MTRR_H
 
+#include <linux/bits.h>
 #include <uapi/asm/mtrr.h>
 
+/* Defines for hardware MTRR registers. */
+#define MTRR_CAP_VCNT          GENMASK(7, 0)
+#define MTRR_CAP_FIX           BIT_MASK(8)
+#define MTRR_CAP_WC            BIT_MASK(10)
+
+#define MTRR_DEF_TYPE_TYPE     GENMASK(7, 0)
+#define MTRR_DEF_TYPE_FE       BIT_MASK(10)
+#define MTRR_DEF_TYPE_E                BIT_MASK(11)
+
+#define MTRR_DEF_TYPE_ENABLE   (MTRR_DEF_TYPE_FE | MTRR_DEF_TYPE_E)
+#define MTRR_DEF_TYPE_DISABLE  ~(MTRR_DEF_TYPE_TYPE | MTRR_DEF_TYPE_ENABLE)
+
+#define MTRR_PHYSBASE_TYPE     GENMASK(7, 0)
+#define MTRR_PHYSBASE_RSVD     GENMASK(11, 8)
+
+#define MTRR_PHYSMASK_RSVD     GENMASK(10, 0)
+#define MTRR_PHYSMASK_V                BIT_MASK(11)
+
+struct mtrr_state_type {
+       struct mtrr_var_range var_ranges[MTRR_MAX_VAR_RANGES];
+       mtrr_type fixed_ranges[MTRR_NUM_FIXED_RANGES];
+       unsigned char enabled;
+       bool have_fixed;
+       mtrr_type def_type;
+};
+
 /*
  * The following functions are for use by other drivers that cannot use
  * arch_phys_wc_add and arch_phys_wc_del.
  */
 # ifdef CONFIG_MTRR
 void mtrr_bp_init(void);
+void mtrr_overwrite_state(struct mtrr_var_range *var, unsigned int num_var,
+                         mtrr_type def_type);
 extern u8 mtrr_type_lookup(u64 addr, u64 end, u8 *uniform);
 extern void mtrr_save_fixed_ranges(void *);
 extern void mtrr_save_state(void);
@@ -40,7 +69,6 @@ extern int mtrr_add_page(unsigned long base, unsigned long size,
                         unsigned int type, bool increment);
 extern int mtrr_del(int reg, unsigned long base, unsigned long size);
 extern int mtrr_del_page(int reg, unsigned long base, unsigned long size);
-extern void mtrr_centaur_report_mcr(int mcr, u32 lo, u32 hi);
 extern void mtrr_bp_restore(void);
 extern int mtrr_trim_uncached_memory(unsigned long end_pfn);
 extern int amd_special_default_mtrr(void);
@@ -48,12 +76,21 @@ void mtrr_disable(void);
 void mtrr_enable(void);
 void mtrr_generic_set_state(void);
 #  else
+static inline void mtrr_overwrite_state(struct mtrr_var_range *var,
+                                       unsigned int num_var,
+                                       mtrr_type def_type)
+{
+}
+
 static inline u8 mtrr_type_lookup(u64 addr, u64 end, u8 *uniform)
 {
        /*
-        * Return no-MTRRs:
+        * Return the default MTRR type, without any known other types in
+        * that range.
         */
-       return MTRR_TYPE_INVALID;
+       *uniform = 1;
+
+       return MTRR_TYPE_UNCACHABLE;
 }
 #define mtrr_save_fixed_ranges(arg) do {} while (0)
 #define mtrr_save_state() do {} while (0)
@@ -79,9 +116,6 @@ static inline int mtrr_trim_uncached_memory(unsigned long end_pfn)
 {
        return 0;
 }
-static inline void mtrr_centaur_report_mcr(int mcr, u32 lo, u32 hi)
-{
-}
 #define mtrr_bp_init() do {} while (0)
 #define mtrr_bp_restore() do {} while (0)
 #define mtrr_disable() do {} while (0)
@@ -121,7 +155,8 @@ struct mtrr_gentry32 {
 #endif /* CONFIG_COMPAT */
 
 /* Bit fields for enabled in struct mtrr_state_type */
-#define MTRR_STATE_MTRR_FIXED_ENABLED  0x01
-#define MTRR_STATE_MTRR_ENABLED                0x02
+#define MTRR_STATE_SHIFT               10
+#define MTRR_STATE_MTRR_FIXED_ENABLED  (MTRR_DEF_TYPE_FE >> MTRR_STATE_SHIFT)
+#define MTRR_STATE_MTRR_ENABLED                (MTRR_DEF_TYPE_E >> MTRR_STATE_SHIFT)
 
 #endif /* _ASM_X86_MTRR_H */
index c5573ea..1c1b755 100644 (file)
@@ -34,6 +34,8 @@
 #define BYTES_NOP7     0x8d,0xb4,0x26,0x00,0x00,0x00,0x00
 #define BYTES_NOP8     0x3e,BYTES_NOP7
 
+#define ASM_NOP_MAX 8
+
 #else
 
 /*
@@ -47,6 +49,9 @@
  * 6: osp nopl 0x00(%eax,%eax,1)
  * 7: nopl 0x00000000(%eax)
  * 8: nopl 0x00000000(%eax,%eax,1)
+ * 9: cs nopl 0x00000000(%eax,%eax,1)
+ * 10: osp cs nopl 0x00000000(%eax,%eax,1)
+ * 11: osp osp cs nopl 0x00000000(%eax,%eax,1)
  */
 #define BYTES_NOP1     0x90
 #define BYTES_NOP2     0x66,BYTES_NOP1
 #define BYTES_NOP6     0x66,BYTES_NOP5
 #define BYTES_NOP7     0x0f,0x1f,0x80,0x00,0x00,0x00,0x00
 #define BYTES_NOP8     0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00
+#define BYTES_NOP9     0x2e,BYTES_NOP8
+#define BYTES_NOP10    0x66,BYTES_NOP9
+#define BYTES_NOP11    0x66,BYTES_NOP10
+
+#define ASM_NOP9  _ASM_BYTES(BYTES_NOP9)
+#define ASM_NOP10 _ASM_BYTES(BYTES_NOP10)
+#define ASM_NOP11 _ASM_BYTES(BYTES_NOP11)
+
+#define ASM_NOP_MAX 11
 
 #endif /* CONFIG_64BIT */
 
@@ -68,8 +82,6 @@
 #define ASM_NOP7 _ASM_BYTES(BYTES_NOP7)
 #define ASM_NOP8 _ASM_BYTES(BYTES_NOP8)
 
-#define ASM_NOP_MAX 8
-
 #ifndef __ASSEMBLY__
 extern const unsigned char * const x86_nops[];
 #endif
index edb2b0c..55388c9 100644 (file)
        movq    $-1, PER_CPU_VAR(pcpu_hot + X86_call_depth);
 
 #define RESET_CALL_DEPTH                                       \
-       mov     $0x80, %rax;                                    \
-       shl     $56, %rax;                                      \
+       xor     %eax, %eax;                                     \
+       bts     $63, %rax;                                      \
        movq    %rax, PER_CPU_VAR(pcpu_hot + X86_call_depth);
 
 #define RESET_CALL_DEPTH_FROM_CALL                             \
-       mov     $0xfc, %rax;                                    \
+       movb    $0xfc, %al;                                     \
        shl     $56, %rax;                                      \
        movq    %rax, PER_CPU_VAR(pcpu_hot + X86_call_depth);   \
        CALL_THUNKS_DEBUG_INC_CALLS
diff --git a/arch/x86/include/asm/orc_header.h b/arch/x86/include/asm/orc_header.h
new file mode 100644 (file)
index 0000000..07bacf3
--- /dev/null
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* Copyright (c) Meta Platforms, Inc. and affiliates. */
+
+#ifndef _ORC_HEADER_H
+#define _ORC_HEADER_H
+
+#include <linux/types.h>
+#include <linux/compiler.h>
+#include <asm/orc_hash.h>
+
+/*
+ * The header is currently a 20-byte hash of the ORC entry definition; see
+ * scripts/orc_hash.sh.
+ */
+#define ORC_HEADER                                     \
+       __used __section(".orc_header") __aligned(4)    \
+       static const u8 orc_header[] = { ORC_HASH }
+
+#endif /* _ORC_HEADER_H */
index 13c0d63..34734d7 100644 (file)
@@ -210,6 +210,67 @@ do {                                                                       \
        (typeof(_var))(unsigned long) pco_old__;                        \
 })
 
+#if defined(CONFIG_X86_32) && !defined(CONFIG_UML)
+#define percpu_cmpxchg64_op(size, qual, _var, _oval, _nval)            \
+({                                                                     \
+       union {                                                         \
+               u64 var;                                                \
+               struct {                                                \
+                       u32 low, high;                                  \
+               };                                                      \
+       } old__, new__;                                                 \
+                                                                       \
+       old__.var = _oval;                                              \
+       new__.var = _nval;                                              \
+                                                                       \
+       asm qual (ALTERNATIVE("leal %P[var], %%esi; call this_cpu_cmpxchg8b_emu", \
+                             "cmpxchg8b " __percpu_arg([var]), X86_FEATURE_CX8) \
+                 : [var] "+m" (_var),                                  \
+                   "+a" (old__.low),                                   \
+                   "+d" (old__.high)                                   \
+                 : "b" (new__.low),                                    \
+                   "c" (new__.high)                                    \
+                 : "memory", "esi");                                   \
+                                                                       \
+       old__.var;                                                      \
+})
+
+#define raw_cpu_cmpxchg64(pcp, oval, nval)     percpu_cmpxchg64_op(8,         , pcp, oval, nval)
+#define this_cpu_cmpxchg64(pcp, oval, nval)    percpu_cmpxchg64_op(8, volatile, pcp, oval, nval)
+#endif
+
+#ifdef CONFIG_X86_64
+#define raw_cpu_cmpxchg64(pcp, oval, nval)     percpu_cmpxchg_op(8,         , pcp, oval, nval);
+#define this_cpu_cmpxchg64(pcp, oval, nval)    percpu_cmpxchg_op(8, volatile, pcp, oval, nval);
+
+#define percpu_cmpxchg128_op(size, qual, _var, _oval, _nval)           \
+({                                                                     \
+       union {                                                         \
+               u128 var;                                               \
+               struct {                                                \
+                       u64 low, high;                                  \
+               };                                                      \
+       } old__, new__;                                                 \
+                                                                       \
+       old__.var = _oval;                                              \
+       new__.var = _nval;                                              \
+                                                                       \
+       asm qual (ALTERNATIVE("leaq %P[var], %%rsi; call this_cpu_cmpxchg16b_emu", \
+                             "cmpxchg16b " __percpu_arg([var]), X86_FEATURE_CX16) \
+                 : [var] "+m" (_var),                                  \
+                   "+a" (old__.low),                                   \
+                   "+d" (old__.high)                                   \
+                 : "b" (new__.low),                                    \
+                   "c" (new__.high)                                    \
+                 : "memory", "rsi");                                   \
+                                                                       \
+       old__.var;                                                      \
+})
+
+#define raw_cpu_cmpxchg128(pcp, oval, nval)    percpu_cmpxchg128_op(16,         , pcp, oval, nval)
+#define this_cpu_cmpxchg128(pcp, oval, nval)   percpu_cmpxchg128_op(16, volatile, pcp, oval, nval)
+#endif
+
 /*
  * this_cpu_read() makes gcc load the percpu variable every time it is
  * accessed while this_cpu_read_stable() allows the value to be cached.
@@ -290,23 +351,6 @@ do {                                                                       \
 #define this_cpu_cmpxchg_2(pcp, oval, nval)    percpu_cmpxchg_op(2, volatile, pcp, oval, nval)
 #define this_cpu_cmpxchg_4(pcp, oval, nval)    percpu_cmpxchg_op(4, volatile, pcp, oval, nval)
 
-#ifdef CONFIG_X86_CMPXCHG64
-#define percpu_cmpxchg8b_double(pcp1, pcp2, o1, o2, n1, n2)            \
-({                                                                     \
-       bool __ret;                                                     \
-       typeof(pcp1) __o1 = (o1), __n1 = (n1);                          \
-       typeof(pcp2) __o2 = (o2), __n2 = (n2);                          \
-       asm volatile("cmpxchg8b "__percpu_arg(1)                        \
-                    CC_SET(z)                                          \
-                    : CC_OUT(z) (__ret), "+m" (pcp1), "+m" (pcp2), "+a" (__o1), "+d" (__o2) \
-                    : "b" (__n1), "c" (__n2));                         \
-       __ret;                                                          \
-})
-
-#define raw_cpu_cmpxchg_double_4       percpu_cmpxchg8b_double
-#define this_cpu_cmpxchg_double_4      percpu_cmpxchg8b_double
-#endif /* CONFIG_X86_CMPXCHG64 */
-
 /*
  * Per cpu atomic 64 bit operations are only available under 64 bit.
  * 32 bit must fall back to generic operations.
@@ -329,30 +373,6 @@ do {                                                                       \
 #define this_cpu_add_return_8(pcp, val)                percpu_add_return_op(8, volatile, pcp, val)
 #define this_cpu_xchg_8(pcp, nval)             percpu_xchg_op(8, volatile, pcp, nval)
 #define this_cpu_cmpxchg_8(pcp, oval, nval)    percpu_cmpxchg_op(8, volatile, pcp, oval, nval)
-
-/*
- * Pretty complex macro to generate cmpxchg16 instruction.  The instruction
- * is not supported on early AMD64 processors so we must be able to emulate
- * it in software.  The address used in the cmpxchg16 instruction must be
- * aligned to a 16 byte boundary.
- */
-#define percpu_cmpxchg16b_double(pcp1, pcp2, o1, o2, n1, n2)           \
-({                                                                     \
-       bool __ret;                                                     \
-       typeof(pcp1) __o1 = (o1), __n1 = (n1);                          \
-       typeof(pcp2) __o2 = (o2), __n2 = (n2);                          \
-       alternative_io("leaq %P1,%%rsi\n\tcall this_cpu_cmpxchg16b_emu\n\t", \
-                      "cmpxchg16b " __percpu_arg(1) "\n\tsetz %0\n\t", \
-                      X86_FEATURE_CX16,                                \
-                      ASM_OUTPUT2("=a" (__ret), "+m" (pcp1),           \
-                                  "+m" (pcp2), "+d" (__o2)),           \
-                      "b" (__n1), "c" (__n2), "a" (__o1) : "rsi");     \
-       __ret;                                                          \
-})
-
-#define raw_cpu_cmpxchg_double_8       percpu_cmpxchg16b_double
-#define this_cpu_cmpxchg_double_8      percpu_cmpxchg16b_double
-
 #endif
 
 static __always_inline bool x86_this_cpu_constant_test_bit(unsigned int nr,
index abf0988..85a9fd5 100644 (file)
 #define ARCH_PERFMON_EVENTSEL_INV                      (1ULL << 23)
 #define ARCH_PERFMON_EVENTSEL_CMASK                    0xFF000000ULL
 
+#define INTEL_FIXED_BITS_MASK                          0xFULL
+#define INTEL_FIXED_BITS_STRIDE                        4
+#define INTEL_FIXED_0_KERNEL                           (1ULL << 0)
+#define INTEL_FIXED_0_USER                             (1ULL << 1)
+#define INTEL_FIXED_0_ANYTHREAD                        (1ULL << 2)
+#define INTEL_FIXED_0_ENABLE_PMI                       (1ULL << 3)
+
 #define HSW_IN_TX                                      (1ULL << 32)
 #define HSW_IN_TX_CHECKPOINTED                         (1ULL << 33)
 #define ICL_EVENTSEL_ADAPTIVE                          (1ULL << 34)
 #define ICL_FIXED_0_ADAPTIVE                           (1ULL << 32)
 
+#define intel_fixed_bits_by_idx(_idx, _bits)                   \
+       ((_bits) << ((_idx) * INTEL_FIXED_BITS_STRIDE))
+
 #define AMD64_EVENTSEL_INT_CORE_ENABLE                 (1ULL << 36)
 #define AMD64_EVENTSEL_GUESTONLY                       (1ULL << 40)
 #define AMD64_EVENTSEL_HOSTONLY                                (1ULL << 41)
@@ -478,8 +488,10 @@ struct pebs_xmm {
 
 #ifdef CONFIG_X86_LOCAL_APIC
 extern u32 get_ibs_caps(void);
+extern int forward_event_to_ibs(struct perf_event *event);
 #else
 static inline u32 get_ibs_caps(void) { return 0; }
+static inline int forward_event_to_ibs(struct perf_event *event) { return -ENOENT; }
 #endif
 
 #ifdef CONFIG_PERF_EVENTS
index 15ae4d6..5700bb3 100644 (file)
@@ -27,6 +27,7 @@
 extern pgd_t early_top_pgt[PTRS_PER_PGD];
 bool __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
 
+struct seq_file;
 void ptdump_walk_pgd_level(struct seq_file *m, struct mm_struct *mm);
 void ptdump_walk_pgd_level_debugfs(struct seq_file *m, struct mm_struct *mm,
                                   bool user);
index 7929327..a629b1b 100644 (file)
@@ -237,8 +237,8 @@ static inline void native_pgd_clear(pgd_t *pgd)
 
 #define __pte_to_swp_entry(pte)                ((swp_entry_t) { pte_val((pte)) })
 #define __pmd_to_swp_entry(pmd)                ((swp_entry_t) { pmd_val((pmd)) })
-#define __swp_entry_to_pte(x)          ((pte_t) { .pte = (x).val })
-#define __swp_entry_to_pmd(x)          ((pmd_t) { .pmd = (x).val })
+#define __swp_entry_to_pte(x)          (__pte((x).val))
+#define __swp_entry_to_pmd(x)          (__pmd((x).val))
 
 extern void cleanup_highmap(void);
 
index 447d4be..ba3e255 100644 (file)
@@ -513,9 +513,6 @@ extern void native_pagetable_init(void);
 #define native_pagetable_init        paging_init
 #endif
 
-struct seq_file;
-extern void arch_report_meminfo(struct seq_file *m);
-
 enum pg_level {
        PG_LEVEL_NONE,
        PG_LEVEL_4K,
index a1e4fa5..d46300e 100644 (file)
@@ -551,7 +551,6 @@ extern void switch_gdt_and_percpu_base(int);
 extern void load_direct_gdt(int);
 extern void load_fixmap_gdt(int);
 extern void cpu_init(void);
-extern void cpu_init_secondary(void);
 extern void cpu_init_exception_handling(void);
 extern void cr4_init(void);
 
index f6a1737..87e5482 100644 (file)
@@ -52,6 +52,7 @@ struct trampoline_header {
        u64 efer;
        u32 cr4;
        u32 flags;
+       u32 lock;
 #endif
 };
 
@@ -64,6 +65,8 @@ extern unsigned long initial_stack;
 extern unsigned long initial_vc_handler;
 #endif
 
+extern u32 *trampoline_lock;
+
 extern unsigned char real_mode_blob[];
 extern unsigned char real_mode_relocs[];
 
index 0759af9..b463fcb 100644 (file)
@@ -106,8 +106,13 @@ enum psc_op {
 #define GHCB_HV_FT_SNP                 BIT_ULL(0)
 #define GHCB_HV_FT_SNP_AP_CREATION     BIT_ULL(1)
 
-/* SNP Page State Change NAE event */
-#define VMGEXIT_PSC_MAX_ENTRY          253
+/*
+ * SNP Page State Change NAE event
+ *   The VMGEXIT_PSC_MAX_ENTRY determines the size of the PSC structure, which
+ *   is a local stack variable in set_pages_state(). Do not increase this value
+ *   without evaluating the impact to stack usage.
+ */
+#define VMGEXIT_PSC_MAX_ENTRY          64
 
 struct psc_hdr {
        u16 cur_entry;
index 13dc2a9..66c8067 100644 (file)
@@ -14,6 +14,7 @@
 #include <asm/insn.h>
 #include <asm/sev-common.h>
 #include <asm/bootparam.h>
+#include <asm/coco.h>
 
 #define GHCB_PROTOCOL_MIN      1ULL
 #define GHCB_PROTOCOL_MAX      2ULL
@@ -80,11 +81,15 @@ extern void vc_no_ghcb(void);
 extern void vc_boot_ghcb(void);
 extern bool handle_vc_boot_ghcb(struct pt_regs *regs);
 
+/* PVALIDATE return codes */
+#define PVALIDATE_FAIL_SIZEMISMATCH    6
+
 /* Software defined (when rFlags.CF = 1) */
 #define PVALIDATE_FAIL_NOUPDATE                255
 
 /* RMP page size */
 #define RMP_PG_SIZE_4K                 0
+#define RMP_PG_SIZE_2M                 1
 
 #define RMPADJUST_VMSA_PAGE_BIT                BIT(16)
 
@@ -136,24 +141,26 @@ struct snp_secrets_page_layout {
 } __packed;
 
 #ifdef CONFIG_AMD_MEM_ENCRYPT
-extern struct static_key_false sev_es_enable_key;
 extern void __sev_es_ist_enter(struct pt_regs *regs);
 extern void __sev_es_ist_exit(void);
 static __always_inline void sev_es_ist_enter(struct pt_regs *regs)
 {
-       if (static_branch_unlikely(&sev_es_enable_key))
+       if (cc_vendor == CC_VENDOR_AMD &&
+           cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT))
                __sev_es_ist_enter(regs);
 }
 static __always_inline void sev_es_ist_exit(void)
 {
-       if (static_branch_unlikely(&sev_es_enable_key))
+       if (cc_vendor == CC_VENDOR_AMD &&
+           cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT))
                __sev_es_ist_exit();
 }
 extern int sev_es_setup_ap_jump_table(struct real_mode_header *rmh);
 extern void __sev_es_nmi_complete(void);
 static __always_inline void sev_es_nmi_complete(void)
 {
-       if (static_branch_unlikely(&sev_es_enable_key))
+       if (cc_vendor == CC_VENDOR_AMD &&
+           cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT))
                __sev_es_nmi_complete();
 }
 extern int __init sev_es_efi_map_ghcbs(pgd_t *pgd);
@@ -192,16 +199,17 @@ struct snp_guest_request_ioctl;
 
 void setup_ghcb(void);
 void __init early_snp_set_memory_private(unsigned long vaddr, unsigned long paddr,
-                                        unsigned int npages);
+                                        unsigned long npages);
 void __init early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr,
-                                       unsigned int npages);
+                                       unsigned long npages);
 void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op);
-void snp_set_memory_shared(unsigned long vaddr, unsigned int npages);
-void snp_set_memory_private(unsigned long vaddr, unsigned int npages);
+void snp_set_memory_shared(unsigned long vaddr, unsigned long npages);
+void snp_set_memory_private(unsigned long vaddr, unsigned long npages);
 void snp_set_wakeup_secondary_cpu(void);
 bool snp_init(struct boot_params *bp);
 void __init __noreturn snp_abort(void);
 int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, struct snp_guest_request_ioctl *rio);
+void snp_accept_memory(phys_addr_t start, phys_addr_t end);
 #else
 static inline void sev_es_ist_enter(struct pt_regs *regs) { }
 static inline void sev_es_ist_exit(void) { }
@@ -212,12 +220,12 @@ static inline int pvalidate(unsigned long vaddr, bool rmp_psize, bool validate)
 static inline int rmpadjust(unsigned long vaddr, bool rmp_psize, unsigned long attrs) { return 0; }
 static inline void setup_ghcb(void) { }
 static inline void __init
-early_snp_set_memory_private(unsigned long vaddr, unsigned long paddr, unsigned int npages) { }
+early_snp_set_memory_private(unsigned long vaddr, unsigned long paddr, unsigned long npages) { }
 static inline void __init
-early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr, unsigned int npages) { }
+early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr, unsigned long npages) { }
 static inline void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op) { }
-static inline void snp_set_memory_shared(unsigned long vaddr, unsigned int npages) { }
-static inline void snp_set_memory_private(unsigned long vaddr, unsigned int npages) { }
+static inline void snp_set_memory_shared(unsigned long vaddr, unsigned long npages) { }
+static inline void snp_set_memory_private(unsigned long vaddr, unsigned long npages) { }
 static inline void snp_set_wakeup_secondary_cpu(void) { }
 static inline bool snp_init(struct boot_params *bp) { return false; }
 static inline void snp_abort(void) { }
@@ -225,6 +233,8 @@ static inline int snp_issue_guest_request(u64 exit_code, struct snp_req_data *in
 {
        return -ENOTTY;
 }
+
+static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
 #endif
 
 #endif
index 2631e01..7513b3b 100644 (file)
 #define TDX_CPUID_LEAF_ID      0x21
 #define TDX_IDENT              "IntelTDX    "
 
+/* TDX module Call Leaf IDs */
+#define TDX_GET_INFO                   1
+#define TDX_GET_VEINFO                 3
+#define TDX_GET_REPORT                 4
+#define TDX_ACCEPT_PAGE                        6
+#define TDX_WR                         8
+
+/* TDCS fields. To be used by TDG.VM.WR and TDG.VM.RD module calls */
+#define TDCS_NOTIFY_ENABLES            0x9100000000000010
+
+/* TDX hypercall Leaf IDs */
+#define TDVMCALL_MAP_GPA               0x10001
+#define TDVMCALL_REPORT_FATAL_ERROR    0x10003
+
 #ifndef __ASSEMBLY__
 
 /*
@@ -37,8 +51,58 @@ struct tdx_hypercall_args {
 u64 __tdx_hypercall(struct tdx_hypercall_args *args);
 u64 __tdx_hypercall_ret(struct tdx_hypercall_args *args);
 
+/*
+ * Wrapper for standard use of __tdx_hypercall with no output aside from
+ * return code.
+ */
+static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+{
+       struct tdx_hypercall_args args = {
+               .r10 = TDX_HYPERCALL_STANDARD,
+               .r11 = fn,
+               .r12 = r12,
+               .r13 = r13,
+               .r14 = r14,
+               .r15 = r15,
+       };
+
+       return __tdx_hypercall(&args);
+}
+
+
 /* Called from __tdx_hypercall() for unrecoverable failure */
 void __tdx_hypercall_failed(void);
 
+/*
+ * Used in __tdx_module_call() to gather the output registers' values of the
+ * TDCALL instruction when requesting services from the TDX module. This is a
+ * software only structure and not part of the TDX module/VMM ABI
+ */
+struct tdx_module_output {
+       u64 rcx;
+       u64 rdx;
+       u64 r8;
+       u64 r9;
+       u64 r10;
+       u64 r11;
+};
+
+/* Used to communicate with the TDX module */
+u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+                     struct tdx_module_output *out);
+
+bool tdx_accept_memory(phys_addr_t start, phys_addr_t end);
+
+/*
+ * The TDG.VP.VMCALL-Instruction-execution sub-functions are defined
+ * independently from but are currently matched 1:1 with VMX EXIT_REASONs.
+ * Reusing the KVM EXIT_REASON macros makes it easier to connect the host and
+ * guest sides of these calls.
+ */
+static __always_inline u64 hcall_func(u64 exit_reason)
+{
+        return exit_reason;
+}
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_SHARED_TDX_H */
index 5b1ed65..84eab27 100644 (file)
@@ -85,6 +85,4 @@ struct rt_sigframe_x32 {
 
 #endif /* CONFIG_X86_64 */
 
-void __init init_sigframe_size(void);
-
 #endif /* _ASM_X86_SIGFRAME_H */
index 4e91054..600cf25 100644 (file)
@@ -38,7 +38,9 @@ struct smp_ops {
        void (*crash_stop_other_cpus)(void);
        void (*smp_send_reschedule)(int cpu);
 
-       int (*cpu_up)(unsigned cpu, struct task_struct *tidle);
+       void (*cleanup_dead_cpu)(unsigned cpu);
+       void (*poll_sync_state)(void);
+       int (*kick_ap_alive)(unsigned cpu, struct task_struct *tidle);
        int (*cpu_disable)(void);
        void (*cpu_die)(unsigned int cpu);
        void (*play_dead)(void);
@@ -78,11 +80,6 @@ static inline void smp_cpus_done(unsigned int max_cpus)
        smp_ops.smp_cpus_done(max_cpus);
 }
 
-static inline int __cpu_up(unsigned int cpu, struct task_struct *tidle)
-{
-       return smp_ops.cpu_up(cpu, tidle);
-}
-
 static inline int __cpu_disable(void)
 {
        return smp_ops.cpu_disable();
@@ -90,7 +87,8 @@ static inline int __cpu_disable(void)
 
 static inline void __cpu_die(unsigned int cpu)
 {
-       smp_ops.cpu_die(cpu);
+       if (smp_ops.cpu_die)
+               smp_ops.cpu_die(cpu);
 }
 
 static inline void __noreturn play_dead(void)
@@ -121,22 +119,23 @@ void native_smp_prepare_cpus(unsigned int max_cpus);
 void calculate_max_logical_packages(void);
 void native_smp_cpus_done(unsigned int max_cpus);
 int common_cpu_up(unsigned int cpunum, struct task_struct *tidle);
-int native_cpu_up(unsigned int cpunum, struct task_struct *tidle);
+int native_kick_ap(unsigned int cpu, struct task_struct *tidle);
 int native_cpu_disable(void);
-int common_cpu_die(unsigned int cpu);
-void native_cpu_die(unsigned int cpu);
 void __noreturn hlt_play_dead(void);
 void native_play_dead(void);
 void play_dead_common(void);
 void wbinvd_on_cpu(int cpu);
 int wbinvd_on_all_cpus(void);
-void cond_wakeup_cpu0(void);
+
+void smp_kick_mwait_play_dead(void);
 
 void native_smp_send_reschedule(int cpu);
 void native_send_call_func_ipi(const struct cpumask *mask);
 void native_send_call_func_single_ipi(int cpu);
 void x86_idle_thread_init(unsigned int cpu, struct task_struct *idle);
 
+bool smp_park_other_cpus_in_init(void);
+
 void smp_store_boot_cpu_info(void);
 void smp_store_cpu_info(int id);
 
@@ -201,7 +200,14 @@ extern void nmi_selftest(void);
 #endif
 
 extern unsigned int smpboot_control;
+extern unsigned long apic_mmio_base;
 
 #endif /* !__ASSEMBLY__ */
 
+/* Control bits for startup_64 */
+#define STARTUP_READ_APICID    0x80000000
+
+/* Top 8 bits are reserved for control */
+#define STARTUP_PARALLEL_MASK  0xFF000000
+
 #endif /* _ASM_X86_SMP_H */
index 5b85987..4fb36fb 100644 (file)
@@ -127,9 +127,11 @@ static inline int syscall_get_arch(struct task_struct *task)
 }
 
 void do_syscall_64(struct pt_regs *regs, int nr);
-void do_int80_syscall_32(struct pt_regs *regs);
-long do_fast_syscall_32(struct pt_regs *regs);
 
 #endif /* CONFIG_X86_32 */
 
+void do_int80_syscall_32(struct pt_regs *regs);
+long do_fast_syscall_32(struct pt_regs *regs);
+long do_SYSENTER_32(struct pt_regs *regs);
+
 #endif /* _ASM_X86_SYSCALL_H */
index 28d889c..603e6d1 100644 (file)
@@ -5,6 +5,8 @@
 
 #include <linux/init.h>
 #include <linux/bits.h>
+
+#include <asm/errno.h>
 #include <asm/ptrace.h>
 #include <asm/shared/tdx.h>
 
 #ifndef __ASSEMBLY__
 
 /*
- * Used to gather the output registers values of the TDCALL and SEAMCALL
- * instructions when requesting services from the TDX module.
- *
- * This is a software only structure and not part of the TDX module/VMM ABI.
- */
-struct tdx_module_output {
-       u64 rcx;
-       u64 rdx;
-       u64 r8;
-       u64 r9;
-       u64 r10;
-       u64 r11;
-};
-
-/*
  * Used by the #VE exception handler to gather the #VE exception
  * info from the TDX module. This is a software only structure
  * and not part of the TDX module/VMM ABI.
@@ -55,10 +42,6 @@ struct ve_info {
 
 void __init tdx_early_init(void);
 
-/* Used to communicate with the TDX module */
-u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
-                     struct tdx_module_output *out);
-
 void tdx_get_ve_info(struct ve_info *ve);
 
 bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
index f1cccba..d63b029 100644 (file)
@@ -232,9 +232,6 @@ static inline int arch_within_stack_frames(const void * const stack,
                           current_thread_info()->status & TS_COMPAT)
 #endif
 
-extern void arch_task_cache_init(void);
-extern int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src);
-extern void arch_release_task_struct(struct task_struct *tsk);
 extern void arch_setup_new_exec(void);
 #define arch_setup_new_exec arch_setup_new_exec
 #endif /* !__ASSEMBLY__ */
index a53961c..f360104 100644 (file)
@@ -6,7 +6,6 @@
 #include <asm/mc146818rtc.h>
 
 extern void hpet_time_init(void);
-extern void time_init(void);
 extern bool pit_timer_init(void);
 extern bool tsc_clocksource_watchdog_disabled(void);
 
index 75bfaa4..80450e1 100644 (file)
@@ -14,6 +14,8 @@
 #include <asm/processor-flags.h>
 #include <asm/pgtable.h>
 
+DECLARE_PER_CPU(u64, tlbstate_untag_mask);
+
 void __flush_tlb_all(void);
 
 #define TLB_FLUSH_ALL  -1UL
@@ -54,15 +56,6 @@ static inline void cr4_clear_bits(unsigned long mask)
        local_irq_restore(flags);
 }
 
-#ifdef CONFIG_ADDRESS_MASKING
-DECLARE_PER_CPU(u64, tlbstate_untag_mask);
-
-static inline u64 current_untag_mask(void)
-{
-       return this_cpu_read(tlbstate_untag_mask);
-}
-#endif
-
 #ifndef MODULE
 /*
  * 6 because 6 should be plenty and struct tlb_state will fit in two cache
index 458c891..caf41c4 100644 (file)
@@ -31,9 +31,9 @@
  * CONFIG_NUMA.
  */
 #include <linux/numa.h>
+#include <linux/cpumask.h>
 
 #ifdef CONFIG_NUMA
-#include <linux/cpumask.h>
 
 #include <asm/mpspec.h>
 #include <asm/percpu.h>
@@ -139,23 +139,31 @@ static inline int topology_max_smt_threads(void)
 int topology_update_package_map(unsigned int apicid, unsigned int cpu);
 int topology_update_die_map(unsigned int dieid, unsigned int cpu);
 int topology_phys_to_logical_pkg(unsigned int pkg);
-int topology_phys_to_logical_die(unsigned int die, unsigned int cpu);
-bool topology_is_primary_thread(unsigned int cpu);
 bool topology_smt_supported(void);
-#else
+
+extern struct cpumask __cpu_primary_thread_mask;
+#define cpu_primary_thread_mask ((const struct cpumask *)&__cpu_primary_thread_mask)
+
+/**
+ * topology_is_primary_thread - Check whether CPU is the primary SMT thread
+ * @cpu:       CPU to check
+ */
+static inline bool topology_is_primary_thread(unsigned int cpu)
+{
+       return cpumask_test_cpu(cpu, cpu_primary_thread_mask);
+}
+#else /* CONFIG_SMP */
 #define topology_max_packages()                        (1)
 static inline int
 topology_update_package_map(unsigned int apicid, unsigned int cpu) { return 0; }
 static inline int
 topology_update_die_map(unsigned int dieid, unsigned int cpu) { return 0; }
 static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }
-static inline int topology_phys_to_logical_die(unsigned int die,
-               unsigned int cpu) { return 0; }
 static inline int topology_max_die_per_package(void) { return 1; }
 static inline int topology_max_smt_threads(void) { return 1; }
 static inline bool topology_is_primary_thread(unsigned int cpu) { return true; }
 static inline bool topology_smt_supported(void) { return false; }
-#endif
+#endif /* !CONFIG_SMP */
 
 static inline void arch_fix_phys_package_id(int num, u32 slot)
 {
index fbdc3d9..594fce0 100644 (file)
@@ -32,7 +32,6 @@ extern struct system_counterval_t convert_art_ns_to_tsc(u64 art_ns);
 
 extern void tsc_early_init(void);
 extern void tsc_init(void);
-extern unsigned long calibrate_delay_is_known(void);
 extern void mark_tsc_unstable(char *reason);
 extern int unsynchronized_tsc(void);
 extern int check_tsc_unstable(void);
@@ -55,12 +54,10 @@ extern bool tsc_async_resets;
 #ifdef CONFIG_X86_TSC
 extern bool tsc_store_and_check_tsc_adjust(bool bootcpu);
 extern void tsc_verify_tsc_adjust(bool resume);
-extern void check_tsc_sync_source(int cpu);
 extern void check_tsc_sync_target(void);
 #else
 static inline bool tsc_store_and_check_tsc_adjust(bool bootcpu) { return false; }
 static inline void tsc_verify_tsc_adjust(bool resume) { }
-static inline void check_tsc_sync_source(int cpu) { }
 static inline void check_tsc_sync_target(void) { }
 #endif
 
diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
new file mode 100644 (file)
index 0000000..f5937e9
--- /dev/null
@@ -0,0 +1,27 @@
+#ifndef _ASM_X86_UNACCEPTED_MEMORY_H
+#define _ASM_X86_UNACCEPTED_MEMORY_H
+
+#include <linux/efi.h>
+#include <asm/tdx.h>
+#include <asm/sev.h>
+
+static inline void arch_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+       /* Platform-specific memory-acceptance call goes here */
+       if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
+               if (!tdx_accept_memory(start, end))
+                       panic("TDX: Failed to accept memory\n");
+       } else if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) {
+               snp_accept_memory(start, end);
+       } else {
+               panic("Cannot accept memory: unknown platform\n");
+       }
+}
+
+static inline struct efi_unaccepted_memory *efi_get_unaccepted_table(void)
+{
+       if (efi.unaccepted == EFI_INVALID_TABLE_ADDR)
+               return NULL;
+       return __va(efi.unaccepted);
+}
+#endif
index 01cb969..85cc57c 100644 (file)
 
 #else
 
+#define UNWIND_HINT_UNDEFINED \
+       UNWIND_HINT(UNWIND_HINT_TYPE_UNDEFINED, 0, 0, 0)
+
 #define UNWIND_HINT_FUNC \
        UNWIND_HINT(UNWIND_HINT_TYPE_FUNC, ORC_REG_SP, 8, 0)
 
+#define UNWIND_HINT_SAVE \
+       UNWIND_HINT(UNWIND_HINT_TYPE_SAVE, 0, 0, 0)
+
+#define UNWIND_HINT_RESTORE \
+       UNWIND_HINT(UNWIND_HINT_TYPE_RESTORE, 0, 0, 0)
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_X86_UNWIND_HINTS_H */
index d3e3197..5fa76c2 100644 (file)
@@ -177,6 +177,7 @@ struct uv_hub_info_s {
        unsigned short          nr_possible_cpus;
        unsigned short          nr_online_cpus;
        short                   memory_nid;
+       unsigned short          *node_to_socket;
 };
 
 /* CPU specific info with a pointer to the hub common info struct */
@@ -519,25 +520,30 @@ static inline int uv_socket_to_node(int socket)
        return _uv_socket_to_node(socket, uv_hub_info->socket_to_node);
 }
 
+static inline int uv_pnode_to_socket(int pnode)
+{
+       unsigned short *p2s = uv_hub_info->pnode_to_socket;
+
+       return p2s ? p2s[pnode - uv_hub_info->min_pnode] : pnode;
+}
+
 /* pnode, offset --> socket virtual */
 static inline void *uv_pnode_offset_to_vaddr(int pnode, unsigned long offset)
 {
        unsigned int m_val = uv_hub_info->m_val;
        unsigned long base;
-       unsigned short sockid, node, *p2s;
+       unsigned short sockid;
 
        if (m_val)
                return __va(((unsigned long)pnode << m_val) | offset);
 
-       p2s = uv_hub_info->pnode_to_socket;
-       sockid = p2s ? p2s[pnode - uv_hub_info->min_pnode] : pnode;
-       node = uv_socket_to_node(sockid);
+       sockid = uv_pnode_to_socket(pnode);
 
        /* limit address of previous socket is our base, except node 0 is 0 */
-       if (!node)
+       if (sockid == 0)
                return __va((unsigned long)offset);
 
-       base = (unsigned long)(uv_hub_info->gr_table[node - 1].limit);
+       base = (unsigned long)(uv_hub_info->gr_table[sockid - 1].limit);
        return __va(base << UV_GAM_RANGE_SHFT | offset);
 }
 
@@ -644,7 +650,7 @@ static inline int uv_cpu_blade_processor_id(int cpu)
 /* Blade number to Node number (UV2..UV4 is 1:1) */
 static inline int uv_blade_to_node(int blade)
 {
-       return blade;
+       return uv_socket_to_node(blade);
 }
 
 /* Blade number of current cpu. Numnbered 0 .. <#blades -1> */
@@ -656,23 +662,27 @@ static inline int uv_numa_blade_id(void)
 /*
  * Convert linux node number to the UV blade number.
  * .. Currently for UV2 thru UV4 the node and the blade are identical.
- * .. If this changes then you MUST check references to this function!
+ * .. UV5 needs conversion when sub-numa clustering is enabled.
  */
 static inline int uv_node_to_blade_id(int nid)
 {
-       return nid;
+       unsigned short *n2s = uv_hub_info->node_to_socket;
+
+       return n2s ? n2s[nid] : nid;
 }
 
 /* Convert a CPU number to the UV blade number */
 static inline int uv_cpu_to_blade_id(int cpu)
 {
-       return uv_node_to_blade_id(cpu_to_node(cpu));
+       return uv_cpu_hub_info(cpu)->numa_blade_id;
 }
 
 /* Convert a blade id to the PNODE of the blade */
 static inline int uv_blade_to_pnode(int bid)
 {
-       return uv_hub_info_list(uv_blade_to_node(bid))->pnode;
+       unsigned short *s2p = uv_hub_info->socket_to_pnode;
+
+       return s2p ? s2p[bid] : bid;
 }
 
 /* Nid of memory node on blade. -1 if no blade-local memory */
index 57fa673..bb45812 100644 (file)
@@ -4199,6 +4199,13 @@ union uvh_rh_gam_mmioh_overlay_config1_u {
 #define UV3H_RH_GAM_MMIOH_REDIRECT_CONFIG0_NASID_SHFT  0
 #define UV3H_RH_GAM_MMIOH_REDIRECT_CONFIG0_NASID_MASK  0x0000000000007fffUL
 
+/* UVH common defines */
+#define UVH_RH_GAM_MMIOH_REDIRECT_CONFIG0_NASID_MASK (                 \
+       is_uv(UV4A) ? UV4AH_RH_GAM_MMIOH_REDIRECT_CONFIG0_NASID_MASK :  \
+       is_uv(UV4)  ?  UV4H_RH_GAM_MMIOH_REDIRECT_CONFIG0_NASID_MASK :  \
+       is_uv(UV3)  ?  UV3H_RH_GAM_MMIOH_REDIRECT_CONFIG0_NASID_MASK :  \
+       0)
+
 
 union uvh_rh_gam_mmioh_redirect_config0_u {
        unsigned long   v;
@@ -4247,8 +4254,8 @@ union uvh_rh_gam_mmioh_redirect_config0_u {
        0)
 
 /* UV4A unique defines */
-#define UV4AH_RH_GAM_MMIOH_REDIRECT_CONFIG0_NASID_SHFT 0
-#define UV4AH_RH_GAM_MMIOH_REDIRECT_CONFIG0_NASID_MASK 0x0000000000000fffUL
+#define UV4AH_RH_GAM_MMIOH_REDIRECT_CONFIG1_NASID_SHFT 0
+#define UV4AH_RH_GAM_MMIOH_REDIRECT_CONFIG1_NASID_MASK 0x0000000000000fffUL
 
 /* UV4 unique defines */
 #define UV4H_RH_GAM_MMIOH_REDIRECT_CONFIG1_NASID_SHFT  0
@@ -4258,6 +4265,13 @@ union uvh_rh_gam_mmioh_redirect_config0_u {
 #define UV3H_RH_GAM_MMIOH_REDIRECT_CONFIG1_NASID_SHFT  0
 #define UV3H_RH_GAM_MMIOH_REDIRECT_CONFIG1_NASID_MASK  0x0000000000007fffUL
 
+/* UVH common defines */
+#define UVH_RH_GAM_MMIOH_REDIRECT_CONFIG1_NASID_MASK (                 \
+       is_uv(UV4A) ? UV4AH_RH_GAM_MMIOH_REDIRECT_CONFIG1_NASID_MASK :  \
+       is_uv(UV4)  ?  UV4H_RH_GAM_MMIOH_REDIRECT_CONFIG1_NASID_MASK :  \
+       is_uv(UV3)  ?  UV3H_RH_GAM_MMIOH_REDIRECT_CONFIG1_NASID_MASK :  \
+       0)
+
 
 union uvh_rh_gam_mmioh_redirect_config1_u {
        unsigned long   v;
index 4cf6794..c81858d 100644 (file)
@@ -231,14 +231,19 @@ static u64 vread_pvclock(void)
                ret = __pvclock_read_cycles(pvti, rdtsc_ordered());
        } while (pvclock_read_retry(pvti, version));
 
-       return ret;
+       return ret & S64_MAX;
 }
 #endif
 
 #ifdef CONFIG_HYPERV_TIMER
 static u64 vread_hvclock(void)
 {
-       return hv_read_tsc_page(&hvclock_page);
+       u64 tsc, time;
+
+       if (hv_read_tsc_page_tsc(&hvclock_page, &tsc, &time))
+               return time & S64_MAX;
+
+       return U64_MAX;
 }
 #endif
 
@@ -246,7 +251,7 @@ static inline u64 __arch_get_hw_counter(s32 clock_mode,
                                        const struct vdso_data *vd)
 {
        if (likely(clock_mode == VDSO_CLOCKMODE_TSC))
-               return (u64)rdtsc_ordered();
+               return (u64)rdtsc_ordered() & S64_MAX;
        /*
         * For any memory-mapped vclock type, we need to make sure that gcc
         * doesn't cleverly hoist a load before the mode check.  Otherwise we
@@ -284,6 +289,9 @@ static inline bool arch_vdso_clocksource_ok(const struct vdso_data *vd)
  * which can be invalidated asynchronously and indicate invalidation by
  * returning U64_MAX, which can be effectively tested by checking for a
  * negative value after casting it to s64.
+ *
+ * This effectively forces a S64_MAX mask on the calculations, unlike the
+ * U64_MAX mask normally used by x86 clocksources.
  */
 static inline bool arch_vdso_cycles_ok(u64 cycles)
 {
@@ -303,18 +311,29 @@ static inline bool arch_vdso_cycles_ok(u64 cycles)
  * @last. If not then use @last, which is the base time of the current
  * conversion period.
  *
- * This variant also removes the masking of the subtraction because the
- * clocksource mask of all VDSO capable clocksources on x86 is U64_MAX
- * which would result in a pointless operation. The compiler cannot
- * optimize it away as the mask comes from the vdso data and is not compile
- * time constant.
+ * This variant also uses a custom mask because while the clocksource mask of
+ * all the VDSO capable clocksources on x86 is U64_MAX, the above code uses
+ * U64_MASK as an exception value, additionally arch_vdso_cycles_ok() above
+ * declares everything with the MSB/Sign-bit set as invalid. Therefore the
+ * effective mask is S64_MAX.
  */
 static __always_inline
 u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult)
 {
-       if (cycles > last)
-               return (cycles - last) * mult;
-       return 0;
+       /*
+        * Due to the MSB/Sign-bit being used as invald marker (see
+        * arch_vdso_cycles_valid() above), the effective mask is S64_MAX.
+        */
+       u64 delta = (cycles - last) & S64_MAX;
+
+       /*
+        * Due to the above mentioned TSC wobbles, filter out negative motion.
+        * Per the above masking, the effective sign bit is now bit 62.
+        */
+       if (unlikely(delta & (1ULL << 62)))
+               return 0;
+
+       return delta * mult;
 }
 #define vdso_calc_delta vdso_calc_delta
 
index 88085f3..5240d88 100644 (file)
@@ -150,7 +150,7 @@ struct x86_init_acpi {
  * @enc_cache_flush_required   Returns true if a cache flush is needed before changing page encryption status
  */
 struct x86_guest {
-       void (*enc_status_change_prepare)(unsigned long vaddr, int npages, bool enc);
+       bool (*enc_status_change_prepare)(unsigned long vaddr, int npages, bool enc);
        bool (*enc_status_change_finish)(unsigned long vaddr, int npages, bool enc);
        bool (*enc_tlb_flush_required)(bool enc);
        bool (*enc_cache_flush_required)(void);
@@ -177,11 +177,14 @@ struct x86_init_ops {
  * struct x86_cpuinit_ops - platform specific cpu hotplug setups
  * @setup_percpu_clockev:      set up the per cpu clock event device
  * @early_percpu_clock_init:   early init of the per cpu clock event device
+ * @fixup_cpu_id:              fixup function for cpuinfo_x86::phys_proc_id
+ * @parallel_bringup:          Parallel bringup control
  */
 struct x86_cpuinit_ops {
        void (*setup_percpu_clockev)(void);
        void (*early_percpu_clock_init)(void);
        void (*fixup_cpu_id)(struct cpuinfo_x86 *c, int node);
+       bool parallel_bringup;
 };
 
 struct timespec64;
index 376563f..3a8a8eb 100644 (file)
@@ -81,14 +81,6 @@ typedef __u8 mtrr_type;
 #define MTRR_NUM_FIXED_RANGES 88
 #define MTRR_MAX_VAR_RANGES 256
 
-struct mtrr_state_type {
-       struct mtrr_var_range var_ranges[MTRR_MAX_VAR_RANGES];
-       mtrr_type fixed_ranges[MTRR_NUM_FIXED_RANGES];
-       unsigned char enabled;
-       unsigned char have_fixed;
-       mtrr_type def_type;
-};
-
 #define MTRRphysBase_MSR(reg) (0x200 + 2 * (reg))
 #define MTRRphysMask_MSR(reg) (0x200 + 2 * (reg) + 1)
 
@@ -115,9 +107,9 @@ struct mtrr_state_type {
 #define MTRR_NUM_TYPES       7
 
 /*
- * Invalid MTRR memory type.  mtrr_type_lookup() returns this value when
- * MTRRs are disabled.  Note, this value is allocated from the reserved
- * values (0x7-0xff) of the MTRR memory types.
+ * Invalid MTRR memory type.  No longer used outside of MTRR code.
+ * Note, this value is allocated from the reserved values (0x7-0xff) of
+ * the MTRR memory types.
  */
 #define MTRR_TYPE_INVALID    0xff
 
index 1328c22..6dfecb2 100644 (file)
@@ -16,6 +16,7 @@
 #include <asm/cacheflush.h>
 #include <asm/realmode.h>
 #include <asm/hypervisor.h>
+#include <asm/smp.h>
 
 #include <linux/ftrace.h>
 #include "../../realmode/rm/wakeup.h"
@@ -127,7 +128,13 @@ int x86_acpi_suspend_lowlevel(void)
         * value is in the actual %rsp register.
         */
        current->thread.sp = (unsigned long)temp_stack + sizeof(temp_stack);
-       smpboot_control = smp_processor_id();
+       /*
+        * Ensure the CPU knows which one it is when it comes back, if
+        * it isn't in parallel mode and expected to work that out for
+        * itself.
+        */
+       if (!(smpboot_control & STARTUP_PARALLEL_MASK))
+               smpboot_control = smp_processor_id();
 #endif
        initial_code = (unsigned long)wakeup_long64;
        saved_magic = 0x123456789abcdef0L;
index 171a40c..054c15a 100644 (file)
@@ -12,7 +12,6 @@ extern int wakeup_pmode_return;
 
 extern u8 wake_sleep_flags;
 
-extern unsigned long acpi_copy_wakeup_routine(unsigned long);
 extern void wakeup_long64(void);
 
 extern void do_suspend_lowlevel(void);
index f615e0c..72646d7 100644 (file)
@@ -37,11 +37,23 @@ EXPORT_SYMBOL_GPL(alternatives_patched);
 
 #define MAX_PATCH_LEN (255-1)
 
-static int __initdata_or_module debug_alternative;
+#define DA_ALL         (~0)
+#define DA_ALT         0x01
+#define DA_RET         0x02
+#define DA_RETPOLINE   0x04
+#define DA_ENDBR       0x08
+#define DA_SMP         0x10
+
+static unsigned int __initdata_or_module debug_alternative;
 
 static int __init debug_alt(char *str)
 {
-       debug_alternative = 1;
+       if (str && *str == '=')
+               str++;
+
+       if (!str || kstrtouint(str, 0, &debug_alternative))
+               debug_alternative = DA_ALL;
+
        return 1;
 }
 __setup("debug-alternative", debug_alt);
@@ -55,15 +67,15 @@ static int __init setup_noreplace_smp(char *str)
 }
 __setup("noreplace-smp", setup_noreplace_smp);
 
-#define DPRINTK(fmt, args...)                                          \
+#define DPRINTK(type, fmt, args...)                                    \
 do {                                                                   \
-       if (debug_alternative)                                          \
+       if (debug_alternative & DA_##type)                              \
                printk(KERN_DEBUG pr_fmt(fmt) "\n", ##args);            \
 } while (0)
 
-#define DUMP_BYTES(buf, len, fmt, args...)                             \
+#define DUMP_BYTES(type, buf, len, fmt, args...)                       \
 do {                                                                   \
-       if (unlikely(debug_alternative)) {                              \
+       if (unlikely(debug_alternative & DA_##type)) {                  \
                int j;                                                  \
                                                                        \
                if (!(len))                                             \
@@ -86,6 +98,11 @@ static const unsigned char x86nops[] =
        BYTES_NOP6,
        BYTES_NOP7,
        BYTES_NOP8,
+#ifdef CONFIG_64BIT
+       BYTES_NOP9,
+       BYTES_NOP10,
+       BYTES_NOP11,
+#endif
 };
 
 const unsigned char * const x86_nops[ASM_NOP_MAX+1] =
@@ -99,19 +116,44 @@ const unsigned char * const x86_nops[ASM_NOP_MAX+1] =
        x86nops + 1 + 2 + 3 + 4 + 5,
        x86nops + 1 + 2 + 3 + 4 + 5 + 6,
        x86nops + 1 + 2 + 3 + 4 + 5 + 6 + 7,
+#ifdef CONFIG_64BIT
+       x86nops + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8,
+       x86nops + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9,
+       x86nops + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10,
+#endif
 };
 
-/* Use this to add nops to a buffer, then text_poke the whole buffer. */
-static void __init_or_module add_nops(void *insns, unsigned int len)
+/*
+ * Fill the buffer with a single effective instruction of size @len.
+ *
+ * In order not to issue an ORC stack depth tracking CFI entry (Call Frame Info)
+ * for every single-byte NOP, try to generate the maximally available NOP of
+ * size <= ASM_NOP_MAX such that only a single CFI entry is generated (vs one for
+ * each single-byte NOPs). If @len to fill out is > ASM_NOP_MAX, pad with INT3 and
+ * *jump* over instead of executing long and daft NOPs.
+ */
+static void __init_or_module add_nop(u8 *instr, unsigned int len)
 {
-       while (len > 0) {
-               unsigned int noplen = len;
-               if (noplen > ASM_NOP_MAX)
-                       noplen = ASM_NOP_MAX;
-               memcpy(insns, x86_nops[noplen], noplen);
-               insns += noplen;
-               len -= noplen;
+       u8 *target = instr + len;
+
+       if (!len)
+               return;
+
+       if (len <= ASM_NOP_MAX) {
+               memcpy(instr, x86_nops[len], len);
+               return;
        }
+
+       if (len < 128) {
+               __text_gen_insn(instr, JMP8_INSN_OPCODE, instr, target, JMP8_INSN_SIZE);
+               instr += JMP8_INSN_SIZE;
+       } else {
+               __text_gen_insn(instr, JMP32_INSN_OPCODE, instr, target, JMP32_INSN_SIZE);
+               instr += JMP32_INSN_SIZE;
+       }
+
+       for (;instr < target; instr++)
+               *instr = INT3_INSN_OPCODE;
 }
 
 extern s32 __retpoline_sites[], __retpoline_sites_end[];
@@ -123,133 +165,223 @@ extern s32 __smp_locks[], __smp_locks_end[];
 void text_poke_early(void *addr, const void *opcode, size_t len);
 
 /*
- * Are we looking at a near JMP with a 1 or 4-byte displacement.
+ * Matches NOP and NOPL, not any of the other possible NOPs.
  */
-static inline bool is_jmp(const u8 opcode)
+static bool insn_is_nop(struct insn *insn)
 {
-       return opcode == 0xeb || opcode == 0xe9;
+       /* Anything NOP, but no REP NOP */
+       if (insn->opcode.bytes[0] == 0x90 &&
+           (!insn->prefixes.nbytes || insn->prefixes.bytes[0] != 0xF3))
+               return true;
+
+       /* NOPL */
+       if (insn->opcode.bytes[0] == 0x0F && insn->opcode.bytes[1] == 0x1F)
+               return true;
+
+       /* TODO: more nops */
+
+       return false;
 }
 
-static void __init_or_module
-recompute_jump(struct alt_instr *a, u8 *orig_insn, u8 *repl_insn, u8 *insn_buff)
+/*
+ * Find the offset of the first non-NOP instruction starting at @offset
+ * but no further than @len.
+ */
+static int skip_nops(u8 *instr, int offset, int len)
 {
-       u8 *next_rip, *tgt_rip;
-       s32 n_dspl, o_dspl;
-       int repl_len;
+       struct insn insn;
 
-       if (a->replacementlen != 5)
-               return;
+       for (; offset < len; offset += insn.length) {
+               if (insn_decode_kernel(&insn, &instr[offset]))
+                       break;
 
-       o_dspl = *(s32 *)(insn_buff + 1);
+               if (!insn_is_nop(&insn))
+                       break;
+       }
 
-       /* next_rip of the replacement JMP */
-       next_rip = repl_insn + a->replacementlen;
-       /* target rip of the replacement JMP */
-       tgt_rip  = next_rip + o_dspl;
-       n_dspl = tgt_rip - orig_insn;
+       return offset;
+}
 
-       DPRINTK("target RIP: %px, new_displ: 0x%x", tgt_rip, n_dspl);
+/*
+ * Optimize a sequence of NOPs, possibly preceded by an unconditional jump
+ * to the end of the NOP sequence into a single NOP.
+ */
+static bool __init_or_module
+__optimize_nops(u8 *instr, size_t len, struct insn *insn, int *next, int *prev, int *target)
+{
+       int i = *next - insn->length;
 
-       if (tgt_rip - orig_insn >= 0) {
-               if (n_dspl - 2 <= 127)
-                       goto two_byte_jmp;
-               else
-                       goto five_byte_jmp;
-       /* negative offset */
-       } else {
-               if (((n_dspl - 2) & 0xff) == (n_dspl - 2))
-                       goto two_byte_jmp;
-               else
-                       goto five_byte_jmp;
+       switch (insn->opcode.bytes[0]) {
+       case JMP8_INSN_OPCODE:
+       case JMP32_INSN_OPCODE:
+               *prev = i;
+               *target = *next + insn->immediate.value;
+               return false;
        }
 
-two_byte_jmp:
-       n_dspl -= 2;
+       if (insn_is_nop(insn)) {
+               int nop = i;
 
-       insn_buff[0] = 0xeb;
-       insn_buff[1] = (s8)n_dspl;
-       add_nops(insn_buff + 2, 3);
+               *next = skip_nops(instr, *next, len);
+               if (*target && *next == *target)
+                       nop = *prev;
 
-       repl_len = 2;
-       goto done;
+               add_nop(instr + nop, *next - nop);
+               DUMP_BYTES(ALT, instr, len, "%px: [%d:%d) optimized NOPs: ", instr, nop, *next);
+               return true;
+       }
+
+       *target = 0;
+       return false;
+}
 
-five_byte_jmp:
-       n_dspl -= 5;
+/*
+ * "noinline" to cause control flow change and thus invalidate I$ and
+ * cause refetch after modification.
+ */
+static void __init_or_module noinline optimize_nops(u8 *instr, size_t len)
+{
+       int prev, target = 0;
 
-       insn_buff[0] = 0xe9;
-       *(s32 *)&insn_buff[1] = n_dspl;
+       for (int next, i = 0; i < len; i = next) {
+               struct insn insn;
 
-       repl_len = 5;
+               if (insn_decode_kernel(&insn, &instr[i]))
+                       return;
 
-done:
+               next = i + insn.length;
 
-       DPRINTK("final displ: 0x%08x, JMP 0x%lx",
-               n_dspl, (unsigned long)orig_insn + n_dspl + repl_len);
+               __optimize_nops(instr, len, &insn, &next, &prev, &target);
+       }
 }
 
 /*
- * optimize_nops_range() - Optimize a sequence of single byte NOPs (0x90)
+ * In this context, "source" is where the instructions are placed in the
+ * section .altinstr_replacement, for example during kernel build by the
+ * toolchain.
+ * "Destination" is where the instructions are being patched in by this
+ * machinery.
  *
- * @instr: instruction byte stream
- * @instrlen: length of the above
- * @off: offset within @instr where the first NOP has been detected
+ * The source offset is:
  *
- * Return: number of NOPs found (and replaced).
+ *   src_imm = target - src_next_ip                  (1)
+ *
+ * and the target offset is:
+ *
+ *   dst_imm = target - dst_next_ip                  (2)
+ *
+ * so rework (1) as an expression for target like:
+ *
+ *   target = src_imm + src_next_ip                  (1a)
+ *
+ * and substitute in (2) to get:
+ *
+ *   dst_imm = (src_imm + src_next_ip) - dst_next_ip (3)
+ *
+ * Now, since the instruction stream is 'identical' at src and dst (it
+ * is being copied after all) it can be stated that:
+ *
+ *   src_next_ip = src + ip_offset
+ *   dst_next_ip = dst + ip_offset                   (4)
+ *
+ * Substitute (4) in (3) and observe ip_offset being cancelled out to
+ * obtain:
+ *
+ *   dst_imm = src_imm + (src + ip_offset) - (dst + ip_offset)
+ *           = src_imm + src - dst + ip_offset - ip_offset
+ *           = src_imm + src - dst                   (5)
+ *
+ * IOW, only the relative displacement of the code block matters.
  */
-static __always_inline int optimize_nops_range(u8 *instr, u8 instrlen, int off)
-{
-       unsigned long flags;
-       int i = off, nnops;
 
-       while (i < instrlen) {
-               if (instr[i] != 0x90)
-                       break;
+#define apply_reloc_n(n_, p_, d_)                              \
+       do {                                                    \
+               s32 v = *(s##n_ *)(p_);                         \
+               v += (d_);                                      \
+               BUG_ON((v >> 31) != (v >> (n_-1)));             \
+               *(s##n_ *)(p_) = (s##n_)v;                      \
+       } while (0)
+
 
-               i++;
+static __always_inline
+void apply_reloc(int n, void *ptr, uintptr_t diff)
+{
+       switch (n) {
+       case 1: apply_reloc_n(8, ptr, diff); break;
+       case 2: apply_reloc_n(16, ptr, diff); break;
+       case 4: apply_reloc_n(32, ptr, diff); break;
+       default: BUG();
        }
+}
 
-       nnops = i - off;
+static __always_inline
+bool need_reloc(unsigned long offset, u8 *src, size_t src_len)
+{
+       u8 *target = src + offset;
+       /*
+        * If the target is inside the patched block, it's relative to the
+        * block itself and does not need relocation.
+        */
+       return (target < src || target > src + src_len);
+}
 
-       if (nnops <= 1)
-               return nnops;
+static void __init_or_module noinline
+apply_relocation(u8 *buf, size_t len, u8 *dest, u8 *src, size_t src_len)
+{
+       int prev, target = 0;
 
-       local_irq_save(flags);
-       add_nops(instr + off, nnops);
-       local_irq_restore(flags);
+       for (int next, i = 0; i < len; i = next) {
+               struct insn insn;
 
-       DUMP_BYTES(instr, instrlen, "%px: [%d:%d) optimized NOPs: ", instr, off, i);
+               if (WARN_ON_ONCE(insn_decode_kernel(&insn, &buf[i])))
+                       return;
 
-       return nnops;
-}
+               next = i + insn.length;
 
-/*
- * "noinline" to cause control flow change and thus invalidate I$ and
- * cause refetch after modification.
- */
-static void __init_or_module noinline optimize_nops(u8 *instr, size_t len)
-{
-       struct insn insn;
-       int i = 0;
+               if (__optimize_nops(buf, len, &insn, &next, &prev, &target))
+                       continue;
 
-       /*
-        * Jump over the non-NOP insns and optimize single-byte NOPs into bigger
-        * ones.
-        */
-       for (;;) {
-               if (insn_decode_kernel(&insn, &instr[i]))
-                       return;
+               switch (insn.opcode.bytes[0]) {
+               case 0x0f:
+                       if (insn.opcode.bytes[1] < 0x80 ||
+                           insn.opcode.bytes[1] > 0x8f)
+                               break;
 
-               /*
-                * See if this and any potentially following NOPs can be
-                * optimized.
-                */
-               if (insn.length == 1 && insn.opcode.bytes[0] == 0x90)
-                       i += optimize_nops_range(instr, len, i);
-               else
-                       i += insn.length;
+                       fallthrough;    /* Jcc.d32 */
+               case 0x70 ... 0x7f:     /* Jcc.d8 */
+               case JMP8_INSN_OPCODE:
+               case JMP32_INSN_OPCODE:
+               case CALL_INSN_OPCODE:
+                       if (need_reloc(next + insn.immediate.value, src, src_len)) {
+                               apply_reloc(insn.immediate.nbytes,
+                                           buf + i + insn_offset_immediate(&insn),
+                                           src - dest);
+                       }
 
-               if (i >= len)
-                       return;
+                       /*
+                        * Where possible, convert JMP.d32 into JMP.d8.
+                        */
+                       if (insn.opcode.bytes[0] == JMP32_INSN_OPCODE) {
+                               s32 imm = insn.immediate.value;
+                               imm += src - dest;
+                               imm += JMP32_INSN_SIZE - JMP8_INSN_SIZE;
+                               if ((imm >> 31) == (imm >> 7)) {
+                                       buf[i+0] = JMP8_INSN_OPCODE;
+                                       buf[i+1] = (s8)imm;
+
+                                       memset(&buf[i+2], INT3_INSN_OPCODE, insn.length - 2);
+                               }
+                       }
+                       break;
+               }
+
+               if (insn_rip_relative(&insn)) {
+                       if (need_reloc(next + insn.displacement.value, src, src_len)) {
+                               apply_reloc(insn.displacement.nbytes,
+                                           buf + i + insn_offset_displacement(&insn),
+                                           src - dest);
+                       }
+               }
        }
 }
 
@@ -270,7 +402,7 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
        u8 *instr, *replacement;
        u8 insn_buff[MAX_PATCH_LEN];
 
-       DPRINTK("alt table %px, -> %px", start, end);
+       DPRINTK(ALT, "alt table %px, -> %px", start, end);
        /*
         * The scan order should be from start to end. A later scanned
         * alternative code can overwrite previously scanned alternative code.
@@ -294,47 +426,31 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
                 * - feature not present but ALT_FLAG_NOT is set to mean,
                 *   patch if feature is *NOT* present.
                 */
-               if (!boot_cpu_has(a->cpuid) == !(a->flags & ALT_FLAG_NOT))
-                       goto next;
+               if (!boot_cpu_has(a->cpuid) == !(a->flags & ALT_FLAG_NOT)) {
+                       optimize_nops(instr, a->instrlen);
+                       continue;
+               }
 
-               DPRINTK("feat: %s%d*32+%d, old: (%pS (%px) len: %d), repl: (%px, len: %d)",
+               DPRINTK(ALT, "feat: %s%d*32+%d, old: (%pS (%px) len: %d), repl: (%px, len: %d)",
                        (a->flags & ALT_FLAG_NOT) ? "!" : "",
                        a->cpuid >> 5,
                        a->cpuid & 0x1f,
                        instr, instr, a->instrlen,
                        replacement, a->replacementlen);
 
-               DUMP_BYTES(instr, a->instrlen, "%px:   old_insn: ", instr);
-               DUMP_BYTES(replacement, a->replacementlen, "%px:   rpl_insn: ", replacement);
-
                memcpy(insn_buff, replacement, a->replacementlen);
                insn_buff_sz = a->replacementlen;
 
-               /*
-                * 0xe8 is a relative jump; fix the offset.
-                *
-                * Instruction length is checked before the opcode to avoid
-                * accessing uninitialized bytes for zero-length replacements.
-                */
-               if (a->replacementlen == 5 && *insn_buff == 0xe8) {
-                       *(s32 *)(insn_buff + 1) += replacement - instr;
-                       DPRINTK("Fix CALL offset: 0x%x, CALL 0x%lx",
-                               *(s32 *)(insn_buff + 1),
-                               (unsigned long)instr + *(s32 *)(insn_buff + 1) + 5);
-               }
-
-               if (a->replacementlen && is_jmp(replacement[0]))
-                       recompute_jump(a, instr, replacement, insn_buff);
-
                for (; insn_buff_sz < a->instrlen; insn_buff_sz++)
                        insn_buff[insn_buff_sz] = 0x90;
 
-               DUMP_BYTES(insn_buff, insn_buff_sz, "%px: final_insn: ", instr);
+               apply_relocation(insn_buff, a->instrlen, instr, replacement, a->replacementlen);
 
-               text_poke_early(instr, insn_buff, insn_buff_sz);
+               DUMP_BYTES(ALT, instr, a->instrlen, "%px:   old_insn: ", instr);
+               DUMP_BYTES(ALT, replacement, a->replacementlen, "%px:   rpl_insn: ", replacement);
+               DUMP_BYTES(ALT, insn_buff, insn_buff_sz, "%px: final_insn: ", instr);
 
-next:
-               optimize_nops(instr, a->instrlen);
+               text_poke_early(instr, insn_buff, insn_buff_sz);
        }
 }
 
@@ -555,15 +671,15 @@ void __init_or_module noinline apply_retpolines(s32 *start, s32 *end)
                        continue;
                }
 
-               DPRINTK("retpoline at: %pS (%px) len: %d to: %pS",
+               DPRINTK(RETPOLINE, "retpoline at: %pS (%px) len: %d to: %pS",
                        addr, addr, insn.length,
                        addr + insn.length + insn.immediate.value);
 
                len = patch_retpoline(addr, &insn, bytes);
                if (len == insn.length) {
                        optimize_nops(bytes, len);
-                       DUMP_BYTES(((u8*)addr),  len, "%px: orig: ", addr);
-                       DUMP_BYTES(((u8*)bytes), len, "%px: repl: ", addr);
+                       DUMP_BYTES(RETPOLINE, ((u8*)addr),  len, "%px: orig: ", addr);
+                       DUMP_BYTES(RETPOLINE, ((u8*)bytes), len, "%px: repl: ", addr);
                        text_poke_early(addr, bytes, len);
                }
        }
@@ -590,13 +706,12 @@ static int patch_return(void *addr, struct insn *insn, u8 *bytes)
 {
        int i = 0;
 
+       /* Patch the custom return thunks... */
        if (cpu_feature_enabled(X86_FEATURE_RETHUNK)) {
-               if (x86_return_thunk == __x86_return_thunk)
-                       return -1;
-
                i = JMP32_INSN_SIZE;
                __text_gen_insn(bytes, JMP32_INSN_OPCODE, addr, x86_return_thunk, i);
        } else {
+               /* ... or patch them out if not needed. */
                bytes[i++] = RET_INSN_OPCODE;
        }
 
@@ -609,6 +724,14 @@ void __init_or_module noinline apply_returns(s32 *start, s32 *end)
 {
        s32 *s;
 
+       /*
+        * Do not patch out the default return thunks if those needed are the
+        * ones generated by the compiler.
+        */
+       if (cpu_feature_enabled(X86_FEATURE_RETHUNK) &&
+           (x86_return_thunk == __x86_return_thunk))
+               return;
+
        for (s = start; s < end; s++) {
                void *dest = NULL, *addr = (void *)s + *s;
                struct insn insn;
@@ -630,14 +753,14 @@ void __init_or_module noinline apply_returns(s32 *start, s32 *end)
                              addr, dest, 5, addr))
                        continue;
 
-               DPRINTK("return thunk at: %pS (%px) len: %d to: %pS",
+               DPRINTK(RET, "return thunk at: %pS (%px) len: %d to: %pS",
                        addr, addr, insn.length,
                        addr + insn.length + insn.immediate.value);
 
                len = patch_return(addr, &insn, bytes);
                if (len == insn.length) {
-                       DUMP_BYTES(((u8*)addr),  len, "%px: orig: ", addr);
-                       DUMP_BYTES(((u8*)bytes), len, "%px: repl: ", addr);
+                       DUMP_BYTES(RET, ((u8*)addr),  len, "%px: orig: ", addr);
+                       DUMP_BYTES(RET, ((u8*)bytes), len, "%px: repl: ", addr);
                        text_poke_early(addr, bytes, len);
                }
        }
@@ -655,7 +778,7 @@ void __init_or_module noinline apply_returns(s32 *start, s32 *end) { }
 
 #ifdef CONFIG_X86_KERNEL_IBT
 
-static void poison_endbr(void *addr, bool warn)
+static void __init_or_module poison_endbr(void *addr, bool warn)
 {
        u32 endbr, poison = gen_endbr_poison();
 
@@ -667,13 +790,13 @@ static void poison_endbr(void *addr, bool warn)
                return;
        }
 
-       DPRINTK("ENDBR at: %pS (%px)", addr, addr);
+       DPRINTK(ENDBR, "ENDBR at: %pS (%px)", addr, addr);
 
        /*
         * When we have IBT, the lack of ENDBR will trigger #CP
         */
-       DUMP_BYTES(((u8*)addr), 4, "%px: orig: ", addr);
-       DUMP_BYTES(((u8*)&poison), 4, "%px: repl: ", addr);
+       DUMP_BYTES(ENDBR, ((u8*)addr), 4, "%px: orig: ", addr);
+       DUMP_BYTES(ENDBR, ((u8*)&poison), 4, "%px: repl: ", addr);
        text_poke_early(addr, &poison, 4);
 }
 
@@ -1148,7 +1271,7 @@ void __init_or_module alternatives_smp_module_add(struct module *mod,
        smp->locks_end  = locks_end;
        smp->text       = text;
        smp->text_end   = text_end;
-       DPRINTK("locks %p -> %p, text %p -> %p, name %s\n",
+       DPRINTK(SMP, "locks %p -> %p, text %p -> %p, name %s\n",
                smp->locks, smp->locks_end,
                smp->text, smp->text_end, smp->name);
 
@@ -1225,6 +1348,20 @@ int alternatives_text_reserved(void *start, void *end)
 #endif /* CONFIG_SMP */
 
 #ifdef CONFIG_PARAVIRT
+
+/* Use this to add nops to a buffer, then text_poke the whole buffer. */
+static void __init_or_module add_nops(void *insns, unsigned int len)
+{
+       while (len > 0) {
+               unsigned int noplen = len;
+               if (noplen > ASM_NOP_MAX)
+                       noplen = ASM_NOP_MAX;
+               memcpy(insns, x86_nops[noplen], noplen);
+               insns += noplen;
+               len -= noplen;
+       }
+}
+
 void __init_or_module apply_paravirt(struct paravirt_patch_site *start,
                                     struct paravirt_patch_site *end)
 {
@@ -1332,6 +1469,35 @@ static noinline void __init int3_selftest(void)
        unregister_die_notifier(&int3_exception_nb);
 }
 
+static __initdata int __alt_reloc_selftest_addr;
+
+__visible noinline void __init __alt_reloc_selftest(void *arg)
+{
+       WARN_ON(arg != &__alt_reloc_selftest_addr);
+}
+
+static noinline void __init alt_reloc_selftest(void)
+{
+       /*
+        * Tests apply_relocation().
+        *
+        * This has a relative immediate (CALL) in a place other than the first
+        * instruction and additionally on x86_64 we get a RIP-relative LEA:
+        *
+        *   lea    0x0(%rip),%rdi  # 5d0: R_X86_64_PC32    .init.data+0x5566c
+        *   call   +0              # 5d5: R_X86_64_PLT32   __alt_reloc_selftest-0x4
+        *
+        * Getting this wrong will either crash and burn or tickle the WARN
+        * above.
+        */
+       asm_inline volatile (
+               ALTERNATIVE("", "lea %[mem], %%" _ASM_ARG1 "; call __alt_reloc_selftest;", X86_FEATURE_ALWAYS)
+               : /* output */
+               : [mem] "m" (__alt_reloc_selftest_addr)
+               : _ASM_ARG1
+       );
+}
+
 void __init alternative_instructions(void)
 {
        int3_selftest();
@@ -1419,6 +1585,8 @@ void __init alternative_instructions(void)
 
        restart_nmi();
        alternatives_patched = 1;
+
+       alt_reloc_selftest();
 }
 
 /**
@@ -1799,7 +1967,7 @@ struct bp_patching_desc *try_get_desc(void)
 {
        struct bp_patching_desc *desc = &bp_desc;
 
-       if (!arch_atomic_inc_not_zero(&desc->refs))
+       if (!raw_atomic_inc_not_zero(&desc->refs))
                return NULL;
 
        return desc;
@@ -1810,7 +1978,7 @@ static __always_inline void put_desc(void)
        struct bp_patching_desc *desc = &bp_desc;
 
        smp_mb__before_atomic();
-       arch_atomic_dec(&desc->refs);
+       raw_atomic_dec(&desc->refs);
 }
 
 static __always_inline void *text_poke_addr(struct text_poke_loc *tp)
@@ -1954,6 +2122,16 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
        atomic_set_release(&bp_desc.refs, 1);
 
        /*
+        * Function tracing can enable thousands of places that need to be
+        * updated. This can take quite some time, and with full kernel debugging
+        * enabled, this could cause the softlockup watchdog to trigger.
+        * This function gets called every 256 entries added to be patched.
+        * Call cond_resched() here to make sure that other tasks can get scheduled
+        * while processing all the functions being patched.
+        */
+       cond_resched();
+
+       /*
         * Corresponding read barrier in int3 notifier for making sure the
         * nr_entries and handler are correctly ordered wrt. patching.
         */
index 7e331e8..035a3db 100644 (file)
 #include <linux/pci_ids.h>
 #include <asm/amd_nb.h>
 
-#define PCI_DEVICE_ID_AMD_17H_ROOT     0x1450
-#define PCI_DEVICE_ID_AMD_17H_M10H_ROOT        0x15d0
-#define PCI_DEVICE_ID_AMD_17H_M30H_ROOT        0x1480
-#define PCI_DEVICE_ID_AMD_17H_M60H_ROOT        0x1630
-#define PCI_DEVICE_ID_AMD_17H_MA0H_ROOT        0x14b5
-#define PCI_DEVICE_ID_AMD_19H_M10H_ROOT        0x14a4
-#define PCI_DEVICE_ID_AMD_19H_M60H_ROOT        0x14d8
-#define PCI_DEVICE_ID_AMD_19H_M70H_ROOT        0x14e8
-#define PCI_DEVICE_ID_AMD_17H_DF_F4    0x1464
-#define PCI_DEVICE_ID_AMD_17H_M10H_DF_F4 0x15ec
-#define PCI_DEVICE_ID_AMD_17H_M30H_DF_F4 0x1494
-#define PCI_DEVICE_ID_AMD_17H_M60H_DF_F4 0x144c
-#define PCI_DEVICE_ID_AMD_17H_M70H_DF_F4 0x1444
-#define PCI_DEVICE_ID_AMD_17H_MA0H_DF_F4 0x1728
-#define PCI_DEVICE_ID_AMD_19H_DF_F4    0x1654
-#define PCI_DEVICE_ID_AMD_19H_M10H_DF_F4 0x14b1
-#define PCI_DEVICE_ID_AMD_19H_M40H_ROOT        0x14b5
-#define PCI_DEVICE_ID_AMD_19H_M40H_DF_F4 0x167d
-#define PCI_DEVICE_ID_AMD_19H_M50H_DF_F4 0x166e
-#define PCI_DEVICE_ID_AMD_19H_M60H_DF_F4 0x14e4
-#define PCI_DEVICE_ID_AMD_19H_M70H_DF_F4 0x14f4
-#define PCI_DEVICE_ID_AMD_19H_M78H_DF_F4 0x12fc
+#define PCI_DEVICE_ID_AMD_17H_ROOT             0x1450
+#define PCI_DEVICE_ID_AMD_17H_M10H_ROOT                0x15d0
+#define PCI_DEVICE_ID_AMD_17H_M30H_ROOT                0x1480
+#define PCI_DEVICE_ID_AMD_17H_M60H_ROOT                0x1630
+#define PCI_DEVICE_ID_AMD_17H_MA0H_ROOT                0x14b5
+#define PCI_DEVICE_ID_AMD_19H_M10H_ROOT                0x14a4
+#define PCI_DEVICE_ID_AMD_19H_M40H_ROOT                0x14b5
+#define PCI_DEVICE_ID_AMD_19H_M60H_ROOT                0x14d8
+#define PCI_DEVICE_ID_AMD_19H_M70H_ROOT                0x14e8
+#define PCI_DEVICE_ID_AMD_MI200_ROOT           0x14bb
+
+#define PCI_DEVICE_ID_AMD_17H_DF_F4            0x1464
+#define PCI_DEVICE_ID_AMD_17H_M10H_DF_F4       0x15ec
+#define PCI_DEVICE_ID_AMD_17H_M30H_DF_F4       0x1494
+#define PCI_DEVICE_ID_AMD_17H_M60H_DF_F4       0x144c
+#define PCI_DEVICE_ID_AMD_17H_M70H_DF_F4       0x1444
+#define PCI_DEVICE_ID_AMD_17H_MA0H_DF_F4       0x1728
+#define PCI_DEVICE_ID_AMD_19H_DF_F4            0x1654
+#define PCI_DEVICE_ID_AMD_19H_M10H_DF_F4       0x14b1
+#define PCI_DEVICE_ID_AMD_19H_M40H_DF_F4       0x167d
+#define PCI_DEVICE_ID_AMD_19H_M50H_DF_F4       0x166e
+#define PCI_DEVICE_ID_AMD_19H_M60H_DF_F4       0x14e4
+#define PCI_DEVICE_ID_AMD_19H_M70H_DF_F4       0x14f4
+#define PCI_DEVICE_ID_AMD_19H_M78H_DF_F4       0x12fc
+#define PCI_DEVICE_ID_AMD_MI200_DF_F4          0x14d4
 
 /* Protect the PCI config register pairs used for SMN. */
 static DEFINE_MUTEX(smn_mutex);
@@ -53,6 +56,7 @@ static const struct pci_device_id amd_root_ids[] = {
        { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M40H_ROOT) },
        { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M60H_ROOT) },
        { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M70H_ROOT) },
+       { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_MI200_ROOT) },
        {}
 };
 
@@ -81,6 +85,7 @@ static const struct pci_device_id amd_nb_misc_ids[] = {
        { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M60H_DF_F3) },
        { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M70H_DF_F3) },
        { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M78H_DF_F3) },
+       { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_MI200_DF_F3) },
        {}
 };
 
@@ -101,6 +106,7 @@ static const struct pci_device_id amd_nb_link_ids[] = {
        { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M40H_DF_F4) },
        { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M50H_DF_F4) },
        { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_CNB17H_F4) },
+       { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_MI200_DF_F4) },
        {}
 };
 
index 7705571..af49e24 100644 (file)
@@ -101,6 +101,9 @@ static int apic_extnmi __ro_after_init = APIC_EXTNMI_BSP;
  */
 static bool virt_ext_dest_id __ro_after_init;
 
+/* For parallel bootup. */
+unsigned long apic_mmio_base __ro_after_init;
+
 /*
  * Map cpu index to physical APIC ID
  */
@@ -2163,6 +2166,7 @@ void __init register_lapic_address(unsigned long address)
 
        if (!x2apic_mode) {
                set_fixmap_nocache(FIX_APIC_BASE, address);
+               apic_mmio_base = APIC_BASE;
                apic_printk(APIC_VERBOSE, "mapped APIC to %16lx (%16lx)\n",
                            APIC_BASE, address);
        }
@@ -2376,7 +2380,7 @@ static int nr_logical_cpuids = 1;
 /*
  * Used to store mapping between logical CPU IDs and APIC IDs.
  */
-static int cpuid_to_apicid[] = {
+int cpuid_to_apicid[] = {
        [0 ... NR_CPUS - 1] = -1,
 };
 
@@ -2386,20 +2390,31 @@ bool arch_match_cpu_phys_id(int cpu, u64 phys_id)
 }
 
 #ifdef CONFIG_SMP
-/**
- * apic_id_is_primary_thread - Check whether APIC ID belongs to a primary thread
- * @apicid: APIC ID to check
+static void cpu_mark_primary_thread(unsigned int cpu, unsigned int apicid)
+{
+       /* Isolate the SMT bit(s) in the APICID and check for 0 */
+       u32 mask = (1U << (fls(smp_num_siblings) - 1)) - 1;
+
+       if (smp_num_siblings == 1 || !(apicid & mask))
+               cpumask_set_cpu(cpu, &__cpu_primary_thread_mask);
+}
+
+/*
+ * Due to the utter mess of CPUID evaluation smp_num_siblings is not valid
+ * during early boot. Initialize the primary thread mask before SMP
+ * bringup.
  */
-bool apic_id_is_primary_thread(unsigned int apicid)
+static int __init smp_init_primary_thread_mask(void)
 {
-       u32 mask;
+       unsigned int cpu;
 
-       if (smp_num_siblings == 1)
-               return true;
-       /* Isolate the SMT bit(s) in the APICID and check for 0 */
-       mask = (1U << (fls(smp_num_siblings) - 1)) - 1;
-       return !(apicid & mask);
+       for (cpu = 0; cpu < nr_logical_cpuids; cpu++)
+               cpu_mark_primary_thread(cpu, cpuid_to_apicid[cpu]);
+       return 0;
 }
+early_initcall(smp_init_primary_thread_mask);
+#else
+static inline void cpu_mark_primary_thread(unsigned int cpu, unsigned int apicid) { }
 #endif
 
 /*
@@ -2544,6 +2559,9 @@ int generic_processor_info(int apicid, int version)
        set_cpu_present(cpu, true);
        num_processors++;
 
+       if (system_state != SYSTEM_BOOTING)
+               cpu_mark_primary_thread(cpu, apicid);
+
        return cpu;
 }
 
index 6bde05a..896bc41 100644 (file)
@@ -97,7 +97,10 @@ static void init_x2apic_ldr(void)
 
 static int x2apic_phys_probe(void)
 {
-       if (x2apic_mode && (x2apic_phys || x2apic_fadt_phys()))
+       if (!x2apic_mode)
+               return 0;
+
+       if (x2apic_phys || x2apic_fadt_phys())
                return 1;
 
        return apic == &apic_x2apic_phys;
index 4828552..d9384d5 100644 (file)
@@ -546,7 +546,6 @@ unsigned long sn_rtc_cycles_per_second;
 EXPORT_SYMBOL(sn_rtc_cycles_per_second);
 
 /* The following values are used for the per node hub info struct */
-static __initdata unsigned short               *_node_to_pnode;
 static __initdata unsigned short               _min_socket, _max_socket;
 static __initdata unsigned short               _min_pnode, _max_pnode, _gr_table_len;
 static __initdata struct uv_gam_range_entry    *uv_gre_table;
@@ -554,6 +553,7 @@ static __initdata struct uv_gam_parameters  *uv_gp_table;
 static __initdata unsigned short               *_socket_to_node;
 static __initdata unsigned short               *_socket_to_pnode;
 static __initdata unsigned short               *_pnode_to_socket;
+static __initdata unsigned short               *_node_to_socket;
 
 static __initdata struct uv_gam_range_s                *_gr_table;
 
@@ -617,7 +617,8 @@ static __init void build_uv_gr_table(void)
 
        bytes = _gr_table_len * sizeof(struct uv_gam_range_s);
        grt = kzalloc(bytes, GFP_KERNEL);
-       BUG_ON(!grt);
+       if (WARN_ON_ONCE(!grt))
+               return;
        _gr_table = grt;
 
        for (; gre->type != UV_GAM_RANGE_TYPE_UNUSED; gre++) {
@@ -1022,7 +1023,7 @@ static void __init calc_mmioh_map(enum mmioh_arch index,
        switch (index) {
        case UVY_MMIOH0:
                mmr = UVH_RH10_GAM_MMIOH_REDIRECT_CONFIG0;
-               nasid_mask = UVH_RH10_GAM_MMIOH_OVERLAY_CONFIG0_BASE_MASK;
+               nasid_mask = UVYH_RH10_GAM_MMIOH_REDIRECT_CONFIG0_NASID_MASK;
                n = UVH_RH10_GAM_MMIOH_REDIRECT_CONFIG0_DEPTH;
                min_nasid = min_pnode;
                max_nasid = max_pnode;
@@ -1030,7 +1031,7 @@ static void __init calc_mmioh_map(enum mmioh_arch index,
                break;
        case UVY_MMIOH1:
                mmr = UVH_RH10_GAM_MMIOH_REDIRECT_CONFIG1;
-               nasid_mask = UVH_RH10_GAM_MMIOH_OVERLAY_CONFIG1_BASE_MASK;
+               nasid_mask = UVYH_RH10_GAM_MMIOH_REDIRECT_CONFIG1_NASID_MASK;
                n = UVH_RH10_GAM_MMIOH_REDIRECT_CONFIG1_DEPTH;
                min_nasid = min_pnode;
                max_nasid = max_pnode;
@@ -1038,7 +1039,7 @@ static void __init calc_mmioh_map(enum mmioh_arch index,
                break;
        case UVX_MMIOH0:
                mmr = UVH_RH_GAM_MMIOH_REDIRECT_CONFIG0;
-               nasid_mask = UVH_RH_GAM_MMIOH_OVERLAY_CONFIG0_BASE_MASK;
+               nasid_mask = UVH_RH_GAM_MMIOH_REDIRECT_CONFIG0_NASID_MASK;
                n = UVH_RH_GAM_MMIOH_REDIRECT_CONFIG0_DEPTH;
                min_nasid = min_pnode * 2;
                max_nasid = max_pnode * 2;
@@ -1046,7 +1047,7 @@ static void __init calc_mmioh_map(enum mmioh_arch index,
                break;
        case UVX_MMIOH1:
                mmr = UVH_RH_GAM_MMIOH_REDIRECT_CONFIG1;
-               nasid_mask = UVH_RH_GAM_MMIOH_OVERLAY_CONFIG1_BASE_MASK;
+               nasid_mask = UVH_RH_GAM_MMIOH_REDIRECT_CONFIG1_NASID_MASK;
                n = UVH_RH_GAM_MMIOH_REDIRECT_CONFIG1_DEPTH;
                min_nasid = min_pnode * 2;
                max_nasid = max_pnode * 2;
@@ -1072,8 +1073,9 @@ static void __init calc_mmioh_map(enum mmioh_arch index,
 
                /* Invalid NASID check */
                if (nasid < min_nasid || max_nasid < nasid) {
-                       pr_err("UV:%s:Invalid NASID:%x (range:%x..%x)\n",
-                               __func__, index, min_nasid, max_nasid);
+                       /* Not an error: unused table entries get "poison" values */
+                       pr_debug("UV:%s:Invalid NASID(%x):%x (range:%x..%x)\n",
+                              __func__, index, nasid, min_nasid, max_nasid);
                        nasid = -1;
                }
 
@@ -1292,6 +1294,7 @@ static void __init uv_init_hub_info(struct uv_hub_info_s *hi)
        hi->nasid_shift         = uv_cpuid.nasid_shift;
        hi->min_pnode           = _min_pnode;
        hi->min_socket          = _min_socket;
+       hi->node_to_socket      = _node_to_socket;
        hi->pnode_to_socket     = _pnode_to_socket;
        hi->socket_to_node      = _socket_to_node;
        hi->socket_to_pnode     = _socket_to_pnode;
@@ -1348,7 +1351,7 @@ static void __init decode_gam_rng_tbl(unsigned long ptr)
        struct uv_gam_range_entry *gre = (struct uv_gam_range_entry *)ptr;
        unsigned long lgre = 0, gend = 0;
        int index = 0;
-       int sock_min = 999999, pnode_min = 99999;
+       int sock_min = INT_MAX, pnode_min = INT_MAX;
        int sock_max = -1, pnode_max = -1;
 
        uv_gre_table = gre;
@@ -1459,11 +1462,37 @@ static int __init decode_uv_systab(void)
        return 0;
 }
 
+/*
+ * Given a bitmask 'bits' representing presnt blades, numbered
+ * starting at 'base', masking off unused high bits of blade number
+ * with 'mask', update the minimum and maximum blade numbers that we
+ * have found.  (Masking with 'mask' necessary because of BIOS
+ * treatment of system partitioning when creating this table we are
+ * interpreting.)
+ */
+static inline void blade_update_min_max(unsigned long bits, int base, int mask, int *min, int *max)
+{
+       int first, last;
+
+       if (!bits)
+               return;
+       first = (base + __ffs(bits)) & mask;
+       last =  (base + __fls(bits)) & mask;
+
+       if (*min > first)
+               *min = first;
+       if (*max < last)
+               *max = last;
+}
+
 /* Set up physical blade translations from UVH_NODE_PRESENT_TABLE */
 static __init void boot_init_possible_blades(struct uv_hub_info_s *hub_info)
 {
        unsigned long np;
        int i, uv_pb = 0;
+       int sock_min = INT_MAX, sock_max = -1, s_mask;
+
+       s_mask = (1 << uv_cpuid.n_skt) - 1;
 
        if (UVH_NODE_PRESENT_TABLE) {
                pr_info("UV: NODE_PRESENT_DEPTH = %d\n",
@@ -1471,35 +1500,82 @@ static __init void boot_init_possible_blades(struct uv_hub_info_s *hub_info)
                for (i = 0; i < UVH_NODE_PRESENT_TABLE_DEPTH; i++) {
                        np = uv_read_local_mmr(UVH_NODE_PRESENT_TABLE + i * 8);
                        pr_info("UV: NODE_PRESENT(%d) = 0x%016lx\n", i, np);
-                       uv_pb += hweight64(np);
+                       blade_update_min_max(np, i * 64, s_mask, &sock_min, &sock_max);
                }
        }
        if (UVH_NODE_PRESENT_0) {
                np = uv_read_local_mmr(UVH_NODE_PRESENT_0);
                pr_info("UV: NODE_PRESENT_0 = 0x%016lx\n", np);
-               uv_pb += hweight64(np);
+               blade_update_min_max(np, 0, s_mask, &sock_min, &sock_max);
        }
        if (UVH_NODE_PRESENT_1) {
                np = uv_read_local_mmr(UVH_NODE_PRESENT_1);
                pr_info("UV: NODE_PRESENT_1 = 0x%016lx\n", np);
-               uv_pb += hweight64(np);
+               blade_update_min_max(np, 64, s_mask, &sock_min, &sock_max);
+       }
+
+       /* Only update if we actually found some bits indicating blades present */
+       if (sock_max >= sock_min) {
+               _min_socket = sock_min;
+               _max_socket = sock_max;
+               uv_pb = sock_max - sock_min + 1;
        }
        if (uv_possible_blades != uv_pb)
                uv_possible_blades = uv_pb;
 
-       pr_info("UV: number nodes/possible blades %d\n", uv_pb);
+       pr_info("UV: number nodes/possible blades %d (%d - %d)\n",
+               uv_pb, sock_min, sock_max);
+}
+
+static int __init alloc_conv_table(int num_elem, unsigned short **table)
+{
+       int i;
+       size_t bytes;
+
+       bytes = num_elem * sizeof(*table[0]);
+       *table = kmalloc(bytes, GFP_KERNEL);
+       if (WARN_ON_ONCE(!*table))
+               return -ENOMEM;
+       for (i = 0; i < num_elem; i++)
+               ((unsigned short *)*table)[i] = SOCK_EMPTY;
+       return 0;
 }
 
+/* Remove conversion table if it's 1:1 */
+#define FREE_1_TO_1_TABLE(tbl, min, max, max2) free_1_to_1_table(&tbl, #tbl, min, max, max2)
+
+static void __init free_1_to_1_table(unsigned short **tp, char *tname, int min, int max, int max2)
+{
+       int i;
+       unsigned short *table = *tp;
+
+       if (table == NULL)
+               return;
+       if (max != max2)
+               return;
+       for (i = 0; i < max; i++) {
+               if (i != table[i])
+                       return;
+       }
+       kfree(table);
+       *tp = NULL;
+       pr_info("UV: %s is 1:1, conversion table removed\n", tname);
+}
+
+/*
+ * Build Socket Tables
+ * If the number of nodes is >1 per socket, socket to node table will
+ * contain lowest node number on that socket.
+ */
 static void __init build_socket_tables(void)
 {
        struct uv_gam_range_entry *gre = uv_gre_table;
-       int num, nump;
+       int nums, numn, nump;
        int cpu, i, lnid;
        int minsock = _min_socket;
        int maxsock = _max_socket;
        int minpnode = _min_pnode;
        int maxpnode = _max_pnode;
-       size_t bytes;
 
        if (!gre) {
                if (is_uv2_hub() || is_uv3_hub()) {
@@ -1507,39 +1583,36 @@ static void __init build_socket_tables(void)
                        return;
                }
                pr_err("UV: Error: UVsystab address translations not available!\n");
-               BUG();
+               WARN_ON_ONCE(!gre);
+               return;
        }
 
-       /* Build socket id -> node id, pnode */
-       num = maxsock - minsock + 1;
-       bytes = num * sizeof(_socket_to_node[0]);
-       _socket_to_node = kmalloc(bytes, GFP_KERNEL);
-       _socket_to_pnode = kmalloc(bytes, GFP_KERNEL);
-
+       numn = num_possible_nodes();
        nump = maxpnode - minpnode + 1;
-       bytes = nump * sizeof(_pnode_to_socket[0]);
-       _pnode_to_socket = kmalloc(bytes, GFP_KERNEL);
-       BUG_ON(!_socket_to_node || !_socket_to_pnode || !_pnode_to_socket);
-
-       for (i = 0; i < num; i++)
-               _socket_to_node[i] = _socket_to_pnode[i] = SOCK_EMPTY;
-
-       for (i = 0; i < nump; i++)
-               _pnode_to_socket[i] = SOCK_EMPTY;
+       nums = maxsock - minsock + 1;
+
+       /* Allocate and clear tables */
+       if ((alloc_conv_table(nump, &_pnode_to_socket) < 0)
+           || (alloc_conv_table(nums, &_socket_to_pnode) < 0)
+           || (alloc_conv_table(numn, &_node_to_socket) < 0)
+           || (alloc_conv_table(nums, &_socket_to_node) < 0)) {
+               kfree(_pnode_to_socket);
+               kfree(_socket_to_pnode);
+               kfree(_node_to_socket);
+               return;
+       }
 
        /* Fill in pnode/node/addr conversion list values: */
-       pr_info("UV: GAM Building socket/pnode conversion tables\n");
        for (; gre->type != UV_GAM_RANGE_TYPE_UNUSED; gre++) {
                if (gre->type == UV_GAM_RANGE_TYPE_HOLE)
                        continue;
                i = gre->sockid - minsock;
-               /* Duplicate: */
-               if (_socket_to_pnode[i] != SOCK_EMPTY)
-                       continue;
-               _socket_to_pnode[i] = gre->pnode;
+               if (_socket_to_pnode[i] == SOCK_EMPTY)
+                       _socket_to_pnode[i] = gre->pnode;
 
                i = gre->pnode - minpnode;
-               _pnode_to_socket[i] = gre->sockid;
+               if (_pnode_to_socket[i] == SOCK_EMPTY)
+                       _pnode_to_socket[i] = gre->sockid;
 
                pr_info("UV: sid:%02x type:%d nasid:%04x pn:%02x pn2s:%2x\n",
                        gre->sockid, gre->type, gre->nasid,
@@ -1549,66 +1622,39 @@ static void __init build_socket_tables(void)
 
        /* Set socket -> node values: */
        lnid = NUMA_NO_NODE;
-       for_each_present_cpu(cpu) {
+       for_each_possible_cpu(cpu) {
                int nid = cpu_to_node(cpu);
                int apicid, sockid;
 
                if (lnid == nid)
                        continue;
                lnid = nid;
+
                apicid = per_cpu(x86_cpu_to_apicid, cpu);
                sockid = apicid >> uv_cpuid.socketid_shift;
-               _socket_to_node[sockid - minsock] = nid;
-               pr_info("UV: sid:%02x: apicid:%04x node:%2d\n",
-                       sockid, apicid, nid);
-       }
 
-       /* Set up physical blade to pnode translation from GAM Range Table: */
-       bytes = num_possible_nodes() * sizeof(_node_to_pnode[0]);
-       _node_to_pnode = kmalloc(bytes, GFP_KERNEL);
-       BUG_ON(!_node_to_pnode);
+               if (_socket_to_node[sockid - minsock] == SOCK_EMPTY)
+                       _socket_to_node[sockid - minsock] = nid;
 
-       for (lnid = 0; lnid < num_possible_nodes(); lnid++) {
-               unsigned short sockid;
+               if (_node_to_socket[nid] == SOCK_EMPTY)
+                       _node_to_socket[nid] = sockid;
 
-               for (sockid = minsock; sockid <= maxsock; sockid++) {
-                       if (lnid == _socket_to_node[sockid - minsock]) {
-                               _node_to_pnode[lnid] = _socket_to_pnode[sockid - minsock];
-                               break;
-                       }
-               }
-               if (sockid > maxsock) {
-                       pr_err("UV: socket for node %d not found!\n", lnid);
-                       BUG();
-               }
+               pr_info("UV: sid:%02x: apicid:%04x socket:%02d node:%03x s2n:%03x\n",
+                       sockid,
+                       apicid,
+                       _node_to_socket[nid],
+                       nid,
+                       _socket_to_node[sockid - minsock]);
        }
 
        /*
-        * If socket id == pnode or socket id == node for all nodes,
+        * If e.g. socket id == pnode for all pnodes,
         *   system runs faster by removing corresponding conversion table.
         */
-       pr_info("UV: Checking socket->node/pnode for identity maps\n");
-       if (minsock == 0) {
-               for (i = 0; i < num; i++)
-                       if (_socket_to_node[i] == SOCK_EMPTY || i != _socket_to_node[i])
-                               break;
-               if (i >= num) {
-                       kfree(_socket_to_node);
-                       _socket_to_node = NULL;
-                       pr_info("UV: 1:1 socket_to_node table removed\n");
-               }
-       }
-       if (minsock == minpnode) {
-               for (i = 0; i < num; i++)
-                       if (_socket_to_pnode[i] != SOCK_EMPTY &&
-                               _socket_to_pnode[i] != i + minpnode)
-                               break;
-               if (i >= num) {
-                       kfree(_socket_to_pnode);
-                       _socket_to_pnode = NULL;
-                       pr_info("UV: 1:1 socket_to_pnode table removed\n");
-               }
-       }
+       FREE_1_TO_1_TABLE(_socket_to_node, _min_socket, nums, numn);
+       FREE_1_TO_1_TABLE(_node_to_socket, _min_socket, nums, numn);
+       FREE_1_TO_1_TABLE(_socket_to_pnode, _min_pnode, nums, nump);
+       FREE_1_TO_1_TABLE(_pnode_to_socket, _min_pnode, nums, nump);
 }
 
 /* Check which reboot to use */
@@ -1692,12 +1738,13 @@ static __init int uv_system_init_hubless(void)
 static void __init uv_system_init_hub(void)
 {
        struct uv_hub_info_s hub_info = {0};
-       int bytes, cpu, nodeid;
-       unsigned short min_pnode = 9999, max_pnode = 0;
+       int bytes, cpu, nodeid, bid;
+       unsigned short min_pnode = USHRT_MAX, max_pnode = 0;
        char *hub = is_uv5_hub() ? "UV500" :
                    is_uv4_hub() ? "UV400" :
                    is_uv3_hub() ? "UV300" :
                    is_uv2_hub() ? "UV2000/3000" : NULL;
+       struct uv_hub_info_s **uv_hub_info_list_blade;
 
        if (!hub) {
                pr_err("UV: Unknown/unsupported UV hub\n");
@@ -1720,9 +1767,12 @@ static void __init uv_system_init_hub(void)
        build_uv_gr_table();
        set_block_size();
        uv_init_hub_info(&hub_info);
-       uv_possible_blades = num_possible_nodes();
-       if (!_node_to_pnode)
+       /* If UV2 or UV3 may need to get # blades from HW */
+       if (is_uv(UV2|UV3) && !uv_gre_table)
                boot_init_possible_blades(&hub_info);
+       else
+               /* min/max sockets set in decode_gam_rng_tbl */
+               uv_possible_blades = (_max_socket - _min_socket) + 1;
 
        /* uv_num_possible_blades() is really the hub count: */
        pr_info("UV: Found %d hubs, %d nodes, %d CPUs\n", uv_num_possible_blades(), num_possible_nodes(), num_possible_cpus());
@@ -1731,79 +1781,98 @@ static void __init uv_system_init_hub(void)
        hub_info.coherency_domain_number = sn_coherency_id;
        uv_rtc_init();
 
+       /*
+        * __uv_hub_info_list[] is indexed by node, but there is only
+        * one hub_info structure per blade.  First, allocate one
+        * structure per blade.  Further down we create a per-node
+        * table (__uv_hub_info_list[]) pointing to hub_info
+        * structures for the correct blade.
+        */
+
        bytes = sizeof(void *) * uv_num_possible_blades();
-       __uv_hub_info_list = kzalloc(bytes, GFP_KERNEL);
-       BUG_ON(!__uv_hub_info_list);
+       uv_hub_info_list_blade = kzalloc(bytes, GFP_KERNEL);
+       if (WARN_ON_ONCE(!uv_hub_info_list_blade))
+               return;
 
        bytes = sizeof(struct uv_hub_info_s);
-       for_each_node(nodeid) {
+       for_each_possible_blade(bid) {
                struct uv_hub_info_s *new_hub;
 
-               if (__uv_hub_info_list[nodeid]) {
-                       pr_err("UV: Node %d UV HUB already initialized!?\n", nodeid);
-                       BUG();
+               /* Allocate & fill new per hub info list */
+               new_hub = (bid == 0) ?  &uv_hub_info_node0
+                       : kzalloc_node(bytes, GFP_KERNEL, uv_blade_to_node(bid));
+               if (WARN_ON_ONCE(!new_hub)) {
+                       /* do not kfree() bid 0, which is statically allocated */
+                       while (--bid > 0)
+                               kfree(uv_hub_info_list_blade[bid]);
+                       kfree(uv_hub_info_list_blade);
+                       return;
                }
 
-               /* Allocate new per hub info list */
-               new_hub = (nodeid == 0) ?  &uv_hub_info_node0 : kzalloc_node(bytes, GFP_KERNEL, nodeid);
-               BUG_ON(!new_hub);
-               __uv_hub_info_list[nodeid] = new_hub;
-               new_hub = uv_hub_info_list(nodeid);
-               BUG_ON(!new_hub);
+               uv_hub_info_list_blade[bid] = new_hub;
                *new_hub = hub_info;
 
                /* Use information from GAM table if available: */
-               if (_node_to_pnode)
-                       new_hub->pnode = _node_to_pnode[nodeid];
+               if (uv_gre_table)
+                       new_hub->pnode = uv_blade_to_pnode(bid);
                else /* Or fill in during CPU loop: */
                        new_hub->pnode = 0xffff;
 
-               new_hub->numa_blade_id = uv_node_to_blade_id(nodeid);
+               new_hub->numa_blade_id = bid;
                new_hub->memory_nid = NUMA_NO_NODE;
                new_hub->nr_possible_cpus = 0;
                new_hub->nr_online_cpus = 0;
        }
 
+       /*
+        * Now populate __uv_hub_info_list[] for each node with the
+        * pointer to the struct for the blade it resides on.
+        */
+
+       bytes = sizeof(void *) * num_possible_nodes();
+       __uv_hub_info_list = kzalloc(bytes, GFP_KERNEL);
+       if (WARN_ON_ONCE(!__uv_hub_info_list)) {
+               for_each_possible_blade(bid)
+                       /* bid 0 is statically allocated */
+                       if (bid != 0)
+                               kfree(uv_hub_info_list_blade[bid]);
+               kfree(uv_hub_info_list_blade);
+               return;
+       }
+
+       for_each_node(nodeid)
+               __uv_hub_info_list[nodeid] = uv_hub_info_list_blade[uv_node_to_blade_id(nodeid)];
+
        /* Initialize per CPU info: */
        for_each_possible_cpu(cpu) {
-               int apicid = per_cpu(x86_cpu_to_apicid, cpu);
-               int numa_node_id;
+               int apicid = early_per_cpu(x86_cpu_to_apicid, cpu);
+               unsigned short bid;
                unsigned short pnode;
 
-               nodeid = cpu_to_node(cpu);
-               numa_node_id = numa_cpu_node(cpu);
                pnode = uv_apicid_to_pnode(apicid);
+               bid = uv_pnode_to_socket(pnode) - _min_socket;
 
-               uv_cpu_info_per(cpu)->p_uv_hub_info = uv_hub_info_list(nodeid);
+               uv_cpu_info_per(cpu)->p_uv_hub_info = uv_hub_info_list_blade[bid];
                uv_cpu_info_per(cpu)->blade_cpu_id = uv_cpu_hub_info(cpu)->nr_possible_cpus++;
                if (uv_cpu_hub_info(cpu)->memory_nid == NUMA_NO_NODE)
                        uv_cpu_hub_info(cpu)->memory_nid = cpu_to_node(cpu);
 
-               /* Init memoryless node: */
-               if (nodeid != numa_node_id &&
-                   uv_hub_info_list(numa_node_id)->pnode == 0xffff)
-                       uv_hub_info_list(numa_node_id)->pnode = pnode;
-               else if (uv_cpu_hub_info(cpu)->pnode == 0xffff)
+               if (uv_cpu_hub_info(cpu)->pnode == 0xffff)
                        uv_cpu_hub_info(cpu)->pnode = pnode;
        }
 
-       for_each_node(nodeid) {
-               unsigned short pnode = uv_hub_info_list(nodeid)->pnode;
+       for_each_possible_blade(bid) {
+               unsigned short pnode = uv_hub_info_list_blade[bid]->pnode;
 
-               /* Add pnode info for pre-GAM list nodes without CPUs: */
-               if (pnode == 0xffff) {
-                       unsigned long paddr;
+               if (pnode == 0xffff)
+                       continue;
 
-                       paddr = node_start_pfn(nodeid) << PAGE_SHIFT;
-                       pnode = uv_gpa_to_pnode(uv_soc_phys_ram_to_gpa(paddr));
-                       uv_hub_info_list(nodeid)->pnode = pnode;
-               }
                min_pnode = min(pnode, min_pnode);
                max_pnode = max(pnode, max_pnode);
-               pr_info("UV: UVHUB node:%2d pn:%02x nrcpus:%d\n",
-                       nodeid,
-                       uv_hub_info_list(nodeid)->pnode,
-                       uv_hub_info_list(nodeid)->nr_possible_cpus);
+               pr_info("UV: HUB:%2d pn:%02x nrcpus:%d\n",
+                       bid,
+                       uv_hub_info_list_blade[bid]->pnode,
+                       uv_hub_info_list_blade[bid]->nr_possible_cpus);
        }
 
        pr_info("UV: min_pnode:%02x max_pnode:%02x\n", min_pnode, max_pnode);
@@ -1811,6 +1880,9 @@ static void __init uv_system_init_hub(void)
        map_mmr_high(max_pnode);
        map_mmioh_high(min_pnode, max_pnode);
 
+       kfree(uv_hub_info_list_blade);
+       uv_hub_info_list_blade = NULL;
+
        uv_nmi_setup();
        uv_cpu_init();
        uv_setup_proc_files(0);
index 22ab139..c06bfc0 100644 (file)
@@ -133,8 +133,8 @@ static bool skip_addr(void *dest)
        /* Accounts directly */
        if (dest == ret_from_fork)
                return true;
-#ifdef CONFIG_HOTPLUG_CPU
-       if (dest == start_cpu0)
+#if defined(CONFIG_HOTPLUG_CPU) && defined(CONFIG_AMD_MEM_ENCRYPT)
+       if (dest == soft_restart_cpu)
                return true;
 #endif
 #ifdef CONFIG_FUNCTION_TRACER
@@ -293,7 +293,8 @@ void *callthunks_translate_call_dest(void *dest)
        return target ? : dest;
 }
 
-bool is_callthunk(void *addr)
+#ifdef CONFIG_BPF_JIT
+static bool is_callthunk(void *addr)
 {
        unsigned int tmpl_size = SKL_TMPL_SIZE;
        void *tmpl = skl_call_thunk_template;
@@ -306,7 +307,6 @@ bool is_callthunk(void *addr)
        return !bcmp((void *)(dest - tmpl_size), tmpl, tmpl_size);
 }
 
-#ifdef CONFIG_BPF_JIT
 int x86_call_depth_emit_accounting(u8 **pprog, void *func)
 {
        unsigned int tmpl_size = SKL_TMPL_SIZE;
index d7e3cea..4350f6b 100644 (file)
@@ -27,7 +27,7 @@ obj-y                 += cpuid-deps.o
 obj-y                  += umwait.o
 
 obj-$(CONFIG_PROC_FS)  += proc.o
-obj-$(CONFIG_X86_FEATURE_NAMES) += capflags.o powerflags.o
+obj-y += capflags.o powerflags.o
 
 obj-$(CONFIG_IA32_FEAT_CTL) += feat_ctl.o
 ifdef CONFIG_CPU_SUP_INTEL
@@ -54,7 +54,6 @@ obj-$(CONFIG_X86_LOCAL_APIC)          += perfctr-watchdog.o
 obj-$(CONFIG_HYPERVISOR_GUEST)         += vmware.o hypervisor.o mshyperv.o
 obj-$(CONFIG_ACRN_GUEST)               += acrn.o
 
-ifdef CONFIG_X86_FEATURE_NAMES
 quiet_cmd_mkcapflags = MKCAP   $@
       cmd_mkcapflags = $(CONFIG_SHELL) $(srctree)/$(src)/mkcapflags.sh $@ $^
 
@@ -63,5 +62,4 @@ vmxfeature = $(src)/../../include/asm/vmxfeatures.h
 
 $(obj)/capflags.c: $(cpufeature) $(vmxfeature) $(src)/mkcapflags.sh FORCE
        $(call if_changed,mkcapflags)
-endif
 targets += capflags.c
index 182af64..9e2a918 100644 (file)
@@ -9,7 +9,6 @@
  *     - Andrew D. Balsa (code cleanup).
  */
 #include <linux/init.h>
-#include <linux/utsname.h>
 #include <linux/cpu.h>
 #include <linux/module.h>
 #include <linux/nospec.h>
@@ -27,8 +26,6 @@
 #include <asm/msr.h>
 #include <asm/vmx.h>
 #include <asm/paravirt.h>
-#include <asm/alternative.h>
-#include <asm/set_memory.h>
 #include <asm/intel-family.h>
 #include <asm/e820/api.h>
 #include <asm/hypervisor.h>
@@ -125,21 +122,8 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_cond_l1d_flush);
 DEFINE_STATIC_KEY_FALSE(mmio_stale_data_clear);
 EXPORT_SYMBOL_GPL(mmio_stale_data_clear);
 
-void __init check_bugs(void)
+void __init cpu_select_mitigations(void)
 {
-       identify_boot_cpu();
-
-       /*
-        * identify_boot_cpu() initialized SMT support information, let the
-        * core code know.
-        */
-       cpu_smt_check_topology();
-
-       if (!IS_ENABLED(CONFIG_SMP)) {
-               pr_info("CPU: ");
-               print_cpu_info(&boot_cpu_data);
-       }
-
        /*
         * Read the SPEC_CTRL MSR to account for reserved bits which may
         * have unknown values. AMD64_LS_CFG MSR is cached in the early AMD
@@ -176,39 +160,6 @@ void __init check_bugs(void)
        md_clear_select_mitigation();
        srbds_select_mitigation();
        l1d_flush_select_mitigation();
-
-       arch_smt_update();
-
-#ifdef CONFIG_X86_32
-       /*
-        * Check whether we are able to run this kernel safely on SMP.
-        *
-        * - i386 is no longer supported.
-        * - In order to run on anything without a TSC, we need to be
-        *   compiled for a i486.
-        */
-       if (boot_cpu_data.x86 < 4)
-               panic("Kernel requires i486+ for 'invlpg' and other features");
-
-       init_utsname()->machine[1] =
-               '0' + (boot_cpu_data.x86 > 6 ? 6 : boot_cpu_data.x86);
-       alternative_instructions();
-
-       fpu__init_check_bugs();
-#else /* CONFIG_X86_64 */
-       alternative_instructions();
-
-       /*
-        * Make sure the first 2MB area is not mapped by huge pages
-        * There are typically fixed size MTRRs in there and overlapping
-        * MTRRs into large pages causes slow downs.
-        *
-        * Right now we don't do that with gbpages because there seems
-        * very little benefit for that case.
-        */
-       if (!direct_gbpages)
-               set_memory_4k((unsigned long)__va(0), 1);
-#endif
 }
 
 /*
index 4063e89..8f86eac 100644 (file)
@@ -39,6 +39,8 @@ DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
 /* Shared L2 cache maps */
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_l2c_shared_map);
 
+static cpumask_var_t cpu_cacheinfo_mask;
+
 /* Kernel controls MTRR and/or PAT MSRs. */
 unsigned int memory_caching_control __ro_after_init;
 
@@ -1172,8 +1174,10 @@ void cache_bp_restore(void)
                cache_cpu_init();
 }
 
-static int cache_ap_init(unsigned int cpu)
+static int cache_ap_online(unsigned int cpu)
 {
+       cpumask_set_cpu(cpu, cpu_cacheinfo_mask);
+
        if (!memory_caching_control || get_cache_aps_delayed_init())
                return 0;
 
@@ -1191,11 +1195,17 @@ static int cache_ap_init(unsigned int cpu)
         *      lock to prevent MTRR entry changes
         */
        stop_machine_from_inactive_cpu(cache_rendezvous_handler, NULL,
-                                      cpu_callout_mask);
+                                      cpu_cacheinfo_mask);
 
        return 0;
 }
 
+static int cache_ap_offline(unsigned int cpu)
+{
+       cpumask_clear_cpu(cpu, cpu_cacheinfo_mask);
+       return 0;
+}
+
 /*
  * Delayed cache initialization for all AP's
  */
@@ -1210,9 +1220,12 @@ void cache_aps_init(void)
 
 static int __init cache_ap_register(void)
 {
+       zalloc_cpumask_var(&cpu_cacheinfo_mask, GFP_KERNEL);
+       cpumask_set_cpu(smp_processor_id(), cpu_cacheinfo_mask);
+
        cpuhp_setup_state_nocalls(CPUHP_AP_CACHECTRL_STARTING,
                                  "x86/cachectrl:starting",
-                                 cache_ap_init, NULL);
+                                 cache_ap_online, cache_ap_offline);
        return 0;
 }
-core_initcall(cache_ap_register);
+early_initcall(cache_ap_register);
index 80710a6..52683fd 100644 (file)
 #include <linux/init.h>
 #include <linux/kprobes.h>
 #include <linux/kgdb.h>
+#include <linux/mem_encrypt.h>
 #include <linux/smp.h>
+#include <linux/cpu.h>
 #include <linux/io.h>
 #include <linux/syscore_ops.h>
 #include <linux/pgtable.h>
 #include <linux/stackprotector.h>
+#include <linux/utsname.h>
 
+#include <asm/alternative.h>
 #include <asm/cmdline.h>
 #include <asm/perf_event.h>
 #include <asm/mmu_context.h>
@@ -59,7 +63,7 @@
 #include <asm/intel-family.h>
 #include <asm/cpu_device_id.h>
 #include <asm/uv/uv.h>
-#include <asm/sigframe.h>
+#include <asm/set_memory.h>
 #include <asm/traps.h>
 #include <asm/sev.h>
 
 
 u32 elf_hwcap2 __read_mostly;
 
-/* all of these masks are initialized in setup_cpu_local_masks() */
-cpumask_var_t cpu_initialized_mask;
-cpumask_var_t cpu_callout_mask;
-cpumask_var_t cpu_callin_mask;
-
-/* representing cpus for which sibling maps can be computed */
-cpumask_var_t cpu_sibling_setup_mask;
-
 /* Number of siblings per CPU package */
 int smp_num_siblings = 1;
 EXPORT_SYMBOL(smp_num_siblings);
@@ -169,15 +165,6 @@ clear_ppin:
        clear_cpu_cap(c, info->feature);
 }
 
-/* correctly size the local cpu masks */
-void __init setup_cpu_local_masks(void)
-{
-       alloc_bootmem_cpumask_var(&cpu_initialized_mask);
-       alloc_bootmem_cpumask_var(&cpu_callin_mask);
-       alloc_bootmem_cpumask_var(&cpu_callout_mask);
-       alloc_bootmem_cpumask_var(&cpu_sibling_setup_mask);
-}
-
 static void default_init(struct cpuinfo_x86 *c)
 {
 #ifdef CONFIG_X86_64
@@ -1502,12 +1489,10 @@ static void __init cpu_parse_early_param(void)
                if (!kstrtouint(opt, 10, &bit)) {
                        if (bit < NCAPINTS * 32) {
 
-#ifdef CONFIG_X86_FEATURE_NAMES
                                /* empty-string, i.e., ""-defined feature flags */
                                if (!x86_cap_flags[bit])
                                        pr_cont(" " X86_CAP_FMT_NUM, x86_cap_flag_num(bit));
                                else
-#endif
                                        pr_cont(" " X86_CAP_FMT, x86_cap_flag(bit));
 
                                setup_clear_cpu_cap(bit);
@@ -1520,7 +1505,6 @@ static void __init cpu_parse_early_param(void)
                        continue;
                }
 
-#ifdef CONFIG_X86_FEATURE_NAMES
                for (bit = 0; bit < 32 * NCAPINTS; bit++) {
                        if (!x86_cap_flag(bit))
                                continue;
@@ -1537,7 +1521,6 @@ static void __init cpu_parse_early_param(void)
 
                if (!found)
                        pr_cont(" (unknown: %s)", opt);
-#endif
        }
        pr_cont("\n");
 
@@ -1600,10 +1583,6 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
 
        sld_setup(c);
 
-       fpu__init_system(c);
-
-       init_sigframe_size();
-
 #ifdef CONFIG_X86_32
        /*
         * Regardless of whether PCID is enumerated, the SDM says
@@ -2123,19 +2102,6 @@ static void dbg_restore_debug_regs(void)
 #define dbg_restore_debug_regs()
 #endif /* ! CONFIG_KGDB */
 
-static void wait_for_master_cpu(int cpu)
-{
-#ifdef CONFIG_SMP
-       /*
-        * wait for ACK from master CPU before continuing
-        * with AP initialization
-        */
-       WARN_ON(cpumask_test_and_set_cpu(cpu, cpu_initialized_mask));
-       while (!cpumask_test_cpu(cpu, cpu_callout_mask))
-               cpu_relax();
-#endif
-}
-
 static inline void setup_getcpu(int cpu)
 {
        unsigned long cpudata = vdso_encode_cpunode(cpu, early_cpu_to_node(cpu));
@@ -2158,11 +2124,7 @@ static inline void setup_getcpu(int cpu)
 }
 
 #ifdef CONFIG_X86_64
-static inline void ucode_cpu_init(int cpu)
-{
-       if (cpu)
-               load_ucode_ap();
-}
+static inline void ucode_cpu_init(int cpu) { }
 
 static inline void tss_setup_ist(struct tss_struct *tss)
 {
@@ -2239,8 +2201,6 @@ void cpu_init(void)
        struct task_struct *cur = current;
        int cpu = raw_smp_processor_id();
 
-       wait_for_master_cpu(cpu);
-
        ucode_cpu_init(cpu);
 
 #ifdef CONFIG_NUMA
@@ -2285,26 +2245,12 @@ void cpu_init(void)
 
        doublefault_init_cpu_tss();
 
-       fpu__init_cpu();
-
        if (is_uv_system())
                uv_cpu_init();
 
        load_fixmap_gdt(cpu);
 }
 
-#ifdef CONFIG_SMP
-void cpu_init_secondary(void)
-{
-       /*
-        * Relies on the BP having set-up the IDT tables, which are loaded
-        * on this CPU in cpu_init_exception_handling().
-        */
-       cpu_init_exception_handling();
-       cpu_init();
-}
-#endif
-
 #ifdef CONFIG_MICROCODE_LATE_LOADING
 /**
  * store_cpu_caps() - Store a snapshot of CPU capabilities
@@ -2362,3 +2308,69 @@ void arch_smt_update(void)
        /* Check whether IPI broadcasting can be enabled */
        apic_smt_update();
 }
+
+void __init arch_cpu_finalize_init(void)
+{
+       identify_boot_cpu();
+
+       /*
+        * identify_boot_cpu() initialized SMT support information, let the
+        * core code know.
+        */
+       cpu_smt_check_topology();
+
+       if (!IS_ENABLED(CONFIG_SMP)) {
+               pr_info("CPU: ");
+               print_cpu_info(&boot_cpu_data);
+       }
+
+       cpu_select_mitigations();
+
+       arch_smt_update();
+
+       if (IS_ENABLED(CONFIG_X86_32)) {
+               /*
+                * Check whether this is a real i386 which is not longer
+                * supported and fixup the utsname.
+                */
+               if (boot_cpu_data.x86 < 4)
+                       panic("Kernel requires i486+ for 'invlpg' and other features");
+
+               init_utsname()->machine[1] =
+                       '0' + (boot_cpu_data.x86 > 6 ? 6 : boot_cpu_data.x86);
+       }
+
+       /*
+        * Must be before alternatives because it might set or clear
+        * feature bits.
+        */
+       fpu__init_system();
+       fpu__init_cpu();
+
+       alternative_instructions();
+
+       if (IS_ENABLED(CONFIG_X86_64)) {
+               /*
+                * Make sure the first 2MB area is not mapped by huge pages
+                * There are typically fixed size MTRRs in there and overlapping
+                * MTRRs into large pages causes slow downs.
+                *
+                * Right now we don't do that with gbpages because there seems
+                * very little benefit for that case.
+                */
+               if (!direct_gbpages)
+                       set_memory_4k((unsigned long)__va(0), 1);
+       } else {
+               fpu__init_check_bugs();
+       }
+
+       /*
+        * This needs to be called before any devices perform DMA
+        * operations that might use the SWIOTLB bounce buffers. It will
+        * mark the bounce buffers as decrypted so that their usage will
+        * not cause "plain-text" data to be decrypted when accessed. It
+        * must be called after late_time_init() so that Hyper-V x86/x64
+        * hypercalls work when the SWIOTLB bounce buffers are decrypted.
+        */
+       mem_encrypt_init();
+}
index f97b0fe..1c44630 100644 (file)
@@ -79,6 +79,7 @@ extern void detect_ht(struct cpuinfo_x86 *c);
 extern void check_null_seg_clears_base(struct cpuinfo_x86 *c);
 
 unsigned int aperfmperf_get_khz(int cpu);
+void cpu_select_mitigations(void);
 
 extern void x86_spec_ctrl_setup_ap(void);
 extern void update_srbds_msr(void);
index 0b971f9..5e74610 100644 (file)
@@ -715,11 +715,13 @@ void mce_amd_feature_init(struct cpuinfo_x86 *c)
 
 bool amd_mce_is_memory_error(struct mce *m)
 {
+       enum smca_bank_types bank_type;
        /* ErrCodeExt[20:16] */
        u8 xec = (m->status >> 16) & 0x1f;
 
+       bank_type = smca_get_bank_type(m->extcpu, m->bank);
        if (mce_flags.smca)
-               return smca_get_bank_type(m->extcpu, m->bank) == SMCA_UMC && xec == 0x0;
+               return (bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2) && xec == 0x0;
 
        return m->bank == 4 && xec == 0x8;
 }
@@ -1050,7 +1052,7 @@ static const char *get_name(unsigned int cpu, unsigned int bank, struct threshol
        if (bank_type >= N_SMCA_BANK_TYPES)
                return NULL;
 
-       if (b && bank_type == SMCA_UMC) {
+       if (b && (bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2)) {
                if (b->block < ARRAY_SIZE(smca_umc_block_names))
                        return smca_umc_block_names[b->block];
                return NULL;
index 2eec60f..89e2aab 100644 (file)
@@ -1022,12 +1022,12 @@ static noinstr int mce_start(int *no_way_out)
        if (!timeout)
                return ret;
 
-       arch_atomic_add(*no_way_out, &global_nwo);
+       raw_atomic_add(*no_way_out, &global_nwo);
        /*
         * Rely on the implied barrier below, such that global_nwo
         * is updated before mce_callin.
         */
-       order = arch_atomic_inc_return(&mce_callin);
+       order = raw_atomic_inc_return(&mce_callin);
        arch_cpumask_clear_cpu(smp_processor_id(), &mce_missing_cpus);
 
        /* Enable instrumentation around calls to external facilities */
@@ -1036,10 +1036,10 @@ static noinstr int mce_start(int *no_way_out)
        /*
         * Wait for everyone.
         */
-       while (arch_atomic_read(&mce_callin) != num_online_cpus()) {
+       while (raw_atomic_read(&mce_callin) != num_online_cpus()) {
                if (mce_timed_out(&timeout,
                                  "Timeout: Not all CPUs entered broadcast exception handler")) {
-                       arch_atomic_set(&global_nwo, 0);
+                       raw_atomic_set(&global_nwo, 0);
                        goto out;
                }
                ndelay(SPINUNIT);
@@ -1054,7 +1054,7 @@ static noinstr int mce_start(int *no_way_out)
                /*
                 * Monarch: Starts executing now, the others wait.
                 */
-               arch_atomic_set(&mce_executing, 1);
+               raw_atomic_set(&mce_executing, 1);
        } else {
                /*
                 * Subject: Now start the scanning loop one by one in
@@ -1062,10 +1062,10 @@ static noinstr int mce_start(int *no_way_out)
                 * This way when there are any shared banks it will be
                 * only seen by one CPU before cleared, avoiding duplicates.
                 */
-               while (arch_atomic_read(&mce_executing) < order) {
+               while (raw_atomic_read(&mce_executing) < order) {
                        if (mce_timed_out(&timeout,
                                          "Timeout: Subject CPUs unable to finish machine check processing")) {
-                               arch_atomic_set(&global_nwo, 0);
+                               raw_atomic_set(&global_nwo, 0);
                                goto out;
                        }
                        ndelay(SPINUNIT);
@@ -1075,7 +1075,7 @@ static noinstr int mce_start(int *no_way_out)
        /*
         * Cache the global no_way_out state.
         */
-       *no_way_out = arch_atomic_read(&global_nwo);
+       *no_way_out = raw_atomic_read(&global_nwo);
 
        ret = order;
 
@@ -1533,7 +1533,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
                /* If this triggers there is no way to recover. Die hard. */
                BUG_ON(!on_thread_stack() || !user_mode(regs));
 
-               if (kill_current_task)
+               if (!mce_usable_address(&m))
                        queue_task_work(&m, msg, kill_me_now);
                else
                        queue_task_work(&m, msg, kill_me_maybe);
index f5fdeb1..87208e4 100644 (file)
@@ -78,8 +78,6 @@ static u16 find_equiv_id(struct equiv_cpu_table *et, u32 sig)
 
                if (sig == e->installed_cpu)
                        return e->equiv_cpu;
-
-               e++;
        }
        return 0;
 }
@@ -596,11 +594,6 @@ void reload_ucode_amd(unsigned int cpu)
                }
        }
 }
-static u16 __find_equiv_id(unsigned int cpu)
-{
-       struct ucode_cpu_info *uci = ucode_cpu_info + cpu;
-       return find_equiv_id(&equiv_table, uci->cpu_sig.sig);
-}
 
 /*
  * a small, trivial cache of per-family ucode patches
@@ -651,9 +644,11 @@ static void free_cache(void)
 
 static struct ucode_patch *find_patch(unsigned int cpu)
 {
+       struct ucode_cpu_info *uci = ucode_cpu_info + cpu;
        u16 equiv_id;
 
-       equiv_id = __find_equiv_id(cpu);
+
+       equiv_id = find_equiv_id(&equiv_table, uci->cpu_sig.sig);
        if (!equiv_id)
                return NULL;
 
@@ -705,7 +700,7 @@ static enum ucode_state apply_microcode_amd(int cpu)
        rdmsr(MSR_AMD64_PATCH_LEVEL, rev, dummy);
 
        /* need to apply patch? */
-       if (rev >= mc_amd->hdr.patch_id) {
+       if (rev > mc_amd->hdr.patch_id) {
                ret = UCODE_OK;
                goto out;
        }
index cc4f9f1..aee4bc5 100644 (file)
@@ -1,4 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0-only
 obj-y          := mtrr.o if.o generic.o cleanup.o
-obj-$(CONFIG_X86_32) += amd.o cyrix.o centaur.o
+obj-$(CONFIG_X86_32) += amd.o cyrix.o centaur.o legacy.o
 
index eff6ac6..ef3e8e4 100644 (file)
@@ -110,7 +110,7 @@ amd_validate_add_page(unsigned long base, unsigned long size, unsigned int type)
 }
 
 const struct mtrr_ops amd_mtrr_ops = {
-       .vendor            = X86_VENDOR_AMD,
+       .var_regs          = 2,
        .set               = amd_set_mtrr,
        .get               = amd_get_mtrr,
        .get_free_region   = generic_get_free_region,
index b8a74ed..6f6c3ae 100644 (file)
@@ -45,15 +45,6 @@ centaur_get_free_region(unsigned long base, unsigned long size, int replace_reg)
        return -ENOSPC;
 }
 
-/*
- * Report boot time MCR setups
- */
-void mtrr_centaur_report_mcr(int mcr, u32 lo, u32 hi)
-{
-       centaur_mcr[mcr].low = lo;
-       centaur_mcr[mcr].high = hi;
-}
-
 static void
 centaur_get_mcr(unsigned int reg, unsigned long *base,
                unsigned long *size, mtrr_type * type)
@@ -112,7 +103,7 @@ centaur_validate_add_page(unsigned long base, unsigned long size, unsigned int t
 }
 
 const struct mtrr_ops centaur_mtrr_ops = {
-       .vendor            = X86_VENDOR_CENTAUR,
+       .var_regs          = 8,
        .set               = centaur_set_mcr,
        .get               = centaur_get_mcr,
        .get_free_region   = centaur_get_free_region,
index b5f4304..18cf79d 100644 (file)
@@ -55,9 +55,6 @@ static int __initdata                         nr_range;
 
 static struct var_mtrr_range_state __initdata  range_state[RANGE_NUM];
 
-static int __initdata debug_print;
-#define Dprintk(x...) do { if (debug_print) pr_debug(x); } while (0)
-
 #define BIOS_BUG_MSG \
        "WARNING: BIOS bug: VAR MTRR %d contains strange UC entry under 1M, check with your system vendor!\n"
 
@@ -79,12 +76,11 @@ x86_get_mtrr_mem_range(struct range *range, int nr_range,
                nr_range = add_range_with_merge(range, RANGE_NUM, nr_range,
                                                base, base + size);
        }
-       if (debug_print) {
-               pr_debug("After WB checking\n");
-               for (i = 0; i < nr_range; i++)
-                       pr_debug("MTRR MAP PFN: %016llx - %016llx\n",
-                                range[i].start, range[i].end);
-       }
+
+       Dprintk("After WB checking\n");
+       for (i = 0; i < nr_range; i++)
+               Dprintk("MTRR MAP PFN: %016llx - %016llx\n",
+                        range[i].start, range[i].end);
 
        /* Take out UC ranges: */
        for (i = 0; i < num_var_ranges; i++) {
@@ -112,24 +108,22 @@ x86_get_mtrr_mem_range(struct range *range, int nr_range,
                subtract_range(range, RANGE_NUM, extra_remove_base,
                                 extra_remove_base + extra_remove_size);
 
-       if  (debug_print) {
-               pr_debug("After UC checking\n");
-               for (i = 0; i < RANGE_NUM; i++) {
-                       if (!range[i].end)
-                               continue;
-                       pr_debug("MTRR MAP PFN: %016llx - %016llx\n",
-                                range[i].start, range[i].end);
-               }
+       Dprintk("After UC checking\n");
+       for (i = 0; i < RANGE_NUM; i++) {
+               if (!range[i].end)
+                       continue;
+
+               Dprintk("MTRR MAP PFN: %016llx - %016llx\n",
+                        range[i].start, range[i].end);
        }
 
        /* sort the ranges */
        nr_range = clean_sort_range(range, RANGE_NUM);
-       if  (debug_print) {
-               pr_debug("After sorting\n");
-               for (i = 0; i < nr_range; i++)
-                       pr_debug("MTRR MAP PFN: %016llx - %016llx\n",
-                                range[i].start, range[i].end);
-       }
+
+       Dprintk("After sorting\n");
+       for (i = 0; i < nr_range; i++)
+               Dprintk("MTRR MAP PFN: %016llx - %016llx\n",
+                       range[i].start, range[i].end);
 
        return nr_range;
 }
@@ -164,16 +158,9 @@ static int __init enable_mtrr_cleanup_setup(char *str)
 }
 early_param("enable_mtrr_cleanup", enable_mtrr_cleanup_setup);
 
-static int __init mtrr_cleanup_debug_setup(char *str)
-{
-       debug_print = 1;
-       return 0;
-}
-early_param("mtrr_cleanup_debug", mtrr_cleanup_debug_setup);
-
 static void __init
 set_var_mtrr(unsigned int reg, unsigned long basek, unsigned long sizek,
-            unsigned char type, unsigned int address_bits)
+            unsigned char type)
 {
        u32 base_lo, base_hi, mask_lo, mask_hi;
        u64 base, mask;
@@ -183,7 +170,7 @@ set_var_mtrr(unsigned int reg, unsigned long basek, unsigned long sizek,
                return;
        }
 
-       mask = (1ULL << address_bits) - 1;
+       mask = (1ULL << boot_cpu_data.x86_phys_bits) - 1;
        mask &= ~((((u64)sizek) << 10) - 1);
 
        base = ((u64)basek) << 10;
@@ -209,7 +196,7 @@ save_var_mtrr(unsigned int reg, unsigned long basek, unsigned long sizek,
        range_state[reg].type = type;
 }
 
-static void __init set_var_mtrr_all(unsigned int address_bits)
+static void __init set_var_mtrr_all(void)
 {
        unsigned long basek, sizek;
        unsigned char type;
@@ -220,7 +207,7 @@ static void __init set_var_mtrr_all(unsigned int address_bits)
                sizek = range_state[reg].size_pfn << (PAGE_SHIFT - 10);
                type = range_state[reg].type;
 
-               set_var_mtrr(reg, basek, sizek, type, address_bits);
+               set_var_mtrr(reg, basek, sizek, type);
        }
 }
 
@@ -267,7 +254,7 @@ range_to_mtrr(unsigned int reg, unsigned long range_startk,
                        align = max_align;
 
                sizek = 1UL << align;
-               if (debug_print) {
+               if (mtrr_debug) {
                        char start_factor = 'K', size_factor = 'K';
                        unsigned long start_base, size_base;
 
@@ -542,7 +529,7 @@ static void __init print_out_mtrr_range_state(void)
                start_base = to_size_factor(start_base, &start_factor);
                type = range_state[i].type;
 
-               pr_debug("reg %d, base: %ld%cB, range: %ld%cB, type %s\n",
+               Dprintk("reg %d, base: %ld%cB, range: %ld%cB, type %s\n",
                        i, start_base, start_factor,
                        size_base, size_factor,
                        (type == MTRR_TYPE_UNCACHABLE) ? "UC" :
@@ -680,7 +667,7 @@ static int __init mtrr_search_optimal_index(void)
        return index_good;
 }
 
-int __init mtrr_cleanup(unsigned address_bits)
+int __init mtrr_cleanup(void)
 {
        unsigned long x_remove_base, x_remove_size;
        unsigned long base, size, def, dummy;
@@ -689,7 +676,10 @@ int __init mtrr_cleanup(unsigned address_bits)
        int index_good;
        int i;
 
-       if (!is_cpu(INTEL) || enable_mtrr_cleanup < 1)
+       if (!mtrr_enabled())
+               return 0;
+
+       if (!cpu_feature_enabled(X86_FEATURE_MTRR) || enable_mtrr_cleanup < 1)
                return 0;
 
        rdmsr(MSR_MTRRdefType, def, dummy);
@@ -711,7 +701,7 @@ int __init mtrr_cleanup(unsigned address_bits)
                return 0;
 
        /* Print original var MTRRs at first, for debugging: */
-       pr_debug("original variable MTRRs\n");
+       Dprintk("original variable MTRRs\n");
        print_out_mtrr_range_state();
 
        memset(range, 0, sizeof(range));
@@ -742,8 +732,8 @@ int __init mtrr_cleanup(unsigned address_bits)
                mtrr_print_out_one_result(i);
 
                if (!result[i].bad) {
-                       set_var_mtrr_all(address_bits);
-                       pr_debug("New variable MTRRs\n");
+                       set_var_mtrr_all();
+                       Dprintk("New variable MTRRs\n");
                        print_out_mtrr_range_state();
                        return 1;
                }
@@ -763,7 +753,7 @@ int __init mtrr_cleanup(unsigned address_bits)
 
                        mtrr_calc_range_state(chunk_size, gran_size,
                                      x_remove_base, x_remove_size, i);
-                       if (debug_print) {
+                       if (mtrr_debug) {
                                mtrr_print_out_one_result(i);
                                pr_info("\n");
                        }
@@ -786,8 +776,8 @@ int __init mtrr_cleanup(unsigned address_bits)
                gran_size = result[i].gran_sizek;
                gran_size <<= 10;
                x86_setup_var_mtrrs(range, nr_range, chunk_size, gran_size);
-               set_var_mtrr_all(address_bits);
-               pr_debug("New variable MTRRs\n");
+               set_var_mtrr_all();
+               Dprintk("New variable MTRRs\n");
                print_out_mtrr_range_state();
                return 1;
        } else {
@@ -802,7 +792,7 @@ int __init mtrr_cleanup(unsigned address_bits)
        return 0;
 }
 #else
-int __init mtrr_cleanup(unsigned address_bits)
+int __init mtrr_cleanup(void)
 {
        return 0;
 }
@@ -882,15 +872,18 @@ int __init mtrr_trim_uncached_memory(unsigned long end_pfn)
        /* extra one for all 0 */
        int num[MTRR_NUM_TYPES + 1];
 
+       if (!mtrr_enabled())
+               return 0;
+
        /*
         * Make sure we only trim uncachable memory on machines that
         * support the Intel MTRR architecture:
         */
-       if (!is_cpu(INTEL) || disable_mtrr_trim)
+       if (!cpu_feature_enabled(X86_FEATURE_MTRR) || disable_mtrr_trim)
                return 0;
 
        rdmsr(MSR_MTRRdefType, def, dummy);
-       def &= 0xff;
+       def &= MTRR_DEF_TYPE_TYPE;
        if (def != MTRR_TYPE_UNCACHABLE)
                return 0;
 
index 173b9e0..238dad5 100644 (file)
@@ -235,7 +235,7 @@ static void cyrix_set_arr(unsigned int reg, unsigned long base,
 }
 
 const struct mtrr_ops cyrix_mtrr_ops = {
-       .vendor            = X86_VENDOR_CYRIX,
+       .var_regs          = 8,
        .set               = cyrix_set_arr,
        .get               = cyrix_get_arr,
        .get_free_region   = cyrix_get_free_region,
index ee09d35..2d6aa5d 100644 (file)
@@ -8,10 +8,12 @@
 #include <linux/init.h>
 #include <linux/io.h>
 #include <linux/mm.h>
-
+#include <linux/cc_platform.h>
 #include <asm/processor-flags.h>
 #include <asm/cacheinfo.h>
 #include <asm/cpufeature.h>
+#include <asm/hypervisor.h>
+#include <asm/mshyperv.h>
 #include <asm/tlbflush.h>
 #include <asm/mtrr.h>
 #include <asm/msr.h>
@@ -31,6 +33,55 @@ static struct fixed_range_block fixed_range_blocks[] = {
        {}
 };
 
+struct cache_map {
+       u64 start;
+       u64 end;
+       u64 flags;
+       u64 type:8;
+       u64 fixed:1;
+};
+
+bool mtrr_debug;
+
+static int __init mtrr_param_setup(char *str)
+{
+       int rc = 0;
+
+       if (!str)
+               return -EINVAL;
+       if (!strcmp(str, "debug"))
+               mtrr_debug = true;
+       else
+               rc = -EINVAL;
+
+       return rc;
+}
+early_param("mtrr", mtrr_param_setup);
+
+/*
+ * CACHE_MAP_MAX is the maximum number of memory ranges in cache_map, where
+ * no 2 adjacent ranges have the same cache mode (those would be merged).
+ * The number is based on the worst case:
+ * - no two adjacent fixed MTRRs share the same cache mode
+ * - one variable MTRR is spanning a huge area with mode WB
+ * - 255 variable MTRRs with mode UC all overlap with the WB MTRR, creating 2
+ *   additional ranges each (result like "ababababa...aba" with a = WB, b = UC),
+ *   accounting for MTRR_MAX_VAR_RANGES * 2 - 1 range entries
+ * - a TOP_MEM2 area (even with overlapping an UC MTRR can't add 2 range entries
+ *   to the possible maximum, as it always starts at 4GB, thus it can't be in
+ *   the middle of that MTRR, unless that MTRR starts at 0, which would remove
+ *   the initial "a" from the "abababa" pattern above)
+ * The map won't contain ranges with no matching MTRR (those fall back to the
+ * default cache mode).
+ */
+#define CACHE_MAP_MAX  (MTRR_NUM_FIXED_RANGES + MTRR_MAX_VAR_RANGES * 2)
+
+static struct cache_map init_cache_map[CACHE_MAP_MAX] __initdata;
+static struct cache_map *cache_map __refdata = init_cache_map;
+static unsigned int cache_map_size = CACHE_MAP_MAX;
+static unsigned int cache_map_n;
+static unsigned int cache_map_fixed;
+
 static unsigned long smp_changes_mask;
 static int mtrr_state_set;
 u64 mtrr_tom2;
@@ -38,6 +89,9 @@ u64 mtrr_tom2;
 struct mtrr_state_type mtrr_state;
 EXPORT_SYMBOL_GPL(mtrr_state);
 
+/* Reserved bits in the high portion of the MTRRphysBaseN MSR. */
+u32 phys_hi_rsvd;
+
 /*
  * BIOS is expected to clear MtrrFixDramModEn bit, see for example
  * "BIOS and Kernel Developer's Guide for the AMD Athlon 64 and AMD
@@ -69,175 +123,370 @@ static u64 get_mtrr_size(u64 mask)
 {
        u64 size;
 
-       mask >>= PAGE_SHIFT;
-       mask |= size_or_mask;
+       mask |= (u64)phys_hi_rsvd << 32;
        size = -mask;
-       size <<= PAGE_SHIFT;
+
        return size;
 }
 
+static u8 get_var_mtrr_state(unsigned int reg, u64 *start, u64 *size)
+{
+       struct mtrr_var_range *mtrr = mtrr_state.var_ranges + reg;
+
+       if (!(mtrr->mask_lo & MTRR_PHYSMASK_V))
+               return MTRR_TYPE_INVALID;
+
+       *start = (((u64)mtrr->base_hi) << 32) + (mtrr->base_lo & PAGE_MASK);
+       *size = get_mtrr_size((((u64)mtrr->mask_hi) << 32) +
+                             (mtrr->mask_lo & PAGE_MASK));
+
+       return mtrr->base_lo & MTRR_PHYSBASE_TYPE;
+}
+
+static u8 get_effective_type(u8 type1, u8 type2)
+{
+       if (type1 == MTRR_TYPE_UNCACHABLE || type2 == MTRR_TYPE_UNCACHABLE)
+               return MTRR_TYPE_UNCACHABLE;
+
+       if ((type1 == MTRR_TYPE_WRBACK && type2 == MTRR_TYPE_WRTHROUGH) ||
+           (type1 == MTRR_TYPE_WRTHROUGH && type2 == MTRR_TYPE_WRBACK))
+               return MTRR_TYPE_WRTHROUGH;
+
+       if (type1 != type2)
+               return MTRR_TYPE_UNCACHABLE;
+
+       return type1;
+}
+
+static void rm_map_entry_at(int idx)
+{
+       cache_map_n--;
+       if (cache_map_n > idx) {
+               memmove(cache_map + idx, cache_map + idx + 1,
+                       sizeof(*cache_map) * (cache_map_n - idx));
+       }
+}
+
 /*
- * Check and return the effective type for MTRR-MTRR type overlap.
- * Returns 1 if the effective type is UNCACHEABLE, else returns 0
+ * Add an entry into cache_map at a specific index.  Merges adjacent entries if
+ * appropriate.  Return the number of merges for correcting the scan index
+ * (this is needed as merging will reduce the number of entries, which will
+ * result in skipping entries in future iterations if the scan index isn't
+ * corrected).
+ * Note that the corrected index can never go below -1 (resulting in being 0 in
+ * the next scan iteration), as "2" is returned only if the current index is
+ * larger than zero.
  */
-static int check_type_overlap(u8 *prev, u8 *curr)
+static int add_map_entry_at(u64 start, u64 end, u8 type, int idx)
 {
-       if (*prev == MTRR_TYPE_UNCACHABLE || *curr == MTRR_TYPE_UNCACHABLE) {
-               *prev = MTRR_TYPE_UNCACHABLE;
-               *curr = MTRR_TYPE_UNCACHABLE;
-               return 1;
+       bool merge_prev = false, merge_next = false;
+
+       if (start >= end)
+               return 0;
+
+       if (idx > 0) {
+               struct cache_map *prev = cache_map + idx - 1;
+
+               if (!prev->fixed && start == prev->end && type == prev->type)
+                       merge_prev = true;
        }
 
-       if ((*prev == MTRR_TYPE_WRBACK && *curr == MTRR_TYPE_WRTHROUGH) ||
-           (*prev == MTRR_TYPE_WRTHROUGH && *curr == MTRR_TYPE_WRBACK)) {
-               *prev = MTRR_TYPE_WRTHROUGH;
-               *curr = MTRR_TYPE_WRTHROUGH;
+       if (idx < cache_map_n) {
+               struct cache_map *next = cache_map + idx;
+
+               if (!next->fixed && end == next->start && type == next->type)
+                       merge_next = true;
        }
 
-       if (*prev != *curr) {
-               *prev = MTRR_TYPE_UNCACHABLE;
-               *curr = MTRR_TYPE_UNCACHABLE;
+       if (merge_prev && merge_next) {
+               cache_map[idx - 1].end = cache_map[idx].end;
+               rm_map_entry_at(idx);
+               return 2;
+       }
+       if (merge_prev) {
+               cache_map[idx - 1].end = end;
                return 1;
        }
+       if (merge_next) {
+               cache_map[idx].start = start;
+               return 1;
+       }
+
+       /* Sanity check: the array should NEVER be too small! */
+       if (cache_map_n == cache_map_size) {
+               WARN(1, "MTRR cache mode memory map exhausted!\n");
+               cache_map_n = cache_map_fixed;
+               return 0;
+       }
+
+       if (cache_map_n > idx) {
+               memmove(cache_map + idx + 1, cache_map + idx,
+                       sizeof(*cache_map) * (cache_map_n - idx));
+       }
+
+       cache_map[idx].start = start;
+       cache_map[idx].end = end;
+       cache_map[idx].type = type;
+       cache_map[idx].fixed = 0;
+       cache_map_n++;
 
        return 0;
 }
 
-/**
- * mtrr_type_lookup_fixed - look up memory type in MTRR fixed entries
- *
- * Return the MTRR fixed memory type of 'start'.
- *
- * MTRR fixed entries are divided into the following ways:
- *  0x00000 - 0x7FFFF : This range is divided into eight 64KB sub-ranges
- *  0x80000 - 0xBFFFF : This range is divided into sixteen 16KB sub-ranges
- *  0xC0000 - 0xFFFFF : This range is divided into sixty-four 4KB sub-ranges
- *
- * Return Values:
- * MTRR_TYPE_(type)  - Matched memory type
- * MTRR_TYPE_INVALID - Unmatched
+/* Clear a part of an entry. Return 1 if start of entry is still valid. */
+static int clr_map_range_at(u64 start, u64 end, int idx)
+{
+       int ret = start != cache_map[idx].start;
+       u64 tmp;
+
+       if (start == cache_map[idx].start && end == cache_map[idx].end) {
+               rm_map_entry_at(idx);
+       } else if (start == cache_map[idx].start) {
+               cache_map[idx].start = end;
+       } else if (end == cache_map[idx].end) {
+               cache_map[idx].end = start;
+       } else {
+               tmp = cache_map[idx].end;
+               cache_map[idx].end = start;
+               add_map_entry_at(end, tmp, cache_map[idx].type, idx + 1);
+       }
+
+       return ret;
+}
+
+/*
+ * Add MTRR to the map.  The current map is scanned and each part of the MTRR
+ * either overlapping with an existing entry or with a hole in the map is
+ * handled separately.
  */
-static u8 mtrr_type_lookup_fixed(u64 start, u64 end)
+static void add_map_entry(u64 start, u64 end, u8 type)
 {
-       int idx;
+       u8 new_type, old_type;
+       u64 tmp;
+       int i;
 
-       if (start >= 0x100000)
-               return MTRR_TYPE_INVALID;
+       for (i = 0; i < cache_map_n && start < end; i++) {
+               if (start >= cache_map[i].end)
+                       continue;
+
+               if (start < cache_map[i].start) {
+                       /* Region start has no overlap. */
+                       tmp = min(end, cache_map[i].start);
+                       i -= add_map_entry_at(start, tmp,  type, i);
+                       start = tmp;
+                       continue;
+               }
 
-       /* 0x0 - 0x7FFFF */
-       if (start < 0x80000) {
-               idx = 0;
-               idx += (start >> 16);
-               return mtrr_state.fixed_ranges[idx];
-       /* 0x80000 - 0xBFFFF */
-       } else if (start < 0xC0000) {
-               idx = 1 * 8;
-               idx += ((start - 0x80000) >> 14);
-               return mtrr_state.fixed_ranges[idx];
+               new_type = get_effective_type(type, cache_map[i].type);
+               old_type = cache_map[i].type;
+
+               if (cache_map[i].fixed || new_type == old_type) {
+                       /* Cut off start of new entry. */
+                       start = cache_map[i].end;
+                       continue;
+               }
+
+               /* Handle only overlapping part of region. */
+               tmp = min(end, cache_map[i].end);
+               i += clr_map_range_at(start, tmp, i);
+               i -= add_map_entry_at(start, tmp, new_type, i);
+               start = tmp;
        }
 
-       /* 0xC0000 - 0xFFFFF */
-       idx = 3 * 8;
-       idx += ((start - 0xC0000) >> 12);
-       return mtrr_state.fixed_ranges[idx];
+       /* Add rest of region after last map entry (rest might be empty). */
+       add_map_entry_at(start, end, type, i);
 }
 
-/**
- * mtrr_type_lookup_variable - look up memory type in MTRR variable entries
- *
- * Return Value:
- * MTRR_TYPE_(type) - Matched memory type or default memory type (unmatched)
- *
- * Output Arguments:
- * repeat - Set to 1 when [start:end] spanned across MTRR range and type
- *         returned corresponds only to [start:*partial_end].  Caller has
- *         to lookup again for [*partial_end:end].
- *
- * uniform - Set to 1 when an MTRR covers the region uniformly, i.e. the
- *          region is fully covered by a single MTRR entry or the default
- *          type.
+/* Add variable MTRRs to cache map. */
+static void map_add_var(void)
+{
+       u64 start, size;
+       unsigned int i;
+       u8 type;
+
+       /*
+        * Add AMD TOP_MEM2 area.  Can't be added in mtrr_build_map(), as it
+        * needs to be added again when rebuilding the map due to potentially
+        * having moved as a result of variable MTRRs for memory below 4GB.
+        */
+       if (mtrr_tom2) {
+               add_map_entry(BIT_ULL(32), mtrr_tom2, MTRR_TYPE_WRBACK);
+               cache_map[cache_map_n - 1].fixed = 1;
+       }
+
+       for (i = 0; i < num_var_ranges; i++) {
+               type = get_var_mtrr_state(i, &start, &size);
+               if (type != MTRR_TYPE_INVALID)
+                       add_map_entry(start, start + size, type);
+       }
+}
+
+/*
+ * Rebuild map by replacing variable entries.  Needs to be called when MTRR
+ * registers are being changed after boot, as such changes could include
+ * removals of registers, which are complicated to handle without rebuild of
+ * the map.
  */
-static u8 mtrr_type_lookup_variable(u64 start, u64 end, u64 *partial_end,
-                                   int *repeat, u8 *uniform)
+void generic_rebuild_map(void)
 {
-       int i;
-       u64 base, mask;
-       u8 prev_match, curr_match;
+       if (mtrr_if != &generic_mtrr_ops)
+               return;
 
-       *repeat = 0;
-       *uniform = 1;
+       cache_map_n = cache_map_fixed;
 
-       prev_match = MTRR_TYPE_INVALID;
-       for (i = 0; i < num_var_ranges; ++i) {
-               unsigned short start_state, end_state, inclusive;
+       map_add_var();
+}
 
-               if (!(mtrr_state.var_ranges[i].mask_lo & (1 << 11)))
-                       continue;
+static unsigned int __init get_cache_map_size(void)
+{
+       return cache_map_fixed + 2 * num_var_ranges + (mtrr_tom2 != 0);
+}
 
-               base = (((u64)mtrr_state.var_ranges[i].base_hi) << 32) +
-                      (mtrr_state.var_ranges[i].base_lo & PAGE_MASK);
-               mask = (((u64)mtrr_state.var_ranges[i].mask_hi) << 32) +
-                      (mtrr_state.var_ranges[i].mask_lo & PAGE_MASK);
-
-               start_state = ((start & mask) == (base & mask));
-               end_state = ((end & mask) == (base & mask));
-               inclusive = ((start < base) && (end > base));
-
-               if ((start_state != end_state) || inclusive) {
-                       /*
-                        * We have start:end spanning across an MTRR.
-                        * We split the region into either
-                        *
-                        * - start_state:1
-                        * (start:mtrr_end)(mtrr_end:end)
-                        * - end_state:1
-                        * (start:mtrr_start)(mtrr_start:end)
-                        * - inclusive:1
-                        * (start:mtrr_start)(mtrr_start:mtrr_end)(mtrr_end:end)
-                        *
-                        * depending on kind of overlap.
-                        *
-                        * Return the type of the first region and a pointer
-                        * to the start of next region so that caller will be
-                        * advised to lookup again after having adjusted start
-                        * and end.
-                        *
-                        * Note: This way we handle overlaps with multiple
-                        * entries and the default type properly.
-                        */
-                       if (start_state)
-                               *partial_end = base + get_mtrr_size(mask);
-                       else
-                               *partial_end = base;
-
-                       if (unlikely(*partial_end <= start)) {
-                               WARN_ON(1);
-                               *partial_end = start + PAGE_SIZE;
-                       }
+/* Build the cache_map containing the cache modes per memory range. */
+void __init mtrr_build_map(void)
+{
+       u64 start, end, size;
+       unsigned int i;
+       u8 type;
 
-                       end = *partial_end - 1; /* end is inclusive */
-                       *repeat = 1;
-                       *uniform = 0;
+       /* Add fixed MTRRs, optimize for adjacent entries with same type. */
+       if (mtrr_state.enabled & MTRR_STATE_MTRR_FIXED_ENABLED) {
+               /*
+                * Start with 64k size fixed entries, preset 1st one (hence the
+                * loop below is starting with index 1).
+                */
+               start = 0;
+               end = size = 0x10000;
+               type = mtrr_state.fixed_ranges[0];
+
+               for (i = 1; i < MTRR_NUM_FIXED_RANGES; i++) {
+                       /* 8 64k entries, then 16 16k ones, rest 4k. */
+                       if (i == 8 || i == 24)
+                               size >>= 2;
+
+                       if (mtrr_state.fixed_ranges[i] != type) {
+                               add_map_entry(start, end, type);
+                               start = end;
+                               type = mtrr_state.fixed_ranges[i];
+                       }
+                       end += size;
                }
+               add_map_entry(start, end, type);
+       }
 
-               if ((start & mask) != (base & mask))
-                       continue;
+       /* Mark fixed, they take precedence. */
+       for (i = 0; i < cache_map_n; i++)
+               cache_map[i].fixed = 1;
+       cache_map_fixed = cache_map_n;
 
-               curr_match = mtrr_state.var_ranges[i].base_lo & 0xff;
-               if (prev_match == MTRR_TYPE_INVALID) {
-                       prev_match = curr_match;
-                       continue;
+       map_add_var();
+
+       pr_info("MTRR map: %u entries (%u fixed + %u variable; max %u), built from %u variable MTRRs\n",
+               cache_map_n, cache_map_fixed, cache_map_n - cache_map_fixed,
+               get_cache_map_size(), num_var_ranges + (mtrr_tom2 != 0));
+
+       if (mtrr_debug) {
+               for (i = 0; i < cache_map_n; i++) {
+                       pr_info("%3u: %016llx-%016llx %s\n", i,
+                               cache_map[i].start, cache_map[i].end - 1,
+                               mtrr_attrib_to_str(cache_map[i].type));
                }
+       }
+}
 
-               *uniform = 0;
-               if (check_type_overlap(&prev_match, &curr_match))
-                       return curr_match;
+/* Copy the cache_map from __initdata memory to dynamically allocated one. */
+void __init mtrr_copy_map(void)
+{
+       unsigned int new_size = get_cache_map_size();
+
+       if (!mtrr_state.enabled || !new_size) {
+               cache_map = NULL;
+               return;
+       }
+
+       mutex_lock(&mtrr_mutex);
+
+       cache_map = kcalloc(new_size, sizeof(*cache_map), GFP_KERNEL);
+       if (cache_map) {
+               memmove(cache_map, init_cache_map,
+                       cache_map_n * sizeof(*cache_map));
+               cache_map_size = new_size;
+       } else {
+               mtrr_state.enabled = 0;
+               pr_err("MTRRs disabled due to allocation failure for lookup map.\n");
+       }
+
+       mutex_unlock(&mtrr_mutex);
+}
+
+/**
+ * mtrr_overwrite_state - set static MTRR state
+ *
+ * Used to set MTRR state via different means (e.g. with data obtained from
+ * a hypervisor).
+ * Is allowed only for special cases when running virtualized. Must be called
+ * from the x86_init.hyper.init_platform() hook.  It can be called only once.
+ * The MTRR state can't be changed afterwards.  To ensure that, X86_FEATURE_MTRR
+ * is cleared.
+ */
+void mtrr_overwrite_state(struct mtrr_var_range *var, unsigned int num_var,
+                         mtrr_type def_type)
+{
+       unsigned int i;
+
+       /* Only allowed to be called once before mtrr_bp_init(). */
+       if (WARN_ON_ONCE(mtrr_state_set))
+               return;
+
+       /* Only allowed when running virtualized. */
+       if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR))
+               return;
+
+       /*
+        * Only allowed for special virtualization cases:
+        * - when running as Hyper-V, SEV-SNP guest using vTOM
+        * - when running as Xen PV guest
+        * - when running as SEV-SNP or TDX guest to avoid unnecessary
+        *   VMM communication/Virtualization exceptions (#VC, #VE)
+        */
+       if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP) &&
+           !hv_is_isolation_supported() &&
+           !cpu_feature_enabled(X86_FEATURE_XENPV) &&
+           !cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
+               return;
+
+       /* Disable MTRR in order to disable MTRR modifications. */
+       setup_clear_cpu_cap(X86_FEATURE_MTRR);
+
+       if (var) {
+               if (num_var > MTRR_MAX_VAR_RANGES) {
+                       pr_warn("Trying to overwrite MTRR state with %u variable entries\n",
+                               num_var);
+                       num_var = MTRR_MAX_VAR_RANGES;
+               }
+               for (i = 0; i < num_var; i++)
+                       mtrr_state.var_ranges[i] = var[i];
+               num_var_ranges = num_var;
        }
 
-       if (prev_match != MTRR_TYPE_INVALID)
-               return prev_match;
+       mtrr_state.def_type = def_type;
+       mtrr_state.enabled |= MTRR_STATE_MTRR_ENABLED;
 
-       return mtrr_state.def_type;
+       mtrr_state_set = 1;
+}
+
+static u8 type_merge(u8 type, u8 new_type, u8 *uniform)
+{
+       u8 effective_type;
+
+       if (type == MTRR_TYPE_INVALID)
+               return new_type;
+
+       effective_type = get_effective_type(type, new_type);
+       if (type != effective_type)
+               *uniform = 0;
+
+       return effective_type;
 }
 
 /**
@@ -248,66 +497,49 @@ static u8 mtrr_type_lookup_variable(u64 start, u64 end, u64 *partial_end,
  * MTRR_TYPE_INVALID - MTRR is disabled
  *
  * Output Argument:
- * uniform - Set to 1 when an MTRR covers the region uniformly, i.e. the
- *          region is fully covered by a single MTRR entry or the default
- *          type.
+ * uniform - Set to 1 when the returned MTRR type is valid for the whole
+ *          region, set to 0 else.
  */
 u8 mtrr_type_lookup(u64 start, u64 end, u8 *uniform)
 {
-       u8 type, prev_type, is_uniform = 1, dummy;
-       int repeat;
-       u64 partial_end;
+       u8 type = MTRR_TYPE_INVALID;
+       unsigned int i;
 
-       /* Make end inclusive instead of exclusive */
-       end--;
+       if (!mtrr_state_set) {
+               /* Uniformity is unknown. */
+               *uniform = 0;
+               return MTRR_TYPE_UNCACHABLE;
+       }
 
-       if (!mtrr_state_set)
-               return MTRR_TYPE_INVALID;
+       *uniform = 1;
 
        if (!(mtrr_state.enabled & MTRR_STATE_MTRR_ENABLED))
-               return MTRR_TYPE_INVALID;
+               return MTRR_TYPE_UNCACHABLE;
 
-       /*
-        * Look up the fixed ranges first, which take priority over
-        * the variable ranges.
-        */
-       if ((start < 0x100000) &&
-           (mtrr_state.have_fixed) &&
-           (mtrr_state.enabled & MTRR_STATE_MTRR_FIXED_ENABLED)) {
-               is_uniform = 0;
-               type = mtrr_type_lookup_fixed(start, end);
-               goto out;
-       }
+       for (i = 0; i < cache_map_n && start < end; i++) {
+               /* Region after current map entry? -> continue with next one. */
+               if (start >= cache_map[i].end)
+                       continue;
 
-       /*
-        * Look up the variable ranges.  Look of multiple ranges matching
-        * this address and pick type as per MTRR precedence.
-        */
-       type = mtrr_type_lookup_variable(start, end, &partial_end,
-                                        &repeat, &is_uniform);
+               /* Start of region not covered by current map entry? */
+               if (start < cache_map[i].start) {
+                       /* At least some part of region has default type. */
+                       type = type_merge(type, mtrr_state.def_type, uniform);
+                       /* End of region not covered, too? -> lookup done. */
+                       if (end <= cache_map[i].start)
+                               return type;
+               }
 
-       /*
-        * Common path is with repeat = 0.
-        * However, we can have cases where [start:end] spans across some
-        * MTRR ranges and/or the default type.  Do repeated lookups for
-        * that case here.
-        */
-       while (repeat) {
-               prev_type = type;
-               start = partial_end;
-               is_uniform = 0;
-               type = mtrr_type_lookup_variable(start, end, &partial_end,
-                                                &repeat, &dummy);
+               /* At least part of region covered by map entry. */
+               type = type_merge(type, cache_map[i].type, uniform);
 
-               if (check_type_overlap(&prev_type, &type))
-                       goto out;
+               start = cache_map[i].end;
        }
 
-       if (mtrr_tom2 && (start >= (1ULL<<32)) && (end < mtrr_tom2))
-               type = MTRR_TYPE_WRBACK;
+       /* End of region past last entry in map? -> use default type. */
+       if (start < end)
+               type = type_merge(type, mtrr_state.def_type, uniform);
 
-out:
-       *uniform = is_uniform;
        return type;
 }
 
@@ -363,8 +595,8 @@ static void __init print_fixed_last(void)
        if (!last_fixed_end)
                return;
 
-       pr_debug("  %05X-%05X %s\n", last_fixed_start,
-                last_fixed_end - 1, mtrr_attrib_to_str(last_fixed_type));
+       pr_info("  %05X-%05X %s\n", last_fixed_start,
+               last_fixed_end - 1, mtrr_attrib_to_str(last_fixed_type));
 
        last_fixed_end = 0;
 }
@@ -402,10 +634,10 @@ static void __init print_mtrr_state(void)
        unsigned int i;
        int high_width;
 
-       pr_debug("MTRR default type: %s\n",
-                mtrr_attrib_to_str(mtrr_state.def_type));
+       pr_info("MTRR default type: %s\n",
+               mtrr_attrib_to_str(mtrr_state.def_type));
        if (mtrr_state.have_fixed) {
-               pr_debug("MTRR fixed ranges %sabled:\n",
+               pr_info("MTRR fixed ranges %sabled:\n",
                        ((mtrr_state.enabled & MTRR_STATE_MTRR_ENABLED) &&
                         (mtrr_state.enabled & MTRR_STATE_MTRR_FIXED_ENABLED)) ?
                         "en" : "dis");
@@ -420,26 +652,27 @@ static void __init print_mtrr_state(void)
                /* tail */
                print_fixed_last();
        }
-       pr_debug("MTRR variable ranges %sabled:\n",
-                mtrr_state.enabled & MTRR_STATE_MTRR_ENABLED ? "en" : "dis");
-       high_width = (__ffs64(size_or_mask) - (32 - PAGE_SHIFT) + 3) / 4;
+       pr_info("MTRR variable ranges %sabled:\n",
+               mtrr_state.enabled & MTRR_STATE_MTRR_ENABLED ? "en" : "dis");
+       high_width = (boot_cpu_data.x86_phys_bits - (32 - PAGE_SHIFT) + 3) / 4;
 
        for (i = 0; i < num_var_ranges; ++i) {
-               if (mtrr_state.var_ranges[i].mask_lo & (1 << 11))
-                       pr_debug("  %u base %0*X%05X000 mask %0*X%05X000 %s\n",
-                                i,
-                                high_width,
-                                mtrr_state.var_ranges[i].base_hi,
-                                mtrr_state.var_ranges[i].base_lo >> 12,
-                                high_width,
-                                mtrr_state.var_ranges[i].mask_hi,
-                                mtrr_state.var_ranges[i].mask_lo >> 12,
-                                mtrr_attrib_to_str(mtrr_state.var_ranges[i].base_lo & 0xff));
+               if (mtrr_state.var_ranges[i].mask_lo & MTRR_PHYSMASK_V)
+                       pr_info("  %u base %0*X%05X000 mask %0*X%05X000 %s\n",
+                               i,
+                               high_width,
+                               mtrr_state.var_ranges[i].base_hi,
+                               mtrr_state.var_ranges[i].base_lo >> 12,
+                               high_width,
+                               mtrr_state.var_ranges[i].mask_hi,
+                               mtrr_state.var_ranges[i].mask_lo >> 12,
+                               mtrr_attrib_to_str(mtrr_state.var_ranges[i].base_lo &
+                                                   MTRR_PHYSBASE_TYPE));
                else
-                       pr_debug("  %u disabled\n", i);
+                       pr_info("  %u disabled\n", i);
        }
        if (mtrr_tom2)
-               pr_debug("TOM2: %016llx aka %lldM\n", mtrr_tom2, mtrr_tom2>>20);
+               pr_info("TOM2: %016llx aka %lldM\n", mtrr_tom2, mtrr_tom2>>20);
 }
 
 /* Grab all of the MTRR state for this CPU into *state */
@@ -452,7 +685,7 @@ bool __init get_mtrr_state(void)
        vrs = mtrr_state.var_ranges;
 
        rdmsr(MSR_MTRRcap, lo, dummy);
-       mtrr_state.have_fixed = (lo >> 8) & 1;
+       mtrr_state.have_fixed = lo & MTRR_CAP_FIX;
 
        for (i = 0; i < num_var_ranges; i++)
                get_mtrr_var_range(i, &vrs[i]);
@@ -460,8 +693,8 @@ bool __init get_mtrr_state(void)
                get_fixed_ranges(mtrr_state.fixed_ranges);
 
        rdmsr(MSR_MTRRdefType, lo, dummy);
-       mtrr_state.def_type = (lo & 0xff);
-       mtrr_state.enabled = (lo & 0xc00) >> 10;
+       mtrr_state.def_type = lo & MTRR_DEF_TYPE_TYPE;
+       mtrr_state.enabled = (lo & MTRR_DEF_TYPE_ENABLE) >> MTRR_STATE_SHIFT;
 
        if (amd_special_default_mtrr()) {
                unsigned low, high;
@@ -474,7 +707,8 @@ bool __init get_mtrr_state(void)
                mtrr_tom2 &= 0xffffff800000ULL;
        }
 
-       print_mtrr_state();
+       if (mtrr_debug)
+               print_mtrr_state();
 
        mtrr_state_set = 1;
 
@@ -574,7 +808,7 @@ static void generic_get_mtrr(unsigned int reg, unsigned long *base,
 
        rdmsr(MTRRphysMask_MSR(reg), mask_lo, mask_hi);
 
-       if ((mask_lo & 0x800) == 0) {
+       if (!(mask_lo & MTRR_PHYSMASK_V)) {
                /*  Invalid (i.e. free) range */
                *base = 0;
                *size = 0;
@@ -585,8 +819,8 @@ static void generic_get_mtrr(unsigned int reg, unsigned long *base,
        rdmsr(MTRRphysBase_MSR(reg), base_lo, base_hi);
 
        /* Work out the shifted address mask: */
-       tmp = (u64)mask_hi << (32 - PAGE_SHIFT) | mask_lo >> PAGE_SHIFT;
-       mask = size_or_mask | tmp;
+       tmp = (u64)mask_hi << 32 | (mask_lo & PAGE_MASK);
+       mask = (u64)phys_hi_rsvd << 32 | tmp;
 
        /* Expand tmp with high bits to all 1s: */
        hi = fls64(tmp);
@@ -604,9 +838,9 @@ static void generic_get_mtrr(unsigned int reg, unsigned long *base,
         * This works correctly if size is a power of two, i.e. a
         * contiguous range:
         */
-       *size = -mask;
+       *size = -mask >> PAGE_SHIFT;
        *base = (u64)base_hi << (32 - PAGE_SHIFT) | base_lo >> PAGE_SHIFT;
-       *type = base_lo & 0xff;
+       *type = base_lo & MTRR_PHYSBASE_TYPE;
 
 out_put_cpu:
        put_cpu();
@@ -644,9 +878,8 @@ static bool set_mtrr_var_ranges(unsigned int index, struct mtrr_var_range *vr)
        bool changed = false;
 
        rdmsr(MTRRphysBase_MSR(index), lo, hi);
-       if ((vr->base_lo & 0xfffff0ffUL) != (lo & 0xfffff0ffUL)
-           || (vr->base_hi & (size_and_mask >> (32 - PAGE_SHIFT))) !=
-               (hi & (size_and_mask >> (32 - PAGE_SHIFT)))) {
+       if ((vr->base_lo & ~MTRR_PHYSBASE_RSVD) != (lo & ~MTRR_PHYSBASE_RSVD)
+           || (vr->base_hi & ~phys_hi_rsvd) != (hi & ~phys_hi_rsvd)) {
 
                mtrr_wrmsr(MTRRphysBase_MSR(index), vr->base_lo, vr->base_hi);
                changed = true;
@@ -654,9 +887,8 @@ static bool set_mtrr_var_ranges(unsigned int index, struct mtrr_var_range *vr)
 
        rdmsr(MTRRphysMask_MSR(index), lo, hi);
 
-       if ((vr->mask_lo & 0xfffff800UL) != (lo & 0xfffff800UL)
-           || (vr->mask_hi & (size_and_mask >> (32 - PAGE_SHIFT))) !=
-               (hi & (size_and_mask >> (32 - PAGE_SHIFT)))) {
+       if ((vr->mask_lo & ~MTRR_PHYSMASK_RSVD) != (lo & ~MTRR_PHYSMASK_RSVD)
+           || (vr->mask_hi & ~phys_hi_rsvd) != (hi & ~phys_hi_rsvd)) {
                mtrr_wrmsr(MTRRphysMask_MSR(index), vr->mask_lo, vr->mask_hi);
                changed = true;
        }
@@ -691,11 +923,12 @@ static unsigned long set_mtrr_state(void)
         * Set_mtrr_restore restores the old value of MTRRdefType,
         * so to set it we fiddle with the saved value:
         */
-       if ((deftype_lo & 0xff) != mtrr_state.def_type
-           || ((deftype_lo & 0xc00) >> 10) != mtrr_state.enabled) {
+       if ((deftype_lo & MTRR_DEF_TYPE_TYPE) != mtrr_state.def_type ||
+           ((deftype_lo & MTRR_DEF_TYPE_ENABLE) >> MTRR_STATE_SHIFT) != mtrr_state.enabled) {
 
-               deftype_lo = (deftype_lo & ~0xcff) | mtrr_state.def_type |
-                            (mtrr_state.enabled << 10);
+               deftype_lo = (deftype_lo & MTRR_DEF_TYPE_DISABLE) |
+                            mtrr_state.def_type |
+                            (mtrr_state.enabled << MTRR_STATE_SHIFT);
                change_mask |= MTRR_CHANGE_MASK_DEFTYPE;
        }
 
@@ -708,7 +941,7 @@ void mtrr_disable(void)
        rdmsr(MSR_MTRRdefType, deftype_lo, deftype_hi);
 
        /* Disable MTRRs, and set the default type to uncached */
-       mtrr_wrmsr(MSR_MTRRdefType, deftype_lo & ~0xcff, deftype_hi);
+       mtrr_wrmsr(MSR_MTRRdefType, deftype_lo & MTRR_DEF_TYPE_DISABLE, deftype_hi);
 }
 
 void mtrr_enable(void)
@@ -762,9 +995,9 @@ static void generic_set_mtrr(unsigned int reg, unsigned long base,
                memset(vr, 0, sizeof(struct mtrr_var_range));
        } else {
                vr->base_lo = base << PAGE_SHIFT | type;
-               vr->base_hi = (base & size_and_mask) >> (32 - PAGE_SHIFT);
-               vr->mask_lo = -size << PAGE_SHIFT | 0x800;
-               vr->mask_hi = (-size & size_and_mask) >> (32 - PAGE_SHIFT);
+               vr->base_hi = (base >> (32 - PAGE_SHIFT)) & ~phys_hi_rsvd;
+               vr->mask_lo = -size << PAGE_SHIFT | MTRR_PHYSMASK_V;
+               vr->mask_hi = (-size >> (32 - PAGE_SHIFT)) & ~phys_hi_rsvd;
 
                mtrr_wrmsr(MTRRphysBase_MSR(reg), vr->base_lo, vr->base_hi);
                mtrr_wrmsr(MTRRphysMask_MSR(reg), vr->mask_lo, vr->mask_hi);
@@ -783,7 +1016,7 @@ int generic_validate_add_page(unsigned long base, unsigned long size,
         * For Intel PPro stepping <= 7
         * must be 4 MiB aligned and not touch 0x70000000 -> 0x7003FFFF
         */
-       if (is_cpu(INTEL) && boot_cpu_data.x86 == 6 &&
+       if (mtrr_if == &generic_mtrr_ops && boot_cpu_data.x86 == 6 &&
            boot_cpu_data.x86_model == 1 &&
            boot_cpu_data.x86_stepping <= 7) {
                if (base & ((1 << (22 - PAGE_SHIFT)) - 1)) {
@@ -817,7 +1050,7 @@ static int generic_have_wrcomb(void)
 {
        unsigned long config, dummy;
        rdmsr(MSR_MTRRcap, config, dummy);
-       return config & (1 << 10);
+       return config & MTRR_CAP_WC;
 }
 
 int positive_have_wrcomb(void)
diff --git a/arch/x86/kernel/cpu/mtrr/legacy.c b/arch/x86/kernel/cpu/mtrr/legacy.c
new file mode 100644 (file)
index 0000000..d25882f
--- /dev/null
@@ -0,0 +1,90 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/syscore_ops.h>
+#include <asm/cpufeature.h>
+#include <asm/mtrr.h>
+#include <asm/processor.h>
+#include "mtrr.h"
+
+void mtrr_set_if(void)
+{
+       switch (boot_cpu_data.x86_vendor) {
+       case X86_VENDOR_AMD:
+               /* Pre-Athlon (K6) AMD CPU MTRRs */
+               if (cpu_feature_enabled(X86_FEATURE_K6_MTRR))
+                       mtrr_if = &amd_mtrr_ops;
+               break;
+       case X86_VENDOR_CENTAUR:
+               if (cpu_feature_enabled(X86_FEATURE_CENTAUR_MCR))
+                       mtrr_if = &centaur_mtrr_ops;
+               break;
+       case X86_VENDOR_CYRIX:
+               if (cpu_feature_enabled(X86_FEATURE_CYRIX_ARR))
+                       mtrr_if = &cyrix_mtrr_ops;
+               break;
+       default:
+               break;
+       }
+}
+
+/*
+ * The suspend/resume methods are only for CPUs without MTRR. CPUs using generic
+ * MTRR driver don't require this.
+ */
+struct mtrr_value {
+       mtrr_type       ltype;
+       unsigned long   lbase;
+       unsigned long   lsize;
+};
+
+static struct mtrr_value *mtrr_value;
+
+static int mtrr_save(void)
+{
+       int i;
+
+       if (!mtrr_value)
+               return -ENOMEM;
+
+       for (i = 0; i < num_var_ranges; i++) {
+               mtrr_if->get(i, &mtrr_value[i].lbase,
+                               &mtrr_value[i].lsize,
+                               &mtrr_value[i].ltype);
+       }
+       return 0;
+}
+
+static void mtrr_restore(void)
+{
+       int i;
+
+       for (i = 0; i < num_var_ranges; i++) {
+               if (mtrr_value[i].lsize) {
+                       mtrr_if->set(i, mtrr_value[i].lbase,
+                                    mtrr_value[i].lsize,
+                                    mtrr_value[i].ltype);
+               }
+       }
+}
+
+static struct syscore_ops mtrr_syscore_ops = {
+       .suspend        = mtrr_save,
+       .resume         = mtrr_restore,
+};
+
+void mtrr_register_syscore(void)
+{
+       mtrr_value = kcalloc(num_var_ranges, sizeof(*mtrr_value), GFP_KERNEL);
+
+       /*
+        * The CPU has no MTRR and seems to not support SMP. They have
+        * specific drivers, we use a tricky method to support
+        * suspend/resume for them.
+        *
+        * TBD: is there any system with such CPU which supports
+        * suspend/resume? If no, we should remove the code.
+        */
+       register_syscore_ops(&mtrr_syscore_ops);
+}
index 783f321..767bf1c 100644 (file)
 #define MTRR_TO_PHYS_WC_OFFSET 1000
 
 u32 num_var_ranges;
-static bool mtrr_enabled(void)
-{
-       return !!mtrr_if;
-}
 
 unsigned int mtrr_usage_table[MTRR_MAX_VAR_RANGES];
-static DEFINE_MUTEX(mtrr_mutex);
-
-u64 size_or_mask, size_and_mask;
+DEFINE_MUTEX(mtrr_mutex);
 
 const struct mtrr_ops *mtrr_if;
 
@@ -105,21 +99,6 @@ static int have_wrcomb(void)
        return mtrr_if->have_wrcomb ? mtrr_if->have_wrcomb() : 0;
 }
 
-/*  This function returns the number of variable MTRRs  */
-static void __init set_num_var_ranges(bool use_generic)
-{
-       unsigned long config = 0, dummy;
-
-       if (use_generic)
-               rdmsr(MSR_MTRRcap, config, dummy);
-       else if (is_cpu(AMD) || is_cpu(HYGON))
-               config = 2;
-       else if (is_cpu(CYRIX) || is_cpu(CENTAUR))
-               config = 8;
-
-       num_var_ranges = config & 0xff;
-}
-
 static void __init init_table(void)
 {
        int i, max;
@@ -194,20 +173,8 @@ static inline int types_compatible(mtrr_type type1, mtrr_type type2)
  * Note that the mechanism is the same for UP systems, too; all the SMP stuff
  * becomes nops.
  */
-static void
-set_mtrr(unsigned int reg, unsigned long base, unsigned long size, mtrr_type type)
-{
-       struct set_mtrr_data data = { .smp_reg = reg,
-                                     .smp_base = base,
-                                     .smp_size = size,
-                                     .smp_type = type
-                                   };
-
-       stop_machine(mtrr_rendezvous_handler, &data, cpu_online_mask);
-}
-
-static void set_mtrr_cpuslocked(unsigned int reg, unsigned long base,
-                               unsigned long size, mtrr_type type)
+static void set_mtrr(unsigned int reg, unsigned long base, unsigned long size,
+                    mtrr_type type)
 {
        struct set_mtrr_data data = { .smp_reg = reg,
                                      .smp_base = base,
@@ -216,6 +183,8 @@ static void set_mtrr_cpuslocked(unsigned int reg, unsigned long base,
                                    };
 
        stop_machine_cpuslocked(mtrr_rendezvous_handler, &data, cpu_online_mask);
+
+       generic_rebuild_map();
 }
 
 /**
@@ -337,7 +306,7 @@ int mtrr_add_page(unsigned long base, unsigned long size,
        /* Search for an empty MTRR */
        i = mtrr_if->get_free_region(base, size, replace);
        if (i >= 0) {
-               set_mtrr_cpuslocked(i, base, size, type);
+               set_mtrr(i, base, size, type);
                if (likely(replace < 0)) {
                        mtrr_usage_table[i] = 1;
                } else {
@@ -345,7 +314,7 @@ int mtrr_add_page(unsigned long base, unsigned long size,
                        if (increment)
                                mtrr_usage_table[i]++;
                        if (unlikely(replace != i)) {
-                               set_mtrr_cpuslocked(replace, 0, 0, 0);
+                               set_mtrr(replace, 0, 0, 0);
                                mtrr_usage_table[replace] = 0;
                        }
                }
@@ -363,7 +332,7 @@ static int mtrr_check(unsigned long base, unsigned long size)
 {
        if ((base & (PAGE_SIZE - 1)) || (size & (PAGE_SIZE - 1))) {
                pr_warn("size and base must be multiples of 4 kiB\n");
-               pr_debug("size: 0x%lx  base: 0x%lx\n", size, base);
+               Dprintk("size: 0x%lx  base: 0x%lx\n", size, base);
                dump_stack();
                return -1;
        }
@@ -454,8 +423,7 @@ int mtrr_del_page(int reg, unsigned long base, unsigned long size)
                        }
                }
                if (reg < 0) {
-                       pr_debug("no MTRR for %lx000,%lx000 found\n",
-                                base, size);
+                       Dprintk("no MTRR for %lx000,%lx000 found\n", base, size);
                        goto out;
                }
        }
@@ -473,7 +441,7 @@ int mtrr_del_page(int reg, unsigned long base, unsigned long size)
                goto out;
        }
        if (--mtrr_usage_table[reg] < 1)
-               set_mtrr_cpuslocked(reg, 0, 0, 0);
+               set_mtrr(reg, 0, 0, 0);
        error = reg;
  out:
        mutex_unlock(&mtrr_mutex);
@@ -574,136 +542,54 @@ int arch_phys_wc_index(int handle)
 }
 EXPORT_SYMBOL_GPL(arch_phys_wc_index);
 
-/* The suspend/resume methods are only for CPU without MTRR. CPU using generic
- * MTRR driver doesn't require this
- */
-struct mtrr_value {
-       mtrr_type       ltype;
-       unsigned long   lbase;
-       unsigned long   lsize;
-};
-
-static struct mtrr_value mtrr_value[MTRR_MAX_VAR_RANGES];
-
-static int mtrr_save(void)
-{
-       int i;
-
-       for (i = 0; i < num_var_ranges; i++) {
-               mtrr_if->get(i, &mtrr_value[i].lbase,
-                               &mtrr_value[i].lsize,
-                               &mtrr_value[i].ltype);
-       }
-       return 0;
-}
-
-static void mtrr_restore(void)
-{
-       int i;
-
-       for (i = 0; i < num_var_ranges; i++) {
-               if (mtrr_value[i].lsize) {
-                       set_mtrr(i, mtrr_value[i].lbase,
-                                   mtrr_value[i].lsize,
-                                   mtrr_value[i].ltype);
-               }
-       }
-}
-
-
-
-static struct syscore_ops mtrr_syscore_ops = {
-       .suspend        = mtrr_save,
-       .resume         = mtrr_restore,
-};
-
 int __initdata changed_by_mtrr_cleanup;
 
-#define SIZE_OR_MASK_BITS(n)  (~((1ULL << ((n) - PAGE_SHIFT)) - 1))
 /**
- * mtrr_bp_init - initialize mtrrs on the boot CPU
+ * mtrr_bp_init - initialize MTRRs on the boot CPU
  *
  * This needs to be called early; before any of the other CPUs are
  * initialized (i.e. before smp_init()).
- *
  */
 void __init mtrr_bp_init(void)
 {
+       bool generic_mtrrs = cpu_feature_enabled(X86_FEATURE_MTRR);
        const char *why = "(not available)";
-       u32 phys_addr;
-
-       phys_addr = 32;
+       unsigned long config, dummy;
 
-       if (boot_cpu_has(X86_FEATURE_MTRR)) {
-               mtrr_if = &generic_mtrr_ops;
-               size_or_mask = SIZE_OR_MASK_BITS(36);
-               size_and_mask = 0x00f00000;
-               phys_addr = 36;
+       phys_hi_rsvd = GENMASK(31, boot_cpu_data.x86_phys_bits - 32);
 
+       if (!generic_mtrrs && mtrr_state.enabled) {
                /*
-                * This is an AMD specific MSR, but we assume(hope?) that
-                * Intel will implement it too when they extend the address
-                * bus of the Xeon.
+                * Software overwrite of MTRR state, only for generic case.
+                * Note that X86_FEATURE_MTRR has been reset in this case.
                 */
-               if (cpuid_eax(0x80000000) >= 0x80000008) {
-                       phys_addr = cpuid_eax(0x80000008) & 0xff;
-                       /* CPUID workaround for Intel 0F33/0F34 CPU */
-                       if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
-                           boot_cpu_data.x86 == 0xF &&
-                           boot_cpu_data.x86_model == 0x3 &&
-                           (boot_cpu_data.x86_stepping == 0x3 ||
-                            boot_cpu_data.x86_stepping == 0x4))
-                               phys_addr = 36;
-
-                       size_or_mask = SIZE_OR_MASK_BITS(phys_addr);
-                       size_and_mask = ~size_or_mask & 0xfffff00000ULL;
-               } else if (boot_cpu_data.x86_vendor == X86_VENDOR_CENTAUR &&
-                          boot_cpu_data.x86 == 6) {
-                       /*
-                        * VIA C* family have Intel style MTRRs,
-                        * but don't support PAE
-                        */
-                       size_or_mask = SIZE_OR_MASK_BITS(32);
-                       size_and_mask = 0;
-                       phys_addr = 32;
-               }
-       } else {
-               switch (boot_cpu_data.x86_vendor) {
-               case X86_VENDOR_AMD:
-                       if (cpu_feature_enabled(X86_FEATURE_K6_MTRR)) {
-                               /* Pre-Athlon (K6) AMD CPU MTRRs */
-                               mtrr_if = &amd_mtrr_ops;
-                               size_or_mask = SIZE_OR_MASK_BITS(32);
-                               size_and_mask = 0;
-                       }
-                       break;
-               case X86_VENDOR_CENTAUR:
-                       if (cpu_feature_enabled(X86_FEATURE_CENTAUR_MCR)) {
-                               mtrr_if = &centaur_mtrr_ops;
-                               size_or_mask = SIZE_OR_MASK_BITS(32);
-                               size_and_mask = 0;
-                       }
-                       break;
-               case X86_VENDOR_CYRIX:
-                       if (cpu_feature_enabled(X86_FEATURE_CYRIX_ARR)) {
-                               mtrr_if = &cyrix_mtrr_ops;
-                               size_or_mask = SIZE_OR_MASK_BITS(32);
-                               size_and_mask = 0;
-                       }
-                       break;
-               default:
-                       break;
-               }
+               init_table();
+               mtrr_build_map();
+               pr_info("MTRRs set to read-only\n");
+
+               return;
        }
 
+       if (generic_mtrrs)
+               mtrr_if = &generic_mtrr_ops;
+       else
+               mtrr_set_if();
+
        if (mtrr_enabled()) {
-               set_num_var_ranges(mtrr_if == &generic_mtrr_ops);
+               /* Get the number of variable MTRR ranges. */
+               if (mtrr_if == &generic_mtrr_ops)
+                       rdmsr(MSR_MTRRcap, config, dummy);
+               else
+                       config = mtrr_if->var_regs;
+               num_var_ranges = config & MTRR_CAP_VCNT;
+
                init_table();
                if (mtrr_if == &generic_mtrr_ops) {
                        /* BIOS may override */
                        if (get_mtrr_state()) {
                                memory_caching_control |= CACHE_MTRR;
-                               changed_by_mtrr_cleanup = mtrr_cleanup(phys_addr);
+                               changed_by_mtrr_cleanup = mtrr_cleanup();
+                               mtrr_build_map();
                        } else {
                                mtrr_if = NULL;
                                why = "by BIOS";
@@ -730,8 +616,14 @@ void mtrr_save_state(void)
        smp_call_function_single(first_cpu, mtrr_save_fixed_ranges, NULL, 1);
 }
 
-static int __init mtrr_init_finialize(void)
+static int __init mtrr_init_finalize(void)
 {
+       /*
+        * Map might exist if mtrr_overwrite_state() has been called or if
+        * mtrr_enabled() returns true.
+        */
+       mtrr_copy_map();
+
        if (!mtrr_enabled())
                return 0;
 
@@ -741,16 +633,8 @@ static int __init mtrr_init_finialize(void)
                return 0;
        }
 
-       /*
-        * The CPU has no MTRR and seems to not support SMP. They have
-        * specific drivers, we use a tricky method to support
-        * suspend/resume for them.
-        *
-        * TBD: is there any system with such CPU which supports
-        * suspend/resume? If no, we should remove the code.
-        */
-       register_syscore_ops(&mtrr_syscore_ops);
+       mtrr_register_syscore();
 
        return 0;
 }
-subsys_initcall(mtrr_init_finialize);
+subsys_initcall(mtrr_init_finalize);
index 02eb587..5655f25 100644 (file)
 #define MTRR_CHANGE_MASK_VARIABLE  0x02
 #define MTRR_CHANGE_MASK_DEFTYPE   0x04
 
+extern bool mtrr_debug;
+#define Dprintk(x...) do { if (mtrr_debug) pr_info(x); } while (0)
+
 extern unsigned int mtrr_usage_table[MTRR_MAX_VAR_RANGES];
 
 struct mtrr_ops {
-       u32     vendor;
+       u32     var_regs;
        void    (*set)(unsigned int reg, unsigned long base,
                       unsigned long size, mtrr_type type);
        void    (*get)(unsigned int reg, unsigned long *base,
@@ -51,18 +54,26 @@ void fill_mtrr_var_range(unsigned int index,
                u32 base_lo, u32 base_hi, u32 mask_lo, u32 mask_hi);
 bool get_mtrr_state(void);
 
-extern u64 size_or_mask, size_and_mask;
 extern const struct mtrr_ops *mtrr_if;
-
-#define is_cpu(vnd)    (mtrr_if && mtrr_if->vendor == X86_VENDOR_##vnd)
+extern struct mutex mtrr_mutex;
 
 extern unsigned int num_var_ranges;
 extern u64 mtrr_tom2;
 extern struct mtrr_state_type mtrr_state;
+extern u32 phys_hi_rsvd;
 
 void mtrr_state_warn(void);
 const char *mtrr_attrib_to_str(int x);
 void mtrr_wrmsr(unsigned, unsigned, unsigned);
+#ifdef CONFIG_X86_32
+void mtrr_set_if(void);
+void mtrr_register_syscore(void);
+#else
+static inline void mtrr_set_if(void) { }
+static inline void mtrr_register_syscore(void) { }
+#endif
+void mtrr_build_map(void);
+void mtrr_copy_map(void);
 
 /* CPU specific mtrr_ops vectors. */
 extern const struct mtrr_ops amd_mtrr_ops;
@@ -70,4 +81,14 @@ extern const struct mtrr_ops cyrix_mtrr_ops;
 extern const struct mtrr_ops centaur_mtrr_ops;
 
 extern int changed_by_mtrr_cleanup;
-extern int mtrr_cleanup(unsigned address_bits);
+extern int mtrr_cleanup(void);
+
+/*
+ * Must be used by code which uses mtrr_if to call platform-specific
+ * MTRR manipulation functions.
+ */
+static inline bool mtrr_enabled(void)
+{
+       return !!mtrr_if;
+}
+void generic_rebuild_map(void);
index 6ad33f3..7253440 100644 (file)
@@ -726,11 +726,15 @@ unlock:
 static void show_rdt_tasks(struct rdtgroup *r, struct seq_file *s)
 {
        struct task_struct *p, *t;
+       pid_t pid;
 
        rcu_read_lock();
        for_each_process_thread(p, t) {
-               if (is_closid_match(t, r) || is_rmid_match(t, r))
-                       seq_printf(s, "%d\n", t->pid);
+               if (is_closid_match(t, r) || is_rmid_match(t, r)) {
+                       pid = task_pid_vnr(t);
+                       if (pid)
+                               seq_printf(s, "%d\n", pid);
+               }
        }
        rcu_read_unlock();
 }
@@ -2301,6 +2305,26 @@ static struct rdtgroup *kernfs_to_rdtgroup(struct kernfs_node *kn)
        }
 }
 
+static void rdtgroup_kn_get(struct rdtgroup *rdtgrp, struct kernfs_node *kn)
+{
+       atomic_inc(&rdtgrp->waitcount);
+       kernfs_break_active_protection(kn);
+}
+
+static void rdtgroup_kn_put(struct rdtgroup *rdtgrp, struct kernfs_node *kn)
+{
+       if (atomic_dec_and_test(&rdtgrp->waitcount) &&
+           (rdtgrp->flags & RDT_DELETED)) {
+               if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP ||
+                   rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED)
+                       rdtgroup_pseudo_lock_remove(rdtgrp);
+               kernfs_unbreak_active_protection(kn);
+               rdtgroup_remove(rdtgrp);
+       } else {
+               kernfs_unbreak_active_protection(kn);
+       }
+}
+
 struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn)
 {
        struct rdtgroup *rdtgrp = kernfs_to_rdtgroup(kn);
@@ -2308,8 +2332,7 @@ struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn)
        if (!rdtgrp)
                return NULL;
 
-       atomic_inc(&rdtgrp->waitcount);
-       kernfs_break_active_protection(kn);
+       rdtgroup_kn_get(rdtgrp, kn);
 
        mutex_lock(&rdtgroup_mutex);
 
@@ -2328,17 +2351,7 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn)
                return;
 
        mutex_unlock(&rdtgroup_mutex);
-
-       if (atomic_dec_and_test(&rdtgrp->waitcount) &&
-           (rdtgrp->flags & RDT_DELETED)) {
-               if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP ||
-                   rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED)
-                       rdtgroup_pseudo_lock_remove(rdtgrp);
-               kernfs_unbreak_active_protection(kn);
-               rdtgroup_remove(rdtgrp);
-       } else {
-               kernfs_unbreak_active_protection(kn);
-       }
+       rdtgroup_kn_put(rdtgrp, kn);
 }
 
 static int mkdir_mondata_all(struct kernfs_node *parent_kn,
@@ -3505,6 +3518,133 @@ out:
        return ret;
 }
 
+/**
+ * mongrp_reparent() - replace parent CTRL_MON group of a MON group
+ * @rdtgrp:            the MON group whose parent should be replaced
+ * @new_prdtgrp:       replacement parent CTRL_MON group for @rdtgrp
+ * @cpus:              cpumask provided by the caller for use during this call
+ *
+ * Replaces the parent CTRL_MON group for a MON group, resulting in all member
+ * tasks' CLOSID immediately changing to that of the new parent group.
+ * Monitoring data for the group is unaffected by this operation.
+ */
+static void mongrp_reparent(struct rdtgroup *rdtgrp,
+                           struct rdtgroup *new_prdtgrp,
+                           cpumask_var_t cpus)
+{
+       struct rdtgroup *prdtgrp = rdtgrp->mon.parent;
+
+       WARN_ON(rdtgrp->type != RDTMON_GROUP);
+       WARN_ON(new_prdtgrp->type != RDTCTRL_GROUP);
+
+       /* Nothing to do when simply renaming a MON group. */
+       if (prdtgrp == new_prdtgrp)
+               return;
+
+       WARN_ON(list_empty(&prdtgrp->mon.crdtgrp_list));
+       list_move_tail(&rdtgrp->mon.crdtgrp_list,
+                      &new_prdtgrp->mon.crdtgrp_list);
+
+       rdtgrp->mon.parent = new_prdtgrp;
+       rdtgrp->closid = new_prdtgrp->closid;
+
+       /* Propagate updated closid to all tasks in this group. */
+       rdt_move_group_tasks(rdtgrp, rdtgrp, cpus);
+
+       update_closid_rmid(cpus, NULL);
+}
+
+static int rdtgroup_rename(struct kernfs_node *kn,
+                          struct kernfs_node *new_parent, const char *new_name)
+{
+       struct rdtgroup *new_prdtgrp;
+       struct rdtgroup *rdtgrp;
+       cpumask_var_t tmpmask;
+       int ret;
+
+       rdtgrp = kernfs_to_rdtgroup(kn);
+       new_prdtgrp = kernfs_to_rdtgroup(new_parent);
+       if (!rdtgrp || !new_prdtgrp)
+               return -ENOENT;
+
+       /* Release both kernfs active_refs before obtaining rdtgroup mutex. */
+       rdtgroup_kn_get(rdtgrp, kn);
+       rdtgroup_kn_get(new_prdtgrp, new_parent);
+
+       mutex_lock(&rdtgroup_mutex);
+
+       rdt_last_cmd_clear();
+
+       /*
+        * Don't allow kernfs_to_rdtgroup() to return a parent rdtgroup if
+        * either kernfs_node is a file.
+        */
+       if (kernfs_type(kn) != KERNFS_DIR ||
+           kernfs_type(new_parent) != KERNFS_DIR) {
+               rdt_last_cmd_puts("Source and destination must be directories");
+               ret = -EPERM;
+               goto out;
+       }
+
+       if ((rdtgrp->flags & RDT_DELETED) || (new_prdtgrp->flags & RDT_DELETED)) {
+               ret = -ENOENT;
+               goto out;
+       }
+
+       if (rdtgrp->type != RDTMON_GROUP || !kn->parent ||
+           !is_mon_groups(kn->parent, kn->name)) {
+               rdt_last_cmd_puts("Source must be a MON group\n");
+               ret = -EPERM;
+               goto out;
+       }
+
+       if (!is_mon_groups(new_parent, new_name)) {
+               rdt_last_cmd_puts("Destination must be a mon_groups subdirectory\n");
+               ret = -EPERM;
+               goto out;
+       }
+
+       /*
+        * If the MON group is monitoring CPUs, the CPUs must be assigned to the
+        * current parent CTRL_MON group and therefore cannot be assigned to
+        * the new parent, making the move illegal.
+        */
+       if (!cpumask_empty(&rdtgrp->cpu_mask) &&
+           rdtgrp->mon.parent != new_prdtgrp) {
+               rdt_last_cmd_puts("Cannot move a MON group that monitors CPUs\n");
+               ret = -EPERM;
+               goto out;
+       }
+
+       /*
+        * Allocate the cpumask for use in mongrp_reparent() to avoid the
+        * possibility of failing to allocate it after kernfs_rename() has
+        * succeeded.
+        */
+       if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) {
+               ret = -ENOMEM;
+               goto out;
+       }
+
+       /*
+        * Perform all input validation and allocations needed to ensure
+        * mongrp_reparent() will succeed before calling kernfs_rename(),
+        * otherwise it would be necessary to revert this call if
+        * mongrp_reparent() failed.
+        */
+       ret = kernfs_rename(kn, new_parent, new_name);
+       if (!ret)
+               mongrp_reparent(rdtgrp, new_prdtgrp, tmpmask);
+
+       free_cpumask_var(tmpmask);
+
+out:
+       mutex_unlock(&rdtgroup_mutex);
+       rdtgroup_kn_put(rdtgrp, kn);
+       rdtgroup_kn_put(new_prdtgrp, new_parent);
+       return ret;
+}
+
 static int rdtgroup_show_options(struct seq_file *seq, struct kernfs_root *kf)
 {
        if (resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L3))
@@ -3522,6 +3662,7 @@ static int rdtgroup_show_options(struct seq_file *seq, struct kernfs_root *kf)
 static struct kernfs_syscall_ops rdtgroup_kf_syscall_ops = {
        .mkdir          = rdtgroup_mkdir,
        .rmdir          = rdtgroup_rmdir,
+       .rename         = rdtgroup_rename,
        .show_options   = rdtgroup_show_options,
 };
 
index 2a0e90f..91fa70e 100644 (file)
@@ -755,6 +755,7 @@ static void sgx_mmu_notifier_release(struct mmu_notifier *mn,
 {
        struct sgx_encl_mm *encl_mm = container_of(mn, struct sgx_encl_mm, mmu_notifier);
        struct sgx_encl_mm *tmp = NULL;
+       bool found = false;
 
        /*
         * The enclave itself can remove encl_mm.  Note, objects can't be moved
@@ -764,12 +765,13 @@ static void sgx_mmu_notifier_release(struct mmu_notifier *mn,
        list_for_each_entry(tmp, &encl_mm->encl->mm_list, list) {
                if (tmp == encl_mm) {
                        list_del_rcu(&encl_mm->list);
+                       found = true;
                        break;
                }
        }
        spin_unlock(&encl_mm->encl->mm_lock);
 
-       if (tmp == encl_mm) {
+       if (found) {
                synchronize_srcu(&encl_mm->encl->srcu);
                mmu_notifier_put(mn);
        }
index 21ca0a8..5d390df 100644 (file)
@@ -214,7 +214,7 @@ static int __sgx_encl_add_page(struct sgx_encl *encl,
        if (!(vma->vm_flags & VM_MAYEXEC))
                return -EACCES;
 
-       ret = get_user_pages(src, 1, 0, &src_page, NULL);
+       ret = get_user_pages(src, 1, 0, &src_page);
        if (ret < 1)
                return -EFAULT;
 
index 3b58d87..6eaf9a6 100644 (file)
@@ -9,6 +9,7 @@
 #include <asm/processor.h>
 #include <asm/desc.h>
 #include <asm/traps.h>
+#include <asm/doublefault.h>
 
 #define ptr_ok(x) ((x) > PAGE_OFFSET && (x) < PAGE_OFFSET + MAXMEM)
 
index 851eb13..998a08f 100644 (file)
@@ -53,7 +53,7 @@ void fpu__init_cpu(void)
        fpu__init_cpu_xstate();
 }
 
-static bool fpu__probe_without_cpuid(void)
+static bool __init fpu__probe_without_cpuid(void)
 {
        unsigned long cr0;
        u16 fsw, fcw;
@@ -71,7 +71,7 @@ static bool fpu__probe_without_cpuid(void)
        return fsw == 0 && (fcw & 0x103f) == 0x003f;
 }
 
-static void fpu__init_system_early_generic(struct cpuinfo_x86 *c)
+static void __init fpu__init_system_early_generic(void)
 {
        if (!boot_cpu_has(X86_FEATURE_CPUID) &&
            !test_bit(X86_FEATURE_FPU, (unsigned long *)cpu_caps_cleared)) {
@@ -211,10 +211,10 @@ static void __init fpu__init_system_xstate_size_legacy(void)
  * Called on the boot CPU once per system bootup, to set up the initial
  * FPU state that is later cloned into all processes:
  */
-void __init fpu__init_system(struct cpuinfo_x86 *c)
+void __init fpu__init_system(void)
 {
        fpstate_reset(&current->thread.fpu);
-       fpu__init_system_early_generic(c);
+       fpu__init_system_early_generic();
 
        /*
         * The FPU has to be operational for some of the
index 5e7ead5..01e8f34 100644 (file)
@@ -525,9 +525,6 @@ static void *addr_from_call(void *ptr)
        return ptr + CALL_INSN_SIZE + call.disp;
 }
 
-void prepare_ftrace_return(unsigned long ip, unsigned long *parent,
-                          unsigned long frame_pointer);
-
 /*
  * If the ops->trampoline was not allocated, then it probably
  * has a static trampoline func, or is the ftrace caller itself.
index 10c27b4..246a609 100644 (file)
@@ -69,6 +69,7 @@ asmlinkage __visible void __init __noreturn i386_start_kernel(void)
  * to the first kernel PMD. Note the upper half of each PMD or PTE are
  * always zero at this stage.
  */
+void __init mk_early_pgtbl_32(void);
 void __init mk_early_pgtbl_32(void)
 {
 #ifdef __pa
index 67c8ed9..c931899 100644 (file)
@@ -138,20 +138,6 @@ SYM_CODE_START(startup_32)
        jmp .Ldefault_entry
 SYM_CODE_END(startup_32)
 
-#ifdef CONFIG_HOTPLUG_CPU
-/*
- * Boot CPU0 entry point. It's called from play_dead(). Everything has been set
- * up already except stack. We just set up stack here. Then call
- * start_secondary().
- */
-SYM_FUNC_START(start_cpu0)
-       movl initial_stack, %ecx
-       movl %ecx, %esp
-       call *(initial_code)
-1:     jmp 1b
-SYM_FUNC_END(start_cpu0)
-#endif
-
 /*
  * Non-boot CPU entry point; entered from trampoline.S
  * We can't lgdt here, because lgdt itself uses a data segment, but
index 113c133..c5b9289 100644 (file)
@@ -24,7 +24,9 @@
 #include "../entry/calling.h"
 #include <asm/export.h>
 #include <asm/nospec-branch.h>
+#include <asm/apicdef.h>
 #include <asm/fixmap.h>
+#include <asm/smp.h>
 
 /*
  * We are not able to switch in one step to the final KERNEL ADDRESS SPACE
@@ -234,8 +236,67 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
        ANNOTATE_NOENDBR // above
 
 #ifdef CONFIG_SMP
+       /*
+        * For parallel boot, the APIC ID is read from the APIC, and then
+        * used to look up the CPU number.  For booting a single CPU, the
+        * CPU number is encoded in smpboot_control.
+        *
+        * Bit 31       STARTUP_READ_APICID (Read APICID from APIC)
+        * Bit 0-23     CPU# if STARTUP_xx flags are not set
+        */
        movl    smpboot_control(%rip), %ecx
+       testl   $STARTUP_READ_APICID, %ecx
+       jnz     .Lread_apicid
+       /*
+        * No control bit set, single CPU bringup. CPU number is provided
+        * in bit 0-23. This is also the boot CPU case (CPU number 0).
+        */
+       andl    $(~STARTUP_PARALLEL_MASK), %ecx
+       jmp     .Lsetup_cpu
 
+.Lread_apicid:
+       /* Check whether X2APIC mode is already enabled */
+       mov     $MSR_IA32_APICBASE, %ecx
+       rdmsr
+       testl   $X2APIC_ENABLE, %eax
+       jnz     .Lread_apicid_msr
+
+       /* Read the APIC ID from the fix-mapped MMIO space. */
+       movq    apic_mmio_base(%rip), %rcx
+       addq    $APIC_ID, %rcx
+       movl    (%rcx), %eax
+       shr     $24, %eax
+       jmp     .Llookup_AP
+
+.Lread_apicid_msr:
+       mov     $APIC_X2APIC_ID_MSR, %ecx
+       rdmsr
+
+.Llookup_AP:
+       /* EAX contains the APIC ID of the current CPU */
+       xorq    %rcx, %rcx
+       leaq    cpuid_to_apicid(%rip), %rbx
+
+.Lfind_cpunr:
+       cmpl    (%rbx,%rcx,4), %eax
+       jz      .Lsetup_cpu
+       inc     %ecx
+#ifdef CONFIG_FORCE_NR_CPUS
+       cmpl    $NR_CPUS, %ecx
+#else
+       cmpl    nr_cpu_ids(%rip), %ecx
+#endif
+       jb      .Lfind_cpunr
+
+       /*  APIC ID not found in the table. Drop the trampoline lock and bail. */
+       movq    trampoline_lock(%rip), %rax
+       movl    $0, (%rax)
+
+1:     cli
+       hlt
+       jmp     1b
+
+.Lsetup_cpu:
        /* Get the per cpu offset for the given CPU# which is in ECX */
        movq    __per_cpu_offset(,%rcx,8), %rdx
 #else
@@ -252,6 +313,16 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
        movq    TASK_threadsp(%rax), %rsp
 
        /*
+        * Now that this CPU is running on its own stack, drop the realmode
+        * protection. For the boot CPU the pointer is NULL!
+        */
+       movq    trampoline_lock(%rip), %rax
+       testq   %rax, %rax
+       jz      .Lsetup_gdt
+       movl    $0, (%rax)
+
+.Lsetup_gdt:
+       /*
         * We must switch to a new descriptor in kernel space for the GDT
         * because soon the kernel won't have access anymore to the userspace
         * addresses where we're currently running on. We have to do that here
@@ -375,13 +446,13 @@ SYM_CODE_END(secondary_startup_64)
 #include "verify_cpu.S"
 #include "sev_verify_cbit.S"
 
-#ifdef CONFIG_HOTPLUG_CPU
+#if defined(CONFIG_HOTPLUG_CPU) && defined(CONFIG_AMD_MEM_ENCRYPT)
 /*
- * Boot CPU0 entry point. It's called from play_dead(). Everything has been set
- * up already except stack. We just set up stack here. Then call
- * start_secondary() via .Ljump_to_C_code.
+ * Entry point for soft restart of a CPU. Invoked from xxx_play_dead() for
+ * restarting the boot CPU or for restarting SEV guest CPUs after CPU hot
+ * unplug. Everything is set up already except the stack.
  */
-SYM_CODE_START(start_cpu0)
+SYM_CODE_START(soft_restart_cpu)
        ANNOTATE_NOENDBR
        UNWIND_HINT_END_OF_STACK
 
@@ -390,7 +461,7 @@ SYM_CODE_START(start_cpu0)
        movq    TASK_threadsp(%rcx), %rsp
 
        jmp     .Ljump_to_C_code
-SYM_CODE_END(start_cpu0)
+SYM_CODE_END(soft_restart_cpu)
 #endif
 
 #ifdef CONFIG_AMD_MEM_ENCRYPT
@@ -433,6 +504,8 @@ SYM_DATA(initial_code,      .quad x86_64_start_kernel)
 #ifdef CONFIG_AMD_MEM_ENCRYPT
 SYM_DATA(initial_vc_handler,   .quad handle_vc_boot_ghcb)
 #endif
+
+SYM_DATA(trampoline_lock, .quad 0);
        __FINITDATA
 
        __INIT
index 766ffe3..9f668d2 100644 (file)
@@ -211,6 +211,13 @@ u64 arch_irq_stat_cpu(unsigned int cpu)
 #ifdef CONFIG_X86_MCE_THRESHOLD
        sum += irq_stats(cpu)->irq_threshold_count;
 #endif
+#ifdef CONFIG_X86_HV_CALLBACK_VECTOR
+       sum += irq_stats(cpu)->irq_hv_callback_count;
+#endif
+#if IS_ENABLED(CONFIG_HYPERV)
+       sum += irq_stats(cpu)->irq_hv_reenlightenment_count;
+       sum += irq_stats(cpu)->hyperv_stimer0_count;
+#endif
 #ifdef CONFIG_X86_MCE
        sum += per_cpu(mce_exception_count, cpu);
        sum += per_cpu(mce_poll_count, cpu);
index 670eb08..ee4fe8c 100644 (file)
@@ -165,32 +165,19 @@ int arch_asym_cpu_priority(int cpu)
 
 /**
  * sched_set_itmt_core_prio() - Set CPU priority based on ITMT
- * @prio:      Priority of cpu core
- * @core_cpu:  The cpu number associated with the core
+ * @prio:      Priority of @cpu
+ * @cpu:       The CPU number
  *
  * The pstate driver will find out the max boost frequency
  * and call this function to set a priority proportional
- * to the max boost frequency. CPU with higher boost
+ * to the max boost frequency. CPUs with higher boost
  * frequency will receive higher priority.
  *
  * No need to rebuild sched domain after updating
  * the CPU priorities. The sched domains have no
  * dependency on CPU priorities.
  */
-void sched_set_itmt_core_prio(int prio, int core_cpu)
+void sched_set_itmt_core_prio(int prio, int cpu)
 {
-       int cpu, i = 1;
-
-       for_each_cpu(cpu, topology_sibling_cpumask(core_cpu)) {
-               int smt_prio;
-
-               /*
-                * Ensure that the siblings are moved to the end
-                * of the priority chain and only used when
-                * all other high priority cpus are out of capacity.
-                */
-               smt_prio = prio * smp_num_siblings / (i * i);
-               per_cpu(sched_core_priority, cpu) = smt_prio;
-               i++;
-       }
+       per_cpu(sched_core_priority, cpu) = prio;
 }
index 0f35d44..fb8f521 100644 (file)
@@ -71,7 +71,7 @@ static int kvm_set_wallclock(const struct timespec64 *now)
        return -ENODEV;
 }
 
-static noinstr u64 kvm_clock_read(void)
+static u64 kvm_clock_read(void)
 {
        u64 ret;
 
@@ -88,7 +88,7 @@ static u64 kvm_clock_get_cycles(struct clocksource *cs)
 
 static noinstr u64 kvm_sched_clock_read(void)
 {
-       return kvm_clock_read() - kvm_sched_clock_offset;
+       return pvclock_clocksource_read_nowd(this_cpu_pvti()) - kvm_sched_clock_offset;
 }
 
 static inline void kvm_sched_clock_init(bool stable)
index 525876e..adc67f9 100644 (file)
@@ -367,8 +367,10 @@ static void unmap_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt)
 
                va = (unsigned long)ldt_slot_va(ldt->slot) + offset;
                ptep = get_locked_pte(mm, va, &ptl);
-               pte_clear(mm, va, ptep);
-               pte_unmap_unlock(ptep, ptl);
+               if (!WARN_ON_ONCE(!ptep)) {
+                       pte_clear(mm, va, ptep);
+                       pte_unmap_unlock(ptep, ptl);
+               }
        }
 
        va = (unsigned long)ldt_slot_va(ldt->slot);
index 776f4b1..a0c5518 100644 (file)
@@ -496,7 +496,7 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
         */
        sev_es_nmi_complete();
        if (IS_ENABLED(CONFIG_NMI_CHECK_CPU))
-               arch_atomic_long_inc(&nsp->idt_calls);
+               raw_atomic_long_inc(&nsp->idt_calls);
 
        if (IS_ENABLED(CONFIG_SMP) && arch_cpu_is_offline(smp_processor_id()))
                return;
index b348a67..b525fe6 100644 (file)
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/kernel.h>
 #include <linux/init.h>
+#include <linux/pnp.h>
 
 #include <asm/setup.h>
 #include <asm/bios_ebda.h>
index dac41a0..ff9b80a 100644 (file)
@@ -759,15 +759,26 @@ bool xen_set_default_idle(void)
 }
 #endif
 
+struct cpumask cpus_stop_mask;
+
 void __noreturn stop_this_cpu(void *dummy)
 {
+       struct cpuinfo_x86 *c = this_cpu_ptr(&cpu_info);
+       unsigned int cpu = smp_processor_id();
+
        local_irq_disable();
+
        /*
-        * Remove this CPU:
+        * Remove this CPU from the online mask and disable it
+        * unconditionally. This might be redundant in case that the reboot
+        * vector was handled late and stop_other_cpus() sent an NMI.
+        *
+        * According to SDM and APM NMIs can be accepted even after soft
+        * disabling the local APIC.
         */
-       set_cpu_online(smp_processor_id(), false);
+       set_cpu_online(cpu, false);
        disable_local_APIC();
-       mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
+       mcheck_cpu_clear(c);
 
        /*
         * Use wbinvd on processors that support SME. This provides support
@@ -781,8 +792,17 @@ void __noreturn stop_this_cpu(void *dummy)
         * Test the CPUID bit directly because the machine might've cleared
         * X86_FEATURE_SME due to cmdline options.
         */
-       if (cpuid_eax(0x8000001f) & BIT(0))
+       if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
                native_wbinvd();
+
+       /*
+        * This brings a cache line back and dirties it, but
+        * native_stop_other_cpus() will overwrite cpus_stop_mask after it
+        * observed that all CPUs reported stop. This write will invalidate
+        * the related cache line on this CPU.
+        */
+       cpumask_clear_cpu(cpu, &cpus_stop_mask);
+
        for (;;) {
                /*
                 * Use native_halt() so that memory contents don't change
index 56acf53..b3f8137 100644 (file)
@@ -101,11 +101,11 @@ u64 __pvclock_clocksource_read(struct pvclock_vcpu_time_info *src, bool dowd)
         * updating at the same time, and one of them could be slightly behind,
         * making the assumption that last_value always go forward fail to hold.
         */
-       last = arch_atomic64_read(&last_value);
+       last = raw_atomic64_read(&last_value);
        do {
                if (ret <= last)
                        return last;
-       } while (!arch_atomic64_try_cmpxchg(&last_value, &last, ret));
+       } while (!raw_atomic64_try_cmpxchg(&last_value, &last, ret));
 
        return ret;
 }
index 16babff..fd975a4 100644 (file)
@@ -796,7 +796,6 @@ static void __init early_reserve_memory(void)
 
        memblock_x86_reserve_range_setup_data();
 
-       reserve_ibft_region();
        reserve_bios_regions();
        trim_snb_memory();
 }
@@ -1032,11 +1031,14 @@ void __init setup_arch(char **cmdline_p)
        if (efi_enabled(EFI_BOOT))
                efi_init();
 
+       reserve_ibft_region();
        dmi_setup();
 
        /*
         * VMware detection requires dmi to be available, so this
         * needs to be done after dmi_setup(), for the boot CPU.
+        * For some guest types (Xen PV, SEV-SNP, TDX) it is required to be
+        * called before cache_bp_init() for setting up MTRR state.
         */
        init_hypervisor_platform();
 
index 3a5b0c9..2eabccd 100644 (file)
@@ -12,6 +12,9 @@
 #ifndef __BOOT_COMPRESSED
 #define error(v)       pr_err(v)
 #define has_cpuflag(f) boot_cpu_has(f)
+#else
+#undef WARN
+#define WARN(condition, format...) (!!(condition))
 #endif
 
 /* I/O parameters for CPUID-related helpers */
@@ -991,3 +994,103 @@ static void __init setup_cpuid_table(const struct cc_blob_sev_info *cc_info)
                        cpuid_ext_range_max = fn->eax;
        }
 }
+
+static void pvalidate_pages(struct snp_psc_desc *desc)
+{
+       struct psc_entry *e;
+       unsigned long vaddr;
+       unsigned int size;
+       unsigned int i;
+       bool validate;
+       int rc;
+
+       for (i = 0; i <= desc->hdr.end_entry; i++) {
+               e = &desc->entries[i];
+
+               vaddr = (unsigned long)pfn_to_kaddr(e->gfn);
+               size = e->pagesize ? RMP_PG_SIZE_2M : RMP_PG_SIZE_4K;
+               validate = e->operation == SNP_PAGE_STATE_PRIVATE;
+
+               rc = pvalidate(vaddr, size, validate);
+               if (rc == PVALIDATE_FAIL_SIZEMISMATCH && size == RMP_PG_SIZE_2M) {
+                       unsigned long vaddr_end = vaddr + PMD_SIZE;
+
+                       for (; vaddr < vaddr_end; vaddr += PAGE_SIZE) {
+                               rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate);
+                               if (rc)
+                                       break;
+                       }
+               }
+
+               if (rc) {
+                       WARN(1, "Failed to validate address 0x%lx ret %d", vaddr, rc);
+                       sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PVALIDATE);
+               }
+       }
+}
+
+static int vmgexit_psc(struct ghcb *ghcb, struct snp_psc_desc *desc)
+{
+       int cur_entry, end_entry, ret = 0;
+       struct snp_psc_desc *data;
+       struct es_em_ctxt ctxt;
+
+       vc_ghcb_invalidate(ghcb);
+
+       /* Copy the input desc into GHCB shared buffer */
+       data = (struct snp_psc_desc *)ghcb->shared_buffer;
+       memcpy(ghcb->shared_buffer, desc, min_t(int, GHCB_SHARED_BUF_SIZE, sizeof(*desc)));
+
+       /*
+        * As per the GHCB specification, the hypervisor can resume the guest
+        * before processing all the entries. Check whether all the entries
+        * are processed. If not, then keep retrying. Note, the hypervisor
+        * will update the data memory directly to indicate the status, so
+        * reference the data->hdr everywhere.
+        *
+        * The strategy here is to wait for the hypervisor to change the page
+        * state in the RMP table before guest accesses the memory pages. If the
+        * page state change was not successful, then later memory access will
+        * result in a crash.
+        */
+       cur_entry = data->hdr.cur_entry;
+       end_entry = data->hdr.end_entry;
+
+       while (data->hdr.cur_entry <= data->hdr.end_entry) {
+               ghcb_set_sw_scratch(ghcb, (u64)__pa(data));
+
+               /* This will advance the shared buffer data points to. */
+               ret = sev_es_ghcb_hv_call(ghcb, &ctxt, SVM_VMGEXIT_PSC, 0, 0);
+
+               /*
+                * Page State Change VMGEXIT can pass error code through
+                * exit_info_2.
+                */
+               if (WARN(ret || ghcb->save.sw_exit_info_2,
+                        "SNP: PSC failed ret=%d exit_info_2=%llx\n",
+                        ret, ghcb->save.sw_exit_info_2)) {
+                       ret = 1;
+                       goto out;
+               }
+
+               /* Verify that reserved bit is not set */
+               if (WARN(data->hdr.reserved, "Reserved bit is set in the PSC header\n")) {
+                       ret = 1;
+                       goto out;
+               }
+
+               /*
+                * Sanity check that entry processing is not going backwards.
+                * This will happen only if hypervisor is tricking us.
+                */
+               if (WARN(data->hdr.end_entry > end_entry || cur_entry > data->hdr.cur_entry,
+"SNP: PSC processing going backward, end_entry %d (got %d) cur_entry %d (got %d)\n",
+                        end_entry, data->hdr.end_entry, cur_entry, data->hdr.cur_entry)) {
+                       ret = 1;
+                       goto out;
+               }
+       }
+
+out:
+       return ret;
+}
index b031244..1ee7bed 100644 (file)
@@ -113,13 +113,23 @@ struct ghcb_state {
 };
 
 static DEFINE_PER_CPU(struct sev_es_runtime_data*, runtime_data);
-DEFINE_STATIC_KEY_FALSE(sev_es_enable_key);
-
 static DEFINE_PER_CPU(struct sev_es_save_area *, sev_vmsa);
 
 struct sev_config {
        __u64 debug             : 1,
-             __reserved        : 63;
+
+             /*
+              * A flag used by __set_pages_state() that indicates when the
+              * per-CPU GHCB has been created and registered and thus can be
+              * used by the BSP instead of the early boot GHCB.
+              *
+              * For APs, the per-CPU GHCB is created before they are started
+              * and registered upon startup, so this flag can be used globally
+              * for the BSP and APs.
+              */
+             ghcbs_initialized : 1,
+
+             __reserved        : 62;
 };
 
 static struct sev_config sev_cfg __read_mostly;
@@ -645,32 +655,26 @@ static u64 __init get_jump_table_addr(void)
        return ret;
 }
 
-static void pvalidate_pages(unsigned long vaddr, unsigned int npages, bool validate)
-{
-       unsigned long vaddr_end;
-       int rc;
-
-       vaddr = vaddr & PAGE_MASK;
-       vaddr_end = vaddr + (npages << PAGE_SHIFT);
-
-       while (vaddr < vaddr_end) {
-               rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate);
-               if (WARN(rc, "Failed to validate address 0x%lx ret %d", vaddr, rc))
-                       sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PVALIDATE);
-
-               vaddr = vaddr + PAGE_SIZE;
-       }
-}
-
-static void __init early_set_pages_state(unsigned long paddr, unsigned int npages, enum psc_op op)
+static void early_set_pages_state(unsigned long vaddr, unsigned long paddr,
+                                 unsigned long npages, enum psc_op op)
 {
        unsigned long paddr_end;
        u64 val;
+       int ret;
+
+       vaddr = vaddr & PAGE_MASK;
 
        paddr = paddr & PAGE_MASK;
        paddr_end = paddr + (npages << PAGE_SHIFT);
 
        while (paddr < paddr_end) {
+               if (op == SNP_PAGE_STATE_SHARED) {
+                       /* Page validation must be rescinded before changing to shared */
+                       ret = pvalidate(vaddr, RMP_PG_SIZE_4K, false);
+                       if (WARN(ret, "Failed to validate address 0x%lx ret %d", paddr, ret))
+                               goto e_term;
+               }
+
                /*
                 * Use the MSR protocol because this function can be called before
                 * the GHCB is established.
@@ -691,7 +695,15 @@ static void __init early_set_pages_state(unsigned long paddr, unsigned int npage
                         paddr, GHCB_MSR_PSC_RESP_VAL(val)))
                        goto e_term;
 
-               paddr = paddr + PAGE_SIZE;
+               if (op == SNP_PAGE_STATE_PRIVATE) {
+                       /* Page validation must be performed after changing to private */
+                       ret = pvalidate(vaddr, RMP_PG_SIZE_4K, true);
+                       if (WARN(ret, "Failed to validate address 0x%lx ret %d", paddr, ret))
+                               goto e_term;
+               }
+
+               vaddr += PAGE_SIZE;
+               paddr += PAGE_SIZE;
        }
 
        return;
@@ -701,7 +713,7 @@ e_term:
 }
 
 void __init early_snp_set_memory_private(unsigned long vaddr, unsigned long paddr,
-                                        unsigned int npages)
+                                        unsigned long npages)
 {
        /*
         * This can be invoked in early boot while running identity mapped, so
@@ -716,14 +728,11 @@ void __init early_snp_set_memory_private(unsigned long vaddr, unsigned long padd
          * Ask the hypervisor to mark the memory pages as private in the RMP
          * table.
          */
-       early_set_pages_state(paddr, npages, SNP_PAGE_STATE_PRIVATE);
-
-       /* Validate the memory pages after they've been added in the RMP table. */
-       pvalidate_pages(vaddr, npages, true);
+       early_set_pages_state(vaddr, paddr, npages, SNP_PAGE_STATE_PRIVATE);
 }
 
 void __init early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr,
-                                       unsigned int npages)
+                                       unsigned long npages)
 {
        /*
         * This can be invoked in early boot while running identity mapped, so
@@ -734,11 +743,8 @@ void __init early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr
        if (!(sev_status & MSR_AMD64_SEV_SNP_ENABLED))
                return;
 
-       /* Invalidate the memory pages before they are marked shared in the RMP table. */
-       pvalidate_pages(vaddr, npages, false);
-
         /* Ask hypervisor to mark the memory pages shared in the RMP table. */
-       early_set_pages_state(paddr, npages, SNP_PAGE_STATE_SHARED);
+       early_set_pages_state(vaddr, paddr, npages, SNP_PAGE_STATE_SHARED);
 }
 
 void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op)
@@ -756,96 +762,16 @@ void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op
                WARN(1, "invalid memory op %d\n", op);
 }
 
-static int vmgexit_psc(struct snp_psc_desc *desc)
+static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
+                                      unsigned long vaddr_end, int op)
 {
-       int cur_entry, end_entry, ret = 0;
-       struct snp_psc_desc *data;
        struct ghcb_state state;
-       struct es_em_ctxt ctxt;
-       unsigned long flags;
-       struct ghcb *ghcb;
-
-       /*
-        * __sev_get_ghcb() needs to run with IRQs disabled because it is using
-        * a per-CPU GHCB.
-        */
-       local_irq_save(flags);
-
-       ghcb = __sev_get_ghcb(&state);
-       if (!ghcb) {
-               ret = 1;
-               goto out_unlock;
-       }
-
-       /* Copy the input desc into GHCB shared buffer */
-       data = (struct snp_psc_desc *)ghcb->shared_buffer;
-       memcpy(ghcb->shared_buffer, desc, min_t(int, GHCB_SHARED_BUF_SIZE, sizeof(*desc)));
-
-       /*
-        * As per the GHCB specification, the hypervisor can resume the guest
-        * before processing all the entries. Check whether all the entries
-        * are processed. If not, then keep retrying. Note, the hypervisor
-        * will update the data memory directly to indicate the status, so
-        * reference the data->hdr everywhere.
-        *
-        * The strategy here is to wait for the hypervisor to change the page
-        * state in the RMP table before guest accesses the memory pages. If the
-        * page state change was not successful, then later memory access will
-        * result in a crash.
-        */
-       cur_entry = data->hdr.cur_entry;
-       end_entry = data->hdr.end_entry;
-
-       while (data->hdr.cur_entry <= data->hdr.end_entry) {
-               ghcb_set_sw_scratch(ghcb, (u64)__pa(data));
-
-               /* This will advance the shared buffer data points to. */
-               ret = sev_es_ghcb_hv_call(ghcb, &ctxt, SVM_VMGEXIT_PSC, 0, 0);
-
-               /*
-                * Page State Change VMGEXIT can pass error code through
-                * exit_info_2.
-                */
-               if (WARN(ret || ghcb->save.sw_exit_info_2,
-                        "SNP: PSC failed ret=%d exit_info_2=%llx\n",
-                        ret, ghcb->save.sw_exit_info_2)) {
-                       ret = 1;
-                       goto out;
-               }
-
-               /* Verify that reserved bit is not set */
-               if (WARN(data->hdr.reserved, "Reserved bit is set in the PSC header\n")) {
-                       ret = 1;
-                       goto out;
-               }
-
-               /*
-                * Sanity check that entry processing is not going backwards.
-                * This will happen only if hypervisor is tricking us.
-                */
-               if (WARN(data->hdr.end_entry > end_entry || cur_entry > data->hdr.cur_entry,
-"SNP: PSC processing going backward, end_entry %d (got %d) cur_entry %d (got %d)\n",
-                        end_entry, data->hdr.end_entry, cur_entry, data->hdr.cur_entry)) {
-                       ret = 1;
-                       goto out;
-               }
-       }
-
-out:
-       __sev_put_ghcb(&state);
-
-out_unlock:
-       local_irq_restore(flags);
-
-       return ret;
-}
-
-static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
-                             unsigned long vaddr_end, int op)
-{
+       bool use_large_entry;
        struct psc_hdr *hdr;
        struct psc_entry *e;
+       unsigned long flags;
        unsigned long pfn;
+       struct ghcb *ghcb;
        int i;
 
        hdr = &data->hdr;
@@ -854,74 +780,104 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
        memset(data, 0, sizeof(*data));
        i = 0;
 
-       while (vaddr < vaddr_end) {
-               if (is_vmalloc_addr((void *)vaddr))
+       while (vaddr < vaddr_end && i < ARRAY_SIZE(data->entries)) {
+               hdr->end_entry = i;
+
+               if (is_vmalloc_addr((void *)vaddr)) {
                        pfn = vmalloc_to_pfn((void *)vaddr);
-               else
+                       use_large_entry = false;
+               } else {
                        pfn = __pa(vaddr) >> PAGE_SHIFT;
+                       use_large_entry = true;
+               }
 
                e->gfn = pfn;
                e->operation = op;
-               hdr->end_entry = i;
 
-               /*
-                * Current SNP implementation doesn't keep track of the RMP page
-                * size so use 4K for simplicity.
-                */
-               e->pagesize = RMP_PG_SIZE_4K;
+               if (use_large_entry && IS_ALIGNED(vaddr, PMD_SIZE) &&
+                   (vaddr_end - vaddr) >= PMD_SIZE) {
+                       e->pagesize = RMP_PG_SIZE_2M;
+                       vaddr += PMD_SIZE;
+               } else {
+                       e->pagesize = RMP_PG_SIZE_4K;
+                       vaddr += PAGE_SIZE;
+               }
 
-               vaddr = vaddr + PAGE_SIZE;
                e++;
                i++;
        }
 
-       if (vmgexit_psc(data))
+       /* Page validation must be rescinded before changing to shared */
+       if (op == SNP_PAGE_STATE_SHARED)
+               pvalidate_pages(data);
+
+       local_irq_save(flags);
+
+       if (sev_cfg.ghcbs_initialized)
+               ghcb = __sev_get_ghcb(&state);
+       else
+               ghcb = boot_ghcb;
+
+       /* Invoke the hypervisor to perform the page state changes */
+       if (!ghcb || vmgexit_psc(ghcb, data))
                sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
+
+       if (sev_cfg.ghcbs_initialized)
+               __sev_put_ghcb(&state);
+
+       local_irq_restore(flags);
+
+       /* Page validation must be performed after changing to private */
+       if (op == SNP_PAGE_STATE_PRIVATE)
+               pvalidate_pages(data);
+
+       return vaddr;
 }
 
-static void set_pages_state(unsigned long vaddr, unsigned int npages, int op)
+static void set_pages_state(unsigned long vaddr, unsigned long npages, int op)
 {
-       unsigned long vaddr_end, next_vaddr;
-       struct snp_psc_desc *desc;
+       struct snp_psc_desc desc;
+       unsigned long vaddr_end;
 
-       desc = kmalloc(sizeof(*desc), GFP_KERNEL_ACCOUNT);
-       if (!desc)
-               panic("SNP: failed to allocate memory for PSC descriptor\n");
+       /* Use the MSR protocol when a GHCB is not available. */
+       if (!boot_ghcb)
+               return early_set_pages_state(vaddr, __pa(vaddr), npages, op);
 
        vaddr = vaddr & PAGE_MASK;
        vaddr_end = vaddr + (npages << PAGE_SHIFT);
 
-       while (vaddr < vaddr_end) {
-               /* Calculate the last vaddr that fits in one struct snp_psc_desc. */
-               next_vaddr = min_t(unsigned long, vaddr_end,
-                                  (VMGEXIT_PSC_MAX_ENTRY * PAGE_SIZE) + vaddr);
-
-               __set_pages_state(desc, vaddr, next_vaddr, op);
-
-               vaddr = next_vaddr;
-       }
-
-       kfree(desc);
+       while (vaddr < vaddr_end)
+               vaddr = __set_pages_state(&desc, vaddr, vaddr_end, op);
 }
 
-void snp_set_memory_shared(unsigned long vaddr, unsigned int npages)
+void snp_set_memory_shared(unsigned long vaddr, unsigned long npages)
 {
        if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
                return;
 
-       pvalidate_pages(vaddr, npages, false);
-
        set_pages_state(vaddr, npages, SNP_PAGE_STATE_SHARED);
 }
 
-void snp_set_memory_private(unsigned long vaddr, unsigned int npages)
+void snp_set_memory_private(unsigned long vaddr, unsigned long npages)
 {
        if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
                return;
 
        set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
+}
+
+void snp_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+       unsigned long vaddr;
+       unsigned int npages;
+
+       if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
+               return;
+
+       vaddr = (unsigned long)__va(start);
+       npages = (end - start) >> PAGE_SHIFT;
 
-       pvalidate_pages(vaddr, npages, true);
+       set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
 }
 
 static int snp_set_vmsa(void *va, bool vmsa)
@@ -1267,6 +1223,8 @@ void setup_ghcb(void)
                if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
                        snp_register_per_cpu_ghcb();
 
+               sev_cfg.ghcbs_initialized = true;
+
                return;
        }
 
@@ -1328,7 +1286,7 @@ static void sev_es_play_dead(void)
         * If we get here, the VCPU was woken up again. Jump to CPU
         * startup code to get it back online.
         */
-       start_cpu0();
+       soft_restart_cpu();
 }
 #else  /* CONFIG_HOTPLUG_CPU */
 #define sev_es_play_dead       native_play_dead
@@ -1395,9 +1353,6 @@ void __init sev_es_init_vc_handling(void)
                        sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SNP_UNSUPPORTED);
        }
 
-       /* Enable SEV-ES special handling */
-       static_branch_enable(&sev_es_enable_key);
-
        /* Initialize per-cpu GHCB pages */
        for_each_possible_cpu(cpu) {
                alloc_runtime_data(cpu);
index 004cb30..cfeec3e 100644 (file)
@@ -182,7 +182,7 @@ get_sigframe(struct ksignal *ksig, struct pt_regs *regs, size_t frame_size,
 static unsigned long __ro_after_init max_frame_size;
 static unsigned int __ro_after_init fpu_default_state_size;
 
-void __init init_sigframe_size(void)
+static int __init init_sigframe_size(void)
 {
        fpu_default_state_size = fpu__get_fpstate_size();
 
@@ -194,7 +194,9 @@ void __init init_sigframe_size(void)
        max_frame_size = round_up(max_frame_size, FRAME_ALIGNMENT);
 
        pr_info("max sigframe size: %lu\n", max_frame_size);
+       return 0;
 }
+early_initcall(init_sigframe_size);
 
 unsigned long get_sigframe_size(void)
 {
index 375b33e..7eb18ca 100644 (file)
 #include <linux/interrupt.h>
 #include <linux/cpu.h>
 #include <linux/gfp.h>
+#include <linux/kexec.h>
 
 #include <asm/mtrr.h>
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
 #include <asm/proto.h>
 #include <asm/apic.h>
+#include <asm/cpu.h>
 #include <asm/idtentry.h>
 #include <asm/nmi.h>
 #include <asm/mce.h>
@@ -129,7 +131,7 @@ static int smp_stop_nmi_callback(unsigned int val, struct pt_regs *regs)
 }
 
 /*
- * this function calls the 'stop' function on all other CPUs in the system.
+ * Disable virtualization, APIC etc. and park the CPU in a HLT loop
  */
 DEFINE_IDTENTRY_SYSVEC(sysvec_reboot)
 {
@@ -146,61 +148,96 @@ static int register_stop_handler(void)
 
 static void native_stop_other_cpus(int wait)
 {
-       unsigned long flags;
-       unsigned long timeout;
+       unsigned int cpu = smp_processor_id();
+       unsigned long flags, timeout;
 
        if (reboot_force)
                return;
 
-       /*
-        * Use an own vector here because smp_call_function
-        * does lots of things not suitable in a panic situation.
-        */
+       /* Only proceed if this is the first CPU to reach this code */
+       if (atomic_cmpxchg(&stopping_cpu, -1, cpu) != -1)
+               return;
+
+       /* For kexec, ensure that offline CPUs are out of MWAIT and in HLT */
+       if (kexec_in_progress)
+               smp_kick_mwait_play_dead();
 
        /*
-        * We start by using the REBOOT_VECTOR irq.
-        * The irq is treated as a sync point to allow critical
-        * regions of code on other cpus to release their spin locks
-        * and re-enable irqs.  Jumping straight to an NMI might
-        * accidentally cause deadlocks with further shutdown/panic
-        * code.  By syncing, we give the cpus up to one second to
-        * finish their work before we force them off with the NMI.
+        * 1) Send an IPI on the reboot vector to all other CPUs.
+        *
+        *    The other CPUs should react on it after leaving critical
+        *    sections and re-enabling interrupts. They might still hold
+        *    locks, but there is nothing which can be done about that.
+        *
+        * 2) Wait for all other CPUs to report that they reached the
+        *    HLT loop in stop_this_cpu()
+        *
+        * 3) If the system uses INIT/STARTUP for CPU bringup, then
+        *    send all present CPUs an INIT vector, which brings them
+        *    completely out of the way.
+        *
+        * 4) If #3 is not possible and #2 timed out send an NMI to the
+        *    CPUs which did not yet report
+        *
+        * 5) Wait for all other CPUs to report that they reached the
+        *    HLT loop in stop_this_cpu()
+        *
+        * #4 can obviously race against a CPU reaching the HLT loop late.
+        * That CPU will have reported already and the "have all CPUs
+        * reached HLT" condition will be true despite the fact that the
+        * other CPU is still handling the NMI. Again, there is no
+        * protection against that as "disabled" APICs still respond to
+        * NMIs.
         */
-       if (num_online_cpus() > 1) {
-               /* did someone beat us here? */
-               if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) != -1)
-                       return;
-
-               /* sync above data before sending IRQ */
-               wmb();
+       cpumask_copy(&cpus_stop_mask, cpu_online_mask);
+       cpumask_clear_cpu(cpu, &cpus_stop_mask);
 
+       if (!cpumask_empty(&cpus_stop_mask)) {
                apic_send_IPI_allbutself(REBOOT_VECTOR);
 
                /*
                 * Don't wait longer than a second for IPI completion. The
                 * wait request is not checked here because that would
-                * prevent an NMI shutdown attempt in case that not all
+                * prevent an NMI/INIT shutdown in case that not all
                 * CPUs reach shutdown state.
                 */
                timeout = USEC_PER_SEC;
-               while (num_online_cpus() > 1 && timeout--)
+               while (!cpumask_empty(&cpus_stop_mask) && timeout--)
                        udelay(1);
        }
 
-       /* if the REBOOT_VECTOR didn't work, try with the NMI */
-       if (num_online_cpus() > 1) {
+       /*
+        * Park all other CPUs in INIT including "offline" CPUs, if
+        * possible. That's a safe place where they can't resume execution
+        * of HLT and then execute the HLT loop from overwritten text or
+        * page tables.
+        *
+        * The only downside is a broadcast MCE, but up to the point where
+        * the kexec() kernel brought all APs online again an MCE will just
+        * make HLT resume and handle the MCE. The machine crashes and burns
+        * due to overwritten text, page tables and data. So there is a
+        * choice between fire and frying pan. The result is pretty much
+        * the same. Chose frying pan until x86 provides a sane mechanism
+        * to park a CPU.
+        */
+       if (smp_park_other_cpus_in_init())
+               goto done;
+
+       /*
+        * If park with INIT was not possible and the REBOOT_VECTOR didn't
+        * take all secondary CPUs offline, try with the NMI.
+        */
+       if (!cpumask_empty(&cpus_stop_mask)) {
                /*
                 * If NMI IPI is enabled, try to register the stop handler
                 * and send the IPI. In any case try to wait for the other
                 * CPUs to stop.
                 */
                if (!smp_no_nmi_ipi && !register_stop_handler()) {
-                       /* Sync above data before sending IRQ */
-                       wmb();
-
                        pr_emerg("Shutting down cpus with NMI\n");
 
-                       apic_send_IPI_allbutself(NMI_VECTOR);
+                       for_each_cpu(cpu, &cpus_stop_mask)
+                               apic->send_IPI(cpu, NMI_VECTOR);
                }
                /*
                 * Don't wait longer than 10 ms if the caller didn't
@@ -208,14 +245,21 @@ static void native_stop_other_cpus(int wait)
                 * one or more CPUs do not reach shutdown state.
                 */
                timeout = USEC_PER_MSEC * 10;
-               while (num_online_cpus() > 1 && (wait || timeout--))
+               while (!cpumask_empty(&cpus_stop_mask) && (wait || timeout--))
                        udelay(1);
        }
 
+done:
        local_irq_save(flags);
        disable_local_APIC();
        mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
        local_irq_restore(flags);
+
+       /*
+        * Ensure that the cpus_stop_mask cache lines are invalidated on
+        * the other CPUs. See comment vs. SME in stop_this_cpu().
+        */
+       cpumask_clear(&cpus_stop_mask);
 }
 
 /*
@@ -268,8 +312,7 @@ struct smp_ops smp_ops = {
 #endif
        .smp_send_reschedule    = native_smp_send_reschedule,
 
-       .cpu_up                 = native_cpu_up,
-       .cpu_die                = native_cpu_die,
+       .kick_ap_alive          = native_kick_ap,
        .cpu_disable            = native_cpu_disable,
        .play_dead              = native_play_dead,
 
index 352f0ce..ed2d519 100644 (file)
 #include <linux/tboot.h>
 #include <linux/gfp.h>
 #include <linux/cpuidle.h>
+#include <linux/kexec.h>
 #include <linux/numa.h>
 #include <linux/pgtable.h>
 #include <linux/overflow.h>
 #include <linux/stackprotector.h>
+#include <linux/cpuhotplug.h>
+#include <linux/mc146818rtc.h>
 
 #include <asm/acpi.h>
 #include <asm/cacheinfo.h>
@@ -74,7 +77,7 @@
 #include <asm/fpu/api.h>
 #include <asm/setup.h>
 #include <asm/uv/uv.h>
-#include <linux/mc146818rtc.h>
+#include <asm/microcode.h>
 #include <asm/i8259.h>
 #include <asm/misc.h>
 #include <asm/qspinlock.h>
@@ -101,6 +104,26 @@ EXPORT_PER_CPU_SYMBOL(cpu_die_map);
 DEFINE_PER_CPU_READ_MOSTLY(struct cpuinfo_x86, cpu_info);
 EXPORT_PER_CPU_SYMBOL(cpu_info);
 
+/* CPUs which are the primary SMT threads */
+struct cpumask __cpu_primary_thread_mask __read_mostly;
+
+/* Representing CPUs for which sibling maps can be computed */
+static cpumask_var_t cpu_sibling_setup_mask;
+
+struct mwait_cpu_dead {
+       unsigned int    control;
+       unsigned int    status;
+};
+
+#define CPUDEAD_MWAIT_WAIT     0xDEADBEEF
+#define CPUDEAD_MWAIT_KEXEC_HLT        0x4A17DEAD
+
+/*
+ * Cache line aligned data for mwait_play_dead(). Separate on purpose so
+ * that it's unlikely to be touched by other CPUs.
+ */
+static DEFINE_PER_CPU_ALIGNED(struct mwait_cpu_dead, mwait_cpu_dead);
+
 /* Logical package management. We might want to allocate that dynamically */
 unsigned int __max_logical_packages __read_mostly;
 EXPORT_SYMBOL(__max_logical_packages);
@@ -121,7 +144,6 @@ int arch_update_cpu_topology(void)
        return retval;
 }
 
-
 static unsigned int smpboot_warm_reset_vector_count;
 
 static inline void smpboot_setup_warm_reset_vector(unsigned long start_eip)
@@ -154,66 +176,63 @@ static inline void smpboot_restore_warm_reset_vector(void)
 
 }
 
-/*
- * Report back to the Boot Processor during boot time or to the caller processor
- * during CPU online.
- */
-static void smp_callin(void)
+/* Run the next set of setup steps for the upcoming CPU */
+static void ap_starting(void)
 {
-       int cpuid;
+       int cpuid = smp_processor_id();
 
-       /*
-        * If waken up by an INIT in an 82489DX configuration
-        * cpu_callout_mask guarantees we don't get here before
-        * an INIT_deassert IPI reaches our local APIC, so it is
-        * now safe to touch our local APIC.
-        */
-       cpuid = smp_processor_id();
+       /* Mop up eventual mwait_play_dead() wreckage */
+       this_cpu_write(mwait_cpu_dead.status, 0);
+       this_cpu_write(mwait_cpu_dead.control, 0);
 
        /*
-        * the boot CPU has finished the init stage and is spinning
-        * on callin_map until we finish. We are free to set up this
-        * CPU, first the APIC. (this is probably redundant on most
-        * boards)
+        * If woken up by an INIT in an 82489DX configuration the alive
+        * synchronization guarantees that the CPU does not reach this
+        * point before an INIT_deassert IPI reaches the local APIC, so it
+        * is now safe to touch the local APIC.
+        *
+        * Set up this CPU, first the APIC, which is probably redundant on
+        * most boards.
         */
        apic_ap_setup();
 
-       /*
-        * Save our processor parameters. Note: this information
-        * is needed for clock calibration.
-        */
+       /* Save the processor parameters. */
        smp_store_cpu_info(cpuid);
 
        /*
         * The topology information must be up to date before
-        * calibrate_delay() and notify_cpu_starting().
+        * notify_cpu_starting().
         */
-       set_cpu_sibling_map(raw_smp_processor_id());
+       set_cpu_sibling_map(cpuid);
 
        ap_init_aperfmperf();
 
-       /*
-        * Get our bogomips.
-        * Update loops_per_jiffy in cpu_data. Previous call to
-        * smp_store_cpu_info() stored a value that is close but not as
-        * accurate as the value just calculated.
-        */
-       calibrate_delay();
-       cpu_data(cpuid).loops_per_jiffy = loops_per_jiffy;
        pr_debug("Stack at about %p\n", &cpuid);
 
        wmb();
 
+       /*
+        * This runs the AP through all the cpuhp states to its target
+        * state CPUHP_ONLINE.
+        */
        notify_cpu_starting(cpuid);
+}
 
+static void ap_calibrate_delay(void)
+{
        /*
-        * Allow the master to continue.
+        * Calibrate the delay loop and update loops_per_jiffy in cpu_data.
+        * smp_store_cpu_info() stored a value that is close but not as
+        * accurate as the value just calculated.
+        *
+        * As this is invoked after the TSC synchronization check,
+        * calibrate_delay_is_known() will skip the calibration routine
+        * when TSC is synchronized across sockets.
         */
-       cpumask_set_cpu(cpuid, cpu_callin_mask);
+       calibrate_delay();
+       cpu_data(smp_processor_id()).loops_per_jiffy = loops_per_jiffy;
 }
 
-static int cpu0_logical_apicid;
-static int enable_start_cpu0;
 /*
  * Activate a secondary processor.
  */
@@ -226,24 +245,63 @@ static void notrace start_secondary(void *unused)
         */
        cr4_init();
 
-#ifdef CONFIG_X86_32
-       /* switch away from the initial page table */
-       load_cr3(swapper_pg_dir);
-       __flush_tlb_all();
-#endif
-       cpu_init_secondary();
+       /*
+        * 32-bit specific. 64-bit reaches this code with the correct page
+        * table established. Yet another historical divergence.
+        */
+       if (IS_ENABLED(CONFIG_X86_32)) {
+               /* switch away from the initial page table */
+               load_cr3(swapper_pg_dir);
+               __flush_tlb_all();
+       }
+
+       cpu_init_exception_handling();
+
+       /*
+        * 32-bit systems load the microcode from the ASM startup code for
+        * historical reasons.
+        *
+        * On 64-bit systems load it before reaching the AP alive
+        * synchronization point below so it is not part of the full per
+        * CPU serialized bringup part when "parallel" bringup is enabled.
+        *
+        * That's even safe when hyperthreading is enabled in the CPU as
+        * the core code starts the primary threads first and leaves the
+        * secondary threads waiting for SIPI. Loading microcode on
+        * physical cores concurrently is a safe operation.
+        *
+        * This covers both the Intel specific issue that concurrent
+        * microcode loading on SMT siblings must be prohibited and the
+        * vendor independent issue`that microcode loading which changes
+        * CPUID, MSRs etc. must be strictly serialized to maintain
+        * software state correctness.
+        */
+       if (IS_ENABLED(CONFIG_X86_64))
+               load_ucode_ap();
+
+       /*
+        * Synchronization point with the hotplug core. Sets this CPUs
+        * synchronization state to ALIVE and spin-waits for the control CPU to
+        * release this CPU for further bringup.
+        */
+       cpuhp_ap_sync_alive();
+
+       cpu_init();
+       fpu__init_cpu();
        rcu_cpu_starting(raw_smp_processor_id());
        x86_cpuinit.early_percpu_clock_init();
-       smp_callin();
 
-       enable_start_cpu0 = 0;
+       ap_starting();
+
+       /* Check TSC synchronization with the control CPU. */
+       check_tsc_sync_target();
 
-       /* otherwise gcc will move up smp_processor_id before the cpu_init */
-       barrier();
        /*
-        * Check TSC synchronization with the boot CPU:
+        * Calibrate the delay loop after the TSC synchronization check.
+        * This allows to skip the calibration when TSC is synchronized
+        * across sockets.
         */
-       check_tsc_sync_target();
+       ap_calibrate_delay();
 
        speculative_store_bypass_ht_init();
 
@@ -257,7 +315,6 @@ static void notrace start_secondary(void *unused)
        set_cpu_online(smp_processor_id(), true);
        lapic_online();
        unlock_vector_lock();
-       cpu_set_state_online(smp_processor_id());
        x86_platform.nmi_init();
 
        /* enable local interrupts */
@@ -270,15 +327,6 @@ static void notrace start_secondary(void *unused)
 }
 
 /**
- * topology_is_primary_thread - Check whether CPU is the primary SMT thread
- * @cpu:       CPU to check
- */
-bool topology_is_primary_thread(unsigned int cpu)
-{
-       return apic_id_is_primary_thread(per_cpu(x86_cpu_to_apicid, cpu));
-}
-
-/**
  * topology_smt_supported - Check whether SMT is supported by the CPUs
  */
 bool topology_smt_supported(void)
@@ -288,6 +336,7 @@ bool topology_smt_supported(void)
 
 /**
  * topology_phys_to_logical_pkg - Map a physical package id to a logical
+ * @phys_pkg:  The physical package id to map
  *
  * Returns logical package id or -1 if not found
  */
@@ -304,15 +353,17 @@ int topology_phys_to_logical_pkg(unsigned int phys_pkg)
        return -1;
 }
 EXPORT_SYMBOL(topology_phys_to_logical_pkg);
+
 /**
  * topology_phys_to_logical_die - Map a physical die id to logical
+ * @die_id:    The physical die id to map
+ * @cur_cpu:   The CPU for which the mapping is done
  *
  * Returns logical die id or -1 if not found
  */
-int topology_phys_to_logical_die(unsigned int die_id, unsigned int cur_cpu)
+static int topology_phys_to_logical_die(unsigned int die_id, unsigned int cur_cpu)
 {
-       int cpu;
-       int proc_id = cpu_data(cur_cpu).phys_proc_id;
+       int cpu, proc_id = cpu_data(cur_cpu).phys_proc_id;
 
        for_each_possible_cpu(cpu) {
                struct cpuinfo_x86 *c = &cpu_data(cpu);
@@ -323,7 +374,6 @@ int topology_phys_to_logical_die(unsigned int die_id, unsigned int cur_cpu)
        }
        return -1;
 }
-EXPORT_SYMBOL(topology_phys_to_logical_die);
 
 /**
  * topology_update_package_map - Update the physical to logical package map
@@ -398,7 +448,7 @@ void smp_store_cpu_info(int id)
        c->cpu_index = id;
        /*
         * During boot time, CPU0 has this setup already. Save the info when
-        * bringing up AP or offlined CPU0.
+        * bringing up an AP.
         */
        identify_secondary_cpu(c);
        c->initialized = true;
@@ -552,7 +602,7 @@ static int x86_core_flags(void)
 #ifdef CONFIG_SCHED_SMT
 static int x86_smt_flags(void)
 {
-       return cpu_smt_flags() | x86_sched_itmt_flags();
+       return cpu_smt_flags();
 }
 #endif
 #ifdef CONFIG_SCHED_CLUSTER
@@ -563,50 +613,57 @@ static int x86_cluster_flags(void)
 #endif
 #endif
 
-static struct sched_domain_topology_level x86_numa_in_package_topology[] = {
-#ifdef CONFIG_SCHED_SMT
-       { cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT) },
-#endif
-#ifdef CONFIG_SCHED_CLUSTER
-       { cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS) },
-#endif
-#ifdef CONFIG_SCHED_MC
-       { cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) },
-#endif
-       { NULL, },
-};
+/*
+ * Set if a package/die has multiple NUMA nodes inside.
+ * AMD Magny-Cours, Intel Cluster-on-Die, and Intel
+ * Sub-NUMA Clustering have this.
+ */
+static bool x86_has_numa_in_package;
 
-static struct sched_domain_topology_level x86_hybrid_topology[] = {
-#ifdef CONFIG_SCHED_SMT
-       { cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT) },
-#endif
-#ifdef CONFIG_SCHED_MC
-       { cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) },
-#endif
-       { cpu_cpu_mask, SD_INIT_NAME(DIE) },
-       { NULL, },
-};
+static struct sched_domain_topology_level x86_topology[6];
+
+static void __init build_sched_topology(void)
+{
+       int i = 0;
 
-static struct sched_domain_topology_level x86_topology[] = {
 #ifdef CONFIG_SCHED_SMT
-       { cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT) },
+       x86_topology[i++] = (struct sched_domain_topology_level){
+               cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT)
+       };
 #endif
 #ifdef CONFIG_SCHED_CLUSTER
-       { cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS) },
+       /*
+        * For now, skip the cluster domain on Hybrid.
+        */
+       if (!cpu_feature_enabled(X86_FEATURE_HYBRID_CPU)) {
+               x86_topology[i++] = (struct sched_domain_topology_level){
+                       cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS)
+               };
+       }
 #endif
 #ifdef CONFIG_SCHED_MC
-       { cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) },
+       x86_topology[i++] = (struct sched_domain_topology_level){
+               cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC)
+       };
 #endif
-       { cpu_cpu_mask, SD_INIT_NAME(DIE) },
-       { NULL, },
-};
+       /*
+        * When there is NUMA topology inside the package skip the DIE domain
+        * since the NUMA domains will auto-magically create the right spanning
+        * domains based on the SLIT.
+        */
+       if (!x86_has_numa_in_package) {
+               x86_topology[i++] = (struct sched_domain_topology_level){
+                       cpu_cpu_mask, SD_INIT_NAME(DIE)
+               };
+       }
 
-/*
- * Set if a package/die has multiple NUMA nodes inside.
- * AMD Magny-Cours, Intel Cluster-on-Die, and Intel
- * Sub-NUMA Clustering have this.
- */
-static bool x86_has_numa_in_package;
+       /*
+        * There must be one trailing NULL entry left.
+        */
+       BUG_ON(i >= ARRAY_SIZE(x86_topology)-1);
+
+       set_sched_topology(x86_topology);
+}
 
 void set_cpu_sibling_map(int cpu)
 {
@@ -706,9 +763,9 @@ static void impress_friends(void)
         * Allow the user to impress friends.
         */
        pr_debug("Before bogomips\n");
-       for_each_possible_cpu(cpu)
-               if (cpumask_test_cpu(cpu, cpu_callout_mask))
-                       bogosum += cpu_data(cpu).loops_per_jiffy;
+       for_each_online_cpu(cpu)
+               bogosum += cpu_data(cpu).loops_per_jiffy;
+
        pr_info("Total of %d processors activated (%lu.%02lu BogoMIPS)\n",
                num_online_cpus(),
                bogosum/(500000/HZ),
@@ -795,86 +852,42 @@ static void __init smp_quirk_init_udelay(void)
 }
 
 /*
- * Poke the other CPU in the eye via NMI to wake it up. Remember that the normal
- * INIT, INIT, STARTUP sequence will reset the chip hard for us, and this
- * won't ... remember to clear down the APIC, etc later.
+ * Wake up AP by INIT, INIT, STARTUP sequence.
  */
-int
-wakeup_secondary_cpu_via_nmi(int apicid, unsigned long start_eip)
+static void send_init_sequence(int phys_apicid)
 {
-       u32 dm = apic->dest_mode_logical ? APIC_DEST_LOGICAL : APIC_DEST_PHYSICAL;
-       unsigned long send_status, accept_status = 0;
-       int maxlvt;
+       int maxlvt = lapic_get_maxlvt();
 
-       /* Target chip */
-       /* Boot on the stack */
-       /* Kick the second */
-       apic_icr_write(APIC_DM_NMI | dm, apicid);
-
-       pr_debug("Waiting for send to finish...\n");
-       send_status = safe_apic_wait_icr_idle();
-
-       /*
-        * Give the other CPU some time to accept the IPI.
-        */
-       udelay(200);
+       /* Be paranoid about clearing APIC errors. */
        if (APIC_INTEGRATED(boot_cpu_apic_version)) {
-               maxlvt = lapic_get_maxlvt();
-               if (maxlvt > 3)                 /* Due to the Pentium erratum 3AP.  */
+               /* Due to the Pentium erratum 3AP.  */
+               if (maxlvt > 3)
                        apic_write(APIC_ESR, 0);
-               accept_status = (apic_read(APIC_ESR) & 0xEF);
+               apic_read(APIC_ESR);
        }
-       pr_debug("NMI sent\n");
 
-       if (send_status)
-               pr_err("APIC never delivered???\n");
-       if (accept_status)
-               pr_err("APIC delivery error (%lx)\n", accept_status);
+       /* Assert INIT on the target CPU */
+       apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT, phys_apicid);
+       safe_apic_wait_icr_idle();
 
-       return (send_status | accept_status);
+       udelay(init_udelay);
+
+       /* Deassert INIT on the target CPU */
+       apic_icr_write(APIC_INT_LEVELTRIG | APIC_DM_INIT, phys_apicid);
+       safe_apic_wait_icr_idle();
 }
 
-static int
-wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
+/*
+ * Wake up AP by INIT, INIT, STARTUP sequence.
+ */
+static int wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
 {
        unsigned long send_status = 0, accept_status = 0;
-       int maxlvt, num_starts, j;
+       int num_starts, j, maxlvt;
 
+       preempt_disable();
        maxlvt = lapic_get_maxlvt();
-
-       /*
-        * Be paranoid about clearing APIC errors.
-        */
-       if (APIC_INTEGRATED(boot_cpu_apic_version)) {
-               if (maxlvt > 3)         /* Due to the Pentium erratum 3AP.  */
-                       apic_write(APIC_ESR, 0);
-               apic_read(APIC_ESR);
-       }
-
-       pr_debug("Asserting INIT\n");
-
-       /*
-        * Turn INIT on target chip
-        */
-       /*
-        * Send IPI
-        */
-       apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT,
-                      phys_apicid);
-
-       pr_debug("Waiting for send to finish...\n");
-       send_status = safe_apic_wait_icr_idle();
-
-       udelay(init_udelay);
-
-       pr_debug("Deasserting INIT\n");
-
-       /* Target chip */
-       /* Send IPI */
-       apic_icr_write(APIC_INT_LEVELTRIG | APIC_DM_INIT, phys_apicid);
-
-       pr_debug("Waiting for send to finish...\n");
-       send_status = safe_apic_wait_icr_idle();
+       send_init_sequence(phys_apicid);
 
        mb();
 
@@ -945,15 +958,16 @@ wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
        if (accept_status)
                pr_err("APIC delivery error (%lx)\n", accept_status);
 
+       preempt_enable();
        return (send_status | accept_status);
 }
 
 /* reduce the number of lines printed when booting a large cpu count system */
 static void announce_cpu(int cpu, int apicid)
 {
+       static int width, node_width, first = 1;
        static int current_node = NUMA_NO_NODE;
        int node = early_cpu_to_node(cpu);
-       static int width, node_width;
 
        if (!width)
                width = num_digits(num_possible_cpus()) + 1; /* + '#' sign */
@@ -961,10 +975,10 @@ static void announce_cpu(int cpu, int apicid)
        if (!node_width)
                node_width = num_digits(num_possible_nodes()) + 1; /* + '#' */
 
-       if (cpu == 1)
-               printk(KERN_INFO "x86: Booting SMP configuration:\n");
-
        if (system_state < SYSTEM_RUNNING) {
+               if (first)
+                       pr_info("x86: Booting SMP configuration:\n");
+
                if (node != current_node) {
                        if (current_node > (-1))
                                pr_cont("\n");
@@ -975,77 +989,16 @@ static void announce_cpu(int cpu, int apicid)
                }
 
                /* Add padding for the BSP */
-               if (cpu == 1)
+               if (first)
                        pr_cont("%*s", width + 1, " ");
+               first = 0;
 
                pr_cont("%*s#%d", width - num_digits(cpu), " ", cpu);
-
        } else
                pr_info("Booting Node %d Processor %d APIC 0x%x\n",
                        node, cpu, apicid);
 }
 
-static int wakeup_cpu0_nmi(unsigned int cmd, struct pt_regs *regs)
-{
-       int cpu;
-
-       cpu = smp_processor_id();
-       if (cpu == 0 && !cpu_online(cpu) && enable_start_cpu0)
-               return NMI_HANDLED;
-
-       return NMI_DONE;
-}
-
-/*
- * Wake up AP by INIT, INIT, STARTUP sequence.
- *
- * Instead of waiting for STARTUP after INITs, BSP will execute the BIOS
- * boot-strap code which is not a desired behavior for waking up BSP. To
- * void the boot-strap code, wake up CPU0 by NMI instead.
- *
- * This works to wake up soft offlined CPU0 only. If CPU0 is hard offlined
- * (i.e. physically hot removed and then hot added), NMI won't wake it up.
- * We'll change this code in the future to wake up hard offlined CPU0 if
- * real platform and request are available.
- */
-static int
-wakeup_cpu_via_init_nmi(int cpu, unsigned long start_ip, int apicid,
-              int *cpu0_nmi_registered)
-{
-       int id;
-       int boot_error;
-
-       preempt_disable();
-
-       /*
-        * Wake up AP by INIT, INIT, STARTUP sequence.
-        */
-       if (cpu) {
-               boot_error = wakeup_secondary_cpu_via_init(apicid, start_ip);
-               goto out;
-       }
-
-       /*
-        * Wake up BSP by nmi.
-        *
-        * Register a NMI handler to help wake up CPU0.
-        */
-       boot_error = register_nmi_handler(NMI_LOCAL,
-                                         wakeup_cpu0_nmi, 0, "wake_cpu0");
-
-       if (!boot_error) {
-               enable_start_cpu0 = 1;
-               *cpu0_nmi_registered = 1;
-               id = apic->dest_mode_logical ? cpu0_logical_apicid : apicid;
-               boot_error = wakeup_secondary_cpu_via_nmi(id, start_ip);
-       }
-
-out:
-       preempt_enable();
-
-       return boot_error;
-}
-
 int common_cpu_up(unsigned int cpu, struct task_struct *idle)
 {
        int ret;
@@ -1071,17 +1024,13 @@ int common_cpu_up(unsigned int cpu, struct task_struct *idle)
 /*
  * NOTE - on most systems this is a PHYSICAL apic ID, but on multiquad
  * (ie clustered apic addressing mode), this is a LOGICAL apic ID.
- * Returns zero if CPU booted OK, else error code from
+ * Returns zero if startup was successfully sent, else error code from
  * ->wakeup_secondary_cpu.
  */
-static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
-                      int *cpu0_nmi_registered)
+static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle)
 {
-       /* start_ip had better be page-aligned! */
        unsigned long start_ip = real_mode_header->trampoline_start;
-
-       unsigned long boot_error = 0;
-       unsigned long timeout;
+       int ret;
 
 #ifdef CONFIG_X86_64
        /* If 64-bit wakeup method exists, use the 64-bit mode trampoline IP */
@@ -1094,7 +1043,7 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
        if (IS_ENABLED(CONFIG_X86_32)) {
                early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu);
                initial_stack  = idle->thread.sp;
-       } else {
+       } else if (!(smpboot_control & STARTUP_PARALLEL_MASK)) {
                smpboot_control = cpu;
        }
 
@@ -1108,7 +1057,6 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
         * This grunge runs the startup process for
         * the targeted processor.
         */
-
        if (x86_platform.legacy.warm_reset) {
 
                pr_debug("Setting warm reset code and vector.\n");
@@ -1123,13 +1071,6 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
                }
        }
 
-       /*
-        * AP might wait on cpu_callout_mask in cpu_init() with
-        * cpu_initialized_mask set if previous attempt to online
-        * it timed-out. Clear cpu_initialized_mask so that after
-        * INIT/SIPI it could start with a clean state.
-        */
-       cpumask_clear_cpu(cpu, cpu_initialized_mask);
        smp_mb();
 
        /*
@@ -1137,66 +1078,25 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
         * - Use a method from the APIC driver if one defined, with wakeup
         *   straight to 64-bit mode preferred over wakeup to RM.
         * Otherwise,
-        * - Use an INIT boot APIC message for APs or NMI for BSP.
+        * - Use an INIT boot APIC message
         */
        if (apic->wakeup_secondary_cpu_64)
-               boot_error = apic->wakeup_secondary_cpu_64(apicid, start_ip);
+               ret = apic->wakeup_secondary_cpu_64(apicid, start_ip);
        else if (apic->wakeup_secondary_cpu)
-               boot_error = apic->wakeup_secondary_cpu(apicid, start_ip);
+               ret = apic->wakeup_secondary_cpu(apicid, start_ip);
        else
-               boot_error = wakeup_cpu_via_init_nmi(cpu, start_ip, apicid,
-                                                    cpu0_nmi_registered);
-
-       if (!boot_error) {
-               /*
-                * Wait 10s total for first sign of life from AP
-                */
-               boot_error = -1;
-               timeout = jiffies + 10*HZ;
-               while (time_before(jiffies, timeout)) {
-                       if (cpumask_test_cpu(cpu, cpu_initialized_mask)) {
-                               /*
-                                * Tell AP to proceed with initialization
-                                */
-                               cpumask_set_cpu(cpu, cpu_callout_mask);
-                               boot_error = 0;
-                               break;
-                       }
-                       schedule();
-               }
-       }
-
-       if (!boot_error) {
-               /*
-                * Wait till AP completes initial initialization
-                */
-               while (!cpumask_test_cpu(cpu, cpu_callin_mask)) {
-                       /*
-                        * Allow other tasks to run while we wait for the
-                        * AP to come online. This also gives a chance
-                        * for the MTRR work(triggered by the AP coming online)
-                        * to be completed in the stop machine context.
-                        */
-                       schedule();
-               }
-       }
+               ret = wakeup_secondary_cpu_via_init(apicid, start_ip);
 
-       if (x86_platform.legacy.warm_reset) {
-               /*
-                * Cleanup possible dangling ends...
-                */
-               smpboot_restore_warm_reset_vector();
-       }
-
-       return boot_error;
+       /* If the wakeup mechanism failed, cleanup the warm reset vector */
+       if (ret)
+               arch_cpuhp_cleanup_kick_cpu(cpu);
+       return ret;
 }
 
-int native_cpu_up(unsigned int cpu, struct task_struct *tidle)
+int native_kick_ap(unsigned int cpu, struct task_struct *tidle)
 {
        int apicid = apic->cpu_present_to_apicid(cpu);
-       int cpu0_nmi_registered = 0;
-       unsigned long flags;
-       int err, ret = 0;
+       int err;
 
        lockdep_assert_irqs_enabled();
 
@@ -1210,24 +1110,11 @@ int native_cpu_up(unsigned int cpu, struct task_struct *tidle)
        }
 
        /*
-        * Already booted CPU?
-        */
-       if (cpumask_test_cpu(cpu, cpu_callin_mask)) {
-               pr_debug("do_boot_cpu %d Already started\n", cpu);
-               return -ENOSYS;
-       }
-
-       /*
         * Save current MTRR state in case it was changed since early boot
         * (e.g. by the ACPI SMI) to initialize new CPUs with MTRRs in sync:
         */
        mtrr_save_state();
 
-       /* x86 CPUs take themselves offline, so delayed offline is OK. */
-       err = cpu_check_up_prepare(cpu);
-       if (err && err != -EBUSY)
-               return err;
-
        /* the FPU context is blank, nobody can own it */
        per_cpu(fpu_fpregs_owner_ctx, cpu) = NULL;
 
@@ -1235,41 +1122,44 @@ int native_cpu_up(unsigned int cpu, struct task_struct *tidle)
        if (err)
                return err;
 
-       err = do_boot_cpu(apicid, cpu, tidle, &cpu0_nmi_registered);
-       if (err) {
+       err = do_boot_cpu(apicid, cpu, tidle);
+       if (err)
                pr_err("do_boot_cpu failed(%d) to wakeup CPU#%u\n", err, cpu);
-               ret = -EIO;
-               goto unreg_nmi;
-       }
 
-       /*
-        * Check TSC synchronization with the AP (keep irqs disabled
-        * while doing so):
-        */
-       local_irq_save(flags);
-       check_tsc_sync_source(cpu);
-       local_irq_restore(flags);
+       return err;
+}
 
-       while (!cpu_online(cpu)) {
-               cpu_relax();
-               touch_nmi_watchdog();
-       }
+int arch_cpuhp_kick_ap_alive(unsigned int cpu, struct task_struct *tidle)
+{
+       return smp_ops.kick_ap_alive(cpu, tidle);
+}
 
-unreg_nmi:
-       /*
-        * Clean up the nmi handler. Do this after the callin and callout sync
-        * to avoid impact of possible long unregister time.
-        */
-       if (cpu0_nmi_registered)
-               unregister_nmi_handler(NMI_LOCAL, "wake_cpu0");
+void arch_cpuhp_cleanup_kick_cpu(unsigned int cpu)
+{
+       /* Cleanup possible dangling ends... */
+       if (smp_ops.kick_ap_alive == native_kick_ap && x86_platform.legacy.warm_reset)
+               smpboot_restore_warm_reset_vector();
+}
 
-       return ret;
+void arch_cpuhp_cleanup_dead_cpu(unsigned int cpu)
+{
+       if (smp_ops.cleanup_dead_cpu)
+               smp_ops.cleanup_dead_cpu(cpu);
+
+       if (system_state == SYSTEM_RUNNING)
+               pr_info("CPU %u is now offline\n", cpu);
+}
+
+void arch_cpuhp_sync_state_poll(void)
+{
+       if (smp_ops.poll_sync_state)
+               smp_ops.poll_sync_state();
 }
 
 /**
- * arch_disable_smp_support() - disables SMP support for x86 at runtime
+ * arch_disable_smp_support() - Disables SMP support for x86 at boottime
  */
-void arch_disable_smp_support(void)
+void __init arch_disable_smp_support(void)
 {
        disable_ioapic_support();
 }
@@ -1361,14 +1251,6 @@ static void __init smp_cpu_index_default(void)
        }
 }
 
-static void __init smp_get_logical_apicid(void)
-{
-       if (x2apic_mode)
-               cpu0_logical_apicid = apic_read(APIC_LDR);
-       else
-               cpu0_logical_apicid = GET_APIC_LOGICAL_ID(apic_read(APIC_LDR));
-}
-
 void __init smp_prepare_cpus_common(void)
 {
        unsigned int i;
@@ -1379,7 +1261,6 @@ void __init smp_prepare_cpus_common(void)
         * Setup boot CPU information
         */
        smp_store_boot_cpu_info(); /* Final full version of the data */
-       cpumask_copy(cpu_callin_mask, cpumask_of(0));
        mb();
 
        for_each_possible_cpu(i) {
@@ -1390,18 +1271,24 @@ void __init smp_prepare_cpus_common(void)
                zalloc_cpumask_var(&per_cpu(cpu_l2c_shared_map, i), GFP_KERNEL);
        }
 
-       /*
-        * Set 'default' x86 topology, this matches default_topology() in that
-        * it has NUMA nodes as a topology level. See also
-        * native_smp_cpus_done().
-        *
-        * Must be done before set_cpus_sibling_map() is ran.
-        */
-       set_sched_topology(x86_topology);
-
        set_cpu_sibling_map(0);
 }
 
+#ifdef CONFIG_X86_64
+/* Establish whether parallel bringup can be supported. */
+bool __init arch_cpuhp_init_parallel_bringup(void)
+{
+       if (!x86_cpuinit.parallel_bringup) {
+               pr_info("Parallel CPU startup disabled by the platform\n");
+               return false;
+       }
+
+       smpboot_control = STARTUP_READ_APICID;
+       pr_debug("Parallel CPU startup enabled: 0x%08x\n", smpboot_control);
+       return true;
+}
+#endif
+
 /*
  * Prepare for SMP bootup.
  * @max_cpus: configured maximum number of CPUs, It is a legacy parameter
@@ -1431,8 +1318,6 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
        /* Setup local timer */
        x86_init.timers.setup_percpu_clockev();
 
-       smp_get_logical_apicid();
-
        pr_info("CPU0: ");
        print_cpu_info(&cpu_data(0));
 
@@ -1455,6 +1340,25 @@ void arch_thaw_secondary_cpus_end(void)
        cache_aps_init();
 }
 
+bool smp_park_other_cpus_in_init(void)
+{
+       unsigned int cpu, this_cpu = smp_processor_id();
+       unsigned int apicid;
+
+       if (apic->wakeup_secondary_cpu_64 || apic->wakeup_secondary_cpu)
+               return false;
+
+       for_each_present_cpu(cpu) {
+               if (cpu == this_cpu)
+                       continue;
+               apicid = apic->cpu_present_to_apicid(cpu);
+               if (apicid == BAD_APICID)
+                       continue;
+               send_init_sequence(apicid);
+       }
+       return true;
+}
+
 /*
  * Early setup to make printk work.
  */
@@ -1466,9 +1370,6 @@ void __init native_smp_prepare_boot_cpu(void)
        if (!IS_ENABLED(CONFIG_SMP))
                switch_gdt_and_percpu_base(me);
 
-       /* already set me in cpu_online_mask in boot_cpu_init() */
-       cpumask_set_cpu(me, cpu_callout_mask);
-       cpu_set_state_online(me);
        native_pv_lock_init();
 }
 
@@ -1490,13 +1391,7 @@ void __init native_smp_cpus_done(unsigned int max_cpus)
        pr_debug("Boot done\n");
 
        calculate_max_logical_packages();
-
-       /* XXX for now assume numa-in-package and hybrid don't overlap */
-       if (x86_has_numa_in_package)
-               set_sched_topology(x86_numa_in_package_topology);
-       if (cpu_feature_enabled(X86_FEATURE_HYBRID_CPU))
-               set_sched_topology(x86_hybrid_topology);
-
+       build_sched_topology();
        nmi_selftest();
        impress_friends();
        cache_aps_init();
@@ -1592,6 +1487,12 @@ __init void prefill_possible_map(void)
                set_cpu_possible(i, true);
 }
 
+/* correctly size the local cpu masks */
+void __init setup_cpu_local_masks(void)
+{
+       alloc_bootmem_cpumask_var(&cpu_sibling_setup_mask);
+}
+
 #ifdef CONFIG_HOTPLUG_CPU
 
 /* Recompute SMT state for all CPUs on offline */
@@ -1650,10 +1551,6 @@ static void remove_siblinginfo(int cpu)
 static void remove_cpu_from_maps(int cpu)
 {
        set_cpu_online(cpu, false);
-       cpumask_clear_cpu(cpu, cpu_callout_mask);
-       cpumask_clear_cpu(cpu, cpu_callin_mask);
-       /* was set by cpu_init() */
-       cpumask_clear_cpu(cpu, cpu_initialized_mask);
        numa_remove_cpu(cpu);
 }
 
@@ -1704,64 +1601,27 @@ int native_cpu_disable(void)
        return 0;
 }
 
-int common_cpu_die(unsigned int cpu)
-{
-       int ret = 0;
-
-       /* We don't do anything here: idle task is faking death itself. */
-
-       /* They ack this in play_dead() by setting CPU_DEAD */
-       if (cpu_wait_death(cpu, 5)) {
-               if (system_state == SYSTEM_RUNNING)
-                       pr_info("CPU %u is now offline\n", cpu);
-       } else {
-               pr_err("CPU %u didn't die...\n", cpu);
-               ret = -1;
-       }
-
-       return ret;
-}
-
-void native_cpu_die(unsigned int cpu)
-{
-       common_cpu_die(cpu);
-}
-
 void play_dead_common(void)
 {
        idle_task_exit();
 
-       /* Ack it */
-       (void)cpu_report_death();
-
+       cpuhp_ap_report_dead();
        /*
         * With physical CPU hotplug, we should halt the cpu
         */
        local_irq_disable();
 }
 
-/**
- * cond_wakeup_cpu0 - Wake up CPU0 if needed.
- *
- * If NMI wants to wake up CPU0, start CPU0.
- */
-void cond_wakeup_cpu0(void)
-{
-       if (smp_processor_id() == 0 && enable_start_cpu0)
-               start_cpu0();
-}
-EXPORT_SYMBOL_GPL(cond_wakeup_cpu0);
-
 /*
  * We need to flush the caches before going to sleep, lest we have
  * dirty data in our caches when we come back up.
  */
 static inline void mwait_play_dead(void)
 {
+       struct mwait_cpu_dead *md = this_cpu_ptr(&mwait_cpu_dead);
        unsigned int eax, ebx, ecx, edx;
        unsigned int highest_cstate = 0;
        unsigned int highest_subcstate = 0;
-       void *mwait_ptr;
        int i;
 
        if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
@@ -1796,12 +1656,9 @@ static inline void mwait_play_dead(void)
                        (highest_subcstate - 1);
        }
 
-       /*
-        * This should be a memory location in a cache line which is
-        * unlikely to be touched by other processors.  The actual
-        * content is immaterial as it is not actually modified in any way.
-        */
-       mwait_ptr = &current_thread_info()->flags;
+       /* Set up state for the kexec() hack below */
+       md->status = CPUDEAD_MWAIT_WAIT;
+       md->control = CPUDEAD_MWAIT_WAIT;
 
        wbinvd();
 
@@ -1814,13 +1671,58 @@ static inline void mwait_play_dead(void)
                 * case where we return around the loop.
                 */
                mb();
-               clflush(mwait_ptr);
+               clflush(md);
                mb();
-               __monitor(mwait_ptr, 0, 0);
+               __monitor(md, 0, 0);
                mb();
                __mwait(eax, 0);
 
-               cond_wakeup_cpu0();
+               if (READ_ONCE(md->control) == CPUDEAD_MWAIT_KEXEC_HLT) {
+                       /*
+                        * Kexec is about to happen. Don't go back into mwait() as
+                        * the kexec kernel might overwrite text and data including
+                        * page tables and stack. So mwait() would resume when the
+                        * monitor cache line is written to and then the CPU goes
+                        * south due to overwritten text, page tables and stack.
+                        *
+                        * Note: This does _NOT_ protect against a stray MCE, NMI,
+                        * SMI. They will resume execution at the instruction
+                        * following the HLT instruction and run into the problem
+                        * which this is trying to prevent.
+                        */
+                       WRITE_ONCE(md->status, CPUDEAD_MWAIT_KEXEC_HLT);
+                       while(1)
+                               native_halt();
+               }
+       }
+}
+
+/*
+ * Kick all "offline" CPUs out of mwait on kexec(). See comment in
+ * mwait_play_dead().
+ */
+void smp_kick_mwait_play_dead(void)
+{
+       u32 newstate = CPUDEAD_MWAIT_KEXEC_HLT;
+       struct mwait_cpu_dead *md;
+       unsigned int cpu, i;
+
+       for_each_cpu_andnot(cpu, cpu_present_mask, cpu_online_mask) {
+               md = per_cpu_ptr(&mwait_cpu_dead, cpu);
+
+               /* Does it sit in mwait_play_dead() ? */
+               if (READ_ONCE(md->status) != CPUDEAD_MWAIT_WAIT)
+                       continue;
+
+               /* Wait up to 5ms */
+               for (i = 0; READ_ONCE(md->status) != newstate && i < 1000; i++) {
+                       /* Bring it out of mwait */
+                       WRITE_ONCE(md->control, newstate);
+                       udelay(5);
+               }
+
+               if (READ_ONCE(md->status) != newstate)
+                       pr_err_once("CPU%u is stuck in mwait_play_dead()\n", cpu);
        }
 }
 
@@ -1829,11 +1731,8 @@ void __noreturn hlt_play_dead(void)
        if (__this_cpu_read(cpu_info.x86) >= 4)
                wbinvd();
 
-       while (1) {
+       while (1)
                native_halt();
-
-               cond_wakeup_cpu0();
-       }
 }
 
 void native_play_dead(void)
@@ -1852,12 +1751,6 @@ int native_cpu_disable(void)
        return -ENOSYS;
 }
 
-void native_cpu_die(unsigned int cpu)
-{
-       /* We said "no" in __cpu_disable */
-       BUG();
-}
-
 void native_play_dead(void)
 {
        BUG();
index 1b83377..ca004e2 100644 (file)
 static DEFINE_PER_CPU(struct x86_cpu, cpu_devices);
 
 #ifdef CONFIG_HOTPLUG_CPU
-
-#ifdef CONFIG_BOOTPARAM_HOTPLUG_CPU0
-static int cpu0_hotpluggable = 1;
-#else
-static int cpu0_hotpluggable;
-static int __init enable_cpu0_hotplug(char *str)
-{
-       cpu0_hotpluggable = 1;
-       return 1;
-}
-
-__setup("cpu0_hotplug", enable_cpu0_hotplug);
-#endif
-
-#ifdef CONFIG_DEBUG_HOTPLUG_CPU0
-/*
- * This function offlines a CPU as early as possible and allows userspace to
- * boot up without the CPU. The CPU can be onlined back by user after boot.
- *
- * This is only called for debugging CPU offline/online feature.
- */
-int _debug_hotplug_cpu(int cpu, int action)
+int arch_register_cpu(int cpu)
 {
-       int ret;
-
-       if (!cpu_is_hotpluggable(cpu))
-               return -EINVAL;
+       struct x86_cpu *xc = per_cpu_ptr(&cpu_devices, cpu);
 
-       switch (action) {
-       case 0:
-               ret = remove_cpu(cpu);
-               if (!ret)
-                       pr_info("DEBUG_HOTPLUG_CPU0: CPU %u is now offline\n", cpu);
-               else
-                       pr_debug("Can't offline CPU%d.\n", cpu);
-               break;
-       case 1:
-               ret = add_cpu(cpu);
-               if (ret)
-                       pr_debug("Can't online CPU%d.\n", cpu);
-
-               break;
-       default:
-               ret = -EINVAL;
-       }
-
-       return ret;
-}
-
-static int __init debug_hotplug_cpu(void)
-{
-       _debug_hotplug_cpu(0, 0);
-       return 0;
-}
-
-late_initcall_sync(debug_hotplug_cpu);
-#endif /* CONFIG_DEBUG_HOTPLUG_CPU0 */
-
-int arch_register_cpu(int num)
-{
-       struct cpuinfo_x86 *c = &cpu_data(num);
-
-       /*
-        * Currently CPU0 is only hotpluggable on Intel platforms. Other
-        * vendors can add hotplug support later.
-        * Xen PV guests don't support CPU0 hotplug at all.
-        */
-       if (c->x86_vendor != X86_VENDOR_INTEL ||
-           cpu_feature_enabled(X86_FEATURE_XENPV))
-               cpu0_hotpluggable = 0;
-
-       /*
-        * Two known BSP/CPU0 dependencies: Resume from suspend/hibernate
-        * depends on BSP. PIC interrupts depend on BSP.
-        *
-        * If the BSP dependencies are under control, one can tell kernel to
-        * enable BSP hotplug. This basically adds a control file and
-        * one can attempt to offline BSP.
-        */
-       if (num == 0 && cpu0_hotpluggable) {
-               unsigned int irq;
-               /*
-                * We won't take down the boot processor on i386 if some
-                * interrupts only are able to be serviced by the BSP in PIC.
-                */
-               for_each_active_irq(irq) {
-                       if (!IO_APIC_IRQ(irq) && irq_has_action(irq)) {
-                               cpu0_hotpluggable = 0;
-                               break;
-                       }
-               }
-       }
-       if (num || cpu0_hotpluggable)
-               per_cpu(cpu_devices, num).cpu.hotpluggable = 1;
-
-       return register_cpu(&per_cpu(cpu_devices, num).cpu, num);
+       xc->cpu.hotpluggable = cpu > 0;
+       return register_cpu(&xc->cpu, cpu);
 }
 EXPORT_SYMBOL(arch_register_cpu);
 
index 3446988..3425c6a 100644 (file)
@@ -69,12 +69,10 @@ static int __init tsc_early_khz_setup(char *buf)
 }
 early_param("tsc_early_khz", tsc_early_khz_setup);
 
-__always_inline void cyc2ns_read_begin(struct cyc2ns_data *data)
+__always_inline void __cyc2ns_read(struct cyc2ns_data *data)
 {
        int seq, idx;
 
-       preempt_disable_notrace();
-
        do {
                seq = this_cpu_read(cyc2ns.seq.seqcount.sequence);
                idx = seq & 1;
@@ -86,6 +84,12 @@ __always_inline void cyc2ns_read_begin(struct cyc2ns_data *data)
        } while (unlikely(seq != this_cpu_read(cyc2ns.seq.seqcount.sequence)));
 }
 
+__always_inline void cyc2ns_read_begin(struct cyc2ns_data *data)
+{
+       preempt_disable_notrace();
+       __cyc2ns_read(data);
+}
+
 __always_inline void cyc2ns_read_end(void)
 {
        preempt_enable_notrace();
@@ -115,18 +119,25 @@ __always_inline void cyc2ns_read_end(void)
  *                      -johnstul@us.ibm.com "math is hard, lets go shopping!"
  */
 
-static __always_inline unsigned long long cycles_2_ns(unsigned long long cyc)
+static __always_inline unsigned long long __cycles_2_ns(unsigned long long cyc)
 {
        struct cyc2ns_data data;
        unsigned long long ns;
 
-       cyc2ns_read_begin(&data);
+       __cyc2ns_read(&data);
 
        ns = data.cyc2ns_offset;
        ns += mul_u64_u32_shr(cyc, data.cyc2ns_mul, data.cyc2ns_shift);
 
-       cyc2ns_read_end();
+       return ns;
+}
 
+static __always_inline unsigned long long cycles_2_ns(unsigned long long cyc)
+{
+       unsigned long long ns;
+       preempt_disable_notrace();
+       ns = __cycles_2_ns(cyc);
+       preempt_enable_notrace();
        return ns;
 }
 
@@ -223,7 +234,7 @@ noinstr u64 native_sched_clock(void)
                u64 tsc_now = rdtsc();
 
                /* return the value in ns */
-               return cycles_2_ns(tsc_now);
+               return __cycles_2_ns(tsc_now);
        }
 
        /*
@@ -250,7 +261,7 @@ u64 native_sched_clock_from_tsc(u64 tsc)
 /* We need to define a real function for sched_clock, to override the
    weak default version */
 #ifdef CONFIG_PARAVIRT
-noinstr u64 sched_clock(void)
+noinstr u64 sched_clock_noinstr(void)
 {
        return paravirt_sched_clock();
 }
@@ -260,11 +271,20 @@ bool using_native_sched_clock(void)
        return static_call_query(pv_sched_clock) == native_sched_clock;
 }
 #else
-u64 sched_clock(void) __attribute__((alias("native_sched_clock")));
+u64 sched_clock_noinstr(void) __attribute__((alias("native_sched_clock")));
 
 bool using_native_sched_clock(void) { return true; }
 #endif
 
+notrace u64 sched_clock(void)
+{
+       u64 now;
+       preempt_disable_notrace();
+       now = sched_clock_noinstr();
+       preempt_enable_notrace();
+       return now;
+}
+
 int check_tsc_unstable(void)
 {
        return tsc_unstable;
@@ -1598,10 +1618,7 @@ void __init tsc_init(void)
 
 #ifdef CONFIG_SMP
 /*
- * If we have a constant TSC and are using the TSC for the delay loop,
- * we can skip clock calibration if another cpu in the same socket has already
- * been calibrated. This assumes that CONSTANT_TSC applies to all
- * cpus in the socket - this should be a safe assumption.
+ * Check whether existing calibration data can be reused.
  */
 unsigned long calibrate_delay_is_known(void)
 {
@@ -1609,6 +1626,21 @@ unsigned long calibrate_delay_is_known(void)
        int constant_tsc = cpu_has(&cpu_data(cpu), X86_FEATURE_CONSTANT_TSC);
        const struct cpumask *mask = topology_core_cpumask(cpu);
 
+       /*
+        * If TSC has constant frequency and TSC is synchronized across
+        * sockets then reuse CPU0 calibration.
+        */
+       if (constant_tsc && !tsc_unstable)
+               return cpu_data(0).loops_per_jiffy;
+
+       /*
+        * If TSC has constant frequency and TSC is not synchronized across
+        * sockets and this is not the first CPU in the socket, then reuse
+        * the calibration value of an already online CPU on that socket.
+        *
+        * This assumes that CONSTANT_TSC is consistent for all CPUs in a
+        * socket.
+        */
        if (!constant_tsc || !mask)
                return 0;
 
index 9452dc9..bbc440c 100644 (file)
@@ -245,7 +245,6 @@ bool tsc_store_and_check_tsc_adjust(bool bootcpu)
  */
 static atomic_t start_count;
 static atomic_t stop_count;
-static atomic_t skip_test;
 static atomic_t test_runs;
 
 /*
@@ -344,21 +343,14 @@ static inline unsigned int loop_timeout(int cpu)
 }
 
 /*
- * Source CPU calls into this - it waits for the freshly booted
- * target CPU to arrive and then starts the measurement:
+ * The freshly booted CPU initiates this via an async SMP function call.
  */
-void check_tsc_sync_source(int cpu)
+static void check_tsc_sync_source(void *__cpu)
 {
+       unsigned int cpu = (unsigned long)__cpu;
        int cpus = 2;
 
        /*
-        * No need to check if we already know that the TSC is not
-        * synchronized or if we have no TSC.
-        */
-       if (unsynchronized_tsc())
-               return;
-
-       /*
         * Set the maximum number of test runs to
         *  1 if the CPU does not provide the TSC_ADJUST MSR
         *  3 if the MSR is available, so the target can try to adjust
@@ -368,16 +360,9 @@ void check_tsc_sync_source(int cpu)
        else
                atomic_set(&test_runs, 3);
 retry:
-       /*
-        * Wait for the target to start or to skip the test:
-        */
-       while (atomic_read(&start_count) != cpus - 1) {
-               if (atomic_read(&skip_test) > 0) {
-                       atomic_set(&skip_test, 0);
-                       return;
-               }
+       /* Wait for the target to start. */
+       while (atomic_read(&start_count) != cpus - 1)
                cpu_relax();
-       }
 
        /*
         * Trigger the target to continue into the measurement too:
@@ -397,14 +382,14 @@ retry:
        if (!nr_warps) {
                atomic_set(&test_runs, 0);
 
-               pr_debug("TSC synchronization [CPU#%d -> CPU#%d]: passed\n",
+               pr_debug("TSC synchronization [CPU#%d -> CPU#%u]: passed\n",
                        smp_processor_id(), cpu);
 
        } else if (atomic_dec_and_test(&test_runs) || random_warps) {
                /* Force it to 0 if random warps brought us here */
                atomic_set(&test_runs, 0);
 
-               pr_warn("TSC synchronization [CPU#%d -> CPU#%d]:\n",
+               pr_warn("TSC synchronization [CPU#%d -> CPU#%u]:\n",
                        smp_processor_id(), cpu);
                pr_warn("Measured %Ld cycles TSC warp between CPUs, "
                        "turning off TSC clock.\n", max_warp);
@@ -457,11 +442,12 @@ void check_tsc_sync_target(void)
         * SoCs the TSC is frequency synchronized, but still the TSC ADJUST
         * register might have been wreckaged by the BIOS..
         */
-       if (tsc_store_and_check_tsc_adjust(false) || tsc_clocksource_reliable) {
-               atomic_inc(&skip_test);
+       if (tsc_store_and_check_tsc_adjust(false) || tsc_clocksource_reliable)
                return;
-       }
 
+       /* Kick the control CPU into the TSC synchronization function */
+       smp_call_function_single(cpumask_first(cpu_online_mask), check_tsc_sync_source,
+                                (unsigned long *)(unsigned long)cpu, 0);
 retry:
        /*
         * Register this CPU's participation and wait for the
index 3ac50b7..7e574cf 100644 (file)
@@ -7,14 +7,23 @@
 #include <asm/unwind.h>
 #include <asm/orc_types.h>
 #include <asm/orc_lookup.h>
+#include <asm/orc_header.h>
+
+ORC_HEADER;
 
 #define orc_warn(fmt, ...) \
        printk_deferred_once(KERN_WARNING "WARNING: " fmt, ##__VA_ARGS__)
 
 #define orc_warn_current(args...)                                      \
 ({                                                                     \
-       if (state->task == current && !state->error)                    \
+       static bool dumped_before;                                      \
+       if (state->task == current && !state->error) {                  \
                orc_warn(args);                                         \
+               if (unwind_debug && !dumped_before) {                   \
+                       dumped_before = true;                           \
+                       unwind_dump(state);                             \
+               }                                                       \
+       }                                                               \
 })
 
 extern int __start_orc_unwind_ip[];
@@ -23,8 +32,49 @@ extern struct orc_entry __start_orc_unwind[];
 extern struct orc_entry __stop_orc_unwind[];
 
 static bool orc_init __ro_after_init;
+static bool unwind_debug __ro_after_init;
 static unsigned int lookup_num_blocks __ro_after_init;
 
+static int __init unwind_debug_cmdline(char *str)
+{
+       unwind_debug = true;
+
+       return 0;
+}
+early_param("unwind_debug", unwind_debug_cmdline);
+
+static void unwind_dump(struct unwind_state *state)
+{
+       static bool dumped_before;
+       unsigned long word, *sp;
+       struct stack_info stack_info = {0};
+       unsigned long visit_mask = 0;
+
+       if (dumped_before)
+               return;
+
+       dumped_before = true;
+
+       printk_deferred("unwind stack type:%d next_sp:%p mask:0x%lx graph_idx:%d\n",
+                       state->stack_info.type, state->stack_info.next_sp,
+                       state->stack_mask, state->graph_idx);
+
+       for (sp = __builtin_frame_address(0); sp;
+            sp = PTR_ALIGN(stack_info.next_sp, sizeof(long))) {
+               if (get_stack_info(sp, state->task, &stack_info, &visit_mask))
+                       break;
+
+               for (; sp < stack_info.end; sp++) {
+
+                       word = READ_ONCE_NOCHECK(*sp);
+
+                       printk_deferred("%0*lx: %0*lx (%pB)\n", BITS_PER_LONG/4,
+                                       (unsigned long)sp, BITS_PER_LONG/4,
+                                       word, (void *)word);
+               }
+       }
+}
+
 static inline unsigned long orc_ip(const int *ip)
 {
        return (unsigned long)ip + *ip;
@@ -136,21 +186,6 @@ static struct orc_entry null_orc_entry = {
        .type = ORC_TYPE_CALL
 };
 
-#ifdef CONFIG_CALL_THUNKS
-static struct orc_entry *orc_callthunk_find(unsigned long ip)
-{
-       if (!is_callthunk((void *)ip))
-               return NULL;
-
-       return &null_orc_entry;
-}
-#else
-static struct orc_entry *orc_callthunk_find(unsigned long ip)
-{
-       return NULL;
-}
-#endif
-
 /* Fake frame pointer entry -- used as a fallback for generated code */
 static struct orc_entry orc_fp_entry = {
        .type           = ORC_TYPE_CALL,
@@ -203,11 +238,7 @@ static struct orc_entry *orc_find(unsigned long ip)
        if (orc)
                return orc;
 
-       orc =  orc_ftrace_find(ip);
-       if (orc)
-               return orc;
-
-       return orc_callthunk_find(ip);
+       return orc_ftrace_find(ip);
 }
 
 #ifdef CONFIG_MODULES
@@ -219,7 +250,6 @@ static struct orc_entry *cur_orc_table = __start_orc_unwind;
 static void orc_sort_swap(void *_a, void *_b, int size)
 {
        struct orc_entry *orc_a, *orc_b;
-       struct orc_entry orc_tmp;
        int *a = _a, *b = _b, tmp;
        int delta = _b - _a;
 
@@ -231,9 +261,7 @@ static void orc_sort_swap(void *_a, void *_b, int size)
        /* Swap the corresponding .orc_unwind entries: */
        orc_a = cur_orc_table + (a - cur_orc_ip_table);
        orc_b = cur_orc_table + (b - cur_orc_ip_table);
-       orc_tmp = *orc_a;
-       *orc_a = *orc_b;
-       *orc_b = orc_tmp;
+       swap(*orc_a, *orc_b);
 }
 
 static int orc_sort_cmp(const void *_a, const void *_b)
index 25f1552..03c885d 100644 (file)
@@ -508,4 +508,8 @@ INIT_PER_CPU(irq_stack_backing_store);
            "fixed_percpu_data is not at start of per-cpu area");
 #endif
 
+#ifdef CONFIG_RETHUNK
+. = ASSERT((__x86_return_thunk & 0x3f) == 0, "__x86_return_thunk not cacheline-aligned");
+#endif
+
 #endif /* CONFIG_X86_64 */
index d82f4fa..a37ebd3 100644 (file)
@@ -126,12 +126,13 @@ struct x86_init_ops x86_init __initdata = {
 struct x86_cpuinit_ops x86_cpuinit = {
        .early_percpu_clock_init        = x86_init_noop,
        .setup_percpu_clockev           = setup_secondary_APIC_clock,
+       .parallel_bringup               = true,
 };
 
 static void default_nmi_init(void) { };
 
-static void enc_status_change_prepare_noop(unsigned long vaddr, int npages, bool enc) { }
-static bool enc_status_change_finish_noop(unsigned long vaddr, int npages, bool enc) { return false; }
+static bool enc_status_change_prepare_noop(unsigned long vaddr, int npages, bool enc) { return true; }
+static bool enc_status_change_finish_noop(unsigned long vaddr, int npages, bool enc) { return true; }
 static bool enc_tlb_flush_required_noop(bool enc) { return false; }
 static bool enc_cache_flush_required_noop(void) { return false; }
 static bool is_private_mmio_noop(u64 addr) {return false; }
index 04b57a3..7f70207 100644 (file)
@@ -2799,14 +2799,13 @@ static u64 read_tsc(void)
 static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp,
                          int *mode)
 {
-       long v;
        u64 tsc_pg_val;
+       long v;
 
        switch (clock->vclock_mode) {
        case VDSO_CLOCKMODE_HVCLOCK:
-               tsc_pg_val = hv_read_tsc_page_tsc(hv_get_tsc_page(),
-                                                 tsc_timestamp);
-               if (tsc_pg_val != U64_MAX) {
+               if (hv_read_tsc_page_tsc(hv_get_tsc_page(),
+                                        tsc_timestamp, &tsc_pg_val)) {
                        /* TSC page valid */
                        *mode = VDSO_CLOCKMODE_HVCLOCK;
                        v = (tsc_pg_val - clock->cycle_last) &
@@ -13162,7 +13161,7 @@ EXPORT_SYMBOL_GPL(kvm_arch_end_assignment);
 
 bool noinstr kvm_arch_has_assigned_device(struct kvm *kvm)
 {
-       return arch_atomic_read(&kvm->arch.assigned_device_count);
+       return raw_atomic_read(&kvm->arch.assigned_device_count);
 }
 EXPORT_SYMBOL_GPL(kvm_arch_has_assigned_device);
 
index 01932af..ea3a28e 100644 (file)
@@ -61,8 +61,9 @@ ifeq ($(CONFIG_X86_32),y)
         lib-y += strstr_32.o
         lib-y += string_32.o
         lib-y += memmove_32.o
+        lib-y += cmpxchg8b_emu.o
 ifneq ($(CONFIG_X86_CMPXCHG64),y)
-        lib-y += cmpxchg8b_emu.o atomic64_386_32.o
+        lib-y += atomic64_386_32.o
 endif
 else
         obj-y += iomap_copy_64.o
index 33c70c0..6962df3 100644 (file)
@@ -1,47 +1,54 @@
 /* SPDX-License-Identifier: GPL-2.0-only */
 #include <linux/linkage.h>
 #include <asm/percpu.h>
+#include <asm/processor-flags.h>
 
 .text
 
 /*
+ * Emulate 'cmpxchg16b %gs:(%rsi)'
+ *
  * Inputs:
  * %rsi : memory location to compare
  * %rax : low 64 bits of old value
  * %rdx : high 64 bits of old value
  * %rbx : low 64 bits of new value
  * %rcx : high 64 bits of new value
- * %al  : Operation successful
+ *
+ * Notably this is not LOCK prefixed and is not safe against NMIs
  */
 SYM_FUNC_START(this_cpu_cmpxchg16b_emu)
 
-#
-# Emulate 'cmpxchg16b %gs:(%rsi)' except we return the result in %al not
-# via the ZF.  Caller will access %al to get result.
-#
-# Note that this is only useful for a cpuops operation.  Meaning that we
-# do *not* have a fully atomic operation but just an operation that is
-# *atomic* on a single cpu (as provided by the this_cpu_xx class of
-# macros).
-#
        pushfq
        cli
 
-       cmpq PER_CPU_VAR((%rsi)), %rax
-       jne .Lnot_same
-       cmpq PER_CPU_VAR(8(%rsi)), %rdx
-       jne .Lnot_same
+       /* if (*ptr == old) */
+       cmpq    PER_CPU_VAR(0(%rsi)), %rax
+       jne     .Lnot_same
+       cmpq    PER_CPU_VAR(8(%rsi)), %rdx
+       jne     .Lnot_same
 
-       movq %rbx, PER_CPU_VAR((%rsi))
-       movq %rcx, PER_CPU_VAR(8(%rsi))
+       /* *ptr = new */
+       movq    %rbx, PER_CPU_VAR(0(%rsi))
+       movq    %rcx, PER_CPU_VAR(8(%rsi))
+
+       /* set ZF in EFLAGS to indicate success */
+       orl     $X86_EFLAGS_ZF, (%rsp)
 
        popfq
-       mov $1, %al
        RET
 
 .Lnot_same:
+       /* *ptr != old */
+
+       /* old = *ptr */
+       movq    PER_CPU_VAR(0(%rsi)), %rax
+       movq    PER_CPU_VAR(8(%rsi)), %rdx
+
+       /* clear ZF in EFLAGS to indicate failure */
+       andl    $(~X86_EFLAGS_ZF), (%rsp)
+
        popfq
-       xor %al,%al
        RET
 
 SYM_FUNC_END(this_cpu_cmpxchg16b_emu)
index 6a912d5..4980525 100644 (file)
@@ -2,10 +2,16 @@
 
 #include <linux/linkage.h>
 #include <asm/export.h>
+#include <asm/percpu.h>
+#include <asm/processor-flags.h>
 
 .text
 
+#ifndef CONFIG_X86_CMPXCHG64
+
 /*
+ * Emulate 'cmpxchg8b (%esi)' on UP
+ *
  * Inputs:
  * %esi : memory location to compare
  * %eax : low 32 bits of old value
  */
 SYM_FUNC_START(cmpxchg8b_emu)
 
-#
-# Emulate 'cmpxchg8b (%esi)' on UP except we don't
-# set the whole ZF thing (caller will just compare
-# eax:edx with the expected value)
-#
        pushfl
        cli
 
-       cmpl  (%esi), %eax
-       jne .Lnot_same
-       cmpl 4(%esi), %edx
-       jne .Lhalf_same
+       cmpl    0(%esi), %eax
+       jne     .Lnot_same
+       cmpl    4(%esi), %edx
+       jne     .Lnot_same
+
+       movl    %ebx, 0(%esi)
+       movl    %ecx, 4(%esi)
 
-       movl %ebx,  (%esi)
-       movl %ecx, 4(%esi)
+       orl     $X86_EFLAGS_ZF, (%esp)
 
        popfl
        RET
 
 .Lnot_same:
-       movl  (%esi), %eax
-.Lhalf_same:
-       movl 4(%esi), %edx
+       movl    0(%esi), %eax
+       movl    4(%esi), %edx
+
+       andl    $(~X86_EFLAGS_ZF), (%esp)
 
        popfl
        RET
 
 SYM_FUNC_END(cmpxchg8b_emu)
 EXPORT_SYMBOL(cmpxchg8b_emu)
+
+#endif
+
+#ifndef CONFIG_UML
+
+SYM_FUNC_START(this_cpu_cmpxchg8b_emu)
+
+       pushfl
+       cli
+
+       cmpl    PER_CPU_VAR(0(%esi)), %eax
+       jne     .Lnot_same2
+       cmpl    PER_CPU_VAR(4(%esi)), %edx
+       jne     .Lnot_same2
+
+       movl    %ebx, PER_CPU_VAR(0(%esi))
+       movl    %ecx, PER_CPU_VAR(4(%esi))
+
+       orl     $X86_EFLAGS_ZF, (%esp)
+
+       popfl
+       RET
+
+.Lnot_same2:
+       movl    PER_CPU_VAR(0(%esi)), %eax
+       movl    PER_CPU_VAR(4(%esi)), %edx
+
+       andl    $(~X86_EFLAGS_ZF), (%esp)
+
+       popfl
+       RET
+
+SYM_FUNC_END(this_cpu_cmpxchg8b_emu)
+
+#endif
index 50734a2..cea25ca 100644 (file)
@@ -5,22 +5,34 @@
  * This file contains network checksum routines that are better done
  * in an architecture-specific manner due to speed.
  */
+
 #include <linux/compiler.h>
 #include <linux/export.h>
 #include <asm/checksum.h>
 #include <asm/word-at-a-time.h>
 
-static inline unsigned short from32to16(unsigned a) 
+static inline unsigned short from32to16(unsigned a)
 {
-       unsigned short b = a >> 16; 
+       unsigned short b = a >> 16;
        asm("addw %w2,%w0\n\t"
-           "adcw $0,%w0\n" 
+           "adcw $0,%w0\n"
            : "=r" (b)
            : "0" (b), "r" (a));
        return b;
 }
 
+static inline __wsum csum_tail(u64 temp64, int odd)
+{
+       unsigned int result;
+
+       result = add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
+       if (unlikely(odd)) {
+               result = from32to16(result);
+               result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
+       }
+       return (__force __wsum)result;
+}
+
 /*
  * Do a checksum on an arbitrary memory area.
  * Returns a 32bit checksum.
@@ -35,7 +47,7 @@ static inline unsigned short from32to16(unsigned a)
 __wsum csum_partial(const void *buff, int len, __wsum sum)
 {
        u64 temp64 = (__force u64)sum;
-       unsigned odd, result;
+       unsigned odd;
 
        odd = 1 & (unsigned long) buff;
        if (unlikely(odd)) {
@@ -47,21 +59,52 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
                buff++;
        }
 
-       while (unlikely(len >= 64)) {
+       /*
+        * len == 40 is the hot case due to IPv6 headers, but annotating it likely()
+        * has noticeable negative affect on codegen for all other cases with
+        * minimal performance benefit here.
+        */
+       if (len == 40) {
                asm("addq 0*8(%[src]),%[res]\n\t"
                    "adcq 1*8(%[src]),%[res]\n\t"
                    "adcq 2*8(%[src]),%[res]\n\t"
                    "adcq 3*8(%[src]),%[res]\n\t"
                    "adcq 4*8(%[src]),%[res]\n\t"
-                   "adcq 5*8(%[src]),%[res]\n\t"
-                   "adcq 6*8(%[src]),%[res]\n\t"
-                   "adcq 7*8(%[src]),%[res]\n\t"
                    "adcq $0,%[res]"
-                   : [res] "+r" (temp64)
-                   : [src] "r" (buff)
-                   : "memory");
-               buff += 64;
-               len -= 64;
+                   : [res] "+r"(temp64)
+                   : [src] "r"(buff), "m"(*(const char(*)[40])buff));
+               return csum_tail(temp64, odd);
+       }
+       if (unlikely(len >= 64)) {
+               /*
+                * Extra accumulators for better ILP in the loop.
+                */
+               u64 tmp_accum, tmp_carries;
+
+               asm("xorl %k[tmp_accum],%k[tmp_accum]\n\t"
+                   "xorl %k[tmp_carries],%k[tmp_carries]\n\t"
+                   "subl $64, %[len]\n\t"
+                   "1:\n\t"
+                   "addq 0*8(%[src]),%[res]\n\t"
+                   "adcq 1*8(%[src]),%[res]\n\t"
+                   "adcq 2*8(%[src]),%[res]\n\t"
+                   "adcq 3*8(%[src]),%[res]\n\t"
+                   "adcl $0,%k[tmp_carries]\n\t"
+                   "addq 4*8(%[src]),%[tmp_accum]\n\t"
+                   "adcq 5*8(%[src]),%[tmp_accum]\n\t"
+                   "adcq 6*8(%[src]),%[tmp_accum]\n\t"
+                   "adcq 7*8(%[src]),%[tmp_accum]\n\t"
+                   "adcl $0,%k[tmp_carries]\n\t"
+                   "addq $64, %[src]\n\t"
+                   "subl $64, %[len]\n\t"
+                   "jge 1b\n\t"
+                   "addq %[tmp_accum],%[res]\n\t"
+                   "adcq %[tmp_carries],%[res]\n\t"
+                   "adcq $0,%[res]"
+                   : [tmp_accum] "=&r"(tmp_accum),
+                     [tmp_carries] "=&r"(tmp_carries), [res] "+r"(temp64),
+                     [len] "+r"(len), [src] "+r"(buff)
+                   : "m"(*(const char *)buff));
        }
 
        if (len & 32) {
@@ -70,45 +113,37 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
                    "adcq 2*8(%[src]),%[res]\n\t"
                    "adcq 3*8(%[src]),%[res]\n\t"
                    "adcq $0,%[res]"
-                       : [res] "+r" (temp64)
-                       : [src] "r" (buff)
-                       : "memory");
+                   : [res] "+r"(temp64)
+                   : [src] "r"(buff), "m"(*(const char(*)[32])buff));
                buff += 32;
        }
        if (len & 16) {
                asm("addq 0*8(%[src]),%[res]\n\t"
                    "adcq 1*8(%[src]),%[res]\n\t"
                    "adcq $0,%[res]"
-                       : [res] "+r" (temp64)
-                       : [src] "r" (buff)
-                       : "memory");
+                   : [res] "+r"(temp64)
+                   : [src] "r"(buff), "m"(*(const char(*)[16])buff));
                buff += 16;
        }
        if (len & 8) {
                asm("addq 0*8(%[src]),%[res]\n\t"
                    "adcq $0,%[res]"
-                       : [res] "+r" (temp64)
-                       : [src] "r" (buff)
-                       : "memory");
+                   : [res] "+r"(temp64)
+                   : [src] "r"(buff), "m"(*(const char(*)[8])buff));
                buff += 8;
        }
        if (len & 7) {
-               unsigned int shift = (8 - (len & 7)) * 8;
+               unsigned int shift = (-len << 3) & 63;
                unsigned long trail;
 
                trail = (load_unaligned_zeropad(buff) << shift) >> shift;
 
                asm("addq %[trail],%[res]\n\t"
                    "adcq $0,%[res]"
-                       : [res] "+r" (temp64)
-                       : [trail] "r" (trail));
+                   : [res] "+r"(temp64)
+                   : [trail] "r"(trail));
        }
-       result = add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
-       if (unlikely(odd)) {
-               result = from32to16(result);
-               result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
-       }
-       return (__force __wsum)result;
+       return csum_tail(temp64, odd);
 }
 EXPORT_SYMBOL(csum_partial);
 
@@ -118,6 +153,6 @@ EXPORT_SYMBOL(csum_partial);
  */
 __sum16 ip_compute_csum(const void *buff, int len)
 {
-       return csum_fold(csum_partial(buff,len,0));
+       return csum_fold(csum_partial(buff, len, 0));
 }
 EXPORT_SYMBOL(ip_compute_csum);
index b64a2bd..9c63713 100644 (file)
@@ -143,43 +143,43 @@ SYM_FUNC_END(__get_user_nocheck_8)
 EXPORT_SYMBOL(__get_user_nocheck_8)
 
 
-SYM_CODE_START_LOCAL(.Lbad_get_user_clac)
+SYM_CODE_START_LOCAL(__get_user_handle_exception)
        ASM_CLAC
 .Lbad_get_user:
        xor %edx,%edx
        mov $(-EFAULT),%_ASM_AX
        RET
-SYM_CODE_END(.Lbad_get_user_clac)
+SYM_CODE_END(__get_user_handle_exception)
 
 #ifdef CONFIG_X86_32
-SYM_CODE_START_LOCAL(.Lbad_get_user_8_clac)
+SYM_CODE_START_LOCAL(__get_user_8_handle_exception)
        ASM_CLAC
 bad_get_user_8:
        xor %edx,%edx
        xor %ecx,%ecx
        mov $(-EFAULT),%_ASM_AX
        RET
-SYM_CODE_END(.Lbad_get_user_8_clac)
+SYM_CODE_END(__get_user_8_handle_exception)
 #endif
 
 /* get_user */
-       _ASM_EXTABLE(1b, .Lbad_get_user_clac)
-       _ASM_EXTABLE(2b, .Lbad_get_user_clac)
-       _ASM_EXTABLE(3b, .Lbad_get_user_clac)
+       _ASM_EXTABLE(1b, __get_user_handle_exception)
+       _ASM_EXTABLE(2b, __get_user_handle_exception)
+       _ASM_EXTABLE(3b, __get_user_handle_exception)
 #ifdef CONFIG_X86_64
-       _ASM_EXTABLE(4b, .Lbad_get_user_clac)
+       _ASM_EXTABLE(4b, __get_user_handle_exception)
 #else
-       _ASM_EXTABLE(4b, .Lbad_get_user_8_clac)
-       _ASM_EXTABLE(5b, .Lbad_get_user_8_clac)
+       _ASM_EXTABLE(4b, __get_user_8_handle_exception)
+       _ASM_EXTABLE(5b, __get_user_8_handle_exception)
 #endif
 
 /* __get_user */
-       _ASM_EXTABLE(6b, .Lbad_get_user_clac)
-       _ASM_EXTABLE(7b, .Lbad_get_user_clac)
-       _ASM_EXTABLE(8b, .Lbad_get_user_clac)
+       _ASM_EXTABLE(6b, __get_user_handle_exception)
+       _ASM_EXTABLE(7b, __get_user_handle_exception)
+       _ASM_EXTABLE(8b, __get_user_handle_exception)
 #ifdef CONFIG_X86_64
-       _ASM_EXTABLE(9b, .Lbad_get_user_clac)
+       _ASM_EXTABLE(9b, __get_user_handle_exception)
 #else
-       _ASM_EXTABLE(9b, .Lbad_get_user_8_clac)
-       _ASM_EXTABLE(10b, .Lbad_get_user_8_clac)
+       _ASM_EXTABLE(9b, __get_user_8_handle_exception)
+       _ASM_EXTABLE(10b, __get_user_8_handle_exception)
 #endif
index 0266186..0559b20 100644 (file)
@@ -38,10 +38,12 @@ SYM_FUNC_START(__memmove)
        cmp %rdi, %r8
        jg 2f
 
-       /* FSRM implies ERMS => no length checks, do the copy directly */
+#define CHECK_LEN      cmp $0x20, %rdx; jb 1f
+#define MEMMOVE_BYTES  movq %rdx, %rcx; rep movsb; RET
 .Lmemmove_begin_forward:
-       ALTERNATIVE "cmp $0x20, %rdx; jb 1f", "", X86_FEATURE_FSRM
-       ALTERNATIVE "", "jmp .Lmemmove_erms", X86_FEATURE_ERMS
+       ALTERNATIVE_2 __stringify(CHECK_LEN), \
+                     __stringify(CHECK_LEN; MEMMOVE_BYTES), X86_FEATURE_ERMS, \
+                     __stringify(MEMMOVE_BYTES), X86_FEATURE_FSRM
 
        /*
         * movsq instruction have many startup latency
@@ -207,11 +209,6 @@ SYM_FUNC_START(__memmove)
        movb %r11b, (%rdi)
 13:
        RET
-
-.Lmemmove_erms:
-       movq %rdx, %rcx
-       rep movsb
-       RET
 SYM_FUNC_END(__memmove)
 EXPORT_SYMBOL(__memmove)
 
index b09cd2a..47fd9bd 100644 (file)
@@ -27,14 +27,14 @@ void msrs_free(struct msr *msrs)
 EXPORT_SYMBOL(msrs_free);
 
 /**
- * Read an MSR with error handling
- *
+ * msr_read - Read an MSR with error handling
  * @msr: MSR to read
  * @m: value to read into
  *
  * It returns read data only on success, otherwise it doesn't change the output
  * argument @m.
  *
+ * Return: %0 for success, otherwise an error code
  */
 static int msr_read(u32 msr, struct msr *m)
 {
@@ -49,10 +49,12 @@ static int msr_read(u32 msr, struct msr *m)
 }
 
 /**
- * Write an MSR with error handling
+ * msr_write - Write an MSR with error handling
  *
  * @msr: MSR to write
  * @m: value to write
+ *
+ * Return: %0 for success, otherwise an error code
  */
 static int msr_write(u32 msr, struct msr *m)
 {
@@ -88,12 +90,14 @@ static inline int __flip_bit(u32 msr, u8 bit, bool set)
 }
 
 /**
- * Set @bit in a MSR @msr.
+ * msr_set_bit - Set @bit in a MSR @msr.
+ * @msr: MSR to write
+ * @bit: bit number to set
  *
- * Retval:
- * < 0: An error was encountered.
- * = 0: Bit was already set.
- * > 0: Hardware accepted the MSR write.
+ * Return:
+ * < 0: An error was encountered.
+ * = 0: Bit was already set.
+ * > 0: Hardware accepted the MSR write.
  */
 int msr_set_bit(u32 msr, u8 bit)
 {
@@ -101,12 +105,14 @@ int msr_set_bit(u32 msr, u8 bit)
 }
 
 /**
- * Clear @bit in a MSR @msr.
+ * msr_clear_bit - Clear @bit in a MSR @msr.
+ * @msr: MSR to write
+ * @bit: bit number to clear
  *
- * Retval:
- * < 0: An error was encountered.
- * = 0: Bit was already cleared.
- * > 0: Hardware accepted the MSR write.
+ * Return:
+ * < 0: An error was encountered.
+ * = 0: Bit was already cleared.
+ * > 0: Hardware accepted the MSR write.
  */
 int msr_clear_bit(u32 msr, u8 bit)
 {
index 3062d09..1451e0c 100644 (file)
@@ -131,22 +131,22 @@ SYM_FUNC_START(__put_user_nocheck_8)
 SYM_FUNC_END(__put_user_nocheck_8)
 EXPORT_SYMBOL(__put_user_nocheck_8)
 
-SYM_CODE_START_LOCAL(.Lbad_put_user_clac)
+SYM_CODE_START_LOCAL(__put_user_handle_exception)
        ASM_CLAC
 .Lbad_put_user:
        movl $-EFAULT,%ecx
        RET
-SYM_CODE_END(.Lbad_put_user_clac)
+SYM_CODE_END(__put_user_handle_exception)
 
-       _ASM_EXTABLE(1b, .Lbad_put_user_clac)
-       _ASM_EXTABLE(2b, .Lbad_put_user_clac)
-       _ASM_EXTABLE(3b, .Lbad_put_user_clac)
-       _ASM_EXTABLE(4b, .Lbad_put_user_clac)
-       _ASM_EXTABLE(5b, .Lbad_put_user_clac)
-       _ASM_EXTABLE(6b, .Lbad_put_user_clac)
-       _ASM_EXTABLE(7b, .Lbad_put_user_clac)
-       _ASM_EXTABLE(9b, .Lbad_put_user_clac)
+       _ASM_EXTABLE(1b, __put_user_handle_exception)
+       _ASM_EXTABLE(2b, __put_user_handle_exception)
+       _ASM_EXTABLE(3b, __put_user_handle_exception)
+       _ASM_EXTABLE(4b, __put_user_handle_exception)
+       _ASM_EXTABLE(5b, __put_user_handle_exception)
+       _ASM_EXTABLE(6b, __put_user_handle_exception)
+       _ASM_EXTABLE(7b, __put_user_handle_exception)
+       _ASM_EXTABLE(9b, __put_user_handle_exception)
 #ifdef CONFIG_X86_32
-       _ASM_EXTABLE(8b, .Lbad_put_user_clac)
-       _ASM_EXTABLE(10b, .Lbad_put_user_clac)
+       _ASM_EXTABLE(8b, __put_user_handle_exception)
+       _ASM_EXTABLE(10b, __put_user_handle_exception)
 #endif
index b3b1e37..3fd066d 100644 (file)
@@ -143,7 +143,7 @@ SYM_CODE_END(__x86_indirect_jump_thunk_array)
  *    from re-poisioning the BTB prediction.
  */
        .align 64
-       .skip 63, 0xcc
+       .skip 64 - (__x86_return_thunk - zen_untrain_ret), 0xcc
 SYM_START(zen_untrain_ret, SYM_L_GLOBAL, SYM_A_NONE)
        ANNOTATE_NOENDBR
        /*
index 003d901..e9251b8 100644 (file)
@@ -9,6 +9,7 @@
 #include <linux/export.h>
 #include <linux/uaccess.h>
 #include <linux/highmem.h>
+#include <linux/libnvdimm.h>
 
 /*
  * Zero Userspace
index 7fe56c5..91c52ea 100644 (file)
@@ -32,6 +32,7 @@
 #include <asm/traps.h>
 #include <asm/user.h>
 #include <asm/fpu/api.h>
+#include <asm/fpu/regset.h>
 
 #include "fpu_system.h"
 #include "fpu_emu.h"
index 2c54b76..d9efa35 100644 (file)
@@ -3,6 +3,7 @@
 #include <linux/export.h>
 #include <linux/swap.h> /* for totalram_pages */
 #include <linux/memblock.h>
+#include <asm/numa.h>
 
 void __init set_highmem_pages_init(void)
 {
index d4e2648..b63403d 100644 (file)
@@ -45,7 +45,6 @@
 #include <asm/olpc_ofw.h>
 #include <asm/pgalloc.h>
 #include <asm/sections.h>
-#include <asm/paravirt.h>
 #include <asm/setup.h>
 #include <asm/set_memory.h>
 #include <asm/page_types.h>
@@ -74,7 +73,6 @@ static pmd_t * __init one_md_table_init(pgd_t *pgd)
 #ifdef CONFIG_X86_PAE
        if (!(pgd_val(*pgd) & _PAGE_PRESENT)) {
                pmd_table = (pmd_t *)alloc_low_page();
-               paravirt_alloc_pmd(&init_mm, __pa(pmd_table) >> PAGE_SHIFT);
                set_pgd(pgd, __pgd(__pa(pmd_table) | _PAGE_PRESENT));
                p4d = p4d_offset(pgd, 0);
                pud = pud_offset(p4d, 0);
@@ -99,7 +97,6 @@ static pte_t * __init one_page_table_init(pmd_t *pmd)
        if (!(pmd_val(*pmd) & _PAGE_PRESENT)) {
                pte_t *page_table = (pte_t *)alloc_low_page();
 
-               paravirt_alloc_pte(&init_mm, __pa(page_table) >> PAGE_SHIFT);
                set_pmd(pmd, __pmd(__pa(page_table) | _PAGE_TABLE));
                BUG_ON(page_table != pte_offset_kernel(pmd, 0));
        }
@@ -181,12 +178,10 @@ static pte_t *__init page_table_kmap_check(pte_t *pte, pmd_t *pmd,
                        set_pte(newpte + i, pte[i]);
                *adr = (void *)(((unsigned long)(*adr)) + PAGE_SIZE);
 
-               paravirt_alloc_pte(&init_mm, __pa(newpte) >> PAGE_SHIFT);
                set_pmd(pmd, __pmd(__pa(newpte)|_PAGE_TABLE));
                BUG_ON(newpte != pte_offset_kernel(pmd, 0));
                __flush_tlb_all();
 
-               paravirt_release_pte(__pa(pte) >> PAGE_SHIFT);
                pte = newpte;
        }
        BUG_ON(vaddr < fix_to_virt(FIX_KMAP_BEGIN - 1)
@@ -482,7 +477,6 @@ void __init native_pagetable_init(void)
                                pfn, pmd, __pa(pmd), pte, __pa(pte));
                pte_clear(NULL, va, pte);
        }
-       paravirt_alloc_pmd(&init_mm, __pa(base) >> PAGE_SHIFT);
        paging_init();
 }
 
@@ -491,15 +485,8 @@ void __init native_pagetable_init(void)
  * point, we've been running on some set of pagetables constructed by
  * the boot process.
  *
- * If we're booting on native hardware, this will be a pagetable
- * constructed in arch/x86/kernel/head_32.S.  The root of the
- * pagetable will be swapper_pg_dir.
- *
- * If we're booting paravirtualized under a hypervisor, then there are
- * more options: we may already be running PAE, and the pagetable may
- * or may not be based in swapper_pg_dir.  In any case,
- * paravirt_pagetable_init() will set up swapper_pg_dir
- * appropriately for the rest of the initialization to work.
+ * This will be a pagetable constructed in arch/x86/kernel/head_32.S.
+ * The root of the pagetable will be swapper_pg_dir.
  *
  * In general, pagetable_init() assumes that the pagetable may already
  * be partially populated, and so it avoids stomping on any existing
index 557f0fe..37db264 100644 (file)
@@ -172,10 +172,10 @@ void __meminit init_trampoline_kaslr(void)
                set_p4d(p4d_tramp,
                        __p4d(_KERNPG_TABLE | __pa(pud_page_tramp)));
 
-               set_pgd(&trampoline_pgd_entry,
-                       __pgd(_KERNPG_TABLE | __pa(p4d_page_tramp)));
+               trampoline_pgd_entry =
+                       __pgd(_KERNPG_TABLE | __pa(p4d_page_tramp));
        } else {
-               set_pgd(&trampoline_pgd_entry,
-                       __pgd(_KERNPG_TABLE | __pa(pud_page_tramp)));
+               trampoline_pgd_entry =
+                       __pgd(_KERNPG_TABLE | __pa(pud_page_tramp));
        }
 }
index e0b51c0..54bbd51 100644 (file)
@@ -319,7 +319,7 @@ static void enc_dec_hypercall(unsigned long vaddr, int npages, bool enc)
 #endif
 }
 
-static void amd_enc_status_change_prepare(unsigned long vaddr, int npages, bool enc)
+static bool amd_enc_status_change_prepare(unsigned long vaddr, int npages, bool enc)
 {
        /*
         * To maintain the security guarantees of SEV-SNP guests, make sure
@@ -327,6 +327,8 @@ static void amd_enc_status_change_prepare(unsigned long vaddr, int npages, bool
         */
        if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP) && !enc)
                snp_set_memory_shared(vaddr, npages);
+
+       return true;
 }
 
 /* Return true unconditionally: return value doesn't matter for the SEV side */
@@ -501,6 +503,21 @@ void __init sme_early_init(void)
        x86_platform.guest.enc_status_change_finish  = amd_enc_status_change_finish;
        x86_platform.guest.enc_tlb_flush_required    = amd_enc_tlb_flush_required;
        x86_platform.guest.enc_cache_flush_required  = amd_enc_cache_flush_required;
+
+       /*
+        * AMD-SEV-ES intercepts the RDMSR to read the X2APIC ID in the
+        * parallel bringup low level code. That raises #VC which cannot be
+        * handled there.
+        * It does not provide a RDMSR GHCB protocol so the early startup
+        * code cannot directly communicate with the secure firmware. The
+        * alternative solution to retrieve the APIC ID via CPUID(0xb),
+        * which is covered by the GHCB protocol, is not viable either
+        * because there is no enforcement of the CPUID(0xb) provided
+        * "initial" APIC ID to be the same as the real APIC ID.
+        * Disable parallel bootup.
+        */
+       if (sev_status & MSR_AMD64_SEV_ES_ENABLED)
+               x86_cpuinit.parallel_bringup = false;
 }
 
 void __init mem_encrypt_free_decrypted_mem(void)
index c6efcf5..d73aeb1 100644 (file)
@@ -188,7 +188,7 @@ static void __init sme_populate_pgd(struct sme_populate_pgd_data *ppd)
        if (pmd_large(*pmd))
                return;
 
-       pte = pte_offset_map(pmd, ppd->vaddr);
+       pte = pte_offset_kernel(pmd, ppd->vaddr);
        if (pte_none(*pte))
                set_pte(pte, __pte(ppd->paddr | ppd->pte_flags));
 }
@@ -612,7 +612,7 @@ void __init sme_enable(struct boot_params *bp)
 out:
        if (sme_me_mask) {
                physical_mask &= ~sme_me_mask;
-               cc_set_vendor(CC_VENDOR_AMD);
+               cc_vendor = CC_VENDOR_AMD;
                cc_set_mask(sme_me_mask);
        }
 }
index 7159cf7..df4182b 100644 (file)
@@ -9,6 +9,7 @@
 #include <linux/mm.h>
 #include <linux/interrupt.h>
 #include <linux/seq_file.h>
+#include <linux/proc_fs.h>
 #include <linux/debugfs.h>
 #include <linux/pfn.h>
 #include <linux/percpu.h>
@@ -231,7 +232,7 @@ within_inclusive(unsigned long addr, unsigned long start, unsigned long end)
  * points to #2, but almost all physical-to-virtual translations point to #1.
  *
  * This is so that we can have both a directmap of all physical memory *and*
- * take full advantage of the the limited (s32) immediate addressing range (2G)
+ * take full advantage of the limited (s32) immediate addressing range (2G)
  * of x86_64.
  *
  * See Documentation/arch/x86/x86_64/mm.rst for more detail.
@@ -2151,7 +2152,8 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
                cpa_flush(&cpa, x86_platform.guest.enc_cache_flush_required());
 
        /* Notify hypervisor that we are about to set/clr encryption attribute. */
-       x86_platform.guest.enc_status_change_prepare(addr, numpages, enc);
+       if (!x86_platform.guest.enc_status_change_prepare(addr, numpages, enc))
+               return -EIO;
 
        ret = __change_page_attr_set_clr(&cpa, 1);
 
index e4f499e..15a8009 100644 (file)
@@ -702,14 +702,8 @@ void p4d_clear_huge(p4d_t *p4d)
  * pud_set_huge - setup kernel PUD mapping
  *
  * MTRRs can override PAT memory types with 4KiB granularity. Therefore, this
- * function sets up a huge page only if any of the following conditions are met:
- *
- * - MTRRs are disabled, or
- *
- * - MTRRs are enabled and the range is completely covered by a single MTRR, or
- *
- * - MTRRs are enabled and the corresponding MTRR memory type is WB, which
- *   has no effect on the requested PAT memory type.
+ * function sets up a huge page only if the complete range has the same MTRR
+ * caching mode.
  *
  * Callers should try to decrease page size (1GB -> 2MB -> 4K) if the bigger
  * page mapping attempt fails.
@@ -718,11 +712,10 @@ void p4d_clear_huge(p4d_t *p4d)
  */
 int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
 {
-       u8 mtrr, uniform;
+       u8 uniform;
 
-       mtrr = mtrr_type_lookup(addr, addr + PUD_SIZE, &uniform);
-       if ((mtrr != MTRR_TYPE_INVALID) && (!uniform) &&
-           (mtrr != MTRR_TYPE_WRBACK))
+       mtrr_type_lookup(addr, addr + PUD_SIZE, &uniform);
+       if (!uniform)
                return 0;
 
        /* Bail out if we are we on a populated non-leaf entry: */
@@ -745,11 +738,10 @@ int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
  */
 int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot)
 {
-       u8 mtrr, uniform;
+       u8 uniform;
 
-       mtrr = mtrr_type_lookup(addr, addr + PMD_SIZE, &uniform);
-       if ((mtrr != MTRR_TYPE_INVALID) && (!uniform) &&
-           (mtrr != MTRR_TYPE_WRBACK)) {
+       mtrr_type_lookup(addr, addr + PMD_SIZE, &uniform);
+       if (!uniform) {
                pr_warn_once("%s: Cannot satisfy [mem %#010llx-%#010llx] with a huge-page mapping due to MTRR override.\n",
                             __func__, addr, addr + PMD_SIZE);
                return 0;
index 584c25b..8731370 100644 (file)
@@ -83,7 +83,7 @@ static void ehci_reg_read(struct sim_dev_reg *reg, u32 *value)
                *value |= 0x100;
 }
 
-void sata_revid_init(struct sim_dev_reg *reg)
+static void sata_revid_init(struct sim_dev_reg *reg)
 {
        reg->sim_reg.value = 0x01060100;
        reg->sim_reg.mask = 0;
@@ -172,7 +172,7 @@ static inline void extract_bytes(u32 *value, int reg, int len)
        *value &= mask;
 }
 
-int bridge_read(unsigned int devfn, int reg, int len, u32 *value)
+static int bridge_read(unsigned int devfn, int reg, int len, u32 *value)
 {
        u32 av_bridge_base, av_bridge_limit;
        int retval = 0;
index f3f2d87..e9f99c5 100644 (file)
@@ -96,6 +96,9 @@ static const unsigned long * const efi_tables[] = {
 #ifdef CONFIG_EFI_COCO_SECRET
        &efi.coco_secret,
 #endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+       &efi.unaccepted,
+#endif
 };
 
 u64 efi_setup;         /* efi setup_data physical address */
index 75e3319..74ebd68 100644 (file)
@@ -234,7 +234,7 @@ static int __init olpc_dt_compatible_match(phandle node, const char *compat)
        return 0;
 }
 
-void __init olpc_dt_fixup(void)
+static void __init olpc_dt_fixup(void)
 {
        phandle node;
        u32 board_rev;
index 7a4d5e9..63230ff 100644 (file)
@@ -351,43 +351,6 @@ static int bsp_pm_callback(struct notifier_block *nb, unsigned long action,
        case PM_HIBERNATION_PREPARE:
                ret = bsp_check();
                break;
-#ifdef CONFIG_DEBUG_HOTPLUG_CPU0
-       case PM_RESTORE_PREPARE:
-               /*
-                * When system resumes from hibernation, online CPU0 because
-                * 1. it's required for resume and
-                * 2. the CPU was online before hibernation
-                */
-               if (!cpu_online(0))
-                       _debug_hotplug_cpu(0, 1);
-               break;
-       case PM_POST_RESTORE:
-               /*
-                * When a resume really happens, this code won't be called.
-                *
-                * This code is called only when user space hibernation software
-                * prepares for snapshot device during boot time. So we just
-                * call _debug_hotplug_cpu() to restore to CPU0's state prior to
-                * preparing the snapshot device.
-                *
-                * This works for normal boot case in our CPU0 hotplug debug
-                * mode, i.e. CPU0 is offline and user mode hibernation
-                * software initializes during boot time.
-                *
-                * If CPU0 is online and user application accesses snapshot
-                * device after boot time, this will offline CPU0 and user may
-                * see different CPU0 state before and after accessing
-                * the snapshot device. But hopefully this is not a case when
-                * user debugging CPU0 hotplug. Even if users hit this case,
-                * they can easily online CPU0 back.
-                *
-                * To simplify this debug code, we only consider normal boot
-                * case. Otherwise we need to remember CPU0's state and restore
-                * to that state and resolve racy conditions etc.
-                */
-               _debug_hotplug_cpu(0, 0);
-               break;
-#endif
        default:
                break;
        }
index 42abd6a..c2a29be 100644 (file)
@@ -12,7 +12,7 @@ $(obj)/string.o: $(srctree)/arch/x86/boot/compressed/string.c FORCE
 $(obj)/sha256.o: $(srctree)/lib/crypto/sha256.c FORCE
        $(call if_changed_rule,cc_o_c)
 
-CFLAGS_sha256.o := -D__DISABLE_EXPORTS
+CFLAGS_sha256.o := -D__DISABLE_EXPORTS -D__NO_FORTIFY
 
 # When profile-guided optimization is enabled, llvm emits two different
 # overlapping text sections, which is not supported by kexec. Remove profile
index af56581..788e555 100644 (file)
@@ -154,6 +154,9 @@ static void __init setup_real_mode(void)
 
        trampoline_header->flags = 0;
 
+       trampoline_lock = &trampoline_header->lock;
+       *trampoline_lock = 0;
+
        trampoline_pgd = (u64 *) __va(real_mode_header->trampoline_pgd);
 
        /* Map the real mode stub as virtual == physical */
index e38d61d..c9f76fa 100644 (file)
        .text
        .code16
 
+.macro LOCK_AND_LOAD_REALMODE_ESP lock_pa=0
+       /*
+        * Make sure only one CPU fiddles with the realmode stack
+        */
+.Llock_rm\@:
+       .if \lock_pa
+        lock btsl       $0, pa_tr_lock
+       .else
+        lock btsl       $0, tr_lock
+       .endif
+        jnc             2f
+        pause
+        jmp             .Llock_rm\@
+2:
+       # Setup stack
+       movl    $rm_stack_end, %esp
+.endm
+
        .balign PAGE_SIZE
 SYM_CODE_START(trampoline_start)
        cli                     # We should be safe anyway
@@ -49,8 +67,7 @@ SYM_CODE_START(trampoline_start)
        mov     %ax, %es
        mov     %ax, %ss
 
-       # Setup stack
-       movl    $rm_stack_end, %esp
+       LOCK_AND_LOAD_REALMODE_ESP
 
        call    verify_cpu              # Verify the cpu supports long mode
        testl   %eax, %eax              # Check for return code
@@ -93,8 +110,7 @@ SYM_CODE_START(sev_es_trampoline_start)
        mov     %ax, %es
        mov     %ax, %ss
 
-       # Setup stack
-       movl    $rm_stack_end, %esp
+       LOCK_AND_LOAD_REALMODE_ESP
 
        jmp     .Lswitch_to_protected
 SYM_CODE_END(sev_es_trampoline_start)
@@ -177,7 +193,7 @@ SYM_CODE_START(pa_trampoline_compat)
         * In compatibility mode.  Prep ESP and DX for startup_32, then disable
         * paging and complete the switch to legacy 32-bit mode.
         */
-       movl    $rm_stack_end, %esp
+       LOCK_AND_LOAD_REALMODE_ESP lock_pa=1
        movw    $__KERNEL_DS, %dx
 
        movl    $(CR0_STATE & ~X86_CR0_PG), %eax
@@ -241,6 +257,7 @@ SYM_DATA_START(trampoline_header)
        SYM_DATA(tr_efer,               .space 8)
        SYM_DATA(tr_cr4,                .space 4)
        SYM_DATA(tr_flags,              .space 4)
+       SYM_DATA(tr_lock,               .space 4)
 SYM_DATA_END(trampoline_header)
 
 #include "trampoline_common.S"
index 9fd2484..9e91430 100644 (file)
@@ -10,6 +10,7 @@
 #include <linux/pci.h>
 #include <linux/module.h>
 #include <linux/vgaarb.h>
+#include <asm/fb.h>
 
 int fb_is_primary_device(struct fb_info *info)
 {
index 7d7ffb9..863d0d6 100644 (file)
@@ -16,6 +16,8 @@
 #include <asm/setup.h>
 #include <asm/xen/hypercall.h>
 
+#include "xen-ops.h"
+
 static efi_char16_t vendor[100] __initdata;
 
 static efi_system_table_t efi_systab_xen __initdata = {
index c1cd28e..a6820ca 100644 (file)
@@ -161,13 +161,12 @@ static int xen_cpu_up_prepare_hvm(unsigned int cpu)
        int rc = 0;
 
        /*
-        * This can happen if CPU was offlined earlier and
-        * offlining timed out in common_cpu_die().
+        * If a CPU was offlined earlier and offlining timed out then the
+        * lock mechanism is still initialized. Uninit it unconditionally
+        * as it's safe to call even if already uninited. Interrupts and
+        * timer have already been handled in xen_cpu_dead_hvm().
         */
-       if (cpu_report_state(cpu) == CPU_DEAD_FROZEN) {
-               xen_smp_intr_free(cpu);
-               xen_uninit_lock_cpu(cpu);
-       }
+       xen_uninit_lock_cpu(cpu);
 
        if (cpu_acpi_id(cpu) != U32_MAX)
                per_cpu(xen_vcpu_id, cpu) = cpu_acpi_id(cpu);
index 093b78c..93b6582 100644 (file)
@@ -68,6 +68,7 @@
 #include <asm/reboot.h>
 #include <asm/hypervisor.h>
 #include <asm/mach_traps.h>
+#include <asm/mtrr.h>
 #include <asm/mwait.h>
 #include <asm/pci_x86.h>
 #include <asm/cpu.h>
@@ -119,6 +120,54 @@ static int __init parse_xen_msr_safe(char *str)
 }
 early_param("xen_msr_safe", parse_xen_msr_safe);
 
+/* Get MTRR settings from Xen and put them into mtrr_state. */
+static void __init xen_set_mtrr_data(void)
+{
+#ifdef CONFIG_MTRR
+       struct xen_platform_op op = {
+               .cmd = XENPF_read_memtype,
+               .interface_version = XENPF_INTERFACE_VERSION,
+       };
+       unsigned int reg;
+       unsigned long mask;
+       uint32_t eax, width;
+       static struct mtrr_var_range var[MTRR_MAX_VAR_RANGES] __initdata;
+
+       /* Get physical address width (only 64-bit cpus supported). */
+       width = 36;
+       eax = cpuid_eax(0x80000000);
+       if ((eax >> 16) == 0x8000 && eax >= 0x80000008) {
+               eax = cpuid_eax(0x80000008);
+               width = eax & 0xff;
+       }
+
+       for (reg = 0; reg < MTRR_MAX_VAR_RANGES; reg++) {
+               op.u.read_memtype.reg = reg;
+               if (HYPERVISOR_platform_op(&op))
+                       break;
+
+               /*
+                * Only called in dom0, which has all RAM PFNs mapped at
+                * RAM MFNs, and all PCI space etc. is identity mapped.
+                * This means we can treat MFN == PFN regarding MTRR settings.
+                */
+               var[reg].base_lo = op.u.read_memtype.type;
+               var[reg].base_lo |= op.u.read_memtype.mfn << PAGE_SHIFT;
+               var[reg].base_hi = op.u.read_memtype.mfn >> (32 - PAGE_SHIFT);
+               mask = ~((op.u.read_memtype.nr_mfns << PAGE_SHIFT) - 1);
+               mask &= (1UL << width) - 1;
+               if (mask)
+                       mask |= MTRR_PHYSMASK_V;
+               var[reg].mask_lo = mask;
+               var[reg].mask_hi = mask >> 32;
+       }
+
+       /* Only overwrite MTRR state if any MTRR could be got from Xen. */
+       if (reg)
+               mtrr_overwrite_state(var, reg, MTRR_TYPE_UNCACHABLE);
+#endif
+}
+
 static void __init xen_pv_init_platform(void)
 {
        /* PV guests can't operate virtio devices without grants. */
@@ -135,6 +184,11 @@ static void __init xen_pv_init_platform(void)
 
        /* pvclock is in shared info area */
        xen_init_time_ops();
+
+       if (xen_initial_domain())
+               xen_set_mtrr_data();
+       else
+               mtrr_overwrite_state(NULL, 0, MTRR_TYPE_WRBACK);
 }
 
 static void __init xen_pv_guest_late_init(void)
index b3b8d28..e0a9751 100644 (file)
 #include "mmu.h"
 #include "debugfs.h"
 
+/*
+ * Prototypes for functions called via PV_CALLEE_SAVE_REGS_THUNK() in order
+ * to avoid warnings with "-Wmissing-prototypes".
+ */
+pteval_t xen_pte_val(pte_t pte);
+pgdval_t xen_pgd_val(pgd_t pgd);
+pmdval_t xen_pmd_val(pmd_t pmd);
+pudval_t xen_pud_val(pud_t pud);
+p4dval_t xen_p4d_val(p4d_t p4d);
+pte_t xen_make_pte(pteval_t pte);
+pgd_t xen_make_pgd(pgdval_t pgd);
+pmd_t xen_make_pmd(pmdval_t pmd);
+pud_t xen_make_pud(pudval_t pud);
+p4d_t xen_make_p4d(p4dval_t p4d);
+pte_t xen_make_pte_init(pteval_t pte);
+
 #ifdef CONFIG_X86_VSYSCALL_EMULATION
 /* l3 pud for userspace vsyscall mapping */
 static pud_t level3_user_vsyscall[PTRS_PER_PUD] __page_aligned_bss;
index c2be3ef..8b5cf7b 100644 (file)
@@ -6,6 +6,7 @@
  */
 
 #include <linux/init.h>
+#include <linux/iscsi_ibft.h>
 #include <linux/sched.h>
 #include <linux/kstrtox.h>
 #include <linux/mm.h>
@@ -764,17 +765,26 @@ char * __init xen_memory_setup(void)
        BUG_ON(memmap.nr_entries == 0);
        xen_e820_table.nr_entries = memmap.nr_entries;
 
-       /*
-        * Xen won't allow a 1:1 mapping to be created to UNUSABLE
-        * regions, so if we're using the machine memory map leave the
-        * region as RAM as it is in the pseudo-physical map.
-        *
-        * UNUSABLE regions in domUs are not handled and will need
-        * a patch in the future.
-        */
-       if (xen_initial_domain())
+       if (xen_initial_domain()) {
+               /*
+                * Xen won't allow a 1:1 mapping to be created to UNUSABLE
+                * regions, so if we're using the machine memory map leave the
+                * region as RAM as it is in the pseudo-physical map.
+                *
+                * UNUSABLE regions in domUs are not handled and will need
+                * a patch in the future.
+                */
                xen_ignore_unusable();
 
+#ifdef CONFIG_ISCSI_IBFT_FIND
+               /* Reserve 0.5 MiB to 1 MiB region so iBFT can be found */
+               xen_e820_table.entries[xen_e820_table.nr_entries].addr = IBFT_START;
+               xen_e820_table.entries[xen_e820_table.nr_entries].size = IBFT_END - IBFT_START;
+               xen_e820_table.entries[xen_e820_table.nr_entries].type = E820_TYPE_RESERVED;
+               xen_e820_table.nr_entries++;
+#endif
+       }
+
        /* Make sure the Xen-supplied memory map is well-ordered. */
        e820__update_table(&xen_e820_table);
 
index 22fb982..c20cbb1 100644 (file)
@@ -2,6 +2,10 @@
 #ifndef _XEN_SMP_H
 
 #ifdef CONFIG_SMP
+
+void asm_cpu_bringup_and_idle(void);
+asmlinkage void cpu_bringup_and_idle(void);
+
 extern void xen_send_IPI_mask(const struct cpumask *mask,
                              int vector);
 extern void xen_send_IPI_mask_allbutself(const struct cpumask *mask,
index b70afdf..ac95d19 100644 (file)
@@ -55,18 +55,16 @@ static void __init xen_hvm_smp_prepare_cpus(unsigned int max_cpus)
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
-static void xen_hvm_cpu_die(unsigned int cpu)
+static void xen_hvm_cleanup_dead_cpu(unsigned int cpu)
 {
-       if (common_cpu_die(cpu) == 0) {
-               if (xen_have_vector_callback) {
-                       xen_smp_intr_free(cpu);
-                       xen_uninit_lock_cpu(cpu);
-                       xen_teardown_timer(cpu);
-               }
+       if (xen_have_vector_callback) {
+               xen_smp_intr_free(cpu);
+               xen_uninit_lock_cpu(cpu);
+               xen_teardown_timer(cpu);
        }
 }
 #else
-static void xen_hvm_cpu_die(unsigned int cpu)
+static void xen_hvm_cleanup_dead_cpu(unsigned int cpu)
 {
        BUG();
 }
@@ -77,7 +75,7 @@ void __init xen_hvm_smp_init(void)
        smp_ops.smp_prepare_boot_cpu = xen_hvm_smp_prepare_boot_cpu;
        smp_ops.smp_prepare_cpus = xen_hvm_smp_prepare_cpus;
        smp_ops.smp_cpus_done = xen_smp_cpus_done;
-       smp_ops.cpu_die = xen_hvm_cpu_die;
+       smp_ops.cleanup_dead_cpu = xen_hvm_cleanup_dead_cpu;
 
        if (!xen_have_vector_callback) {
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
index a9cf8c8..d5ae5de 100644 (file)
@@ -55,13 +55,13 @@ static DEFINE_PER_CPU(struct xen_common_irq, xen_irq_work) = { .irq = -1 };
 static DEFINE_PER_CPU(struct xen_common_irq, xen_pmu_irq) = { .irq = -1 };
 
 static irqreturn_t xen_irq_work_interrupt(int irq, void *dev_id);
-void asm_cpu_bringup_and_idle(void);
 
 static void cpu_bringup(void)
 {
        int cpu;
 
        cr4_init();
+       cpuhp_ap_sync_alive();
        cpu_init();
        touch_softlockup_watchdog();
 
@@ -83,7 +83,7 @@ static void cpu_bringup(void)
 
        set_cpu_online(cpu, true);
 
-       cpu_set_state_online(cpu);  /* Implies full memory barrier. */
+       smp_mb();
 
        /* We can take interrupts now: we're officially "up". */
        local_irq_enable();
@@ -254,15 +254,12 @@ cpu_initialize_context(unsigned int cpu, struct task_struct *idle)
        struct desc_struct *gdt;
        unsigned long gdt_mfn;
 
-       /* used to tell cpu_init() that it can proceed with initialization */
-       cpumask_set_cpu(cpu, cpu_callout_mask);
        if (cpumask_test_and_set_cpu(cpu, xen_cpu_initialized_map))
                return 0;
 
        ctxt = kzalloc(sizeof(*ctxt), GFP_KERNEL);
        if (ctxt == NULL) {
                cpumask_clear_cpu(cpu, xen_cpu_initialized_map);
-               cpumask_clear_cpu(cpu, cpu_callout_mask);
                return -ENOMEM;
        }
 
@@ -316,7 +313,7 @@ cpu_initialize_context(unsigned int cpu, struct task_struct *idle)
        return 0;
 }
 
-static int xen_pv_cpu_up(unsigned int cpu, struct task_struct *idle)
+static int xen_pv_kick_ap(unsigned int cpu, struct task_struct *idle)
 {
        int rc;
 
@@ -326,14 +323,6 @@ static int xen_pv_cpu_up(unsigned int cpu, struct task_struct *idle)
 
        xen_setup_runstate_info(cpu);
 
-       /*
-        * PV VCPUs are always successfully taken down (see 'while' loop
-        * in xen_cpu_die()), so -EBUSY is an error.
-        */
-       rc = cpu_check_up_prepare(cpu);
-       if (rc)
-               return rc;
-
        /* make sure interrupts start blocked */
        per_cpu(xen_vcpu, cpu)->evtchn_upcall_mask = 1;
 
@@ -343,15 +332,20 @@ static int xen_pv_cpu_up(unsigned int cpu, struct task_struct *idle)
 
        xen_pmu_init(cpu);
 
-       rc = HYPERVISOR_vcpu_op(VCPUOP_up, xen_vcpu_nr(cpu), NULL);
-       BUG_ON(rc);
-
-       while (cpu_report_state(cpu) != CPU_ONLINE)
-               HYPERVISOR_sched_op(SCHEDOP_yield, NULL);
+       /*
+        * Why is this a BUG? If the hypercall fails then everything can be
+        * rolled back, no?
+        */
+       BUG_ON(HYPERVISOR_vcpu_op(VCPUOP_up, xen_vcpu_nr(cpu), NULL));
 
        return 0;
 }
 
+static void xen_pv_poll_sync_state(void)
+{
+       HYPERVISOR_sched_op(SCHEDOP_yield, NULL);
+}
+
 #ifdef CONFIG_HOTPLUG_CPU
 static int xen_pv_cpu_disable(void)
 {
@@ -367,18 +361,18 @@ static int xen_pv_cpu_disable(void)
 
 static void xen_pv_cpu_die(unsigned int cpu)
 {
-       while (HYPERVISOR_vcpu_op(VCPUOP_is_up,
-                                 xen_vcpu_nr(cpu), NULL)) {
+       while (HYPERVISOR_vcpu_op(VCPUOP_is_up, xen_vcpu_nr(cpu), NULL)) {
                __set_current_state(TASK_UNINTERRUPTIBLE);
                schedule_timeout(HZ/10);
        }
+}
 
-       if (common_cpu_die(cpu) == 0) {
-               xen_smp_intr_free(cpu);
-               xen_uninit_lock_cpu(cpu);
-               xen_teardown_timer(cpu);
-               xen_pmu_finish(cpu);
-       }
+static void xen_pv_cleanup_dead_cpu(unsigned int cpu)
+{
+       xen_smp_intr_free(cpu);
+       xen_uninit_lock_cpu(cpu);
+       xen_teardown_timer(cpu);
+       xen_pmu_finish(cpu);
 }
 
 static void __noreturn xen_pv_play_dead(void) /* used only with HOTPLUG_CPU */
@@ -400,6 +394,11 @@ static void xen_pv_cpu_die(unsigned int cpu)
        BUG();
 }
 
+static void xen_pv_cleanup_dead_cpu(unsigned int cpu)
+{
+       BUG();
+}
+
 static void __noreturn xen_pv_play_dead(void)
 {
        BUG();
@@ -438,8 +437,10 @@ static const struct smp_ops xen_smp_ops __initconst = {
        .smp_prepare_cpus = xen_pv_smp_prepare_cpus,
        .smp_cpus_done = xen_smp_cpus_done,
 
-       .cpu_up = xen_pv_cpu_up,
+       .kick_ap_alive = xen_pv_kick_ap,
        .cpu_die = xen_pv_cpu_die,
+       .cleanup_dead_cpu = xen_pv_cleanup_dead_cpu,
+       .poll_sync_state = xen_pv_poll_sync_state,
        .cpu_disable = xen_pv_cpu_disable,
        .play_dead = xen_pv_play_dead,
 
index b74ac25..52fa560 100644 (file)
@@ -66,11 +66,10 @@ static noinstr u64 xen_sched_clock(void)
         struct pvclock_vcpu_time_info *src;
        u64 ret;
 
-       preempt_disable_notrace();
        src = &__this_cpu_read(xen_vcpu)->time;
        ret = pvclock_clocksource_read_nowd(src);
        ret -= xen_sched_clock_offset;
-       preempt_enable_notrace();
+
        return ret;
 }
 
index a109037..408a2aa 100644 (file)
@@ -72,8 +72,6 @@ void xen_restore_time_memory_area(void);
 void xen_init_time_ops(void);
 void xen_hvm_init_time_ops(void);
 
-irqreturn_t xen_debug_interrupt(int irq, void *dev_id);
-
 bool xen_vcpu_stolen(int vcpu);
 
 void xen_vcpu_setup(int cpu);
@@ -148,9 +146,12 @@ int xen_cpuhp_setup(int (*cpu_up_prepare_cb)(unsigned int),
 void xen_pin_vcpu(int cpu);
 
 void xen_emergency_restart(void);
+void xen_force_evtchn_callback(void);
+
 #ifdef CONFIG_XEN_PV
 void xen_pv_pre_suspend(void);
 void xen_pv_post_suspend(int suspend_cancelled);
+void xen_start_kernel(struct start_info *si);
 #else
 static inline void xen_pv_pre_suspend(void) {}
 static inline void xen_pv_post_suspend(int suspend_cancelled) {}
index 3c6e547..c1bcfc2 100644 (file)
@@ -16,7 +16,6 @@ config XTENSA
        select ARCH_USE_MEMTEST
        select ARCH_USE_QUEUED_RWLOCKS
        select ARCH_USE_QUEUED_SPINLOCKS
-       select ARCH_WANT_FRAME_POINTERS
        select ARCH_WANT_IPC_PARSE_VERSION
        select BUILDTIME_TABLE_SORT
        select CLONE_BACKWARDS
@@ -35,6 +34,7 @@ config XTENSA
        select HAVE_ARCH_KCSAN
        select HAVE_ARCH_SECCOMP_FILTER
        select HAVE_ARCH_TRACEHOOK
+       select HAVE_ASM_MODVERSIONS
        select HAVE_CONTEXT_TRACKING_USER
        select HAVE_DEBUG_KMEMLEAK
        select HAVE_DMA_CONTIGUOUS
@@ -203,6 +203,18 @@ config XTENSA_UNALIGNED_USER
 
          Say Y here to enable unaligned memory access in user space.
 
+config XTENSA_LOAD_STORE
+       bool "Load/store exception handler for memory only readable with l32"
+       help
+         The Xtensa architecture only allows reading memory attached to its
+         instruction bus with l32r and l32i instructions, all other
+         instructions raise an exception with the LoadStoreErrorCause code.
+         This makes it hard to use some configurations, e.g. store string
+         literals in FLASH memory attached to the instruction bus.
+
+         Say Y here to enable exception handler that allows transparent
+         byte and 2-byte access to memory attached to instruction bus.
+
 config HAVE_SMP
        bool "System Supports SMP (MX)"
        depends on XTENSA_VARIANT_CUSTOM
index 83cc8d1..e84172a 100644 (file)
@@ -38,3 +38,11 @@ config PRINT_STACK_DEPTH
        help
          This option allows you to set the stack depth that the kernel
          prints in stack traces.
+
+config PRINT_USER_CODE_ON_UNHANDLED_EXCEPTION
+       bool "Dump user code around unhandled exception address"
+       help
+         Enable this option to display user code around PC of the unhandled
+         exception (starting at address aligned on 16 byte boundary).
+         This may simplify finding faulting code in the absence of other
+         debug facilities.
index 1d1d462..c0eef3f 100644 (file)
@@ -6,16 +6,12 @@
 
 OBJCOPY_ARGS := -O $(if $(CONFIG_CPU_BIG_ENDIAN),elf32-xtensa-be,elf32-xtensa-le)
 
-LD_ARGS        = -T $(srctree)/$(obj)/boot.ld
-
 boot-y := bootstrap.o
 targets        += $(boot-y)
 
 OBJS   := $(addprefix $(obj)/,$(boot-y))
 LIBS   := arch/xtensa/boot/lib/lib.a arch/xtensa/lib/lib.a
 
-LIBGCC := $(shell $(CC) $(KBUILD_CFLAGS) -print-libgcc-file-name)
-
 $(obj)/zImage.o: $(obj)/../vmlinux.bin.gz $(OBJS)
        $(Q)$(OBJCOPY) $(OBJCOPY_ARGS) -R .comment \
                --add-section image=$< \
@@ -23,7 +19,10 @@ $(obj)/zImage.o: $(obj)/../vmlinux.bin.gz $(OBJS)
                $(OBJS) $@
 
 $(obj)/zImage.elf: $(obj)/zImage.o $(LIBS)
-       $(Q)$(LD) $(LD_ARGS) -o $@ $^ -L/xtensa-elf/lib $(LIBGCC)
+       $(Q)$(LD) $(KBUILD_LDFLAGS) \
+               -T $(srctree)/$(obj)/boot.ld \
+               --build-id=none \
+               -o $@ $^
 
 $(obj)/../zImage.redboot: $(obj)/zImage.elf
        $(Q)$(OBJCOPY) -S -O binary $< $@
diff --git a/arch/xtensa/include/asm/asm-prototypes.h b/arch/xtensa/include/asm/asm-prototypes.h
new file mode 100644 (file)
index 0000000..b0da618
--- /dev/null
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_PROTOTYPES_H
+#define __ASM_PROTOTYPES_H
+
+#include <asm/cacheflush.h>
+#include <asm/checksum.h>
+#include <asm/ftrace.h>
+#include <asm/page.h>
+#include <asm/string.h>
+#include <asm/uaccess.h>
+
+#include <asm-generic/asm-prototypes.h>
+
+/*
+ * gcc internal math functions
+ */
+long long __ashrdi3(long long, int);
+long long __ashldi3(long long, int);
+long long __bswapdi2(long long);
+int __bswapsi2(int);
+long long __lshrdi3(long long, int);
+int __divsi3(int, int);
+int __modsi3(int, int);
+int __mulsi3(int, int);
+unsigned int __udivsi3(unsigned int, unsigned int);
+unsigned int __umodsi3(unsigned int, unsigned int);
+unsigned long long __umulsidi3(unsigned int, unsigned int);
+
+#endif /* __ASM_PROTOTYPES_H */
index e3474ca..01bf7d9 100644 (file)
@@ -11,6 +11,7 @@
 #ifndef _XTENSA_ASMMACRO_H
 #define _XTENSA_ASMMACRO_H
 
+#include <asm-generic/export.h>
 #include <asm/core.h>
 
 /*
index 52da614..7308b7f 100644 (file)
@@ -245,6 +245,11 @@ static inline int arch_atomic_fetch_##op(int i, atomic_t * v)              \
 ATOMIC_OPS(add)
 ATOMIC_OPS(sub)
 
+#define arch_atomic_add_return                 arch_atomic_add_return
+#define arch_atomic_sub_return                 arch_atomic_sub_return
+#define arch_atomic_fetch_add                  arch_atomic_fetch_add
+#define arch_atomic_fetch_sub                  arch_atomic_fetch_sub
+
 #undef ATOMIC_OPS
 #define ATOMIC_OPS(op) ATOMIC_OP(op) ATOMIC_FETCH_OP(op)
 
@@ -252,12 +257,13 @@ ATOMIC_OPS(and)
 ATOMIC_OPS(or)
 ATOMIC_OPS(xor)
 
+#define arch_atomic_fetch_and                  arch_atomic_fetch_and
+#define arch_atomic_fetch_or                   arch_atomic_fetch_or
+#define arch_atomic_fetch_xor                  arch_atomic_fetch_xor
+
 #undef ATOMIC_OPS
 #undef ATOMIC_FETCH_OP
 #undef ATOMIC_OP_RETURN
 #undef ATOMIC_OP
 
-#define arch_atomic_cmpxchg(v, o, n) ((int)arch_cmpxchg(&((v)->counter), (o), (n)))
-#define arch_atomic_xchg(v, new) (arch_xchg(&((v)->counter), new))
-
 #endif /* _XTENSA_ATOMIC_H */
diff --git a/arch/xtensa/include/asm/bugs.h b/arch/xtensa/include/asm/bugs.h
deleted file mode 100644 (file)
index 69b29d1..0000000
+++ /dev/null
@@ -1,18 +0,0 @@
-/*
- * include/asm-xtensa/bugs.h
- *
- * This is included by init/main.c to check for architecture-dependent bugs.
- *
- * Xtensa processors don't have any bugs.  :)
- *
- * This file is subject to the terms and conditions of the GNU General
- * Public License.  See the file "COPYING" in the main directory of
- * this archive for more details.
- */
-
-#ifndef _XTENSA_BUGS_H
-#define _XTENSA_BUGS_H
-
-static void check_bugs(void) { }
-
-#endif /* _XTENSA_BUGS_H */
index f856d2b..0e1bb6f 100644 (file)
 #define XCHAL_SPANNING_WAY 0
 #endif
 
+#ifndef XCHAL_HAVE_TRAX
+#define XCHAL_HAVE_TRAX 0
+#endif
+
+#ifndef XCHAL_NUM_PERF_COUNTERS
+#define XCHAL_NUM_PERF_COUNTERS 0
+#endif
+
 #if XCHAL_HAVE_WINDOWED
 #if defined(CONFIG_USER_ABI_DEFAULT) || defined(CONFIG_USER_ABI_CALL0_PROBE)
 /* Whether windowed ABI is supported in userspace. */
index 6c6d9a9..0ea4f84 100644 (file)
 #include <asm/processor.h>
 
 #ifndef __ASSEMBLY__
-#define ftrace_return_address0 ({ unsigned long a0, a1; \
-               __asm__ __volatile__ ( \
-                       "mov %0, a0\n" \
-                       "mov %1, a1\n" \
-                       : "=r"(a0), "=r"(a1)); \
-               MAKE_PC_FROM_RA(a0, a1); })
-
-#ifdef CONFIG_FRAME_POINTER
 extern unsigned long return_address(unsigned level);
 #define ftrace_return_address(n) return_address(n)
-#endif
 #endif /* __ASSEMBLY__ */
 
 #ifdef CONFIG_FUNCTION_TRACER
index 354ca94..94f13fa 100644 (file)
@@ -28,31 +28,11 @@ extern void platform_init(bp_tag_t*);
 extern void platform_setup (char **);
 
 /*
- * platform_restart is called to restart the system.
- */
-extern void platform_restart (void);
-
-/*
- * platform_halt is called to stop the system and halt.
- */
-extern void platform_halt (void);
-
-/*
- * platform_power_off is called to stop the system and power it off.
- */
-extern void platform_power_off (void);
-
-/*
  * platform_idle is called from the idle function.
  */
 extern void platform_idle (void);
 
 /*
- * platform_heartbeat is called every HZ
- */
-extern void platform_heartbeat (void);
-
-/*
  * platform_calibrate_ccount calibrates cpu clock freq (CONFIG_XTENSA_CALIBRATE)
  */
 extern void platform_calibrate_ccount (void);
index 89b51a0..ffce435 100644 (file)
@@ -118,9 +118,6 @@ extern void *__memcpy(void *__to, __const__ void *__from, size_t __n);
 extern void *memmove(void *__dest, __const__ void *__src, size_t __n);
 extern void *__memmove(void *__dest, __const__ void *__src, size_t __n);
 
-/* Don't build bcopy at all ...  */
-#define __HAVE_ARCH_BCOPY
-
 #if defined(CONFIG_KASAN) && !defined(__SANITIZE_ADDRESS__)
 
 /*
index 6f74ccc..212c3b9 100644 (file)
@@ -47,6 +47,7 @@ __init trap_set_handler(int cause, xtensa_exception_handler *handler);
 asmlinkage void fast_illegal_instruction_user(void);
 asmlinkage void fast_syscall_user(void);
 asmlinkage void fast_alloca(void);
+asmlinkage void fast_load_store(void);
 asmlinkage void fast_unaligned(void);
 asmlinkage void fast_second_level_miss(void);
 asmlinkage void fast_store_prohibited(void);
@@ -64,8 +65,14 @@ void do_unhandled(struct pt_regs *regs);
 static inline void __init early_trap_init(void)
 {
        static struct exc_table init_exc_table __initdata = {
+#ifdef CONFIG_XTENSA_LOAD_STORE
+               .fast_kernel_handler[EXCCAUSE_LOAD_STORE_ERROR] =
+                       fast_load_store,
+#endif
+#ifdef CONFIG_MMU
                .fast_kernel_handler[EXCCAUSE_DTLB_MISS] =
                        fast_second_level_miss,
+#endif
        };
        xtensa_set_sr(&init_exc_table, excsave1);
 }
index d062c73..20d6b49 100644 (file)
 #include <asm/asmmacro.h>
 #include <asm/processor.h>
 
-#if XCHAL_UNALIGNED_LOAD_EXCEPTION || XCHAL_UNALIGNED_STORE_EXCEPTION
+#if XCHAL_UNALIGNED_LOAD_EXCEPTION || defined CONFIG_XTENSA_LOAD_STORE
+#define LOAD_EXCEPTION_HANDLER
+#endif
+
+#if XCHAL_UNALIGNED_STORE_EXCEPTION || defined LOAD_EXCEPTION_HANDLER
+#define ANY_EXCEPTION_HANDLER
+#endif
+
+#if XCHAL_HAVE_WINDOWED
+#define UNALIGNED_USER_EXCEPTION
+#endif
 
 /*  First-level exception handler for unaligned exceptions.
  *
  *  BE  shift left / mask 0 0 X X
  */
 
-#if XCHAL_HAVE_WINDOWED
-#define UNALIGNED_USER_EXCEPTION
-#endif
-
 #if XCHAL_HAVE_BE
 
 #define HWORD_START    16
  *
  *            23                           0
  *             -----------------------------
- *     res               0000           0010
+ *     L8UI    xxxx xxxx 0000 ssss tttt 0010
  *     L16UI   xxxx xxxx 0001 ssss tttt 0010
  *     L32I    xxxx xxxx 0010 ssss tttt 0010
  *     XXX               0011 ssss tttt 0010
 
 #define OP0_L32I_N     0x8             /* load immediate narrow */
 #define OP0_S32I_N     0x9             /* store immediate narrow */
+#define OP0_LSAI       0x2             /* load/store */
 #define OP1_SI_MASK    0x4             /* OP1 bit set for stores */
 #define OP1_SI_BIT     2               /* OP1 bit number for stores */
 
+#define OP1_L8UI       0x0
 #define OP1_L32I       0x2
 #define OP1_L16UI      0x1
 #define OP1_L16SI      0x9
  */
 
        .literal_position
-ENTRY(fast_unaligned)
+#ifdef CONFIG_XTENSA_LOAD_STORE
+ENTRY(fast_load_store)
 
-       /* Note: We don't expect the address to be aligned on a word
-        *       boundary. After all, the processor generated that exception
-        *       and it would be a hardware fault.
-        */
+       call0   .Lsave_and_load_instruction
 
-       /* Save some working register */
+       /* Analyze the instruction (load or store?). */
 
-       s32i    a4, a2, PT_AREG4
-       s32i    a5, a2, PT_AREG5
-       s32i    a6, a2, PT_AREG6
-       s32i    a7, a2, PT_AREG7
-       s32i    a8, a2, PT_AREG8
+       extui   a0, a4, INSN_OP0, 4     # get insn.op0 nibble
 
-       rsr     a0, depc
-       s32i    a0, a2, PT_AREG2
-       s32i    a3, a2, PT_AREG3
+#if XCHAL_HAVE_DENSITY
+       _beqi   a0, OP0_L32I_N, 1f      # L32I.N, jump
+#endif
+       bnei    a0, OP0_LSAI, .Linvalid_instruction
+       /* 'store indicator bit' set, jump */
+       bbsi.l  a4, OP1_SI_BIT + INSN_OP1, .Linvalid_instruction
 
-       rsr     a3, excsave1
-       movi    a4, fast_unaligned_fixup
-       s32i    a4, a3, EXC_TABLE_FIXUP
+1:
+       movi    a3, ~3
+       and     a3, a3, a8              # align memory address
 
-       /* Keep value of SAR in a0 */
+       __ssa8  a8
 
-       rsr     a0, sar
-       rsr     a8, excvaddr            # load unaligned memory address
+#ifdef CONFIG_MMU
+       /* l32e can't be used here even when it's available. */
+       /* TODO access_ok(a3) could be used here */
+       j       .Linvalid_instruction
+#endif
+       l32i    a5, a3, 0
+       l32i    a6, a3, 4
+       __src_b a3, a5, a6              # a3 has the data word
 
-       /* Now, identify one of the following load/store instructions.
-        *
-        * The only possible danger of a double exception on the
-        * following l32i instructions is kernel code in vmalloc
-        * memory. The processor was just executing at the EPC_1
-        * address, and indeed, already fetched the instruction.  That
-        * guarantees a TLB mapping, which hasn't been replaced by
-        * this unaligned exception handler that uses only static TLB
-        * mappings. However, high-level interrupt handlers might
-        * modify TLB entries, so for the generic case, we register a
-        * TABLE_FIXUP handler here, too.
-        */
+#if XCHAL_HAVE_DENSITY
+       addi    a7, a7, 2               # increment PC (assume 16-bit insn)
+       _beqi   a0, OP0_L32I_N, .Lload_w# l32i.n: jump
+       addi    a7, a7, 1
+#else
+       addi    a7, a7, 3
+#endif
 
-       /* a3...a6 saved on stack, a2 = SP */
+       extui   a5, a4, INSN_OP1, 4
+       _beqi   a5, OP1_L32I, .Lload_w
+       bnei    a5, OP1_L8UI, .Lload16
+       extui   a3, a3, 0, 8
+       j       .Lload_w
 
-       /* Extract the instruction that caused the unaligned access. */
+ENDPROC(fast_load_store)
+#endif
 
-       rsr     a7, epc1        # load exception address
-       movi    a3, ~3
-       and     a3, a3, a7      # mask lower bits
+/*
+ * Entry condition:
+ *
+ *   a0:       trashed, original value saved on stack (PT_AREG0)
+ *   a1:       a1
+ *   a2:       new stack pointer, original in DEPC
+ *   a3:       a3
+ *   depc:     a2, original value saved on stack (PT_DEPC)
+ *   excsave_1:        dispatch table
+ *
+ *   PT_DEPC >= VALID_DOUBLE_EXCEPTION_ADDRESS: double exception, DEPC
+ *          <  VALID_DOUBLE_EXCEPTION_ADDRESS: regular exception
+ */
 
-       l32i    a4, a3, 0       # load 2 words
-       l32i    a5, a3, 4
+#ifdef ANY_EXCEPTION_HANDLER
+ENTRY(fast_unaligned)
 
-       __ssa8  a7
-       __src_b a4, a4, a5      # a4 has the instruction
+#if XCHAL_UNALIGNED_LOAD_EXCEPTION || XCHAL_UNALIGNED_STORE_EXCEPTION
+
+       call0   .Lsave_and_load_instruction
 
        /* Analyze the instruction (load or store?). */
 
@@ -222,12 +244,17 @@ ENTRY(fast_unaligned)
        /* 'store indicator bit' not set, jump */
        _bbci.l a4, OP1_SI_BIT + INSN_OP1, .Lload
 
+#endif
+#if XCHAL_UNALIGNED_STORE_EXCEPTION
+
        /* Store: Jump to table entry to get the value in the source register.*/
 
 .Lstore:movi   a5, .Lstore_table       # table
        extui   a6, a4, INSN_T, 4       # get source register
        addx8   a5, a6, a5
        jx      a5                      # jump into table
+#endif
+#if XCHAL_UNALIGNED_LOAD_EXCEPTION
 
        /* Load: Load memory address. */
 
@@ -249,7 +276,7 @@ ENTRY(fast_unaligned)
        addi    a7, a7, 2               # increment PC (assume 16-bit insn)
 
        extui   a5, a4, INSN_OP0, 4
-       _beqi   a5, OP0_L32I_N, 1f      # l32i.n: jump
+       _beqi   a5, OP0_L32I_N, .Lload_w# l32i.n: jump
 
        addi    a7, a7, 1
 #else
@@ -257,21 +284,26 @@ ENTRY(fast_unaligned)
 #endif
 
        extui   a5, a4, INSN_OP1, 4
-       _beqi   a5, OP1_L32I, 1f        # l32i: jump
-
+       _beqi   a5, OP1_L32I, .Lload_w  # l32i: jump
+#endif
+#ifdef LOAD_EXCEPTION_HANDLER
+.Lload16:
        extui   a3, a3, 0, 16           # extract lower 16 bits
-       _beqi   a5, OP1_L16UI, 1f
+       _beqi   a5, OP1_L16UI, .Lload_w
        addi    a5, a5, -OP1_L16SI
-       _bnez   a5, .Linvalid_instruction_load
+       _bnez   a5, .Linvalid_instruction
 
        /* sign extend value */
-
+#if XCHAL_HAVE_SEXT
+       sext    a3, a3, 15
+#else
        slli    a3, a3, 16
        srai    a3, a3, 16
+#endif
 
        /* Set target register. */
 
-1:
+.Lload_w:
        extui   a4, a4, INSN_T, 4       # extract target register
        movi    a5, .Lload_table
        addx8   a4, a4, a5
@@ -295,30 +327,32 @@ ENTRY(fast_unaligned)
        mov     a13, a3         ;       _j .Lexit;      .align 8
        mov     a14, a3         ;       _j .Lexit;      .align 8
        mov     a15, a3         ;       _j .Lexit;      .align 8
-
+#endif
+#if XCHAL_UNALIGNED_STORE_EXCEPTION
 .Lstore_table:
-       l32i    a3, a2, PT_AREG0;       _j 1f;  .align 8
-       mov     a3, a1;                 _j 1f;  .align 8        # fishy??
-       l32i    a3, a2, PT_AREG2;       _j 1f;  .align 8
-       l32i    a3, a2, PT_AREG3;       _j 1f;  .align 8
-       l32i    a3, a2, PT_AREG4;       _j 1f;  .align 8
-       l32i    a3, a2, PT_AREG5;       _j 1f;  .align 8
-       l32i    a3, a2, PT_AREG6;       _j 1f;  .align 8
-       l32i    a3, a2, PT_AREG7;       _j 1f;  .align 8
-       l32i    a3, a2, PT_AREG8;       _j 1f;  .align 8
-       mov     a3, a9          ;       _j 1f;  .align 8
-       mov     a3, a10         ;       _j 1f;  .align 8
-       mov     a3, a11         ;       _j 1f;  .align 8
-       mov     a3, a12         ;       _j 1f;  .align 8
-       mov     a3, a13         ;       _j 1f;  .align 8
-       mov     a3, a14         ;       _j 1f;  .align 8
-       mov     a3, a15         ;       _j 1f;  .align 8
+       l32i    a3, a2, PT_AREG0;       _j .Lstore_w;   .align 8
+       mov     a3, a1;                 _j .Lstore_w;   .align 8        # fishy??
+       l32i    a3, a2, PT_AREG2;       _j .Lstore_w;   .align 8
+       l32i    a3, a2, PT_AREG3;       _j .Lstore_w;   .align 8
+       l32i    a3, a2, PT_AREG4;       _j .Lstore_w;   .align 8
+       l32i    a3, a2, PT_AREG5;       _j .Lstore_w;   .align 8
+       l32i    a3, a2, PT_AREG6;       _j .Lstore_w;   .align 8
+       l32i    a3, a2, PT_AREG7;       _j .Lstore_w;   .align 8
+       l32i    a3, a2, PT_AREG8;       _j .Lstore_w;   .align 8
+       mov     a3, a9          ;       _j .Lstore_w;   .align 8
+       mov     a3, a10         ;       _j .Lstore_w;   .align 8
+       mov     a3, a11         ;       _j .Lstore_w;   .align 8
+       mov     a3, a12         ;       _j .Lstore_w;   .align 8
+       mov     a3, a13         ;       _j .Lstore_w;   .align 8
+       mov     a3, a14         ;       _j .Lstore_w;   .align 8
+       mov     a3, a15         ;       _j .Lstore_w;   .align 8
+#endif
 
+#ifdef ANY_EXCEPTION_HANDLER
        /* We cannot handle this exception. */
 
        .extern _kernel_exception
-.Linvalid_instruction_load:
-.Linvalid_instruction_store:
+.Linvalid_instruction:
 
        movi    a4, 0
        rsr     a3, excsave1
@@ -326,6 +360,7 @@ ENTRY(fast_unaligned)
 
        /* Restore a4...a8 and SAR, set SP, and jump to default exception. */
 
+       l32i    a0, a2, PT_SAR
        l32i    a8, a2, PT_AREG8
        l32i    a7, a2, PT_AREG7
        l32i    a6, a2, PT_AREG6
@@ -342,9 +377,11 @@ ENTRY(fast_unaligned)
 
 2:     movi    a0, _user_exception
        jx      a0
+#endif
+#if XCHAL_UNALIGNED_STORE_EXCEPTION
 
-1:     # a7: instruction pointer, a4: instruction, a3: value
-
+       # a7: instruction pointer, a4: instruction, a3: value
+.Lstore_w:
        movi    a6, 0                   # mask: ffffffff:00000000
 
 #if XCHAL_HAVE_DENSITY
@@ -361,7 +398,7 @@ ENTRY(fast_unaligned)
 
        extui   a5, a4, INSN_OP1, 4     # extract OP1
        _beqi   a5, OP1_S32I, 1f        # jump if 32 bit store
-       _bnei   a5, OP1_S16I, .Linvalid_instruction_store
+       _bnei   a5, OP1_S16I, .Linvalid_instruction
 
        movi    a5, -1
        __extl  a3, a3                  # get 16-bit value
@@ -406,7 +443,8 @@ ENTRY(fast_unaligned)
 #else
        s32i    a6, a4, 4
 #endif
-
+#endif
+#ifdef ANY_EXCEPTION_HANDLER
 .Lexit:
 #if XCHAL_HAVE_LOOPS
        rsr     a4, lend                # check if we reached LEND
@@ -434,6 +472,7 @@ ENTRY(fast_unaligned)
 
        /* Restore working register */
 
+       l32i    a0, a2, PT_SAR
        l32i    a8, a2, PT_AREG8
        l32i    a7, a2, PT_AREG7
        l32i    a6, a2, PT_AREG6
@@ -448,6 +487,59 @@ ENTRY(fast_unaligned)
        l32i    a2, a2, PT_AREG2
        rfe
 
+       .align  4
+.Lsave_and_load_instruction:
+
+       /* Save some working register */
+
+       s32i    a3, a2, PT_AREG3
+       s32i    a4, a2, PT_AREG4
+       s32i    a5, a2, PT_AREG5
+       s32i    a6, a2, PT_AREG6
+       s32i    a7, a2, PT_AREG7
+       s32i    a8, a2, PT_AREG8
+
+       rsr     a4, depc
+       s32i    a4, a2, PT_AREG2
+
+       rsr     a5, sar
+       s32i    a5, a2, PT_SAR
+
+       rsr     a3, excsave1
+       movi    a4, fast_unaligned_fixup
+       s32i    a4, a3, EXC_TABLE_FIXUP
+
+       rsr     a8, excvaddr            # load unaligned memory address
+
+       /* Now, identify one of the following load/store instructions.
+        *
+        * The only possible danger of a double exception on the
+        * following l32i instructions is kernel code in vmalloc
+        * memory. The processor was just executing at the EPC_1
+        * address, and indeed, already fetched the instruction.  That
+        * guarantees a TLB mapping, which hasn't been replaced by
+        * this unaligned exception handler that uses only static TLB
+        * mappings. However, high-level interrupt handlers might
+        * modify TLB entries, so for the generic case, we register a
+        * TABLE_FIXUP handler here, too.
+        */
+
+       /* a3...a6 saved on stack, a2 = SP */
+
+       /* Extract the instruction that caused the unaligned access. */
+
+       rsr     a7, epc1        # load exception address
+       movi    a3, ~3
+       and     a3, a3, a7      # mask lower bits
+
+       l32i    a4, a3, 0       # load 2 words
+       l32i    a5, a3, 4
+
+       __ssa8  a7
+       __src_b a4, a4, a5      # a4 has the instruction
+
+       ret
+#endif
 ENDPROC(fast_unaligned)
 
 ENTRY(fast_unaligned_fixup)
@@ -459,10 +551,11 @@ ENTRY(fast_unaligned_fixup)
        l32i    a7, a2, PT_AREG7
        l32i    a6, a2, PT_AREG6
        l32i    a5, a2, PT_AREG5
-       l32i    a4, a2, PT_AREG4
+       l32i    a4, a2, PT_SAR
        l32i    a0, a2, PT_AREG2
-       xsr     a0, depc                        # restore depc and a0
-       wsr     a0, sar
+       wsr     a4, sar
+       wsr     a0, depc                        # restore depc and a0
+       l32i    a4, a2, PT_AREG4
 
        rsr     a0, exccause
        s32i    a0, a2, PT_DEPC                 # mark as a regular exception
@@ -483,5 +576,4 @@ ENTRY(fast_unaligned_fixup)
        jx      a0
 
 ENDPROC(fast_unaligned_fixup)
-
-#endif /* XCHAL_UNALIGNED_LOAD_EXCEPTION || XCHAL_UNALIGNED_STORE_EXCEPTION */
+#endif
index 51daaf4..309b329 100644 (file)
@@ -78,6 +78,7 @@ ENTRY(_mcount)
 #error Unsupported Xtensa ABI
 #endif
 ENDPROC(_mcount)
+EXPORT_SYMBOL(_mcount)
 
 ENTRY(ftrace_stub)
        abi_entry_default
index ac1e0e5..926b8bf 100644 (file)
 #include <asm/platform.h>
 #include <asm/timex.h>
 
-#define _F(r,f,a,b)                                                    \
-       r __platform_##f a b;                                           \
-       r platform_##f a __attribute__((weak, alias("__platform_"#f)))
-
 /*
  * Default functions that are used if no platform specific function is defined.
- * (Please, refer to include/asm-xtensa/platform.h for more information)
+ * (Please, refer to arch/xtensa/include/asm/platform.h for more information)
  */
 
-_F(void, init, (bp_tag_t *first), { });
-_F(void, setup, (char** cmd), { });
-_F(void, restart, (void), { while(1); });
-_F(void, halt, (void), { while(1); });
-_F(void, power_off, (void), { while(1); });
-_F(void, idle, (void), { __asm__ __volatile__ ("waiti 0" ::: "memory"); });
-_F(void, heartbeat, (void), { });
+void __weak __init platform_init(bp_tag_t *first)
+{
+}
+
+void __weak __init platform_setup(char **cmd)
+{
+}
+
+void __weak platform_idle(void)
+{
+       __asm__ __volatile__ ("waiti 0" ::: "memory");
+}
 
 #ifdef CONFIG_XTENSA_CALIBRATE_CCOUNT
-_F(void, calibrate_ccount, (void),
+void __weak platform_calibrate_ccount(void)
 {
        pr_err("ERROR: Cannot calibrate cpu frequency! Assuming 10MHz.\n");
        ccount_freq = 10 * 1000000UL;
-});
+}
 #endif
index 9191738..aba3ff4 100644 (file)
@@ -22,6 +22,7 @@
 #include <linux/screen_info.h>
 #include <linux/kernel.h>
 #include <linux/percpu.h>
+#include <linux/reboot.h>
 #include <linux/cpu.h>
 #include <linux/of.h>
 #include <linux/of_fdt.h>
@@ -46,6 +47,7 @@
 #include <asm/smp.h>
 #include <asm/sysmem.h>
 #include <asm/timex.h>
+#include <asm/traps.h>
 
 #if defined(CONFIG_VGA_CONSOLE) || defined(CONFIG_DUMMY_CONSOLE)
 struct screen_info screen_info = {
@@ -241,6 +243,12 @@ void __init early_init_devtree(void *params)
 
 void __init init_arch(bp_tag_t *bp_start)
 {
+       /* Initialize basic exception handling if configuration may need it */
+
+       if (IS_ENABLED(CONFIG_KASAN) ||
+           IS_ENABLED(CONFIG_XTENSA_LOAD_STORE))
+               early_trap_init();
+
        /* Initialize MMU. */
 
        init_mmu();
@@ -522,19 +530,30 @@ void cpu_reset(void)
 
 void machine_restart(char * cmd)
 {
-       platform_restart();
+       local_irq_disable();
+       smp_send_stop();
+       do_kernel_restart(cmd);
+       pr_err("Reboot failed -- System halted\n");
+       while (1)
+               cpu_relax();
 }
 
 void machine_halt(void)
 {
-       platform_halt();
-       while (1);
+       local_irq_disable();
+       smp_send_stop();
+       do_kernel_power_off();
+       while (1)
+               cpu_relax();
 }
 
 void machine_power_off(void)
 {
-       platform_power_off();
-       while (1);
+       local_irq_disable();
+       smp_send_stop();
+       do_kernel_power_off();
+       while (1)
+               cpu_relax();
 }
 #ifdef CONFIG_PROC_FS
 
@@ -574,6 +593,12 @@ c_show(struct seq_file *f, void *slot)
 # if XCHAL_HAVE_OCD
                     "ocd "
 # endif
+#if XCHAL_HAVE_TRAX
+                    "trax "
+#endif
+#if XCHAL_NUM_PERF_COUNTERS
+                    "perf "
+#endif
 #endif
 #if XCHAL_HAVE_DENSITY
                     "density "
@@ -623,11 +648,13 @@ c_show(struct seq_file *f, void *slot)
        seq_printf(f,"physical aregs\t: %d\n"
                     "misc regs\t: %d\n"
                     "ibreak\t\t: %d\n"
-                    "dbreak\t\t: %d\n",
+                    "dbreak\t\t: %d\n"
+                    "perf counters\t: %d\n",
                     XCHAL_NUM_AREGS,
                     XCHAL_NUM_MISC_REGS,
                     XCHAL_NUM_IBREAK,
-                    XCHAL_NUM_DBREAK);
+                    XCHAL_NUM_DBREAK,
+                    XCHAL_NUM_PERF_COUNTERS);
 
 
        /* Interrupt. */
index 7f7755c..f643ea5 100644 (file)
@@ -237,8 +237,6 @@ EXPORT_SYMBOL_GPL(save_stack_trace);
 
 #endif
 
-#ifdef CONFIG_FRAME_POINTER
-
 struct return_addr_data {
        unsigned long addr;
        unsigned skip;
@@ -271,5 +269,3 @@ unsigned long return_address(unsigned level)
        return r.addr;
 }
 EXPORT_SYMBOL(return_address);
-
-#endif
index 52c94ab..2b69c3c 100644 (file)
 448    common  process_mrelease                sys_process_mrelease
 449    common  futex_waitv                     sys_futex_waitv
 450    common  set_mempolicy_home_node         sys_set_mempolicy_home_node
+451    common  cachestat                       sys_cachestat
index 16b8a62..1c3dfea 100644 (file)
@@ -121,10 +121,6 @@ static irqreturn_t timer_interrupt(int irq, void *dev_id)
 
        set_linux_timer(get_linux_timer());
        evt->event_handler(evt);
-
-       /* Allow platform to do something useful (Wdog). */
-       platform_heartbeat();
-
        return IRQ_HANDLED;
 }
 
index f0a7d1c..17eb180 100644 (file)
@@ -54,9 +54,10 @@ static void do_interrupt(struct pt_regs *regs);
 #if XTENSA_FAKE_NMI
 static void do_nmi(struct pt_regs *regs);
 #endif
-#if XCHAL_UNALIGNED_LOAD_EXCEPTION || XCHAL_UNALIGNED_STORE_EXCEPTION
-static void do_unaligned_user(struct pt_regs *regs);
+#ifdef CONFIG_XTENSA_LOAD_STORE
+static void do_load_store(struct pt_regs *regs);
 #endif
+static void do_unaligned_user(struct pt_regs *regs);
 static void do_multihit(struct pt_regs *regs);
 #if XTENSA_HAVE_COPROCESSORS
 static void do_coprocessor(struct pt_regs *regs);
@@ -91,7 +92,10 @@ static dispatch_init_table_t __initdata dispatch_init_table[] = {
 { EXCCAUSE_SYSTEM_CALL,                USER,      fast_syscall_user },
 { EXCCAUSE_SYSTEM_CALL,                0,         system_call },
 /* EXCCAUSE_INSTRUCTION_FETCH unhandled */
-/* EXCCAUSE_LOAD_STORE_ERROR unhandled*/
+#ifdef CONFIG_XTENSA_LOAD_STORE
+{ EXCCAUSE_LOAD_STORE_ERROR,   USER|KRNL, fast_load_store },
+{ EXCCAUSE_LOAD_STORE_ERROR,   0,         do_load_store },
+#endif
 { EXCCAUSE_LEVEL1_INTERRUPT,   0,         do_interrupt },
 #ifdef SUPPORT_WINDOWED
 { EXCCAUSE_ALLOCA,             USER|KRNL, fast_alloca },
@@ -102,9 +106,9 @@ static dispatch_init_table_t __initdata dispatch_init_table[] = {
 #ifdef CONFIG_XTENSA_UNALIGNED_USER
 { EXCCAUSE_UNALIGNED,          USER,      fast_unaligned },
 #endif
-{ EXCCAUSE_UNALIGNED,          0,         do_unaligned_user },
 { EXCCAUSE_UNALIGNED,          KRNL,      fast_unaligned },
 #endif
+{ EXCCAUSE_UNALIGNED,          0,         do_unaligned_user },
 #ifdef CONFIG_MMU
 { EXCCAUSE_ITLB_MISS,                  0,         do_page_fault },
 { EXCCAUSE_ITLB_MISS,                  USER|KRNL, fast_second_level_miss},
@@ -171,6 +175,23 @@ __die_if_kernel(const char *str, struct pt_regs *regs, long err)
                die(str, regs, err);
 }
 
+#ifdef CONFIG_PRINT_USER_CODE_ON_UNHANDLED_EXCEPTION
+static inline void dump_user_code(struct pt_regs *regs)
+{
+       char buf[32];
+
+       if (copy_from_user(buf, (void __user *)(regs->pc & -16), sizeof(buf)) == 0) {
+               print_hex_dump(KERN_INFO, " ", DUMP_PREFIX_NONE,
+                              32, 1, buf, sizeof(buf), false);
+
+       }
+}
+#else
+static inline void dump_user_code(struct pt_regs *regs)
+{
+}
+#endif
+
 /*
  * Unhandled Exceptions. Kill user task or panic if in kernel space.
  */
@@ -186,6 +207,7 @@ void do_unhandled(struct pt_regs *regs)
                            "\tEXCCAUSE is %ld\n",
                            current->comm, task_pid_nr(current), regs->pc,
                            regs->exccause);
+       dump_user_code(regs);
        force_sig(SIGILL);
 }
 
@@ -349,6 +371,19 @@ static void do_div0(struct pt_regs *regs)
        force_sig_fault(SIGFPE, FPE_INTDIV, (void __user *)regs->pc);
 }
 
+#ifdef CONFIG_XTENSA_LOAD_STORE
+static void do_load_store(struct pt_regs *regs)
+{
+       __die_if_kernel("Unhandled load/store exception in kernel",
+                       regs, SIGKILL);
+
+       pr_info_ratelimited("Load/store error to %08lx in '%s' (pid = %d, pc = %#010lx)\n",
+                           regs->excvaddr, current->comm,
+                           task_pid_nr(current), regs->pc);
+       force_sig_fault(SIGBUS, BUS_ADRERR, (void *)regs->excvaddr);
+}
+#endif
+
 /*
  * Handle unaligned memory accesses from user space. Kill task.
  *
@@ -356,7 +391,6 @@ static void do_div0(struct pt_regs *regs)
  * accesses causes from user space.
  */
 
-#if XCHAL_UNALIGNED_LOAD_EXCEPTION || XCHAL_UNALIGNED_STORE_EXCEPTION
 static void do_unaligned_user(struct pt_regs *regs)
 {
        __die_if_kernel("Unhandled unaligned exception in kernel",
@@ -368,7 +402,6 @@ static void do_unaligned_user(struct pt_regs *regs)
                            task_pid_nr(current), regs->pc);
        force_sig_fault(SIGBUS, BUS_ADRALN, (void *) regs->excvaddr);
 }
-#endif
 
 #if XTENSA_HAVE_COPROCESSORS
 static void do_coprocessor(struct pt_regs *regs)
@@ -534,31 +567,58 @@ static void show_trace(struct task_struct *task, unsigned long *sp,
 }
 
 #define STACK_DUMP_ENTRY_SIZE 4
-#define STACK_DUMP_LINE_SIZE 32
+#define STACK_DUMP_LINE_SIZE 16
 static size_t kstack_depth_to_print = CONFIG_PRINT_STACK_DEPTH;
 
-void show_stack(struct task_struct *task, unsigned long *sp, const char *loglvl)
+struct stack_fragment
 {
-       size_t len, off = 0;
-
-       if (!sp)
-               sp = stack_pointer(task);
+       size_t len;
+       size_t off;
+       u8 *sp;
+       const char *loglvl;
+};
 
-       len = min((-(size_t)sp) & (THREAD_SIZE - STACK_DUMP_ENTRY_SIZE),
-                 kstack_depth_to_print * STACK_DUMP_ENTRY_SIZE);
+static int show_stack_fragment_cb(struct stackframe *frame, void *data)
+{
+       struct stack_fragment *sf = data;
 
-       printk("%sStack:\n", loglvl);
-       while (off < len) {
+       while (sf->off < sf->len) {
                u8 line[STACK_DUMP_LINE_SIZE];
-               size_t line_len = len - off > STACK_DUMP_LINE_SIZE ?
-                       STACK_DUMP_LINE_SIZE : len - off;
+               size_t line_len = sf->len - sf->off > STACK_DUMP_LINE_SIZE ?
+                       STACK_DUMP_LINE_SIZE : sf->len - sf->off;
+               bool arrow = sf->off == 0;
 
-               __memcpy(line, (u8 *)sp + off, line_len);
-               print_hex_dump(loglvl, " ", DUMP_PREFIX_NONE,
+               if (frame && frame->sp == (unsigned long)(sf->sp + sf->off))
+                       arrow = true;
+
+               __memcpy(line, sf->sp + sf->off, line_len);
+               print_hex_dump(sf->loglvl, arrow ? "> " : "  ", DUMP_PREFIX_NONE,
                               STACK_DUMP_LINE_SIZE, STACK_DUMP_ENTRY_SIZE,
                               line, line_len, false);
-               off += STACK_DUMP_LINE_SIZE;
+               sf->off += STACK_DUMP_LINE_SIZE;
+               if (arrow)
+                       return 0;
        }
+       return 1;
+}
+
+void show_stack(struct task_struct *task, unsigned long *sp, const char *loglvl)
+{
+       struct stack_fragment sf;
+
+       if (!sp)
+               sp = stack_pointer(task);
+
+       sf.len = min((-(size_t)sp) & (THREAD_SIZE - STACK_DUMP_ENTRY_SIZE),
+                    kstack_depth_to_print * STACK_DUMP_ENTRY_SIZE);
+       sf.off = 0;
+       sf.sp = (u8 *)sp;
+       sf.loglvl = loglvl;
+
+       printk("%sStack:\n", loglvl);
+       walk_stackframe(sp, show_stack_fragment_cb, &sf);
+       while (sf.off < sf.len)
+               show_stack_fragment_cb(NULL, &sf);
        show_trace(task, sp, loglvl);
 }
 
index 17a7ef8..62d81e7 100644 (file)
  */
 
 #include <linux/module.h>
-#include <linux/string.h>
-#include <linux/mm.h>
-#include <linux/interrupt.h>
-#include <asm/irq.h>
-#include <linux/in6.h>
-
-#include <linux/uaccess.h>
-#include <asm/cacheflush.h>
-#include <asm/checksum.h>
-#include <asm/dma.h>
-#include <asm/io.h>
-#include <asm/page.h>
-#include <asm/ftrace.h>
-#ifdef CONFIG_BLK_DEV_FD
-#include <asm/floppy.h>
-#endif
-#ifdef CONFIG_NET
-#include <net/checksum.h>
-#endif /* CONFIG_NET */
-
-
-/*
- * String functions
- */
-EXPORT_SYMBOL(memset);
-EXPORT_SYMBOL(memcpy);
-EXPORT_SYMBOL(memmove);
-EXPORT_SYMBOL(__memset);
-EXPORT_SYMBOL(__memcpy);
-EXPORT_SYMBOL(__memmove);
-#ifdef CONFIG_ARCH_HAS_STRNCPY_FROM_USER
-EXPORT_SYMBOL(__strncpy_user);
-#endif
-EXPORT_SYMBOL(clear_page);
-EXPORT_SYMBOL(copy_page);
+#include <asm/pgtable.h>
 
 EXPORT_SYMBOL(empty_zero_page);
 
-/*
- * gcc internal math functions
- */
-extern long long __ashrdi3(long long, int);
-extern long long __ashldi3(long long, int);
-extern long long __bswapdi2(long long);
-extern int __bswapsi2(int);
-extern long long __lshrdi3(long long, int);
-extern int __divsi3(int, int);
-extern int __modsi3(int, int);
-extern int __mulsi3(int, int);
-extern unsigned int __udivsi3(unsigned int, unsigned int);
-extern unsigned int __umodsi3(unsigned int, unsigned int);
-extern unsigned long long __umulsidi3(unsigned int, unsigned int);
-
-EXPORT_SYMBOL(__ashldi3);
-EXPORT_SYMBOL(__ashrdi3);
-EXPORT_SYMBOL(__bswapdi2);
-EXPORT_SYMBOL(__bswapsi2);
-EXPORT_SYMBOL(__lshrdi3);
-EXPORT_SYMBOL(__divsi3);
-EXPORT_SYMBOL(__modsi3);
-EXPORT_SYMBOL(__mulsi3);
-EXPORT_SYMBOL(__udivsi3);
-EXPORT_SYMBOL(__umodsi3);
-EXPORT_SYMBOL(__umulsidi3);
-
 unsigned int __sync_fetch_and_and_4(volatile void *p, unsigned int v)
 {
        BUG();
@@ -89,35 +28,3 @@ unsigned int __sync_fetch_and_or_4(volatile void *p, unsigned int v)
        BUG();
 }
 EXPORT_SYMBOL(__sync_fetch_and_or_4);
-
-/*
- * Networking support
- */
-EXPORT_SYMBOL(csum_partial);
-EXPORT_SYMBOL(csum_partial_copy_generic);
-
-/*
- * Architecture-specific symbols
- */
-EXPORT_SYMBOL(__xtensa_copy_user);
-EXPORT_SYMBOL(__invalidate_icache_range);
-
-/*
- * Kernel hacking ...
- */
-
-#if defined(CONFIG_VGA_CONSOLE) || defined(CONFIG_DUMMY_CONSOLE)
-// FIXME EXPORT_SYMBOL(screen_info);
-#endif
-
-extern long common_exception_return;
-EXPORT_SYMBOL(common_exception_return);
-
-#ifdef CONFIG_FUNCTION_TRACER
-EXPORT_SYMBOL(_mcount);
-#endif
-
-EXPORT_SYMBOL(__invalidate_dcache_range);
-#if XCHAL_DCACHE_IS_WRITEBACK
-EXPORT_SYMBOL(__flush_dcache_range);
-#endif
index c9c2614..6e5b223 100644 (file)
@@ -6,7 +6,8 @@
 lib-y  += memcopy.o memset.o checksum.o \
           ashldi3.o ashrdi3.o bswapdi2.o bswapsi2.o lshrdi3.o \
           divsi3.o udivsi3.o modsi3.o umodsi3.o mulsi3.o umulsidi3.o \
-          usercopy.o strncpy_user.o strnlen_user.o
+          usercopy.o strnlen_user.o
+lib-$(CONFIG_ARCH_HAS_STRNCPY_FROM_USER) += strncpy_user.o
 lib-$(CONFIG_PCI) += pci-auto.o
 lib-$(CONFIG_KCSAN) += kcsan-stubs.o
 KCSAN_SANITIZE_kcsan-stubs.o := n
index 67fb0da..cd6b731 100644 (file)
@@ -26,3 +26,4 @@ ENTRY(__ashldi3)
        abi_ret_default
 
 ENDPROC(__ashldi3)
+EXPORT_SYMBOL(__ashldi3)
index cbf052c..07bc6e7 100644 (file)
@@ -26,3 +26,4 @@ ENTRY(__ashrdi3)
        abi_ret_default
 
 ENDPROC(__ashrdi3)
+EXPORT_SYMBOL(__ashrdi3)
index d8e52e0..5d94a93 100644 (file)
@@ -19,3 +19,4 @@ ENTRY(__bswapdi2)
        abi_ret_default
 
 ENDPROC(__bswapdi2)
+EXPORT_SYMBOL(__bswapdi2)
index 9c1de13..fbfb861 100644 (file)
@@ -14,3 +14,4 @@ ENTRY(__bswapsi2)
        abi_ret_default
 
 ENDPROC(__bswapsi2)
+EXPORT_SYMBOL(__bswapsi2)
index cf1bed1..ffee6f9 100644 (file)
@@ -169,6 +169,7 @@ ENTRY(csum_partial)
        j       5b              /* branch to handle the remaining byte */
 
 ENDPROC(csum_partial)
+EXPORT_SYMBOL(csum_partial)
 
 /*
  * Copy from ds while checksumming, otherwise like csum_partial
@@ -346,6 +347,7 @@ EX(10f)     s8i     a8, a3, 1
        j       4b              /* process the possible trailing odd byte */
 
 ENDPROC(csum_partial_copy_generic)
+EXPORT_SYMBOL(csum_partial_copy_generic)
 
 
 # Exception handler:
index b044b47..edb3c4a 100644 (file)
@@ -72,3 +72,4 @@ ENTRY(__divsi3)
        abi_ret_default
 
 ENDPROC(__divsi3)
+EXPORT_SYMBOL(__divsi3)
index 129ef8d..e432e1a 100644 (file)
@@ -26,3 +26,4 @@ ENTRY(__lshrdi3)
        abi_ret_default
 
 ENDPROC(__lshrdi3)
+EXPORT_SYMBOL(__lshrdi3)
index b20d206..f607603 100644 (file)
@@ -273,21 +273,8 @@ WEAK(memcpy)
        abi_ret_default
 
 ENDPROC(__memcpy)
-
-/*
- * void bcopy(const void *src, void *dest, size_t n);
- */
-
-ENTRY(bcopy)
-
-       abi_entry_default
-       # a2=src, a3=dst, a4=len
-       mov     a5, a3
-       mov     a3, a2
-       mov     a2, a5
-       j       .Lmovecommon    # go to common code for memmove+bcopy
-
-ENDPROC(bcopy)
+EXPORT_SYMBOL(__memcpy)
+EXPORT_SYMBOL(memcpy)
 
 /*
  * void *memmove(void *dst, const void *src, size_t len);
@@ -551,3 +538,5 @@ WEAK(memmove)
        abi_ret_default
 
 ENDPROC(__memmove)
+EXPORT_SYMBOL(__memmove)
+EXPORT_SYMBOL(memmove)
index 59b1524..262c3f3 100644 (file)
@@ -142,6 +142,8 @@ EX(10f) s8i a3, a5, 0
        abi_ret_default
 
 ENDPROC(__memset)
+EXPORT_SYMBOL(__memset)
+EXPORT_SYMBOL(memset)
 
        .section .fixup, "ax"
        .align  4
index d00e771..c5f4295 100644 (file)
@@ -60,6 +60,7 @@ ENTRY(__modsi3)
        abi_ret_default
 
 ENDPROC(__modsi3)
+EXPORT_SYMBOL(__modsi3)
 
 #if !XCHAL_HAVE_NSA
        .section .rodata
index 91a9d7c..c6b4fd4 100644 (file)
@@ -131,3 +131,4 @@ ENTRY(__mulsi3)
        abi_ret_default
 
 ENDPROC(__mulsi3)
+EXPORT_SYMBOL(__mulsi3)
index 0731912..9841d16 100644 (file)
@@ -201,6 +201,7 @@ EX(10f)     s8i     a9, a11, 0
        abi_ret_default
 
 ENDPROC(__strncpy_user)
+EXPORT_SYMBOL(__strncpy_user)
 
        .section .fixup, "ax"
        .align  4
index 3d391dc..cdcf574 100644 (file)
@@ -133,6 +133,7 @@ EX(10f)     l32i    a9, a4, 0       # get word with first two bytes of string
        abi_ret_default
 
 ENDPROC(__strnlen_user)
+EXPORT_SYMBOL(__strnlen_user)
 
        .section .fixup, "ax"
        .align  4
index d2477e0..59ea2df 100644 (file)
@@ -66,3 +66,4 @@ ENTRY(__udivsi3)
        abi_ret_default
 
 ENDPROC(__udivsi3)
+EXPORT_SYMBOL(__udivsi3)
index 5f031bf..d39a7e5 100644 (file)
@@ -55,3 +55,4 @@ ENTRY(__umodsi3)
        abi_ret_default
 
 ENDPROC(__umodsi3)
+EXPORT_SYMBOL(__umodsi3)
index 1360816..8c7a94a 100644 (file)
@@ -228,3 +228,4 @@ ENTRY(__umulsidi3)
 #endif /* XCHAL_NO_MUL */
 
 ENDPROC(__umulsidi3)
+EXPORT_SYMBOL(__umulsidi3)
index 16128c0..2c665c0 100644 (file)
@@ -283,6 +283,7 @@ EX(10f)     s8i     a6, a5,  0
        abi_ret(STACK_SIZE)
 
 ENDPROC(__xtensa_copy_user)
+EXPORT_SYMBOL(__xtensa_copy_user)
 
        .section .fixup, "ax"
        .align  4
index 1fef24d..f00d122 100644 (file)
@@ -14,7 +14,6 @@
 #include <linux/kernel.h>
 #include <asm/initialize_mmu.h>
 #include <asm/tlbflush.h>
-#include <asm/traps.h>
 
 void __init kasan_early_init(void)
 {
@@ -31,7 +30,6 @@ void __init kasan_early_init(void)
                BUG_ON(!pmd_none(*pmd));
                set_pmd(pmd, __pmd((unsigned long)kasan_early_shadow_pte));
        }
-       early_trap_init();
 }
 
 static void __init populate(void *start, void *end)
index 0527bf6..ec36f73 100644 (file)
@@ -47,6 +47,7 @@ ENTRY(clear_page)
        abi_ret_default
 
 ENDPROC(clear_page)
+EXPORT_SYMBOL(clear_page)
 
 /*
  * copy_page and copy_user_page are the same for non-cache-aliased configs.
@@ -89,6 +90,7 @@ ENTRY(copy_page)
        abi_ret_default
 
 ENDPROC(copy_page)
+EXPORT_SYMBOL(copy_page)
 
 #ifdef CONFIG_MMU
 /*
@@ -367,6 +369,7 @@ ENTRY(__invalidate_icache_range)
        abi_ret_default
 
 ENDPROC(__invalidate_icache_range)
+EXPORT_SYMBOL(__invalidate_icache_range)
 
 /*
  * void __flush_invalidate_dcache_range(ulong start, ulong size)
@@ -397,6 +400,7 @@ ENTRY(__flush_dcache_range)
        abi_ret_default
 
 ENDPROC(__flush_dcache_range)
+EXPORT_SYMBOL(__flush_dcache_range)
 
 /*
  * void _invalidate_dcache_range(ulong start, ulong size)
@@ -411,6 +415,7 @@ ENTRY(__invalidate_dcache_range)
        abi_ret_default
 
 ENDPROC(__invalidate_dcache_range)
+EXPORT_SYMBOL(__invalidate_dcache_range)
 
 /*
  * void _invalidate_icache_all(void)
index 27a477d..0a11fc5 100644 (file)
@@ -179,6 +179,7 @@ static unsigned get_pte_for_vaddr(unsigned vaddr)
        pud_t *pud;
        pmd_t *pmd;
        pte_t *pte;
+       unsigned int pteval;
 
        if (!mm)
                mm = task->active_mm;
@@ -197,7 +198,9 @@ static unsigned get_pte_for_vaddr(unsigned vaddr)
        pte = pte_offset_map(pmd, vaddr);
        if (!pte)
                return 0;
-       return pte_val(*pte);
+       pteval = pte_val(*pte);
+       pte_unmap(pte);
+       return pteval;
 }
 
 enum {
index d3433e1..0f1fe13 100644 (file)
@@ -16,6 +16,7 @@
 #include <linux/notifier.h>
 #include <linux/panic_notifier.h>
 #include <linux/printk.h>
+#include <linux/reboot.h>
 #include <linux/string.h>
 
 #include <asm/platform.h>
 #include <platform/simcall.h>
 
 
-void platform_halt(void)
-{
-       pr_info(" ** Called platform_halt() **\n");
-       simc_exit(0);
-}
-
-void platform_power_off(void)
+static int iss_power_off(struct sys_off_data *unused)
 {
        pr_info(" ** Called platform_power_off() **\n");
        simc_exit(0);
+       return NOTIFY_DONE;
 }
 
-void platform_restart(void)
+static int iss_restart(struct notifier_block *this,
+                      unsigned long event, void *ptr)
 {
        /* Flush and reset the mmu, simulate a processor reset, and
         * jump to the reset vector. */
        cpu_reset();
-       /* control never gets here */
+
+       return NOTIFY_DONE;
 }
 
+static struct notifier_block iss_restart_block = {
+       .notifier_call = iss_restart,
+};
+
 static int
 iss_panic_event(struct notifier_block *this, unsigned long event, void *ptr)
 {
@@ -82,4 +84,8 @@ void __init platform_setup(char **p_cmdline)
        }
 
        atomic_notifier_chain_register(&panic_notifier_list, &iss_panic_block);
+       register_restart_handler(&iss_restart_block);
+       register_sys_off_handler(SYS_OFF_MODE_POWER_OFF,
+                                SYS_OFF_PRIO_PLATFORM,
+                                iss_power_off, NULL);
 }
index f50caaa..178cf96 100644 (file)
@@ -120,9 +120,9 @@ static void simdisk_submit_bio(struct bio *bio)
        bio_endio(bio);
 }
 
-static int simdisk_open(struct block_device *bdev, fmode_t mode)
+static int simdisk_open(struct gendisk *disk, blk_mode_t mode)
 {
-       struct simdisk *dev = bdev->bd_disk->private_data;
+       struct simdisk *dev = disk->private_data;
 
        spin_lock(&dev->lock);
        ++dev->users;
@@ -130,7 +130,7 @@ static int simdisk_open(struct block_device *bdev, fmode_t mode)
        return 0;
 }
 
-static void simdisk_release(struct gendisk *disk, fmode_t mode)
+static void simdisk_release(struct gendisk *disk)
 {
        struct simdisk *dev = disk->private_data;
        spin_lock(&dev->lock);
index 0dc22c3..258e01a 100644 (file)
@@ -23,6 +23,7 @@
 #include <linux/platform_device.h>
 #include <linux/serial.h>
 #include <linux/serial_8250.h>
+#include <linux/timer.h>
 
 #include <asm/processor.h>
 #include <asm/platform.h>
@@ -41,51 +42,46 @@ static void led_print (int f, char *s)
                    break;
 }
 
-void platform_halt(void)
-{
-       led_print (0, "  HALT  ");
-       local_irq_disable();
-       while (1);
-}
-
-void platform_power_off(void)
+static int xt2000_power_off(struct sys_off_data *unused)
 {
        led_print (0, "POWEROFF");
        local_irq_disable();
        while (1);
+       return NOTIFY_DONE;
 }
 
-void platform_restart(void)
+static int xt2000_restart(struct notifier_block *this,
+                         unsigned long event, void *ptr)
 {
        /* Flush and reset the mmu, simulate a processor reset, and
         * jump to the reset vector. */
        cpu_reset();
-       /* control never gets here */
+
+       return NOTIFY_DONE;
 }
 
+static struct notifier_block xt2000_restart_block = {
+       .notifier_call = xt2000_restart,
+};
+
 void __init platform_setup(char** cmdline)
 {
        led_print (0, "LINUX   ");
 }
 
-/* early initialization */
+/* Heartbeat. Let the LED blink. */
 
-void __init platform_init(bp_tag_t *first)
-{
-}
+static void xt2000_heartbeat(struct timer_list *unused);
 
-/* Heartbeat. Let the LED blink. */
+static DEFINE_TIMER(heartbeat_timer, xt2000_heartbeat);
 
-void platform_heartbeat(void)
+static void xt2000_heartbeat(struct timer_list *unused)
 {
-       static int i, t;
+       static int i;
 
-       if (--t < 0)
-       {
-               t = 59;
-               led_print(7, i ? ".": " ");
-               i ^= 1;
-       }
+       led_print(7, i ? "." : " ");
+       i ^= 1;
+       mod_timer(&heartbeat_timer, jiffies + HZ / 2);
 }
 
 //#define RS_TABLE_SIZE 2
@@ -143,7 +139,11 @@ static int __init xt2000_setup_devinit(void)
 {
        platform_device_register(&xt2000_serial8250_device);
        platform_device_register(&xt2000_sonic_device);
-
+       mod_timer(&heartbeat_timer, jiffies + HZ / 2);
+       register_restart_handler(&xt2000_restart_block);
+       register_sys_off_handler(SYS_OFF_MODE_POWER_OFF,
+                                SYS_OFF_PRIO_DEFAULT,
+                                xt2000_power_off, NULL);
        return 0;
 }
 
index c79c1d0..a2432f0 100644 (file)
 #include <platform/lcd.h>
 #include <platform/hardware.h>
 
-void platform_halt(void)
-{
-       lcd_disp_at_pos(" HALT ", 0);
-       local_irq_disable();
-       while (1)
-               cpu_relax();
-}
-
-void platform_power_off(void)
+static int xtfpga_power_off(struct sys_off_data *unused)
 {
        lcd_disp_at_pos("POWEROFF", 0);
        local_irq_disable();
        while (1)
                cpu_relax();
+       return NOTIFY_DONE;
 }
 
-void platform_restart(void)
+static int xtfpga_restart(struct notifier_block *this,
+                         unsigned long event, void *ptr)
 {
        /* Try software reset first. */
        WRITE_ONCE(*(u32 *)XTFPGA_SWRST_VADDR, 0xdead);
@@ -58,9 +52,14 @@ void platform_restart(void)
         * simulate a processor reset, and jump to the reset vector.
         */
        cpu_reset();
-       /* control never gets here */
+
+       return NOTIFY_DONE;
 }
 
+static struct notifier_block xtfpga_restart_block = {
+       .notifier_call = xtfpga_restart,
+};
+
 #ifdef CONFIG_XTENSA_CALIBRATE_CCOUNT
 
 void __init platform_calibrate_ccount(void)
@@ -70,6 +69,14 @@ void __init platform_calibrate_ccount(void)
 
 #endif
 
+static void __init xtfpga_register_handlers(void)
+{
+       register_restart_handler(&xtfpga_restart_block);
+       register_sys_off_handler(SYS_OFF_MODE_POWER_OFF,
+                                SYS_OFF_PRIO_DEFAULT,
+                                xtfpga_power_off, NULL);
+}
+
 #ifdef CONFIG_USE_OF
 
 static void __init xtfpga_clk_setup(struct device_node *np)
@@ -134,6 +141,9 @@ static int __init machine_setup(void)
        if ((eth = of_find_compatible_node(eth, NULL, "opencores,ethoc")))
                update_local_mac(eth);
        of_node_put(eth);
+
+       xtfpga_register_handlers();
+
        return 0;
 }
 arch_initcall(machine_setup);
@@ -281,6 +291,8 @@ static int __init xtavnet_init(void)
        pr_info("XTFPGA: Ethernet MAC %pM\n", ethoc_pdata.hwaddr);
        ethoc_pdata.eth_clkfreq = *(long *)XTFPGA_CLKFRQ_VADDR;
 
+       xtfpga_register_handlers();
+
        return 0;
 }
 
index b31b053..46ada9d 100644 (file)
@@ -9,7 +9,7 @@ obj-y           := bdev.o fops.o bio.o elevator.o blk-core.o blk-sysfs.o \
                        blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
                        blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \
                        genhd.o ioprio.o badblocks.o partitions/ blk-rq-qos.o \
-                       disk-events.o blk-ia-ranges.o
+                       disk-events.o blk-ia-ranges.o early-lookup.o
 
 obj-$(CONFIG_BOUNCE)           += bounce.o
 obj-$(CONFIG_BLK_DEV_BSG_COMMON) += bsg.o
index 21c63bf..979e28a 100644 (file)
@@ -93,7 +93,7 @@ EXPORT_SYMBOL(invalidate_bdev);
  * Drop all buffers & page cache for given bdev range. This function bails
  * with error if bdev has other exclusive owner (such as filesystem).
  */
-int truncate_bdev_range(struct block_device *bdev, fmode_t mode,
+int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode,
                        loff_t lstart, loff_t lend)
 {
        /*
@@ -101,14 +101,14 @@ int truncate_bdev_range(struct block_device *bdev, fmode_t mode,
         * while we discard the buffer cache to avoid discarding buffers
         * under live filesystem.
         */
-       if (!(mode & FMODE_EXCL)) {
-               int err = bd_prepare_to_claim(bdev, truncate_bdev_range);
+       if (!(mode & BLK_OPEN_EXCL)) {
+               int err = bd_prepare_to_claim(bdev, truncate_bdev_range, NULL);
                if (err)
                        goto invalidate;
        }
 
        truncate_inode_pages_range(bdev->bd_inode->i_mapping, lstart, lend);
-       if (!(mode & FMODE_EXCL))
+       if (!(mode & BLK_OPEN_EXCL))
                bd_abort_claiming(bdev, truncate_bdev_range);
        return 0;
 
@@ -308,7 +308,7 @@ EXPORT_SYMBOL(thaw_bdev);
  * pseudo-fs
  */
 
-static  __cacheline_aligned_in_smp DEFINE_SPINLOCK(bdev_lock);
+static  __cacheline_aligned_in_smp DEFINE_MUTEX(bdev_lock);
 static struct kmem_cache * bdev_cachep __read_mostly;
 
 static struct inode *bdev_alloc_inode(struct super_block *sb)
@@ -415,6 +415,7 @@ struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
        bdev = I_BDEV(inode);
        mutex_init(&bdev->bd_fsfreeze_mutex);
        spin_lock_init(&bdev->bd_size_lock);
+       mutex_init(&bdev->bd_holder_lock);
        bdev->bd_partno = partno;
        bdev->bd_inode = inode;
        bdev->bd_queue = disk->queue;
@@ -463,39 +464,48 @@ long nr_blockdev_pages(void)
 /**
  * bd_may_claim - test whether a block device can be claimed
  * @bdev: block device of interest
- * @whole: whole block device containing @bdev, may equal @bdev
  * @holder: holder trying to claim @bdev
+ * @hops: holder ops
  *
  * Test whether @bdev can be claimed by @holder.
  *
- * CONTEXT:
- * spin_lock(&bdev_lock).
- *
  * RETURNS:
  * %true if @bdev can be claimed, %false otherwise.
  */
-static bool bd_may_claim(struct block_device *bdev, struct block_device *whole,
-                        void *holder)
+static bool bd_may_claim(struct block_device *bdev, void *holder,
+               const struct blk_holder_ops *hops)
 {
-       if (bdev->bd_holder == holder)
-               return true;     /* already a holder */
-       else if (bdev->bd_holder != NULL)
-               return false;    /* held by someone else */
-       else if (whole == bdev)
-               return true;     /* is a whole device which isn't held */
-
-       else if (whole->bd_holder == bd_may_claim)
-               return true;     /* is a partition of a device that is being partitioned */
-       else if (whole->bd_holder != NULL)
-               return false;    /* is a partition of a held device */
-       else
-               return true;     /* is a partition of an un-held device */
+       struct block_device *whole = bdev_whole(bdev);
+
+       lockdep_assert_held(&bdev_lock);
+
+       if (bdev->bd_holder) {
+               /*
+                * The same holder can always re-claim.
+                */
+               if (bdev->bd_holder == holder) {
+                       if (WARN_ON_ONCE(bdev->bd_holder_ops != hops))
+                               return false;
+                       return true;
+               }
+               return false;
+       }
+
+       /*
+        * If the whole devices holder is set to bd_may_claim, a partition on
+        * the device is claimed, but not the whole device.
+        */
+       if (whole != bdev &&
+           whole->bd_holder && whole->bd_holder != bd_may_claim)
+               return false;
+       return true;
 }
 
 /**
  * bd_prepare_to_claim - claim a block device
  * @bdev: block device of interest
  * @holder: holder trying to claim @bdev
+ * @hops: holder ops.
  *
  * Claim @bdev.  This function fails if @bdev is already claimed by another
  * holder and waits if another claiming is in progress. return, the caller
@@ -504,17 +514,18 @@ static bool bd_may_claim(struct block_device *bdev, struct block_device *whole,
  * RETURNS:
  * 0 if @bdev can be claimed, -EBUSY otherwise.
  */
-int bd_prepare_to_claim(struct block_device *bdev, void *holder)
+int bd_prepare_to_claim(struct block_device *bdev, void *holder,
+               const struct blk_holder_ops *hops)
 {
        struct block_device *whole = bdev_whole(bdev);
 
        if (WARN_ON_ONCE(!holder))
                return -EINVAL;
 retry:
-       spin_lock(&bdev_lock);
+       mutex_lock(&bdev_lock);
        /* if someone else claimed, fail */
-       if (!bd_may_claim(bdev, whole, holder)) {
-               spin_unlock(&bdev_lock);
+       if (!bd_may_claim(bdev, holder, hops)) {
+               mutex_unlock(&bdev_lock);
                return -EBUSY;
        }
 
@@ -524,7 +535,7 @@ retry:
                DEFINE_WAIT(wait);
 
                prepare_to_wait(wq, &wait, TASK_UNINTERRUPTIBLE);
-               spin_unlock(&bdev_lock);
+               mutex_unlock(&bdev_lock);
                schedule();
                finish_wait(wq, &wait);
                goto retry;
@@ -532,7 +543,7 @@ retry:
 
        /* yay, all mine */
        whole->bd_claiming = holder;
-       spin_unlock(&bdev_lock);
+       mutex_unlock(&bdev_lock);
        return 0;
 }
 EXPORT_SYMBOL_GPL(bd_prepare_to_claim); /* only for the loop driver */
@@ -550,16 +561,18 @@ static void bd_clear_claiming(struct block_device *whole, void *holder)
  * bd_finish_claiming - finish claiming of a block device
  * @bdev: block device of interest
  * @holder: holder that has claimed @bdev
+ * @hops: block device holder operations
  *
  * Finish exclusive open of a block device. Mark the device as exlusively
  * open by the holder and wake up all waiters for exclusive open to finish.
  */
-static void bd_finish_claiming(struct block_device *bdev, void *holder)
+static void bd_finish_claiming(struct block_device *bdev, void *holder,
+               const struct blk_holder_ops *hops)
 {
        struct block_device *whole = bdev_whole(bdev);
 
-       spin_lock(&bdev_lock);
-       BUG_ON(!bd_may_claim(bdev, whole, holder));
+       mutex_lock(&bdev_lock);
+       BUG_ON(!bd_may_claim(bdev, holder, hops));
        /*
         * Note that for a whole device bd_holders will be incremented twice,
         * and bd_holder will be set to bd_may_claim before being set to holder
@@ -567,9 +580,12 @@ static void bd_finish_claiming(struct block_device *bdev, void *holder)
        whole->bd_holders++;
        whole->bd_holder = bd_may_claim;
        bdev->bd_holders++;
+       mutex_lock(&bdev->bd_holder_lock);
        bdev->bd_holder = holder;
+       bdev->bd_holder_ops = hops;
+       mutex_unlock(&bdev->bd_holder_lock);
        bd_clear_claiming(whole, holder);
-       spin_unlock(&bdev_lock);
+       mutex_unlock(&bdev_lock);
 }
 
 /**
@@ -583,12 +599,47 @@ static void bd_finish_claiming(struct block_device *bdev, void *holder)
  */
 void bd_abort_claiming(struct block_device *bdev, void *holder)
 {
-       spin_lock(&bdev_lock);
+       mutex_lock(&bdev_lock);
        bd_clear_claiming(bdev_whole(bdev), holder);
-       spin_unlock(&bdev_lock);
+       mutex_unlock(&bdev_lock);
 }
 EXPORT_SYMBOL(bd_abort_claiming);
 
+static void bd_end_claim(struct block_device *bdev, void *holder)
+{
+       struct block_device *whole = bdev_whole(bdev);
+       bool unblock = false;
+
+       /*
+        * Release a claim on the device.  The holder fields are protected with
+        * bdev_lock.  open_mutex is used to synchronize disk_holder unlinking.
+        */
+       mutex_lock(&bdev_lock);
+       WARN_ON_ONCE(bdev->bd_holder != holder);
+       WARN_ON_ONCE(--bdev->bd_holders < 0);
+       WARN_ON_ONCE(--whole->bd_holders < 0);
+       if (!bdev->bd_holders) {
+               mutex_lock(&bdev->bd_holder_lock);
+               bdev->bd_holder = NULL;
+               bdev->bd_holder_ops = NULL;
+               mutex_unlock(&bdev->bd_holder_lock);
+               if (bdev->bd_write_holder)
+                       unblock = true;
+       }
+       if (!whole->bd_holders)
+               whole->bd_holder = NULL;
+       mutex_unlock(&bdev_lock);
+
+       /*
+        * If this was the last claim, remove holder link and unblock evpoll if
+        * it was a write holder.
+        */
+       if (unblock) {
+               disk_unblock_events(bdev->bd_disk);
+               bdev->bd_write_holder = false;
+       }
+}
+
 static void blkdev_flush_mapping(struct block_device *bdev)
 {
        WARN_ON_ONCE(bdev->bd_holders);
@@ -597,13 +648,13 @@ static void blkdev_flush_mapping(struct block_device *bdev)
        bdev_write_inode(bdev);
 }
 
-static int blkdev_get_whole(struct block_device *bdev, fmode_t mode)
+static int blkdev_get_whole(struct block_device *bdev, blk_mode_t mode)
 {
        struct gendisk *disk = bdev->bd_disk;
        int ret;
 
        if (disk->fops->open) {
-               ret = disk->fops->open(bdev, mode);
+               ret = disk->fops->open(disk, mode);
                if (ret) {
                        /* avoid ghost partitions on a removed medium */
                        if (ret == -ENOMEDIUM &&
@@ -621,22 +672,19 @@ static int blkdev_get_whole(struct block_device *bdev, fmode_t mode)
        return 0;
 }
 
-static void blkdev_put_whole(struct block_device *bdev, fmode_t mode)
+static void blkdev_put_whole(struct block_device *bdev)
 {
        if (atomic_dec_and_test(&bdev->bd_openers))
                blkdev_flush_mapping(bdev);
        if (bdev->bd_disk->fops->release)
-               bdev->bd_disk->fops->release(bdev->bd_disk, mode);
+               bdev->bd_disk->fops->release(bdev->bd_disk);
 }
 
-static int blkdev_get_part(struct block_device *part, fmode_t mode)
+static int blkdev_get_part(struct block_device *part, blk_mode_t mode)
 {
        struct gendisk *disk = part->bd_disk;
        int ret;
 
-       if (atomic_read(&part->bd_openers))
-               goto done;
-
        ret = blkdev_get_whole(bdev_whole(part), mode);
        if (ret)
                return ret;
@@ -645,26 +693,27 @@ static int blkdev_get_part(struct block_device *part, fmode_t mode)
        if (!bdev_nr_sectors(part))
                goto out_blkdev_put;
 
-       disk->open_partitions++;
-       set_init_blocksize(part);
-done:
+       if (!atomic_read(&part->bd_openers)) {
+               disk->open_partitions++;
+               set_init_blocksize(part);
+       }
        atomic_inc(&part->bd_openers);
        return 0;
 
 out_blkdev_put:
-       blkdev_put_whole(bdev_whole(part), mode);
+       blkdev_put_whole(bdev_whole(part));
        return ret;
 }
 
-static void blkdev_put_part(struct block_device *part, fmode_t mode)
+static void blkdev_put_part(struct block_device *part)
 {
        struct block_device *whole = bdev_whole(part);
 
-       if (!atomic_dec_and_test(&part->bd_openers))
-               return;
-       blkdev_flush_mapping(part);
-       whole->bd_disk->open_partitions--;
-       blkdev_put_whole(whole, mode);
+       if (atomic_dec_and_test(&part->bd_openers)) {
+               blkdev_flush_mapping(part);
+               whole->bd_disk->open_partitions--;
+       }
+       blkdev_put_whole(whole);
 }
 
 struct block_device *blkdev_get_no_open(dev_t dev)
@@ -695,17 +744,17 @@ void blkdev_put_no_open(struct block_device *bdev)
 {
        put_device(&bdev->bd_device);
 }
-
+       
 /**
  * blkdev_get_by_dev - open a block device by device number
  * @dev: device number of block device to open
- * @mode: FMODE_* mask
+ * @mode: open mode (BLK_OPEN_*)
  * @holder: exclusive holder identifier
+ * @hops: holder operations
  *
- * Open the block device described by device number @dev. If @mode includes
- * %FMODE_EXCL, the block device is opened with exclusive access.  Specifying
- * %FMODE_EXCL with a %NULL @holder is invalid.  Exclusive opens may nest for
- * the same @holder.
+ * Open the block device described by device number @dev. If @holder is not
+ * %NULL, the block device is opened with exclusive access.  Exclusive opens may
+ * nest for the same @holder.
  *
  * Use this interface ONLY if you really do not have anything better - i.e. when
  * you are behind a truly sucky interface and all you are given is a device
@@ -717,7 +766,8 @@ void blkdev_put_no_open(struct block_device *bdev)
  * RETURNS:
  * Reference to the block_device on success, ERR_PTR(-errno) on failure.
  */
-struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder)
+struct block_device *blkdev_get_by_dev(dev_t dev, blk_mode_t mode, void *holder,
+               const struct blk_holder_ops *hops)
 {
        bool unblock_events = true;
        struct block_device *bdev;
@@ -726,8 +776,8 @@ struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder)
 
        ret = devcgroup_check_permission(DEVCG_DEV_BLOCK,
                        MAJOR(dev), MINOR(dev),
-                       ((mode & FMODE_READ) ? DEVCG_ACC_READ : 0) |
-                       ((mode & FMODE_WRITE) ? DEVCG_ACC_WRITE : 0));
+                       ((mode & BLK_OPEN_READ) ? DEVCG_ACC_READ : 0) |
+                       ((mode & BLK_OPEN_WRITE) ? DEVCG_ACC_WRITE : 0));
        if (ret)
                return ERR_PTR(ret);
 
@@ -736,10 +786,16 @@ struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder)
                return ERR_PTR(-ENXIO);
        disk = bdev->bd_disk;
 
-       if (mode & FMODE_EXCL) {
-               ret = bd_prepare_to_claim(bdev, holder);
+       if (holder) {
+               mode |= BLK_OPEN_EXCL;
+               ret = bd_prepare_to_claim(bdev, holder, hops);
                if (ret)
                        goto put_blkdev;
+       } else {
+               if (WARN_ON_ONCE(mode & BLK_OPEN_EXCL)) {
+                       ret = -EIO;
+                       goto put_blkdev;
+               }
        }
 
        disk_block_events(disk);
@@ -756,8 +812,8 @@ struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder)
                ret = blkdev_get_whole(bdev, mode);
        if (ret)
                goto put_module;
-       if (mode & FMODE_EXCL) {
-               bd_finish_claiming(bdev, holder);
+       if (holder) {
+               bd_finish_claiming(bdev, holder, hops);
 
                /*
                 * Block event polling for write claims if requested.  Any write
@@ -766,7 +822,7 @@ struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder)
                 * writeable reference is too fragile given the way @mode is
                 * used in blkdev_get/put().
                 */
-               if ((mode & FMODE_WRITE) && !bdev->bd_write_holder &&
+               if ((mode & BLK_OPEN_WRITE) && !bdev->bd_write_holder &&
                    (disk->event_flags & DISK_EVENT_FLAG_BLOCK_ON_EXCL_WRITE)) {
                        bdev->bd_write_holder = true;
                        unblock_events = false;
@@ -780,7 +836,7 @@ struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder)
 put_module:
        module_put(disk->fops->owner);
 abort_claiming:
-       if (mode & FMODE_EXCL)
+       if (holder)
                bd_abort_claiming(bdev, holder);
        mutex_unlock(&disk->open_mutex);
        disk_unblock_events(disk);
@@ -793,13 +849,13 @@ EXPORT_SYMBOL(blkdev_get_by_dev);
 /**
  * blkdev_get_by_path - open a block device by name
  * @path: path to the block device to open
- * @mode: FMODE_* mask
+ * @mode: open mode (BLK_OPEN_*)
  * @holder: exclusive holder identifier
+ * @hops: holder operations
  *
- * Open the block device described by the device file at @path.  If @mode
- * includes %FMODE_EXCL, the block device is opened with exclusive access.
- * Specifying %FMODE_EXCL with a %NULL @holder is invalid.  Exclusive opens may
- * nest for the same @holder.
+ * Open the block device described by the device file at @path.  If @holder is
+ * not %NULL, the block device is opened with exclusive access.  Exclusive opens
+ * may nest for the same @holder.
  *
  * CONTEXT:
  * Might sleep.
@@ -807,8 +863,8 @@ EXPORT_SYMBOL(blkdev_get_by_dev);
  * RETURNS:
  * Reference to the block_device on success, ERR_PTR(-errno) on failure.
  */
-struct block_device *blkdev_get_by_path(const char *path, fmode_t mode,
-                                       void *holder)
+struct block_device *blkdev_get_by_path(const char *path, blk_mode_t mode,
+               void *holder, const struct blk_holder_ops *hops)
 {
        struct block_device *bdev;
        dev_t dev;
@@ -818,9 +874,9 @@ struct block_device *blkdev_get_by_path(const char *path, fmode_t mode,
        if (error)
                return ERR_PTR(error);
 
-       bdev = blkdev_get_by_dev(dev, mode, holder);
-       if (!IS_ERR(bdev) && (mode & FMODE_WRITE) && bdev_read_only(bdev)) {
-               blkdev_put(bdev, mode);
+       bdev = blkdev_get_by_dev(dev, mode, holder, hops);
+       if (!IS_ERR(bdev) && (mode & BLK_OPEN_WRITE) && bdev_read_only(bdev)) {
+               blkdev_put(bdev, holder);
                return ERR_PTR(-EACCES);
        }
 
@@ -828,7 +884,7 @@ struct block_device *blkdev_get_by_path(const char *path, fmode_t mode,
 }
 EXPORT_SYMBOL(blkdev_get_by_path);
 
-void blkdev_put(struct block_device *bdev, fmode_t mode)
+void blkdev_put(struct block_device *bdev, void *holder)
 {
        struct gendisk *disk = bdev->bd_disk;
 
@@ -843,36 +899,8 @@ void blkdev_put(struct block_device *bdev, fmode_t mode)
                sync_blockdev(bdev);
 
        mutex_lock(&disk->open_mutex);
-       if (mode & FMODE_EXCL) {
-               struct block_device *whole = bdev_whole(bdev);
-               bool bdev_free;
-
-               /*
-                * Release a claim on the device.  The holder fields
-                * are protected with bdev_lock.  open_mutex is to
-                * synchronize disk_holder unlinking.
-                */
-               spin_lock(&bdev_lock);
-
-               WARN_ON_ONCE(--bdev->bd_holders < 0);
-               WARN_ON_ONCE(--whole->bd_holders < 0);
-
-               if ((bdev_free = !bdev->bd_holders))
-                       bdev->bd_holder = NULL;
-               if (!whole->bd_holders)
-                       whole->bd_holder = NULL;
-
-               spin_unlock(&bdev_lock);
-
-               /*
-                * If this was the last claim, remove holder link and
-                * unblock evpoll if it was a write holder.
-                */
-               if (bdev_free && bdev->bd_write_holder) {
-                       disk_unblock_events(disk);
-                       bdev->bd_write_holder = false;
-               }
-       }
+       if (holder)
+               bd_end_claim(bdev, holder);
 
        /*
         * Trigger event checking and tell drivers to flush MEDIA_CHANGE
@@ -882,9 +910,9 @@ void blkdev_put(struct block_device *bdev, fmode_t mode)
        disk_flush_events(disk, DISK_EVENT_MEDIA_CHANGE);
 
        if (bdev_is_partition(bdev))
-               blkdev_put_part(bdev, mode);
+               blkdev_put_part(bdev);
        else
-               blkdev_put_whole(bdev, mode);
+               blkdev_put_whole(bdev);
        mutex_unlock(&disk->open_mutex);
 
        module_put(disk->fops->owner);
index 3164e31..09bbbcf 100644 (file)
@@ -5403,6 +5403,10 @@ void bfq_put_queue(struct bfq_queue *bfqq)
        if (bfqq->bfqd->last_completed_rq_bfqq == bfqq)
                bfqq->bfqd->last_completed_rq_bfqq = NULL;
 
+       WARN_ON_ONCE(!list_empty(&bfqq->fifo));
+       WARN_ON_ONCE(!RB_EMPTY_ROOT(&bfqq->sort_list));
+       WARN_ON_ONCE(bfqq->dispatched);
+
        kmem_cache_free(bfq_pool, bfqq);
        bfqg_and_blkg_put(bfqg);
 }
@@ -7135,6 +7139,7 @@ static void bfq_exit_queue(struct elevator_queue *e)
 {
        struct bfq_data *bfqd = e->elevator_data;
        struct bfq_queue *bfqq, *n;
+       unsigned int actuator;
 
        hrtimer_cancel(&bfqd->idle_slice_timer);
 
@@ -7143,6 +7148,10 @@ static void bfq_exit_queue(struct elevator_queue *e)
                bfq_deactivate_bfqq(bfqd, bfqq, false, false);
        spin_unlock_irq(&bfqd->lock);
 
+       for (actuator = 0; actuator < bfqd->num_actuators; actuator++)
+               WARN_ON_ONCE(bfqd->rq_in_driver[actuator]);
+       WARN_ON_ONCE(bfqd->tot_rq_in_driver);
+
        hrtimer_cancel(&bfqd->idle_slice_timer);
 
        /* release oom-queue reference to root group */
index 043944f..8672179 100644 (file)
@@ -1138,6 +1138,14 @@ int bio_add_page(struct bio *bio, struct page *page,
 }
 EXPORT_SYMBOL(bio_add_page);
 
+void bio_add_folio_nofail(struct bio *bio, struct folio *folio, size_t len,
+                         size_t off)
+{
+       WARN_ON_ONCE(len > UINT_MAX);
+       WARN_ON_ONCE(off > UINT_MAX);
+       __bio_add_page(bio, &folio->page, len, off);
+}
+
 /**
  * bio_add_folio - Attempt to add part of a folio to a bio.
  * @bio: BIO to add to.
@@ -1169,7 +1177,7 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
        bio_for_each_segment_all(bvec, bio, iter_all) {
                if (mark_dirty && !PageCompound(bvec->bv_page))
                        set_page_dirty_lock(bvec->bv_page);
-               put_page(bvec->bv_page);
+               bio_release_page(bio, bvec->bv_page);
        }
 }
 EXPORT_SYMBOL_GPL(__bio_release_pages);
@@ -1191,7 +1199,6 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
        bio->bi_io_vec = (struct bio_vec *)iter->bvec;
        bio->bi_iter.bi_bvec_done = iter->iov_offset;
        bio->bi_iter.bi_size = size;
-       bio_set_flag(bio, BIO_NO_PAGE_REF);
        bio_set_flag(bio, BIO_CLONED);
 }
 
@@ -1206,7 +1213,7 @@ static int bio_iov_add_page(struct bio *bio, struct page *page,
        }
 
        if (same_page)
-               put_page(page);
+               bio_release_page(bio, page);
        return 0;
 }
 
@@ -1220,7 +1227,7 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
                        queue_max_zone_append_sectors(q), &same_page) != len)
                return -EINVAL;
        if (same_page)
-               put_page(page);
+               bio_release_page(bio, page);
        return 0;
 }
 
@@ -1231,10 +1238,10 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
  * @bio: bio to add pages to
  * @iter: iov iterator describing the region to be mapped
  *
- * Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
- * For multi-segment *iter, this function only adds pages from the
- * next non-empty segment of the iov iterator.
+ * Extracts pages from *iter and appends them to @bio's bvec array.  The pages
+ * will have to be cleaned up in the way indicated by the BIO_PAGE_PINNED flag.
+ * For a multi-segment *iter, this function only adds pages from the next
+ * non-empty segment of the iov iterator.
  */
 static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
@@ -1266,9 +1273,9 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
         * result to ensure the bio's total size is correct. The remainder of
         * the iov data will be picked up in the next bio iteration.
         */
-       size = iov_iter_get_pages(iter, pages,
-                                 UINT_MAX - bio->bi_iter.bi_size,
-                                 nr_pages, &offset, extraction_flags);
+       size = iov_iter_extract_pages(iter, &pages,
+                                     UINT_MAX - bio->bi_iter.bi_size,
+                                     nr_pages, extraction_flags, &offset);
        if (unlikely(size <= 0))
                return size ? size : -EFAULT;
 
@@ -1301,7 +1308,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
        iov_iter_revert(iter, left);
 out:
        while (i < nr_pages)
-               put_page(pages[i++]);
+               bio_release_page(bio, pages[i++]);
 
        return ret;
 }
@@ -1336,6 +1343,8 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
                return 0;
        }
 
+       if (iov_iter_extract_will_pin(iter))
+               bio_set_flag(bio, BIO_PAGE_PINNED);
        do {
                ret = __bio_iov_iter_get_pages(bio, iter);
        } while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
@@ -1489,8 +1498,8 @@ void bio_set_pages_dirty(struct bio *bio)
  * the BIO and re-dirty the pages in process context.
  *
  * It is expected that bio_check_pages_dirty() will wholly own the BIO from
- * here on.  It will run one put_page() against each page and will run one
- * bio_put() against the BIO.
+ * here on.  It will unpin each page and will run one bio_put() against the
+ * BIO.
  */
 
 static void bio_dirty_fn(struct work_struct *work);
index 842e5e1..3ec2133 100644 (file)
@@ -34,7 +34,7 @@ int blkcg_set_fc_appid(char *app_id, u64 cgrp_id, size_t app_id_len)
         * the vmid from the fabric.
         * Adding the overhead of a lock is not necessary.
         */
-       strlcpy(blkcg->fc_app_id, app_id, app_id_len);
+       strscpy(blkcg->fc_app_id, app_id, app_id_len);
        css_put(css);
 out_cgrp_put:
        cgroup_put(cgrp);
index f0b5c9c..aaf9903 100644 (file)
@@ -624,8 +624,13 @@ static int blkcg_reset_stats(struct cgroup_subsys_state *css,
                        struct blkg_iostat_set *bis =
                                per_cpu_ptr(blkg->iostat_cpu, cpu);
                        memset(bis, 0, sizeof(*bis));
+
+                       /* Re-initialize the cleared blkg_iostat_set */
+                       u64_stats_init(&bis->sync);
+                       bis->blkg = blkg;
                }
                memset(&blkg->iostat, 0, sizeof(blkg->iostat));
+               u64_stats_init(&blkg->iostat.sync);
 
                for (i = 0; i < BLKCG_MAX_POLS; i++) {
                        struct blkcg_policy *pol = blkcg_policy[i];
@@ -762,6 +767,13 @@ int blkg_conf_open_bdev(struct blkg_conf_ctx *ctx)
                return -ENODEV;
        }
 
+       mutex_lock(&bdev->bd_queue->rq_qos_mutex);
+       if (!disk_live(bdev->bd_disk)) {
+               blkdev_put_no_open(bdev);
+               mutex_unlock(&bdev->bd_queue->rq_qos_mutex);
+               return -ENODEV;
+       }
+
        ctx->body = input;
        ctx->bdev = bdev;
        return 0;
@@ -906,6 +918,7 @@ EXPORT_SYMBOL_GPL(blkg_conf_prep);
  */
 void blkg_conf_exit(struct blkg_conf_ctx *ctx)
        __releases(&ctx->bdev->bd_queue->queue_lock)
+       __releases(&ctx->bdev->bd_queue->rq_qos_mutex)
 {
        if (ctx->blkg) {
                spin_unlock_irq(&bdev_get_queue(ctx->bdev)->queue_lock);
@@ -913,6 +926,7 @@ void blkg_conf_exit(struct blkg_conf_ctx *ctx)
        }
 
        if (ctx->bdev) {
+               mutex_unlock(&ctx->bdev->bd_queue->rq_qos_mutex);
                blkdev_put_no_open(ctx->bdev);
                ctx->body = NULL;
                ctx->bdev = NULL;
@@ -970,6 +984,7 @@ static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu)
        struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
        struct llist_node *lnode;
        struct blkg_iostat_set *bisc, *next_bisc;
+       unsigned long flags;
 
        rcu_read_lock();
 
@@ -983,7 +998,7 @@ static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu)
         * When flushing from cgroup, cgroup_rstat_lock is always held, so
         * this lock won't cause contention most of time.
         */
-       raw_spin_lock(&blkg_stat_lock);
+       raw_spin_lock_irqsave(&blkg_stat_lock, flags);
 
        /*
         * Iterate only the iostat_cpu's queued in the lockless list.
@@ -1009,7 +1024,7 @@ static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu)
                        blkcg_iostat_update(parent, &blkg->iostat.cur,
                                            &blkg->iostat.last);
        }
-       raw_spin_unlock(&blkg_stat_lock);
+       raw_spin_unlock_irqrestore(&blkg_stat_lock, flags);
 out:
        rcu_read_unlock();
 }
index 1da77e7..3fc68b9 100644 (file)
@@ -420,6 +420,7 @@ struct request_queue *blk_alloc_queue(int node_id)
        mutex_init(&q->debugfs_mutex);
        mutex_init(&q->sysfs_lock);
        mutex_init(&q->sysfs_dir_lock);
+       mutex_init(&q->rq_qos_mutex);
        spin_lock_init(&q->queue_lock);
 
        init_waitqueue_head(&q->mq_freeze_wq);
index 04698ed..dba392c 100644 (file)
@@ -188,7 +188,9 @@ static void blk_flush_complete_seq(struct request *rq,
 
        case REQ_FSEQ_DATA:
                list_move_tail(&rq->flush.list, &fq->flush_data_in_flight);
-               blk_mq_add_to_requeue_list(rq, BLK_MQ_INSERT_AT_HEAD);
+               spin_lock(&q->requeue_lock);
+               list_add_tail(&rq->queuelist, &q->flush_list);
+               spin_unlock(&q->requeue_lock);
                blk_mq_kick_requeue_list(q);
                break;
 
@@ -346,7 +348,10 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
        smp_wmb();
        req_ref_set(flush_rq, 1);
 
-       blk_mq_add_to_requeue_list(flush_rq, 0);
+       spin_lock(&q->requeue_lock);
+       list_add_tail(&flush_rq->queuelist, &q->flush_list);
+       spin_unlock(&q->requeue_lock);
+
        blk_mq_kick_requeue_list(q);
 }
 
@@ -376,22 +381,29 @@ static enum rq_end_io_ret mq_flush_data_end_io(struct request *rq,
        return RQ_END_IO_NONE;
 }
 
-/**
- * blk_insert_flush - insert a new PREFLUSH/FUA request
- * @rq: request to insert
- *
- * To be called from __elv_add_request() for %ELEVATOR_INSERT_FLUSH insertions.
- * or __blk_mq_run_hw_queue() to dispatch request.
- * @rq is being submitted.  Analyze what needs to be done and put it on the
- * right queue.
+static void blk_rq_init_flush(struct request *rq)
+{
+       rq->flush.seq = 0;
+       INIT_LIST_HEAD(&rq->flush.list);
+       rq->rq_flags |= RQF_FLUSH_SEQ;
+       rq->flush.saved_end_io = rq->end_io; /* Usually NULL */
+       rq->end_io = mq_flush_data_end_io;
+}
+
+/*
+ * Insert a PREFLUSH/FUA request into the flush state machine.
+ * Returns true if the request has been consumed by the flush state machine,
+ * or false if the caller should continue to process it.
  */
-void blk_insert_flush(struct request *rq)
+bool blk_insert_flush(struct request *rq)
 {
        struct request_queue *q = rq->q;
        unsigned long fflags = q->queue_flags;  /* may change, cache */
        unsigned int policy = blk_flush_policy(fflags, rq);
        struct blk_flush_queue *fq = blk_get_flush_queue(q, rq->mq_ctx);
-       struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
+
+       /* FLUSH/FUA request must never be merged */
+       WARN_ON_ONCE(rq->bio != rq->biotail);
 
        /*
         * @policy now records what operations need to be done.  Adjust
@@ -408,45 +420,45 @@ void blk_insert_flush(struct request *rq)
         */
        rq->cmd_flags |= REQ_SYNC;
 
-       /*
-        * An empty flush handed down from a stacking driver may
-        * translate into nothing if the underlying device does not
-        * advertise a write-back cache.  In this case, simply
-        * complete the request.
-        */
-       if (!policy) {
+       switch (policy) {
+       case 0:
+               /*
+                * An empty flush handed down from a stacking driver may
+                * translate into nothing if the underlying device does not
+                * advertise a write-back cache.  In this case, simply
+                * complete the request.
+                */
                blk_mq_end_request(rq, 0);
-               return;
-       }
-
-       BUG_ON(rq->bio != rq->biotail); /*assumes zero or single bio rq */
-
-       /*
-        * If there's data but flush is not necessary, the request can be
-        * processed directly without going through flush machinery.  Queue
-        * for normal execution.
-        */
-       if ((policy & REQ_FSEQ_DATA) &&
-           !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
-               blk_mq_request_bypass_insert(rq, 0);
-               blk_mq_run_hw_queue(hctx, false);
-               return;
+               return true;
+       case REQ_FSEQ_DATA:
+               /*
+                * If there's data, but no flush is necessary, the request can
+                * be processed directly without going through flush machinery.
+                * Queue for normal execution.
+                */
+               return false;
+       case REQ_FSEQ_DATA | REQ_FSEQ_POSTFLUSH:
+               /*
+                * Initialize the flush fields and completion handler to trigger
+                * the post flush, and then just pass the command on.
+                */
+               blk_rq_init_flush(rq);
+               rq->flush.seq |= REQ_FSEQ_POSTFLUSH;
+               spin_lock_irq(&fq->mq_flush_lock);
+               list_move_tail(&rq->flush.list, &fq->flush_data_in_flight);
+               spin_unlock_irq(&fq->mq_flush_lock);
+               return false;
+       default:
+               /*
+                * Mark the request as part of a flush sequence and submit it
+                * for further processing to the flush state machine.
+                */
+               blk_rq_init_flush(rq);
+               spin_lock_irq(&fq->mq_flush_lock);
+               blk_flush_complete_seq(rq, fq, REQ_FSEQ_ACTIONS & ~policy, 0);
+               spin_unlock_irq(&fq->mq_flush_lock);
+               return true;
        }
-
-       /*
-        * @rq should go through flush machinery.  Mark it part of flush
-        * sequence and submit for further processing.
-        */
-       memset(&rq->flush, 0, sizeof(rq->flush));
-       INIT_LIST_HEAD(&rq->flush.list);
-       rq->rq_flags |= RQF_FLUSH_SEQ;
-       rq->flush.saved_end_io = rq->end_io; /* Usually NULL */
-
-       rq->end_io = mq_flush_data_end_io;
-
-       spin_lock_irq(&fq->mq_flush_lock);
-       blk_flush_complete_seq(rq, fq, REQ_FSEQ_ACTIONS & ~policy, 0);
-       spin_unlock_irq(&fq->mq_flush_lock);
 }
 
 /**
index 63fc020..25dd4db 100644 (file)
@@ -77,6 +77,10 @@ static void ioc_destroy_icq(struct io_cq *icq)
        struct elevator_type *et = q->elevator->type;
 
        lockdep_assert_held(&ioc->lock);
+       lockdep_assert_held(&q->queue_lock);
+
+       if (icq->flags & ICQ_DESTROYED)
+               return;
 
        radix_tree_delete(&ioc->icq_tree, icq->q->id);
        hlist_del_init(&icq->ioc_node);
@@ -128,12 +132,7 @@ static void ioc_release_fn(struct work_struct *work)
                        spin_lock(&q->queue_lock);
                        spin_lock(&ioc->lock);
 
-                       /*
-                        * The icq may have been destroyed when the ioc lock
-                        * was released.
-                        */
-                       if (!(icq->flags & ICQ_DESTROYED))
-                               ioc_destroy_icq(icq);
+                       ioc_destroy_icq(icq);
 
                        spin_unlock(&q->queue_lock);
                        rcu_read_unlock();
@@ -171,23 +170,20 @@ static bool ioc_delay_free(struct io_context *ioc)
  */
 void ioc_clear_queue(struct request_queue *q)
 {
-       LIST_HEAD(icq_list);
-
        spin_lock_irq(&q->queue_lock);
-       list_splice_init(&q->icq_list, &icq_list);
-       spin_unlock_irq(&q->queue_lock);
-
-       rcu_read_lock();
-       while (!list_empty(&icq_list)) {
+       while (!list_empty(&q->icq_list)) {
                struct io_cq *icq =
-                       list_entry(icq_list.next, struct io_cq, q_node);
-
-               spin_lock_irq(&icq->ioc->lock);
-               if (!(icq->flags & ICQ_DESTROYED))
-                       ioc_destroy_icq(icq);
-               spin_unlock_irq(&icq->ioc->lock);
+                       list_first_entry(&q->icq_list, struct io_cq, q_node);
+
+               /*
+                * Other context won't hold ioc lock to wait for queue_lock, see
+                * details in ioc_release_fn().
+                */
+               spin_lock(&icq->ioc->lock);
+               ioc_destroy_icq(icq);
+               spin_unlock(&icq->ioc->lock);
        }
-       rcu_read_unlock();
+       spin_unlock_irq(&q->queue_lock);
 }
 #else /* CONFIG_BLK_ICQ */
 static inline void ioc_exit_icqs(struct io_context *ioc)
index 285ced3..6084a95 100644 (file)
@@ -2455,6 +2455,7 @@ static u64 adjust_inuse_and_calc_cost(struct ioc_gq *iocg, u64 vtime,
        u32 hwi, adj_step;
        s64 margin;
        u64 cost, new_inuse;
+       unsigned long flags;
 
        current_hweight(iocg, NULL, &hwi);
        old_hwi = hwi;
@@ -2473,11 +2474,11 @@ static u64 adjust_inuse_and_calc_cost(struct ioc_gq *iocg, u64 vtime,
            iocg->inuse == iocg->active)
                return cost;
 
-       spin_lock_irq(&ioc->lock);
+       spin_lock_irqsave(&ioc->lock, flags);
 
        /* we own inuse only when @iocg is in the normal active state */
        if (iocg->abs_vdebt || list_empty(&iocg->active_list)) {
-               spin_unlock_irq(&ioc->lock);
+               spin_unlock_irqrestore(&ioc->lock, flags);
                return cost;
        }
 
@@ -2498,7 +2499,7 @@ static u64 adjust_inuse_and_calc_cost(struct ioc_gq *iocg, u64 vtime,
        } while (time_after64(vtime + cost, now->vnow) &&
                 iocg->inuse != iocg->active);
 
-       spin_unlock_irq(&ioc->lock);
+       spin_unlock_irqrestore(&ioc->lock, flags);
 
        TRACE_IOCG_PATH(inuse_adjust, iocg, now,
                        old_inuse, iocg->inuse, old_hwi, hwi);
index 055529b..4051fad 100644 (file)
 /**
  * enum prio_policy - I/O priority class policy.
  * @POLICY_NO_CHANGE: (default) do not modify the I/O priority class.
- * @POLICY_NONE_TO_RT: modify IOPRIO_CLASS_NONE into IOPRIO_CLASS_RT.
+ * @POLICY_PROMOTE_TO_RT: modify no-IOPRIO_CLASS_RT to IOPRIO_CLASS_RT.
  * @POLICY_RESTRICT_TO_BE: modify IOPRIO_CLASS_NONE and IOPRIO_CLASS_RT into
  *             IOPRIO_CLASS_BE.
  * @POLICY_ALL_TO_IDLE: change the I/O priority class into IOPRIO_CLASS_IDLE.
+ * @POLICY_NONE_TO_RT: an alias for POLICY_PROMOTE_TO_RT.
  *
  * See also <linux/ioprio.h>.
  */
 enum prio_policy {
        POLICY_NO_CHANGE        = 0,
-       POLICY_NONE_TO_RT       = 1,
+       POLICY_PROMOTE_TO_RT    = 1,
        POLICY_RESTRICT_TO_BE   = 2,
        POLICY_ALL_TO_IDLE      = 3,
+       POLICY_NONE_TO_RT       = 4,
 };
 
 static const char *policy_name[] = {
        [POLICY_NO_CHANGE]      = "no-change",
-       [POLICY_NONE_TO_RT]     = "none-to-rt",
+       [POLICY_PROMOTE_TO_RT]  = "promote-to-rt",
        [POLICY_RESTRICT_TO_BE] = "restrict-to-be",
        [POLICY_ALL_TO_IDLE]    = "idle",
+       [POLICY_NONE_TO_RT]     = "none-to-rt",
 };
 
 static struct blkcg_policy ioprio_policy;
@@ -189,6 +192,20 @@ void blkcg_set_ioprio(struct bio *bio)
        if (!blkcg || blkcg->prio_policy == POLICY_NO_CHANGE)
                return;
 
+       if (blkcg->prio_policy == POLICY_PROMOTE_TO_RT ||
+           blkcg->prio_policy == POLICY_NONE_TO_RT) {
+               /*
+                * For RT threads, the default priority level is 4 because
+                * task_nice is 0. By promoting non-RT io-priority to RT-class
+                * and default level 4, those requests that are already
+                * RT-class but need a higher io-priority can use ioprio_set()
+                * to achieve this.
+                */
+               if (IOPRIO_PRIO_CLASS(bio->bi_ioprio) != IOPRIO_CLASS_RT)
+                       bio->bi_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_RT, 4);
+               return;
+       }
+
        /*
         * Except for IOPRIO_CLASS_NONE, higher I/O priority numbers
         * correspond to a lower priority. Hence, the max_t() below selects
index 46eed2e..44d74a3 100644 (file)
@@ -281,21 +281,21 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 
        if (blk_queue_pci_p2pdma(rq->q))
                extraction_flags |= ITER_ALLOW_P2PDMA;
+       if (iov_iter_extract_will_pin(iter))
+               bio_set_flag(bio, BIO_PAGE_PINNED);
 
        while (iov_iter_count(iter)) {
-               struct page **pages, *stack_pages[UIO_FASTIOV];
+               struct page *stack_pages[UIO_FASTIOV];
+               struct page **pages = stack_pages;
                ssize_t bytes;
                size_t offs;
                int npages;
 
-               if (nr_vecs <= ARRAY_SIZE(stack_pages)) {
-                       pages = stack_pages;
-                       bytes = iov_iter_get_pages(iter, pages, LONG_MAX,
-                                                  nr_vecs, &offs, extraction_flags);
-               } else {
-                       bytes = iov_iter_get_pages_alloc(iter, &pages,
-                                               LONG_MAX, &offs, extraction_flags);
-               }
+               if (nr_vecs > ARRAY_SIZE(stack_pages))
+                       pages = NULL;
+
+               bytes = iov_iter_extract_pages(iter, &pages, LONG_MAX,
+                                              nr_vecs, extraction_flags, &offs);
                if (unlikely(bytes <= 0)) {
                        ret = bytes ? bytes : -EFAULT;
                        goto out_unmap;
@@ -317,7 +317,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
                                if (!bio_add_hw_page(rq->q, bio, page, n, offs,
                                                     max_sectors, &same_page)) {
                                        if (same_page)
-                                               put_page(page);
+                                               bio_release_page(bio, page);
                                        break;
                                }
 
@@ -329,7 +329,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
                 * release the pages we didn't map into the bio, if any
                 */
                while (j < npages)
-                       put_page(pages[j++]);
+                       bio_release_page(bio, pages[j++]);
                if (pages != stack_pages)
                        kvfree(pages);
                /* couldn't stuff something into bio? */
index d23a855..c3b5930 100644 (file)
@@ -88,6 +88,7 @@ static const char *const blk_queue_flag_name[] = {
        QUEUE_FLAG_NAME(IO_STAT),
        QUEUE_FLAG_NAME(NOXMERGES),
        QUEUE_FLAG_NAME(ADD_RANDOM),
+       QUEUE_FLAG_NAME(SYNCHRONOUS),
        QUEUE_FLAG_NAME(SAME_FORCE),
        QUEUE_FLAG_NAME(INIT_DONE),
        QUEUE_FLAG_NAME(STABLE_WRITES),
@@ -103,6 +104,8 @@ static const char *const blk_queue_flag_name[] = {
        QUEUE_FLAG_NAME(RQ_ALLOC_TIME),
        QUEUE_FLAG_NAME(HCTX_ACTIVE),
        QUEUE_FLAG_NAME(NOWAIT),
+       QUEUE_FLAG_NAME(SQ_SCHED),
+       QUEUE_FLAG_NAME(SKIP_TAGSET_QUIESCE),
 };
 #undef QUEUE_FLAG_NAME
 
@@ -241,14 +244,14 @@ static const char *const cmd_flag_name[] = {
 #define RQF_NAME(name) [ilog2((__force u32)RQF_##name)] = #name
 static const char *const rqf_name[] = {
        RQF_NAME(STARTED),
-       RQF_NAME(SOFTBARRIER),
        RQF_NAME(FLUSH_SEQ),
        RQF_NAME(MIXED_MERGE),
        RQF_NAME(MQ_INFLIGHT),
        RQF_NAME(DONTPREP),
+       RQF_NAME(SCHED_TAGS),
+       RQF_NAME(USE_SCHED),
        RQF_NAME(FAILED),
        RQF_NAME(QUIET),
-       RQF_NAME(ELVPRIV),
        RQF_NAME(IO_STAT),
        RQF_NAME(PM),
        RQF_NAME(HASHED),
@@ -256,7 +259,6 @@ static const char *const rqf_name[] = {
        RQF_NAME(SPECIAL_PAYLOAD),
        RQF_NAME(ZONE_WRITE_LOCKED),
        RQF_NAME(TIMED_OUT),
-       RQF_NAME(ELV),
        RQF_NAME(RESV),
 };
 #undef RQF_NAME
@@ -399,7 +401,7 @@ static void blk_mq_debugfs_tags_show(struct seq_file *m,
        seq_printf(m, "nr_tags=%u\n", tags->nr_tags);
        seq_printf(m, "nr_reserved_tags=%u\n", tags->nr_reserved_tags);
        seq_printf(m, "active_queues=%d\n",
-                  atomic_read(&tags->active_queues));
+                  READ_ONCE(tags->active_queues));
 
        seq_puts(m, "\nbitmap_tags:\n");
        sbitmap_queue_show(&tags->bitmap_tags, m);
index 7c3cbad..1326526 100644 (file)
@@ -37,7 +37,7 @@ static inline bool
 blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
                         struct bio *bio)
 {
-       if (rq->rq_flags & RQF_ELV) {
+       if (rq->rq_flags & RQF_USE_SCHED) {
                struct elevator_queue *e = q->elevator;
 
                if (e->type->ops.allow_merge)
@@ -48,7 +48,7 @@ blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
 
 static inline void blk_mq_sched_completed_request(struct request *rq, u64 now)
 {
-       if (rq->rq_flags & RQF_ELV) {
+       if (rq->rq_flags & RQF_USE_SCHED) {
                struct elevator_queue *e = rq->q->elevator;
 
                if (e->type->ops.completed_request)
@@ -58,11 +58,11 @@ static inline void blk_mq_sched_completed_request(struct request *rq, u64 now)
 
 static inline void blk_mq_sched_requeue_request(struct request *rq)
 {
-       if (rq->rq_flags & RQF_ELV) {
+       if (rq->rq_flags & RQF_USE_SCHED) {
                struct request_queue *q = rq->q;
                struct elevator_queue *e = q->elevator;
 
-               if ((rq->rq_flags & RQF_ELVPRIV) && e->type->ops.requeue_request)
+               if (e->type->ops.requeue_request)
                        e->type->ops.requeue_request(rq);
        }
 }
index dfd81ca..cc57e2d 100644 (file)
@@ -38,6 +38,7 @@ static void blk_mq_update_wake_batch(struct blk_mq_tags *tags,
 void __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 {
        unsigned int users;
+       struct blk_mq_tags *tags = hctx->tags;
 
        /*
         * calling test_bit() prior to test_and_set_bit() is intentional,
@@ -55,9 +56,11 @@ void __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
                        return;
        }
 
-       users = atomic_inc_return(&hctx->tags->active_queues);
-
-       blk_mq_update_wake_batch(hctx->tags, users);
+       spin_lock_irq(&tags->lock);
+       users = tags->active_queues + 1;
+       WRITE_ONCE(tags->active_queues, users);
+       blk_mq_update_wake_batch(tags, users);
+       spin_unlock_irq(&tags->lock);
 }
 
 /*
@@ -90,9 +93,11 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
                        return;
        }
 
-       users = atomic_dec_return(&tags->active_queues);
-
+       spin_lock_irq(&tags->lock);
+       users = tags->active_queues - 1;
+       WRITE_ONCE(tags->active_queues, users);
        blk_mq_update_wake_batch(tags, users);
+       spin_unlock_irq(&tags->lock);
 
        blk_mq_tag_wakeup_all(tags, false);
 }
index 850bfb8..decb6ab 100644 (file)
@@ -45,6 +45,8 @@
 static DEFINE_PER_CPU(struct llist_head, blk_cpu_done);
 
 static void blk_mq_insert_request(struct request *rq, blk_insert_t flags);
+static void blk_mq_request_bypass_insert(struct request *rq,
+               blk_insert_t flags);
 static void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
                struct list_head *list);
 
@@ -354,12 +356,12 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
                data->rq_flags |= RQF_IO_STAT;
        rq->rq_flags = data->rq_flags;
 
-       if (!(data->rq_flags & RQF_ELV)) {
-               rq->tag = tag;
-               rq->internal_tag = BLK_MQ_NO_TAG;
-       } else {
+       if (data->rq_flags & RQF_SCHED_TAGS) {
                rq->tag = BLK_MQ_NO_TAG;
                rq->internal_tag = tag;
+       } else {
+               rq->tag = tag;
+               rq->internal_tag = BLK_MQ_NO_TAG;
        }
        rq->timeout = 0;
 
@@ -386,17 +388,14 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
        WRITE_ONCE(rq->deadline, 0);
        req_ref_set(rq, 1);
 
-       if (rq->rq_flags & RQF_ELV) {
+       if (rq->rq_flags & RQF_USE_SCHED) {
                struct elevator_queue *e = data->q->elevator;
 
                INIT_HLIST_NODE(&rq->hash);
                RB_CLEAR_NODE(&rq->rb_node);
 
-               if (!op_is_flush(data->cmd_flags) &&
-                   e->type->ops.prepare_request) {
+               if (e->type->ops.prepare_request)
                        e->type->ops.prepare_request(rq);
-                       rq->rq_flags |= RQF_ELVPRIV;
-               }
        }
 
        return rq;
@@ -449,26 +448,32 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
                data->flags |= BLK_MQ_REQ_NOWAIT;
 
        if (q->elevator) {
-               struct elevator_queue *e = q->elevator;
-
-               data->rq_flags |= RQF_ELV;
+               /*
+                * All requests use scheduler tags when an I/O scheduler is
+                * enabled for the queue.
+                */
+               data->rq_flags |= RQF_SCHED_TAGS;
 
                /*
                 * Flush/passthrough requests are special and go directly to the
-                * dispatch list. Don't include reserved tags in the
-                * limiting, as it isn't useful.
+                * dispatch list.
                 */
-               if (!op_is_flush(data->cmd_flags) &&
-                   !blk_op_is_passthrough(data->cmd_flags) &&
-                   e->type->ops.limit_depth &&
-                   !(data->flags & BLK_MQ_REQ_RESERVED))
-                       e->type->ops.limit_depth(data->cmd_flags, data);
+               if ((data->cmd_flags & REQ_OP_MASK) != REQ_OP_FLUSH &&
+                   !blk_op_is_passthrough(data->cmd_flags)) {
+                       struct elevator_mq_ops *ops = &q->elevator->type->ops;
+
+                       WARN_ON_ONCE(data->flags & BLK_MQ_REQ_RESERVED);
+
+                       data->rq_flags |= RQF_USE_SCHED;
+                       if (ops->limit_depth)
+                               ops->limit_depth(data->cmd_flags, data);
+               }
        }
 
 retry:
        data->ctx = blk_mq_get_ctx(q);
        data->hctx = blk_mq_map_queue(q, data->cmd_flags, data->ctx);
-       if (!(data->rq_flags & RQF_ELV))
+       if (!(data->rq_flags & RQF_SCHED_TAGS))
                blk_mq_tag_busy(data->hctx);
 
        if (data->flags & BLK_MQ_REQ_RESERVED)
@@ -648,10 +653,10 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
                goto out_queue_exit;
        data.ctx = __blk_mq_get_ctx(q, cpu);
 
-       if (!q->elevator)
-               blk_mq_tag_busy(data.hctx);
+       if (q->elevator)
+               data.rq_flags |= RQF_SCHED_TAGS;
        else
-               data.rq_flags |= RQF_ELV;
+               blk_mq_tag_busy(data.hctx);
 
        if (flags & BLK_MQ_REQ_RESERVED)
                data.rq_flags |= RQF_RESV;
@@ -699,7 +704,7 @@ void blk_mq_free_request(struct request *rq)
 {
        struct request_queue *q = rq->q;
 
-       if ((rq->rq_flags & RQF_ELVPRIV) &&
+       if ((rq->rq_flags & RQF_USE_SCHED) &&
            q->elevator->type->ops.finish_request)
                q->elevator->type->ops.finish_request(rq);
 
@@ -957,6 +962,8 @@ EXPORT_SYMBOL_GPL(blk_update_request);
 
 static inline void blk_account_io_done(struct request *req, u64 now)
 {
+       trace_block_io_done(req);
+
        /*
         * Account IO completion.  flush_rq isn't accounted as a
         * normal IO on queueing nor completion.  Accounting the
@@ -976,6 +983,8 @@ static inline void blk_account_io_done(struct request *req, u64 now)
 
 static inline void blk_account_io_start(struct request *req)
 {
+       trace_block_io_start(req);
+
        if (blk_do_io_stat(req)) {
                /*
                 * All non-passthrough requests are created from a bio with one
@@ -1176,8 +1185,9 @@ bool blk_mq_complete_request_remote(struct request *rq)
         * or a polled request, always complete locally,
         * it's pointless to redirect the completion.
         */
-       if (rq->mq_hctx->nr_ctx == 1 ||
-               rq->cmd_flags & REQ_POLLED)
+       if ((rq->mq_hctx->nr_ctx == 1 &&
+            rq->mq_ctx->cpu == raw_smp_processor_id()) ||
+            rq->cmd_flags & REQ_POLLED)
                return false;
 
        if (blk_mq_complete_need_ipi(rq)) {
@@ -1270,7 +1280,7 @@ static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
 
        if (!plug->multiple_queues && last && last->q != rq->q)
                plug->multiple_queues = true;
-       if (!plug->has_elevator && (rq->rq_flags & RQF_ELV))
+       if (!plug->has_elevator && (rq->rq_flags & RQF_USE_SCHED))
                plug->has_elevator = true;
        rq->rq_next = NULL;
        rq_list_add(&plug->mq_list, rq);
@@ -1411,13 +1421,16 @@ static void __blk_mq_requeue_request(struct request *rq)
 void blk_mq_requeue_request(struct request *rq, bool kick_requeue_list)
 {
        struct request_queue *q = rq->q;
+       unsigned long flags;
 
        __blk_mq_requeue_request(rq);
 
        /* this request will be re-inserted to io scheduler queue */
        blk_mq_sched_requeue_request(rq);
 
-       blk_mq_add_to_requeue_list(rq, BLK_MQ_INSERT_AT_HEAD);
+       spin_lock_irqsave(&q->requeue_lock, flags);
+       list_add_tail(&rq->queuelist, &q->requeue_list);
+       spin_unlock_irqrestore(&q->requeue_lock, flags);
 
        if (kick_requeue_list)
                blk_mq_kick_requeue_list(q);
@@ -1429,13 +1442,16 @@ static void blk_mq_requeue_work(struct work_struct *work)
        struct request_queue *q =
                container_of(work, struct request_queue, requeue_work.work);
        LIST_HEAD(rq_list);
-       struct request *rq, *next;
+       LIST_HEAD(flush_list);
+       struct request *rq;
 
        spin_lock_irq(&q->requeue_lock);
        list_splice_init(&q->requeue_list, &rq_list);
+       list_splice_init(&q->flush_list, &flush_list);
        spin_unlock_irq(&q->requeue_lock);
 
-       list_for_each_entry_safe(rq, next, &rq_list, queuelist) {
+       while (!list_empty(&rq_list)) {
+               rq = list_entry(rq_list.next, struct request, queuelist);
                /*
                 * If RQF_DONTPREP ist set, the request has been started by the
                 * driver already and might have driver-specific data allocated
@@ -1443,18 +1459,16 @@ static void blk_mq_requeue_work(struct work_struct *work)
                 * block layer merges for the request.
                 */
                if (rq->rq_flags & RQF_DONTPREP) {
-                       rq->rq_flags &= ~RQF_SOFTBARRIER;
                        list_del_init(&rq->queuelist);
                        blk_mq_request_bypass_insert(rq, 0);
-               } else if (rq->rq_flags & RQF_SOFTBARRIER) {
-                       rq->rq_flags &= ~RQF_SOFTBARRIER;
+               } else {
                        list_del_init(&rq->queuelist);
                        blk_mq_insert_request(rq, BLK_MQ_INSERT_AT_HEAD);
                }
        }
 
-       while (!list_empty(&rq_list)) {
-               rq = list_entry(rq_list.next, struct request, queuelist);
+       while (!list_empty(&flush_list)) {
+               rq = list_entry(flush_list.next, struct request, queuelist);
                list_del_init(&rq->queuelist);
                blk_mq_insert_request(rq, 0);
        }
@@ -1462,27 +1476,6 @@ static void blk_mq_requeue_work(struct work_struct *work)
        blk_mq_run_hw_queues(q, false);
 }
 
-void blk_mq_add_to_requeue_list(struct request *rq, blk_insert_t insert_flags)
-{
-       struct request_queue *q = rq->q;
-       unsigned long flags;
-
-       /*
-        * We abuse this flag that is otherwise used by the I/O scheduler to
-        * request head insertion from the workqueue.
-        */
-       BUG_ON(rq->rq_flags & RQF_SOFTBARRIER);
-
-       spin_lock_irqsave(&q->requeue_lock, flags);
-       if (insert_flags & BLK_MQ_INSERT_AT_HEAD) {
-               rq->rq_flags |= RQF_SOFTBARRIER;
-               list_add(&rq->queuelist, &q->requeue_list);
-       } else {
-               list_add_tail(&rq->queuelist, &q->requeue_list);
-       }
-       spin_unlock_irqrestore(&q->requeue_lock, flags);
-}
-
 void blk_mq_kick_requeue_list(struct request_queue *q)
 {
        kblockd_mod_delayed_work_on(WORK_CPU_UNBOUND, &q->requeue_work, 0);
@@ -2427,7 +2420,7 @@ static void blk_mq_run_work_fn(struct work_struct *work)
  * Should only be used carefully, when the caller knows we want to
  * bypass a potential IO scheduler on the target device.
  */
-void blk_mq_request_bypass_insert(struct request *rq, blk_insert_t flags)
+static void blk_mq_request_bypass_insert(struct request *rq, blk_insert_t flags)
 {
        struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
 
@@ -2492,7 +2485,7 @@ static void blk_mq_insert_request(struct request *rq, blk_insert_t flags)
                 * dispatch it given we prioritize requests in hctx->dispatch.
                 */
                blk_mq_request_bypass_insert(rq, flags);
-       } else if (rq->rq_flags & RQF_FLUSH_SEQ) {
+       } else if (req_op(rq) == REQ_OP_FLUSH) {
                /*
                 * Firstly normal IO request is inserted to scheduler queue or
                 * sw queue, meantime we add flush request to dispatch queue(
@@ -2622,7 +2615,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
                return;
        }
 
-       if ((rq->rq_flags & RQF_ELV) || !blk_mq_get_budget_and_tag(rq)) {
+       if ((rq->rq_flags & RQF_USE_SCHED) || !blk_mq_get_budget_and_tag(rq)) {
                blk_mq_insert_request(rq, 0);
                blk_mq_run_hw_queue(hctx, false);
                return;
@@ -2711,6 +2704,7 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched)
        struct request *requeue_list = NULL;
        struct request **requeue_lastp = &requeue_list;
        unsigned int depth = 0;
+       bool is_passthrough = false;
        LIST_HEAD(list);
 
        do {
@@ -2719,7 +2713,9 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched)
                if (!this_hctx) {
                        this_hctx = rq->mq_hctx;
                        this_ctx = rq->mq_ctx;
-               } else if (this_hctx != rq->mq_hctx || this_ctx != rq->mq_ctx) {
+                       is_passthrough = blk_rq_is_passthrough(rq);
+               } else if (this_hctx != rq->mq_hctx || this_ctx != rq->mq_ctx ||
+                          is_passthrough != blk_rq_is_passthrough(rq)) {
                        rq_list_add_tail(&requeue_lastp, rq);
                        continue;
                }
@@ -2731,7 +2727,13 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched)
        trace_block_unplug(this_hctx->queue, depth, !from_sched);
 
        percpu_ref_get(&this_hctx->queue->q_usage_counter);
-       if (this_hctx->queue->elevator) {
+       /* passthrough requests should never be issued to the I/O scheduler */
+       if (is_passthrough) {
+               spin_lock(&this_hctx->lock);
+               list_splice_tail_init(&list, &this_hctx->dispatch);
+               spin_unlock(&this_hctx->lock);
+               blk_mq_run_hw_queue(this_hctx, from_sched);
+       } else if (this_hctx->queue->elevator) {
                this_hctx->queue->elevator->type->ops.insert_requests(this_hctx,
                                &list, 0);
                blk_mq_run_hw_queue(this_hctx, from_sched);
@@ -2970,10 +2972,8 @@ void blk_mq_submit_bio(struct bio *bio)
                return;
        }
 
-       if (op_is_flush(bio->bi_opf)) {
-               blk_insert_flush(rq);
+       if (op_is_flush(bio->bi_opf) && blk_insert_flush(rq))
                return;
-       }
 
        if (plug) {
                blk_add_rq_to_plug(plug, rq);
@@ -2981,7 +2981,7 @@ void blk_mq_submit_bio(struct bio *bio)
        }
 
        hctx = rq->mq_hctx;
-       if ((rq->rq_flags & RQF_ELV) ||
+       if ((rq->rq_flags & RQF_USE_SCHED) ||
            (hctx->dispatch_busy && (q->nr_hw_queues == 1 || !is_sync))) {
                blk_mq_insert_request(rq, 0);
                blk_mq_run_hw_queue(hctx, true);
@@ -4232,6 +4232,7 @@ int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
        blk_mq_update_poll_flag(q);
 
        INIT_DELAYED_WORK(&q->requeue_work, blk_mq_requeue_work);
+       INIT_LIST_HEAD(&q->flush_list);
        INIT_LIST_HEAD(&q->requeue_list);
        spin_lock_init(&q->requeue_lock);
 
@@ -4608,9 +4609,6 @@ static bool blk_mq_elv_switch_none(struct list_head *head,
 {
        struct blk_mq_qe_pair *qe;
 
-       if (!q->elevator)
-               return true;
-
        qe = kmalloc(sizeof(*qe), GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY);
        if (!qe)
                return false;
@@ -4618,6 +4616,12 @@ static bool blk_mq_elv_switch_none(struct list_head *head,
        /* q->elevator needs protection from ->sysfs_lock */
        mutex_lock(&q->sysfs_lock);
 
+       /* the check has to be done with holding sysfs_lock */
+       if (!q->elevator) {
+               kfree(qe);
+               goto unlock;
+       }
+
        INIT_LIST_HEAD(&qe->node);
        qe->q = q;
        qe->type = q->elevator->type;
@@ -4625,6 +4629,7 @@ static bool blk_mq_elv_switch_none(struct list_head *head,
        __elevator_get(qe->type);
        list_add(&qe->node, head);
        elevator_disable(q);
+unlock:
        mutex_unlock(&q->sysfs_lock);
 
        return true;
index e876584..1743857 100644 (file)
@@ -47,7 +47,6 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
 bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *,
                             unsigned int);
-void blk_mq_add_to_requeue_list(struct request *rq, blk_insert_t insert_flags);
 void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
 struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
                                        struct blk_mq_ctx *start);
@@ -64,10 +63,6 @@ struct blk_mq_tags *blk_mq_alloc_map_and_rqs(struct blk_mq_tag_set *set,
 void blk_mq_free_map_and_rqs(struct blk_mq_tag_set *set,
                             struct blk_mq_tags *tags,
                             unsigned int hctx_idx);
-/*
- * Internal helpers for request insertion into sw queues
- */
-void blk_mq_request_bypass_insert(struct request *rq, blk_insert_t flags);
 
 /*
  * CPU -> queue mappings
@@ -226,9 +221,9 @@ static inline bool blk_mq_is_shared_tags(unsigned int flags)
 
 static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data *data)
 {
-       if (!(data->rq_flags & RQF_ELV))
-               return data->hctx->tags;
-       return data->hctx->sched_tags;
+       if (data->rq_flags & RQF_SCHED_TAGS)
+               return data->hctx->sched_tags;
+       return data->hctx->tags;
 }
 
 static inline bool blk_mq_hctx_stopped(struct blk_mq_hw_ctx *hctx)
@@ -417,8 +412,7 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
                        return true;
        }
 
-       users = atomic_read(&hctx->tags->active_queues);
-
+       users = READ_ONCE(hctx->tags->active_queues);
        if (!users)
                return true;
 
index d8cc820..167be74 100644 (file)
@@ -288,11 +288,13 @@ void rq_qos_wait(struct rq_wait *rqw, void *private_data,
 
 void rq_qos_exit(struct request_queue *q)
 {
+       mutex_lock(&q->rq_qos_mutex);
        while (q->rq_qos) {
                struct rq_qos *rqos = q->rq_qos;
                q->rq_qos = rqos->next;
                rqos->ops->exit(rqos);
        }
+       mutex_unlock(&q->rq_qos_mutex);
 }
 
 int rq_qos_add(struct rq_qos *rqos, struct gendisk *disk, enum rq_qos_id id,
@@ -300,6 +302,8 @@ int rq_qos_add(struct rq_qos *rqos, struct gendisk *disk, enum rq_qos_id id,
 {
        struct request_queue *q = disk->queue;
 
+       lockdep_assert_held(&q->rq_qos_mutex);
+
        rqos->disk = disk;
        rqos->id = id;
        rqos->ops = ops;
@@ -307,18 +311,13 @@ int rq_qos_add(struct rq_qos *rqos, struct gendisk *disk, enum rq_qos_id id,
        /*
         * No IO can be in-flight when adding rqos, so freeze queue, which
         * is fine since we only support rq_qos for blk-mq queue.
-        *
-        * Reuse ->queue_lock for protecting against other concurrent
-        * rq_qos adding/deleting
         */
        blk_mq_freeze_queue(q);
 
-       spin_lock_irq(&q->queue_lock);
        if (rq_qos_id(q, rqos->id))
                goto ebusy;
        rqos->next = q->rq_qos;
        q->rq_qos = rqos;
-       spin_unlock_irq(&q->queue_lock);
 
        blk_mq_unfreeze_queue(q);
 
@@ -330,7 +329,6 @@ int rq_qos_add(struct rq_qos *rqos, struct gendisk *disk, enum rq_qos_id id,
 
        return 0;
 ebusy:
-       spin_unlock_irq(&q->queue_lock);
        blk_mq_unfreeze_queue(q);
        return -EBUSY;
 }
@@ -340,21 +338,15 @@ void rq_qos_del(struct rq_qos *rqos)
        struct request_queue *q = rqos->disk->queue;
        struct rq_qos **cur;
 
-       /*
-        * See comment in rq_qos_add() about freezing queue & using
-        * ->queue_lock.
-        */
-       blk_mq_freeze_queue(q);
+       lockdep_assert_held(&q->rq_qos_mutex);
 
-       spin_lock_irq(&q->queue_lock);
+       blk_mq_freeze_queue(q);
        for (cur = &q->rq_qos; *cur; cur = &(*cur)->next) {
                if (*cur == rqos) {
                        *cur = rqos->next;
                        break;
                }
        }
-       spin_unlock_irq(&q->queue_lock);
-
        blk_mq_unfreeze_queue(q);
 
        mutex_lock(&q->debugfs_mutex);
index 9ec2a2f..7a87506 100644 (file)
@@ -944,7 +944,9 @@ int wbt_init(struct gendisk *disk)
        /*
         * Assign rwb and add the stats callback.
         */
+       mutex_lock(&q->rq_qos_mutex);
        ret = rq_qos_add(&rwb->rqos, disk, RQ_QOS_WBT, &wbt_rqos_ops);
+       mutex_unlock(&q->rq_qos_mutex);
        if (ret)
                goto err_free;
 
index fce9082..0f9f97c 100644 (file)
@@ -57,16 +57,10 @@ EXPORT_SYMBOL_GPL(blk_zone_cond_str);
  */
 bool blk_req_needs_zone_write_lock(struct request *rq)
 {
-       if (blk_rq_is_passthrough(rq))
-               return false;
-
        if (!rq->q->disk->seq_zones_wlock)
                return false;
 
-       if (bdev_op_is_zoned_write(rq->q->disk->part0, req_op(rq)))
-               return blk_rq_zone_is_seq(rq);
-
-       return false;
+       return blk_rq_is_seq_zoned_write(rq);
 }
 EXPORT_SYMBOL_GPL(blk_req_needs_zone_write_lock);
 
@@ -329,8 +323,8 @@ static int blkdev_copy_zone_to_user(struct blk_zone *zone, unsigned int idx,
  * BLKREPORTZONE ioctl processing.
  * Called from blkdev_ioctl.
  */
-int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
-                             unsigned int cmd, unsigned long arg)
+int blkdev_report_zones_ioctl(struct block_device *bdev, unsigned int cmd,
+               unsigned long arg)
 {
        void __user *argp = (void __user *)arg;
        struct zone_report_args args;
@@ -362,8 +356,8 @@ int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
        return 0;
 }
 
-static int blkdev_truncate_zone_range(struct block_device *bdev, fmode_t mode,
-                                     const struct blk_zone_range *zrange)
+static int blkdev_truncate_zone_range(struct block_device *bdev,
+               blk_mode_t mode, const struct blk_zone_range *zrange)
 {
        loff_t start, end;
 
@@ -382,7 +376,7 @@ static int blkdev_truncate_zone_range(struct block_device *bdev, fmode_t mode,
  * BLKRESETZONE, BLKOPENZONE, BLKCLOSEZONE and BLKFINISHZONE ioctl processing.
  * Called from blkdev_ioctl.
  */
-int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
+int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode,
                           unsigned int cmd, unsigned long arg)
 {
        void __user *argp = (void __user *)arg;
@@ -396,7 +390,7 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
        if (!bdev_is_zoned(bdev))
                return -ENOTTY;
 
-       if (!(mode & FMODE_WRITE))
+       if (!(mode & BLK_OPEN_WRITE))
                return -EBADF;
 
        if (copy_from_user(&zrange, argp, sizeof(struct blk_zone_range)))
index 45547bc..608c5dc 100644 (file)
@@ -269,7 +269,7 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
  */
 #define ELV_ON_HASH(rq) ((rq)->rq_flags & RQF_HASHED)
 
-void blk_insert_flush(struct request *rq);
+bool blk_insert_flush(struct request *rq);
 
 int elevator_switch(struct request_queue *q, struct elevator_type *new_e);
 void elevator_disable(struct request_queue *q);
@@ -394,10 +394,27 @@ static inline struct bio *blk_queue_bounce(struct bio *bio,
 #ifdef CONFIG_BLK_DEV_ZONED
 void disk_free_zone_bitmaps(struct gendisk *disk);
 void disk_clear_zone_settings(struct gendisk *disk);
-#else
+int blkdev_report_zones_ioctl(struct block_device *bdev, unsigned int cmd,
+               unsigned long arg);
+int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode,
+               unsigned int cmd, unsigned long arg);
+#else /* CONFIG_BLK_DEV_ZONED */
 static inline void disk_free_zone_bitmaps(struct gendisk *disk) {}
 static inline void disk_clear_zone_settings(struct gendisk *disk) {}
-#endif
+static inline int blkdev_report_zones_ioctl(struct block_device *bdev,
+               unsigned int cmd, unsigned long arg)
+{
+       return -ENOTTY;
+}
+static inline int blkdev_zone_mgmt_ioctl(struct block_device *bdev,
+               blk_mode_t mode, unsigned int cmd, unsigned long arg)
+{
+       return -ENOTTY;
+}
+#endif /* CONFIG_BLK_DEV_ZONED */
+
+struct block_device *bdev_alloc(struct gendisk *disk, u8 partno);
+void bdev_add(struct block_device *bdev, dev_t dev);
 
 int blk_alloc_ext_minor(void);
 void blk_free_ext_minor(unsigned int minor);
@@ -409,7 +426,7 @@ int bdev_add_partition(struct gendisk *disk, int partno, sector_t start,
 int bdev_del_partition(struct gendisk *disk, int partno);
 int bdev_resize_partition(struct gendisk *disk, int partno, sector_t start,
                sector_t length);
-void blk_drop_partitions(struct gendisk *disk);
+void drop_partition(struct block_device *part);
 
 void bdev_set_nr_sectors(struct block_device *bdev, sector_t sectors);
 
@@ -420,9 +437,19 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
                struct page *page, unsigned int len, unsigned int offset,
                unsigned int max_sectors, bool *same_page);
 
+/*
+ * Clean up a page appropriately, where the page may be pinned, may have a
+ * ref taken on it or neither.
+ */
+static inline void bio_release_page(struct bio *bio, struct page *page)
+{
+       if (bio_flagged(bio, BIO_PAGE_PINNED))
+               unpin_user_page(page);
+}
+
 struct request_queue *blk_alloc_queue(int node_id);
 
-int disk_scan_partitions(struct gendisk *disk, fmode_t mode);
+int disk_scan_partitions(struct gendisk *disk, blk_mode_t mode);
 
 int disk_alloc_events(struct gendisk *disk);
 void disk_add_events(struct gendisk *disk);
@@ -437,6 +464,9 @@ extern struct device_attribute dev_attr_events_poll_msecs;
 
 extern struct attribute_group blk_trace_attr_group;
 
+blk_mode_t file_to_blk_mode(struct file *file);
+int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode,
+               loff_t lstart, loff_t lend);
 long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
 long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
 
index 435c323..b3acdbd 100644 (file)
@@ -26,7 +26,7 @@ struct bsg_set {
 };
 
 static int bsg_transport_sg_io_fn(struct request_queue *q, struct sg_io_v4 *hdr,
-               fmode_t mode, unsigned int timeout)
+               bool open_for_write, unsigned int timeout)
 {
        struct bsg_job *job;
        struct request *rq;
index 7eca43f..1a9396a 100644 (file)
@@ -39,7 +39,7 @@ static inline struct bsg_device *to_bsg_device(struct inode *inode)
 #define BSG_MAX_DEVS           32768
 
 static DEFINE_IDA(bsg_minor_ida);
-static struct class *bsg_class;
+static const struct class bsg_class;
 static int bsg_major;
 
 static unsigned int bsg_timeout(struct bsg_device *bd, struct sg_io_v4 *hdr)
@@ -54,7 +54,8 @@ static unsigned int bsg_timeout(struct bsg_device *bd, struct sg_io_v4 *hdr)
        return max_t(unsigned int, timeout, BLK_MIN_SG_TIMEOUT);
 }
 
-static int bsg_sg_io(struct bsg_device *bd, fmode_t mode, void __user *uarg)
+static int bsg_sg_io(struct bsg_device *bd, bool open_for_write,
+                    void __user *uarg)
 {
        struct sg_io_v4 hdr;
        int ret;
@@ -63,7 +64,8 @@ static int bsg_sg_io(struct bsg_device *bd, fmode_t mode, void __user *uarg)
                return -EFAULT;
        if (hdr.guard != 'Q')
                return -EINVAL;
-       ret = bd->sg_io_fn(bd->queue, &hdr, mode, bsg_timeout(bd, &hdr));
+       ret = bd->sg_io_fn(bd->queue, &hdr, open_for_write,
+                          bsg_timeout(bd, &hdr));
        if (!ret && copy_to_user(uarg, &hdr, sizeof(hdr)))
                return -EFAULT;
        return ret;
@@ -146,7 +148,7 @@ static long bsg_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
        case SG_EMULATED_HOST:
                return put_user(1, intp);
        case SG_IO:
-               return bsg_sg_io(bd, file->f_mode, uarg);
+               return bsg_sg_io(bd, file->f_mode & FMODE_WRITE, uarg);
        case SCSI_IOCTL_SEND_COMMAND:
                pr_warn_ratelimited("%s: calling unsupported SCSI_IOCTL_SEND_COMMAND\n",
                                current->comm);
@@ -206,7 +208,7 @@ struct bsg_device *bsg_register_queue(struct request_queue *q,
                return ERR_PTR(ret);
        }
        bd->device.devt = MKDEV(bsg_major, ret);
-       bd->device.class = bsg_class;
+       bd->device.class = &bsg_class;
        bd->device.parent = parent;
        bd->device.release = bsg_device_release;
        dev_set_name(&bd->device, "%s", name);
@@ -240,15 +242,19 @@ static char *bsg_devnode(const struct device *dev, umode_t *mode)
        return kasprintf(GFP_KERNEL, "bsg/%s", dev_name(dev));
 }
 
+static const struct class bsg_class = {
+       .name           = "bsg",
+       .devnode        = bsg_devnode,
+};
+
 static int __init bsg_init(void)
 {
        dev_t devid;
        int ret;
 
-       bsg_class = class_create("bsg");
-       if (IS_ERR(bsg_class))
-               return PTR_ERR(bsg_class);
-       bsg_class->devnode = bsg_devnode;
+       ret = class_register(&bsg_class);
+       if (ret)
+               return ret;
 
        ret = alloc_chrdev_region(&devid, 0, BSG_MAX_DEVS, "bsg");
        if (ret)
@@ -260,7 +266,7 @@ static int __init bsg_init(void)
        return 0;
 
 destroy_bsg_class:
-       class_destroy(bsg_class);
+       class_unregister(&bsg_class);
        return ret;
 }
 
index aee25a7..0cfac46 100644 (file)
@@ -263,31 +263,31 @@ static unsigned int disk_clear_events(struct gendisk *disk, unsigned int mask)
 }
 
 /**
- * bdev_check_media_change - check if a removable media has been changed
- * @bdev: block device to check
+ * disk_check_media_change - check if a removable media has been changed
+ * @disk: gendisk to check
  *
  * Check whether a removable media has been changed, and attempt to free all
  * dentries and inodes and invalidates all block device page cache entries in
  * that case.
  *
- * Returns %true if the block device changed, or %false if not.
+ * Returns %true if the media has changed, or %false if not.
  */
-bool bdev_check_media_change(struct block_device *bdev)
+bool disk_check_media_change(struct gendisk *disk)
 {
        unsigned int events;
 
-       events = disk_clear_events(bdev->bd_disk, DISK_EVENT_MEDIA_CHANGE |
+       events = disk_clear_events(disk, DISK_EVENT_MEDIA_CHANGE |
                                   DISK_EVENT_EJECT_REQUEST);
        if (!(events & DISK_EVENT_MEDIA_CHANGE))
                return false;
 
-       if (__invalidate_device(bdev, true))
+       if (__invalidate_device(disk->part0, true))
                pr_warn("VFS: busy inodes on changed media %s\n",
-                       bdev->bd_disk->disk_name);
-       set_bit(GD_NEED_PART_SCAN, &bdev->bd_disk->state);
+                       disk->disk_name);
+       set_bit(GD_NEED_PART_SCAN, &disk->state);
        return true;
 }
-EXPORT_SYMBOL(bdev_check_media_change);
+EXPORT_SYMBOL(disk_check_media_change);
 
 /**
  * disk_force_media_change - force a media change event
@@ -307,6 +307,7 @@ bool disk_force_media_change(struct gendisk *disk, unsigned int events)
        if (!(events & DISK_EVENT_MEDIA_CHANGE))
                return false;
 
+       inc_diskseq(disk);
        if (__invalidate_device(disk->part0, true))
                pr_warn("VFS: busy inodes on changed media %s\n",
                        disk->disk_name);
diff --git a/block/early-lookup.c b/block/early-lookup.c
new file mode 100644 (file)
index 0000000..3effbd0
--- /dev/null
@@ -0,0 +1,316 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Code for looking up block devices in the early boot code before mounting the
+ * root file system.
+ */
+#include <linux/blkdev.h>
+#include <linux/ctype.h>
+
+struct uuidcmp {
+       const char *uuid;
+       int len;
+};
+
+/**
+ * match_dev_by_uuid - callback for finding a partition using its uuid
+ * @dev:       device passed in by the caller
+ * @data:      opaque pointer to the desired struct uuidcmp to match
+ *
+ * Returns 1 if the device matches, and 0 otherwise.
+ */
+static int __init match_dev_by_uuid(struct device *dev, const void *data)
+{
+       struct block_device *bdev = dev_to_bdev(dev);
+       const struct uuidcmp *cmp = data;
+
+       if (!bdev->bd_meta_info ||
+           strncasecmp(cmp->uuid, bdev->bd_meta_info->uuid, cmp->len))
+               return 0;
+       return 1;
+}
+
+/**
+ * devt_from_partuuid - looks up the dev_t of a partition by its UUID
+ * @uuid_str:  char array containing ascii UUID
+ * @devt:      dev_t result
+ *
+ * The function will return the first partition which contains a matching
+ * UUID value in its partition_meta_info struct.  This does not search
+ * by filesystem UUIDs.
+ *
+ * If @uuid_str is followed by a "/PARTNROFF=%d", then the number will be
+ * extracted and used as an offset from the partition identified by the UUID.
+ *
+ * Returns 0 on success or a negative error code on failure.
+ */
+static int __init devt_from_partuuid(const char *uuid_str, dev_t *devt)
+{
+       struct uuidcmp cmp;
+       struct device *dev = NULL;
+       int offset = 0;
+       char *slash;
+
+       cmp.uuid = uuid_str;
+
+       slash = strchr(uuid_str, '/');
+       /* Check for optional partition number offset attributes. */
+       if (slash) {
+               char c = 0;
+
+               /* Explicitly fail on poor PARTUUID syntax. */
+               if (sscanf(slash + 1, "PARTNROFF=%d%c", &offset, &c) != 1)
+                       goto out_invalid;
+               cmp.len = slash - uuid_str;
+       } else {
+               cmp.len = strlen(uuid_str);
+       }
+
+       if (!cmp.len)
+               goto out_invalid;
+
+       dev = class_find_device(&block_class, NULL, &cmp, &match_dev_by_uuid);
+       if (!dev)
+               return -ENODEV;
+
+       if (offset) {
+               /*
+                * Attempt to find the requested partition by adding an offset
+                * to the partition number found by UUID.
+                */
+               *devt = part_devt(dev_to_disk(dev),
+                                 dev_to_bdev(dev)->bd_partno + offset);
+       } else {
+               *devt = dev->devt;
+       }
+
+       put_device(dev);
+       return 0;
+
+out_invalid:
+       pr_err("VFS: PARTUUID= is invalid.\n"
+              "Expected PARTUUID=<valid-uuid-id>[/PARTNROFF=%%d]\n");
+       return -EINVAL;
+}
+
+/**
+ * match_dev_by_label - callback for finding a partition using its label
+ * @dev:       device passed in by the caller
+ * @data:      opaque pointer to the label to match
+ *
+ * Returns 1 if the device matches, and 0 otherwise.
+ */
+static int __init match_dev_by_label(struct device *dev, const void *data)
+{
+       struct block_device *bdev = dev_to_bdev(dev);
+       const char *label = data;
+
+       if (!bdev->bd_meta_info || strcmp(label, bdev->bd_meta_info->volname))
+               return 0;
+       return 1;
+}
+
+static int __init devt_from_partlabel(const char *label, dev_t *devt)
+{
+       struct device *dev;
+
+       dev = class_find_device(&block_class, NULL, label, &match_dev_by_label);
+       if (!dev)
+               return -ENODEV;
+       *devt = dev->devt;
+       put_device(dev);
+       return 0;
+}
+
+static dev_t __init blk_lookup_devt(const char *name, int partno)
+{
+       dev_t devt = MKDEV(0, 0);
+       struct class_dev_iter iter;
+       struct device *dev;
+
+       class_dev_iter_init(&iter, &block_class, NULL, &disk_type);
+       while ((dev = class_dev_iter_next(&iter))) {
+               struct gendisk *disk = dev_to_disk(dev);
+
+               if (strcmp(dev_name(dev), name))
+                       continue;
+
+               if (partno < disk->minors) {
+                       /* We need to return the right devno, even
+                        * if the partition doesn't exist yet.
+                        */
+                       devt = MKDEV(MAJOR(dev->devt),
+                                    MINOR(dev->devt) + partno);
+               } else {
+                       devt = part_devt(disk, partno);
+                       if (devt)
+                               break;
+               }
+       }
+       class_dev_iter_exit(&iter);
+       return devt;
+}
+
+static int __init devt_from_devname(const char *name, dev_t *devt)
+{
+       int part;
+       char s[32];
+       char *p;
+
+       if (strlen(name) > 31)
+               return -EINVAL;
+       strcpy(s, name);
+       for (p = s; *p; p++) {
+               if (*p == '/')
+                       *p = '!';
+       }
+
+       *devt = blk_lookup_devt(s, 0);
+       if (*devt)
+               return 0;
+
+       /*
+        * Try non-existent, but valid partition, which may only exist after
+        * opening the device, like partitioned md devices.
+        */
+       while (p > s && isdigit(p[-1]))
+               p--;
+       if (p == s || !*p || *p == '0')
+               return -ENODEV;
+
+       /* try disk name without <part number> */
+       part = simple_strtoul(p, NULL, 10);
+       *p = '\0';
+       *devt = blk_lookup_devt(s, part);
+       if (*devt)
+               return 0;
+
+       /* try disk name without p<part number> */
+       if (p < s + 2 || !isdigit(p[-2]) || p[-1] != 'p')
+               return -ENODEV;
+       p[-1] = '\0';
+       *devt = blk_lookup_devt(s, part);
+       if (*devt)
+               return 0;
+       return -ENODEV;
+}
+
+static int __init devt_from_devnum(const char *name, dev_t *devt)
+{
+       unsigned maj, min, offset;
+       char *p, dummy;
+
+       if (sscanf(name, "%u:%u%c", &maj, &min, &dummy) == 2 ||
+           sscanf(name, "%u:%u:%u:%c", &maj, &min, &offset, &dummy) == 3) {
+               *devt = MKDEV(maj, min);
+               if (maj != MAJOR(*devt) || min != MINOR(*devt))
+                       return -EINVAL;
+       } else {
+               *devt = new_decode_dev(simple_strtoul(name, &p, 16));
+               if (*p)
+                       return -EINVAL;
+       }
+
+       return 0;
+}
+
+/*
+ *     Convert a name into device number.  We accept the following variants:
+ *
+ *     1) <hex_major><hex_minor> device number in hexadecimal represents itself
+ *         no leading 0x, for example b302.
+ *     3) /dev/<disk_name> represents the device number of disk
+ *     4) /dev/<disk_name><decimal> represents the device number
+ *         of partition - device number of disk plus the partition number
+ *     5) /dev/<disk_name>p<decimal> - same as the above, that form is
+ *        used when disk name of partitioned disk ends on a digit.
+ *     6) PARTUUID=00112233-4455-6677-8899-AABBCCDDEEFF representing the
+ *        unique id of a partition if the partition table provides it.
+ *        The UUID may be either an EFI/GPT UUID, or refer to an MSDOS
+ *        partition using the format SSSSSSSS-PP, where SSSSSSSS is a zero-
+ *        filled hex representation of the 32-bit "NT disk signature", and PP
+ *        is a zero-filled hex representation of the 1-based partition number.
+ *     7) PARTUUID=<UUID>/PARTNROFF=<int> to select a partition in relation to
+ *        a partition with a known unique id.
+ *     8) <major>:<minor> major and minor number of the device separated by
+ *        a colon.
+ *     9) PARTLABEL=<name> with name being the GPT partition label.
+ *        MSDOS partitions do not support labels!
+ *
+ *     If name doesn't have fall into the categories above, we return (0,0).
+ *     block_class is used to check if something is a disk name. If the disk
+ *     name contains slashes, the device name has them replaced with
+ *     bangs.
+ */
+int __init early_lookup_bdev(const char *name, dev_t *devt)
+{
+       if (strncmp(name, "PARTUUID=", 9) == 0)
+               return devt_from_partuuid(name + 9, devt);
+       if (strncmp(name, "PARTLABEL=", 10) == 0)
+               return devt_from_partlabel(name + 10, devt);
+       if (strncmp(name, "/dev/", 5) == 0)
+               return devt_from_devname(name + 5, devt);
+       return devt_from_devnum(name, devt);
+}
+
+static char __init *bdevt_str(dev_t devt, char *buf)
+{
+       if (MAJOR(devt) <= 0xff && MINOR(devt) <= 0xff) {
+               char tbuf[BDEVT_SIZE];
+               snprintf(tbuf, BDEVT_SIZE, "%02x%02x", MAJOR(devt), MINOR(devt));
+               snprintf(buf, BDEVT_SIZE, "%-9s", tbuf);
+       } else
+               snprintf(buf, BDEVT_SIZE, "%03x:%05x", MAJOR(devt), MINOR(devt));
+
+       return buf;
+}
+
+/*
+ * print a full list of all partitions - intended for places where the root
+ * filesystem can't be mounted and thus to give the victim some idea of what
+ * went wrong
+ */
+void __init printk_all_partitions(void)
+{
+       struct class_dev_iter iter;
+       struct device *dev;
+
+       class_dev_iter_init(&iter, &block_class, NULL, &disk_type);
+       while ((dev = class_dev_iter_next(&iter))) {
+               struct gendisk *disk = dev_to_disk(dev);
+               struct block_device *part;
+               char devt_buf[BDEVT_SIZE];
+               unsigned long idx;
+
+               /*
+                * Don't show empty devices or things that have been
+                * suppressed
+                */
+               if (get_capacity(disk) == 0 || (disk->flags & GENHD_FL_HIDDEN))
+                       continue;
+
+               /*
+                * Note, unlike /proc/partitions, I am showing the numbers in
+                * hex - the same format as the root= option takes.
+                */
+               rcu_read_lock();
+               xa_for_each(&disk->part_tbl, idx, part) {
+                       if (!bdev_nr_sectors(part))
+                               continue;
+                       printk("%s%s %10llu %pg %s",
+                              bdev_is_partition(part) ? "  " : "",
+                              bdevt_str(part->bd_dev, devt_buf),
+                              bdev_nr_sectors(part) >> 1, part,
+                              part->bd_meta_info ?
+                                       part->bd_meta_info->uuid : "");
+                       if (bdev_is_partition(part))
+                               printk("\n");
+                       else if (dev->parent && dev->parent->driver)
+                               printk(" driver: %s\n",
+                                       dev->parent->driver->name);
+                       else
+                               printk(" (driver?)\n");
+               }
+               rcu_read_unlock();
+       }
+       class_dev_iter_exit(&iter);
+}
index 2490906..8400e30 100644 (file)
@@ -751,7 +751,7 @@ ssize_t elv_iosched_store(struct request_queue *q, const char *buf,
        if (!elv_support_iosched(q))
                return count;
 
-       strlcpy(elevator_name, buf, sizeof(elevator_name));
+       strscpy(elevator_name, buf, sizeof(elevator_name));
        ret = elevator_change(q, strstrip(elevator_name));
        if (!ret)
                return count;
index 58d0aeb..a286bf3 100644 (file)
@@ -54,7 +54,7 @@ static bool blkdev_dio_unaligned(struct block_device *bdev, loff_t pos,
 static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
                struct iov_iter *iter, unsigned int nr_pages)
 {
-       struct block_device *bdev = iocb->ki_filp->private_data;
+       struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
        struct bio_vec inline_vecs[DIO_INLINE_BIO_VECS], *vecs;
        loff_t pos = iocb->ki_pos;
        bool should_dirty = false;
@@ -170,7 +170,7 @@ static void blkdev_bio_end_io(struct bio *bio)
 static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
                unsigned int nr_pages)
 {
-       struct block_device *bdev = iocb->ki_filp->private_data;
+       struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
        struct blk_plug plug;
        struct blkdev_dio *dio;
        struct bio *bio;
@@ -310,7 +310,7 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
                                        struct iov_iter *iter,
                                        unsigned int nr_pages)
 {
-       struct block_device *bdev = iocb->ki_filp->private_data;
+       struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
        bool is_read = iov_iter_rw(iter) == READ;
        blk_opf_t opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb);
        struct blkdev_dio *dio;
@@ -451,7 +451,7 @@ static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
 static int blkdev_fsync(struct file *filp, loff_t start, loff_t end,
                int datasync)
 {
-       struct block_device *bdev = filp->private_data;
+       struct block_device *bdev = I_BDEV(filp->f_mapping->host);
        int error;
 
        error = file_write_and_wait_range(filp, start, end);
@@ -470,6 +470,30 @@ static int blkdev_fsync(struct file *filp, loff_t start, loff_t end,
        return error;
 }
 
+blk_mode_t file_to_blk_mode(struct file *file)
+{
+       blk_mode_t mode = 0;
+
+       if (file->f_mode & FMODE_READ)
+               mode |= BLK_OPEN_READ;
+       if (file->f_mode & FMODE_WRITE)
+               mode |= BLK_OPEN_WRITE;
+       if (file->private_data)
+               mode |= BLK_OPEN_EXCL;
+       if (file->f_flags & O_NDELAY)
+               mode |= BLK_OPEN_NDELAY;
+
+       /*
+        * If all bits in O_ACCMODE set (aka O_RDWR | O_WRONLY), the floppy
+        * driver has historically allowed ioctls as if the file was opened for
+        * writing, but does not allow and actual reads or writes.
+        */
+       if ((file->f_flags & O_ACCMODE) == (O_RDWR | O_WRONLY))
+               mode |= BLK_OPEN_WRITE_IOCTL;
+
+       return mode;
+}
+
 static int blkdev_open(struct inode *inode, struct file *filp)
 {
        struct block_device *bdev;
@@ -481,30 +505,31 @@ static int blkdev_open(struct inode *inode, struct file *filp)
         * during an unstable branch.
         */
        filp->f_flags |= O_LARGEFILE;
-       filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
+       filp->f_mode |= FMODE_BUF_RASYNC;
 
-       if (filp->f_flags & O_NDELAY)
-               filp->f_mode |= FMODE_NDELAY;
+       /*
+        * Use the file private data to store the holder for exclusive openes.
+        * file_to_blk_mode relies on it being present to set BLK_OPEN_EXCL.
+        */
        if (filp->f_flags & O_EXCL)
-               filp->f_mode |= FMODE_EXCL;
-       if ((filp->f_flags & O_ACCMODE) == 3)
-               filp->f_mode |= FMODE_WRITE_IOCTL;
+               filp->private_data = filp;
 
-       bdev = blkdev_get_by_dev(inode->i_rdev, filp->f_mode, filp);
+       bdev = blkdev_get_by_dev(inode->i_rdev, file_to_blk_mode(filp),
+                                filp->private_data, NULL);
        if (IS_ERR(bdev))
                return PTR_ERR(bdev);
 
-       filp->private_data = bdev;
+       if (bdev_nowait(bdev))
+               filp->f_mode |= FMODE_NOWAIT;
+
        filp->f_mapping = bdev->bd_inode->i_mapping;
        filp->f_wb_err = filemap_sample_wb_err(filp->f_mapping);
        return 0;
 }
 
-static int blkdev_close(struct inode *inode, struct file *filp)
+static int blkdev_release(struct inode *inode, struct file *filp)
 {
-       struct block_device *bdev = filp->private_data;
-
-       blkdev_put(bdev, filp->f_mode);
+       blkdev_put(I_BDEV(filp->f_mapping->host), filp->private_data);
        return 0;
 }
 
@@ -517,10 +542,9 @@ static int blkdev_close(struct inode *inode, struct file *filp)
  */
 static ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
-       struct block_device *bdev = iocb->ki_filp->private_data;
+       struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
        struct inode *bd_inode = bdev->bd_inode;
        loff_t size = bdev_nr_bytes(bdev);
-       struct blk_plug plug;
        size_t shorted = 0;
        ssize_t ret;
 
@@ -545,18 +569,16 @@ static ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
                iov_iter_truncate(from, size);
        }
 
-       blk_start_plug(&plug);
        ret = __generic_file_write_iter(iocb, from);
        if (ret > 0)
                ret = generic_write_sync(iocb, ret);
        iov_iter_reexpand(from, iov_iter_count(from) + shorted);
-       blk_finish_plug(&plug);
        return ret;
 }
 
 static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
-       struct block_device *bdev = iocb->ki_filp->private_data;
+       struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
        loff_t size = bdev_nr_bytes(bdev);
        loff_t pos = iocb->ki_pos;
        size_t shorted = 0;
@@ -576,21 +598,9 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
                goto reexpand; /* skip atime */
 
        if (iocb->ki_flags & IOCB_DIRECT) {
-               struct address_space *mapping = iocb->ki_filp->f_mapping;
-
-               if (iocb->ki_flags & IOCB_NOWAIT) {
-                       if (filemap_range_needs_writeback(mapping, pos,
-                                                         pos + count - 1)) {
-                               ret = -EAGAIN;
-                               goto reexpand;
-                       }
-               } else {
-                       ret = filemap_write_and_wait_range(mapping, pos,
-                                                          pos + count - 1);
-                       if (ret < 0)
-                               goto reexpand;
-               }
-
+               ret = kiocb_write_and_wait(iocb, count);
+               if (ret < 0)
+                       goto reexpand;
                file_accessed(iocb->ki_filp);
 
                ret = blkdev_direct_IO(iocb, to);
@@ -649,7 +659,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
        filemap_invalidate_lock(inode->i_mapping);
 
        /* Invalidate the page cache, including dirty pages. */
-       error = truncate_bdev_range(bdev, file->f_mode, start, end);
+       error = truncate_bdev_range(bdev, file_to_blk_mode(file), start, end);
        if (error)
                goto fail;
 
@@ -690,7 +700,7 @@ static int blkdev_mmap(struct file *file, struct vm_area_struct *vma)
 
 const struct file_operations def_blk_fops = {
        .open           = blkdev_open,
-       .release        = blkdev_close,
+       .release        = blkdev_release,
        .llseek         = blkdev_llseek,
        .read_iter      = blkdev_read_iter,
        .write_iter     = blkdev_write_iter,
@@ -701,7 +711,7 @@ const struct file_operations def_blk_fops = {
 #ifdef CONFIG_COMPAT
        .compat_ioctl   = compat_blkdev_ioctl,
 #endif
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = filemap_splice_read,
        .splice_write   = iter_file_splice_write,
        .fallocate      = blkdev_fallocate,
 };
index 1cb489b..3d287b3 100644 (file)
@@ -25,8 +25,9 @@
 #include <linux/pm_runtime.h>
 #include <linux/badblocks.h>
 #include <linux/part_stat.h>
-#include "blk-throttle.h"
+#include <linux/blktrace_api.h>
 
+#include "blk-throttle.h"
 #include "blk.h"
 #include "blk-mq-sched.h"
 #include "blk-rq-qos.h"
@@ -253,7 +254,7 @@ int __register_blkdev(unsigned int major, const char *name,
 #ifdef CONFIG_BLOCK_LEGACY_AUTOLOAD
        p->probe = probe;
 #endif
-       strlcpy(p->name, name, sizeof(p->name));
+       strscpy(p->name, name, sizeof(p->name));
        p->next = NULL;
        index = major_to_index(major);
 
@@ -318,18 +319,6 @@ void blk_free_ext_minor(unsigned int minor)
        ida_free(&ext_devt_ida, minor);
 }
 
-static char *bdevt_str(dev_t devt, char *buf)
-{
-       if (MAJOR(devt) <= 0xff && MINOR(devt) <= 0xff) {
-               char tbuf[BDEVT_SIZE];
-               snprintf(tbuf, BDEVT_SIZE, "%02x%02x", MAJOR(devt), MINOR(devt));
-               snprintf(buf, BDEVT_SIZE, "%-9s", tbuf);
-       } else
-               snprintf(buf, BDEVT_SIZE, "%03x:%05x", MAJOR(devt), MINOR(devt));
-
-       return buf;
-}
-
 void disk_uevent(struct gendisk *disk, enum kobject_action action)
 {
        struct block_device *part;
@@ -351,7 +340,7 @@ void disk_uevent(struct gendisk *disk, enum kobject_action action)
 }
 EXPORT_SYMBOL_GPL(disk_uevent);
 
-int disk_scan_partitions(struct gendisk *disk, fmode_t mode)
+int disk_scan_partitions(struct gendisk *disk, blk_mode_t mode)
 {
        struct block_device *bdev;
        int ret = 0;
@@ -369,18 +358,20 @@ int disk_scan_partitions(struct gendisk *disk, fmode_t mode)
         * synchronize with other exclusive openers and other partition
         * scanners.
         */
-       if (!(mode & FMODE_EXCL)) {
-               ret = bd_prepare_to_claim(disk->part0, disk_scan_partitions);
+       if (!(mode & BLK_OPEN_EXCL)) {
+               ret = bd_prepare_to_claim(disk->part0, disk_scan_partitions,
+                                         NULL);
                if (ret)
                        return ret;
        }
 
        set_bit(GD_NEED_PART_SCAN, &disk->state);
-       bdev = blkdev_get_by_dev(disk_devt(disk), mode & ~FMODE_EXCL, NULL);
+       bdev = blkdev_get_by_dev(disk_devt(disk), mode & ~BLK_OPEN_EXCL, NULL,
+                                NULL);
        if (IS_ERR(bdev))
                ret =  PTR_ERR(bdev);
        else
-               blkdev_put(bdev, mode & ~FMODE_EXCL);
+               blkdev_put(bdev, NULL);
 
        /*
         * If blkdev_get_by_dev() failed early, GD_NEED_PART_SCAN is still set,
@@ -388,7 +379,7 @@ int disk_scan_partitions(struct gendisk *disk, fmode_t mode)
         * creat partition for underlying disk.
         */
        clear_bit(GD_NEED_PART_SCAN, &disk->state);
-       if (!(mode & FMODE_EXCL))
+       if (!(mode & BLK_OPEN_EXCL))
                bd_abort_claiming(disk->part0, disk_scan_partitions);
        return ret;
 }
@@ -516,7 +507,7 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk,
 
                bdev_add(disk->part0, ddev->devt);
                if (get_capacity(disk))
-                       disk_scan_partitions(disk, FMODE_READ);
+                       disk_scan_partitions(disk, BLK_OPEN_READ);
 
                /*
                 * Announce the disk and partitions after all partitions are
@@ -563,6 +554,28 @@ out_exit_elevator:
 }
 EXPORT_SYMBOL(device_add_disk);
 
+static void blk_report_disk_dead(struct gendisk *disk)
+{
+       struct block_device *bdev;
+       unsigned long idx;
+
+       rcu_read_lock();
+       xa_for_each(&disk->part_tbl, idx, bdev) {
+               if (!kobject_get_unless_zero(&bdev->bd_device.kobj))
+                       continue;
+               rcu_read_unlock();
+
+               mutex_lock(&bdev->bd_holder_lock);
+               if (bdev->bd_holder_ops && bdev->bd_holder_ops->mark_dead)
+                       bdev->bd_holder_ops->mark_dead(bdev);
+               mutex_unlock(&bdev->bd_holder_lock);
+
+               put_device(&bdev->bd_device);
+               rcu_read_lock();
+       }
+       rcu_read_unlock();
+}
+
 /**
  * blk_mark_disk_dead - mark a disk as dead
  * @disk: disk to mark as dead
@@ -572,13 +585,26 @@ EXPORT_SYMBOL(device_add_disk);
  */
 void blk_mark_disk_dead(struct gendisk *disk)
 {
-       set_bit(GD_DEAD, &disk->state);
-       blk_queue_start_drain(disk->queue);
+       /*
+        * Fail any new I/O.
+        */
+       if (test_and_set_bit(GD_DEAD, &disk->state))
+               return;
+
+       if (test_bit(GD_OWNS_QUEUE, &disk->state))
+               blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue);
 
        /*
         * Stop buffered writers from dirtying pages that can't be written out.
         */
-       set_capacity_and_notify(disk, 0);
+       set_capacity(disk, 0);
+
+       /*
+        * Prevent new I/O from crossing bio_queue_enter().
+        */
+       blk_queue_start_drain(disk->queue);
+
+       blk_report_disk_dead(disk);
 }
 EXPORT_SYMBOL_GPL(blk_mark_disk_dead);
 
@@ -604,6 +630,8 @@ EXPORT_SYMBOL_GPL(blk_mark_disk_dead);
 void del_gendisk(struct gendisk *disk)
 {
        struct request_queue *q = disk->queue;
+       struct block_device *part;
+       unsigned long idx;
 
        might_sleep();
 
@@ -612,26 +640,27 @@ void del_gendisk(struct gendisk *disk)
 
        disk_del_events(disk);
 
+       /*
+        * Prevent new openers by unlinked the bdev inode, and write out
+        * dirty data before marking the disk dead and stopping all I/O.
+        */
        mutex_lock(&disk->open_mutex);
-       remove_inode_hash(disk->part0->bd_inode);
-       blk_drop_partitions(disk);
+       xa_for_each(&disk->part_tbl, idx, part) {
+               remove_inode_hash(part->bd_inode);
+               fsync_bdev(part);
+               __invalidate_device(part, true);
+       }
        mutex_unlock(&disk->open_mutex);
 
-       fsync_bdev(disk->part0);
-       __invalidate_device(disk->part0, true);
+       blk_mark_disk_dead(disk);
 
        /*
-        * Fail any new I/O.
+        * Drop all partitions now that the disk is marked dead.
         */
-       set_bit(GD_DEAD, &disk->state);
-       if (test_bit(GD_OWNS_QUEUE, &disk->state))
-               blk_queue_flag_set(QUEUE_FLAG_DYING, q);
-       set_capacity(disk, 0);
-
-       /*
-        * Prevent new I/O from crossing bio_queue_enter().
-        */
-       blk_queue_start_drain(q);
+       mutex_lock(&disk->open_mutex);
+       xa_for_each_start(&disk->part_tbl, idx, part, 1)
+               drop_partition(part);
+       mutex_unlock(&disk->open_mutex);
 
        if (!(disk->flags & GENHD_FL_HIDDEN)) {
                sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
@@ -755,57 +784,6 @@ void blk_request_module(dev_t devt)
 }
 #endif /* CONFIG_BLOCK_LEGACY_AUTOLOAD */
 
-/*
- * print a full list of all partitions - intended for places where the root
- * filesystem can't be mounted and thus to give the victim some idea of what
- * went wrong
- */
-void __init printk_all_partitions(void)
-{
-       struct class_dev_iter iter;
-       struct device *dev;
-
-       class_dev_iter_init(&iter, &block_class, NULL, &disk_type);
-       while ((dev = class_dev_iter_next(&iter))) {
-               struct gendisk *disk = dev_to_disk(dev);
-               struct block_device *part;
-               char devt_buf[BDEVT_SIZE];
-               unsigned long idx;
-
-               /*
-                * Don't show empty devices or things that have been
-                * suppressed
-                */
-               if (get_capacity(disk) == 0 || (disk->flags & GENHD_FL_HIDDEN))
-                       continue;
-
-               /*
-                * Note, unlike /proc/partitions, I am showing the numbers in
-                * hex - the same format as the root= option takes.
-                */
-               rcu_read_lock();
-               xa_for_each(&disk->part_tbl, idx, part) {
-                       if (!bdev_nr_sectors(part))
-                               continue;
-                       printk("%s%s %10llu %pg %s",
-                              bdev_is_partition(part) ? "  " : "",
-                              bdevt_str(part->bd_dev, devt_buf),
-                              bdev_nr_sectors(part) >> 1, part,
-                              part->bd_meta_info ?
-                                       part->bd_meta_info->uuid : "");
-                       if (bdev_is_partition(part))
-                               printk("\n");
-                       else if (dev->parent && dev->parent->driver)
-                               printk(" driver: %s\n",
-                                       dev->parent->driver->name);
-                       else
-                               printk(" (driver?)\n");
-               }
-               rcu_read_unlock();
-       }
-       class_dev_iter_exit(&iter);
-}
-
 #ifdef CONFIG_PROC_FS
 /* iterator */
 static void *disk_seqf_start(struct seq_file *seqf, loff_t *pos)
@@ -1171,6 +1149,8 @@ static void disk_release(struct device *dev)
        might_sleep();
        WARN_ON_ONCE(disk_live(disk));
 
+       blk_trace_remove(disk->queue);
+
        /*
         * To undo the all initialization from blk_mq_init_allocated_queue in
         * case of a probe failure where add_disk is never called we have to
@@ -1339,35 +1319,6 @@ dev_t part_devt(struct gendisk *disk, u8 partno)
        return devt;
 }
 
-dev_t blk_lookup_devt(const char *name, int partno)
-{
-       dev_t devt = MKDEV(0, 0);
-       struct class_dev_iter iter;
-       struct device *dev;
-
-       class_dev_iter_init(&iter, &block_class, NULL, &disk_type);
-       while ((dev = class_dev_iter_next(&iter))) {
-               struct gendisk *disk = dev_to_disk(dev);
-
-               if (strcmp(dev_name(dev), name))
-                       continue;
-
-               if (partno < disk->minors) {
-                       /* We need to return the right devno, even
-                        * if the partition doesn't exist yet.
-                        */
-                       devt = MKDEV(MAJOR(dev->devt),
-                                    MINOR(dev->devt) + partno);
-               } else {
-                       devt = part_devt(disk, partno);
-                       if (devt)
-                               break;
-               }
-       }
-       class_dev_iter_exit(&iter);
-       return devt;
-}
-
 struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
                struct lock_class_key *lkclass)
 {
index 9c5f637..3be1194 100644 (file)
@@ -82,7 +82,7 @@ static int compat_blkpg_ioctl(struct block_device *bdev,
 }
 #endif
 
-static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode,
+static int blk_ioctl_discard(struct block_device *bdev, blk_mode_t mode,
                unsigned long arg)
 {
        uint64_t range[2];
@@ -90,7 +90,7 @@ static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode,
        struct inode *inode = bdev->bd_inode;
        int err;
 
-       if (!(mode & FMODE_WRITE))
+       if (!(mode & BLK_OPEN_WRITE))
                return -EBADF;
 
        if (!bdev_max_discard_sectors(bdev))
@@ -120,14 +120,14 @@ fail:
        return err;
 }
 
-static int blk_ioctl_secure_erase(struct block_device *bdev, fmode_t mode,
+static int blk_ioctl_secure_erase(struct block_device *bdev, blk_mode_t mode,
                void __user *argp)
 {
        uint64_t start, len;
        uint64_t range[2];
        int err;
 
-       if (!(mode & FMODE_WRITE))
+       if (!(mode & BLK_OPEN_WRITE))
                return -EBADF;
        if (!bdev_max_secure_erase_sectors(bdev))
                return -EOPNOTSUPP;
@@ -151,7 +151,7 @@ static int blk_ioctl_secure_erase(struct block_device *bdev, fmode_t mode,
 }
 
 
-static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
+static int blk_ioctl_zeroout(struct block_device *bdev, blk_mode_t mode,
                unsigned long arg)
 {
        uint64_t range[2];
@@ -159,7 +159,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
        struct inode *inode = bdev->bd_inode;
        int err;
 
-       if (!(mode & FMODE_WRITE))
+       if (!(mode & BLK_OPEN_WRITE))
                return -EBADF;
 
        if (copy_from_user(range, (void __user *)arg, sizeof(range)))
@@ -240,7 +240,7 @@ static int compat_put_ulong(compat_ulong_t __user *argp, compat_ulong_t val)
  * drivers that implement only commands that are completely compatible
  * between 32-bit and 64-bit user space
  */
-int blkdev_compat_ptr_ioctl(struct block_device *bdev, fmode_t mode,
+int blkdev_compat_ptr_ioctl(struct block_device *bdev, blk_mode_t mode,
                        unsigned cmd, unsigned long arg)
 {
        struct gendisk *disk = bdev->bd_disk;
@@ -254,13 +254,28 @@ int blkdev_compat_ptr_ioctl(struct block_device *bdev, fmode_t mode,
 EXPORT_SYMBOL(blkdev_compat_ptr_ioctl);
 #endif
 
-static int blkdev_pr_register(struct block_device *bdev,
+static bool blkdev_pr_allowed(struct block_device *bdev, blk_mode_t mode)
+{
+       /* no sense to make reservations for partitions */
+       if (bdev_is_partition(bdev))
+               return false;
+
+       if (capable(CAP_SYS_ADMIN))
+               return true;
+       /*
+        * Only allow unprivileged reservations if the file descriptor is open
+        * for writing.
+        */
+       return mode & BLK_OPEN_WRITE;
+}
+
+static int blkdev_pr_register(struct block_device *bdev, blk_mode_t mode,
                struct pr_registration __user *arg)
 {
        const struct pr_ops *ops = bdev->bd_disk->fops->pr_ops;
        struct pr_registration reg;
 
-       if (!capable(CAP_SYS_ADMIN))
+       if (!blkdev_pr_allowed(bdev, mode))
                return -EPERM;
        if (!ops || !ops->pr_register)
                return -EOPNOTSUPP;
@@ -272,13 +287,13 @@ static int blkdev_pr_register(struct block_device *bdev,
        return ops->pr_register(bdev, reg.old_key, reg.new_key, reg.flags);
 }
 
-static int blkdev_pr_reserve(struct block_device *bdev,
+static int blkdev_pr_reserve(struct block_device *bdev, blk_mode_t mode,
                struct pr_reservation __user *arg)
 {
        const struct pr_ops *ops = bdev->bd_disk->fops->pr_ops;
        struct pr_reservation rsv;
 
-       if (!capable(CAP_SYS_ADMIN))
+       if (!blkdev_pr_allowed(bdev, mode))
                return -EPERM;
        if (!ops || !ops->pr_reserve)
                return -EOPNOTSUPP;
@@ -290,13 +305,13 @@ static int blkdev_pr_reserve(struct block_device *bdev,
        return ops->pr_reserve(bdev, rsv.key, rsv.type, rsv.flags);
 }
 
-static int blkdev_pr_release(struct block_device *bdev,
+static int blkdev_pr_release(struct block_device *bdev, blk_mode_t mode,
                struct pr_reservation __user *arg)
 {
        const struct pr_ops *ops = bdev->bd_disk->fops->pr_ops;
        struct pr_reservation rsv;
 
-       if (!capable(CAP_SYS_ADMIN))
+       if (!blkdev_pr_allowed(bdev, mode))
                return -EPERM;
        if (!ops || !ops->pr_release)
                return -EOPNOTSUPP;
@@ -308,13 +323,13 @@ static int blkdev_pr_release(struct block_device *bdev,
        return ops->pr_release(bdev, rsv.key, rsv.type);
 }
 
-static int blkdev_pr_preempt(struct block_device *bdev,
+static int blkdev_pr_preempt(struct block_device *bdev, blk_mode_t mode,
                struct pr_preempt __user *arg, bool abort)
 {
        const struct pr_ops *ops = bdev->bd_disk->fops->pr_ops;
        struct pr_preempt p;
 
-       if (!capable(CAP_SYS_ADMIN))
+       if (!blkdev_pr_allowed(bdev, mode))
                return -EPERM;
        if (!ops || !ops->pr_preempt)
                return -EOPNOTSUPP;
@@ -326,13 +341,13 @@ static int blkdev_pr_preempt(struct block_device *bdev,
        return ops->pr_preempt(bdev, p.old_key, p.new_key, p.type, abort);
 }
 
-static int blkdev_pr_clear(struct block_device *bdev,
+static int blkdev_pr_clear(struct block_device *bdev, blk_mode_t mode,
                struct pr_clear __user *arg)
 {
        const struct pr_ops *ops = bdev->bd_disk->fops->pr_ops;
        struct pr_clear c;
 
-       if (!capable(CAP_SYS_ADMIN))
+       if (!blkdev_pr_allowed(bdev, mode))
                return -EPERM;
        if (!ops || !ops->pr_clear)
                return -EOPNOTSUPP;
@@ -344,8 +359,8 @@ static int blkdev_pr_clear(struct block_device *bdev,
        return ops->pr_clear(bdev, c.key);
 }
 
-static int blkdev_flushbuf(struct block_device *bdev, fmode_t mode,
-               unsigned cmd, unsigned long arg)
+static int blkdev_flushbuf(struct block_device *bdev, unsigned cmd,
+               unsigned long arg)
 {
        if (!capable(CAP_SYS_ADMIN))
                return -EACCES;
@@ -354,8 +369,8 @@ static int blkdev_flushbuf(struct block_device *bdev, fmode_t mode,
        return 0;
 }
 
-static int blkdev_roset(struct block_device *bdev, fmode_t mode,
-               unsigned cmd, unsigned long arg)
+static int blkdev_roset(struct block_device *bdev, unsigned cmd,
+               unsigned long arg)
 {
        int ret, n;
 
@@ -439,7 +454,7 @@ static int compat_hdio_getgeo(struct block_device *bdev,
 #endif
 
 /* set the logical block size */
-static int blkdev_bszset(struct block_device *bdev, fmode_t mode,
+static int blkdev_bszset(struct block_device *bdev, blk_mode_t mode,
                int __user *argp)
 {
        int ret, n;
@@ -451,13 +466,13 @@ static int blkdev_bszset(struct block_device *bdev, fmode_t mode,
        if (get_user(n, argp))
                return -EFAULT;
 
-       if (mode & FMODE_EXCL)
+       if (mode & BLK_OPEN_EXCL)
                return set_blocksize(bdev, n);
 
-       if (IS_ERR(blkdev_get_by_dev(bdev->bd_dev, mode | FMODE_EXCL, &bdev)))
+       if (IS_ERR(blkdev_get_by_dev(bdev->bd_dev, mode, &bdev, NULL)))
                return -EBUSY;
        ret = set_blocksize(bdev, n);
-       blkdev_put(bdev, mode | FMODE_EXCL);
+       blkdev_put(bdev, &bdev);
 
        return ret;
 }
@@ -467,7 +482,7 @@ static int blkdev_bszset(struct block_device *bdev, fmode_t mode,
  * user space. Note the separate arg/argp parameters that are needed
  * to deal with the compat_ptr() conversion.
  */
-static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
+static int blkdev_common_ioctl(struct block_device *bdev, blk_mode_t mode,
                               unsigned int cmd, unsigned long arg,
                               void __user *argp)
 {
@@ -475,9 +490,9 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
 
        switch (cmd) {
        case BLKFLSBUF:
-               return blkdev_flushbuf(bdev, mode, cmd, arg);
+               return blkdev_flushbuf(bdev, cmd, arg);
        case BLKROSET:
-               return blkdev_roset(bdev, mode, cmd, arg);
+               return blkdev_roset(bdev, cmd, arg);
        case BLKDISCARD:
                return blk_ioctl_discard(bdev, mode, arg);
        case BLKSECDISCARD:
@@ -487,7 +502,7 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
        case BLKGETDISKSEQ:
                return put_u64(argp, bdev->bd_disk->diskseq);
        case BLKREPORTZONE:
-               return blkdev_report_zones_ioctl(bdev, mode, cmd, arg);
+               return blkdev_report_zones_ioctl(bdev, cmd, arg);
        case BLKRESETZONE:
        case BLKOPENZONE:
        case BLKCLOSEZONE:
@@ -534,17 +549,17 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
        case BLKTRACETEARDOWN:
                return blk_trace_ioctl(bdev, cmd, argp);
        case IOC_PR_REGISTER:
-               return blkdev_pr_register(bdev, argp);
+               return blkdev_pr_register(bdev, mode, argp);
        case IOC_PR_RESERVE:
-               return blkdev_pr_reserve(bdev, argp);
+               return blkdev_pr_reserve(bdev, mode, argp);
        case IOC_PR_RELEASE:
-               return blkdev_pr_release(bdev, argp);
+               return blkdev_pr_release(bdev, mode, argp);
        case IOC_PR_PREEMPT:
-               return blkdev_pr_preempt(bdev, argp, false);
+               return blkdev_pr_preempt(bdev, mode, argp, false);
        case IOC_PR_PREEMPT_ABORT:
-               return blkdev_pr_preempt(bdev, argp, true);
+               return blkdev_pr_preempt(bdev, mode, argp, true);
        case IOC_PR_CLEAR:
-               return blkdev_pr_clear(bdev, argp);
+               return blkdev_pr_clear(bdev, mode, argp);
        default:
                return -ENOIOCTLCMD;
        }
@@ -560,18 +575,9 @@ long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 {
        struct block_device *bdev = I_BDEV(file->f_mapping->host);
        void __user *argp = (void __user *)arg;
-       fmode_t mode = file->f_mode;
+       blk_mode_t mode = file_to_blk_mode(file);
        int ret;
 
-       /*
-        * O_NDELAY can be altered using fcntl(.., F_SETFL, ..), so we have
-        * to updated it before every ioctl.
-        */
-       if (file->f_flags & O_NDELAY)
-               mode |= FMODE_NDELAY;
-       else
-               mode &= ~FMODE_NDELAY;
-
        switch (cmd) {
        /* These need separate implementations for the data structure */
        case HDIO_GETGEO:
@@ -630,16 +636,7 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg)
        void __user *argp = compat_ptr(arg);
        struct block_device *bdev = I_BDEV(file->f_mapping->host);
        struct gendisk *disk = bdev->bd_disk;
-       fmode_t mode = file->f_mode;
-
-       /*
-        * O_NDELAY can be altered using fcntl(.., F_SETFL, ..), so we have
-        * to updated it before every ioctl.
-        */
-       if (file->f_flags & O_NDELAY)
-               mode |= FMODE_NDELAY;
-       else
-               mode &= ~FMODE_NDELAY;
+       blk_mode_t mode = file_to_blk_mode(file);
 
        switch (cmd) {
        /* These need separate implementations for the data structure */
index 5839a02..6aa5daf 100644 (file)
@@ -74,8 +74,8 @@ struct dd_per_prio {
        struct list_head dispatch;
        struct rb_root sort_list[DD_DIR_COUNT];
        struct list_head fifo_list[DD_DIR_COUNT];
-       /* Next request in FIFO order. Read, write or both are NULL. */
-       struct request *next_rq[DD_DIR_COUNT];
+       /* Position of the most recently dispatched request. */
+       sector_t latest_pos[DD_DIR_COUNT];
        struct io_stats_per_prio stats;
 };
 
@@ -156,6 +156,40 @@ deadline_latter_request(struct request *rq)
        return NULL;
 }
 
+/*
+ * Return the first request for which blk_rq_pos() >= @pos. For zoned devices,
+ * return the first request after the start of the zone containing @pos.
+ */
+static inline struct request *deadline_from_pos(struct dd_per_prio *per_prio,
+                               enum dd_data_dir data_dir, sector_t pos)
+{
+       struct rb_node *node = per_prio->sort_list[data_dir].rb_node;
+       struct request *rq, *res = NULL;
+
+       if (!node)
+               return NULL;
+
+       rq = rb_entry_rq(node);
+       /*
+        * A zoned write may have been requeued with a starting position that
+        * is below that of the most recently dispatched request. Hence, for
+        * zoned writes, start searching from the start of a zone.
+        */
+       if (blk_rq_is_seq_zoned_write(rq))
+               pos -= round_down(pos, rq->q->limits.chunk_sectors);
+
+       while (node) {
+               rq = rb_entry_rq(node);
+               if (blk_rq_pos(rq) >= pos) {
+                       res = rq;
+                       node = node->rb_left;
+               } else {
+                       node = node->rb_right;
+               }
+       }
+       return res;
+}
+
 static void
 deadline_add_rq_rb(struct dd_per_prio *per_prio, struct request *rq)
 {
@@ -167,11 +201,6 @@ deadline_add_rq_rb(struct dd_per_prio *per_prio, struct request *rq)
 static inline void
 deadline_del_rq_rb(struct dd_per_prio *per_prio, struct request *rq)
 {
-       const enum dd_data_dir data_dir = rq_data_dir(rq);
-
-       if (per_prio->next_rq[data_dir] == rq)
-               per_prio->next_rq[data_dir] = deadline_latter_request(rq);
-
        elv_rb_del(deadline_rb_root(per_prio, rq), rq);
 }
 
@@ -251,10 +280,6 @@ static void
 deadline_move_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
                      struct request *rq)
 {
-       const enum dd_data_dir data_dir = rq_data_dir(rq);
-
-       per_prio->next_rq[data_dir] = deadline_latter_request(rq);
-
        /*
         * take it off the sort and fifo list
         */
@@ -272,21 +297,15 @@ static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
 }
 
 /*
- * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
- * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
+ * deadline_check_fifo returns true if and only if there are expired requests
+ * in the FIFO list. Requires !list_empty(&dd->fifo_list[data_dir]).
  */
-static inline int deadline_check_fifo(struct dd_per_prio *per_prio,
-                                     enum dd_data_dir data_dir)
+static inline bool deadline_check_fifo(struct dd_per_prio *per_prio,
+                                      enum dd_data_dir data_dir)
 {
        struct request *rq = rq_entry_fifo(per_prio->fifo_list[data_dir].next);
 
-       /*
-        * rq is expired!
-        */
-       if (time_after_eq(jiffies, (unsigned long)rq->fifo_time))
-               return 1;
-
-       return 0;
+       return time_is_before_eq_jiffies((unsigned long)rq->fifo_time);
 }
 
 /*
@@ -310,14 +329,11 @@ static struct request *deadline_skip_seq_writes(struct deadline_data *dd,
                                                struct request *rq)
 {
        sector_t pos = blk_rq_pos(rq);
-       sector_t skipped_sectors = 0;
 
-       while (rq) {
-               if (blk_rq_pos(rq) != pos + skipped_sectors)
-                       break;
-               skipped_sectors += blk_rq_sectors(rq);
+       do {
+               pos += blk_rq_sectors(rq);
                rq = deadline_latter_request(rq);
-       }
+       } while (rq && blk_rq_pos(rq) == pos);
 
        return rq;
 }
@@ -330,7 +346,7 @@ static struct request *
 deadline_fifo_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
                      enum dd_data_dir data_dir)
 {
-       struct request *rq;
+       struct request *rq, *rb_rq, *next;
        unsigned long flags;
 
        if (list_empty(&per_prio->fifo_list[data_dir]))
@@ -348,7 +364,12 @@ deadline_fifo_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
         * zones and these zones are unlocked.
         */
        spin_lock_irqsave(&dd->zone_lock, flags);
-       list_for_each_entry(rq, &per_prio->fifo_list[DD_WRITE], queuelist) {
+       list_for_each_entry_safe(rq, next, &per_prio->fifo_list[DD_WRITE],
+                                queuelist) {
+               /* Check whether a prior request exists for the same zone. */
+               rb_rq = deadline_from_pos(per_prio, data_dir, blk_rq_pos(rq));
+               if (rb_rq && blk_rq_pos(rb_rq) < blk_rq_pos(rq))
+                       rq = rb_rq;
                if (blk_req_can_dispatch_to_zone(rq) &&
                    (blk_queue_nonrot(rq->q) ||
                     !deadline_is_seq_write(dd, rq)))
@@ -372,7 +393,8 @@ deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
        struct request *rq;
        unsigned long flags;
 
-       rq = per_prio->next_rq[data_dir];
+       rq = deadline_from_pos(per_prio, data_dir,
+                              per_prio->latest_pos[data_dir]);
        if (!rq)
                return NULL;
 
@@ -435,6 +457,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
                if (started_after(dd, rq, latest_start))
                        return NULL;
                list_del_init(&rq->queuelist);
+               data_dir = rq_data_dir(rq);
                goto done;
        }
 
@@ -442,9 +465,11 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
         * batches are currently reads XOR writes
         */
        rq = deadline_next_request(dd, per_prio, dd->last_dir);
-       if (rq && dd->batching < dd->fifo_batch)
-               /* we have a next request are still entitled to batch */
+       if (rq && dd->batching < dd->fifo_batch) {
+               /* we have a next request and are still entitled to batch */
+               data_dir = rq_data_dir(rq);
                goto dispatch_request;
+       }
 
        /*
         * at this point we are not running a batch. select the appropriate
@@ -522,6 +547,7 @@ dispatch_request:
 done:
        ioprio_class = dd_rq_ioclass(rq);
        prio = ioprio_class_to_prio[ioprio_class];
+       dd->per_prio[prio].latest_pos[data_dir] = blk_rq_pos(rq);
        dd->per_prio[prio].stats.dispatched++;
        /*
         * If the request needs its target zone locked, do it.
@@ -766,7 +792,7 @@ static bool dd_bio_merge(struct request_queue *q, struct bio *bio,
  * add rq to rbtree and fifo
  */
 static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
-                             blk_insert_t flags)
+                             blk_insert_t flags, struct list_head *free)
 {
        struct request_queue *q = hctx->queue;
        struct deadline_data *dd = q->elevator->elevator_data;
@@ -775,7 +801,6 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
        u8 ioprio_class = IOPRIO_PRIO_CLASS(ioprio);
        struct dd_per_prio *per_prio;
        enum dd_prio prio;
-       LIST_HEAD(free);
 
        lockdep_assert_held(&dd->lock);
 
@@ -792,10 +817,8 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
                rq->elv.priv[0] = (void *)(uintptr_t)1;
        }
 
-       if (blk_mq_sched_try_insert_merge(q, rq, &free)) {
-               blk_mq_free_requests(&free);
+       if (blk_mq_sched_try_insert_merge(q, rq, free))
                return;
-       }
 
        trace_block_rq_insert(rq);
 
@@ -803,6 +826,8 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
                list_add(&rq->queuelist, &per_prio->dispatch);
                rq->fifo_time = jiffies;
        } else {
+               struct list_head *insert_before;
+
                deadline_add_rq_rb(per_prio, rq);
 
                if (rq_mergeable(rq)) {
@@ -815,7 +840,20 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
                 * set expire time and add to fifo list
                 */
                rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
-               list_add_tail(&rq->queuelist, &per_prio->fifo_list[data_dir]);
+               insert_before = &per_prio->fifo_list[data_dir];
+#ifdef CONFIG_BLK_DEV_ZONED
+               /*
+                * Insert zoned writes such that requests are sorted by
+                * position per zone.
+                */
+               if (blk_rq_is_seq_zoned_write(rq)) {
+                       struct request *rq2 = deadline_latter_request(rq);
+
+                       if (rq2 && blk_rq_zone_no(rq2) == blk_rq_zone_no(rq))
+                               insert_before = &rq2->queuelist;
+               }
+#endif
+               list_add_tail(&rq->queuelist, insert_before);
        }
 }
 
@@ -828,6 +866,7 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
 {
        struct request_queue *q = hctx->queue;
        struct deadline_data *dd = q->elevator->elevator_data;
+       LIST_HEAD(free);
 
        spin_lock(&dd->lock);
        while (!list_empty(list)) {
@@ -835,9 +874,11 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
 
                rq = list_first_entry(list, struct request, queuelist);
                list_del_init(&rq->queuelist);
-               dd_insert_request(hctx, rq, flags);
+               dd_insert_request(hctx, rq, flags, &free);
        }
        spin_unlock(&dd->lock);
+
+       blk_mq_free_requests(&free);
 }
 
 /* Callback from inside blk_mq_rq_ctx_init(). */
@@ -1035,8 +1076,10 @@ static int deadline_##name##_next_rq_show(void *data,                    \
        struct request_queue *q = data;                                 \
        struct deadline_data *dd = q->elevator->elevator_data;          \
        struct dd_per_prio *per_prio = &dd->per_prio[prio];             \
-       struct request *rq = per_prio->next_rq[data_dir];               \
+       struct request *rq;                                             \
                                                                        \
+       rq = deadline_from_pos(per_prio, data_dir,                      \
+                              per_prio->latest_pos[data_dir]);         \
        if (rq)                                                         \
                __blk_mq_debugfs_rq_show(m, rq);                        \
        return 0;                                                       \
index 5c8624e..ed222b9 100644 (file)
 #define pr_fmt(fmt) fmt
 
 #include <linux/types.h>
+#include <linux/mm_types.h>
+#include <linux/overflow.h>
 #include <linux/affs_hardblocks.h>
 
 #include "check.h"
 
+/* magic offsets in partition DosEnvVec */
+#define NR_HD  3
+#define NR_SECT        5
+#define LO_CYL 9
+#define HI_CYL 10
+
 static __inline__ u32
 checksum_block(__be32 *m, int size)
 {
@@ -31,8 +39,12 @@ int amiga_partition(struct parsed_partitions *state)
        unsigned char *data;
        struct RigidDiskBlock *rdb;
        struct PartitionBlock *pb;
-       int start_sect, nr_sects, blk, part, res = 0;
-       int blksize = 1;        /* Multiplier for disk block size */
+       u64 start_sect, nr_sects;
+       sector_t blk, end_sect;
+       u32 cylblk;             /* rdb_CylBlocks = nr_heads*sect_per_track */
+       u32 nr_hd, nr_sect, lo_cyl, hi_cyl;
+       int part, res = 0;
+       unsigned int blksize = 1;       /* Multiplier for disk block size */
        int slot = 1;
 
        for (blk = 0; ; blk++, put_dev_sector(sect)) {
@@ -40,7 +52,7 @@ int amiga_partition(struct parsed_partitions *state)
                        goto rdb_done;
                data = read_part_sector(state, blk, &sect);
                if (!data) {
-                       pr_err("Dev %s: unable to read RDB block %d\n",
+                       pr_err("Dev %s: unable to read RDB block %llu\n",
                               state->disk->disk_name, blk);
                        res = -1;
                        goto rdb_done;
@@ -57,12 +69,12 @@ int amiga_partition(struct parsed_partitions *state)
                *(__be32 *)(data+0xdc) = 0;
                if (checksum_block((__be32 *)data,
                                be32_to_cpu(rdb->rdb_SummedLongs) & 0x7F)==0) {
-                       pr_err("Trashed word at 0xd0 in block %d ignored in checksum calculation\n",
+                       pr_err("Trashed word at 0xd0 in block %llu ignored in checksum calculation\n",
                               blk);
                        break;
                }
 
-               pr_err("Dev %s: RDB in block %d has bad checksum\n",
+               pr_err("Dev %s: RDB in block %llu has bad checksum\n",
                       state->disk->disk_name, blk);
        }
 
@@ -79,10 +91,15 @@ int amiga_partition(struct parsed_partitions *state)
        blk = be32_to_cpu(rdb->rdb_PartitionList);
        put_dev_sector(sect);
        for (part = 1; blk>0 && part<=16; part++, put_dev_sector(sect)) {
-               blk *= blksize; /* Read in terms partition table understands */
+               /* Read in terms partition table understands */
+               if (check_mul_overflow(blk, (sector_t) blksize, &blk)) {
+                       pr_err("Dev %s: overflow calculating partition block %llu! Skipping partitions %u and beyond\n",
+                               state->disk->disk_name, blk, part);
+                       break;
+               }
                data = read_part_sector(state, blk, &sect);
                if (!data) {
-                       pr_err("Dev %s: unable to read partition block %d\n",
+                       pr_err("Dev %s: unable to read partition block %llu\n",
                               state->disk->disk_name, blk);
                        res = -1;
                        goto rdb_done;
@@ -94,19 +111,70 @@ int amiga_partition(struct parsed_partitions *state)
                if (checksum_block((__be32 *)pb, be32_to_cpu(pb->pb_SummedLongs) & 0x7F) != 0 )
                        continue;
 
-               /* Tell Kernel about it */
+               /* RDB gives us more than enough rope to hang ourselves with,
+                * many times over (2^128 bytes if all fields max out).
+                * Some careful checks are in order, so check for potential
+                * overflows.
+                * We are multiplying four 32 bit numbers to one sector_t!
+                */
+
+               nr_hd   = be32_to_cpu(pb->pb_Environment[NR_HD]);
+               nr_sect = be32_to_cpu(pb->pb_Environment[NR_SECT]);
+
+               /* CylBlocks is total number of blocks per cylinder */
+               if (check_mul_overflow(nr_hd, nr_sect, &cylblk)) {
+                       pr_err("Dev %s: heads*sects %u overflows u32, skipping partition!\n",
+                               state->disk->disk_name, cylblk);
+                       continue;
+               }
+
+               /* check for consistency with RDB defined CylBlocks */
+               if (cylblk > be32_to_cpu(rdb->rdb_CylBlocks)) {
+                       pr_warn("Dev %s: cylblk %u > rdb_CylBlocks %u!\n",
+                               state->disk->disk_name, cylblk,
+                               be32_to_cpu(rdb->rdb_CylBlocks));
+               }
+
+               /* RDB allows for variable logical block size -
+                * normalize to 512 byte blocks and check result.
+                */
+
+               if (check_mul_overflow(cylblk, blksize, &cylblk)) {
+                       pr_err("Dev %s: partition %u bytes per cyl. overflows u32, skipping partition!\n",
+                               state->disk->disk_name, part);
+                       continue;
+               }
+
+               /* Calculate partition start and end. Limit of 32 bit on cylblk
+                * guarantees no overflow occurs if LBD support is enabled.
+                */
+
+               lo_cyl = be32_to_cpu(pb->pb_Environment[LO_CYL]);
+               start_sect = ((u64) lo_cyl * cylblk);
+
+               hi_cyl = be32_to_cpu(pb->pb_Environment[HI_CYL]);
+               nr_sects = (((u64) hi_cyl - lo_cyl + 1) * cylblk);
 
-               nr_sects = (be32_to_cpu(pb->pb_Environment[10]) + 1 -
-                           be32_to_cpu(pb->pb_Environment[9])) *
-                          be32_to_cpu(pb->pb_Environment[3]) *
-                          be32_to_cpu(pb->pb_Environment[5]) *
-                          blksize;
                if (!nr_sects)
                        continue;
-               start_sect = be32_to_cpu(pb->pb_Environment[9]) *
-                            be32_to_cpu(pb->pb_Environment[3]) *
-                            be32_to_cpu(pb->pb_Environment[5]) *
-                            blksize;
+
+               /* Warn user if partition end overflows u32 (AmigaDOS limit) */
+
+               if ((start_sect + nr_sects) > UINT_MAX) {
+                       pr_warn("Dev %s: partition %u (%llu-%llu) needs 64 bit device support!\n",
+                               state->disk->disk_name, part,
+                               start_sect, start_sect + nr_sects);
+               }
+
+               if (check_add_overflow(start_sect, nr_sects, &end_sect)) {
+                       pr_err("Dev %s: partition %u (%llu-%llu) needs LBD device support, skipping partition!\n",
+                               state->disk->disk_name, part,
+                               start_sect, end_sect);
+                       continue;
+               }
+
+               /* Tell Kernel about it */
+
                put_partition(state,slot++,start_sect,nr_sects);
                {
                        /* Be even more informative to aid mounting */
index 49e0496..13a7341 100644 (file)
@@ -12,7 +12,7 @@
 #include <linux/raid/detect.h>
 #include "check.h"
 
-static int (*check_part[])(struct parsed_partitions *) = {
+static int (*const check_part[])(struct parsed_partitions *) = {
        /*
         * Probe partition formats with tables at disk address 0
         * that also have an ADFS boot block at 0xdc0.
@@ -228,7 +228,7 @@ static struct attribute *part_attrs[] = {
        NULL
 };
 
-static struct attribute_group part_attr_group = {
+static const struct attribute_group part_attr_group = {
        .attrs = part_attrs,
 };
 
@@ -256,31 +256,36 @@ static int part_uevent(const struct device *dev, struct kobj_uevent_env *env)
        return 0;
 }
 
-struct device_type part_type = {
+const struct device_type part_type = {
        .name           = "partition",
        .groups         = part_attr_groups,
        .release        = part_release,
        .uevent         = part_uevent,
 };
 
-static void delete_partition(struct block_device *part)
+void drop_partition(struct block_device *part)
 {
        lockdep_assert_held(&part->bd_disk->open_mutex);
 
-       fsync_bdev(part);
-       __invalidate_device(part, true);
-
        xa_erase(&part->bd_disk->part_tbl, part->bd_partno);
        kobject_put(part->bd_holder_dir);
+
        device_del(&part->bd_device);
+       put_device(&part->bd_device);
+}
 
+static void delete_partition(struct block_device *part)
+{
        /*
         * Remove the block device from the inode hash, so that it cannot be
         * looked up any more even when openers still hold references.
         */
        remove_inode_hash(part->bd_inode);
 
-       put_device(&part->bd_device);
+       fsync_bdev(part);
+       __invalidate_device(part, true);
+
+       drop_partition(part);
 }
 
 static ssize_t whole_disk_show(struct device *dev,
@@ -288,7 +293,7 @@ static ssize_t whole_disk_show(struct device *dev,
 {
        return 0;
 }
-static DEVICE_ATTR(whole_disk, 0444, whole_disk_show, NULL);
+static const DEVICE_ATTR(whole_disk, 0444, whole_disk_show, NULL);
 
 /*
  * Must be called either with open_mutex held, before a disk can be opened or
@@ -436,10 +441,21 @@ static bool partition_overlaps(struct gendisk *disk, sector_t start,
 int bdev_add_partition(struct gendisk *disk, int partno, sector_t start,
                sector_t length)
 {
+       sector_t capacity = get_capacity(disk), end;
        struct block_device *part;
        int ret;
 
        mutex_lock(&disk->open_mutex);
+       if (check_add_overflow(start, length, &end)) {
+               ret = -EINVAL;
+               goto out;
+       }
+
+       if (start >= capacity || end > capacity) {
+               ret = -EINVAL;
+               goto out;
+       }
+
        if (!disk_live(disk)) {
                ret = -ENXIO;
                goto out;
@@ -519,17 +535,6 @@ static bool disk_unlock_native_capacity(struct gendisk *disk)
        return true;
 }
 
-void blk_drop_partitions(struct gendisk *disk)
-{
-       struct block_device *part;
-       unsigned long idx;
-
-       lockdep_assert_held(&disk->open_mutex);
-
-       xa_for_each_start(&disk->part_tbl, idx, part, 1)
-               delete_partition(part);
-}
-
 static bool blk_add_partition(struct gendisk *disk,
                struct parsed_partitions *state, int p)
 {
@@ -646,6 +651,8 @@ out_free_state:
 
 int bdev_disk_changed(struct gendisk *disk, bool invalidate)
 {
+       struct block_device *part;
+       unsigned long idx;
        int ret = 0;
 
        lockdep_assert_held(&disk->open_mutex);
@@ -658,8 +665,9 @@ rescan:
                return -EBUSY;
        sync_blockdev(disk->part0);
        invalidate_bdev(disk->part0);
-       blk_drop_partitions(disk);
 
+       xa_for_each_start(&disk->part_tbl, idx, part, 1)
+               delete_partition(part);
        clear_bit(GD_NEED_PART_SCAN, &disk->state);
 
        /*
index e42c1f9..e9a1cb7 100644 (file)
@@ -23,6 +23,7 @@
 #include <linux/wait.h>
 #include <drm/drm_file.h>
 #include <drm/drm_gem.h>
+#include <drm/drm_prime.h>
 #include <drm/drm_print.h>
 #include <uapi/drm/qaic_accel.h>
 
@@ -616,8 +617,7 @@ static void qaic_free_object(struct drm_gem_object *obj)
 
        if (obj->import_attach) {
                /* DMABUF/PRIME Path */
-               dma_buf_detach(obj->import_attach->dmabuf, obj->import_attach);
-               dma_buf_put(obj->import_attach->dmabuf);
+               drm_prime_gem_destroy(obj, NULL);
        } else {
                /* Private buffer allocation path */
                qaic_free_sgt(bo->sgt);
index 19aff80..8d51269 100644 (file)
@@ -9,8 +9,6 @@
 #include <linux/idr.h>
 #include <linux/io.h>
 
-#include <linux/arm-smccc.h>
-
 static struct acpi_ffh_info ffh_ctx;
 
 int __weak acpi_ffh_address_space_arch_setup(void *handler_ctxt,
index 77186f0..539e700 100644 (file)
@@ -201,11 +201,19 @@ static void byt_i2c_setup(struct lpss_private_data *pdata)
        writel(0, pdata->mmio_base + LPSS_I2C_ENABLE);
 }
 
-/* BSW PWM used for backlight control by the i915 driver */
+/*
+ * BSW PWM1 is used for backlight control by the i915 driver
+ * BSW PWM2 is used for backlight control for fixed (etched into the glass)
+ * touch controls on some models. These touch-controls have specialized
+ * drivers which know they need the "pwm_soc_lpss_2" con-id.
+ */
 static struct pwm_lookup bsw_pwm_lookup[] = {
        PWM_LOOKUP_WITH_MODULE("80862288:00", 0, "0000:00:02.0",
                               "pwm_soc_backlight", 0, PWM_POLARITY_NORMAL,
                               "pwm-lpss-platform"),
+       PWM_LOOKUP_WITH_MODULE("80862289:00", 0, NULL,
+                              "pwm_soc_lpss_2", 0, PWM_POLARITY_NORMAL,
+                              "pwm-lpss-platform"),
 };
 
 static void bsw_pwm_setup(struct lpss_private_data *pdata)
index 02f1a1b..7a453c5 100644 (file)
@@ -66,6 +66,7 @@ static void power_saving_mwait_init(void)
        case X86_VENDOR_AMD:
        case X86_VENDOR_INTEL:
        case X86_VENDOR_ZHAOXIN:
+       case X86_VENDOR_CENTAUR:
                /*
                 * AMD Fam10h TSC will tick in all
                 * C/P/S0/S1 states when this bit is set.
index 7514e38..5427e49 100644 (file)
@@ -34,7 +34,7 @@
 #define ACPI_BERT_PRINT_MAX_RECORDS 5
 #define ACPI_BERT_PRINT_MAX_LEN 1024
 
-static int bert_disable;
+static int bert_disable __initdata;
 
 /*
  * Print "all" the error records in the BERT table, but avoid huge spam to
index 34ad071..ef59d6e 100644 (file)
@@ -152,7 +152,6 @@ struct ghes_vendor_record_entry {
 };
 
 static struct gen_pool *ghes_estatus_pool;
-static unsigned long ghes_estatus_pool_size_request;
 
 static struct ghes_estatus_cache __rcu *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
 static atomic_t ghes_estatus_cache_alloced;
@@ -191,7 +190,6 @@ int ghes_estatus_pool_init(unsigned int num_ghes)
        len = GHES_ESTATUS_CACHE_AVG_SIZE * GHES_ESTATUS_CACHE_ALLOCED_MAX;
        len += (num_ghes * GHES_ESOURCE_PREALLOC_MAX_SIZE);
 
-       ghes_estatus_pool_size_request = PAGE_ALIGN(len);
        addr = (unsigned long)vmalloc(PAGE_ALIGN(len));
        if (!addr)
                goto err_pool_alloc;
@@ -1544,6 +1542,8 @@ struct list_head *ghes_get_devices(void)
 
                        pr_warn_once("Force-loading ghes_edac on an unsupported platform. You're on your own!\n");
                }
+       } else if (list_empty(&ghes_devs)) {
+               return NULL;
        }
 
        return &ghes_devs;
index e21a9e8..f81fe24 100644 (file)
@@ -3,4 +3,4 @@ obj-$(CONFIG_ACPI_AGDI)         += agdi.o
 obj-$(CONFIG_ACPI_IORT)        += iort.o
 obj-$(CONFIG_ACPI_GTDT)        += gtdt.o
 obj-$(CONFIG_ACPI_APMT)        += apmt.o
-obj-y                          += dma.o
+obj-y                          += dma.o init.o
index f605302..8b3c7d4 100644 (file)
@@ -9,11 +9,11 @@
 #define pr_fmt(fmt) "ACPI: AGDI: " fmt
 
 #include <linux/acpi.h>
-#include <linux/acpi_agdi.h>
 #include <linux/arm_sdei.h>
 #include <linux/io.h>
 #include <linux/kernel.h>
 #include <linux/platform_device.h>
+#include "init.h"
 
 struct agdi_data {
        int sdei_event;
index 8cab69f..bb010f6 100644 (file)
 #define pr_fmt(fmt)    "ACPI: APMT: " fmt
 
 #include <linux/acpi.h>
-#include <linux/acpi_apmt.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
 #include <linux/platform_device.h>
+#include "init.h"
 
 #define DEV_NAME "arm-cs-arch-pmu"
 
@@ -35,11 +35,13 @@ static int __init apmt_init_resources(struct resource *res,
 
        num_res++;
 
-       res[num_res].start = node->base_address1;
-       res[num_res].end = node->base_address1 + SZ_4K - 1;
-       res[num_res].flags = IORESOURCE_MEM;
+       if (node->flags & ACPI_APMT_FLAGS_DUAL_PAGE) {
+               res[num_res].start = node->base_address1;
+               res[num_res].end = node->base_address1 + SZ_4K - 1;
+               res[num_res].flags = IORESOURCE_MEM;
 
-       num_res++;
+               num_res++;
+       }
 
        if (node->ovflw_irq != 0) {
                trigger = (node->ovflw_irq_flags & ACPI_APMT_OVFLW_IRQ_FLAGS_MODE);
diff --git a/drivers/acpi/arm64/init.c b/drivers/acpi/arm64/init.c
new file mode 100644 (file)
index 0000000..d3ce53d
--- /dev/null
@@ -0,0 +1,13 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/acpi.h>
+#include "init.h"
+
+void __init acpi_arm_init(void)
+{
+       if (IS_ENABLED(CONFIG_ACPI_AGDI))
+               acpi_agdi_init();
+       if (IS_ENABLED(CONFIG_ACPI_APMT))
+               acpi_apmt_init();
+       if (IS_ENABLED(CONFIG_ACPI_IORT))
+               acpi_iort_init();
+}
diff --git a/drivers/acpi/arm64/init.h b/drivers/acpi/arm64/init.h
new file mode 100644 (file)
index 0000000..a1715a2
--- /dev/null
@@ -0,0 +1,6 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#include <linux/init.h>
+
+void __init acpi_agdi_init(void);
+void __init acpi_apmt_init(void);
+void __init acpi_iort_init(void);
index 38fb849..3631230 100644 (file)
@@ -19,6 +19,7 @@
 #include <linux/platform_device.h>
 #include <linux/slab.h>
 #include <linux/dma-map-ops.h>
+#include "init.h"
 
 #define IORT_TYPE_MASK(type)   (1 << (type))
 #define IORT_MSI_TYPE          (1 << ACPI_IORT_NODE_ITS_GROUP)
index d161ff7..e3e0bd0 100644 (file)
@@ -26,9 +26,6 @@
 #include <asm/mpspec.h>
 #include <linux/dmi.h>
 #endif
-#include <linux/acpi_agdi.h>
-#include <linux/acpi_apmt.h>
-#include <linux/acpi_iort.h>
 #include <linux/acpi_viot.h>
 #include <linux/pci.h>
 #include <acpi/apei.h>
@@ -530,65 +527,30 @@ static void acpi_notify_device(acpi_handle handle, u32 event, void *data)
        acpi_drv->ops.notify(device, event);
 }
 
-static void acpi_notify_device_fixed(void *data)
-{
-       struct acpi_device *device = data;
-
-       /* Fixed hardware devices have no handles */
-       acpi_notify_device(NULL, ACPI_FIXED_HARDWARE_EVENT, device);
-}
-
-static u32 acpi_device_fixed_event(void *data)
-{
-       acpi_os_execute(OSL_NOTIFY_HANDLER, acpi_notify_device_fixed, data);
-       return ACPI_INTERRUPT_HANDLED;
-}
-
 static int acpi_device_install_notify_handler(struct acpi_device *device,
                                              struct acpi_driver *acpi_drv)
 {
-       acpi_status status;
-
-       if (device->device_type == ACPI_BUS_TYPE_POWER_BUTTON) {
-               status =
-                   acpi_install_fixed_event_handler(ACPI_EVENT_POWER_BUTTON,
-                                                    acpi_device_fixed_event,
-                                                    device);
-       } else if (device->device_type == ACPI_BUS_TYPE_SLEEP_BUTTON) {
-               status =
-                   acpi_install_fixed_event_handler(ACPI_EVENT_SLEEP_BUTTON,
-                                                    acpi_device_fixed_event,
-                                                    device);
-       } else {
-               u32 type = acpi_drv->flags & ACPI_DRIVER_ALL_NOTIFY_EVENTS ?
+       u32 type = acpi_drv->flags & ACPI_DRIVER_ALL_NOTIFY_EVENTS ?
                                ACPI_ALL_NOTIFY : ACPI_DEVICE_NOTIFY;
+       acpi_status status;
 
-               status = acpi_install_notify_handler(device->handle, type,
-                                                    acpi_notify_device,
-                                                    device);
-       }
-
+       status = acpi_install_notify_handler(device->handle, type,
+                                            acpi_notify_device, device);
        if (ACPI_FAILURE(status))
                return -EINVAL;
+
        return 0;
 }
 
 static void acpi_device_remove_notify_handler(struct acpi_device *device,
                                              struct acpi_driver *acpi_drv)
 {
-       if (device->device_type == ACPI_BUS_TYPE_POWER_BUTTON) {
-               acpi_remove_fixed_event_handler(ACPI_EVENT_POWER_BUTTON,
-                                               acpi_device_fixed_event);
-       } else if (device->device_type == ACPI_BUS_TYPE_SLEEP_BUTTON) {
-               acpi_remove_fixed_event_handler(ACPI_EVENT_SLEEP_BUTTON,
-                                               acpi_device_fixed_event);
-       } else {
-               u32 type = acpi_drv->flags & ACPI_DRIVER_ALL_NOTIFY_EVENTS ?
+       u32 type = acpi_drv->flags & ACPI_DRIVER_ALL_NOTIFY_EVENTS ?
                                ACPI_ALL_NOTIFY : ACPI_DEVICE_NOTIFY;
 
-               acpi_remove_notify_handler(device->handle, type,
-                                          acpi_notify_device);
-       }
+       acpi_remove_notify_handler(device->handle, type,
+                                  acpi_notify_device);
+
        acpi_os_wait_events_complete();
 }
 
@@ -1408,7 +1370,7 @@ static int __init acpi_init(void)
        acpi_init_ffh();
 
        pci_mmcfg_late_init();
-       acpi_iort_init();
+       acpi_arm_init();
        acpi_viot_early_init();
        acpi_hest_init();
        acpi_ghes_init();
@@ -1420,8 +1382,6 @@ static int __init acpi_init(void)
        acpi_debugger_init();
        acpi_setup_sb_notify_handler();
        acpi_viot_init();
-       acpi_agdi_init();
-       acpi_apmt_init();
        return 0;
 }
 
index 475e1ed..1e76a64 100644 (file)
@@ -78,6 +78,15 @@ static const struct dmi_system_id dmi_lid_quirks[] = {
                .driver_data = (void *)(long)ACPI_BUTTON_LID_INIT_DISABLED,
        },
        {
+               /* Nextbook Ares 8A tablet, _LID device always reports lid closed */
+               .matches = {
+                       DMI_MATCH(DMI_SYS_VENDOR, "Insyde"),
+                       DMI_MATCH(DMI_PRODUCT_NAME, "CherryTrail"),
+                       DMI_MATCH(DMI_BIOS_VERSION, "M882"),
+               },
+               .driver_data = (void *)(long)ACPI_BUTTON_LID_INIT_DISABLED,
+       },
+       {
                /*
                 * Lenovo Yoga 9 14ITL5, initial notification of the LID device
                 * never happens.
@@ -126,7 +135,6 @@ static const struct dmi_system_id dmi_lid_quirks[] = {
 
 static int acpi_button_add(struct acpi_device *device);
 static void acpi_button_remove(struct acpi_device *device);
-static void acpi_button_notify(struct acpi_device *device, u32 event);
 
 #ifdef CONFIG_PM_SLEEP
 static int acpi_button_suspend(struct device *dev);
@@ -144,7 +152,6 @@ static struct acpi_driver acpi_button_driver = {
        .ops = {
                .add = acpi_button_add,
                .remove = acpi_button_remove,
-               .notify = acpi_button_notify,
        },
        .drv.pm = &acpi_button_pm,
 };
@@ -400,45 +407,65 @@ static void acpi_lid_initialize_state(struct acpi_device *device)
        button->lid_state_initialized = true;
 }
 
-static void acpi_button_notify(struct acpi_device *device, u32 event)
+static void acpi_lid_notify(acpi_handle handle, u32 event, void *data)
 {
-       struct acpi_button *button = acpi_driver_data(device);
+       struct acpi_device *device = data;
+       struct acpi_button *button;
+
+       if (event != ACPI_BUTTON_NOTIFY_STATUS) {
+               acpi_handle_debug(device->handle, "Unsupported event [0x%x]\n",
+                                 event);
+               return;
+       }
+
+       button = acpi_driver_data(device);
+       if (!button->lid_state_initialized)
+               return;
+
+       acpi_lid_update_state(device, true);
+}
+
+static void acpi_button_notify(acpi_handle handle, u32 event, void *data)
+{
+       struct acpi_device *device = data;
+       struct acpi_button *button;
        struct input_dev *input;
+       int keycode;
 
-       switch (event) {
-       case ACPI_FIXED_HARDWARE_EVENT:
-               event = ACPI_BUTTON_NOTIFY_STATUS;
-               fallthrough;
-       case ACPI_BUTTON_NOTIFY_STATUS:
-               input = button->input;
-               if (button->type == ACPI_BUTTON_TYPE_LID) {
-                       if (button->lid_state_initialized)
-                               acpi_lid_update_state(device, true);
-               } else {
-                       int keycode;
-
-                       acpi_pm_wakeup_event(&device->dev);
-                       if (button->suspended)
-                               break;
-
-                       keycode = test_bit(KEY_SLEEP, input->keybit) ?
-                                               KEY_SLEEP : KEY_POWER;
-                       input_report_key(input, keycode, 1);
-                       input_sync(input);
-                       input_report_key(input, keycode, 0);
-                       input_sync(input);
-
-                       acpi_bus_generate_netlink_event(
-                                       device->pnp.device_class,
-                                       dev_name(&device->dev),
-                                       event, ++button->pushed);
-               }
-               break;
-       default:
+       if (event != ACPI_BUTTON_NOTIFY_STATUS) {
                acpi_handle_debug(device->handle, "Unsupported event [0x%x]\n",
                                  event);
-               break;
+               return;
        }
+
+       acpi_pm_wakeup_event(&device->dev);
+
+       button = acpi_driver_data(device);
+       if (button->suspended)
+               return;
+
+       input = button->input;
+       keycode = test_bit(KEY_SLEEP, input->keybit) ? KEY_SLEEP : KEY_POWER;
+
+       input_report_key(input, keycode, 1);
+       input_sync(input);
+       input_report_key(input, keycode, 0);
+       input_sync(input);
+
+       acpi_bus_generate_netlink_event(device->pnp.device_class,
+                                       dev_name(&device->dev),
+                                       event, ++button->pushed);
+}
+
+static void acpi_button_notify_run(void *data)
+{
+       acpi_button_notify(NULL, ACPI_BUTTON_NOTIFY_STATUS, data);
+}
+
+static u32 acpi_button_event(void *data)
+{
+       acpi_os_execute(OSL_NOTIFY_HANDLER, acpi_button_notify_run, data);
+       return ACPI_INTERRUPT_HANDLED;
 }
 
 #ifdef CONFIG_PM_SLEEP
@@ -480,11 +507,13 @@ static int acpi_lid_input_open(struct input_dev *input)
 
 static int acpi_button_add(struct acpi_device *device)
 {
+       acpi_notify_handler handler;
        struct acpi_button *button;
        struct input_dev *input;
        const char *hid = acpi_device_hid(device);
+       acpi_status status;
        char *name, *class;
-       int error;
+       int error = 0;
 
        if (!strcmp(hid, ACPI_BUTTON_HID_LID) &&
             lid_init_state == ACPI_BUTTON_LID_INIT_DISABLED)
@@ -508,17 +537,20 @@ static int acpi_button_add(struct acpi_device *device)
        if (!strcmp(hid, ACPI_BUTTON_HID_POWER) ||
            !strcmp(hid, ACPI_BUTTON_HID_POWERF)) {
                button->type = ACPI_BUTTON_TYPE_POWER;
+               handler = acpi_button_notify;
                strcpy(name, ACPI_BUTTON_DEVICE_NAME_POWER);
                sprintf(class, "%s/%s",
                        ACPI_BUTTON_CLASS, ACPI_BUTTON_SUBCLASS_POWER);
        } else if (!strcmp(hid, ACPI_BUTTON_HID_SLEEP) ||
                   !strcmp(hid, ACPI_BUTTON_HID_SLEEPF)) {
                button->type = ACPI_BUTTON_TYPE_SLEEP;
+               handler = acpi_button_notify;
                strcpy(name, ACPI_BUTTON_DEVICE_NAME_SLEEP);
                sprintf(class, "%s/%s",
                        ACPI_BUTTON_CLASS, ACPI_BUTTON_SUBCLASS_SLEEP);
        } else if (!strcmp(hid, ACPI_BUTTON_HID_LID)) {
                button->type = ACPI_BUTTON_TYPE_LID;
+               handler = acpi_lid_notify;
                strcpy(name, ACPI_BUTTON_DEVICE_NAME_LID);
                sprintf(class, "%s/%s",
                        ACPI_BUTTON_CLASS, ACPI_BUTTON_SUBCLASS_LID);
@@ -526,12 +558,15 @@ static int acpi_button_add(struct acpi_device *device)
        } else {
                pr_info("Unsupported hid [%s]\n", hid);
                error = -ENODEV;
-               goto err_free_input;
        }
 
-       error = acpi_button_add_fs(device);
-       if (error)
-               goto err_free_input;
+       if (!error)
+               error = acpi_button_add_fs(device);
+
+       if (error) {
+               input_free_device(input);
+               goto err_free_button;
+       }
 
        snprintf(button->phys, sizeof(button->phys), "%s/button/input0", hid);
 
@@ -559,6 +594,29 @@ static int acpi_button_add(struct acpi_device *device)
        error = input_register_device(input);
        if (error)
                goto err_remove_fs;
+
+       switch (device->device_type) {
+       case ACPI_BUS_TYPE_POWER_BUTTON:
+               status = acpi_install_fixed_event_handler(ACPI_EVENT_POWER_BUTTON,
+                                                         acpi_button_event,
+                                                         device);
+               break;
+       case ACPI_BUS_TYPE_SLEEP_BUTTON:
+               status = acpi_install_fixed_event_handler(ACPI_EVENT_SLEEP_BUTTON,
+                                                         acpi_button_event,
+                                                         device);
+               break;
+       default:
+               status = acpi_install_notify_handler(device->handle,
+                                                    ACPI_DEVICE_NOTIFY, handler,
+                                                    device);
+               break;
+       }
+       if (ACPI_FAILURE(status)) {
+               error = -ENODEV;
+               goto err_input_unregister;
+       }
+
        if (button->type == ACPI_BUTTON_TYPE_LID) {
                /*
                 * This assumes there's only one lid device, or if there are
@@ -571,11 +629,11 @@ static int acpi_button_add(struct acpi_device *device)
        pr_info("%s [%s]\n", name, acpi_device_bid(device));
        return 0;
 
- err_remove_fs:
+err_input_unregister:
+       input_unregister_device(input);
+err_remove_fs:
        acpi_button_remove_fs(device);
- err_free_input:
-       input_free_device(input);
- err_free_button:
+err_free_button:
        kfree(button);
        return error;
 }
@@ -584,6 +642,24 @@ static void acpi_button_remove(struct acpi_device *device)
 {
        struct acpi_button *button = acpi_driver_data(device);
 
+       switch (device->device_type) {
+       case ACPI_BUS_TYPE_POWER_BUTTON:
+               acpi_remove_fixed_event_handler(ACPI_EVENT_POWER_BUTTON,
+                                               acpi_button_event);
+               break;
+       case ACPI_BUS_TYPE_SLEEP_BUTTON:
+               acpi_remove_fixed_event_handler(ACPI_EVENT_SLEEP_BUTTON,
+                                               acpi_button_event);
+               break;
+       default:
+               acpi_remove_notify_handler(device->handle, ACPI_DEVICE_NOTIFY,
+                                          button->type == ACPI_BUTTON_TYPE_LID ?
+                                               acpi_lid_notify :
+                                               acpi_button_notify);
+               break;
+       }
+       acpi_os_wait_events_complete();
+
        acpi_button_remove_fs(device);
        input_unregister_device(button->input);
        kfree(button);
index 928899a..8569f55 100644 (file)
@@ -662,21 +662,6 @@ static void advance_transaction(struct acpi_ec *ec, bool interrupt)
 
        ec_dbg_stm("%s (%d)", interrupt ? "IRQ" : "TASK", smp_processor_id());
 
-       /*
-        * Clear GPE_STS upfront to allow subsequent hardware GPE_STS 0->1
-        * changes to always trigger a GPE interrupt.
-        *
-        * GPE STS is a W1C register, which means:
-        *
-        * 1. Software can clear it without worrying about clearing the other
-        *    GPEs' STS bits when the hardware sets them in parallel.
-        *
-        * 2. As long as software can ensure only clearing it when it is set,
-        *    hardware won't set it in parallel.
-        */
-       if (ec->gpe >= 0 && acpi_ec_gpe_status_set(ec))
-               acpi_clear_gpe(NULL, ec->gpe);
-
        status = acpi_ec_read_status(ec);
 
        /*
@@ -1287,6 +1272,22 @@ static void acpi_ec_handle_interrupt(struct acpi_ec *ec)
        unsigned long flags;
 
        spin_lock_irqsave(&ec->lock, flags);
+
+       /*
+        * Clear GPE_STS upfront to allow subsequent hardware GPE_STS 0->1
+        * changes to always trigger a GPE interrupt.
+        *
+        * GPE STS is a W1C register, which means:
+        *
+        * 1. Software can clear it without worrying about clearing the other
+        *    GPEs' STS bits when the hardware sets them in parallel.
+        *
+        * 2. As long as software can ensure only clearing it when it is set,
+        *    hardware won't set it in parallel.
+        */
+       if (ec->gpe >= 0 && acpi_ec_gpe_status_set(ec))
+               acpi_clear_gpe(NULL, ec->gpe);
+
        advance_transaction(ec, true);
        spin_unlock_irqrestore(&ec->lock, flags);
 }
index 6023ad6..573bc0d 100644 (file)
@@ -347,4 +347,6 @@ int acpi_nfit_ctl(struct nvdimm_bus_descriptor *nd_desc, struct nvdimm *nvdimm,
 void acpi_nfit_desc_init(struct acpi_nfit_desc *acpi_desc, struct device *dev);
 bool intel_fwa_supported(struct nvdimm_bus *nvdimm_bus);
 extern struct device_attribute dev_attr_firmware_activate_noidle;
+void nfit_intel_shutdown_status(struct nfit_mem *nfit_mem);
+
 #endif /* __NFIT_H__ */
index 9718d07..dc615ef 100644 (file)
@@ -597,10 +597,6 @@ static int acpi_idle_play_dead(struct cpuidle_device *dev, int index)
                        io_idle(cx->address);
                } else
                        return -ENODEV;
-
-#if defined(CONFIG_X86) && defined(CONFIG_HOTPLUG_CPU)
-               cond_wakeup_cpu0();
-#endif
        }
 
        /* Never reached */
index 0800a9d..1dd8d5a 100644 (file)
@@ -470,52 +470,6 @@ static const struct dmi_system_id asus_laptop[] = {
        { }
 };
 
-static const struct dmi_system_id lenovo_laptop[] = {
-       {
-               .ident = "LENOVO IdeaPad Flex 5 14ALC7",
-               .matches = {
-                       DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
-                       DMI_MATCH(DMI_PRODUCT_NAME, "82R9"),
-               },
-       },
-       {
-               .ident = "LENOVO IdeaPad Flex 5 16ALC7",
-               .matches = {
-                       DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
-                       DMI_MATCH(DMI_PRODUCT_NAME, "82RA"),
-               },
-       },
-       { }
-};
-
-static const struct dmi_system_id tongfang_gm_rg[] = {
-       {
-               .ident = "TongFang GMxRGxx/XMG CORE 15 (M22)/TUXEDO Stellaris 15 Gen4 AMD",
-               .matches = {
-                       DMI_MATCH(DMI_BOARD_NAME, "GMxRGxx"),
-               },
-       },
-       { }
-};
-
-static const struct dmi_system_id maingear_laptop[] = {
-       {
-               .ident = "MAINGEAR Vector Pro 2 15",
-               .matches = {
-                       DMI_MATCH(DMI_SYS_VENDOR, "Micro Electronics Inc"),
-                       DMI_MATCH(DMI_PRODUCT_NAME, "MG-VCP2-15A3070T"),
-               }
-       },
-       {
-               .ident = "MAINGEAR Vector Pro 2 17",
-               .matches = {
-                       DMI_MATCH(DMI_SYS_VENDOR, "Micro Electronics Inc"),
-                       DMI_MATCH(DMI_PRODUCT_NAME, "MG-VCP2-17A3070T"),
-               },
-       },
-       { }
-};
-
 static const struct dmi_system_id lg_laptop[] = {
        {
                .ident = "LG Electronics 17U70P",
@@ -539,10 +493,6 @@ struct irq_override_cmp {
 static const struct irq_override_cmp override_table[] = {
        { medion_laptop, 1, ACPI_LEVEL_SENSITIVE, ACPI_ACTIVE_LOW, 0, false },
        { asus_laptop, 1, ACPI_LEVEL_SENSITIVE, ACPI_ACTIVE_LOW, 0, false },
-       { lenovo_laptop, 6, ACPI_LEVEL_SENSITIVE, ACPI_ACTIVE_LOW, 0, true },
-       { lenovo_laptop, 10, ACPI_LEVEL_SENSITIVE, ACPI_ACTIVE_LOW, 0, true },
-       { tongfang_gm_rg, 1, ACPI_EDGE_SENSITIVE, ACPI_ACTIVE_LOW, 1, true },
-       { maingear_laptop, 1, ACPI_EDGE_SENSITIVE, ACPI_ACTIVE_LOW, 1, true },
        { lg_laptop, 1, ACPI_LEVEL_SENSITIVE, ACPI_ACTIVE_LOW, 0, false },
 };
 
@@ -562,16 +512,6 @@ static bool acpi_dev_irq_override(u32 gsi, u8 triggering, u8 polarity,
                        return entry->override;
        }
 
-#ifdef CONFIG_X86
-       /*
-        * IRQ override isn't needed on modern AMD Zen systems and
-        * this override breaks active low IRQs on AMD Ryzen 6000 and
-        * newer systems. Skip it.
-        */
-       if (boot_cpu_has(X86_FEATURE_ZEN))
-               return false;
-#endif
-
        return true;
 }
 
index 0c6f06a..1c3e1e2 100644 (file)
@@ -2029,8 +2029,6 @@ static u32 acpi_scan_check_dep(acpi_handle handle, bool check_dep)
        return count;
 }
 
-static bool acpi_bus_scan_second_pass;
-
 static acpi_status acpi_bus_check_add(acpi_handle handle, bool check_dep,
                                      struct acpi_device **adev_p)
 {
@@ -2050,10 +2048,8 @@ static acpi_status acpi_bus_check_add(acpi_handle handle, bool check_dep,
                        return AE_OK;
 
                /* Bail out if there are dependencies. */
-               if (acpi_scan_check_dep(handle, check_dep) > 0) {
-                       acpi_bus_scan_second_pass = true;
+               if (acpi_scan_check_dep(handle, check_dep) > 0)
                        return AE_CTRL_DEPTH;
-               }
 
                fallthrough;
        case ACPI_TYPE_ANY:     /* for ACPI_ROOT_OBJECT */
@@ -2301,6 +2297,12 @@ static bool acpi_scan_clear_dep_queue(struct acpi_device *adev)
        return true;
 }
 
+static void acpi_scan_delete_dep_data(struct acpi_dep_data *dep)
+{
+       list_del(&dep->node);
+       kfree(dep);
+}
+
 static int acpi_scan_clear_dep(struct acpi_dep_data *dep, void *data)
 {
        struct acpi_device *adev = acpi_get_acpi_dev(dep->consumer);
@@ -2311,8 +2313,10 @@ static int acpi_scan_clear_dep(struct acpi_dep_data *dep, void *data)
                        acpi_dev_put(adev);
        }
 
-       list_del(&dep->node);
-       kfree(dep);
+       if (dep->free_when_met)
+               acpi_scan_delete_dep_data(dep);
+       else
+               dep->met = true;
 
        return 0;
 }
@@ -2406,6 +2410,55 @@ struct acpi_device *acpi_dev_get_next_consumer_dev(struct acpi_device *supplier,
 }
 EXPORT_SYMBOL_GPL(acpi_dev_get_next_consumer_dev);
 
+static void acpi_scan_postponed_branch(acpi_handle handle)
+{
+       struct acpi_device *adev = NULL;
+
+       if (ACPI_FAILURE(acpi_bus_check_add(handle, false, &adev)))
+               return;
+
+       acpi_walk_namespace(ACPI_TYPE_ANY, handle, ACPI_UINT32_MAX,
+                           acpi_bus_check_add_2, NULL, NULL, (void **)&adev);
+       acpi_bus_attach(adev, NULL);
+}
+
+static void acpi_scan_postponed(void)
+{
+       struct acpi_dep_data *dep, *tmp;
+
+       mutex_lock(&acpi_dep_list_lock);
+
+       list_for_each_entry_safe(dep, tmp, &acpi_dep_list, node) {
+               acpi_handle handle = dep->consumer;
+
+               /*
+                * In case there are multiple acpi_dep_list entries with the
+                * same consumer, skip the current entry if the consumer device
+                * object corresponding to it is present already.
+                */
+               if (!acpi_fetch_acpi_dev(handle)) {
+                       /*
+                        * Even though the lock is released here, tmp is
+                        * guaranteed to be valid, because none of the list
+                        * entries following dep is marked as "free when met"
+                        * and so they cannot be deleted.
+                        */
+                       mutex_unlock(&acpi_dep_list_lock);
+
+                       acpi_scan_postponed_branch(handle);
+
+                       mutex_lock(&acpi_dep_list_lock);
+               }
+
+               if (dep->met)
+                       acpi_scan_delete_dep_data(dep);
+               else
+                       dep->free_when_met = true;
+       }
+
+       mutex_unlock(&acpi_dep_list_lock);
+}
+
 /**
  * acpi_bus_scan - Add ACPI device node objects in a given namespace scope.
  * @handle: Root of the namespace scope to scan.
@@ -2424,8 +2477,6 @@ int acpi_bus_scan(acpi_handle handle)
 {
        struct acpi_device *device = NULL;
 
-       acpi_bus_scan_second_pass = false;
-
        /* Pass 1: Avoid enumerating devices with missing dependencies. */
 
        if (ACPI_SUCCESS(acpi_bus_check_add(handle, true, &device)))
@@ -2438,19 +2489,9 @@ int acpi_bus_scan(acpi_handle handle)
 
        acpi_bus_attach(device, (void *)true);
 
-       if (!acpi_bus_scan_second_pass)
-               return 0;
-
        /* Pass 2: Enumerate all of the remaining devices. */
 
-       device = NULL;
-
-       if (ACPI_SUCCESS(acpi_bus_check_add(handle, false, &device)))
-               acpi_walk_namespace(ACPI_TYPE_ANY, handle, ACPI_UINT32_MAX,
-                                   acpi_bus_check_add_2, NULL, NULL,
-                                   (void **)&device);
-
-       acpi_bus_attach(device, NULL);
+       acpi_scan_postponed();
 
        return 0;
 }
index f32570f..808484d 100644 (file)
@@ -848,7 +848,7 @@ void __weak acpi_s2idle_setup(void)
        s2idle_set_ops(&acpi_s2idle_ops);
 }
 
-static void acpi_sleep_suspend_setup(void)
+static void __init acpi_sleep_suspend_setup(void)
 {
        bool suspend_ops_needed = false;
        int i;
index 4720a36..f9f6ebb 100644 (file)
 #define ACPI_THERMAL_NOTIFY_HOT                0xF1
 #define ACPI_THERMAL_MODE_ACTIVE       0x00
 
-#define ACPI_THERMAL_MAX_ACTIVE        10
-#define ACPI_THERMAL_MAX_LIMIT_STR_LEN 65
+#define ACPI_THERMAL_MAX_ACTIVE                10
+#define ACPI_THERMAL_MAX_LIMIT_STR_LEN 65
 
-MODULE_AUTHOR("Paul Diefenbaugh");
-MODULE_DESCRIPTION("ACPI Thermal Zone Driver");
-MODULE_LICENSE("GPL");
+#define ACPI_TRIPS_CRITICAL    BIT(0)
+#define ACPI_TRIPS_HOT         BIT(1)
+#define ACPI_TRIPS_PASSIVE     BIT(2)
+#define ACPI_TRIPS_ACTIVE      BIT(3)
+#define ACPI_TRIPS_DEVICES     BIT(4)
+
+#define ACPI_TRIPS_THRESHOLDS  (ACPI_TRIPS_PASSIVE | ACPI_TRIPS_ACTIVE)
+
+#define ACPI_TRIPS_INIT                (ACPI_TRIPS_CRITICAL | ACPI_TRIPS_HOT | \
+                                ACPI_TRIPS_PASSIVE | ACPI_TRIPS_ACTIVE | \
+                                ACPI_TRIPS_DEVICES)
+
+/*
+ * This exception is thrown out in two cases:
+ * 1.An invalid trip point becomes invalid or a valid trip point becomes invalid
+ *   when re-evaluating the AML code.
+ * 2.TODO: Devices listed in _PSL, _ALx, _TZD may change.
+ *   We need to re-bind the cooling devices of a thermal zone when this occurs.
+ */
+#define ACPI_THERMAL_TRIPS_EXCEPTION(flags, tz, str) \
+do { \
+       if (flags != ACPI_TRIPS_INIT) \
+               acpi_handle_info(tz->device->handle, \
+                       "ACPI thermal trip point %s changed\n" \
+                       "Please report to linux-acpi@vger.kernel.org\n", str); \
+} while (0)
 
 static int act;
 module_param(act, int, 0644);
@@ -73,75 +96,30 @@ MODULE_PARM_DESC(psv, "Disable or override all passive trip points.");
 
 static struct workqueue_struct *acpi_thermal_pm_queue;
 
-static int acpi_thermal_add(struct acpi_device *device);
-static void acpi_thermal_remove(struct acpi_device *device);
-static void acpi_thermal_notify(struct acpi_device *device, u32 event);
-
-static const struct acpi_device_id  thermal_device_ids[] = {
-       {ACPI_THERMAL_HID, 0},
-       {"", 0},
-};
-MODULE_DEVICE_TABLE(acpi, thermal_device_ids);
-
-#ifdef CONFIG_PM_SLEEP
-static int acpi_thermal_suspend(struct device *dev);
-static int acpi_thermal_resume(struct device *dev);
-#else
-#define acpi_thermal_suspend NULL
-#define acpi_thermal_resume NULL
-#endif
-static SIMPLE_DEV_PM_OPS(acpi_thermal_pm, acpi_thermal_suspend, acpi_thermal_resume);
-
-static struct acpi_driver acpi_thermal_driver = {
-       .name = "thermal",
-       .class = ACPI_THERMAL_CLASS,
-       .ids = thermal_device_ids,
-       .ops = {
-               .add = acpi_thermal_add,
-               .remove = acpi_thermal_remove,
-               .notify = acpi_thermal_notify,
-               },
-       .drv.pm = &acpi_thermal_pm,
-};
-
-struct acpi_thermal_state {
-       u8 critical:1;
-       u8 hot:1;
-       u8 passive:1;
-       u8 active:1;
-       u8 reserved:4;
-       int active_index;
-};
-
-struct acpi_thermal_state_flags {
-       u8 valid:1;
-       u8 enabled:1;
-       u8 reserved:6;
-};
-
 struct acpi_thermal_critical {
-       struct acpi_thermal_state_flags flags;
        unsigned long temperature;
+       bool valid;
 };
 
 struct acpi_thermal_hot {
-       struct acpi_thermal_state_flags flags;
        unsigned long temperature;
+       bool valid;
 };
 
 struct acpi_thermal_passive {
-       struct acpi_thermal_state_flags flags;
+       struct acpi_handle_list devices;
        unsigned long temperature;
        unsigned long tc1;
        unsigned long tc2;
        unsigned long tsp;
-       struct acpi_handle_list devices;
+       bool valid;
 };
 
 struct acpi_thermal_active {
-       struct acpi_thermal_state_flags flags;
-       unsigned long temperature;
        struct acpi_handle_list devices;
+       unsigned long temperature;
+       bool valid;
+       bool enabled;
 };
 
 struct acpi_thermal_trips {
@@ -151,12 +129,6 @@ struct acpi_thermal_trips {
        struct acpi_thermal_active active[ACPI_THERMAL_MAX_ACTIVE];
 };
 
-struct acpi_thermal_flags {
-       u8 cooling_mode:1;      /* _SCP */
-       u8 devices:1;           /* _TZD */
-       u8 reserved:6;
-};
-
 struct acpi_thermal {
        struct acpi_device *device;
        acpi_bus_id name;
@@ -164,8 +136,6 @@ struct acpi_thermal {
        unsigned long last_temperature;
        unsigned long polling_frequency;
        volatile u8 zombie;
-       struct acpi_thermal_flags flags;
-       struct acpi_thermal_state state;
        struct acpi_thermal_trips trips;
        struct acpi_handle_list devices;
        struct thermal_zone_device *thermal_zone;
@@ -220,52 +190,12 @@ static int acpi_thermal_get_polling_frequency(struct acpi_thermal *tz)
        return 0;
 }
 
-static int acpi_thermal_set_cooling_mode(struct acpi_thermal *tz, int mode)
-{
-       if (!tz)
-               return -EINVAL;
-
-       if (ACPI_FAILURE(acpi_execute_simple_method(tz->device->handle,
-                                                   "_SCP", mode)))
-               return -ENODEV;
-
-       return 0;
-}
-
-#define ACPI_TRIPS_CRITICAL    0x01
-#define ACPI_TRIPS_HOT         0x02
-#define ACPI_TRIPS_PASSIVE     0x04
-#define ACPI_TRIPS_ACTIVE      0x08
-#define ACPI_TRIPS_DEVICES     0x10
-
-#define ACPI_TRIPS_REFRESH_THRESHOLDS  (ACPI_TRIPS_PASSIVE | ACPI_TRIPS_ACTIVE)
-#define ACPI_TRIPS_REFRESH_DEVICES     ACPI_TRIPS_DEVICES
-
-#define ACPI_TRIPS_INIT      (ACPI_TRIPS_CRITICAL | ACPI_TRIPS_HOT |   \
-                             ACPI_TRIPS_PASSIVE | ACPI_TRIPS_ACTIVE |  \
-                             ACPI_TRIPS_DEVICES)
-
-/*
- * This exception is thrown out in two cases:
- * 1.An invalid trip point becomes invalid or a valid trip point becomes invalid
- *   when re-evaluating the AML code.
- * 2.TODO: Devices listed in _PSL, _ALx, _TZD may change.
- *   We need to re-bind the cooling devices of a thermal zone when this occurs.
- */
-#define ACPI_THERMAL_TRIPS_EXCEPTION(flags, tz, str)   \
-do {   \
-       if (flags != ACPI_TRIPS_INIT)   \
-               acpi_handle_info(tz->device->handle,    \
-               "ACPI thermal trip point %s changed\n"  \
-               "Please report to linux-acpi@vger.kernel.org\n", str); \
-} while (0)
-
 static int acpi_thermal_trips_update(struct acpi_thermal *tz, int flag)
 {
        acpi_status status;
        unsigned long long tmp;
        struct acpi_handle_list devices;
-       int valid = 0;
+       bool valid = false;
        int i;
 
        /* Critical Shutdown */
@@ -279,21 +209,21 @@ static int acpi_thermal_trips_update(struct acpi_thermal *tz, int flag)
                 * ... so lets discard those as invalid.
                 */
                if (ACPI_FAILURE(status)) {
-                       tz->trips.critical.flags.valid = 0;
+                       tz->trips.critical.valid = false;
                        acpi_handle_debug(tz->device->handle,
                                          "No critical threshold\n");
                } else if (tmp <= 2732) {
                        pr_info(FW_BUG "Invalid critical threshold (%llu)\n", tmp);
-                       tz->trips.critical.flags.valid = 0;
+                       tz->trips.critical.valid = false;
                } else {
-                       tz->trips.critical.flags.valid = 1;
+                       tz->trips.critical.valid = true;
                        acpi_handle_debug(tz->device->handle,
                                          "Found critical threshold [%lu]\n",
                                          tz->trips.critical.temperature);
                }
-               if (tz->trips.critical.flags.valid) {
+               if (tz->trips.critical.valid) {
                        if (crt == -1) {
-                               tz->trips.critical.flags.valid = 0;
+                               tz->trips.critical.valid = false;
                        } else if (crt > 0) {
                                unsigned long crt_k = celsius_to_deci_kelvin(crt);
 
@@ -312,12 +242,12 @@ static int acpi_thermal_trips_update(struct acpi_thermal *tz, int flag)
        if (flag & ACPI_TRIPS_HOT) {
                status = acpi_evaluate_integer(tz->device->handle, "_HOT", NULL, &tmp);
                if (ACPI_FAILURE(status)) {
-                       tz->trips.hot.flags.valid = 0;
+                       tz->trips.hot.valid = false;
                        acpi_handle_debug(tz->device->handle,
                                          "No hot threshold\n");
                } else {
                        tz->trips.hot.temperature = tmp;
-                       tz->trips.hot.flags.valid = 1;
+                       tz->trips.hot.valid = true;
                        acpi_handle_debug(tz->device->handle,
                                          "Found hot threshold [%lu]\n",
                                          tz->trips.hot.temperature);
@@ -325,9 +255,9 @@ static int acpi_thermal_trips_update(struct acpi_thermal *tz, int flag)
        }
 
        /* Passive (optional) */
-       if (((flag & ACPI_TRIPS_PASSIVE) && tz->trips.passive.flags.valid) ||
+       if (((flag & ACPI_TRIPS_PASSIVE) && tz->trips.passive.valid) ||
            flag == ACPI_TRIPS_INIT) {
-               valid = tz->trips.passive.flags.valid;
+               valid = tz->trips.passive.valid;
                if (psv == -1) {
                        status = AE_SUPPORT;
                } else if (psv > 0) {
@@ -339,44 +269,44 @@ static int acpi_thermal_trips_update(struct acpi_thermal *tz, int flag)
                }
 
                if (ACPI_FAILURE(status)) {
-                       tz->trips.passive.flags.valid = 0;
+                       tz->trips.passive.valid = false;
                } else {
                        tz->trips.passive.temperature = tmp;
-                       tz->trips.passive.flags.valid = 1;
+                       tz->trips.passive.valid = true;
                        if (flag == ACPI_TRIPS_INIT) {
                                status = acpi_evaluate_integer(tz->device->handle,
                                                               "_TC1", NULL, &tmp);
                                if (ACPI_FAILURE(status))
-                                       tz->trips.passive.flags.valid = 0;
+                                       tz->trips.passive.valid = false;
                                else
                                        tz->trips.passive.tc1 = tmp;
 
                                status = acpi_evaluate_integer(tz->device->handle,
                                                               "_TC2", NULL, &tmp);
                                if (ACPI_FAILURE(status))
-                                       tz->trips.passive.flags.valid = 0;
+                                       tz->trips.passive.valid = false;
                                else
                                        tz->trips.passive.tc2 = tmp;
 
                                status = acpi_evaluate_integer(tz->device->handle,
                                                               "_TSP", NULL, &tmp);
                                if (ACPI_FAILURE(status))
-                                       tz->trips.passive.flags.valid = 0;
+                                       tz->trips.passive.valid = false;
                                else
                                        tz->trips.passive.tsp = tmp;
                        }
                }
        }
-       if ((flag & ACPI_TRIPS_DEVICES) && tz->trips.passive.flags.valid) {
+       if ((flag & ACPI_TRIPS_DEVICES) && tz->trips.passive.valid) {
                memset(&devices, 0, sizeof(struct acpi_handle_list));
                status = acpi_evaluate_reference(tz->device->handle, "_PSL",
                                                 NULL, &devices);
                if (ACPI_FAILURE(status)) {
                        acpi_handle_info(tz->device->handle,
                                         "Invalid passive threshold\n");
-                       tz->trips.passive.flags.valid = 0;
+                       tz->trips.passive.valid = false;
                } else {
-                       tz->trips.passive.flags.valid = 1;
+                       tz->trips.passive.valid = true;
                }
 
                if (memcmp(&tz->trips.passive.devices, &devices,
@@ -387,24 +317,24 @@ static int acpi_thermal_trips_update(struct acpi_thermal *tz, int flag)
                }
        }
        if ((flag & ACPI_TRIPS_PASSIVE) || (flag & ACPI_TRIPS_DEVICES)) {
-               if (valid != tz->trips.passive.flags.valid)
+               if (valid != tz->trips.passive.valid)
                        ACPI_THERMAL_TRIPS_EXCEPTION(flag, tz, "state");
        }
 
        /* Active (optional) */
        for (i = 0; i < ACPI_THERMAL_MAX_ACTIVE; i++) {
                char name[5] = { '_', 'A', 'C', ('0' + i), '\0' };
-               valid = tz->trips.active[i].flags.valid;
+               valid = tz->trips.active[i].valid;
 
                if (act == -1)
                        break; /* disable all active trip points */
 
                if (flag == ACPI_TRIPS_INIT || ((flag & ACPI_TRIPS_ACTIVE) &&
-                   tz->trips.active[i].flags.valid)) {
+                   tz->trips.active[i].valid)) {
                        status = acpi_evaluate_integer(tz->device->handle,
                                                       name, NULL, &tmp);
                        if (ACPI_FAILURE(status)) {
-                               tz->trips.active[i].flags.valid = 0;
+                               tz->trips.active[i].valid = false;
                                if (i == 0)
                                        break;
 
@@ -426,21 +356,21 @@ static int acpi_thermal_trips_update(struct acpi_thermal *tz, int flag)
                                break;
                        } else {
                                tz->trips.active[i].temperature = tmp;
-                               tz->trips.active[i].flags.valid = 1;
+                               tz->trips.active[i].valid = true;
                        }
                }
 
                name[2] = 'L';
-               if ((flag & ACPI_TRIPS_DEVICES) && tz->trips.active[i].flags.valid) {
+               if ((flag & ACPI_TRIPS_DEVICES) && tz->trips.active[i].valid) {
                        memset(&devices, 0, sizeof(struct acpi_handle_list));
                        status = acpi_evaluate_reference(tz->device->handle,
                                                         name, NULL, &devices);
                        if (ACPI_FAILURE(status)) {
                                acpi_handle_info(tz->device->handle,
                                                 "Invalid active%d threshold\n", i);
-                               tz->trips.active[i].flags.valid = 0;
+                               tz->trips.active[i].valid = false;
                        } else {
-                               tz->trips.active[i].flags.valid = 1;
+                               tz->trips.active[i].valid = true;
                        }
 
                        if (memcmp(&tz->trips.active[i].devices, &devices,
@@ -451,10 +381,10 @@ static int acpi_thermal_trips_update(struct acpi_thermal *tz, int flag)
                        }
                }
                if ((flag & ACPI_TRIPS_ACTIVE) || (flag & ACPI_TRIPS_DEVICES))
-                       if (valid != tz->trips.active[i].flags.valid)
+                       if (valid != tz->trips.active[i].valid)
                                ACPI_THERMAL_TRIPS_EXCEPTION(flag, tz, "state");
 
-               if (!tz->trips.active[i].flags.valid)
+               if (!tz->trips.active[i].valid)
                        break;
        }
 
@@ -474,17 +404,18 @@ static int acpi_thermal_trips_update(struct acpi_thermal *tz, int flag)
 
 static int acpi_thermal_get_trip_points(struct acpi_thermal *tz)
 {
-       int i, valid, ret = acpi_thermal_trips_update(tz, ACPI_TRIPS_INIT);
+       int i, ret = acpi_thermal_trips_update(tz, ACPI_TRIPS_INIT);
+       bool valid;
 
        if (ret)
                return ret;
 
-       valid = tz->trips.critical.flags.valid |
-               tz->trips.hot.flags.valid |
-               tz->trips.passive.flags.valid;
+       valid = tz->trips.critical.valid |
+               tz->trips.hot.valid |
+               tz->trips.passive.valid;
 
        for (i = 0; i < ACPI_THERMAL_MAX_ACTIVE; i++)
-               valid |= tz->trips.active[i].flags.valid;
+               valid = valid || tz->trips.active[i].valid;
 
        if (!valid) {
                pr_warn(FW_BUG "No valid trip found\n");
@@ -521,7 +452,7 @@ static int thermal_get_trip_type(struct thermal_zone_device *thermal,
        if (!tz || trip < 0)
                return -EINVAL;
 
-       if (tz->trips.critical.flags.valid) {
+       if (tz->trips.critical.valid) {
                if (!trip) {
                        *type = THERMAL_TRIP_CRITICAL;
                        return 0;
@@ -529,7 +460,7 @@ static int thermal_get_trip_type(struct thermal_zone_device *thermal,
                trip--;
        }
 
-       if (tz->trips.hot.flags.valid) {
+       if (tz->trips.hot.valid) {
                if (!trip) {
                        *type = THERMAL_TRIP_HOT;
                        return 0;
@@ -537,7 +468,7 @@ static int thermal_get_trip_type(struct thermal_zone_device *thermal,
                trip--;
        }
 
-       if (tz->trips.passive.flags.valid) {
+       if (tz->trips.passive.valid) {
                if (!trip) {
                        *type = THERMAL_TRIP_PASSIVE;
                        return 0;
@@ -545,7 +476,7 @@ static int thermal_get_trip_type(struct thermal_zone_device *thermal,
                trip--;
        }
 
-       for (i = 0; i < ACPI_THERMAL_MAX_ACTIVE && tz->trips.active[i].flags.valid; i++) {
+       for (i = 0; i < ACPI_THERMAL_MAX_ACTIVE && tz->trips.active[i].valid; i++) {
                if (!trip) {
                        *type = THERMAL_TRIP_ACTIVE;
                        return 0;
@@ -565,7 +496,7 @@ static int thermal_get_trip_temp(struct thermal_zone_device *thermal,
        if (!tz || trip < 0)
                return -EINVAL;
 
-       if (tz->trips.critical.flags.valid) {
+       if (tz->trips.critical.valid) {
                if (!trip) {
                        *temp = deci_kelvin_to_millicelsius_with_offset(
                                        tz->trips.critical.temperature,
@@ -575,7 +506,7 @@ static int thermal_get_trip_temp(struct thermal_zone_device *thermal,
                trip--;
        }
 
-       if (tz->trips.hot.flags.valid) {
+       if (tz->trips.hot.valid) {
                if (!trip) {
                        *temp = deci_kelvin_to_millicelsius_with_offset(
                                        tz->trips.hot.temperature,
@@ -585,7 +516,7 @@ static int thermal_get_trip_temp(struct thermal_zone_device *thermal,
                trip--;
        }
 
-       if (tz->trips.passive.flags.valid) {
+       if (tz->trips.passive.valid) {
                if (!trip) {
                        *temp = deci_kelvin_to_millicelsius_with_offset(
                                        tz->trips.passive.temperature,
@@ -596,7 +527,7 @@ static int thermal_get_trip_temp(struct thermal_zone_device *thermal,
        }
 
        for (i = 0; i < ACPI_THERMAL_MAX_ACTIVE &&
-               tz->trips.active[i].flags.valid; i++) {
+               tz->trips.active[i].valid; i++) {
                if (!trip) {
                        *temp = deci_kelvin_to_millicelsius_with_offset(
                                        tz->trips.active[i].temperature,
@@ -614,7 +545,7 @@ static int thermal_get_crit_temp(struct thermal_zone_device *thermal,
 {
        struct acpi_thermal *tz = thermal_zone_device_priv(thermal);
 
-       if (tz->trips.critical.flags.valid) {
+       if (tz->trips.critical.valid) {
                *temperature = deci_kelvin_to_millicelsius_with_offset(
                                        tz->trips.critical.temperature,
                                        tz->kelvin_offset);
@@ -700,13 +631,13 @@ static int acpi_thermal_cooling_device_cb(struct thermal_zone_device *thermal,
        int trip = -1;
        int result = 0;
 
-       if (tz->trips.critical.flags.valid)
+       if (tz->trips.critical.valid)
                trip++;
 
-       if (tz->trips.hot.flags.valid)
+       if (tz->trips.hot.valid)
                trip++;
 
-       if (tz->trips.passive.flags.valid) {
+       if (tz->trips.passive.valid) {
                trip++;
                for (i = 0; i < tz->trips.passive.devices.count; i++) {
                        handle = tz->trips.passive.devices.handles[i];
@@ -731,7 +662,7 @@ static int acpi_thermal_cooling_device_cb(struct thermal_zone_device *thermal,
        }
 
        for (i = 0; i < ACPI_THERMAL_MAX_ACTIVE; i++) {
-               if (!tz->trips.active[i].flags.valid)
+               if (!tz->trips.active[i].valid)
                        break;
 
                trip++;
@@ -819,19 +750,19 @@ static int acpi_thermal_register_thermal_zone(struct acpi_thermal *tz)
        acpi_status status;
        int i;
 
-       if (tz->trips.critical.flags.valid)
+       if (tz->trips.critical.valid)
                trips++;
 
-       if (tz->trips.hot.flags.valid)
+       if (tz->trips.hot.valid)
                trips++;
 
-       if (tz->trips.passive.flags.valid)
+       if (tz->trips.passive.valid)
                trips++;
 
-       for (i = 0; i < ACPI_THERMAL_MAX_ACTIVE && tz->trips.active[i].flags.valid;
+       for (i = 0; i < ACPI_THERMAL_MAX_ACTIVE && tz->trips.active[i].valid;
             i++, trips++);
 
-       if (tz->trips.passive.flags.valid)
+       if (tz->trips.passive.valid)
                tz->thermal_zone = thermal_zone_device_register("acpitz", trips, 0, tz,
                                                                &acpi_thermal_zone_ops, NULL,
                                                                tz->trips.passive.tsp * 100,
@@ -906,13 +837,13 @@ static void acpi_thermal_notify(struct acpi_device *device, u32 event)
                acpi_queue_thermal_check(tz);
                break;
        case ACPI_THERMAL_NOTIFY_THRESHOLDS:
-               acpi_thermal_trips_update(tz, ACPI_TRIPS_REFRESH_THRESHOLDS);
+               acpi_thermal_trips_update(tz, ACPI_TRIPS_THRESHOLDS);
                acpi_queue_thermal_check(tz);
                acpi_bus_generate_netlink_event(device->pnp.device_class,
                                                dev_name(&device->dev), event, 0);
                break;
        case ACPI_THERMAL_NOTIFY_DEVICES:
-               acpi_thermal_trips_update(tz, ACPI_TRIPS_REFRESH_DEVICES);
+               acpi_thermal_trips_update(tz, ACPI_TRIPS_DEVICES);
                acpi_queue_thermal_check(tz);
                acpi_bus_generate_netlink_event(device->pnp.device_class,
                                                dev_name(&device->dev), event, 0);
@@ -976,9 +907,8 @@ static int acpi_thermal_get_info(struct acpi_thermal *tz)
                return result;
 
        /* Set the cooling mode [_SCP] to active cooling (default) */
-       result = acpi_thermal_set_cooling_mode(tz, ACPI_THERMAL_MODE_ACTIVE);
-       if (!result)
-               tz->flags.cooling_mode = 1;
+       acpi_execute_simple_method(tz->device->handle, "_SCP",
+                                  ACPI_THERMAL_MODE_ACTIVE);
 
        /* Get default polling frequency [_TZP] (optional) */
        if (tzp)
@@ -1001,7 +931,7 @@ static int acpi_thermal_get_info(struct acpi_thermal *tz)
  */
 static void acpi_thermal_guess_offset(struct acpi_thermal *tz)
 {
-       if (tz->trips.critical.flags.valid &&
+       if (tz->trips.critical.valid &&
            (tz->trips.critical.temperature % 5) == 1)
                tz->kelvin_offset = 273100;
        else
@@ -1110,27 +1040,48 @@ static int acpi_thermal_resume(struct device *dev)
                return -EINVAL;
 
        for (i = 0; i < ACPI_THERMAL_MAX_ACTIVE; i++) {
-               if (!tz->trips.active[i].flags.valid)
+               if (!tz->trips.active[i].valid)
                        break;
 
-               tz->trips.active[i].flags.enabled = 1;
+               tz->trips.active[i].enabled = true;
                for (j = 0; j < tz->trips.active[i].devices.count; j++) {
                        result = acpi_bus_update_power(
                                        tz->trips.active[i].devices.handles[j],
                                        &power_state);
                        if (result || (power_state != ACPI_STATE_D0)) {
-                               tz->trips.active[i].flags.enabled = 0;
+                               tz->trips.active[i].enabled = false;
                                break;
                        }
                }
-               tz->state.active |= tz->trips.active[i].flags.enabled;
        }
 
        acpi_queue_thermal_check(tz);
 
        return AE_OK;
 }
+#else
+#define acpi_thermal_suspend   NULL
+#define acpi_thermal_resume    NULL
 #endif
+static SIMPLE_DEV_PM_OPS(acpi_thermal_pm, acpi_thermal_suspend, acpi_thermal_resume);
+
+static const struct acpi_device_id  thermal_device_ids[] = {
+       {ACPI_THERMAL_HID, 0},
+       {"", 0},
+};
+MODULE_DEVICE_TABLE(acpi, thermal_device_ids);
+
+static struct acpi_driver acpi_thermal_driver = {
+       .name = "thermal",
+       .class = ACPI_THERMAL_CLASS,
+       .ids = thermal_device_ids,
+       .ops = {
+               .add = acpi_thermal_add,
+               .remove = acpi_thermal_remove,
+               .notify = acpi_thermal_notify,
+               },
+       .drv.pm = &acpi_thermal_pm,
+};
 
 static int thermal_act(const struct dmi_system_id *d) {
        if (act == 0) {
@@ -1236,3 +1187,7 @@ static void __exit acpi_thermal_exit(void)
 
 module_init(acpi_thermal_init);
 module_exit(acpi_thermal_exit);
+
+MODULE_AUTHOR("Paul Diefenbaugh");
+MODULE_DESCRIPTION("ACPI Thermal Zone Driver");
+MODULE_LICENSE("GPL");
index 598f548..6353be6 100644 (file)
@@ -19,18 +19,52 @@ static const struct acpi_device_id tiny_power_button_device_ids[] = {
 };
 MODULE_DEVICE_TABLE(acpi, tiny_power_button_device_ids);
 
-static int acpi_noop_add(struct acpi_device *device)
+static void acpi_tiny_power_button_notify(acpi_handle handle, u32 event, void *data)
 {
-       return 0;
+       kill_cad_pid(power_signal, 1);
 }
 
-static void acpi_noop_remove(struct acpi_device *device)
+static void acpi_tiny_power_button_notify_run(void *not_used)
 {
+       acpi_tiny_power_button_notify(NULL, ACPI_FIXED_HARDWARE_EVENT, NULL);
 }
 
-static void acpi_tiny_power_button_notify(struct acpi_device *device, u32 event)
+static u32 acpi_tiny_power_button_event(void *not_used)
 {
-       kill_cad_pid(power_signal, 1);
+       acpi_os_execute(OSL_NOTIFY_HANDLER, acpi_tiny_power_button_notify_run, NULL);
+       return ACPI_INTERRUPT_HANDLED;
+}
+
+static int acpi_tiny_power_button_add(struct acpi_device *device)
+{
+       acpi_status status;
+
+       if (device->device_type == ACPI_BUS_TYPE_POWER_BUTTON) {
+               status = acpi_install_fixed_event_handler(ACPI_EVENT_POWER_BUTTON,
+                                                         acpi_tiny_power_button_event,
+                                                         NULL);
+       } else {
+               status = acpi_install_notify_handler(device->handle,
+                                                    ACPI_DEVICE_NOTIFY,
+                                                    acpi_tiny_power_button_notify,
+                                                    NULL);
+       }
+       if (ACPI_FAILURE(status))
+               return -ENODEV;
+
+       return 0;
+}
+
+static void acpi_tiny_power_button_remove(struct acpi_device *device)
+{
+       if (device->device_type == ACPI_BUS_TYPE_POWER_BUTTON) {
+               acpi_remove_fixed_event_handler(ACPI_EVENT_POWER_BUTTON,
+                                               acpi_tiny_power_button_event);
+       } else {
+               acpi_remove_notify_handler(device->handle, ACPI_DEVICE_NOTIFY,
+                                          acpi_tiny_power_button_notify);
+       }
+       acpi_os_wait_events_complete();
 }
 
 static struct acpi_driver acpi_tiny_power_button_driver = {
@@ -38,9 +72,8 @@ static struct acpi_driver acpi_tiny_power_button_driver = {
        .class = "tiny-power-button",
        .ids = tiny_power_button_device_ids,
        .ops = {
-               .add = acpi_noop_add,
-               .remove = acpi_noop_remove,
-               .notify = acpi_tiny_power_button_notify,
+               .add = acpi_tiny_power_button_add,
+               .remove = acpi_tiny_power_button_remove,
        },
 };
 
index bcc25d4..18cc08c 100644 (file)
@@ -471,6 +471,22 @@ static const struct dmi_system_id video_detect_dmi_table[] = {
                },
        },
        {
+        .callback = video_detect_force_native,
+        /* Lenovo ThinkPad X131e (3371 AMD version) */
+        .matches = {
+               DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
+               DMI_MATCH(DMI_PRODUCT_NAME, "3371"),
+               },
+       },
+       {
+        .callback = video_detect_force_native,
+        /* Apple iMac11,3 */
+        .matches = {
+               DMI_MATCH(DMI_SYS_VENDOR, "Apple Inc."),
+               DMI_MATCH(DMI_PRODUCT_NAME, "iMac11,3"),
+               },
+       },
+       {
         /* https://bugzilla.redhat.com/show_bug.cgi?id=1217249 */
         .callback = video_detect_force_native,
         /* Apple MacBook Pro 12,1 */
@@ -514,6 +530,14 @@ static const struct dmi_system_id video_detect_dmi_table[] = {
        },
        {
         .callback = video_detect_force_native,
+        /* Dell Studio 1569 */
+        .matches = {
+               DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
+               DMI_MATCH(DMI_PRODUCT_NAME, "Studio 1569"),
+               },
+       },
+       {
+        .callback = video_detect_force_native,
         /* Acer Aspire 3830TG */
         .matches = {
                DMI_MATCH(DMI_SYS_VENDOR, "Acer"),
@@ -828,6 +852,27 @@ enum acpi_backlight_type __acpi_video_get_backlight_type(bool native, bool *auto
        if (native_available)
                return acpi_backlight_native;
 
+       /*
+        * The vendor specific BIOS interfaces are only necessary for
+        * laptops from before ~2008.
+        *
+        * For laptops from ~2008 till ~2023 this point is never reached
+        * because on those (video_caps & ACPI_VIDEO_BACKLIGHT) above is true.
+        *
+        * Laptops from after ~2023 no longer support ACPI_VIDEO_BACKLIGHT,
+        * if this point is reached on those, this likely means that
+        * the GPU kms driver which sets native_available has not loaded yet.
+        *
+        * Returning acpi_backlight_vendor in this case is known to sometimes
+        * cause a non working vendor specific /sys/class/backlight device to
+        * get registered.
+        *
+        * Return acpi_backlight_none on laptops with ACPI tables written
+        * for Windows 8 (laptops from after ~2012) to avoid this problem.
+        */
+       if (acpi_osi_is_win8())
+               return acpi_backlight_none;
+
        /* No ACPI video/native (old hw), use vendor specific fw methods. */
        return acpi_backlight_vendor;
 }
index e499c60..ce62e61 100644 (file)
@@ -59,6 +59,7 @@ static int lps0_dsm_func_mask;
 
 static guid_t lps0_dsm_guid_microsoft;
 static int lps0_dsm_func_mask_microsoft;
+static int lps0_dsm_state;
 
 /* Device constraint entry structure */
 struct lpi_device_info {
@@ -320,6 +321,44 @@ static void lpi_check_constraints(void)
        }
 }
 
+static bool acpi_s2idle_vendor_amd(void)
+{
+       return boot_cpu_data.x86_vendor == X86_VENDOR_AMD;
+}
+
+static const char *acpi_sleep_dsm_state_to_str(unsigned int state)
+{
+       if (lps0_dsm_func_mask_microsoft || !acpi_s2idle_vendor_amd()) {
+               switch (state) {
+               case ACPI_LPS0_SCREEN_OFF:
+                       return "screen off";
+               case ACPI_LPS0_SCREEN_ON:
+                       return "screen on";
+               case ACPI_LPS0_ENTRY:
+                       return "lps0 entry";
+               case ACPI_LPS0_EXIT:
+                       return "lps0 exit";
+               case ACPI_LPS0_MS_ENTRY:
+                       return "lps0 ms entry";
+               case ACPI_LPS0_MS_EXIT:
+                       return "lps0 ms exit";
+               }
+       } else {
+               switch (state) {
+               case ACPI_LPS0_SCREEN_ON_AMD:
+                       return "screen on";
+               case ACPI_LPS0_SCREEN_OFF_AMD:
+                       return "screen off";
+               case ACPI_LPS0_ENTRY_AMD:
+                       return "lps0 entry";
+               case ACPI_LPS0_EXIT_AMD:
+                       return "lps0 exit";
+               }
+       }
+
+       return "unknown";
+}
+
 static void acpi_sleep_run_lps0_dsm(unsigned int func, unsigned int func_mask, guid_t dsm_guid)
 {
        union acpi_object *out_obj;
@@ -331,14 +370,15 @@ static void acpi_sleep_run_lps0_dsm(unsigned int func, unsigned int func_mask, g
                                        rev_id, func, NULL);
        ACPI_FREE(out_obj);
 
-       acpi_handle_debug(lps0_device_handle, "_DSM function %u evaluation %s\n",
-                         func, out_obj ? "successful" : "failed");
+       lps0_dsm_state = func;
+       if (pm_debug_messages_on) {
+               acpi_handle_info(lps0_device_handle,
+                               "%s transitioned to state %s\n",
+                                out_obj ? "Successfully" : "Failed to",
+                                acpi_sleep_dsm_state_to_str(lps0_dsm_state));
+       }
 }
 
-static bool acpi_s2idle_vendor_amd(void)
-{
-       return boot_cpu_data.x86_vendor == X86_VENDOR_AMD;
-}
 
 static int validate_dsm(acpi_handle handle, const char *uuid, int rev, guid_t *dsm_guid)
 {
@@ -485,11 +525,11 @@ int acpi_s2idle_prepare_late(void)
                                        ACPI_LPS0_ENTRY,
                                        lps0_dsm_func_mask, lps0_dsm_guid);
        if (lps0_dsm_func_mask_microsoft > 0) {
-               acpi_sleep_run_lps0_dsm(ACPI_LPS0_ENTRY,
-                               lps0_dsm_func_mask_microsoft, lps0_dsm_guid_microsoft);
                /* modern standby entry */
                acpi_sleep_run_lps0_dsm(ACPI_LPS0_MS_ENTRY,
                                lps0_dsm_func_mask_microsoft, lps0_dsm_guid_microsoft);
+               acpi_sleep_run_lps0_dsm(ACPI_LPS0_ENTRY,
+                               lps0_dsm_func_mask_microsoft, lps0_dsm_guid_microsoft);
        }
 
        list_for_each_entry(handler, &lps0_s2idle_devops_head, list_node) {
@@ -524,11 +564,6 @@ void acpi_s2idle_restore_early(void)
                if (handler->restore)
                        handler->restore();
 
-       /* Modern standby exit */
-       if (lps0_dsm_func_mask_microsoft > 0)
-               acpi_sleep_run_lps0_dsm(ACPI_LPS0_MS_EXIT,
-                               lps0_dsm_func_mask_microsoft, lps0_dsm_guid_microsoft);
-
        /* LPS0 exit */
        if (lps0_dsm_func_mask > 0)
                acpi_sleep_run_lps0_dsm(acpi_s2idle_vendor_amd() ?
@@ -539,6 +574,11 @@ void acpi_s2idle_restore_early(void)
                acpi_sleep_run_lps0_dsm(ACPI_LPS0_EXIT,
                                lps0_dsm_func_mask_microsoft, lps0_dsm_guid_microsoft);
 
+       /* Modern standby exit */
+       if (lps0_dsm_func_mask_microsoft > 0)
+               acpi_sleep_run_lps0_dsm(ACPI_LPS0_MS_EXIT,
+                               lps0_dsm_func_mask_microsoft, lps0_dsm_guid_microsoft);
+
        /* Screen on */
        if (lps0_dsm_func_mask_microsoft > 0)
                acpi_sleep_run_lps0_dsm(ACPI_LPS0_SCREEN_ON,
index 9c2d6f3..c2b925f 100644 (file)
@@ -259,10 +259,11 @@ bool force_storage_d3(void)
  * drivers/platform/x86/x86-android-tablets.c kernel module.
  */
 #define ACPI_QUIRK_SKIP_I2C_CLIENTS                            BIT(0)
-#define ACPI_QUIRK_UART1_TTY_UART2_SKIP                                BIT(1)
-#define ACPI_QUIRK_SKIP_ACPI_AC_AND_BATTERY                    BIT(2)
-#define ACPI_QUIRK_USE_ACPI_AC_AND_BATTERY                     BIT(3)
-#define ACPI_QUIRK_SKIP_GPIO_EVENT_HANDLERS                    BIT(4)
+#define ACPI_QUIRK_UART1_SKIP                                  BIT(1)
+#define ACPI_QUIRK_UART1_TTY_UART2_SKIP                                BIT(2)
+#define ACPI_QUIRK_SKIP_ACPI_AC_AND_BATTERY                    BIT(3)
+#define ACPI_QUIRK_USE_ACPI_AC_AND_BATTERY                     BIT(4)
+#define ACPI_QUIRK_SKIP_GPIO_EVENT_HANDLERS                    BIT(5)
 
 static const struct dmi_system_id acpi_quirk_skip_dmi_ids[] = {
        /*
@@ -319,6 +320,7 @@ static const struct dmi_system_id acpi_quirk_skip_dmi_ids[] = {
                        DMI_EXACT_MATCH(DMI_PRODUCT_VERSION, "YETI-11"),
                },
                .driver_data = (void *)(ACPI_QUIRK_SKIP_I2C_CLIENTS |
+                                       ACPI_QUIRK_UART1_SKIP |
                                        ACPI_QUIRK_SKIP_ACPI_AC_AND_BATTERY |
                                        ACPI_QUIRK_SKIP_GPIO_EVENT_HANDLERS),
        },
@@ -365,7 +367,7 @@ static const struct dmi_system_id acpi_quirk_skip_dmi_ids[] = {
                                        ACPI_QUIRK_SKIP_ACPI_AC_AND_BATTERY),
        },
        {
-               /* Nextbook Ares 8 */
+               /* Nextbook Ares 8 (BYT version)*/
                .matches = {
                        DMI_MATCH(DMI_SYS_VENDOR, "Insyde"),
                        DMI_MATCH(DMI_PRODUCT_NAME, "M890BAP"),
@@ -375,6 +377,16 @@ static const struct dmi_system_id acpi_quirk_skip_dmi_ids[] = {
                                        ACPI_QUIRK_SKIP_GPIO_EVENT_HANDLERS),
        },
        {
+               /* Nextbook Ares 8A (CHT version)*/
+               .matches = {
+                       DMI_MATCH(DMI_SYS_VENDOR, "Insyde"),
+                       DMI_MATCH(DMI_PRODUCT_NAME, "CherryTrail"),
+                       DMI_MATCH(DMI_BIOS_VERSION, "M882"),
+               },
+               .driver_data = (void *)(ACPI_QUIRK_SKIP_I2C_CLIENTS |
+                                       ACPI_QUIRK_SKIP_ACPI_AC_AND_BATTERY),
+       },
+       {
                /* Whitelabel (sold as various brands) TM800A550L */
                .matches = {
                        DMI_MATCH(DMI_BOARD_VENDOR, "AMI Corporation"),
@@ -392,6 +404,7 @@ static const struct dmi_system_id acpi_quirk_skip_dmi_ids[] = {
 #if IS_ENABLED(CONFIG_X86_ANDROID_TABLETS)
 static const struct acpi_device_id i2c_acpi_known_good_ids[] = {
        { "10EC5640", 0 }, /* RealTek ALC5640 audio codec */
+       { "10EC5651", 0 }, /* RealTek ALC5651 audio codec */
        { "INT33F4", 0 },  /* X-Powers AXP288 PMIC */
        { "INT33FD", 0 },  /* Intel Crystal Cove PMIC */
        { "INT34D3", 0 },  /* Intel Whiskey Cove PMIC */
@@ -438,6 +451,9 @@ int acpi_quirk_skip_serdev_enumeration(struct device *controller_parent, bool *s
        if (dmi_id)
                quirks = (unsigned long)dmi_id->driver_data;
 
+       if ((quirks & ACPI_QUIRK_UART1_SKIP) && uid == 1)
+               *skip = true;
+
        if (quirks & ACPI_QUIRK_UART1_TTY_UART2_SKIP) {
                if (uid == 1)
                        return -ENODEV; /* Create tty cdev instead of serdev */
index 0242599..d44814b 100644 (file)
@@ -820,7 +820,7 @@ static const struct of_device_id ht16k33_of_match[] = {
 MODULE_DEVICE_TABLE(of, ht16k33_of_match);
 
 static struct i2c_driver ht16k33_driver = {
-       .probe_new      = ht16k33_probe,
+       .probe          = ht16k33_probe,
        .remove         = ht16k33_remove,
        .driver         = {
                .name           = DRIVER_NAME,
index 135831a..6422be0 100644 (file)
@@ -365,7 +365,7 @@ static struct i2c_driver lcd2s_i2c_driver = {
                .name = "lcd2s",
                .of_match_table = lcd2s_of_table,
        },
-       .probe_new = lcd2s_i2c_probe,
+       .probe = lcd2s_i2c_probe,
        .remove = lcd2s_i2c_remove,
        .id_table = lcd2s_i2c_id,
 };
index 9c09ca5..878aa76 100644 (file)
@@ -751,14 +751,12 @@ static int really_probe_debug(struct device *dev, struct device_driver *drv)
  *
  * Should somehow figure out how to use a semaphore, not an atomic variable...
  */
-int driver_probe_done(void)
+bool __init driver_probe_done(void)
 {
        int local_probe_count = atomic_read(&probe_count);
 
        pr_debug("%s: probe_count = %d\n", __func__, local_probe_count);
-       if (local_probe_count)
-               return -EBUSY;
-       return 0;
+       return !local_probe_count;
 }
 
 /**
index 5c998cf..3df0025 100644 (file)
@@ -29,10 +29,10 @@ struct devres {
         * Some archs want to perform DMA into kmalloc caches
         * and need a guaranteed alignment larger than
         * the alignment of a 64-bit integer.
-        * Thus we use ARCH_KMALLOC_MINALIGN here and get exactly the same
-        * buffer alignment as if it was allocated by plain kmalloc().
+        * Thus we use ARCH_DMA_MINALIGN for data[] which will force the same
+        * alignment for struct devres when allocated by kmalloc().
         */
-       u8 __aligned(ARCH_KMALLOC_MINALIGN) data[];
+       u8 __aligned(ARCH_DMA_MINALIGN) data[];
 };
 
 struct devres_group {
index b46db17..6559759 100644 (file)
@@ -449,6 +449,9 @@ static ssize_t node_read_meminfo(struct device *dev,
                             "Node %d FileHugePages: %8lu kB\n"
                             "Node %d FilePmdMapped: %8lu kB\n"
 #endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+                            "Node %d Unaccepted:     %8lu kB\n"
+#endif
                             ,
                             nid, K(node_page_state(pgdat, NR_FILE_DIRTY)),
                             nid, K(node_page_state(pgdat, NR_WRITEBACK)),
@@ -478,6 +481,10 @@ static ssize_t node_read_meminfo(struct device *dev,
                             nid, K(node_page_state(pgdat, NR_FILE_THPS)),
                             nid, K(node_page_state(pgdat, NR_FILE_PMDMAPPED))
 #endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+                            ,
+                            nid, K(sum_zone_node_page_state(nid, NR_UNACCEPTED))
+#endif
                            );
        len += hugetlb_report_node_meminfo(buf, len, nid);
        return len;
index 32084e3..5cb2023 100644 (file)
@@ -1632,9 +1632,6 @@ static int genpd_add_device(struct generic_pm_domain *genpd, struct device *dev,
 
        dev_dbg(dev, "%s()\n", __func__);
 
-       if (IS_ERR_OR_NULL(genpd) || IS_ERR_OR_NULL(dev))
-               return -EINVAL;
-
        gpd_data = genpd_alloc_dev_data(dev, gd);
        if (IS_ERR(gpd_data))
                return PTR_ERR(gpd_data);
@@ -1676,6 +1673,9 @@ int pm_genpd_add_device(struct generic_pm_domain *genpd, struct device *dev)
 {
        int ret;
 
+       if (!genpd || !dev)
+               return -EINVAL;
+
        mutex_lock(&gpd_list_lock);
        ret = genpd_add_device(genpd, dev, dev);
        mutex_unlock(&gpd_list_lock);
@@ -2523,6 +2523,9 @@ int of_genpd_add_device(struct of_phandle_args *genpdspec, struct device *dev)
        struct generic_pm_domain *genpd;
        int ret;
 
+       if (!dev)
+               return -EINVAL;
+
        mutex_lock(&gpd_list_lock);
 
        genpd = genpd_get_from_provider(genpdspec);
@@ -2939,10 +2942,10 @@ static int genpd_parse_state(struct genpd_power_state *genpd_state,
 
        err = of_property_read_u32(state_node, "min-residency-us", &residency);
        if (!err)
-               genpd_state->residency_ns = 1000 * residency;
+               genpd_state->residency_ns = 1000LL * residency;
 
-       genpd_state->power_on_latency_ns = 1000 * exit_latency;
-       genpd_state->power_off_latency_ns = 1000 * entry_latency;
+       genpd_state->power_on_latency_ns = 1000LL * exit_latency;
+       genpd_state->power_off_latency_ns = 1000LL * entry_latency;
        genpd_state->fwnode = &state_node->fwnode;
 
        return 0;
index 7cc0c0c..a917219 100644 (file)
 
 #include "power.h"
 
-#ifndef CONFIG_SUSPEND
-suspend_state_t pm_suspend_target_state;
-#define pm_suspend_target_state        (PM_SUSPEND_ON)
-#endif
-
 #define list_for_each_entry_rcu_locked(pos, head, member) \
        list_for_each_entry_rcu(pos, head, member, \
                srcu_read_lock_held(&wakeup_srcu))
index f6c6cb0..5fdd084 100644 (file)
@@ -8,7 +8,7 @@ obj-$(CONFIG_DEBUG_FS) += regmap-debugfs.o
 obj-$(CONFIG_REGMAP_KUNIT) += regmap-kunit.o
 obj-$(CONFIG_REGMAP_AC97) += regmap-ac97.o
 obj-$(CONFIG_REGMAP_I2C) += regmap-i2c.o
-obj-$(CONFIG_REGMAP_RAM) += regmap-ram.o
+obj-$(CONFIG_REGMAP_RAM) += regmap-ram.o regmap-raw-ram.o
 obj-$(CONFIG_REGMAP_SLIMBUS) += regmap-slimbus.o
 obj-$(CONFIG_REGMAP_SPI) += regmap-spi.o
 obj-$(CONFIG_REGMAP_SPMI) += regmap-spmi.o
index 9bd0dfd..9a9ea51 100644 (file)
@@ -125,6 +125,9 @@ struct regmap {
        int reg_stride;
        int reg_stride_order;
 
+       /* If set, will always write field to HW. */
+       bool force_write_field;
+
        /* regcache specific members */
        const struct regcache_ops *cache_ops;
        enum regcache_type cache_type;
@@ -257,6 +260,8 @@ int regcache_sync_block(struct regmap *map, void *block,
                        unsigned long *cache_present,
                        unsigned int block_base, unsigned int start,
                        unsigned int end);
+bool regcache_reg_needs_sync(struct regmap *map, unsigned int reg,
+                            unsigned int val);
 
 static inline const void *regcache_get_val_addr(struct regmap *map,
                                                const void *base,
@@ -267,7 +272,7 @@ static inline const void *regcache_get_val_addr(struct regmap *map,
 
 unsigned int regcache_get_val(struct regmap *map, const void *base,
                              unsigned int idx);
-bool regcache_set_val(struct regmap *map, void *base, unsigned int idx,
+void regcache_set_val(struct regmap *map, void *base, unsigned int idx,
                      unsigned int val);
 int regcache_lookup_reg(struct regmap *map, unsigned int reg);
 int regcache_sync_val(struct regmap *map, unsigned int reg, unsigned int val);
@@ -312,6 +317,7 @@ struct regmap_ram_data {
        unsigned int *vals;  /* Allocatd by caller */
        bool *read;
        bool *written;
+       enum regmap_endian reg_endian;
 };
 
 /*
@@ -326,5 +332,12 @@ struct regmap *__regmap_init_ram(const struct regmap_config *config,
 #define regmap_init_ram(config, data)                                  \
        __regmap_lockdep_wrapper(__regmap_init_ram, #config, config, data)
 
+struct regmap *__regmap_init_raw_ram(const struct regmap_config *config,
+                                    struct regmap_ram_data *data,
+                                    struct lock_class_key *lock_key,
+                                    const char *lock_name);
+
+#define regmap_init_raw_ram(config, data)                              \
+       __regmap_lockdep_wrapper(__regmap_init_raw_ram, #config, config, data)
 
 #endif
index c2e3a0f..283c2e0 100644 (file)
@@ -186,6 +186,55 @@ out_unlocked:
        return ret;
 }
 
+static int regcache_maple_sync_block(struct regmap *map, unsigned long *entry,
+                                    struct ma_state *mas,
+                                    unsigned int min, unsigned int max)
+{
+       void *buf;
+       unsigned long r;
+       size_t val_bytes = map->format.val_bytes;
+       int ret = 0;
+
+       mas_pause(mas);
+       rcu_read_unlock();
+
+       /*
+        * Use a raw write if writing more than one register to a
+        * device that supports raw writes to reduce transaction
+        * overheads.
+        */
+       if (max - min > 1 && regmap_can_raw_write(map)) {
+               buf = kmalloc(val_bytes * (max - min), map->alloc_flags);
+               if (!buf) {
+                       ret = -ENOMEM;
+                       goto out;
+               }
+
+               /* Render the data for a raw write */
+               for (r = min; r < max; r++) {
+                       regcache_set_val(map, buf, r - min,
+                                        entry[r - mas->index]);
+               }
+
+               ret = _regmap_raw_write(map, min, buf, (max - min) * val_bytes,
+                                       false);
+
+               kfree(buf);
+       } else {
+               for (r = min; r < max; r++) {
+                       ret = _regmap_write(map, r,
+                                           entry[r - mas->index]);
+                       if (ret != 0)
+                               goto out;
+               }
+       }
+
+out:
+       rcu_read_lock();
+
+       return ret;
+}
+
 static int regcache_maple_sync(struct regmap *map, unsigned int min,
                               unsigned int max)
 {
@@ -194,8 +243,9 @@ static int regcache_maple_sync(struct regmap *map, unsigned int min,
        MA_STATE(mas, mt, min, max);
        unsigned long lmin = min;
        unsigned long lmax = max;
-       unsigned int r;
+       unsigned int r, v, sync_start;
        int ret;
+       bool sync_needed = false;
 
        map->cache_bypass = true;
 
@@ -203,18 +253,38 @@ static int regcache_maple_sync(struct regmap *map, unsigned int min,
 
        mas_for_each(&mas, entry, max) {
                for (r = max(mas.index, lmin); r <= min(mas.last, lmax); r++) {
-                       mas_pause(&mas);
-                       rcu_read_unlock();
-                       ret = regcache_sync_val(map, r, entry[r - mas.index]);
+                       v = entry[r - mas.index];
+
+                       if (regcache_reg_needs_sync(map, r, v)) {
+                               if (!sync_needed) {
+                                       sync_start = r;
+                                       sync_needed = true;
+                               }
+                               continue;
+                       }
+
+                       if (!sync_needed)
+                               continue;
+
+                       ret = regcache_maple_sync_block(map, entry, &mas,
+                                                       sync_start, r);
+                       if (ret != 0)
+                               goto out;
+                       sync_needed = false;
+               }
+
+               if (sync_needed) {
+                       ret = regcache_maple_sync_block(map, entry, &mas,
+                                                       sync_start, r);
                        if (ret != 0)
                                goto out;
-                       rcu_read_lock();
+                       sync_needed = false;
                }
        }
 
+out:
        rcu_read_unlock();
 
-out:
        map->cache_bypass = false;
 
        return ret;
@@ -242,11 +312,41 @@ static int regcache_maple_exit(struct regmap *map)
        return 0;
 }
 
+static int regcache_maple_insert_block(struct regmap *map, int first,
+                                       int last)
+{
+       struct maple_tree *mt = map->cache;
+       MA_STATE(mas, mt, first, last);
+       unsigned long *entry;
+       int i, ret;
+
+       entry = kcalloc(last - first + 1, sizeof(unsigned long), GFP_KERNEL);
+       if (!entry)
+               return -ENOMEM;
+
+       for (i = 0; i < last - first + 1; i++)
+               entry[i] = map->reg_defaults[first + i].def;
+
+       mas_lock(&mas);
+
+       mas_set_range(&mas, map->reg_defaults[first].reg,
+                     map->reg_defaults[last].reg);
+       ret = mas_store_gfp(&mas, entry, GFP_KERNEL);
+
+       mas_unlock(&mas);
+
+       if (ret)
+               kfree(entry);
+
+       return ret;
+}
+
 static int regcache_maple_init(struct regmap *map)
 {
        struct maple_tree *mt;
        int i;
        int ret;
+       int range_start;
 
        mt = kmalloc(sizeof(*mt), GFP_KERNEL);
        if (!mt)
@@ -255,14 +355,30 @@ static int regcache_maple_init(struct regmap *map)
 
        mt_init(mt);
 
-       for (i = 0; i < map->num_reg_defaults; i++) {
-               ret = regcache_maple_write(map,
-                                          map->reg_defaults[i].reg,
-                                          map->reg_defaults[i].def);
-               if (ret)
-                       goto err;
+       if (!map->num_reg_defaults)
+               return 0;
+
+       range_start = 0;
+
+       /* Scan for ranges of contiguous registers */
+       for (i = 1; i < map->num_reg_defaults; i++) {
+               if (map->reg_defaults[i].reg !=
+                   map->reg_defaults[i - 1].reg + 1) {
+                       ret = regcache_maple_insert_block(map, range_start,
+                                                         i - 1);
+                       if (ret != 0)
+                               goto err;
+
+                       range_start = i;
+               }
        }
 
+       /* Add the last block */
+       ret = regcache_maple_insert_block(map, range_start,
+                                         map->num_reg_defaults - 1);
+       if (ret != 0)
+               goto err;
+
        return 0;
 
 err:
index 97c681f..28bc3ae 100644 (file)
@@ -279,8 +279,8 @@ int regcache_write(struct regmap *map,
        return 0;
 }
 
-static bool regcache_reg_needs_sync(struct regmap *map, unsigned int reg,
-                                   unsigned int val)
+bool regcache_reg_needs_sync(struct regmap *map, unsigned int reg,
+                            unsigned int val)
 {
        int ret;
 
@@ -561,17 +561,14 @@ void regcache_cache_bypass(struct regmap *map, bool enable)
 }
 EXPORT_SYMBOL_GPL(regcache_cache_bypass);
 
-bool regcache_set_val(struct regmap *map, void *base, unsigned int idx,
+void regcache_set_val(struct regmap *map, void *base, unsigned int idx,
                      unsigned int val)
 {
-       if (regcache_get_val(map, base, idx) == val)
-               return true;
-
        /* Use device native format if possible */
        if (map->format.format_val) {
                map->format.format_val(base + (map->cache_word_size * idx),
                                       val, 0);
-               return false;
+               return;
        }
 
        switch (map->cache_word_size) {
@@ -604,7 +601,6 @@ bool regcache_set_val(struct regmap *map, void *base, unsigned int idx,
        default:
                BUG();
        }
-       return false;
 }
 
 unsigned int regcache_get_val(struct regmap *map, const void *base,
index c491fab..f360275 100644 (file)
@@ -636,6 +636,17 @@ void regmap_debugfs_init(struct regmap *map)
                                    &regmap_cache_bypass_fops);
        }
 
+       /*
+        * This could interfere with driver operation. Therefore, don't provide
+        * any real compile time configuration option for this feature. One will
+        * have to modify the source code directly in order to use it.
+        */
+#undef REGMAP_ALLOW_FORCE_WRITE_FIELD_DEBUGFS
+#ifdef REGMAP_ALLOW_FORCE_WRITE_FIELD_DEBUGFS
+       debugfs_create_bool("force_write_field", 0600, map->debugfs,
+                           &map->force_write_field);
+#endif
+
        next = rb_first(&map->range_tree);
        while (next) {
                range_node = rb_entry(next, struct regmap_range_node, node);
index b99bb23..ced0dcf 100644 (file)
@@ -30,9 +30,6 @@ struct regmap_irq_chip_data {
        int irq;
        int wake_count;
 
-       unsigned int mask_base;
-       unsigned int unmask_base;
-
        void *status_reg_buf;
        unsigned int *main_status_buf;
        unsigned int *status_buf;
@@ -41,7 +38,6 @@ struct regmap_irq_chip_data {
        unsigned int *wake_buf;
        unsigned int *type_buf;
        unsigned int *type_buf_def;
-       unsigned int **virt_buf;
        unsigned int **config_buf;
 
        unsigned int irq_reg_stride;
@@ -114,25 +110,22 @@ static void regmap_irq_sync_unlock(struct irq_data *data)
         * suppress pointless writes.
         */
        for (i = 0; i < d->chip->num_regs; i++) {
-               if (d->mask_base) {
-                       if (d->chip->handle_mask_sync)
-                               d->chip->handle_mask_sync(d->map, i,
-                                                         d->mask_buf_def[i],
-                                                         d->mask_buf[i],
-                                                         d->chip->irq_drv_data);
-                       else {
-                               reg = d->get_irq_reg(d, d->mask_base, i);
-                               ret = regmap_update_bits(d->map, reg,
-                                               d->mask_buf_def[i],
-                                               d->mask_buf[i]);
-                               if (ret)
-                                       dev_err(d->map->dev, "Failed to sync masks in %x\n",
-                                               reg);
-                       }
+               if (d->chip->handle_mask_sync)
+                       d->chip->handle_mask_sync(i, d->mask_buf_def[i],
+                                                 d->mask_buf[i],
+                                                 d->chip->irq_drv_data);
+
+               if (d->chip->mask_base && !d->chip->handle_mask_sync) {
+                       reg = d->get_irq_reg(d, d->chip->mask_base, i);
+                       ret = regmap_update_bits(d->map, reg,
+                                                d->mask_buf_def[i],
+                                                d->mask_buf[i]);
+                       if (ret)
+                               dev_err(d->map->dev, "Failed to sync masks in %x\n", reg);
                }
 
-               if (d->unmask_base) {
-                       reg = d->get_irq_reg(d, d->unmask_base, i);
+               if (d->chip->unmask_base && !d->chip->handle_mask_sync) {
+                       reg = d->get_irq_reg(d, d->chip->unmask_base, i);
                        ret = regmap_update_bits(d->map, reg,
                                        d->mask_buf_def[i], ~d->mask_buf[i]);
                        if (ret)
@@ -183,34 +176,6 @@ static void regmap_irq_sync_unlock(struct irq_data *data)
                }
        }
 
-       /* Don't update the type bits if we're using mask bits for irq type. */
-       if (!d->chip->type_in_mask) {
-               for (i = 0; i < d->chip->num_type_reg; i++) {
-                       if (!d->type_buf_def[i])
-                               continue;
-                       reg = d->get_irq_reg(d, d->chip->type_base, i);
-                       ret = regmap_update_bits(d->map, reg,
-                                                d->type_buf_def[i], d->type_buf[i]);
-                       if (ret != 0)
-                               dev_err(d->map->dev, "Failed to sync type in %x\n",
-                                       reg);
-               }
-       }
-
-       if (d->chip->num_virt_regs) {
-               for (i = 0; i < d->chip->num_virt_regs; i++) {
-                       for (j = 0; j < d->chip->num_regs; j++) {
-                               reg = d->get_irq_reg(d, d->chip->virt_reg_base[i],
-                                                    j);
-                               ret = regmap_write(map, reg, d->virt_buf[i][j]);
-                               if (ret != 0)
-                                       dev_err(d->map->dev,
-                                               "Failed to write virt 0x%x: %d\n",
-                                               reg, ret);
-                       }
-               }
-       }
-
        for (i = 0; i < d->chip->num_config_bases; i++) {
                for (j = 0; j < d->chip->num_config_regs; j++) {
                        reg = d->get_irq_reg(d, d->chip->config_base[i], j);
@@ -289,41 +254,9 @@ static int regmap_irq_set_type(struct irq_data *data, unsigned int type)
 
        reg = t->type_reg_offset / map->reg_stride;
 
-       if (t->type_reg_mask)
-               d->type_buf[reg] &= ~t->type_reg_mask;
-       else
-               d->type_buf[reg] &= ~(t->type_falling_val |
-                                     t->type_rising_val |
-                                     t->type_level_low_val |
-                                     t->type_level_high_val);
-       switch (type) {
-       case IRQ_TYPE_EDGE_FALLING:
-               d->type_buf[reg] |= t->type_falling_val;
-               break;
-
-       case IRQ_TYPE_EDGE_RISING:
-               d->type_buf[reg] |= t->type_rising_val;
-               break;
-
-       case IRQ_TYPE_EDGE_BOTH:
-               d->type_buf[reg] |= (t->type_falling_val |
-                                       t->type_rising_val);
-               break;
-
-       case IRQ_TYPE_LEVEL_HIGH:
-               d->type_buf[reg] |= t->type_level_high_val;
-               break;
-
-       case IRQ_TYPE_LEVEL_LOW:
-               d->type_buf[reg] |= t->type_level_low_val;
-               break;
-       default:
-               return -EINVAL;
-       }
-
-       if (d->chip->set_type_virt) {
-               ret = d->chip->set_type_virt(d->virt_buf, type, data->hwirq,
-                                            reg);
+       if (d->chip->type_in_mask) {
+               ret = regmap_irq_set_type_config_simple(&d->type_buf, type,
+                                                       irq_data, reg, d->chip->irq_drv_data);
                if (ret)
                        return ret;
        }
@@ -390,15 +323,8 @@ static inline int read_sub_irq_data(struct regmap_irq_chip_data *data,
                        unsigned int offset = subreg->offset[i];
                        unsigned int index = offset / map->reg_stride;
 
-                       if (chip->not_fixed_stride)
-                               ret = regmap_read(map,
-                                               chip->status_base + offset,
-                                               &data->status_buf[b]);
-                       else
-                               ret = regmap_read(map,
-                                               chip->status_base + offset,
-                                               &data->status_buf[index]);
-
+                       ret = regmap_read(map, chip->status_base + offset,
+                                         &data->status_buf[index]);
                        if (ret)
                                break;
                }
@@ -453,17 +379,7 @@ static irqreturn_t regmap_irq_thread(int irq, void *d)
                 * sake of simplicity. and add bulk reads only if needed
                 */
                for (i = 0; i < chip->num_main_regs; i++) {
-                       /*
-                        * For not_fixed_stride, don't use ->get_irq_reg().
-                        * It would produce an incorrect result.
-                        */
-                       if (data->chip->not_fixed_stride)
-                               reg = chip->main_status +
-                                       i * map->reg_stride * data->irq_reg_stride;
-                       else
-                               reg = data->get_irq_reg(data,
-                                                       chip->main_status, i);
-
+                       reg = data->get_irq_reg(data, chip->main_status, i);
                        ret = regmap_read(map, reg, &data->main_status_buf[i]);
                        if (ret) {
                                dev_err(map->dev,
@@ -586,12 +502,12 @@ static irqreturn_t regmap_irq_thread(int irq, void *d)
        }
 
 exit:
-       if (chip->runtime_pm)
-               pm_runtime_put(map->dev);
-
        if (chip->handle_post_irq)
                chip->handle_post_irq(chip->irq_drv_data);
 
+       if (chip->runtime_pm)
+               pm_runtime_put(map->dev);
+
        if (handled)
                return IRQ_HANDLED;
        else
@@ -629,20 +545,8 @@ static const struct irq_domain_ops regmap_domain_ops = {
 unsigned int regmap_irq_get_irq_reg_linear(struct regmap_irq_chip_data *data,
                                           unsigned int base, int index)
 {
-       const struct regmap_irq_chip *chip = data->chip;
        struct regmap *map = data->map;
 
-       /*
-        * FIXME: This is for backward compatibility and should be removed
-        * when not_fixed_stride is dropped (it's only used by qcom-pm8008).
-        */
-       if (chip->not_fixed_stride && chip->sub_reg_offsets) {
-               struct regmap_irq_sub_irq_map *subreg;
-
-               subreg = &chip->sub_reg_offsets[0];
-               return base + subreg->offset[0];
-       }
-
        return base + index * map->reg_stride * data->irq_reg_stride;
 }
 EXPORT_SYMBOL_GPL(regmap_irq_get_irq_reg_linear);
@@ -730,8 +634,6 @@ int regmap_add_irq_chip_fwnode(struct fwnode_handle *fwnode,
        struct regmap_irq_chip_data *d;
        int i;
        int ret = -ENOMEM;
-       int num_type_reg;
-       int num_regs;
        u32 reg;
 
        if (chip->num_regs <= 0)
@@ -740,6 +642,9 @@ int regmap_add_irq_chip_fwnode(struct fwnode_handle *fwnode,
        if (chip->clear_on_unmask && (chip->ack_base || chip->use_ack))
                return -EINVAL;
 
+       if (chip->mask_base && chip->unmask_base && !chip->mask_unmask_non_inverted)
+               return -EINVAL;
+
        for (i = 0; i < chip->num_irqs; i++) {
                if (chip->irqs[i].reg_offset % map->reg_stride)
                        return -EINVAL;
@@ -748,20 +653,6 @@ int regmap_add_irq_chip_fwnode(struct fwnode_handle *fwnode,
                        return -EINVAL;
        }
 
-       if (chip->not_fixed_stride) {
-               dev_warn(map->dev, "not_fixed_stride is deprecated; use ->get_irq_reg() instead");
-
-               for (i = 0; i < chip->num_regs; i++)
-                       if (chip->sub_reg_offsets[i].num_regs != 1)
-                               return -EINVAL;
-       }
-
-       if (chip->num_type_reg)
-               dev_warn(map->dev, "type registers are deprecated; use config registers instead");
-
-       if (chip->num_virt_regs || chip->virt_reg_base || chip->set_type_virt)
-               dev_warn(map->dev, "virtual registers are deprecated; use config registers instead");
-
        if (irq_base) {
                irq_base = irq_alloc_descs(irq_base, 0, chip->num_irqs, 0);
                if (irq_base < 0) {
@@ -806,43 +697,17 @@ int regmap_add_irq_chip_fwnode(struct fwnode_handle *fwnode,
                        goto err_alloc;
        }
 
-       /*
-        * Use num_config_regs if defined, otherwise fall back to num_type_reg
-        * to maintain backward compatibility.
-        */
-       num_type_reg = chip->num_config_regs ? chip->num_config_regs
-                       : chip->num_type_reg;
-       num_regs = chip->type_in_mask ? chip->num_regs : num_type_reg;
-       if (num_regs) {
-               d->type_buf_def = kcalloc(num_regs,
+       if (chip->type_in_mask) {
+               d->type_buf_def = kcalloc(chip->num_regs,
                                          sizeof(*d->type_buf_def), GFP_KERNEL);
                if (!d->type_buf_def)
                        goto err_alloc;
 
-               d->type_buf = kcalloc(num_regs, sizeof(*d->type_buf),
-                                     GFP_KERNEL);
+               d->type_buf = kcalloc(chip->num_regs, sizeof(*d->type_buf), GFP_KERNEL);
                if (!d->type_buf)
                        goto err_alloc;
        }
 
-       if (chip->num_virt_regs) {
-               /*
-                * Create virt_buf[chip->num_extra_config_regs][chip->num_regs]
-                */
-               d->virt_buf = kcalloc(chip->num_virt_regs, sizeof(*d->virt_buf),
-                                     GFP_KERNEL);
-               if (!d->virt_buf)
-                       goto err_alloc;
-
-               for (i = 0; i < chip->num_virt_regs; i++) {
-                       d->virt_buf[i] = kcalloc(chip->num_regs,
-                                                sizeof(**d->virt_buf),
-                                                GFP_KERNEL);
-                       if (!d->virt_buf[i])
-                               goto err_alloc;
-               }
-       }
-
        if (chip->num_config_bases && chip->num_config_regs) {
                /*
                 * Create config_buf[num_config_bases][num_config_regs]
@@ -868,28 +733,6 @@ int regmap_add_irq_chip_fwnode(struct fwnode_handle *fwnode,
        d->chip = chip;
        d->irq_base = irq_base;
 
-       if (chip->mask_base && chip->unmask_base &&
-           !chip->mask_unmask_non_inverted) {
-               /*
-                * Chips that specify both mask_base and unmask_base used to
-                * get inverted mask behavior by default, with no way to ask
-                * for the normal, non-inverted behavior. This "inverted by
-                * default" behavior is deprecated, but we have to support it
-                * until existing drivers have been fixed.
-                *
-                * Existing drivers should be updated by swapping mask_base
-                * and unmask_base and setting mask_unmask_non_inverted=true.
-                * New drivers should always set the flag.
-                */
-               dev_warn(map->dev, "mask_base and unmask_base are inverted, please fix it");
-
-               d->mask_base = chip->unmask_base;
-               d->unmask_base = chip->mask_base;
-       } else {
-               d->mask_base = chip->mask_base;
-               d->unmask_base = chip->unmask_base;
-       }
-
        if (chip->irq_reg_stride)
                d->irq_reg_stride = chip->irq_reg_stride;
        else
@@ -918,29 +761,28 @@ int regmap_add_irq_chip_fwnode(struct fwnode_handle *fwnode,
        for (i = 0; i < chip->num_regs; i++) {
                d->mask_buf[i] = d->mask_buf_def[i];
 
-               if (d->mask_base) {
-                       if (chip->handle_mask_sync) {
-                               ret = chip->handle_mask_sync(d->map, i,
-                                                            d->mask_buf_def[i],
-                                                            d->mask_buf[i],
-                                                            chip->irq_drv_data);
-                               if (ret)
-                                       goto err_alloc;
-                       } else {
-                               reg = d->get_irq_reg(d, d->mask_base, i);
-                               ret = regmap_update_bits(d->map, reg,
-                                               d->mask_buf_def[i],
-                                               d->mask_buf[i]);
-                               if (ret) {
-                                       dev_err(map->dev, "Failed to set masks in 0x%x: %d\n",
-                                               reg, ret);
-                                       goto err_alloc;
-                               }
+               if (chip->handle_mask_sync) {
+                       ret = chip->handle_mask_sync(i, d->mask_buf_def[i],
+                                                    d->mask_buf[i],
+                                                    chip->irq_drv_data);
+                       if (ret)
+                               goto err_alloc;
+               }
+
+               if (chip->mask_base && !chip->handle_mask_sync) {
+                       reg = d->get_irq_reg(d, chip->mask_base, i);
+                       ret = regmap_update_bits(d->map, reg,
+                                                d->mask_buf_def[i],
+                                                d->mask_buf[i]);
+                       if (ret) {
+                               dev_err(map->dev, "Failed to set masks in 0x%x: %d\n",
+                                       reg, ret);
+                               goto err_alloc;
                        }
                }
 
-               if (d->unmask_base) {
-                       reg = d->get_irq_reg(d, d->unmask_base, i);
+               if (chip->unmask_base && !chip->handle_mask_sync) {
+                       reg = d->get_irq_reg(d, chip->unmask_base, i);
                        ret = regmap_update_bits(d->map, reg,
                                        d->mask_buf_def[i], ~d->mask_buf[i]);
                        if (ret) {
@@ -1014,20 +856,6 @@ int regmap_add_irq_chip_fwnode(struct fwnode_handle *fwnode,
                }
        }
 
-       if (chip->num_type_reg && !chip->type_in_mask) {
-               for (i = 0; i < chip->num_type_reg; ++i) {
-                       reg = d->get_irq_reg(d, d->chip->type_base, i);
-
-                       ret = regmap_read(map, reg, &d->type_buf_def[i]);
-
-                       if (ret) {
-                               dev_err(map->dev, "Failed to get type defaults at 0x%x: %d\n",
-                                       reg, ret);
-                               goto err_alloc;
-                       }
-               }
-       }
-
        if (irq_base)
                d->domain = irq_domain_create_legacy(fwnode, chip->num_irqs,
                                                     irq_base, 0,
@@ -1064,11 +892,6 @@ err_alloc:
        kfree(d->mask_buf);
        kfree(d->status_buf);
        kfree(d->status_reg_buf);
-       if (d->virt_buf) {
-               for (i = 0; i < chip->num_virt_regs; i++)
-                       kfree(d->virt_buf[i]);
-               kfree(d->virt_buf);
-       }
        if (d->config_buf) {
                for (i = 0; i < chip->num_config_bases; i++)
                        kfree(d->config_buf[i]);
index f76d416..24257aa 100644 (file)
@@ -92,6 +92,11 @@ static struct regmap *gen_regmap(struct regmap_config *config,
        return ret;
 }
 
+static bool reg_5_false(struct device *context, unsigned int reg)
+{
+       return reg != 5;
+}
+
 static void basic_read_write(struct kunit *test)
 {
        struct regcache_types *t = (struct regcache_types *)test->param_value;
@@ -191,6 +196,81 @@ static void bulk_read(struct kunit *test)
        regmap_exit(map);
 }
 
+static void write_readonly(struct kunit *test)
+{
+       struct regcache_types *t = (struct regcache_types *)test->param_value;
+       struct regmap *map;
+       struct regmap_config config;
+       struct regmap_ram_data *data;
+       unsigned int val;
+       int i;
+
+       config = test_regmap_config;
+       config.cache_type = t->type;
+       config.num_reg_defaults = BLOCK_TEST_SIZE;
+       config.writeable_reg = reg_5_false;
+
+       map = gen_regmap(&config, &data);
+       KUNIT_ASSERT_FALSE(test, IS_ERR(map));
+       if (IS_ERR(map))
+               return;
+
+       get_random_bytes(&val, sizeof(val));
+
+       for (i = 0; i < BLOCK_TEST_SIZE; i++)
+               data->written[i] = false;
+
+       /* Change the value of all registers, readonly should fail */
+       for (i = 0; i < BLOCK_TEST_SIZE; i++)
+               KUNIT_EXPECT_EQ(test, i != 5, regmap_write(map, i, val) == 0);
+
+       /* Did that match what we see on the device? */
+       for (i = 0; i < BLOCK_TEST_SIZE; i++)
+               KUNIT_EXPECT_EQ(test, i != 5, data->written[i]);
+
+       regmap_exit(map);
+}
+
+static void read_writeonly(struct kunit *test)
+{
+       struct regcache_types *t = (struct regcache_types *)test->param_value;
+       struct regmap *map;
+       struct regmap_config config;
+       struct regmap_ram_data *data;
+       unsigned int val;
+       int i;
+
+       config = test_regmap_config;
+       config.cache_type = t->type;
+       config.readable_reg = reg_5_false;
+
+       map = gen_regmap(&config, &data);
+       KUNIT_ASSERT_FALSE(test, IS_ERR(map));
+       if (IS_ERR(map))
+               return;
+
+       for (i = 0; i < BLOCK_TEST_SIZE; i++)
+               data->read[i] = false;
+
+       /*
+        * Try to read all the registers, the writeonly one should
+        * fail if we aren't using the flat cache.
+        */
+       for (i = 0; i < BLOCK_TEST_SIZE; i++) {
+               if (t->type != REGCACHE_FLAT) {
+                       KUNIT_EXPECT_EQ(test, i != 5,
+                                       regmap_read(map, i, &val) == 0);
+               } else {
+                       KUNIT_EXPECT_EQ(test, 0, regmap_read(map, i, &val));
+               }
+       }
+
+       /* Did we trigger a hardware access? */
+       KUNIT_EXPECT_FALSE(test, data->read[5]);
+
+       regmap_exit(map);
+}
+
 static void reg_defaults(struct kunit *test)
 {
        struct regcache_types *t = (struct regcache_types *)test->param_value;
@@ -609,6 +689,47 @@ static void cache_sync_defaults(struct kunit *test)
        regmap_exit(map);
 }
 
+static void cache_sync_readonly(struct kunit *test)
+{
+       struct regcache_types *t = (struct regcache_types *)test->param_value;
+       struct regmap *map;
+       struct regmap_config config;
+       struct regmap_ram_data *data;
+       unsigned int val;
+       int i;
+
+       config = test_regmap_config;
+       config.cache_type = t->type;
+       config.writeable_reg = reg_5_false;
+
+       map = gen_regmap(&config, &data);
+       KUNIT_ASSERT_FALSE(test, IS_ERR(map));
+       if (IS_ERR(map))
+               return;
+
+       /* Read all registers to fill the cache */
+       for (i = 0; i < BLOCK_TEST_SIZE; i++)
+               KUNIT_EXPECT_EQ(test, 0, regmap_read(map, i, &val));
+
+       /* Change the value of all registers, readonly should fail */
+       get_random_bytes(&val, sizeof(val));
+       regcache_cache_only(map, true);
+       for (i = 0; i < BLOCK_TEST_SIZE; i++)
+               KUNIT_EXPECT_EQ(test, i != 5, regmap_write(map, i, val) == 0);
+       regcache_cache_only(map, false);
+
+       /* Resync */
+       for (i = 0; i < BLOCK_TEST_SIZE; i++)
+               data->written[i] = false;
+       KUNIT_EXPECT_EQ(test, 0, regcache_sync(map));
+
+       /* Did that match what we see on the device? */
+       for (i = 0; i < BLOCK_TEST_SIZE; i++)
+               KUNIT_EXPECT_EQ(test, i != 5, data->written[i]);
+
+       regmap_exit(map);
+}
+
 static void cache_sync_patch(struct kunit *test)
 {
        struct regcache_types *t = (struct regcache_types *)test->param_value;
@@ -712,10 +833,333 @@ static void cache_drop(struct kunit *test)
        regmap_exit(map);
 }
 
+struct raw_test_types {
+       const char *name;
+
+       enum regcache_type cache_type;
+       enum regmap_endian val_endian;
+};
+
+static void raw_to_desc(const struct raw_test_types *t, char *desc)
+{
+       strcpy(desc, t->name);
+}
+
+static const struct raw_test_types raw_types_list[] = {
+       { "none-little",   REGCACHE_NONE,   REGMAP_ENDIAN_LITTLE },
+       { "none-big",      REGCACHE_NONE,   REGMAP_ENDIAN_BIG },
+       { "flat-little",   REGCACHE_FLAT,   REGMAP_ENDIAN_LITTLE },
+       { "flat-big",      REGCACHE_FLAT,   REGMAP_ENDIAN_BIG },
+       { "rbtree-little", REGCACHE_RBTREE, REGMAP_ENDIAN_LITTLE },
+       { "rbtree-big",    REGCACHE_RBTREE, REGMAP_ENDIAN_BIG },
+       { "maple-little",  REGCACHE_MAPLE,  REGMAP_ENDIAN_LITTLE },
+       { "maple-big",     REGCACHE_MAPLE,  REGMAP_ENDIAN_BIG },
+};
+
+KUNIT_ARRAY_PARAM(raw_test_types, raw_types_list, raw_to_desc);
+
+static const struct raw_test_types raw_cache_types_list[] = {
+       { "flat-little",   REGCACHE_FLAT,   REGMAP_ENDIAN_LITTLE },
+       { "flat-big",      REGCACHE_FLAT,   REGMAP_ENDIAN_BIG },
+       { "rbtree-little", REGCACHE_RBTREE, REGMAP_ENDIAN_LITTLE },
+       { "rbtree-big",    REGCACHE_RBTREE, REGMAP_ENDIAN_BIG },
+       { "maple-little",  REGCACHE_MAPLE,  REGMAP_ENDIAN_LITTLE },
+       { "maple-big",     REGCACHE_MAPLE,  REGMAP_ENDIAN_BIG },
+};
+
+KUNIT_ARRAY_PARAM(raw_test_cache_types, raw_cache_types_list, raw_to_desc);
+
+static const struct regmap_config raw_regmap_config = {
+       .max_register = BLOCK_TEST_SIZE,
+
+       .reg_format_endian = REGMAP_ENDIAN_LITTLE,
+       .reg_bits = 16,
+       .val_bits = 16,
+};
+
+static struct regmap *gen_raw_regmap(struct regmap_config *config,
+                                    struct raw_test_types *test_type,
+                                    struct regmap_ram_data **data)
+{
+       u16 *buf;
+       struct regmap *ret;
+       size_t size = (config->max_register + 1) * config->reg_bits / 8;
+       int i;
+       struct reg_default *defaults;
+
+       config->cache_type = test_type->cache_type;
+       config->val_format_endian = test_type->val_endian;
+
+       buf = kmalloc(size, GFP_KERNEL);
+       if (!buf)
+               return ERR_PTR(-ENOMEM);
+
+       get_random_bytes(buf, size);
+
+       *data = kzalloc(sizeof(**data), GFP_KERNEL);
+       if (!(*data))
+               return ERR_PTR(-ENOMEM);
+       (*data)->vals = (void *)buf;
+
+       config->num_reg_defaults = config->max_register + 1;
+       defaults = kcalloc(config->num_reg_defaults,
+                          sizeof(struct reg_default),
+                          GFP_KERNEL);
+       if (!defaults)
+               return ERR_PTR(-ENOMEM);
+       config->reg_defaults = defaults;
+
+       for (i = 0; i < config->num_reg_defaults; i++) {
+               defaults[i].reg = i;
+               switch (test_type->val_endian) {
+               case REGMAP_ENDIAN_LITTLE:
+                       defaults[i].def = le16_to_cpu(buf[i]);
+                       break;
+               case REGMAP_ENDIAN_BIG:
+                       defaults[i].def = be16_to_cpu(buf[i]);
+                       break;
+               default:
+                       return ERR_PTR(-EINVAL);
+               }
+       }
+
+       /*
+        * We use the defaults in the tests but they don't make sense
+        * to the core if there's no cache.
+        */
+       if (config->cache_type == REGCACHE_NONE)
+               config->num_reg_defaults = 0;
+
+       ret = regmap_init_raw_ram(config, *data);
+       if (IS_ERR(ret)) {
+               kfree(buf);
+               kfree(*data);
+       }
+
+       return ret;
+}
+
+static void raw_read_defaults_single(struct kunit *test)
+{
+       struct raw_test_types *t = (struct raw_test_types *)test->param_value;
+       struct regmap *map;
+       struct regmap_config config;
+       struct regmap_ram_data *data;
+       unsigned int rval;
+       int i;
+
+       config = raw_regmap_config;
+
+       map = gen_raw_regmap(&config, t, &data);
+       KUNIT_ASSERT_FALSE(test, IS_ERR(map));
+       if (IS_ERR(map))
+               return;
+
+       /* Check that we can read the defaults via the API */
+       for (i = 0; i < config.max_register + 1; i++) {
+               KUNIT_EXPECT_EQ(test, 0, regmap_read(map, i, &rval));
+               KUNIT_EXPECT_EQ(test, config.reg_defaults[i].def, rval);
+       }
+
+       regmap_exit(map);
+}
+
+static void raw_read_defaults(struct kunit *test)
+{
+       struct raw_test_types *t = (struct raw_test_types *)test->param_value;
+       struct regmap *map;
+       struct regmap_config config;
+       struct regmap_ram_data *data;
+       u16 *rval;
+       u16 def;
+       size_t val_len;
+       int i;
+
+       config = raw_regmap_config;
+
+       map = gen_raw_regmap(&config, t, &data);
+       KUNIT_ASSERT_FALSE(test, IS_ERR(map));
+       if (IS_ERR(map))
+               return;
+
+       val_len = sizeof(*rval) * (config.max_register + 1);
+       rval = kmalloc(val_len, GFP_KERNEL);
+       KUNIT_ASSERT_TRUE(test, rval != NULL);
+       if (!rval)
+               return;
+       
+       /* Check that we can read the defaults via the API */
+       KUNIT_EXPECT_EQ(test, 0, regmap_raw_read(map, 0, rval, val_len));
+       for (i = 0; i < config.max_register + 1; i++) {
+               def = config.reg_defaults[i].def;
+               if (config.val_format_endian == REGMAP_ENDIAN_BIG) {
+                       KUNIT_EXPECT_EQ(test, def, be16_to_cpu(rval[i]));
+               } else {
+                       KUNIT_EXPECT_EQ(test, def, le16_to_cpu(rval[i]));
+               }
+       }
+       
+       kfree(rval);
+       regmap_exit(map);
+}
+
+static void raw_write_read_single(struct kunit *test)
+{
+       struct raw_test_types *t = (struct raw_test_types *)test->param_value;
+       struct regmap *map;
+       struct regmap_config config;
+       struct regmap_ram_data *data;
+       u16 val;
+       unsigned int rval;
+
+       config = raw_regmap_config;
+
+       map = gen_raw_regmap(&config, t, &data);
+       KUNIT_ASSERT_FALSE(test, IS_ERR(map));
+       if (IS_ERR(map))
+               return;
+
+       get_random_bytes(&val, sizeof(val));
+
+       /* If we write a value to a register we can read it back */
+       KUNIT_EXPECT_EQ(test, 0, regmap_write(map, 0, val));
+       KUNIT_EXPECT_EQ(test, 0, regmap_read(map, 0, &rval));
+       KUNIT_EXPECT_EQ(test, val, rval);
+
+       regmap_exit(map);
+}
+
+static void raw_write(struct kunit *test)
+{
+       struct raw_test_types *t = (struct raw_test_types *)test->param_value;
+       struct regmap *map;
+       struct regmap_config config;
+       struct regmap_ram_data *data;
+       u16 *hw_buf;
+       u16 val[2];
+       unsigned int rval;
+       int i;
+
+       config = raw_regmap_config;
+
+       map = gen_raw_regmap(&config, t, &data);
+       KUNIT_ASSERT_FALSE(test, IS_ERR(map));
+       if (IS_ERR(map))
+               return;
+
+       hw_buf = (u16 *)data->vals;
+
+       get_random_bytes(&val, sizeof(val));
+
+       /* Do a raw write */
+       KUNIT_EXPECT_EQ(test, 0, regmap_raw_write(map, 2, val, sizeof(val)));
+
+       /* We should read back the new values, and defaults for the rest */
+       for (i = 0; i < config.max_register + 1; i++) {
+               KUNIT_EXPECT_EQ(test, 0, regmap_read(map, i, &rval));
+
+               switch (i) {
+               case 2:
+               case 3:
+                       if (config.val_format_endian == REGMAP_ENDIAN_BIG) {
+                               KUNIT_EXPECT_EQ(test, rval,
+                                               be16_to_cpu(val[i % 2]));
+                       } else {
+                               KUNIT_EXPECT_EQ(test, rval,
+                                               le16_to_cpu(val[i % 2]));
+                       }
+                       break;
+               default:
+                       KUNIT_EXPECT_EQ(test, config.reg_defaults[i].def, rval);
+                       break;
+               }
+       }
+
+       /* The values should appear in the "hardware" */
+       KUNIT_EXPECT_MEMEQ(test, &hw_buf[2], val, sizeof(val));
+
+       regmap_exit(map);
+}
+
+static void raw_sync(struct kunit *test)
+{
+       struct raw_test_types *t = (struct raw_test_types *)test->param_value;
+       struct regmap *map;
+       struct regmap_config config;
+       struct regmap_ram_data *data;
+       u16 val[2];
+       u16 *hw_buf;
+       unsigned int rval;
+       int i;
+
+       config = raw_regmap_config;
+
+       map = gen_raw_regmap(&config, t, &data);
+       KUNIT_ASSERT_FALSE(test, IS_ERR(map));
+       if (IS_ERR(map))
+               return;
+
+       hw_buf = (u16 *)data->vals;
+
+       get_random_bytes(&val, sizeof(val));
+
+       /* Do a regular write and a raw write in cache only mode */
+       regcache_cache_only(map, true);
+       KUNIT_EXPECT_EQ(test, 0, regmap_raw_write(map, 2, val, sizeof(val)));
+       if (config.val_format_endian == REGMAP_ENDIAN_BIG)
+               KUNIT_EXPECT_EQ(test, 0, regmap_write(map, 6,
+                                                     be16_to_cpu(val[0])));
+       else
+               KUNIT_EXPECT_EQ(test, 0, regmap_write(map, 6,
+                                                     le16_to_cpu(val[0])));
+
+       /* We should read back the new values, and defaults for the rest */
+       for (i = 0; i < config.max_register + 1; i++) {
+               KUNIT_EXPECT_EQ(test, 0, regmap_read(map, i, &rval));
+
+               switch (i) {
+               case 2:
+               case 3:
+               case 6:
+                       if (config.val_format_endian == REGMAP_ENDIAN_BIG) {
+                               KUNIT_EXPECT_EQ(test, rval,
+                                               be16_to_cpu(val[i % 2]));
+                       } else {
+                               KUNIT_EXPECT_EQ(test, rval,
+                                               le16_to_cpu(val[i % 2]));
+                       }
+                       break;
+               default:
+                       KUNIT_EXPECT_EQ(test, config.reg_defaults[i].def, rval);
+                       break;
+               }
+       }
+       
+       /* The values should not appear in the "hardware" */
+       KUNIT_EXPECT_MEMNEQ(test, &hw_buf[2], val, sizeof(val));
+       KUNIT_EXPECT_MEMNEQ(test, &hw_buf[6], val, sizeof(u16));
+
+       for (i = 0; i < config.max_register + 1; i++)
+               data->written[i] = false;
+
+       /* Do the sync */
+       regcache_cache_only(map, false);
+       regcache_mark_dirty(map);
+       KUNIT_EXPECT_EQ(test, 0, regcache_sync(map));
+
+       /* The values should now appear in the "hardware" */
+       KUNIT_EXPECT_MEMEQ(test, &hw_buf[2], val, sizeof(val));
+       KUNIT_EXPECT_MEMEQ(test, &hw_buf[6], val, sizeof(u16));
+
+       regmap_exit(map);
+}
+
 static struct kunit_case regmap_test_cases[] = {
        KUNIT_CASE_PARAM(basic_read_write, regcache_types_gen_params),
        KUNIT_CASE_PARAM(bulk_write, regcache_types_gen_params),
        KUNIT_CASE_PARAM(bulk_read, regcache_types_gen_params),
+       KUNIT_CASE_PARAM(write_readonly, regcache_types_gen_params),
+       KUNIT_CASE_PARAM(read_writeonly, regcache_types_gen_params),
        KUNIT_CASE_PARAM(reg_defaults, regcache_types_gen_params),
        KUNIT_CASE_PARAM(reg_defaults_read_dev, regcache_types_gen_params),
        KUNIT_CASE_PARAM(register_patch, regcache_types_gen_params),
@@ -725,8 +1169,15 @@ static struct kunit_case regmap_test_cases[] = {
        KUNIT_CASE_PARAM(cache_bypass, real_cache_types_gen_params),
        KUNIT_CASE_PARAM(cache_sync, real_cache_types_gen_params),
        KUNIT_CASE_PARAM(cache_sync_defaults, real_cache_types_gen_params),
+       KUNIT_CASE_PARAM(cache_sync_readonly, real_cache_types_gen_params),
        KUNIT_CASE_PARAM(cache_sync_patch, real_cache_types_gen_params),
        KUNIT_CASE_PARAM(cache_drop, sparse_cache_types_gen_params),
+
+       KUNIT_CASE_PARAM(raw_read_defaults_single, raw_test_types_gen_params),
+       KUNIT_CASE_PARAM(raw_read_defaults, raw_test_types_gen_params),
+       KUNIT_CASE_PARAM(raw_write_read_single, raw_test_types_gen_params),
+       KUNIT_CASE_PARAM(raw_write, raw_test_types_gen_params),
+       KUNIT_CASE_PARAM(raw_sync, raw_test_cache_types_gen_params),
        {}
 };
 
index 3ccdd86..8132b5c 100644 (file)
@@ -448,7 +448,7 @@ static struct regmap_mmio_context *regmap_mmio_gen_context(struct device *dev,
        if (min_stride < 0)
                return ERR_PTR(min_stride);
 
-       if (config->reg_stride < min_stride)
+       if (config->reg_stride && config->reg_stride < min_stride)
                return ERR_PTR(-EINVAL);
 
        if (config->use_relaxed_mmio && config->io_port)
diff --git a/drivers/base/regmap/regmap-raw-ram.c b/drivers/base/regmap/regmap-raw-ram.c
new file mode 100644 (file)
index 0000000..c9b8008
--- /dev/null
@@ -0,0 +1,133 @@
+// SPDX-License-Identifier: GPL-2.0
+//
+// Register map access API - Memory region with raw access
+//
+// This is intended for testing only
+//
+// Copyright (c) 2023, Arm Ltd
+
+#include <linux/clk.h>
+#include <linux/err.h>
+#include <linux/io.h>
+#include <linux/module.h>
+#include <linux/regmap.h>
+#include <linux/slab.h>
+#include <linux/swab.h>
+
+#include "internal.h"
+
+static unsigned int decode_reg(enum regmap_endian endian, const void *reg)
+{
+       const u16 *r = reg;
+
+       if (endian == REGMAP_ENDIAN_BIG)
+               return be16_to_cpu(*r);
+       else
+               return le16_to_cpu(*r);
+}
+
+static int regmap_raw_ram_gather_write(void *context,
+                                      const void *reg, size_t reg_len,
+                                      const void *val, size_t val_len)
+{
+       struct regmap_ram_data *data = context;
+       unsigned int r;
+       u16 *our_buf = (u16 *)data->vals;
+       int i;
+
+       if (reg_len != 2)
+               return -EINVAL;
+       if (val_len % 2)
+               return -EINVAL;
+
+       r = decode_reg(data->reg_endian, reg);
+       memcpy(&our_buf[r], val, val_len);
+
+       for (i = 0; i < val_len / 2; i++)
+               data->written[r + i] = true;
+       
+       return 0;
+}
+
+static int regmap_raw_ram_write(void *context, const void *data, size_t count)
+{
+       return regmap_raw_ram_gather_write(context, data, 2,
+                                          data + 2, count - 2);
+}
+
+static int regmap_raw_ram_read(void *context,
+                              const void *reg, size_t reg_len,
+                              void *val, size_t val_len)
+{
+       struct regmap_ram_data *data = context;
+       unsigned int r;
+       u16 *our_buf = (u16 *)data->vals;
+       int i;
+
+       if (reg_len != 2)
+               return -EINVAL;
+       if (val_len % 2)
+               return -EINVAL;
+
+       r = decode_reg(data->reg_endian, reg);
+       memcpy(val, &our_buf[r], val_len);
+
+       for (i = 0; i < val_len / 2; i++)
+               data->read[r + i] = true;
+
+       return 0;
+}
+
+static void regmap_raw_ram_free_context(void *context)
+{
+       struct regmap_ram_data *data = context;
+
+       kfree(data->vals);
+       kfree(data->read);
+       kfree(data->written);
+       kfree(data);
+}
+
+static const struct regmap_bus regmap_raw_ram = {
+       .fast_io = true,
+       .write = regmap_raw_ram_write,
+       .gather_write = regmap_raw_ram_gather_write,
+       .read = regmap_raw_ram_read,
+       .free_context = regmap_raw_ram_free_context,
+};
+
+struct regmap *__regmap_init_raw_ram(const struct regmap_config *config,
+                                    struct regmap_ram_data *data,
+                                    struct lock_class_key *lock_key,
+                                    const char *lock_name)
+{
+       struct regmap *map;
+
+       if (config->reg_bits != 16)
+               return ERR_PTR(-EINVAL);
+
+       if (!config->max_register) {
+               pr_crit("No max_register specified for RAM regmap\n");
+               return ERR_PTR(-EINVAL);
+       }
+
+       data->read = kcalloc(sizeof(bool), config->max_register + 1,
+                            GFP_KERNEL);
+       if (!data->read)
+               return ERR_PTR(-ENOMEM);
+
+       data->written = kcalloc(sizeof(bool), config->max_register + 1,
+                               GFP_KERNEL);
+       if (!data->written)
+               return ERR_PTR(-ENOMEM);
+
+       data->reg_endian = config->reg_format_endian;
+
+       map = __regmap_init(NULL, &regmap_raw_ram, data, config,
+                           lock_key, lock_name);
+
+       return map;
+}
+EXPORT_SYMBOL_GPL(__regmap_init_raw_ram);
+
+MODULE_LICENSE("GPL v2");
index fa2d3fb..89a7f1c 100644 (file)
@@ -2983,6 +2983,11 @@ int regmap_raw_read(struct regmap *map, unsigned int reg, void *val,
                size_t chunk_count, chunk_bytes;
                size_t chunk_regs = val_count;
 
+               if (!map->cache_bypass && map->cache_only) {
+                       ret = -EBUSY;
+                       goto out;
+               }
+
                if (!map->read) {
                        ret = -ENOTSUPP;
                        goto out;
@@ -3078,18 +3083,19 @@ int regmap_noinc_read(struct regmap *map, unsigned int reg,
                goto out_unlock;
        }
 
+       /*
+        * We have not defined the FIFO semantics for cache, as the
+        * cache is just one value deep. Should we return the last
+        * written value? Just avoid this by always reading the FIFO
+        * even when using cache. Cache only will not work.
+        */
+       if (!map->cache_bypass && map->cache_only) {
+               ret = -EBUSY;
+               goto out_unlock;
+       }
+
        /* Use the accelerated operation if we can */
        if (map->bus->reg_noinc_read) {
-               /*
-                * We have not defined the FIFO semantics for cache, as the
-                * cache is just one value deep. Should we return the last
-                * written value? Just avoid this by always reading the FIFO
-                * even when using cache. Cache only will not work.
-                */
-               if (map->cache_only) {
-                       ret = -EBUSY;
-                       goto out_unlock;
-               }
                ret = regmap_noinc_readwrite(map, reg, val, val_len, false);
                goto out_unlock;
        }
@@ -3273,7 +3279,7 @@ static int _regmap_update_bits(struct regmap *map, unsigned int reg,
                tmp = orig & ~mask;
                tmp |= val & mask;
 
-               if (force_write || (tmp != orig)) {
+               if (force_write || (tmp != orig) || map->force_write_field) {
                        ret = _regmap_write(map, reg, tmp);
                        if (ret == 0 && change)
                                *change = true;
index 4c8b2ba..e460c97 100644 (file)
@@ -1532,7 +1532,7 @@ static int fd_getgeo(struct block_device *bdev, struct hd_geometry *geo)
        return 0;
 }
 
-static int fd_locked_ioctl(struct block_device *bdev, fmode_t mode,
+static int fd_locked_ioctl(struct block_device *bdev, blk_mode_t mode,
                    unsigned int cmd, unsigned long param)
 {
        struct amiga_floppy_struct *p = bdev->bd_disk->private_data;
@@ -1607,7 +1607,7 @@ static int fd_locked_ioctl(struct block_device *bdev, fmode_t mode,
        return 0;
 }
 
-static int fd_ioctl(struct block_device *bdev, fmode_t mode,
+static int fd_ioctl(struct block_device *bdev, blk_mode_t mode,
                             unsigned int cmd, unsigned long param)
 {
        int ret;
@@ -1654,10 +1654,10 @@ static void fd_probe(int dev)
  * /dev/PS0 etc), and disallows simultaneous access to the same
  * drive with different device numbers.
  */
-static int floppy_open(struct block_device *bdev, fmode_t mode)
+static int floppy_open(struct gendisk *disk, blk_mode_t mode)
 {
-       int drive = MINOR(bdev->bd_dev) & 3;
-       int system =  (MINOR(bdev->bd_dev) & 4) >> 2;
+       int drive = disk->first_minor & 3;
+       int system = (disk->first_minor & 4) >> 2;
        int old_dev;
        unsigned long flags;
 
@@ -1673,10 +1673,9 @@ static int floppy_open(struct block_device *bdev, fmode_t mode)
                mutex_unlock(&amiflop_mutex);
                return -ENXIO;
        }
-
-       if (mode & (FMODE_READ|FMODE_WRITE)) {
-               bdev_check_media_change(bdev);
-               if (mode & FMODE_WRITE) {
+       if (mode & (BLK_OPEN_READ | BLK_OPEN_WRITE)) {
+               disk_check_media_change(disk);
+               if (mode & BLK_OPEN_WRITE) {
                        int wrprot;
 
                        get_fdc(drive);
@@ -1691,7 +1690,6 @@ static int floppy_open(struct block_device *bdev, fmode_t mode)
                        }
                }
        }
-
        local_irq_save(flags);
        fd_ref[drive]++;
        fd_device[drive] = system;
@@ -1709,7 +1707,7 @@ static int floppy_open(struct block_device *bdev, fmode_t mode)
        return 0;
 }
 
-static void floppy_release(struct gendisk *disk, fmode_t mode)
+static void floppy_release(struct gendisk *disk)
 {
        struct amiga_floppy_struct *p = disk->private_data;
        int drive = p - unit;
index 128722c..cf68837 100644 (file)
@@ -204,9 +204,9 @@ aoedisk_rm_debugfs(struct aoedev *d)
 }
 
 static int
-aoeblk_open(struct block_device *bdev, fmode_t mode)
+aoeblk_open(struct gendisk *disk, blk_mode_t mode)
 {
-       struct aoedev *d = bdev->bd_disk->private_data;
+       struct aoedev *d = disk->private_data;
        ulong flags;
 
        if (!virt_addr_valid(d)) {
@@ -232,7 +232,7 @@ aoeblk_open(struct block_device *bdev, fmode_t mode)
 }
 
 static void
-aoeblk_release(struct gendisk *disk, fmode_t mode)
+aoeblk_release(struct gendisk *disk)
 {
        struct aoedev *d = disk->private_data;
        ulong flags;
@@ -285,7 +285,7 @@ aoeblk_getgeo(struct block_device *bdev, struct hd_geometry *geo)
 }
 
 static int
-aoeblk_ioctl(struct block_device *bdev, fmode_t mode, uint cmd, ulong arg)
+aoeblk_ioctl(struct block_device *bdev, blk_mode_t mode, uint cmd, ulong arg)
 {
        struct aoedev *d;
 
index 4c666f7..a42c4bc 100644 (file)
@@ -49,7 +49,7 @@ static int emsgs_head_idx, emsgs_tail_idx;
 static struct completion emsgs_comp;
 static spinlock_t emsgs_lock;
 static int nblocked_emsgs_readers;
-static struct class *aoe_class;
+
 static struct aoe_chardev chardevs[] = {
        { MINOR_ERR, "err" },
        { MINOR_DISCOVER, "discover" },
@@ -58,6 +58,16 @@ static struct aoe_chardev chardevs[] = {
        { MINOR_FLUSH, "flush" },
 };
 
+static char *aoe_devnode(const struct device *dev, umode_t *mode)
+{
+       return kasprintf(GFP_KERNEL, "etherd/%s", dev_name(dev));
+}
+
+static const struct class aoe_class = {
+       .name = "aoe",
+       .devnode = aoe_devnode,
+};
+
 static int
 discover(void)
 {
@@ -273,11 +283,6 @@ static const struct file_operations aoe_fops = {
        .llseek = noop_llseek,
 };
 
-static char *aoe_devnode(const struct device *dev, umode_t *mode)
-{
-       return kasprintf(GFP_KERNEL, "etherd/%s", dev_name(dev));
-}
-
 int __init
 aoechr_init(void)
 {
@@ -290,15 +295,14 @@ aoechr_init(void)
        }
        init_completion(&emsgs_comp);
        spin_lock_init(&emsgs_lock);
-       aoe_class = class_create("aoe");
-       if (IS_ERR(aoe_class)) {
+       n = class_register(&aoe_class);
+       if (n) {
                unregister_chrdev(AOE_MAJOR, "aoechr");
-               return PTR_ERR(aoe_class);
+               return n;
        }
-       aoe_class->devnode = aoe_devnode;
 
        for (i = 0; i < ARRAY_SIZE(chardevs); ++i)
-               device_create(aoe_class, NULL,
+               device_create(&aoe_class, NULL,
                              MKDEV(AOE_MAJOR, chardevs[i].minor), NULL,
                              chardevs[i].name);
 
@@ -311,8 +315,8 @@ aoechr_exit(void)
        int i;
 
        for (i = 0; i < ARRAY_SIZE(chardevs); ++i)
-               device_destroy(aoe_class, MKDEV(AOE_MAJOR, chardevs[i].minor));
-       class_destroy(aoe_class);
+               device_destroy(&aoe_class, MKDEV(AOE_MAJOR, chardevs[i].minor));
+       class_unregister(&aoe_class);
        unregister_chrdev(AOE_MAJOR, "aoechr");
 }
 
index 9deb4df..cd738ca 100644 (file)
@@ -442,13 +442,13 @@ static void fd_times_out(struct timer_list *unused);
 static void finish_fdc( void );
 static void finish_fdc_done( int dummy );
 static void setup_req_params( int drive );
-static int fd_locked_ioctl(struct block_device *bdev, fmode_t mode, unsigned int
-                     cmd, unsigned long param);
+static int fd_locked_ioctl(struct block_device *bdev, blk_mode_t mode,
+               unsigned int cmd, unsigned long param);
 static void fd_probe( int drive );
 static int fd_test_drive_present( int drive );
 static void config_types( void );
-static int floppy_open(struct block_device *bdev, fmode_t mode);
-static void floppy_release(struct gendisk *disk, fmode_t mode);
+static int floppy_open(struct gendisk *disk, blk_mode_t mode);
+static void floppy_release(struct gendisk *disk);
 
 /************************* End of Prototypes **************************/
 
@@ -1581,7 +1581,7 @@ out:
        return BLK_STS_OK;
 }
 
-static int fd_locked_ioctl(struct block_device *bdev, fmode_t mode,
+static int fd_locked_ioctl(struct block_device *bdev, blk_mode_t mode,
                    unsigned int cmd, unsigned long param)
 {
        struct gendisk *disk = bdev->bd_disk;
@@ -1760,15 +1760,15 @@ static int fd_locked_ioctl(struct block_device *bdev, fmode_t mode,
                /* invalidate the buffer track to force a reread */
                BufferDrive = -1;
                set_bit(drive, &fake_change);
-               if (bdev_check_media_change(bdev))
-                       floppy_revalidate(bdev->bd_disk);
+               if (disk_check_media_change(disk))
+                       floppy_revalidate(disk);
                return 0;
        default:
                return -EINVAL;
        }
 }
 
-static int fd_ioctl(struct block_device *bdev, fmode_t mode,
+static int fd_ioctl(struct block_device *bdev, blk_mode_t mode,
                             unsigned int cmd, unsigned long arg)
 {
        int ret;
@@ -1915,32 +1915,31 @@ static void __init config_types( void )
  * drive with different device numbers.
  */
 
-static int floppy_open(struct block_device *bdev, fmode_t mode)
+static int floppy_open(struct gendisk *disk, blk_mode_t mode)
 {
-       struct atari_floppy_struct *p = bdev->bd_disk->private_data;
-       int type  = MINOR(bdev->bd_dev) >> 2;
+       struct atari_floppy_struct *p = disk->private_data;
+       int type = disk->first_minor >> 2;
 
        DPRINT(("fd_open: type=%d\n",type));
        if (p->ref && p->type != type)
                return -EBUSY;
 
-       if (p->ref == -1 || (p->ref && mode & FMODE_EXCL))
+       if (p->ref == -1 || (p->ref && mode & BLK_OPEN_EXCL))
                return -EBUSY;
-
-       if (mode & FMODE_EXCL)
+       if (mode & BLK_OPEN_EXCL)
                p->ref = -1;
        else
                p->ref++;
 
        p->type = type;
 
-       if (mode & FMODE_NDELAY)
+       if (mode & BLK_OPEN_NDELAY)
                return 0;
 
-       if (mode & (FMODE_READ|FMODE_WRITE)) {
-               if (bdev_check_media_change(bdev))
-                       floppy_revalidate(bdev->bd_disk);
-               if (mode & FMODE_WRITE) {
+       if (mode & (BLK_OPEN_READ | BLK_OPEN_WRITE)) {
+               if (disk_check_media_change(disk))
+                       floppy_revalidate(disk);
+               if (mode & BLK_OPEN_WRITE) {
                        if (p->wpstat) {
                                if (p->ref < 0)
                                        p->ref = 0;
@@ -1953,18 +1952,18 @@ static int floppy_open(struct block_device *bdev, fmode_t mode)
        return 0;
 }
 
-static int floppy_unlocked_open(struct block_device *bdev, fmode_t mode)
+static int floppy_unlocked_open(struct gendisk *disk, blk_mode_t mode)
 {
        int ret;
 
        mutex_lock(&ataflop_mutex);
-       ret = floppy_open(bdev, mode);
+       ret = floppy_open(disk, mode);
        mutex_unlock(&ataflop_mutex);
 
        return ret;
 }
 
-static void floppy_release(struct gendisk *disk, fmode_t mode)
+static void floppy_release(struct gendisk *disk)
 {
        struct atari_floppy_struct *p = disk->private_data;
        mutex_lock(&ataflop_mutex);
index bcad9b9..970bd6f 100644 (file)
@@ -19,7 +19,7 @@
 #include <linux/highmem.h>
 #include <linux/mutex.h>
 #include <linux/pagemap.h>
-#include <linux/radix-tree.h>
+#include <linux/xarray.h>
 #include <linux/fs.h>
 #include <linux/slab.h>
 #include <linux/backing-dev.h>
@@ -28,7 +28,7 @@
 #include <linux/uaccess.h>
 
 /*
- * Each block ramdisk device has a radix_tree brd_pages of pages that stores
+ * Each block ramdisk device has a xarray brd_pages of pages that stores
  * the pages containing the block device's contents. A brd page's ->index is
  * its offset in PAGE_SIZE units. This is similar to, but in no way connected
  * with, the kernel's pagecache or buffer cache (which sit above our block
@@ -40,11 +40,9 @@ struct brd_device {
        struct list_head        brd_list;
 
        /*
-        * Backing store of pages and lock to protect it. This is the contents
-        * of the block device.
+        * Backing store of pages. This is the contents of the block device.
         */
-       spinlock_t              brd_lock;
-       struct radix_tree_root  brd_pages;
+       struct xarray           brd_pages;
        u64                     brd_nr_pages;
 };
 
@@ -56,21 +54,8 @@ static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector)
        pgoff_t idx;
        struct page *page;
 
-       /*
-        * The page lifetime is protected by the fact that we have opened the
-        * device node -- brd pages will never be deleted under us, so we
-        * don't need any further locking or refcounting.
-        *
-        * This is strictly true for the radix-tree nodes as well (ie. we
-        * don't actually need the rcu_read_lock()), however that is not a
-        * documented feature of the radix-tree API so it is better to be
-        * safe here (we don't have total exclusion from radix tree updates
-        * here, only deletes).
-        */
-       rcu_read_lock();
        idx = sector >> PAGE_SECTORS_SHIFT; /* sector to page index */
-       page = radix_tree_lookup(&brd->brd_pages, idx);
-       rcu_read_unlock();
+       page = xa_load(&brd->brd_pages, idx);
 
        BUG_ON(page && page->index != idx);
 
@@ -83,7 +68,7 @@ static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector)
 static int brd_insert_page(struct brd_device *brd, sector_t sector, gfp_t gfp)
 {
        pgoff_t idx;
-       struct page *page;
+       struct page *page, *cur;
        int ret = 0;
 
        page = brd_lookup_page(brd, sector);
@@ -94,71 +79,42 @@ static int brd_insert_page(struct brd_device *brd, sector_t sector, gfp_t gfp)
        if (!page)
                return -ENOMEM;
 
-       if (radix_tree_maybe_preload(gfp)) {
-               __free_page(page);
-               return -ENOMEM;
-       }
+       xa_lock(&brd->brd_pages);
 
-       spin_lock(&brd->brd_lock);
        idx = sector >> PAGE_SECTORS_SHIFT;
        page->index = idx;
-       if (radix_tree_insert(&brd->brd_pages, idx, page)) {
+
+       cur = __xa_cmpxchg(&brd->brd_pages, idx, NULL, page, gfp);
+
+       if (unlikely(cur)) {
                __free_page(page);
-               page = radix_tree_lookup(&brd->brd_pages, idx);
-               if (!page)
-                       ret = -ENOMEM;
-               else if (page->index != idx)
+               ret = xa_err(cur);
+               if (!ret && (cur->index != idx))
                        ret = -EIO;
        } else {
                brd->brd_nr_pages++;
        }
-       spin_unlock(&brd->brd_lock);
 
-       radix_tree_preload_end();
+       xa_unlock(&brd->brd_pages);
+
        return ret;
 }
 
 /*
- * Free all backing store pages and radix tree. This must only be called when
+ * Free all backing store pages and xarray. This must only be called when
  * there are no other users of the device.
  */
-#define FREE_BATCH 16
 static void brd_free_pages(struct brd_device *brd)
 {
-       unsigned long pos = 0;
-       struct page *pages[FREE_BATCH];
-       int nr_pages;
-
-       do {
-               int i;
-
-               nr_pages = radix_tree_gang_lookup(&brd->brd_pages,
-                               (void **)pages, pos, FREE_BATCH);
-
-               for (i = 0; i < nr_pages; i++) {
-                       void *ret;
-
-                       BUG_ON(pages[i]->index < pos);
-                       pos = pages[i]->index;
-                       ret = radix_tree_delete(&brd->brd_pages, pos);
-                       BUG_ON(!ret || ret != pages[i]);
-                       __free_page(pages[i]);
-               }
-
-               pos++;
+       struct page *page;
+       pgoff_t idx;
 
-               /*
-                * It takes 3.4 seconds to remove 80GiB ramdisk.
-                * So, we need cond_resched to avoid stalling the CPU.
-                */
+       xa_for_each(&brd->brd_pages, idx, page) {
+               __free_page(page);
                cond_resched();
+       }
 
-               /*
-                * This assumes radix_tree_gang_lookup always returns as
-                * many pages as possible. If the radix-tree code changes,
-                * so will this have to.
-                */
-       } while (nr_pages == FREE_BATCH);
+       xa_destroy(&brd->brd_pages);
 }
 
 /*
@@ -372,8 +328,7 @@ static int brd_alloc(int i)
        brd->brd_number         = i;
        list_add_tail(&brd->brd_list, &brd_devices);
 
-       spin_lock_init(&brd->brd_lock);
-       INIT_RADIX_TREE(&brd->brd_pages, GFP_ATOMIC);
+       xa_init(&brd->brd_pages);
 
        snprintf(buf, DISK_NAME_LEN, "ram%d", i);
        if (!IS_ERR_OR_NULL(brd_debugfs_dir))
index 6ac8c54..85ca000 100644 (file)
@@ -1043,9 +1043,7 @@ static void bm_page_io_async(struct drbd_bm_aio_ctx *ctx, int page_nr) __must_ho
        bio = bio_alloc_bioset(device->ldev->md_bdev, 1, op, GFP_NOIO,
                        &drbd_md_io_bio_set);
        bio->bi_iter.bi_sector = on_disk_sector;
-       /* bio_add_page of a single page to an empty bio will always succeed,
-        * according to api.  Do we want to assert that? */
-       bio_add_page(bio, page, len, 0);
+       __bio_add_page(bio, page, len, 0);
        bio->bi_private = ctx;
        bio->bi_end_io = drbd_bm_endio;
 
index ea82d67..79ab532 100644 (file)
@@ -37,7 +37,6 @@
 #include <linux/notifier.h>
 #include <linux/kthread.h>
 #include <linux/workqueue.h>
-#define __KERNEL_SYSCALLS__
 #include <linux/unistd.h>
 #include <linux/vmalloc.h>
 #include <linux/sched/signal.h>
@@ -50,8 +49,8 @@
 #include "drbd_debugfs.h"
 
 static DEFINE_MUTEX(drbd_main_mutex);
-static int drbd_open(struct block_device *bdev, fmode_t mode);
-static void drbd_release(struct gendisk *gd, fmode_t mode);
+static int drbd_open(struct gendisk *disk, blk_mode_t mode);
+static void drbd_release(struct gendisk *gd);
 static void md_sync_timer_fn(struct timer_list *t);
 static int w_bitmap_io(struct drbd_work *w, int unused);
 
@@ -1887,9 +1886,9 @@ int drbd_send_all(struct drbd_connection *connection, struct socket *sock, void
        return 0;
 }
 
-static int drbd_open(struct block_device *bdev, fmode_t mode)
+static int drbd_open(struct gendisk *disk, blk_mode_t mode)
 {
-       struct drbd_device *device = bdev->bd_disk->private_data;
+       struct drbd_device *device = disk->private_data;
        unsigned long flags;
        int rv = 0;
 
@@ -1899,7 +1898,7 @@ static int drbd_open(struct block_device *bdev, fmode_t mode)
         * and no race with updating open_cnt */
 
        if (device->state.role != R_PRIMARY) {
-               if (mode & FMODE_WRITE)
+               if (mode & BLK_OPEN_WRITE)
                        rv = -EROFS;
                else if (!drbd_allow_oos)
                        rv = -EMEDIUMTYPE;
@@ -1913,9 +1912,10 @@ static int drbd_open(struct block_device *bdev, fmode_t mode)
        return rv;
 }
 
-static void drbd_release(struct gendisk *gd, fmode_t mode)
+static void drbd_release(struct gendisk *gd)
 {
        struct drbd_device *device = gd->private_data;
+
        mutex_lock(&drbd_main_mutex);
        device->open_cnt--;
        mutex_unlock(&drbd_main_mutex);
index 1a5d3d7..cddae6f 100644 (file)
@@ -1640,8 +1640,8 @@ static struct block_device *open_backing_dev(struct drbd_device *device,
        struct block_device *bdev;
        int err = 0;
 
-       bdev = blkdev_get_by_path(bdev_path,
-                                 FMODE_READ | FMODE_WRITE | FMODE_EXCL, claim_ptr);
+       bdev = blkdev_get_by_path(bdev_path, BLK_OPEN_READ | BLK_OPEN_WRITE,
+                                 claim_ptr, NULL);
        if (IS_ERR(bdev)) {
                drbd_err(device, "open(\"%s\") failed with %ld\n",
                                bdev_path, PTR_ERR(bdev));
@@ -1653,7 +1653,7 @@ static struct block_device *open_backing_dev(struct drbd_device *device,
 
        err = bd_link_disk_holder(bdev, device->vdisk);
        if (err) {
-               blkdev_put(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL);
+               blkdev_put(bdev, claim_ptr);
                drbd_err(device, "bd_link_disk_holder(\"%s\", ...) failed with %d\n",
                                bdev_path, err);
                bdev = ERR_PTR(err);
@@ -1695,13 +1695,13 @@ static int open_backing_devices(struct drbd_device *device,
 }
 
 static void close_backing_dev(struct drbd_device *device, struct block_device *bdev,
-       bool do_bd_unlink)
+               void *claim_ptr, bool do_bd_unlink)
 {
        if (!bdev)
                return;
        if (do_bd_unlink)
                bd_unlink_disk_holder(bdev, device->vdisk);
-       blkdev_put(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL);
+       blkdev_put(bdev, claim_ptr);
 }
 
 void drbd_backing_dev_free(struct drbd_device *device, struct drbd_backing_dev *ldev)
@@ -1709,8 +1709,11 @@ void drbd_backing_dev_free(struct drbd_device *device, struct drbd_backing_dev *
        if (ldev == NULL)
                return;
 
-       close_backing_dev(device, ldev->md_bdev, ldev->md_bdev != ldev->backing_bdev);
-       close_backing_dev(device, ldev->backing_bdev, true);
+       close_backing_dev(device, ldev->md_bdev,
+                         ldev->md.meta_dev_idx < 0 ?
+                               (void *)device : (void *)drbd_m_holder,
+                         ldev->md_bdev != ldev->backing_bdev);
+       close_backing_dev(device, ldev->backing_bdev, device, true);
 
        kfree(ldev->disk_conf);
        kfree(ldev);
@@ -2126,8 +2129,11 @@ int drbd_adm_attach(struct sk_buff *skb, struct genl_info *info)
  fail:
        conn_reconfig_done(connection);
        if (nbc) {
-               close_backing_dev(device, nbc->md_bdev, nbc->md_bdev != nbc->backing_bdev);
-               close_backing_dev(device, nbc->backing_bdev, true);
+               close_backing_dev(device, nbc->md_bdev,
+                         nbc->disk_conf->meta_dev_idx < 0 ?
+                               (void *)device : (void *)drbd_m_holder,
+                         nbc->md_bdev != nbc->backing_bdev);
+               close_backing_dev(device, nbc->backing_bdev, device, true);
                kfree(nbc);
        }
        kfree(new_disk_conf);
index 8c2bc47..0c9f541 100644 (file)
@@ -27,7 +27,6 @@
 #include <uapi/linux/sched/types.h>
 #include <linux/sched/signal.h>
 #include <linux/pkt_sched.h>
-#define __KERNEL_SYSCALLS__
 #include <linux/unistd.h>
 #include <linux/vmalloc.h>
 #include <linux/random.h>
index cec2c20..2db9b18 100644 (file)
@@ -402,7 +402,7 @@ static struct floppy_drive_struct drive_state[N_DRIVE];
 static struct floppy_write_errors write_errors[N_DRIVE];
 static struct timer_list motor_off_timer[N_DRIVE];
 static struct blk_mq_tag_set tag_sets[N_DRIVE];
-static struct block_device *opened_bdev[N_DRIVE];
+static struct gendisk *opened_disk[N_DRIVE];
 static DEFINE_MUTEX(open_lock);
 static struct floppy_raw_cmd *raw_cmd, default_raw_cmd;
 
@@ -3210,13 +3210,13 @@ static int floppy_raw_cmd_ioctl(int type, int drive, int cmd,
 
 #endif
 
-static int invalidate_drive(struct block_device *bdev)
+static int invalidate_drive(struct gendisk *disk)
 {
        /* invalidate the buffer track to force a reread */
-       set_bit((long)bdev->bd_disk->private_data, &fake_change);
+       set_bit((long)disk->private_data, &fake_change);
        process_fd_request();
-       if (bdev_check_media_change(bdev))
-               floppy_revalidate(bdev->bd_disk);
+       if (disk_check_media_change(disk))
+               floppy_revalidate(disk);
        return 0;
 }
 
@@ -3251,10 +3251,11 @@ static int set_geometry(unsigned int cmd, struct floppy_struct *g,
                            floppy_type[type].size + 1;
                process_fd_request();
                for (cnt = 0; cnt < N_DRIVE; cnt++) {
-                       struct block_device *bdev = opened_bdev[cnt];
-                       if (!bdev || ITYPE(drive_state[cnt].fd_device) != type)
+                       struct gendisk *disk = opened_disk[cnt];
+
+                       if (!disk || ITYPE(drive_state[cnt].fd_device) != type)
                                continue;
-                       __invalidate_device(bdev, true);
+                       __invalidate_device(disk->part0, true);
                }
                mutex_unlock(&open_lock);
        } else {
@@ -3287,7 +3288,7 @@ static int set_geometry(unsigned int cmd, struct floppy_struct *g,
                    drive_state[current_drive].maxtrack ||
                    ((user_params[drive].sect ^ oldStretch) &
                     (FD_SWAPSIDES | FD_SECTBASEMASK)))
-                       invalidate_drive(bdev);
+                       invalidate_drive(bdev->bd_disk);
                else
                        process_fd_request();
        }
@@ -3393,8 +3394,8 @@ static bool valid_floppy_drive_params(const short autodetect[FD_AUTODETECT_SIZE]
        return true;
 }
 
-static int fd_locked_ioctl(struct block_device *bdev, fmode_t mode, unsigned int cmd,
-                   unsigned long param)
+static int fd_locked_ioctl(struct block_device *bdev, blk_mode_t mode,
+               unsigned int cmd, unsigned long param)
 {
        int drive = (long)bdev->bd_disk->private_data;
        int type = ITYPE(drive_state[drive].fd_device);
@@ -3427,7 +3428,8 @@ static int fd_locked_ioctl(struct block_device *bdev, fmode_t mode, unsigned int
                return ret;
 
        /* permission checks */
-       if (((cmd & 0x40) && !(mode & (FMODE_WRITE | FMODE_WRITE_IOCTL))) ||
+       if (((cmd & 0x40) &&
+            !(mode & (BLK_OPEN_WRITE | BLK_OPEN_WRITE_IOCTL))) ||
            ((cmd & 0x80) && !capable(CAP_SYS_ADMIN)))
                return -EPERM;
 
@@ -3464,7 +3466,7 @@ static int fd_locked_ioctl(struct block_device *bdev, fmode_t mode, unsigned int
                current_type[drive] = NULL;
                floppy_sizes[drive] = MAX_DISK_SIZE << 1;
                drive_state[drive].keep_data = 0;
-               return invalidate_drive(bdev);
+               return invalidate_drive(bdev->bd_disk);
        case FDSETPRM:
        case FDDEFPRM:
                return set_geometry(cmd, &inparam.g, drive, type, bdev);
@@ -3503,7 +3505,7 @@ static int fd_locked_ioctl(struct block_device *bdev, fmode_t mode, unsigned int
        case FDFLUSH:
                if (lock_fdc(drive))
                        return -EINTR;
-               return invalidate_drive(bdev);
+               return invalidate_drive(bdev->bd_disk);
        case FDSETEMSGTRESH:
                drive_params[drive].max_errors.reporting = (unsigned short)(param & 0x0f);
                return 0;
@@ -3565,7 +3567,7 @@ static int fd_locked_ioctl(struct block_device *bdev, fmode_t mode, unsigned int
        return 0;
 }
 
-static int fd_ioctl(struct block_device *bdev, fmode_t mode,
+static int fd_ioctl(struct block_device *bdev, blk_mode_t mode,
                             unsigned int cmd, unsigned long param)
 {
        int ret;
@@ -3653,8 +3655,8 @@ struct compat_floppy_write_errors {
 #define FDGETFDCSTAT32 _IOR(2, 0x15, struct compat_floppy_fdc_state)
 #define FDWERRORGET32  _IOR(2, 0x17, struct compat_floppy_write_errors)
 
-static int compat_set_geometry(struct block_device *bdev, fmode_t mode, unsigned int cmd,
-                   struct compat_floppy_struct __user *arg)
+static int compat_set_geometry(struct block_device *bdev, blk_mode_t mode,
+               unsigned int cmd, struct compat_floppy_struct __user *arg)
 {
        struct floppy_struct v;
        int drive, type;
@@ -3663,7 +3665,7 @@ static int compat_set_geometry(struct block_device *bdev, fmode_t mode, unsigned
        BUILD_BUG_ON(offsetof(struct floppy_struct, name) !=
                     offsetof(struct compat_floppy_struct, name));
 
-       if (!(mode & (FMODE_WRITE | FMODE_WRITE_IOCTL)))
+       if (!(mode & (BLK_OPEN_WRITE | BLK_OPEN_WRITE_IOCTL)))
                return -EPERM;
 
        memset(&v, 0, sizeof(struct floppy_struct));
@@ -3860,8 +3862,8 @@ static int compat_werrorget(int drive,
        return 0;
 }
 
-static int fd_compat_ioctl(struct block_device *bdev, fmode_t mode, unsigned int cmd,
-                   unsigned long param)
+static int fd_compat_ioctl(struct block_device *bdev, blk_mode_t mode,
+               unsigned int cmd, unsigned long param)
 {
        int drive = (long)bdev->bd_disk->private_data;
        switch (cmd) {
@@ -3962,7 +3964,7 @@ static void __init config_types(void)
                pr_cont("\n");
 }
 
-static void floppy_release(struct gendisk *disk, fmode_t mode)
+static void floppy_release(struct gendisk *disk)
 {
        int drive = (long)disk->private_data;
 
@@ -3973,7 +3975,7 @@ static void floppy_release(struct gendisk *disk, fmode_t mode)
                drive_state[drive].fd_ref = 0;
        }
        if (!drive_state[drive].fd_ref)
-               opened_bdev[drive] = NULL;
+               opened_disk[drive] = NULL;
        mutex_unlock(&open_lock);
        mutex_unlock(&floppy_mutex);
 }
@@ -3983,9 +3985,9 @@ static void floppy_release(struct gendisk *disk, fmode_t mode)
  * /dev/PS0 etc), and disallows simultaneous access to the same
  * drive with different device numbers.
  */
-static int floppy_open(struct block_device *bdev, fmode_t mode)
+static int floppy_open(struct gendisk *disk, blk_mode_t mode)
 {
-       int drive = (long)bdev->bd_disk->private_data;
+       int drive = (long)disk->private_data;
        int old_dev, new_dev;
        int try;
        int res = -EBUSY;
@@ -3994,7 +3996,7 @@ static int floppy_open(struct block_device *bdev, fmode_t mode)
        mutex_lock(&floppy_mutex);
        mutex_lock(&open_lock);
        old_dev = drive_state[drive].fd_device;
-       if (opened_bdev[drive] && opened_bdev[drive] != bdev)
+       if (opened_disk[drive] && opened_disk[drive] != disk)
                goto out2;
 
        if (!drive_state[drive].fd_ref && (drive_params[drive].flags & FD_BROKEN_DCL)) {
@@ -4004,7 +4006,7 @@ static int floppy_open(struct block_device *bdev, fmode_t mode)
 
        drive_state[drive].fd_ref++;
 
-       opened_bdev[drive] = bdev;
+       opened_disk[drive] = disk;
 
        res = -ENXIO;
 
@@ -4038,7 +4040,7 @@ static int floppy_open(struct block_device *bdev, fmode_t mode)
                }
        }
 
-       new_dev = MINOR(bdev->bd_dev);
+       new_dev = disk->first_minor;
        drive_state[drive].fd_device = new_dev;
        set_capacity(disks[drive][ITYPE(new_dev)], floppy_sizes[new_dev]);
        if (old_dev != -1 && old_dev != new_dev) {
@@ -4048,21 +4050,20 @@ static int floppy_open(struct block_device *bdev, fmode_t mode)
 
        if (fdc_state[FDC(drive)].rawcmd == 1)
                fdc_state[FDC(drive)].rawcmd = 2;
-
-       if (!(mode & FMODE_NDELAY)) {
-               if (mode & (FMODE_READ|FMODE_WRITE)) {
+       if (!(mode & BLK_OPEN_NDELAY)) {
+               if (mode & (BLK_OPEN_READ | BLK_OPEN_WRITE)) {
                        drive_state[drive].last_checked = 0;
                        clear_bit(FD_OPEN_SHOULD_FAIL_BIT,
                                  &drive_state[drive].flags);
-                       if (bdev_check_media_change(bdev))
-                               floppy_revalidate(bdev->bd_disk);
+                       if (disk_check_media_change(disk))
+                               floppy_revalidate(disk);
                        if (test_bit(FD_DISK_CHANGED_BIT, &drive_state[drive].flags))
                                goto out;
                        if (test_bit(FD_OPEN_SHOULD_FAIL_BIT, &drive_state[drive].flags))
                                goto out;
                }
                res = -EROFS;
-               if ((mode & FMODE_WRITE) &&
+               if ((mode & BLK_OPEN_WRITE) &&
                    !test_bit(FD_DISK_WRITABLE_BIT, &drive_state[drive].flags))
                        goto out;
        }
@@ -4073,7 +4074,7 @@ out:
        drive_state[drive].fd_ref--;
 
        if (!drive_state[drive].fd_ref)
-               opened_bdev[drive] = NULL;
+               opened_disk[drive] = NULL;
 out2:
        mutex_unlock(&open_lock);
        mutex_unlock(&floppy_mutex);
@@ -4147,7 +4148,7 @@ static int __floppy_read_block_0(struct block_device *bdev, int drive)
        cbdata.drive = drive;
 
        bio_init(&bio, bdev, &bio_vec, 1, REQ_OP_READ);
-       bio_add_page(&bio, page, block_size(bdev), 0);
+       __bio_add_page(&bio, page, block_size(bdev), 0);
 
        bio.bi_iter.bi_sector = 0;
        bio.bi_flags |= (1 << BIO_QUIET);
@@ -4203,7 +4204,8 @@ static int floppy_revalidate(struct gendisk *disk)
                        drive_state[drive].generation++;
                if (drive_no_geom(drive)) {
                        /* auto-sensing */
-                       res = __floppy_read_block_0(opened_bdev[drive], drive);
+                       res = __floppy_read_block_0(opened_disk[drive]->part0,
+                                                   drive);
                } else {
                        if (cf)
                                poll_drive(false, FD_RAW_NEED_DISK);
index bc31bb7..37511d2 100644 (file)
@@ -990,7 +990,7 @@ loop_set_status_from_info(struct loop_device *lo,
        return 0;
 }
 
-static int loop_configure(struct loop_device *lo, fmode_t mode,
+static int loop_configure(struct loop_device *lo, blk_mode_t mode,
                          struct block_device *bdev,
                          const struct loop_config *config)
 {
@@ -1014,8 +1014,8 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
         * If we don't hold exclusive handle for the device, upgrade to it
         * here to avoid changing device under exclusive owner.
         */
-       if (!(mode & FMODE_EXCL)) {
-               error = bd_prepare_to_claim(bdev, loop_configure);
+       if (!(mode & BLK_OPEN_EXCL)) {
+               error = bd_prepare_to_claim(bdev, loop_configure, NULL);
                if (error)
                        goto out_putf;
        }
@@ -1050,7 +1050,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
        if (error)
                goto out_unlock;
 
-       if (!(file->f_mode & FMODE_WRITE) || !(mode & FMODE_WRITE) ||
+       if (!(file->f_mode & FMODE_WRITE) || !(mode & BLK_OPEN_WRITE) ||
            !file->f_op->write_iter)
                lo->lo_flags |= LO_FLAGS_READ_ONLY;
 
@@ -1116,7 +1116,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
        if (partscan)
                loop_reread_partitions(lo);
 
-       if (!(mode & FMODE_EXCL))
+       if (!(mode & BLK_OPEN_EXCL))
                bd_abort_claiming(bdev, loop_configure);
 
        return 0;
@@ -1124,7 +1124,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
 out_unlock:
        loop_global_unlock(lo, is_loop);
 out_bdev:
-       if (!(mode & FMODE_EXCL))
+       if (!(mode & BLK_OPEN_EXCL))
                bd_abort_claiming(bdev, loop_configure);
 out_putf:
        fput(file);
@@ -1528,7 +1528,7 @@ static int lo_simple_ioctl(struct loop_device *lo, unsigned int cmd,
        return err;
 }
 
-static int lo_ioctl(struct block_device *bdev, fmode_t mode,
+static int lo_ioctl(struct block_device *bdev, blk_mode_t mode,
        unsigned int cmd, unsigned long arg)
 {
        struct loop_device *lo = bdev->bd_disk->private_data;
@@ -1563,24 +1563,22 @@ static int lo_ioctl(struct block_device *bdev, fmode_t mode,
                return loop_clr_fd(lo);
        case LOOP_SET_STATUS:
                err = -EPERM;
-               if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) {
+               if ((mode & BLK_OPEN_WRITE) || capable(CAP_SYS_ADMIN))
                        err = loop_set_status_old(lo, argp);
-               }
                break;
        case LOOP_GET_STATUS:
                return loop_get_status_old(lo, argp);
        case LOOP_SET_STATUS64:
                err = -EPERM;
-               if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) {
+               if ((mode & BLK_OPEN_WRITE) || capable(CAP_SYS_ADMIN))
                        err = loop_set_status64(lo, argp);
-               }
                break;
        case LOOP_GET_STATUS64:
                return loop_get_status64(lo, argp);
        case LOOP_SET_CAPACITY:
        case LOOP_SET_DIRECT_IO:
        case LOOP_SET_BLOCK_SIZE:
-               if (!(mode & FMODE_WRITE) && !capable(CAP_SYS_ADMIN))
+               if (!(mode & BLK_OPEN_WRITE) && !capable(CAP_SYS_ADMIN))
                        return -EPERM;
                fallthrough;
        default:
@@ -1691,7 +1689,7 @@ loop_get_status_compat(struct loop_device *lo,
        return err;
 }
 
-static int lo_compat_ioctl(struct block_device *bdev, fmode_t mode,
+static int lo_compat_ioctl(struct block_device *bdev, blk_mode_t mode,
                           unsigned int cmd, unsigned long arg)
 {
        struct loop_device *lo = bdev->bd_disk->private_data;
@@ -1727,7 +1725,7 @@ static int lo_compat_ioctl(struct block_device *bdev, fmode_t mode,
 }
 #endif
 
-static void lo_release(struct gendisk *disk, fmode_t mode)
+static void lo_release(struct gendisk *disk)
 {
        struct loop_device *lo = disk->private_data;
 
index 815d77b..b200950 100644 (file)
@@ -3041,7 +3041,7 @@ static int rssd_disk_name_format(char *prefix,
  *                 structure pointer.
  */
 static int mtip_block_ioctl(struct block_device *dev,
-                           fmode_t mode,
+                           blk_mode_t mode,
                            unsigned cmd,
                            unsigned long arg)
 {
@@ -3079,7 +3079,7 @@ static int mtip_block_ioctl(struct block_device *dev,
  *                 structure pointer.
  */
 static int mtip_block_compat_ioctl(struct block_device *dev,
-                           fmode_t mode,
+                           blk_mode_t mode,
                            unsigned cmd,
                            unsigned long arg)
 {
index 65ecde3..8576d69 100644 (file)
@@ -1502,7 +1502,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
        return -ENOTTY;
 }
 
-static int nbd_ioctl(struct block_device *bdev, fmode_t mode,
+static int nbd_ioctl(struct block_device *bdev, blk_mode_t mode,
                     unsigned int cmd, unsigned long arg)
 {
        struct nbd_device *nbd = bdev->bd_disk->private_data;
@@ -1553,13 +1553,13 @@ static struct nbd_config *nbd_alloc_config(void)
        return config;
 }
 
-static int nbd_open(struct block_device *bdev, fmode_t mode)
+static int nbd_open(struct gendisk *disk, blk_mode_t mode)
 {
        struct nbd_device *nbd;
        int ret = 0;
 
        mutex_lock(&nbd_index_mutex);
-       nbd = bdev->bd_disk->private_data;
+       nbd = disk->private_data;
        if (!nbd) {
                ret = -ENXIO;
                goto out;
@@ -1587,17 +1587,17 @@ static int nbd_open(struct block_device *bdev, fmode_t mode)
                refcount_inc(&nbd->refs);
                mutex_unlock(&nbd->config_lock);
                if (max_part)
-                       set_bit(GD_NEED_PART_SCAN, &bdev->bd_disk->state);
+                       set_bit(GD_NEED_PART_SCAN, &disk->state);
        } else if (nbd_disconnected(nbd->config)) {
                if (max_part)
-                       set_bit(GD_NEED_PART_SCAN, &bdev->bd_disk->state);
+                       set_bit(GD_NEED_PART_SCAN, &disk->state);
        }
 out:
        mutex_unlock(&nbd_index_mutex);
        return ret;
 }
 
-static void nbd_release(struct gendisk *disk, fmode_t mode)
+static void nbd_release(struct gendisk *disk)
 {
        struct nbd_device *nbd = disk->private_data;
 
@@ -1776,7 +1776,8 @@ static struct nbd_device *nbd_dev_add(int index, unsigned int refs)
                if (err == -ENOSPC)
                        err = -EEXIST;
        } else {
-               err = idr_alloc(&nbd_index_idr, nbd, 0, 0, GFP_KERNEL);
+               err = idr_alloc(&nbd_index_idr, nbd, 0,
+                               (MINORMASK >> part_shift) + 1, GFP_KERNEL);
                if (err >= 0)
                        index = err;
        }
index d5d7884..a142853 100644 (file)
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
-#include <linux/pktcdvd.h>
-#include <linux/module.h>
-#include <linux/types.h>
-#include <linux/kernel.h>
+#include <linux/backing-dev.h>
 #include <linux/compat.h>
-#include <linux/kthread.h>
+#include <linux/debugfs.h>
+#include <linux/device.h>
 #include <linux/errno.h>
-#include <linux/spinlock.h>
 #include <linux/file.h>
-#include <linux/proc_fs.h>
-#include <linux/seq_file.h>
-#include <linux/miscdevice.h>
 #include <linux/freezer.h>
+#include <linux/kernel.h>
+#include <linux/kthread.h>
+#include <linux/miscdevice.h>
+#include <linux/module.h>
 #include <linux/mutex.h>
+#include <linux/nospec.h>
+#include <linux/pktcdvd.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
 #include <linux/slab.h>
-#include <linux/backing-dev.h>
+#include <linux/spinlock.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+
+#include <scsi/scsi.h>
 #include <scsi/scsi_cmnd.h>
 #include <scsi/scsi_ioctl.h>
-#include <scsi/scsi.h>
-#include <linux/debugfs.h>
-#include <linux/device.h>
-#include <linux/nospec.h>
-#include <linux/uaccess.h>
 
-#define DRIVER_NAME    "pktcdvd"
+#include <asm/unaligned.h>
 
-#define pkt_err(pd, fmt, ...)                                          \
-       pr_err("%s: " fmt, pd->name, ##__VA_ARGS__)
-#define pkt_notice(pd, fmt, ...)                                       \
-       pr_notice("%s: " fmt, pd->name, ##__VA_ARGS__)
-#define pkt_info(pd, fmt, ...)                                         \
-       pr_info("%s: " fmt, pd->name, ##__VA_ARGS__)
-
-#define pkt_dbg(level, pd, fmt, ...)                                   \
-do {                                                                   \
-       if (level == 2 && PACKET_DEBUG >= 2)                            \
-               pr_notice("%s: %s():" fmt,                              \
-                         pd->name, __func__, ##__VA_ARGS__);           \
-       else if (level == 1 && PACKET_DEBUG >= 1)                       \
-               pr_notice("%s: " fmt, pd->name, ##__VA_ARGS__);         \
-} while (0)
+#define DRIVER_NAME    "pktcdvd"
 
 #define MAX_SPEED 0xffff
 
@@ -107,7 +94,6 @@ static struct dentry *pkt_debugfs_root = NULL; /* /sys/kernel/debug/pktcdvd */
 /* forward declaration */
 static int pkt_setup_dev(dev_t dev, dev_t* pkt_dev);
 static int pkt_remove_dev(dev_t pkt_dev);
-static int pkt_seq_show(struct seq_file *m, void *p);
 
 static sector_t get_zone(sector_t sector, struct pktcdvd_device *pd)
 {
@@ -253,15 +239,16 @@ static ssize_t congestion_off_store(struct device *dev,
                                    const char *buf, size_t len)
 {
        struct pktcdvd_device *pd = dev_get_drvdata(dev);
-       int val;
+       int val, ret;
 
-       if (sscanf(buf, "%d", &val) == 1) {
-               spin_lock(&pd->lock);
-               pd->write_congestion_off = val;
-               init_write_congestion_marks(&pd->write_congestion_off,
-                                       &pd->write_congestion_on);
-               spin_unlock(&pd->lock);
-       }
+       ret = kstrtoint(buf, 10, &val);
+       if (ret)
+               return ret;
+
+       spin_lock(&pd->lock);
+       pd->write_congestion_off = val;
+       init_write_congestion_marks(&pd->write_congestion_off, &pd->write_congestion_on);
+       spin_unlock(&pd->lock);
        return len;
 }
 static DEVICE_ATTR_RW(congestion_off);
@@ -283,15 +270,16 @@ static ssize_t congestion_on_store(struct device *dev,
                                   const char *buf, size_t len)
 {
        struct pktcdvd_device *pd = dev_get_drvdata(dev);
-       int val;
+       int val, ret;
 
-       if (sscanf(buf, "%d", &val) == 1) {
-               spin_lock(&pd->lock);
-               pd->write_congestion_on = val;
-               init_write_congestion_marks(&pd->write_congestion_off,
-                                       &pd->write_congestion_on);
-               spin_unlock(&pd->lock);
-       }
+       ret = kstrtoint(buf, 10, &val);
+       if (ret)
+               return ret;
+
+       spin_lock(&pd->lock);
+       pd->write_congestion_on = val;
+       init_write_congestion_marks(&pd->write_congestion_off, &pd->write_congestion_on);
+       spin_unlock(&pd->lock);
        return len;
 }
 static DEVICE_ATTR_RW(congestion_on);
@@ -319,7 +307,7 @@ static void pkt_sysfs_dev_new(struct pktcdvd_device *pd)
        if (class_is_registered(&class_pktcdvd)) {
                pd->dev = device_create_with_groups(&class_pktcdvd, NULL,
                                                    MKDEV(0, 0), pd, pkt_groups,
-                                                   "%s", pd->name);
+                                                   "%s", pd->disk->disk_name);
                if (IS_ERR(pd->dev))
                        pd->dev = NULL;
        }
@@ -349,8 +337,8 @@ static ssize_t device_map_show(const struct class *c, const struct class_attribu
                struct pktcdvd_device *pd = pkt_devs[idx];
                if (!pd)
                        continue;
-               n += sprintf(data+n, "%s %u:%u %u:%u\n",
-                       pd->name,
+               n += sysfs_emit_at(data, n, "%s %u:%u %u:%u\n",
+                       pd->disk->disk_name,
                        MAJOR(pd->pkt_dev), MINOR(pd->pkt_dev),
                        MAJOR(pd->bdev->bd_dev),
                        MINOR(pd->bdev->bd_dev));
@@ -428,34 +416,92 @@ static void pkt_sysfs_cleanup(void)
 
  *******************************************************************/
 
-static int pkt_debugfs_seq_show(struct seq_file *m, void *p)
+static void pkt_count_states(struct pktcdvd_device *pd, int *states)
 {
-       return pkt_seq_show(m, p);
+       struct packet_data *pkt;
+       int i;
+
+       for (i = 0; i < PACKET_NUM_STATES; i++)
+               states[i] = 0;
+
+       spin_lock(&pd->cdrw.active_list_lock);
+       list_for_each_entry(pkt, &pd->cdrw.pkt_active_list, list) {
+               states[pkt->state]++;
+       }
+       spin_unlock(&pd->cdrw.active_list_lock);
 }
 
-static int pkt_debugfs_fops_open(struct inode *inode, struct file *file)
+static int pkt_seq_show(struct seq_file *m, void *p)
 {
-       return single_open(file, pkt_debugfs_seq_show, inode->i_private);
-}
+       struct pktcdvd_device *pd = m->private;
+       char *msg;
+       int states[PACKET_NUM_STATES];
 
-static const struct file_operations debug_fops = {
-       .open           = pkt_debugfs_fops_open,
-       .read           = seq_read,
-       .llseek         = seq_lseek,
-       .release        = single_release,
-       .owner          = THIS_MODULE,
-};
+       seq_printf(m, "Writer %s mapped to %pg:\n", pd->disk->disk_name, pd->bdev);
+
+       seq_printf(m, "\nSettings:\n");
+       seq_printf(m, "\tpacket size:\t\t%dkB\n", pd->settings.size / 2);
+
+       if (pd->settings.write_type == 0)
+               msg = "Packet";
+       else
+               msg = "Unknown";
+       seq_printf(m, "\twrite type:\t\t%s\n", msg);
+
+       seq_printf(m, "\tpacket type:\t\t%s\n", pd->settings.fp ? "Fixed" : "Variable");
+       seq_printf(m, "\tlink loss:\t\t%d\n", pd->settings.link_loss);
+
+       seq_printf(m, "\ttrack mode:\t\t%d\n", pd->settings.track_mode);
+
+       if (pd->settings.block_mode == PACKET_BLOCK_MODE1)
+               msg = "Mode 1";
+       else if (pd->settings.block_mode == PACKET_BLOCK_MODE2)
+               msg = "Mode 2";
+       else
+               msg = "Unknown";
+       seq_printf(m, "\tblock mode:\t\t%s\n", msg);
+
+       seq_printf(m, "\nStatistics:\n");
+       seq_printf(m, "\tpackets started:\t%lu\n", pd->stats.pkt_started);
+       seq_printf(m, "\tpackets ended:\t\t%lu\n", pd->stats.pkt_ended);
+       seq_printf(m, "\twritten:\t\t%lukB\n", pd->stats.secs_w >> 1);
+       seq_printf(m, "\tread gather:\t\t%lukB\n", pd->stats.secs_rg >> 1);
+       seq_printf(m, "\tread:\t\t\t%lukB\n", pd->stats.secs_r >> 1);
+
+       seq_printf(m, "\nMisc:\n");
+       seq_printf(m, "\treference count:\t%d\n", pd->refcnt);
+       seq_printf(m, "\tflags:\t\t\t0x%lx\n", pd->flags);
+       seq_printf(m, "\tread speed:\t\t%ukB/s\n", pd->read_speed);
+       seq_printf(m, "\twrite speed:\t\t%ukB/s\n", pd->write_speed);
+       seq_printf(m, "\tstart offset:\t\t%lu\n", pd->offset);
+       seq_printf(m, "\tmode page offset:\t%u\n", pd->mode_offset);
+
+       seq_printf(m, "\nQueue state:\n");
+       seq_printf(m, "\tbios queued:\t\t%d\n", pd->bio_queue_size);
+       seq_printf(m, "\tbios pending:\t\t%d\n", atomic_read(&pd->cdrw.pending_bios));
+       seq_printf(m, "\tcurrent sector:\t\t0x%llx\n", pd->current_sector);
+
+       pkt_count_states(pd, states);
+       seq_printf(m, "\tstate:\t\t\ti:%d ow:%d rw:%d ww:%d rec:%d fin:%d\n",
+                  states[0], states[1], states[2], states[3], states[4], states[5]);
+
+       seq_printf(m, "\twrite congestion marks:\toff=%d on=%d\n",
+                       pd->write_congestion_off,
+                       pd->write_congestion_on);
+       return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(pkt_seq);
 
 static void pkt_debugfs_dev_new(struct pktcdvd_device *pd)
 {
        if (!pkt_debugfs_root)
                return;
-       pd->dfs_d_root = debugfs_create_dir(pd->name, pkt_debugfs_root);
+       pd->dfs_d_root = debugfs_create_dir(pd->disk->disk_name, pkt_debugfs_root);
        if (!pd->dfs_d_root)
                return;
 
-       pd->dfs_f_info = debugfs_create_file("info", 0444,
-                                            pd->dfs_d_root, pd, &debug_fops);
+       pd->dfs_f_info = debugfs_create_file("info", 0444, pd->dfs_d_root,
+                                            pd, &pkt_seq_fops);
 }
 
 static void pkt_debugfs_dev_remove(struct pktcdvd_device *pd)
@@ -484,9 +530,11 @@ static void pkt_debugfs_cleanup(void)
 
 static void pkt_bio_finished(struct pktcdvd_device *pd)
 {
+       struct device *ddev = disk_to_dev(pd->disk);
+
        BUG_ON(atomic_read(&pd->cdrw.pending_bios) <= 0);
        if (atomic_dec_and_test(&pd->cdrw.pending_bios)) {
-               pkt_dbg(2, pd, "queue empty\n");
+               dev_dbg(ddev, "queue empty\n");
                atomic_set(&pd->iosched.attention, 1);
                wake_up(&pd->wqueue);
        }
@@ -717,15 +765,16 @@ static const char *sense_key_string(__u8 index)
 static void pkt_dump_sense(struct pktcdvd_device *pd,
                           struct packet_command *cgc)
 {
+       struct device *ddev = disk_to_dev(pd->disk);
        struct scsi_sense_hdr *sshdr = cgc->sshdr;
 
        if (sshdr)
-               pkt_err(pd, "%*ph - sense %02x.%02x.%02x (%s)\n",
+               dev_err(ddev, "%*ph - sense %02x.%02x.%02x (%s)\n",
                        CDROM_PACKET_SIZE, cgc->cmd,
                        sshdr->sense_key, sshdr->asc, sshdr->ascq,
                        sense_key_string(sshdr->sense_key));
        else
-               pkt_err(pd, "%*ph - no sense\n", CDROM_PACKET_SIZE, cgc->cmd);
+               dev_err(ddev, "%*ph - no sense\n", CDROM_PACKET_SIZE, cgc->cmd);
 }
 
 /*
@@ -762,10 +811,8 @@ static noinline_for_stack int pkt_set_speed(struct pktcdvd_device *pd,
        init_cdrom_command(&cgc, NULL, 0, CGC_DATA_NONE);
        cgc.sshdr = &sshdr;
        cgc.cmd[0] = GPCMD_SET_SPEED;
-       cgc.cmd[2] = (read_speed >> 8) & 0xff;
-       cgc.cmd[3] = read_speed & 0xff;
-       cgc.cmd[4] = (write_speed >> 8) & 0xff;
-       cgc.cmd[5] = write_speed & 0xff;
+       put_unaligned_be16(read_speed, &cgc.cmd[2]);
+       put_unaligned_be16(write_speed, &cgc.cmd[4]);
 
        ret = pkt_generic_packet(pd, &cgc);
        if (ret)
@@ -809,6 +856,7 @@ static void pkt_queue_bio(struct pktcdvd_device *pd, struct bio *bio)
  */
 static void pkt_iosched_process_queue(struct pktcdvd_device *pd)
 {
+       struct device *ddev = disk_to_dev(pd->disk);
 
        if (atomic_read(&pd->iosched.attention) == 0)
                return;
@@ -836,7 +884,7 @@ static void pkt_iosched_process_queue(struct pktcdvd_device *pd)
                                need_write_seek = 0;
                        if (need_write_seek && reads_queued) {
                                if (atomic_read(&pd->cdrw.pending_bios) > 0) {
-                                       pkt_dbg(2, pd, "write, waiting\n");
+                                       dev_dbg(ddev, "write, waiting\n");
                                        break;
                                }
                                pkt_flush_cache(pd);
@@ -845,7 +893,7 @@ static void pkt_iosched_process_queue(struct pktcdvd_device *pd)
                } else {
                        if (!reads_queued && writes_queued) {
                                if (atomic_read(&pd->cdrw.pending_bios) > 0) {
-                                       pkt_dbg(2, pd, "read, waiting\n");
+                                       dev_dbg(ddev, "read, waiting\n");
                                        break;
                                }
                                pd->iosched.writing = 1;
@@ -892,25 +940,27 @@ static void pkt_iosched_process_queue(struct pktcdvd_device *pd)
  */
 static int pkt_set_segment_merging(struct pktcdvd_device *pd, struct request_queue *q)
 {
-       if ((pd->settings.size << 9) / CD_FRAMESIZE
-           <= queue_max_segments(q)) {
+       struct device *ddev = disk_to_dev(pd->disk);
+
+       if ((pd->settings.size << 9) / CD_FRAMESIZE <= queue_max_segments(q)) {
                /*
                 * The cdrom device can handle one segment/frame
                 */
                clear_bit(PACKET_MERGE_SEGS, &pd->flags);
                return 0;
-       } else if ((pd->settings.size << 9) / PAGE_SIZE
-                  <= queue_max_segments(q)) {
+       }
+
+       if ((pd->settings.size << 9) / PAGE_SIZE <= queue_max_segments(q)) {
                /*
                 * We can handle this case at the expense of some extra memory
                 * copies during write operations
                 */
                set_bit(PACKET_MERGE_SEGS, &pd->flags);
                return 0;
-       } else {
-               pkt_err(pd, "cdrom max_phys_segments too small\n");
-               return -EIO;
        }
+
+       dev_err(ddev, "cdrom max_phys_segments too small\n");
+       return -EIO;
 }
 
 static void pkt_end_io_read(struct bio *bio)
@@ -919,9 +969,8 @@ static void pkt_end_io_read(struct bio *bio)
        struct pktcdvd_device *pd = pkt->pd;
        BUG_ON(!pd);
 
-       pkt_dbg(2, pd, "bio=%p sec0=%llx sec=%llx err=%d\n",
-               bio, (unsigned long long)pkt->sector,
-               (unsigned long long)bio->bi_iter.bi_sector, bio->bi_status);
+       dev_dbg(disk_to_dev(pd->disk), "bio=%p sec0=%llx sec=%llx err=%d\n",
+               bio, pkt->sector, bio->bi_iter.bi_sector, bio->bi_status);
 
        if (bio->bi_status)
                atomic_inc(&pkt->io_errors);
@@ -939,7 +988,7 @@ static void pkt_end_io_packet_write(struct bio *bio)
        struct pktcdvd_device *pd = pkt->pd;
        BUG_ON(!pd);
 
-       pkt_dbg(2, pd, "id=%d, err=%d\n", pkt->id, bio->bi_status);
+       dev_dbg(disk_to_dev(pd->disk), "id=%d, err=%d\n", pkt->id, bio->bi_status);
 
        pd->stats.pkt_ended++;
 
@@ -955,6 +1004,7 @@ static void pkt_end_io_packet_write(struct bio *bio)
  */
 static void pkt_gather_data(struct pktcdvd_device *pd, struct packet_data *pkt)
 {
+       struct device *ddev = disk_to_dev(pd->disk);
        int frames_read = 0;
        struct bio *bio;
        int f;
@@ -983,8 +1033,7 @@ static void pkt_gather_data(struct pktcdvd_device *pd, struct packet_data *pkt)
        spin_unlock(&pkt->lock);
 
        if (pkt->cache_valid) {
-               pkt_dbg(2, pd, "zone %llx cached\n",
-                       (unsigned long long)pkt->sector);
+               dev_dbg(ddev, "zone %llx cached\n", pkt->sector);
                goto out_account;
        }
 
@@ -1005,8 +1054,8 @@ static void pkt_gather_data(struct pktcdvd_device *pd, struct packet_data *pkt)
 
                p = (f * CD_FRAMESIZE) / PAGE_SIZE;
                offset = (f * CD_FRAMESIZE) % PAGE_SIZE;
-               pkt_dbg(2, pd, "Adding frame %d, page:%p offs:%d\n",
-                       f, pkt->pages[p], offset);
+               dev_dbg(ddev, "Adding frame %d, page:%p offs:%d\n", f,
+                       pkt->pages[p], offset);
                if (!bio_add_page(bio, pkt->pages[p], CD_FRAMESIZE, offset))
                        BUG();
 
@@ -1016,8 +1065,7 @@ static void pkt_gather_data(struct pktcdvd_device *pd, struct packet_data *pkt)
        }
 
 out_account:
-       pkt_dbg(2, pd, "need %d frames for zone %llx\n",
-               frames_read, (unsigned long long)pkt->sector);
+       dev_dbg(ddev, "need %d frames for zone %llx\n", frames_read, pkt->sector);
        pd->stats.pkt_started++;
        pd->stats.secs_rg += frames_read * (CD_FRAMESIZE >> 9);
 }
@@ -1051,17 +1099,17 @@ static void pkt_put_packet_data(struct pktcdvd_device *pd, struct packet_data *p
        }
 }
 
-static inline void pkt_set_state(struct packet_data *pkt, enum packet_data_state state)
+static inline void pkt_set_state(struct device *ddev, struct packet_data *pkt,
+                                enum packet_data_state state)
 {
-#if PACKET_DEBUG > 1
        static const char *state_name[] = {
                "IDLE", "WAITING", "READ_WAIT", "WRITE_WAIT", "RECOVERY", "FINISHED"
        };
        enum packet_data_state old_state = pkt->state;
-       pkt_dbg(2, pd, "pkt %2d : s=%6llx %s -> %s\n",
-               pkt->id, (unsigned long long)pkt->sector,
-               state_name[old_state], state_name[state]);
-#endif
+
+       dev_dbg(ddev, "pkt %2d : s=%6llx %s -> %s\n",
+               pkt->id, pkt->sector, state_name[old_state], state_name[state]);
+
        pkt->state = state;
 }
 
@@ -1071,6 +1119,7 @@ static inline void pkt_set_state(struct packet_data *pkt, enum packet_data_state
  */
 static int pkt_handle_queue(struct pktcdvd_device *pd)
 {
+       struct device *ddev = disk_to_dev(pd->disk);
        struct packet_data *pkt, *p;
        struct bio *bio = NULL;
        sector_t zone = 0; /* Suppress gcc warning */
@@ -1080,7 +1129,7 @@ static int pkt_handle_queue(struct pktcdvd_device *pd)
        atomic_set(&pd->scan_queue, 0);
 
        if (list_empty(&pd->cdrw.pkt_free_list)) {
-               pkt_dbg(2, pd, "no pkt\n");
+               dev_dbg(ddev, "no pkt\n");
                return 0;
        }
 
@@ -1117,7 +1166,7 @@ try_next_bio:
        }
        spin_unlock(&pd->lock);
        if (!bio) {
-               pkt_dbg(2, pd, "no bio\n");
+               dev_dbg(ddev, "no bio\n");
                return 0;
        }
 
@@ -1133,12 +1182,13 @@ try_next_bio:
         * to this packet.
         */
        spin_lock(&pd->lock);
-       pkt_dbg(2, pd, "looking for zone %llx\n", (unsigned long long)zone);
+       dev_dbg(ddev, "looking for zone %llx\n", zone);
        while ((node = pkt_rbtree_find(pd, zone)) != NULL) {
+               sector_t tmp = get_zone(node->bio->bi_iter.bi_sector, pd);
+
                bio = node->bio;
-               pkt_dbg(2, pd, "found zone=%llx\n", (unsigned long long)
-                       get_zone(bio->bi_iter.bi_sector, pd));
-               if (get_zone(bio->bi_iter.bi_sector, pd) != zone)
+               dev_dbg(ddev, "found zone=%llx\n", tmp);
+               if (tmp != zone)
                        break;
                pkt_rbtree_erase(pd, node);
                spin_lock(&pkt->lock);
@@ -1157,7 +1207,7 @@ try_next_bio:
        spin_unlock(&pd->lock);
 
        pkt->sleep_time = max(PACKET_WAIT_TIME, 1);
-       pkt_set_state(pkt, PACKET_WAITING_STATE);
+       pkt_set_state(ddev, pkt, PACKET_WAITING_STATE);
        atomic_set(&pkt->run_sm, 1);
 
        spin_lock(&pd->cdrw.active_list_lock);
@@ -1209,6 +1259,7 @@ static void bio_list_copy_data(struct bio *dst, struct bio *src)
  */
 static void pkt_start_write(struct pktcdvd_device *pd, struct packet_data *pkt)
 {
+       struct device *ddev = disk_to_dev(pd->disk);
        int f;
 
        bio_init(pkt->w_bio, pd->bdev, pkt->w_bio->bi_inline_vecs, pkt->frames,
@@ -1225,7 +1276,7 @@ static void pkt_start_write(struct pktcdvd_device *pd, struct packet_data *pkt)
                if (!bio_add_page(pkt->w_bio, page, CD_FRAMESIZE, offset))
                        BUG();
        }
-       pkt_dbg(2, pd, "vcnt=%d\n", pkt->w_bio->bi_vcnt);
+       dev_dbg(ddev, "vcnt=%d\n", pkt->w_bio->bi_vcnt);
 
        /*
         * Fill-in bvec with data from orig_bios.
@@ -1233,11 +1284,10 @@ static void pkt_start_write(struct pktcdvd_device *pd, struct packet_data *pkt)
        spin_lock(&pkt->lock);
        bio_list_copy_data(pkt->w_bio, pkt->orig_bios.head);
 
-       pkt_set_state(pkt, PACKET_WRITE_WAIT_STATE);
+       pkt_set_state(ddev, pkt, PACKET_WRITE_WAIT_STATE);
        spin_unlock(&pkt->lock);
 
-       pkt_dbg(2, pd, "Writing %d frames for zone %llx\n",
-               pkt->write_size, (unsigned long long)pkt->sector);
+       dev_dbg(ddev, "Writing %d frames for zone %llx\n", pkt->write_size, pkt->sector);
 
        if (test_bit(PACKET_MERGE_SEGS, &pd->flags) || (pkt->write_size < pkt->frames))
                pkt->cache_valid = 1;
@@ -1265,7 +1315,9 @@ static void pkt_finish_packet(struct packet_data *pkt, blk_status_t status)
 
 static void pkt_run_state_machine(struct pktcdvd_device *pd, struct packet_data *pkt)
 {
-       pkt_dbg(2, pd, "pkt %d\n", pkt->id);
+       struct device *ddev = disk_to_dev(pd->disk);
+
+       dev_dbg(ddev, "pkt %d\n", pkt->id);
 
        for (;;) {
                switch (pkt->state) {
@@ -1275,7 +1327,7 @@ static void pkt_run_state_machine(struct pktcdvd_device *pd, struct packet_data
 
                        pkt->sleep_time = 0;
                        pkt_gather_data(pd, pkt);
-                       pkt_set_state(pkt, PACKET_READ_WAIT_STATE);
+                       pkt_set_state(ddev, pkt, PACKET_READ_WAIT_STATE);
                        break;
 
                case PACKET_READ_WAIT_STATE:
@@ -1283,7 +1335,7 @@ static void pkt_run_state_machine(struct pktcdvd_device *pd, struct packet_data
                                return;
 
                        if (atomic_read(&pkt->io_errors) > 0) {
-                               pkt_set_state(pkt, PACKET_RECOVERY_STATE);
+                               pkt_set_state(ddev, pkt, PACKET_RECOVERY_STATE);
                        } else {
                                pkt_start_write(pd, pkt);
                        }
@@ -1294,15 +1346,15 @@ static void pkt_run_state_machine(struct pktcdvd_device *pd, struct packet_data
                                return;
 
                        if (!pkt->w_bio->bi_status) {
-                               pkt_set_state(pkt, PACKET_FINISHED_STATE);
+                               pkt_set_state(ddev, pkt, PACKET_FINISHED_STATE);
                        } else {
-                               pkt_set_state(pkt, PACKET_RECOVERY_STATE);
+                               pkt_set_state(ddev, pkt, PACKET_RECOVERY_STATE);
                        }
                        break;
 
                case PACKET_RECOVERY_STATE:
-                       pkt_dbg(2, pd, "No recovery possible\n");
-                       pkt_set_state(pkt, PACKET_FINISHED_STATE);
+                       dev_dbg(ddev, "No recovery possible\n");
+                       pkt_set_state(ddev, pkt, PACKET_FINISHED_STATE);
                        break;
 
                case PACKET_FINISHED_STATE:
@@ -1318,6 +1370,7 @@ static void pkt_run_state_machine(struct pktcdvd_device *pd, struct packet_data
 
 static void pkt_handle_packets(struct pktcdvd_device *pd)
 {
+       struct device *ddev = disk_to_dev(pd->disk);
        struct packet_data *pkt, *next;
 
        /*
@@ -1338,28 +1391,13 @@ static void pkt_handle_packets(struct pktcdvd_device *pd)
                if (pkt->state == PACKET_FINISHED_STATE) {
                        list_del(&pkt->list);
                        pkt_put_packet_data(pd, pkt);
-                       pkt_set_state(pkt, PACKET_IDLE_STATE);
+                       pkt_set_state(ddev, pkt, PACKET_IDLE_STATE);
                        atomic_set(&pd->scan_queue, 1);
                }
        }
        spin_unlock(&pd->cdrw.active_list_lock);
 }
 
-static void pkt_count_states(struct pktcdvd_device *pd, int *states)
-{
-       struct packet_data *pkt;
-       int i;
-
-       for (i = 0; i < PACKET_NUM_STATES; i++)
-               states[i] = 0;
-
-       spin_lock(&pd->cdrw.active_list_lock);
-       list_for_each_entry(pkt, &pd->cdrw.pkt_active_list, list) {
-               states[pkt->state]++;
-       }
-       spin_unlock(&pd->cdrw.active_list_lock);
-}
-
 /*
  * kcdrwd is woken up when writes have been queued for one of our
  * registered devices
@@ -1367,7 +1405,9 @@ static void pkt_count_states(struct pktcdvd_device *pd, int *states)
 static int kcdrwd(void *foobar)
 {
        struct pktcdvd_device *pd = foobar;
+       struct device *ddev = disk_to_dev(pd->disk);
        struct packet_data *pkt;
+       int states[PACKET_NUM_STATES];
        long min_sleep_time, residue;
 
        set_user_nice(current, MIN_NICE);
@@ -1398,13 +1438,9 @@ static int kcdrwd(void *foobar)
                                goto work_to_do;
 
                        /* Otherwise, go to sleep */
-                       if (PACKET_DEBUG > 1) {
-                               int states[PACKET_NUM_STATES];
-                               pkt_count_states(pd, states);
-                               pkt_dbg(2, pd, "i:%d ow:%d rw:%d ww:%d rec:%d fin:%d\n",
-                                       states[0], states[1], states[2],
-                                       states[3], states[4], states[5]);
-                       }
+                       pkt_count_states(pd, states);
+                       dev_dbg(ddev, "i:%d ow:%d rw:%d ww:%d rec:%d fin:%d\n",
+                               states[0], states[1], states[2], states[3], states[4], states[5]);
 
                        min_sleep_time = MAX_SCHEDULE_TIMEOUT;
                        list_for_each_entry(pkt, &pd->cdrw.pkt_active_list, list) {
@@ -1412,9 +1448,9 @@ static int kcdrwd(void *foobar)
                                        min_sleep_time = pkt->sleep_time;
                        }
 
-                       pkt_dbg(2, pd, "sleeping\n");
+                       dev_dbg(ddev, "sleeping\n");
                        residue = schedule_timeout(min_sleep_time);
-                       pkt_dbg(2, pd, "wake up\n");
+                       dev_dbg(ddev, "wake up\n");
 
                        /* make swsusp happy with our thread */
                        try_to_freeze();
@@ -1462,7 +1498,7 @@ work_to_do:
 
 static void pkt_print_settings(struct pktcdvd_device *pd)
 {
-       pkt_info(pd, "%s packets, %u blocks, Mode-%c disc\n",
+       dev_info(disk_to_dev(pd->disk), "%s packets, %u blocks, Mode-%c disc\n",
                 pd->settings.fp ? "Fixed" : "Variable",
                 pd->settings.size >> 2,
                 pd->settings.block_mode == 8 ? '1' : '2');
@@ -1474,8 +1510,7 @@ static int pkt_mode_sense(struct pktcdvd_device *pd, struct packet_command *cgc,
 
        cgc->cmd[0] = GPCMD_MODE_SENSE_10;
        cgc->cmd[2] = page_code | (page_control << 6);
-       cgc->cmd[7] = cgc->buflen >> 8;
-       cgc->cmd[8] = cgc->buflen & 0xff;
+       put_unaligned_be16(cgc->buflen, &cgc->cmd[7]);
        cgc->data_direction = CGC_DATA_READ;
        return pkt_generic_packet(pd, cgc);
 }
@@ -1486,8 +1521,7 @@ static int pkt_mode_select(struct pktcdvd_device *pd, struct packet_command *cgc
        memset(cgc->buffer, 0, 2);
        cgc->cmd[0] = GPCMD_MODE_SELECT_10;
        cgc->cmd[1] = 0x10;             /* PF */
-       cgc->cmd[7] = cgc->buflen >> 8;
-       cgc->cmd[8] = cgc->buflen & 0xff;
+       put_unaligned_be16(cgc->buflen, &cgc->cmd[7]);
        cgc->data_direction = CGC_DATA_WRITE;
        return pkt_generic_packet(pd, cgc);
 }
@@ -1528,8 +1562,7 @@ static int pkt_get_track_info(struct pktcdvd_device *pd, __u16 track, __u8 type,
        init_cdrom_command(&cgc, ti, 8, CGC_DATA_READ);
        cgc.cmd[0] = GPCMD_READ_TRACK_RZONE_INFO;
        cgc.cmd[1] = type & 3;
-       cgc.cmd[4] = (track & 0xff00) >> 8;
-       cgc.cmd[5] = track & 0xff;
+       put_unaligned_be16(track, &cgc.cmd[4]);
        cgc.cmd[8] = 8;
        cgc.quiet = 1;
 
@@ -1590,6 +1623,7 @@ static noinline_for_stack int pkt_get_last_written(struct pktcdvd_device *pd,
  */
 static noinline_for_stack int pkt_set_write_settings(struct pktcdvd_device *pd)
 {
+       struct device *ddev = disk_to_dev(pd->disk);
        struct packet_command cgc;
        struct scsi_sense_hdr sshdr;
        write_param_page *wp;
@@ -1609,8 +1643,8 @@ static noinline_for_stack int pkt_set_write_settings(struct pktcdvd_device *pd)
                return ret;
        }
 
-       size = 2 + ((buffer[0] << 8) | (buffer[1] & 0xff));
-       pd->mode_offset = (buffer[6] << 8) | (buffer[7] & 0xff);
+       size = 2 + get_unaligned_be16(&buffer[0]);
+       pd->mode_offset = get_unaligned_be16(&buffer[6]);
        if (size > sizeof(buffer))
                size = sizeof(buffer);
 
@@ -1656,7 +1690,7 @@ static noinline_for_stack int pkt_set_write_settings(struct pktcdvd_device *pd)
                /*
                 * paranoia
                 */
-               pkt_err(pd, "write mode wrong %d\n", wp->data_block_type);
+               dev_err(ddev, "write mode wrong %d\n", wp->data_block_type);
                return 1;
        }
        wp->packet_size = cpu_to_be32(pd->settings.size >> 2);
@@ -1677,6 +1711,8 @@ static noinline_for_stack int pkt_set_write_settings(struct pktcdvd_device *pd)
  */
 static int pkt_writable_track(struct pktcdvd_device *pd, track_information *ti)
 {
+       struct device *ddev = disk_to_dev(pd->disk);
+
        switch (pd->mmc3_profile) {
                case 0x1a: /* DVD+RW */
                case 0x12: /* DVD-RAM */
@@ -1701,7 +1737,7 @@ static int pkt_writable_track(struct pktcdvd_device *pd, track_information *ti)
        if (ti->rt == 1 && ti->blank == 0)
                return 1;
 
-       pkt_err(pd, "bad state %d-%d-%d\n", ti->rt, ti->blank, ti->packet);
+       dev_err(ddev, "bad state %d-%d-%d\n", ti->rt, ti->blank, ti->packet);
        return 0;
 }
 
@@ -1710,6 +1746,8 @@ static int pkt_writable_track(struct pktcdvd_device *pd, track_information *ti)
  */
 static int pkt_writable_disc(struct pktcdvd_device *pd, disc_information *di)
 {
+       struct device *ddev = disk_to_dev(pd->disk);
+
        switch (pd->mmc3_profile) {
                case 0x0a: /* CD-RW */
                case 0xffff: /* MMC3 not supported */
@@ -1719,8 +1757,7 @@ static int pkt_writable_disc(struct pktcdvd_device *pd, disc_information *di)
                case 0x12: /* DVD-RAM */
                        return 1;
                default:
-                       pkt_dbg(2, pd, "Wrong disc profile (%x)\n",
-                               pd->mmc3_profile);
+                       dev_dbg(ddev, "Wrong disc profile (%x)\n", pd->mmc3_profile);
                        return 0;
        }
 
@@ -1729,22 +1766,22 @@ static int pkt_writable_disc(struct pktcdvd_device *pd, disc_information *di)
         * but i'm not sure, should we leave this to user apps? probably.
         */
        if (di->disc_type == 0xff) {
-               pkt_notice(pd, "unknown disc - no track?\n");
+               dev_notice(ddev, "unknown disc - no track?\n");
                return 0;
        }
 
        if (di->disc_type != 0x20 && di->disc_type != 0) {
-               pkt_err(pd, "wrong disc type (%x)\n", di->disc_type);
+               dev_err(ddev, "wrong disc type (%x)\n", di->disc_type);
                return 0;
        }
 
        if (di->erasable == 0) {
-               pkt_notice(pd, "disc not erasable\n");
+               dev_err(ddev, "disc not erasable\n");
                return 0;
        }
 
        if (di->border_status == PACKET_SESSION_RESERVED) {
-               pkt_err(pd, "can't write to last track (reserved)\n");
+               dev_err(ddev, "can't write to last track (reserved)\n");
                return 0;
        }
 
@@ -1753,6 +1790,7 @@ static int pkt_writable_disc(struct pktcdvd_device *pd, disc_information *di)
 
 static noinline_for_stack int pkt_probe_settings(struct pktcdvd_device *pd)
 {
+       struct device *ddev = disk_to_dev(pd->disk);
        struct packet_command cgc;
        unsigned char buf[12];
        disc_information di;
@@ -1763,14 +1801,14 @@ static noinline_for_stack int pkt_probe_settings(struct pktcdvd_device *pd)
        cgc.cmd[0] = GPCMD_GET_CONFIGURATION;
        cgc.cmd[8] = 8;
        ret = pkt_generic_packet(pd, &cgc);
-       pd->mmc3_profile = ret ? 0xffff : buf[6] << 8 | buf[7];
+       pd->mmc3_profile = ret ? 0xffff : get_unaligned_be16(&buf[6]);
 
        memset(&di, 0, sizeof(disc_information));
        memset(&ti, 0, sizeof(track_information));
 
        ret = pkt_get_disc_info(pd, &di);
        if (ret) {
-               pkt_err(pd, "failed get_disc\n");
+               dev_err(ddev, "failed get_disc\n");
                return ret;
        }
 
@@ -1782,12 +1820,12 @@ static noinline_for_stack int pkt_probe_settings(struct pktcdvd_device *pd)
        track = 1; /* (di.last_track_msb << 8) | di.last_track_lsb; */
        ret = pkt_get_track_info(pd, track, 1, &ti);
        if (ret) {
-               pkt_err(pd, "failed get_track\n");
+               dev_err(ddev, "failed get_track\n");
                return ret;
        }
 
        if (!pkt_writable_track(pd, &ti)) {
-               pkt_err(pd, "can't write to this track\n");
+               dev_err(ddev, "can't write to this track\n");
                return -EROFS;
        }
 
@@ -1797,11 +1835,11 @@ static noinline_for_stack int pkt_probe_settings(struct pktcdvd_device *pd)
         */
        pd->settings.size = be32_to_cpu(ti.fixed_packet_size) << 2;
        if (pd->settings.size == 0) {
-               pkt_notice(pd, "detected zero packet size!\n");
+               dev_notice(ddev, "detected zero packet size!\n");
                return -ENXIO;
        }
        if (pd->settings.size > PACKET_MAX_SECTORS) {
-               pkt_err(pd, "packet size is too big\n");
+               dev_err(ddev, "packet size is too big\n");
                return -EROFS;
        }
        pd->settings.fp = ti.fp;
@@ -1843,7 +1881,7 @@ static noinline_for_stack int pkt_probe_settings(struct pktcdvd_device *pd)
                        pd->settings.block_mode = PACKET_BLOCK_MODE2;
                        break;
                default:
-                       pkt_err(pd, "unknown data mode\n");
+                       dev_err(ddev, "unknown data mode\n");
                        return -EROFS;
        }
        return 0;
@@ -1854,6 +1892,7 @@ static noinline_for_stack int pkt_probe_settings(struct pktcdvd_device *pd)
  */
 static noinline_for_stack int pkt_write_caching(struct pktcdvd_device *pd)
 {
+       struct device *ddev = disk_to_dev(pd->disk);
        struct packet_command cgc;
        struct scsi_sense_hdr sshdr;
        unsigned char buf[64];
@@ -1880,13 +1919,13 @@ static noinline_for_stack int pkt_write_caching(struct pktcdvd_device *pd)
         */
        buf[pd->mode_offset + 10] |= (set << 2);
 
-       cgc.buflen = cgc.cmd[8] = 2 + ((buf[0] << 8) | (buf[1] & 0xff));
+       cgc.buflen = cgc.cmd[8] = 2 + get_unaligned_be16(&buf[0]);
        ret = pkt_mode_select(pd, &cgc);
        if (ret) {
-               pkt_err(pd, "write caching control failed\n");
+               dev_err(ddev, "write caching control failed\n");
                pkt_dump_sense(pd, &cgc);
        } else if (!ret && set)
-               pkt_notice(pd, "enabled write caching\n");
+               dev_notice(ddev, "enabled write caching\n");
        return ret;
 }
 
@@ -1935,12 +1974,12 @@ static noinline_for_stack int pkt_get_max_speed(struct pktcdvd_device *pd,
                 * Speed Performance Descriptor Block", use the information
                 * in the first block. (contains the highest speed)
                 */
-               int num_spdb = (cap_buf[30] << 8) + cap_buf[31];
+               int num_spdb = get_unaligned_be16(&cap_buf[30]);
                if (num_spdb > 0)
                        offset = 34;
        }
 
-       *write_speed = (cap_buf[offset] << 8) | cap_buf[offset + 1];
+       *write_speed = get_unaligned_be16(&cap_buf[offset]);
        return 0;
 }
 
@@ -1967,6 +2006,7 @@ static char us_clv_to_speed[16] = {
 static noinline_for_stack int pkt_media_speed(struct pktcdvd_device *pd,
                                                unsigned *speed)
 {
+       struct device *ddev = disk_to_dev(pd->disk);
        struct packet_command cgc;
        struct scsi_sense_hdr sshdr;
        unsigned char buf[64];
@@ -1984,7 +2024,7 @@ static noinline_for_stack int pkt_media_speed(struct pktcdvd_device *pd,
                pkt_dump_sense(pd, &cgc);
                return ret;
        }
-       size = ((unsigned int) buf[0]<<8) + buf[1] + 2;
+       size = 2 + get_unaligned_be16(&buf[0]);
        if (size > sizeof(buf))
                size = sizeof(buf);
 
@@ -2001,11 +2041,11 @@ static noinline_for_stack int pkt_media_speed(struct pktcdvd_device *pd,
        }
 
        if (!(buf[6] & 0x40)) {
-               pkt_notice(pd, "disc type is not CD-RW\n");
+               dev_notice(ddev, "disc type is not CD-RW\n");
                return 1;
        }
        if (!(buf[6] & 0x4)) {
-               pkt_notice(pd, "A1 values on media are not valid, maybe not CDRW?\n");
+               dev_notice(ddev, "A1 values on media are not valid, maybe not CDRW?\n");
                return 1;
        }
 
@@ -2025,25 +2065,26 @@ static noinline_for_stack int pkt_media_speed(struct pktcdvd_device *pd,
                        *speed = us_clv_to_speed[sp];
                        break;
                default:
-                       pkt_notice(pd, "unknown disc sub-type %d\n", st);
+                       dev_notice(ddev, "unknown disc sub-type %d\n", st);
                        return 1;
        }
        if (*speed) {
-               pkt_info(pd, "maximum media speed: %d\n", *speed);
+               dev_info(ddev, "maximum media speed: %d\n", *speed);
                return 0;
        } else {
-               pkt_notice(pd, "unknown speed %d for sub-type %d\n", sp, st);
+               dev_notice(ddev, "unknown speed %d for sub-type %d\n", sp, st);
                return 1;
        }
 }
 
 static noinline_for_stack int pkt_perform_opc(struct pktcdvd_device *pd)
 {
+       struct device *ddev = disk_to_dev(pd->disk);
        struct packet_command cgc;
        struct scsi_sense_hdr sshdr;
        int ret;
 
-       pkt_dbg(2, pd, "Performing OPC\n");
+       dev_dbg(ddev, "Performing OPC\n");
 
        init_cdrom_command(&cgc, NULL, 0, CGC_DATA_NONE);
        cgc.sshdr = &sshdr;
@@ -2058,18 +2099,19 @@ static noinline_for_stack int pkt_perform_opc(struct pktcdvd_device *pd)
 
 static int pkt_open_write(struct pktcdvd_device *pd)
 {
+       struct device *ddev = disk_to_dev(pd->disk);
        int ret;
        unsigned int write_speed, media_write_speed, read_speed;
 
        ret = pkt_probe_settings(pd);
        if (ret) {
-               pkt_dbg(2, pd, "failed probe\n");
+               dev_dbg(ddev, "failed probe\n");
                return ret;
        }
 
        ret = pkt_set_write_settings(pd);
        if (ret) {
-               pkt_dbg(1, pd, "failed saving write settings\n");
+               dev_notice(ddev, "failed saving write settings\n");
                return -EIO;
        }
 
@@ -2082,30 +2124,29 @@ static int pkt_open_write(struct pktcdvd_device *pd)
                case 0x13: /* DVD-RW */
                case 0x1a: /* DVD+RW */
                case 0x12: /* DVD-RAM */
-                       pkt_dbg(1, pd, "write speed %ukB/s\n", write_speed);
+                       dev_notice(ddev, "write speed %ukB/s\n", write_speed);
                        break;
                default:
                        ret = pkt_media_speed(pd, &media_write_speed);
                        if (ret)
                                media_write_speed = 16;
                        write_speed = min(write_speed, media_write_speed * 177);
-                       pkt_dbg(1, pd, "write speed %ux\n", write_speed / 176);
+                       dev_notice(ddev, "write speed %ux\n", write_speed / 176);
                        break;
        }
        read_speed = write_speed;
 
        ret = pkt_set_speed(pd, write_speed, read_speed);
        if (ret) {
-               pkt_dbg(1, pd, "couldn't set write speed\n");
+               dev_notice(ddev, "couldn't set write speed\n");
                return -EIO;
        }
        pd->write_speed = write_speed;
        pd->read_speed = read_speed;
 
        ret = pkt_perform_opc(pd);
-       if (ret) {
-               pkt_dbg(1, pd, "Optimum Power Calibration failed\n");
-       }
+       if (ret)
+               dev_notice(ddev, "Optimum Power Calibration failed\n");
 
        return 0;
 }
@@ -2113,8 +2154,9 @@ static int pkt_open_write(struct pktcdvd_device *pd)
 /*
  * called at open time.
  */
-static int pkt_open_dev(struct pktcdvd_device *pd, fmode_t write)
+static int pkt_open_dev(struct pktcdvd_device *pd, bool write)
 {
+       struct device *ddev = disk_to_dev(pd->disk);
        int ret;
        long lba;
        struct request_queue *q;
@@ -2125,7 +2167,7 @@ static int pkt_open_dev(struct pktcdvd_device *pd, fmode_t write)
         * to read/write from/to it. It is already opened in O_NONBLOCK mode
         * so open should not fail.
         */
-       bdev = blkdev_get_by_dev(pd->bdev->bd_dev, FMODE_READ | FMODE_EXCL, pd);
+       bdev = blkdev_get_by_dev(pd->bdev->bd_dev, BLK_OPEN_READ, pd, NULL);
        if (IS_ERR(bdev)) {
                ret = PTR_ERR(bdev);
                goto out;
@@ -2133,7 +2175,7 @@ static int pkt_open_dev(struct pktcdvd_device *pd, fmode_t write)
 
        ret = pkt_get_last_written(pd, &lba);
        if (ret) {
-               pkt_err(pd, "pkt_get_last_written failed\n");
+               dev_err(ddev, "pkt_get_last_written failed\n");
                goto out_putdev;
        }
 
@@ -2162,17 +2204,17 @@ static int pkt_open_dev(struct pktcdvd_device *pd, fmode_t write)
 
        if (write) {
                if (!pkt_grow_pktlist(pd, CONFIG_CDROM_PKTCDVD_BUFFERS)) {
-                       pkt_err(pd, "not enough memory for buffers\n");
+                       dev_err(ddev, "not enough memory for buffers\n");
                        ret = -ENOMEM;
                        goto out_putdev;
                }
-               pkt_info(pd, "%lukB available on disc\n", lba << 1);
+               dev_info(ddev, "%lukB available on disc\n", lba << 1);
        }
 
        return 0;
 
 out_putdev:
-       blkdev_put(bdev, FMODE_READ | FMODE_EXCL);
+       blkdev_put(bdev, pd);
 out:
        return ret;
 }
@@ -2183,13 +2225,15 @@ out:
  */
 static void pkt_release_dev(struct pktcdvd_device *pd, int flush)
 {
+       struct device *ddev = disk_to_dev(pd->disk);
+
        if (flush && pkt_flush_cache(pd))
-               pkt_dbg(1, pd, "not flushing cache\n");
+               dev_notice(ddev, "not flushing cache\n");
 
        pkt_lock_door(pd, 0);
 
        pkt_set_speed(pd, MAX_SPEED, MAX_SPEED);
-       blkdev_put(pd->bdev, FMODE_READ | FMODE_EXCL);
+       blkdev_put(pd->bdev, pd);
 
        pkt_shrink_pktlist(pd);
 }
@@ -2203,14 +2247,14 @@ static struct pktcdvd_device *pkt_find_dev_from_minor(unsigned int dev_minor)
        return pkt_devs[dev_minor];
 }
 
-static int pkt_open(struct block_device *bdev, fmode_t mode)
+static int pkt_open(struct gendisk *disk, blk_mode_t mode)
 {
        struct pktcdvd_device *pd = NULL;
        int ret;
 
        mutex_lock(&pktcdvd_mutex);
        mutex_lock(&ctl_mutex);
-       pd = pkt_find_dev_from_minor(MINOR(bdev->bd_dev));
+       pd = pkt_find_dev_from_minor(disk->first_minor);
        if (!pd) {
                ret = -ENODEV;
                goto out;
@@ -2219,22 +2263,21 @@ static int pkt_open(struct block_device *bdev, fmode_t mode)
 
        pd->refcnt++;
        if (pd->refcnt > 1) {
-               if ((mode & FMODE_WRITE) &&
+               if ((mode & BLK_OPEN_WRITE) &&
                    !test_bit(PACKET_WRITABLE, &pd->flags)) {
                        ret = -EBUSY;
                        goto out_dec;
                }
        } else {
-               ret = pkt_open_dev(pd, mode & FMODE_WRITE);
+               ret = pkt_open_dev(pd, mode & BLK_OPEN_WRITE);
                if (ret)
                        goto out_dec;
                /*
                 * needed here as well, since ext2 (among others) may change
                 * the blocksize at mount time
                 */
-               set_blocksize(bdev, CD_FRAMESIZE);
+               set_blocksize(disk->part0, CD_FRAMESIZE);
        }
-
        mutex_unlock(&ctl_mutex);
        mutex_unlock(&pktcdvd_mutex);
        return 0;
@@ -2247,7 +2290,7 @@ out:
        return ret;
 }
 
-static void pkt_close(struct gendisk *disk, fmode_t mode)
+static void pkt_release(struct gendisk *disk)
 {
        struct pktcdvd_device *pd = disk->private_data;
 
@@ -2385,15 +2428,15 @@ static void pkt_make_request_write(struct request_queue *q, struct bio *bio)
 static void pkt_submit_bio(struct bio *bio)
 {
        struct pktcdvd_device *pd = bio->bi_bdev->bd_disk->queue->queuedata;
+       struct device *ddev = disk_to_dev(pd->disk);
        struct bio *split;
 
        bio = bio_split_to_limits(bio);
        if (!bio)
                return;
 
-       pkt_dbg(2, pd, "start = %6llx stop = %6llx\n",
-               (unsigned long long)bio->bi_iter.bi_sector,
-               (unsigned long long)bio_end_sector(bio));
+       dev_dbg(ddev, "start = %6llx stop = %6llx\n",
+               bio->bi_iter.bi_sector, bio_end_sector(bio));
 
        /*
         * Clone READ bios so we can have our own bi_end_io callback.
@@ -2404,13 +2447,12 @@ static void pkt_submit_bio(struct bio *bio)
        }
 
        if (!test_bit(PACKET_WRITABLE, &pd->flags)) {
-               pkt_notice(pd, "WRITE for ro device (%llu)\n",
-                          (unsigned long long)bio->bi_iter.bi_sector);
+               dev_notice(ddev, "WRITE for ro device (%llu)\n", bio->bi_iter.bi_sector);
                goto end_io;
        }
 
        if (!bio->bi_iter.bi_size || (bio->bi_iter.bi_size % CD_FRAMESIZE)) {
-               pkt_err(pd, "wrong bio size\n");
+               dev_err(ddev, "wrong bio size\n");
                goto end_io;
        }
 
@@ -2446,74 +2488,15 @@ static void pkt_init_queue(struct pktcdvd_device *pd)
        q->queuedata = pd;
 }
 
-static int pkt_seq_show(struct seq_file *m, void *p)
-{
-       struct pktcdvd_device *pd = m->private;
-       char *msg;
-       int states[PACKET_NUM_STATES];
-
-       seq_printf(m, "Writer %s mapped to %pg:\n", pd->name, pd->bdev);
-
-       seq_printf(m, "\nSettings:\n");
-       seq_printf(m, "\tpacket size:\t\t%dkB\n", pd->settings.size / 2);
-
-       if (pd->settings.write_type == 0)
-               msg = "Packet";
-       else
-               msg = "Unknown";
-       seq_printf(m, "\twrite type:\t\t%s\n", msg);
-
-       seq_printf(m, "\tpacket type:\t\t%s\n", pd->settings.fp ? "Fixed" : "Variable");
-       seq_printf(m, "\tlink loss:\t\t%d\n", pd->settings.link_loss);
-
-       seq_printf(m, "\ttrack mode:\t\t%d\n", pd->settings.track_mode);
-
-       if (pd->settings.block_mode == PACKET_BLOCK_MODE1)
-               msg = "Mode 1";
-       else if (pd->settings.block_mode == PACKET_BLOCK_MODE2)
-               msg = "Mode 2";
-       else
-               msg = "Unknown";
-       seq_printf(m, "\tblock mode:\t\t%s\n", msg);
-
-       seq_printf(m, "\nStatistics:\n");
-       seq_printf(m, "\tpackets started:\t%lu\n", pd->stats.pkt_started);
-       seq_printf(m, "\tpackets ended:\t\t%lu\n", pd->stats.pkt_ended);
-       seq_printf(m, "\twritten:\t\t%lukB\n", pd->stats.secs_w >> 1);
-       seq_printf(m, "\tread gather:\t\t%lukB\n", pd->stats.secs_rg >> 1);
-       seq_printf(m, "\tread:\t\t\t%lukB\n", pd->stats.secs_r >> 1);
-
-       seq_printf(m, "\nMisc:\n");
-       seq_printf(m, "\treference count:\t%d\n", pd->refcnt);
-       seq_printf(m, "\tflags:\t\t\t0x%lx\n", pd->flags);
-       seq_printf(m, "\tread speed:\t\t%ukB/s\n", pd->read_speed);
-       seq_printf(m, "\twrite speed:\t\t%ukB/s\n", pd->write_speed);
-       seq_printf(m, "\tstart offset:\t\t%lu\n", pd->offset);
-       seq_printf(m, "\tmode page offset:\t%u\n", pd->mode_offset);
-
-       seq_printf(m, "\nQueue state:\n");
-       seq_printf(m, "\tbios queued:\t\t%d\n", pd->bio_queue_size);
-       seq_printf(m, "\tbios pending:\t\t%d\n", atomic_read(&pd->cdrw.pending_bios));
-       seq_printf(m, "\tcurrent sector:\t\t0x%llx\n", (unsigned long long)pd->current_sector);
-
-       pkt_count_states(pd, states);
-       seq_printf(m, "\tstate:\t\t\ti:%d ow:%d rw:%d ww:%d rec:%d fin:%d\n",
-                  states[0], states[1], states[2], states[3], states[4], states[5]);
-
-       seq_printf(m, "\twrite congestion marks:\toff=%d on=%d\n",
-                       pd->write_congestion_off,
-                       pd->write_congestion_on);
-       return 0;
-}
-
 static int pkt_new_dev(struct pktcdvd_device *pd, dev_t dev)
 {
+       struct device *ddev = disk_to_dev(pd->disk);
        int i;
        struct block_device *bdev;
        struct scsi_device *sdev;
 
        if (pd->pkt_dev == dev) {
-               pkt_err(pd, "recursive setup not allowed\n");
+               dev_err(ddev, "recursive setup not allowed\n");
                return -EBUSY;
        }
        for (i = 0; i < MAX_WRITERS; i++) {
@@ -2521,21 +2504,22 @@ static int pkt_new_dev(struct pktcdvd_device *pd, dev_t dev)
                if (!pd2)
                        continue;
                if (pd2->bdev->bd_dev == dev) {
-                       pkt_err(pd, "%pg already setup\n", pd2->bdev);
+                       dev_err(ddev, "%pg already setup\n", pd2->bdev);
                        return -EBUSY;
                }
                if (pd2->pkt_dev == dev) {
-                       pkt_err(pd, "can't chain pktcdvd devices\n");
+                       dev_err(ddev, "can't chain pktcdvd devices\n");
                        return -EBUSY;
                }
        }
 
-       bdev = blkdev_get_by_dev(dev, FMODE_READ | FMODE_NDELAY, NULL);
+       bdev = blkdev_get_by_dev(dev, BLK_OPEN_READ | BLK_OPEN_NDELAY, NULL,
+                                NULL);
        if (IS_ERR(bdev))
                return PTR_ERR(bdev);
        sdev = scsi_device_from_queue(bdev->bd_disk->queue);
        if (!sdev) {
-               blkdev_put(bdev, FMODE_READ | FMODE_NDELAY);
+               blkdev_put(bdev, NULL);
                return -EINVAL;
        }
        put_device(&sdev->sdev_gendev);
@@ -2549,30 +2533,31 @@ static int pkt_new_dev(struct pktcdvd_device *pd, dev_t dev)
        pkt_init_queue(pd);
 
        atomic_set(&pd->cdrw.pending_bios, 0);
-       pd->cdrw.thread = kthread_run(kcdrwd, pd, "%s", pd->name);
+       pd->cdrw.thread = kthread_run(kcdrwd, pd, "%s", pd->disk->disk_name);
        if (IS_ERR(pd->cdrw.thread)) {
-               pkt_err(pd, "can't start kernel thread\n");
+               dev_err(ddev, "can't start kernel thread\n");
                goto out_mem;
        }
 
-       proc_create_single_data(pd->name, 0, pkt_proc, pkt_seq_show, pd);
-       pkt_dbg(1, pd, "writer mapped to %pg\n", bdev);
+       proc_create_single_data(pd->disk->disk_name, 0, pkt_proc, pkt_seq_show, pd);
+       dev_notice(ddev, "writer mapped to %pg\n", bdev);
        return 0;
 
 out_mem:
-       blkdev_put(bdev, FMODE_READ | FMODE_NDELAY);
+       blkdev_put(bdev, NULL);
        /* This is safe: open() is still holding a reference. */
        module_put(THIS_MODULE);
        return -ENOMEM;
 }
 
-static int pkt_ioctl(struct block_device *bdev, fmode_t mode, unsigned int cmd, unsigned long arg)
+static int pkt_ioctl(struct block_device *bdev, blk_mode_t mode,
+               unsigned int cmd, unsigned long arg)
 {
        struct pktcdvd_device *pd = bdev->bd_disk->private_data;
+       struct device *ddev = disk_to_dev(pd->disk);
        int ret;
 
-       pkt_dbg(2, pd, "cmd %x, dev %d:%d\n",
-               cmd, MAJOR(bdev->bd_dev), MINOR(bdev->bd_dev));
+       dev_dbg(ddev, "cmd %x, dev %d:%d\n", cmd, MAJOR(bdev->bd_dev), MINOR(bdev->bd_dev));
 
        mutex_lock(&pktcdvd_mutex);
        switch (cmd) {
@@ -2598,7 +2583,7 @@ static int pkt_ioctl(struct block_device *bdev, fmode_t mode, unsigned int cmd,
                        ret = bdev->bd_disk->fops->ioctl(bdev, mode, cmd, arg);
                break;
        default:
-               pkt_dbg(2, pd, "Unknown ioctl (%x)\n", cmd);
+               dev_dbg(ddev, "Unknown ioctl (%x)\n", cmd);
                ret = -ENOTTY;
        }
        mutex_unlock(&pktcdvd_mutex);
@@ -2631,7 +2616,7 @@ static const struct block_device_operations pktcdvd_ops = {
        .owner =                THIS_MODULE,
        .submit_bio =           pkt_submit_bio,
        .open =                 pkt_open,
-       .release =              pkt_close,
+       .release =              pkt_release,
        .ioctl =                pkt_ioctl,
        .compat_ioctl =         blkdev_compat_ptr_ioctl,
        .check_events =         pkt_check_events,
@@ -2676,7 +2661,6 @@ static int pkt_setup_dev(dev_t dev, dev_t* pkt_dev)
        spin_lock_init(&pd->iosched.lock);
        bio_list_init(&pd->iosched.read_queue);
        bio_list_init(&pd->iosched.write_queue);
-       sprintf(pd->name, DRIVER_NAME"%d", idx);
        init_waitqueue_head(&pd->wqueue);
        pd->bio_queue = RB_ROOT;
 
@@ -2693,7 +2677,7 @@ static int pkt_setup_dev(dev_t dev, dev_t* pkt_dev)
        disk->minors = 1;
        disk->fops = &pktcdvd_ops;
        disk->flags = GENHD_FL_REMOVABLE | GENHD_FL_NO_PART;
-       strcpy(disk->disk_name, pd->name);
+       snprintf(disk->disk_name, sizeof(disk->disk_name), DRIVER_NAME"%d", idx);
        disk->private_data = pd;
 
        pd->pkt_dev = MKDEV(pktdev_major, idx);
@@ -2735,6 +2719,7 @@ out_mutex:
 static int pkt_remove_dev(dev_t pkt_dev)
 {
        struct pktcdvd_device *pd;
+       struct device *ddev;
        int idx;
        int ret = 0;
 
@@ -2755,6 +2740,9 @@ static int pkt_remove_dev(dev_t pkt_dev)
                ret = -EBUSY;
                goto out;
        }
+
+       ddev = disk_to_dev(pd->disk);
+
        if (!IS_ERR(pd->cdrw.thread))
                kthread_stop(pd->cdrw.thread);
 
@@ -2763,10 +2751,10 @@ static int pkt_remove_dev(dev_t pkt_dev)
        pkt_debugfs_dev_remove(pd);
        pkt_sysfs_dev_remove(pd);
 
-       blkdev_put(pd->bdev, FMODE_READ | FMODE_NDELAY);
+       blkdev_put(pd->bdev, NULL);
 
-       remove_proc_entry(pd->name, pkt_proc);
-       pkt_dbg(1, pd, "writer unmapped\n");
+       remove_proc_entry(pd->disk->disk_name, pkt_proc);
+       dev_notice(ddev, "writer unmapped\n");
 
        del_gendisk(pd->disk);
        put_disk(pd->disk);
index 632751d..bd0e075 100644 (file)
@@ -660,9 +660,9 @@ static bool pending_result_dec(struct pending_result *pending, int *result)
        return true;
 }
 
-static int rbd_open(struct block_device *bdev, fmode_t mode)
+static int rbd_open(struct gendisk *disk, blk_mode_t mode)
 {
-       struct rbd_device *rbd_dev = bdev->bd_disk->private_data;
+       struct rbd_device *rbd_dev = disk->private_data;
        bool removing = false;
 
        spin_lock_irq(&rbd_dev->lock);
@@ -679,7 +679,7 @@ static int rbd_open(struct block_device *bdev, fmode_t mode)
        return 0;
 }
 
-static void rbd_release(struct gendisk *disk, fmode_t mode)
+static void rbd_release(struct gendisk *disk)
 {
        struct rbd_device *rbd_dev = disk->private_data;
        unsigned long open_count_before;
index 40b3163..208e5f8 100644 (file)
@@ -3,13 +3,11 @@
 ccflags-y := -I$(srctree)/drivers/infiniband/ulp/rtrs
 
 rnbd-client-y := rnbd-clt.o \
-                 rnbd-clt-sysfs.o \
-                 rnbd-common.o
+                 rnbd-clt-sysfs.o
 
 CFLAGS_rnbd-srv-trace.o = -I$(src)
 
-rnbd-server-y := rnbd-common.o \
-                 rnbd-srv.o \
+rnbd-server-y := rnbd-srv.o \
                  rnbd-srv-sysfs.o \
                  rnbd-srv-trace.o
 
index 8c60879..c36d8b1 100644 (file)
@@ -24,7 +24,9 @@
 #include "rnbd-clt.h"
 
 static struct device *rnbd_dev;
-static struct class *rnbd_dev_class;
+static const struct class rnbd_dev_class = {
+       .name = "rnbd_client",
+};
 static struct kobject *rnbd_devs_kobj;
 
 enum {
@@ -278,7 +280,7 @@ static ssize_t access_mode_show(struct kobject *kobj,
 
        dev = container_of(kobj, struct rnbd_clt_dev, kobj);
 
-       return sysfs_emit(page, "%s\n", rnbd_access_mode_str(dev->access_mode));
+       return sysfs_emit(page, "%s\n", rnbd_access_modes[dev->access_mode].str);
 }
 
 static struct kobj_attribute rnbd_clt_access_mode =
@@ -596,7 +598,7 @@ static ssize_t rnbd_clt_map_device_store(struct kobject *kobj,
 
        pr_info("Mapping device %s on session %s, (access_mode: %s, nr_poll_queues: %d)\n",
                pathname, sessname,
-               rnbd_access_mode_str(access_mode),
+               rnbd_access_modes[access_mode].str,
                nr_poll_queues);
 
        dev = rnbd_clt_map_device(sessname, paths, path_cnt, port_nr, pathname,
@@ -646,11 +648,11 @@ int rnbd_clt_create_sysfs_files(void)
 {
        int err;
 
-       rnbd_dev_class = class_create("rnbd-client");
-       if (IS_ERR(rnbd_dev_class))
-               return PTR_ERR(rnbd_dev_class);
+       err = class_register(&rnbd_dev_class);
+       if (err)
+               return err;
 
-       rnbd_dev = device_create_with_groups(rnbd_dev_class, NULL,
+       rnbd_dev = device_create_with_groups(&rnbd_dev_class, NULL,
                                              MKDEV(0, 0), NULL,
                                              default_attr_groups, "ctl");
        if (IS_ERR(rnbd_dev)) {
@@ -666,9 +668,9 @@ int rnbd_clt_create_sysfs_files(void)
        return 0;
 
 dev_destroy:
-       device_destroy(rnbd_dev_class, MKDEV(0, 0));
+       device_destroy(&rnbd_dev_class, MKDEV(0, 0));
 cls_destroy:
-       class_destroy(rnbd_dev_class);
+       class_unregister(&rnbd_dev_class);
 
        return err;
 }
@@ -678,6 +680,6 @@ void rnbd_clt_destroy_sysfs_files(void)
        sysfs_remove_group(&rnbd_dev->kobj, &default_attr_group);
        kobject_del(rnbd_devs_kobj);
        kobject_put(rnbd_devs_kobj);
-       device_destroy(rnbd_dev_class, MKDEV(0, 0));
-       class_destroy(rnbd_dev_class);
+       device_destroy(&rnbd_dev_class, MKDEV(0, 0));
+       class_unregister(&rnbd_dev_class);
 }
index 5eb8c78..b0550b6 100644 (file)
@@ -921,11 +921,11 @@ rnbd_clt_session *find_or_create_sess(const char *sessname, bool *first)
        return sess;
 }
 
-static int rnbd_client_open(struct block_device *block_device, fmode_t mode)
+static int rnbd_client_open(struct gendisk *disk, blk_mode_t mode)
 {
-       struct rnbd_clt_dev *dev = block_device->bd_disk->private_data;
+       struct rnbd_clt_dev *dev = disk->private_data;
 
-       if (get_disk_ro(dev->gd) && (mode & FMODE_WRITE))
+       if (get_disk_ro(dev->gd) && (mode & BLK_OPEN_WRITE))
                return -EPERM;
 
        if (dev->dev_state == DEV_STATE_UNMAPPED ||
@@ -935,7 +935,7 @@ static int rnbd_client_open(struct block_device *block_device, fmode_t mode)
        return 0;
 }
 
-static void rnbd_client_release(struct gendisk *gen, fmode_t mode)
+static void rnbd_client_release(struct gendisk *gen)
 {
        struct rnbd_clt_dev *dev = gen->private_data;
 
diff --git a/drivers/block/rnbd/rnbd-common.c b/drivers/block/rnbd/rnbd-common.c
deleted file mode 100644 (file)
index 596c3f7..0000000
+++ /dev/null
@@ -1,23 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-or-later
-/*
- * RDMA Network Block Driver
- *
- * Copyright (c) 2014 - 2018 ProfitBricks GmbH. All rights reserved.
- * Copyright (c) 2018 - 2019 1&1 IONOS Cloud GmbH. All rights reserved.
- * Copyright (c) 2019 - 2020 1&1 IONOS SE. All rights reserved.
- */
-#include "rnbd-proto.h"
-
-const char *rnbd_access_mode_str(enum rnbd_access_mode mode)
-{
-       switch (mode) {
-       case RNBD_ACCESS_RO:
-               return "ro";
-       case RNBD_ACCESS_RW:
-               return "rw";
-       case RNBD_ACCESS_MIGRATION:
-               return "migration";
-       default:
-               return "unknown";
-       }
-}
index da1d054..e32f8f2 100644 (file)
@@ -61,6 +61,15 @@ enum rnbd_access_mode {
        RNBD_ACCESS_MIGRATION,
 };
 
+static const __maybe_unused struct {
+       enum rnbd_access_mode mode;
+       const char *str;
+} rnbd_access_modes[] = {
+       [RNBD_ACCESS_RO] = {RNBD_ACCESS_RO, "ro"},
+       [RNBD_ACCESS_RW] = {RNBD_ACCESS_RW, "rw"},
+       [RNBD_ACCESS_MIGRATION] = {RNBD_ACCESS_MIGRATION, "migration"},
+};
+
 /**
  * struct rnbd_msg_sess_info - initial session info from client to server
  * @hdr:               message header
@@ -185,7 +194,6 @@ struct rnbd_msg_io {
 enum rnbd_io_flags {
 
        /* Operations */
-
        RNBD_OP_READ            = 0,
        RNBD_OP_WRITE           = 1,
        RNBD_OP_FLUSH           = 2,
@@ -193,15 +201,9 @@ enum rnbd_io_flags {
        RNBD_OP_SECURE_ERASE    = 4,
        RNBD_OP_WRITE_SAME      = 5,
 
-       RNBD_OP_LAST,
-
        /* Flags */
-
        RNBD_F_SYNC  = 1<<(RNBD_OP_BITS + 0),
        RNBD_F_FUA   = 1<<(RNBD_OP_BITS + 1),
-
-       RNBD_F_ALL   = (RNBD_F_SYNC | RNBD_F_FUA)
-
 };
 
 static inline u32 rnbd_op(u32 flags)
@@ -214,21 +216,6 @@ static inline u32 rnbd_flags(u32 flags)
        return flags & ~RNBD_OP_MASK;
 }
 
-static inline bool rnbd_flags_supported(u32 flags)
-{
-       u32 op;
-
-       op = rnbd_op(flags);
-       flags = rnbd_flags(flags);
-
-       if (op >= RNBD_OP_LAST)
-               return false;
-       if (flags & ~RNBD_F_ALL)
-               return false;
-
-       return true;
-}
-
 static inline blk_opf_t rnbd_to_bio_flags(u32 rnbd_opf)
 {
        blk_opf_t bio_opf;
index d5d9267..cba6ba4 100644 (file)
@@ -9,7 +9,6 @@
 #undef pr_fmt
 #define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
 
-#include <uapi/linux/limits.h>
 #include <linux/kobject.h>
 #include <linux/sysfs.h>
 #include <linux/stat.h>
@@ -20,7 +19,9 @@
 #include "rnbd-srv.h"
 
 static struct device *rnbd_dev;
-static struct class *rnbd_dev_class;
+static const struct class rnbd_dev_class = {
+       .name = "rnbd-server",
+};
 static struct kobject *rnbd_devs_kobj;
 
 static void rnbd_srv_dev_release(struct kobject *kobj)
@@ -88,8 +89,7 @@ static ssize_t read_only_show(struct kobject *kobj, struct kobj_attribute *attr,
 
        sess_dev = container_of(kobj, struct rnbd_srv_sess_dev, kobj);
 
-       return sysfs_emit(page, "%d\n",
-                         !(sess_dev->open_flags & FMODE_WRITE));
+       return sysfs_emit(page, "%d\n", sess_dev->readonly);
 }
 
 static struct kobj_attribute rnbd_srv_dev_session_ro_attr =
@@ -104,7 +104,7 @@ static ssize_t access_mode_show(struct kobject *kobj,
        sess_dev = container_of(kobj, struct rnbd_srv_sess_dev, kobj);
 
        return sysfs_emit(page, "%s\n",
-                         rnbd_access_mode_str(sess_dev->access_mode));
+                         rnbd_access_modes[sess_dev->access_mode].str);
 }
 
 static struct kobj_attribute rnbd_srv_dev_session_access_mode_attr =
@@ -215,12 +215,12 @@ int rnbd_srv_create_sysfs_files(void)
 {
        int err;
 
-       rnbd_dev_class = class_create("rnbd-server");
-       if (IS_ERR(rnbd_dev_class))
-               return PTR_ERR(rnbd_dev_class);
+       err = class_register(&rnbd_dev_class);
+       if (err)
+               return err;
 
-       rnbd_dev = device_create(rnbd_dev_class, NULL,
-                                 MKDEV(0, 0), NULL, "ctl");
+       rnbd_dev = device_create(&rnbd_dev_class, NULL,
+                                MKDEV(0, 0), NULL, "ctl");
        if (IS_ERR(rnbd_dev)) {
                err = PTR_ERR(rnbd_dev);
                goto cls_destroy;
@@ -234,9 +234,9 @@ int rnbd_srv_create_sysfs_files(void)
        return 0;
 
 dev_destroy:
-       device_destroy(rnbd_dev_class, MKDEV(0, 0));
+       device_destroy(&rnbd_dev_class, MKDEV(0, 0));
 cls_destroy:
-       class_destroy(rnbd_dev_class);
+       class_unregister(&rnbd_dev_class);
 
        return err;
 }
@@ -245,6 +245,6 @@ void rnbd_srv_destroy_sysfs_files(void)
 {
        kobject_del(rnbd_devs_kobj);
        kobject_put(rnbd_devs_kobj);
-       device_destroy(rnbd_dev_class, MKDEV(0, 0));
-       class_destroy(rnbd_dev_class);
+       device_destroy(&rnbd_dev_class, MKDEV(0, 0));
+       class_unregister(&rnbd_dev_class);
 }
index 2cfed2e..c186df0 100644 (file)
@@ -96,7 +96,7 @@ rnbd_get_sess_dev(int dev_id, struct rnbd_srv_session *srv_sess)
                ret = kref_get_unless_zero(&sess_dev->kref);
        rcu_read_unlock();
 
-       if (!sess_dev || !ret)
+       if (!ret)
                return ERR_PTR(-ENXIO);
 
        return sess_dev;
@@ -180,7 +180,7 @@ static void destroy_device(struct kref *kref)
 
        WARN_ONCE(!list_empty(&dev->sess_dev_list),
                  "Device %s is being destroyed but still in use!\n",
-                 dev->id);
+                 dev->name);
 
        spin_lock(&dev_lock);
        list_del(&dev->list);
@@ -219,10 +219,10 @@ void rnbd_destroy_sess_dev(struct rnbd_srv_sess_dev *sess_dev, bool keep_id)
        rnbd_put_sess_dev(sess_dev);
        wait_for_completion(&dc); /* wait for inflights to drop to zero */
 
-       blkdev_put(sess_dev->bdev, sess_dev->open_flags);
+       blkdev_put(sess_dev->bdev, NULL);
        mutex_lock(&sess_dev->dev->lock);
        list_del(&sess_dev->dev_list);
-       if (sess_dev->open_flags & FMODE_WRITE)
+       if (!sess_dev->readonly)
                sess_dev->dev->open_write_cnt--;
        mutex_unlock(&sess_dev->dev->lock);
 
@@ -356,7 +356,7 @@ static int process_msg_open(struct rnbd_srv_session *srv_sess,
                            const void *msg, size_t len,
                            void *data, size_t datalen);
 
-static int process_msg_sess_info(struct rnbd_srv_session *srv_sess,
+static void process_msg_sess_info(struct rnbd_srv_session *srv_sess,
                                 const void *msg, size_t len,
                                 void *data, size_t datalen);
 
@@ -384,8 +384,7 @@ static int rnbd_srv_rdma_ev(void *priv, struct rtrs_srv_op *id,
                ret = process_msg_open(srv_sess, usr, usrlen, data, datalen);
                break;
        case RNBD_MSG_SESS_INFO:
-               ret = process_msg_sess_info(srv_sess, usr, usrlen, data,
-                                           datalen);
+               process_msg_sess_info(srv_sess, usr, usrlen, data, datalen);
                break;
        default:
                pr_warn("Received unexpected message type %d from session %s\n",
@@ -431,7 +430,7 @@ static struct rnbd_srv_dev *rnbd_srv_init_srv_dev(struct block_device *bdev)
        if (!dev)
                return ERR_PTR(-ENOMEM);
 
-       snprintf(dev->id, sizeof(dev->id), "%pg", bdev);
+       snprintf(dev->name, sizeof(dev->name), "%pg", bdev);
        kref_init(&dev->kref);
        INIT_LIST_HEAD(&dev->sess_dev_list);
        mutex_init(&dev->lock);
@@ -446,7 +445,7 @@ rnbd_srv_find_or_add_srv_dev(struct rnbd_srv_dev *new_dev)
 
        spin_lock(&dev_lock);
        list_for_each_entry(dev, &dev_list, list) {
-               if (!strncmp(dev->id, new_dev->id, sizeof(dev->id))) {
+               if (!strncmp(dev->name, new_dev->name, sizeof(dev->name))) {
                        if (!kref_get_unless_zero(&dev->kref))
                                /*
                                 * We lost the race, device is almost dead.
@@ -467,39 +466,38 @@ static int rnbd_srv_check_update_open_perm(struct rnbd_srv_dev *srv_dev,
                                            struct rnbd_srv_session *srv_sess,
                                            enum rnbd_access_mode access_mode)
 {
-       int ret = -EPERM;
+       int ret = 0;
 
        mutex_lock(&srv_dev->lock);
 
        switch (access_mode) {
        case RNBD_ACCESS_RO:
-               ret = 0;
                break;
        case RNBD_ACCESS_RW:
                if (srv_dev->open_write_cnt == 0)  {
                        srv_dev->open_write_cnt++;
-                       ret = 0;
                } else {
                        pr_err("Mapping device '%s' for session %s with RW permissions failed. Device already opened as 'RW' by %d client(s), access mode %s.\n",
-                              srv_dev->id, srv_sess->sessname,
+                              srv_dev->name, srv_sess->sessname,
                               srv_dev->open_write_cnt,
-                              rnbd_access_mode_str(access_mode));
+                              rnbd_access_modes[access_mode].str);
+                       ret = -EPERM;
                }
                break;
        case RNBD_ACCESS_MIGRATION:
                if (srv_dev->open_write_cnt < 2) {
                        srv_dev->open_write_cnt++;
-                       ret = 0;
                } else {
                        pr_err("Mapping device '%s' for session %s with migration permissions failed. Device already opened as 'RW' by %d client(s), access mode %s.\n",
-                              srv_dev->id, srv_sess->sessname,
+                              srv_dev->name, srv_sess->sessname,
                               srv_dev->open_write_cnt,
-                              rnbd_access_mode_str(access_mode));
+                              rnbd_access_modes[access_mode].str);
+                       ret = -EPERM;
                }
                break;
        default:
                pr_err("Received mapping request for device '%s' on session %s with invalid access mode: %d\n",
-                      srv_dev->id, srv_sess->sessname, access_mode);
+                      srv_dev->name, srv_sess->sessname, access_mode);
                ret = -EINVAL;
        }
 
@@ -561,7 +559,7 @@ static void rnbd_srv_fill_msg_open_rsp(struct rnbd_msg_open_rsp *rsp,
 static struct rnbd_srv_sess_dev *
 rnbd_srv_create_set_sess_dev(struct rnbd_srv_session *srv_sess,
                              const struct rnbd_msg_open *open_msg,
-                             struct block_device *bdev, fmode_t open_flags,
+                             struct block_device *bdev, bool readonly,
                              struct rnbd_srv_dev *srv_dev)
 {
        struct rnbd_srv_sess_dev *sdev = rnbd_sess_dev_alloc(srv_sess);
@@ -576,7 +574,7 @@ rnbd_srv_create_set_sess_dev(struct rnbd_srv_session *srv_sess,
        sdev->bdev              = bdev;
        sdev->sess              = srv_sess;
        sdev->dev               = srv_dev;
-       sdev->open_flags        = open_flags;
+       sdev->readonly          = readonly;
        sdev->access_mode       = open_msg->access_mode;
 
        return sdev;
@@ -631,7 +629,7 @@ static char *rnbd_srv_get_full_path(struct rnbd_srv_session *srv_sess,
        return full_path;
 }
 
-static int process_msg_sess_info(struct rnbd_srv_session *srv_sess,
+static void process_msg_sess_info(struct rnbd_srv_session *srv_sess,
                                 const void *msg, size_t len,
                                 void *data, size_t datalen)
 {
@@ -644,8 +642,6 @@ static int process_msg_sess_info(struct rnbd_srv_session *srv_sess,
 
        rsp->hdr.type = cpu_to_le16(RNBD_MSG_SESS_INFO_RSP);
        rsp->ver = srv_sess->ver;
-
-       return 0;
 }
 
 /**
@@ -681,15 +677,14 @@ static int process_msg_open(struct rnbd_srv_session *srv_sess,
        struct rnbd_srv_sess_dev *srv_sess_dev;
        const struct rnbd_msg_open *open_msg = msg;
        struct block_device *bdev;
-       fmode_t open_flags;
+       blk_mode_t open_flags = BLK_OPEN_READ;
        char *full_path;
        struct rnbd_msg_open_rsp *rsp = data;
 
        trace_process_msg_open(srv_sess, open_msg);
 
-       open_flags = FMODE_READ;
        if (open_msg->access_mode != RNBD_ACCESS_RO)
-               open_flags |= FMODE_WRITE;
+               open_flags |= BLK_OPEN_WRITE;
 
        mutex_lock(&srv_sess->lock);
 
@@ -719,7 +714,7 @@ static int process_msg_open(struct rnbd_srv_session *srv_sess,
                goto reject;
        }
 
-       bdev = blkdev_get_by_path(full_path, open_flags, THIS_MODULE);
+       bdev = blkdev_get_by_path(full_path, open_flags, NULL, NULL);
        if (IS_ERR(bdev)) {
                ret = PTR_ERR(bdev);
                pr_err("Opening device '%s' on session %s failed, failed to open the block device, err: %d\n",
@@ -736,9 +731,9 @@ static int process_msg_open(struct rnbd_srv_session *srv_sess,
                goto blkdev_put;
        }
 
-       srv_sess_dev = rnbd_srv_create_set_sess_dev(srv_sess, open_msg,
-                                                    bdev, open_flags,
-                                                    srv_dev);
+       srv_sess_dev = rnbd_srv_create_set_sess_dev(srv_sess, open_msg, bdev,
+                               open_msg->access_mode == RNBD_ACCESS_RO,
+                               srv_dev);
        if (IS_ERR(srv_sess_dev)) {
                pr_err("Opening device '%s' on session %s failed, creating sess_dev failed, err: %ld\n",
                       full_path, srv_sess->sessname, PTR_ERR(srv_sess_dev));
@@ -774,7 +769,7 @@ static int process_msg_open(struct rnbd_srv_session *srv_sess,
        list_add(&srv_sess_dev->dev_list, &srv_dev->sess_dev_list);
        mutex_unlock(&srv_dev->lock);
 
-       rnbd_srv_info(srv_sess_dev, "Opened device '%s'\n", srv_dev->id);
+       rnbd_srv_info(srv_sess_dev, "Opened device '%s'\n", srv_dev->name);
 
        kfree(full_path);
 
@@ -795,7 +790,7 @@ srv_dev_put:
        }
        rnbd_put_srv_dev(srv_dev);
 blkdev_put:
-       blkdev_put(bdev, open_flags);
+       blkdev_put(bdev, NULL);
 free_path:
        kfree(full_path);
 reject:
@@ -808,7 +803,7 @@ static struct rtrs_srv_ctx *rtrs_ctx;
 static struct rtrs_srv_ops rtrs_ops;
 static int __init rnbd_srv_init_module(void)
 {
-       int err;
+       int err = 0;
 
        BUILD_BUG_ON(sizeof(struct rnbd_msg_hdr) != 4);
        BUILD_BUG_ON(sizeof(struct rnbd_msg_sess_info) != 36);
@@ -822,19 +817,17 @@ static int __init rnbd_srv_init_module(void)
        };
        rtrs_ctx = rtrs_srv_open(&rtrs_ops, port_nr);
        if (IS_ERR(rtrs_ctx)) {
-               err = PTR_ERR(rtrs_ctx);
                pr_err("rtrs_srv_open(), err: %d\n", err);
-               return err;
+               return PTR_ERR(rtrs_ctx);
        }
 
        err = rnbd_srv_create_sysfs_files();
        if (err) {
                pr_err("rnbd_srv_create_sysfs_files(), err: %d\n", err);
                rtrs_srv_close(rtrs_ctx);
-               return err;
        }
 
-       return 0;
+       return err;
 }
 
 static void __exit rnbd_srv_cleanup_module(void)
index f5962fd..1027656 100644 (file)
@@ -35,7 +35,7 @@ struct rnbd_srv_dev {
        struct kobject                  dev_kobj;
        struct kobject                  *dev_sessions_kobj;
        struct kref                     kref;
-       char                            id[NAME_MAX];
+       char                            name[NAME_MAX];
        /* List of rnbd_srv_sess_dev structs */
        struct list_head                sess_dev_list;
        struct mutex                    lock;
@@ -52,7 +52,7 @@ struct rnbd_srv_sess_dev {
        struct kobject                  kobj;
        u32                             device_id;
        bool                            keep_id;
-       fmode_t                         open_flags;
+       bool                            readonly;
        struct kref                     kref;
        struct completion               *destroy_comp;
        char                            pathname[NAME_MAX];
index 9fa821f..7bf4b48 100644 (file)
@@ -139,7 +139,7 @@ static int vdc_getgeo(struct block_device *bdev, struct hd_geometry *geo)
  * when vdisk_mtype is VD_MEDIA_TYPE_CD or VD_MEDIA_TYPE_DVD.
  * Needed to be able to install inside an ldom from an iso image.
  */
-static int vdc_ioctl(struct block_device *bdev, fmode_t mode,
+static int vdc_ioctl(struct block_device *bdev, blk_mode_t mode,
                     unsigned command, unsigned long argument)
 {
        struct vdc_port *port = bdev->bd_disk->private_data;
index 42b4b68..f85b6af 100644 (file)
@@ -608,20 +608,18 @@ static void setup_medium(struct floppy_state *fs)
        }
 }
 
-static int floppy_open(struct block_device *bdev, fmode_t mode)
+static int floppy_open(struct gendisk *disk, blk_mode_t mode)
 {
-       struct floppy_state *fs = bdev->bd_disk->private_data;
+       struct floppy_state *fs = disk->private_data;
        struct swim __iomem *base = fs->swd->base;
        int err;
 
-       if (fs->ref_count == -1 || (fs->ref_count && mode & FMODE_EXCL))
+       if (fs->ref_count == -1 || (fs->ref_count && mode & BLK_OPEN_EXCL))
                return -EBUSY;
-
-       if (mode & FMODE_EXCL)
+       if (mode & BLK_OPEN_EXCL)
                fs->ref_count = -1;
        else
                fs->ref_count++;
-
        swim_write(base, setup, S_IBM_DRIVE  | S_FCLK_DIV2);
        udelay(10);
        swim_drive(base, fs->location);
@@ -636,13 +634,13 @@ static int floppy_open(struct block_device *bdev, fmode_t mode)
 
        set_capacity(fs->disk, fs->total_secs);
 
-       if (mode & FMODE_NDELAY)
+       if (mode & BLK_OPEN_NDELAY)
                return 0;
 
-       if (mode & (FMODE_READ|FMODE_WRITE)) {
-               if (bdev_check_media_change(bdev) && fs->disk_in)
+       if (mode & (BLK_OPEN_READ | BLK_OPEN_WRITE)) {
+               if (disk_check_media_change(disk) && fs->disk_in)
                        fs->ejected = 0;
-               if ((mode & FMODE_WRITE) && fs->write_protected) {
+               if ((mode & BLK_OPEN_WRITE) && fs->write_protected) {
                        err = -EROFS;
                        goto out;
                }
@@ -659,18 +657,18 @@ out:
        return err;
 }
 
-static int floppy_unlocked_open(struct block_device *bdev, fmode_t mode)
+static int floppy_unlocked_open(struct gendisk *disk, blk_mode_t mode)
 {
        int ret;
 
        mutex_lock(&swim_mutex);
-       ret = floppy_open(bdev, mode);
+       ret = floppy_open(disk, mode);
        mutex_unlock(&swim_mutex);
 
        return ret;
 }
 
-static void floppy_release(struct gendisk *disk, fmode_t mode)
+static void floppy_release(struct gendisk *disk)
 {
        struct floppy_state *fs = disk->private_data;
        struct swim __iomem *base = fs->swd->base;
@@ -686,7 +684,7 @@ static void floppy_release(struct gendisk *disk, fmode_t mode)
        mutex_unlock(&swim_mutex);
 }
 
-static int floppy_ioctl(struct block_device *bdev, fmode_t mode,
+static int floppy_ioctl(struct block_device *bdev, blk_mode_t mode,
                        unsigned int cmd, unsigned long param)
 {
        struct floppy_state *fs = bdev->bd_disk->private_data;
index da811a7..dc43a63 100644 (file)
@@ -246,10 +246,9 @@ static int grab_drive(struct floppy_state *fs, enum swim_state state,
                      int interruptible);
 static void release_drive(struct floppy_state *fs);
 static int fd_eject(struct floppy_state *fs);
-static int floppy_ioctl(struct block_device *bdev, fmode_t mode,
+static int floppy_ioctl(struct block_device *bdev, blk_mode_t mode,
                        unsigned int cmd, unsigned long param);
-static int floppy_open(struct block_device *bdev, fmode_t mode);
-static void floppy_release(struct gendisk *disk, fmode_t mode);
+static int floppy_open(struct gendisk *disk, blk_mode_t mode);
 static unsigned int floppy_check_events(struct gendisk *disk,
                                        unsigned int clearing);
 static int floppy_revalidate(struct gendisk *disk);
@@ -883,7 +882,7 @@ static int fd_eject(struct floppy_state *fs)
 static struct floppy_struct floppy_type =
        { 2880,18,2,80,0,0x1B,0x00,0xCF,0x6C,NULL };    /*  7 1.44MB 3.5"   */
 
-static int floppy_locked_ioctl(struct block_device *bdev, fmode_t mode,
+static int floppy_locked_ioctl(struct block_device *bdev, blk_mode_t mode,
                        unsigned int cmd, unsigned long param)
 {
        struct floppy_state *fs = bdev->bd_disk->private_data;
@@ -911,7 +910,7 @@ static int floppy_locked_ioctl(struct block_device *bdev, fmode_t mode,
        return -ENOTTY;
 }
 
-static int floppy_ioctl(struct block_device *bdev, fmode_t mode,
+static int floppy_ioctl(struct block_device *bdev, blk_mode_t mode,
                                 unsigned int cmd, unsigned long param)
 {
        int ret;
@@ -923,9 +922,9 @@ static int floppy_ioctl(struct block_device *bdev, fmode_t mode,
        return ret;
 }
 
-static int floppy_open(struct block_device *bdev, fmode_t mode)
+static int floppy_open(struct gendisk *disk, blk_mode_t mode)
 {
-       struct floppy_state *fs = bdev->bd_disk->private_data;
+       struct floppy_state *fs = disk->private_data;
        struct swim3 __iomem *sw = fs->swim3;
        int n, err = 0;
 
@@ -958,18 +957,18 @@ static int floppy_open(struct block_device *bdev, fmode_t mode)
                swim3_action(fs, SETMFM);
                swim3_select(fs, RELAX);
 
-       } else if (fs->ref_count == -1 || mode & FMODE_EXCL)
+       } else if (fs->ref_count == -1 || mode & BLK_OPEN_EXCL)
                return -EBUSY;
 
-       if (err == 0 && (mode & FMODE_NDELAY) == 0
-           && (mode & (FMODE_READ|FMODE_WRITE))) {
-               if (bdev_check_media_change(bdev))
-                       floppy_revalidate(bdev->bd_disk);
+       if (err == 0 && !(mode & BLK_OPEN_NDELAY) &&
+           (mode & (BLK_OPEN_READ | BLK_OPEN_WRITE))) {
+               if (disk_check_media_change(disk))
+                       floppy_revalidate(disk);
                if (fs->ejected)
                        err = -ENXIO;
        }
 
-       if (err == 0 && (mode & FMODE_WRITE)) {
+       if (err == 0 && (mode & BLK_OPEN_WRITE)) {
                if (fs->write_prot < 0)
                        fs->write_prot = swim3_readbit(fs, WRITE_PROT);
                if (fs->write_prot)
@@ -985,7 +984,7 @@ static int floppy_open(struct block_device *bdev, fmode_t mode)
                return err;
        }
 
-       if (mode & FMODE_EXCL)
+       if (mode & BLK_OPEN_EXCL)
                fs->ref_count = -1;
        else
                ++fs->ref_count;
@@ -993,18 +992,18 @@ static int floppy_open(struct block_device *bdev, fmode_t mode)
        return 0;
 }
 
-static int floppy_unlocked_open(struct block_device *bdev, fmode_t mode)
+static int floppy_unlocked_open(struct gendisk *disk, blk_mode_t mode)
 {
        int ret;
 
        mutex_lock(&swim3_mutex);
-       ret = floppy_open(bdev, mode);
+       ret = floppy_open(disk, mode);
        mutex_unlock(&swim3_mutex);
 
        return ret;
 }
 
-static void floppy_release(struct gendisk *disk, fmode_t mode)
+static void floppy_release(struct gendisk *disk)
 {
        struct floppy_state *fs = disk->private_data;
        struct swim3 __iomem *sw = fs->swim3;
index 33d3298..1c82375 100644 (file)
@@ -43,6 +43,7 @@
 #include <asm/page.h>
 #include <linux/task_work.h>
 #include <linux/namei.h>
+#include <linux/kref.h>
 #include <uapi/linux/ublk_cmd.h>
 
 #define UBLK_MINORS            (1U << MINORBITS)
@@ -54,7 +55,8 @@
                | UBLK_F_USER_RECOVERY \
                | UBLK_F_USER_RECOVERY_REISSUE \
                | UBLK_F_UNPRIVILEGED_DEV \
-               | UBLK_F_CMD_IOCTL_ENCODE)
+               | UBLK_F_CMD_IOCTL_ENCODE \
+               | UBLK_F_USER_COPY)
 
 /* All UBLK_PARAM_TYPE_* should be included here */
 #define UBLK_PARAM_TYPE_ALL (UBLK_PARAM_TYPE_BASIC | \
@@ -62,7 +64,8 @@
 
 struct ublk_rq_data {
        struct llist_node node;
-       struct callback_head work;
+
+       struct kref ref;
 };
 
 struct ublk_uring_cmd_pdu {
@@ -182,8 +185,13 @@ struct ublk_params_header {
        __u32   types;
 };
 
+static inline void __ublk_complete_rq(struct request *req);
+static void ublk_complete_rq(struct kref *ref);
+
 static dev_t ublk_chr_devt;
-static struct class *ublk_chr_class;
+static const struct class ublk_chr_class = {
+       .name = "ublk-char",
+};
 
 static DEFINE_IDR(ublk_index_idr);
 static DEFINE_SPINLOCK(ublk_idr_lock);
@@ -202,6 +210,23 @@ static unsigned int ublks_added;   /* protected by ublk_ctl_mutex */
 
 static struct miscdevice ublk_misc;
 
+static inline unsigned ublk_pos_to_hwq(loff_t pos)
+{
+       return ((pos - UBLKSRV_IO_BUF_OFFSET) >> UBLK_QID_OFF) &
+               UBLK_QID_BITS_MASK;
+}
+
+static inline unsigned ublk_pos_to_buf_off(loff_t pos)
+{
+       return (pos - UBLKSRV_IO_BUF_OFFSET) & UBLK_IO_BUF_BITS_MASK;
+}
+
+static inline unsigned ublk_pos_to_tag(loff_t pos)
+{
+       return ((pos - UBLKSRV_IO_BUF_OFFSET) >> UBLK_TAG_OFF) &
+               UBLK_TAG_BITS_MASK;
+}
+
 static void ublk_dev_param_basic_apply(struct ublk_device *ub)
 {
        struct request_queue *q = ub->ub_disk->queue;
@@ -290,12 +315,52 @@ static int ublk_apply_params(struct ublk_device *ub)
        return 0;
 }
 
-static inline bool ublk_can_use_task_work(const struct ublk_queue *ubq)
+static inline bool ublk_support_user_copy(const struct ublk_queue *ubq)
 {
-       if (IS_BUILTIN(CONFIG_BLK_DEV_UBLK) &&
-                       !(ubq->flags & UBLK_F_URING_CMD_COMP_IN_TASK))
-               return true;
-       return false;
+       return ubq->flags & UBLK_F_USER_COPY;
+}
+
+static inline bool ublk_need_req_ref(const struct ublk_queue *ubq)
+{
+       /*
+        * read()/write() is involved in user copy, so request reference
+        * has to be grabbed
+        */
+       return ublk_support_user_copy(ubq);
+}
+
+static inline void ublk_init_req_ref(const struct ublk_queue *ubq,
+               struct request *req)
+{
+       if (ublk_need_req_ref(ubq)) {
+               struct ublk_rq_data *data = blk_mq_rq_to_pdu(req);
+
+               kref_init(&data->ref);
+       }
+}
+
+static inline bool ublk_get_req_ref(const struct ublk_queue *ubq,
+               struct request *req)
+{
+       if (ublk_need_req_ref(ubq)) {
+               struct ublk_rq_data *data = blk_mq_rq_to_pdu(req);
+
+               return kref_get_unless_zero(&data->ref);
+       }
+
+       return true;
+}
+
+static inline void ublk_put_req_ref(const struct ublk_queue *ubq,
+               struct request *req)
+{
+       if (ublk_need_req_ref(ubq)) {
+               struct ublk_rq_data *data = blk_mq_rq_to_pdu(req);
+
+               kref_put(&data->ref, ublk_complete_rq);
+       } else {
+               __ublk_complete_rq(req);
+       }
 }
 
 static inline bool ublk_need_get_data(const struct ublk_queue *ubq)
@@ -384,9 +449,9 @@ static void ublk_store_owner_uid_gid(unsigned int *owner_uid,
        *owner_gid = from_kgid(&init_user_ns, gid);
 }
 
-static int ublk_open(struct block_device *bdev, fmode_t mode)
+static int ublk_open(struct gendisk *disk, blk_mode_t mode)
 {
-       struct ublk_device *ub = bdev->bd_disk->private_data;
+       struct ublk_device *ub = disk->private_data;
 
        if (capable(CAP_SYS_ADMIN))
                return 0;
@@ -421,49 +486,39 @@ static const struct block_device_operations ub_fops = {
 
 #define UBLK_MAX_PIN_PAGES     32
 
-struct ublk_map_data {
-       const struct request *rq;
-       unsigned long   ubuf;
-       unsigned int    len;
-};
-
 struct ublk_io_iter {
        struct page *pages[UBLK_MAX_PIN_PAGES];
-       unsigned pg_off;        /* offset in the 1st page in pages */
-       int nr_pages;           /* how many page pointers in pages */
        struct bio *bio;
        struct bvec_iter iter;
 };
 
-static inline unsigned ublk_copy_io_pages(struct ublk_io_iter *data,
-               unsigned max_bytes, bool to_vm)
+/* return how many pages are copied */
+static void ublk_copy_io_pages(struct ublk_io_iter *data,
+               size_t total, size_t pg_off, int dir)
 {
-       const unsigned total = min_t(unsigned, max_bytes,
-                       PAGE_SIZE - data->pg_off +
-                       ((data->nr_pages - 1) << PAGE_SHIFT));
        unsigned done = 0;
        unsigned pg_idx = 0;
 
        while (done < total) {
                struct bio_vec bv = bio_iter_iovec(data->bio, data->iter);
-               const unsigned int bytes = min3(bv.bv_len, total - done,
-                               (unsigned)(PAGE_SIZE - data->pg_off));
+               unsigned int bytes = min3(bv.bv_len, (unsigned)total - done,
+                               (unsigned)(PAGE_SIZE - pg_off));
                void *bv_buf = bvec_kmap_local(&bv);
                void *pg_buf = kmap_local_page(data->pages[pg_idx]);
 
-               if (to_vm)
-                       memcpy(pg_buf + data->pg_off, bv_buf, bytes);
+               if (dir == ITER_DEST)
+                       memcpy(pg_buf + pg_off, bv_buf, bytes);
                else
-                       memcpy(bv_buf, pg_buf + data->pg_off, bytes);
+                       memcpy(bv_buf, pg_buf + pg_off, bytes);
 
                kunmap_local(pg_buf);
                kunmap_local(bv_buf);
 
                /* advance page array */
-               data->pg_off += bytes;
-               if (data->pg_off == PAGE_SIZE) {
+               pg_off += bytes;
+               if (pg_off == PAGE_SIZE) {
                        pg_idx += 1;
-                       data->pg_off = 0;
+                       pg_off = 0;
                }
 
                done += bytes;
@@ -477,41 +532,58 @@ static inline unsigned ublk_copy_io_pages(struct ublk_io_iter *data,
                        data->iter = data->bio->bi_iter;
                }
        }
+}
 
-       return done;
+static bool ublk_advance_io_iter(const struct request *req,
+               struct ublk_io_iter *iter, unsigned int offset)
+{
+       struct bio *bio = req->bio;
+
+       for_each_bio(bio) {
+               if (bio->bi_iter.bi_size > offset) {
+                       iter->bio = bio;
+                       iter->iter = bio->bi_iter;
+                       bio_advance_iter(iter->bio, &iter->iter, offset);
+                       return true;
+               }
+               offset -= bio->bi_iter.bi_size;
+       }
+       return false;
 }
 
-static int ublk_copy_user_pages(struct ublk_map_data *data, bool to_vm)
+/*
+ * Copy data between request pages and io_iter, and 'offset'
+ * is the start point of linear offset of request.
+ */
+static size_t ublk_copy_user_pages(const struct request *req,
+               unsigned offset, struct iov_iter *uiter, int dir)
 {
-       const unsigned int gup_flags = to_vm ? FOLL_WRITE : 0;
-       const unsigned long start_vm = data->ubuf;
-       unsigned int done = 0;
-       struct ublk_io_iter iter = {
-               .pg_off = start_vm & (PAGE_SIZE - 1),
-               .bio    = data->rq->bio,
-               .iter   = data->rq->bio->bi_iter,
-       };
-       const unsigned int nr_pages = round_up(data->len +
-                       (start_vm & (PAGE_SIZE - 1)), PAGE_SIZE) >> PAGE_SHIFT;
-
-       while (done < nr_pages) {
-               const unsigned to_pin = min_t(unsigned, UBLK_MAX_PIN_PAGES,
-                               nr_pages - done);
-               unsigned i, len;
-
-               iter.nr_pages = get_user_pages_fast(start_vm +
-                               (done << PAGE_SHIFT), to_pin, gup_flags,
-                               iter.pages);
-               if (iter.nr_pages <= 0)
-                       return done == 0 ? iter.nr_pages : done;
-               len = ublk_copy_io_pages(&iter, data->len, to_vm);
-               for (i = 0; i < iter.nr_pages; i++) {
-                       if (to_vm)
+       struct ublk_io_iter iter;
+       size_t done = 0;
+
+       if (!ublk_advance_io_iter(req, &iter, offset))
+               return 0;
+
+       while (iov_iter_count(uiter) && iter.bio) {
+               unsigned nr_pages;
+               ssize_t len;
+               size_t off;
+               int i;
+
+               len = iov_iter_get_pages2(uiter, iter.pages,
+                               iov_iter_count(uiter),
+                               UBLK_MAX_PIN_PAGES, &off);
+               if (len <= 0)
+                       return done;
+
+               ublk_copy_io_pages(&iter, len, off, dir);
+               nr_pages = DIV_ROUND_UP(len + off, PAGE_SIZE);
+               for (i = 0; i < nr_pages; i++) {
+                       if (dir == ITER_DEST)
                                set_page_dirty(iter.pages[i]);
                        put_page(iter.pages[i]);
                }
-               data->len -= len;
-               done += iter.nr_pages;
+               done += len;
        }
 
        return done;
@@ -532,21 +604,23 @@ static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
 {
        const unsigned int rq_bytes = blk_rq_bytes(req);
 
+       if (ublk_support_user_copy(ubq))
+               return rq_bytes;
+
        /*
         * no zero copy, we delay copy WRITE request data into ublksrv
         * context and the big benefit is that pinning pages in current
         * context is pretty fast, see ublk_pin_user_pages
         */
        if (ublk_need_map_req(req)) {
-               struct ublk_map_data data = {
-                       .rq     =       req,
-                       .ubuf   =       io->addr,
-                       .len    =       rq_bytes,
-               };
+               struct iov_iter iter;
+               struct iovec iov;
+               const int dir = ITER_DEST;
 
-               ublk_copy_user_pages(&data, true);
+               import_single_range(dir, u64_to_user_ptr(io->addr), rq_bytes,
+                               &iov, &iter);
 
-               return rq_bytes - data.len;
+               return ublk_copy_user_pages(req, 0, &iter, dir);
        }
        return rq_bytes;
 }
@@ -557,18 +631,19 @@ static int ublk_unmap_io(const struct ublk_queue *ubq,
 {
        const unsigned int rq_bytes = blk_rq_bytes(req);
 
+       if (ublk_support_user_copy(ubq))
+               return rq_bytes;
+
        if (ublk_need_unmap_req(req)) {
-               struct ublk_map_data data = {
-                       .rq     =       req,
-                       .ubuf   =       io->addr,
-                       .len    =       io->res,
-               };
+               struct iov_iter iter;
+               struct iovec iov;
+               const int dir = ITER_SOURCE;
 
                WARN_ON_ONCE(io->res > rq_bytes);
 
-               ublk_copy_user_pages(&data, false);
-
-               return io->res - data.len;
+               import_single_range(dir, u64_to_user_ptr(io->addr), io->res,
+                               &iov, &iter);
+               return ublk_copy_user_pages(req, 0, &iter, dir);
        }
        return rq_bytes;
 }
@@ -648,13 +723,19 @@ static inline bool ubq_daemon_is_dying(struct ublk_queue *ubq)
 }
 
 /* todo: handle partial completion */
-static void ublk_complete_rq(struct request *req)
+static inline void __ublk_complete_rq(struct request *req)
 {
        struct ublk_queue *ubq = req->mq_hctx->driver_data;
        struct ublk_io *io = &ubq->ios[req->tag];
        unsigned int unmapped_bytes;
        blk_status_t res = BLK_STS_OK;
 
+       /* called from ublk_abort_queue() code path */
+       if (io->flags & UBLK_IO_FLAG_ABORTED) {
+               res = BLK_STS_IOERR;
+               goto exit;
+       }
+
        /* failed read IO if nothing is read */
        if (!io->res && req_op(req) == REQ_OP_READ)
                io->res = -EIO;
@@ -694,6 +775,15 @@ exit:
        blk_mq_end_request(req, res);
 }
 
+static void ublk_complete_rq(struct kref *ref)
+{
+       struct ublk_rq_data *data = container_of(ref, struct ublk_rq_data,
+                       ref);
+       struct request *req = blk_mq_rq_from_pdu(data);
+
+       __ublk_complete_rq(req);
+}
+
 /*
  * Since __ublk_rq_task_work always fails requests immediately during
  * exiting, __ublk_fail_req() is only called from abort context during
@@ -712,7 +802,7 @@ static void __ublk_fail_req(struct ublk_queue *ubq, struct ublk_io *io,
                if (ublk_queue_can_use_recovery_reissue(ubq))
                        blk_mq_requeue_request(req, false);
                else
-                       blk_mq_end_request(req, BLK_STS_IOERR);
+                       ublk_put_req_ref(ubq, req);
        }
 }
 
@@ -821,6 +911,7 @@ static inline void __ublk_rq_task_work(struct request *req,
                        mapped_bytes >> 9;
        }
 
+       ublk_init_req_ref(ubq, req);
        ubq_complete_io_cmd(io, UBLK_IO_RES_OK, issue_flags);
 }
 
@@ -852,17 +943,6 @@ static void ublk_rq_task_work_cb(struct io_uring_cmd *cmd, unsigned issue_flags)
        ublk_forward_io_cmds(ubq, issue_flags);
 }
 
-static void ublk_rq_task_work_fn(struct callback_head *work)
-{
-       struct ublk_rq_data *data = container_of(work,
-                       struct ublk_rq_data, work);
-       struct request *req = blk_mq_rq_from_pdu(data);
-       struct ublk_queue *ubq = req->mq_hctx->driver_data;
-       unsigned issue_flags = IO_URING_F_UNLOCKED;
-
-       ublk_forward_io_cmds(ubq, issue_flags);
-}
-
 static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq)
 {
        struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
@@ -886,10 +966,6 @@ static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq)
         */
        if (unlikely(io->flags & UBLK_IO_FLAG_ABORTED)) {
                ublk_abort_io_cmds(ubq);
-       } else if (ublk_can_use_task_work(ubq)) {
-               if (task_work_add(ubq->ubq_daemon, &data->work,
-                                       TWA_SIGNAL_NO_IPI))
-                       ublk_abort_io_cmds(ubq);
        } else {
                struct io_uring_cmd *cmd = io->cmd;
                struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
@@ -961,19 +1037,9 @@ static int ublk_init_hctx(struct blk_mq_hw_ctx *hctx, void *driver_data,
        return 0;
 }
 
-static int ublk_init_rq(struct blk_mq_tag_set *set, struct request *req,
-               unsigned int hctx_idx, unsigned int numa_node)
-{
-       struct ublk_rq_data *data = blk_mq_rq_to_pdu(req);
-
-       init_task_work(&data->work, ublk_rq_task_work_fn);
-       return 0;
-}
-
 static const struct blk_mq_ops ublk_mq_ops = {
        .queue_rq       = ublk_queue_rq,
        .init_hctx      = ublk_init_hctx,
-       .init_request   = ublk_init_rq,
        .timeout        = ublk_timeout,
 };
 
@@ -1050,7 +1116,7 @@ static void ublk_commit_completion(struct ublk_device *ub,
        req = blk_mq_tag_to_rq(ub->tag_set.tags[qid], tag);
 
        if (req && likely(!blk_should_fake_timeout(req->q)))
-               ublk_complete_rq(req);
+               ublk_put_req_ref(ubq, req);
 }
 
 /*
@@ -1295,6 +1361,14 @@ static inline int ublk_check_cmd_op(u32 cmd_op)
        return 0;
 }
 
+static inline void ublk_fill_io_cmd(struct ublk_io *io,
+               struct io_uring_cmd *cmd, unsigned long buf_addr)
+{
+       io->cmd = cmd;
+       io->flags |= UBLK_IO_FLAG_ACTIVE;
+       io->addr = buf_addr;
+}
+
 static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
                               unsigned int issue_flags,
                               const struct ublksrv_io_cmd *ub_cmd)
@@ -1340,6 +1414,11 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
                        ^ (_IOC_NR(cmd_op) == UBLK_IO_NEED_GET_DATA))
                goto out;
 
+       if (ublk_support_user_copy(ubq) && ub_cmd->addr) {
+               ret = -EINVAL;
+               goto out;
+       }
+
        ret = ublk_check_cmd_op(cmd_op);
        if (ret)
                goto out;
@@ -1358,36 +1437,41 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
                 */
                if (io->flags & UBLK_IO_FLAG_OWNED_BY_SRV)
                        goto out;
-               /* FETCH_RQ has to provide IO buffer if NEED GET DATA is not enabled */
-               if (!ub_cmd->addr && !ublk_need_get_data(ubq))
-                       goto out;
-               io->cmd = cmd;
-               io->flags |= UBLK_IO_FLAG_ACTIVE;
-               io->addr = ub_cmd->addr;
 
+               if (!ublk_support_user_copy(ubq)) {
+                       /*
+                        * FETCH_RQ has to provide IO buffer if NEED GET
+                        * DATA is not enabled
+                        */
+                       if (!ub_cmd->addr && !ublk_need_get_data(ubq))
+                               goto out;
+               }
+
+               ublk_fill_io_cmd(io, cmd, ub_cmd->addr);
                ublk_mark_io_ready(ub, ubq);
                break;
        case UBLK_IO_COMMIT_AND_FETCH_REQ:
                req = blk_mq_tag_to_rq(ub->tag_set.tags[ub_cmd->q_id], tag);
-               /*
-                * COMMIT_AND_FETCH_REQ has to provide IO buffer if NEED GET DATA is
-                * not enabled or it is Read IO.
-                */
-               if (!ub_cmd->addr && (!ublk_need_get_data(ubq) || req_op(req) == REQ_OP_READ))
-                       goto out;
+
                if (!(io->flags & UBLK_IO_FLAG_OWNED_BY_SRV))
                        goto out;
-               io->addr = ub_cmd->addr;
-               io->flags |= UBLK_IO_FLAG_ACTIVE;
-               io->cmd = cmd;
+
+               if (!ublk_support_user_copy(ubq)) {
+                       /*
+                        * COMMIT_AND_FETCH_REQ has to provide IO buffer if
+                        * NEED GET DATA is not enabled or it is Read IO.
+                        */
+                       if (!ub_cmd->addr && (!ublk_need_get_data(ubq) ||
+                                               req_op(req) == REQ_OP_READ))
+                               goto out;
+               }
+               ublk_fill_io_cmd(io, cmd, ub_cmd->addr);
                ublk_commit_completion(ub, ub_cmd);
                break;
        case UBLK_IO_NEED_GET_DATA:
                if (!(io->flags & UBLK_IO_FLAG_OWNED_BY_SRV))
                        goto out;
-               io->addr = ub_cmd->addr;
-               io->cmd = cmd;
-               io->flags |= UBLK_IO_FLAG_ACTIVE;
+               ublk_fill_io_cmd(io, cmd, ub_cmd->addr);
                ublk_handle_need_get_data(ub, ub_cmd->q_id, ub_cmd->tag);
                break;
        default:
@@ -1402,6 +1486,36 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
        return -EIOCBQUEUED;
 }
 
+static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
+               struct ublk_queue *ubq, int tag, size_t offset)
+{
+       struct request *req;
+
+       if (!ublk_need_req_ref(ubq))
+               return NULL;
+
+       req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
+       if (!req)
+               return NULL;
+
+       if (!ublk_get_req_ref(ubq, req))
+               return NULL;
+
+       if (unlikely(!blk_mq_request_started(req) || req->tag != tag))
+               goto fail_put;
+
+       if (!ublk_rq_has_data(req))
+               goto fail_put;
+
+       if (offset > blk_rq_bytes(req))
+               goto fail_put;
+
+       return req;
+fail_put:
+       ublk_put_req_ref(ubq, req);
+       return NULL;
+}
+
 static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 {
        /*
@@ -1419,11 +1533,112 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
        return __ublk_ch_uring_cmd(cmd, issue_flags, &ub_cmd);
 }
 
+static inline bool ublk_check_ubuf_dir(const struct request *req,
+               int ubuf_dir)
+{
+       /* copy ubuf to request pages */
+       if (req_op(req) == REQ_OP_READ && ubuf_dir == ITER_SOURCE)
+               return true;
+
+       /* copy request pages to ubuf */
+       if (req_op(req) == REQ_OP_WRITE && ubuf_dir == ITER_DEST)
+               return true;
+
+       return false;
+}
+
+static struct request *ublk_check_and_get_req(struct kiocb *iocb,
+               struct iov_iter *iter, size_t *off, int dir)
+{
+       struct ublk_device *ub = iocb->ki_filp->private_data;
+       struct ublk_queue *ubq;
+       struct request *req;
+       size_t buf_off;
+       u16 tag, q_id;
+
+       if (!ub)
+               return ERR_PTR(-EACCES);
+
+       if (!user_backed_iter(iter))
+               return ERR_PTR(-EACCES);
+
+       if (ub->dev_info.state == UBLK_S_DEV_DEAD)
+               return ERR_PTR(-EACCES);
+
+       tag = ublk_pos_to_tag(iocb->ki_pos);
+       q_id = ublk_pos_to_hwq(iocb->ki_pos);
+       buf_off = ublk_pos_to_buf_off(iocb->ki_pos);
+
+       if (q_id >= ub->dev_info.nr_hw_queues)
+               return ERR_PTR(-EINVAL);
+
+       ubq = ublk_get_queue(ub, q_id);
+       if (!ubq)
+               return ERR_PTR(-EINVAL);
+
+       if (tag >= ubq->q_depth)
+               return ERR_PTR(-EINVAL);
+
+       req = __ublk_check_and_get_req(ub, ubq, tag, buf_off);
+       if (!req)
+               return ERR_PTR(-EINVAL);
+
+       if (!req->mq_hctx || !req->mq_hctx->driver_data)
+               goto fail;
+
+       if (!ublk_check_ubuf_dir(req, dir))
+               goto fail;
+
+       *off = buf_off;
+       return req;
+fail:
+       ublk_put_req_ref(ubq, req);
+       return ERR_PTR(-EACCES);
+}
+
+static ssize_t ublk_ch_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+       struct ublk_queue *ubq;
+       struct request *req;
+       size_t buf_off;
+       size_t ret;
+
+       req = ublk_check_and_get_req(iocb, to, &buf_off, ITER_DEST);
+       if (IS_ERR(req))
+               return PTR_ERR(req);
+
+       ret = ublk_copy_user_pages(req, buf_off, to, ITER_DEST);
+       ubq = req->mq_hctx->driver_data;
+       ublk_put_req_ref(ubq, req);
+
+       return ret;
+}
+
+static ssize_t ublk_ch_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+       struct ublk_queue *ubq;
+       struct request *req;
+       size_t buf_off;
+       size_t ret;
+
+       req = ublk_check_and_get_req(iocb, from, &buf_off, ITER_SOURCE);
+       if (IS_ERR(req))
+               return PTR_ERR(req);
+
+       ret = ublk_copy_user_pages(req, buf_off, from, ITER_SOURCE);
+       ubq = req->mq_hctx->driver_data;
+       ublk_put_req_ref(ubq, req);
+
+       return ret;
+}
+
 static const struct file_operations ublk_ch_fops = {
        .owner = THIS_MODULE,
        .open = ublk_ch_open,
        .release = ublk_ch_release,
        .llseek = no_llseek,
+       .read_iter = ublk_ch_read_iter,
+       .write_iter = ublk_ch_write_iter,
        .uring_cmd = ublk_ch_uring_cmd,
        .mmap = ublk_ch_mmap,
 };
@@ -1547,7 +1762,7 @@ static int ublk_add_chdev(struct ublk_device *ub)
 
        dev->parent = ublk_misc.this_device;
        dev->devt = MKDEV(MAJOR(ublk_chr_devt), minor);
-       dev->class = ublk_chr_class;
+       dev->class = &ublk_chr_class;
        dev->release = ublk_cdev_rel;
        device_initialize(dev);
 
@@ -1818,10 +2033,12 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd)
         */
        ub->dev_info.flags &= UBLK_F_ALL;
 
-       if (!IS_BUILTIN(CONFIG_BLK_DEV_UBLK))
-               ub->dev_info.flags |= UBLK_F_URING_CMD_COMP_IN_TASK;
+       ub->dev_info.flags |= UBLK_F_CMD_IOCTL_ENCODE |
+               UBLK_F_URING_CMD_COMP_IN_TASK;
 
-       ub->dev_info.flags |= UBLK_F_CMD_IOCTL_ENCODE;
+       /* GET_DATA isn't needed any more with USER_COPY */
+       if (ub->dev_info.flags & UBLK_F_USER_COPY)
+               ub->dev_info.flags &= ~UBLK_F_NEED_GET_DATA;
 
        /* We are not ready to support zero copy */
        ub->dev_info.flags &= ~UBLK_F_SUPPORT_ZERO_COPY;
@@ -2133,6 +2350,21 @@ static int ublk_ctrl_end_recovery(struct ublk_device *ub,
        return ret;
 }
 
+static int ublk_ctrl_get_features(struct io_uring_cmd *cmd)
+{
+       const struct ublksrv_ctrl_cmd *header = io_uring_sqe_cmd(cmd->sqe);
+       void __user *argp = (void __user *)(unsigned long)header->addr;
+       u64 features = UBLK_F_ALL & ~UBLK_F_SUPPORT_ZERO_COPY;
+
+       if (header->len != UBLK_FEATURES_LEN || !header->addr)
+               return -EINVAL;
+
+       if (copy_to_user(argp, &features, UBLK_FEATURES_LEN))
+               return -EFAULT;
+
+       return 0;
+}
+
 /*
  * All control commands are sent via /dev/ublk-control, so we have to check
  * the destination device's permission
@@ -2213,6 +2445,7 @@ static int ublk_ctrl_uring_cmd_permission(struct ublk_device *ub,
        case UBLK_CMD_GET_DEV_INFO2:
        case UBLK_CMD_GET_QUEUE_AFFINITY:
        case UBLK_CMD_GET_PARAMS:
+       case (_IOC_NR(UBLK_U_CMD_GET_FEATURES)):
                mask = MAY_READ;
                break;
        case UBLK_CMD_START_DEV:
@@ -2262,6 +2495,11 @@ static int ublk_ctrl_uring_cmd(struct io_uring_cmd *cmd,
        if (ret)
                goto out;
 
+       if (cmd_op == UBLK_U_CMD_GET_FEATURES) {
+               ret = ublk_ctrl_get_features(cmd);
+               goto out;
+       }
+
        if (_IOC_NR(cmd_op) != UBLK_CMD_ADD_DEV) {
                ret = -ENODEV;
                ub = ublk_get_device_from_id(header->dev_id);
@@ -2337,6 +2575,9 @@ static int __init ublk_init(void)
 {
        int ret;
 
+       BUILD_BUG_ON((u64)UBLKSRV_IO_BUF_OFFSET +
+                       UBLKSRV_IO_BUF_TOTAL_SIZE < UBLKSRV_IO_BUF_OFFSET);
+
        init_waitqueue_head(&ublk_idr_wq);
 
        ret = misc_register(&ublk_misc);
@@ -2347,11 +2588,10 @@ static int __init ublk_init(void)
        if (ret)
                goto unregister_mis;
 
-       ublk_chr_class = class_create("ublk-char");
-       if (IS_ERR(ublk_chr_class)) {
-               ret = PTR_ERR(ublk_chr_class);
+       ret = class_register(&ublk_chr_class);
+       if (ret)
                goto free_chrdev_region;
-       }
+
        return 0;
 
 free_chrdev_region:
@@ -2369,7 +2609,7 @@ static void __exit ublk_exit(void)
        idr_for_each_entry(&ublk_index_idr, ub, id)
                ublk_remove(ub);
 
-       class_destroy(ublk_chr_class);
+       class_unregister(&ublk_chr_class);
        misc_deregister(&ublk_misc);
 
        idr_destroy(&ublk_index_idr);
index 4807af1..bb66178 100644 (file)
@@ -473,7 +473,7 @@ static void xenvbd_sysfs_delif(struct xenbus_device *dev)
 static void xen_vbd_free(struct xen_vbd *vbd)
 {
        if (vbd->bdev)
-               blkdev_put(vbd->bdev, vbd->readonly ? FMODE_READ : FMODE_WRITE);
+               blkdev_put(vbd->bdev, NULL);
        vbd->bdev = NULL;
 }
 
@@ -492,7 +492,7 @@ static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
        vbd->pdevice  = MKDEV(major, minor);
 
        bdev = blkdev_get_by_dev(vbd->pdevice, vbd->readonly ?
-                                FMODE_READ : FMODE_WRITE, NULL);
+                                BLK_OPEN_READ : BLK_OPEN_WRITE, NULL, NULL);
 
        if (IS_ERR(bdev)) {
                pr_warn("xen_vbd_create: device %08x could not be opened\n",
index c1890c8..434fab3 100644 (file)
@@ -509,7 +509,7 @@ static int blkif_getgeo(struct block_device *bd, struct hd_geometry *hg)
        return 0;
 }
 
-static int blkif_ioctl(struct block_device *bdev, fmode_t mode,
+static int blkif_ioctl(struct block_device *bdev, blk_mode_t mode,
                       unsigned command, unsigned long argument)
 {
        struct blkfront_info *info = bdev->bd_disk->private_data;
index c1e85f3..1149316 100644 (file)
@@ -140,16 +140,14 @@ static void get_chipram(void)
        return;
 }
 
-static int z2_open(struct block_device *bdev, fmode_t mode)
+static int z2_open(struct gendisk *disk, blk_mode_t mode)
 {
-       int device;
+       int device = disk->first_minor;
        int max_z2_map = (Z2RAM_SIZE / Z2RAM_CHUNKSIZE) * sizeof(z2ram_map[0]);
        int max_chip_map = (amiga_chip_size / Z2RAM_CHUNKSIZE) *
            sizeof(z2ram_map[0]);
        int rc = -ENOMEM;
 
-       device = MINOR(bdev->bd_dev);
-
        mutex_lock(&z2ram_mutex);
        if (current_device != -1 && current_device != device) {
                rc = -EBUSY;
@@ -290,7 +288,7 @@ err_out:
        return rc;
 }
 
-static void z2_release(struct gendisk *disk, fmode_t mode)
+static void z2_release(struct gendisk *disk)
 {
        mutex_lock(&z2ram_mutex);
        if (current_device == -1) {
index f6d90f1..5676e6d 100644 (file)
@@ -420,7 +420,7 @@ static void reset_bdev(struct zram *zram)
                return;
 
        bdev = zram->bdev;
-       blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
+       blkdev_put(bdev, zram);
        /* hope filp_close flush all of IO */
        filp_close(zram->backing_dev, NULL);
        zram->backing_dev = NULL;
@@ -507,8 +507,8 @@ static ssize_t backing_dev_store(struct device *dev,
                goto out;
        }
 
-       bdev = blkdev_get_by_dev(inode->i_rdev,
-                       FMODE_READ | FMODE_WRITE | FMODE_EXCL, zram);
+       bdev = blkdev_get_by_dev(inode->i_rdev, BLK_OPEN_READ | BLK_OPEN_WRITE,
+                                zram, NULL);
        if (IS_ERR(bdev)) {
                err = PTR_ERR(bdev);
                bdev = NULL;
@@ -539,7 +539,7 @@ out:
        kvfree(bitmap);
 
        if (bdev)
-               blkdev_put(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL);
+               blkdev_put(bdev, zram);
 
        if (backing_dev)
                filp_close(backing_dev, NULL);
@@ -700,7 +700,7 @@ static ssize_t writeback_store(struct device *dev,
                bio_init(&bio, zram->bdev, &bio_vec, 1,
                         REQ_OP_WRITE | REQ_SYNC);
                bio.bi_iter.bi_sector = blk_idx * (PAGE_SIZE >> 9);
-               bio_add_page(&bio, page, PAGE_SIZE, 0);
+               __bio_add_page(&bio, page, PAGE_SIZE, 0);
 
                /*
                 * XXX: A single page IO would be inefficient for write
@@ -1753,7 +1753,7 @@ static ssize_t recompress_store(struct device *dev,
                }
        }
 
-       if (threshold >= PAGE_SIZE)
+       if (threshold >= huge_class_size)
                return -EINVAL;
 
        down_read(&zram->init_lock);
@@ -2097,19 +2097,16 @@ static ssize_t reset_store(struct device *dev,
        return len;
 }
 
-static int zram_open(struct block_device *bdev, fmode_t mode)
+static int zram_open(struct gendisk *disk, blk_mode_t mode)
 {
-       int ret = 0;
-       struct zram *zram;
+       struct zram *zram = disk->private_data;
 
-       WARN_ON(!mutex_is_locked(&bdev->bd_disk->open_mutex));
+       WARN_ON(!mutex_is_locked(&disk->open_mutex));
 
-       zram = bdev->bd_disk->private_data;
        /* zram was claimed to reset so open request fails */
        if (zram->claim)
-               ret = -EBUSY;
-
-       return ret;
+               return -EBUSY;
+       return 0;
 }
 
 static const struct block_device_operations zram_devops = {
index 416f723..cc28398 100644 (file)
 #include <linux/errno.h>
 #include <linux/kernel.h>
 #include <linux/mm.h>
+#include <linux/nospec.h>
 #include <linux/slab.h> 
 #include <linux/cdrom.h>
 #include <linux/sysctl.h>
@@ -978,15 +979,6 @@ static void cdrom_dvd_rw_close_write(struct cdrom_device_info *cdi)
        cdi->media_written = 0;
 }
 
-static int cdrom_close_write(struct cdrom_device_info *cdi)
-{
-#if 0
-       return cdrom_flush_cache(cdi);
-#else
-       return 0;
-#endif
-}
-
 /* badly broken, I know. Is due for a fixup anytime. */
 static void cdrom_count_tracks(struct cdrom_device_info *cdi, tracktype *tracks)
 {
@@ -1155,8 +1147,7 @@ clean_up_and_return:
  * is in their own interest: device control becomes a lot easier
  * this way.
  */
-int cdrom_open(struct cdrom_device_info *cdi, struct block_device *bdev,
-              fmode_t mode)
+int cdrom_open(struct cdrom_device_info *cdi, blk_mode_t mode)
 {
        int ret;
 
@@ -1165,7 +1156,7 @@ int cdrom_open(struct cdrom_device_info *cdi, struct block_device *bdev,
        /* if this was a O_NONBLOCK open and we should honor the flags,
         * do a quick open without drive/disc integrity checks. */
        cdi->use_count++;
-       if ((mode & FMODE_NDELAY) && (cdi->options & CDO_USE_FFLAGS)) {
+       if ((mode & BLK_OPEN_NDELAY) && (cdi->options & CDO_USE_FFLAGS)) {
                ret = cdi->ops->open(cdi, 1);
        } else {
                ret = open_for_data(cdi);
@@ -1173,7 +1164,7 @@ int cdrom_open(struct cdrom_device_info *cdi, struct block_device *bdev,
                        goto err;
                if (CDROM_CAN(CDC_GENERIC_PACKET))
                        cdrom_mmc3_profile(cdi);
-               if (mode & FMODE_WRITE) {
+               if (mode & BLK_OPEN_WRITE) {
                        ret = -EROFS;
                        if (cdrom_open_write(cdi))
                                goto err_release;
@@ -1182,6 +1173,7 @@ int cdrom_open(struct cdrom_device_info *cdi, struct block_device *bdev,
                        ret = 0;
                        cdi->media_written = 0;
                }
+               cdi->opened_for_data = true;
        }
 
        if (ret)
@@ -1259,10 +1251,9 @@ static int check_for_audio_disc(struct cdrom_device_info *cdi,
        return 0;
 }
 
-void cdrom_release(struct cdrom_device_info *cdi, fmode_t mode)
+void cdrom_release(struct cdrom_device_info *cdi)
 {
        const struct cdrom_device_ops *cdo = cdi->ops;
-       int opened_for_data;
 
        cd_dbg(CD_CLOSE, "entering cdrom_release\n");
 
@@ -1280,20 +1271,12 @@ void cdrom_release(struct cdrom_device_info *cdi, fmode_t mode)
                }
        }
 
-       opened_for_data = !(cdi->options & CDO_USE_FFLAGS) ||
-               !(mode & FMODE_NDELAY);
-
-       /*
-        * flush cache on last write release
-        */
-       if (CDROM_CAN(CDC_RAM) && !cdi->use_count && cdi->for_data)
-               cdrom_close_write(cdi);
-
        cdo->release(cdi);
-       if (cdi->use_count == 0) {      /* last process that closes dev*/
-               if (opened_for_data &&
-                   cdi->options & CDO_AUTO_EJECT && CDROM_CAN(CDC_OPEN_TRAY))
+
+       if (cdi->use_count == 0 && cdi->opened_for_data) {
+               if (cdi->options & CDO_AUTO_EJECT && CDROM_CAN(CDC_OPEN_TRAY))
                        cdo->tray_move(cdi, 1);
+               cdi->opened_for_data = false;
        }
 }
 EXPORT_SYMBOL(cdrom_release);
@@ -2329,6 +2312,9 @@ static int cdrom_ioctl_media_changed(struct cdrom_device_info *cdi,
        if (arg >= cdi->capacity)
                return -EINVAL;
 
+       /* Prevent arg from speculatively bypassing the length check */
+       barrier_nospec();
+
        info = kmalloc(sizeof(*info), GFP_KERNEL);
        if (!info)
                return -ENOMEM;
@@ -3337,7 +3323,7 @@ static int mmc_ioctl(struct cdrom_device_info *cdi, unsigned int cmd,
  * ATAPI / SCSI specific code now mainly resides in mmc_ioctl().
  */
 int cdrom_ioctl(struct cdrom_device_info *cdi, struct block_device *bdev,
-               fmode_t mode, unsigned int cmd, unsigned long arg)
+               unsigned int cmd, unsigned long arg)
 {
        void __user *argp = (void __user *)arg;
        int ret;
index ceded57..3a46e27 100644 (file)
@@ -474,19 +474,19 @@ static const struct cdrom_device_ops gdrom_ops = {
                                  CDC_RESET | CDC_DRIVE_STATUS | CDC_CD_R,
 };
 
-static int gdrom_bdops_open(struct block_device *bdev, fmode_t mode)
+static int gdrom_bdops_open(struct gendisk *disk, blk_mode_t mode)
 {
        int ret;
 
-       bdev_check_media_change(bdev);
+       disk_check_media_change(disk);
 
        mutex_lock(&gdrom_mutex);
-       ret = cdrom_open(gd.cd_info, bdev, mode);
+       ret = cdrom_open(gd.cd_info);
        mutex_unlock(&gdrom_mutex);
        return ret;
 }
 
-static void gdrom_bdops_release(struct gendisk *disk, fmode_t mode)
+static void gdrom_bdops_release(struct gendisk *disk)
 {
        mutex_lock(&gdrom_mutex);
        cdrom_release(gd.cd_info, mode);
@@ -499,13 +499,13 @@ static unsigned int gdrom_bdops_check_events(struct gendisk *disk,
        return cdrom_check_events(gd.cd_info, clearing);
 }
 
-static int gdrom_bdops_ioctl(struct block_device *bdev, fmode_t mode,
+static int gdrom_bdops_ioctl(struct block_device *bdev, blk_mode_t mode,
        unsigned cmd, unsigned long arg)
 {
        int ret;
 
        mutex_lock(&gdrom_mutex);
-       ret = cdrom_ioctl(gd.cd_info, bdev, mode, cmd, arg);
+       ret = cdrom_ioctl(gd.cd_info, bdev, cmd, arg);
        mutex_unlock(&gdrom_mutex);
 
        return ret;
index 253f2dd..3cb3776 100644 (file)
@@ -1546,7 +1546,7 @@ const struct file_operations random_fops = {
        .compat_ioctl = compat_ptr_ioctl,
        .fasync = random_fasync,
        .llseek = noop_llseek,
-       .splice_read = generic_file_splice_read,
+       .splice_read = copy_splice_read,
        .splice_write = iter_file_splice_write,
 };
 
@@ -1557,7 +1557,7 @@ const struct file_operations urandom_fops = {
        .compat_ioctl = compat_ptr_ioctl,
        .fasync = random_fasync,
        .llseek = noop_llseek,
-       .splice_read = generic_file_splice_read,
+       .splice_read = copy_splice_read,
        .splice_write = iter_file_splice_write,
 };
 
index 016814e..c0c8e52 100644 (file)
@@ -82,7 +82,7 @@ config COMMON_CLK_MAX9485
 
 config COMMON_CLK_RK808
        tristate "Clock driver for RK805/RK808/RK809/RK817/RK818"
-       depends on MFD_RK808
+       depends on MFD_RK8XX
        help
          This driver supports RK805, RK809 and RK817, RK808 and RK818 crystal oscillator clock.
          These multi-function devices have two fixed-rate oscillators, clocked at 32KHz each.
index 32f833d..f7412b1 100644 (file)
 #include <linux/slab.h>
 #include <linux/platform_device.h>
 #include <linux/mfd/rk808.h>
-#include <linux/i2c.h>
 
 struct rk808_clkout {
-       struct rk808 *rk808;
+       struct regmap           *regmap;
        struct clk_hw           clkout1_hw;
        struct clk_hw           clkout2_hw;
 };
@@ -31,9 +30,8 @@ static int rk808_clkout2_enable(struct clk_hw *hw, bool enable)
        struct rk808_clkout *rk808_clkout = container_of(hw,
                                                         struct rk808_clkout,
                                                         clkout2_hw);
-       struct rk808 *rk808 = rk808_clkout->rk808;
 
-       return regmap_update_bits(rk808->regmap, RK808_CLK32OUT_REG,
+       return regmap_update_bits(rk808_clkout->regmap, RK808_CLK32OUT_REG,
                                  CLK32KOUT2_EN, enable ? CLK32KOUT2_EN : 0);
 }
 
@@ -52,10 +50,9 @@ static int rk808_clkout2_is_prepared(struct clk_hw *hw)
        struct rk808_clkout *rk808_clkout = container_of(hw,
                                                         struct rk808_clkout,
                                                         clkout2_hw);
-       struct rk808 *rk808 = rk808_clkout->rk808;
        uint32_t val;
 
-       int ret = regmap_read(rk808->regmap, RK808_CLK32OUT_REG, &val);
+       int ret = regmap_read(rk808_clkout->regmap, RK808_CLK32OUT_REG, &val);
 
        if (ret < 0)
                return ret;
@@ -93,9 +90,8 @@ static int rk817_clkout2_enable(struct clk_hw *hw, bool enable)
        struct rk808_clkout *rk808_clkout = container_of(hw,
                                                         struct rk808_clkout,
                                                         clkout2_hw);
-       struct rk808 *rk808 = rk808_clkout->rk808;
 
-       return regmap_update_bits(rk808->regmap, RK817_SYS_CFG(1),
+       return regmap_update_bits(rk808_clkout->regmap, RK817_SYS_CFG(1),
                                  RK817_CLK32KOUT2_EN,
                                  enable ? RK817_CLK32KOUT2_EN : 0);
 }
@@ -115,10 +111,9 @@ static int rk817_clkout2_is_prepared(struct clk_hw *hw)
        struct rk808_clkout *rk808_clkout = container_of(hw,
                                                         struct rk808_clkout,
                                                         clkout2_hw);
-       struct rk808 *rk808 = rk808_clkout->rk808;
        unsigned int val;
 
-       int ret = regmap_read(rk808->regmap, RK817_SYS_CFG(1), &val);
+       int ret = regmap_read(rk808_clkout->regmap, RK817_SYS_CFG(1), &val);
 
        if (ret < 0)
                return 0;
@@ -153,18 +148,21 @@ static const struct clk_ops *rkpmic_get_ops(long variant)
 static int rk808_clkout_probe(struct platform_device *pdev)
 {
        struct rk808 *rk808 = dev_get_drvdata(pdev->dev.parent);
-       struct i2c_client *client = rk808->i2c;
-       struct device_node *node = client->dev.of_node;
+       struct device *dev = &pdev->dev;
        struct clk_init_data init = {};
        struct rk808_clkout *rk808_clkout;
        int ret;
 
-       rk808_clkout = devm_kzalloc(&client->dev,
+       dev->of_node = pdev->dev.parent->of_node;
+
+       rk808_clkout = devm_kzalloc(dev,
                                    sizeof(*rk808_clkout), GFP_KERNEL);
        if (!rk808_clkout)
                return -ENOMEM;
 
-       rk808_clkout->rk808 = rk808;
+       rk808_clkout->regmap = dev_get_regmap(pdev->dev.parent, NULL);
+       if (!rk808_clkout->regmap)
+               return -ENODEV;
 
        init.parent_names = NULL;
        init.num_parents = 0;
@@ -173,10 +171,10 @@ static int rk808_clkout_probe(struct platform_device *pdev)
        rk808_clkout->clkout1_hw.init = &init;
 
        /* optional override of the clockname */
-       of_property_read_string_index(node, "clock-output-names",
+       of_property_read_string_index(dev->of_node, "clock-output-names",
                                      0, &init.name);
 
-       ret = devm_clk_hw_register(&client->dev, &rk808_clkout->clkout1_hw);
+       ret = devm_clk_hw_register(dev, &rk808_clkout->clkout1_hw);
        if (ret)
                return ret;
 
@@ -185,10 +183,10 @@ static int rk808_clkout_probe(struct platform_device *pdev)
        rk808_clkout->clkout2_hw.init = &init;
 
        /* optional override of the clockname */
-       of_property_read_string_index(node, "clock-output-names",
+       of_property_read_string_index(dev->of_node, "clock-output-names",
                                      1, &init.name);
 
-       ret = devm_clk_hw_register(&client->dev, &rk808_clkout->clkout2_hw);
+       ret = devm_clk_hw_register(dev, &rk808_clkout->clkout2_hw);
        if (ret)
                return ret;
 
index 22fc749..f6ea7e5 100644 (file)
@@ -10,7 +10,6 @@
 #include <linux/of.h>
 #include <linux/of_address.h>
 #include <dt-bindings/clock/imx1-clock.h>
-#include <soc/imx/timer.h>
 #include <asm/irq.h>
 
 #include "clk.h"
index 5d17712..99618de 100644 (file)
@@ -8,7 +8,6 @@
 #include <linux/of_address.h>
 #include <dt-bindings/clock/imx27-clock.h>
 #include <soc/imx/revision.h>
-#include <soc/imx/timer.h>
 #include <asm/irq.h>
 
 #include "clk.h"
index c44e18c..4c8d9ff 100644 (file)
@@ -11,7 +11,6 @@
 #include <linux/of.h>
 #include <linux/of_address.h>
 #include <soc/imx/revision.h>
-#include <soc/imx/timer.h>
 #include <asm/irq.h>
 
 #include "clk.h"
index 7dcbaea..3b6fdb4 100644 (file)
@@ -10,7 +10,6 @@
 #include <linux/of.h>
 #include <linux/err.h>
 #include <soc/imx/revision.h>
-#include <soc/imx/timer.h>
 #include <asm/irq.h>
 
 #include "clk.h"
index 526382d..c4d671a 100644 (file)
@@ -612,6 +612,15 @@ config TIMER_IMX_SYS_CTR
          Enable this option to use i.MX system counter timer as a
          clockevent.
 
+config CLKSRC_LOONGSON1_PWM
+       bool "Clocksource using Loongson1 PWM"
+       depends on MACH_LOONGSON32 || COMPILE_TEST
+       select MIPS_EXTERNAL_TIMER
+       select TIMER_OF
+       help
+         Enable this option to use Loongson1 PWM timer as clocksource
+         instead of the performance counter.
+
 config CLKSRC_ST_LPC
        bool "Low power clocksource found in the LPC" if COMPILE_TEST
        select TIMER_OF if OF
index f12d398..5d93c9e 100644 (file)
@@ -89,3 +89,4 @@ obj-$(CONFIG_MICROCHIP_PIT64B)                += timer-microchip-pit64b.o
 obj-$(CONFIG_MSC313E_TIMER)            += timer-msc313e.o
 obj-$(CONFIG_GOLDFISH_TIMER)           += timer-goldfish.o
 obj-$(CONFIG_GXP_TIMER)                        += timer-gxp.o
+obj-$(CONFIG_CLKSRC_LOONGSON1_PWM)     += timer-loongson1-pwm.o
index e09d442..e733a2a 100644 (file)
@@ -191,22 +191,40 @@ u32 arch_timer_reg_read(int access, enum arch_timer_reg reg,
        return val;
 }
 
-static notrace u64 arch_counter_get_cntpct_stable(void)
+static noinstr u64 raw_counter_get_cntpct_stable(void)
 {
        return __arch_counter_get_cntpct_stable();
 }
 
-static notrace u64 arch_counter_get_cntpct(void)
+static notrace u64 arch_counter_get_cntpct_stable(void)
+{
+       u64 val;
+       preempt_disable_notrace();
+       val = __arch_counter_get_cntpct_stable();
+       preempt_enable_notrace();
+       return val;
+}
+
+static noinstr u64 arch_counter_get_cntpct(void)
 {
        return __arch_counter_get_cntpct();
 }
 
-static notrace u64 arch_counter_get_cntvct_stable(void)
+static noinstr u64 raw_counter_get_cntvct_stable(void)
 {
        return __arch_counter_get_cntvct_stable();
 }
 
-static notrace u64 arch_counter_get_cntvct(void)
+static notrace u64 arch_counter_get_cntvct_stable(void)
+{
+       u64 val;
+       preempt_disable_notrace();
+       val = __arch_counter_get_cntvct_stable();
+       preempt_enable_notrace();
+       return val;
+}
+
+static noinstr u64 arch_counter_get_cntvct(void)
 {
        return __arch_counter_get_cntvct();
 }
@@ -753,14 +771,14 @@ static int arch_timer_set_next_event_phys(unsigned long evt,
        return 0;
 }
 
-static u64 arch_counter_get_cnt_mem(struct arch_timer *t, int offset_lo)
+static noinstr u64 arch_counter_get_cnt_mem(struct arch_timer *t, int offset_lo)
 {
        u32 cnt_lo, cnt_hi, tmp_hi;
 
        do {
-               cnt_hi = readl_relaxed(t->base + offset_lo + 4);
-               cnt_lo = readl_relaxed(t->base + offset_lo);
-               tmp_hi = readl_relaxed(t->base + offset_lo + 4);
+               cnt_hi = __le32_to_cpu((__le32 __force)__raw_readl(t->base + offset_lo + 4));
+               cnt_lo = __le32_to_cpu((__le32 __force)__raw_readl(t->base + offset_lo));
+               tmp_hi = __le32_to_cpu((__le32 __force)__raw_readl(t->base + offset_lo + 4));
        } while (cnt_hi != tmp_hi);
 
        return ((u64) cnt_hi << 32) | cnt_lo;
@@ -1060,7 +1078,7 @@ bool arch_timer_evtstrm_available(void)
        return cpumask_test_cpu(raw_smp_processor_id(), &evtstrm_available);
 }
 
-static u64 arch_counter_get_cntvct_mem(void)
+static noinstr u64 arch_counter_get_cntvct_mem(void)
 {
        return arch_counter_get_cnt_mem(arch_timer_mem, CNTVCT_LO);
 }
@@ -1074,6 +1092,7 @@ struct arch_timer_kvm_info *arch_timer_get_kvm_info(void)
 
 static void __init arch_counter_register(unsigned type)
 {
+       u64 (*scr)(void);
        u64 start_count;
        int width;
 
@@ -1083,21 +1102,28 @@ static void __init arch_counter_register(unsigned type)
 
                if ((IS_ENABLED(CONFIG_ARM64) && !is_hyp_mode_available()) ||
                    arch_timer_uses_ppi == ARCH_TIMER_VIRT_PPI) {
-                       if (arch_timer_counter_has_wa())
+                       if (arch_timer_counter_has_wa()) {
                                rd = arch_counter_get_cntvct_stable;
-                       else
+                               scr = raw_counter_get_cntvct_stable;
+                       } else {
                                rd = arch_counter_get_cntvct;
+                               scr = arch_counter_get_cntvct;
+                       }
                } else {
-                       if (arch_timer_counter_has_wa())
+                       if (arch_timer_counter_has_wa()) {
                                rd = arch_counter_get_cntpct_stable;
-                       else
+                               scr = raw_counter_get_cntpct_stable;
+                       } else {
                                rd = arch_counter_get_cntpct;
+                               scr = arch_counter_get_cntpct;
+                       }
                }
 
                arch_timer_read_counter = rd;
                clocksource_counter.vdso_clock_mode = vdso_default;
        } else {
                arch_timer_read_counter = arch_counter_get_cntvct_mem;
+               scr = arch_counter_get_cntvct_mem;
        }
 
        width = arch_counter_get_width();
@@ -1113,7 +1139,7 @@ static void __init arch_counter_register(unsigned type)
        timecounter_init(&arch_timer_kvm_info.timecounter,
                         &cyclecounter, start_count);
 
-       sched_clock_register(arch_timer_read_counter, width, arch_timer_rate);
+       sched_clock_register(scr, width, arch_timer_rate);
 }
 
 static void arch_timer_stop(struct clock_event_device *clk)
index bcd9042..e56307a 100644 (file)
@@ -365,6 +365,20 @@ void hv_stimer_global_cleanup(void)
 }
 EXPORT_SYMBOL_GPL(hv_stimer_global_cleanup);
 
+static __always_inline u64 read_hv_clock_msr(void)
+{
+       /*
+        * Read the partition counter to get the current tick count. This count
+        * is set to 0 when the partition is created and is incremented in 100
+        * nanosecond units.
+        *
+        * Use hv_raw_get_register() because this function is used from
+        * noinstr. Notable; while HV_REGISTER_TIME_REF_COUNT is a synthetic
+        * register it doesn't need the GHCB path.
+        */
+       return hv_raw_get_register(HV_REGISTER_TIME_REF_COUNT);
+}
+
 /*
  * Code and definitions for the Hyper-V clocksources.  Two
  * clocksources are defined: one that reads the Hyper-V defined MSR, and
@@ -393,14 +407,20 @@ struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
 }
 EXPORT_SYMBOL_GPL(hv_get_tsc_page);
 
-static u64 notrace read_hv_clock_tsc(void)
+static __always_inline u64 read_hv_clock_tsc(void)
 {
-       u64 current_tick = hv_read_tsc_page(hv_get_tsc_page());
+       u64 cur_tsc, time;
 
-       if (current_tick == U64_MAX)
-               current_tick = hv_get_register(HV_REGISTER_TIME_REF_COUNT);
+       /*
+        * The Hyper-V Top-Level Function Spec (TLFS), section Timers,
+        * subsection Refererence Counter, guarantees that the TSC and MSR
+        * times are in sync and monotonic. Therefore we can fall back
+        * to the MSR in case the TSC page indicates unavailability.
+        */
+       if (!hv_read_tsc_page_tsc(tsc_page, &cur_tsc, &time))
+               time = read_hv_clock_msr();
 
-       return current_tick;
+       return time;
 }
 
 static u64 notrace read_hv_clock_tsc_cs(struct clocksource *arg)
@@ -408,7 +428,7 @@ static u64 notrace read_hv_clock_tsc_cs(struct clocksource *arg)
        return read_hv_clock_tsc();
 }
 
-static u64 notrace read_hv_sched_clock_tsc(void)
+static u64 noinstr read_hv_sched_clock_tsc(void)
 {
        return (read_hv_clock_tsc() - hv_sched_clock_offset) *
                (NSEC_PER_SEC / HV_CLOCK_HZ);
@@ -460,30 +480,14 @@ static struct clocksource hyperv_cs_tsc = {
 #endif
 };
 
-static u64 notrace read_hv_clock_msr(void)
-{
-       /*
-        * Read the partition counter to get the current tick count. This count
-        * is set to 0 when the partition is created and is incremented in
-        * 100 nanosecond units.
-        */
-       return hv_get_register(HV_REGISTER_TIME_REF_COUNT);
-}
-
 static u64 notrace read_hv_clock_msr_cs(struct clocksource *arg)
 {
        return read_hv_clock_msr();
 }
 
-static u64 notrace read_hv_sched_clock_msr(void)
-{
-       return (read_hv_clock_msr() - hv_sched_clock_offset) *
-               (NSEC_PER_SEC / HV_CLOCK_HZ);
-}
-
 static struct clocksource hyperv_cs_msr = {
        .name   = "hyperv_clocksource_msr",
-       .rating = 500,
+       .rating = 495,
        .read   = read_hv_clock_msr_cs,
        .mask   = CLOCKSOURCE_MASK(64),
        .flags  = CLOCK_SOURCE_IS_CONTINUOUS,
@@ -513,7 +517,7 @@ static __always_inline void hv_setup_sched_clock(void *sched_clock)
 static __always_inline void hv_setup_sched_clock(void *sched_clock) {}
 #endif /* CONFIG_GENERIC_SCHED_CLOCK */
 
-static bool __init hv_init_tsc_clocksource(void)
+static void __init hv_init_tsc_clocksource(void)
 {
        union hv_reference_tsc_msr tsc_msr;
 
@@ -524,17 +528,14 @@ static bool __init hv_init_tsc_clocksource(void)
         * Hyper-V Reference TSC rating, causing the generic TSC to be used.
         * TSC_INVARIANT is not offered on ARM64, so the Hyper-V Reference
         * TSC will be preferred over the virtualized ARM64 arch counter.
-        * While the Hyper-V MSR clocksource won't be used since the
-        * Reference TSC clocksource is present, change its rating as
-        * well for consistency.
         */
        if (ms_hyperv.features & HV_ACCESS_TSC_INVARIANT) {
                hyperv_cs_tsc.rating = 250;
-               hyperv_cs_msr.rating = 250;
+               hyperv_cs_msr.rating = 245;
        }
 
        if (!(ms_hyperv.features & HV_MSR_REFERENCE_TSC_AVAILABLE))
-               return false;
+               return;
 
        hv_read_reference_counter = read_hv_clock_tsc;
 
@@ -565,33 +566,34 @@ static bool __init hv_init_tsc_clocksource(void)
 
        clocksource_register_hz(&hyperv_cs_tsc, NSEC_PER_SEC/100);
 
-       hv_sched_clock_offset = hv_read_reference_counter();
-       hv_setup_sched_clock(read_hv_sched_clock_tsc);
-
-       return true;
+       /*
+        * If TSC is invariant, then let it stay as the sched clock since it
+        * will be faster than reading the TSC page. But if not invariant, use
+        * the TSC page so that live migrations across hosts with different
+        * frequencies is handled correctly.
+        */
+       if (!(ms_hyperv.features & HV_ACCESS_TSC_INVARIANT)) {
+               hv_sched_clock_offset = hv_read_reference_counter();
+               hv_setup_sched_clock(read_hv_sched_clock_tsc);
+       }
 }
 
 void __init hv_init_clocksource(void)
 {
        /*
-        * Try to set up the TSC page clocksource. If it succeeds, we're
-        * done. Otherwise, set up the MSR clocksource.  At least one of
-        * these will always be available except on very old versions of
-        * Hyper-V on x86.  In that case we won't have a Hyper-V
+        * Try to set up the TSC page clocksource, then the MSR clocksource.
+        * At least one of these will always be available except on very old
+        * versions of Hyper-V on x86.  In that case we won't have a Hyper-V
         * clocksource, but Linux will still run with a clocksource based
         * on the emulated PIT or LAPIC timer.
+        *
+        * Never use the MSR clocksource as sched clock.  It's too slow.
+        * Better to use the native sched clock as the fallback.
         */
-       if (hv_init_tsc_clocksource())
-               return;
-
-       if (!(ms_hyperv.features & HV_MSR_TIME_REF_COUNT_AVAILABLE))
-               return;
-
-       hv_read_reference_counter = read_hv_clock_msr;
-       clocksource_register_hz(&hyperv_cs_msr, NSEC_PER_SEC/100);
+       hv_init_tsc_clocksource();
 
-       hv_sched_clock_offset = hv_read_reference_counter();
-       hv_setup_sched_clock(read_hv_sched_clock_msr);
+       if (ms_hyperv.features & HV_MSR_TIME_REF_COUNT_AVAILABLE)
+               clocksource_register_hz(&hyperv_cs_msr, NSEC_PER_SEC/100);
 }
 
 void __init hv_remap_tsc_clocksource(void)
index 089ce64..154ee5f 100644 (file)
@@ -369,7 +369,7 @@ static int __init ingenic_tcu_probe(struct platform_device *pdev)
        return 0;
 }
 
-static int __maybe_unused ingenic_tcu_suspend(struct device *dev)
+static int ingenic_tcu_suspend(struct device *dev)
 {
        struct ingenic_tcu *tcu = dev_get_drvdata(dev);
        unsigned int cpu;
@@ -382,7 +382,7 @@ static int __maybe_unused ingenic_tcu_suspend(struct device *dev)
        return 0;
 }
 
-static int __maybe_unused ingenic_tcu_resume(struct device *dev)
+static int ingenic_tcu_resume(struct device *dev)
 {
        struct ingenic_tcu *tcu = dev_get_drvdata(dev);
        unsigned int cpu;
@@ -406,7 +406,7 @@ err_timer_clk_disable:
        return ret;
 }
 
-static const struct dev_pm_ops __maybe_unused ingenic_tcu_pm_ops = {
+static const struct dev_pm_ops ingenic_tcu_pm_ops = {
        /* _noirq: We want the TCU clocks to be gated last / ungated first */
        .suspend_noirq = ingenic_tcu_suspend,
        .resume_noirq  = ingenic_tcu_resume,
@@ -415,9 +415,7 @@ static const struct dev_pm_ops __maybe_unused ingenic_tcu_pm_ops = {
 static struct platform_driver ingenic_tcu_driver = {
        .driver = {
                .name   = "ingenic-tcu-timer",
-#ifdef CONFIG_PM_SLEEP
-               .pm     = &ingenic_tcu_pm_ops,
-#endif
+               .pm     = pm_sleep_ptr(&ingenic_tcu_pm_ops),
                .of_match_table = ingenic_tcu_of_match,
        },
 };
index 4efd0cf..0d52e28 100644 (file)
@@ -486,10 +486,10 @@ static int __init ttc_timer_probe(struct platform_device *pdev)
         * and use it. Note that the event timer uses the interrupt and it's the
         * 2nd TTC hence the irq_of_parse_and_map(,1)
         */
-       timer_baseaddr = of_iomap(timer, 0);
-       if (!timer_baseaddr) {
+       timer_baseaddr = devm_of_iomap(&pdev->dev, timer, 0, NULL);
+       if (IS_ERR(timer_baseaddr)) {
                pr_err("ERROR: invalid timer base address\n");
-               return -ENXIO;
+               return PTR_ERR(timer_baseaddr);
        }
 
        irq = irq_of_parse_and_map(timer, 1);
@@ -513,20 +513,27 @@ static int __init ttc_timer_probe(struct platform_device *pdev)
        clk_ce = of_clk_get(timer, clksel);
        if (IS_ERR(clk_ce)) {
                pr_err("ERROR: timer input clock not found\n");
-               return PTR_ERR(clk_ce);
+               ret = PTR_ERR(clk_ce);
+               goto put_clk_cs;
        }
 
        ret = ttc_setup_clocksource(clk_cs, timer_baseaddr, timer_width);
        if (ret)
-               return ret;
+               goto put_clk_ce;
 
        ret = ttc_setup_clockevent(clk_ce, timer_baseaddr + 4, irq);
        if (ret)
-               return ret;
+               goto put_clk_ce;
 
        pr_info("%pOFn #0 at %p, irq=%d\n", timer, timer_baseaddr, irq);
 
        return 0;
+
+put_clk_ce:
+       clk_put(clk_ce);
+put_clk_cs:
+       clk_put(clk_cs);
+       return ret;
 }
 
 static const struct of_device_id ttc_timer_of_match[] = {
index ca3e4cb..28ab4f1 100644 (file)
@@ -16,7 +16,6 @@
 #include <linux/of.h>
 #include <linux/of_address.h>
 #include <linux/of_irq.h>
-#include <soc/imx/timer.h>
 
 /*
  * There are 4 versions of the timer hardware on Freescale MXC hardware.
  *  - MX25, MX31, MX35, MX37, MX51, MX6Q(rev1.0)
  *  - MX6DL, MX6SX, MX6Q(rev1.1+)
  */
+enum imx_gpt_type {
+       GPT_TYPE_IMX1,          /* i.MX1 */
+       GPT_TYPE_IMX21,         /* i.MX21/27 */
+       GPT_TYPE_IMX31,         /* i.MX31/35/25/37/51/6Q */
+       GPT_TYPE_IMX6DL,        /* i.MX6DL/SX/SL */
+};
 
 /* defines common for all i.MX */
 #define MXC_TCTL               0x00
@@ -93,13 +98,11 @@ static void imx1_gpt_irq_disable(struct imx_timer *imxtm)
        tmp = readl_relaxed(imxtm->base + MXC_TCTL);
        writel_relaxed(tmp & ~MX1_2_TCTL_IRQEN, imxtm->base + MXC_TCTL);
 }
-#define imx21_gpt_irq_disable imx1_gpt_irq_disable
 
 static void imx31_gpt_irq_disable(struct imx_timer *imxtm)
 {
        writel_relaxed(0, imxtm->base + V2_IR);
 }
-#define imx6dl_gpt_irq_disable imx31_gpt_irq_disable
 
 static void imx1_gpt_irq_enable(struct imx_timer *imxtm)
 {
@@ -108,13 +111,11 @@ static void imx1_gpt_irq_enable(struct imx_timer *imxtm)
        tmp = readl_relaxed(imxtm->base + MXC_TCTL);
        writel_relaxed(tmp | MX1_2_TCTL_IRQEN, imxtm->base + MXC_TCTL);
 }
-#define imx21_gpt_irq_enable imx1_gpt_irq_enable
 
 static void imx31_gpt_irq_enable(struct imx_timer *imxtm)
 {
        writel_relaxed(1<<0, imxtm->base + V2_IR);
 }
-#define imx6dl_gpt_irq_enable imx31_gpt_irq_enable
 
 static void imx1_gpt_irq_acknowledge(struct imx_timer *imxtm)
 {
@@ -131,7 +132,6 @@ static void imx31_gpt_irq_acknowledge(struct imx_timer *imxtm)
 {
        writel_relaxed(V2_TSTAT_OF1, imxtm->base + V2_TSTAT);
 }
-#define imx6dl_gpt_irq_acknowledge imx31_gpt_irq_acknowledge
 
 static void __iomem *sched_clock_reg;
 
@@ -296,7 +296,6 @@ static void imx1_gpt_setup_tctl(struct imx_timer *imxtm)
        tctl_val = MX1_2_TCTL_FRR | MX1_2_TCTL_CLK_PCLK1 | MXC_TCTL_TEN;
        writel_relaxed(tctl_val, imxtm->base + MXC_TCTL);
 }
-#define imx21_gpt_setup_tctl imx1_gpt_setup_tctl
 
 static void imx31_gpt_setup_tctl(struct imx_timer *imxtm)
 {
@@ -343,10 +342,10 @@ static const struct imx_gpt_data imx21_gpt_data = {
        .reg_tstat = MX1_2_TSTAT,
        .reg_tcn = MX1_2_TCN,
        .reg_tcmp = MX1_2_TCMP,
-       .gpt_irq_enable = imx21_gpt_irq_enable,
-       .gpt_irq_disable = imx21_gpt_irq_disable,
+       .gpt_irq_enable = imx1_gpt_irq_enable,
+       .gpt_irq_disable = imx1_gpt_irq_disable,
        .gpt_irq_acknowledge = imx21_gpt_irq_acknowledge,
-       .gpt_setup_tctl = imx21_gpt_setup_tctl,
+       .gpt_setup_tctl = imx1_gpt_setup_tctl,
        .set_next_event = mx1_2_set_next_event,
 };
 
@@ -365,9 +364,9 @@ static const struct imx_gpt_data imx6dl_gpt_data = {
        .reg_tstat = V2_TSTAT,
        .reg_tcn = V2_TCN,
        .reg_tcmp = V2_TCMP,
-       .gpt_irq_enable = imx6dl_gpt_irq_enable,
-       .gpt_irq_disable = imx6dl_gpt_irq_disable,
-       .gpt_irq_acknowledge = imx6dl_gpt_irq_acknowledge,
+       .gpt_irq_enable = imx31_gpt_irq_enable,
+       .gpt_irq_disable = imx31_gpt_irq_disable,
+       .gpt_irq_acknowledge = imx31_gpt_irq_acknowledge,
        .gpt_setup_tctl = imx6dl_gpt_setup_tctl,
        .set_next_event = v2_set_next_event,
 };
diff --git a/drivers/clocksource/timer-loongson1-pwm.c b/drivers/clocksource/timer-loongson1-pwm.c
new file mode 100644 (file)
index 0000000..6335fee
--- /dev/null
@@ -0,0 +1,236 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Clocksource driver for Loongson-1 SoC
+ *
+ * Copyright (c) 2023 Keguang Zhang <keguang.zhang@gmail.com>
+ */
+
+#include <linux/clockchips.h>
+#include <linux/interrupt.h>
+#include <linux/sizes.h>
+#include "timer-of.h"
+
+/* Loongson-1 PWM Timer Register Definitions */
+#define PWM_CNTR               0x0
+#define PWM_HRC                        0x4
+#define PWM_LRC                        0x8
+#define PWM_CTRL               0xc
+
+/* PWM Control Register Bits */
+#define INT_LRC_EN             BIT(11)
+#define INT_HRC_EN             BIT(10)
+#define CNTR_RST               BIT(7)
+#define INT_SR                 BIT(6)
+#define INT_EN                 BIT(5)
+#define PWM_SINGLE             BIT(4)
+#define PWM_OE                 BIT(3)
+#define CNT_EN                 BIT(0)
+
+#define CNTR_WIDTH             24
+
+DEFINE_RAW_SPINLOCK(ls1x_timer_lock);
+
+struct ls1x_clocksource {
+       void __iomem *reg_base;
+       unsigned long ticks_per_jiffy;
+       struct clocksource clksrc;
+};
+
+static inline struct ls1x_clocksource *to_ls1x_clksrc(struct clocksource *c)
+{
+       return container_of(c, struct ls1x_clocksource, clksrc);
+}
+
+static inline void ls1x_pwmtimer_set_period(unsigned int period,
+                                           struct timer_of *to)
+{
+       writel(period, timer_of_base(to) + PWM_LRC);
+       writel(period, timer_of_base(to) + PWM_HRC);
+}
+
+static inline void ls1x_pwmtimer_clear(struct timer_of *to)
+{
+       writel(0, timer_of_base(to) + PWM_CNTR);
+}
+
+static inline void ls1x_pwmtimer_start(struct timer_of *to)
+{
+       writel((INT_EN | PWM_OE | CNT_EN), timer_of_base(to) + PWM_CTRL);
+}
+
+static inline void ls1x_pwmtimer_stop(struct timer_of *to)
+{
+       writel(0, timer_of_base(to) + PWM_CTRL);
+}
+
+static inline void ls1x_pwmtimer_irq_ack(struct timer_of *to)
+{
+       int val;
+
+       val = readl(timer_of_base(to) + PWM_CTRL);
+       val |= INT_SR;
+       writel(val, timer_of_base(to) + PWM_CTRL);
+}
+
+static irqreturn_t ls1x_clockevent_isr(int irq, void *dev_id)
+{
+       struct clock_event_device *clkevt = dev_id;
+       struct timer_of *to = to_timer_of(clkevt);
+
+       ls1x_pwmtimer_irq_ack(to);
+       ls1x_pwmtimer_clear(to);
+       ls1x_pwmtimer_start(to);
+
+       clkevt->event_handler(clkevt);
+
+       return IRQ_HANDLED;
+}
+
+static int ls1x_clockevent_set_state_periodic(struct clock_event_device *clkevt)
+{
+       struct timer_of *to = to_timer_of(clkevt);
+
+       raw_spin_lock(&ls1x_timer_lock);
+       ls1x_pwmtimer_set_period(timer_of_period(to), to);
+       ls1x_pwmtimer_clear(to);
+       ls1x_pwmtimer_start(to);
+       raw_spin_unlock(&ls1x_timer_lock);
+
+       return 0;
+}
+
+static int ls1x_clockevent_tick_resume(struct clock_event_device *clkevt)
+{
+       raw_spin_lock(&ls1x_timer_lock);
+       ls1x_pwmtimer_start(to_timer_of(clkevt));
+       raw_spin_unlock(&ls1x_timer_lock);
+
+       return 0;
+}
+
+static int ls1x_clockevent_set_state_shutdown(struct clock_event_device *clkevt)
+{
+       raw_spin_lock(&ls1x_timer_lock);
+       ls1x_pwmtimer_stop(to_timer_of(clkevt));
+       raw_spin_unlock(&ls1x_timer_lock);
+
+       return 0;
+}
+
+static int ls1x_clockevent_set_next(unsigned long evt,
+                                   struct clock_event_device *clkevt)
+{
+       struct timer_of *to = to_timer_of(clkevt);
+
+       raw_spin_lock(&ls1x_timer_lock);
+       ls1x_pwmtimer_set_period(evt, to);
+       ls1x_pwmtimer_clear(to);
+       ls1x_pwmtimer_start(to);
+       raw_spin_unlock(&ls1x_timer_lock);
+
+       return 0;
+}
+
+static struct timer_of ls1x_to = {
+       .flags = TIMER_OF_IRQ | TIMER_OF_BASE | TIMER_OF_CLOCK,
+       .clkevt = {
+               .name                   = "ls1x-pwmtimer",
+               .features               = CLOCK_EVT_FEAT_PERIODIC |
+                                         CLOCK_EVT_FEAT_ONESHOT,
+               .rating                 = 300,
+               .set_next_event         = ls1x_clockevent_set_next,
+               .set_state_periodic     = ls1x_clockevent_set_state_periodic,
+               .set_state_oneshot      = ls1x_clockevent_set_state_shutdown,
+               .set_state_shutdown     = ls1x_clockevent_set_state_shutdown,
+               .tick_resume            = ls1x_clockevent_tick_resume,
+       },
+       .of_irq = {
+               .handler                = ls1x_clockevent_isr,
+               .flags                  = IRQF_TIMER,
+       },
+};
+
+/*
+ * Since the PWM timer overflows every two ticks, its not very useful
+ * to just read by itself. So use jiffies to emulate a free
+ * running counter:
+ */
+static u64 ls1x_clocksource_read(struct clocksource *cs)
+{
+       struct ls1x_clocksource *ls1x_cs = to_ls1x_clksrc(cs);
+       unsigned long flags;
+       int count;
+       u32 jifs;
+       static int old_count;
+       static u32 old_jifs;
+
+       raw_spin_lock_irqsave(&ls1x_timer_lock, flags);
+       /*
+        * Although our caller may have the read side of xtime_lock,
+        * this is now a seqlock, and we are cheating in this routine
+        * by having side effects on state that we cannot undo if
+        * there is a collision on the seqlock and our caller has to
+        * retry.  (Namely, old_jifs and old_count.)  So we must treat
+        * jiffies as volatile despite the lock.  We read jiffies
+        * before latching the timer count to guarantee that although
+        * the jiffies value might be older than the count (that is,
+        * the counter may underflow between the last point where
+        * jiffies was incremented and the point where we latch the
+        * count), it cannot be newer.
+        */
+       jifs = jiffies;
+       /* read the count */
+       count = readl(ls1x_cs->reg_base + PWM_CNTR);
+
+       /*
+        * It's possible for count to appear to go the wrong way for this
+        * reason:
+        *
+        *  The timer counter underflows, but we haven't handled the resulting
+        *  interrupt and incremented jiffies yet.
+        *
+        * Previous attempts to handle these cases intelligently were buggy, so
+        * we just do the simple thing now.
+        */
+       if (count < old_count && jifs == old_jifs)
+               count = old_count;
+
+       old_count = count;
+       old_jifs = jifs;
+
+       raw_spin_unlock_irqrestore(&ls1x_timer_lock, flags);
+
+       return (u64)(jifs * ls1x_cs->ticks_per_jiffy) + count;
+}
+
+static struct ls1x_clocksource ls1x_clocksource = {
+       .clksrc = {
+               .name           = "ls1x-pwmtimer",
+               .rating         = 300,
+               .read           = ls1x_clocksource_read,
+               .mask           = CLOCKSOURCE_MASK(CNTR_WIDTH),
+               .flags          = CLOCK_SOURCE_IS_CONTINUOUS,
+       },
+};
+
+static int __init ls1x_pwm_clocksource_init(struct device_node *np)
+{
+       struct timer_of *to = &ls1x_to;
+       int ret;
+
+       ret = timer_of_init(np, to);
+       if (ret)
+               return ret;
+
+       clockevents_config_and_register(&to->clkevt, timer_of_rate(to),
+                                       0x1, GENMASK(CNTR_WIDTH - 1, 0));
+
+       ls1x_clocksource.reg_base = timer_of_base(to);
+       ls1x_clocksource.ticks_per_jiffy = timer_of_period(to);
+
+       return clocksource_register_hz(&ls1x_clocksource.clksrc,
+                                      timer_of_rate(to));
+}
+
+TIMER_OF_DECLARE(ls1x_pwm_clocksource, "loongson,ls1b-pwmtimer",
+                ls1x_pwm_clocksource_init);
index 2c839bd..a1c51ab 100644 (file)
@@ -38,7 +38,7 @@ choice
        prompt "Default CPUFreq governor"
        default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1110_CPUFREQ
        default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if ARM64 || ARM
-       default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if X86_INTEL_PSTATE && SMP
+       default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if (X86_INTEL_PSTATE || X86_AMD_PSTATE) && SMP
        default CPU_FREQ_DEFAULT_GOV_PERFORMANCE
        help
          This option sets which CPUFreq governor shall be loaded at
index 00476e9..438c9e7 100644 (file)
@@ -51,6 +51,23 @@ config X86_AMD_PSTATE
 
          If in doubt, say N.
 
+config X86_AMD_PSTATE_DEFAULT_MODE
+       int "AMD Processor P-State default mode"
+       depends on X86_AMD_PSTATE
+       default 3 if X86_AMD_PSTATE
+       range 1 4
+       help
+         Select the default mode the amd-pstate driver will use on
+         supported hardware.
+         The value set has the following meanings:
+               1 -> Disabled
+               2 -> Passive
+               3 -> Active (EPP)
+               4 -> Guided
+
+         For details, take a look at:
+         <file:Documentation/admin-guide/pm/amd-pstate.rst>.
+
 config X86_AMD_PSTATE_UT
        tristate "selftest for AMD Processor P-State driver"
        depends on X86 && ACPI_PROCESSOR
index ddd346a..81fba0d 100644 (file)
@@ -62,7 +62,8 @@
 static struct cpufreq_driver *current_pstate_driver;
 static struct cpufreq_driver amd_pstate_driver;
 static struct cpufreq_driver amd_pstate_epp_driver;
-static int cppc_state = AMD_PSTATE_DISABLE;
+static int cppc_state = AMD_PSTATE_UNDEFINED;
+static bool cppc_enabled;
 
 /*
  * AMD Energy Preference Performance (EPP)
@@ -228,7 +229,28 @@ static int amd_pstate_set_energy_pref_index(struct amd_cpudata *cpudata,
 
 static inline int pstate_enable(bool enable)
 {
-       return wrmsrl_safe(MSR_AMD_CPPC_ENABLE, enable);
+       int ret, cpu;
+       unsigned long logical_proc_id_mask = 0;
+
+       if (enable == cppc_enabled)
+               return 0;
+
+       for_each_present_cpu(cpu) {
+               unsigned long logical_id = topology_logical_die_id(cpu);
+
+               if (test_bit(logical_id, &logical_proc_id_mask))
+                       continue;
+
+               set_bit(logical_id, &logical_proc_id_mask);
+
+               ret = wrmsrl_safe_on_cpu(cpu, MSR_AMD_CPPC_ENABLE,
+                               enable);
+               if (ret)
+                       return ret;
+       }
+
+       cppc_enabled = enable;
+       return 0;
 }
 
 static int cppc_enable(bool enable)
@@ -236,6 +258,9 @@ static int cppc_enable(bool enable)
        int cpu, ret = 0;
        struct cppc_perf_ctrls perf_ctrls;
 
+       if (enable == cppc_enabled)
+               return 0;
+
        for_each_present_cpu(cpu) {
                ret = cppc_set_enable(cpu, enable);
                if (ret)
@@ -251,6 +276,7 @@ static int cppc_enable(bool enable)
                }
        }
 
+       cppc_enabled = enable;
        return ret;
 }
 
@@ -1045,6 +1071,26 @@ static const struct attribute_group amd_pstate_global_attr_group = {
        .attrs = pstate_global_attributes,
 };
 
+static bool amd_pstate_acpi_pm_profile_server(void)
+{
+       switch (acpi_gbl_FADT.preferred_profile) {
+       case PM_ENTERPRISE_SERVER:
+       case PM_SOHO_SERVER:
+       case PM_PERFORMANCE_SERVER:
+               return true;
+       }
+       return false;
+}
+
+static bool amd_pstate_acpi_pm_profile_undefined(void)
+{
+       if (acpi_gbl_FADT.preferred_profile == PM_UNSPECIFIED)
+               return true;
+       if (acpi_gbl_FADT.preferred_profile >= NR_PM_PROFILES)
+               return true;
+       return false;
+}
+
 static int amd_pstate_epp_cpu_init(struct cpufreq_policy *policy)
 {
        int min_freq, max_freq, nominal_freq, lowest_nonlinear_freq, ret;
@@ -1102,10 +1148,14 @@ static int amd_pstate_epp_cpu_init(struct cpufreq_policy *policy)
        policy->max = policy->cpuinfo.max_freq;
 
        /*
-        * Set the policy to powersave to provide a valid fallback value in case
+        * Set the policy to provide a valid fallback value in case
         * the default cpufreq governor is neither powersave nor performance.
         */
-       policy->policy = CPUFREQ_POLICY_POWERSAVE;
+       if (amd_pstate_acpi_pm_profile_server() ||
+           amd_pstate_acpi_pm_profile_undefined())
+               policy->policy = CPUFREQ_POLICY_PERFORMANCE;
+       else
+               policy->policy = CPUFREQ_POLICY_POWERSAVE;
 
        if (boot_cpu_has(X86_FEATURE_CPPC)) {
                ret = rdmsrl_on_cpu(cpudata->cpu, MSR_AMD_CPPC_REQ, &value);
@@ -1356,10 +1406,29 @@ static struct cpufreq_driver amd_pstate_epp_driver = {
        .online         = amd_pstate_epp_cpu_online,
        .suspend        = amd_pstate_epp_suspend,
        .resume         = amd_pstate_epp_resume,
-       .name           = "amd_pstate_epp",
+       .name           = "amd-pstate-epp",
        .attr           = amd_pstate_epp_attr,
 };
 
+static int __init amd_pstate_set_driver(int mode_idx)
+{
+       if (mode_idx >= AMD_PSTATE_DISABLE && mode_idx < AMD_PSTATE_MAX) {
+               cppc_state = mode_idx;
+               if (cppc_state == AMD_PSTATE_DISABLE)
+                       pr_info("driver is explicitly disabled\n");
+
+               if (cppc_state == AMD_PSTATE_ACTIVE)
+                       current_pstate_driver = &amd_pstate_epp_driver;
+
+               if (cppc_state == AMD_PSTATE_PASSIVE || cppc_state == AMD_PSTATE_GUIDED)
+                       current_pstate_driver = &amd_pstate_driver;
+
+               return 0;
+       }
+
+       return -EINVAL;
+}
+
 static int __init amd_pstate_init(void)
 {
        struct device *dev_root;
@@ -1367,15 +1436,6 @@ static int __init amd_pstate_init(void)
 
        if (boot_cpu_data.x86_vendor != X86_VENDOR_AMD)
                return -ENODEV;
-       /*
-        * by default the pstate driver is disabled to load
-        * enable the amd_pstate passive mode driver explicitly
-        * with amd_pstate=passive or other modes in kernel command line
-        */
-       if (cppc_state == AMD_PSTATE_DISABLE) {
-               pr_info("driver load is disabled, boot with specific mode to enable this\n");
-               return -ENODEV;
-       }
 
        if (!acpi_cpc_valid()) {
                pr_warn_once("the _CPC object is not present in SBIOS or ACPI disabled\n");
@@ -1386,6 +1446,33 @@ static int __init amd_pstate_init(void)
        if (cpufreq_get_current_driver())
                return -EEXIST;
 
+       switch (cppc_state) {
+       case AMD_PSTATE_UNDEFINED:
+               /* Disable on the following configs by default:
+                * 1. Undefined platforms
+                * 2. Server platforms
+                * 3. Shared memory designs
+                */
+               if (amd_pstate_acpi_pm_profile_undefined() ||
+                   amd_pstate_acpi_pm_profile_server() ||
+                   !boot_cpu_has(X86_FEATURE_CPPC)) {
+                       pr_info("driver load is disabled, boot with specific mode to enable this\n");
+                       return -ENODEV;
+               }
+               ret = amd_pstate_set_driver(CONFIG_X86_AMD_PSTATE_DEFAULT_MODE);
+               if (ret)
+                       return ret;
+               break;
+       case AMD_PSTATE_DISABLE:
+               return -ENODEV;
+       case AMD_PSTATE_PASSIVE:
+       case AMD_PSTATE_ACTIVE:
+       case AMD_PSTATE_GUIDED:
+               break;
+       default:
+               return -EINVAL;
+       }
+
        /* capability check */
        if (boot_cpu_has(X86_FEATURE_CPPC)) {
                pr_debug("AMD CPPC MSR based functionality is supported\n");
@@ -1438,21 +1525,7 @@ static int __init amd_pstate_param(char *str)
        size = strlen(str);
        mode_idx = get_mode_idx_from_str(str, size);
 
-       if (mode_idx >= AMD_PSTATE_DISABLE && mode_idx < AMD_PSTATE_MAX) {
-               cppc_state = mode_idx;
-               if (cppc_state == AMD_PSTATE_DISABLE)
-                       pr_info("driver is explicitly disabled\n");
-
-               if (cppc_state == AMD_PSTATE_ACTIVE)
-                       current_pstate_driver = &amd_pstate_epp_driver;
-
-               if (cppc_state == AMD_PSTATE_PASSIVE || cppc_state == AMD_PSTATE_GUIDED)
-                       current_pstate_driver = &amd_pstate_driver;
-
-               return 0;
-       }
-
-       return -EINVAL;
+       return amd_pstate_set_driver(mode_idx);
 }
 early_param("amd_pstate", amd_pstate_param);
 
index 6b52ebe..50bbc96 100644 (file)
@@ -2828,7 +2828,8 @@ int cpufreq_register_driver(struct cpufreq_driver *driver_data)
             (driver_data->setpolicy && (driver_data->target_index ||
                    driver_data->target)) ||
             (!driver_data->get_intermediate != !driver_data->target_intermediate) ||
-            (!driver_data->online != !driver_data->offline))
+            (!driver_data->online != !driver_data->offline) ||
+                (driver_data->adjust_perf && !driver_data->fast_switch))
                return -EINVAL;
 
        pr_debug("trying to register driver %s\n", driver_data->name);
index 2548ec9..f291825 100644 (file)
@@ -824,6 +824,8 @@ static ssize_t store_energy_performance_preference(
                        err = cpufreq_start_governor(policy);
                        if (!ret)
                                ret = err;
+               } else {
+                       ret = 0;
                }
        }
 
index 8e929f6..737a026 100644 (file)
@@ -145,7 +145,7 @@ static noinstr void enter_s2idle_proper(struct cpuidle_driver *drv,
 
        instrumentation_begin();
 
-       time_start = ns_to_ktime(local_clock());
+       time_start = ns_to_ktime(local_clock_noinstr());
 
        tick_freeze();
        /*
@@ -169,7 +169,7 @@ static noinstr void enter_s2idle_proper(struct cpuidle_driver *drv,
        tick_unfreeze();
        start_critical_timings();
 
-       time_end = ns_to_ktime(local_clock());
+       time_end = ns_to_ktime(local_clock_noinstr());
 
        dev->states_usage[index].s2idle_time += ktime_us_delta(time_end, time_start);
        dev->states_usage[index].s2idle_usage++;
@@ -243,7 +243,7 @@ noinstr int cpuidle_enter_state(struct cpuidle_device *dev,
        sched_idle_set_state(target_state);
 
        trace_cpu_idle(index, dev->cpu);
-       time_start = ns_to_ktime(local_clock());
+       time_start = ns_to_ktime(local_clock_noinstr());
 
        stop_critical_timings();
        if (!(target_state->flags & CPUIDLE_FLAG_RCU_IDLE)) {
@@ -276,7 +276,7 @@ noinstr int cpuidle_enter_state(struct cpuidle_device *dev,
        start_critical_timings();
 
        sched_clock_idle_wakeup_event();
-       time_end = ns_to_ktime(local_clock());
+       time_end = ns_to_ktime(local_clock_noinstr());
        trace_cpu_idle(PWR_EVENT_EXIT, dev->cpu);
 
        /* The cpu is no longer idle or about to enter idle. */
index bdcfeae..9b6d90a 100644 (file)
@@ -15,7 +15,7 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev,
 {
        u64 time_start;
 
-       time_start = local_clock();
+       time_start = local_clock_noinstr();
 
        dev->poll_time_limit = false;
 
@@ -32,7 +32,7 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev,
                                continue;
 
                        loop_count = 0;
-                       if (local_clock() - time_start > limit) {
+                       if (local_clock_noinstr() - time_start > limit) {
                                dev->poll_time_limit = true;
                                break;
                        }
index 10fe9f7..f2dd667 100644 (file)
@@ -8,7 +8,7 @@
  * keysize in CBC and ECB mode.
  * Add support also for DES and 3DES in CBC and ECB mode.
  *
- * You could find the datasheet in Documentation/arm/sunxi.rst
+ * You could find the datasheet in Documentation/arch/arm/sunxi.rst
  */
 #include "sun4i-ss.h"
 
index 006e401..51a3a7b 100644 (file)
@@ -6,7 +6,7 @@
  *
  * Core file which registers crypto algorithms supported by the SS.
  *
- * You could find a link for the datasheet in Documentation/arm/sunxi.rst
+ * You could find a link for the datasheet in Documentation/arch/arm/sunxi.rst
  */
 #include <linux/clk.h>
 #include <linux/crypto.h>
index d282927..f7893e4 100644 (file)
@@ -6,7 +6,7 @@
  *
  * This file add support for MD5 and SHA1.
  *
- * You could find the datasheet in Documentation/arm/sunxi.rst
+ * You could find the datasheet in Documentation/arch/arm/sunxi.rst
  */
 #include "sun4i-ss.h"
 #include <asm/unaligned.h>
index ba59c7a..6c5d4aa 100644 (file)
@@ -8,7 +8,7 @@
  * Support MD5 and SHA1 hash algorithms.
  * Support DES and 3DES
  *
- * You could find the datasheet in Documentation/arm/sunxi.rst
+ * You could find the datasheet in Documentation/arch/arm/sunxi.rst
  */
 
 #include <linux/clk.h>
index 74b4e91..c135500 100644 (file)
@@ -8,7 +8,7 @@
  * This file add support for AES cipher with 128,192,256 bits keysize in
  * CBC and ECB mode.
  *
- * You could find a link for the datasheet in Documentation/arm/sunxi.rst
+ * You could find a link for the datasheet in Documentation/arch/arm/sunxi.rst
  */
 
 #include <linux/bottom_half.h>
index a6865ff..07ea0cc 100644 (file)
@@ -7,7 +7,7 @@
  *
  * Core file which registers crypto algorithms supported by the CryptoEngine.
  *
- * You could find a link for the datasheet in Documentation/arm/sunxi.rst
+ * You could find a link for the datasheet in Documentation/arch/arm/sunxi.rst
  */
 #include <linux/clk.h>
 #include <linux/crypto.h>
index 8b5b9b9..930ad15 100644 (file)
@@ -7,7 +7,7 @@
  *
  * This file add support for MD5 and SHA1/SHA224/SHA256/SHA384/SHA512.
  *
- * You could find the datasheet in Documentation/arm/sunxi.rst
+ * You could find the datasheet in Documentation/arch/arm/sunxi.rst
  */
 #include <linux/bottom_half.h>
 #include <linux/dma-mapping.h>
index b3cc43e..8081537 100644 (file)
@@ -7,7 +7,7 @@
  *
  * This file handle the PRNG
  *
- * You could find a link for the datasheet in Documentation/arm/sunxi.rst
+ * You could find a link for the datasheet in Documentation/arch/arm/sunxi.rst
  */
 #include "sun8i-ce.h"
 #include <linux/dma-mapping.h>
index e2b9b91..9c35f2a 100644 (file)
@@ -7,7 +7,7 @@
  *
  * This file handle the TRNG
  *
- * You could find a link for the datasheet in Documentation/arm/sunxi.rst
+ * You could find a link for the datasheet in Documentation/arch/arm/sunxi.rst
  */
 #include "sun8i-ce.h"
 #include <linux/dma-mapping.h>
index 16966cc..381a90f 100644 (file)
@@ -8,7 +8,7 @@
  * This file add support for AES cipher with 128,192,256 bits keysize in
  * CBC and ECB mode.
  *
- * You could find a link for the datasheet in Documentation/arm/sunxi.rst
+ * You could find a link for the datasheet in Documentation/arch/arm/sunxi.rst
  */
 
 #include <linux/bottom_half.h>
index c9dc06f..3dd844b 100644 (file)
@@ -7,7 +7,7 @@
  *
  * Core file which registers crypto algorithms supported by the SecuritySystem
  *
- * You could find a link for the datasheet in Documentation/arm/sunxi.rst
+ * You could find a link for the datasheet in Documentation/arch/arm/sunxi.rst
  */
 #include <linux/clk.h>
 #include <linux/crypto.h>
index 577bf63..a4b67d1 100644 (file)
@@ -7,7 +7,7 @@
  *
  * This file add support for MD5 and SHA1/SHA224/SHA256.
  *
- * You could find the datasheet in Documentation/arm/sunxi.rst
+ * You could find the datasheet in Documentation/arch/arm/sunxi.rst
  */
 #include <linux/bottom_half.h>
 #include <linux/dma-mapping.h>
index 70c7b5d..a923cfc 100644 (file)
@@ -7,7 +7,7 @@
  *
  * This file handle the PRNG found in the SS
  *
- * You could find a link for the datasheet in Documentation/arm/sunxi.rst
+ * You could find a link for the datasheet in Documentation/arch/arm/sunxi.rst
  */
 #include "sun8i-ss.h"
 #include <linux/dma-mapping.h>
index ddf6e91..30e6acf 100644 (file)
@@ -357,9 +357,9 @@ static int cptpf_vfpf_mbox_init(struct otx2_cptpf_dev *cptpf, int num_vfs)
        u64 vfpf_mbox_base;
        int err, i;
 
-       cptpf->vfpf_mbox_wq = alloc_workqueue("cpt_vfpf_mailbox",
-                                             WQ_UNBOUND | WQ_HIGHPRI |
-                                             WQ_MEM_RECLAIM, 1);
+       cptpf->vfpf_mbox_wq =
+               alloc_ordered_workqueue("cpt_vfpf_mailbox",
+                                       WQ_HIGHPRI | WQ_MEM_RECLAIM);
        if (!cptpf->vfpf_mbox_wq)
                return -ENOMEM;
 
@@ -453,9 +453,9 @@ static int cptpf_afpf_mbox_init(struct otx2_cptpf_dev *cptpf)
        resource_size_t offset;
        int err;
 
-       cptpf->afpf_mbox_wq = alloc_workqueue("cpt_afpf_mailbox",
-                                             WQ_UNBOUND | WQ_HIGHPRI |
-                                             WQ_MEM_RECLAIM, 1);
+       cptpf->afpf_mbox_wq =
+               alloc_ordered_workqueue("cpt_afpf_mailbox",
+                                       WQ_HIGHPRI | WQ_MEM_RECLAIM);
        if (!cptpf->afpf_mbox_wq)
                return -ENOMEM;
 
index 392e9fe..6023a7a 100644 (file)
@@ -75,9 +75,9 @@ static int cptvf_pfvf_mbox_init(struct otx2_cptvf_dev *cptvf)
        resource_size_t offset, size;
        int ret;
 
-       cptvf->pfvf_mbox_wq = alloc_workqueue("cpt_pfvf_mailbox",
-                                             WQ_UNBOUND | WQ_HIGHPRI |
-                                             WQ_MEM_RECLAIM, 1);
+       cptvf->pfvf_mbox_wq =
+               alloc_ordered_workqueue("cpt_pfvf_mailbox",
+                                       WQ_HIGHPRI | WQ_MEM_RECLAIM);
        if (!cptvf->pfvf_mbox_wq)
                return -ENOMEM;
 
index 8841444..245898f 100644 (file)
@@ -518,6 +518,7 @@ static struct platform_driver exynos_bus_platdrv = {
 };
 module_platform_driver(exynos_bus_platdrv);
 
+MODULE_SOFTDEP("pre: exynos_ppmu");
 MODULE_DESCRIPTION("Generic Exynos Bus frequency driver");
 MODULE_AUTHOR("Chanwoo Choi <cw00.choi@samsung.com>");
 MODULE_LICENSE("GPL v2");
index e5458ad..6354622 100644 (file)
@@ -127,7 +127,7 @@ static int mtk_ccifreq_target(struct device *dev, unsigned long *freq,
                              u32 flags)
 {
        struct mtk_ccifreq_drv *drv = dev_get_drvdata(dev);
-       struct clk *cci_pll = clk_get_parent(drv->cci_clk);
+       struct clk *cci_pll;
        struct dev_pm_opp *opp;
        unsigned long opp_rate;
        int voltage, pre_voltage, inter_voltage, target_voltage, ret;
@@ -139,6 +139,7 @@ static int mtk_ccifreq_target(struct device *dev, unsigned long *freq,
                return 0;
 
        inter_voltage = drv->inter_voltage;
+       cci_pll = clk_get_parent(drv->cci_clk);
 
        opp_rate = *freq;
        opp = devfreq_recommended_opp(dev, &opp_rate, 1);
index 68f5767..110e99b 100644 (file)
@@ -550,4 +550,15 @@ config EDAC_ZYNQMP
          Xilinx ZynqMP OCM (On Chip Memory) controller. It can also be
          built as a module. In that case it will be called zynqmp_edac.
 
+config EDAC_NPCM
+       tristate "Nuvoton NPCM DDR Memory Controller"
+       depends on (ARCH_NPCM || COMPILE_TEST)
+       help
+         Support for error detection and correction on the Nuvoton NPCM DDR
+         memory controller.
+
+         The memory controller supports single bit error correction, double bit
+         error detection (in-line ECC in which a section 1/8th of the memory
+         device used to store data is used for ECC storage).
+
 endif # EDAC
index 9b025c5..61945d3 100644 (file)
@@ -84,4 +84,5 @@ obj-$(CONFIG_EDAC_QCOM)                       += qcom_edac.o
 obj-$(CONFIG_EDAC_ASPEED)              += aspeed_edac.o
 obj-$(CONFIG_EDAC_BLUEFIELD)           += bluefield_edac.o
 obj-$(CONFIG_EDAC_DMC520)              += dmc520_edac.o
+obj-$(CONFIG_EDAC_NPCM)                        += npcm_edac.o
 obj-$(CONFIG_EDAC_ZYNQMP)              += zynqmp_edac.o
index 5c4292e..597dae7 100644 (file)
@@ -975,6 +975,74 @@ static int sys_addr_to_csrow(struct mem_ctl_info *mci, u64 sys_addr)
        return csrow;
 }
 
+/*
+ * See AMD PPR DF::LclNodeTypeMap
+ *
+ * This register gives information for nodes of the same type within a system.
+ *
+ * Reading this register from a GPU node will tell how many GPU nodes are in the
+ * system and what the lowest AMD Node ID value is for the GPU nodes. Use this
+ * info to fixup the Linux logical "Node ID" value set in the AMD NB code and EDAC.
+ */
+static struct local_node_map {
+       u16 node_count;
+       u16 base_node_id;
+} gpu_node_map;
+
+#define PCI_DEVICE_ID_AMD_MI200_DF_F1          0x14d1
+#define REG_LOCAL_NODE_TYPE_MAP                        0x144
+
+/* Local Node Type Map (LNTM) fields */
+#define LNTM_NODE_COUNT                                GENMASK(27, 16)
+#define LNTM_BASE_NODE_ID                      GENMASK(11, 0)
+
+static int gpu_get_node_map(void)
+{
+       struct pci_dev *pdev;
+       int ret;
+       u32 tmp;
+
+       /*
+        * Node ID 0 is reserved for CPUs.
+        * Therefore, a non-zero Node ID means we've already cached the values.
+        */
+       if (gpu_node_map.base_node_id)
+               return 0;
+
+       pdev = pci_get_device(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_MI200_DF_F1, NULL);
+       if (!pdev) {
+               ret = -ENODEV;
+               goto out;
+       }
+
+       ret = pci_read_config_dword(pdev, REG_LOCAL_NODE_TYPE_MAP, &tmp);
+       if (ret)
+               goto out;
+
+       gpu_node_map.node_count = FIELD_GET(LNTM_NODE_COUNT, tmp);
+       gpu_node_map.base_node_id = FIELD_GET(LNTM_BASE_NODE_ID, tmp);
+
+out:
+       pci_dev_put(pdev);
+       return ret;
+}
+
+static int fixup_node_id(int node_id, struct mce *m)
+{
+       /* MCA_IPID[InstanceIdHi] give the AMD Node ID for the bank. */
+       u8 nid = (m->ipid >> 44) & 0xF;
+
+       if (smca_get_bank_type(m->extcpu, m->bank) != SMCA_UMC_V2)
+               return node_id;
+
+       /* Nodes below the GPU base node are CPU nodes and don't need a fixup. */
+       if (nid < gpu_node_map.base_node_id)
+               return node_id;
+
+       /* Convert the hardware-provided AMD Node ID to a Linux logical one. */
+       return nid - gpu_node_map.base_node_id + 1;
+}
+
 /* Protect the PCI config register pairs used for DF indirect access. */
 static DEFINE_MUTEX(df_indirect_mutex);
 
@@ -1426,12 +1494,47 @@ static int umc_get_cs_mode(int dimm, u8 ctrl, struct amd64_pvt *pvt)
        return cs_mode;
 }
 
+static int __addr_mask_to_cs_size(u32 addr_mask_orig, unsigned int cs_mode,
+                                 int csrow_nr, int dimm)
+{
+       u32 msb, weight, num_zero_bits;
+       u32 addr_mask_deinterleaved;
+       int size = 0;
+
+       /*
+        * The number of zero bits in the mask is equal to the number of bits
+        * in a full mask minus the number of bits in the current mask.
+        *
+        * The MSB is the number of bits in the full mask because BIT[0] is
+        * always 0.
+        *
+        * In the special 3 Rank interleaving case, a single bit is flipped
+        * without swapping with the most significant bit. This can be handled
+        * by keeping the MSB where it is and ignoring the single zero bit.
+        */
+       msb = fls(addr_mask_orig) - 1;
+       weight = hweight_long(addr_mask_orig);
+       num_zero_bits = msb - weight - !!(cs_mode & CS_3R_INTERLEAVE);
+
+       /* Take the number of zero bits off from the top of the mask. */
+       addr_mask_deinterleaved = GENMASK_ULL(msb - num_zero_bits, 1);
+
+       edac_dbg(1, "CS%d DIMM%d AddrMasks:\n", csrow_nr, dimm);
+       edac_dbg(1, "  Original AddrMask: 0x%x\n", addr_mask_orig);
+       edac_dbg(1, "  Deinterleaved AddrMask: 0x%x\n", addr_mask_deinterleaved);
+
+       /* Register [31:1] = Address [39:9]. Size is in kBs here. */
+       size = (addr_mask_deinterleaved >> 2) + 1;
+
+       /* Return size in MBs. */
+       return size >> 10;
+}
+
 static int umc_addr_mask_to_cs_size(struct amd64_pvt *pvt, u8 umc,
                                    unsigned int cs_mode, int csrow_nr)
 {
-       u32 addr_mask_orig, addr_mask_deinterleaved;
-       u32 msb, weight, num_zero_bits;
        int cs_mask_nr = csrow_nr;
+       u32 addr_mask_orig;
        int dimm, size = 0;
 
        /* No Chip Selects are enabled. */
@@ -1475,33 +1578,7 @@ static int umc_addr_mask_to_cs_size(struct amd64_pvt *pvt, u8 umc,
        else
                addr_mask_orig = pvt->csels[umc].csmasks[cs_mask_nr];
 
-       /*
-        * The number of zero bits in the mask is equal to the number of bits
-        * in a full mask minus the number of bits in the current mask.
-        *
-        * The MSB is the number of bits in the full mask because BIT[0] is
-        * always 0.
-        *
-        * In the special 3 Rank interleaving case, a single bit is flipped
-        * without swapping with the most significant bit. This can be handled
-        * by keeping the MSB where it is and ignoring the single zero bit.
-        */
-       msb = fls(addr_mask_orig) - 1;
-       weight = hweight_long(addr_mask_orig);
-       num_zero_bits = msb - weight - !!(cs_mode & CS_3R_INTERLEAVE);
-
-       /* Take the number of zero bits off from the top of the mask. */
-       addr_mask_deinterleaved = GENMASK_ULL(msb - num_zero_bits, 1);
-
-       edac_dbg(1, "CS%d DIMM%d AddrMasks:\n", csrow_nr, dimm);
-       edac_dbg(1, "  Original AddrMask: 0x%x\n", addr_mask_orig);
-       edac_dbg(1, "  Deinterleaved AddrMask: 0x%x\n", addr_mask_deinterleaved);
-
-       /* Register [31:1] = Address [39:9]. Size is in kBs here. */
-       size = (addr_mask_deinterleaved >> 2) + 1;
-
-       /* Return size in MBs. */
-       return size >> 10;
+       return __addr_mask_to_cs_size(addr_mask_orig, cs_mode, csrow_nr, dimm);
 }
 
 static void umc_debug_display_dimm_sizes(struct amd64_pvt *pvt, u8 ctrl)
@@ -2992,6 +3069,8 @@ static void decode_umc_error(int node_id, struct mce *m)
        struct err_info err;
        u64 sys_addr;
 
+       node_id = fixup_node_id(node_id, m);
+
        mci = edac_mc_find(node_id);
        if (!mci)
                return;
@@ -3675,6 +3754,227 @@ static int umc_hw_info_get(struct amd64_pvt *pvt)
        return 0;
 }
 
+/*
+ * The CPUs have one channel per UMC, so UMC number is equivalent to a
+ * channel number. The GPUs have 8 channels per UMC, so the UMC number no
+ * longer works as a channel number.
+ *
+ * The channel number within a GPU UMC is given in MCA_IPID[15:12].
+ * However, the IDs are split such that two UMC values go to one UMC, and
+ * the channel numbers are split in two groups of four.
+ *
+ * Refer to comment on gpu_get_umc_base().
+ *
+ * For example,
+ * UMC0 CH[3:0] = 0x0005[3:0]000
+ * UMC0 CH[7:4] = 0x0015[3:0]000
+ * UMC1 CH[3:0] = 0x0025[3:0]000
+ * UMC1 CH[7:4] = 0x0035[3:0]000
+ */
+static void gpu_get_err_info(struct mce *m, struct err_info *err)
+{
+       u8 ch = (m->ipid & GENMASK(31, 0)) >> 20;
+       u8 phy = ((m->ipid >> 12) & 0xf);
+
+       err->channel = ch % 2 ? phy + 4 : phy;
+       err->csrow = phy;
+}
+
+static int gpu_addr_mask_to_cs_size(struct amd64_pvt *pvt, u8 umc,
+                                   unsigned int cs_mode, int csrow_nr)
+{
+       u32 addr_mask_orig = pvt->csels[umc].csmasks[csrow_nr];
+
+       return __addr_mask_to_cs_size(addr_mask_orig, cs_mode, csrow_nr, csrow_nr >> 1);
+}
+
+static void gpu_debug_display_dimm_sizes(struct amd64_pvt *pvt, u8 ctrl)
+{
+       int size, cs_mode, cs = 0;
+
+       edac_printk(KERN_DEBUG, EDAC_MC, "UMC%d chip selects:\n", ctrl);
+
+       cs_mode = CS_EVEN_PRIMARY | CS_ODD_PRIMARY;
+
+       for_each_chip_select(cs, ctrl, pvt) {
+               size = gpu_addr_mask_to_cs_size(pvt, ctrl, cs_mode, cs);
+               amd64_info(EDAC_MC ": %d: %5dMB\n", cs, size);
+       }
+}
+
+static void gpu_dump_misc_regs(struct amd64_pvt *pvt)
+{
+       struct amd64_umc *umc;
+       u32 i;
+
+       for_each_umc(i) {
+               umc = &pvt->umc[i];
+
+               edac_dbg(1, "UMC%d UMC cfg: 0x%x\n", i, umc->umc_cfg);
+               edac_dbg(1, "UMC%d SDP ctrl: 0x%x\n", i, umc->sdp_ctrl);
+               edac_dbg(1, "UMC%d ECC ctrl: 0x%x\n", i, umc->ecc_ctrl);
+               edac_dbg(1, "UMC%d All HBMs support ECC: yes\n", i);
+
+               gpu_debug_display_dimm_sizes(pvt, i);
+       }
+}
+
+static u32 gpu_get_csrow_nr_pages(struct amd64_pvt *pvt, u8 dct, int csrow_nr)
+{
+       u32 nr_pages;
+       int cs_mode = CS_EVEN_PRIMARY | CS_ODD_PRIMARY;
+
+       nr_pages   = gpu_addr_mask_to_cs_size(pvt, dct, cs_mode, csrow_nr);
+       nr_pages <<= 20 - PAGE_SHIFT;
+
+       edac_dbg(0, "csrow: %d, channel: %d\n", csrow_nr, dct);
+       edac_dbg(0, "nr_pages/channel: %u\n", nr_pages);
+
+       return nr_pages;
+}
+
+static void gpu_init_csrows(struct mem_ctl_info *mci)
+{
+       struct amd64_pvt *pvt = mci->pvt_info;
+       struct dimm_info *dimm;
+       u8 umc, cs;
+
+       for_each_umc(umc) {
+               for_each_chip_select(cs, umc, pvt) {
+                       if (!csrow_enabled(cs, umc, pvt))
+                               continue;
+
+                       dimm = mci->csrows[umc]->channels[cs]->dimm;
+
+                       edac_dbg(1, "MC node: %d, csrow: %d\n",
+                                pvt->mc_node_id, cs);
+
+                       dimm->nr_pages = gpu_get_csrow_nr_pages(pvt, umc, cs);
+                       dimm->edac_mode = EDAC_SECDED;
+                       dimm->mtype = MEM_HBM2;
+                       dimm->dtype = DEV_X16;
+                       dimm->grain = 64;
+               }
+       }
+}
+
+static void gpu_setup_mci_misc_attrs(struct mem_ctl_info *mci)
+{
+       struct amd64_pvt *pvt = mci->pvt_info;
+
+       mci->mtype_cap          = MEM_FLAG_HBM2;
+       mci->edac_ctl_cap       = EDAC_FLAG_SECDED;
+
+       mci->edac_cap           = EDAC_FLAG_EC;
+       mci->mod_name           = EDAC_MOD_STR;
+       mci->ctl_name           = pvt->ctl_name;
+       mci->dev_name           = pci_name(pvt->F3);
+       mci->ctl_page_to_phys   = NULL;
+
+       gpu_init_csrows(mci);
+}
+
+/* ECC is enabled by default on GPU nodes */
+static bool gpu_ecc_enabled(struct amd64_pvt *pvt)
+{
+       return true;
+}
+
+static inline u32 gpu_get_umc_base(u8 umc, u8 channel)
+{
+       /*
+        * On CPUs, there is one channel per UMC, so UMC numbering equals
+        * channel numbering. On GPUs, there are eight channels per UMC,
+        * so the channel numbering is different from UMC numbering.
+        *
+        * On CPU nodes channels are selected in 6th nibble
+        * UMC chY[3:0]= [(chY*2 + 1) : (chY*2)]50000;
+        *
+        * On GPU nodes channels are selected in 3rd nibble
+        * HBM chX[3:0]= [Y  ]5X[3:0]000;
+        * HBM chX[7:4]= [Y+1]5X[3:0]000
+        */
+       umc *= 2;
+
+       if (channel >= 4)
+               umc++;
+
+       return 0x50000 + (umc << 20) + ((channel % 4) << 12);
+}
+
+static void gpu_read_mc_regs(struct amd64_pvt *pvt)
+{
+       u8 nid = pvt->mc_node_id;
+       struct amd64_umc *umc;
+       u32 i, umc_base;
+
+       /* Read registers from each UMC */
+       for_each_umc(i) {
+               umc_base = gpu_get_umc_base(i, 0);
+               umc = &pvt->umc[i];
+
+               amd_smn_read(nid, umc_base + UMCCH_UMC_CFG, &umc->umc_cfg);
+               amd_smn_read(nid, umc_base + UMCCH_SDP_CTRL, &umc->sdp_ctrl);
+               amd_smn_read(nid, umc_base + UMCCH_ECC_CTRL, &umc->ecc_ctrl);
+       }
+}
+
+static void gpu_read_base_mask(struct amd64_pvt *pvt)
+{
+       u32 base_reg, mask_reg;
+       u32 *base, *mask;
+       int umc, cs;
+
+       for_each_umc(umc) {
+               for_each_chip_select(cs, umc, pvt) {
+                       base_reg = gpu_get_umc_base(umc, cs) + UMCCH_BASE_ADDR;
+                       base = &pvt->csels[umc].csbases[cs];
+
+                       if (!amd_smn_read(pvt->mc_node_id, base_reg, base)) {
+                               edac_dbg(0, "  DCSB%d[%d]=0x%08x reg: 0x%x\n",
+                                        umc, cs, *base, base_reg);
+                       }
+
+                       mask_reg = gpu_get_umc_base(umc, cs) + UMCCH_ADDR_MASK;
+                       mask = &pvt->csels[umc].csmasks[cs];
+
+                       if (!amd_smn_read(pvt->mc_node_id, mask_reg, mask)) {
+                               edac_dbg(0, "  DCSM%d[%d]=0x%08x reg: 0x%x\n",
+                                        umc, cs, *mask, mask_reg);
+                       }
+               }
+       }
+}
+
+static void gpu_prep_chip_selects(struct amd64_pvt *pvt)
+{
+       int umc;
+
+       for_each_umc(umc) {
+               pvt->csels[umc].b_cnt = 8;
+               pvt->csels[umc].m_cnt = 8;
+       }
+}
+
+static int gpu_hw_info_get(struct amd64_pvt *pvt)
+{
+       int ret;
+
+       ret = gpu_get_node_map();
+       if (ret)
+               return ret;
+
+       pvt->umc = kcalloc(pvt->max_mcs, sizeof(struct amd64_umc), GFP_KERNEL);
+       if (!pvt->umc)
+               return -ENOMEM;
+
+       gpu_prep_chip_selects(pvt);
+       gpu_read_base_mask(pvt);
+       gpu_read_mc_regs(pvt);
+
+       return 0;
+}
+
 static void hw_info_put(struct amd64_pvt *pvt)
 {
        pci_dev_put(pvt->F1);
@@ -3690,6 +3990,14 @@ static struct low_ops umc_ops = {
        .get_err_info                   = umc_get_err_info,
 };
 
+static struct low_ops gpu_ops = {
+       .hw_info_get                    = gpu_hw_info_get,
+       .ecc_enabled                    = gpu_ecc_enabled,
+       .setup_mci_misc_attrs           = gpu_setup_mci_misc_attrs,
+       .dump_misc_regs                 = gpu_dump_misc_regs,
+       .get_err_info                   = gpu_get_err_info,
+};
+
 /* Use Family 16h versions for defaults and adjust as needed below. */
 static struct low_ops dct_ops = {
        .map_sysaddr_to_csrow           = f1x_map_sysaddr_to_csrow,
@@ -3813,9 +4121,27 @@ static int per_family_init(struct amd64_pvt *pvt)
                case 0x20 ... 0x2f:
                        pvt->ctl_name                   = "F19h_M20h";
                        break;
+               case 0x30 ... 0x3f:
+                       if (pvt->F3->device == PCI_DEVICE_ID_AMD_MI200_DF_F3) {
+                               pvt->ctl_name           = "MI200";
+                               pvt->max_mcs            = 4;
+                               pvt->ops                = &gpu_ops;
+                       } else {
+                               pvt->ctl_name           = "F19h_M30h";
+                               pvt->max_mcs            = 8;
+                       }
+                       break;
                case 0x50 ... 0x5f:
                        pvt->ctl_name                   = "F19h_M50h";
                        break;
+               case 0x60 ... 0x6f:
+                       pvt->ctl_name                   = "F19h_M60h";
+                       pvt->flags.zn_regs_v2           = 1;
+                       break;
+               case 0x70 ... 0x7f:
+                       pvt->ctl_name                   = "F19h_M70h";
+                       pvt->flags.zn_regs_v2           = 1;
+                       break;
                case 0xa0 ... 0xaf:
                        pvt->ctl_name                   = "F19h_MA0h";
                        pvt->max_mcs                    = 12;
@@ -3846,11 +4172,17 @@ static int init_one_instance(struct amd64_pvt *pvt)
        struct edac_mc_layer layers[2];
        int ret = -ENOMEM;
 
+       /*
+        * For Heterogeneous family EDAC CHIP_SELECT and CHANNEL layers should
+        * be swapped to fit into the layers.
+        */
        layers[0].type = EDAC_MC_LAYER_CHIP_SELECT;
-       layers[0].size = pvt->csels[0].b_cnt;
+       layers[0].size = (pvt->F3->device == PCI_DEVICE_ID_AMD_MI200_DF_F3) ?
+                        pvt->max_mcs : pvt->csels[0].b_cnt;
        layers[0].is_virt_csrow = true;
        layers[1].type = EDAC_MC_LAYER_CHANNEL;
-       layers[1].size = pvt->max_mcs;
+       layers[1].size = (pvt->F3->device == PCI_DEVICE_ID_AMD_MI200_DF_F3) ?
+                        pvt->csels[0].b_cnt : pvt->max_mcs;
        layers[1].is_virt_csrow = false;
 
        mci = edac_mc_alloc(pvt->mc_node_id, ARRAY_SIZE(layers), layers, 0);
@@ -4074,8 +4406,6 @@ static int __init amd64_edac_init(void)
        amd64_err("%s on 32-bit is unsupported. USE AT YOUR OWN RISK!\n", EDAC_MOD_STR);
 #endif
 
-       printk(KERN_INFO "AMD64 EDAC driver v%s\n", EDAC_AMD64_VERSION);
-
        return 0;
 
 err_pci:
@@ -4121,7 +4451,7 @@ module_exit(amd64_edac_exit);
 
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("SoftwareBitMaker: Doug Thompson, Dave Peterson, Thayne Harbaugh; AMD");
-MODULE_DESCRIPTION("MC support for AMD64 memory controllers - " EDAC_AMD64_VERSION);
+MODULE_DESCRIPTION("MC support for AMD64 memory controllers");
 
 module_param(edac_op_state, int, 0444);
 MODULE_PARM_DESC(edac_op_state, "EDAC Error Reporting state: 0=Poll,1=NMI");
index e84fe0d..5a4e4a5 100644 (file)
@@ -16,6 +16,7 @@
 #include <linux/slab.h>
 #include <linux/mmzone.h>
 #include <linux/edac.h>
+#include <linux/bitfield.h>
 #include <asm/cpu_device_id.h>
 #include <asm/msr.h>
 #include "edac_module.h"
@@ -85,7 +86,6 @@
  *         sections 3.5.4 and 3.5.5 for more information.
  */
 
-#define EDAC_AMD64_VERSION             "3.5.0"
 #define EDAC_MOD_STR                   "amd64_edac"
 
 /* Extended Model from CPUID, for CPU Revision numbers */
index cc5c63f..9215c06 100644 (file)
@@ -1186,7 +1186,8 @@ static void decode_smca_error(struct mce *m)
        if (xec < smca_mce_descs[bank_type].num_descs)
                pr_cont(", %s.\n", smca_mce_descs[bank_type].descs[xec]);
 
-       if (bank_type == SMCA_UMC && xec == 0 && decode_dram_ecc)
+       if ((bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2) &&
+           xec == 0 && decode_dram_ecc)
                decode_dram_ecc(topology_die_id(m->extcpu), m);
 }
 
diff --git a/drivers/edac/npcm_edac.c b/drivers/edac/npcm_edac.c
new file mode 100644 (file)
index 0000000..12b95be
--- /dev/null
@@ -0,0 +1,543 @@
+// SPDX-License-Identifier: GPL-2.0-only
+// Copyright (c) 2022 Nuvoton Technology Corporation
+
+#include <linux/debugfs.h>
+#include <linux/iopoll.h>
+#include <linux/of_device.h>
+#include <linux/regmap.h>
+#include "edac_module.h"
+
+#define EDAC_MOD_NAME                  "npcm-edac"
+#define EDAC_MSG_SIZE                  256
+
+/* chip serials */
+#define NPCM7XX_CHIP                   BIT(0)
+#define NPCM8XX_CHIP                   BIT(1)
+
+/* syndrome values */
+#define UE_SYNDROME                    0x03
+
+/* error injection */
+#define ERROR_TYPE_CORRECTABLE         0
+#define ERROR_TYPE_UNCORRECTABLE       1
+#define ERROR_LOCATION_DATA            0
+#define ERROR_LOCATION_CHECKCODE       1
+#define ERROR_BIT_DATA_MAX             63
+#define ERROR_BIT_CHECKCODE_MAX                7
+
+static char data_synd[] = {
+       0xf4, 0xf1, 0xec, 0xea, 0xe9, 0xe6, 0xe5, 0xe3,
+       0xdc, 0xda, 0xd9, 0xd6, 0xd5, 0xd3, 0xce, 0xcb,
+       0xb5, 0xb0, 0xad, 0xab, 0xa8, 0xa7, 0xa4, 0xa2,
+       0x9d, 0x9b, 0x98, 0x97, 0x94, 0x92, 0x8f, 0x8a,
+       0x75, 0x70, 0x6d, 0x6b, 0x68, 0x67, 0x64, 0x62,
+       0x5e, 0x5b, 0x58, 0x57, 0x54, 0x52, 0x4f, 0x4a,
+       0x34, 0x31, 0x2c, 0x2a, 0x29, 0x26, 0x25, 0x23,
+       0x1c, 0x1a, 0x19, 0x16, 0x15, 0x13, 0x0e, 0x0b
+};
+
+static struct regmap *npcm_regmap;
+
+struct npcm_platform_data {
+       /* chip serials */
+       int chip;
+
+       /* memory controller registers */
+       u32 ctl_ecc_en;
+       u32 ctl_int_status;
+       u32 ctl_int_ack;
+       u32 ctl_int_mask_master;
+       u32 ctl_int_mask_ecc;
+       u32 ctl_ce_addr_l;
+       u32 ctl_ce_addr_h;
+       u32 ctl_ce_data_l;
+       u32 ctl_ce_data_h;
+       u32 ctl_ce_synd;
+       u32 ctl_ue_addr_l;
+       u32 ctl_ue_addr_h;
+       u32 ctl_ue_data_l;
+       u32 ctl_ue_data_h;
+       u32 ctl_ue_synd;
+       u32 ctl_source_id;
+       u32 ctl_controller_busy;
+       u32 ctl_xor_check_bits;
+
+       /* masks and shifts */
+       u32 ecc_en_mask;
+       u32 int_status_ce_mask;
+       u32 int_status_ue_mask;
+       u32 int_ack_ce_mask;
+       u32 int_ack_ue_mask;
+       u32 int_mask_master_non_ecc_mask;
+       u32 int_mask_master_global_mask;
+       u32 int_mask_ecc_non_event_mask;
+       u32 ce_addr_h_mask;
+       u32 ce_synd_mask;
+       u32 ce_synd_shift;
+       u32 ue_addr_h_mask;
+       u32 ue_synd_mask;
+       u32 ue_synd_shift;
+       u32 source_id_ce_mask;
+       u32 source_id_ce_shift;
+       u32 source_id_ue_mask;
+       u32 source_id_ue_shift;
+       u32 controller_busy_mask;
+       u32 xor_check_bits_mask;
+       u32 xor_check_bits_shift;
+       u32 writeback_en_mask;
+       u32 fwc_mask;
+};
+
+struct priv_data {
+       void __iomem *reg;
+       char message[EDAC_MSG_SIZE];
+       const struct npcm_platform_data *pdata;
+
+       /* error injection */
+       struct dentry *debugfs;
+       u8 error_type;
+       u8 location;
+       u8 bit;
+};
+
+static void handle_ce(struct mem_ctl_info *mci)
+{
+       struct priv_data *priv = mci->pvt_info;
+       const struct npcm_platform_data *pdata;
+       u32 val_h = 0, val_l, id, synd;
+       u64 addr = 0, data = 0;
+
+       pdata = priv->pdata;
+       regmap_read(npcm_regmap, pdata->ctl_ce_addr_l, &val_l);
+       if (pdata->chip == NPCM8XX_CHIP) {
+               regmap_read(npcm_regmap, pdata->ctl_ce_addr_h, &val_h);
+               val_h &= pdata->ce_addr_h_mask;
+       }
+       addr = ((addr | val_h) << 32) | val_l;
+
+       regmap_read(npcm_regmap, pdata->ctl_ce_data_l, &val_l);
+       if (pdata->chip == NPCM8XX_CHIP)
+               regmap_read(npcm_regmap, pdata->ctl_ce_data_h, &val_h);
+       data = ((data | val_h) << 32) | val_l;
+
+       regmap_read(npcm_regmap, pdata->ctl_source_id, &id);
+       id = (id & pdata->source_id_ce_mask) >> pdata->source_id_ce_shift;
+
+       regmap_read(npcm_regmap, pdata->ctl_ce_synd, &synd);
+       synd = (synd & pdata->ce_synd_mask) >> pdata->ce_synd_shift;
+
+       snprintf(priv->message, EDAC_MSG_SIZE,
+                "addr = 0x%llx, data = 0x%llx, id = 0x%x", addr, data, id);
+
+       edac_mc_handle_error(HW_EVENT_ERR_CORRECTED, mci, 1, addr >> PAGE_SHIFT,
+                            addr & ~PAGE_MASK, synd, 0, 0, -1, priv->message, "");
+}
+
+static void handle_ue(struct mem_ctl_info *mci)
+{
+       struct priv_data *priv = mci->pvt_info;
+       const struct npcm_platform_data *pdata;
+       u32 val_h = 0, val_l, id, synd;
+       u64 addr = 0, data = 0;
+
+       pdata = priv->pdata;
+       regmap_read(npcm_regmap, pdata->ctl_ue_addr_l, &val_l);
+       if (pdata->chip == NPCM8XX_CHIP) {
+               regmap_read(npcm_regmap, pdata->ctl_ue_addr_h, &val_h);
+               val_h &= pdata->ue_addr_h_mask;
+       }
+       addr = ((addr | val_h) << 32) | val_l;
+
+       regmap_read(npcm_regmap, pdata->ctl_ue_data_l, &val_l);
+       if (pdata->chip == NPCM8XX_CHIP)
+               regmap_read(npcm_regmap, pdata->ctl_ue_data_h, &val_h);
+       data = ((data | val_h) << 32) | val_l;
+
+       regmap_read(npcm_regmap, pdata->ctl_source_id, &id);
+       id = (id & pdata->source_id_ue_mask) >> pdata->source_id_ue_shift;
+
+       regmap_read(npcm_regmap, pdata->ctl_ue_synd, &synd);
+       synd = (synd & pdata->ue_synd_mask) >> pdata->ue_synd_shift;
+
+       snprintf(priv->message, EDAC_MSG_SIZE,
+                "addr = 0x%llx, data = 0x%llx, id = 0x%x", addr, data, id);
+
+       edac_mc_handle_error(HW_EVENT_ERR_UNCORRECTED, mci, 1, addr >> PAGE_SHIFT,
+                            addr & ~PAGE_MASK, synd, 0, 0, -1, priv->message, "");
+}
+
+static irqreturn_t edac_ecc_isr(int irq, void *dev_id)
+{
+       const struct npcm_platform_data *pdata;
+       struct mem_ctl_info *mci = dev_id;
+       u32 status;
+
+       pdata = ((struct priv_data *)mci->pvt_info)->pdata;
+       regmap_read(npcm_regmap, pdata->ctl_int_status, &status);
+       if (status & pdata->int_status_ce_mask) {
+               handle_ce(mci);
+
+               /* acknowledge the CE interrupt */
+               regmap_write(npcm_regmap, pdata->ctl_int_ack,
+                            pdata->int_ack_ce_mask);
+               return IRQ_HANDLED;
+       } else if (status & pdata->int_status_ue_mask) {
+               handle_ue(mci);
+
+               /* acknowledge the UE interrupt */
+               regmap_write(npcm_regmap, pdata->ctl_int_ack,
+                            pdata->int_ack_ue_mask);
+               return IRQ_HANDLED;
+       }
+
+       WARN_ON_ONCE(1);
+       return IRQ_NONE;
+}
+
+static ssize_t force_ecc_error(struct file *file, const char __user *data,
+                              size_t count, loff_t *ppos)
+{
+       struct device *dev = file->private_data;
+       struct mem_ctl_info *mci = to_mci(dev);
+       struct priv_data *priv = mci->pvt_info;
+       const struct npcm_platform_data *pdata;
+       u32 val, syndrome;
+       int ret;
+
+       pdata = priv->pdata;
+       edac_printk(KERN_INFO, EDAC_MOD_NAME,
+                   "force an ECC error, type = %d, location = %d, bit = %d\n",
+                   priv->error_type, priv->location, priv->bit);
+
+       /* ensure no pending writes */
+       ret = regmap_read_poll_timeout(npcm_regmap, pdata->ctl_controller_busy,
+                                      val, !(val & pdata->controller_busy_mask),
+                                      1000, 10000);
+       if (ret) {
+               edac_printk(KERN_INFO, EDAC_MOD_NAME,
+                           "wait pending writes timeout\n");
+               return count;
+       }
+
+       regmap_read(npcm_regmap, pdata->ctl_xor_check_bits, &val);
+       val &= ~pdata->xor_check_bits_mask;
+
+       /* write syndrome to XOR_CHECK_BITS */
+       if (priv->error_type == ERROR_TYPE_CORRECTABLE) {
+               if (priv->location == ERROR_LOCATION_DATA &&
+                   priv->bit > ERROR_BIT_DATA_MAX) {
+                       edac_printk(KERN_INFO, EDAC_MOD_NAME,
+                                   "data bit should not exceed %d (%d)\n",
+                                   ERROR_BIT_DATA_MAX, priv->bit);
+                       return count;
+               }
+
+               if (priv->location == ERROR_LOCATION_CHECKCODE &&
+                   priv->bit > ERROR_BIT_CHECKCODE_MAX) {
+                       edac_printk(KERN_INFO, EDAC_MOD_NAME,
+                                   "checkcode bit should not exceed %d (%d)\n",
+                                   ERROR_BIT_CHECKCODE_MAX, priv->bit);
+                       return count;
+               }
+
+               syndrome = priv->location ? 1 << priv->bit
+                                         : data_synd[priv->bit];
+
+               regmap_write(npcm_regmap, pdata->ctl_xor_check_bits,
+                            val | (syndrome << pdata->xor_check_bits_shift) |
+                            pdata->writeback_en_mask);
+       } else if (priv->error_type == ERROR_TYPE_UNCORRECTABLE) {
+               regmap_write(npcm_regmap, pdata->ctl_xor_check_bits,
+                            val | (UE_SYNDROME << pdata->xor_check_bits_shift));
+       }
+
+       /* force write check */
+       regmap_update_bits(npcm_regmap, pdata->ctl_xor_check_bits,
+                          pdata->fwc_mask, pdata->fwc_mask);
+
+       return count;
+}
+
+static const struct file_operations force_ecc_error_fops = {
+       .open = simple_open,
+       .write = force_ecc_error,
+       .llseek = generic_file_llseek,
+};
+
+/*
+ * Setup debugfs for error injection.
+ *
+ * Nodes:
+ *   error_type                - 0: CE, 1: UE
+ *   location          - 0: data, 1: checkcode
+ *   bit               - 0 ~ 63 for data and 0 ~ 7 for checkcode
+ *   force_ecc_error   - trigger
+ *
+ * Examples:
+ *   1. Inject a correctable error (CE) at checkcode bit 7.
+ *      ~# echo 0 > /sys/kernel/debug/edac/npcm-edac/error_type
+ *      ~# echo 1 > /sys/kernel/debug/edac/npcm-edac/location
+ *      ~# echo 7 > /sys/kernel/debug/edac/npcm-edac/bit
+ *      ~# echo 1 > /sys/kernel/debug/edac/npcm-edac/force_ecc_error
+ *
+ *   2. Inject an uncorrectable error (UE).
+ *      ~# echo 1 > /sys/kernel/debug/edac/npcm-edac/error_type
+ *      ~# echo 1 > /sys/kernel/debug/edac/npcm-edac/force_ecc_error
+ */
+static void setup_debugfs(struct mem_ctl_info *mci)
+{
+       struct priv_data *priv = mci->pvt_info;
+
+       priv->debugfs = edac_debugfs_create_dir(mci->mod_name);
+       if (!priv->debugfs)
+               return;
+
+       edac_debugfs_create_x8("error_type", 0644, priv->debugfs, &priv->error_type);
+       edac_debugfs_create_x8("location", 0644, priv->debugfs, &priv->location);
+       edac_debugfs_create_x8("bit", 0644, priv->debugfs, &priv->bit);
+       edac_debugfs_create_file("force_ecc_error", 0200, priv->debugfs,
+                                &mci->dev, &force_ecc_error_fops);
+}
+
+static int setup_irq(struct mem_ctl_info *mci, struct platform_device *pdev)
+{
+       const struct npcm_platform_data *pdata;
+       int ret, irq;
+
+       pdata = ((struct priv_data *)mci->pvt_info)->pdata;
+       irq = platform_get_irq(pdev, 0);
+       if (irq < 0) {
+               edac_printk(KERN_ERR, EDAC_MOD_NAME, "IRQ not defined in DTS\n");
+               return irq;
+       }
+
+       ret = devm_request_irq(&pdev->dev, irq, edac_ecc_isr, 0,
+                              dev_name(&pdev->dev), mci);
+       if (ret < 0) {
+               edac_printk(KERN_ERR, EDAC_MOD_NAME, "failed to request IRQ\n");
+               return ret;
+       }
+
+       /* enable the functional group of ECC and mask the others */
+       regmap_write(npcm_regmap, pdata->ctl_int_mask_master,
+                    pdata->int_mask_master_non_ecc_mask);
+
+       if (pdata->chip == NPCM8XX_CHIP)
+               regmap_write(npcm_regmap, pdata->ctl_int_mask_ecc,
+                            pdata->int_mask_ecc_non_event_mask);
+
+       return 0;
+}
+
+static const struct regmap_config npcm_regmap_cfg = {
+       .reg_bits       = 32,
+       .reg_stride     = 4,
+       .val_bits       = 32,
+};
+
+static int edac_probe(struct platform_device *pdev)
+{
+       const struct npcm_platform_data *pdata;
+       struct device *dev = &pdev->dev;
+       struct edac_mc_layer layers[1];
+       struct mem_ctl_info *mci;
+       struct priv_data *priv;
+       void __iomem *reg;
+       u32 val;
+       int rc;
+
+       reg = devm_platform_ioremap_resource(pdev, 0);
+       if (IS_ERR(reg))
+               return PTR_ERR(reg);
+
+       npcm_regmap = devm_regmap_init_mmio(dev, reg, &npcm_regmap_cfg);
+       if (IS_ERR(npcm_regmap))
+               return PTR_ERR(npcm_regmap);
+
+       pdata = of_device_get_match_data(dev);
+       if (!pdata)
+               return -EINVAL;
+
+       /* bail out if ECC is not enabled */
+       regmap_read(npcm_regmap, pdata->ctl_ecc_en, &val);
+       if (!(val & pdata->ecc_en_mask)) {
+               edac_printk(KERN_ERR, EDAC_MOD_NAME, "ECC is not enabled\n");
+               return -EPERM;
+       }
+
+       edac_op_state = EDAC_OPSTATE_INT;
+
+       layers[0].type = EDAC_MC_LAYER_ALL_MEM;
+       layers[0].size = 1;
+
+       mci = edac_mc_alloc(0, ARRAY_SIZE(layers), layers,
+                           sizeof(struct priv_data));
+       if (!mci)
+               return -ENOMEM;
+
+       mci->pdev = &pdev->dev;
+       priv = mci->pvt_info;
+       priv->reg = reg;
+       priv->pdata = pdata;
+       platform_set_drvdata(pdev, mci);
+
+       mci->mtype_cap = MEM_FLAG_DDR4;
+       mci->edac_ctl_cap = EDAC_FLAG_SECDED;
+       mci->scrub_cap = SCRUB_FLAG_HW_SRC;
+       mci->scrub_mode = SCRUB_HW_SRC;
+       mci->edac_cap = EDAC_FLAG_SECDED;
+       mci->ctl_name = "npcm_ddr_controller";
+       mci->dev_name = dev_name(&pdev->dev);
+       mci->mod_name = EDAC_MOD_NAME;
+       mci->ctl_page_to_phys = NULL;
+
+       rc = setup_irq(mci, pdev);
+       if (rc)
+               goto free_edac_mc;
+
+       rc = edac_mc_add_mc(mci);
+       if (rc)
+               goto free_edac_mc;
+
+       if (IS_ENABLED(CONFIG_EDAC_DEBUG) && pdata->chip == NPCM8XX_CHIP)
+               setup_debugfs(mci);
+
+       return rc;
+
+free_edac_mc:
+       edac_mc_free(mci);
+       return rc;
+}
+
+static int edac_remove(struct platform_device *pdev)
+{
+       struct mem_ctl_info *mci = platform_get_drvdata(pdev);
+       struct priv_data *priv = mci->pvt_info;
+       const struct npcm_platform_data *pdata;
+
+       pdata = priv->pdata;
+       if (IS_ENABLED(CONFIG_EDAC_DEBUG) && pdata->chip == NPCM8XX_CHIP)
+               edac_debugfs_remove_recursive(priv->debugfs);
+
+       edac_mc_del_mc(&pdev->dev);
+       edac_mc_free(mci);
+
+       regmap_write(npcm_regmap, pdata->ctl_int_mask_master,
+                    pdata->int_mask_master_global_mask);
+       regmap_update_bits(npcm_regmap, pdata->ctl_ecc_en, pdata->ecc_en_mask, 0);
+
+       return 0;
+}
+
+static const struct npcm_platform_data npcm750_edac = {
+       .chip                           = NPCM7XX_CHIP,
+
+       /* memory controller registers */
+       .ctl_ecc_en                     = 0x174,
+       .ctl_int_status                 = 0x1d0,
+       .ctl_int_ack                    = 0x1d4,
+       .ctl_int_mask_master            = 0x1d8,
+       .ctl_ce_addr_l                  = 0x188,
+       .ctl_ce_data_l                  = 0x190,
+       .ctl_ce_synd                    = 0x18c,
+       .ctl_ue_addr_l                  = 0x17c,
+       .ctl_ue_data_l                  = 0x184,
+       .ctl_ue_synd                    = 0x180,
+       .ctl_source_id                  = 0x194,
+
+       /* masks and shifts */
+       .ecc_en_mask                    = BIT(24),
+       .int_status_ce_mask             = GENMASK(4, 3),
+       .int_status_ue_mask             = GENMASK(6, 5),
+       .int_ack_ce_mask                = GENMASK(4, 3),
+       .int_ack_ue_mask                = GENMASK(6, 5),
+       .int_mask_master_non_ecc_mask   = GENMASK(30, 7) | GENMASK(2, 0),
+       .int_mask_master_global_mask    = BIT(31),
+       .ce_synd_mask                   = GENMASK(6, 0),
+       .ce_synd_shift                  = 0,
+       .ue_synd_mask                   = GENMASK(6, 0),
+       .ue_synd_shift                  = 0,
+       .source_id_ce_mask              = GENMASK(29, 16),
+       .source_id_ce_shift             = 16,
+       .source_id_ue_mask              = GENMASK(13, 0),
+       .source_id_ue_shift             = 0,
+};
+
+static const struct npcm_platform_data npcm845_edac = {
+       .chip =                         NPCM8XX_CHIP,
+
+       /* memory controller registers */
+       .ctl_ecc_en                     = 0x16c,
+       .ctl_int_status                 = 0x228,
+       .ctl_int_ack                    = 0x244,
+       .ctl_int_mask_master            = 0x220,
+       .ctl_int_mask_ecc               = 0x260,
+       .ctl_ce_addr_l                  = 0x18c,
+       .ctl_ce_addr_h                  = 0x190,
+       .ctl_ce_data_l                  = 0x194,
+       .ctl_ce_data_h                  = 0x198,
+       .ctl_ce_synd                    = 0x190,
+       .ctl_ue_addr_l                  = 0x17c,
+       .ctl_ue_addr_h                  = 0x180,
+       .ctl_ue_data_l                  = 0x184,
+       .ctl_ue_data_h                  = 0x188,
+       .ctl_ue_synd                    = 0x180,
+       .ctl_source_id                  = 0x19c,
+       .ctl_controller_busy            = 0x20c,
+       .ctl_xor_check_bits             = 0x174,
+
+       /* masks and shifts */
+       .ecc_en_mask                    = GENMASK(17, 16),
+       .int_status_ce_mask             = GENMASK(1, 0),
+       .int_status_ue_mask             = GENMASK(3, 2),
+       .int_ack_ce_mask                = GENMASK(1, 0),
+       .int_ack_ue_mask                = GENMASK(3, 2),
+       .int_mask_master_non_ecc_mask   = GENMASK(30, 3) | GENMASK(1, 0),
+       .int_mask_master_global_mask    = BIT(31),
+       .int_mask_ecc_non_event_mask    = GENMASK(8, 4),
+       .ce_addr_h_mask                 = GENMASK(1, 0),
+       .ce_synd_mask                   = GENMASK(15, 8),
+       .ce_synd_shift                  = 8,
+       .ue_addr_h_mask                 = GENMASK(1, 0),
+       .ue_synd_mask                   = GENMASK(15, 8),
+       .ue_synd_shift                  = 8,
+       .source_id_ce_mask              = GENMASK(29, 16),
+       .source_id_ce_shift             = 16,
+       .source_id_ue_mask              = GENMASK(13, 0),
+       .source_id_ue_shift             = 0,
+       .controller_busy_mask           = BIT(0),
+       .xor_check_bits_mask            = GENMASK(23, 16),
+       .xor_check_bits_shift           = 16,
+       .writeback_en_mask              = BIT(24),
+       .fwc_mask                       = BIT(8),
+};
+
+static const struct of_device_id npcm_edac_of_match[] = {
+       {
+               .compatible = "nuvoton,npcm750-memory-controller",
+               .data = &npcm750_edac
+       },
+       {
+               .compatible = "nuvoton,npcm845-memory-controller",
+               .data = &npcm845_edac
+       },
+       {},
+};
+
+MODULE_DEVICE_TABLE(of, npcm_edac_of_match);
+
+static struct platform_driver npcm_edac_driver = {
+       .driver = {
+               .name = "npcm-edac",
+               .of_match_table = npcm_edac_of_match,
+       },
+       .probe = edac_probe,
+       .remove = edac_remove,
+};
+
+module_platform_driver(npcm_edac_driver);
+
+MODULE_AUTHOR("Medad CChien <medadyoung@gmail.com>");
+MODULE_AUTHOR("Marvin Lin <kflin@nuvoton.com>");
+MODULE_DESCRIPTION("Nuvoton NPCM EDAC Driver");
+MODULE_LICENSE("GPL");
index 0bcd9f0..b9c5772 100644 (file)
@@ -481,7 +481,7 @@ static int thunderx_create_debugfs_nodes(struct dentry *parent,
                ent = edac_debugfs_create_file(attrs[i]->name, attrs[i]->mode,
                                               parent, data, &attrs[i]->fops);
 
-               if (!ent)
+               if (IS_ERR(ent))
                        break;
        }
 
index 043ca31..231f1c7 100644 (file)
@@ -269,6 +269,20 @@ config EFI_COCO_SECRET
          virt/coco/efi_secret module to access the secrets, which in turn
          allows userspace programs to access the injected secrets.
 
+config UNACCEPTED_MEMORY
+       bool
+       depends on EFI_STUB
+       help
+          Some Virtual Machine platforms, such as Intel TDX, require
+          some memory to be "accepted" by the guest before it can be used.
+          This mechanism helps prevent malicious hosts from making changes
+          to guest memory.
+
+          UEFI specification v2.9 introduced EFI_UNACCEPTED_MEMORY memory type.
+
+          This option adds support for unaccepted memory and makes such memory
+          usable by the kernel.
+
 config EFI_EMBEDDED_FIRMWARE
        bool
        select CRYPTO_LIB_SHA256
index b51f2a4..e489fef 100644 (file)
@@ -41,3 +41,4 @@ obj-$(CONFIG_EFI_CAPSULE_LOADER)      += capsule-loader.o
 obj-$(CONFIG_EFI_EARLYCON)             += earlycon.o
 obj-$(CONFIG_UEFI_CPER_ARM)            += cper-arm.o
 obj-$(CONFIG_UEFI_CPER_X86)            += cper-x86.o
+obj-$(CONFIG_UNACCEPTED_MEMORY)                += unaccepted_memory.o
index 34b9e78..3a6ee7b 100644 (file)
@@ -50,6 +50,9 @@ struct efi __read_mostly efi = {
 #ifdef CONFIG_EFI_COCO_SECRET
        .coco_secret            = EFI_INVALID_TABLE_ADDR,
 #endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+       .unaccepted             = EFI_INVALID_TABLE_ADDR,
+#endif
 };
 EXPORT_SYMBOL(efi);
 
@@ -584,6 +587,9 @@ static const efi_config_table_type_t common_tables[] __initconst = {
 #ifdef CONFIG_EFI_COCO_SECRET
        {LINUX_EFI_COCO_SECRET_AREA_GUID,       &efi.coco_secret,       "CocoSecret"    },
 #endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+       {LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID,   &efi.unaccepted,        "Unaccepted"    },
+#endif
 #ifdef CONFIG_EFI_GENERIC_STUB
        {LINUX_EFI_SCREEN_INFO_TABLE_GUID,      &screen_info_table                      },
 #endif
@@ -738,6 +744,25 @@ int __init efi_config_parse_tables(const efi_config_table_t *config_tables,
                }
        }
 
+       if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY) &&
+           efi.unaccepted != EFI_INVALID_TABLE_ADDR) {
+               struct efi_unaccepted_memory *unaccepted;
+
+               unaccepted = early_memremap(efi.unaccepted, sizeof(*unaccepted));
+               if (unaccepted) {
+                       unsigned long size;
+
+                       if (unaccepted->version == 1) {
+                               size = sizeof(*unaccepted) + unaccepted->size;
+                               memblock_reserve(efi.unaccepted, size);
+                       } else {
+                               efi.unaccepted = EFI_INVALID_TABLE_ADDR;
+                       }
+
+                       early_memunmap(unaccepted, sizeof(*unaccepted));
+               }
+       }
+
        return 0;
 }
 
@@ -822,6 +847,7 @@ static __initdata char memory_type_name[][13] = {
        "MMIO Port",
        "PAL Code",
        "Persistent",
+       "Unaccepted",
 };
 
 char * __init efi_md_typeattr_format(char *buf, size_t size,
index 3abb2b3..16d64a3 100644 (file)
@@ -96,6 +96,8 @@ CFLAGS_arm32-stub.o           := -DTEXT_OFFSET=$(TEXT_OFFSET)
 zboot-obj-$(CONFIG_RISCV)      := lib-clz_ctz.o lib-ashldi3.o
 lib-$(CONFIG_EFI_ZBOOT)                += zboot.o $(zboot-obj-y)
 
+lib-$(CONFIG_UNACCEPTED_MEMORY) += unaccepted_memory.o bitmap.o find.o
+
 extra-y                                := $(lib-y)
 lib-y                          := $(patsubst %.o,%.stub.o,$(lib-y))
 
diff --git a/drivers/firmware/efi/libstub/bitmap.c b/drivers/firmware/efi/libstub/bitmap.c
new file mode 100644 (file)
index 0000000..5c9bba0
--- /dev/null
@@ -0,0 +1,41 @@
+#include <linux/bitmap.h>
+
+void __bitmap_set(unsigned long *map, unsigned int start, int len)
+{
+       unsigned long *p = map + BIT_WORD(start);
+       const unsigned int size = start + len;
+       int bits_to_set = BITS_PER_LONG - (start % BITS_PER_LONG);
+       unsigned long mask_to_set = BITMAP_FIRST_WORD_MASK(start);
+
+       while (len - bits_to_set >= 0) {
+               *p |= mask_to_set;
+               len -= bits_to_set;
+               bits_to_set = BITS_PER_LONG;
+               mask_to_set = ~0UL;
+               p++;
+       }
+       if (len) {
+               mask_to_set &= BITMAP_LAST_WORD_MASK(size);
+               *p |= mask_to_set;
+       }
+}
+
+void __bitmap_clear(unsigned long *map, unsigned int start, int len)
+{
+       unsigned long *p = map + BIT_WORD(start);
+       const unsigned int size = start + len;
+       int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
+       unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
+
+       while (len - bits_to_clear >= 0) {
+               *p &= ~mask_to_clear;
+               len -= bits_to_clear;
+               bits_to_clear = BITS_PER_LONG;
+               mask_to_clear = ~0UL;
+               p++;
+       }
+       if (len) {
+               mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
+               *p &= ~mask_to_clear;
+       }
+}
index 54a2822..6aa38a1 100644 (file)
@@ -1136,4 +1136,10 @@ void efi_remap_image(unsigned long image_base, unsigned alloc_size,
 asmlinkage efi_status_t __efiapi
 efi_zboot_entry(efi_handle_t handle, efi_system_table_t *systab);
 
+efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
+                                       struct efi_boot_memmap *map);
+void process_unaccepted_memory(u64 start, u64 end);
+void accept_memory(phys_addr_t start, phys_addr_t end);
+void arch_accept_memory(phys_addr_t start, phys_addr_t end);
+
 #endif
diff --git a/drivers/firmware/efi/libstub/find.c b/drivers/firmware/efi/libstub/find.c
new file mode 100644 (file)
index 0000000..4e7740d
--- /dev/null
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/bitmap.h>
+#include <linux/math.h>
+#include <linux/minmax.h>
+
+/*
+ * Common helper for find_next_bit() function family
+ * @FETCH: The expression that fetches and pre-processes each word of bitmap(s)
+ * @MUNGE: The expression that post-processes a word containing found bit (may be empty)
+ * @size: The bitmap size in bits
+ * @start: The bitnumber to start searching at
+ */
+#define FIND_NEXT_BIT(FETCH, MUNGE, size, start)                               \
+({                                                                             \
+       unsigned long mask, idx, tmp, sz = (size), __start = (start);           \
+                                                                               \
+       if (unlikely(__start >= sz))                                            \
+               goto out;                                                       \
+                                                                               \
+       mask = MUNGE(BITMAP_FIRST_WORD_MASK(__start));                          \
+       idx = __start / BITS_PER_LONG;                                          \
+                                                                               \
+       for (tmp = (FETCH) & mask; !tmp; tmp = (FETCH)) {                       \
+               if ((idx + 1) * BITS_PER_LONG >= sz)                            \
+                       goto out;                                               \
+               idx++;                                                          \
+       }                                                                       \
+                                                                               \
+       sz = min(idx * BITS_PER_LONG + __ffs(MUNGE(tmp)), sz);                  \
+out:                                                                           \
+       sz;                                                                     \
+})
+
+unsigned long _find_next_bit(const unsigned long *addr, unsigned long nbits, unsigned long start)
+{
+       return FIND_NEXT_BIT(addr[idx], /* nop */, nbits, start);
+}
+
+unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits,
+                                        unsigned long start)
+{
+       return FIND_NEXT_BIT(~addr[idx], /* nop */, nbits, start);
+}
diff --git a/drivers/firmware/efi/libstub/unaccepted_memory.c b/drivers/firmware/efi/libstub/unaccepted_memory.c
new file mode 100644 (file)
index 0000000..ca61f47
--- /dev/null
@@ -0,0 +1,222 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/efi.h>
+#include <asm/efi.h>
+#include "efistub.h"
+
+struct efi_unaccepted_memory *unaccepted_table;
+
+efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
+                                       struct efi_boot_memmap *map)
+{
+       efi_guid_t unaccepted_table_guid = LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID;
+       u64 unaccepted_start = ULLONG_MAX, unaccepted_end = 0, bitmap_size;
+       efi_status_t status;
+       int i;
+
+       /* Check if the table is already installed */
+       unaccepted_table = get_efi_config_table(unaccepted_table_guid);
+       if (unaccepted_table) {
+               if (unaccepted_table->version != 1) {
+                       efi_err("Unknown version of unaccepted memory table\n");
+                       return EFI_UNSUPPORTED;
+               }
+               return EFI_SUCCESS;
+       }
+
+       /* Check if there's any unaccepted memory and find the max address */
+       for (i = 0; i < nr_desc; i++) {
+               efi_memory_desc_t *d;
+               unsigned long m = (unsigned long)map->map;
+
+               d = efi_early_memdesc_ptr(m, map->desc_size, i);
+               if (d->type != EFI_UNACCEPTED_MEMORY)
+                       continue;
+
+               unaccepted_start = min(unaccepted_start, d->phys_addr);
+               unaccepted_end = max(unaccepted_end,
+                                    d->phys_addr + d->num_pages * PAGE_SIZE);
+       }
+
+       if (unaccepted_start == ULLONG_MAX)
+               return EFI_SUCCESS;
+
+       unaccepted_start = round_down(unaccepted_start,
+                                     EFI_UNACCEPTED_UNIT_SIZE);
+       unaccepted_end = round_up(unaccepted_end, EFI_UNACCEPTED_UNIT_SIZE);
+
+       /*
+        * If unaccepted memory is present, allocate a bitmap to track what
+        * memory has to be accepted before access.
+        *
+        * One bit in the bitmap represents 2MiB in the address space:
+        * A 4k bitmap can track 64GiB of physical address space.
+        *
+        * In the worst case scenario -- a huge hole in the middle of the
+        * address space -- It needs 256MiB to handle 4PiB of the address
+        * space.
+        *
+        * The bitmap will be populated in setup_e820() according to the memory
+        * map after efi_exit_boot_services().
+        */
+       bitmap_size = DIV_ROUND_UP(unaccepted_end - unaccepted_start,
+                                  EFI_UNACCEPTED_UNIT_SIZE * BITS_PER_BYTE);
+
+       status = efi_bs_call(allocate_pool, EFI_LOADER_DATA,
+                            sizeof(*unaccepted_table) + bitmap_size,
+                            (void **)&unaccepted_table);
+       if (status != EFI_SUCCESS) {
+               efi_err("Failed to allocate unaccepted memory config table\n");
+               return status;
+       }
+
+       unaccepted_table->version = 1;
+       unaccepted_table->unit_size = EFI_UNACCEPTED_UNIT_SIZE;
+       unaccepted_table->phys_base = unaccepted_start;
+       unaccepted_table->size = bitmap_size;
+       memset(unaccepted_table->bitmap, 0, bitmap_size);
+
+       status = efi_bs_call(install_configuration_table,
+                            &unaccepted_table_guid, unaccepted_table);
+       if (status != EFI_SUCCESS) {
+               efi_bs_call(free_pool, unaccepted_table);
+               efi_err("Failed to install unaccepted memory config table!\n");
+       }
+
+       return status;
+}
+
+/*
+ * The accepted memory bitmap only works at unit_size granularity.  Take
+ * unaligned start/end addresses and either:
+ *  1. Accepts the memory immediately and in its entirety
+ *  2. Accepts unaligned parts, and marks *some* aligned part unaccepted
+ *
+ * The function will never reach the bitmap_set() with zero bits to set.
+ */
+void process_unaccepted_memory(u64 start, u64 end)
+{
+       u64 unit_size = unaccepted_table->unit_size;
+       u64 unit_mask = unaccepted_table->unit_size - 1;
+       u64 bitmap_size = unaccepted_table->size;
+
+       /*
+        * Ensure that at least one bit will be set in the bitmap by
+        * immediately accepting all regions under 2*unit_size.  This is
+        * imprecise and may immediately accept some areas that could
+        * have been represented in the bitmap.  But, results in simpler
+        * code below
+        *
+        * Consider case like this (assuming unit_size == 2MB):
+        *
+        * | 4k | 2044k |    2048k   |
+        * ^ 0x0        ^ 2MB        ^ 4MB
+        *
+        * Only the first 4k has been accepted. The 0MB->2MB region can not be
+        * represented in the bitmap. The 2MB->4MB region can be represented in
+        * the bitmap. But, the 0MB->4MB region is <2*unit_size and will be
+        * immediately accepted in its entirety.
+        */
+       if (end - start < 2 * unit_size) {
+               arch_accept_memory(start, end);
+               return;
+       }
+
+       /*
+        * No matter how the start and end are aligned, at least one unaccepted
+        * unit_size area will remain to be marked in the bitmap.
+        */
+
+       /* Immediately accept a <unit_size piece at the start: */
+       if (start & unit_mask) {
+               arch_accept_memory(start, round_up(start, unit_size));
+               start = round_up(start, unit_size);
+       }
+
+       /* Immediately accept a <unit_size piece at the end: */
+       if (end & unit_mask) {
+               arch_accept_memory(round_down(end, unit_size), end);
+               end = round_down(end, unit_size);
+       }
+
+       /*
+        * Accept part of the range that before phys_base and cannot be recorded
+        * into the bitmap.
+        */
+       if (start < unaccepted_table->phys_base) {
+               arch_accept_memory(start,
+                                  min(unaccepted_table->phys_base, end));
+               start = unaccepted_table->phys_base;
+       }
+
+       /* Nothing to record */
+       if (end < unaccepted_table->phys_base)
+               return;
+
+       /* Translate to offsets from the beginning of the bitmap */
+       start -= unaccepted_table->phys_base;
+       end -= unaccepted_table->phys_base;
+
+       /* Accept memory that doesn't fit into bitmap */
+       if (end > bitmap_size * unit_size * BITS_PER_BYTE) {
+               unsigned long phys_start, phys_end;
+
+               phys_start = bitmap_size * unit_size * BITS_PER_BYTE +
+                            unaccepted_table->phys_base;
+               phys_end = end + unaccepted_table->phys_base;
+
+               arch_accept_memory(phys_start, phys_end);
+               end = bitmap_size * unit_size * BITS_PER_BYTE;
+       }
+
+       /*
+        * 'start' and 'end' are now both unit_size-aligned.
+        * Record the range as being unaccepted:
+        */
+       bitmap_set(unaccepted_table->bitmap,
+                  start / unit_size, (end - start) / unit_size);
+}
+
+void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+       unsigned long range_start, range_end;
+       unsigned long bitmap_size;
+       u64 unit_size;
+
+       if (!unaccepted_table)
+               return;
+
+       unit_size = unaccepted_table->unit_size;
+
+       /*
+        * Only care for the part of the range that is represented
+        * in the bitmap.
+        */
+       if (start < unaccepted_table->phys_base)
+               start = unaccepted_table->phys_base;
+       if (end < unaccepted_table->phys_base)
+               return;
+
+       /* Translate to offsets from the beginning of the bitmap */
+       start -= unaccepted_table->phys_base;
+       end -= unaccepted_table->phys_base;
+
+       /* Make sure not to overrun the bitmap */
+       if (end > unaccepted_table->size * unit_size * BITS_PER_BYTE)
+               end = unaccepted_table->size * unit_size * BITS_PER_BYTE;
+
+       range_start = start / unit_size;
+       bitmap_size = DIV_ROUND_UP(end, unit_size);
+
+       for_each_set_bitrange_from(range_start, range_end,
+                                  unaccepted_table->bitmap, bitmap_size) {
+               unsigned long phys_start, phys_end;
+
+               phys_start = range_start * unit_size + unaccepted_table->phys_base;
+               phys_end = range_end * unit_size + unaccepted_table->phys_base;
+
+               arch_accept_memory(phys_start, phys_end);
+               bitmap_clear(unaccepted_table->bitmap,
+                            range_start, range_end - range_start);
+       }
+}
index a0bfd31..220be75 100644 (file)
@@ -26,6 +26,17 @@ const efi_dxe_services_table_t *efi_dxe_table;
 u32 image_offset __section(".data");
 static efi_loaded_image_t *image = NULL;
 
+typedef union sev_memory_acceptance_protocol sev_memory_acceptance_protocol_t;
+union sev_memory_acceptance_protocol {
+       struct {
+               efi_status_t (__efiapi * allow_unaccepted_memory)(
+                       sev_memory_acceptance_protocol_t *);
+       };
+       struct {
+               u32 allow_unaccepted_memory;
+       } mixed_mode;
+};
+
 static efi_status_t
 preserve_pci_rom_image(efi_pci_io_protocol_t *pci, struct pci_setup_rom **__rom)
 {
@@ -310,6 +321,29 @@ setup_memory_protection(unsigned long image_base, unsigned long image_size)
 #endif
 }
 
+static void setup_unaccepted_memory(void)
+{
+       efi_guid_t mem_acceptance_proto = OVMF_SEV_MEMORY_ACCEPTANCE_PROTOCOL_GUID;
+       sev_memory_acceptance_protocol_t *proto;
+       efi_status_t status;
+
+       if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY))
+               return;
+
+       /*
+        * Enable unaccepted memory before calling exit boot services in order
+        * for the UEFI to not accept all memory on EBS.
+        */
+       status = efi_bs_call(locate_protocol, &mem_acceptance_proto, NULL,
+                            (void **)&proto);
+       if (status != EFI_SUCCESS)
+               return;
+
+       status = efi_call_proto(proto, allow_unaccepted_memory);
+       if (status != EFI_SUCCESS)
+               efi_err("Memory acceptance protocol failed\n");
+}
+
 static const efi_char16_t apple[] = L"Apple";
 
 static void setup_quirks(struct boot_params *boot_params,
@@ -613,6 +647,16 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
                        e820_type = E820_TYPE_PMEM;
                        break;
 
+               case EFI_UNACCEPTED_MEMORY:
+                       if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY)) {
+                               efi_warn_once(
+"The system has unaccepted memory,  but kernel does not support it\nConsider enabling CONFIG_UNACCEPTED_MEMORY\n");
+                               continue;
+                       }
+                       e820_type = E820_TYPE_RAM;
+                       process_unaccepted_memory(d->phys_addr,
+                                                 d->phys_addr + PAGE_SIZE * d->num_pages);
+                       break;
                default:
                        continue;
                }
@@ -681,28 +725,27 @@ static efi_status_t allocate_e820(struct boot_params *params,
                                  struct setup_data **e820ext,
                                  u32 *e820ext_size)
 {
-       unsigned long map_size, desc_size, map_key;
+       struct efi_boot_memmap *map;
        efi_status_t status;
-       __u32 nr_desc, desc_version;
-
-       /* Only need the size of the mem map and size of each mem descriptor */
-       map_size = 0;
-       status = efi_bs_call(get_memory_map, &map_size, NULL, &map_key,
-                            &desc_size, &desc_version);
-       if (status != EFI_BUFFER_TOO_SMALL)
-               return (status != EFI_SUCCESS) ? status : EFI_UNSUPPORTED;
+       __u32 nr_desc;
 
-       nr_desc = map_size / desc_size + EFI_MMAP_NR_SLACK_SLOTS;
+       status = efi_get_memory_map(&map, false);
+       if (status != EFI_SUCCESS)
+               return status;
 
-       if (nr_desc > ARRAY_SIZE(params->e820_table)) {
-               u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table);
+       nr_desc = map->map_size / map->desc_size;
+       if (nr_desc > ARRAY_SIZE(params->e820_table) - EFI_MMAP_NR_SLACK_SLOTS) {
+               u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table) +
+                                EFI_MMAP_NR_SLACK_SLOTS;
 
                status = alloc_e820ext(nr_e820ext, e820ext, e820ext_size);
-               if (status != EFI_SUCCESS)
-                       return status;
        }
 
-       return EFI_SUCCESS;
+       if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY) && status == EFI_SUCCESS)
+               status = allocate_unaccepted_bitmap(nr_desc, map);
+
+       efi_bs_call(free_pool, map);
+       return status;
 }
 
 struct exit_boot_struct {
@@ -899,6 +942,8 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,
 
        setup_quirks(boot_params, bzimage_addr, buffer_end - buffer_start);
 
+       setup_unaccepted_memory();
+
        status = exit_boot(boot_params, handle);
        if (status != EFI_SUCCESS) {
                efi_err("exit_boot() failed!\n");
diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c
new file mode 100644 (file)
index 0000000..853f7dc
--- /dev/null
@@ -0,0 +1,147 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/efi.h>
+#include <linux/memblock.h>
+#include <linux/spinlock.h>
+#include <asm/unaccepted_memory.h>
+
+/* Protects unaccepted memory bitmap */
+static DEFINE_SPINLOCK(unaccepted_memory_lock);
+
+/*
+ * accept_memory() -- Consult bitmap and accept the memory if needed.
+ *
+ * Only memory that is explicitly marked as unaccepted in the bitmap requires
+ * an action. All the remaining memory is implicitly accepted and doesn't need
+ * acceptance.
+ *
+ * No need to accept:
+ *  - anything if the system has no unaccepted table;
+ *  - memory that is below phys_base;
+ *  - memory that is above the memory that addressable by the bitmap;
+ */
+void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+       struct efi_unaccepted_memory *unaccepted;
+       unsigned long range_start, range_end;
+       unsigned long flags;
+       u64 unit_size;
+
+       unaccepted = efi_get_unaccepted_table();
+       if (!unaccepted)
+               return;
+
+       unit_size = unaccepted->unit_size;
+
+       /*
+        * Only care for the part of the range that is represented
+        * in the bitmap.
+        */
+       if (start < unaccepted->phys_base)
+               start = unaccepted->phys_base;
+       if (end < unaccepted->phys_base)
+               return;
+
+       /* Translate to offsets from the beginning of the bitmap */
+       start -= unaccepted->phys_base;
+       end -= unaccepted->phys_base;
+
+       /*
+        * load_unaligned_zeropad() can lead to unwanted loads across page
+        * boundaries. The unwanted loads are typically harmless. But, they
+        * might be made to totally unrelated or even unmapped memory.
+        * load_unaligned_zeropad() relies on exception fixup (#PF, #GP and now
+        * #VE) to recover from these unwanted loads.
+        *
+        * But, this approach does not work for unaccepted memory. For TDX, a
+        * load from unaccepted memory will not lead to a recoverable exception
+        * within the guest. The guest will exit to the VMM where the only
+        * recourse is to terminate the guest.
+        *
+        * There are two parts to fix this issue and comprehensively avoid
+        * access to unaccepted memory. Together these ensure that an extra
+        * "guard" page is accepted in addition to the memory that needs to be
+        * used:
+        *
+        * 1. Implicitly extend the range_contains_unaccepted_memory(start, end)
+        *    checks up to end+unit_size if 'end' is aligned on a unit_size
+        *    boundary.
+        *
+        * 2. Implicitly extend accept_memory(start, end) to end+unit_size if
+        *    'end' is aligned on a unit_size boundary. (immediately following
+        *    this comment)
+        */
+       if (!(end % unit_size))
+               end += unit_size;
+
+       /* Make sure not to overrun the bitmap */
+       if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
+               end = unaccepted->size * unit_size * BITS_PER_BYTE;
+
+       range_start = start / unit_size;
+
+       spin_lock_irqsave(&unaccepted_memory_lock, flags);
+       for_each_set_bitrange_from(range_start, range_end, unaccepted->bitmap,
+                                  DIV_ROUND_UP(end, unit_size)) {
+               unsigned long phys_start, phys_end;
+               unsigned long len = range_end - range_start;
+
+               phys_start = range_start * unit_size + unaccepted->phys_base;
+               phys_end = range_end * unit_size + unaccepted->phys_base;
+
+               arch_accept_memory(phys_start, phys_end);
+               bitmap_clear(unaccepted->bitmap, range_start, len);
+       }
+       spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+}
+
+bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end)
+{
+       struct efi_unaccepted_memory *unaccepted;
+       unsigned long flags;
+       bool ret = false;
+       u64 unit_size;
+
+       unaccepted = efi_get_unaccepted_table();
+       if (!unaccepted)
+               return false;
+
+       unit_size = unaccepted->unit_size;
+
+       /*
+        * Only care for the part of the range that is represented
+        * in the bitmap.
+        */
+       if (start < unaccepted->phys_base)
+               start = unaccepted->phys_base;
+       if (end < unaccepted->phys_base)
+               return false;
+
+       /* Translate to offsets from the beginning of the bitmap */
+       start -= unaccepted->phys_base;
+       end -= unaccepted->phys_base;
+
+       /*
+        * Also consider the unaccepted state of the *next* page. See fix #1 in
+        * the comment on load_unaligned_zeropad() in accept_memory().
+        */
+       if (!(end % unit_size))
+               end += unit_size;
+
+       /* Make sure not to overrun the bitmap */
+       if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
+               end = unaccepted->size * unit_size * BITS_PER_BYTE;
+
+       spin_lock_irqsave(&unaccepted_memory_lock, flags);
+       while (start < end) {
+               if (test_bit(start / unit_size, unaccepted->bitmap)) {
+                       ret = true;
+                       break;
+               }
+
+               start += unit_size;
+       }
+       spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+
+       return ret;
+}
index 94b49cc..71f5130 100644 (file)
@@ -42,8 +42,6 @@ static const struct {
 };
 
 #define IBFT_SIGN_LEN 4
-#define IBFT_START 0x80000 /* 512kB */
-#define IBFT_END 0x100000 /* 1MB */
 #define VGA_MEM 0xA0000 /* VGA buffer */
 #define VGA_SIZE 0x20000 /* 128kB */
 
@@ -52,9 +50,9 @@ static const struct {
  */
 void __init reserve_ibft_region(void)
 {
-       unsigned long pos;
+       unsigned long pos, virt_pos = 0;
        unsigned int len = 0;
-       void *virt;
+       void *virt = NULL;
        int i;
 
        ibft_phys_addr = 0;
@@ -70,13 +68,20 @@ void __init reserve_ibft_region(void)
                 * so skip that area */
                if (pos == VGA_MEM)
                        pos += VGA_SIZE;
-               virt = isa_bus_to_virt(pos);
+
+               /* Map page by page */
+               if (offset_in_page(pos) == 0) {
+                       if (virt)
+                               early_memunmap(virt, PAGE_SIZE);
+                       virt = early_memremap_ro(pos, PAGE_SIZE);
+                       virt_pos = pos;
+               }
 
                for (i = 0; i < ARRAY_SIZE(ibft_signs); i++) {
-                       if (memcmp(virt, ibft_signs[i].sign, IBFT_SIGN_LEN) ==
-                           0) {
+                       if (memcmp(virt + (pos - virt_pos), ibft_signs[i].sign,
+                                  IBFT_SIGN_LEN) == 0) {
                                unsigned long *addr =
-                                   (unsigned long *)isa_bus_to_virt(pos + 4);
+                                   (unsigned long *)(virt + pos - virt_pos + 4);
                                len = *addr;
                                /* if the length of the table extends past 1M,
                                 * the table cannot be valid. */
@@ -84,9 +89,12 @@ void __init reserve_ibft_region(void)
                                        ibft_phys_addr = pos;
                                        memblock_reserve(ibft_phys_addr, PAGE_ALIGN(len));
                                        pr_info("iBFT found at %pa.\n", &ibft_phys_addr);
-                                       return;
+                                       goto out;
                                }
                        }
                }
        }
+
+out:
+       early_memunmap(virt, PAGE_SIZE);
 }
index f2253fd..8ff5f4f 100644 (file)
@@ -100,13 +100,23 @@ static const struct regmap_irq dio48e_regmap_irqs[] = {
        DIO48E_REGMAP_IRQ(0), DIO48E_REGMAP_IRQ(1),
 };
 
-static int dio48e_handle_mask_sync(struct regmap *const map, const int index,
+/**
+ * struct dio48e_gpio - GPIO device private data structure
+ * @map:       Regmap for the device
+ * @irq_mask:  Current IRQ mask state on the device
+ */
+struct dio48e_gpio {
+       struct regmap *map;
+       unsigned int irq_mask;
+};
+
+static int dio48e_handle_mask_sync(const int index,
                                   const unsigned int mask_buf_def,
                                   const unsigned int mask_buf,
                                   void *const irq_drv_data)
 {
-       unsigned int *const irq_mask = irq_drv_data;
-       const unsigned int prev_mask = *irq_mask;
+       struct dio48e_gpio *const dio48egpio = irq_drv_data;
+       const unsigned int prev_mask = dio48egpio->irq_mask;
        int err;
        unsigned int val;
 
@@ -115,19 +125,19 @@ static int dio48e_handle_mask_sync(struct regmap *const map, const int index,
                return 0;
 
        /* remember the current mask for the next mask sync */
-       *irq_mask = mask_buf;
+       dio48egpio->irq_mask = mask_buf;
 
        /* if all previously masked, enable interrupts when unmasking */
        if (prev_mask == mask_buf_def) {
-               err = regmap_write(map, DIO48E_CLEAR_INTERRUPT, 0x00);
+               err = regmap_write(dio48egpio->map, DIO48E_CLEAR_INTERRUPT, 0x00);
                if (err)
                        return err;
-               return regmap_write(map, DIO48E_ENABLE_INTERRUPT, 0x00);
+               return regmap_write(dio48egpio->map, DIO48E_ENABLE_INTERRUPT, 0x00);
        }
 
        /* if all are currently masked, disable interrupts */
        if (mask_buf == mask_buf_def)
-               return regmap_read(map, DIO48E_DISABLE_INTERRUPT, &val);
+               return regmap_read(dio48egpio->map, DIO48E_DISABLE_INTERRUPT, &val);
 
        return 0;
 }
@@ -168,7 +178,7 @@ static int dio48e_probe(struct device *dev, unsigned int id)
        struct regmap *map;
        int err;
        struct regmap_irq_chip *chip;
-       unsigned int irq_mask;
+       struct dio48e_gpio *dio48egpio;
        struct regmap_irq_chip_data *chip_data;
 
        if (!devm_request_region(dev, base[id], DIO48E_EXTENT, name)) {
@@ -186,12 +196,14 @@ static int dio48e_probe(struct device *dev, unsigned int id)
                return dev_err_probe(dev, PTR_ERR(map),
                                     "Unable to initialize register map\n");
 
-       chip = devm_kzalloc(dev, sizeof(*chip), GFP_KERNEL);
-       if (!chip)
+       dio48egpio = devm_kzalloc(dev, sizeof(*dio48egpio), GFP_KERNEL);
+       if (!dio48egpio)
                return -ENOMEM;
 
-       chip->irq_drv_data = devm_kzalloc(dev, sizeof(irq_mask), GFP_KERNEL);
-       if (!chip->irq_drv_data)
+       dio48egpio->map = map;
+
+       chip = devm_kzalloc(dev, sizeof(*chip), GFP_KERNEL);
+       if (!chip)
                return -ENOMEM;
 
        chip->name = name;
@@ -202,6 +214,7 @@ static int dio48e_probe(struct device *dev, unsigned int id)
        chip->irqs = dio48e_regmap_irqs;
        chip->num_irqs = ARRAY_SIZE(dio48e_regmap_irqs);
        chip->handle_mask_sync = dio48e_handle_mask_sync;
+       chip->irq_drv_data = dio48egpio;
 
        /* Initialize to prevent spurious interrupts before we're ready */
        err = dio48e_irq_init_hw(map);
index 98939cd..745e5f6 100644 (file)
@@ -221,8 +221,12 @@ static int sifive_gpio_probe(struct platform_device *pdev)
                return -ENODEV;
        }
 
-       for (i = 0; i < ngpio; i++)
-               chip->irq_number[i] = platform_get_irq(pdev, i);
+       for (i = 0; i < ngpio; i++) {
+               ret = platform_get_irq(pdev, i);
+               if (ret < 0)
+                       return ret;
+               chip->irq_number[i] = ret;
+       }
 
        ret = bgpio_init(&chip->gc, dev, 4,
                         chip->base + SIFIVE_GPIO_INPUT_VAL,
index a7220e0..5be8ad6 100644 (file)
@@ -1745,7 +1745,7 @@ static void gpiochip_irqchip_remove(struct gpio_chip *gc)
        }
 
        /* Remove all IRQ mappings and delete the domain */
-       if (gc->irq.domain) {
+       if (!gc->irq.domain_is_allocated_externally && gc->irq.domain) {
                unsigned int irq;
 
                for (offset = 0; offset < gc->ngpio; offset++) {
@@ -1791,6 +1791,15 @@ int gpiochip_irqchip_add_domain(struct gpio_chip *gc,
 
        gc->to_irq = gpiochip_to_irq;
        gc->irq.domain = domain;
+       gc->irq.domain_is_allocated_externally = true;
+
+       /*
+        * Using barrier() here to prevent compiler from reordering
+        * gc->irq.initialized before adding irqdomain.
+        */
+       barrier();
+
+       gc->irq.initialized = true;
 
        return 0;
 }
index 1c5d938..5f610e9 100644 (file)
@@ -1509,7 +1509,7 @@ struct atom_context *amdgpu_atom_parse(struct card_info *card, void *bios)
        str = CSTR(idx);
        if (*str != '\0') {
                pr_info("ATOM BIOS: %s\n", str);
-               strlcpy(ctx->vbios_version, str, sizeof(ctx->vbios_version));
+               strscpy(ctx->vbios_version, str, sizeof(ctx->vbios_version));
        }
 
        atom_rom_header = (struct _ATOM_ROM_HEADER *)CSTR(base);
index d3fe149..81fb4e5 100644 (file)
@@ -794,7 +794,7 @@ void amdgpu_add_thermal_controller(struct amdgpu_device *adev)
                                struct i2c_board_info info = { };
                                const char *name = pp_lib_thermal_controller_names[controller->ucType];
                                info.addr = controller->ucI2cAddress >> 1;
-                               strlcpy(info.type, name, sizeof(info.type));
+                               strscpy(info.type, name, sizeof(info.type));
                                i2c_new_client_device(&adev->pm.i2c_bus->adapter, &info);
                        }
                } else {
index 16565a0..e6a78fd 100644 (file)
@@ -2103,7 +2103,7 @@ int drm_dp_aux_register(struct drm_dp_aux *aux)
        aux->ddc.owner = THIS_MODULE;
        aux->ddc.dev.parent = aux->dev;
 
-       strlcpy(aux->ddc.name, aux->name ? aux->name : dev_name(aux->dev),
+       strscpy(aux->ddc.name, aux->name ? aux->name : dev_name(aux->dev),
                sizeof(aux->ddc.name));
 
        ret = drm_dp_aux_register_devnode(aux);
index 38dab76..943a00d 100644 (file)
@@ -3404,7 +3404,7 @@ int drm_dp_add_payload_part2(struct drm_dp_mst_topology_mgr *mgr,
 
        /* Skip failed payloads */
        if (payload->vc_start_slot == -1) {
-               drm_dbg_kms(state->dev, "Part 1 of payload creation for %s failed, skipping part 2\n",
+               drm_dbg_kms(mgr->dev, "Part 1 of payload creation for %s failed, skipping part 2\n",
                            payload->port->connector->name);
                return -EIO;
        }
@@ -5702,7 +5702,7 @@ static int drm_dp_mst_register_i2c_bus(struct drm_dp_mst_port *port)
        aux->ddc.dev.parent = parent_dev;
        aux->ddc.dev.of_node = parent_dev->of_node;
 
-       strlcpy(aux->ddc.name, aux->name ? aux->name : dev_name(parent_dev),
+       strscpy(aux->ddc.name, aux->name ? aux->name : dev_name(parent_dev),
                sizeof(aux->ddc.name));
 
        return i2c_add_adapter(&aux->ddc);
index 1a5a2cd..78dcae2 100644 (file)
@@ -496,13 +496,13 @@ int drm_gem_create_mmap_offset(struct drm_gem_object *obj)
 EXPORT_SYMBOL(drm_gem_create_mmap_offset);
 
 /*
- * Move pages to appropriate lru and release the pagevec, decrementing the
- * ref count of those pages.
+ * Move folios to appropriate lru and release the folios, decrementing the
+ * ref count of those folios.
  */
-static void drm_gem_check_release_pagevec(struct pagevec *pvec)
+static void drm_gem_check_release_batch(struct folio_batch *fbatch)
 {
-       check_move_unevictable_pages(pvec);
-       __pagevec_release(pvec);
+       check_move_unevictable_folios(fbatch);
+       __folio_batch_release(fbatch);
        cond_resched();
 }
 
@@ -534,10 +534,10 @@ static void drm_gem_check_release_pagevec(struct pagevec *pvec)
 struct page **drm_gem_get_pages(struct drm_gem_object *obj)
 {
        struct address_space *mapping;
-       struct page *p, **pages;
-       struct pagevec pvec;
-       int i, npages;
-
+       struct page **pages;
+       struct folio *folio;
+       struct folio_batch fbatch;
+       int i, j, npages;
 
        if (WARN_ON(!obj->filp))
                return ERR_PTR(-EINVAL);
@@ -559,11 +559,14 @@ struct page **drm_gem_get_pages(struct drm_gem_object *obj)
 
        mapping_set_unevictable(mapping);
 
-       for (i = 0; i < npages; i++) {
-               p = shmem_read_mapping_page(mapping, i);
-               if (IS_ERR(p))
+       i = 0;
+       while (i < npages) {
+               folio = shmem_read_folio_gfp(mapping, i,
+                               mapping_gfp_mask(mapping));
+               if (IS_ERR(folio))
                        goto fail;
-               pages[i] = p;
+               for (j = 0; j < folio_nr_pages(folio); j++, i++)
+                       pages[i] = folio_file_page(folio, i);
 
                /* Make sure shmem keeps __GFP_DMA32 allocated pages in the
                 * correct region during swapin. Note that this requires
@@ -571,23 +574,26 @@ struct page **drm_gem_get_pages(struct drm_gem_object *obj)
                 * so shmem can relocate pages during swapin if required.
                 */
                BUG_ON(mapping_gfp_constraint(mapping, __GFP_DMA32) &&
-                               (page_to_pfn(p) >= 0x00100000UL));
+                               (folio_pfn(folio) >= 0x00100000UL));
        }
 
        return pages;
 
 fail:
        mapping_clear_unevictable(mapping);
-       pagevec_init(&pvec);
-       while (i--) {
-               if (!pagevec_add(&pvec, pages[i]))
-                       drm_gem_check_release_pagevec(&pvec);
+       folio_batch_init(&fbatch);
+       j = 0;
+       while (j < i) {
+               struct folio *f = page_folio(pages[j]);
+               if (!folio_batch_add(&fbatch, f))
+                       drm_gem_check_release_batch(&fbatch);
+               j += folio_nr_pages(f);
        }
-       if (pagevec_count(&pvec))
-               drm_gem_check_release_pagevec(&pvec);
+       if (fbatch.nr)
+               drm_gem_check_release_batch(&fbatch);
 
        kvfree(pages);
-       return ERR_CAST(p);
+       return ERR_CAST(folio);
 }
 EXPORT_SYMBOL(drm_gem_get_pages);
 
@@ -603,7 +609,7 @@ void drm_gem_put_pages(struct drm_gem_object *obj, struct page **pages,
 {
        int i, npages;
        struct address_space *mapping;
-       struct pagevec pvec;
+       struct folio_batch fbatch;
 
        mapping = file_inode(obj->filp)->i_mapping;
        mapping_clear_unevictable(mapping);
@@ -616,23 +622,27 @@ void drm_gem_put_pages(struct drm_gem_object *obj, struct page **pages,
 
        npages = obj->size >> PAGE_SHIFT;
 
-       pagevec_init(&pvec);
+       folio_batch_init(&fbatch);
        for (i = 0; i < npages; i++) {
+               struct folio *folio;
+
                if (!pages[i])
                        continue;
+               folio = page_folio(pages[i]);
 
                if (dirty)
-                       set_page_dirty(pages[i]);
+                       folio_mark_dirty(folio);
 
                if (accessed)
-                       mark_page_accessed(pages[i]);
+                       folio_mark_accessed(folio);
 
                /* Undo the reference we took when populating the table */
-               if (!pagevec_add(&pvec, pages[i]))
-                       drm_gem_check_release_pagevec(&pvec);
+               if (!folio_batch_add(&fbatch, folio))
+                       drm_gem_check_release_batch(&fbatch);
+               i += folio_nr_pages(folio) - 1;
        }
-       if (pagevec_count(&pvec))
-               drm_gem_check_release_pagevec(&pvec);
+       if (folio_batch_count(&fbatch))
+               drm_gem_check_release_batch(&fbatch);
 
        kvfree(pages);
 }
index c21c3f6..5423ad8 100644 (file)
@@ -49,10 +49,10 @@ struct drmres {
         * Some archs want to perform DMA into kmalloc caches
         * and need a guaranteed alignment larger than
         * the alignment of a 64-bit integer.
-        * Thus we use ARCH_KMALLOC_MINALIGN here and get exactly the same
-        * buffer alignment as if it was allocated by plain kmalloc().
+        * Thus we use ARCH_DMA_MINALIGN for data[] which will force the same
+        * alignment for struct drmres when allocated by kmalloc().
         */
-       u8 __aligned(ARCH_KMALLOC_MINALIGN) data[];
+       u8 __aligned(ARCH_DMA_MINALIGN) data[];
 };
 
 static void free_dr(struct drmres *dr)
index 3fd6c73..6252ac0 100644 (file)
@@ -223,7 +223,7 @@ mipi_dsi_device_register_full(struct mipi_dsi_host *host,
 
        device_set_node(&dsi->dev, of_fwnode_handle(info->node));
        dsi->channel = info->channel;
-       strlcpy(dsi->name, info->type, sizeof(dsi->name));
+       strscpy(dsi->name, info->type, sizeof(dsi->name));
 
        ret = mipi_dsi_device_add(dsi);
        if (ret) {
index db5c934..0918d80 100644 (file)
@@ -1951,7 +1951,7 @@ static int tda998x_create(struct device *dev)
         * offset.
         */
        memset(&cec_info, 0, sizeof(cec_info));
-       strlcpy(cec_info.type, "tda9950", sizeof(cec_info.type));
+       strscpy(cec_info.type, "tda9950", sizeof(cec_info.type));
        cec_info.addr = priv->cec_addr;
        cec_info.platform_data = &priv->cec_glue;
        cec_info.irq = client->irq;
index 37d1efc..adf1154 100644 (file)
 #include "i915_trace.h"
 
 /*
- * Move pages to appropriate lru and release the pagevec, decrementing the
- * ref count of those pages.
+ * Move folios to appropriate lru and release the batch, decrementing the
+ * ref count of those folios.
  */
-static void check_release_pagevec(struct pagevec *pvec)
+static void check_release_folio_batch(struct folio_batch *fbatch)
 {
-       check_move_unevictable_pages(pvec);
-       __pagevec_release(pvec);
+       check_move_unevictable_folios(fbatch);
+       __folio_batch_release(fbatch);
        cond_resched();
 }
 
@@ -33,24 +33,29 @@ void shmem_sg_free_table(struct sg_table *st, struct address_space *mapping,
                         bool dirty, bool backup)
 {
        struct sgt_iter sgt_iter;
-       struct pagevec pvec;
+       struct folio_batch fbatch;
+       struct folio *last = NULL;
        struct page *page;
 
        mapping_clear_unevictable(mapping);
 
-       pagevec_init(&pvec);
+       folio_batch_init(&fbatch);
        for_each_sgt_page(page, sgt_iter, st) {
-               if (dirty)
-                       set_page_dirty(page);
+               struct folio *folio = page_folio(page);
 
+               if (folio == last)
+                       continue;
+               last = folio;
+               if (dirty)
+                       folio_mark_dirty(folio);
                if (backup)
-                       mark_page_accessed(page);
+                       folio_mark_accessed(folio);
 
-               if (!pagevec_add(&pvec, page))
-                       check_release_pagevec(&pvec);
+               if (!folio_batch_add(&fbatch, folio))
+                       check_release_folio_batch(&fbatch);
        }
-       if (pagevec_count(&pvec))
-               check_release_pagevec(&pvec);
+       if (fbatch.nr)
+               check_release_folio_batch(&fbatch);
 
        sg_free_table(st);
 }
@@ -63,8 +68,7 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, struct sg_table *st,
        unsigned int page_count; /* restricted by sg_alloc_table */
        unsigned long i;
        struct scatterlist *sg;
-       struct page *page;
-       unsigned long last_pfn = 0;     /* suppress gcc warning */
+       unsigned long next_pfn = 0;     /* suppress gcc warning */
        gfp_t noreclaim;
        int ret;
 
@@ -95,6 +99,7 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, struct sg_table *st,
        sg = st->sgl;
        st->nents = 0;
        for (i = 0; i < page_count; i++) {
+               struct folio *folio;
                const unsigned int shrink[] = {
                        I915_SHRINK_BOUND | I915_SHRINK_UNBOUND,
                        0,
@@ -103,12 +108,12 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, struct sg_table *st,
 
                do {
                        cond_resched();
-                       page = shmem_read_mapping_page_gfp(mapping, i, gfp);
-                       if (!IS_ERR(page))
+                       folio = shmem_read_folio_gfp(mapping, i, gfp);
+                       if (!IS_ERR(folio))
                                break;
 
                        if (!*s) {
-                               ret = PTR_ERR(page);
+                               ret = PTR_ERR(folio);
                                goto err_sg;
                        }
 
@@ -147,19 +152,21 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, struct sg_table *st,
 
                if (!i ||
                    sg->length >= max_segment ||
-                   page_to_pfn(page) != last_pfn + 1) {
+                   folio_pfn(folio) != next_pfn) {
                        if (i)
                                sg = sg_next(sg);
 
                        st->nents++;
-                       sg_set_page(sg, page, PAGE_SIZE, 0);
+                       sg_set_folio(sg, folio, folio_size(folio), 0);
                } else {
-                       sg->length += PAGE_SIZE;
+                       /* XXX: could overflow? */
+                       sg->length += folio_size(folio);
                }
-               last_pfn = page_to_pfn(page);
+               next_pfn = folio_pfn(folio) + folio_nr_pages(folio);
+               i += folio_nr_pages(folio) - 1;
 
                /* Check that the i965g/gm workaround works. */
-               GEM_BUG_ON(gfp & __GFP_DMA32 && last_pfn >= 0x00100000UL);
+               GEM_BUG_ON(gfp & __GFP_DMA32 && next_pfn >= 0x00100000UL);
        }
        if (sg) /* loop terminated early; short sg table */
                sg_mark_end(sg);
index 5627990..01e271b 100644 (file)
@@ -1681,7 +1681,9 @@ static int igt_mmap_gpu(void *arg)
 
 static int check_present_pte(pte_t *pte, unsigned long addr, void *data)
 {
-       if (!pte_present(*pte) || pte_none(*pte)) {
+       pte_t ptent = ptep_get(pte);
+
+       if (!pte_present(ptent) || pte_none(ptent)) {
                pr_err("missing PTE:%lx\n",
                       (addr - (unsigned long)data) >> PAGE_SHIFT);
                return -EINVAL;
@@ -1692,7 +1694,9 @@ static int check_present_pte(pte_t *pte, unsigned long addr, void *data)
 
 static int check_absent_pte(pte_t *pte, unsigned long addr, void *data)
 {
-       if (pte_present(*pte) && !pte_none(*pte)) {
+       pte_t ptent = ptep_get(pte);
+
+       if (pte_present(ptent) && !pte_none(ptent)) {
                pr_err("present PTE:%lx; expected to be revoked\n",
                       (addr - (unsigned long)data) >> PAGE_SHIFT);
                return -EINVAL;
index f020c00..35f70bb 100644 (file)
@@ -187,64 +187,64 @@ i915_error_printer(struct drm_i915_error_state_buf *e)
 }
 
 /* single threaded page allocator with a reserved stash for emergencies */
-static void pool_fini(struct pagevec *pv)
+static void pool_fini(struct folio_batch *fbatch)
 {
-       pagevec_release(pv);
+       folio_batch_release(fbatch);
 }
 
-static int pool_refill(struct pagevec *pv, gfp_t gfp)
+static int pool_refill(struct folio_batch *fbatch, gfp_t gfp)
 {
-       while (pagevec_space(pv)) {
-               struct page *p;
+       while (folio_batch_space(fbatch)) {
+               struct folio *folio;
 
-               p = alloc_page(gfp);
-               if (!p)
+               folio = folio_alloc(gfp, 0);
+               if (!folio)
                        return -ENOMEM;
 
-               pagevec_add(pv, p);
+               folio_batch_add(fbatch, folio);
        }
 
        return 0;
 }
 
-static int pool_init(struct pagevec *pv, gfp_t gfp)
+static int pool_init(struct folio_batch *fbatch, gfp_t gfp)
 {
        int err;
 
-       pagevec_init(pv);
+       folio_batch_init(fbatch);
 
-       err = pool_refill(pv, gfp);
+       err = pool_refill(fbatch, gfp);
        if (err)
-               pool_fini(pv);
+               pool_fini(fbatch);
 
        return err;
 }
 
-static void *pool_alloc(struct pagevec *pv, gfp_t gfp)
+static void *pool_alloc(struct folio_batch *fbatch, gfp_t gfp)
 {
-       struct page *p;
+       struct folio *folio;
 
-       p = alloc_page(gfp);
-       if (!p && pagevec_count(pv))
-               p = pv->pages[--pv->nr];
+       folio = folio_alloc(gfp, 0);
+       if (!folio && folio_batch_count(fbatch))
+               folio = fbatch->folios[--fbatch->nr];
 
-       return p ? page_address(p) : NULL;
+       return folio ? folio_address(folio) : NULL;
 }
 
-static void pool_free(struct pagevec *pv, void *addr)
+static void pool_free(struct folio_batch *fbatch, void *addr)
 {
-       struct page *p = virt_to_page(addr);
+       struct folio *folio = virt_to_folio(addr);
 
-       if (pagevec_space(pv))
-               pagevec_add(pv, p);
+       if (folio_batch_space(fbatch))
+               folio_batch_add(fbatch, folio);
        else
-               __free_page(p);
+               folio_put(folio);
 }
 
 #ifdef CONFIG_DRM_I915_COMPRESS_ERROR
 
 struct i915_vma_compress {
-       struct pagevec pool;
+       struct folio_batch pool;
        struct z_stream_s zstream;
        void *tmp;
 };
@@ -381,7 +381,7 @@ static void err_compression_marker(struct drm_i915_error_state_buf *m)
 #else
 
 struct i915_vma_compress {
-       struct pagevec pool;
+       struct folio_batch pool;
 };
 
 static bool compress_init(struct i915_vma_compress *c)
index 2fc9214..4d39ea0 100644 (file)
@@ -295,7 +295,7 @@ static int mtk_hdmi_ddc_probe(struct platform_device *pdev)
                return ret;
        }
 
-       strlcpy(ddc->adap.name, "mediatek-hdmi-ddc", sizeof(ddc->adap.name));
+       strscpy(ddc->adap.name, "mediatek-hdmi-ddc", sizeof(ddc->adap.name));
        ddc->adap.owner = THIS_MODULE;
        ddc->adap.class = I2C_CLASS_DDC;
        ddc->adap.algo = &mtk_hdmi_ddc_algorithm;
index 4ad5a32..bf3c411 100644 (file)
@@ -2105,7 +2105,7 @@ static int radeon_atombios_parse_power_table_1_3(struct radeon_device *rdev)
                        const char *name = thermal_controller_names[power_info->info.
                                                                    ucOverdriveThermalController];
                        info.addr = power_info->info.ucOverdriveControllerAddress >> 1;
-                       strlcpy(info.type, name, sizeof(info.type));
+                       strscpy(info.type, name, sizeof(info.type));
                        i2c_new_client_device(&rdev->pm.i2c_bus->adapter, &info);
                }
        }
@@ -2355,7 +2355,7 @@ static void radeon_atombios_add_pplib_thermal_controller(struct radeon_device *r
                                struct i2c_board_info info = { };
                                const char *name = pp_lib_thermal_controller_names[controller->ucType];
                                info.addr = controller->ucI2cAddress >> 1;
-                               strlcpy(info.type, name, sizeof(info.type));
+                               strscpy(info.type, name, sizeof(info.type));
                                i2c_new_client_device(&rdev->pm.i2c_bus->adapter, &info);
                        }
                } else {
index 783a6b8..795c366 100644 (file)
@@ -2702,7 +2702,7 @@ void radeon_combios_get_power_modes(struct radeon_device *rdev)
                                struct i2c_board_info info = { };
                                const char *name = thermal_controller_names[thermal_controller];
                                info.addr = i2c_addr >> 1;
-                               strlcpy(info.type, name, sizeof(info.type));
+                               strscpy(info.type, name, sizeof(info.type));
                                i2c_new_client_device(&rdev->pm.i2c_bus->adapter, &info);
                        }
                }
@@ -2719,7 +2719,7 @@ void radeon_combios_get_power_modes(struct radeon_device *rdev)
                                struct i2c_board_info info = { };
                                const char *name = "f75375";
                                info.addr = 0x28;
-                               strlcpy(info.type, name, sizeof(info.type));
+                               strscpy(info.type, name, sizeof(info.type));
                                i2c_new_client_device(&rdev->pm.i2c_bus->adapter, &info);
                                DRM_INFO("Possible %s thermal controller at 0x%02x\n",
                                         name, info.addr);
index 2220cdf..3a9db03 100644 (file)
@@ -359,7 +359,7 @@ static int radeon_ttm_tt_pin_userptr(struct ttm_device *bdev, struct ttm_tt *ttm
                struct page **pages = ttm->pages + pinned;
 
                r = get_user_pages(userptr, num_pages, write ? FOLL_WRITE : 0,
-                                  pages, NULL);
+                                  pages);
                if (r < 0)
                        goto release_pages;
 
index f517748..9afb889 100644 (file)
@@ -797,7 +797,7 @@ static struct i2c_adapter *inno_hdmi_i2c_adapter(struct inno_hdmi *hdmi)
        adap->dev.parent = hdmi->dev;
        adap->dev.of_node = hdmi->dev->of_node;
        adap->algo = &inno_hdmi_algorithm;
-       strlcpy(adap->name, "Inno HDMI", sizeof(adap->name));
+       strscpy(adap->name, "Inno HDMI", sizeof(adap->name));
        i2c_set_adapdata(adap, hdmi);
 
        ret = i2c_add_adapter(adap);
index 90145ad..b5d042e 100644 (file)
@@ -730,7 +730,7 @@ static struct i2c_adapter *rk3066_hdmi_i2c_adapter(struct rk3066_hdmi *hdmi)
        adap->dev.parent = hdmi->dev;
        adap->dev.of_node = hdmi->dev->of_node;
        adap->algo = &rk3066_hdmi_algorithm;
-       strlcpy(adap->name, "RK3066 HDMI", sizeof(adap->name));
+       strscpy(adap->name, "RK3066 HDMI", sizeof(adap->name));
        i2c_set_adapdata(adap, hdmi);
 
        ret = i2c_add_adapter(adap);
index c7d7e9f..d1a65a9 100644 (file)
@@ -304,7 +304,7 @@ int sun4i_hdmi_i2c_create(struct device *dev, struct sun4i_hdmi *hdmi)
        adap->owner = THIS_MODULE;
        adap->class = I2C_CLASS_DDC;
        adap->algo = &sun4i_hdmi_i2c_algorithm;
-       strlcpy(adap->name, "sun4i_hdmi_i2c adapter", sizeof(adap->name));
+       strscpy(adap->name, "sun4i_hdmi_i2c adapter", sizeof(adap->name));
        i2c_set_adapdata(adap, hdmi);
 
        ret = i2c_add_adapter(adap);
index 0b74ca2..23899d7 100644 (file)
                         flags, magic, bp,              \
                         eax, ebx, ecx, edx, si, di)    \
 ({                                                     \
-        asm volatile ("push %%rbp;"                    \
+        asm volatile (                                 \
+               UNWIND_HINT_SAVE                        \
+               "push %%rbp;"                           \
+               UNWIND_HINT_UNDEFINED                   \
                 "mov %12, %%rbp;"                      \
                 VMWARE_HYPERCALL_HB_OUT                        \
-                "pop %%rbp;" :                         \
+                "pop %%rbp;"                           \
+               UNWIND_HINT_RESTORE :                   \
                 "=a"(eax),                             \
                 "=b"(ebx),                             \
                 "=c"(ecx),                             \
                        flags, magic, bp,               \
                        eax, ebx, ecx, edx, si, di)     \
 ({                                                     \
-        asm volatile ("push %%rbp;"                    \
+        asm volatile (                                 \
+               UNWIND_HINT_SAVE                        \
+               "push %%rbp;"                           \
+               UNWIND_HINT_UNDEFINED                   \
                 "mov %12, %%rbp;"                      \
                 VMWARE_HYPERCALL_HB_IN                 \
-                "pop %%rbp" :                          \
+                "pop %%rbp;"                           \
+               UNWIND_HINT_RESTORE :                   \
                 "=a"(eax),                             \
                 "=b"(ebx),                             \
                 "=c"(ecx),                             \
index e3799a5..9c88861 100644 (file)
@@ -187,8 +187,8 @@ _gb_connection_create(struct gb_host_device *hd, int hd_cport_id,
        spin_lock_init(&connection->lock);
        INIT_LIST_HEAD(&connection->operations);
 
-       connection->wq = alloc_workqueue("%s:%d", WQ_UNBOUND, 1,
-                                        dev_name(&hd->dev), hd_cport_id);
+       connection->wq = alloc_ordered_workqueue("%s:%d", 0, dev_name(&hd->dev),
+                                                hd_cport_id);
        if (!connection->wq) {
                ret = -ENOMEM;
                goto err_free_connection;
index 16cced8..0d7e749 100644 (file)
@@ -1318,7 +1318,7 @@ struct gb_svc *gb_svc_create(struct gb_host_device *hd)
        if (!svc)
                return NULL;
 
-       svc->wq = alloc_workqueue("%s:svc", WQ_UNBOUND, 1, dev_name(&hd->dev));
+       svc->wq = alloc_ordered_workqueue("%s:svc", 0, dev_name(&hd->dev));
        if (!svc->wq) {
                kfree(svc);
                return NULL;
index 1fc4fd7..1bab91c 100644 (file)
@@ -218,7 +218,7 @@ static inline void set_trbe_enabled(struct trbe_cpudata *cpudata, u64 trblimitr)
         * Enable the TRBE without clearing LIMITPTR which
         * might be required for fetching the buffer limits.
         */
-       trblimitr |= TRBLIMITR_ENABLE;
+       trblimitr |= TRBLIMITR_EL1_E;
        write_sysreg_s(trblimitr, SYS_TRBLIMITR_EL1);
 
        /* Synchronize the TRBE enable event */
@@ -236,7 +236,7 @@ static inline void set_trbe_disabled(struct trbe_cpudata *cpudata)
         * Disable the TRBE without clearing LIMITPTR which
         * might be required for fetching the buffer limits.
         */
-       trblimitr &= ~TRBLIMITR_ENABLE;
+       trblimitr &= ~TRBLIMITR_EL1_E;
        write_sysreg_s(trblimitr, SYS_TRBLIMITR_EL1);
 
        if (trbe_needs_drain_after_disable(cpudata))
@@ -582,12 +582,12 @@ static void clr_trbe_status(void)
        u64 trbsr = read_sysreg_s(SYS_TRBSR_EL1);
 
        WARN_ON(is_trbe_enabled());
-       trbsr &= ~TRBSR_IRQ;
-       trbsr &= ~TRBSR_TRG;
-       trbsr &= ~TRBSR_WRAP;
-       trbsr &= ~(TRBSR_EC_MASK << TRBSR_EC_SHIFT);
-       trbsr &= ~(TRBSR_BSC_MASK << TRBSR_BSC_SHIFT);
-       trbsr &= ~TRBSR_STOP;
+       trbsr &= ~TRBSR_EL1_IRQ;
+       trbsr &= ~TRBSR_EL1_TRG;
+       trbsr &= ~TRBSR_EL1_WRAP;
+       trbsr &= ~TRBSR_EL1_EC_MASK;
+       trbsr &= ~TRBSR_EL1_BSC_MASK;
+       trbsr &= ~TRBSR_EL1_S;
        write_sysreg_s(trbsr, SYS_TRBSR_EL1);
 }
 
@@ -596,13 +596,13 @@ static void set_trbe_limit_pointer_enabled(struct trbe_buf *buf)
        u64 trblimitr = read_sysreg_s(SYS_TRBLIMITR_EL1);
        unsigned long addr = buf->trbe_limit;
 
-       WARN_ON(!IS_ALIGNED(addr, (1UL << TRBLIMITR_LIMIT_SHIFT)));
+       WARN_ON(!IS_ALIGNED(addr, (1UL << TRBLIMITR_EL1_LIMIT_SHIFT)));
        WARN_ON(!IS_ALIGNED(addr, PAGE_SIZE));
 
-       trblimitr &= ~TRBLIMITR_NVM;
-       trblimitr &= ~(TRBLIMITR_FILL_MODE_MASK << TRBLIMITR_FILL_MODE_SHIFT);
-       trblimitr &= ~(TRBLIMITR_TRIG_MODE_MASK << TRBLIMITR_TRIG_MODE_SHIFT);
-       trblimitr &= ~(TRBLIMITR_LIMIT_MASK << TRBLIMITR_LIMIT_SHIFT);
+       trblimitr &= ~TRBLIMITR_EL1_nVM;
+       trblimitr &= ~TRBLIMITR_EL1_FM_MASK;
+       trblimitr &= ~TRBLIMITR_EL1_TM_MASK;
+       trblimitr &= ~TRBLIMITR_EL1_LIMIT_MASK;
 
        /*
         * Fill trace buffer mode is used here while configuring the
@@ -613,14 +613,15 @@ static void set_trbe_limit_pointer_enabled(struct trbe_buf *buf)
         * trace data in the interrupt handler, before reconfiguring
         * the TRBE.
         */
-       trblimitr |= (TRBE_FILL_MODE_FILL & TRBLIMITR_FILL_MODE_MASK) << TRBLIMITR_FILL_MODE_SHIFT;
+       trblimitr |= (TRBLIMITR_EL1_FM_FILL << TRBLIMITR_EL1_FM_SHIFT) &
+                    TRBLIMITR_EL1_FM_MASK;
 
        /*
         * Trigger mode is not used here while configuring the TRBE for
         * the trace capture. Hence just keep this in the ignore mode.
         */
-       trblimitr |= (TRBE_TRIG_MODE_IGNORE & TRBLIMITR_TRIG_MODE_MASK) <<
-                     TRBLIMITR_TRIG_MODE_SHIFT;
+       trblimitr |= (TRBLIMITR_EL1_TM_IGNR << TRBLIMITR_EL1_TM_SHIFT) &
+                    TRBLIMITR_EL1_TM_MASK;
        trblimitr |= (addr & PAGE_MASK);
        set_trbe_enabled(buf->cpudata, trblimitr);
 }
index 98ff1b1..77cbb5c 100644 (file)
@@ -30,7 +30,7 @@ static inline bool is_trbe_enabled(void)
 {
        u64 trblimitr = read_sysreg_s(SYS_TRBLIMITR_EL1);
 
-       return trblimitr & TRBLIMITR_ENABLE;
+       return trblimitr & TRBLIMITR_EL1_E;
 }
 
 #define TRBE_EC_OTHERS         0
@@ -39,7 +39,7 @@ static inline bool is_trbe_enabled(void)
 
 static inline int get_trbe_ec(u64 trbsr)
 {
-       return (trbsr >> TRBSR_EC_SHIFT) & TRBSR_EC_MASK;
+       return (trbsr & TRBSR_EL1_EC_MASK) >> TRBSR_EL1_EC_SHIFT;
 }
 
 #define TRBE_BSC_NOT_STOPPED 0
@@ -48,63 +48,55 @@ static inline int get_trbe_ec(u64 trbsr)
 
 static inline int get_trbe_bsc(u64 trbsr)
 {
-       return (trbsr >> TRBSR_BSC_SHIFT) & TRBSR_BSC_MASK;
+       return (trbsr & TRBSR_EL1_BSC_MASK) >> TRBSR_EL1_BSC_SHIFT;
 }
 
 static inline void clr_trbe_irq(void)
 {
        u64 trbsr = read_sysreg_s(SYS_TRBSR_EL1);
 
-       trbsr &= ~TRBSR_IRQ;
+       trbsr &= ~TRBSR_EL1_IRQ;
        write_sysreg_s(trbsr, SYS_TRBSR_EL1);
 }
 
 static inline bool is_trbe_irq(u64 trbsr)
 {
-       return trbsr & TRBSR_IRQ;
+       return trbsr & TRBSR_EL1_IRQ;
 }
 
 static inline bool is_trbe_trg(u64 trbsr)
 {
-       return trbsr & TRBSR_TRG;
+       return trbsr & TRBSR_EL1_TRG;
 }
 
 static inline bool is_trbe_wrap(u64 trbsr)
 {
-       return trbsr & TRBSR_WRAP;
+       return trbsr & TRBSR_EL1_WRAP;
 }
 
 static inline bool is_trbe_abort(u64 trbsr)
 {
-       return trbsr & TRBSR_ABORT;
+       return trbsr & TRBSR_EL1_EA;
 }
 
 static inline bool is_trbe_running(u64 trbsr)
 {
-       return !(trbsr & TRBSR_STOP);
+       return !(trbsr & TRBSR_EL1_S);
 }
 
-#define TRBE_TRIG_MODE_STOP            0
-#define TRBE_TRIG_MODE_IRQ             1
-#define TRBE_TRIG_MODE_IGNORE          3
-
-#define TRBE_FILL_MODE_FILL            0
-#define TRBE_FILL_MODE_WRAP            1
-#define TRBE_FILL_MODE_CIRCULAR_BUFFER 3
-
 static inline bool get_trbe_flag_update(u64 trbidr)
 {
-       return trbidr & TRBIDR_FLAG;
+       return trbidr & TRBIDR_EL1_F;
 }
 
 static inline bool is_trbe_programmable(u64 trbidr)
 {
-       return !(trbidr & TRBIDR_PROG);
+       return !(trbidr & TRBIDR_EL1_P);
 }
 
 static inline int get_trbe_address_align(u64 trbidr)
 {
-       return (trbidr >> TRBIDR_ALIGN_SHIFT) & TRBIDR_ALIGN_MASK;
+       return (trbidr & TRBIDR_EL1_Align_MASK) >> TRBIDR_EL1_Align_SHIFT;
 }
 
 static inline unsigned long get_trbe_write_pointer(void)
@@ -121,7 +113,7 @@ static inline void set_trbe_write_pointer(unsigned long addr)
 static inline unsigned long get_trbe_limit_pointer(void)
 {
        u64 trblimitr = read_sysreg_s(SYS_TRBLIMITR_EL1);
-       unsigned long addr = trblimitr & (TRBLIMITR_LIMIT_MASK << TRBLIMITR_LIMIT_SHIFT);
+       unsigned long addr = trblimitr & TRBLIMITR_EL1_LIMIT_MASK;
 
        WARN_ON(!IS_ALIGNED(addr, PAGE_SIZE));
        return addr;
@@ -130,7 +122,7 @@ static inline unsigned long get_trbe_limit_pointer(void)
 static inline unsigned long get_trbe_base_pointer(void)
 {
        u64 trbbaser = read_sysreg_s(SYS_TRBBASER_EL1);
-       unsigned long addr = trbbaser & (TRBBASER_BASE_MASK << TRBBASER_BASE_SHIFT);
+       unsigned long addr = trbbaser & TRBBASER_EL1_BASE_MASK;
 
        WARN_ON(!IS_ALIGNED(addr, PAGE_SIZE));
        return addr;
@@ -139,7 +131,7 @@ static inline unsigned long get_trbe_base_pointer(void)
 static inline void set_trbe_base_pointer(unsigned long addr)
 {
        WARN_ON(is_trbe_enabled());
-       WARN_ON(!IS_ALIGNED(addr, (1UL << TRBBASER_BASE_SHIFT)));
+       WARN_ON(!IS_ALIGNED(addr, (1UL << TRBBASER_EL1_BASE_SHIFT)));
        WARN_ON(!IS_ALIGNED(addr, PAGE_SIZE));
        write_sysreg_s(addr, SYS_TRBBASER_EL1);
 }
index 1af0a63..4d24ceb 100644 (file)
@@ -201,8 +201,8 @@ static void lpi2c_imx_stop(struct lpi2c_imx_struct *lpi2c_imx)
 /* CLKLO = I2C_CLK_RATIO * CLKHI, SETHOLD = CLKHI, DATAVD = CLKHI/2 */
 static int lpi2c_imx_config(struct lpi2c_imx_struct *lpi2c_imx)
 {
-       u8 prescale, filt, sethold, clkhi, clklo, datavd;
-       unsigned int clk_rate, clk_cycle;
+       u8 prescale, filt, sethold, datavd;
+       unsigned int clk_rate, clk_cycle, clkhi, clklo;
        enum lpi2c_imx_pincfg pincfg;
        unsigned int temp;
 
index 2e153f2..7868238 100644 (file)
@@ -1752,16 +1752,21 @@ nodma:
        if (!clk_freq || clk_freq > I2C_MAX_FAST_MODE_PLUS_FREQ) {
                dev_err(qup->dev, "clock frequency not supported %d\n",
                        clk_freq);
-               return -EINVAL;
+               ret = -EINVAL;
+               goto fail_dma;
        }
 
        qup->base = devm_platform_ioremap_resource(pdev, 0);
-       if (IS_ERR(qup->base))
-               return PTR_ERR(qup->base);
+       if (IS_ERR(qup->base)) {
+               ret = PTR_ERR(qup->base);
+               goto fail_dma;
+       }
 
        qup->irq = platform_get_irq(pdev, 0);
-       if (qup->irq < 0)
-               return qup->irq;
+       if (qup->irq < 0) {
+               ret = qup->irq;
+               goto fail_dma;
+       }
 
        if (has_acpi_companion(qup->dev)) {
                ret = device_property_read_u32(qup->dev,
@@ -1775,13 +1780,15 @@ nodma:
                qup->clk = devm_clk_get(qup->dev, "core");
                if (IS_ERR(qup->clk)) {
                        dev_err(qup->dev, "Could not get core clock\n");
-                       return PTR_ERR(qup->clk);
+                       ret = PTR_ERR(qup->clk);
+                       goto fail_dma;
                }
 
                qup->pclk = devm_clk_get(qup->dev, "iface");
                if (IS_ERR(qup->pclk)) {
                        dev_err(qup->dev, "Could not get iface clock\n");
-                       return PTR_ERR(qup->pclk);
+                       ret = PTR_ERR(qup->pclk);
+                       goto fail_dma;
                }
                qup_i2c_enable_clocks(qup);
                src_clk_freq = clk_get_rate(qup->clk);
index aa2d19d..34201d7 100644 (file)
@@ -199,6 +199,43 @@ static __cpuidle int intel_idle_xstate(struct cpuidle_device *dev,
        return __intel_idle(dev, drv, index);
 }
 
+static __always_inline int __intel_idle_hlt(struct cpuidle_device *dev,
+                                       struct cpuidle_driver *drv, int index)
+{
+       raw_safe_halt();
+       raw_local_irq_disable();
+       return index;
+}
+
+/**
+ * intel_idle_hlt - Ask the processor to enter the given idle state using hlt.
+ * @dev: cpuidle device of the target CPU.
+ * @drv: cpuidle driver (assumed to point to intel_idle_driver).
+ * @index: Target idle state index.
+ *
+ * Use the HLT instruction to notify the processor that the CPU represented by
+ * @dev is idle and it can try to enter the idle state corresponding to @index.
+ *
+ * Must be called under local_irq_disable().
+ */
+static __cpuidle int intel_idle_hlt(struct cpuidle_device *dev,
+                               struct cpuidle_driver *drv, int index)
+{
+       return __intel_idle_hlt(dev, drv, index);
+}
+
+static __cpuidle int intel_idle_hlt_irq_on(struct cpuidle_device *dev,
+                                   struct cpuidle_driver *drv, int index)
+{
+       int ret;
+
+       raw_local_irq_enable();
+       ret = __intel_idle_hlt(dev, drv, index);
+       raw_local_irq_disable();
+
+       return ret;
+}
+
 /**
  * intel_idle_s2idle - Ask the processor to enter the given idle state.
  * @dev: cpuidle device of the target CPU.
@@ -1242,6 +1279,25 @@ static struct cpuidle_state snr_cstates[] __initdata = {
                .enter = NULL }
 };
 
+static struct cpuidle_state vmguest_cstates[] __initdata = {
+       {
+               .name = "C1",
+               .desc = "HLT",
+               .flags = MWAIT2flg(0x00) | CPUIDLE_FLAG_IRQ_ENABLE,
+               .exit_latency = 5,
+               .target_residency = 10,
+               .enter = &intel_idle_hlt, },
+       {
+               .name = "C1L",
+               .desc = "Long HLT",
+               .flags = MWAIT2flg(0x00) | CPUIDLE_FLAG_TLB_FLUSHED,
+               .exit_latency = 5,
+               .target_residency = 200,
+               .enter = &intel_idle_hlt, },
+       {
+               .enter = NULL }
+};
+
 static const struct idle_cpu idle_cpu_nehalem __initconst = {
        .state_table = nehalem_cstates,
        .auto_demotion_disable_flags = NHM_C1_AUTO_DEMOTE | NHM_C3_AUTO_DEMOTE,
@@ -1839,6 +1895,66 @@ static bool __init intel_idle_verify_cstate(unsigned int mwait_hint)
        return true;
 }
 
+static void state_update_enter_method(struct cpuidle_state *state, int cstate)
+{
+       if (state->enter == intel_idle_hlt) {
+               if (force_irq_on) {
+                       pr_info("forced intel_idle_irq for state %d\n", cstate);
+                       state->enter = intel_idle_hlt_irq_on;
+               }
+               return;
+       }
+       if (state->enter == intel_idle_hlt_irq_on)
+               return; /* no update scenarios */
+
+       if (state->flags & CPUIDLE_FLAG_INIT_XSTATE) {
+               /*
+                * Combining with XSTATE with IBRS or IRQ_ENABLE flags
+                * is not currently supported but this driver.
+                */
+               WARN_ON_ONCE(state->flags & CPUIDLE_FLAG_IBRS);
+               WARN_ON_ONCE(state->flags & CPUIDLE_FLAG_IRQ_ENABLE);
+               state->enter = intel_idle_xstate;
+               return;
+       }
+
+       if (cpu_feature_enabled(X86_FEATURE_KERNEL_IBRS) &&
+                          state->flags & CPUIDLE_FLAG_IBRS) {
+               /*
+                * IBRS mitigation requires that C-states are entered
+                * with interrupts disabled.
+                */
+               WARN_ON_ONCE(state->flags & CPUIDLE_FLAG_IRQ_ENABLE);
+               state->enter = intel_idle_ibrs;
+               return;
+       }
+
+       if (state->flags & CPUIDLE_FLAG_IRQ_ENABLE) {
+               state->enter = intel_idle_irq;
+               return;
+       }
+
+       if (force_irq_on) {
+               pr_info("forced intel_idle_irq for state %d\n", cstate);
+               state->enter = intel_idle_irq;
+       }
+}
+
+/*
+ * For mwait based states, we want to verify the cpuid data to see if the state
+ * is actually supported by this specific CPU.
+ * For non-mwait based states, this check should be skipped.
+ */
+static bool should_verify_mwait(struct cpuidle_state *state)
+{
+       if (state->enter == intel_idle_hlt)
+               return false;
+       if (state->enter == intel_idle_hlt_irq_on)
+               return false;
+
+       return true;
+}
+
 static void __init intel_idle_init_cstates_icpu(struct cpuidle_driver *drv)
 {
        int cstate;
@@ -1887,35 +2003,15 @@ static void __init intel_idle_init_cstates_icpu(struct cpuidle_driver *drv)
                }
 
                mwait_hint = flg2MWAIT(cpuidle_state_table[cstate].flags);
-               if (!intel_idle_verify_cstate(mwait_hint))
+               if (should_verify_mwait(&cpuidle_state_table[cstate]) && !intel_idle_verify_cstate(mwait_hint))
                        continue;
 
                /* Structure copy. */
                drv->states[drv->state_count] = cpuidle_state_table[cstate];
                state = &drv->states[drv->state_count];
 
-               if (state->flags & CPUIDLE_FLAG_INIT_XSTATE) {
-                       /*
-                        * Combining with XSTATE with IBRS or IRQ_ENABLE flags
-                        * is not currently supported but this driver.
-                        */
-                       WARN_ON_ONCE(state->flags & CPUIDLE_FLAG_IBRS);
-                       WARN_ON_ONCE(state->flags & CPUIDLE_FLAG_IRQ_ENABLE);
-                       state->enter = intel_idle_xstate;
-               } else if (cpu_feature_enabled(X86_FEATURE_KERNEL_IBRS) &&
-                          state->flags & CPUIDLE_FLAG_IBRS) {
-                       /*
-                        * IBRS mitigation requires that C-states are entered
-                        * with interrupts disabled.
-                        */
-                       WARN_ON_ONCE(state->flags & CPUIDLE_FLAG_IRQ_ENABLE);
-                       state->enter = intel_idle_ibrs;
-               } else if (state->flags & CPUIDLE_FLAG_IRQ_ENABLE) {
-                       state->enter = intel_idle_irq;
-               } else if (force_irq_on) {
-                       pr_info("forced intel_idle_irq for state %d\n", cstate);
-                       state->enter = intel_idle_irq;
-               }
+               state_update_enter_method(state, cstate);
+
 
                if ((disabled_states_mask & BIT(drv->state_count)) ||
                    ((icpu->use_acpi || force_use_acpi) &&
@@ -2041,6 +2137,93 @@ static void __init intel_idle_cpuidle_devices_uninit(void)
                cpuidle_unregister_device(per_cpu_ptr(intel_idle_cpuidle_devices, i));
 }
 
+/*
+ * Match up the latency and break even point of the bare metal (cpu based)
+ * states with the deepest VM available state.
+ *
+ * We only want to do this for the deepest state, the ones that has
+ * the TLB_FLUSHED flag set on the .
+ *
+ * All our short idle states are dominated by vmexit/vmenter latencies,
+ * not the underlying hardware latencies so we keep our values for these.
+ */
+static void matchup_vm_state_with_baremetal(void)
+{
+       int cstate;
+
+       for (cstate = 0; cstate < CPUIDLE_STATE_MAX; ++cstate) {
+               int matching_cstate;
+
+               if (intel_idle_max_cstate_reached(cstate))
+                       break;
+
+               if (!cpuidle_state_table[cstate].enter)
+                       break;
+
+               if (!(cpuidle_state_table[cstate].flags & CPUIDLE_FLAG_TLB_FLUSHED))
+                       continue;
+
+               for (matching_cstate = 0; matching_cstate < CPUIDLE_STATE_MAX; ++matching_cstate) {
+                       if (!icpu->state_table[matching_cstate].enter)
+                               break;
+                       if (icpu->state_table[matching_cstate].exit_latency > cpuidle_state_table[cstate].exit_latency) {
+                               cpuidle_state_table[cstate].exit_latency = icpu->state_table[matching_cstate].exit_latency;
+                               cpuidle_state_table[cstate].target_residency = icpu->state_table[matching_cstate].target_residency;
+                       }
+               }
+
+       }
+}
+
+
+static int __init intel_idle_vminit(const struct x86_cpu_id *id)
+{
+       int retval;
+
+       cpuidle_state_table = vmguest_cstates;
+
+       icpu = (const struct idle_cpu *)id->driver_data;
+
+       pr_debug("v" INTEL_IDLE_VERSION " model 0x%X\n",
+                boot_cpu_data.x86_model);
+
+       intel_idle_cpuidle_devices = alloc_percpu(struct cpuidle_device);
+       if (!intel_idle_cpuidle_devices)
+               return -ENOMEM;
+
+       /*
+        * We don't know exactly what the host will do when we go idle, but as a worst estimate
+        * we can assume that the exit latency of the deepest host state will be hit for our
+        * deep (long duration) guest idle state.
+        * The same logic applies to the break even point for the long duration guest idle state.
+        * So lets copy these two properties from the table we found for the host CPU type.
+        */
+       matchup_vm_state_with_baremetal();
+
+       intel_idle_cpuidle_driver_init(&intel_idle_driver);
+
+       retval = cpuidle_register_driver(&intel_idle_driver);
+       if (retval) {
+               struct cpuidle_driver *drv = cpuidle_get_driver();
+               printk(KERN_DEBUG pr_fmt("intel_idle yielding to %s\n"),
+                      drv ? drv->name : "none");
+               goto init_driver_fail;
+       }
+
+       retval = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "idle/intel:online",
+                                  intel_idle_cpu_online, NULL);
+       if (retval < 0)
+               goto hp_setup_fail;
+
+       return 0;
+hp_setup_fail:
+       intel_idle_cpuidle_devices_uninit();
+       cpuidle_unregister_driver(&intel_idle_driver);
+init_driver_fail:
+       free_percpu(intel_idle_cpuidle_devices);
+       return retval;
+}
+
 static int __init intel_idle_init(void)
 {
        const struct x86_cpu_id *id;
@@ -2059,6 +2242,8 @@ static int __init intel_idle_init(void)
        id = x86_match_cpu(intel_idle_ids);
        if (id) {
                if (!boot_cpu_has(X86_FEATURE_MWAIT)) {
+                       if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
+                               return intel_idle_vminit(id);
                        pr_debug("Please enable MWAIT in BIOS SETUP\n");
                        return -ENODEV;
                }
index f693bc7..1bb7507 100644 (file)
@@ -111,7 +111,7 @@ int qib_get_user_pages(unsigned long start_page, size_t num_pages,
                ret = pin_user_pages(start_page + got * PAGE_SIZE,
                                     num_pages - got,
                                     FOLL_LONGTERM | FOLL_WRITE,
-                                    p + got, NULL);
+                                    p + got);
                if (ret < 0) {
                        mmap_read_unlock(current->mm);
                        goto bail_release;
index 2a5cac2..84e0f41 100644 (file)
@@ -140,7 +140,7 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
                ret = pin_user_pages(cur_base,
                                     min_t(unsigned long, npages,
                                     PAGE_SIZE / sizeof(struct page *)),
-                                    gup_flags, page_list, NULL);
+                                    gup_flags, page_list);
 
                if (ret < 0)
                        goto out;
index 4d8f6b8..83093e1 100644 (file)
@@ -1357,7 +1357,7 @@ static int rxe_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
        if (cleanup_err)
                rxe_err_mr(mr, "cleanup failed, err = %d", cleanup_err);
 
-       kfree_rcu(mr);
+       kfree_rcu_mightsleep(mr);
        return 0;
 
 err_out:
index f51ab2c..e6e25f1 100644 (file)
@@ -422,7 +422,7 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable)
                umem->page_chunk[i].plist = plist;
                while (nents) {
                        rv = pin_user_pages(first_page_va, nents, foll_flags,
-                                           plist, NULL);
+                                           plist);
                        if (rv < 0)
                                goto out_sem_up;
 
index 81a54a5..8a320e6 100644 (file)
@@ -609,7 +609,7 @@ config INPUT_PWM_VIBRA
 
 config INPUT_RK805_PWRKEY
        tristate "Rockchip RK805 PMIC power key support"
-       depends on MFD_RK808
+       depends on MFD_RK8XX
        help
          Select this option to enable power key driver for RK805.
 
index 577c75c..bb3c607 100644 (file)
@@ -22,7 +22,7 @@
  * in the kernel). So this driver offers straight forward, reliable single
  * touch functionality only.
  *
- * s.a. A20 User Manual "1.15 TP" (Documentation/arm/sunxi.rst)
+ * s.a. A20 User Manual "1.15 TP" (Documentation/arch/arm/sunxi.rst)
  * (looks like the description in the A20 User Manual v1.3 is better
  * than the one in the A10 User Manual v.1.5)
  */
index 4d80060..2b12b58 100644 (file)
@@ -152,6 +152,7 @@ config IOMMU_DMA
        select IOMMU_IOVA
        select IRQ_MSI_IOMMU
        select NEED_SG_DMA_LENGTH
+       select NEED_SG_DMA_FLAGS if SWIOTLB
 
 # Shared Virtual Addressing
 config IOMMU_SVA
index 2ddbda3..ab8aa8f 100644 (file)
@@ -986,8 +986,13 @@ union irte_ga_hi {
 };
 
 struct irte_ga {
-       union irte_ga_lo lo;
-       union irte_ga_hi hi;
+       union {
+               struct {
+                       union irte_ga_lo lo;
+                       union irte_ga_hi hi;
+               };
+               u128 irte;
+       };
 };
 
 struct irq_2_irte {
index dc1ec68..9ea4096 100644 (file)
@@ -2078,10 +2078,6 @@ static struct protection_domain *protection_domain_alloc(unsigned int type)
        int mode = DEFAULT_PGTABLE_LEVEL;
        int ret;
 
-       domain = kzalloc(sizeof(*domain), GFP_KERNEL);
-       if (!domain)
-               return NULL;
-
        /*
         * Force IOMMU v1 page table when iommu=pt and
         * when allocating domain for pass-through devices.
@@ -2097,6 +2093,10 @@ static struct protection_domain *protection_domain_alloc(unsigned int type)
                return NULL;
        }
 
+       domain = kzalloc(sizeof(*domain), GFP_KERNEL);
+       if (!domain)
+               return NULL;
+
        switch (pgtable) {
        case AMD_IOMMU_V1:
                ret = protection_domain_init_v1(domain, mode);
@@ -3023,10 +3023,10 @@ out:
 static int modify_irte_ga(struct amd_iommu *iommu, u16 devid, int index,
                          struct irte_ga *irte, struct amd_ir_data *data)
 {
-       bool ret;
        struct irq_remap_table *table;
-       unsigned long flags;
        struct irte_ga *entry;
+       unsigned long flags;
+       u128 old;
 
        table = get_irq_table(iommu, devid);
        if (!table)
@@ -3037,16 +3037,14 @@ static int modify_irte_ga(struct amd_iommu *iommu, u16 devid, int index,
        entry = (struct irte_ga *)table->table;
        entry = &entry[index];
 
-       ret = cmpxchg_double(&entry->lo.val, &entry->hi.val,
-                            entry->lo.val, entry->hi.val,
-                            irte->lo.val, irte->hi.val);
        /*
         * We use cmpxchg16 to atomically update the 128-bit IRTE,
         * and it cannot be updated by the hardware or other processors
         * behind us, so the return value of cmpxchg16 should be the
         * same as the old value.
         */
-       WARN_ON(!ret);
+       old = entry->irte;
+       WARN_ON(!try_cmpxchg128(&entry->irte, &old, irte->irte));
 
        if (data)
                data->ref = entry;
index 7a9f0b0..e86ae46 100644 (file)
@@ -520,9 +520,38 @@ static bool dev_is_untrusted(struct device *dev)
        return dev_is_pci(dev) && to_pci_dev(dev)->untrusted;
 }
 
-static bool dev_use_swiotlb(struct device *dev)
+static bool dev_use_swiotlb(struct device *dev, size_t size,
+                           enum dma_data_direction dir)
 {
-       return IS_ENABLED(CONFIG_SWIOTLB) && dev_is_untrusted(dev);
+       return IS_ENABLED(CONFIG_SWIOTLB) &&
+               (dev_is_untrusted(dev) ||
+                dma_kmalloc_needs_bounce(dev, size, dir));
+}
+
+static bool dev_use_sg_swiotlb(struct device *dev, struct scatterlist *sg,
+                              int nents, enum dma_data_direction dir)
+{
+       struct scatterlist *s;
+       int i;
+
+       if (!IS_ENABLED(CONFIG_SWIOTLB))
+               return false;
+
+       if (dev_is_untrusted(dev))
+               return true;
+
+       /*
+        * If kmalloc() buffers are not DMA-safe for this device and
+        * direction, check the individual lengths in the sg list. If any
+        * element is deemed unsafe, use the swiotlb for bouncing.
+        */
+       if (!dma_kmalloc_safe(dev, dir)) {
+               for_each_sg(sg, s, nents, i)
+                       if (!dma_kmalloc_size_aligned(s->length))
+                               return true;
+       }
+
+       return false;
 }
 
 /**
@@ -922,7 +951,7 @@ static void iommu_dma_sync_single_for_cpu(struct device *dev,
 {
        phys_addr_t phys;
 
-       if (dev_is_dma_coherent(dev) && !dev_use_swiotlb(dev))
+       if (dev_is_dma_coherent(dev) && !dev_use_swiotlb(dev, size, dir))
                return;
 
        phys = iommu_iova_to_phys(iommu_get_dma_domain(dev), dma_handle);
@@ -938,7 +967,7 @@ static void iommu_dma_sync_single_for_device(struct device *dev,
 {
        phys_addr_t phys;
 
-       if (dev_is_dma_coherent(dev) && !dev_use_swiotlb(dev))
+       if (dev_is_dma_coherent(dev) && !dev_use_swiotlb(dev, size, dir))
                return;
 
        phys = iommu_iova_to_phys(iommu_get_dma_domain(dev), dma_handle);
@@ -956,7 +985,7 @@ static void iommu_dma_sync_sg_for_cpu(struct device *dev,
        struct scatterlist *sg;
        int i;
 
-       if (dev_use_swiotlb(dev))
+       if (sg_dma_is_swiotlb(sgl))
                for_each_sg(sgl, sg, nelems, i)
                        iommu_dma_sync_single_for_cpu(dev, sg_dma_address(sg),
                                                      sg->length, dir);
@@ -972,7 +1001,7 @@ static void iommu_dma_sync_sg_for_device(struct device *dev,
        struct scatterlist *sg;
        int i;
 
-       if (dev_use_swiotlb(dev))
+       if (sg_dma_is_swiotlb(sgl))
                for_each_sg(sgl, sg, nelems, i)
                        iommu_dma_sync_single_for_device(dev,
                                                         sg_dma_address(sg),
@@ -998,7 +1027,8 @@ static dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
         * If both the physical buffer start address and size are
         * page aligned, we don't need to use a bounce page.
         */
-       if (dev_use_swiotlb(dev) && iova_offset(iovad, phys | size)) {
+       if (dev_use_swiotlb(dev, size, dir) &&
+           iova_offset(iovad, phys | size)) {
                void *padding_start;
                size_t padding_size, aligned_size;
 
@@ -1080,7 +1110,7 @@ static int __finalise_sg(struct device *dev, struct scatterlist *sg, int nents,
                sg_dma_address(s) = DMA_MAPPING_ERROR;
                sg_dma_len(s) = 0;
 
-               if (sg_is_dma_bus_address(s)) {
+               if (sg_dma_is_bus_address(s)) {
                        if (i > 0)
                                cur = sg_next(cur);
 
@@ -1136,7 +1166,7 @@ static void __invalidate_sg(struct scatterlist *sg, int nents)
        int i;
 
        for_each_sg(sg, s, nents, i) {
-               if (sg_is_dma_bus_address(s)) {
+               if (sg_dma_is_bus_address(s)) {
                        sg_dma_unmark_bus_address(s);
                } else {
                        if (sg_dma_address(s) != DMA_MAPPING_ERROR)
@@ -1166,6 +1196,8 @@ static int iommu_dma_map_sg_swiotlb(struct device *dev, struct scatterlist *sg,
        struct scatterlist *s;
        int i;
 
+       sg_dma_mark_swiotlb(sg);
+
        for_each_sg(sg, s, nents, i) {
                sg_dma_address(s) = iommu_dma_map_page(dev, sg_page(s),
                                s->offset, s->length, dir, attrs);
@@ -1210,7 +1242,7 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
                        goto out;
        }
 
-       if (dev_use_swiotlb(dev))
+       if (dev_use_sg_swiotlb(dev, sg, nents, dir))
                return iommu_dma_map_sg_swiotlb(dev, sg, nents, dir, attrs);
 
        if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
@@ -1315,7 +1347,7 @@ static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
        struct scatterlist *tmp;
        int i;
 
-       if (dev_use_swiotlb(dev)) {
+       if (sg_dma_is_swiotlb(sg)) {
                iommu_dma_unmap_sg_swiotlb(dev, sg, nents, dir, attrs);
                return;
        }
@@ -1329,7 +1361,7 @@ static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
         * just have to be determined.
         */
        for_each_sg(sg, tmp, nents, i) {
-               if (sg_is_dma_bus_address(tmp)) {
+               if (sg_dma_is_bus_address(tmp)) {
                        sg_dma_unmark_bus_address(tmp);
                        continue;
                }
@@ -1343,7 +1375,7 @@ static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
 
        nents -= i;
        for_each_sg(tmp, tmp, nents, i) {
-               if (sg_is_dma_bus_address(tmp)) {
+               if (sg_dma_is_bus_address(tmp)) {
                        sg_dma_unmark_bus_address(tmp);
                        continue;
                }
index a1b9873..08f5632 100644 (file)
@@ -175,18 +175,14 @@ static int modify_irte(struct irq_2_iommu *irq_iommu,
        irte = &iommu->ir_table->base[index];
 
        if ((irte->pst == 1) || (irte_modified->pst == 1)) {
-               bool ret;
-
-               ret = cmpxchg_double(&irte->low, &irte->high,
-                                    irte->low, irte->high,
-                                    irte_modified->low, irte_modified->high);
                /*
                 * We use cmpxchg16 to atomically update the 128-bit IRTE,
                 * and it cannot be updated by the hardware or other processors
                 * behind us, so the return value of cmpxchg16 should be the
                 * same as the old value.
                 */
-               WARN_ON(!ret);
+               u128 old = irte->irte;
+               WARN_ON(!try_cmpxchg128(&irte->irte, &old, irte_modified->irte));
        } else {
                WRITE_ONCE(irte->low, irte_modified->low);
                WRITE_ONCE(irte->high, irte_modified->high);
index f1dcfa3..eb62055 100644 (file)
@@ -2567,7 +2567,7 @@ ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
                        len = 0;
                }
 
-               if (sg_is_dma_bus_address(sg))
+               if (sg_dma_is_bus_address(sg))
                        goto next;
 
                if (len) {
index 3c47846..412ca96 100644 (file)
@@ -786,7 +786,7 @@ static int pfn_reader_user_pin(struct pfn_reader_user *user,
                        user->locked = 1;
                }
                rc = pin_user_pages_remote(pages->source_mm, uptr, npages,
-                                          user->gup_flags, user->upages, NULL,
+                                          user->gup_flags, user->upages,
                                           &user->locked);
        }
        if (rc <= 0) {
@@ -1799,7 +1799,7 @@ static int iopt_pages_rw_page(struct iopt_pages *pages, unsigned long index,
        rc = pin_user_pages_remote(
                pages->source_mm, (uintptr_t)(pages->uptr + index * PAGE_SIZE),
                1, (flags & IOMMUFD_ACCESS_RW_WRITE) ? FOLL_WRITE : 0, &page,
-               NULL, NULL);
+               NULL);
        mmap_read_unlock(pages->source_mm);
        if (rc != 1) {
                if (WARN_ON(rc >= 0))
index 77ebe7e..e731e07 100644 (file)
@@ -212,12 +212,6 @@ out_kfree:
        return err;
 }
 
-void __init clps711x_intc_init(phys_addr_t base, resource_size_t size)
-{
-       BUG_ON(_clps711x_intc_init(NULL, base, size));
-}
-
-#ifdef CONFIG_IRQCHIP
 static int __init clps711x_intc_init_dt(struct device_node *np,
                                        struct device_node *parent)
 {
@@ -231,4 +225,3 @@ static int __init clps711x_intc_init_dt(struct device_node *np,
        return _clps711x_intc_init(np, res.start, resource_size(&res));
 }
 IRQCHIP_DECLARE(clps711x, "cirrus,ep7209-intc", clps711x_intc_init_dt);
-#endif
index 46a3aa6..359efc1 100644 (file)
@@ -125,7 +125,7 @@ static struct irq_chip ft010_irq_chip = {
 /* Local static for the IRQ entry call */
 static struct ft010_irq_data firq;
 
-asmlinkage void __exception_irq_entry ft010_irqchip_handle_irq(struct pt_regs *regs)
+static asmlinkage void __exception_irq_entry ft010_irqchip_handle_irq(struct pt_regs *regs)
 {
        struct ft010_irq_data *f = &firq;
        int irq;
@@ -162,7 +162,7 @@ static const struct irq_domain_ops ft010_irqdomain_ops = {
        .xlate = irq_domain_xlate_onetwocell,
 };
 
-int __init ft010_of_init_irq(struct device_node *node,
+static int __init ft010_of_init_irq(struct device_node *node,
                              struct device_node *parent)
 {
        struct ft010_irq_data *f = &firq;
index 0ec2b1e..1994541 100644 (file)
@@ -3585,6 +3585,7 @@ static int its_irq_domain_alloc(struct irq_domain *domain, unsigned int virq,
                irqd = irq_get_irq_data(virq + i);
                irqd_set_single_target(irqd);
                irqd_set_affinity_on_activate(irqd);
+               irqd_set_resend_when_in_progress(irqd);
                pr_debug("ID:%d pID:%d vID:%d\n",
                         (int)(hwirq + i - its_dev->event_map.lpi_base),
                         (int)(hwirq + i), virq + i);
@@ -4523,6 +4524,7 @@ static int its_vpe_irq_domain_alloc(struct irq_domain *domain, unsigned int virq
                irq_domain_set_hwirq_and_chip(domain, virq + i, i,
                                              irqchip, vm->vpes[i]);
                set_bit(i, bitmap);
+               irqd_set_resend_when_in_progress(irq_get_irq_data(virq + i));
        }
 
        if (err) {
index a605aa7..0c6c1af 100644 (file)
@@ -40,6 +40,7 @@
 #define FLAGS_WORKAROUND_GICR_WAKER_MSM8996    (1ULL << 0)
 #define FLAGS_WORKAROUND_CAVIUM_ERRATUM_38539  (1ULL << 1)
 #define FLAGS_WORKAROUND_MTK_GICR_SAVE         (1ULL << 2)
+#define FLAGS_WORKAROUND_ASR_ERRATUM_8601001   (1ULL << 3)
 
 #define GIC_IRQ_TYPE_PARTITION (GIC_IRQ_TYPE_LPI + 1)
 
@@ -656,10 +657,16 @@ static int gic_irq_set_vcpu_affinity(struct irq_data *d, void *vcpu)
        return 0;
 }
 
-static u64 gic_mpidr_to_affinity(unsigned long mpidr)
+static u64 gic_cpu_to_affinity(int cpu)
 {
+       u64 mpidr = cpu_logical_map(cpu);
        u64 aff;
 
+       /* ASR8601 needs to have its affinities shifted down... */
+       if (unlikely(gic_data.flags & FLAGS_WORKAROUND_ASR_ERRATUM_8601001))
+               mpidr = (MPIDR_AFFINITY_LEVEL(mpidr, 1) |
+                        (MPIDR_AFFINITY_LEVEL(mpidr, 2) << 8));
+
        aff = ((u64)MPIDR_AFFINITY_LEVEL(mpidr, 3) << 32 |
               MPIDR_AFFINITY_LEVEL(mpidr, 2) << 16 |
               MPIDR_AFFINITY_LEVEL(mpidr, 1) << 8  |
@@ -914,7 +921,7 @@ static void __init gic_dist_init(void)
         * Set all global interrupts to the boot CPU only. ARE must be
         * enabled.
         */
-       affinity = gic_mpidr_to_affinity(cpu_logical_map(smp_processor_id()));
+       affinity = gic_cpu_to_affinity(smp_processor_id());
        for (i = 32; i < GIC_LINE_NR; i++)
                gic_write_irouter(affinity, base + GICD_IROUTER + i * 8);
 
@@ -963,7 +970,7 @@ static int gic_iterate_rdists(int (*fn)(struct redist_region *, void __iomem *))
 
 static int __gic_populate_rdist(struct redist_region *region, void __iomem *ptr)
 {
-       unsigned long mpidr = cpu_logical_map(smp_processor_id());
+       unsigned long mpidr;
        u64 typer;
        u32 aff;
 
@@ -971,6 +978,8 @@ static int __gic_populate_rdist(struct redist_region *region, void __iomem *ptr)
         * Convert affinity to a 32bit value that can be matched to
         * GICR_TYPER bits [63:32].
         */
+       mpidr = gic_cpu_to_affinity(smp_processor_id());
+
        aff = (MPIDR_AFFINITY_LEVEL(mpidr, 3) << 24 |
               MPIDR_AFFINITY_LEVEL(mpidr, 2) << 16 |
               MPIDR_AFFINITY_LEVEL(mpidr, 1) << 8 |
@@ -1084,7 +1093,7 @@ static inline bool gic_dist_security_disabled(void)
 static void gic_cpu_sys_reg_init(void)
 {
        int i, cpu = smp_processor_id();
-       u64 mpidr = cpu_logical_map(cpu);
+       u64 mpidr = gic_cpu_to_affinity(cpu);
        u64 need_rss = MPIDR_RS(mpidr);
        bool group0;
        u32 pribits;
@@ -1183,11 +1192,11 @@ static void gic_cpu_sys_reg_init(void)
        for_each_online_cpu(i) {
                bool have_rss = per_cpu(has_rss, i) && per_cpu(has_rss, cpu);
 
-               need_rss |= MPIDR_RS(cpu_logical_map(i));
+               need_rss |= MPIDR_RS(gic_cpu_to_affinity(i));
                if (need_rss && (!have_rss))
                        pr_crit("CPU%d (%lx) can't SGI CPU%d (%lx), no RSS\n",
                                cpu, (unsigned long)mpidr,
-                               i, (unsigned long)cpu_logical_map(i));
+                               i, (unsigned long)gic_cpu_to_affinity(i));
        }
 
        /**
@@ -1263,9 +1272,11 @@ static u16 gic_compute_target_list(int *base_cpu, const struct cpumask *mask,
                                   unsigned long cluster_id)
 {
        int next_cpu, cpu = *base_cpu;
-       unsigned long mpidr = cpu_logical_map(cpu);
+       unsigned long mpidr;
        u16 tlist = 0;
 
+       mpidr = gic_cpu_to_affinity(cpu);
+
        while (cpu < nr_cpu_ids) {
                tlist |= 1 << (mpidr & 0xf);
 
@@ -1274,7 +1285,7 @@ static u16 gic_compute_target_list(int *base_cpu, const struct cpumask *mask,
                        goto out;
                cpu = next_cpu;
 
-               mpidr = cpu_logical_map(cpu);
+               mpidr = gic_cpu_to_affinity(cpu);
 
                if (cluster_id != MPIDR_TO_SGI_CLUSTER_ID(mpidr)) {
                        cpu--;
@@ -1319,7 +1330,7 @@ static void gic_ipi_send_mask(struct irq_data *d, const struct cpumask *mask)
        dsb(ishst);
 
        for_each_cpu(cpu, mask) {
-               u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu));
+               u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(gic_cpu_to_affinity(cpu));
                u16 tlist;
 
                tlist = gic_compute_target_list(&cpu, mask, cluster_id);
@@ -1377,7 +1388,7 @@ static int gic_set_affinity(struct irq_data *d, const struct cpumask *mask_val,
 
        offset = convert_offset_index(d, GICD_IROUTER, &index);
        reg = gic_dist_base(d) + offset + (index * 8);
-       val = gic_mpidr_to_affinity(cpu_logical_map(cpu));
+       val = gic_cpu_to_affinity(cpu);
 
        gic_write_irouter(val, reg);
 
@@ -1796,6 +1807,15 @@ static bool gic_enable_quirk_nvidia_t241(void *data)
        return true;
 }
 
+static bool gic_enable_quirk_asr8601(void *data)
+{
+       struct gic_chip_data *d = data;
+
+       d->flags |= FLAGS_WORKAROUND_ASR_ERRATUM_8601001;
+
+       return true;
+}
+
 static const struct gic_quirk gic_quirks[] = {
        {
                .desc   = "GICv3: Qualcomm MSM8996 broken firmware",
@@ -1803,6 +1823,11 @@ static const struct gic_quirk gic_quirks[] = {
                .init   = gic_enable_quirk_msm8996,
        },
        {
+               .desc   = "GICv3: ASR erratum 8601001",
+               .compatible = "asr,asr8601-gic-v3",
+               .init   = gic_enable_quirk_asr8601,
+       },
+       {
                .desc   = "GICv3: Mediatek Chromebook GICR save problem",
                .property = "mediatek,broken-save-restore-fw",
                .init   = gic_enable_quirk_mtk_gicr,
index 5f47d8e..b9dcc8e 100644 (file)
@@ -68,6 +68,7 @@ static int __init aic_irq_of_init(struct device_node *node,
        unsigned min_irq = JCORE_AIC2_MIN_HWIRQ;
        unsigned dom_sz = JCORE_AIC_MAX_HWIRQ+1;
        struct irq_domain *domain;
+       int ret;
 
        pr_info("Initializing J-Core AIC\n");
 
@@ -100,6 +101,12 @@ static int __init aic_irq_of_init(struct device_node *node,
        jcore_aic.irq_unmask = noop;
        jcore_aic.name = "AIC";
 
+       ret = irq_alloc_descs(-1, min_irq, dom_sz - min_irq,
+                             of_node_to_nid(node));
+
+       if (ret < 0)
+               return ret;
+
        domain = irq_domain_add_legacy(node, dom_sz - min_irq, min_irq, min_irq,
                                       &jcore_aic_irqdomain_ops,
                                       &jcore_aic);
index 71ef19f..92d8aa2 100644 (file)
@@ -36,6 +36,7 @@ static int nr_pics;
 
 struct eiointc_priv {
        u32                     node;
+       u32                     vec_count;
        nodemask_t              node_map;
        cpumask_t               cpuspan_map;
        struct fwnode_handle    *domain_handle;
@@ -153,18 +154,18 @@ static int eiointc_router_init(unsigned int cpu)
        if ((cpu_logical_map(cpu) % CORES_PER_EIO_NODE) == 0) {
                eiointc_enable();
 
-               for (i = 0; i < VEC_COUNT / 32; i++) {
+               for (i = 0; i < eiointc_priv[0]->vec_count / 32; i++) {
                        data = (((1 << (i * 2 + 1)) << 16) | (1 << (i * 2)));
                        iocsr_write32(data, EIOINTC_REG_NODEMAP + i * 4);
                }
 
-               for (i = 0; i < VEC_COUNT / 32 / 4; i++) {
+               for (i = 0; i < eiointc_priv[0]->vec_count / 32 / 4; i++) {
                        bit = BIT(1 + index); /* Route to IP[1 + index] */
                        data = bit | (bit << 8) | (bit << 16) | (bit << 24);
                        iocsr_write32(data, EIOINTC_REG_IPMAP + i * 4);
                }
 
-               for (i = 0; i < VEC_COUNT / 4; i++) {
+               for (i = 0; i < eiointc_priv[0]->vec_count / 4; i++) {
                        /* Route to Node-0 Core-0 */
                        if (index == 0)
                                bit = BIT(cpu_logical_map(0));
@@ -175,7 +176,7 @@ static int eiointc_router_init(unsigned int cpu)
                        iocsr_write32(data, EIOINTC_REG_ROUTE + i * 4);
                }
 
-               for (i = 0; i < VEC_COUNT / 32; i++) {
+               for (i = 0; i < eiointc_priv[0]->vec_count / 32; i++) {
                        data = 0xffffffff;
                        iocsr_write32(data, EIOINTC_REG_ENABLE + i * 4);
                        iocsr_write32(data, EIOINTC_REG_BOUNCE + i * 4);
@@ -195,7 +196,7 @@ static void eiointc_irq_dispatch(struct irq_desc *desc)
 
        chained_irq_enter(chip, desc);
 
-       for (i = 0; i < VEC_REG_COUNT; i++) {
+       for (i = 0; i < eiointc_priv[0]->vec_count / VEC_COUNT_PER_REG; i++) {
                pending = iocsr_read64(EIOINTC_REG_ISR + (i << 3));
                iocsr_write64(pending, EIOINTC_REG_ISR + (i << 3));
                while (pending) {
@@ -310,11 +311,11 @@ static void eiointc_resume(void)
        eiointc_router_init(0);
 
        for (i = 0; i < nr_pics; i++) {
-               for (j = 0; j < VEC_COUNT; j++) {
+               for (j = 0; j < eiointc_priv[0]->vec_count; j++) {
                        desc = irq_resolve_mapping(eiointc_priv[i]->eiointc_domain, j);
                        if (desc && desc->handle_irq && desc->handle_irq != handle_bad_irq) {
                                raw_spin_lock(&desc->lock);
-                               irq_data = &desc->irq_data;
+                               irq_data = irq_domain_get_irq_data(eiointc_priv[i]->eiointc_domain, irq_desc_get_irq(desc));
                                eiointc_set_irq_affinity(irq_data, irq_data->common->affinity, 0);
                                raw_spin_unlock(&desc->lock);
                        }
@@ -375,11 +376,47 @@ static int __init acpi_cascade_irqdomain_init(void)
        return 0;
 }
 
+static int __init eiointc_init(struct eiointc_priv *priv, int parent_irq,
+                              u64 node_map)
+{
+       int i;
+
+       node_map = node_map ? node_map : -1ULL;
+       for_each_possible_cpu(i) {
+               if (node_map & (1ULL << (cpu_to_eio_node(i)))) {
+                       node_set(cpu_to_eio_node(i), priv->node_map);
+                       cpumask_or(&priv->cpuspan_map, &priv->cpuspan_map,
+                                  cpumask_of(i));
+               }
+       }
+
+       priv->eiointc_domain = irq_domain_create_linear(priv->domain_handle,
+                                                       priv->vec_count,
+                                                       &eiointc_domain_ops,
+                                                       priv);
+       if (!priv->eiointc_domain) {
+               pr_err("loongson-extioi: cannot add IRQ domain\n");
+               return -ENOMEM;
+       }
+
+       eiointc_priv[nr_pics++] = priv;
+       eiointc_router_init(0);
+       irq_set_chained_handler_and_data(parent_irq, eiointc_irq_dispatch, priv);
+
+       if (nr_pics == 1) {
+               register_syscore_ops(&eiointc_syscore_ops);
+               cpuhp_setup_state_nocalls(CPUHP_AP_IRQ_LOONGARCH_STARTING,
+                                         "irqchip/loongarch/intc:starting",
+                                         eiointc_router_init, NULL);
+       }
+
+       return 0;
+}
+
 int __init eiointc_acpi_init(struct irq_domain *parent,
                                     struct acpi_madt_eio_pic *acpi_eiointc)
 {
-       int i, ret, parent_irq;
-       unsigned long node_map;
+       int parent_irq, ret;
        struct eiointc_priv *priv;
        int node;
 
@@ -394,37 +431,14 @@ int __init eiointc_acpi_init(struct irq_domain *parent,
                goto out_free_priv;
        }
 
+       priv->vec_count = VEC_COUNT;
        priv->node = acpi_eiointc->node;
-       node_map = acpi_eiointc->node_map ? : -1ULL;
-
-       for_each_possible_cpu(i) {
-               if (node_map & (1ULL << cpu_to_eio_node(i))) {
-                       node_set(cpu_to_eio_node(i), priv->node_map);
-                       cpumask_or(&priv->cpuspan_map, &priv->cpuspan_map, cpumask_of(i));
-               }
-       }
-
-       /* Setup IRQ domain */
-       priv->eiointc_domain = irq_domain_create_linear(priv->domain_handle, VEC_COUNT,
-                                       &eiointc_domain_ops, priv);
-       if (!priv->eiointc_domain) {
-               pr_err("loongson-eiointc: cannot add IRQ domain\n");
-               goto out_free_handle;
-       }
-
-       eiointc_priv[nr_pics++] = priv;
-
-       eiointc_router_init(0);
 
        parent_irq = irq_create_mapping(parent, acpi_eiointc->cascade);
-       irq_set_chained_handler_and_data(parent_irq, eiointc_irq_dispatch, priv);
 
-       if (nr_pics == 1) {
-               register_syscore_ops(&eiointc_syscore_ops);
-               cpuhp_setup_state_nocalls(CPUHP_AP_IRQ_LOONGARCH_STARTING,
-                                 "irqchip/loongarch/intc:starting",
-                                 eiointc_router_init, NULL);
-       }
+       ret = eiointc_init(priv, parent_irq, acpi_eiointc->node_map);
+       if (ret < 0)
+               goto out_free_handle;
 
        if (cpu_has_flatmode)
                node = cpu_to_node(acpi_eiointc->node * CORES_PER_EIO_NODE);
@@ -432,7 +446,10 @@ int __init eiointc_acpi_init(struct irq_domain *parent,
                node = acpi_eiointc->node;
        acpi_set_vec_parent(node, priv->eiointc_domain, pch_group);
        acpi_set_vec_parent(node, priv->eiointc_domain, msi_group);
+
        ret = acpi_cascade_irqdomain_init();
+       if (ret < 0)
+               goto out_free_handle;
 
        return ret;
 
@@ -444,3 +461,49 @@ out_free_priv:
 
        return -ENOMEM;
 }
+
+static int __init eiointc_of_init(struct device_node *of_node,
+                                 struct device_node *parent)
+{
+       int parent_irq, ret;
+       struct eiointc_priv *priv;
+
+       priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+       if (!priv)
+               return -ENOMEM;
+
+       parent_irq = irq_of_parse_and_map(of_node, 0);
+       if (parent_irq <= 0) {
+               ret = -ENODEV;
+               goto out_free_priv;
+       }
+
+       ret = irq_set_handler_data(parent_irq, priv);
+       if (ret < 0)
+               goto out_free_priv;
+
+       /*
+        * In particular, the number of devices supported by the LS2K0500
+        * extended I/O interrupt vector is 128.
+        */
+       if (of_device_is_compatible(of_node, "loongson,ls2k0500-eiointc"))
+               priv->vec_count = 128;
+       else
+               priv->vec_count = VEC_COUNT;
+
+       priv->node = 0;
+       priv->domain_handle = of_node_to_fwnode(of_node);
+
+       ret = eiointc_init(priv, parent_irq, 0);
+       if (ret < 0)
+               goto out_free_priv;
+
+       return 0;
+
+out_free_priv:
+       kfree(priv);
+       return ret;
+}
+
+IRQCHIP_DECLARE(loongson_ls2k0500_eiointc, "loongson,ls2k0500-eiointc", eiointc_of_init);
+IRQCHIP_DECLARE(loongson_ls2k2000_eiointc, "loongson,ls2k2000-eiointc", eiointc_of_init);
index 8d00a9a..e4b33ae 100644 (file)
 #define LIOINTC_REG_INTC_EN_STATUS     (LIOINTC_INTC_CHIP_START + 0x04)
 #define LIOINTC_REG_INTC_ENABLE        (LIOINTC_INTC_CHIP_START + 0x08)
 #define LIOINTC_REG_INTC_DISABLE       (LIOINTC_INTC_CHIP_START + 0x0c)
+/*
+ * LIOINTC_REG_INTC_POL register is only valid for Loongson-2K series, and
+ * Loongson-3 series behave as noops.
+ */
 #define LIOINTC_REG_INTC_POL   (LIOINTC_INTC_CHIP_START + 0x10)
 #define LIOINTC_REG_INTC_EDGE  (LIOINTC_INTC_CHIP_START + 0x14)
 
@@ -116,19 +120,19 @@ static int liointc_set_type(struct irq_data *data, unsigned int type)
        switch (type) {
        case IRQ_TYPE_LEVEL_HIGH:
                liointc_set_bit(gc, LIOINTC_REG_INTC_EDGE, mask, false);
-               liointc_set_bit(gc, LIOINTC_REG_INTC_POL, mask, true);
+               liointc_set_bit(gc, LIOINTC_REG_INTC_POL, mask, false);
                break;
        case IRQ_TYPE_LEVEL_LOW:
                liointc_set_bit(gc, LIOINTC_REG_INTC_EDGE, mask, false);
-               liointc_set_bit(gc, LIOINTC_REG_INTC_POL, mask, false);
+               liointc_set_bit(gc, LIOINTC_REG_INTC_POL, mask, true);
                break;
        case IRQ_TYPE_EDGE_RISING:
                liointc_set_bit(gc, LIOINTC_REG_INTC_EDGE, mask, true);
-               liointc_set_bit(gc, LIOINTC_REG_INTC_POL, mask, true);
+               liointc_set_bit(gc, LIOINTC_REG_INTC_POL, mask, false);
                break;
        case IRQ_TYPE_EDGE_FALLING:
                liointc_set_bit(gc, LIOINTC_REG_INTC_EDGE, mask, true);
-               liointc_set_bit(gc, LIOINTC_REG_INTC_POL, mask, false);
+               liointc_set_bit(gc, LIOINTC_REG_INTC_POL, mask, true);
                break;
        default:
                irq_gc_unlock_irqrestore(gc, flags);
@@ -291,6 +295,7 @@ static int liointc_init(phys_addr_t addr, unsigned long size, int revision,
        ct->chip.irq_mask = irq_gc_mask_disable_reg;
        ct->chip.irq_mask_ack = irq_gc_mask_disable_reg;
        ct->chip.irq_set_type = liointc_set_type;
+       ct->chip.flags = IRQCHIP_SKIP_SET_WAKE;
 
        gc->mask_cache = 0;
        priv->gc = gc;
index e5fe4d5..93a71f6 100644 (file)
@@ -164,7 +164,7 @@ static int pch_pic_domain_translate(struct irq_domain *d,
                if (fwspec->param_count < 2)
                        return -EINVAL;
 
-               *hwirq = fwspec->param[0] + priv->ht_vec_base;
+               *hwirq = fwspec->param[0];
                *type = fwspec->param[1] & IRQ_TYPE_SENSE_MASK;
        } else {
                if (fwspec->param_count < 1)
@@ -196,7 +196,7 @@ static int pch_pic_alloc(struct irq_domain *domain, unsigned int virq,
 
        parent_fwspec.fwnode = domain->parent->fwnode;
        parent_fwspec.param_count = 1;
-       parent_fwspec.param[0] = hwirq;
+       parent_fwspec.param[0] = hwirq + priv->ht_vec_base;
 
        err = irq_domain_alloc_irqs_parent(domain, virq, 1, &parent_fwspec);
        if (err)
@@ -401,14 +401,12 @@ static int __init acpi_cascade_irqdomain_init(void)
 int __init pch_pic_acpi_init(struct irq_domain *parent,
                                        struct acpi_madt_bio_pic *acpi_pchpic)
 {
-       int ret, vec_base;
+       int ret;
        struct fwnode_handle *domain_handle;
 
        if (find_pch_pic(acpi_pchpic->gsi_base) >= 0)
                return 0;
 
-       vec_base = acpi_pchpic->gsi_base - GSI_MIN_PCH_IRQ;
-
        domain_handle = irq_domain_alloc_fwnode(&acpi_pchpic->address);
        if (!domain_handle) {
                pr_err("Unable to allocate domain handle\n");
@@ -416,7 +414,7 @@ int __init pch_pic_acpi_init(struct irq_domain *parent,
        }
 
        ret = pch_pic_init(acpi_pchpic->address, acpi_pchpic->size,
-                               vec_base, parent, domain_handle, acpi_pchpic->gsi_base);
+                               0, parent, domain_handle, acpi_pchpic->gsi_base);
 
        if (ret < 0) {
                irq_domain_free_fwnode(domain_handle);
index 83455ca..25cf4f8 100644 (file)
@@ -244,132 +244,6 @@ static void __exception_irq_entry mmp2_handle_irq(struct pt_regs *regs)
        generic_handle_domain_irq(icu_data[0].domain, hwirq);
 }
 
-/* MMP (ARMv5) */
-void __init icu_init_irq(void)
-{
-       int irq;
-
-       max_icu_nr = 1;
-       mmp_icu_base = ioremap(0xd4282000, 0x1000);
-       icu_data[0].conf_enable = mmp_conf.conf_enable;
-       icu_data[0].conf_disable = mmp_conf.conf_disable;
-       icu_data[0].conf_mask = mmp_conf.conf_mask;
-       icu_data[0].nr_irqs = 64;
-       icu_data[0].virq_base = 0;
-       icu_data[0].domain = irq_domain_add_legacy(NULL, 64, 0, 0,
-                                                  &irq_domain_simple_ops,
-                                                  &icu_data[0]);
-       for (irq = 0; irq < 64; irq++) {
-               icu_mask_irq(irq_get_irq_data(irq));
-               irq_set_chip_and_handler(irq, &icu_irq_chip, handle_level_irq);
-       }
-       irq_set_default_host(icu_data[0].domain);
-       set_handle_irq(mmp_handle_irq);
-}
-
-/* MMP2 (ARMv7) */
-void __init mmp2_init_icu(void)
-{
-       int irq, end;
-
-       max_icu_nr = 8;
-       mmp_icu_base = ioremap(0xd4282000, 0x1000);
-       icu_data[0].conf_enable = mmp2_conf.conf_enable;
-       icu_data[0].conf_disable = mmp2_conf.conf_disable;
-       icu_data[0].conf_mask = mmp2_conf.conf_mask;
-       icu_data[0].nr_irqs = 64;
-       icu_data[0].virq_base = 0;
-       icu_data[0].domain = irq_domain_add_legacy(NULL, 64, 0, 0,
-                                                  &irq_domain_simple_ops,
-                                                  &icu_data[0]);
-       icu_data[1].reg_status = mmp_icu_base + 0x150;
-       icu_data[1].reg_mask = mmp_icu_base + 0x168;
-       icu_data[1].clr_mfp_irq_base = icu_data[0].virq_base +
-                               icu_data[0].nr_irqs;
-       icu_data[1].clr_mfp_hwirq = 1;          /* offset to IRQ_MMP2_PMIC_BASE */
-       icu_data[1].nr_irqs = 2;
-       icu_data[1].cascade_irq = 4;
-       icu_data[1].virq_base = icu_data[0].virq_base + icu_data[0].nr_irqs;
-       icu_data[1].domain = irq_domain_add_legacy(NULL, icu_data[1].nr_irqs,
-                                                  icu_data[1].virq_base, 0,
-                                                  &irq_domain_simple_ops,
-                                                  &icu_data[1]);
-       icu_data[2].reg_status = mmp_icu_base + 0x154;
-       icu_data[2].reg_mask = mmp_icu_base + 0x16c;
-       icu_data[2].nr_irqs = 2;
-       icu_data[2].cascade_irq = 5;
-       icu_data[2].virq_base = icu_data[1].virq_base + icu_data[1].nr_irqs;
-       icu_data[2].domain = irq_domain_add_legacy(NULL, icu_data[2].nr_irqs,
-                                                  icu_data[2].virq_base, 0,
-                                                  &irq_domain_simple_ops,
-                                                  &icu_data[2]);
-       icu_data[3].reg_status = mmp_icu_base + 0x180;
-       icu_data[3].reg_mask = mmp_icu_base + 0x17c;
-       icu_data[3].nr_irqs = 3;
-       icu_data[3].cascade_irq = 9;
-       icu_data[3].virq_base = icu_data[2].virq_base + icu_data[2].nr_irqs;
-       icu_data[3].domain = irq_domain_add_legacy(NULL, icu_data[3].nr_irqs,
-                                                  icu_data[3].virq_base, 0,
-                                                  &irq_domain_simple_ops,
-                                                  &icu_data[3]);
-       icu_data[4].reg_status = mmp_icu_base + 0x158;
-       icu_data[4].reg_mask = mmp_icu_base + 0x170;
-       icu_data[4].nr_irqs = 5;
-       icu_data[4].cascade_irq = 17;
-       icu_data[4].virq_base = icu_data[3].virq_base + icu_data[3].nr_irqs;
-       icu_data[4].domain = irq_domain_add_legacy(NULL, icu_data[4].nr_irqs,
-                                                  icu_data[4].virq_base, 0,
-                                                  &irq_domain_simple_ops,
-                                                  &icu_data[4]);
-       icu_data[5].reg_status = mmp_icu_base + 0x15c;
-       icu_data[5].reg_mask = mmp_icu_base + 0x174;
-       icu_data[5].nr_irqs = 15;
-       icu_data[5].cascade_irq = 35;
-       icu_data[5].virq_base = icu_data[4].virq_base + icu_data[4].nr_irqs;
-       icu_data[5].domain = irq_domain_add_legacy(NULL, icu_data[5].nr_irqs,
-                                                  icu_data[5].virq_base, 0,
-                                                  &irq_domain_simple_ops,
-                                                  &icu_data[5]);
-       icu_data[6].reg_status = mmp_icu_base + 0x160;
-       icu_data[6].reg_mask = mmp_icu_base + 0x178;
-       icu_data[6].nr_irqs = 2;
-       icu_data[6].cascade_irq = 51;
-       icu_data[6].virq_base = icu_data[5].virq_base + icu_data[5].nr_irqs;
-       icu_data[6].domain = irq_domain_add_legacy(NULL, icu_data[6].nr_irqs,
-                                                  icu_data[6].virq_base, 0,
-                                                  &irq_domain_simple_ops,
-                                                  &icu_data[6]);
-       icu_data[7].reg_status = mmp_icu_base + 0x188;
-       icu_data[7].reg_mask = mmp_icu_base + 0x184;
-       icu_data[7].nr_irqs = 2;
-       icu_data[7].cascade_irq = 55;
-       icu_data[7].virq_base = icu_data[6].virq_base + icu_data[6].nr_irqs;
-       icu_data[7].domain = irq_domain_add_legacy(NULL, icu_data[7].nr_irqs,
-                                                  icu_data[7].virq_base, 0,
-                                                  &irq_domain_simple_ops,
-                                                  &icu_data[7]);
-       end = icu_data[7].virq_base + icu_data[7].nr_irqs;
-       for (irq = 0; irq < end; irq++) {
-               icu_mask_irq(irq_get_irq_data(irq));
-               if (irq == icu_data[1].cascade_irq ||
-                   irq == icu_data[2].cascade_irq ||
-                   irq == icu_data[3].cascade_irq ||
-                   irq == icu_data[4].cascade_irq ||
-                   irq == icu_data[5].cascade_irq ||
-                   irq == icu_data[6].cascade_irq ||
-                   irq == icu_data[7].cascade_irq) {
-                       irq_set_chip(irq, &icu_irq_chip);
-                       irq_set_chained_handler(irq, icu_mux_irq_demux);
-               } else {
-                       irq_set_chip_and_handler(irq, &icu_irq_chip,
-                                                handle_level_irq);
-               }
-       }
-       irq_set_default_host(icu_data[0].domain);
-       set_handle_irq(mmp2_handle_irq);
-}
-
-#ifdef CONFIG_OF
 static int __init mmp_init_bases(struct device_node *node)
 {
        int ret, nr_irqs, irq, i = 0;
@@ -548,4 +422,3 @@ err:
        return -EINVAL;
 }
 IRQCHIP_DECLARE(mmp2_mux_intc, "mrvl,mmp2-mux-intc", mmp2_mux_of_init);
-#endif
index 55cb6b5..be96806 100644 (file)
@@ -201,6 +201,7 @@ static int __init icoll_of_init(struct device_node *np,
        stmp_reset_block(icoll_priv.ctrl);
 
        icoll_add_domain(np, ICOLL_NUM_IRQS);
+       set_handle_irq(icoll_handle_irq);
 
        return 0;
 }
index 6a3f749..b5fa76c 100644 (file)
@@ -173,6 +173,16 @@ static struct irq_chip stm32_exti_h_chip_direct;
 #define EXTI_INVALID_IRQ       U8_MAX
 #define STM32MP1_DESC_IRQ_SIZE (ARRAY_SIZE(stm32mp1_exti_banks) * IRQS_PER_BANK)
 
+/*
+ * Use some intentionally tricky logic here to initialize the whole array to
+ * EXTI_INVALID_IRQ, but then override certain fields, requiring us to indicate
+ * that we "know" that there are overrides in this structure, and we'll need to
+ * disable that warning from W=1 builds.
+ */
+__diag_push();
+__diag_ignore_all("-Woverride-init",
+                 "logic to initialize all and then override some is OK");
+
 static const u8 stm32mp1_desc_irq[] = {
        /* default value */
        [0 ... (STM32MP1_DESC_IRQ_SIZE - 1)] = EXTI_INVALID_IRQ,
@@ -208,6 +218,7 @@ static const u8 stm32mp1_desc_irq[] = {
        [31] = 53,
        [32] = 82,
        [33] = 83,
+       [46] = 151,
        [47] = 93,
        [48] = 138,
        [50] = 139,
@@ -266,6 +277,8 @@ static const u8 stm32mp13_desc_irq[] = {
        [70] = 98,
 };
 
+__diag_pop();
+
 static const struct stm32_exti_drv_data stm32mp1_drv_data = {
        .exti_banks = stm32mp1_exti_banks,
        .bank_nr = ARRAY_SIZE(stm32mp1_exti_banks),
index aebb7ef..5a79bb3 100644 (file)
@@ -275,7 +275,7 @@ struct bcache_device {
 
        int (*cache_miss)(struct btree *b, struct search *s,
                          struct bio *bio, unsigned int sectors);
-       int (*ioctl)(struct bcache_device *d, fmode_t mode,
+       int (*ioctl)(struct bcache_device *d, blk_mode_t mode,
                     unsigned int cmd, unsigned long arg);
 };
 
@@ -1004,11 +1004,11 @@ extern struct workqueue_struct *bch_flush_wq;
 extern struct mutex bch_register_lock;
 extern struct list_head bch_cache_sets;
 
-extern struct kobj_type bch_cached_dev_ktype;
-extern struct kobj_type bch_flash_dev_ktype;
-extern struct kobj_type bch_cache_set_ktype;
-extern struct kobj_type bch_cache_set_internal_ktype;
-extern struct kobj_type bch_cache_ktype;
+extern const struct kobj_type bch_cached_dev_ktype;
+extern const struct kobj_type bch_flash_dev_ktype;
+extern const struct kobj_type bch_cache_set_ktype;
+extern const struct kobj_type bch_cache_set_internal_ktype;
+extern const struct kobj_type bch_cache_ktype;
 
 void bch_cached_dev_release(struct kobject *kobj);
 void bch_flash_dev_release(struct kobject *kobj);
index 147c493..fd121a6 100644 (file)
@@ -559,6 +559,27 @@ static void mca_data_alloc(struct btree *b, struct bkey *k, gfp_t gfp)
        }
 }
 
+#define cmp_int(l, r)          ((l > r) - (l < r))
+
+#ifdef CONFIG_PROVE_LOCKING
+static int btree_lock_cmp_fn(const struct lockdep_map *_a,
+                            const struct lockdep_map *_b)
+{
+       const struct btree *a = container_of(_a, struct btree, lock.dep_map);
+       const struct btree *b = container_of(_b, struct btree, lock.dep_map);
+
+       return -cmp_int(a->level, b->level) ?: bkey_cmp(&a->key, &b->key);
+}
+
+static void btree_lock_print_fn(const struct lockdep_map *map)
+{
+       const struct btree *b = container_of(map, struct btree, lock.dep_map);
+
+       printk(KERN_CONT " l=%u %llu:%llu", b->level,
+              KEY_INODE(&b->key), KEY_OFFSET(&b->key));
+}
+#endif
+
 static struct btree *mca_bucket_alloc(struct cache_set *c,
                                      struct bkey *k, gfp_t gfp)
 {
@@ -572,7 +593,7 @@ static struct btree *mca_bucket_alloc(struct cache_set *c,
                return NULL;
 
        init_rwsem(&b->lock);
-       lockdep_set_novalidate_class(&b->lock);
+       lock_set_cmp_fn(&b->lock, btree_lock_cmp_fn, btree_lock_print_fn);
        mutex_init(&b->write_lock);
        lockdep_set_novalidate_class(&b->write_lock);
        INIT_LIST_HEAD(&b->list);
@@ -885,7 +906,7 @@ static struct btree *mca_cannibalize(struct cache_set *c, struct btree_op *op,
  * cannibalize_bucket() will take. This means every time we unlock the root of
  * the btree, we need to release this lock if we have it held.
  */
-static void bch_cannibalize_unlock(struct cache_set *c)
+void bch_cannibalize_unlock(struct cache_set *c)
 {
        spin_lock(&c->btree_cannibalize_lock);
        if (c->btree_cache_alloc_lock == current) {
@@ -1090,10 +1111,12 @@ struct btree *__bch_btree_node_alloc(struct cache_set *c, struct btree_op *op,
                                     struct btree *parent)
 {
        BKEY_PADDED(key) k;
-       struct btree *b = ERR_PTR(-EAGAIN);
+       struct btree *b;
 
        mutex_lock(&c->bucket_lock);
 retry:
+       /* return ERR_PTR(-EAGAIN) when it fails */
+       b = ERR_PTR(-EAGAIN);
        if (__bch_bucket_alloc_set(c, RESERVE_BTREE, &k.key, wait))
                goto err;
 
@@ -1138,7 +1161,7 @@ static struct btree *btree_node_alloc_replacement(struct btree *b,
 {
        struct btree *n = bch_btree_node_alloc(b->c, op, b->level, b->parent);
 
-       if (!IS_ERR_OR_NULL(n)) {
+       if (!IS_ERR(n)) {
                mutex_lock(&n->write_lock);
                bch_btree_sort_into(&b->keys, &n->keys, &b->c->sort);
                bkey_copy_key(&n->key, &b->key);
@@ -1340,7 +1363,7 @@ static int btree_gc_coalesce(struct btree *b, struct btree_op *op,
        memset(new_nodes, 0, sizeof(new_nodes));
        closure_init_stack(&cl);
 
-       while (nodes < GC_MERGE_NODES && !IS_ERR_OR_NULL(r[nodes].b))
+       while (nodes < GC_MERGE_NODES && !IS_ERR(r[nodes].b))
                keys += r[nodes++].keys;
 
        blocks = btree_default_blocks(b->c) * 2 / 3;
@@ -1352,7 +1375,7 @@ static int btree_gc_coalesce(struct btree *b, struct btree_op *op,
 
        for (i = 0; i < nodes; i++) {
                new_nodes[i] = btree_node_alloc_replacement(r[i].b, NULL);
-               if (IS_ERR_OR_NULL(new_nodes[i]))
+               if (IS_ERR(new_nodes[i]))
                        goto out_nocoalesce;
        }
 
@@ -1487,7 +1510,7 @@ out_nocoalesce:
        bch_keylist_free(&keylist);
 
        for (i = 0; i < nodes; i++)
-               if (!IS_ERR_OR_NULL(new_nodes[i])) {
+               if (!IS_ERR(new_nodes[i])) {
                        btree_node_free(new_nodes[i]);
                        rw_unlock(true, new_nodes[i]);
                }
@@ -1669,7 +1692,7 @@ static int bch_btree_gc_root(struct btree *b, struct btree_op *op,
        if (should_rewrite) {
                n = btree_node_alloc_replacement(b, NULL);
 
-               if (!IS_ERR_OR_NULL(n)) {
+               if (!IS_ERR(n)) {
                        bch_btree_node_write_sync(n);
 
                        bch_btree_set_root(n);
@@ -1968,6 +1991,15 @@ static int bch_btree_check_thread(void *arg)
                        c->gc_stats.nodes++;
                        bch_btree_op_init(&op, 0);
                        ret = bcache_btree(check_recurse, p, c->root, &op);
+                       /*
+                        * The op may be added to cache_set's btree_cache_wait
+                        * in mca_cannibalize(), must ensure it is removed from
+                        * the list and release btree_cache_alloc_lock before
+                        * free op memory.
+                        * Otherwise, the btree_cache_wait will be damaged.
+                        */
+                       bch_cannibalize_unlock(c);
+                       finish_wait(&c->btree_cache_wait, &(&op)->wait);
                        if (ret)
                                goto out;
                }
index 1b5fdbc..45d64b5 100644 (file)
@@ -247,8 +247,8 @@ static inline void bch_btree_op_init(struct btree_op *op, int write_lock_level)
 
 static inline void rw_lock(bool w, struct btree *b, int level)
 {
-       w ? down_write_nested(&b->lock, level + 1)
-         : down_read_nested(&b->lock, level + 1);
+       w ? down_write(&b->lock)
+         : down_read(&b->lock);
        if (w)
                b->seq++;
 }
@@ -282,6 +282,7 @@ void bch_initial_gc_finish(struct cache_set *c);
 void bch_moving_gc(struct cache_set *c);
 int bch_btree_check(struct cache_set *c);
 void bch_initial_mark_key(struct cache_set *c, int level, struct bkey *k);
+void bch_cannibalize_unlock(struct cache_set *c);
 
 static inline void wake_up_gc(struct cache_set *c)
 {
index 67a2e29..a9b1f38 100644 (file)
@@ -1228,7 +1228,7 @@ void cached_dev_submit_bio(struct bio *bio)
                detached_dev_do_request(d, bio, orig_bdev, start_time);
 }
 
-static int cached_dev_ioctl(struct bcache_device *d, fmode_t mode,
+static int cached_dev_ioctl(struct bcache_device *d, blk_mode_t mode,
                            unsigned int cmd, unsigned long arg)
 {
        struct cached_dev *dc = container_of(d, struct cached_dev, disk);
@@ -1318,7 +1318,7 @@ void flash_dev_submit_bio(struct bio *bio)
        continue_at(cl, search_free, NULL);
 }
 
-static int flash_dev_ioctl(struct bcache_device *d, fmode_t mode,
+static int flash_dev_ioctl(struct bcache_device *d, blk_mode_t mode,
                           unsigned int cmd, unsigned long arg)
 {
        return -ENOTTY;
index bd3afc8..21b445f 100644 (file)
@@ -18,7 +18,6 @@ struct cache_stats {
        unsigned long cache_misses;
        unsigned long cache_bypass_hits;
        unsigned long cache_bypass_misses;
-       unsigned long cache_readaheads;
        unsigned long cache_miss_collisions;
        unsigned long sectors_bypassed;
 
index 7e9d19f..e2a8036 100644 (file)
@@ -732,9 +732,9 @@ out:
 
 /* Bcache device */
 
-static int open_dev(struct block_device *b, fmode_t mode)
+static int open_dev(struct gendisk *disk, blk_mode_t mode)
 {
-       struct bcache_device *d = b->bd_disk->private_data;
+       struct bcache_device *d = disk->private_data;
 
        if (test_bit(BCACHE_DEV_CLOSING, &d->flags))
                return -ENXIO;
@@ -743,14 +743,14 @@ static int open_dev(struct block_device *b, fmode_t mode)
        return 0;
 }
 
-static void release_dev(struct gendisk *b, fmode_t mode)
+static void release_dev(struct gendisk *b)
 {
        struct bcache_device *d = b->private_data;
 
        closure_put(&d->cl);
 }
 
-static int ioctl_dev(struct block_device *b, fmode_t mode,
+static int ioctl_dev(struct block_device *b, blk_mode_t mode,
                     unsigned int cmd, unsigned long arg)
 {
        struct bcache_device *d = b->bd_disk->private_data;
@@ -1369,7 +1369,7 @@ static void cached_dev_free(struct closure *cl)
                put_page(virt_to_page(dc->sb_disk));
 
        if (!IS_ERR_OR_NULL(dc->bdev))
-               blkdev_put(dc->bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
+               blkdev_put(dc->bdev, bcache_kobj);
 
        wake_up(&unregister_wait);
 
@@ -1723,7 +1723,7 @@ static void cache_set_flush(struct closure *cl)
        if (!IS_ERR_OR_NULL(c->gc_thread))
                kthread_stop(c->gc_thread);
 
-       if (!IS_ERR_OR_NULL(c->root))
+       if (!IS_ERR(c->root))
                list_add(&c->root->list, &c->btree_cache);
 
        /*
@@ -2087,7 +2087,7 @@ static int run_cache_set(struct cache_set *c)
 
                err = "cannot allocate new btree root";
                c->root = __bch_btree_node_alloc(c, NULL, 0, true, NULL);
-               if (IS_ERR_OR_NULL(c->root))
+               if (IS_ERR(c->root))
                        goto err;
 
                mutex_lock(&c->root->write_lock);
@@ -2218,7 +2218,7 @@ void bch_cache_release(struct kobject *kobj)
                put_page(virt_to_page(ca->sb_disk));
 
        if (!IS_ERR_OR_NULL(ca->bdev))
-               blkdev_put(ca->bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
+               blkdev_put(ca->bdev, bcache_kobj);
 
        kfree(ca);
        module_put(THIS_MODULE);
@@ -2359,7 +2359,7 @@ static int register_cache(struct cache_sb *sb, struct cache_sb_disk *sb_disk,
                 * call blkdev_put() to bdev in bch_cache_release(). So we
                 * explicitly call blkdev_put() here.
                 */
-               blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
+               blkdev_put(bdev, bcache_kobj);
                if (ret == -ENOMEM)
                        err = "cache_alloc(): -ENOMEM";
                else if (ret == -EPERM)
@@ -2461,7 +2461,7 @@ static void register_bdev_worker(struct work_struct *work)
        if (!dc) {
                fail = true;
                put_page(virt_to_page(args->sb_disk));
-               blkdev_put(args->bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL);
+               blkdev_put(args->bdev, bcache_kobj);
                goto out;
        }
 
@@ -2491,7 +2491,7 @@ static void register_cache_worker(struct work_struct *work)
        if (!ca) {
                fail = true;
                put_page(virt_to_page(args->sb_disk));
-               blkdev_put(args->bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL);
+               blkdev_put(args->bdev, bcache_kobj);
                goto out;
        }
 
@@ -2558,9 +2558,8 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 
        ret = -EINVAL;
        err = "failed to open device";
-       bdev = blkdev_get_by_path(strim(path),
-                                 FMODE_READ|FMODE_WRITE|FMODE_EXCL,
-                                 sb);
+       bdev = blkdev_get_by_path(strim(path), BLK_OPEN_READ | BLK_OPEN_WRITE,
+                                 bcache_kobj, NULL);
        if (IS_ERR(bdev)) {
                if (bdev == ERR_PTR(-EBUSY)) {
                        dev_t dev;
@@ -2648,7 +2647,7 @@ async_done:
 out_put_sb_page:
        put_page(virt_to_page(sb_disk));
 out_blkdev_put:
-       blkdev_put(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL);
+       blkdev_put(bdev, register_bcache);
 out_free_sb:
        kfree(sb);
 out_free_path:
index c6f6770..0e2c188 100644 (file)
@@ -1111,26 +1111,25 @@ SHOW(__bch_cache)
 
                vfree(p);
 
-               ret = scnprintf(buf, PAGE_SIZE,
-                               "Unused:                %zu%%\n"
-                               "Clean:         %zu%%\n"
-                               "Dirty:         %zu%%\n"
-                               "Metadata:      %zu%%\n"
-                               "Average:       %llu\n"
-                               "Sectors per Q: %zu\n"
-                               "Quantiles:     [",
-                               unused * 100 / (size_t) ca->sb.nbuckets,
-                               available * 100 / (size_t) ca->sb.nbuckets,
-                               dirty * 100 / (size_t) ca->sb.nbuckets,
-                               meta * 100 / (size_t) ca->sb.nbuckets, sum,
-                               n * ca->sb.bucket_size / (ARRAY_SIZE(q) + 1));
+               ret = sysfs_emit(buf,
+                                "Unused:               %zu%%\n"
+                                "Clean:                %zu%%\n"
+                                "Dirty:                %zu%%\n"
+                                "Metadata:     %zu%%\n"
+                                "Average:      %llu\n"
+                                "Sectors per Q:        %zu\n"
+                                "Quantiles:    [",
+                                unused * 100 / (size_t) ca->sb.nbuckets,
+                                available * 100 / (size_t) ca->sb.nbuckets,
+                                dirty * 100 / (size_t) ca->sb.nbuckets,
+                                meta * 100 / (size_t) ca->sb.nbuckets, sum,
+                                n * ca->sb.bucket_size / (ARRAY_SIZE(q) + 1));
 
                for (i = 0; i < ARRAY_SIZE(q); i++)
-                       ret += scnprintf(buf + ret, PAGE_SIZE - ret,
-                                        "%u ", q[i]);
+                       ret += sysfs_emit_at(buf, ret, "%u ", q[i]);
                ret--;
 
-               ret += scnprintf(buf + ret, PAGE_SIZE - ret, "]\n");
+               ret += sysfs_emit_at(buf, ret, "]\n");
 
                return ret;
        }
index a2ff644..65b8bd9 100644 (file)
@@ -3,7 +3,7 @@
 #define _BCACHE_SYSFS_H_
 
 #define KTYPE(type)                                                    \
-struct kobj_type type ## _ktype = {                                    \
+const struct kobj_type type ## _ktype = {                                      \
        .release        = type ## _release,                             \
        .sysfs_ops      = &((const struct sysfs_ops) {                  \
                .show   = type ## _show,                                \
index d4a5fc0..24c0490 100644 (file)
@@ -890,6 +890,16 @@ static int bch_root_node_dirty_init(struct cache_set *c,
        if (ret < 0)
                pr_warn("sectors dirty init failed, ret=%d!\n", ret);
 
+       /*
+        * The op may be added to cache_set's btree_cache_wait
+        * in mca_cannibalize(), must ensure it is removed from
+        * the list and release btree_cache_alloc_lock before
+        * free op memory.
+        * Otherwise, the btree_cache_wait will be damaged.
+        */
+       bch_cannibalize_unlock(c);
+       finish_wait(&c->btree_cache_wait, &(&op.op)->wait);
+
        return ret;
 }
 
index 8728962..911f73f 100644 (file)
@@ -2051,8 +2051,8 @@ static int parse_metadata_dev(struct cache_args *ca, struct dm_arg_set *as,
        if (!at_least_one_arg(as, error))
                return -EINVAL;
 
-       r = dm_get_device(ca->ti, dm_shift_arg(as), FMODE_READ | FMODE_WRITE,
-                         &ca->metadata_dev);
+       r = dm_get_device(ca->ti, dm_shift_arg(as),
+                         BLK_OPEN_READ | BLK_OPEN_WRITE, &ca->metadata_dev);
        if (r) {
                *error = "Error opening metadata device";
                return r;
@@ -2074,8 +2074,8 @@ static int parse_cache_dev(struct cache_args *ca, struct dm_arg_set *as,
        if (!at_least_one_arg(as, error))
                return -EINVAL;
 
-       r = dm_get_device(ca->ti, dm_shift_arg(as), FMODE_READ | FMODE_WRITE,
-                         &ca->cache_dev);
+       r = dm_get_device(ca->ti, dm_shift_arg(as),
+                         BLK_OPEN_READ | BLK_OPEN_WRITE, &ca->cache_dev);
        if (r) {
                *error = "Error opening cache device";
                return r;
@@ -2093,8 +2093,8 @@ static int parse_origin_dev(struct cache_args *ca, struct dm_arg_set *as,
        if (!at_least_one_arg(as, error))
                return -EINVAL;
 
-       r = dm_get_device(ca->ti, dm_shift_arg(as), FMODE_READ | FMODE_WRITE,
-                         &ca->origin_dev);
+       r = dm_get_device(ca->ti, dm_shift_arg(as),
+                         BLK_OPEN_READ | BLK_OPEN_WRITE, &ca->origin_dev);
        if (r) {
                *error = "Error opening origin device";
                return r;
index f467cdb..94b2fc3 100644 (file)
@@ -1683,8 +1683,8 @@ static int parse_metadata_dev(struct clone *clone, struct dm_arg_set *as, char *
        int r;
        sector_t metadata_dev_size;
 
-       r = dm_get_device(clone->ti, dm_shift_arg(as), FMODE_READ | FMODE_WRITE,
-                         &clone->metadata_dev);
+       r = dm_get_device(clone->ti, dm_shift_arg(as),
+                         BLK_OPEN_READ | BLK_OPEN_WRITE, &clone->metadata_dev);
        if (r) {
                *error = "Error opening metadata device";
                return r;
@@ -1703,8 +1703,8 @@ static int parse_dest_dev(struct clone *clone, struct dm_arg_set *as, char **err
        int r;
        sector_t dest_dev_size;
 
-       r = dm_get_device(clone->ti, dm_shift_arg(as), FMODE_READ | FMODE_WRITE,
-                         &clone->dest_dev);
+       r = dm_get_device(clone->ti, dm_shift_arg(as),
+                         BLK_OPEN_READ | BLK_OPEN_WRITE, &clone->dest_dev);
        if (r) {
                *error = "Error opening destination device";
                return r;
@@ -1725,7 +1725,7 @@ static int parse_source_dev(struct clone *clone, struct dm_arg_set *as, char **e
        int r;
        sector_t source_dev_size;
 
-       r = dm_get_device(clone->ti, dm_shift_arg(as), FMODE_READ,
+       r = dm_get_device(clone->ti, dm_shift_arg(as), BLK_OPEN_READ,
                          &clone->source_dev);
        if (r) {
                *error = "Error opening source device";
index aecab0c..ce913ad 100644 (file)
@@ -207,11 +207,10 @@ struct dm_table {
        unsigned integrity_added:1;
 
        /*
-        * Indicates the rw permissions for the new logical
-        * device.  This should be a combination of FMODE_READ
-        * and FMODE_WRITE.
+        * Indicates the rw permissions for the new logical device.  This
+        * should be a combination of BLK_OPEN_READ and BLK_OPEN_WRITE.
         */
-       fmode_t mode;
+       blk_mode_t mode;
 
        /* a list of devices used by this table */
        struct list_head devices;
index 8b47b91..15424bf 100644 (file)
@@ -1693,8 +1693,7 @@ retry:
 
                len = (remaining_size > PAGE_SIZE) ? PAGE_SIZE : remaining_size;
 
-               bio_add_page(clone, page, len, 0);
-
+               __bio_add_page(clone, page, len, 0);
                remaining_size -= len;
        }
 
@@ -3256,7 +3255,7 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 
        cc->per_bio_data_size = ti->per_io_data_size =
                ALIGN(sizeof(struct dm_crypt_io) + cc->dmreq_start + additional_req_size,
-                     ARCH_KMALLOC_MINALIGN);
+                     ARCH_DMA_MINALIGN);
 
        ret = mempool_init(&cc->page_pool, BIO_MAX_VECS, crypt_page_alloc, crypt_page_free, cc);
        if (ret) {
index 0d70914..6acfa5b 100644 (file)
@@ -1482,14 +1482,16 @@ static int era_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 
        era->ti = ti;
 
-       r = dm_get_device(ti, argv[0], FMODE_READ | FMODE_WRITE, &era->metadata_dev);
+       r = dm_get_device(ti, argv[0], BLK_OPEN_READ | BLK_OPEN_WRITE,
+                         &era->metadata_dev);
        if (r) {
                ti->error = "Error opening metadata device";
                era_destroy(era);
                return -EINVAL;
        }
 
-       r = dm_get_device(ti, argv[1], FMODE_READ | FMODE_WRITE, &era->origin_dev);
+       r = dm_get_device(ti, argv[1], BLK_OPEN_READ | BLK_OPEN_WRITE,
+                         &era->origin_dev);
        if (r) {
                ti->error = "Error opening data device";
                era_destroy(era);
index d369457..2a71bcd 100644 (file)
@@ -293,8 +293,10 @@ static int __init dm_init_init(void)
 
        for (i = 0; i < ARRAY_SIZE(waitfor); i++) {
                if (waitfor[i]) {
+                       dev_t dev;
+
                        DMINFO("waiting for device %s ...", waitfor[i]);
-                       while (!dm_get_dev_t(waitfor[i]))
+                       while (early_lookup_bdev(waitfor[i], &dev))
                                fsleep(5000);
                }
        }
index 31838b1..63ec502 100644 (file)
@@ -4268,10 +4268,10 @@ static int dm_integrity_ctr(struct dm_target *ti, unsigned int argc, char **argv
        }
 
        /*
-        * If this workqueue were percpu, it would cause bio reordering
+        * If this workqueue weren't ordered, it would cause bio reordering
         * and reduced performance.
         */
-       ic->wait_wq = alloc_workqueue("dm-integrity-wait", WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
+       ic->wait_wq = alloc_ordered_workqueue("dm-integrity-wait", WQ_MEM_RECLAIM);
        if (!ic->wait_wq) {
                ti->error = "Cannot allocate workqueue";
                r = -ENOMEM;
index 7d5c9c5..6d30101 100644 (file)
@@ -861,7 +861,7 @@ static void __dev_status(struct mapped_device *md, struct dm_ioctl *param)
 
                table = dm_get_inactive_table(md, &srcu_idx);
                if (table) {
-                       if (!(dm_table_get_mode(table) & FMODE_WRITE))
+                       if (!(dm_table_get_mode(table) & BLK_OPEN_WRITE))
                                param->flags |= DM_READONLY_FLAG;
                        param->target_count = table->num_targets;
                }
@@ -1189,7 +1189,7 @@ static int do_resume(struct dm_ioctl *param)
                if (old_size && new_size && old_size != new_size)
                        need_resize_uevent = true;
 
-               if (dm_table_get_mode(new_map) & FMODE_WRITE)
+               if (dm_table_get_mode(new_map) & BLK_OPEN_WRITE)
                        set_disk_ro(dm_disk(md), 0);
                else
                        set_disk_ro(dm_disk(md), 1);
@@ -1378,12 +1378,12 @@ static int dev_arm_poll(struct file *filp, struct dm_ioctl *param, size_t param_
        return 0;
 }
 
-static inline fmode_t get_mode(struct dm_ioctl *param)
+static inline blk_mode_t get_mode(struct dm_ioctl *param)
 {
-       fmode_t mode = FMODE_READ | FMODE_WRITE;
+       blk_mode_t mode = BLK_OPEN_READ | BLK_OPEN_WRITE;
 
        if (param->flags & DM_READONLY_FLAG)
-               mode = FMODE_READ;
+               mode = BLK_OPEN_READ;
 
        return mode;
 }
index c8821fc..8846bf5 100644 (file)
@@ -3750,11 +3750,11 @@ static int raid_message(struct dm_target *ti, unsigned int argc, char **argv,
                 * canceling read-auto mode
                 */
                mddev->ro = 0;
-               if (!mddev->suspended && mddev->sync_thread)
+               if (!mddev->suspended)
                        md_wakeup_thread(mddev->sync_thread);
        }
        set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
-       if (!mddev->suspended && mddev->thread)
+       if (!mddev->suspended)
                md_wakeup_thread(mddev->thread);
 
        return 0;
index 9c49f53..bf7a574 100644 (file)
@@ -1241,9 +1241,8 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv)
        int i;
        int r = -EINVAL;
        char *origin_path, *cow_path;
-       dev_t origin_dev, cow_dev;
        unsigned int args_used, num_flush_bios = 1;
-       fmode_t origin_mode = FMODE_READ;
+       blk_mode_t origin_mode = BLK_OPEN_READ;
 
        if (argc < 4) {
                ti->error = "requires 4 or more arguments";
@@ -1253,7 +1252,7 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 
        if (dm_target_is_snapshot_merge(ti)) {
                num_flush_bios = 2;
-               origin_mode = FMODE_WRITE;
+               origin_mode = BLK_OPEN_WRITE;
        }
 
        s = kzalloc(sizeof(*s), GFP_KERNEL);
@@ -1279,24 +1278,21 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv)
                ti->error = "Cannot get origin device";
                goto bad_origin;
        }
-       origin_dev = s->origin->bdev->bd_dev;
 
        cow_path = argv[0];
        argv++;
        argc--;
 
-       cow_dev = dm_get_dev_t(cow_path);
-       if (cow_dev && cow_dev == origin_dev) {
-               ti->error = "COW device cannot be the same as origin device";
-               r = -EINVAL;
-               goto bad_cow;
-       }
-
        r = dm_get_device(ti, cow_path, dm_table_get_mode(ti->table), &s->cow);
        if (r) {
                ti->error = "Cannot get COW device";
                goto bad_cow;
        }
+       if (s->cow->bdev && s->cow->bdev == s->origin->bdev) {
+               ti->error = "COW device cannot be the same as origin device";
+               r = -EINVAL;
+               goto bad_store;
+       }
 
        r = dm_exception_store_create(ti, argc, argv, s, &args_used, &s->store);
        if (r) {
index 1398f1d..7d208b2 100644 (file)
@@ -126,7 +126,7 @@ static int alloc_targets(struct dm_table *t, unsigned int num)
        return 0;
 }
 
-int dm_table_create(struct dm_table **result, fmode_t mode,
+int dm_table_create(struct dm_table **result, blk_mode_t mode,
                    unsigned int num_targets, struct mapped_device *md)
 {
        struct dm_table *t = kzalloc(sizeof(*t), GFP_KERNEL);
@@ -304,7 +304,7 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
  * device and not to touch the existing bdev field in case
  * it is accessed concurrently.
  */
-static int upgrade_mode(struct dm_dev_internal *dd, fmode_t new_mode,
+static int upgrade_mode(struct dm_dev_internal *dd, blk_mode_t new_mode,
                        struct mapped_device *md)
 {
        int r;
@@ -324,23 +324,13 @@ static int upgrade_mode(struct dm_dev_internal *dd, fmode_t new_mode,
 }
 
 /*
- * Convert the path to a device
- */
-dev_t dm_get_dev_t(const char *path)
-{
-       dev_t dev;
-
-       if (lookup_bdev(path, &dev))
-               dev = name_to_dev_t(path);
-       return dev;
-}
-EXPORT_SYMBOL_GPL(dm_get_dev_t);
-
-/*
  * Add a device to the list, or just increment the usage count if
  * it's already present.
+ *
+ * Note: the __ref annotation is because this function can call the __init
+ * marked early_lookup_bdev when called during early boot code from dm-init.c.
  */
-int dm_get_device(struct dm_target *ti, const char *path, fmode_t mode,
+int __ref dm_get_device(struct dm_target *ti, const char *path, blk_mode_t mode,
                  struct dm_dev **result)
 {
        int r;
@@ -358,9 +348,13 @@ int dm_get_device(struct dm_target *ti, const char *path, fmode_t mode,
                if (MAJOR(dev) != major || MINOR(dev) != minor)
                        return -EOVERFLOW;
        } else {
-               dev = dm_get_dev_t(path);
-               if (!dev)
-                       return -ENODEV;
+               r = lookup_bdev(path, &dev);
+#ifndef MODULE
+               if (r && system_state < SYSTEM_RUNNING)
+                       r = early_lookup_bdev(path, &dev);
+#endif
+               if (r)
+                       return r;
        }
        if (dev == disk_devt(t->md->disk))
                return -EINVAL;
@@ -668,7 +662,8 @@ int dm_table_add_target(struct dm_table *t, const char *type,
                t->singleton = true;
        }
 
-       if (dm_target_always_writeable(ti->type) && !(t->mode & FMODE_WRITE)) {
+       if (dm_target_always_writeable(ti->type) &&
+           !(t->mode & BLK_OPEN_WRITE)) {
                ti->error = "target type may not be included in a read-only table";
                goto bad;
        }
@@ -2039,7 +2034,7 @@ struct list_head *dm_table_get_devices(struct dm_table *t)
        return &t->devices;
 }
 
-fmode_t dm_table_get_mode(struct dm_table *t)
+blk_mode_t dm_table_get_mode(struct dm_table *t)
 {
        return t->mode;
 }
index 39410bf..f1d0dcb 100644 (file)
@@ -3300,7 +3300,7 @@ static int pool_ctr(struct dm_target *ti, unsigned int argc, char **argv)
        unsigned long block_size;
        dm_block_t low_water_blocks;
        struct dm_dev *metadata_dev;
-       fmode_t metadata_mode;
+       blk_mode_t metadata_mode;
 
        /*
         * FIXME Remove validation from scope of lock.
@@ -3333,7 +3333,8 @@ static int pool_ctr(struct dm_target *ti, unsigned int argc, char **argv)
        if (r)
                goto out_unlock;
 
-       metadata_mode = FMODE_READ | ((pf.mode == PM_READ_ONLY) ? 0 : FMODE_WRITE);
+       metadata_mode = BLK_OPEN_READ |
+               ((pf.mode == PM_READ_ONLY) ? 0 : BLK_OPEN_WRITE);
        r = dm_get_device(ti, argv[0], metadata_mode, &metadata_dev);
        if (r) {
                ti->error = "Error opening metadata block device";
@@ -3341,7 +3342,7 @@ static int pool_ctr(struct dm_target *ti, unsigned int argc, char **argv)
        }
        warn_if_metadata_device_too_big(metadata_dev->bdev);
 
-       r = dm_get_device(ti, argv[1], FMODE_READ | FMODE_WRITE, &data_dev);
+       r = dm_get_device(ti, argv[1], BLK_OPEN_READ | BLK_OPEN_WRITE, &data_dev);
        if (r) {
                ti->error = "Error getting data device";
                goto out_metadata;
@@ -4222,7 +4223,7 @@ static int thin_ctr(struct dm_target *ti, unsigned int argc, char **argv)
                        goto bad_origin_dev;
                }
 
-               r = dm_get_device(ti, argv[2], FMODE_READ, &origin_dev);
+               r = dm_get_device(ti, argv[2], BLK_OPEN_READ, &origin_dev);
                if (r) {
                        ti->error = "Error opening origin device";
                        goto bad_origin_dev;
index a9ee2fa..3ef9f01 100644 (file)
@@ -607,7 +607,7 @@ int verity_fec_parse_opt_args(struct dm_arg_set *as, struct dm_verity *v,
        (*argc)--;
 
        if (!strcasecmp(arg_name, DM_VERITY_OPT_FEC_DEV)) {
-               r = dm_get_device(ti, arg_value, FMODE_READ, &v->fec->dev);
+               r = dm_get_device(ti, arg_value, BLK_OPEN_READ, &v->fec->dev);
                if (r) {
                        ti->error = "FEC device lookup failed";
                        return r;
index e35c16e..26adcfe 100644 (file)
@@ -1196,7 +1196,7 @@ static int verity_ctr(struct dm_target *ti, unsigned int argc, char **argv)
        if (r)
                goto bad;
 
-       if ((dm_table_get_mode(ti->table) & ~FMODE_READ)) {
+       if ((dm_table_get_mode(ti->table) & ~BLK_OPEN_READ)) {
                ti->error = "Device must be readonly";
                r = -EINVAL;
                goto bad;
@@ -1225,13 +1225,13 @@ static int verity_ctr(struct dm_target *ti, unsigned int argc, char **argv)
        }
        v->version = num;
 
-       r = dm_get_device(ti, argv[1], FMODE_READ, &v->data_dev);
+       r = dm_get_device(ti, argv[1], BLK_OPEN_READ, &v->data_dev);
        if (r) {
                ti->error = "Data device lookup failed";
                goto bad;
        }
 
-       r = dm_get_device(ti, argv[2], FMODE_READ, &v->hash_dev);
+       r = dm_get_device(ti, argv[2], BLK_OPEN_READ, &v->hash_dev);
        if (r) {
                ti->error = "Hash device lookup failed";
                goto bad;
index 8f0896a..9d3cca8 100644 (file)
@@ -577,7 +577,7 @@ static struct dmz_mblock *dmz_get_mblock_slow(struct dmz_metadata *zmd,
        bio->bi_iter.bi_sector = dmz_blk2sect(block);
        bio->bi_private = mblk;
        bio->bi_end_io = dmz_mblock_bio_end_io;
-       bio_add_page(bio, mblk->page, DMZ_BLOCK_SIZE, 0);
+       __bio_add_page(bio, mblk->page, DMZ_BLOCK_SIZE, 0);
        submit_bio(bio);
 
        return mblk;
@@ -728,7 +728,7 @@ static int dmz_write_mblock(struct dmz_metadata *zmd, struct dmz_mblock *mblk,
        bio->bi_iter.bi_sector = dmz_blk2sect(block);
        bio->bi_private = mblk;
        bio->bi_end_io = dmz_mblock_bio_end_io;
-       bio_add_page(bio, mblk->page, DMZ_BLOCK_SIZE, 0);
+       __bio_add_page(bio, mblk->page, DMZ_BLOCK_SIZE, 0);
        submit_bio(bio);
 
        return 0;
@@ -752,7 +752,7 @@ static int dmz_rdwr_block(struct dmz_dev *dev, enum req_op op,
        bio = bio_alloc(dev->bdev, 1, op | REQ_SYNC | REQ_META | REQ_PRIO,
                        GFP_NOIO);
        bio->bi_iter.bi_sector = dmz_blk2sect(block);
-       bio_add_page(bio, page, DMZ_BLOCK_SIZE, 0);
+       __bio_add_page(bio, page, DMZ_BLOCK_SIZE, 0);
        ret = submit_bio_wait(bio);
        bio_put(bio);
 
index fffb0cb..fe2d475 100644 (file)
@@ -207,7 +207,7 @@ static int __init local_init(void)
        if (r)
                return r;
 
-       deferred_remove_workqueue = alloc_workqueue("kdmremove", WQ_UNBOUND, 1);
+       deferred_remove_workqueue = alloc_ordered_workqueue("kdmremove", 0);
        if (!deferred_remove_workqueue) {
                r = -ENOMEM;
                goto out_uevent_exit;
@@ -310,13 +310,13 @@ int dm_deleting_md(struct mapped_device *md)
        return test_bit(DMF_DELETING, &md->flags);
 }
 
-static int dm_blk_open(struct block_device *bdev, fmode_t mode)
+static int dm_blk_open(struct gendisk *disk, blk_mode_t mode)
 {
        struct mapped_device *md;
 
        spin_lock(&_minor_lock);
 
-       md = bdev->bd_disk->private_data;
+       md = disk->private_data;
        if (!md)
                goto out;
 
@@ -334,7 +334,7 @@ out:
        return md ? 0 : -ENXIO;
 }
 
-static void dm_blk_close(struct gendisk *disk, fmode_t mode)
+static void dm_blk_close(struct gendisk *disk)
 {
        struct mapped_device *md;
 
@@ -448,7 +448,7 @@ static void dm_unprepare_ioctl(struct mapped_device *md, int srcu_idx)
        dm_put_live_table(md, srcu_idx);
 }
 
-static int dm_blk_ioctl(struct block_device *bdev, fmode_t mode,
+static int dm_blk_ioctl(struct block_device *bdev, blk_mode_t mode,
                        unsigned int cmd, unsigned long arg)
 {
        struct mapped_device *md = bdev->bd_disk->private_data;
@@ -734,7 +734,7 @@ static char *_dm_claim_ptr = "I belong to device-mapper";
  * Open a table device so we can use it as a map destination.
  */
 static struct table_device *open_table_device(struct mapped_device *md,
-               dev_t dev, fmode_t mode)
+               dev_t dev, blk_mode_t mode)
 {
        struct table_device *td;
        struct block_device *bdev;
@@ -746,7 +746,7 @@ static struct table_device *open_table_device(struct mapped_device *md,
                return ERR_PTR(-ENOMEM);
        refcount_set(&td->count, 1);
 
-       bdev = blkdev_get_by_dev(dev, mode | FMODE_EXCL, _dm_claim_ptr);
+       bdev = blkdev_get_by_dev(dev, mode, _dm_claim_ptr, NULL);
        if (IS_ERR(bdev)) {
                r = PTR_ERR(bdev);
                goto out_free_td;
@@ -771,7 +771,7 @@ static struct table_device *open_table_device(struct mapped_device *md,
        return td;
 
 out_blkdev_put:
-       blkdev_put(bdev, mode | FMODE_EXCL);
+       blkdev_put(bdev, _dm_claim_ptr);
 out_free_td:
        kfree(td);
        return ERR_PTR(r);
@@ -784,14 +784,14 @@ static void close_table_device(struct table_device *td, struct mapped_device *md
 {
        if (md->disk->slave_dir)
                bd_unlink_disk_holder(td->dm_dev.bdev, md->disk);
-       blkdev_put(td->dm_dev.bdev, td->dm_dev.mode | FMODE_EXCL);
+       blkdev_put(td->dm_dev.bdev, _dm_claim_ptr);
        put_dax(td->dm_dev.dax_dev);
        list_del(&td->list);
        kfree(td);
 }
 
 static struct table_device *find_table_device(struct list_head *l, dev_t dev,
-                                             fmode_t mode)
+                                             blk_mode_t mode)
 {
        struct table_device *td;
 
@@ -802,7 +802,7 @@ static struct table_device *find_table_device(struct list_head *l, dev_t dev,
        return NULL;
 }
 
-int dm_get_table_device(struct mapped_device *md, dev_t dev, fmode_t mode,
+int dm_get_table_device(struct mapped_device *md, dev_t dev, blk_mode_t mode,
                        struct dm_dev **result)
 {
        struct table_device *td;
index a856e0a..63d9010 100644 (file)
@@ -203,7 +203,7 @@ int dm_open_count(struct mapped_device *md);
 int dm_lock_for_deletion(struct mapped_device *md, bool mark_deferred, bool only_deferred);
 int dm_cancel_deferred_remove(struct mapped_device *md);
 int dm_request_based(struct mapped_device *md);
-int dm_get_table_device(struct mapped_device *md, dev_t dev, fmode_t mode,
+int dm_get_table_device(struct mapped_device *md, dev_t dev, blk_mode_t mode,
                        struct dm_dev **result);
 void dm_put_table_device(struct mapped_device *md, struct dm_dev *d);
 
index 91836e6..6eaa0ea 100644 (file)
@@ -147,7 +147,8 @@ static void __init md_setup_drive(struct md_setup_args *args)
                if (p)
                        *p++ = 0;
 
-               dev = name_to_dev_t(devname);
+               if (early_lookup_bdev(devname, &dev))
+                       dev = 0;
                if (strncmp(devname, "/dev/", 5) == 0)
                        devname += 5;
                snprintf(comp_name, 63, "/dev/%s", devname);
index bc8d756..1ff7128 100644 (file)
@@ -54,14 +54,7 @@ __acquires(bitmap->lock)
 {
        unsigned char *mappage;
 
-       if (page >= bitmap->pages) {
-               /* This can happen if bitmap_start_sync goes beyond
-                * End-of-device while looking for a whole page.
-                * It is harmless.
-                */
-               return -EINVAL;
-       }
-
+       WARN_ON_ONCE(page >= bitmap->pages);
        if (bitmap->bp[page].hijacked) /* it's hijacked, don't try to alloc */
                return 0;
 
@@ -1023,7 +1016,6 @@ static int md_bitmap_file_test_bit(struct bitmap *bitmap, sector_t block)
        return set;
 }
 
-
 /* this gets called when the md device is ready to unplug its underlying
  * (slave) device queues -- before we let any writes go down, we need to
  * sync the dirty pages of the bitmap file to disk */
@@ -1033,8 +1025,7 @@ void md_bitmap_unplug(struct bitmap *bitmap)
        int dirty, need_write;
        int writing = 0;
 
-       if (!bitmap || !bitmap->storage.filemap ||
-           test_bit(BITMAP_STALE, &bitmap->flags))
+       if (!md_bitmap_enabled(bitmap))
                return;
 
        /* look at each page to see if there are any set bits that need to be
@@ -1063,6 +1054,35 @@ void md_bitmap_unplug(struct bitmap *bitmap)
 }
 EXPORT_SYMBOL(md_bitmap_unplug);
 
+struct bitmap_unplug_work {
+       struct work_struct work;
+       struct bitmap *bitmap;
+       struct completion *done;
+};
+
+static void md_bitmap_unplug_fn(struct work_struct *work)
+{
+       struct bitmap_unplug_work *unplug_work =
+               container_of(work, struct bitmap_unplug_work, work);
+
+       md_bitmap_unplug(unplug_work->bitmap);
+       complete(unplug_work->done);
+}
+
+void md_bitmap_unplug_async(struct bitmap *bitmap)
+{
+       DECLARE_COMPLETION_ONSTACK(done);
+       struct bitmap_unplug_work unplug_work;
+
+       INIT_WORK_ONSTACK(&unplug_work.work, md_bitmap_unplug_fn);
+       unplug_work.bitmap = bitmap;
+       unplug_work.done = &done;
+
+       queue_work(md_bitmap_wq, &unplug_work.work);
+       wait_for_completion(&done);
+}
+EXPORT_SYMBOL(md_bitmap_unplug_async);
+
 static void md_bitmap_set_memory_bits(struct bitmap *bitmap, sector_t offset, int needed);
 /* * bitmap_init_from_disk -- called at bitmap_create time to initialize
  * the in-memory bitmap from the on-disk bitmap -- also, sets up the
@@ -1241,11 +1261,28 @@ static bitmap_counter_t *md_bitmap_get_counter(struct bitmap_counts *bitmap,
                                               sector_t offset, sector_t *blocks,
                                               int create);
 
+static void mddev_set_timeout(struct mddev *mddev, unsigned long timeout,
+                             bool force)
+{
+       struct md_thread *thread;
+
+       rcu_read_lock();
+       thread = rcu_dereference(mddev->thread);
+
+       if (!thread)
+               goto out;
+
+       if (force || thread->timeout < MAX_SCHEDULE_TIMEOUT)
+               thread->timeout = timeout;
+
+out:
+       rcu_read_unlock();
+}
+
 /*
  * bitmap daemon -- periodically wakes up to clean bits and flush pages
  *                     out to disk
  */
-
 void md_bitmap_daemon_work(struct mddev *mddev)
 {
        struct bitmap *bitmap;
@@ -1269,7 +1306,7 @@ void md_bitmap_daemon_work(struct mddev *mddev)
 
        bitmap->daemon_lastrun = jiffies;
        if (bitmap->allclean) {
-               mddev->thread->timeout = MAX_SCHEDULE_TIMEOUT;
+               mddev_set_timeout(mddev, MAX_SCHEDULE_TIMEOUT, true);
                goto done;
        }
        bitmap->allclean = 1;
@@ -1366,8 +1403,7 @@ void md_bitmap_daemon_work(struct mddev *mddev)
 
  done:
        if (bitmap->allclean == 0)
-               mddev->thread->timeout =
-                       mddev->bitmap_info.daemon_sleep;
+               mddev_set_timeout(mddev, mddev->bitmap_info.daemon_sleep, true);
        mutex_unlock(&mddev->bitmap_info.mutex);
 }
 
@@ -1387,6 +1423,14 @@ __acquires(bitmap->lock)
        sector_t csize;
        int err;
 
+       if (page >= bitmap->pages) {
+               /*
+                * This can happen if bitmap_start_sync goes beyond
+                * End-of-device while looking for a whole page or
+                * user set a huge number to sysfs bitmap_set_bits.
+                */
+               return NULL;
+       }
        err = md_bitmap_checkpage(bitmap, page, create, 0);
 
        if (bitmap->bp[page].hijacked ||
@@ -1820,8 +1864,7 @@ void md_bitmap_destroy(struct mddev *mddev)
        mddev->bitmap = NULL; /* disconnect from the md device */
        spin_unlock(&mddev->lock);
        mutex_unlock(&mddev->bitmap_info.mutex);
-       if (mddev->thread)
-               mddev->thread->timeout = MAX_SCHEDULE_TIMEOUT;
+       mddev_set_timeout(mddev, MAX_SCHEDULE_TIMEOUT, true);
 
        md_bitmap_free(bitmap);
 }
@@ -1964,7 +2007,7 @@ int md_bitmap_load(struct mddev *mddev)
        /* Kick recovery in case any bits were set */
        set_bit(MD_RECOVERY_NEEDED, &bitmap->mddev->recovery);
 
-       mddev->thread->timeout = mddev->bitmap_info.daemon_sleep;
+       mddev_set_timeout(mddev, mddev->bitmap_info.daemon_sleep, true);
        md_wakeup_thread(mddev->thread);
 
        md_bitmap_update_sb(bitmap);
@@ -2469,17 +2512,11 @@ timeout_store(struct mddev *mddev, const char *buf, size_t len)
                timeout = MAX_SCHEDULE_TIMEOUT-1;
        if (timeout < 1)
                timeout = 1;
+
        mddev->bitmap_info.daemon_sleep = timeout;
-       if (mddev->thread) {
-               /* if thread->timeout is MAX_SCHEDULE_TIMEOUT, then
-                * the bitmap is all clean and we don't need to
-                * adjust the timeout right now
-                */
-               if (mddev->thread->timeout < MAX_SCHEDULE_TIMEOUT) {
-                       mddev->thread->timeout = timeout;
-                       md_wakeup_thread(mddev->thread);
-               }
-       }
+       mddev_set_timeout(mddev, timeout, false);
+       md_wakeup_thread(mddev->thread);
+
        return len;
 }
 
index cfd7395..8a3788c 100644 (file)
@@ -264,6 +264,7 @@ void md_bitmap_sync_with_cluster(struct mddev *mddev,
                                 sector_t new_lo, sector_t new_hi);
 
 void md_bitmap_unplug(struct bitmap *bitmap);
+void md_bitmap_unplug_async(struct bitmap *bitmap);
 void md_bitmap_daemon_work(struct mddev *mddev);
 
 int md_bitmap_resize(struct bitmap *bitmap, sector_t blocks,
@@ -273,6 +274,13 @@ int md_bitmap_copy_from_slot(struct mddev *mddev, int slot,
                             sector_t *lo, sector_t *hi, bool clear_bits);
 void md_bitmap_free(struct bitmap *bitmap);
 void md_bitmap_wait_behind_writes(struct mddev *mddev);
+
+static inline bool md_bitmap_enabled(struct bitmap *bitmap)
+{
+       return bitmap && bitmap->storage.filemap &&
+              !test_bit(BITMAP_STALE, &bitmap->flags);
+}
+
 #endif
 
 #endif
index 10e0c53..3d9fd74 100644 (file)
@@ -75,14 +75,14 @@ struct md_cluster_info {
        sector_t suspend_hi;
        int suspend_from; /* the slot which broadcast suspend_lo/hi */
 
-       struct md_thread *recovery_thread;
+       struct md_thread __rcu *recovery_thread;
        unsigned long recovery_map;
        /* communication loc resources */
        struct dlm_lock_resource *ack_lockres;
        struct dlm_lock_resource *message_lockres;
        struct dlm_lock_resource *token_lockres;
        struct dlm_lock_resource *no_new_dev_lockres;
-       struct md_thread *recv_thread;
+       struct md_thread __rcu *recv_thread;
        struct completion newdisk_completion;
        wait_queue_head_t wait;
        unsigned long state;
@@ -362,8 +362,8 @@ static void __recover_slot(struct mddev *mddev, int slot)
 
        set_bit(slot, &cinfo->recovery_map);
        if (!cinfo->recovery_thread) {
-               cinfo->recovery_thread = md_register_thread(recover_bitmaps,
-                               mddev, "recover");
+               rcu_assign_pointer(cinfo->recovery_thread,
+                       md_register_thread(recover_bitmaps, mddev, "recover"));
                if (!cinfo->recovery_thread) {
                        pr_warn("md-cluster: Could not create recovery thread\n");
                        return;
@@ -526,11 +526,15 @@ static void process_add_new_disk(struct mddev *mddev, struct cluster_msg *cmsg)
 static void process_metadata_update(struct mddev *mddev, struct cluster_msg *msg)
 {
        int got_lock = 0;
+       struct md_thread *thread;
        struct md_cluster_info *cinfo = mddev->cluster_info;
        mddev->good_device_nr = le32_to_cpu(msg->raid_slot);
 
        dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_CR);
-       wait_event(mddev->thread->wqueue,
+
+       /* daemaon thread must exist */
+       thread = rcu_dereference_protected(mddev->thread, true);
+       wait_event(thread->wqueue,
                   (got_lock = mddev_trylock(mddev)) ||
                    test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state));
        md_reload_sb(mddev, mddev->good_device_nr);
@@ -889,7 +893,8 @@ static int join(struct mddev *mddev, int nodes)
        }
        /* Initiate the communication resources */
        ret = -ENOMEM;
-       cinfo->recv_thread = md_register_thread(recv_daemon, mddev, "cluster_recv");
+       rcu_assign_pointer(cinfo->recv_thread,
+                       md_register_thread(recv_daemon, mddev, "cluster_recv"));
        if (!cinfo->recv_thread) {
                pr_err("md-cluster: cannot allocate memory for recv_thread!\n");
                goto err;
index 66edf5e..92c45be 100644 (file)
@@ -400,8 +400,8 @@ static int multipath_run (struct mddev *mddev)
        if (ret)
                goto out_free_conf;
 
-       mddev->thread = md_register_thread(multipathd, mddev,
-                                          "multipath");
+       rcu_assign_pointer(mddev->thread,
+                          md_register_thread(multipathd, mddev, "multipath"));
        if (!mddev->thread)
                goto out_free_conf;
 
index 8e344b4..cf3733c 100644 (file)
 #include "md-bitmap.h"
 #include "md-cluster.h"
 
-/* pers_list is a list of registered personalities protected
- * by pers_lock.
- * pers_lock does extra service to protect accesses to
- * mddev->thread when the mutex cannot be held.
- */
+/* pers_list is a list of registered personalities protected by pers_lock. */
 static LIST_HEAD(pers_list);
 static DEFINE_SPINLOCK(pers_lock);
 
@@ -87,23 +83,13 @@ static struct module *md_cluster_mod;
 static DECLARE_WAIT_QUEUE_HEAD(resync_wait);
 static struct workqueue_struct *md_wq;
 static struct workqueue_struct *md_misc_wq;
-static struct workqueue_struct *md_rdev_misc_wq;
+struct workqueue_struct *md_bitmap_wq;
 
 static int remove_and_add_spares(struct mddev *mddev,
                                 struct md_rdev *this);
 static void mddev_detach(struct mddev *mddev);
-
-enum md_ro_state {
-       MD_RDWR,
-       MD_RDONLY,
-       MD_AUTO_READ,
-       MD_MAX_STATE
-};
-
-static bool md_is_rdwr(struct mddev *mddev)
-{
-       return (mddev->ro == MD_RDWR);
-}
+static void export_rdev(struct md_rdev *rdev, struct mddev *mddev);
+static void md_wakeup_thread_directly(struct md_thread __rcu *thread);
 
 /*
  * Default number of read corrections we'll attempt on an rdev
@@ -360,10 +346,6 @@ EXPORT_SYMBOL_GPL(md_new_event);
 static LIST_HEAD(all_mddevs);
 static DEFINE_SPINLOCK(all_mddevs_lock);
 
-static bool is_md_suspended(struct mddev *mddev)
-{
-       return percpu_ref_is_dying(&mddev->active_io);
-}
 /* Rather than calling directly into the personality make_request function,
  * IO requests come here first so that we can check if the device is
  * being suspended pending a reconfiguration.
@@ -457,13 +439,19 @@ static void md_submit_bio(struct bio *bio)
  */
 void mddev_suspend(struct mddev *mddev)
 {
-       WARN_ON_ONCE(mddev->thread && current == mddev->thread->tsk);
-       lockdep_assert_held(&mddev->reconfig_mutex);
+       struct md_thread *thread = rcu_dereference_protected(mddev->thread,
+                       lockdep_is_held(&mddev->reconfig_mutex));
+
+       WARN_ON_ONCE(thread && current == thread->tsk);
        if (mddev->suspended++)
                return;
        wake_up(&mddev->sb_wait);
        set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
        percpu_ref_kill(&mddev->active_io);
+
+       if (mddev->pers->prepare_suspend)
+               mddev->pers->prepare_suspend(mddev);
+
        wait_event(mddev->sb_wait, percpu_ref_is_zero(&mddev->active_io));
        mddev->pers->quiesce(mddev, 1);
        clear_bit_unlock(MD_ALLOW_SB_UPDATE, &mddev->flags);
@@ -655,9 +643,11 @@ void mddev_init(struct mddev *mddev)
 {
        mutex_init(&mddev->open_mutex);
        mutex_init(&mddev->reconfig_mutex);
+       mutex_init(&mddev->delete_mutex);
        mutex_init(&mddev->bitmap_info.mutex);
        INIT_LIST_HEAD(&mddev->disks);
        INIT_LIST_HEAD(&mddev->all_mddevs);
+       INIT_LIST_HEAD(&mddev->deleting);
        timer_setup(&mddev->safemode_timer, md_safemode_timeout, 0);
        atomic_set(&mddev->active, 1);
        atomic_set(&mddev->openers, 0);
@@ -759,6 +749,24 @@ static void mddev_free(struct mddev *mddev)
 
 static const struct attribute_group md_redundancy_group;
 
+static void md_free_rdev(struct mddev *mddev)
+{
+       struct md_rdev *rdev;
+       struct md_rdev *tmp;
+
+       mutex_lock(&mddev->delete_mutex);
+       if (list_empty(&mddev->deleting))
+               goto out;
+
+       list_for_each_entry_safe(rdev, tmp, &mddev->deleting, same_set) {
+               list_del_init(&rdev->same_set);
+               kobject_del(&rdev->kobj);
+               export_rdev(rdev, mddev);
+       }
+out:
+       mutex_unlock(&mddev->delete_mutex);
+}
+
 void mddev_unlock(struct mddev *mddev)
 {
        if (mddev->to_remove) {
@@ -800,13 +808,10 @@ void mddev_unlock(struct mddev *mddev)
        } else
                mutex_unlock(&mddev->reconfig_mutex);
 
-       /* As we've dropped the mutex we need a spinlock to
-        * make sure the thread doesn't disappear
-        */
-       spin_lock(&pers_lock);
+       md_free_rdev(mddev);
+
        md_wakeup_thread(mddev->thread);
        wake_up(&mddev->sb_wait);
-       spin_unlock(&pers_lock);
 }
 EXPORT_SYMBOL_GPL(mddev_unlock);
 
@@ -938,7 +943,7 @@ void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
        atomic_inc(&rdev->nr_pending);
 
        bio->bi_iter.bi_sector = sector;
-       bio_add_page(bio, page, size, 0);
+       __bio_add_page(bio, page, size, 0);
        bio->bi_private = rdev;
        bio->bi_end_io = super_written;
 
@@ -979,7 +984,7 @@ int sync_page_io(struct md_rdev *rdev, sector_t sector, int size,
                bio.bi_iter.bi_sector = sector + rdev->new_data_offset;
        else
                bio.bi_iter.bi_sector = sector + rdev->data_offset;
-       bio_add_page(&bio, page, size, 0);
+       __bio_add_page(&bio, page, size, 0);
 
        submit_bio_wait(&bio);
 
@@ -2440,16 +2445,12 @@ static int bind_rdev_to_array(struct md_rdev *rdev, struct mddev *mddev)
        return err;
 }
 
-static void rdev_delayed_delete(struct work_struct *ws)
-{
-       struct md_rdev *rdev = container_of(ws, struct md_rdev, del_work);
-       kobject_del(&rdev->kobj);
-       kobject_put(&rdev->kobj);
-}
-
 void md_autodetect_dev(dev_t dev);
 
-static void export_rdev(struct md_rdev *rdev)
+/* just for claiming the bdev */
+static struct md_rdev claim_rdev;
+
+static void export_rdev(struct md_rdev *rdev, struct mddev *mddev)
 {
        pr_debug("md: export_rdev(%pg)\n", rdev->bdev);
        md_rdev_clear(rdev);
@@ -2457,13 +2458,15 @@ static void export_rdev(struct md_rdev *rdev)
        if (test_bit(AutoDetected, &rdev->flags))
                md_autodetect_dev(rdev->bdev->bd_dev);
 #endif
-       blkdev_put(rdev->bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL);
+       blkdev_put(rdev->bdev, mddev->major_version == -2 ? &claim_rdev : rdev);
        rdev->bdev = NULL;
        kobject_put(&rdev->kobj);
 }
 
 static void md_kick_rdev_from_array(struct md_rdev *rdev)
 {
+       struct mddev *mddev = rdev->mddev;
+
        bd_unlink_disk_holder(rdev->bdev, rdev->mddev->gendisk);
        list_del_rcu(&rdev->same_set);
        pr_debug("md: unbind<%pg>\n", rdev->bdev);
@@ -2477,15 +2480,17 @@ static void md_kick_rdev_from_array(struct md_rdev *rdev)
        rdev->sysfs_unack_badblocks = NULL;
        rdev->sysfs_badblocks = NULL;
        rdev->badblocks.count = 0;
-       /* We need to delay this, otherwise we can deadlock when
-        * writing to 'remove' to "dev/state".  We also need
-        * to delay it due to rcu usage.
-        */
+
        synchronize_rcu();
-       INIT_WORK(&rdev->del_work, rdev_delayed_delete);
-       kobject_get(&rdev->kobj);
-       queue_work(md_rdev_misc_wq, &rdev->del_work);
-       export_rdev(rdev);
+
+       /*
+        * kobject_del() will wait for all in progress writers to be done, where
+        * reconfig_mutex is held, hence it can't be called under
+        * reconfig_mutex and it's delayed to mddev_unlock().
+        */
+       mutex_lock(&mddev->delete_mutex);
+       list_add(&rdev->same_set, &mddev->deleting);
+       mutex_unlock(&mddev->delete_mutex);
 }
 
 static void export_array(struct mddev *mddev)
@@ -3553,6 +3558,7 @@ rdev_attr_store(struct kobject *kobj, struct attribute *attr,
 {
        struct rdev_sysfs_entry *entry = container_of(attr, struct rdev_sysfs_entry, attr);
        struct md_rdev *rdev = container_of(kobj, struct md_rdev, kobj);
+       struct kernfs_node *kn = NULL;
        ssize_t rv;
        struct mddev *mddev = rdev->mddev;
 
@@ -3560,6 +3566,10 @@ rdev_attr_store(struct kobject *kobj, struct attribute *attr,
                return -EIO;
        if (!capable(CAP_SYS_ADMIN))
                return -EACCES;
+
+       if (entry->store == state_store && cmd_match(page, "remove"))
+               kn = sysfs_break_active_protection(kobj, attr);
+
        rv = mddev ? mddev_lock(mddev) : -ENODEV;
        if (!rv) {
                if (rdev->mddev == NULL)
@@ -3568,6 +3578,10 @@ rdev_attr_store(struct kobject *kobj, struct attribute *attr,
                        rv = entry->store(rdev, page, length);
                mddev_unlock(mddev);
        }
+
+       if (kn)
+               sysfs_unbreak_active_protection(kn);
+
        return rv;
 }
 
@@ -3612,6 +3626,7 @@ int md_rdev_init(struct md_rdev *rdev)
        return badblocks_init(&rdev->badblocks, 0);
 }
 EXPORT_SYMBOL_GPL(md_rdev_init);
+
 /*
  * Import a device. If 'super_format' >= 0, then sanity check the superblock
  *
@@ -3624,7 +3639,6 @@ EXPORT_SYMBOL_GPL(md_rdev_init);
  */
 static struct md_rdev *md_import_device(dev_t newdev, int super_format, int super_minor)
 {
-       static struct md_rdev claim_rdev; /* just for claiming the bdev */
        struct md_rdev *rdev;
        sector_t size;
        int err;
@@ -3640,9 +3654,8 @@ static struct md_rdev *md_import_device(dev_t newdev, int super_format, int supe
        if (err)
                goto out_clear_rdev;
 
-       rdev->bdev = blkdev_get_by_dev(newdev,
-                       FMODE_READ | FMODE_WRITE | FMODE_EXCL,
-                       super_format == -2 ? &claim_rdev : rdev);
+       rdev->bdev = blkdev_get_by_dev(newdev, BLK_OPEN_READ | BLK_OPEN_WRITE,
+                       super_format == -2 ? &claim_rdev : rdev, NULL);
        if (IS_ERR(rdev->bdev)) {
                pr_warn("md: could not open device unknown-block(%u,%u).\n",
                        MAJOR(newdev), MINOR(newdev));
@@ -3679,7 +3692,7 @@ static struct md_rdev *md_import_device(dev_t newdev, int super_format, int supe
        return rdev;
 
 out_blkdev_put:
-       blkdev_put(rdev->bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL);
+       blkdev_put(rdev->bdev, super_format == -2 ? &claim_rdev : rdev);
 out_clear_rdev:
        md_rdev_clear(rdev);
 out_free_rdev:
@@ -3794,8 +3807,9 @@ int strict_strtoul_scaled(const char *cp, unsigned long *res, int scale)
 static ssize_t
 safe_delay_show(struct mddev *mddev, char *page)
 {
-       int msec = (mddev->safemode_delay*1000)/HZ;
-       return sprintf(page, "%d.%03d\n", msec/1000, msec%1000);
+       unsigned int msec = ((unsigned long)mddev->safemode_delay*1000)/HZ;
+
+       return sprintf(page, "%u.%03u\n", msec/1000, msec%1000);
 }
 static ssize_t
 safe_delay_store(struct mddev *mddev, const char *cbuf, size_t len)
@@ -3807,7 +3821,7 @@ safe_delay_store(struct mddev *mddev, const char *cbuf, size_t len)
                return -EINVAL;
        }
 
-       if (strict_strtoul_scaled(cbuf, &msec, 3) < 0)
+       if (strict_strtoul_scaled(cbuf, &msec, 3) < 0 || msec > UINT_MAX / HZ)
                return -EINVAL;
        if (msec == 0)
                mddev->safemode_delay = 0;
@@ -4477,6 +4491,8 @@ max_corrected_read_errors_store(struct mddev *mddev, const char *buf, size_t len
        rv = kstrtouint(buf, 10, &n);
        if (rv < 0)
                return rv;
+       if (n > INT_MAX)
+               return -EINVAL;
        atomic_set(&mddev->max_corr_read_errors, n);
        return len;
 }
@@ -4491,20 +4507,6 @@ null_show(struct mddev *mddev, char *page)
        return -EINVAL;
 }
 
-/* need to ensure rdev_delayed_delete() has completed */
-static void flush_rdev_wq(struct mddev *mddev)
-{
-       struct md_rdev *rdev;
-
-       rcu_read_lock();
-       rdev_for_each_rcu(rdev, mddev)
-               if (work_pending(&rdev->del_work)) {
-                       flush_workqueue(md_rdev_misc_wq);
-                       break;
-               }
-       rcu_read_unlock();
-}
-
 static ssize_t
 new_dev_store(struct mddev *mddev, const char *buf, size_t len)
 {
@@ -4532,7 +4534,6 @@ new_dev_store(struct mddev *mddev, const char *buf, size_t len)
            minor != MINOR(dev))
                return -EOVERFLOW;
 
-       flush_rdev_wq(mddev);
        err = mddev_lock(mddev);
        if (err)
                return err;
@@ -4560,7 +4561,7 @@ new_dev_store(struct mddev *mddev, const char *buf, size_t len)
        err = bind_rdev_to_array(rdev, mddev);
  out:
        if (err)
-               export_rdev(rdev);
+               export_rdev(rdev, mddev);
        mddev_unlock(mddev);
        if (!err)
                md_new_event();
@@ -4804,11 +4805,21 @@ action_store(struct mddev *mddev, const char *page, size_t len)
                        return -EINVAL;
                err = mddev_lock(mddev);
                if (!err) {
-                       if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
+                       if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) {
                                err =  -EBUSY;
-                       else {
+                       } else if (mddev->reshape_position == MaxSector ||
+                                  mddev->pers->check_reshape == NULL ||
+                                  mddev->pers->check_reshape(mddev)) {
                                clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
                                err = mddev->pers->start_reshape(mddev);
+                       } else {
+                               /*
+                                * If reshape is still in progress, and
+                                * md_check_recovery() can continue to reshape,
+                                * don't restart reshape because data can be
+                                * corrupted for raid456.
+                                */
+                               clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
                        }
                        mddev_unlock(mddev);
                }
@@ -5592,7 +5603,6 @@ struct mddev *md_alloc(dev_t dev, char *name)
         * removed (mddev_delayed_delete).
         */
        flush_workqueue(md_misc_wq);
-       flush_workqueue(md_rdev_misc_wq);
 
        mutex_lock(&disks_mutex);
        mddev = mddev_alloc(dev);
@@ -6269,10 +6279,12 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev)
        }
        if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
                set_bit(MD_RECOVERY_INTR, &mddev->recovery);
-       if (mddev->sync_thread)
-               /* Thread might be blocked waiting for metadata update
-                * which will now never happen */
-               wake_up_process(mddev->sync_thread->tsk);
+
+       /*
+        * Thread might be blocked waiting for metadata update which will now
+        * never happen
+        */
+       md_wakeup_thread_directly(mddev->sync_thread);
 
        if (mddev->external && test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
                return -EBUSY;
@@ -6333,10 +6345,12 @@ static int do_md_stop(struct mddev *mddev, int mode,
        }
        if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
                set_bit(MD_RECOVERY_INTR, &mddev->recovery);
-       if (mddev->sync_thread)
-               /* Thread might be blocked waiting for metadata update
-                * which will now never happen */
-               wake_up_process(mddev->sync_thread->tsk);
+
+       /*
+        * Thread might be blocked waiting for metadata update which will now
+        * never happen
+        */
+       md_wakeup_thread_directly(mddev->sync_thread);
 
        mddev_unlock(mddev);
        wait_event(resync_wait, (mddev->sync_thread == NULL &&
@@ -6498,7 +6512,7 @@ static void autorun_devices(int part)
                        rdev_for_each_list(rdev, tmp, &candidates) {
                                list_del_init(&rdev->same_set);
                                if (bind_rdev_to_array(rdev, mddev))
-                                       export_rdev(rdev);
+                                       export_rdev(rdev, mddev);
                        }
                        autorun_array(mddev);
                        mddev_unlock(mddev);
@@ -6508,7 +6522,7 @@ static void autorun_devices(int part)
                 */
                rdev_for_each_list(rdev, tmp, &candidates) {
                        list_del_init(&rdev->same_set);
-                       export_rdev(rdev);
+                       export_rdev(rdev, mddev);
                }
                mddev_put(mddev);
        }
@@ -6696,13 +6710,13 @@ int md_add_new_disk(struct mddev *mddev, struct mdu_disk_info_s *info)
                                pr_warn("md: %pg has different UUID to %pg\n",
                                        rdev->bdev,
                                        rdev0->bdev);
-                               export_rdev(rdev);
+                               export_rdev(rdev, mddev);
                                return -EINVAL;
                        }
                }
                err = bind_rdev_to_array(rdev, mddev);
                if (err)
-                       export_rdev(rdev);
+                       export_rdev(rdev, mddev);
                return err;
        }
 
@@ -6733,7 +6747,6 @@ int md_add_new_disk(struct mddev *mddev, struct mdu_disk_info_s *info)
                        if (info->state & (1<<MD_DISK_SYNC)  &&
                            info->raid_disk < mddev->raid_disks) {
                                rdev->raid_disk = info->raid_disk;
-                               set_bit(In_sync, &rdev->flags);
                                clear_bit(Bitmap_sync, &rdev->flags);
                        } else
                                rdev->raid_disk = -1;
@@ -6746,7 +6759,7 @@ int md_add_new_disk(struct mddev *mddev, struct mdu_disk_info_s *info)
                        /* This was a hot-add request, but events doesn't
                         * match, so reject it.
                         */
-                       export_rdev(rdev);
+                       export_rdev(rdev, mddev);
                        return -EINVAL;
                }
 
@@ -6772,7 +6785,7 @@ int md_add_new_disk(struct mddev *mddev, struct mdu_disk_info_s *info)
                                }
                        }
                        if (has_journal || mddev->bitmap) {
-                               export_rdev(rdev);
+                               export_rdev(rdev, mddev);
                                return -EBUSY;
                        }
                        set_bit(Journal, &rdev->flags);
@@ -6787,7 +6800,7 @@ int md_add_new_disk(struct mddev *mddev, struct mdu_disk_info_s *info)
                                /* --add initiated by this node */
                                err = md_cluster_ops->add_new_disk(mddev, rdev);
                                if (err) {
-                                       export_rdev(rdev);
+                                       export_rdev(rdev, mddev);
                                        return err;
                                }
                        }
@@ -6797,7 +6810,7 @@ int md_add_new_disk(struct mddev *mddev, struct mdu_disk_info_s *info)
                err = bind_rdev_to_array(rdev, mddev);
 
                if (err)
-                       export_rdev(rdev);
+                       export_rdev(rdev, mddev);
 
                if (mddev_is_clustered(mddev)) {
                        if (info->state & (1 << MD_DISK_CANDIDATE)) {
@@ -6860,7 +6873,7 @@ int md_add_new_disk(struct mddev *mddev, struct mdu_disk_info_s *info)
 
                err = bind_rdev_to_array(rdev, mddev);
                if (err) {
-                       export_rdev(rdev);
+                       export_rdev(rdev, mddev);
                        return err;
                }
        }
@@ -6985,7 +6998,7 @@ static int hot_add_disk(struct mddev *mddev, dev_t dev)
        return 0;
 
 abort_export:
-       export_rdev(rdev);
+       export_rdev(rdev, mddev);
        return err;
 }
 
@@ -7486,7 +7499,7 @@ static int __md_set_array_info(struct mddev *mddev, void __user *argp)
        return err;
 }
 
-static int md_ioctl(struct block_device *bdev, fmode_t mode,
+static int md_ioctl(struct block_device *bdev, blk_mode_t mode,
                        unsigned int cmd, unsigned long arg)
 {
        int err = 0;
@@ -7555,9 +7568,6 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
 
        }
 
-       if (cmd == ADD_NEW_DISK || cmd == HOT_ADD_DISK)
-               flush_rdev_wq(mddev);
-
        if (cmd == HOT_REMOVE_DISK)
                /* need to ensure recovery thread has run */
                wait_event_interruptible_timeout(mddev->sb_wait,
@@ -7718,7 +7728,7 @@ out:
        return err;
 }
 #ifdef CONFIG_COMPAT
-static int md_compat_ioctl(struct block_device *bdev, fmode_t mode,
+static int md_compat_ioctl(struct block_device *bdev, blk_mode_t mode,
                    unsigned int cmd, unsigned long arg)
 {
        switch (cmd) {
@@ -7767,13 +7777,13 @@ out_unlock:
        return err;
 }
 
-static int md_open(struct block_device *bdev, fmode_t mode)
+static int md_open(struct gendisk *disk, blk_mode_t mode)
 {
        struct mddev *mddev;
        int err;
 
        spin_lock(&all_mddevs_lock);
-       mddev = mddev_get(bdev->bd_disk->private_data);
+       mddev = mddev_get(disk->private_data);
        spin_unlock(&all_mddevs_lock);
        if (!mddev)
                return -ENODEV;
@@ -7789,7 +7799,7 @@ static int md_open(struct block_device *bdev, fmode_t mode)
        atomic_inc(&mddev->openers);
        mutex_unlock(&mddev->open_mutex);
 
-       bdev_check_media_change(bdev);
+       disk_check_media_change(disk);
        return 0;
 
 out_unlock:
@@ -7799,7 +7809,7 @@ out:
        return err;
 }
 
-static void md_release(struct gendisk *disk, fmode_t mode)
+static void md_release(struct gendisk *disk)
 {
        struct mddev *mddev = disk->private_data;
 
@@ -7886,13 +7896,29 @@ static int md_thread(void *arg)
        return 0;
 }
 
-void md_wakeup_thread(struct md_thread *thread)
+static void md_wakeup_thread_directly(struct md_thread __rcu *thread)
 {
-       if (thread) {
-               pr_debug("md: waking up MD thread %s.\n", thread->tsk->comm);
-               set_bit(THREAD_WAKEUP, &thread->flags);
-               wake_up(&thread->wqueue);
+       struct md_thread *t;
+
+       rcu_read_lock();
+       t = rcu_dereference(thread);
+       if (t)
+               wake_up_process(t->tsk);
+       rcu_read_unlock();
+}
+
+void md_wakeup_thread(struct md_thread __rcu *thread)
+{
+       struct md_thread *t;
+
+       rcu_read_lock();
+       t = rcu_dereference(thread);
+       if (t) {
+               pr_debug("md: waking up MD thread %s.\n", t->tsk->comm);
+               set_bit(THREAD_WAKEUP, &t->flags);
+               wake_up(&t->wqueue);
        }
+       rcu_read_unlock();
 }
 EXPORT_SYMBOL(md_wakeup_thread);
 
@@ -7922,22 +7948,15 @@ struct md_thread *md_register_thread(void (*run) (struct md_thread *),
 }
 EXPORT_SYMBOL(md_register_thread);
 
-void md_unregister_thread(struct md_thread **threadp)
+void md_unregister_thread(struct md_thread __rcu **threadp)
 {
-       struct md_thread *thread;
+       struct md_thread *thread = rcu_dereference_protected(*threadp, true);
 
-       /*
-        * Locking ensures that mddev_unlock does not wake_up a
-        * non-existent thread
-        */
-       spin_lock(&pers_lock);
-       thread = *threadp;
-       if (!thread) {
-               spin_unlock(&pers_lock);
+       if (!thread)
                return;
-       }
-       *threadp = NULL;
-       spin_unlock(&pers_lock);
+
+       rcu_assign_pointer(*threadp, NULL);
+       synchronize_rcu();
 
        pr_debug("interrupting MD-thread pid %d\n", task_pid_nr(thread->tsk));
        kthread_stop(thread->tsk);
@@ -9100,6 +9119,7 @@ void md_do_sync(struct md_thread *thread)
        spin_unlock(&mddev->lock);
 
        wake_up(&resync_wait);
+       wake_up(&mddev->sb_wait);
        md_wakeup_thread(mddev->thread);
        return;
 }
@@ -9202,9 +9222,8 @@ static void md_start_sync(struct work_struct *ws)
 {
        struct mddev *mddev = container_of(ws, struct mddev, del_work);
 
-       mddev->sync_thread = md_register_thread(md_do_sync,
-                                               mddev,
-                                               "resync");
+       rcu_assign_pointer(mddev->sync_thread,
+                          md_register_thread(md_do_sync, mddev, "resync"));
        if (!mddev->sync_thread) {
                pr_warn("%s: could not start resync thread...\n",
                        mdname(mddev));
@@ -9619,9 +9638,10 @@ static int __init md_init(void)
        if (!md_misc_wq)
                goto err_misc_wq;
 
-       md_rdev_misc_wq = alloc_workqueue("md_rdev_misc", 0, 0);
-       if (!md_rdev_misc_wq)
-               goto err_rdev_misc_wq;
+       md_bitmap_wq = alloc_workqueue("md_bitmap", WQ_MEM_RECLAIM | WQ_UNBOUND,
+                                      0);
+       if (!md_bitmap_wq)
+               goto err_bitmap_wq;
 
        ret = __register_blkdev(MD_MAJOR, "md", md_probe);
        if (ret < 0)
@@ -9641,8 +9661,8 @@ static int __init md_init(void)
 err_mdp:
        unregister_blkdev(MD_MAJOR, "md");
 err_md:
-       destroy_workqueue(md_rdev_misc_wq);
-err_rdev_misc_wq:
+       destroy_workqueue(md_bitmap_wq);
+err_bitmap_wq:
        destroy_workqueue(md_misc_wq);
 err_misc_wq:
        destroy_workqueue(md_wq);
@@ -9938,8 +9958,8 @@ static __exit void md_exit(void)
        }
        spin_unlock(&all_mddevs_lock);
 
-       destroy_workqueue(md_rdev_misc_wq);
        destroy_workqueue(md_misc_wq);
+       destroy_workqueue(md_bitmap_wq);
        destroy_workqueue(md_wq);
 }
 
index fd8f260..bfd2306 100644 (file)
@@ -122,8 +122,6 @@ struct md_rdev {
 
        struct serial_in_rdev *serial;  /* used for raid1 io serialization */
 
-       struct work_struct del_work;    /* used for delayed sysfs removal */
-
        struct kernfs_node *sysfs_state; /* handle for 'state'
                                           * sysfs entry */
        /* handle for 'unacknowledged_bad_blocks' sysfs dentry */
@@ -367,8 +365,8 @@ struct mddev {
        int                             new_chunk_sectors;
        int                             reshape_backwards;
 
-       struct md_thread                *thread;        /* management thread */
-       struct md_thread                *sync_thread;   /* doing resync or reconstruct */
+       struct md_thread __rcu          *thread;        /* management thread */
+       struct md_thread __rcu          *sync_thread;   /* doing resync or reconstruct */
 
        /* 'last_sync_action' is initialized to "none".  It is set when a
         * sync operation (i.e "data-check", "requested-resync", "resync",
@@ -531,6 +529,14 @@ struct mddev {
        unsigned int                    good_device_nr; /* good device num within cluster raid */
        unsigned int                    noio_flag; /* for memalloc scope API */
 
+       /*
+        * Temporarily store rdev that will be finally removed when
+        * reconfig_mutex is unlocked.
+        */
+       struct list_head                deleting;
+       /* Protect the deleting list */
+       struct mutex                    delete_mutex;
+
        bool    has_superblocks:1;
        bool    fail_last_dev:1;
        bool    serialize_policy:1;
@@ -555,6 +561,23 @@ enum recovery_flags {
        MD_RESYNCING_REMOTE,    /* remote node is running resync thread */
 };
 
+enum md_ro_state {
+       MD_RDWR,
+       MD_RDONLY,
+       MD_AUTO_READ,
+       MD_MAX_STATE
+};
+
+static inline bool md_is_rdwr(struct mddev *mddev)
+{
+       return (mddev->ro == MD_RDWR);
+}
+
+static inline bool is_md_suspended(struct mddev *mddev)
+{
+       return percpu_ref_is_dying(&mddev->active_io);
+}
+
 static inline int __must_check mddev_lock(struct mddev *mddev)
 {
        return mutex_lock_interruptible(&mddev->reconfig_mutex);
@@ -614,6 +637,7 @@ struct md_personality
        int (*start_reshape) (struct mddev *mddev);
        void (*finish_reshape) (struct mddev *mddev);
        void (*update_reshape_pos) (struct mddev *mddev);
+       void (*prepare_suspend) (struct mddev *mddev);
        /* quiesce suspends or resumes internal processing.
         * 1 - stop new actions and wait for action io to complete
         * 0 - return to normal behaviour
@@ -734,8 +758,8 @@ extern struct md_thread *md_register_thread(
        void (*run)(struct md_thread *thread),
        struct mddev *mddev,
        const char *name);
-extern void md_unregister_thread(struct md_thread **threadp);
-extern void md_wakeup_thread(struct md_thread *thread);
+extern void md_unregister_thread(struct md_thread __rcu **threadp);
+extern void md_wakeup_thread(struct md_thread __rcu *thread);
 extern void md_check_recovery(struct mddev *mddev);
 extern void md_reap_sync_thread(struct mddev *mddev);
 extern int mddev_init_writes_pending(struct mddev *mddev);
@@ -828,6 +852,7 @@ struct mdu_array_info_s;
 struct mdu_disk_info_s;
 
 extern int mdp_major;
+extern struct workqueue_struct *md_bitmap_wq;
 void md_autostart_arrays(int part);
 int md_set_array_info(struct mddev *mddev, struct mdu_array_info_s *info);
 int md_add_new_disk(struct mddev *mddev, struct mdu_disk_info_s *info);
index e61f6ca..169ebe2 100644 (file)
@@ -21,6 +21,7 @@
 #define IO_MADE_GOOD ((struct bio *)2)
 
 #define BIO_SPECIAL(bio) ((unsigned long)bio <= 2)
+#define MAX_PLUG_BIO 32
 
 /* for managing resync I/O pages */
 struct resync_pages {
@@ -31,6 +32,7 @@ struct resync_pages {
 struct raid1_plug_cb {
        struct blk_plug_cb      cb;
        struct bio_list         pending;
+       unsigned int            count;
 };
 
 static void rbio_pool_free(void *rbio, void *data)
@@ -101,11 +103,73 @@ static void md_bio_reset_resync_pages(struct bio *bio, struct resync_pages *rp,
                struct page *page = resync_fetch_page(rp, idx);
                int len = min_t(int, size, PAGE_SIZE);
 
-               /*
-                * won't fail because the vec table is big
-                * enough to hold all these pages
-                */
-               bio_add_page(bio, page, len, 0);
+               if (WARN_ON(!bio_add_page(bio, page, len, 0))) {
+                       bio->bi_status = BLK_STS_RESOURCE;
+                       bio_endio(bio);
+                       return;
+               }
+
                size -= len;
        } while (idx++ < RESYNC_PAGES && size > 0);
 }
+
+
+static inline void raid1_submit_write(struct bio *bio)
+{
+       struct md_rdev *rdev = (struct md_rdev *)bio->bi_bdev;
+
+       bio->bi_next = NULL;
+       bio_set_dev(bio, rdev->bdev);
+       if (test_bit(Faulty, &rdev->flags))
+               bio_io_error(bio);
+       else if (unlikely(bio_op(bio) ==  REQ_OP_DISCARD &&
+                         !bdev_max_discard_sectors(bio->bi_bdev)))
+               /* Just ignore it */
+               bio_endio(bio);
+       else
+               submit_bio_noacct(bio);
+}
+
+static inline bool raid1_add_bio_to_plug(struct mddev *mddev, struct bio *bio,
+                                     blk_plug_cb_fn unplug, int copies)
+{
+       struct raid1_plug_cb *plug = NULL;
+       struct blk_plug_cb *cb;
+
+       /*
+        * If bitmap is not enabled, it's safe to submit the io directly, and
+        * this can get optimal performance.
+        */
+       if (!md_bitmap_enabled(mddev->bitmap)) {
+               raid1_submit_write(bio);
+               return true;
+       }
+
+       cb = blk_check_plugged(unplug, mddev, sizeof(*plug));
+       if (!cb)
+               return false;
+
+       plug = container_of(cb, struct raid1_plug_cb, cb);
+       bio_list_add(&plug->pending, bio);
+       if (++plug->count / MAX_PLUG_BIO >= copies) {
+               list_del(&cb->list);
+               cb->callback(cb, false);
+       }
+
+
+       return true;
+}
+
+/*
+ * current->bio_list will be set under submit_bio() context, in this case bitmap
+ * io will be added to the list and wait for current io submission to finish,
+ * while current io submission must wait for bitmap io to be done. In order to
+ * avoid such deadlock, submit bitmap io asynchronously.
+ */
+static inline void raid1_prepare_flush_writes(struct bitmap *bitmap)
+{
+       if (current->bio_list)
+               md_bitmap_unplug_async(bitmap);
+       else
+               md_bitmap_unplug(bitmap);
+}
index 68a9e2d..dd25832 100644 (file)
@@ -794,22 +794,13 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
 static void flush_bio_list(struct r1conf *conf, struct bio *bio)
 {
        /* flush any pending bitmap writes to disk before proceeding w/ I/O */
-       md_bitmap_unplug(conf->mddev->bitmap);
+       raid1_prepare_flush_writes(conf->mddev->bitmap);
        wake_up(&conf->wait_barrier);
 
        while (bio) { /* submit pending writes */
                struct bio *next = bio->bi_next;
-               struct md_rdev *rdev = (void *)bio->bi_bdev;
-               bio->bi_next = NULL;
-               bio_set_dev(bio, rdev->bdev);
-               if (test_bit(Faulty, &rdev->flags)) {
-                       bio_io_error(bio);
-               } else if (unlikely((bio_op(bio) == REQ_OP_DISCARD) &&
-                                   !bdev_max_discard_sectors(bio->bi_bdev)))
-                       /* Just ignore it */
-                       bio_endio(bio);
-               else
-                       submit_bio_noacct(bio);
+
+               raid1_submit_write(bio);
                bio = next;
                cond_resched();
        }
@@ -1147,7 +1138,10 @@ static void alloc_behind_master_bio(struct r1bio *r1_bio,
                if (unlikely(!page))
                        goto free_pages;
 
-               bio_add_page(behind_bio, page, len, 0);
+               if (!bio_add_page(behind_bio, page, len, 0)) {
+                       put_page(page);
+                       goto free_pages;
+               }
 
                size -= len;
                i++;
@@ -1175,7 +1169,7 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
        struct r1conf *conf = mddev->private;
        struct bio *bio;
 
-       if (from_schedule || current->bio_list) {
+       if (from_schedule) {
                spin_lock_irq(&conf->device_lock);
                bio_list_merge(&conf->pending_bio_list, &plug->pending);
                spin_unlock_irq(&conf->device_lock);
@@ -1343,8 +1337,6 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
        struct bitmap *bitmap = mddev->bitmap;
        unsigned long flags;
        struct md_rdev *blocked_rdev;
-       struct blk_plug_cb *cb;
-       struct raid1_plug_cb *plug = NULL;
        int first_clone;
        int max_sectors;
        bool write_behind = false;
@@ -1573,15 +1565,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
                                              r1_bio->sector);
                /* flush_pending_writes() needs access to the rdev so...*/
                mbio->bi_bdev = (void *)rdev;
-
-               cb = blk_check_plugged(raid1_unplug, mddev, sizeof(*plug));
-               if (cb)
-                       plug = container_of(cb, struct raid1_plug_cb, cb);
-               else
-                       plug = NULL;
-               if (plug) {
-                       bio_list_add(&plug->pending, mbio);
-               } else {
+               if (!raid1_add_bio_to_plug(mddev, mbio, raid1_unplug, disks)) {
                        spin_lock_irqsave(&conf->device_lock, flags);
                        bio_list_add(&conf->pending_bio_list, mbio);
                        spin_unlock_irqrestore(&conf->device_lock, flags);
@@ -2914,7 +2898,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
                                 * won't fail because the vec table is big
                                 * enough to hold all these pages
                                 */
-                               bio_add_page(bio, page, len, 0);
+                               __bio_add_page(bio, page, len, 0);
                        }
                }
                nr_sectors += len>>9;
@@ -3084,7 +3068,8 @@ static struct r1conf *setup_conf(struct mddev *mddev)
        }
 
        err = -ENOMEM;
-       conf->thread = md_register_thread(raid1d, mddev, "raid1");
+       rcu_assign_pointer(conf->thread,
+                          md_register_thread(raid1d, mddev, "raid1"));
        if (!conf->thread)
                goto abort;
 
@@ -3177,8 +3162,8 @@ static int raid1_run(struct mddev *mddev)
        /*
         * Ok, everything is just fine now
         */
-       mddev->thread = conf->thread;
-       conf->thread = NULL;
+       rcu_assign_pointer(mddev->thread, conf->thread);
+       rcu_assign_pointer(conf->thread, NULL);
        mddev->private = conf;
        set_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
 
index ebb6788..468f189 100644 (file)
@@ -130,7 +130,7 @@ struct r1conf {
        /* When taking over an array from a different personality, we store
         * the new thread here until we fully activate the array.
         */
-       struct md_thread        *thread;
+       struct md_thread __rcu  *thread;
 
        /* Keep track of cluster resync window to send to other
         * nodes.
index 4fcfcb3..d0de8c9 100644 (file)
@@ -779,8 +779,16 @@ static struct md_rdev *read_balance(struct r10conf *conf,
                disk = r10_bio->devs[slot].devnum;
                rdev = rcu_dereference(conf->mirrors[disk].replacement);
                if (rdev == NULL || test_bit(Faulty, &rdev->flags) ||
-                   r10_bio->devs[slot].addr + sectors > rdev->recovery_offset)
+                   r10_bio->devs[slot].addr + sectors >
+                   rdev->recovery_offset) {
+                       /*
+                        * Read replacement first to prevent reading both rdev
+                        * and replacement as NULL during replacement replace
+                        * rdev.
+                        */
+                       smp_mb();
                        rdev = rcu_dereference(conf->mirrors[disk].rdev);
+               }
                if (rdev == NULL ||
                    test_bit(Faulty, &rdev->flags))
                        continue;
@@ -902,25 +910,15 @@ static void flush_pending_writes(struct r10conf *conf)
                __set_current_state(TASK_RUNNING);
 
                blk_start_plug(&plug);
-               /* flush any pending bitmap writes to disk
-                * before proceeding w/ I/O */
-               md_bitmap_unplug(conf->mddev->bitmap);
+               raid1_prepare_flush_writes(conf->mddev->bitmap);
                wake_up(&conf->wait_barrier);
 
                while (bio) { /* submit pending writes */
                        struct bio *next = bio->bi_next;
-                       struct md_rdev *rdev = (void*)bio->bi_bdev;
-                       bio->bi_next = NULL;
-                       bio_set_dev(bio, rdev->bdev);
-                       if (test_bit(Faulty, &rdev->flags)) {
-                               bio_io_error(bio);
-                       } else if (unlikely((bio_op(bio) ==  REQ_OP_DISCARD) &&
-                                           !bdev_max_discard_sectors(bio->bi_bdev)))
-                               /* Just ignore it */
-                               bio_endio(bio);
-                       else
-                               submit_bio_noacct(bio);
+
+                       raid1_submit_write(bio);
                        bio = next;
+                       cond_resched();
                }
                blk_finish_plug(&plug);
        } else
@@ -982,6 +980,7 @@ static void lower_barrier(struct r10conf *conf)
 static bool stop_waiting_barrier(struct r10conf *conf)
 {
        struct bio_list *bio_list = current->bio_list;
+       struct md_thread *thread;
 
        /* barrier is dropped */
        if (!conf->barrier)
@@ -997,12 +996,14 @@ static bool stop_waiting_barrier(struct r10conf *conf)
            (!bio_list_empty(&bio_list[0]) || !bio_list_empty(&bio_list[1])))
                return true;
 
+       /* daemon thread must exist while handling io */
+       thread = rcu_dereference_protected(conf->mddev->thread, true);
        /*
         * move on if io is issued from raid10d(), nr_pending is not released
         * from original io(see handle_read_error()). All raise barrier is
         * blocked until this io is done.
         */
-       if (conf->mddev->thread->tsk == current) {
+       if (thread->tsk == current) {
                WARN_ON_ONCE(atomic_read(&conf->nr_pending) == 0);
                return true;
        }
@@ -1113,7 +1114,7 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
        struct r10conf *conf = mddev->private;
        struct bio *bio;
 
-       if (from_schedule || current->bio_list) {
+       if (from_schedule) {
                spin_lock_irq(&conf->device_lock);
                bio_list_merge(&conf->pending_bio_list, &plug->pending);
                spin_unlock_irq(&conf->device_lock);
@@ -1125,23 +1126,15 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
 
        /* we aren't scheduling, so we can do the write-out directly. */
        bio = bio_list_get(&plug->pending);
-       md_bitmap_unplug(mddev->bitmap);
+       raid1_prepare_flush_writes(mddev->bitmap);
        wake_up(&conf->wait_barrier);
 
        while (bio) { /* submit pending writes */
                struct bio *next = bio->bi_next;
-               struct md_rdev *rdev = (void*)bio->bi_bdev;
-               bio->bi_next = NULL;
-               bio_set_dev(bio, rdev->bdev);
-               if (test_bit(Faulty, &rdev->flags)) {
-                       bio_io_error(bio);
-               } else if (unlikely((bio_op(bio) ==  REQ_OP_DISCARD) &&
-                                   !bdev_max_discard_sectors(bio->bi_bdev)))
-                       /* Just ignore it */
-                       bio_endio(bio);
-               else
-                       submit_bio_noacct(bio);
+
+               raid1_submit_write(bio);
                bio = next;
+               cond_resched();
        }
        kfree(plug);
 }
@@ -1282,8 +1275,6 @@ static void raid10_write_one_disk(struct mddev *mddev, struct r10bio *r10_bio,
        const blk_opf_t do_sync = bio->bi_opf & REQ_SYNC;
        const blk_opf_t do_fua = bio->bi_opf & REQ_FUA;
        unsigned long flags;
-       struct blk_plug_cb *cb;
-       struct raid1_plug_cb *plug = NULL;
        struct r10conf *conf = mddev->private;
        struct md_rdev *rdev;
        int devnum = r10_bio->devs[n_copy].devnum;
@@ -1323,14 +1314,7 @@ static void raid10_write_one_disk(struct mddev *mddev, struct r10bio *r10_bio,
 
        atomic_inc(&r10_bio->remaining);
 
-       cb = blk_check_plugged(raid10_unplug, mddev, sizeof(*plug));
-       if (cb)
-               plug = container_of(cb, struct raid1_plug_cb, cb);
-       else
-               plug = NULL;
-       if (plug) {
-               bio_list_add(&plug->pending, mbio);
-       } else {
+       if (!raid1_add_bio_to_plug(mddev, mbio, raid10_unplug, conf->copies)) {
                spin_lock_irqsave(&conf->device_lock, flags);
                bio_list_add(&conf->pending_bio_list, mbio);
                spin_unlock_irqrestore(&conf->device_lock, flags);
@@ -1479,9 +1463,15 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 
        for (i = 0;  i < conf->copies; i++) {
                int d = r10_bio->devs[i].devnum;
-               struct md_rdev *rdev = rcu_dereference(conf->mirrors[d].rdev);
-               struct md_rdev *rrdev = rcu_dereference(
-                       conf->mirrors[d].replacement);
+               struct md_rdev *rdev, *rrdev;
+
+               rrdev = rcu_dereference(conf->mirrors[d].replacement);
+               /*
+                * Read replacement first to prevent reading both rdev and
+                * replacement as NULL during replacement replace rdev.
+                */
+               smp_mb();
+               rdev = rcu_dereference(conf->mirrors[d].rdev);
                if (rdev == rrdev)
                        rrdev = NULL;
                if (rdev && (test_bit(Faulty, &rdev->flags)))
@@ -2148,9 +2138,10 @@ static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev)
 {
        struct r10conf *conf = mddev->private;
        int err = -EEXIST;
-       int mirror;
+       int mirror, repl_slot = -1;
        int first = 0;
        int last = conf->geo.raid_disks - 1;
+       struct raid10_info *p;
 
        if (mddev->recovery_cp < MaxSector)
                /* only hot-add to in-sync arrays, as recovery is
@@ -2173,23 +2164,14 @@ static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev)
        else
                mirror = first;
        for ( ; mirror <= last ; mirror++) {
-               struct raid10_info *p = &conf->mirrors[mirror];
+               p = &conf->mirrors[mirror];
                if (p->recovery_disabled == mddev->recovery_disabled)
                        continue;
                if (p->rdev) {
-                       if (!test_bit(WantReplacement, &p->rdev->flags) ||
-                           p->replacement != NULL)
-                               continue;
-                       clear_bit(In_sync, &rdev->flags);
-                       set_bit(Replacement, &rdev->flags);
-                       rdev->raid_disk = mirror;
-                       err = 0;
-                       if (mddev->gendisk)
-                               disk_stack_limits(mddev->gendisk, rdev->bdev,
-                                                 rdev->data_offset << 9);
-                       conf->fullsync = 1;
-                       rcu_assign_pointer(p->replacement, rdev);
-                       break;
+                       if (test_bit(WantReplacement, &p->rdev->flags) &&
+                           p->replacement == NULL && repl_slot < 0)
+                               repl_slot = mirror;
+                       continue;
                }
 
                if (mddev->gendisk)
@@ -2206,6 +2188,19 @@ static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev)
                break;
        }
 
+       if (err && repl_slot >= 0) {
+               p = &conf->mirrors[repl_slot];
+               clear_bit(In_sync, &rdev->flags);
+               set_bit(Replacement, &rdev->flags);
+               rdev->raid_disk = repl_slot;
+               err = 0;
+               if (mddev->gendisk)
+                       disk_stack_limits(mddev->gendisk, rdev->bdev,
+                                         rdev->data_offset << 9);
+               conf->fullsync = 1;
+               rcu_assign_pointer(p->replacement, rdev);
+       }
+
        print_conf(conf);
        return err;
 }
@@ -3303,6 +3298,7 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
        int chunks_skipped = 0;
        sector_t chunk_mask = conf->geo.chunk_mask;
        int page_idx = 0;
+       int error_disk = -1;
 
        /*
         * Allow skipping a full rebuild for incremental assembly
@@ -3386,8 +3382,21 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
                return reshape_request(mddev, sector_nr, skipped);
 
        if (chunks_skipped >= conf->geo.raid_disks) {
-               /* if there has been nothing to do on any drive,
-                * then there is nothing to do at all..
+               pr_err("md/raid10:%s: %s fails\n", mdname(mddev),
+                       test_bit(MD_RECOVERY_SYNC, &mddev->recovery) ?  "resync" : "recovery");
+               if (error_disk >= 0 &&
+                   !test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
+                       /*
+                        * recovery fails, set mirrors.recovery_disabled,
+                        * device shouldn't be added to there.
+                        */
+                       conf->mirrors[error_disk].recovery_disabled =
+                                               mddev->recovery_disabled;
+                       return 0;
+               }
+               /*
+                * if there has been nothing to do on any drive,
+                * then there is nothing to do at all.
                 */
                *skipped = 1;
                return (max_sector - sector_nr) + sectors_skipped;
@@ -3437,8 +3446,6 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
                        sector_t sect;
                        int must_sync;
                        int any_working;
-                       int need_recover = 0;
-                       int need_replace = 0;
                        struct raid10_info *mirror = &conf->mirrors[i];
                        struct md_rdev *mrdev, *mreplace;
 
@@ -3446,15 +3453,13 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
                        mrdev = rcu_dereference(mirror->rdev);
                        mreplace = rcu_dereference(mirror->replacement);
 
-                       if (mrdev != NULL &&
-                           !test_bit(Faulty, &mrdev->flags) &&
-                           !test_bit(In_sync, &mrdev->flags))
-                               need_recover = 1;
-                       if (mreplace != NULL &&
-                           !test_bit(Faulty, &mreplace->flags))
-                               need_replace = 1;
+                       if (mrdev && (test_bit(Faulty, &mrdev->flags) ||
+                           test_bit(In_sync, &mrdev->flags)))
+                               mrdev = NULL;
+                       if (mreplace && test_bit(Faulty, &mreplace->flags))
+                               mreplace = NULL;
 
-                       if (!need_recover && !need_replace) {
+                       if (!mrdev && !mreplace) {
                                rcu_read_unlock();
                                continue;
                        }
@@ -3470,8 +3475,6 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
                                rcu_read_unlock();
                                continue;
                        }
-                       if (mreplace && test_bit(Faulty, &mreplace->flags))
-                               mreplace = NULL;
                        /* Unless we are doing a full sync, or a replacement
                         * we only need to recover the block if it is set in
                         * the bitmap
@@ -3490,7 +3493,8 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
                                rcu_read_unlock();
                                continue;
                        }
-                       atomic_inc(&mrdev->nr_pending);
+                       if (mrdev)
+                               atomic_inc(&mrdev->nr_pending);
                        if (mreplace)
                                atomic_inc(&mreplace->nr_pending);
                        rcu_read_unlock();
@@ -3577,7 +3581,7 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
                                r10_bio->devs[1].devnum = i;
                                r10_bio->devs[1].addr = to_addr;
 
-                               if (need_recover) {
+                               if (mrdev) {
                                        bio = r10_bio->devs[1].bio;
                                        bio->bi_next = biolist;
                                        biolist = bio;
@@ -3594,11 +3598,11 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
                                bio = r10_bio->devs[1].repl_bio;
                                if (bio)
                                        bio->bi_end_io = NULL;
-                               /* Note: if need_replace, then bio
+                               /* Note: if replace is not NULL, then bio
                                 * cannot be NULL as r10buf_pool_alloc will
                                 * have allocated it.
                                 */
-                               if (!need_replace)
+                               if (!mreplace)
                                        break;
                                bio->bi_next = biolist;
                                biolist = bio;
@@ -3622,7 +3626,7 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
                                        for (k = 0; k < conf->copies; k++)
                                                if (r10_bio->devs[k].devnum == i)
                                                        break;
-                                       if (!test_bit(In_sync,
+                                       if (mrdev && !test_bit(In_sync,
                                                      &mrdev->flags)
                                            && !rdev_set_badblocks(
                                                    mrdev,
@@ -3643,17 +3647,21 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
                                                       mdname(mddev));
                                        mirror->recovery_disabled
                                                = mddev->recovery_disabled;
+                               } else {
+                                       error_disk = i;
                                }
                                put_buf(r10_bio);
                                if (rb2)
                                        atomic_dec(&rb2->remaining);
                                r10_bio = rb2;
-                               rdev_dec_pending(mrdev, mddev);
+                               if (mrdev)
+                                       rdev_dec_pending(mrdev, mddev);
                                if (mreplace)
                                        rdev_dec_pending(mreplace, mddev);
                                break;
                        }
-                       rdev_dec_pending(mrdev, mddev);
+                       if (mrdev)
+                               rdev_dec_pending(mrdev, mddev);
                        if (mreplace)
                                rdev_dec_pending(mreplace, mddev);
                        if (r10_bio->devs[0].bio->bi_opf & MD_FAILFAST) {
@@ -3819,11 +3827,11 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
                for (bio= biolist ; bio ; bio=bio->bi_next) {
                        struct resync_pages *rp = get_resync_pages(bio);
                        page = resync_fetch_page(rp, page_idx);
-                       /*
-                        * won't fail because the vec table is big enough
-                        * to hold all these pages
-                        */
-                       bio_add_page(bio, page, len, 0);
+                       if (WARN_ON(!bio_add_page(bio, page, len, 0))) {
+                               bio->bi_status = BLK_STS_RESOURCE;
+                               bio_endio(bio);
+                               goto giveup;
+                       }
                }
                nr_sectors += len>>9;
                sector_nr += len>>9;
@@ -4107,7 +4115,8 @@ static struct r10conf *setup_conf(struct mddev *mddev)
        atomic_set(&conf->nr_pending, 0);
 
        err = -ENOMEM;
-       conf->thread = md_register_thread(raid10d, mddev, "raid10");
+       rcu_assign_pointer(conf->thread,
+                          md_register_thread(raid10d, mddev, "raid10"));
        if (!conf->thread)
                goto out;
 
@@ -4152,8 +4161,8 @@ static int raid10_run(struct mddev *mddev)
        if (!conf)
                goto out;
 
-       mddev->thread = conf->thread;
-       conf->thread = NULL;
+       rcu_assign_pointer(mddev->thread, conf->thread);
+       rcu_assign_pointer(conf->thread, NULL);
 
        if (mddev_is_clustered(conf->mddev)) {
                int fc, fo;
@@ -4296,8 +4305,8 @@ static int raid10_run(struct mddev *mddev)
                clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
                set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
                set_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
-               mddev->sync_thread = md_register_thread(md_do_sync, mddev,
-                                                       "reshape");
+               rcu_assign_pointer(mddev->sync_thread,
+                       md_register_thread(md_do_sync, mddev, "reshape"));
                if (!mddev->sync_thread)
                        goto out_free_conf;
        }
@@ -4698,8 +4707,8 @@ out:
        set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
        set_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
 
-       mddev->sync_thread = md_register_thread(md_do_sync, mddev,
-                                               "reshape");
+       rcu_assign_pointer(mddev->sync_thread,
+                          md_register_thread(md_do_sync, mddev, "reshape"));
        if (!mddev->sync_thread) {
                ret = -EAGAIN;
                goto abort;
@@ -4997,11 +5006,11 @@ read_more:
                if (len > PAGE_SIZE)
                        len = PAGE_SIZE;
                for (bio = blist; bio ; bio = bio->bi_next) {
-                       /*
-                        * won't fail because the vec table is big enough
-                        * to hold all these pages
-                        */
-                       bio_add_page(bio, page, len, 0);
+                       if (WARN_ON(!bio_add_page(bio, page, len, 0))) {
+                               bio->bi_status = BLK_STS_RESOURCE;
+                               bio_endio(bio);
+                               return sectors_done;
+                       }
                }
                sector_nr += len >> 9;
                nr_sectors += len >> 9;
index 8c072ce..63e48b1 100644 (file)
@@ -100,7 +100,7 @@ struct r10conf {
        /* When taking over an array from a different personality, we store
         * the new thread here until we fully activate the array.
         */
-       struct md_thread        *thread;
+       struct md_thread __rcu  *thread;
 
        /*
         * Keep track of cluster resync window to send to other nodes.
index 46182b9..47ba7d9 100644 (file)
@@ -120,7 +120,7 @@ struct r5l_log {
        struct bio_set bs;
        mempool_t meta_pool;
 
-       struct md_thread *reclaim_thread;
+       struct md_thread __rcu *reclaim_thread;
        unsigned long reclaim_target;   /* number of space that need to be
                                         * reclaimed.  if it's 0, reclaim spaces
                                         * used by io_units which are in
@@ -792,7 +792,7 @@ static struct r5l_io_unit *r5l_new_meta(struct r5l_log *log)
        io->current_bio = r5l_bio_alloc(log);
        io->current_bio->bi_end_io = r5l_log_endio;
        io->current_bio->bi_private = io;
-       bio_add_page(io->current_bio, io->meta_page, PAGE_SIZE, 0);
+       __bio_add_page(io->current_bio, io->meta_page, PAGE_SIZE, 0);
 
        r5_reserve_log_entry(log, io);
 
@@ -1576,17 +1576,18 @@ void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
 
 void r5l_quiesce(struct r5l_log *log, int quiesce)
 {
-       struct mddev *mddev;
+       struct mddev *mddev = log->rdev->mddev;
+       struct md_thread *thread = rcu_dereference_protected(
+               log->reclaim_thread, lockdep_is_held(&mddev->reconfig_mutex));
 
        if (quiesce) {
                /* make sure r5l_write_super_and_discard_space exits */
-               mddev = log->rdev->mddev;
                wake_up(&mddev->sb_wait);
-               kthread_park(log->reclaim_thread->tsk);
+               kthread_park(thread->tsk);
                r5l_wake_reclaim(log, MaxSector);
                r5l_do_reclaim(log);
        } else
-               kthread_unpark(log->reclaim_thread->tsk);
+               kthread_unpark(thread->tsk);
 }
 
 bool r5l_log_disk_error(struct r5conf *conf)
@@ -3063,6 +3064,7 @@ void r5c_update_on_rdev_error(struct mddev *mddev, struct md_rdev *rdev)
 int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 {
        struct r5l_log *log;
+       struct md_thread *thread;
        int ret;
 
        pr_debug("md/raid:%s: using device %pg as journal\n",
@@ -3121,11 +3123,13 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
        spin_lock_init(&log->tree_lock);
        INIT_RADIX_TREE(&log->big_stripe_tree, GFP_NOWAIT | __GFP_NOWARN);
 
-       log->reclaim_thread = md_register_thread(r5l_reclaim_thread,
-                                                log->rdev->mddev, "reclaim");
-       if (!log->reclaim_thread)
+       thread = md_register_thread(r5l_reclaim_thread, log->rdev->mddev,
+                                   "reclaim");
+       if (!thread)
                goto reclaim_thread;
-       log->reclaim_thread->timeout = R5C_RECLAIM_WAKEUP_INTERVAL;
+
+       thread->timeout = R5C_RECLAIM_WAKEUP_INTERVAL;
+       rcu_assign_pointer(log->reclaim_thread, thread);
 
        init_waitqueue_head(&log->iounit_wait);
 
index e495939..eaea57a 100644 (file)
@@ -465,7 +465,7 @@ static void ppl_submit_iounit(struct ppl_io_unit *io)
 
        bio->bi_end_io = ppl_log_endio;
        bio->bi_iter.bi_sector = log->next_io_sector;
-       bio_add_page(bio, io->header_page, PAGE_SIZE, 0);
+       __bio_add_page(bio, io->header_page, PAGE_SIZE, 0);
 
        pr_debug("%s: log->current_io_sector: %llu\n", __func__,
            (unsigned long long)log->next_io_sector);
@@ -496,7 +496,7 @@ static void ppl_submit_iounit(struct ppl_io_unit *io)
                                               prev->bi_opf, GFP_NOIO,
                                               &ppl_conf->bs);
                        bio->bi_iter.bi_sector = bio_end_sector(prev);
-                       bio_add_page(bio, sh->ppl_page, PAGE_SIZE, 0);
+                       __bio_add_page(bio, sh->ppl_page, PAGE_SIZE, 0);
 
                        bio_chain(bio, prev);
                        ppl_submit_iounit_bio(io, prev);
index 9ea285f..85b3004 100644 (file)
@@ -2433,7 +2433,7 @@ static int grow_stripes(struct r5conf *conf, int num)
 
        conf->active_name = 0;
        sc = kmem_cache_create(conf->cache_name[conf->active_name],
-                              sizeof(struct stripe_head)+(devs-1)*sizeof(struct r5dev),
+                              struct_size_t(struct stripe_head, dev, devs),
                               0, 0, NULL);
        if (!sc)
                return 1;
@@ -2559,7 +2559,7 @@ static int resize_stripes(struct r5conf *conf, int newsize)
 
        /* Step 1 */
        sc = kmem_cache_create(conf->cache_name[1-conf->active_name],
-                              sizeof(struct stripe_head)+(newsize-1)*sizeof(struct r5dev),
+                              struct_size_t(struct stripe_head, dev, newsize),
                               0, 0, NULL);
        if (!sc)
                return -ENOMEM;
@@ -5966,6 +5966,19 @@ out:
        return ret;
 }
 
+static bool reshape_inprogress(struct mddev *mddev)
+{
+       return test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
+              test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
+              !test_bit(MD_RECOVERY_DONE, &mddev->recovery) &&
+              !test_bit(MD_RECOVERY_INTR, &mddev->recovery);
+}
+
+static bool reshape_disabled(struct mddev *mddev)
+{
+       return is_md_suspended(mddev) || !md_is_rdwr(mddev);
+}
+
 static enum stripe_result make_stripe_request(struct mddev *mddev,
                struct r5conf *conf, struct stripe_request_ctx *ctx,
                sector_t logical_sector, struct bio *bi)
@@ -5997,7 +6010,8 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
                        if (ahead_of_reshape(mddev, logical_sector,
                                             conf->reshape_safe)) {
                                spin_unlock_irq(&conf->device_lock);
-                               return STRIPE_SCHEDULE_AND_RETRY;
+                               ret = STRIPE_SCHEDULE_AND_RETRY;
+                               goto out;
                        }
                }
                spin_unlock_irq(&conf->device_lock);
@@ -6076,6 +6090,15 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
 
 out_release:
        raid5_release_stripe(sh);
+out:
+       if (ret == STRIPE_SCHEDULE_AND_RETRY && !reshape_inprogress(mddev) &&
+           reshape_disabled(mddev)) {
+               bi->bi_status = BLK_STS_IOERR;
+               ret = STRIPE_FAIL;
+               pr_err("md/raid456:%s: io failed across reshape position while reshape can't make progress.\n",
+                      mdname(mddev));
+       }
+
        return ret;
 }
 
@@ -7708,7 +7731,8 @@ static struct r5conf *setup_conf(struct mddev *mddev)
        }
 
        sprintf(pers_name, "raid%d", mddev->new_level);
-       conf->thread = md_register_thread(raid5d, mddev, pers_name);
+       rcu_assign_pointer(conf->thread,
+                          md_register_thread(raid5d, mddev, pers_name));
        if (!conf->thread) {
                pr_warn("md/raid:%s: couldn't allocate thread.\n",
                        mdname(mddev));
@@ -7931,8 +7955,8 @@ static int raid5_run(struct mddev *mddev)
        }
 
        conf->min_offset_diff = min_offset_diff;
-       mddev->thread = conf->thread;
-       conf->thread = NULL;
+       rcu_assign_pointer(mddev->thread, conf->thread);
+       rcu_assign_pointer(conf->thread, NULL);
        mddev->private = conf;
 
        for (i = 0; i < conf->raid_disks && conf->previous_raid_disks;
@@ -8029,8 +8053,8 @@ static int raid5_run(struct mddev *mddev)
                clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
                set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
                set_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
-               mddev->sync_thread = md_register_thread(md_do_sync, mddev,
-                                                       "reshape");
+               rcu_assign_pointer(mddev->sync_thread,
+                       md_register_thread(md_do_sync, mddev, "reshape"));
                if (!mddev->sync_thread)
                        goto abort;
        }
@@ -8377,6 +8401,7 @@ static int raid5_add_disk(struct mddev *mddev, struct md_rdev *rdev)
                p = conf->disks + disk;
                tmp = rdev_mdlock_deref(mddev, p->rdev);
                if (test_bit(WantReplacement, &tmp->flags) &&
+                   mddev->reshape_position == MaxSector &&
                    p->replacement == NULL) {
                        clear_bit(In_sync, &rdev->flags);
                        set_bit(Replacement, &rdev->flags);
@@ -8500,6 +8525,7 @@ static int raid5_start_reshape(struct mddev *mddev)
        struct r5conf *conf = mddev->private;
        struct md_rdev *rdev;
        int spares = 0;
+       int i;
        unsigned long flags;
 
        if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
@@ -8511,6 +8537,13 @@ static int raid5_start_reshape(struct mddev *mddev)
        if (has_failed(conf))
                return -EINVAL;
 
+       /* raid5 can't handle concurrent reshape and recovery */
+       if (mddev->recovery_cp < MaxSector)
+               return -EBUSY;
+       for (i = 0; i < conf->raid_disks; i++)
+               if (rdev_mdlock_deref(mddev, conf->disks[i].replacement))
+                       return -EBUSY;
+
        rdev_for_each(rdev, mddev) {
                if (!test_bit(In_sync, &rdev->flags)
                    && !test_bit(Faulty, &rdev->flags))
@@ -8607,8 +8640,8 @@ static int raid5_start_reshape(struct mddev *mddev)
        clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
        set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
        set_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
-       mddev->sync_thread = md_register_thread(md_do_sync, mddev,
-                                               "reshape");
+       rcu_assign_pointer(mddev->sync_thread,
+                          md_register_thread(md_do_sync, mddev, "reshape"));
        if (!mddev->sync_thread) {
                mddev->recovery = 0;
                spin_lock_irq(&conf->device_lock);
@@ -9043,6 +9076,22 @@ static int raid5_start(struct mddev *mddev)
        return r5l_start(conf->log);
 }
 
+static void raid5_prepare_suspend(struct mddev *mddev)
+{
+       struct r5conf *conf = mddev->private;
+
+       wait_event(mddev->sb_wait, !reshape_inprogress(mddev) ||
+                                   percpu_ref_is_zero(&mddev->active_io));
+       if (percpu_ref_is_zero(&mddev->active_io))
+               return;
+
+       /*
+        * Reshape is not in progress, and array is suspended, io that is
+        * waiting for reshpape can never be done.
+        */
+       wake_up(&conf->wait_for_overlap);
+}
+
 static struct md_personality raid6_personality =
 {
        .name           = "raid6",
@@ -9063,6 +9112,7 @@ static struct md_personality raid6_personality =
        .check_reshape  = raid6_check_reshape,
        .start_reshape  = raid5_start_reshape,
        .finish_reshape = raid5_finish_reshape,
+       .prepare_suspend = raid5_prepare_suspend,
        .quiesce        = raid5_quiesce,
        .takeover       = raid6_takeover,
        .change_consistency_policy = raid5_change_consistency_policy,
@@ -9087,6 +9137,7 @@ static struct md_personality raid5_personality =
        .check_reshape  = raid5_check_reshape,
        .start_reshape  = raid5_start_reshape,
        .finish_reshape = raid5_finish_reshape,
+       .prepare_suspend = raid5_prepare_suspend,
        .quiesce        = raid5_quiesce,
        .takeover       = raid5_takeover,
        .change_consistency_policy = raid5_change_consistency_policy,
@@ -9112,6 +9163,7 @@ static struct md_personality raid4_personality =
        .check_reshape  = raid5_check_reshape,
        .start_reshape  = raid5_start_reshape,
        .finish_reshape = raid5_finish_reshape,
+       .prepare_suspend = raid5_prepare_suspend,
        .quiesce        = raid5_quiesce,
        .takeover       = raid4_takeover,
        .change_consistency_policy = raid5_change_consistency_policy,
index e873938..97a7959 100644 (file)
@@ -268,7 +268,7 @@ struct stripe_head {
                unsigned long   flags;
                u32             log_checksum;
                unsigned short  write_hint;
-       } dev[1]; /* allocated with extra space depending of RAID geometry */
+       } dev[]; /* allocated depending of RAID geometry ("disks" member) */
 };
 
 /* stripe_head_state - collects and tracks the dynamic state of a stripe_head
@@ -679,7 +679,7 @@ struct r5conf {
        /* When taking over an array from a different personality, we store
         * the new thread here until we fully activate the array.
         */
-       struct md_thread        *thread;
+       struct md_thread __rcu  *thread;
        struct list_head        temp_inactive_list[NR_STRIPE_HASH_LOCKS];
        struct r5worker_group   *worker_groups;
        int                     group_cnt;
index de23627..43d85a5 100644 (file)
@@ -254,7 +254,7 @@ static int vpu_core_register(struct device *dev, struct vpu_core *core)
        if (vpu_core_is_exist(vpu, core))
                return 0;
 
-       core->workqueue = alloc_workqueue("vpu", WQ_UNBOUND | WQ_MEM_RECLAIM, 1);
+       core->workqueue = alloc_ordered_workqueue("vpu", WQ_MEM_RECLAIM);
        if (!core->workqueue) {
                dev_err(core->dev, "fail to alloc workqueue\n");
                return -ENOMEM;
index 6773b88..a48edb4 100644 (file)
@@ -740,7 +740,7 @@ int vpu_v4l2_open(struct file *file, struct vpu_inst *inst)
        inst->fh.ctrl_handler = &inst->ctrl_handler;
        file->private_data = &inst->fh;
        inst->state = VPU_CODEC_STATE_DEINIT;
-       inst->workqueue = alloc_workqueue("vpu_inst", WQ_UNBOUND | WQ_MEM_RECLAIM, 1);
+       inst->workqueue = alloc_ordered_workqueue("vpu_inst", WQ_MEM_RECLAIM);
        if (inst->workqueue) {
                INIT_WORK(&inst->msg_work, vpu_inst_run_work);
                ret = kfifo_init(&inst->msg_fifo,
index d013ea5..ac9a642 100644 (file)
@@ -3268,7 +3268,7 @@ static int coda_probe(struct platform_device *pdev)
                                                       &dev->iram.blob);
        }
 
-       dev->workqueue = alloc_workqueue("coda", WQ_UNBOUND | WQ_MEM_RECLAIM, 1);
+       dev->workqueue = alloc_ordered_workqueue("coda", WQ_MEM_RECLAIM);
        if (!dev->workqueue) {
                dev_err(&pdev->dev, "unable to alloc workqueue\n");
                ret = -ENOMEM;
index 5300153..405b89e 100644 (file)
@@ -180,7 +180,7 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma,
                data, size, dma->nr_pages);
 
        err = pin_user_pages(data & PAGE_MASK, dma->nr_pages, gup_flags,
-                            dma->pages, NULL);
+                            dma->pages);
 
        if (err != dma->nr_pages) {
                dma->nr_pages = (err >= 0) ? err : 0;
index 42bfc46..461f5ff 100644 (file)
@@ -44,12 +44,10 @@ static const char *tpc_names[] = {
  * memstick_debug_get_tpc_name - debug helper that returns string for
  * a TPC number
  */
-const char *memstick_debug_get_tpc_name(int tpc)
+static __maybe_unused const char *memstick_debug_get_tpc_name(int tpc)
 {
        return tpc_names[tpc-1];
 }
-EXPORT_SYMBOL(memstick_debug_get_tpc_name);
-
 
 /* Read a register*/
 static inline u32 r592_read_reg(struct r592_device *dev, int address)
index e90463c..f89f455 100644 (file)
@@ -1183,12 +1183,17 @@ config MFD_RC5T583
          Additional drivers must be enabled in order to use the
          different functionality of the device.
 
-config MFD_RK808
+config MFD_RK8XX
+       bool
+       select MFD_CORE
+
+config MFD_RK8XX_I2C
        tristate "Rockchip RK805/RK808/RK809/RK817/RK818 Power Management Chip"
        depends on I2C && OF
        select MFD_CORE
        select REGMAP_I2C
        select REGMAP_IRQ
+       select MFD_RK8XX
        help
          If you say yes here you get support for the RK805, RK808, RK809,
          RK817 and RK818 Power Management chips.
@@ -1196,6 +1201,20 @@ config MFD_RK808
          through I2C interface. The device supports multiple sub-devices
          including interrupts, RTC, LDO & DCDC regulators, and onkey.
 
+config MFD_RK8XX_SPI
+       tristate "Rockchip RK806 Power Management Chip"
+       depends on SPI && OF
+       select MFD_CORE
+       select REGMAP_SPI
+       select REGMAP_IRQ
+       select MFD_RK8XX
+       help
+         If you say yes here you get support for the RK806 Power Management
+         chip.
+         This driver provides common support for accessing the device
+         through an SPI interface. The device supports multiple sub-devices
+         including interrupts, LDO & DCDC regulators, and power on-key.
+
 config MFD_RN5T618
        tristate "Ricoh RN5T567/618 PMIC"
        depends on I2C
@@ -1679,6 +1698,38 @@ config MFD_TPS65912_SPI
          If you say yes here you get support for the TPS65912 series of
          PM chips with SPI interface.
 
+config MFD_TPS6594
+       tristate
+       select MFD_CORE
+       select REGMAP
+       select REGMAP_IRQ
+
+config MFD_TPS6594_I2C
+       tristate "TI TPS6594 Power Management chip with I2C"
+       select MFD_TPS6594
+       select REGMAP_I2C
+       select CRC8
+       depends on I2C
+       help
+         If you say yes here you get support for the TPS6594 series of
+         PM chips with I2C interface.
+
+         This driver can also be built as a module.  If so, the module
+         will be called tps6594-i2c.
+
+config MFD_TPS6594_SPI
+       tristate "TI TPS6594 Power Management chip with SPI"
+       select MFD_TPS6594
+       select REGMAP_SPI
+       select CRC8
+       depends on SPI_MASTER
+       help
+         If you say yes here you get support for the TPS6594 series of
+         PM chips with SPI interface.
+
+         This driver can also be built as a module.  If so, the module
+         will be called tps6594-spi.
+
 config TWL4030_CORE
        bool "TI TWL4030/TWL5030/TWL6030/TPS659x0 Support"
        depends on I2C=y
index 1d2392f..39c4615 100644 (file)
@@ -96,6 +96,9 @@ obj-$(CONFIG_MFD_TPS65910)    += tps65910.o
 obj-$(CONFIG_MFD_TPS65912)     += tps65912-core.o
 obj-$(CONFIG_MFD_TPS65912_I2C) += tps65912-i2c.o
 obj-$(CONFIG_MFD_TPS65912_SPI)  += tps65912-spi.o
+obj-$(CONFIG_MFD_TPS6594)      += tps6594-core.o
+obj-$(CONFIG_MFD_TPS6594_I2C)  += tps6594-i2c.o
+obj-$(CONFIG_MFD_TPS6594_SPI)  += tps6594-spi.o
 obj-$(CONFIG_MENELAUS)         += menelaus.o
 
 obj-$(CONFIG_TWL4030_CORE)     += twl-core.o twl4030-irq.o twl6030-irq.o
@@ -214,7 +217,9 @@ obj-$(CONFIG_MFD_PALMAS)    += palmas.o
 obj-$(CONFIG_MFD_VIPERBOARD)    += viperboard.o
 obj-$(CONFIG_MFD_NTXEC)                += ntxec.o
 obj-$(CONFIG_MFD_RC5T583)      += rc5t583.o rc5t583-irq.o
-obj-$(CONFIG_MFD_RK808)                += rk808.o
+obj-$(CONFIG_MFD_RK8XX)                += rk8xx-core.o
+obj-$(CONFIG_MFD_RK8XX_I2C)    += rk8xx-i2c.o
+obj-$(CONFIG_MFD_RK8XX_SPI)    += rk8xx-spi.o
 obj-$(CONFIG_MFD_RN5T618)      += rn5t618.o
 obj-$(CONFIG_MFD_SEC_CORE)     += sec-core.o sec-irq.o
 obj-$(CONFIG_MFD_SYSCON)       += syscon.o
index b4f5cb4..a49e5e2 100644 (file)
@@ -63,6 +63,7 @@ static const struct of_device_id axp20x_i2c_of_match[] = {
        { .compatible = "x-powers,axp209", .data = (void *)AXP209_ID },
        { .compatible = "x-powers,axp221", .data = (void *)AXP221_ID },
        { .compatible = "x-powers,axp223", .data = (void *)AXP223_ID },
+       { .compatible = "x-powers,axp313a", .data = (void *)AXP313A_ID },
        { .compatible = "x-powers,axp803", .data = (void *)AXP803_ID },
        { .compatible = "x-powers,axp806", .data = (void *)AXP806_ID },
        { .compatible = "x-powers,axp15060", .data = (void *)AXP15060_ID },
@@ -77,6 +78,7 @@ static const struct i2c_device_id axp20x_i2c_id[] = {
        { "axp209", 0 },
        { "axp221", 0 },
        { "axp223", 0 },
+       { "axp313a", 0 },
        { "axp803", 0 },
        { "axp806", 0 },
        { "axp15060", 0 },
index 72b87aa..07a846e 100644 (file)
@@ -39,6 +39,7 @@ static const char * const axp20x_model_names[] = {
        "AXP221",
        "AXP223",
        "AXP288",
+       "AXP313a",
        "AXP803",
        "AXP806",
        "AXP809",
@@ -156,6 +157,25 @@ static const struct regmap_range axp806_writeable_ranges[] = {
        regmap_reg_range(AXP806_REG_ADDR_EXT, AXP806_REG_ADDR_EXT),
 };
 
+static const struct regmap_range axp313a_writeable_ranges[] = {
+       regmap_reg_range(AXP313A_ON_INDICATE, AXP313A_IRQ_STATE),
+};
+
+static const struct regmap_range axp313a_volatile_ranges[] = {
+       regmap_reg_range(AXP313A_SHUTDOWN_CTRL, AXP313A_SHUTDOWN_CTRL),
+       regmap_reg_range(AXP313A_IRQ_STATE, AXP313A_IRQ_STATE),
+};
+
+static const struct regmap_access_table axp313a_writeable_table = {
+       .yes_ranges = axp313a_writeable_ranges,
+       .n_yes_ranges = ARRAY_SIZE(axp313a_writeable_ranges),
+};
+
+static const struct regmap_access_table axp313a_volatile_table = {
+       .yes_ranges = axp313a_volatile_ranges,
+       .n_yes_ranges = ARRAY_SIZE(axp313a_volatile_ranges),
+};
+
 static const struct regmap_range axp806_volatile_ranges[] = {
        regmap_reg_range(AXP20X_IRQ1_STATE, AXP20X_IRQ2_STATE),
 };
@@ -248,6 +268,11 @@ static const struct resource axp288_fuel_gauge_resources[] = {
        DEFINE_RES_IRQ(AXP288_IRQ_WL1),
 };
 
+static const struct resource axp313a_pek_resources[] = {
+       DEFINE_RES_IRQ_NAMED(AXP313A_IRQ_PEK_RIS_EDGE, "PEK_DBR"),
+       DEFINE_RES_IRQ_NAMED(AXP313A_IRQ_PEK_FAL_EDGE, "PEK_DBF"),
+};
+
 static const struct resource axp803_pek_resources[] = {
        DEFINE_RES_IRQ_NAMED(AXP803_IRQ_PEK_RIS_EDGE, "PEK_DBR"),
        DEFINE_RES_IRQ_NAMED(AXP803_IRQ_PEK_FAL_EDGE, "PEK_DBF"),
@@ -304,6 +329,15 @@ static const struct regmap_config axp288_regmap_config = {
        .cache_type     = REGCACHE_RBTREE,
 };
 
+static const struct regmap_config axp313a_regmap_config = {
+       .reg_bits = 8,
+       .val_bits = 8,
+       .wr_table = &axp313a_writeable_table,
+       .volatile_table = &axp313a_volatile_table,
+       .max_register = AXP313A_IRQ_STATE,
+       .cache_type = REGCACHE_RBTREE,
+};
+
 static const struct regmap_config axp806_regmap_config = {
        .reg_bits       = 8,
        .val_bits       = 8,
@@ -456,6 +490,16 @@ static const struct regmap_irq axp288_regmap_irqs[] = {
        INIT_REGMAP_IRQ(AXP288, BC_USB_CHNG,            5, 1),
 };
 
+static const struct regmap_irq axp313a_regmap_irqs[] = {
+       INIT_REGMAP_IRQ(AXP313A, PEK_RIS_EDGE,          0, 7),
+       INIT_REGMAP_IRQ(AXP313A, PEK_FAL_EDGE,          0, 6),
+       INIT_REGMAP_IRQ(AXP313A, PEK_SHORT,             0, 5),
+       INIT_REGMAP_IRQ(AXP313A, PEK_LONG,              0, 4),
+       INIT_REGMAP_IRQ(AXP313A, DCDC3_V_LOW,           0, 3),
+       INIT_REGMAP_IRQ(AXP313A, DCDC2_V_LOW,           0, 2),
+       INIT_REGMAP_IRQ(AXP313A, DIE_TEMP_HIGH,         0, 0),
+};
+
 static const struct regmap_irq axp803_regmap_irqs[] = {
        INIT_REGMAP_IRQ(AXP803, ACIN_OVER_V,            0, 7),
        INIT_REGMAP_IRQ(AXP803, ACIN_PLUGIN,            0, 6),
@@ -606,6 +650,17 @@ static const struct regmap_irq_chip axp288_regmap_irq_chip = {
 
 };
 
+static const struct regmap_irq_chip axp313a_regmap_irq_chip = {
+       .name                   = "axp313a_irq_chip",
+       .status_base            = AXP313A_IRQ_STATE,
+       .ack_base               = AXP313A_IRQ_STATE,
+       .unmask_base            = AXP313A_IRQ_EN,
+       .init_ack_masked        = true,
+       .irqs                   = axp313a_regmap_irqs,
+       .num_irqs               = ARRAY_SIZE(axp313a_regmap_irqs),
+       .num_regs               = 1,
+};
+
 static const struct regmap_irq_chip axp803_regmap_irq_chip = {
        .name                   = "axp803",
        .status_base            = AXP20X_IRQ1_STATE,
@@ -745,6 +800,11 @@ static const struct mfd_cell axp152_cells[] = {
        },
 };
 
+static struct mfd_cell axp313a_cells[] = {
+       MFD_CELL_NAME("axp20x-regulator"),
+       MFD_CELL_RES("axp313a-pek", axp313a_pek_resources),
+};
+
 static const struct resource axp288_adc_resources[] = {
        DEFINE_RES_IRQ_NAMED(AXP288_IRQ_GPADC, "GPADC"),
 };
@@ -914,8 +974,18 @@ static const struct mfd_cell axp_regulator_only_cells[] = {
 static int axp20x_power_off(struct sys_off_data *data)
 {
        struct axp20x_dev *axp20x = data->cb_data;
+       unsigned int shutdown_reg;
 
-       regmap_write(axp20x->regmap, AXP20X_OFF_CTRL, AXP20X_OFF);
+       switch (axp20x->variant) {
+       case AXP313A_ID:
+               shutdown_reg = AXP313A_SHUTDOWN_CTRL;
+               break;
+       default:
+               shutdown_reg = AXP20X_OFF_CTRL;
+               break;
+       }
+
+       regmap_write(axp20x->regmap, shutdown_reg, AXP20X_OFF);
 
        /* Give capacitors etc. time to drain to avoid kernel panic msg. */
        mdelay(500);
@@ -978,6 +1048,12 @@ int axp20x_match_device(struct axp20x_dev *axp20x)
                axp20x->regmap_irq_chip = &axp288_regmap_irq_chip;
                axp20x->irq_flags = IRQF_TRIGGER_LOW;
                break;
+       case AXP313A_ID:
+               axp20x->nr_cells = ARRAY_SIZE(axp313a_cells);
+               axp20x->cells = axp313a_cells;
+               axp20x->regmap_cfg = &axp313a_regmap_config;
+               axp20x->regmap_irq_chip = &axp313a_regmap_irq_chip;
+               break;
        case AXP803_ID:
                axp20x->nr_cells = ARRAY_SIZE(axp803_cells);
                axp20x->cells = axp803_cells;
similarity index 71%
rename from drivers/mfd/rk808.c
rename to drivers/mfd/rk8xx-core.c
index 0f22ef6..e8fc9e2 100644 (file)
@@ -1,18 +1,15 @@
 // SPDX-License-Identifier: GPL-2.0-only
 /*
- * MFD core driver for Rockchip RK808/RK818
+ * MFD core driver for Rockchip RK8XX
  *
  * Copyright (c) 2014, Fuzhou Rockchip Electronics Co., Ltd
+ * Copyright (C) 2016 PHYTEC Messtechnik GmbH
  *
  * Author: Chris Zhong <zyw@rock-chips.com>
  * Author: Zhang Qing <zhangqing@rock-chips.com>
- *
- * Copyright (C) 2016 PHYTEC Messtechnik GmbH
- *
  * Author: Wadim Egorov <w.egorov@phytec.de>
  */
 
-#include <linux/i2c.h>
 #include <linux/interrupt.h>
 #include <linux/mfd/rk808.h>
 #include <linux/mfd/core.h>
@@ -27,92 +24,6 @@ struct rk808_reg_data {
        int value;
 };
 
-static bool rk808_is_volatile_reg(struct device *dev, unsigned int reg)
-{
-       /*
-        * Notes:
-        * - Technically the ROUND_30s bit makes RTC_CTRL_REG volatile, but
-        *   we don't use that feature.  It's better to cache.
-        * - It's unlikely we care that RK808_DEVCTRL_REG is volatile since
-        *   bits are cleared in case when we shutoff anyway, but better safe.
-        */
-
-       switch (reg) {
-       case RK808_SECONDS_REG ... RK808_WEEKS_REG:
-       case RK808_RTC_STATUS_REG:
-       case RK808_VB_MON_REG:
-       case RK808_THERMAL_REG:
-       case RK808_DCDC_UV_STS_REG:
-       case RK808_LDO_UV_STS_REG:
-       case RK808_DCDC_PG_REG:
-       case RK808_LDO_PG_REG:
-       case RK808_DEVCTRL_REG:
-       case RK808_INT_STS_REG1:
-       case RK808_INT_STS_REG2:
-               return true;
-       }
-
-       return false;
-}
-
-static bool rk817_is_volatile_reg(struct device *dev, unsigned int reg)
-{
-       /*
-        * Notes:
-        * - Technically the ROUND_30s bit makes RTC_CTRL_REG volatile, but
-        *   we don't use that feature.  It's better to cache.
-        */
-
-       switch (reg) {
-       case RK817_SECONDS_REG ... RK817_WEEKS_REG:
-       case RK817_RTC_STATUS_REG:
-       case RK817_CODEC_DTOP_LPT_SRST:
-       case RK817_GAS_GAUGE_ADC_CONFIG0 ... RK817_GAS_GAUGE_CUR_ADC_K0:
-       case RK817_PMIC_CHRG_STS:
-       case RK817_PMIC_CHRG_OUT:
-       case RK817_PMIC_CHRG_IN:
-       case RK817_INT_STS_REG0:
-       case RK817_INT_STS_REG1:
-       case RK817_INT_STS_REG2:
-       case RK817_SYS_STS:
-               return true;
-       }
-
-       return false;
-}
-
-static const struct regmap_config rk818_regmap_config = {
-       .reg_bits = 8,
-       .val_bits = 8,
-       .max_register = RK818_USB_CTRL_REG,
-       .cache_type = REGCACHE_RBTREE,
-       .volatile_reg = rk808_is_volatile_reg,
-};
-
-static const struct regmap_config rk805_regmap_config = {
-       .reg_bits = 8,
-       .val_bits = 8,
-       .max_register = RK805_OFF_SOURCE_REG,
-       .cache_type = REGCACHE_RBTREE,
-       .volatile_reg = rk808_is_volatile_reg,
-};
-
-static const struct regmap_config rk808_regmap_config = {
-       .reg_bits = 8,
-       .val_bits = 8,
-       .max_register = RK808_IO_POL_REG,
-       .cache_type = REGCACHE_RBTREE,
-       .volatile_reg = rk808_is_volatile_reg,
-};
-
-static const struct regmap_config rk817_regmap_config = {
-       .reg_bits = 8,
-       .val_bits = 8,
-       .max_register = RK817_GPIO_INT_CFG,
-       .cache_type = REGCACHE_NONE,
-       .volatile_reg = rk817_is_volatile_reg,
-};
-
 static const struct resource rtc_resources[] = {
        DEFINE_RES_IRQ(RK808_IRQ_RTC_ALARM),
 };
@@ -126,6 +37,11 @@ static const struct resource rk805_key_resources[] = {
        DEFINE_RES_IRQ(RK805_IRQ_PWRON_FALL),
 };
 
+static struct resource rk806_pwrkey_resources[] = {
+       DEFINE_RES_IRQ(RK806_IRQ_PWRON_FALL),
+       DEFINE_RES_IRQ(RK806_IRQ_PWRON_RISE),
+};
+
 static const struct resource rk817_pwrkey_resources[] = {
        DEFINE_RES_IRQ(RK817_IRQ_PWRON_RISE),
        DEFINE_RES_IRQ(RK817_IRQ_PWRON_FALL),
@@ -153,6 +69,17 @@ static const struct mfd_cell rk805s[] = {
        },
 };
 
+static const struct mfd_cell rk806s[] = {
+       { .name = "rk805-pinctrl", .id = PLATFORM_DEVID_AUTO, },
+       { .name = "rk808-regulator", .id = PLATFORM_DEVID_AUTO, },
+       {
+               .name = "rk805-pwrkey",
+               .resources = rk806_pwrkey_resources,
+               .num_resources = ARRAY_SIZE(rk806_pwrkey_resources),
+               .id = PLATFORM_DEVID_AUTO,
+       },
+};
+
 static const struct mfd_cell rk808s[] = {
        { .name = "rk808-clkout", .id = PLATFORM_DEVID_NONE, },
        { .name = "rk808-regulator", .id = PLATFORM_DEVID_NONE, },
@@ -212,6 +139,12 @@ static const struct rk808_reg_data rk805_pre_init_reg[] = {
        {RK805_THERMAL_REG, TEMP_HOTDIE_MSK, TEMP115C},
 };
 
+static const struct rk808_reg_data rk806_pre_init_reg[] = {
+       { RK806_GPIO_INT_CONFIG, RK806_INT_POL_MSK, RK806_INT_POL_L },
+       { RK806_SYS_CFG3, RK806_SLAVE_RESTART_FUN_MSK, RK806_SLAVE_RESTART_FUN_EN },
+       { RK806_SYS_OPTION, RK806_SYS_ENB2_2M_MSK, RK806_SYS_ENB2_2M_EN },
+};
+
 static const struct rk808_reg_data rk808_pre_init_reg[] = {
        { RK808_BUCK3_CONFIG_REG, BUCK_ILMIN_MASK,  BUCK_ILMIN_150MA },
        { RK808_BUCK4_CONFIG_REG, BUCK_ILMIN_MASK,  BUCK_ILMIN_200MA },
@@ -362,6 +295,27 @@ static const struct regmap_irq rk805_irqs[] = {
        },
 };
 
+static const struct regmap_irq rk806_irqs[] = {
+       /* INT_STS0 IRQs */
+       REGMAP_IRQ_REG(RK806_IRQ_PWRON_FALL, 0, RK806_INT_STS_PWRON_FALL),
+       REGMAP_IRQ_REG(RK806_IRQ_PWRON_RISE, 0, RK806_INT_STS_PWRON_RISE),
+       REGMAP_IRQ_REG(RK806_IRQ_PWRON, 0, RK806_INT_STS_PWRON),
+       REGMAP_IRQ_REG(RK806_IRQ_PWRON_LP, 0, RK806_INT_STS_PWRON_LP),
+       REGMAP_IRQ_REG(RK806_IRQ_HOTDIE, 0, RK806_INT_STS_HOTDIE),
+       REGMAP_IRQ_REG(RK806_IRQ_VDC_RISE, 0, RK806_INT_STS_VDC_RISE),
+       REGMAP_IRQ_REG(RK806_IRQ_VDC_FALL, 0, RK806_INT_STS_VDC_FALL),
+       REGMAP_IRQ_REG(RK806_IRQ_VB_LO, 0, RK806_INT_STS_VB_LO),
+       /* INT_STS1 IRQs */
+       REGMAP_IRQ_REG(RK806_IRQ_REV0, 1, RK806_INT_STS_REV0),
+       REGMAP_IRQ_REG(RK806_IRQ_REV1, 1, RK806_INT_STS_REV1),
+       REGMAP_IRQ_REG(RK806_IRQ_REV2, 1, RK806_INT_STS_REV2),
+       REGMAP_IRQ_REG(RK806_IRQ_CRC_ERROR, 1, RK806_INT_STS_CRC_ERROR),
+       REGMAP_IRQ_REG(RK806_IRQ_SLP3_GPIO, 1, RK806_INT_STS_SLP3_GPIO),
+       REGMAP_IRQ_REG(RK806_IRQ_SLP2_GPIO, 1, RK806_INT_STS_SLP2_GPIO),
+       REGMAP_IRQ_REG(RK806_IRQ_SLP1_GPIO, 1, RK806_INT_STS_SLP1_GPIO),
+       REGMAP_IRQ_REG(RK806_IRQ_WDT, 1, RK806_INT_STS_WDT),
+};
+
 static const struct regmap_irq rk808_irqs[] = {
        /* INT_STS */
        [RK808_IRQ_VOUT_LO] = {
@@ -512,6 +466,18 @@ static struct regmap_irq_chip rk805_irq_chip = {
        .init_ack_masked = true,
 };
 
+static struct regmap_irq_chip rk806_irq_chip = {
+       .name = "rk806",
+       .irqs = rk806_irqs,
+       .num_irqs = ARRAY_SIZE(rk806_irqs),
+       .num_regs = 2,
+       .irq_reg_stride = 2,
+       .mask_base = RK806_INT_MSK0,
+       .status_base = RK806_INT_STS0,
+       .ack_base = RK806_INT_STS0,
+       .init_ack_masked = true,
+};
+
 static const struct regmap_irq_chip rk808_irq_chip = {
        .name = "rk808",
        .irqs = rk808_irqs,
@@ -548,13 +514,11 @@ static const struct regmap_irq_chip rk818_irq_chip = {
        .init_ack_masked = true,
 };
 
-static struct i2c_client *rk808_i2c_client;
-
-static void rk808_pm_power_off(void)
+static int rk808_power_off(struct sys_off_data *data)
 {
+       struct rk808 *rk808 = data->cb_data;
        int ret;
        unsigned int reg, bit;
-       struct rk808 *rk808 = i2c_get_clientdata(rk808_i2c_client);
 
        switch (rk808->variant) {
        case RK805_ID:
@@ -575,16 +539,18 @@ static void rk808_pm_power_off(void)
                bit = DEV_OFF;
                break;
        default:
-               return;
+               return NOTIFY_DONE;
        }
        ret = regmap_update_bits(rk808->regmap, reg, bit, bit);
        if (ret)
-               dev_err(&rk808_i2c_client->dev, "Failed to shutdown device!\n");
+               dev_err(rk808->dev, "Failed to shutdown device!\n");
+
+       return NOTIFY_DONE;
 }
 
-static int rk808_restart_notify(struct notifier_block *this, unsigned long mode, void *cmd)
+static int rk808_restart(struct sys_off_data *data)
 {
-       struct rk808 *rk808 = i2c_get_clientdata(rk808_i2c_client);
+       struct rk808 *rk808 = data->cb_data;
        unsigned int reg, bit;
        int ret;
 
@@ -600,19 +566,14 @@ static int rk808_restart_notify(struct notifier_block *this, unsigned long mode,
        }
        ret = regmap_update_bits(rk808->regmap, reg, bit, bit);
        if (ret)
-               dev_err(&rk808_i2c_client->dev, "Failed to restart device!\n");
+               dev_err(rk808->dev, "Failed to restart device!\n");
 
        return NOTIFY_DONE;
 }
 
-static struct notifier_block rk808_restart_handler = {
-       .notifier_call = rk808_restart_notify,
-       .priority = 192,
-};
-
-static void rk8xx_shutdown(struct i2c_client *client)
+void rk8xx_shutdown(struct device *dev)
 {
-       struct rk808 *rk808 = i2c_get_clientdata(client);
+       struct rk808 *rk808 = dev_get_drvdata(dev);
        int ret;
 
        switch (rk808->variant) {
@@ -633,75 +594,47 @@ static void rk8xx_shutdown(struct i2c_client *client)
                return;
        }
        if (ret)
-               dev_warn(&client->dev,
+               dev_warn(dev,
                         "Cannot switch to power down function\n");
 }
+EXPORT_SYMBOL_GPL(rk8xx_shutdown);
 
-static const struct of_device_id rk808_of_match[] = {
-       { .compatible = "rockchip,rk805" },
-       { .compatible = "rockchip,rk808" },
-       { .compatible = "rockchip,rk809" },
-       { .compatible = "rockchip,rk817" },
-       { .compatible = "rockchip,rk818" },
-       { },
-};
-MODULE_DEVICE_TABLE(of, rk808_of_match);
-
-static int rk808_probe(struct i2c_client *client)
+int rk8xx_probe(struct device *dev, int variant, unsigned int irq, struct regmap *regmap)
 {
-       struct device_node *np = client->dev.of_node;
        struct rk808 *rk808;
        const struct rk808_reg_data *pre_init_reg;
        const struct mfd_cell *cells;
+       int dual_support = 0;
        int nr_pre_init_regs;
        int nr_cells;
-       int msb, lsb;
-       unsigned char pmic_id_msb, pmic_id_lsb;
        int ret;
        int i;
 
-       rk808 = devm_kzalloc(&client->dev, sizeof(*rk808), GFP_KERNEL);
+       rk808 = devm_kzalloc(dev, sizeof(*rk808), GFP_KERNEL);
        if (!rk808)
                return -ENOMEM;
-
-       if (of_device_is_compatible(np, "rockchip,rk817") ||
-           of_device_is_compatible(np, "rockchip,rk809")) {
-               pmic_id_msb = RK817_ID_MSB;
-               pmic_id_lsb = RK817_ID_LSB;
-       } else {
-               pmic_id_msb = RK808_ID_MSB;
-               pmic_id_lsb = RK808_ID_LSB;
-       }
-
-       /* Read chip variant */
-       msb = i2c_smbus_read_byte_data(client, pmic_id_msb);
-       if (msb < 0) {
-               dev_err(&client->dev, "failed to read the chip id at 0x%x\n",
-                       RK808_ID_MSB);
-               return msb;
-       }
-
-       lsb = i2c_smbus_read_byte_data(client, pmic_id_lsb);
-       if (lsb < 0) {
-               dev_err(&client->dev, "failed to read the chip id at 0x%x\n",
-                       RK808_ID_LSB);
-               return lsb;
-       }
-
-       rk808->variant = ((msb << 8) | lsb) & RK8XX_ID_MSK;
-       dev_info(&client->dev, "chip id: 0x%x\n", (unsigned int)rk808->variant);
+       rk808->dev = dev;
+       rk808->variant = variant;
+       rk808->regmap = regmap;
+       dev_set_drvdata(dev, rk808);
 
        switch (rk808->variant) {
        case RK805_ID:
-               rk808->regmap_cfg = &rk805_regmap_config;
                rk808->regmap_irq_chip = &rk805_irq_chip;
                pre_init_reg = rk805_pre_init_reg;
                nr_pre_init_regs = ARRAY_SIZE(rk805_pre_init_reg);
                cells = rk805s;
                nr_cells = ARRAY_SIZE(rk805s);
                break;
+       case RK806_ID:
+               rk808->regmap_irq_chip = &rk806_irq_chip;
+               pre_init_reg = rk806_pre_init_reg;
+               nr_pre_init_regs = ARRAY_SIZE(rk806_pre_init_reg);
+               cells = rk806s;
+               nr_cells = ARRAY_SIZE(rk806s);
+               dual_support = IRQF_SHARED;
+               break;
        case RK808_ID:
-               rk808->regmap_cfg = &rk808_regmap_config;
                rk808->regmap_irq_chip = &rk808_irq_chip;
                pre_init_reg = rk808_pre_init_reg;
                nr_pre_init_regs = ARRAY_SIZE(rk808_pre_init_reg);
@@ -709,7 +642,6 @@ static int rk808_probe(struct i2c_client *client)
                nr_cells = ARRAY_SIZE(rk808s);
                break;
        case RK818_ID:
-               rk808->regmap_cfg = &rk818_regmap_config;
                rk808->regmap_irq_chip = &rk818_irq_chip;
                pre_init_reg = rk818_pre_init_reg;
                nr_pre_init_regs = ARRAY_SIZE(rk818_pre_init_reg);
@@ -718,7 +650,6 @@ static int rk808_probe(struct i2c_client *client)
                break;
        case RK809_ID:
        case RK817_ID:
-               rk808->regmap_cfg = &rk817_regmap_config;
                rk808->regmap_irq_chip = &rk817_irq_chip;
                pre_init_reg = rk817_pre_init_reg;
                nr_pre_init_regs = ARRAY_SIZE(rk817_pre_init_reg);
@@ -726,97 +657,64 @@ static int rk808_probe(struct i2c_client *client)
                nr_cells = ARRAY_SIZE(rk817s);
                break;
        default:
-               dev_err(&client->dev, "Unsupported RK8XX ID %lu\n",
-                       rk808->variant);
+               dev_err(dev, "Unsupported RK8XX ID %lu\n", rk808->variant);
                return -EINVAL;
        }
 
-       rk808->i2c = client;
-       i2c_set_clientdata(client, rk808);
-
-       rk808->regmap = devm_regmap_init_i2c(client, rk808->regmap_cfg);
-       if (IS_ERR(rk808->regmap)) {
-               dev_err(&client->dev, "regmap initialization failed\n");
-               return PTR_ERR(rk808->regmap);
-       }
-
-       if (!client->irq) {
-               dev_err(&client->dev, "No interrupt support, no core IRQ\n");
-               return -EINVAL;
-       }
+       if (!irq)
+               return dev_err_probe(dev, -EINVAL, "No interrupt support, no core IRQ\n");
 
-       ret = regmap_add_irq_chip(rk808->regmap, client->irq,
-                                 IRQF_ONESHOT, -1,
-                                 rk808->regmap_irq_chip, &rk808->irq_data);
-       if (ret) {
-               dev_err(&client->dev, "Failed to add irq_chip %d\n", ret);
-               return ret;
-       }
+       ret = devm_regmap_add_irq_chip(dev, rk808->regmap, irq,
+                                      IRQF_ONESHOT | dual_support, -1,
+                                      rk808->regmap_irq_chip, &rk808->irq_data);
+       if (ret)
+               return dev_err_probe(dev, ret, "Failed to add irq_chip\n");
 
        for (i = 0; i < nr_pre_init_regs; i++) {
                ret = regmap_update_bits(rk808->regmap,
                                        pre_init_reg[i].addr,
                                        pre_init_reg[i].mask,
                                        pre_init_reg[i].value);
-               if (ret) {
-                       dev_err(&client->dev,
-                               "0x%x write err\n",
-                               pre_init_reg[i].addr);
-                       return ret;
-               }
+               if (ret)
+                       return dev_err_probe(dev, ret, "0x%x write err\n",
+                                            pre_init_reg[i].addr);
        }
 
-       ret = devm_mfd_add_devices(&client->dev, PLATFORM_DEVID_NONE,
-                             cells, nr_cells, NULL, 0,
+       ret = devm_mfd_add_devices(dev, 0, cells, nr_cells, NULL, 0,
                              regmap_irq_get_domain(rk808->irq_data));
-       if (ret) {
-               dev_err(&client->dev, "failed to add MFD devices %d\n", ret);
-               goto err_irq;
-       }
+       if (ret)
+               return dev_err_probe(dev, ret, "failed to add MFD devices\n");
 
-       if (of_property_read_bool(np, "rockchip,system-power-controller")) {
-               rk808_i2c_client = client;
-               pm_power_off = rk808_pm_power_off;
+       if (device_property_read_bool(dev, "rockchip,system-power-controller")) {
+               ret = devm_register_sys_off_handler(dev,
+                                   SYS_OFF_MODE_POWER_OFF_PREPARE, SYS_OFF_PRIO_HIGH,
+                                   &rk808_power_off, rk808);
+               if (ret)
+                       return dev_err_probe(dev, ret,
+                                            "failed to register poweroff handler\n");
 
                switch (rk808->variant) {
                case RK809_ID:
                case RK817_ID:
-                       ret = register_restart_handler(&rk808_restart_handler);
+                       ret = devm_register_sys_off_handler(dev,
+                                                           SYS_OFF_MODE_RESTART, SYS_OFF_PRIO_HIGH,
+                                                           &rk808_restart, rk808);
                        if (ret)
-                               dev_warn(&client->dev, "failed to register rst handler, %d\n", ret);
+                               dev_warn(dev, "failed to register rst handler, %d\n", ret);
                        break;
                default:
-                       dev_dbg(&client->dev, "pmic controlled board reset not supported\n");
+                       dev_dbg(dev, "pmic controlled board reset not supported\n");
                        break;
                }
        }
 
        return 0;
-
-err_irq:
-       regmap_del_irq_chip(client->irq, rk808->irq_data);
-       return ret;
 }
+EXPORT_SYMBOL_GPL(rk8xx_probe);
 
-static void rk808_remove(struct i2c_client *client)
+int rk8xx_suspend(struct device *dev)
 {
-       struct rk808 *rk808 = i2c_get_clientdata(client);
-
-       regmap_del_irq_chip(client->irq, rk808->irq_data);
-
-       /**
-        * pm_power_off may points to a function from another module.
-        * Check if the pointer is set by us and only then overwrite it.
-        */
-       if (pm_power_off == rk808_pm_power_off)
-               pm_power_off = NULL;
-
-       unregister_restart_handler(&rk808_restart_handler);
-}
-
-static int __maybe_unused rk8xx_suspend(struct device *dev)
-{
-       struct rk808 *rk808 = i2c_get_clientdata(to_i2c_client(dev));
+       struct rk808 *rk808 = dev_get_drvdata(dev);
        int ret = 0;
 
        switch (rk808->variant) {
@@ -839,10 +737,11 @@ static int __maybe_unused rk8xx_suspend(struct device *dev)
 
        return ret;
 }
+EXPORT_SYMBOL_GPL(rk8xx_suspend);
 
-static int __maybe_unused rk8xx_resume(struct device *dev)
+int rk8xx_resume(struct device *dev)
 {
-       struct rk808 *rk808 = i2c_get_clientdata(to_i2c_client(dev));
+       struct rk808 *rk808 = dev_get_drvdata(dev);
        int ret = 0;
 
        switch (rk808->variant) {
@@ -859,23 +758,10 @@ static int __maybe_unused rk8xx_resume(struct device *dev)
 
        return ret;
 }
-static SIMPLE_DEV_PM_OPS(rk8xx_pm_ops, rk8xx_suspend, rk8xx_resume);
-
-static struct i2c_driver rk808_i2c_driver = {
-       .driver = {
-               .name = "rk808",
-               .of_match_table = rk808_of_match,
-               .pm = &rk8xx_pm_ops,
-       },
-       .probe_new = rk808_probe,
-       .remove   = rk808_remove,
-       .shutdown = rk8xx_shutdown,
-};
-
-module_i2c_driver(rk808_i2c_driver);
+EXPORT_SYMBOL_GPL(rk8xx_resume);
 
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Chris Zhong <zyw@rock-chips.com>");
 MODULE_AUTHOR("Zhang Qing <zhangqing@rock-chips.com>");
 MODULE_AUTHOR("Wadim Egorov <w.egorov@phytec.de>");
-MODULE_DESCRIPTION("RK808/RK818 PMIC driver");
+MODULE_DESCRIPTION("RK8xx PMIC core");
diff --git a/drivers/mfd/rk8xx-i2c.c b/drivers/mfd/rk8xx-i2c.c
new file mode 100644 (file)
index 0000000..2822bfa
--- /dev/null
@@ -0,0 +1,185 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Rockchip RK808/RK818 Core (I2C) driver
+ *
+ * Copyright (c) 2014, Fuzhou Rockchip Electronics Co., Ltd
+ * Copyright (C) 2016 PHYTEC Messtechnik GmbH
+ *
+ * Author: Chris Zhong <zyw@rock-chips.com>
+ * Author: Zhang Qing <zhangqing@rock-chips.com>
+ * Author: Wadim Egorov <w.egorov@phytec.de>
+ */
+
+#include <linux/i2c.h>
+#include <linux/mfd/rk808.h>
+#include <linux/module.h>
+#include <linux/of.h>
+#include <linux/regmap.h>
+
+struct rk8xx_i2c_platform_data {
+       const struct regmap_config *regmap_cfg;
+       int variant;
+};
+
+static bool rk808_is_volatile_reg(struct device *dev, unsigned int reg)
+{
+       /*
+        * Notes:
+        * - Technically the ROUND_30s bit makes RTC_CTRL_REG volatile, but
+        *   we don't use that feature.  It's better to cache.
+        * - It's unlikely we care that RK808_DEVCTRL_REG is volatile since
+        *   bits are cleared in case when we shutoff anyway, but better safe.
+        */
+
+       switch (reg) {
+       case RK808_SECONDS_REG ... RK808_WEEKS_REG:
+       case RK808_RTC_STATUS_REG:
+       case RK808_VB_MON_REG:
+       case RK808_THERMAL_REG:
+       case RK808_DCDC_UV_STS_REG:
+       case RK808_LDO_UV_STS_REG:
+       case RK808_DCDC_PG_REG:
+       case RK808_LDO_PG_REG:
+       case RK808_DEVCTRL_REG:
+       case RK808_INT_STS_REG1:
+       case RK808_INT_STS_REG2:
+               return true;
+       }
+
+       return false;
+}
+
+static bool rk817_is_volatile_reg(struct device *dev, unsigned int reg)
+{
+       /*
+        * Notes:
+        * - Technically the ROUND_30s bit makes RTC_CTRL_REG volatile, but
+        *   we don't use that feature.  It's better to cache.
+        */
+
+       switch (reg) {
+       case RK817_SECONDS_REG ... RK817_WEEKS_REG:
+       case RK817_RTC_STATUS_REG:
+       case RK817_CODEC_DTOP_LPT_SRST:
+       case RK817_GAS_GAUGE_ADC_CONFIG0 ... RK817_GAS_GAUGE_CUR_ADC_K0:
+       case RK817_PMIC_CHRG_STS:
+       case RK817_PMIC_CHRG_OUT:
+       case RK817_PMIC_CHRG_IN:
+       case RK817_INT_STS_REG0:
+       case RK817_INT_STS_REG1:
+       case RK817_INT_STS_REG2:
+       case RK817_SYS_STS:
+               return true;
+       }
+
+       return false;
+}
+
+
+static const struct regmap_config rk818_regmap_config = {
+       .reg_bits = 8,
+       .val_bits = 8,
+       .max_register = RK818_USB_CTRL_REG,
+       .cache_type = REGCACHE_RBTREE,
+       .volatile_reg = rk808_is_volatile_reg,
+};
+
+static const struct regmap_config rk805_regmap_config = {
+       .reg_bits = 8,
+       .val_bits = 8,
+       .max_register = RK805_OFF_SOURCE_REG,
+       .cache_type = REGCACHE_RBTREE,
+       .volatile_reg = rk808_is_volatile_reg,
+};
+
+static const struct regmap_config rk808_regmap_config = {
+       .reg_bits = 8,
+       .val_bits = 8,
+       .max_register = RK808_IO_POL_REG,
+       .cache_type = REGCACHE_RBTREE,
+       .volatile_reg = rk808_is_volatile_reg,
+};
+
+static const struct regmap_config rk817_regmap_config = {
+       .reg_bits = 8,
+       .val_bits = 8,
+       .max_register = RK817_GPIO_INT_CFG,
+       .cache_type = REGCACHE_NONE,
+       .volatile_reg = rk817_is_volatile_reg,
+};
+
+static const struct rk8xx_i2c_platform_data rk805_data = {
+       .regmap_cfg = &rk805_regmap_config,
+       .variant = RK805_ID,
+};
+
+static const struct rk8xx_i2c_platform_data rk808_data = {
+       .regmap_cfg = &rk808_regmap_config,
+       .variant = RK808_ID,
+};
+
+static const struct rk8xx_i2c_platform_data rk809_data = {
+       .regmap_cfg = &rk817_regmap_config,
+       .variant = RK809_ID,
+};
+
+static const struct rk8xx_i2c_platform_data rk817_data = {
+       .regmap_cfg = &rk817_regmap_config,
+       .variant = RK817_ID,
+};
+
+static const struct rk8xx_i2c_platform_data rk818_data = {
+       .regmap_cfg = &rk818_regmap_config,
+       .variant = RK818_ID,
+};
+
+static int rk8xx_i2c_probe(struct i2c_client *client)
+{
+       const struct rk8xx_i2c_platform_data *data;
+       struct regmap *regmap;
+
+       data = device_get_match_data(&client->dev);
+       if (!data)
+               return -ENODEV;
+
+       regmap = devm_regmap_init_i2c(client, data->regmap_cfg);
+       if (IS_ERR(regmap))
+               return dev_err_probe(&client->dev, PTR_ERR(regmap),
+                                    "regmap initialization failed\n");
+
+       return rk8xx_probe(&client->dev, data->variant, client->irq, regmap);
+}
+
+static void rk8xx_i2c_shutdown(struct i2c_client *client)
+{
+       rk8xx_shutdown(&client->dev);
+}
+
+static SIMPLE_DEV_PM_OPS(rk8xx_i2c_pm_ops, rk8xx_suspend, rk8xx_resume);
+
+static const struct of_device_id rk8xx_i2c_of_match[] = {
+       { .compatible = "rockchip,rk805", .data = &rk805_data },
+       { .compatible = "rockchip,rk808", .data = &rk808_data },
+       { .compatible = "rockchip,rk809", .data = &rk809_data },
+       { .compatible = "rockchip,rk817", .data = &rk817_data },
+       { .compatible = "rockchip,rk818", .data = &rk818_data },
+       { },
+};
+MODULE_DEVICE_TABLE(of, rk8xx_i2c_of_match);
+
+static struct i2c_driver rk8xx_i2c_driver = {
+       .driver = {
+               .name = "rk8xx-i2c",
+               .of_match_table = rk8xx_i2c_of_match,
+               .pm = &rk8xx_i2c_pm_ops,
+       },
+       .probe_new = rk8xx_i2c_probe,
+       .shutdown  = rk8xx_i2c_shutdown,
+};
+module_i2c_driver(rk8xx_i2c_driver);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Chris Zhong <zyw@rock-chips.com>");
+MODULE_AUTHOR("Zhang Qing <zhangqing@rock-chips.com>");
+MODULE_AUTHOR("Wadim Egorov <w.egorov@phytec.de>");
+MODULE_DESCRIPTION("RK8xx I2C PMIC driver");
diff --git a/drivers/mfd/rk8xx-spi.c b/drivers/mfd/rk8xx-spi.c
new file mode 100644 (file)
index 0000000..fd137f3
--- /dev/null
@@ -0,0 +1,124 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Rockchip RK806 Core (SPI) driver
+ *
+ * Copyright (c) 2021 Rockchip Electronics Co., Ltd.
+ * Copyright (c) 2023 Collabora Ltd.
+ *
+ * Author: Xu Shengfei <xsf@rock-chips.com>
+ * Author: Sebastian Reichel <sebastian.reichel@collabora.com>
+ */
+
+#include <linux/interrupt.h>
+#include <linux/mfd/core.h>
+#include <linux/mfd/rk808.h>
+#include <linux/module.h>
+#include <linux/regmap.h>
+#include <linux/spi/spi.h>
+
+#define RK806_ADDR_SIZE 2
+#define RK806_CMD_WITH_SIZE(CMD, VALUE_BYTES) \
+       (RK806_CMD_##CMD | RK806_CMD_CRC_DIS | (VALUE_BYTES - 1))
+
+static const struct regmap_range rk806_volatile_ranges[] = {
+       regmap_reg_range(RK806_POWER_EN0, RK806_POWER_EN5),
+       regmap_reg_range(RK806_DVS_START_CTRL, RK806_INT_MSK1),
+};
+
+static const struct regmap_access_table rk806_volatile_table = {
+       .yes_ranges = rk806_volatile_ranges,
+       .n_yes_ranges = ARRAY_SIZE(rk806_volatile_ranges),
+};
+
+static const struct regmap_config rk806_regmap_config_spi = {
+       .reg_bits = 16,
+       .val_bits = 8,
+       .max_register = RK806_BUCK_RSERVE_REG5,
+       .cache_type = REGCACHE_RBTREE,
+       .volatile_table = &rk806_volatile_table,
+};
+
+static int rk806_spi_bus_write(void *context, const void *vdata, size_t count)
+{
+       struct device *dev = context;
+       struct spi_device *spi = to_spi_device(dev);
+       struct spi_transfer xfer[2] = { 0 };
+       /* data and thus count includes the register address */
+       size_t val_size = count - RK806_ADDR_SIZE;
+       char cmd;
+
+       if (val_size < 1 || val_size > (RK806_CMD_LEN_MSK + 1))
+               return -EINVAL;
+
+       cmd = RK806_CMD_WITH_SIZE(WRITE, val_size);
+
+       xfer[0].tx_buf = &cmd;
+       xfer[0].len = sizeof(cmd);
+       xfer[1].tx_buf = vdata;
+       xfer[1].len = count;
+
+       return spi_sync_transfer(spi, xfer, ARRAY_SIZE(xfer));
+}
+
+static int rk806_spi_bus_read(void *context, const void *vreg, size_t reg_size,
+                             void *val, size_t val_size)
+{
+       struct device *dev = context;
+       struct spi_device *spi = to_spi_device(dev);
+       char txbuf[3] = { 0 };
+
+       if (reg_size != RK806_ADDR_SIZE ||
+           val_size < 1 || val_size > (RK806_CMD_LEN_MSK + 1))
+               return -EINVAL;
+
+       /* TX buffer contains command byte followed by two address bytes */
+       txbuf[0] = RK806_CMD_WITH_SIZE(READ, val_size);
+       memcpy(txbuf+1, vreg, reg_size);
+
+       return spi_write_then_read(spi, txbuf, sizeof(txbuf), val, val_size);
+}
+
+static const struct regmap_bus rk806_regmap_bus_spi = {
+       .write = rk806_spi_bus_write,
+       .read = rk806_spi_bus_read,
+       .reg_format_endian_default = REGMAP_ENDIAN_LITTLE,
+};
+
+static int rk8xx_spi_probe(struct spi_device *spi)
+{
+       struct regmap *regmap;
+
+       regmap = devm_regmap_init(&spi->dev, &rk806_regmap_bus_spi,
+                                 &spi->dev, &rk806_regmap_config_spi);
+       if (IS_ERR(regmap))
+               return dev_err_probe(&spi->dev, PTR_ERR(regmap),
+                                    "Failed to init regmap\n");
+
+       return rk8xx_probe(&spi->dev, RK806_ID, spi->irq, regmap);
+}
+
+static const struct of_device_id rk8xx_spi_of_match[] = {
+       { .compatible = "rockchip,rk806", },
+       { }
+};
+MODULE_DEVICE_TABLE(of, rk8xx_spi_of_match);
+
+static const struct spi_device_id rk8xx_spi_id_table[] = {
+       { "rk806", 0 },
+       { }
+};
+MODULE_DEVICE_TABLE(spi, rk8xx_spi_id_table);
+
+static struct spi_driver rk8xx_spi_driver = {
+       .driver         = {
+               .name   = "rk8xx-spi",
+               .of_match_table = rk8xx_spi_of_match,
+       },
+       .probe          = rk8xx_spi_probe,
+       .id_table       = rk8xx_spi_id_table,
+};
+module_spi_driver(rk8xx_spi_driver);
+
+MODULE_AUTHOR("Xu Shengfei <xsf@rock-chips.com>");
+MODULE_DESCRIPTION("RK8xx SPI PMIC driver");
+MODULE_LICENSE("GPL");
diff --git a/drivers/mfd/tps6594-core.c b/drivers/mfd/tps6594-core.c
new file mode 100644 (file)
index 0000000..15f3148
--- /dev/null
@@ -0,0 +1,462 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Core functions for TI TPS6594/TPS6593/LP8764 PMICs
+ *
+ * Copyright (C) 2023 BayLibre Incorporated - https://www.baylibre.com/
+ */
+
+#include <linux/completion.h>
+#include <linux/delay.h>
+#include <linux/interrupt.h>
+#include <linux/module.h>
+#include <linux/of_device.h>
+
+#include <linux/mfd/core.h>
+#include <linux/mfd/tps6594.h>
+
+#define TPS6594_CRC_SYNC_TIMEOUT_MS 150
+
+/* Completion to synchronize CRC feature enabling on all PMICs */
+static DECLARE_COMPLETION(tps6594_crc_comp);
+
+static const struct resource tps6594_regulator_resources[] = {
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_BUCK1_OV, TPS6594_IRQ_NAME_BUCK1_OV),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_BUCK1_UV, TPS6594_IRQ_NAME_BUCK1_UV),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_BUCK1_SC, TPS6594_IRQ_NAME_BUCK1_SC),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_BUCK1_ILIM, TPS6594_IRQ_NAME_BUCK1_ILIM),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_BUCK2_OV, TPS6594_IRQ_NAME_BUCK2_OV),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_BUCK2_UV, TPS6594_IRQ_NAME_BUCK2_UV),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_BUCK2_SC, TPS6594_IRQ_NAME_BUCK2_SC),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_BUCK2_ILIM, TPS6594_IRQ_NAME_BUCK2_ILIM),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_BUCK3_OV, TPS6594_IRQ_NAME_BUCK3_OV),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_BUCK3_UV, TPS6594_IRQ_NAME_BUCK3_UV),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_BUCK3_SC, TPS6594_IRQ_NAME_BUCK3_SC),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_BUCK3_ILIM, TPS6594_IRQ_NAME_BUCK3_ILIM),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_BUCK4_OV, TPS6594_IRQ_NAME_BUCK4_OV),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_BUCK4_UV, TPS6594_IRQ_NAME_BUCK4_UV),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_BUCK4_SC, TPS6594_IRQ_NAME_BUCK4_SC),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_BUCK4_ILIM, TPS6594_IRQ_NAME_BUCK4_ILIM),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_BUCK5_OV, TPS6594_IRQ_NAME_BUCK5_OV),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_BUCK5_UV, TPS6594_IRQ_NAME_BUCK5_UV),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_BUCK5_SC, TPS6594_IRQ_NAME_BUCK5_SC),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_BUCK5_ILIM, TPS6594_IRQ_NAME_BUCK5_ILIM),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_LDO1_OV, TPS6594_IRQ_NAME_LDO1_OV),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_LDO1_UV, TPS6594_IRQ_NAME_LDO1_UV),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_LDO1_SC, TPS6594_IRQ_NAME_LDO1_SC),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_LDO1_ILIM, TPS6594_IRQ_NAME_LDO1_ILIM),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_LDO2_OV, TPS6594_IRQ_NAME_LDO2_OV),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_LDO2_UV, TPS6594_IRQ_NAME_LDO2_UV),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_LDO2_SC, TPS6594_IRQ_NAME_LDO2_SC),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_LDO2_ILIM, TPS6594_IRQ_NAME_LDO2_ILIM),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_LDO3_OV, TPS6594_IRQ_NAME_LDO3_OV),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_LDO3_UV, TPS6594_IRQ_NAME_LDO3_UV),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_LDO3_SC, TPS6594_IRQ_NAME_LDO3_SC),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_LDO3_ILIM, TPS6594_IRQ_NAME_LDO3_ILIM),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_LDO4_OV, TPS6594_IRQ_NAME_LDO4_OV),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_LDO4_UV, TPS6594_IRQ_NAME_LDO4_UV),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_LDO4_SC, TPS6594_IRQ_NAME_LDO4_SC),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_LDO4_ILIM, TPS6594_IRQ_NAME_LDO4_ILIM),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_VCCA_OV, TPS6594_IRQ_NAME_VCCA_OV),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_VCCA_UV, TPS6594_IRQ_NAME_VCCA_UV),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_VMON1_OV, TPS6594_IRQ_NAME_VMON1_OV),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_VMON1_UV, TPS6594_IRQ_NAME_VMON1_UV),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_VMON1_RV, TPS6594_IRQ_NAME_VMON1_RV),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_VMON2_OV, TPS6594_IRQ_NAME_VMON2_OV),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_VMON2_UV, TPS6594_IRQ_NAME_VMON2_UV),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_VMON2_RV, TPS6594_IRQ_NAME_VMON2_RV),
+};
+
+static const struct resource tps6594_pinctrl_resources[] = {
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_GPIO9, TPS6594_IRQ_NAME_GPIO9),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_GPIO10, TPS6594_IRQ_NAME_GPIO10),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_GPIO11, TPS6594_IRQ_NAME_GPIO11),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_GPIO1, TPS6594_IRQ_NAME_GPIO1),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_GPIO2, TPS6594_IRQ_NAME_GPIO2),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_GPIO3, TPS6594_IRQ_NAME_GPIO3),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_GPIO4, TPS6594_IRQ_NAME_GPIO4),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_GPIO5, TPS6594_IRQ_NAME_GPIO5),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_GPIO6, TPS6594_IRQ_NAME_GPIO6),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_GPIO7, TPS6594_IRQ_NAME_GPIO7),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_GPIO8, TPS6594_IRQ_NAME_GPIO8),
+};
+
+static const struct resource tps6594_pfsm_resources[] = {
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_NPWRON_START, TPS6594_IRQ_NAME_NPWRON_START),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_ENABLE, TPS6594_IRQ_NAME_ENABLE),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_FSD, TPS6594_IRQ_NAME_FSD),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_SOFT_REBOOT, TPS6594_IRQ_NAME_SOFT_REBOOT),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_BIST_PASS, TPS6594_IRQ_NAME_BIST_PASS),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_EXT_CLK, TPS6594_IRQ_NAME_EXT_CLK),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_TWARN, TPS6594_IRQ_NAME_TWARN),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_TSD_ORD, TPS6594_IRQ_NAME_TSD_ORD),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_BIST_FAIL, TPS6594_IRQ_NAME_BIST_FAIL),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_REG_CRC_ERR, TPS6594_IRQ_NAME_REG_CRC_ERR),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_RECOV_CNT, TPS6594_IRQ_NAME_RECOV_CNT),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_SPMI_ERR, TPS6594_IRQ_NAME_SPMI_ERR),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_NPWRON_LONG, TPS6594_IRQ_NAME_NPWRON_LONG),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_NINT_READBACK, TPS6594_IRQ_NAME_NINT_READBACK),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_NRSTOUT_READBACK, TPS6594_IRQ_NAME_NRSTOUT_READBACK),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_TSD_IMM, TPS6594_IRQ_NAME_TSD_IMM),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_VCCA_OVP, TPS6594_IRQ_NAME_VCCA_OVP),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_PFSM_ERR, TPS6594_IRQ_NAME_PFSM_ERR),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_IMM_SHUTDOWN, TPS6594_IRQ_NAME_IMM_SHUTDOWN),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_ORD_SHUTDOWN, TPS6594_IRQ_NAME_ORD_SHUTDOWN),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_MCU_PWR_ERR, TPS6594_IRQ_NAME_MCU_PWR_ERR),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_SOC_PWR_ERR, TPS6594_IRQ_NAME_SOC_PWR_ERR),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_COMM_FRM_ERR, TPS6594_IRQ_NAME_COMM_FRM_ERR),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_COMM_CRC_ERR, TPS6594_IRQ_NAME_COMM_CRC_ERR),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_COMM_ADR_ERR, TPS6594_IRQ_NAME_COMM_ADR_ERR),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_EN_DRV_READBACK, TPS6594_IRQ_NAME_EN_DRV_READBACK),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_NRSTOUT_SOC_READBACK,
+                            TPS6594_IRQ_NAME_NRSTOUT_SOC_READBACK),
+};
+
+static const struct resource tps6594_esm_resources[] = {
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_ESM_SOC_PIN, TPS6594_IRQ_NAME_ESM_SOC_PIN),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_ESM_SOC_FAIL, TPS6594_IRQ_NAME_ESM_SOC_FAIL),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_ESM_SOC_RST, TPS6594_IRQ_NAME_ESM_SOC_RST),
+};
+
+static const struct resource tps6594_rtc_resources[] = {
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_TIMER, TPS6594_IRQ_NAME_TIMER),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_ALARM, TPS6594_IRQ_NAME_ALARM),
+       DEFINE_RES_IRQ_NAMED(TPS6594_IRQ_POWER_UP, TPS6594_IRQ_NAME_POWERUP),
+};
+
+static const struct mfd_cell tps6594_common_cells[] = {
+       MFD_CELL_RES("tps6594-regulator", tps6594_regulator_resources),
+       MFD_CELL_RES("tps6594-pinctrl", tps6594_pinctrl_resources),
+       MFD_CELL_RES("tps6594-pfsm", tps6594_pfsm_resources),
+       MFD_CELL_RES("tps6594-esm", tps6594_esm_resources),
+};
+
+static const struct mfd_cell tps6594_rtc_cells[] = {
+       MFD_CELL_RES("tps6594-rtc", tps6594_rtc_resources),
+};
+
+static const struct regmap_irq tps6594_irqs[] = {
+       /* INT_BUCK1_2 register */
+       REGMAP_IRQ_REG(TPS6594_IRQ_BUCK1_OV, 0, TPS6594_BIT_BUCKX_OV_INT(0)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_BUCK1_UV, 0, TPS6594_BIT_BUCKX_UV_INT(0)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_BUCK1_SC, 0, TPS6594_BIT_BUCKX_SC_INT(0)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_BUCK1_ILIM, 0, TPS6594_BIT_BUCKX_ILIM_INT(0)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_BUCK2_OV, 0, TPS6594_BIT_BUCKX_OV_INT(1)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_BUCK2_UV, 0, TPS6594_BIT_BUCKX_UV_INT(1)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_BUCK2_SC, 0, TPS6594_BIT_BUCKX_SC_INT(1)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_BUCK2_ILIM, 0, TPS6594_BIT_BUCKX_ILIM_INT(1)),
+
+       /* INT_BUCK3_4 register */
+       REGMAP_IRQ_REG(TPS6594_IRQ_BUCK3_OV, 1, TPS6594_BIT_BUCKX_OV_INT(2)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_BUCK3_UV, 1, TPS6594_BIT_BUCKX_UV_INT(2)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_BUCK3_SC, 1, TPS6594_BIT_BUCKX_SC_INT(2)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_BUCK3_ILIM, 1, TPS6594_BIT_BUCKX_ILIM_INT(2)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_BUCK4_OV, 1, TPS6594_BIT_BUCKX_OV_INT(3)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_BUCK4_UV, 1, TPS6594_BIT_BUCKX_UV_INT(3)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_BUCK4_SC, 1, TPS6594_BIT_BUCKX_SC_INT(3)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_BUCK4_ILIM, 1, TPS6594_BIT_BUCKX_ILIM_INT(3)),
+
+       /* INT_BUCK5 register */
+       REGMAP_IRQ_REG(TPS6594_IRQ_BUCK5_OV, 2, TPS6594_BIT_BUCKX_OV_INT(4)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_BUCK5_UV, 2, TPS6594_BIT_BUCKX_UV_INT(4)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_BUCK5_SC, 2, TPS6594_BIT_BUCKX_SC_INT(4)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_BUCK5_ILIM, 2, TPS6594_BIT_BUCKX_ILIM_INT(4)),
+
+       /* INT_LDO1_2 register */
+       REGMAP_IRQ_REG(TPS6594_IRQ_LDO1_OV, 3, TPS6594_BIT_LDOX_OV_INT(0)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_LDO1_UV, 3, TPS6594_BIT_LDOX_UV_INT(0)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_LDO1_SC, 3, TPS6594_BIT_LDOX_SC_INT(0)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_LDO1_ILIM, 3, TPS6594_BIT_LDOX_ILIM_INT(0)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_LDO2_OV, 3, TPS6594_BIT_LDOX_OV_INT(1)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_LDO2_UV, 3, TPS6594_BIT_LDOX_UV_INT(1)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_LDO2_SC, 3, TPS6594_BIT_LDOX_SC_INT(1)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_LDO2_ILIM, 3, TPS6594_BIT_LDOX_ILIM_INT(1)),
+
+       /* INT_LDO3_4 register */
+       REGMAP_IRQ_REG(TPS6594_IRQ_LDO3_OV, 4, TPS6594_BIT_LDOX_OV_INT(2)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_LDO3_UV, 4, TPS6594_BIT_LDOX_UV_INT(2)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_LDO3_SC, 4, TPS6594_BIT_LDOX_SC_INT(2)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_LDO3_ILIM, 4, TPS6594_BIT_LDOX_ILIM_INT(2)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_LDO4_OV, 4, TPS6594_BIT_LDOX_OV_INT(3)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_LDO4_UV, 4, TPS6594_BIT_LDOX_UV_INT(3)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_LDO4_SC, 4, TPS6594_BIT_LDOX_SC_INT(3)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_LDO4_ILIM, 4, TPS6594_BIT_LDOX_ILIM_INT(3)),
+
+       /* INT_VMON register */
+       REGMAP_IRQ_REG(TPS6594_IRQ_VCCA_OV, 5, TPS6594_BIT_VCCA_OV_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_VCCA_UV, 5, TPS6594_BIT_VCCA_UV_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_VMON1_OV, 5, TPS6594_BIT_VMON1_OV_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_VMON1_UV, 5, TPS6594_BIT_VMON1_UV_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_VMON1_RV, 5, TPS6594_BIT_VMON1_RV_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_VMON2_OV, 5, TPS6594_BIT_VMON2_OV_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_VMON2_UV, 5, TPS6594_BIT_VMON2_UV_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_VMON2_RV, 5, TPS6594_BIT_VMON2_RV_INT),
+
+       /* INT_GPIO register */
+       REGMAP_IRQ_REG(TPS6594_IRQ_GPIO9, 6, TPS6594_BIT_GPIO9_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_GPIO10, 6, TPS6594_BIT_GPIO10_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_GPIO11, 6, TPS6594_BIT_GPIO11_INT),
+
+       /* INT_GPIO1_8 register */
+       REGMAP_IRQ_REG(TPS6594_IRQ_GPIO1, 7, TPS6594_BIT_GPIOX_INT(0)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_GPIO2, 7, TPS6594_BIT_GPIOX_INT(1)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_GPIO3, 7, TPS6594_BIT_GPIOX_INT(2)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_GPIO4, 7, TPS6594_BIT_GPIOX_INT(3)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_GPIO5, 7, TPS6594_BIT_GPIOX_INT(4)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_GPIO6, 7, TPS6594_BIT_GPIOX_INT(5)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_GPIO7, 7, TPS6594_BIT_GPIOX_INT(6)),
+       REGMAP_IRQ_REG(TPS6594_IRQ_GPIO8, 7, TPS6594_BIT_GPIOX_INT(7)),
+
+       /* INT_STARTUP register */
+       REGMAP_IRQ_REG(TPS6594_IRQ_NPWRON_START, 8, TPS6594_BIT_NPWRON_START_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_ENABLE, 8, TPS6594_BIT_ENABLE_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_FSD, 8, TPS6594_BIT_FSD_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_SOFT_REBOOT, 8, TPS6594_BIT_SOFT_REBOOT_INT),
+
+       /* INT_MISC register */
+       REGMAP_IRQ_REG(TPS6594_IRQ_BIST_PASS, 9, TPS6594_BIT_BIST_PASS_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_EXT_CLK, 9, TPS6594_BIT_EXT_CLK_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_TWARN, 9, TPS6594_BIT_TWARN_INT),
+
+       /* INT_MODERATE_ERR register */
+       REGMAP_IRQ_REG(TPS6594_IRQ_TSD_ORD, 10, TPS6594_BIT_TSD_ORD_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_BIST_FAIL, 10, TPS6594_BIT_BIST_FAIL_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_REG_CRC_ERR, 10, TPS6594_BIT_REG_CRC_ERR_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_RECOV_CNT, 10, TPS6594_BIT_RECOV_CNT_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_SPMI_ERR, 10, TPS6594_BIT_SPMI_ERR_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_NPWRON_LONG, 10, TPS6594_BIT_NPWRON_LONG_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_NINT_READBACK, 10, TPS6594_BIT_NINT_READBACK_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_NRSTOUT_READBACK, 10, TPS6594_BIT_NRSTOUT_READBACK_INT),
+
+       /* INT_SEVERE_ERR register */
+       REGMAP_IRQ_REG(TPS6594_IRQ_TSD_IMM, 11, TPS6594_BIT_TSD_IMM_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_VCCA_OVP, 11, TPS6594_BIT_VCCA_OVP_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_PFSM_ERR, 11, TPS6594_BIT_PFSM_ERR_INT),
+
+       /* INT_FSM_ERR register */
+       REGMAP_IRQ_REG(TPS6594_IRQ_IMM_SHUTDOWN, 12, TPS6594_BIT_IMM_SHUTDOWN_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_ORD_SHUTDOWN, 12, TPS6594_BIT_ORD_SHUTDOWN_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_MCU_PWR_ERR, 12, TPS6594_BIT_MCU_PWR_ERR_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_SOC_PWR_ERR, 12, TPS6594_BIT_SOC_PWR_ERR_INT),
+
+       /* INT_COMM_ERR register */
+       REGMAP_IRQ_REG(TPS6594_IRQ_COMM_FRM_ERR, 13, TPS6594_BIT_COMM_FRM_ERR_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_COMM_CRC_ERR, 13, TPS6594_BIT_COMM_CRC_ERR_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_COMM_ADR_ERR, 13, TPS6594_BIT_COMM_ADR_ERR_INT),
+
+       /* INT_READBACK_ERR register */
+       REGMAP_IRQ_REG(TPS6594_IRQ_EN_DRV_READBACK, 14, TPS6594_BIT_EN_DRV_READBACK_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_NRSTOUT_SOC_READBACK, 14, TPS6594_BIT_NRSTOUT_SOC_READBACK_INT),
+
+       /* INT_ESM register */
+       REGMAP_IRQ_REG(TPS6594_IRQ_ESM_SOC_PIN, 15, TPS6594_BIT_ESM_SOC_PIN_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_ESM_SOC_FAIL, 15, TPS6594_BIT_ESM_SOC_FAIL_INT),
+       REGMAP_IRQ_REG(TPS6594_IRQ_ESM_SOC_RST, 15, TPS6594_BIT_ESM_SOC_RST_INT),
+
+       /* RTC_STATUS register */
+       REGMAP_IRQ_REG(TPS6594_IRQ_TIMER, 16, TPS6594_BIT_TIMER),
+       REGMAP_IRQ_REG(TPS6594_IRQ_ALARM, 16, TPS6594_BIT_ALARM),
+       REGMAP_IRQ_REG(TPS6594_IRQ_POWER_UP, 16, TPS6594_BIT_POWER_UP),
+};
+
+static const unsigned int tps6594_irq_reg[] = {
+       TPS6594_REG_INT_BUCK1_2,
+       TPS6594_REG_INT_BUCK3_4,
+       TPS6594_REG_INT_BUCK5,
+       TPS6594_REG_INT_LDO1_2,
+       TPS6594_REG_INT_LDO3_4,
+       TPS6594_REG_INT_VMON,
+       TPS6594_REG_INT_GPIO,
+       TPS6594_REG_INT_GPIO1_8,
+       TPS6594_REG_INT_STARTUP,
+       TPS6594_REG_INT_MISC,
+       TPS6594_REG_INT_MODERATE_ERR,
+       TPS6594_REG_INT_SEVERE_ERR,
+       TPS6594_REG_INT_FSM_ERR,
+       TPS6594_REG_INT_COMM_ERR,
+       TPS6594_REG_INT_READBACK_ERR,
+       TPS6594_REG_INT_ESM,
+       TPS6594_REG_RTC_STATUS,
+};
+
+static inline unsigned int tps6594_get_irq_reg(struct regmap_irq_chip_data *data,
+                                              unsigned int base, int index)
+{
+       return tps6594_irq_reg[index];
+};
+
+static int tps6594_handle_post_irq(void *irq_drv_data)
+{
+       struct tps6594 *tps = irq_drv_data;
+       int ret = 0;
+
+       /*
+        * When CRC is enabled, writing to a read-only bit triggers an error,
+        * and COMM_ADR_ERR_INT bit is set. Besides, bits indicating interrupts
+        * (that must be cleared) and read-only bits are sometimes grouped in
+        * the same register.
+        * Since regmap clears interrupts by doing a write per register, clearing
+        * an interrupt bit in a register containing also a read-only bit makes
+        * COMM_ADR_ERR_INT bit set. Clear immediately this bit to avoid raising
+        * a new interrupt.
+        */
+       if (tps->use_crc)
+               ret = regmap_write_bits(tps->regmap, TPS6594_REG_INT_COMM_ERR,
+                                       TPS6594_BIT_COMM_ADR_ERR_INT,
+                                       TPS6594_BIT_COMM_ADR_ERR_INT);
+
+       return ret;
+};
+
+static struct regmap_irq_chip tps6594_irq_chip = {
+       .ack_base = TPS6594_REG_INT_BUCK1_2,
+       .ack_invert = 1,
+       .clear_ack = 1,
+       .init_ack_masked = 1,
+       .num_regs = ARRAY_SIZE(tps6594_irq_reg),
+       .irqs = tps6594_irqs,
+       .num_irqs = ARRAY_SIZE(tps6594_irqs),
+       .get_irq_reg = tps6594_get_irq_reg,
+       .handle_post_irq = tps6594_handle_post_irq,
+};
+
+bool tps6594_is_volatile_reg(struct device *dev, unsigned int reg)
+{
+       return (reg >= TPS6594_REG_INT_TOP && reg <= TPS6594_REG_STAT_READBACK_ERR) ||
+              reg == TPS6594_REG_RTC_STATUS;
+}
+EXPORT_SYMBOL_GPL(tps6594_is_volatile_reg);
+
+static int tps6594_check_crc_mode(struct tps6594 *tps, bool primary_pmic)
+{
+       int ret;
+
+       /*
+        * Check if CRC is enabled.
+        * Once CRC is enabled, it can't be disabled until next power cycle.
+        */
+       tps->use_crc = true;
+       ret = regmap_test_bits(tps->regmap, TPS6594_REG_SERIAL_IF_CONFIG,
+                              TPS6594_BIT_I2C1_SPI_CRC_EN);
+       if (ret == 0) {
+               ret = -EIO;
+       } else if (ret > 0) {
+               dev_info(tps->dev, "CRC feature enabled on %s PMIC",
+                        primary_pmic ? "primary" : "secondary");
+               ret = 0;
+       }
+
+       return ret;
+}
+
+static int tps6594_set_crc_feature(struct tps6594 *tps)
+{
+       int ret;
+
+       ret = tps6594_check_crc_mode(tps, true);
+       if (ret) {
+               /*
+                * If CRC is not already enabled, force PFSM I2C_2 trigger to enable it
+                * on primary PMIC.
+                */
+               tps->use_crc = false;
+               ret = regmap_write_bits(tps->regmap, TPS6594_REG_FSM_I2C_TRIGGERS,
+                                       TPS6594_BIT_TRIGGER_I2C(2), TPS6594_BIT_TRIGGER_I2C(2));
+               if (ret)
+                       return ret;
+
+               /*
+                * Wait for PFSM to process trigger.
+                * The datasheet indicates 2 ms, and clock specification is +/-5%.
+                * 4 ms should provide sufficient margin.
+                */
+               usleep_range(4000, 5000);
+
+               ret = tps6594_check_crc_mode(tps, true);
+       }
+
+       return ret;
+}
+
+static int tps6594_enable_crc(struct tps6594 *tps)
+{
+       struct device *dev = tps->dev;
+       unsigned int is_primary;
+       unsigned long timeout = msecs_to_jiffies(TPS6594_CRC_SYNC_TIMEOUT_MS);
+       int ret;
+
+       /*
+        * CRC mode can be used with I2C or SPI protocols.
+        * If this mode is specified for primary PMIC, it will also be applied to secondary PMICs
+        * through SPMI serial interface.
+        * In this multi-PMIC synchronization scheme, the primary PMIC is the controller device
+        * on the SPMI bus, and the secondary PMICs are the target devices on the SPMI bus.
+        */
+       is_primary = of_property_read_bool(dev->of_node, "ti,primary-pmic");
+       if (is_primary) {
+               /* Enable CRC feature on primary PMIC */
+               ret = tps6594_set_crc_feature(tps);
+               if (ret)
+                       return ret;
+
+               /* Notify secondary PMICs that CRC feature is enabled */
+               complete_all(&tps6594_crc_comp);
+       } else {
+               /* Wait for CRC feature enabling event from primary PMIC */
+               ret = wait_for_completion_interruptible_timeout(&tps6594_crc_comp, timeout);
+               if (ret == 0)
+                       ret = -ETIMEDOUT;
+               else if (ret > 0)
+                       ret = tps6594_check_crc_mode(tps, false);
+       }
+
+       return ret;
+}
+
+int tps6594_device_init(struct tps6594 *tps, bool enable_crc)
+{
+       struct device *dev = tps->dev;
+       int ret;
+
+       if (enable_crc) {
+               ret = tps6594_enable_crc(tps);
+               if (ret)
+                       return dev_err_probe(dev, ret, "Failed to enable CRC\n");
+       }
+
+       /* Keep PMIC in ACTIVE state */
+       ret = regmap_set_bits(tps->regmap, TPS6594_REG_FSM_NSLEEP_TRIGGERS,
+                             TPS6594_BIT_NSLEEP1B | TPS6594_BIT_NSLEEP2B);
+       if (ret)
+               return dev_err_probe(dev, ret, "Failed to set PMIC state\n");
+
+       tps6594_irq_chip.irq_drv_data = tps;
+       tps6594_irq_chip.name = devm_kasprintf(dev, GFP_KERNEL, "%s-%ld-0x%02x",
+                                              dev->driver->name, tps->chip_id, tps->reg);
+
+       ret = devm_regmap_add_irq_chip(dev, tps->regmap, tps->irq, IRQF_SHARED | IRQF_ONESHOT,
+                                      0, &tps6594_irq_chip, &tps->irq_data);
+       if (ret)
+               return dev_err_probe(dev, ret, "Failed to add regmap IRQ\n");
+
+       ret = devm_mfd_add_devices(dev, PLATFORM_DEVID_AUTO, tps6594_common_cells,
+                                  ARRAY_SIZE(tps6594_common_cells), NULL, 0,
+                                  regmap_irq_get_domain(tps->irq_data));
+       if (ret)
+               return dev_err_probe(dev, ret, "Failed to add common child devices\n");
+
+       /* No RTC for LP8764 */
+       if (tps->chip_id != LP8764) {
+               ret = devm_mfd_add_devices(dev, PLATFORM_DEVID_AUTO, tps6594_rtc_cells,
+                                          ARRAY_SIZE(tps6594_rtc_cells), NULL, 0,
+                                          regmap_irq_get_domain(tps->irq_data));
+               if (ret)
+                       return dev_err_probe(dev, ret, "Failed to add RTC child device\n");
+       }
+
+       return 0;
+}
+EXPORT_SYMBOL_GPL(tps6594_device_init);
+
+MODULE_AUTHOR("Julien Panis <jpanis@baylibre.com>");
+MODULE_DESCRIPTION("TPS6594 Driver");
+MODULE_LICENSE("GPL");
diff --git a/drivers/mfd/tps6594-i2c.c b/drivers/mfd/tps6594-i2c.c
new file mode 100644 (file)
index 0000000..449d5c6
--- /dev/null
@@ -0,0 +1,244 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * I2C access driver for TI TPS6594/TPS6593/LP8764 PMICs
+ *
+ * Copyright (C) 2023 BayLibre Incorporated - https://www.baylibre.com/
+ */
+
+#include <linux/crc8.h>
+#include <linux/i2c.h>
+#include <linux/module.h>
+#include <linux/mod_devicetable.h>
+#include <linux/of_device.h>
+#include <linux/regmap.h>
+
+#include <linux/mfd/tps6594.h>
+
+static bool enable_crc;
+module_param(enable_crc, bool, 0444);
+MODULE_PARM_DESC(enable_crc, "Enable CRC feature for I2C interface");
+
+DECLARE_CRC8_TABLE(tps6594_i2c_crc_table);
+
+static int tps6594_i2c_transfer(struct i2c_adapter *adap, struct i2c_msg *msgs, int num)
+{
+       int ret = i2c_transfer(adap, msgs, num);
+
+       if (ret == num)
+               return 0;
+       else if (ret < 0)
+               return ret;
+       else
+               return -EIO;
+}
+
+static int tps6594_i2c_reg_read_with_crc(struct i2c_client *client, u8 page, u8 reg, u8 *val)
+{
+       struct i2c_msg msgs[2];
+       u8 buf_rx[] = { 0, 0 };
+       /* I2C address = I2C base address + Page index */
+       const u8 addr = client->addr + page;
+       /*
+        * CRC is calculated from every bit included in the protocol
+        * except the ACK bits from the target. Byte stream is:
+        * - B0: (I2C_addr_7bits << 1) | WR_bit, with WR_bit = 0
+        * - B1: reg
+        * - B2: (I2C_addr_7bits << 1) | RD_bit, with RD_bit = 1
+        * - B3: val
+        * - B4: CRC from B0-B1-B2-B3
+        */
+       u8 crc_data[] = { addr << 1, reg, addr << 1 | 1, 0 };
+       int ret;
+
+       /* Write register */
+       msgs[0].addr = addr;
+       msgs[0].flags = 0;
+       msgs[0].len = 1;
+       msgs[0].buf = &reg;
+
+       /* Read data and CRC */
+       msgs[1].addr = msgs[0].addr;
+       msgs[1].flags = I2C_M_RD;
+       msgs[1].len = 2;
+       msgs[1].buf = buf_rx;
+
+       ret = tps6594_i2c_transfer(client->adapter, msgs, 2);
+       if (ret < 0)
+               return ret;
+
+       crc_data[sizeof(crc_data) - 1] = *val = buf_rx[0];
+       if (buf_rx[1] != crc8(tps6594_i2c_crc_table, crc_data, sizeof(crc_data), CRC8_INIT_VALUE))
+               return -EIO;
+
+       return ret;
+}
+
+static int tps6594_i2c_reg_write_with_crc(struct i2c_client *client, u8 page, u8 reg, u8 val)
+{
+       struct i2c_msg msg;
+       u8 buf[] = { reg, val, 0 };
+       /* I2C address = I2C base address + Page index */
+       const u8 addr = client->addr + page;
+       /*
+        * CRC is calculated from every bit included in the protocol
+        * except the ACK bits from the target. Byte stream is:
+        * - B0: (I2C_addr_7bits << 1) | WR_bit, with WR_bit = 0
+        * - B1: reg
+        * - B2: val
+        * - B3: CRC from B0-B1-B2
+        */
+       const u8 crc_data[] = { addr << 1, reg, val };
+
+       /* Write register, data and CRC */
+       msg.addr = addr;
+       msg.flags = client->flags & I2C_M_TEN;
+       msg.len = sizeof(buf);
+       msg.buf = buf;
+
+       buf[msg.len - 1] = crc8(tps6594_i2c_crc_table, crc_data, sizeof(crc_data), CRC8_INIT_VALUE);
+
+       return tps6594_i2c_transfer(client->adapter, &msg, 1);
+}
+
+static int tps6594_i2c_read(void *context, const void *reg_buf, size_t reg_size,
+                           void *val_buf, size_t val_size)
+{
+       struct i2c_client *client = context;
+       struct tps6594 *tps = i2c_get_clientdata(client);
+       struct i2c_msg msgs[2];
+       const u8 *reg_bytes = reg_buf;
+       u8 *val_bytes = val_buf;
+       const u8 page = reg_bytes[1];
+       u8 reg = reg_bytes[0];
+       int ret = 0;
+       int i;
+
+       if (tps->use_crc) {
+               /*
+                * Auto-increment feature does not support CRC protocol.
+                * Converts the bulk read operation into a series of single read operations.
+                */
+               for (i = 0 ; ret == 0 && i < val_size ; i++)
+                       ret = tps6594_i2c_reg_read_with_crc(client, page, reg + i, val_bytes + i);
+
+               return ret;
+       }
+
+       /* Write register: I2C address = I2C base address + Page index */
+       msgs[0].addr = client->addr + page;
+       msgs[0].flags = 0;
+       msgs[0].len = 1;
+       msgs[0].buf = &reg;
+
+       /* Read data */
+       msgs[1].addr = msgs[0].addr;
+       msgs[1].flags = I2C_M_RD;
+       msgs[1].len = val_size;
+       msgs[1].buf = val_bytes;
+
+       return tps6594_i2c_transfer(client->adapter, msgs, 2);
+}
+
+static int tps6594_i2c_write(void *context, const void *data, size_t count)
+{
+       struct i2c_client *client = context;
+       struct tps6594 *tps = i2c_get_clientdata(client);
+       struct i2c_msg msg;
+       const u8 *bytes = data;
+       u8 *buf;
+       const u8 page = bytes[1];
+       const u8 reg = bytes[0];
+       int ret = 0;
+       int i;
+
+       if (tps->use_crc) {
+               /*
+                * Auto-increment feature does not support CRC protocol.
+                * Converts the bulk write operation into a series of single write operations.
+                */
+               for (i = 0 ; ret == 0 && i < count - 2 ; i++)
+                       ret = tps6594_i2c_reg_write_with_crc(client, page, reg + i, bytes[i + 2]);
+
+               return ret;
+       }
+
+       /* Setup buffer: page byte is not sent */
+       buf = kzalloc(--count, GFP_KERNEL);
+       if (!buf)
+               return -ENOMEM;
+
+       buf[0] = reg;
+       for (i = 0 ; i < count - 1 ; i++)
+               buf[i + 1] = bytes[i + 2];
+
+       /* Write register and data: I2C address = I2C base address + Page index */
+       msg.addr = client->addr + page;
+       msg.flags = client->flags & I2C_M_TEN;
+       msg.len = count;
+       msg.buf = buf;
+
+       ret = tps6594_i2c_transfer(client->adapter, &msg, 1);
+
+       kfree(buf);
+       return ret;
+}
+
+static const struct regmap_config tps6594_i2c_regmap_config = {
+       .reg_bits = 16,
+       .val_bits = 8,
+       .max_register = TPS6594_REG_DWD_FAIL_CNT_REG,
+       .volatile_reg = tps6594_is_volatile_reg,
+       .read = tps6594_i2c_read,
+       .write = tps6594_i2c_write,
+};
+
+static const struct of_device_id tps6594_i2c_of_match_table[] = {
+       { .compatible = "ti,tps6594-q1", .data = (void *)TPS6594, },
+       { .compatible = "ti,tps6593-q1", .data = (void *)TPS6593, },
+       { .compatible = "ti,lp8764-q1",  .data = (void *)LP8764,  },
+       {}
+};
+MODULE_DEVICE_TABLE(of, tps6594_i2c_of_match_table);
+
+static int tps6594_i2c_probe(struct i2c_client *client)
+{
+       struct device *dev = &client->dev;
+       struct tps6594 *tps;
+       const struct of_device_id *match;
+
+       tps = devm_kzalloc(dev, sizeof(*tps), GFP_KERNEL);
+       if (!tps)
+               return -ENOMEM;
+
+       i2c_set_clientdata(client, tps);
+
+       tps->dev = dev;
+       tps->reg = client->addr;
+       tps->irq = client->irq;
+
+       tps->regmap = devm_regmap_init(dev, NULL, client, &tps6594_i2c_regmap_config);
+       if (IS_ERR(tps->regmap))
+               return dev_err_probe(dev, PTR_ERR(tps->regmap), "Failed to init regmap\n");
+
+       match = of_match_device(tps6594_i2c_of_match_table, dev);
+       if (!match)
+               return dev_err_probe(dev, PTR_ERR(match), "Failed to find matching chip ID\n");
+       tps->chip_id = (unsigned long)match->data;
+
+       crc8_populate_msb(tps6594_i2c_crc_table, TPS6594_CRC8_POLYNOMIAL);
+
+       return tps6594_device_init(tps, enable_crc);
+}
+
+static struct i2c_driver tps6594_i2c_driver = {
+       .driver = {
+               .name = "tps6594",
+               .of_match_table = tps6594_i2c_of_match_table,
+       },
+       .probe_new = tps6594_i2c_probe,
+};
+module_i2c_driver(tps6594_i2c_driver);
+
+MODULE_AUTHOR("Julien Panis <jpanis@baylibre.com>");
+MODULE_DESCRIPTION("TPS6594 I2C Interface Driver");
+MODULE_LICENSE("GPL");
diff --git a/drivers/mfd/tps6594-spi.c b/drivers/mfd/tps6594-spi.c
new file mode 100644 (file)
index 0000000..a938a19
--- /dev/null
@@ -0,0 +1,129 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * SPI access driver for TI TPS6594/TPS6593/LP8764 PMICs
+ *
+ * Copyright (C) 2023 BayLibre Incorporated - https://www.baylibre.com/
+ */
+
+#include <linux/crc8.h>
+#include <linux/module.h>
+#include <linux/mod_devicetable.h>
+#include <linux/of_device.h>
+#include <linux/regmap.h>
+#include <linux/spi/spi.h>
+
+#include <linux/mfd/tps6594.h>
+
+#define TPS6594_SPI_PAGE_SHIFT 5
+#define TPS6594_SPI_READ_BIT   BIT(4)
+
+static bool enable_crc;
+module_param(enable_crc, bool, 0444);
+MODULE_PARM_DESC(enable_crc, "Enable CRC feature for SPI interface");
+
+DECLARE_CRC8_TABLE(tps6594_spi_crc_table);
+
+static int tps6594_spi_reg_read(void *context, unsigned int reg, unsigned int *val)
+{
+       struct spi_device *spi = context;
+       struct tps6594 *tps = spi_get_drvdata(spi);
+       u8 buf[4] = { 0 };
+       size_t count_rx = 1;
+       int ret;
+
+       buf[0] = reg;
+       buf[1] = TPS6594_REG_TO_PAGE(reg) << TPS6594_SPI_PAGE_SHIFT | TPS6594_SPI_READ_BIT;
+
+       if (tps->use_crc)
+               count_rx++;
+
+       ret = spi_write_then_read(spi, buf, 2, buf + 2, count_rx);
+       if (ret < 0)
+               return ret;
+
+       if (tps->use_crc && buf[3] != crc8(tps6594_spi_crc_table, buf, 3, CRC8_INIT_VALUE))
+               return -EIO;
+
+       *val = buf[2];
+
+       return 0;
+}
+
+static int tps6594_spi_reg_write(void *context, unsigned int reg, unsigned int val)
+{
+       struct spi_device *spi = context;
+       struct tps6594 *tps = spi_get_drvdata(spi);
+       u8 buf[4] = { 0 };
+       size_t count = 3;
+
+       buf[0] = reg;
+       buf[1] = TPS6594_REG_TO_PAGE(reg) << TPS6594_SPI_PAGE_SHIFT;
+       buf[2] = val;
+
+       if (tps->use_crc)
+               buf[3] = crc8(tps6594_spi_crc_table, buf, count++, CRC8_INIT_VALUE);
+
+       return spi_write(spi, buf, count);
+}
+
+static const struct regmap_config tps6594_spi_regmap_config = {
+       .reg_bits = 16,
+       .val_bits = 8,
+       .max_register = TPS6594_REG_DWD_FAIL_CNT_REG,
+       .volatile_reg = tps6594_is_volatile_reg,
+       .reg_read = tps6594_spi_reg_read,
+       .reg_write = tps6594_spi_reg_write,
+       .use_single_read = true,
+       .use_single_write = true,
+};
+
+static const struct of_device_id tps6594_spi_of_match_table[] = {
+       { .compatible = "ti,tps6594-q1", .data = (void *)TPS6594, },
+       { .compatible = "ti,tps6593-q1", .data = (void *)TPS6593, },
+       { .compatible = "ti,lp8764-q1",  .data = (void *)LP8764,  },
+       {}
+};
+MODULE_DEVICE_TABLE(of, tps6594_spi_of_match_table);
+
+static int tps6594_spi_probe(struct spi_device *spi)
+{
+       struct device *dev = &spi->dev;
+       struct tps6594 *tps;
+       const struct of_device_id *match;
+
+       tps = devm_kzalloc(dev, sizeof(*tps), GFP_KERNEL);
+       if (!tps)
+               return -ENOMEM;
+
+       spi_set_drvdata(spi, tps);
+
+       tps->dev = dev;
+       tps->reg = spi->chip_select;
+       tps->irq = spi->irq;
+
+       tps->regmap = devm_regmap_init(dev, NULL, spi, &tps6594_spi_regmap_config);
+       if (IS_ERR(tps->regmap))
+               return dev_err_probe(dev, PTR_ERR(tps->regmap), "Failed to init regmap\n");
+
+       match = of_match_device(tps6594_spi_of_match_table, dev);
+       if (!match)
+               return dev_err_probe(dev, PTR_ERR(match), "Failed to find matching chip ID\n");
+       tps->chip_id = (unsigned long)match->data;
+
+       crc8_populate_msb(tps6594_spi_crc_table, TPS6594_CRC8_POLYNOMIAL);
+
+       return tps6594_device_init(tps, enable_crc);
+}
+
+static struct spi_driver tps6594_spi_driver = {
+       .driver = {
+               .name = "tps6594",
+               .of_match_table = tps6594_spi_of_match_table,
+       },
+       .probe = tps6594_spi_probe,
+};
+module_spi_driver(tps6594_spi_driver);
+
+MODULE_AUTHOR("Julien Panis <jpanis@baylibre.com>");
+MODULE_DESCRIPTION("TPS6594 SPI Interface Driver");
+MODULE_LICENSE("GPL");
index 48821f4..3c95600 100644 (file)
@@ -309,7 +309,7 @@ static void lkdtm_OVERFLOW_UNSIGNED(void)
 struct array_bounds_flex_array {
        int one;
        int two;
-       char data[1];
+       char data[];
 };
 
 struct array_bounds {
@@ -341,7 +341,7 @@ static void lkdtm_ARRAY_BOUNDS(void)
         * For the uninstrumented flex array member, also touch 1 byte
         * beyond to verify it is correctly uninstrumented.
         */
-       for (i = 0; i < sizeof(not_checked->data) + 1; i++)
+       for (i = 0; i < 2; i++)
                not_checked->data[i] = 'A';
 
        pr_info("Array access beyond bounds ...\n");
@@ -487,6 +487,7 @@ static void lkdtm_UNSET_SMEP(void)
         * the cr4 writing instruction.
         */
        insn = (unsigned char *)native_write_cr4;
+       OPTIMIZER_HIDE_VAR(insn);
        for (i = 0; i < MOV_CR4_DEPTH; i++) {
                /* mov %rdi, %cr4 */
                if (insn[i] == 0x0f && insn[i+1] == 0x22 && insn[i+2] == 0xe7)
index b836936..629edb6 100644 (file)
@@ -185,7 +185,7 @@ static int non_atomic_pte_lookup(struct vm_area_struct *vma,
 #else
        *pageshift = PAGE_SHIFT;
 #endif
-       if (get_user_pages(vaddr, 1, write ? FOLL_WRITE : 0, &page, NULL) <= 0)
+       if (get_user_pages(vaddr, 1, write ? FOLL_WRITE : 0, &page) <= 0)
                return -EFAULT;
        *paddr = page_to_phys(page);
        put_page(page);
@@ -228,7 +228,7 @@ static int atomic_pte_lookup(struct vm_area_struct *vma, unsigned long vaddr,
                goto err;
 #ifdef CONFIG_X86_64
        if (unlikely(pmd_large(*pmdp)))
-               pte = *(pte_t *) pmdp;
+               pte = ptep_get((pte_t *)pmdp);
        else
 #endif
                pte = *pte_offset_kernel(pmdp, vaddr);
index d920c41..f701efb 100644 (file)
@@ -178,6 +178,7 @@ static void mmc_blk_rw_rq_prep(struct mmc_queue_req *mqrq,
                               int recovery_mode,
                               struct mmc_queue *mq);
 static void mmc_blk_hsq_req_done(struct mmc_request *mrq);
+static int mmc_spi_err_check(struct mmc_card *card);
 
 static struct mmc_blk_data *mmc_blk_get(struct gendisk *disk)
 {
@@ -358,15 +359,15 @@ static const struct attribute_group *mmc_disk_attr_groups[] = {
        NULL,
 };
 
-static int mmc_blk_open(struct block_device *bdev, fmode_t mode)
+static int mmc_blk_open(struct gendisk *disk, blk_mode_t mode)
 {
-       struct mmc_blk_data *md = mmc_blk_get(bdev->bd_disk);
+       struct mmc_blk_data *md = mmc_blk_get(disk);
        int ret = -ENXIO;
 
        mutex_lock(&block_mutex);
        if (md) {
                ret = 0;
-               if ((mode & FMODE_WRITE) && md->read_only) {
+               if ((mode & BLK_OPEN_WRITE) && md->read_only) {
                        mmc_blk_put(md);
                        ret = -EROFS;
                }
@@ -376,7 +377,7 @@ static int mmc_blk_open(struct block_device *bdev, fmode_t mode)
        return ret;
 }
 
-static void mmc_blk_release(struct gendisk *disk, fmode_t mode)
+static void mmc_blk_release(struct gendisk *disk)
 {
        struct mmc_blk_data *md = disk->private_data;
 
@@ -608,6 +609,11 @@ static int __mmc_blk_ioctl_cmd(struct mmc_card *card, struct mmc_blk_data *md,
        if ((card->host->caps & MMC_CAP_WAIT_WHILE_BUSY) && use_r1b_resp)
                return 0;
 
+       if (mmc_host_is_spi(card->host)) {
+               if (idata->ic.write_flag || r1b_resp || cmd.flags & MMC_RSP_SPI_BUSY)
+                       return mmc_spi_err_check(card);
+               return err;
+       }
        /* Ensure RPMB/R1B command has completed by polling with CMD13. */
        if (idata->rpmb || r1b_resp)
                err = mmc_poll_for_busy(card, busy_timeout_ms, false,
@@ -757,7 +763,7 @@ static int mmc_blk_check_blkdev(struct block_device *bdev)
        return 0;
 }
 
-static int mmc_blk_ioctl(struct block_device *bdev, fmode_t mode,
+static int mmc_blk_ioctl(struct block_device *bdev, blk_mode_t mode,
        unsigned int cmd, unsigned long arg)
 {
        struct mmc_blk_data *md;
@@ -794,7 +800,7 @@ static int mmc_blk_ioctl(struct block_device *bdev, fmode_t mode,
 }
 
 #ifdef CONFIG_COMPAT
-static int mmc_blk_compat_ioctl(struct block_device *bdev, fmode_t mode,
+static int mmc_blk_compat_ioctl(struct block_device *bdev, blk_mode_t mode,
        unsigned int cmd, unsigned long arg)
 {
        return mmc_blk_ioctl(bdev, mode, cmd, (unsigned long) compat_ptr(arg));
@@ -2505,9 +2511,9 @@ static struct mmc_blk_data *mmc_blk_alloc_req(struct mmc_card *card,
 
        string_get_size((u64)size, 512, STRING_UNITS_2,
                        cap_str, sizeof(cap_str));
-       pr_info("%s: %s %s %s %s\n",
+       pr_info("%s: %s %s %s%s\n",
                md->disk->disk_name, mmc_card_id(card), mmc_card_name(card),
-               cap_str, md->read_only ? "(ro)" : "");
+               cap_str, md->read_only ? " (ro)" : "");
 
        /* used in ->open, must be set before add_disk: */
        if (area_type == MMC_BLK_DATA_AREA_MAIN)
@@ -2899,12 +2905,12 @@ static const struct file_operations mmc_dbg_ext_csd_fops = {
        .llseek         = default_llseek,
 };
 
-static int mmc_blk_add_debugfs(struct mmc_card *card, struct mmc_blk_data *md)
+static void mmc_blk_add_debugfs(struct mmc_card *card, struct mmc_blk_data *md)
 {
        struct dentry *root;
 
        if (!card->debugfs_root)
-               return 0;
+               return;
 
        root = card->debugfs_root;
 
@@ -2913,19 +2919,13 @@ static int mmc_blk_add_debugfs(struct mmc_card *card, struct mmc_blk_data *md)
                        debugfs_create_file_unsafe("status", 0400, root,
                                                   card,
                                                   &mmc_dbg_card_status_fops);
-               if (!md->status_dentry)
-                       return -EIO;
        }
 
        if (mmc_card_mmc(card)) {
                md->ext_csd_dentry =
                        debugfs_create_file("ext_csd", S_IRUSR, root, card,
                                            &mmc_dbg_ext_csd_fops);
-               if (!md->ext_csd_dentry)
-                       return -EIO;
        }
-
-       return 0;
 }
 
 static void mmc_blk_remove_debugfs(struct mmc_card *card,
@@ -2934,22 +2934,17 @@ static void mmc_blk_remove_debugfs(struct mmc_card *card,
        if (!card->debugfs_root)
                return;
 
-       if (!IS_ERR_OR_NULL(md->status_dentry)) {
-               debugfs_remove(md->status_dentry);
-               md->status_dentry = NULL;
-       }
+       debugfs_remove(md->status_dentry);
+       md->status_dentry = NULL;
 
-       if (!IS_ERR_OR_NULL(md->ext_csd_dentry)) {
-               debugfs_remove(md->ext_csd_dentry);
-               md->ext_csd_dentry = NULL;
-       }
+       debugfs_remove(md->ext_csd_dentry);
+       md->ext_csd_dentry = NULL;
 }
 
 #else
 
-static int mmc_blk_add_debugfs(struct mmc_card *card, struct mmc_blk_data *md)
+static void mmc_blk_add_debugfs(struct mmc_card *card, struct mmc_blk_data *md)
 {
-       return 0;
 }
 
 static void mmc_blk_remove_debugfs(struct mmc_card *card,
index cfdd1ff..4edf905 100644 (file)
@@ -53,6 +53,10 @@ struct mmc_fixup {
        unsigned int manfid;
        unsigned short oemid;
 
+       /* Manufacturing date */
+       unsigned short year;
+       unsigned char month;
+
        /* SDIO-specific fields. You can use SDIO_ANY_ID here of course */
        u16 cis_vendor, cis_device;
 
@@ -68,6 +72,8 @@ struct mmc_fixup {
 
 #define CID_MANFID_ANY (-1u)
 #define CID_OEMID_ANY ((unsigned short) -1)
+#define CID_YEAR_ANY ((unsigned short) -1)
+#define CID_MONTH_ANY ((unsigned char) -1)
 #define CID_NAME_ANY (NULL)
 
 #define EXT_CSD_REV_ANY (-1u)
@@ -81,17 +87,21 @@ struct mmc_fixup {
 #define CID_MANFID_APACER       0x27
 #define CID_MANFID_KINGSTON     0x70
 #define CID_MANFID_HYNIX       0x90
+#define CID_MANFID_KINGSTON_SD 0x9F
 #define CID_MANFID_NUMONYX     0xFE
 
 #define END_FIXUP { NULL }
 
-#define _FIXUP_EXT(_name, _manfid, _oemid, _rev_start, _rev_end,       \
-                  _cis_vendor, _cis_device,                            \
-                  _fixup, _data, _ext_csd_rev)                         \
+#define _FIXUP_EXT(_name, _manfid, _oemid, _year, _month,      \
+                  _rev_start, _rev_end,                        \
+                  _cis_vendor, _cis_device,                    \
+                  _fixup, _data, _ext_csd_rev)                 \
        {                                               \
                .name = (_name),                        \
                .manfid = (_manfid),                    \
                .oemid = (_oemid),                      \
+               .year = (_year),                        \
+               .month = (_month),                      \
                .rev_start = (_rev_start),              \
                .rev_end = (_rev_end),                  \
                .cis_vendor = (_cis_vendor),            \
@@ -103,8 +113,8 @@ struct mmc_fixup {
 
 #define MMC_FIXUP_REV(_name, _manfid, _oemid, _rev_start, _rev_end,    \
                      _fixup, _data, _ext_csd_rev)                      \
-       _FIXUP_EXT(_name, _manfid,                                      \
-                  _oemid, _rev_start, _rev_end,                        \
+       _FIXUP_EXT(_name, _manfid, _oemid, CID_YEAR_ANY, CID_MONTH_ANY, \
+                  _rev_start, _rev_end,                                \
                   SDIO_ANY_ID, SDIO_ANY_ID,                            \
                   _fixup, _data, _ext_csd_rev)                         \
 
@@ -118,8 +128,9 @@ struct mmc_fixup {
                      _ext_csd_rev)
 
 #define SDIO_FIXUP(_vendor, _device, _fixup, _data)                    \
-       _FIXUP_EXT(CID_NAME_ANY, CID_MANFID_ANY,                        \
-                   CID_OEMID_ANY, 0, -1ull,                            \
+       _FIXUP_EXT(CID_NAME_ANY, CID_MANFID_ANY, CID_OEMID_ANY,         \
+                  CID_YEAR_ANY, CID_MONTH_ANY,                         \
+                  0, -1ull,                                            \
                   _vendor, _device,                                    \
                   _fixup, _data, EXT_CSD_REV_ANY)                      \
 
@@ -264,4 +275,9 @@ static inline int mmc_card_broken_sd_discard(const struct mmc_card *c)
        return c->quirks & MMC_QUIRK_BROKEN_SD_DISCARD;
 }
 
+static inline int mmc_card_broken_sd_cache(const struct mmc_card *c)
+{
+       return c->quirks & MMC_QUIRK_BROKEN_SD_CACHE;
+}
+
 #endif
index 3d3e0ca..ec4108a 100644 (file)
@@ -2199,10 +2199,8 @@ int mmc_card_alternative_gpt_sector(struct mmc_card *card, sector_t *gpt_sector)
 }
 EXPORT_SYMBOL(mmc_card_alternative_gpt_sector);
 
-void mmc_rescan(struct work_struct *work)
+static void __mmc_rescan(struct mmc_host *host)
 {
-       struct mmc_host *host =
-               container_of(work, struct mmc_host, detect.work);
        int i;
 
        if (host->rescan_disable)
@@ -2274,6 +2272,14 @@ void mmc_rescan(struct work_struct *work)
                mmc_schedule_delayed_work(&host->detect, HZ);
 }
 
+void mmc_rescan(struct work_struct *work)
+{
+       struct mmc_host *host =
+               container_of(work, struct mmc_host, detect.work);
+
+       __mmc_rescan(host);
+}
+
 void mmc_start_host(struct mmc_host *host)
 {
        host->f_init = max(min(freqs[0], host->f_max), host->f_min);
@@ -2286,7 +2292,8 @@ void mmc_start_host(struct mmc_host *host)
        }
 
        mmc_gpiod_request_cd_irq(host);
-       _mmc_detect_change(host, 0, false);
+       host->detect_change = 1;
+       __mmc_rescan(host);
 }
 
 void __mmc_stop_host(struct mmc_host *host)
index 29b9497..32b64b5 100644 (file)
@@ -54,6 +54,15 @@ static const struct mmc_fixup __maybe_unused mmc_blk_fixups[] = {
                  MMC_QUIRK_BLK_NO_CMD23),
 
        /*
+        * Kingston Canvas Go! Plus microSD cards never finish SD cache flush.
+        * This has so far only been observed on cards from 11/2019, while new
+        * cards from 2023/05 do not exhibit this behavior.
+        */
+       _FIXUP_EXT("SD64G", CID_MANFID_KINGSTON_SD, 0x5449, 2019, 11,
+                  0, -1ull, SDIO_ANY_ID, SDIO_ANY_ID, add_quirk_sd,
+                  MMC_QUIRK_BROKEN_SD_CACHE, EXT_CSD_REV_ANY),
+
+       /*
         * Some SD cards lockup while using CMD23 multiblock transfers.
         */
        MMC_FIXUP("AF SD", CID_MANFID_ATP, CID_OEMID_ANY, add_quirk_sd,
@@ -101,6 +110,20 @@ static const struct mmc_fixup __maybe_unused mmc_blk_fixups[] = {
                  MMC_QUIRK_TRIM_BROKEN),
 
        /*
+        * Micron MTFC4GACAJCN-1M advertises TRIM but it does not seems to
+        * support being used to offload WRITE_ZEROES.
+        */
+       MMC_FIXUP("Q2J54A", CID_MANFID_MICRON, 0x014e, add_quirk_mmc,
+                 MMC_QUIRK_TRIM_BROKEN),
+
+       /*
+        * Kingston EMMC04G-M627 advertises TRIM but it does not seems to
+        * support being used to offload WRITE_ZEROES.
+        */
+       MMC_FIXUP("M62704", CID_MANFID_KINGSTON, 0x0100, add_quirk_mmc,
+                 MMC_QUIRK_TRIM_BROKEN),
+
+       /*
         * Some SD cards reports discard support while they don't
         */
        MMC_FIXUP(CID_NAME_ANY, CID_MANFID_SANDISK_SD, 0x5344, add_quirk_sd,
@@ -209,6 +232,10 @@ static inline void mmc_fixup_device(struct mmc_card *card,
                if (f->of_compatible &&
                    !mmc_fixup_of_compatible_match(card, f->of_compatible))
                        continue;
+               if (f->year != CID_YEAR_ANY && f->year != card->cid.year)
+                       continue;
+               if (f->month != CID_MONTH_ANY && f->month != card->cid.month)
+                       continue;
 
                dev_dbg(&card->dev, "calling %ps\n", f->vendor_fixup);
                f->vendor_fixup(card, f->data);
index 72b664e..246ce02 100644 (file)
@@ -1170,7 +1170,7 @@ static int sd_parse_ext_reg_perf(struct mmc_card *card, u8 fno, u8 page,
                card->ext_perf.feature_support |= SD_EXT_PERF_HOST_MAINT;
 
        /* Cache support at bit 0. */
-       if (reg_buf[4] & BIT(0))
+       if ((reg_buf[4] & BIT(0)) && !mmc_card_broken_sd_cache(card))
                card->ext_perf.feature_support |= SD_EXT_PERF_CACHE;
 
        /* Command queue support indicated via queue depth bits (0 to 4). */
index 9f79389..159a3e9 100644 (file)
@@ -550,7 +550,7 @@ config MMC_SDHCI_MSM
        depends on MMC_SDHCI_PLTFM
        select MMC_SDHCI_IO_ACCESSORS
        select MMC_CQHCI
-       select QCOM_SCM if MMC_CRYPTO
+       select QCOM_INLINE_CRYPTO_ENGINE if MMC_CRYPTO
        help
          This selects the Secure Digital Host Controller Interface (SDHCI)
          support present in Qualcomm SOCs. The controller supports
index ba9387e..1a12e40 100644 (file)
@@ -5,6 +5,7 @@
 #define LINUX_MMC_CQHCI_H
 
 #include <linux/compiler.h>
+#include <linux/bitfield.h>
 #include <linux/bitops.h>
 #include <linux/spinlock_types.h>
 #include <linux/types.h>
@@ -23,6 +24,8 @@
 /* capabilities */
 #define CQHCI_CAP                      0x04
 #define CQHCI_CAP_CS                   0x10000000 /* Crypto Support */
+#define CQHCI_CAP_ITCFMUL              GENMASK(15, 12)
+#define CQHCI_ITCFMUL(x)               FIELD_GET(CQHCI_CAP_ITCFMUL, (x))
 
 /* configuration */
 #define CQHCI_CFG                      0x08
index 10baf12..4747e56 100644 (file)
@@ -52,7 +52,7 @@ static int dw_mci_bluefield_probe(struct platform_device *pdev)
 
 static struct platform_driver dw_mci_bluefield_pltfm_driver = {
        .probe          = dw_mci_bluefield_probe,
-       .remove         = dw_mci_pltfm_remove,
+       .remove_new     = dw_mci_pltfm_remove,
        .driver         = {
                .name           = "dwmmc_bluefield",
                .probe_type     = PROBE_PREFER_ASYNCHRONOUS,
index 0311a37..e8ee7c4 100644 (file)
@@ -470,7 +470,7 @@ static const struct dev_pm_ops dw_mci_k3_dev_pm_ops = {
 
 static struct platform_driver dw_mci_k3_pltfm_driver = {
        .probe          = dw_mci_k3_probe,
-       .remove         = dw_mci_pltfm_remove,
+       .remove_new     = dw_mci_pltfm_remove,
        .driver         = {
                .name           = "dwmmc_k3",
                .probe_type     = PROBE_PREFER_ASYNCHRONOUS,
index 48b7da2..2353fad 100644 (file)
@@ -121,18 +121,17 @@ static int dw_mci_pltfm_probe(struct platform_device *pdev)
        return dw_mci_pltfm_register(pdev, drv_data);
 }
 
-int dw_mci_pltfm_remove(struct platform_device *pdev)
+void dw_mci_pltfm_remove(struct platform_device *pdev)
 {
        struct dw_mci *host = platform_get_drvdata(pdev);
 
        dw_mci_remove(host);
-       return 0;
 }
 EXPORT_SYMBOL_GPL(dw_mci_pltfm_remove);
 
 static struct platform_driver dw_mci_pltfm_driver = {
        .probe          = dw_mci_pltfm_probe,
-       .remove         = dw_mci_pltfm_remove,
+       .remove_new     = dw_mci_pltfm_remove,
        .driver         = {
                .name           = "dw_mmc",
                .probe_type     = PROBE_PREFER_ASYNCHRONOUS,
index 2d50d7d..64cf752 100644 (file)
@@ -10,7 +10,7 @@
 
 extern int dw_mci_pltfm_register(struct platform_device *pdev,
                                const struct dw_mci_drv_data *drv_data);
-extern int dw_mci_pltfm_remove(struct platform_device *pdev);
+extern void dw_mci_pltfm_remove(struct platform_device *pdev);
 extern const struct dev_pm_ops dw_mci_pltfm_pmops;
 
 #endif /* _DW_MMC_PLTFM_H_ */
index dab1508..fd05a64 100644 (file)
@@ -172,7 +172,7 @@ static int dw_mci_starfive_probe(struct platform_device *pdev)
 
 static struct platform_driver dw_mci_starfive_driver = {
        .probe = dw_mci_starfive_probe,
-       .remove = dw_mci_pltfm_remove,
+       .remove_new = dw_mci_pltfm_remove,
        .driver = {
                .name = "dwmmc_starfive",
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
index da85c2f..97168cd 100644 (file)
@@ -776,6 +776,11 @@ static void meson_mx_sdhc_init_hw(struct mmc_host *mmc)
        regmap_write(host->regmap, MESON_SDHC_ISTA, MESON_SDHC_ISTA_ALL_IRQS);
 }
 
+static void meason_mx_mmc_free_host(void *data)
+{
+       mmc_free_host(data);
+}
+
 static int meson_mx_sdhc_probe(struct platform_device *pdev)
 {
        struct device *dev = &pdev->dev;
@@ -788,8 +793,7 @@ static int meson_mx_sdhc_probe(struct platform_device *pdev)
        if (!mmc)
                return -ENOMEM;
 
-       ret = devm_add_action_or_reset(dev, (void(*)(void *))mmc_free_host,
-                                      mmc);
+       ret = devm_add_action_or_reset(dev, meason_mx_mmc_free_host, mmc);
        if (ret) {
                dev_err(dev, "Failed to register mmc_free_host action\n");
                return ret;
index 696cbef..769b34a 100644 (file)
@@ -37,6 +37,7 @@
 #include <linux/pinctrl/consumer.h>
 #include <linux/reset.h>
 #include <linux/gpio/consumer.h>
+#include <linux/workqueue.h>
 
 #include <asm/div64.h>
 #include <asm/io.h>
@@ -270,6 +271,7 @@ static struct variant_data variant_stm32_sdmmc = {
        .datactrl_any_blocksz   = true,
        .datactrl_mask_sdio     = MCI_DPSM_ST_SDIOEN,
        .stm32_idmabsize_mask   = GENMASK(12, 5),
+       .stm32_idmabsize_align  = BIT(5),
        .busy_timeout           = true,
        .busy_detect            = true,
        .busy_detect_flag       = MCI_STM32_BUSYD0,
@@ -296,6 +298,35 @@ static struct variant_data variant_stm32_sdmmcv2 = {
        .datactrl_any_blocksz   = true,
        .datactrl_mask_sdio     = MCI_DPSM_ST_SDIOEN,
        .stm32_idmabsize_mask   = GENMASK(16, 5),
+       .stm32_idmabsize_align  = BIT(5),
+       .dma_lli                = true,
+       .busy_timeout           = true,
+       .busy_detect            = true,
+       .busy_detect_flag       = MCI_STM32_BUSYD0,
+       .busy_detect_mask       = MCI_STM32_BUSYD0ENDMASK,
+       .init                   = sdmmc_variant_init,
+};
+
+static struct variant_data variant_stm32_sdmmcv3 = {
+       .fifosize               = 256 * 4,
+       .fifohalfsize           = 128 * 4,
+       .f_max                  = 267000000,
+       .stm32_clkdiv           = true,
+       .cmdreg_cpsm_enable     = MCI_CPSM_STM32_ENABLE,
+       .cmdreg_lrsp_crc        = MCI_CPSM_STM32_LRSP_CRC,
+       .cmdreg_srsp_crc        = MCI_CPSM_STM32_SRSP_CRC,
+       .cmdreg_srsp            = MCI_CPSM_STM32_SRSP,
+       .cmdreg_stop            = MCI_CPSM_STM32_CMDSTOP,
+       .data_cmd_enable        = MCI_CPSM_STM32_CMDTRANS,
+       .irq_pio_mask           = MCI_IRQ_PIO_STM32_MASK,
+       .datactrl_first         = true,
+       .datacnt_useless        = true,
+       .datalength_bits        = 25,
+       .datactrl_blocksz       = 14,
+       .datactrl_any_blocksz   = true,
+       .datactrl_mask_sdio     = MCI_DPSM_ST_SDIOEN,
+       .stm32_idmabsize_mask   = GENMASK(16, 6),
+       .stm32_idmabsize_align  = BIT(6),
        .dma_lli                = true,
        .busy_timeout           = true,
        .busy_detect            = true,
@@ -654,10 +685,52 @@ static u32 ux500v2_get_dctrl_cfg(struct mmci_host *host)
        return MCI_DPSM_ENABLE | (host->data->blksz << 16);
 }
 
-static bool ux500_busy_complete(struct mmci_host *host, u32 status, u32 err_msk)
+static void ux500_busy_clear_mask_done(struct mmci_host *host)
 {
        void __iomem *base = host->base;
 
+       writel(host->variant->busy_detect_mask, base + MMCICLEAR);
+       writel(readl(base + MMCIMASK0) &
+              ~host->variant->busy_detect_mask, base + MMCIMASK0);
+       host->busy_state = MMCI_BUSY_DONE;
+       host->busy_status = 0;
+}
+
+/*
+ * ux500_busy_complete() - this will wait until the busy status
+ * goes off, saving any status that occur in the meantime into
+ * host->busy_status until we know the card is not busy any more.
+ * The function returns true when the busy detection is ended
+ * and we should continue processing the command.
+ *
+ * The Ux500 typically fires two IRQs over a busy cycle like this:
+ *
+ *  DAT0 busy          +-----------------+
+ *                     |                 |
+ *  DAT0 not busy  ----+                 +--------
+ *
+ *                     ^                 ^
+ *                     |                 |
+ *                    IRQ1              IRQ2
+ */
+static bool ux500_busy_complete(struct mmci_host *host, struct mmc_command *cmd,
+                               u32 status, u32 err_msk)
+{
+       void __iomem *base = host->base;
+       int retries = 10;
+
+       if (status & err_msk) {
+               /* Stop any ongoing busy detection if an error occurs */
+               ux500_busy_clear_mask_done(host);
+               goto out_ret_state;
+       }
+
+       /*
+        * The state transitions are encoded in a state machine crossing
+        * the edges in this switch statement.
+        */
+       switch (host->busy_state) {
+
        /*
         * Before unmasking for the busy end IRQ, confirm that the
         * command was sent successfully. To keep track of having a
@@ -667,19 +740,33 @@ static bool ux500_busy_complete(struct mmci_host *host, u32 status, u32 err_msk)
         * Note that, the card may need a couple of clock cycles before
         * it starts signaling busy on DAT0, hence re-read the
         * MMCISTATUS register here, to allow the busy bit to be set.
-        * Potentially we may even need to poll the register for a
-        * while, to allow it to be set, but tests indicates that it
-        * isn't needed.
         */
-       if (!host->busy_status && !(status & err_msk) &&
-           (readl(base + MMCISTATUS) & host->variant->busy_detect_flag)) {
-               writel(readl(base + MMCIMASK0) |
-                      host->variant->busy_detect_mask,
-                      base + MMCIMASK0);
-
+       case MMCI_BUSY_DONE:
+               /*
+                * Save the first status register read to be sure to catch
+                * all bits that may be lost will retrying. If the command
+                * is still busy this will result in assigning 0 to
+                * host->busy_status, which is what it should be in IDLE.
+                */
                host->busy_status = status & (MCI_CMDSENT | MCI_CMDRESPEND);
-               return false;
-       }
+               while (retries) {
+                       status = readl(base + MMCISTATUS);
+                       /* Keep accumulating status bits */
+                       host->busy_status |= status & (MCI_CMDSENT | MCI_CMDRESPEND);
+                       if (status & host->variant->busy_detect_flag) {
+                               writel(readl(base + MMCIMASK0) |
+                                      host->variant->busy_detect_mask,
+                                      base + MMCIMASK0);
+                               host->busy_state = MMCI_BUSY_WAITING_FOR_START_IRQ;
+                               schedule_delayed_work(&host->ux500_busy_timeout_work,
+                                     msecs_to_jiffies(cmd->busy_timeout));
+                               goto out_ret_state;
+                       }
+                       retries--;
+               }
+               dev_dbg(mmc_dev(host->mmc), "no busy signalling in time\n");
+               ux500_busy_clear_mask_done(host);
+               break;
 
        /*
         * If there is a command in-progress that has been successfully
@@ -692,27 +779,39 @@ static bool ux500_busy_complete(struct mmci_host *host, u32 status, u32 err_msk)
         * both the start and the end interrupts needs to be cleared,
         * one after the other. So, clear the busy start IRQ here.
         */
-       if (host->busy_status &&
-           (status & host->variant->busy_detect_flag)) {
-               writel(host->variant->busy_detect_mask, base + MMCICLEAR);
-               return false;
-       }
+       case MMCI_BUSY_WAITING_FOR_START_IRQ:
+               if (status & host->variant->busy_detect_flag) {
+                       host->busy_status |= status & (MCI_CMDSENT | MCI_CMDRESPEND);
+                       writel(host->variant->busy_detect_mask, base + MMCICLEAR);
+                       host->busy_state = MMCI_BUSY_WAITING_FOR_END_IRQ;
+               } else {
+                       dev_dbg(mmc_dev(host->mmc),
+                               "lost busy status when waiting for busy start IRQ\n");
+                       cancel_delayed_work(&host->ux500_busy_timeout_work);
+                       ux500_busy_clear_mask_done(host);
+               }
+               break;
 
-       /*
-        * If there is a command in-progress that has been successfully
-        * sent and the busy bit isn't set, it means we have received
-        * the busy end IRQ. Clear and mask the IRQ, then continue to
-        * process the command.
-        */
-       if (host->busy_status) {
-               writel(host->variant->busy_detect_mask, base + MMCICLEAR);
+       case MMCI_BUSY_WAITING_FOR_END_IRQ:
+               if (!(status & host->variant->busy_detect_flag)) {
+                       host->busy_status |= status & (MCI_CMDSENT | MCI_CMDRESPEND);
+                       writel(host->variant->busy_detect_mask, base + MMCICLEAR);
+                       cancel_delayed_work(&host->ux500_busy_timeout_work);
+                       ux500_busy_clear_mask_done(host);
+               } else {
+                       dev_dbg(mmc_dev(host->mmc),
+                               "busy status still asserted when handling busy end IRQ - will keep waiting\n");
+               }
+               break;
 
-               writel(readl(base + MMCIMASK0) &
-                      ~host->variant->busy_detect_mask, base + MMCIMASK0);
-               host->busy_status = 0;
+       default:
+               dev_dbg(mmc_dev(host->mmc), "fell through on state %d\n",
+                       host->busy_state);
+               break;
        }
 
-       return true;
+out_ret_state:
+       return (host->busy_state == MMCI_BUSY_DONE);
 }
 
 /*
@@ -1214,6 +1313,7 @@ static void
 mmci_start_command(struct mmci_host *host, struct mmc_command *cmd, u32 c)
 {
        void __iomem *base = host->base;
+       bool busy_resp = cmd->flags & MMC_RSP_BUSY;
        unsigned long long clks;
 
        dev_dbg(mmc_dev(host->mmc), "op %02x arg %08x flags %08x\n",
@@ -1238,10 +1338,14 @@ mmci_start_command(struct mmci_host *host, struct mmc_command *cmd, u32 c)
                        c |= host->variant->cmdreg_srsp;
        }
 
-       if (host->variant->busy_timeout && cmd->flags & MMC_RSP_BUSY) {
-               if (!cmd->busy_timeout)
-                       cmd->busy_timeout = 10 * MSEC_PER_SEC;
+       host->busy_status = 0;
+       host->busy_state = MMCI_BUSY_DONE;
 
+       /* Assign a default timeout if the core does not provide one */
+       if (busy_resp && !cmd->busy_timeout)
+               cmd->busy_timeout = 10 * MSEC_PER_SEC;
+
+       if (busy_resp && host->variant->busy_timeout) {
                if (cmd->busy_timeout > host->mmc->max_busy_timeout)
                        clks = (unsigned long long)host->mmc->max_busy_timeout * host->cclk;
                else
@@ -1382,7 +1486,7 @@ mmci_cmd_irq(struct mmci_host *host, struct mmc_command *cmd,
 
        /* Handle busy detection on DAT0 if the variant supports it. */
        if (busy_resp && host->variant->busy_detect)
-               if (!host->ops->busy_complete(host, status, err_msk))
+               if (!host->ops->busy_complete(host, cmd, status, err_msk))
                        return;
 
        host->cmd = NULL;
@@ -1429,6 +1533,34 @@ mmci_cmd_irq(struct mmci_host *host, struct mmc_command *cmd,
        }
 }
 
+/*
+ * This busy timeout worker is used to "kick" the command IRQ if a
+ * busy detect IRQ fails to appear in reasonable time. Only used on
+ * variants with busy detection IRQ delivery.
+ */
+static void ux500_busy_timeout_work(struct work_struct *work)
+{
+       struct mmci_host *host = container_of(work, struct mmci_host,
+                                       ux500_busy_timeout_work.work);
+       unsigned long flags;
+       u32 status;
+
+       spin_lock_irqsave(&host->lock, flags);
+
+       if (host->cmd) {
+               dev_dbg(mmc_dev(host->mmc), "timeout waiting for busy IRQ\n");
+
+               /* If we are still busy let's tag on a cmd-timeout error. */
+               status = readl(host->base + MMCISTATUS);
+               if (status & host->variant->busy_detect_flag)
+                       status |= MCI_CMDTIMEOUT;
+
+               mmci_cmd_irq(host, host->cmd, status);
+       }
+
+       spin_unlock_irqrestore(&host->lock, flags);
+}
+
 static int mmci_get_rx_fifocnt(struct mmci_host *host, u32 status, int remain)
 {
        return remain - (readl(host->base + MMCIFIFOCNT) << 2);
@@ -2243,6 +2375,10 @@ static int mmci_probe(struct amba_device *dev,
                        goto clk_disable;
        }
 
+       if (host->variant->busy_detect)
+               INIT_DELAYED_WORK(&host->ux500_busy_timeout_work,
+                                 ux500_busy_timeout_work);
+
        writel(MCI_IRQENABLE | variant->start_err, host->base + MMCIMASK0);
 
        amba_set_drvdata(dev, mmc);
@@ -2441,6 +2577,11 @@ static const struct amba_id mmci_ids[] = {
                .mask   = 0xf0ffffff,
                .data   = &variant_stm32_sdmmcv2,
        },
+       {
+               .id     = 0x00353180,
+               .mask   = 0xf0ffffff,
+               .data   = &variant_stm32_sdmmcv3,
+       },
        /* Qualcomm variants */
        {
                .id     = 0x00051180,
@@ -2456,6 +2597,7 @@ static struct amba_driver mmci_driver = {
        .drv            = {
                .name   = DRIVER_NAME,
                .pm     = &mmci_dev_pm_ops,
+               .probe_type = PROBE_PREFER_ASYNCHRONOUS,
        },
        .probe          = mmci_probe,
        .remove         = mmci_remove,
index e1a9b96..253197f 100644 (file)
 #define MCI_STM32_BUSYD0ENDMASK        BIT(21)
 
 #define MMCIMASK1              0x040
+
+/* STM32 sdmmc data FIFO threshold register */
+#define MMCI_STM32_FIFOTHRR    0x044
+#define MMCI_STM32_THR_MASK    GENMASK(3, 0)
+
 #define MMCIFIFOCNT            0x048
 #define MMCIFIFO               0x080 /* to 0x0bc */
 
 #define MMCI_STM32_IDMALLIEN   BIT(1)
 
 #define MMCI_STM32_IDMABSIZER          0x054
-#define MMCI_STM32_IDMABNDT_SHIFT      5
-#define MMCI_STM32_IDMABNDT_MASK       GENMASK(12, 5)
 
 #define MMCI_STM32_IDMABASE0R  0x058
 
@@ -262,6 +265,19 @@ struct dma_chan;
 struct mmci_host;
 
 /**
+ * enum mmci_busy_state - enumerate the busy detect wait states
+ *
+ * This is used for the state machine waiting for different busy detect
+ * interrupts on hardware that fire a single IRQ for start and end of
+ * the busy detect phase on DAT0.
+ */
+enum mmci_busy_state {
+       MMCI_BUSY_WAITING_FOR_START_IRQ,
+       MMCI_BUSY_WAITING_FOR_END_IRQ,
+       MMCI_BUSY_DONE,
+};
+
+/**
  * struct variant_data - MMCI variant-specific quirks
  * @clkreg: default value for MCICLOCK register
  * @clkreg_enable: enable value for MMCICLOCK register
@@ -361,6 +377,7 @@ struct variant_data {
        u32                     opendrain;
        u8                      dma_lli:1;
        u32                     stm32_idmabsize_mask;
+       u32                     stm32_idmabsize_align;
        void (*init)(struct mmci_host *host);
 };
 
@@ -380,7 +397,7 @@ struct mmci_host_ops {
        void (*dma_error)(struct mmci_host *host);
        void (*set_clkreg)(struct mmci_host *host, unsigned int desired);
        void (*set_pwrreg)(struct mmci_host *host, unsigned int pwr);
-       bool (*busy_complete)(struct mmci_host *host, u32 status, u32 err_msk);
+       bool (*busy_complete)(struct mmci_host *host, struct mmc_command *cmd, u32 status, u32 err_msk);
        void (*pre_sig_volt_switch)(struct mmci_host *host);
        int (*post_sig_volt_switch)(struct mmci_host *host, struct mmc_ios *ios);
 };
@@ -409,6 +426,7 @@ struct mmci_host {
        u32                     clk_reg;
        u32                     clk_reg_add;
        u32                     datactrl_reg;
+       enum mmci_busy_state    busy_state;
        u32                     busy_status;
        u32                     mask1_reg;
        u8                      vqmmc_enabled:1;
@@ -437,6 +455,7 @@ struct mmci_host {
        void                    *dma_priv;
 
        s32                     next_cookie;
+       struct delayed_work     ux500_busy_timeout_work;
 };
 
 #define dma_inprogress(host)   ((host)->dma_in_progress)
index 60bca78..35067e1 100644 (file)
@@ -15,7 +15,6 @@
 #include "mmci.h"
 
 #define SDMMC_LLI_BUF_LEN      PAGE_SIZE
-#define SDMMC_IDMA_BURST       BIT(MMCI_STM32_IDMABNDT_SHIFT)
 
 #define DLYB_CR                        0x0
 #define DLYB_CR_DEN            BIT(0)
 #define DLYB_LNG_TIMEOUT_US    1000
 #define SDMMC_VSWEND_TIMEOUT_US 10000
 
+#define SYSCFG_DLYBSD_CR       0x0
+#define DLYBSD_CR_EN           BIT(0)
+#define DLYBSD_CR_RXTAPSEL_MASK        GENMASK(6, 1)
+#define DLYBSD_TAPSEL_NB       32
+#define DLYBSD_BYP_EN          BIT(16)
+#define DLYBSD_BYP_CMD         GENMASK(21, 17)
+#define DLYBSD_ANTIGLITCH_EN   BIT(22)
+
+#define SYSCFG_DLYBSD_SR       0x4
+#define DLYBSD_SR_LOCK         BIT(0)
+#define DLYBSD_SR_RXTAPSEL_ACK BIT(1)
+
+#define DLYBSD_TIMEOUT_1S_IN_US        1000000
+
 struct sdmmc_lli_desc {
        u32 idmalar;
        u32 idmabase;
@@ -48,10 +61,21 @@ struct sdmmc_idma {
        bool use_bounce_buffer;
 };
 
+struct sdmmc_dlyb;
+
+struct sdmmc_tuning_ops {
+       int (*dlyb_enable)(struct sdmmc_dlyb *dlyb);
+       void (*set_input_ck)(struct sdmmc_dlyb *dlyb);
+       int (*tuning_prepare)(struct mmci_host *host);
+       int (*set_cfg)(struct sdmmc_dlyb *dlyb, int unit __maybe_unused,
+                      int phase, bool sampler __maybe_unused);
+};
+
 struct sdmmc_dlyb {
        void __iomem *base;
        u32 unit;
        u32 max;
+       struct sdmmc_tuning_ops *ops;
 };
 
 static int sdmmc_idma_validate_data(struct mmci_host *host,
@@ -69,7 +93,8 @@ static int sdmmc_idma_validate_data(struct mmci_host *host,
        idma->use_bounce_buffer = false;
        for_each_sg(data->sg, sg, data->sg_len - 1, i) {
                if (!IS_ALIGNED(sg->offset, sizeof(u32)) ||
-                   !IS_ALIGNED(sg->length, SDMMC_IDMA_BURST)) {
+                   !IS_ALIGNED(sg->length,
+                               host->variant->stm32_idmabsize_align)) {
                        dev_dbg(mmc_dev(host->mmc),
                                "unaligned scatterlist: ofst:%x length:%d\n",
                                data->sg->offset, data->sg->length);
@@ -293,23 +318,13 @@ static void mmci_sdmmc_set_clkreg(struct mmci_host *host, unsigned int desired)
        clk |= host->clk_reg_add;
        clk |= ddr;
 
-       /*
-        * SDMMC_FBCK is selected when an external Delay Block is needed
-        * with SDR104 or HS200.
-        */
-       if (host->mmc->ios.timing >= MMC_TIMING_UHS_SDR50) {
+       if (host->mmc->ios.timing >= MMC_TIMING_UHS_SDR50)
                clk |= MCI_STM32_CLK_BUSSPEED;
-               if (host->mmc->ios.timing == MMC_TIMING_UHS_SDR104 ||
-                   host->mmc->ios.timing == MMC_TIMING_MMC_HS200) {
-                       clk &= ~MCI_STM32_CLK_SEL_MSK;
-                       clk |= MCI_STM32_CLK_SELFBCK;
-               }
-       }
 
        mmci_write_clkreg(host, clk);
 }
 
-static void sdmmc_dlyb_input_ck(struct sdmmc_dlyb *dlyb)
+static void sdmmc_dlyb_mp15_input_ck(struct sdmmc_dlyb *dlyb)
 {
        if (!dlyb || !dlyb->base)
                return;
@@ -326,7 +341,8 @@ static void mmci_sdmmc_set_pwrreg(struct mmci_host *host, unsigned int pwr)
        /* adds OF options */
        pwr = host->pwr_reg_add;
 
-       sdmmc_dlyb_input_ck(dlyb);
+       if (dlyb && dlyb->ops->set_input_ck)
+               dlyb->ops->set_input_ck(dlyb);
 
        if (ios.power_mode == MMC_POWER_OFF) {
                /* Only a reset could power-off sdmmc */
@@ -371,6 +387,19 @@ static u32 sdmmc_get_dctrl_cfg(struct mmci_host *host)
 
        datactrl = mmci_dctrl_blksz(host);
 
+       if (host->hw_revision >= 3) {
+               u32 thr = 0;
+
+               if (host->mmc->ios.timing == MMC_TIMING_UHS_SDR104 ||
+                   host->mmc->ios.timing == MMC_TIMING_MMC_HS200) {
+                       thr = ffs(min_t(unsigned int, host->data->blksz,
+                                       host->variant->fifosize));
+                       thr = min_t(u32, thr, MMCI_STM32_THR_MASK);
+               }
+
+               writel_relaxed(thr, host->base + MMCI_STM32_FIFOTHRR);
+       }
+
        if (host->mmc->card && mmc_card_sdio(host->mmc->card) &&
            host->data->blocks == 1)
                datactrl |= MCI_DPSM_STM32_MODE_SDIO;
@@ -382,7 +411,8 @@ static u32 sdmmc_get_dctrl_cfg(struct mmci_host *host)
        return datactrl;
 }
 
-static bool sdmmc_busy_complete(struct mmci_host *host, u32 status, u32 err_msk)
+static bool sdmmc_busy_complete(struct mmci_host *host, struct mmc_command *cmd,
+                               u32 status, u32 err_msk)
 {
        void __iomem *base = host->base;
        u32 busy_d0, busy_d0end, mask, sdmmc_status;
@@ -423,8 +453,15 @@ complete:
        return true;
 }
 
-static void sdmmc_dlyb_set_cfgr(struct sdmmc_dlyb *dlyb,
-                               int unit, int phase, bool sampler)
+static int sdmmc_dlyb_mp15_enable(struct sdmmc_dlyb *dlyb)
+{
+       writel_relaxed(DLYB_CR_DEN, dlyb->base + DLYB_CR);
+
+       return 0;
+}
+
+static int sdmmc_dlyb_mp15_set_cfg(struct sdmmc_dlyb *dlyb,
+                                  int unit, int phase, bool sampler)
 {
        u32 cfgr;
 
@@ -436,16 +473,18 @@ static void sdmmc_dlyb_set_cfgr(struct sdmmc_dlyb *dlyb,
 
        if (!sampler)
                writel_relaxed(DLYB_CR_DEN, dlyb->base + DLYB_CR);
+
+       return 0;
 }
 
-static int sdmmc_dlyb_lng_tuning(struct mmci_host *host)
+static int sdmmc_dlyb_mp15_prepare(struct mmci_host *host)
 {
        struct sdmmc_dlyb *dlyb = host->variant_priv;
        u32 cfgr;
        int i, lng, ret;
 
        for (i = 0; i <= DLYB_CFGR_UNIT_MAX; i++) {
-               sdmmc_dlyb_set_cfgr(dlyb, i, DLYB_CFGR_SEL_MAX, true);
+               dlyb->ops->set_cfg(dlyb, i, DLYB_CFGR_SEL_MAX, true);
 
                ret = readl_relaxed_poll_timeout(dlyb->base + DLYB_CFGR, cfgr,
                                                 (cfgr & DLYB_CFGR_LNGF),
@@ -471,14 +510,58 @@ static int sdmmc_dlyb_lng_tuning(struct mmci_host *host)
        return 0;
 }
 
+static int sdmmc_dlyb_mp25_enable(struct sdmmc_dlyb *dlyb)
+{
+       u32 cr, sr;
+
+       cr = readl_relaxed(dlyb->base + SYSCFG_DLYBSD_CR);
+       cr |= DLYBSD_CR_EN;
+
+       writel_relaxed(cr, dlyb->base + SYSCFG_DLYBSD_CR);
+
+       return readl_relaxed_poll_timeout(dlyb->base + SYSCFG_DLYBSD_SR,
+                                          sr, sr & DLYBSD_SR_LOCK, 1,
+                                          DLYBSD_TIMEOUT_1S_IN_US);
+}
+
+static int sdmmc_dlyb_mp25_set_cfg(struct sdmmc_dlyb *dlyb,
+                                  int unit __maybe_unused, int phase,
+                                  bool sampler __maybe_unused)
+{
+       u32 cr, sr;
+
+       cr = readl_relaxed(dlyb->base + SYSCFG_DLYBSD_CR);
+       cr &= ~DLYBSD_CR_RXTAPSEL_MASK;
+       cr |= FIELD_PREP(DLYBSD_CR_RXTAPSEL_MASK, phase);
+
+       writel_relaxed(cr, dlyb->base + SYSCFG_DLYBSD_CR);
+
+       return readl_relaxed_poll_timeout(dlyb->base + SYSCFG_DLYBSD_SR,
+                                         sr, sr & DLYBSD_SR_RXTAPSEL_ACK, 1,
+                                         DLYBSD_TIMEOUT_1S_IN_US);
+}
+
+static int sdmmc_dlyb_mp25_prepare(struct mmci_host *host)
+{
+       struct sdmmc_dlyb *dlyb = host->variant_priv;
+
+       dlyb->max = DLYBSD_TAPSEL_NB;
+
+       return 0;
+}
+
 static int sdmmc_dlyb_phase_tuning(struct mmci_host *host, u32 opcode)
 {
        struct sdmmc_dlyb *dlyb = host->variant_priv;
        int cur_len = 0, max_len = 0, end_of_len = 0;
-       int phase;
+       int phase, ret;
 
        for (phase = 0; phase <= dlyb->max; phase++) {
-               sdmmc_dlyb_set_cfgr(dlyb, dlyb->unit, phase, false);
+               ret = dlyb->ops->set_cfg(dlyb, dlyb->unit, phase, false);
+               if (ret) {
+                       dev_err(mmc_dev(host->mmc), "tuning config failed\n");
+                       return ret;
+               }
 
                if (mmc_send_tuning(host->mmc, opcode, NULL)) {
                        cur_len = 0;
@@ -496,10 +579,15 @@ static int sdmmc_dlyb_phase_tuning(struct mmci_host *host, u32 opcode)
                return -EINVAL;
        }
 
-       writel_relaxed(0, dlyb->base + DLYB_CR);
+       if (dlyb->ops->set_input_ck)
+               dlyb->ops->set_input_ck(dlyb);
 
        phase = end_of_len - max_len / 2;
-       sdmmc_dlyb_set_cfgr(dlyb, dlyb->unit, phase, false);
+       ret = dlyb->ops->set_cfg(dlyb, dlyb->unit, phase, false);
+       if (ret) {
+               dev_err(mmc_dev(host->mmc), "tuning reconfig failed\n");
+               return ret;
+       }
 
        dev_dbg(mmc_dev(host->mmc), "unit:%d max_dly:%d phase:%d\n",
                dlyb->unit, dlyb->max, phase);
@@ -511,12 +599,33 @@ static int sdmmc_execute_tuning(struct mmc_host *mmc, u32 opcode)
 {
        struct mmci_host *host = mmc_priv(mmc);
        struct sdmmc_dlyb *dlyb = host->variant_priv;
+       u32 clk;
+       int ret;
+
+       if ((host->mmc->ios.timing != MMC_TIMING_UHS_SDR104 &&
+            host->mmc->ios.timing != MMC_TIMING_MMC_HS200) ||
+           host->mmc->actual_clock <= 50000000)
+               return 0;
 
        if (!dlyb || !dlyb->base)
                return -EINVAL;
 
-       if (sdmmc_dlyb_lng_tuning(host))
-               return -EINVAL;
+       ret = dlyb->ops->dlyb_enable(dlyb);
+       if (ret)
+               return ret;
+
+       /*
+        * SDMMC_FBCK is selected when an external Delay Block is needed
+        * with SDR104 or HS200.
+        */
+       clk = host->clk_reg;
+       clk &= ~MCI_STM32_CLK_SEL_MSK;
+       clk |= MCI_STM32_CLK_SELFBCK;
+       mmci_write_clkreg(host, clk);
+
+       ret = dlyb->ops->tuning_prepare(host);
+       if (ret)
+               return ret;
 
        return sdmmc_dlyb_phase_tuning(host, opcode);
 }
@@ -574,6 +683,19 @@ static struct mmci_host_ops sdmmc_variant_ops = {
        .post_sig_volt_switch = sdmmc_post_sig_volt_switch,
 };
 
+static struct sdmmc_tuning_ops dlyb_tuning_mp15_ops = {
+       .dlyb_enable = sdmmc_dlyb_mp15_enable,
+       .set_input_ck = sdmmc_dlyb_mp15_input_ck,
+       .tuning_prepare = sdmmc_dlyb_mp15_prepare,
+       .set_cfg = sdmmc_dlyb_mp15_set_cfg,
+};
+
+static struct sdmmc_tuning_ops dlyb_tuning_mp25_ops = {
+       .dlyb_enable = sdmmc_dlyb_mp25_enable,
+       .tuning_prepare = sdmmc_dlyb_mp25_prepare,
+       .set_cfg = sdmmc_dlyb_mp25_set_cfg,
+};
+
 void sdmmc_variant_init(struct mmci_host *host)
 {
        struct device_node *np = host->mmc->parent->of_node;
@@ -592,6 +714,11 @@ void sdmmc_variant_init(struct mmci_host *host)
                return;
 
        dlyb->base = base_dlyb;
+       if (of_device_is_compatible(np, "st,stm32mp25-sdmmc2"))
+               dlyb->ops = &dlyb_tuning_mp25_ops;
+       else
+               dlyb->ops = &dlyb_tuning_mp15_ops;
+
        host->variant_priv = dlyb;
        host->mmc_ops->execute_tuning = sdmmc_execute_tuning;
 }
index 9785ec9..02403ff 100644 (file)
@@ -473,6 +473,7 @@ struct msdc_host {
        struct msdc_tune_para def_tune_para; /* default tune setting */
        struct msdc_tune_para saved_tune_para; /* tune result of CMD21/CMD19 */
        struct cqhci_host *cq_host;
+       u32 cq_ssc1_time;
 };
 
 static const struct mtk_mmc_compatible mt2701_compat = {
@@ -2450,9 +2451,49 @@ static void msdc_hs400_enhanced_strobe(struct mmc_host *mmc,
        }
 }
 
+static void msdc_cqe_cit_cal(struct msdc_host *host, u64 timer_ns)
+{
+       struct mmc_host *mmc = mmc_from_priv(host);
+       struct cqhci_host *cq_host = mmc->cqe_private;
+       u8 itcfmul;
+       u64 hclk_freq, value;
+
+       /*
+        * On MediaTek SoCs the MSDC controller's CQE uses msdc_hclk as ITCFVAL
+        * so we multiply/divide the HCLK frequency by ITCFMUL to calculate the
+        * Send Status Command Idle Timer (CIT) value.
+        */
+       hclk_freq = (u64)clk_get_rate(host->h_clk);
+       itcfmul = CQHCI_ITCFMUL(cqhci_readl(cq_host, CQHCI_CAP));
+       switch (itcfmul) {
+       case 0x0:
+               do_div(hclk_freq, 1000);
+               break;
+       case 0x1:
+               do_div(hclk_freq, 100);
+               break;
+       case 0x2:
+               do_div(hclk_freq, 10);
+               break;
+       case 0x3:
+               break;
+       case 0x4:
+               hclk_freq = hclk_freq * 10;
+               break;
+       default:
+               host->cq_ssc1_time = 0x40;
+               return;
+       }
+
+       value = hclk_freq * timer_ns;
+       do_div(value, 1000000000);
+       host->cq_ssc1_time = value;
+}
+
 static void msdc_cqe_enable(struct mmc_host *mmc)
 {
        struct msdc_host *host = mmc_priv(mmc);
+       struct cqhci_host *cq_host = mmc->cqe_private;
 
        /* enable cmdq irq */
        writel(MSDC_INT_CMDQ, host->base + MSDC_INTEN);
@@ -2462,6 +2503,9 @@ static void msdc_cqe_enable(struct mmc_host *mmc)
        msdc_set_busy_timeout(host, 20 * 1000000000ULL, 0);
        /* default read data timeout 1s */
        msdc_set_timeout(host, 1000000000ULL, 0);
+
+       /* Set the send status command idle timer */
+       cqhci_writel(cq_host, host->cq_ssc1_time, CQHCI_SSC1);
 }
 
 static void msdc_cqe_disable(struct mmc_host *mmc, bool recovery)
@@ -2707,7 +2751,7 @@ static int msdc_drv_probe(struct platform_device *pdev)
 
        /* Support for SDIO eint irq ? */
        if ((mmc->pm_caps & MMC_PM_WAKE_SDIO_IRQ) && (mmc->pm_caps & MMC_PM_KEEP_POWER)) {
-               host->eint_irq = platform_get_irq_byname(pdev, "sdio_wakeup");
+               host->eint_irq = platform_get_irq_byname_optional(pdev, "sdio_wakeup");
                if (host->eint_irq > 0) {
                        host->pins_eint = pinctrl_lookup_state(host->pinctrl, "state_eint");
                        if (IS_ERR(host->pins_eint)) {
@@ -2803,6 +2847,8 @@ static int msdc_drv_probe(struct platform_device *pdev)
                /* cqhci 16bit length */
                /* 0 size, means 65536 so we don't have to -1 here */
                mmc->max_seg_size = 64 * 1024;
+               /* Reduce CIT to 0x40 that corresponds to 2.35us */
+               msdc_cqe_cit_cal(host, 2350);
        }
 
        ret = devm_request_irq(&pdev->dev, host->irq, msdc_irq,
index 1877d58..1c935b5 100644 (file)
 #include <linux/pm_opp.h>
 #include <linux/slab.h>
 #include <linux/iopoll.h>
-#include <linux/firmware/qcom/qcom_scm.h>
 #include <linux/regulator/consumer.h>
 #include <linux/interconnect.h>
 #include <linux/pinctrl/consumer.h>
 #include <linux/reset.h>
 
+#include <soc/qcom/ice.h>
+
 #include "sdhci-cqhci.h"
 #include "sdhci-pltfm.h"
 #include "cqhci.h"
@@ -258,12 +259,14 @@ struct sdhci_msm_variant_info {
 struct sdhci_msm_host {
        struct platform_device *pdev;
        void __iomem *core_mem; /* MSM SDCC mapped address */
-       void __iomem *ice_mem;  /* MSM ICE mapped address (if available) */
        int pwr_irq;            /* power irq */
        struct clk *bus_clk;    /* SDHC bus voter clock */
        struct clk *xo_clk;     /* TCXO clk needed for FLL feature of cm_dll*/
-       /* core, iface, cal, sleep, and ice clocks */
-       struct clk_bulk_data bulk_clks[5];
+       /* core, iface, cal and sleep clocks */
+       struct clk_bulk_data bulk_clks[4];
+#ifdef CONFIG_MMC_CRYPTO
+       struct qcom_ice *ice;
+#endif
        unsigned long clk_rate;
        struct mmc_host *mmc;
        bool use_14lpp_dll_reset;
@@ -1804,164 +1807,51 @@ out:
 
 #ifdef CONFIG_MMC_CRYPTO
 
-#define AES_256_XTS_KEY_SIZE                   64
-
-/* QCOM ICE registers */
-
-#define QCOM_ICE_REG_VERSION                   0x0008
-
-#define QCOM_ICE_REG_FUSE_SETTING              0x0010
-#define QCOM_ICE_FUSE_SETTING_MASK             0x1
-#define QCOM_ICE_FORCE_HW_KEY0_SETTING_MASK    0x2
-#define QCOM_ICE_FORCE_HW_KEY1_SETTING_MASK    0x4
-
-#define QCOM_ICE_REG_BIST_STATUS               0x0070
-#define QCOM_ICE_BIST_STATUS_MASK              0xF0000000
-
-#define QCOM_ICE_REG_ADVANCED_CONTROL          0x1000
-
-#define sdhci_msm_ice_writel(host, val, reg)   \
-       writel((val), (host)->ice_mem + (reg))
-#define sdhci_msm_ice_readl(host, reg) \
-       readl((host)->ice_mem + (reg))
-
-static bool sdhci_msm_ice_supported(struct sdhci_msm_host *msm_host)
-{
-       struct device *dev = mmc_dev(msm_host->mmc);
-       u32 regval = sdhci_msm_ice_readl(msm_host, QCOM_ICE_REG_VERSION);
-       int major = regval >> 24;
-       int minor = (regval >> 16) & 0xFF;
-       int step = regval & 0xFFFF;
-
-       /* For now this driver only supports ICE version 3. */
-       if (major != 3) {
-               dev_warn(dev, "Unsupported ICE version: v%d.%d.%d\n",
-                        major, minor, step);
-               return false;
-       }
-
-       dev_info(dev, "Found QC Inline Crypto Engine (ICE) v%d.%d.%d\n",
-                major, minor, step);
-
-       /* If fuses are blown, ICE might not work in the standard way. */
-       regval = sdhci_msm_ice_readl(msm_host, QCOM_ICE_REG_FUSE_SETTING);
-       if (regval & (QCOM_ICE_FUSE_SETTING_MASK |
-                     QCOM_ICE_FORCE_HW_KEY0_SETTING_MASK |
-                     QCOM_ICE_FORCE_HW_KEY1_SETTING_MASK)) {
-               dev_warn(dev, "Fuses are blown; ICE is unusable!\n");
-               return false;
-       }
-       return true;
-}
-
-static inline struct clk *sdhci_msm_ice_get_clk(struct device *dev)
-{
-       return devm_clk_get(dev, "ice");
-}
-
 static int sdhci_msm_ice_init(struct sdhci_msm_host *msm_host,
                              struct cqhci_host *cq_host)
 {
        struct mmc_host *mmc = msm_host->mmc;
        struct device *dev = mmc_dev(mmc);
-       struct resource *res;
+       struct qcom_ice *ice;
 
        if (!(cqhci_readl(cq_host, CQHCI_CAP) & CQHCI_CAP_CS))
                return 0;
 
-       res = platform_get_resource_byname(msm_host->pdev, IORESOURCE_MEM,
-                                          "ice");
-       if (!res) {
-               dev_warn(dev, "ICE registers not found\n");
-               goto disable;
-       }
-
-       if (!qcom_scm_ice_available()) {
-               dev_warn(dev, "ICE SCM interface not found\n");
-               goto disable;
+       ice = of_qcom_ice_get(dev);
+       if (ice == ERR_PTR(-EOPNOTSUPP)) {
+               dev_warn(dev, "Disabling inline encryption support\n");
+               ice = NULL;
        }
 
-       msm_host->ice_mem = devm_ioremap_resource(dev, res);
-       if (IS_ERR(msm_host->ice_mem))
-               return PTR_ERR(msm_host->ice_mem);
-
-       if (!sdhci_msm_ice_supported(msm_host))
-               goto disable;
+       if (IS_ERR_OR_NULL(ice))
+               return PTR_ERR_OR_ZERO(ice);
 
+       msm_host->ice = ice;
        mmc->caps2 |= MMC_CAP2_CRYPTO;
-       return 0;
 
-disable:
-       dev_warn(dev, "Disabling inline encryption support\n");
        return 0;
 }
 
-static void sdhci_msm_ice_low_power_mode_enable(struct sdhci_msm_host *msm_host)
-{
-       u32 regval;
-
-       regval = sdhci_msm_ice_readl(msm_host, QCOM_ICE_REG_ADVANCED_CONTROL);
-       /*
-        * Enable low power mode sequence
-        * [0]-0, [1]-0, [2]-0, [3]-E, [4]-0, [5]-0, [6]-0, [7]-0
-        */
-       regval |= 0x7000;
-       sdhci_msm_ice_writel(msm_host, regval, QCOM_ICE_REG_ADVANCED_CONTROL);
-}
-
-static void sdhci_msm_ice_optimization_enable(struct sdhci_msm_host *msm_host)
+static void sdhci_msm_ice_enable(struct sdhci_msm_host *msm_host)
 {
-       u32 regval;
-
-       /* ICE Optimizations Enable Sequence */
-       regval = sdhci_msm_ice_readl(msm_host, QCOM_ICE_REG_ADVANCED_CONTROL);
-       regval |= 0xD807100;
-       /* ICE HPG requires delay before writing */
-       udelay(5);
-       sdhci_msm_ice_writel(msm_host, regval, QCOM_ICE_REG_ADVANCED_CONTROL);
-       udelay(5);
+       if (msm_host->mmc->caps2 & MMC_CAP2_CRYPTO)
+               qcom_ice_enable(msm_host->ice);
 }
 
-/*
- * Wait until the ICE BIST (built-in self-test) has completed.
- *
- * This may be necessary before ICE can be used.
- *
- * Note that we don't really care whether the BIST passed or failed; we really
- * just want to make sure that it isn't still running.  This is because (a) the
- * BIST is a FIPS compliance thing that never fails in practice, (b) ICE is
- * documented to reject crypto requests if the BIST fails, so we needn't do it
- * in software too, and (c) properly testing storage encryption requires testing
- * the full storage stack anyway, and not relying on hardware-level self-tests.
- */
-static int sdhci_msm_ice_wait_bist_status(struct sdhci_msm_host *msm_host)
+static __maybe_unused int sdhci_msm_ice_resume(struct sdhci_msm_host *msm_host)
 {
-       u32 regval;
-       int err;
+       if (msm_host->mmc->caps2 & MMC_CAP2_CRYPTO)
+               return qcom_ice_resume(msm_host->ice);
 
-       err = readl_poll_timeout(msm_host->ice_mem + QCOM_ICE_REG_BIST_STATUS,
-                                regval, !(regval & QCOM_ICE_BIST_STATUS_MASK),
-                                50, 5000);
-       if (err)
-               dev_err(mmc_dev(msm_host->mmc),
-                       "Timed out waiting for ICE self-test to complete\n");
-       return err;
+       return 0;
 }
 
-static void sdhci_msm_ice_enable(struct sdhci_msm_host *msm_host)
+static __maybe_unused int sdhci_msm_ice_suspend(struct sdhci_msm_host *msm_host)
 {
-       if (!(msm_host->mmc->caps2 & MMC_CAP2_CRYPTO))
-               return;
-       sdhci_msm_ice_low_power_mode_enable(msm_host);
-       sdhci_msm_ice_optimization_enable(msm_host);
-       sdhci_msm_ice_wait_bist_status(msm_host);
-}
+       if (msm_host->mmc->caps2 & MMC_CAP2_CRYPTO)
+               return qcom_ice_suspend(msm_host->ice);
 
-static int __maybe_unused sdhci_msm_ice_resume(struct sdhci_msm_host *msm_host)
-{
-       if (!(msm_host->mmc->caps2 & MMC_CAP2_CRYPTO))
-               return 0;
-       return sdhci_msm_ice_wait_bist_status(msm_host);
+       return 0;
 }
 
 /*
@@ -1972,48 +1862,28 @@ static int sdhci_msm_program_key(struct cqhci_host *cq_host,
                                 const union cqhci_crypto_cfg_entry *cfg,
                                 int slot)
 {
-       struct device *dev = mmc_dev(cq_host->mmc);
+       struct sdhci_host *host = mmc_priv(cq_host->mmc);
+       struct sdhci_pltfm_host *pltfm_host = sdhci_priv(host);
+       struct sdhci_msm_host *msm_host = sdhci_pltfm_priv(pltfm_host);
        union cqhci_crypto_cap_entry cap;
-       union {
-               u8 bytes[AES_256_XTS_KEY_SIZE];
-               u32 words[AES_256_XTS_KEY_SIZE / sizeof(u32)];
-       } key;
-       int i;
-       int err;
-
-       if (!(cfg->config_enable & CQHCI_CRYPTO_CONFIGURATION_ENABLE))
-               return qcom_scm_ice_invalidate_key(slot);
 
        /* Only AES-256-XTS has been tested so far. */
        cap = cq_host->crypto_cap_array[cfg->crypto_cap_idx];
        if (cap.algorithm_id != CQHCI_CRYPTO_ALG_AES_XTS ||
-           cap.key_size != CQHCI_CRYPTO_KEY_SIZE_256) {
-               dev_err_ratelimited(dev,
-                                   "Unhandled crypto capability; algorithm_id=%d, key_size=%d\n",
-                                   cap.algorithm_id, cap.key_size);
+               cap.key_size != CQHCI_CRYPTO_KEY_SIZE_256)
                return -EINVAL;
-       }
 
-       memcpy(key.bytes, cfg->crypto_key, AES_256_XTS_KEY_SIZE);
-
-       /*
-        * The SCM call byte-swaps the 32-bit words of the key.  So we have to
-        * do the same, in order for the final key be correct.
-        */
-       for (i = 0; i < ARRAY_SIZE(key.words); i++)
-               __cpu_to_be32s(&key.words[i]);
-
-       err = qcom_scm_ice_set_key(slot, key.bytes, AES_256_XTS_KEY_SIZE,
-                                  QCOM_SCM_ICE_CIPHER_AES_256_XTS,
-                                  cfg->data_unit_size);
-       memzero_explicit(&key, sizeof(key));
-       return err;
+       if (cfg->config_enable & CQHCI_CRYPTO_CONFIGURATION_ENABLE)
+               return qcom_ice_program_key(msm_host->ice,
+                                           QCOM_ICE_CRYPTO_ALG_AES_XTS,
+                                           QCOM_ICE_CRYPTO_KEY_SIZE_256,
+                                           cfg->crypto_key,
+                                           cfg->data_unit_size, slot);
+       else
+               return qcom_ice_evict_key(msm_host->ice, slot);
 }
+
 #else /* CONFIG_MMC_CRYPTO */
-static inline struct clk *sdhci_msm_ice_get_clk(struct device *dev)
-{
-       return NULL;
-}
 
 static inline int sdhci_msm_ice_init(struct sdhci_msm_host *msm_host,
                                     struct cqhci_host *cq_host)
@@ -2025,11 +1895,17 @@ static inline void sdhci_msm_ice_enable(struct sdhci_msm_host *msm_host)
 {
 }
 
-static inline int __maybe_unused
+static inline __maybe_unused int
 sdhci_msm_ice_resume(struct sdhci_msm_host *msm_host)
 {
        return 0;
 }
+
+static inline __maybe_unused int
+sdhci_msm_ice_suspend(struct sdhci_msm_host *msm_host)
+{
+       return 0;
+}
 #endif /* !CONFIG_MMC_CRYPTO */
 
 /*****************************************************************************\
@@ -2633,11 +2509,6 @@ static int sdhci_msm_probe(struct platform_device *pdev)
                clk = NULL;
        msm_host->bulk_clks[3].clk = clk;
 
-       clk = sdhci_msm_ice_get_clk(&pdev->dev);
-       if (IS_ERR(clk))
-               clk = NULL;
-       msm_host->bulk_clks[4].clk = clk;
-
        ret = clk_bulk_prepare_enable(ARRAY_SIZE(msm_host->bulk_clks),
                                      msm_host->bulk_clks);
        if (ret)
@@ -2830,7 +2701,7 @@ static __maybe_unused int sdhci_msm_runtime_suspend(struct device *dev)
        clk_bulk_disable_unprepare(ARRAY_SIZE(msm_host->bulk_clks),
                                   msm_host->bulk_clks);
 
-       return 0;
+       return sdhci_msm_ice_suspend(msm_host);
 }
 
 static __maybe_unused int sdhci_msm_runtime_resume(struct device *dev)
index 01975d1..1c2572c 100644 (file)
@@ -1903,6 +1903,7 @@ static const struct pci_device_id pci_ids[] = {
        SDHCI_PCI_DEVICE(GLI, 9750, gl9750),
        SDHCI_PCI_DEVICE(GLI, 9755, gl9755),
        SDHCI_PCI_DEVICE(GLI, 9763E, gl9763e),
+       SDHCI_PCI_DEVICE(GLI, 9767, gl9767),
        SDHCI_PCI_DEVICE_CLASS(AMD, SYSTEM_SDHCI, PCI_CLASS_MASK, amd),
        /* Generic SD host controller */
        {PCI_DEVICE_CLASS(SYSTEM_SDHCI, PCI_CLASS_MASK)},
index 633a8ee..ae8c307 100644 (file)
 #define PCI_GLI_9755_PM_CTRL     0xFC
 #define   PCI_GLI_9755_PM_STATE    GENMASK(1, 0)
 
+#define SDHCI_GLI_9767_GM_BURST_SIZE                   0x510
+#define   SDHCI_GLI_9767_GM_BURST_SIZE_AXI_ALWAYS_SET    BIT(8)
+
+#define PCIE_GLI_9767_VHS      0x884
+#define   GLI_9767_VHS_REV       GENMASK(19, 16)
+#define   GLI_9767_VHS_REV_R     0x0
+#define   GLI_9767_VHS_REV_M     0x1
+#define   GLI_9767_VHS_REV_W     0x2
+
+#define PCIE_GLI_9767_COM_MAILBOX              0x888
+#define   PCIE_GLI_9767_COM_MAILBOX_SSC_EN       BIT(1)
+
+#define PCIE_GLI_9767_CFG              0x8A0
+#define   PCIE_GLI_9767_CFG_LOW_PWR_OFF          BIT(12)
+
+#define PCIE_GLI_9767_COMBO_MUX_CTL                    0x8C8
+#define   PCIE_GLI_9767_COMBO_MUX_CTL_RST_EN             BIT(6)
+#define   PCIE_GLI_9767_COMBO_MUX_CTL_WAIT_PERST_EN      BIT(10)
+
+#define PCIE_GLI_9767_PWR_MACRO_CTL                                    0x8D0
+#define   PCIE_GLI_9767_PWR_MACRO_CTL_LOW_VOLTAGE                        GENMASK(3, 0)
+#define   PCIE_GLI_9767_PWR_MACRO_CTL_LD0_LOW_OUTPUT_VOLTAGE             GENMASK(15, 12)
+#define   PCIE_GLI_9767_PWR_MACRO_CTL_LD0_LOW_OUTPUT_VOLTAGE_VALUE       0x7
+#define   PCIE_GLI_9767_PWR_MACRO_CTL_RCLK_AMPLITUDE_CTL                 GENMASK(29, 28)
+#define   PCIE_GLI_9767_PWR_MACRO_CTL_RCLK_AMPLITUDE_CTL_VALUE           0x3
+
+#define PCIE_GLI_9767_SCR                              0x8E0
+#define   PCIE_GLI_9767_SCR_AUTO_AXI_W_BURST             BIT(6)
+#define   PCIE_GLI_9767_SCR_AUTO_AXI_R_BURST             BIT(7)
+#define   PCIE_GLI_9767_SCR_AXI_REQ                      BIT(9)
+#define   PCIE_GLI_9767_SCR_CARD_DET_PWR_SAVING_EN       BIT(10)
+#define   PCIE_GLI_9767_SCR_SYSTEM_CLK_SELECT_MODE0      BIT(16)
+#define   PCIE_GLI_9767_SCR_SYSTEM_CLK_SELECT_MODE1      BIT(17)
+#define   PCIE_GLI_9767_SCR_CORE_PWR_D3_OFF              BIT(21)
+#define   PCIE_GLI_9767_SCR_CFG_RST_DATA_LINK_DOWN       BIT(30)
+
+#define PCIE_GLI_9767_SDHC_CAP                 0x91C
+#define   PCIE_GLI_9767_SDHC_CAP_SDEI_RESULT     BIT(5)
+
+#define PCIE_GLI_9767_SD_PLL_CTL                       0x938
+#define   PCIE_GLI_9767_SD_PLL_CTL_PLL_LDIV              GENMASK(9, 0)
+#define   PCIE_GLI_9767_SD_PLL_CTL_PLL_PDIV              GENMASK(15, 12)
+#define   PCIE_GLI_9767_SD_PLL_CTL_PLL_DIR_EN            BIT(16)
+#define   PCIE_GLI_9767_SD_PLL_CTL_SSC_EN                BIT(19)
+#define   PCIE_GLI_9767_SD_PLL_CTL_SSC_STEP_SETTING      GENMASK(28, 24)
+
+#define PCIE_GLI_9767_SD_PLL_CTL2              0x93C
+#define   PCIE_GLI_9767_SD_PLL_CTL2_PLLSSC_PPM   GENMASK(31, 16)
+
+#define PCIE_GLI_9767_SD_EXPRESS_CTL                   0x940
+#define   PCIE_GLI_9767_SD_EXPRESS_CTL_SDEI_EXE                  BIT(0)
+#define   PCIE_GLI_9767_SD_EXPRESS_CTL_SD_EXPRESS_MODE   BIT(1)
+
+#define PCIE_GLI_9767_SD_DATA_MULTI_CTL                                0x944
+#define   PCIE_GLI_9767_SD_DATA_MULTI_CTL_DISCONNECT_TIME        GENMASK(23, 16)
+#define   PCIE_GLI_9767_SD_DATA_MULTI_CTL_DISCONNECT_TIME_VALUE          0x64
+
+#define PCIE_GLI_9767_NORMAL_ERR_INT_STATUS_REG2                       0x950
+#define   PCIE_GLI_9767_NORMAL_ERR_INT_STATUS_REG2_SDEI_COMPLETE         BIT(0)
+
+#define PCIE_GLI_9767_NORMAL_ERR_INT_STATUS_EN_REG2                            0x954
+#define   PCIE_GLI_9767_NORMAL_ERR_INT_STATUS_EN_REG2_SDEI_COMPLETE_STATUS_EN    BIT(0)
+
+#define PCIE_GLI_9767_NORMAL_ERR_INT_SIGNAL_EN_REG2                            0x958
+#define   PCIE_GLI_9767_NORMAL_ERR_INT_SIGNAL_EN_REG2_SDEI_COMPLETE_SIGNAL_EN    BIT(0)
+
 #define GLI_MAX_TUNING_LOOP 40
 
 /* Genesys Logic chipset */
@@ -693,6 +759,293 @@ static void gl9755_hw_setting(struct sdhci_pci_slot *slot)
        gl9755_wt_off(pdev);
 }
 
+static inline void gl9767_vhs_read(struct pci_dev *pdev)
+{
+       u32 vhs_enable;
+       u32 vhs_value;
+
+       pci_read_config_dword(pdev, PCIE_GLI_9767_VHS, &vhs_value);
+       vhs_enable = FIELD_GET(GLI_9767_VHS_REV, vhs_value);
+
+       if (vhs_enable == GLI_9767_VHS_REV_R)
+               return;
+
+       vhs_value &= ~GLI_9767_VHS_REV;
+       vhs_value |= FIELD_PREP(GLI_9767_VHS_REV, GLI_9767_VHS_REV_R);
+
+       pci_write_config_dword(pdev, PCIE_GLI_9767_VHS, vhs_value);
+}
+
+static inline void gl9767_vhs_write(struct pci_dev *pdev)
+{
+       u32 vhs_enable;
+       u32 vhs_value;
+
+       pci_read_config_dword(pdev, PCIE_GLI_9767_VHS, &vhs_value);
+       vhs_enable = FIELD_GET(GLI_9767_VHS_REV, vhs_value);
+
+       if (vhs_enable == GLI_9767_VHS_REV_W)
+               return;
+
+       vhs_value &= ~GLI_9767_VHS_REV;
+       vhs_value |= FIELD_PREP(GLI_9767_VHS_REV, GLI_9767_VHS_REV_W);
+
+       pci_write_config_dword(pdev, PCIE_GLI_9767_VHS, vhs_value);
+}
+
+static bool gl9767_ssc_enable(struct pci_dev *pdev)
+{
+       u32 value;
+       u8 enable;
+
+       gl9767_vhs_write(pdev);
+
+       pci_read_config_dword(pdev, PCIE_GLI_9767_COM_MAILBOX, &value);
+       enable = FIELD_GET(PCIE_GLI_9767_COM_MAILBOX_SSC_EN, value);
+
+       gl9767_vhs_read(pdev);
+
+       return enable;
+}
+
+static void gl9767_set_ssc(struct pci_dev *pdev, u8 enable, u8 step, u16 ppm)
+{
+       u32 pll;
+       u32 ssc;
+
+       gl9767_vhs_write(pdev);
+
+       pci_read_config_dword(pdev, PCIE_GLI_9767_SD_PLL_CTL, &pll);
+       pci_read_config_dword(pdev, PCIE_GLI_9767_SD_PLL_CTL2, &ssc);
+       pll &= ~(PCIE_GLI_9767_SD_PLL_CTL_SSC_STEP_SETTING |
+                PCIE_GLI_9767_SD_PLL_CTL_SSC_EN);
+       ssc &= ~PCIE_GLI_9767_SD_PLL_CTL2_PLLSSC_PPM;
+       pll |= FIELD_PREP(PCIE_GLI_9767_SD_PLL_CTL_SSC_STEP_SETTING, step) |
+              FIELD_PREP(PCIE_GLI_9767_SD_PLL_CTL_SSC_EN, enable);
+       ssc |= FIELD_PREP(PCIE_GLI_9767_SD_PLL_CTL2_PLLSSC_PPM, ppm);
+       pci_write_config_dword(pdev, PCIE_GLI_9767_SD_PLL_CTL2, ssc);
+       pci_write_config_dword(pdev, PCIE_GLI_9767_SD_PLL_CTL, pll);
+
+       gl9767_vhs_read(pdev);
+}
+
+static void gl9767_set_pll(struct pci_dev *pdev, u8 dir, u16 ldiv, u8 pdiv)
+{
+       u32 pll;
+
+       gl9767_vhs_write(pdev);
+
+       pci_read_config_dword(pdev, PCIE_GLI_9767_SD_PLL_CTL, &pll);
+       pll &= ~(PCIE_GLI_9767_SD_PLL_CTL_PLL_LDIV |
+                PCIE_GLI_9767_SD_PLL_CTL_PLL_PDIV |
+                PCIE_GLI_9767_SD_PLL_CTL_PLL_DIR_EN);
+       pll |= FIELD_PREP(PCIE_GLI_9767_SD_PLL_CTL_PLL_LDIV, ldiv) |
+              FIELD_PREP(PCIE_GLI_9767_SD_PLL_CTL_PLL_PDIV, pdiv) |
+              FIELD_PREP(PCIE_GLI_9767_SD_PLL_CTL_PLL_DIR_EN, dir);
+       pci_write_config_dword(pdev, PCIE_GLI_9767_SD_PLL_CTL, pll);
+
+       gl9767_vhs_read(pdev);
+
+       /* wait for pll stable */
+       usleep_range(1000, 1100);
+}
+
+static void gl9767_set_ssc_pll_205mhz(struct pci_dev *pdev)
+{
+       bool enable = gl9767_ssc_enable(pdev);
+
+       /* set pll to 205MHz and ssc */
+       gl9767_set_ssc(pdev, enable, 0x1F, 0xF5C3);
+       gl9767_set_pll(pdev, 0x1, 0x246, 0x0);
+}
+
+static void gl9767_disable_ssc_pll(struct pci_dev *pdev)
+{
+       u32 pll;
+
+       gl9767_vhs_write(pdev);
+
+       pci_read_config_dword(pdev, PCIE_GLI_9767_SD_PLL_CTL, &pll);
+       pll &= ~(PCIE_GLI_9767_SD_PLL_CTL_PLL_DIR_EN | PCIE_GLI_9767_SD_PLL_CTL_SSC_EN);
+       pci_write_config_dword(pdev, PCIE_GLI_9767_SD_PLL_CTL, pll);
+
+       gl9767_vhs_read(pdev);
+}
+
+static void sdhci_gl9767_set_clock(struct sdhci_host *host, unsigned int clock)
+{
+       struct sdhci_pci_slot *slot = sdhci_priv(host);
+       struct mmc_ios *ios = &host->mmc->ios;
+       struct pci_dev *pdev;
+       u32 value;
+       u16 clk;
+
+       pdev = slot->chip->pdev;
+       host->mmc->actual_clock = 0;
+
+       gl9767_vhs_write(pdev);
+
+       pci_read_config_dword(pdev, PCIE_GLI_9767_CFG, &value);
+       value |= PCIE_GLI_9767_CFG_LOW_PWR_OFF;
+       pci_write_config_dword(pdev, PCIE_GLI_9767_CFG, value);
+
+       gl9767_disable_ssc_pll(pdev);
+       sdhci_writew(host, 0, SDHCI_CLOCK_CONTROL);
+
+       if (clock == 0)
+               return;
+
+       clk = sdhci_calc_clk(host, clock, &host->mmc->actual_clock);
+       if (clock == 200000000 && ios->timing == MMC_TIMING_UHS_SDR104) {
+               host->mmc->actual_clock = 205000000;
+               gl9767_set_ssc_pll_205mhz(pdev);
+       }
+
+       sdhci_enable_clk(host, clk);
+
+       pci_read_config_dword(pdev, PCIE_GLI_9767_CFG, &value);
+       value &= ~PCIE_GLI_9767_CFG_LOW_PWR_OFF;
+       pci_write_config_dword(pdev, PCIE_GLI_9767_CFG, value);
+
+       gl9767_vhs_read(pdev);
+}
+
+static void gli_set_9767(struct sdhci_host *host)
+{
+       u32 value;
+
+       value = sdhci_readl(host, SDHCI_GLI_9767_GM_BURST_SIZE);
+       value &= ~SDHCI_GLI_9767_GM_BURST_SIZE_AXI_ALWAYS_SET;
+       sdhci_writel(host, value, SDHCI_GLI_9767_GM_BURST_SIZE);
+}
+
+static void gl9767_hw_setting(struct sdhci_pci_slot *slot)
+{
+       struct pci_dev *pdev = slot->chip->pdev;
+       u32 value;
+
+       gl9767_vhs_write(pdev);
+
+       pci_read_config_dword(pdev, PCIE_GLI_9767_PWR_MACRO_CTL, &value);
+       value &= ~(PCIE_GLI_9767_PWR_MACRO_CTL_LOW_VOLTAGE |
+                  PCIE_GLI_9767_PWR_MACRO_CTL_LD0_LOW_OUTPUT_VOLTAGE |
+                  PCIE_GLI_9767_PWR_MACRO_CTL_RCLK_AMPLITUDE_CTL);
+
+       value |= PCIE_GLI_9767_PWR_MACRO_CTL_LOW_VOLTAGE |
+                FIELD_PREP(PCIE_GLI_9767_PWR_MACRO_CTL_LD0_LOW_OUTPUT_VOLTAGE,
+                           PCIE_GLI_9767_PWR_MACRO_CTL_LD0_LOW_OUTPUT_VOLTAGE_VALUE) |
+                FIELD_PREP(PCIE_GLI_9767_PWR_MACRO_CTL_RCLK_AMPLITUDE_CTL,
+                           PCIE_GLI_9767_PWR_MACRO_CTL_RCLK_AMPLITUDE_CTL_VALUE);
+       pci_write_config_dword(pdev, PCIE_GLI_9767_PWR_MACRO_CTL, value);
+
+       pci_read_config_dword(pdev, PCIE_GLI_9767_SCR, &value);
+       value &= ~(PCIE_GLI_9767_SCR_SYSTEM_CLK_SELECT_MODE0 |
+                  PCIE_GLI_9767_SCR_SYSTEM_CLK_SELECT_MODE1 |
+                  PCIE_GLI_9767_SCR_CFG_RST_DATA_LINK_DOWN);
+
+       value |= PCIE_GLI_9767_SCR_AUTO_AXI_W_BURST |
+                PCIE_GLI_9767_SCR_AUTO_AXI_R_BURST |
+                PCIE_GLI_9767_SCR_AXI_REQ |
+                PCIE_GLI_9767_SCR_CARD_DET_PWR_SAVING_EN |
+                PCIE_GLI_9767_SCR_CORE_PWR_D3_OFF;
+       pci_write_config_dword(pdev, PCIE_GLI_9767_SCR, value);
+
+       gl9767_vhs_read(pdev);
+}
+
+static void sdhci_gl9767_reset(struct sdhci_host *host, u8 mask)
+{
+       sdhci_reset(host, mask);
+       gli_set_9767(host);
+}
+
+static int gl9767_init_sd_express(struct mmc_host *mmc, struct mmc_ios *ios)
+{
+       struct sdhci_host *host = mmc_priv(mmc);
+       struct sdhci_pci_slot *slot = sdhci_priv(host);
+       struct pci_dev *pdev;
+       u32 value;
+       int i;
+
+       pdev = slot->chip->pdev;
+
+       if (mmc->ops->get_ro(mmc)) {
+               mmc->ios.timing &= ~(MMC_TIMING_SD_EXP | MMC_TIMING_SD_EXP_1_2V);
+               return 0;
+       }
+
+       gl9767_vhs_write(pdev);
+
+       pci_read_config_dword(pdev, PCIE_GLI_9767_COMBO_MUX_CTL, &value);
+       value &= ~(PCIE_GLI_9767_COMBO_MUX_CTL_RST_EN | PCIE_GLI_9767_COMBO_MUX_CTL_WAIT_PERST_EN);
+       pci_write_config_dword(pdev, PCIE_GLI_9767_COMBO_MUX_CTL, value);
+
+       pci_read_config_dword(pdev, PCIE_GLI_9767_SD_DATA_MULTI_CTL, &value);
+       value &= ~PCIE_GLI_9767_SD_DATA_MULTI_CTL_DISCONNECT_TIME;
+       value |= FIELD_PREP(PCIE_GLI_9767_SD_DATA_MULTI_CTL_DISCONNECT_TIME,
+                           PCIE_GLI_9767_SD_DATA_MULTI_CTL_DISCONNECT_TIME_VALUE);
+       pci_write_config_dword(pdev, PCIE_GLI_9767_SD_DATA_MULTI_CTL, value);
+
+       pci_read_config_dword(pdev, PCIE_GLI_9767_NORMAL_ERR_INT_STATUS_REG2, &value);
+       value |= PCIE_GLI_9767_NORMAL_ERR_INT_STATUS_REG2_SDEI_COMPLETE;
+       pci_write_config_dword(pdev, PCIE_GLI_9767_NORMAL_ERR_INT_STATUS_REG2, value);
+
+       pci_read_config_dword(pdev, PCIE_GLI_9767_NORMAL_ERR_INT_STATUS_EN_REG2, &value);
+       value |= PCIE_GLI_9767_NORMAL_ERR_INT_STATUS_EN_REG2_SDEI_COMPLETE_STATUS_EN;
+       pci_write_config_dword(pdev, PCIE_GLI_9767_NORMAL_ERR_INT_STATUS_EN_REG2, value);
+
+       pci_read_config_dword(pdev, PCIE_GLI_9767_NORMAL_ERR_INT_SIGNAL_EN_REG2, &value);
+       value |= PCIE_GLI_9767_NORMAL_ERR_INT_SIGNAL_EN_REG2_SDEI_COMPLETE_SIGNAL_EN;
+       pci_write_config_dword(pdev, PCIE_GLI_9767_NORMAL_ERR_INT_SIGNAL_EN_REG2, value);
+
+       pci_read_config_dword(pdev, PCIE_GLI_9767_CFG, &value);
+       value |= PCIE_GLI_9767_CFG_LOW_PWR_OFF;
+       pci_write_config_dword(pdev, PCIE_GLI_9767_CFG, value);
+
+       value = sdhci_readw(host, SDHCI_CLOCK_CONTROL);
+       value &= ~(SDHCI_CLOCK_CARD_EN | SDHCI_CLOCK_PLL_EN);
+       sdhci_writew(host, value, SDHCI_CLOCK_CONTROL);
+
+       value = sdhci_readb(host, SDHCI_POWER_CONTROL);
+       value |= (SDHCI_VDD2_POWER_180 | SDHCI_VDD2_POWER_ON);
+       sdhci_writeb(host, value, SDHCI_POWER_CONTROL);
+
+       pci_read_config_dword(pdev, PCIE_GLI_9767_SD_EXPRESS_CTL, &value);
+       value |= PCIE_GLI_9767_SD_EXPRESS_CTL_SDEI_EXE;
+       pci_write_config_dword(pdev, PCIE_GLI_9767_SD_EXPRESS_CTL, value);
+
+       for (i = 0; i < 2; i++) {
+               usleep_range(10000, 10100);
+               pci_read_config_dword(pdev, PCIE_GLI_9767_NORMAL_ERR_INT_STATUS_REG2, &value);
+               if (value & PCIE_GLI_9767_NORMAL_ERR_INT_STATUS_REG2_SDEI_COMPLETE) {
+                       pci_write_config_dword(pdev, PCIE_GLI_9767_NORMAL_ERR_INT_STATUS_REG2,
+                                              value);
+                       break;
+               }
+       }
+
+       pci_read_config_dword(pdev, PCIE_GLI_9767_SDHC_CAP, &value);
+       if (value & PCIE_GLI_9767_SDHC_CAP_SDEI_RESULT) {
+               pci_read_config_dword(pdev, PCIE_GLI_9767_SD_EXPRESS_CTL, &value);
+               value |= PCIE_GLI_9767_SD_EXPRESS_CTL_SD_EXPRESS_MODE;
+               pci_write_config_dword(pdev, PCIE_GLI_9767_SD_EXPRESS_CTL, value);
+       } else {
+               mmc->ios.timing &= ~(MMC_TIMING_SD_EXP | MMC_TIMING_SD_EXP_1_2V);
+
+               value = sdhci_readb(host, SDHCI_POWER_CONTROL);
+               value &= ~(SDHCI_VDD2_POWER_180 | SDHCI_VDD2_POWER_ON);
+               sdhci_writeb(host, value, SDHCI_POWER_CONTROL);
+
+               value = sdhci_readw(host, SDHCI_CLOCK_CONTROL);
+               value |= (SDHCI_CLOCK_CARD_EN | SDHCI_CLOCK_PLL_EN);
+               sdhci_writew(host, value, SDHCI_CLOCK_CONTROL);
+       }
+
+       gl9767_vhs_read(pdev);
+
+       return 0;
+}
+
 static int gli_probe_slot_gl9750(struct sdhci_pci_slot *slot)
 {
        struct sdhci_host *host = slot->host;
@@ -717,6 +1070,21 @@ static int gli_probe_slot_gl9755(struct sdhci_pci_slot *slot)
        return 0;
 }
 
+static int gli_probe_slot_gl9767(struct sdhci_pci_slot *slot)
+{
+       struct sdhci_host *host = slot->host;
+
+       gli_set_9767(host);
+       gl9767_hw_setting(slot);
+       gli_pcie_enable_msi(slot);
+       slot->host->mmc->caps2 |= MMC_CAP2_NO_SDIO;
+       host->mmc->caps2 |= MMC_CAP2_SD_EXP;
+       host->mmc_host_ops.init_sd_express = gl9767_init_sd_express;
+       sdhci_enable_v4_mode(host);
+
+       return 0;
+}
+
 static void sdhci_gli_voltage_switch(struct sdhci_host *host)
 {
        /*
@@ -740,6 +1108,25 @@ static void sdhci_gli_voltage_switch(struct sdhci_host *host)
        usleep_range(100000, 110000);
 }
 
+static void sdhci_gl9767_voltage_switch(struct sdhci_host *host)
+{
+       /*
+        * According to Section 3.6.1 signal voltage switch procedure in
+        * SD Host Controller Simplified Spec. 4.20, steps 6~8 are as
+        * follows:
+        * (6) Set 1.8V Signal Enable in the Host Control 2 register.
+        * (7) Wait 5ms. 1.8V voltage regulator shall be stable within this
+        *     period.
+        * (8) If 1.8V Signal Enable is cleared by Host Controller, go to
+        *     step (12).
+        *
+        * Wait 5ms after set 1.8V signal enable in Host Control 2 register
+        * to ensure 1.8V signal enable bit is set by GL9767.
+        *
+        */
+       usleep_range(5000, 5500);
+}
+
 static void sdhci_gl9750_reset(struct sdhci_host *host, u8 mask)
 {
        sdhci_reset(host, mask);
@@ -1150,3 +1537,22 @@ const struct sdhci_pci_fixes sdhci_gl9763e = {
 #endif
        .add_host       = gl9763e_add_host,
 };
+
+static const struct sdhci_ops sdhci_gl9767_ops = {
+       .set_clock               = sdhci_gl9767_set_clock,
+       .enable_dma              = sdhci_pci_enable_dma,
+       .set_bus_width           = sdhci_set_bus_width,
+       .reset                   = sdhci_gl9767_reset,
+       .set_uhs_signaling       = sdhci_set_uhs_signaling,
+       .voltage_switch          = sdhci_gl9767_voltage_switch,
+};
+
+const struct sdhci_pci_fixes sdhci_gl9767 = {
+       .quirks         = SDHCI_QUIRK_NO_ENDATTR_IN_NOPDESC,
+       .quirks2        = SDHCI_QUIRK2_BROKEN_DDR50,
+       .probe_slot     = gli_probe_slot_gl9767,
+       .ops            = &sdhci_gl9767_ops,
+#ifdef CONFIG_PM_SLEEP
+       .resume         = sdhci_pci_gli_resume,
+#endif
+};
index 3661a22..9c88639 100644 (file)
@@ -76,6 +76,7 @@
 #define PCI_DEVICE_ID_GLI_9755         0x9755
 #define PCI_DEVICE_ID_GLI_9750         0x9750
 #define PCI_DEVICE_ID_GLI_9763E                0xe763
+#define PCI_DEVICE_ID_GLI_9767         0x9767
 
 /*
  * PCI device class and mask
@@ -195,5 +196,6 @@ extern const struct sdhci_pci_fixes sdhci_o2;
 extern const struct sdhci_pci_fixes sdhci_gl9750;
 extern const struct sdhci_pci_fixes sdhci_gl9755;
 extern const struct sdhci_pci_fixes sdhci_gl9763e;
+extern const struct sdhci_pci_fixes sdhci_gl9767;
 
 #endif /* __SDHCI_PCI_H */
index 3241916..ff41aa5 100644 (file)
@@ -1167,6 +1167,8 @@ static void sdhci_prepare_data(struct sdhci_host *host, struct mmc_command *cmd)
                }
        }
 
+       sdhci_config_dma(host);
+
        if (host->flags & SDHCI_REQ_USE_DMA) {
                int sg_cnt = sdhci_pre_dma_transfer(host, data, COOKIE_MAPPED);
 
@@ -1186,8 +1188,6 @@ static void sdhci_prepare_data(struct sdhci_host *host, struct mmc_command *cmd)
                }
        }
 
-       sdhci_config_dma(host);
-
        if (!(host->flags & SDHCI_REQ_USE_DMA)) {
                int flags;
 
index f4f2085..f219bde 100644 (file)
 #define  SDHCI_POWER_180       0x0A
 #define  SDHCI_POWER_300       0x0C
 #define  SDHCI_POWER_330       0x0E
+/*
+ * VDD2 - UHS2 or PCIe/NVMe
+ * VDD2 power on/off and voltage select
+ */
+#define  SDHCI_VDD2_POWER_ON   0x10
+#define  SDHCI_VDD2_POWER_120  0x80
+#define  SDHCI_VDD2_POWER_180  0xA0
 
 #define SDHCI_BLOCK_GAP_CONTROL        0x2A
 
index 27b0c92..36d8c91 100644 (file)
@@ -204,7 +204,7 @@ static ssize_t mdev_link_device_store(struct config_item *item,
 {
        struct mdev_link *mdev_link = to_mdev_link(item);
 
-       strlcpy(mdev_link->device, page, sizeof(mdev_link->device));
+       strscpy(mdev_link->device, page, sizeof(mdev_link->device));
        strim(mdev_link->device);
        return count;
 }
@@ -219,7 +219,7 @@ static ssize_t mdev_link_channel_store(struct config_item *item,
 {
        struct mdev_link *mdev_link = to_mdev_link(item);
 
-       strlcpy(mdev_link->channel, page, sizeof(mdev_link->channel));
+       strscpy(mdev_link->channel, page, sizeof(mdev_link->channel));
        strim(mdev_link->channel);
        return count;
 }
@@ -234,7 +234,7 @@ static ssize_t mdev_link_comp_store(struct config_item *item,
 {
        struct mdev_link *mdev_link = to_mdev_link(item);
 
-       strlcpy(mdev_link->comp, page, sizeof(mdev_link->comp));
+       strscpy(mdev_link->comp, page, sizeof(mdev_link->comp));
        strim(mdev_link->comp);
        return count;
 }
@@ -250,7 +250,7 @@ static ssize_t mdev_link_comp_params_store(struct config_item *item,
 {
        struct mdev_link *mdev_link = to_mdev_link(item);
 
-       strlcpy(mdev_link->comp_params, page, sizeof(mdev_link->comp_params));
+       strscpy(mdev_link->comp_params, page, sizeof(mdev_link->comp_params));
        strim(mdev_link->comp_params);
        return count;
 }
index 54f92d0..11b06fe 100644 (file)
@@ -1,8 +1,9 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Common Flash Interface support:
  *   Intel Extended Vendor Command Set (ID 0x0001)
  *
- * (C) 2000 Red Hat. GPL'd
+ * (C) 2000 Red Hat.
  *
  *
  * 10/10/2000  Nicolas Pitre <nico@fluxnic.net>
index 67453f5..153fb8d 100644 (file)
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Common Flash Interface support:
  *   AMD & Fujitsu Standard Vendor Command Set (ID 0x0002)
@@ -16,8 +17,6 @@
  * 25/09/2008 Christopher Moore: TopBottom fixup for many Macronix with CFI V1.0
  *
  * Occasionally maintained by Thayne Harbaugh tharbaugh at lnxi dot com
- *
- * This code is GPL
  */
 
 #include <linux/module.h>
index d35df52..60c7f6f 100644 (file)
@@ -1,8 +1,9 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Common Flash Interface support:
  *   ST Advanced Architecture Command Set (ID 0x0020)
  *
- * (C) 2000 Red Hat. GPL'd
+ * (C) 2000 Red Hat.
  *
  * 10/10/2000  Nicolas Pitre <nico@fluxnic.net>
  *     - completely revamped method functions so they are aware and
index cf42695..a04b617 100644 (file)
@@ -1,6 +1,7 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
    Common Flash Interface probe code.
-   (C) 2000 Red Hat. GPL'd.
+   (C) 2000 Red Hat.
 */
 
 #include <linux/module.h>
index 6a6a2a2..140c69a 100644 (file)
@@ -1,11 +1,10 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Common Flash Interface support:
  *   Generic utility functions not dependent on command set
  *
  * Copyright (C) 2002 Red Hat
  * Copyright (C) 2003 STMicroelectronics Limited
- *
- * This code is covered by the GPL.
  */
 
 #include <linux/module.h>
index 4d4f978..9e53fcd 100644 (file)
@@ -1,7 +1,7 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Routines common to all CFI-type probes.
  * (C) 2001-2003 Red Hat, Inc.
- * GPL'd
  */
 
 #include <linux/kernel.h>
index 6f7e7e1..23c32fe 100644 (file)
@@ -1,6 +1,7 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
    Common Flash Interface probe code.
-   (C) 2000 Red Hat. GPL'd.
+   (C) 2000 Red Hat.
    See JEDEC (http://www.jedec.org/) standard JESD21C (section 3.5)
    for the standard this probe goes back to.
 
index c37fce9..e8dd649 100644 (file)
@@ -1,6 +1,7 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Common code to handle map devices which are simple RAM
- * (C) 2000 Red Hat. GPL'd.
+ * (C) 2000 Red Hat.
  */
 
 #include <linux/module.h>
index 20e3604..0823b15 100644 (file)
@@ -1,6 +1,7 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Common code to handle map devices which are simple ROM
- * (C) 2000 Red Hat. GPL'd.
+ * (C) 2000 Red Hat.
  */
 
 #include <linux/module.h>
index 4cd37ec..be106dc 100644 (file)
@@ -209,40 +209,34 @@ static void block2mtd_free_device(struct block2mtd_dev *dev)
        if (dev->blkdev) {
                invalidate_mapping_pages(dev->blkdev->bd_inode->i_mapping,
                                        0, -1);
-               blkdev_put(dev->blkdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
+               blkdev_put(dev->blkdev, NULL);
        }
 
        kfree(dev);
 }
 
-
-static struct block2mtd_dev *add_device(char *devname, int erase_size,
-               char *label, int timeout)
+/*
+ * This function is marked __ref because it calls the __init marked
+ * early_lookup_bdev when called from the early boot code.
+ */
+static struct block_device __ref *mdtblock_early_get_bdev(const char *devname,
+               blk_mode_t mode, int timeout, struct block2mtd_dev *dev)
 {
+       struct block_device *bdev = ERR_PTR(-ENODEV);
 #ifndef MODULE
        int i;
-#endif
-       const fmode_t mode = FMODE_READ | FMODE_WRITE | FMODE_EXCL;
-       struct block_device *bdev;
-       struct block2mtd_dev *dev;
-       char *name;
 
-       if (!devname)
-               return NULL;
-
-       dev = kzalloc(sizeof(struct block2mtd_dev), GFP_KERNEL);
-       if (!dev)
-               return NULL;
-
-       /* Get a handle on the device */
-       bdev = blkdev_get_by_path(devname, mode, dev);
+       /*
+        * We can't use early_lookup_bdev from a running system.
+        */
+       if (system_state >= SYSTEM_RUNNING)
+               return bdev;
 
-#ifndef MODULE
        /*
         * We might not have the root device mounted at this point.
         * Try to resolve the device name by other means.
         */
-       for (i = 0; IS_ERR(bdev) && i <= timeout; i++) {
+       for (i = 0; i <= timeout; i++) {
                dev_t devt;
 
                if (i)
@@ -254,13 +248,35 @@ static struct block2mtd_dev *add_device(char *devname, int erase_size,
                        msleep(1000);
                wait_for_device_probe();
 
-               devt = name_to_dev_t(devname);
-               if (!devt)
-                       continue;
-               bdev = blkdev_get_by_dev(devt, mode, dev);
+               if (!early_lookup_bdev(devname, &devt)) {
+                       bdev = blkdev_get_by_dev(devt, mode, dev, NULL);
+                       if (!IS_ERR(bdev))
+                               break;
+               }
        }
 #endif
+       return bdev;
+}
 
+static struct block2mtd_dev *add_device(char *devname, int erase_size,
+               char *label, int timeout)
+{
+       const blk_mode_t mode = BLK_OPEN_READ | BLK_OPEN_WRITE;
+       struct block_device *bdev;
+       struct block2mtd_dev *dev;
+       char *name;
+
+       if (!devname)
+               return NULL;
+
+       dev = kzalloc(sizeof(struct block2mtd_dev), GFP_KERNEL);
+       if (!dev)
+               return NULL;
+
+       /* Get a handle on the device */
+       bdev = blkdev_get_by_path(devname, mode, dev, NULL);
+       if (IS_ERR(bdev))
+               bdev = mdtblock_early_get_bdev(devname, mode, timeout, dev);
        if (IS_ERR(bdev)) {
                pr_err("error: cannot open device %s\n", devname);
                goto err_free_block2mtd;
index 54861d8..3dbb1aa 100644 (file)
@@ -2046,34 +2046,26 @@ static int stfsm_probe(struct platform_device *pdev)
                return PTR_ERR(fsm->base);
        }
 
-       fsm->clk = devm_clk_get(&pdev->dev, NULL);
+       fsm->clk = devm_clk_get_enabled(&pdev->dev, NULL);
        if (IS_ERR(fsm->clk)) {
                dev_err(fsm->dev, "Couldn't find EMI clock.\n");
                return PTR_ERR(fsm->clk);
        }
 
-       ret = clk_prepare_enable(fsm->clk);
-       if (ret) {
-               dev_err(fsm->dev, "Failed to enable EMI clock.\n");
-               return ret;
-       }
-
        mutex_init(&fsm->lock);
 
        ret = stfsm_init(fsm);
        if (ret) {
                dev_err(&pdev->dev, "Failed to initialise FSM Controller\n");
-               goto err_clk_unprepare;
+               return ret;
        }
 
        stfsm_fetch_platform_configs(pdev);
 
        /* Detect SPI FLASH device */
        info = stfsm_jedec_probe(fsm);
-       if (!info) {
-               ret = -ENODEV;
-               goto err_clk_unprepare;
-       }
+       if (!info)
+               return -ENODEV;
        fsm->info = info;
 
        /* Use device size to determine address width */
@@ -2089,7 +2081,7 @@ static int stfsm_probe(struct platform_device *pdev)
        else
                ret = stfsm_prepare_rwe_seqs_default(fsm);
        if (ret)
-               goto err_clk_unprepare;
+               return ret;
 
        fsm->mtd.name           = info->name;
        fsm->mtd.dev.parent     = &pdev->dev;
@@ -2112,13 +2104,7 @@ static int stfsm_probe(struct platform_device *pdev)
                (long long)fsm->mtd.size, (long long)(fsm->mtd.size >> 20),
                fsm->mtd.erasesize, (fsm->mtd.erasesize >> 10));
 
-       ret = mtd_device_register(&fsm->mtd, NULL, 0);
-       if (ret) {
-err_clk_unprepare:
-               clk_disable_unprepare(fsm->clk);
-       }
-
-       return ret;
+       return mtd_device_register(&fsm->mtd, NULL, 0);
 }
 
 static int stfsm_remove(struct platform_device *pdev)
@@ -2127,8 +2113,6 @@ static int stfsm_remove(struct platform_device *pdev)
 
        WARN_ON(mtd_device_unregister(&fsm->mtd));
 
-       clk_disable_unprepare(fsm->clk);
-
        return 0;
 }
 
index 3e0fff3..ecf6892 100644 (file)
@@ -259,7 +259,7 @@ static struct i2c_driver pismo_driver = {
        .driver = {
                .name   = "pismo",
        },
-       .probe_new      = pismo_probe,
+       .probe          = pismo_probe,
        .remove         = pismo_remove,
        .id_table       = pismo_id,
 };
index 60b2227..ff18636 100644 (file)
@@ -182,9 +182,9 @@ static blk_status_t mtd_queue_rq(struct blk_mq_hw_ctx *hctx,
        return BLK_STS_OK;
 }
 
-static int blktrans_open(struct block_device *bdev, fmode_t mode)
+static int blktrans_open(struct gendisk *disk, blk_mode_t mode)
 {
-       struct mtd_blktrans_dev *dev = bdev->bd_disk->private_data;
+       struct mtd_blktrans_dev *dev = disk->private_data;
        int ret = 0;
 
        kref_get(&dev->ref);
@@ -208,7 +208,7 @@ static int blktrans_open(struct block_device *bdev, fmode_t mode)
        ret = __get_mtd_device(dev->mtd);
        if (ret)
                goto error_release;
-       dev->file_mode = mode;
+       dev->writable = mode & BLK_OPEN_WRITE;
 
 unlock:
        dev->open++;
@@ -225,7 +225,7 @@ error_put:
        return ret;
 }
 
-static void blktrans_release(struct gendisk *disk, fmode_t mode)
+static void blktrans_release(struct gendisk *disk)
 {
        struct mtd_blktrans_dev *dev = disk->private_data;
 
index a0a1194..fa476fb 100644 (file)
@@ -294,7 +294,7 @@ static void mtdblock_release(struct mtd_blktrans_dev *mbd)
                 * It was the last usage. Free the cache, but only sync if
                 * opened for writing.
                 */
-               if (mbd->file_mode & FMODE_WRITE)
+               if (mbd->writable)
                        mtd_sync(mbd->mtd);
                vfree(mtdblk->cache_data);
        }
index 60670b2..e00b12a 100644 (file)
@@ -23,6 +23,7 @@
 #include <linux/idr.h>
 #include <linux/backing-dev.h>
 #include <linux/gfp.h>
+#include <linux/random.h>
 #include <linux/slab.h>
 #include <linux/reboot.h>
 #include <linux/leds.h>
@@ -966,6 +967,26 @@ static int mtd_otp_nvmem_add(struct mtd_info *mtd)
                }
 
                if (size > 0) {
+                       /*
+                        * The factory OTP contains thing such as a unique serial
+                        * number and is small, so let's read it out and put it
+                        * into the entropy pool.
+                        */
+                       void *otp;
+
+                       otp = kmalloc(size, GFP_KERNEL);
+                       if (!otp) {
+                               err = -ENOMEM;
+                               goto err;
+                       }
+                       err = mtd_nvmem_fact_otp_reg_read(mtd, 0, otp, size);
+                       if (err < 0) {
+                               kfree(otp);
+                               goto err;
+                       }
+                       add_device_randomness(otp, err);
+                       kfree(otp);
+
                        nvmem = mtd_otp_nvmem_register(mtd, "factory-otp", size,
                                                       mtd_nvmem_fact_otp_reg_read);
                        if (IS_ERR(nvmem)) {
index 85f5ee6..a46affb 100644 (file)
@@ -326,7 +326,6 @@ static int __mtd_del_partition(struct mtd_info *mtd)
 static int __del_mtd_partitions(struct mtd_info *mtd)
 {
        struct mtd_info *child, *next;
-       LIST_HEAD(tmp_list);
        int ret, err = 0;
 
        list_for_each_entry_safe(child, next, &mtd->partitions, part.node) {
index 917cdfb..d93e861 100644 (file)
@@ -67,5 +67,6 @@ nand-objs += nand_esmt.o
 nand-objs += nand_hynix.o
 nand-objs += nand_macronix.o
 nand-objs += nand_micron.o
+nand-objs += nand_sandisk.o
 nand-objs += nand_samsung.o
 nand-objs += nand_toshiba.o
index d513d2d..906eef7 100644 (file)
@@ -973,21 +973,6 @@ static int anfc_setup_interface(struct nand_chip *chip, int target,
                nvddr = nand_get_nvddr_timings(conf);
                if (IS_ERR(nvddr))
                        return PTR_ERR(nvddr);
-
-               /*
-                * The controller only supports data payload requests which are
-                * a multiple of 4. In practice, most data accesses are 4-byte
-                * aligned and this is not an issue. However, rounding up will
-                * simply be refused by the controller if we reached the end of
-                * the device *and* we are using the NV-DDR interface(!). In
-                * this situation, unaligned data requests ending at the device
-                * boundary will confuse the controller and cannot be performed.
-                *
-                * This is something that happens in nand_read_subpage() when
-                * selecting software ECC support and must be avoided.
-                */
-               if (chip->ecc.engine_type == NAND_ECC_ENGINE_TYPE_SOFT)
-                       return -ENOTSUPP;
        } else {
                sdr = nand_get_sdr_timings(conf);
                if (IS_ERR(sdr))
index 7016e0f..e9932da 100644 (file)
@@ -73,6 +73,7 @@ extern const struct nand_manufacturer_ops hynix_nand_manuf_ops;
 extern const struct nand_manufacturer_ops macronix_nand_manuf_ops;
 extern const struct nand_manufacturer_ops micron_nand_manuf_ops;
 extern const struct nand_manufacturer_ops samsung_nand_manuf_ops;
+extern const struct nand_manufacturer_ops sandisk_nand_manuf_ops;
 extern const struct nand_manufacturer_ops toshiba_nand_manuf_ops;
 
 /* MLC pairing schemes */
index 1feea7d..d3faf80 100644 (file)
@@ -38,6 +38,7 @@
 #define NFC_CMD_SCRAMBLER_DISABLE      0
 #define NFC_CMD_SHORTMODE_DISABLE      0
 #define NFC_CMD_RB_INT         BIT(14)
+#define NFC_CMD_RB_INT_NO_PIN  ((0xb << 10) | BIT(18) | BIT(16))
 
 #define NFC_CMD_GET_SIZE(x)    (((x) >> 22) & GENMASK(4, 0))
 
@@ -76,6 +77,7 @@
 #define GENCMDIADDRH(aih, addr)                ((aih) | (((addr) >> 16) & 0xffff))
 
 #define DMA_DIR(dir)           ((dir) ? NFC_CMD_N2M : NFC_CMD_M2N)
+#define DMA_ADDR_ALIGN         8
 
 #define ECC_CHECK_RETURN_FF    (-1)
 
 
 #define PER_INFO_BYTE          8
 
+#define NFC_CMD_RAW_LEN        GENMASK(13, 0)
+
+#define NFC_COLUMN_ADDR_0      0
+#define NFC_COLUMN_ADDR_1      0
+
 struct meson_nfc_nand_chip {
        struct list_head node;
        struct nand_chip nand;
@@ -179,6 +186,7 @@ struct meson_nfc {
        u32 info_bytes;
 
        unsigned long assigned_cs;
+       bool no_rb_pin;
 };
 
 enum {
@@ -280,7 +288,7 @@ static void meson_nfc_cmd_access(struct nand_chip *nand, int raw, bool dir,
 
        if (raw) {
                len = mtd->writesize + mtd->oobsize;
-               cmd = (len & GENMASK(13, 0)) | scrambler | DMA_DIR(dir);
+               cmd = len | scrambler | DMA_DIR(dir);
                writel(cmd, nfc->reg_base + NFC_REG_CMD);
                return;
        }
@@ -392,7 +400,42 @@ static void meson_nfc_set_data_oob(struct nand_chip *nand,
        }
 }
 
-static int meson_nfc_queue_rb(struct meson_nfc *nfc, int timeout_ms)
+static int meson_nfc_wait_no_rb_pin(struct meson_nfc *nfc, int timeout_ms,
+                                   bool need_cmd_read0)
+{
+       u32 cmd, cfg;
+
+       meson_nfc_cmd_idle(nfc, nfc->timing.twb);
+       meson_nfc_drain_cmd(nfc);
+       meson_nfc_wait_cmd_finish(nfc, CMD_FIFO_EMPTY_TIMEOUT);
+
+       cfg = readl(nfc->reg_base + NFC_REG_CFG);
+       cfg |= NFC_RB_IRQ_EN;
+       writel(cfg, nfc->reg_base + NFC_REG_CFG);
+
+       reinit_completion(&nfc->completion);
+       cmd = nfc->param.chip_select | NFC_CMD_CLE | NAND_CMD_STATUS;
+       writel(cmd, nfc->reg_base + NFC_REG_CMD);
+
+       /* use the max erase time as the maximum clock for waiting R/B */
+       cmd = NFC_CMD_RB | NFC_CMD_RB_INT_NO_PIN | nfc->timing.tbers_max;
+       writel(cmd, nfc->reg_base + NFC_REG_CMD);
+
+       if (!wait_for_completion_timeout(&nfc->completion,
+                                        msecs_to_jiffies(timeout_ms)))
+               return -ETIMEDOUT;
+
+       if (need_cmd_read0) {
+               cmd = nfc->param.chip_select | NFC_CMD_CLE | NAND_CMD_READ0;
+               writel(cmd, nfc->reg_base + NFC_REG_CMD);
+               meson_nfc_drain_cmd(nfc);
+               meson_nfc_wait_cmd_finish(nfc, CMD_FIFO_EMPTY_TIMEOUT);
+       }
+
+       return 0;
+}
+
+static int meson_nfc_wait_rb_pin(struct meson_nfc *nfc, int timeout_ms)
 {
        u32 cmd, cfg;
        int ret = 0;
@@ -420,6 +463,27 @@ static int meson_nfc_queue_rb(struct meson_nfc *nfc, int timeout_ms)
        return ret;
 }
 
+static int meson_nfc_queue_rb(struct meson_nfc *nfc, int timeout_ms,
+                             bool need_cmd_read0)
+{
+       if (nfc->no_rb_pin) {
+               /* This mode is used when there is no wired R/B pin.
+                * It works like 'nand_soft_waitrdy()', but instead of
+                * polling NAND_CMD_STATUS bit in the software loop,
+                * it will wait for interrupt - controllers checks IO
+                * bus and when it detects NAND_CMD_STATUS on it, it
+                * raises interrupt. After interrupt, NAND_CMD_READ0 is
+                * sent as terminator of the ready waiting procedure if
+                * needed (for all cases except page programming - this
+                * is reason of 'need_cmd_read0' flag).
+                */
+               return meson_nfc_wait_no_rb_pin(nfc, timeout_ms,
+                                               need_cmd_read0);
+       } else {
+               return meson_nfc_wait_rb_pin(nfc, timeout_ms);
+       }
+}
+
 static void meson_nfc_set_user_byte(struct nand_chip *nand, u8 *oob_buf)
 {
        struct meson_nfc_nand_chip *meson_chip = to_meson_nand(nand);
@@ -544,7 +608,7 @@ static int meson_nfc_read_buf(struct nand_chip *nand, u8 *buf, int len)
        if (ret)
                goto out;
 
-       cmd = NFC_CMD_N2M | (len & GENMASK(13, 0));
+       cmd = NFC_CMD_N2M | len;
        writel(cmd, nfc->reg_base + NFC_REG_CMD);
 
        meson_nfc_drain_cmd(nfc);
@@ -568,7 +632,7 @@ static int meson_nfc_write_buf(struct nand_chip *nand, u8 *buf, int len)
        if (ret)
                return ret;
 
-       cmd = NFC_CMD_M2N | (len & GENMASK(13, 0));
+       cmd = NFC_CMD_M2N | len;
        writel(cmd, nfc->reg_base + NFC_REG_CMD);
 
        meson_nfc_drain_cmd(nfc);
@@ -595,12 +659,12 @@ static int meson_nfc_rw_cmd_prepare_and_execute(struct nand_chip *nand,
        cmd0 = in ? NAND_CMD_READ0 : NAND_CMD_SEQIN;
        nfc->cmdfifo.rw.cmd0 = cs | NFC_CMD_CLE | cmd0;
 
-       addrs[0] = cs | NFC_CMD_ALE | 0;
+       addrs[0] = cs | NFC_CMD_ALE | NFC_COLUMN_ADDR_0;
        if (mtd->writesize <= 512) {
                cmd_num--;
                row_start = 1;
        } else {
-               addrs[1] = cs | NFC_CMD_ALE | 0;
+               addrs[1] = cs | NFC_CMD_ALE | NFC_COLUMN_ADDR_1;
                row_start = 2;
        }
 
@@ -623,7 +687,7 @@ static int meson_nfc_rw_cmd_prepare_and_execute(struct nand_chip *nand,
        if (in) {
                nfc->cmdfifo.rw.cmd1 = cs | NFC_CMD_CLE | NAND_CMD_READSTART;
                writel(nfc->cmdfifo.rw.cmd1, nfc->reg_base + NFC_REG_CMD);
-               meson_nfc_queue_rb(nfc, PSEC_TO_MSEC(sdr->tR_max));
+               meson_nfc_queue_rb(nfc, PSEC_TO_MSEC(sdr->tR_max), true);
        } else {
                meson_nfc_cmd_idle(nfc, nfc->timing.tadl);
        }
@@ -669,7 +733,7 @@ static int meson_nfc_write_page_sub(struct nand_chip *nand,
 
        cmd = nfc->param.chip_select | NFC_CMD_CLE | NAND_CMD_PAGEPROG;
        writel(cmd, nfc->reg_base + NFC_REG_CMD);
-       meson_nfc_queue_rb(nfc, PSEC_TO_MSEC(sdr->tPROG_max));
+       meson_nfc_queue_rb(nfc, PSEC_TO_MSEC(sdr->tPROG_max), false);
 
        meson_nfc_dma_buffer_release(nand, data_len, info_len, DMA_TO_DEVICE);
 
@@ -842,6 +906,9 @@ static int meson_nfc_read_oob(struct nand_chip *nand, int page)
 
 static bool meson_nfc_is_buffer_dma_safe(const void *buffer)
 {
+       if ((uintptr_t)buffer % DMA_ADDR_ALIGN)
+               return false;
+
        if (virt_addr_valid(buffer) && (!object_is_on_stack(buffer)))
                return true;
        return false;
@@ -899,6 +966,31 @@ meson_nand_op_put_dma_safe_output_buf(const struct nand_op_instr *instr,
                kfree(buf);
 }
 
+static int meson_nfc_check_op(struct nand_chip *chip,
+                             const struct nand_operation *op)
+{
+       int op_id;
+
+       for (op_id = 0; op_id < op->ninstrs; op_id++) {
+               const struct nand_op_instr *instr;
+
+               instr = &op->instrs[op_id];
+
+               switch (instr->type) {
+               case NAND_OP_DATA_IN_INSTR:
+               case NAND_OP_DATA_OUT_INSTR:
+                       if (instr->ctx.data.len > NFC_CMD_RAW_LEN)
+                               return -ENOTSUPP;
+
+                       break;
+               default:
+                       break;
+               }
+       }
+
+       return 0;
+}
+
 static int meson_nfc_exec_op(struct nand_chip *nand,
                             const struct nand_operation *op, bool check_only)
 {
@@ -907,8 +999,13 @@ static int meson_nfc_exec_op(struct nand_chip *nand,
        const struct nand_op_instr *instr = NULL;
        void *buf;
        u32 op_id, delay_idle, cmd;
+       int err;
        int i;
 
+       err = meson_nfc_check_op(nand, op);
+       if (err)
+               return err;
+
        if (check_only)
                return 0;
 
@@ -952,7 +1049,8 @@ static int meson_nfc_exec_op(struct nand_chip *nand,
                        break;
 
                case NAND_OP_WAITRDY_INSTR:
-                       meson_nfc_queue_rb(nfc, instr->ctx.waitrdy.timeout_ms);
+                       meson_nfc_queue_rb(nfc, instr->ctx.waitrdy.timeout_ms,
+                                          true);
                        if (instr->delay_ns)
                                meson_nfc_cmd_idle(nfc, delay_idle);
                        break;
@@ -1181,6 +1279,7 @@ static int meson_nand_attach_chip(struct nand_chip *nand)
        struct meson_nfc_nand_chip *meson_chip = to_meson_nand(nand);
        struct mtd_info *mtd = nand_to_mtd(nand);
        int nsectors = mtd->writesize / 1024;
+       int raw_writesize;
        int ret;
 
        if (!mtd->name) {
@@ -1192,6 +1291,13 @@ static int meson_nand_attach_chip(struct nand_chip *nand)
                        return -ENOMEM;
        }
 
+       raw_writesize = mtd->writesize + mtd->oobsize;
+       if (raw_writesize > NFC_CMD_RAW_LEN) {
+               dev_err(nfc->dev, "too big write size in raw mode: %d > %ld\n",
+                       raw_writesize, NFC_CMD_RAW_LEN);
+               return -EINVAL;
+       }
+
        if (nand->bbt_options & NAND_BBT_USE_FLASH)
                nand->bbt_options |= NAND_BBT_NO_OOB;
 
@@ -1248,6 +1354,7 @@ meson_nfc_nand_chip_init(struct device *dev,
        struct mtd_info *mtd;
        int ret, i;
        u32 tmp, nsels;
+       u32 nand_rb_val = 0;
 
        nsels = of_property_count_elems_of_size(np, "reg", sizeof(u32));
        if (!nsels || nsels > MAX_CE_NUM) {
@@ -1287,6 +1394,15 @@ meson_nfc_nand_chip_init(struct device *dev,
        mtd->owner = THIS_MODULE;
        mtd->dev.parent = dev;
 
+       ret = of_property_read_u32(np, "nand-rb", &nand_rb_val);
+       if (ret == -EINVAL)
+               nfc->no_rb_pin = true;
+       else if (ret)
+               return ret;
+
+       if (nand_rb_val)
+               return -EINVAL;
+
        ret = nand_scan(nand, nsels);
        if (ret)
                return ret;
index dacc552..650351c 100644 (file)
@@ -44,6 +44,9 @@ struct nand_flash_dev nand_flash_ids[] = {
        {"TC58NVG6D2 64G 3.3V 8-bit",
                { .id = {0x98, 0xde, 0x94, 0x82, 0x76, 0x56, 0x04, 0x20} },
                  SZ_8K, SZ_8K, SZ_2M, 0, 8, 640, NAND_ECC_INFO(40, SZ_1K) },
+       {"SDTNQGAMA 64G 3.3V 8-bit",
+               { .id = {0x45, 0xde, 0x94, 0x93, 0x76, 0x57} },
+                 SZ_16K, SZ_8K, SZ_4M, 0, 6, 1280, NAND_ECC_INFO(40, SZ_1K) },
        {"SDTNRGAMA 64G 3.3V 8-bit",
                { .id = {0x45, 0xde, 0x94, 0x93, 0x76, 0x50} },
                  SZ_16K, SZ_8K, SZ_4M, 0, 6, 1280, NAND_ECC_INFO(40, SZ_1K) },
@@ -188,7 +191,7 @@ static const struct nand_manufacturer_desc nand_manufacturer_descs[] = {
        {NAND_MFR_NATIONAL, "National"},
        {NAND_MFR_RENESAS, "Renesas"},
        {NAND_MFR_SAMSUNG, "Samsung", &samsung_nand_manuf_ops},
-       {NAND_MFR_SANDISK, "SanDisk"},
+       {NAND_MFR_SANDISK, "SanDisk", &sandisk_nand_manuf_ops},
        {NAND_MFR_STMICRO, "ST Micro"},
        {NAND_MFR_TOSHIBA, "Toshiba", &toshiba_nand_manuf_ops},
        {NAND_MFR_WINBOND, "Winbond"},
index 385957e..e229de3 100644 (file)
@@ -6,6 +6,7 @@
  * Author: Boris Brezillon <boris.brezillon@free-electrons.com>
  */
 
+#include <linux/slab.h>
 #include "linux/delay.h"
 #include "internals.h"
 
 
 #define MXIC_CMD_POWER_DOWN 0xB9
 
+#define ONFI_FEATURE_ADDR_30LFXG18AC_OTP       0x90
+#define MACRONIX_30LFXG18AC_OTP_START_PAGE     2
+#define MACRONIX_30LFXG18AC_OTP_PAGES          30
+#define MACRONIX_30LFXG18AC_OTP_PAGE_SIZE      2112
+#define MACRONIX_30LFXG18AC_OTP_SIZE_BYTES     \
+       (MACRONIX_30LFXG18AC_OTP_PAGES *        \
+        MACRONIX_30LFXG18AC_OTP_PAGE_SIZE)
+
+#define MACRONIX_30LFXG18AC_OTP_EN             BIT(0)
+
 struct nand_onfi_vendor_macronix {
        u8 reserved;
        u8 reliability_func;
@@ -315,6 +326,161 @@ static void macronix_nand_deep_power_down_support(struct nand_chip *chip)
        chip->ops.resume = mxic_nand_resume;
 }
 
+static int macronix_30lfxg18ac_get_otp_info(struct mtd_info *mtd, size_t len,
+                                           size_t *retlen,
+                                           struct otp_info *buf)
+{
+       if (len < sizeof(*buf))
+               return -EINVAL;
+
+       /* Always report that OTP is unlocked. Reason is that this
+        * type of flash chip doesn't provide way to check that OTP
+        * is locked or not: subfeature parameter is implemented as
+        * volatile register. Technically OTP region could be locked
+        * and become readonly, but as there is no way to check it,
+        * don't allow to lock it ('_lock_user_prot_reg' callback
+        * always returns -EOPNOTSUPP) and thus we report that OTP
+        * is unlocked.
+        */
+       buf->locked = 0;
+       buf->start = 0;
+       buf->length = MACRONIX_30LFXG18AC_OTP_SIZE_BYTES;
+
+       *retlen = sizeof(*buf);
+
+       return 0;
+}
+
+static int macronix_30lfxg18ac_otp_enable(struct nand_chip *nand)
+{
+       u8 feature_buf[ONFI_SUBFEATURE_PARAM_LEN] = { 0 };
+
+       feature_buf[0] = MACRONIX_30LFXG18AC_OTP_EN;
+       return nand_set_features(nand, ONFI_FEATURE_ADDR_30LFXG18AC_OTP,
+                                feature_buf);
+}
+
+static int macronix_30lfxg18ac_otp_disable(struct nand_chip *nand)
+{
+       u8 feature_buf[ONFI_SUBFEATURE_PARAM_LEN] = { 0 };
+
+       return nand_set_features(nand, ONFI_FEATURE_ADDR_30LFXG18AC_OTP,
+                                feature_buf);
+}
+
+static int __macronix_30lfxg18ac_rw_otp(struct mtd_info *mtd,
+                                       loff_t offs_in_flash,
+                                       size_t len, size_t *retlen,
+                                       u_char *buf, bool write)
+{
+       struct nand_chip *nand;
+       size_t bytes_handled;
+       off_t offs_in_page;
+       u64 page;
+       int ret;
+
+       nand = mtd_to_nand(mtd);
+       nand_select_target(nand, 0);
+
+       ret = macronix_30lfxg18ac_otp_enable(nand);
+       if (ret)
+               goto out_otp;
+
+       page = offs_in_flash;
+       /* 'page' will be result of division. */
+       offs_in_page = do_div(page, MACRONIX_30LFXG18AC_OTP_PAGE_SIZE);
+       bytes_handled = 0;
+
+       while (bytes_handled < len &&
+              page < MACRONIX_30LFXG18AC_OTP_PAGES) {
+               size_t bytes_to_handle;
+               u64 phys_page = page + MACRONIX_30LFXG18AC_OTP_START_PAGE;
+
+               bytes_to_handle = min_t(size_t, len - bytes_handled,
+                                       MACRONIX_30LFXG18AC_OTP_PAGE_SIZE -
+                                       offs_in_page);
+
+               if (write)
+                       ret = nand_prog_page_op(nand, phys_page, offs_in_page,
+                                               &buf[bytes_handled], bytes_to_handle);
+               else
+                       ret = nand_read_page_op(nand, phys_page, offs_in_page,
+                                               &buf[bytes_handled], bytes_to_handle);
+               if (ret)
+                       goto out_otp;
+
+               bytes_handled += bytes_to_handle;
+               offs_in_page = 0;
+               page++;
+       }
+
+       *retlen = bytes_handled;
+
+out_otp:
+       if (ret)
+               dev_err(&mtd->dev, "failed to perform OTP IO: %i\n", ret);
+
+       ret = macronix_30lfxg18ac_otp_disable(nand);
+       if (ret)
+               dev_err(&mtd->dev, "failed to leave OTP mode after %s\n",
+                       write ? "write" : "read");
+
+       nand_deselect_target(nand);
+
+       return ret;
+}
+
+static int macronix_30lfxg18ac_write_otp(struct mtd_info *mtd, loff_t to,
+                                        size_t len, size_t *rlen,
+                                        const u_char *buf)
+{
+       return __macronix_30lfxg18ac_rw_otp(mtd, to, len, rlen, (u_char *)buf,
+                                           true);
+}
+
+static int macronix_30lfxg18ac_read_otp(struct mtd_info *mtd, loff_t from,
+                                       size_t len, size_t *rlen,
+                                       u_char *buf)
+{
+       return __macronix_30lfxg18ac_rw_otp(mtd, from, len, rlen, buf, false);
+}
+
+static int macronix_30lfxg18ac_lock_otp(struct mtd_info *mtd, loff_t from,
+                                       size_t len)
+{
+       /* See comment in 'macronix_30lfxg18ac_get_otp_info()'. */
+       return -EOPNOTSUPP;
+}
+
+static void macronix_nand_setup_otp(struct nand_chip *chip)
+{
+       static const char * const supported_otp_models[] = {
+               "MX30LF1G18AC",
+               "MX30LF2G18AC",
+               "MX30LF4G18AC",
+       };
+       struct mtd_info *mtd;
+
+       if (match_string(supported_otp_models,
+                        ARRAY_SIZE(supported_otp_models),
+                        chip->parameters.model) < 0)
+               return;
+
+       if (!chip->parameters.supports_set_get_features)
+               return;
+
+       bitmap_set(chip->parameters.get_feature_list,
+                  ONFI_FEATURE_ADDR_30LFXG18AC_OTP, 1);
+       bitmap_set(chip->parameters.set_feature_list,
+                  ONFI_FEATURE_ADDR_30LFXG18AC_OTP, 1);
+
+       mtd = nand_to_mtd(chip);
+       mtd->_get_user_prot_info = macronix_30lfxg18ac_get_otp_info;
+       mtd->_read_user_prot_reg = macronix_30lfxg18ac_read_otp;
+       mtd->_write_user_prot_reg = macronix_30lfxg18ac_write_otp;
+       mtd->_lock_user_prot_reg = macronix_30lfxg18ac_lock_otp;
+}
+
 static int macronix_nand_init(struct nand_chip *chip)
 {
        if (nand_is_slc(chip))
@@ -324,6 +490,7 @@ static int macronix_nand_init(struct nand_chip *chip)
        macronix_nand_onfi_init(chip);
        macronix_nand_block_protection_support(chip);
        macronix_nand_deep_power_down_support(chip);
+       macronix_nand_setup_otp(chip);
 
        return 0;
 }
diff --git a/drivers/mtd/nand/raw/nand_sandisk.c b/drivers/mtd/nand/raw/nand_sandisk.c
new file mode 100644 (file)
index 0000000..7c66e41
--- /dev/null
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include "internals.h"
+
+static int
+sdtnqgama_choose_interface_config(struct nand_chip *chip,
+                                 struct nand_interface_config *iface)
+{
+       onfi_fill_interface_config(chip, iface, NAND_SDR_IFACE, 0);
+
+       return nand_choose_best_sdr_timings(chip, iface, NULL);
+}
+
+static int sandisk_nand_init(struct nand_chip *chip)
+{
+       if (!strncmp("SDTNQGAMA", chip->parameters.model,
+                    sizeof("SDTNQGAMA") - 1))
+               chip->ops.choose_interface_config =
+                       &sdtnqgama_choose_interface_config;
+
+       return 0;
+}
+
+const struct nand_manufacturer_ops sandisk_nand_manuf_ops = {
+       .init = sandisk_nand_init,
+};
index 6b043e2..cfd7c3b 100644 (file)
@@ -501,6 +501,16 @@ static const struct spinand_info gigadevice_spinand_table[] = {
                     SPINAND_HAS_QE_BIT,
                     SPINAND_ECCINFO(&gd5fxgqx_variant2_ooblayout,
                                     gd5fxgq4uexxg_ecc_get_status)),
+       SPINAND_INFO("GD5F2GQ5xExxH",
+                    SPINAND_ID(SPINAND_READID_METHOD_OPCODE_DUMMY, 0x22),
+                    NAND_MEMORG(1, 2048, 64, 64, 2048, 40, 1, 1, 1),
+                    NAND_ECCREQ(4, 512),
+                    SPINAND_INFO_OP_VARIANTS(&read_cache_variants_2gq5,
+                                             &write_cache_variants,
+                                             &update_cache_variants),
+                    SPINAND_HAS_QE_BIT,
+                    SPINAND_ECCINFO(&gd5fxgqx_variant2_ooblayout,
+                                    gd5fxgq4uexxg_ecc_get_status)),
 };
 
 static const struct spinand_manufacturer_ops gigadevice_spinand_manuf_ops = {
index 722a973..3dfc7e1 100644 (file)
@@ -299,6 +299,26 @@ static const struct spinand_info macronix_spinand_table[] = {
                     SPINAND_ECCINFO(&mx35lfxge4ab_ooblayout,
                                     mx35lf1ge4ab_ecc_get_status)),
 
+       SPINAND_INFO("MX31LF2GE4BC",
+                    SPINAND_ID(SPINAND_READID_METHOD_OPCODE_DUMMY, 0x2e),
+                    NAND_MEMORG(1, 2048, 64, 64, 2048, 40, 1, 1, 1),
+                    NAND_ECCREQ(8, 512),
+                    SPINAND_INFO_OP_VARIANTS(&read_cache_variants,
+                                             &write_cache_variants,
+                                             &update_cache_variants),
+                    SPINAND_HAS_QE_BIT,
+                    SPINAND_ECCINFO(&mx35lfxge4ab_ooblayout,
+                                    mx35lf1ge4ab_ecc_get_status)),
+       SPINAND_INFO("MX3UF2GE4BC",
+                    SPINAND_ID(SPINAND_READID_METHOD_OPCODE_DUMMY, 0xae),
+                    NAND_MEMORG(1, 2048, 64, 64, 2048, 40, 1, 1, 1),
+                    NAND_ECCREQ(8, 512),
+                    SPINAND_INFO_OP_VARIANTS(&read_cache_variants,
+                                             &write_cache_variants,
+                                             &update_cache_variants),
+                    SPINAND_HAS_QE_BIT,
+                    SPINAND_ECCINFO(&mx35lfxge4ab_ooblayout,
+                                    mx35lf1ge4ab_ecc_get_status)),
 };
 
 static const struct spinand_manufacturer_ops macronix_spinand_manuf_ops = {
index 4cfec3b..b5b3c4c 100644 (file)
@@ -981,7 +981,7 @@ restart:
        /* Update the FTL table */
        zone->lba_to_phys_table[ftl->cache_block] = write_sector;
 
-       /* Write succesfull, so erase and free the old block */
+       /* Write successful, so erase and free the old block */
        if (block_num > 0)
                sm_erase_block(ftl, zone_num, block_num, 1);
 
index 3711d7f..437c5b8 100644 (file)
@@ -227,9 +227,9 @@ static blk_status_t ubiblock_read(struct request *req)
        return BLK_STS_OK;
 }
 
-static int ubiblock_open(struct block_device *bdev, fmode_t mode)
+static int ubiblock_open(struct gendisk *disk, blk_mode_t mode)
 {
-       struct ubiblock *dev = bdev->bd_disk->private_data;
+       struct ubiblock *dev = disk->private_data;
        int ret;
 
        mutex_lock(&dev->dev_mutex);
@@ -246,11 +246,10 @@ static int ubiblock_open(struct block_device *bdev, fmode_t mode)
         * It's just a paranoid check, as write requests will get rejected
         * in any case.
         */
-       if (mode & FMODE_WRITE) {
+       if (mode & BLK_OPEN_WRITE) {
                ret = -EROFS;
                goto out_unlock;
        }
-
        dev->desc = ubi_open_volume(dev->ubi_num, dev->vol_id, UBI_READONLY);
        if (IS_ERR(dev->desc)) {
                dev_err(disk_to_dev(dev->gd), "failed to open ubi volume %d_%d",
@@ -270,7 +269,7 @@ out_unlock:
        return ret;
 }
 
-static void ubiblock_release(struct gendisk *gd, fmode_t mode)
+static void ubiblock_release(struct gendisk *gd)
 {
        struct ubiblock *dev = gd->private_data;
 
index 7eb2ddb..a317feb 100644 (file)
@@ -1126,8 +1126,7 @@ static int bgx_lmac_enable(struct bgx *bgx, u8 lmacid)
        }
 
 poll:
-       lmac->check_link = alloc_workqueue("check_link", WQ_UNBOUND |
-                                          WQ_MEM_RECLAIM, 1);
+       lmac->check_link = alloc_ordered_workqueue("check_link", WQ_MEM_RECLAIM);
        if (!lmac->check_link)
                return -ENOMEM;
        INIT_DELAYED_WORK(&lmac->dwork, bgx_poll_for_link);
index 37eadb3..41acfe2 100644 (file)
@@ -185,7 +185,7 @@ struct ice_buf_hdr {
 
 #define ICE_MAX_ENTRIES_IN_BUF(hd_sz, ent_sz)                                 \
        ((ICE_PKG_BUF_SIZE -                                                  \
-         struct_size((struct ice_buf_hdr *)0, section_entry, 1) - (hd_sz)) / \
+         struct_size_t(struct ice_buf_hdr,  section_entry, 1) - (hd_sz)) / \
         (ent_sz))
 
 /* ice package section IDs */
@@ -297,7 +297,7 @@ struct ice_label_section {
 };
 
 #define ICE_MAX_LABELS_IN_BUF                                             \
-       ICE_MAX_ENTRIES_IN_BUF(struct_size((struct ice_label_section *)0, \
+       ICE_MAX_ENTRIES_IN_BUF(struct_size_t(struct ice_label_section,  \
                                           label, 1) -                    \
                                       sizeof(struct ice_label),          \
                               sizeof(struct ice_label))
@@ -352,7 +352,7 @@ struct ice_boost_tcam_section {
 };
 
 #define ICE_MAX_BST_TCAMS_IN_BUF                                               \
-       ICE_MAX_ENTRIES_IN_BUF(struct_size((struct ice_boost_tcam_section *)0, \
+       ICE_MAX_ENTRIES_IN_BUF(struct_size_t(struct ice_boost_tcam_section,  \
                                           tcam, 1) -                          \
                                       sizeof(struct ice_boost_tcam_entry),    \
                               sizeof(struct ice_boost_tcam_entry))
@@ -372,8 +372,7 @@ struct ice_marker_ptype_tcam_section {
 };
 
 #define ICE_MAX_MARKER_PTYPE_TCAMS_IN_BUF                                    \
-       ICE_MAX_ENTRIES_IN_BUF(                                              \
-               struct_size((struct ice_marker_ptype_tcam_section *)0, tcam, \
+       ICE_MAX_ENTRIES_IN_BUF(struct_size_t(struct ice_marker_ptype_tcam_section,  tcam, \
                            1) -                                             \
                        sizeof(struct ice_marker_ptype_tcam_entry),          \
                sizeof(struct ice_marker_ptype_tcam_entry))
index 9f673bd..0069e60 100644 (file)
@@ -3044,9 +3044,8 @@ static int rvu_flr_init(struct rvu *rvu)
                            cfg | BIT_ULL(22));
        }
 
-       rvu->flr_wq = alloc_workqueue("rvu_afpf_flr",
-                                     WQ_UNBOUND | WQ_HIGHPRI | WQ_MEM_RECLAIM,
-                                      1);
+       rvu->flr_wq = alloc_ordered_workqueue("rvu_afpf_flr",
+                                             WQ_HIGHPRI | WQ_MEM_RECLAIM);
        if (!rvu->flr_wq)
                return -ENOMEM;
 
index db3fcab..fe8ea4e 100644 (file)
@@ -272,8 +272,7 @@ static int otx2_pf_flr_init(struct otx2_nic *pf, int num_vfs)
 {
        int vf;
 
-       pf->flr_wq = alloc_workqueue("otx2_pf_flr_wq",
-                                    WQ_UNBOUND | WQ_HIGHPRI, 1);
+       pf->flr_wq = alloc_ordered_workqueue("otx2_pf_flr_wq", WQ_HIGHPRI);
        if (!pf->flr_wq)
                return -ENOMEM;
 
@@ -594,9 +593,8 @@ static int otx2_pfvf_mbox_init(struct otx2_nic *pf, int numvfs)
        if (!pf->mbox_pfvf)
                return -ENOMEM;
 
-       pf->mbox_pfvf_wq = alloc_workqueue("otx2_pfvf_mailbox",
-                                          WQ_UNBOUND | WQ_HIGHPRI |
-                                          WQ_MEM_RECLAIM, 1);
+       pf->mbox_pfvf_wq = alloc_ordered_workqueue("otx2_pfvf_mailbox",
+                                                  WQ_HIGHPRI | WQ_MEM_RECLAIM);
        if (!pf->mbox_pfvf_wq)
                return -ENOMEM;
 
@@ -1060,9 +1058,8 @@ static int otx2_pfaf_mbox_init(struct otx2_nic *pf)
        int err;
 
        mbox->pfvf = pf;
-       pf->mbox_wq = alloc_workqueue("otx2_pfaf_mailbox",
-                                     WQ_UNBOUND | WQ_HIGHPRI |
-                                     WQ_MEM_RECLAIM, 1);
+       pf->mbox_wq = alloc_ordered_workqueue("otx2_pfaf_mailbox",
+                                             WQ_HIGHPRI | WQ_MEM_RECLAIM);
        if (!pf->mbox_wq)
                return -ENOMEM;
 
index 3734c79..35e0604 100644 (file)
@@ -293,9 +293,8 @@ static int otx2vf_vfaf_mbox_init(struct otx2_nic *vf)
        int err;
 
        mbox->pfvf = vf;
-       vf->mbox_wq = alloc_workqueue("otx2_vfaf_mailbox",
-                                     WQ_UNBOUND | WQ_HIGHPRI |
-                                     WQ_MEM_RECLAIM, 1);
+       vf->mbox_wq = alloc_ordered_workqueue("otx2_vfaf_mailbox",
+                                             WQ_HIGHPRI | WQ_MEM_RECLAIM);
        if (!vf->mbox_wq)
                return -ENOMEM;
 
index e47fa6f..20bb5eb 100644 (file)
@@ -45,7 +45,7 @@ static int mlx5_thermal_get_mtmp_temp(struct mlx5_core_dev *mdev, u32 id, int *p
 static int mlx5_thermal_get_temp(struct thermal_zone_device *tzdev,
                                 int *p_temp)
 {
-       struct mlx5_thermal *thermal = tzdev->devdata;
+       struct mlx5_thermal *thermal = thermal_zone_device_priv(tzdev);
        struct mlx5_core_dev *mdev = thermal->mdev;
        int err;
 
@@ -81,12 +81,13 @@ int mlx5_thermal_init(struct mlx5_core_dev *mdev)
                return -ENOMEM;
 
        thermal->mdev = mdev;
-       thermal->tzdev = thermal_zone_device_register(data,
-                                                     MLX5_THERMAL_NUM_TRIPS,
-                                                     MLX5_THERMAL_TRIP_MASK,
-                                                     thermal,
-                                                     &mlx5_thermal_ops,
-                                                     NULL, 0, MLX5_THERMAL_POLL_INT_MSEC);
+       thermal->tzdev = thermal_zone_device_register_with_trips(data,
+                                                                NULL,
+                                                                MLX5_THERMAL_NUM_TRIPS,
+                                                                MLX5_THERMAL_TRIP_MASK,
+                                                                thermal,
+                                                                &mlx5_thermal_ops,
+                                                                NULL, 0, MLX5_THERMAL_POLL_INT_MSEC);
        if (IS_ERR(thermal->tzdev)) {
                dev_err(mdev->device, "Failed to register thermal zone device (%s) %ld\n",
                        data, PTR_ERR(thermal->tzdev));
index 038c590..52c1a3d 100644 (file)
@@ -1082,8 +1082,7 @@ int ath10k_qmi_init(struct ath10k *ar, u32 msa_size)
        if (ret)
                goto err;
 
-       qmi->event_wq = alloc_workqueue("ath10k_qmi_driver_event",
-                                       WQ_UNBOUND, 1);
+       qmi->event_wq = alloc_ordered_workqueue("ath10k_qmi_driver_event", 0);
        if (!qmi->event_wq) {
                ath10k_err(ar, "failed to allocate workqueue\n");
                ret = -EFAULT;
index 5d1c03b..d4eaf7d 100644 (file)
@@ -3269,8 +3269,7 @@ int ath11k_qmi_init_service(struct ath11k_base *ab)
                return ret;
        }
 
-       ab->qmi.event_wq = alloc_workqueue("ath11k_qmi_driver_event",
-                                          WQ_UNBOUND, 1);
+       ab->qmi.event_wq = alloc_ordered_workqueue("ath11k_qmi_driver_event", 0);
        if (!ab->qmi.event_wq) {
                ath11k_err(ab, "failed to allocate workqueue\n");
                return -EFAULT;
index 4afba76..b510c2d 100644 (file)
@@ -3058,8 +3058,7 @@ int ath12k_qmi_init_service(struct ath12k_base *ab)
                return ret;
        }
 
-       ab->qmi.event_wq = alloc_workqueue("ath12k_qmi_driver_event",
-                                          WQ_UNBOUND, 1);
+       ab->qmi.event_wq = alloc_ordered_workqueue("ath12k_qmi_driver_event", 0);
        if (!ab->qmi.event_wq) {
                ath12k_err(ab, "failed to allocate workqueue\n");
                return -EFAULT;
index 3e88bbd..eacbbdb 100644 (file)
@@ -3597,7 +3597,7 @@ struct iwl_trans *iwl_trans_pcie_alloc(struct pci_dev *pdev,
        init_waitqueue_head(&trans_pcie->imr_waitq);
 
        trans_pcie->rba.alloc_wq = alloc_workqueue("rb_allocator",
-                                                  WQ_HIGHPRI | WQ_UNBOUND, 1);
+                                                  WQ_HIGHPRI | WQ_UNBOUND, 0);
        if (!trans_pcie->rba.alloc_wq) {
                ret = -ENOMEM;
                goto out_free_trans;
index 1ef89cd..813d1cb 100644 (file)
@@ -3127,7 +3127,7 @@ struct wireless_dev *mwifiex_add_virtual_intf(struct wiphy *wiphy,
        priv->dfs_cac_workqueue = alloc_workqueue("MWIFIEX_DFS_CAC%s",
                                                  WQ_HIGHPRI |
                                                  WQ_MEM_RECLAIM |
-                                                 WQ_UNBOUND, 1, name);
+                                                 WQ_UNBOUND, 0, name);
        if (!priv->dfs_cac_workqueue) {
                mwifiex_dbg(adapter, ERROR, "cannot alloc DFS CAC queue\n");
                ret = -ENOMEM;
@@ -3138,7 +3138,7 @@ struct wireless_dev *mwifiex_add_virtual_intf(struct wiphy *wiphy,
 
        priv->dfs_chan_sw_workqueue = alloc_workqueue("MWIFIEX_DFS_CHSW%s",
                                                      WQ_HIGHPRI | WQ_UNBOUND |
-                                                     WQ_MEM_RECLAIM, 1, name);
+                                                     WQ_MEM_RECLAIM, 0, name);
        if (!priv->dfs_chan_sw_workqueue) {
                mwifiex_dbg(adapter, ERROR, "cannot alloc DFS channel sw queue\n");
                ret = -ENOMEM;
index ea22a08..1cd9d20 100644 (file)
@@ -1547,7 +1547,7 @@ mwifiex_reinit_sw(struct mwifiex_adapter *adapter)
 
        adapter->workqueue =
                alloc_workqueue("MWIFIEX_WORK_QUEUE",
-                               WQ_HIGHPRI | WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
+                               WQ_HIGHPRI | WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
        if (!adapter->workqueue)
                goto err_kmalloc;
 
@@ -1557,7 +1557,7 @@ mwifiex_reinit_sw(struct mwifiex_adapter *adapter)
                adapter->rx_workqueue = alloc_workqueue("MWIFIEX_RX_WORK_QUEUE",
                                                        WQ_HIGHPRI |
                                                        WQ_MEM_RECLAIM |
-                                                       WQ_UNBOUND, 1);
+                                                       WQ_UNBOUND, 0);
                if (!adapter->rx_workqueue)
                        goto err_kmalloc;
                INIT_WORK(&adapter->rx_work, mwifiex_rx_work_queue);
@@ -1702,7 +1702,7 @@ mwifiex_add_card(void *card, struct completion *fw_done,
 
        adapter->workqueue =
                alloc_workqueue("MWIFIEX_WORK_QUEUE",
-                               WQ_HIGHPRI | WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
+                               WQ_HIGHPRI | WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
        if (!adapter->workqueue)
                goto err_kmalloc;
 
@@ -1712,7 +1712,7 @@ mwifiex_add_card(void *card, struct completion *fw_done,
                adapter->rx_workqueue = alloc_workqueue("MWIFIEX_RX_WORK_QUEUE",
                                                        WQ_HIGHPRI |
                                                        WQ_MEM_RECLAIM |
-                                                       WQ_UNBOUND, 1);
+                                                       WQ_UNBOUND, 0);
                if (!adapter->rx_workqueue)
                        goto err_kmalloc;
 
index aec3a18..7162bf3 100644 (file)
@@ -1293,9 +1293,9 @@ int t7xx_cldma_init(struct cldma_ctrl *md_ctrl)
        for (i = 0; i < CLDMA_TXQ_NUM; i++) {
                md_cd_queue_struct_init(&md_ctrl->txq[i], md_ctrl, MTK_TX, i);
                md_ctrl->txq[i].worker =
-                       alloc_workqueue("md_hif%d_tx%d_worker",
-                                       WQ_UNBOUND | WQ_MEM_RECLAIM | (i ? 0 : WQ_HIGHPRI),
-                                       1, md_ctrl->hif_id, i);
+                       alloc_ordered_workqueue("md_hif%d_tx%d_worker",
+                                       WQ_MEM_RECLAIM | (i ? 0 : WQ_HIGHPRI),
+                                       md_ctrl->hif_id, i);
                if (!md_ctrl->txq[i].worker)
                        goto err_workqueue;
 
@@ -1306,9 +1306,10 @@ int t7xx_cldma_init(struct cldma_ctrl *md_ctrl)
                md_cd_queue_struct_init(&md_ctrl->rxq[i], md_ctrl, MTK_RX, i);
                INIT_WORK(&md_ctrl->rxq[i].cldma_work, t7xx_cldma_rx_done);
 
-               md_ctrl->rxq[i].worker = alloc_workqueue("md_hif%d_rx%d_worker",
-                                                        WQ_UNBOUND | WQ_MEM_RECLAIM,
-                                                        1, md_ctrl->hif_id, i);
+               md_ctrl->rxq[i].worker =
+                       alloc_ordered_workqueue("md_hif%d_rx%d_worker",
+                                               WQ_MEM_RECLAIM,
+                                               md_ctrl->hif_id, i);
                if (!md_ctrl->rxq[i].worker)
                        goto err_workqueue;
        }
index 4651420..8dab025 100644 (file)
@@ -618,8 +618,9 @@ int t7xx_dpmaif_txq_init(struct dpmaif_tx_queue *txq)
                return ret;
        }
 
-       txq->worker = alloc_workqueue("md_dpmaif_tx%d_worker", WQ_UNBOUND | WQ_MEM_RECLAIM |
-                                     (txq->index ? 0 : WQ_HIGHPRI), 1, txq->index);
+       txq->worker = alloc_ordered_workqueue("md_dpmaif_tx%d_worker",
+                               WQ_MEM_RECLAIM | (txq->index ? 0 : WQ_HIGHPRI),
+                               txq->index);
        if (!txq->worker)
                return -ENOMEM;
 
index f70ba58..ab0f32b 100644 (file)
 
 /* Globals */
 
+/* The "nubus.populate_procfs" parameter makes slot resources available in
+ * procfs. It's deprecated and disabled by default because procfs is no longer
+ * thought to be suitable for that and some board ROMs make it too expensive.
+ */
+bool nubus_populate_procfs;
+module_param_named(populate_procfs, nubus_populate_procfs, bool, 0);
+
 LIST_HEAD(nubus_func_rsrcs);
 
 /* Meaning of "bytelanes":
@@ -572,9 +579,9 @@ nubus_get_functional_resource(struct nubus_board *board, int slot,
                        nubus_proc_add_rsrc(dir.procdir, &ent);
                        break;
                default:
-                       /* Local/Private resources have their own
-                          function */
-                       nubus_get_private_resource(fres, dir.procdir, &ent);
+                       if (nubus_populate_procfs)
+                               nubus_get_private_resource(fres, dir.procdir,
+                                                          &ent);
                }
        }
 
index 1fd6678..e7a347d 100644 (file)
@@ -55,7 +55,7 @@ struct proc_dir_entry *nubus_proc_add_board(struct nubus_board *board)
 {
        char name[2];
 
-       if (!proc_bus_nubus_dir)
+       if (!proc_bus_nubus_dir || !nubus_populate_procfs)
                return NULL;
        snprintf(name, sizeof(name), "%x", board->slot);
        return proc_mkdir(name, proc_bus_nubus_dir);
@@ -72,9 +72,10 @@ struct proc_dir_entry *nubus_proc_add_rsrc_dir(struct proc_dir_entry *procdir,
        char name[9];
        int lanes = board->lanes;
 
-       if (!procdir)
+       if (!procdir || !nubus_populate_procfs)
                return NULL;
        snprintf(name, sizeof(name), "%x", ent->type);
+       remove_proc_subtree(name, procdir);
        return proc_mkdir_data(name, 0555, procdir, (void *)lanes);
 }
 
@@ -137,6 +138,18 @@ static int nubus_proc_rsrc_show(struct seq_file *m, void *v)
        return 0;
 }
 
+static int nubus_rsrc_proc_open(struct inode *inode, struct file *file)
+{
+       return single_open(file, nubus_proc_rsrc_show, inode);
+}
+
+static const struct proc_ops nubus_rsrc_proc_ops = {
+       .proc_open      = nubus_rsrc_proc_open,
+       .proc_read      = seq_read,
+       .proc_lseek     = seq_lseek,
+       .proc_release   = single_release,
+};
+
 void nubus_proc_add_rsrc_mem(struct proc_dir_entry *procdir,
                             const struct nubus_dirent *ent,
                             unsigned int size)
@@ -144,7 +157,7 @@ void nubus_proc_add_rsrc_mem(struct proc_dir_entry *procdir,
        char name[9];
        struct nubus_proc_pde_data *pded;
 
-       if (!procdir)
+       if (!procdir || !nubus_populate_procfs)
                return;
 
        snprintf(name, sizeof(name), "%x", ent->type);
@@ -152,8 +165,9 @@ void nubus_proc_add_rsrc_mem(struct proc_dir_entry *procdir,
                pded = nubus_proc_alloc_pde_data(nubus_dirptr(ent), size);
        else
                pded = NULL;
-       proc_create_single_data(name, S_IFREG | 0444, procdir,
-                       nubus_proc_rsrc_show, pded);
+       remove_proc_subtree(name, procdir);
+       proc_create_data(name, S_IFREG | 0444, procdir,
+                        &nubus_rsrc_proc_ops, pded);
 }
 
 void nubus_proc_add_rsrc(struct proc_dir_entry *procdir,
@@ -162,13 +176,14 @@ void nubus_proc_add_rsrc(struct proc_dir_entry *procdir,
        char name[9];
        unsigned char *data = (unsigned char *)ent->data;
 
-       if (!procdir)
+       if (!procdir || !nubus_populate_procfs)
                return;
 
        snprintf(name, sizeof(name), "%x", ent->type);
-       proc_create_single_data(name, S_IFREG | 0444, procdir,
-                       nubus_proc_rsrc_show,
-                       nubus_proc_alloc_pde_data(data, 0));
+       remove_proc_subtree(name, procdir);
+       proc_create_data(name, S_IFREG | 0444, procdir,
+                        &nubus_rsrc_proc_ops,
+                        nubus_proc_alloc_pde_data(data, 0));
 }
 
 /*
index e27202d..d3fc506 100644 (file)
@@ -10,7 +10,7 @@ obj-$(CONFIG_NVME_FC)                 += nvme-fc.o
 obj-$(CONFIG_NVME_TCP)                 += nvme-tcp.o
 obj-$(CONFIG_NVME_APPLE)               += nvme-apple.o
 
-nvme-core-y                            += core.o ioctl.o
+nvme-core-y                            += core.o ioctl.o sysfs.o
 nvme-core-$(CONFIG_NVME_VERBOSE_ERRORS)        += constants.o
 nvme-core-$(CONFIG_TRACING)            += trace.o
 nvme-core-$(CONFIG_NVME_MULTIPATH)     += multipath.o
index ea16a0a..daf5d14 100644 (file)
@@ -30,18 +30,18 @@ struct nvme_dhchap_queue_context {
        u32 s2;
        u16 transaction;
        u8 status;
+       u8 dhgroup_id;
        u8 hash_id;
        size_t hash_len;
-       u8 dhgroup_id;
        u8 c1[64];
        u8 c2[64];
        u8 response[64];
        u8 *host_response;
        u8 *ctrl_key;
-       int ctrl_key_len;
        u8 *host_key;
-       int host_key_len;
        u8 *sess_key;
+       int ctrl_key_len;
+       int host_key_len;
        int sess_key_len;
 };
 
index 3ec38e2..fdfcf27 100644 (file)
@@ -237,7 +237,7 @@ int nvme_delete_ctrl(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_delete_ctrl);
 
-static void nvme_delete_ctrl_sync(struct nvme_ctrl *ctrl)
+void nvme_delete_ctrl_sync(struct nvme_ctrl *ctrl)
 {
        /*
         * Keep a reference until nvme_do_delete_ctrl() complete,
@@ -1635,12 +1635,12 @@ static void nvme_ns_release(struct nvme_ns *ns)
        nvme_put_ns(ns);
 }
 
-static int nvme_open(struct block_device *bdev, fmode_t mode)
+static int nvme_open(struct gendisk *disk, blk_mode_t mode)
 {
-       return nvme_ns_open(bdev->bd_disk->private_data);
+       return nvme_ns_open(disk->private_data);
 }
 
-static void nvme_release(struct gendisk *disk, fmode_t mode)
+static void nvme_release(struct gendisk *disk)
 {
        nvme_ns_release(disk->private_data);
 }
@@ -1879,7 +1879,7 @@ static void nvme_update_disk_info(struct gendisk *disk,
                struct nvme_ns *ns, struct nvme_id_ns *id)
 {
        sector_t capacity = nvme_lba_to_sect(ns, le64_to_cpu(id->nsze));
-       unsigned short bs = 1 << ns->lba_shift;
+       u32 bs = 1U << ns->lba_shift;
        u32 atomic_bs, phys_bs, io_opt = 0;
 
        /*
@@ -2300,7 +2300,7 @@ static int nvme_report_zones(struct gendisk *disk, sector_t sector,
 #define nvme_report_zones      NULL
 #endif /* CONFIG_BLK_DEV_ZONED */
 
-static const struct block_device_operations nvme_bdev_ops = {
+const struct block_device_operations nvme_bdev_ops = {
        .owner          = THIS_MODULE,
        .ioctl          = nvme_ioctl,
        .compat_ioctl   = blkdev_compat_ptr_ioctl,
@@ -2835,75 +2835,6 @@ static struct nvme_subsystem *__nvme_find_get_subsystem(const char *subsysnqn)
        return NULL;
 }
 
-#define SUBSYS_ATTR_RO(_name, _mode, _show)                    \
-       struct device_attribute subsys_attr_##_name = \
-               __ATTR(_name, _mode, _show, NULL)
-
-static ssize_t nvme_subsys_show_nqn(struct device *dev,
-                                   struct device_attribute *attr,
-                                   char *buf)
-{
-       struct nvme_subsystem *subsys =
-               container_of(dev, struct nvme_subsystem, dev);
-
-       return sysfs_emit(buf, "%s\n", subsys->subnqn);
-}
-static SUBSYS_ATTR_RO(subsysnqn, S_IRUGO, nvme_subsys_show_nqn);
-
-static ssize_t nvme_subsys_show_type(struct device *dev,
-                                   struct device_attribute *attr,
-                                   char *buf)
-{
-       struct nvme_subsystem *subsys =
-               container_of(dev, struct nvme_subsystem, dev);
-
-       switch (subsys->subtype) {
-       case NVME_NQN_DISC:
-               return sysfs_emit(buf, "discovery\n");
-       case NVME_NQN_NVME:
-               return sysfs_emit(buf, "nvm\n");
-       default:
-               return sysfs_emit(buf, "reserved\n");
-       }
-}
-static SUBSYS_ATTR_RO(subsystype, S_IRUGO, nvme_subsys_show_type);
-
-#define nvme_subsys_show_str_function(field)                           \
-static ssize_t subsys_##field##_show(struct device *dev,               \
-                           struct device_attribute *attr, char *buf)   \
-{                                                                      \
-       struct nvme_subsystem *subsys =                                 \
-               container_of(dev, struct nvme_subsystem, dev);          \
-       return sysfs_emit(buf, "%.*s\n",                                \
-                          (int)sizeof(subsys->field), subsys->field);  \
-}                                                                      \
-static SUBSYS_ATTR_RO(field, S_IRUGO, subsys_##field##_show);
-
-nvme_subsys_show_str_function(model);
-nvme_subsys_show_str_function(serial);
-nvme_subsys_show_str_function(firmware_rev);
-
-static struct attribute *nvme_subsys_attrs[] = {
-       &subsys_attr_model.attr,
-       &subsys_attr_serial.attr,
-       &subsys_attr_firmware_rev.attr,
-       &subsys_attr_subsysnqn.attr,
-       &subsys_attr_subsystype.attr,
-#ifdef CONFIG_NVME_MULTIPATH
-       &subsys_attr_iopolicy.attr,
-#endif
-       NULL,
-};
-
-static const struct attribute_group nvme_subsys_attrs_group = {
-       .attrs = nvme_subsys_attrs,
-};
-
-static const struct attribute_group *nvme_subsys_attrs_groups[] = {
-       &nvme_subsys_attrs_group,
-       NULL,
-};
-
 static inline bool nvme_discovery_ctrl(struct nvme_ctrl *ctrl)
 {
        return ctrl->opts && ctrl->opts->discovery_nqn;
@@ -3108,7 +3039,8 @@ static int nvme_init_non_mdts_limits(struct nvme_ctrl *ctrl)
                ctrl->max_zeroes_sectors = 0;
 
        if (ctrl->subsys->subtype != NVME_NQN_NVME ||
-           nvme_ctrl_limited_cns(ctrl))
+           nvme_ctrl_limited_cns(ctrl) ||
+           test_bit(NVME_CTRL_SKIP_ID_CNS_CS, &ctrl->flags))
                return 0;
 
        id = kzalloc(sizeof(*id), GFP_KERNEL);
@@ -3130,6 +3062,8 @@ static int nvme_init_non_mdts_limits(struct nvme_ctrl *ctrl)
                ctrl->max_zeroes_sectors = nvme_mps_to_sectors(ctrl, id->wzsl);
 
 free_data:
+       if (ret > 0)
+               set_bit(NVME_CTRL_SKIP_ID_CNS_CS, &ctrl->flags);
        kfree(id);
        return ret;
 }
@@ -3437,586 +3371,6 @@ static const struct file_operations nvme_dev_fops = {
        .uring_cmd      = nvme_dev_uring_cmd,
 };
 
-static ssize_t nvme_sysfs_reset(struct device *dev,
-                               struct device_attribute *attr, const char *buf,
-                               size_t count)
-{
-       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
-       int ret;
-
-       ret = nvme_reset_ctrl_sync(ctrl);
-       if (ret < 0)
-               return ret;
-       return count;
-}
-static DEVICE_ATTR(reset_controller, S_IWUSR, NULL, nvme_sysfs_reset);
-
-static ssize_t nvme_sysfs_rescan(struct device *dev,
-                               struct device_attribute *attr, const char *buf,
-                               size_t count)
-{
-       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
-
-       nvme_queue_scan(ctrl);
-       return count;
-}
-static DEVICE_ATTR(rescan_controller, S_IWUSR, NULL, nvme_sysfs_rescan);
-
-static inline struct nvme_ns_head *dev_to_ns_head(struct device *dev)
-{
-       struct gendisk *disk = dev_to_disk(dev);
-
-       if (disk->fops == &nvme_bdev_ops)
-               return nvme_get_ns_from_dev(dev)->head;
-       else
-               return disk->private_data;
-}
-
-static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
-               char *buf)
-{
-       struct nvme_ns_head *head = dev_to_ns_head(dev);
-       struct nvme_ns_ids *ids = &head->ids;
-       struct nvme_subsystem *subsys = head->subsys;
-       int serial_len = sizeof(subsys->serial);
-       int model_len = sizeof(subsys->model);
-
-       if (!uuid_is_null(&ids->uuid))
-               return sysfs_emit(buf, "uuid.%pU\n", &ids->uuid);
-
-       if (memchr_inv(ids->nguid, 0, sizeof(ids->nguid)))
-               return sysfs_emit(buf, "eui.%16phN\n", ids->nguid);
-
-       if (memchr_inv(ids->eui64, 0, sizeof(ids->eui64)))
-               return sysfs_emit(buf, "eui.%8phN\n", ids->eui64);
-
-       while (serial_len > 0 && (subsys->serial[serial_len - 1] == ' ' ||
-                                 subsys->serial[serial_len - 1] == '\0'))
-               serial_len--;
-       while (model_len > 0 && (subsys->model[model_len - 1] == ' ' ||
-                                subsys->model[model_len - 1] == '\0'))
-               model_len--;
-
-       return sysfs_emit(buf, "nvme.%04x-%*phN-%*phN-%08x\n", subsys->vendor_id,
-               serial_len, subsys->serial, model_len, subsys->model,
-               head->ns_id);
-}
-static DEVICE_ATTR_RO(wwid);
-
-static ssize_t nguid_show(struct device *dev, struct device_attribute *attr,
-               char *buf)
-{
-       return sysfs_emit(buf, "%pU\n", dev_to_ns_head(dev)->ids.nguid);
-}
-static DEVICE_ATTR_RO(nguid);
-
-static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
-               char *buf)
-{
-       struct nvme_ns_ids *ids = &dev_to_ns_head(dev)->ids;
-
-       /* For backward compatibility expose the NGUID to userspace if
-        * we have no UUID set
-        */
-       if (uuid_is_null(&ids->uuid)) {
-               dev_warn_ratelimited(dev,
-                       "No UUID available providing old NGUID\n");
-               return sysfs_emit(buf, "%pU\n", ids->nguid);
-       }
-       return sysfs_emit(buf, "%pU\n", &ids->uuid);
-}
-static DEVICE_ATTR_RO(uuid);
-
-static ssize_t eui_show(struct device *dev, struct device_attribute *attr,
-               char *buf)
-{
-       return sysfs_emit(buf, "%8ph\n", dev_to_ns_head(dev)->ids.eui64);
-}
-static DEVICE_ATTR_RO(eui);
-
-static ssize_t nsid_show(struct device *dev, struct device_attribute *attr,
-               char *buf)
-{
-       return sysfs_emit(buf, "%d\n", dev_to_ns_head(dev)->ns_id);
-}
-static DEVICE_ATTR_RO(nsid);
-
-static struct attribute *nvme_ns_id_attrs[] = {
-       &dev_attr_wwid.attr,
-       &dev_attr_uuid.attr,
-       &dev_attr_nguid.attr,
-       &dev_attr_eui.attr,
-       &dev_attr_nsid.attr,
-#ifdef CONFIG_NVME_MULTIPATH
-       &dev_attr_ana_grpid.attr,
-       &dev_attr_ana_state.attr,
-#endif
-       NULL,
-};
-
-static umode_t nvme_ns_id_attrs_are_visible(struct kobject *kobj,
-               struct attribute *a, int n)
-{
-       struct device *dev = container_of(kobj, struct device, kobj);
-       struct nvme_ns_ids *ids = &dev_to_ns_head(dev)->ids;
-
-       if (a == &dev_attr_uuid.attr) {
-               if (uuid_is_null(&ids->uuid) &&
-                   !memchr_inv(ids->nguid, 0, sizeof(ids->nguid)))
-                       return 0;
-       }
-       if (a == &dev_attr_nguid.attr) {
-               if (!memchr_inv(ids->nguid, 0, sizeof(ids->nguid)))
-                       return 0;
-       }
-       if (a == &dev_attr_eui.attr) {
-               if (!memchr_inv(ids->eui64, 0, sizeof(ids->eui64)))
-                       return 0;
-       }
-#ifdef CONFIG_NVME_MULTIPATH
-       if (a == &dev_attr_ana_grpid.attr || a == &dev_attr_ana_state.attr) {
-               if (dev_to_disk(dev)->fops != &nvme_bdev_ops) /* per-path attr */
-                       return 0;
-               if (!nvme_ctrl_use_ana(nvme_get_ns_from_dev(dev)->ctrl))
-                       return 0;
-       }
-#endif
-       return a->mode;
-}
-
-static const struct attribute_group nvme_ns_id_attr_group = {
-       .attrs          = nvme_ns_id_attrs,
-       .is_visible     = nvme_ns_id_attrs_are_visible,
-};
-
-const struct attribute_group *nvme_ns_id_attr_groups[] = {
-       &nvme_ns_id_attr_group,
-       NULL,
-};
-
-#define nvme_show_str_function(field)                                          \
-static ssize_t  field##_show(struct device *dev,                               \
-                           struct device_attribute *attr, char *buf)           \
-{                                                                              \
-        struct nvme_ctrl *ctrl = dev_get_drvdata(dev);                         \
-        return sysfs_emit(buf, "%.*s\n",                                       \
-               (int)sizeof(ctrl->subsys->field), ctrl->subsys->field);         \
-}                                                                              \
-static DEVICE_ATTR(field, S_IRUGO, field##_show, NULL);
-
-nvme_show_str_function(model);
-nvme_show_str_function(serial);
-nvme_show_str_function(firmware_rev);
-
-#define nvme_show_int_function(field)                                          \
-static ssize_t  field##_show(struct device *dev,                               \
-                           struct device_attribute *attr, char *buf)           \
-{                                                                              \
-        struct nvme_ctrl *ctrl = dev_get_drvdata(dev);                         \
-        return sysfs_emit(buf, "%d\n", ctrl->field);                           \
-}                                                                              \
-static DEVICE_ATTR(field, S_IRUGO, field##_show, NULL);
-
-nvme_show_int_function(cntlid);
-nvme_show_int_function(numa_node);
-nvme_show_int_function(queue_count);
-nvme_show_int_function(sqsize);
-nvme_show_int_function(kato);
-
-static ssize_t nvme_sysfs_delete(struct device *dev,
-                               struct device_attribute *attr, const char *buf,
-                               size_t count)
-{
-       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
-
-       if (!test_bit(NVME_CTRL_STARTED_ONCE, &ctrl->flags))
-               return -EBUSY;
-
-       if (device_remove_file_self(dev, attr))
-               nvme_delete_ctrl_sync(ctrl);
-       return count;
-}
-static DEVICE_ATTR(delete_controller, S_IWUSR, NULL, nvme_sysfs_delete);
-
-static ssize_t nvme_sysfs_show_transport(struct device *dev,
-                                        struct device_attribute *attr,
-                                        char *buf)
-{
-       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
-
-       return sysfs_emit(buf, "%s\n", ctrl->ops->name);
-}
-static DEVICE_ATTR(transport, S_IRUGO, nvme_sysfs_show_transport, NULL);
-
-static ssize_t nvme_sysfs_show_state(struct device *dev,
-                                    struct device_attribute *attr,
-                                    char *buf)
-{
-       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
-       static const char *const state_name[] = {
-               [NVME_CTRL_NEW]         = "new",
-               [NVME_CTRL_LIVE]        = "live",
-               [NVME_CTRL_RESETTING]   = "resetting",
-               [NVME_CTRL_CONNECTING]  = "connecting",
-               [NVME_CTRL_DELETING]    = "deleting",
-               [NVME_CTRL_DELETING_NOIO]= "deleting (no IO)",
-               [NVME_CTRL_DEAD]        = "dead",
-       };
-
-       if ((unsigned)ctrl->state < ARRAY_SIZE(state_name) &&
-           state_name[ctrl->state])
-               return sysfs_emit(buf, "%s\n", state_name[ctrl->state]);
-
-       return sysfs_emit(buf, "unknown state\n");
-}
-
-static DEVICE_ATTR(state, S_IRUGO, nvme_sysfs_show_state, NULL);
-
-static ssize_t nvme_sysfs_show_subsysnqn(struct device *dev,
-                                        struct device_attribute *attr,
-                                        char *buf)
-{
-       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
-
-       return sysfs_emit(buf, "%s\n", ctrl->subsys->subnqn);
-}
-static DEVICE_ATTR(subsysnqn, S_IRUGO, nvme_sysfs_show_subsysnqn, NULL);
-
-static ssize_t nvme_sysfs_show_hostnqn(struct device *dev,
-                                       struct device_attribute *attr,
-                                       char *buf)
-{
-       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
-
-       return sysfs_emit(buf, "%s\n", ctrl->opts->host->nqn);
-}
-static DEVICE_ATTR(hostnqn, S_IRUGO, nvme_sysfs_show_hostnqn, NULL);
-
-static ssize_t nvme_sysfs_show_hostid(struct device *dev,
-                                       struct device_attribute *attr,
-                                       char *buf)
-{
-       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
-
-       return sysfs_emit(buf, "%pU\n", &ctrl->opts->host->id);
-}
-static DEVICE_ATTR(hostid, S_IRUGO, nvme_sysfs_show_hostid, NULL);
-
-static ssize_t nvme_sysfs_show_address(struct device *dev,
-                                        struct device_attribute *attr,
-                                        char *buf)
-{
-       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
-
-       return ctrl->ops->get_address(ctrl, buf, PAGE_SIZE);
-}
-static DEVICE_ATTR(address, S_IRUGO, nvme_sysfs_show_address, NULL);
-
-static ssize_t nvme_ctrl_loss_tmo_show(struct device *dev,
-               struct device_attribute *attr, char *buf)
-{
-       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
-       struct nvmf_ctrl_options *opts = ctrl->opts;
-
-       if (ctrl->opts->max_reconnects == -1)
-               return sysfs_emit(buf, "off\n");
-       return sysfs_emit(buf, "%d\n",
-                         opts->max_reconnects * opts->reconnect_delay);
-}
-
-static ssize_t nvme_ctrl_loss_tmo_store(struct device *dev,
-               struct device_attribute *attr, const char *buf, size_t count)
-{
-       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
-       struct nvmf_ctrl_options *opts = ctrl->opts;
-       int ctrl_loss_tmo, err;
-
-       err = kstrtoint(buf, 10, &ctrl_loss_tmo);
-       if (err)
-               return -EINVAL;
-
-       if (ctrl_loss_tmo < 0)
-               opts->max_reconnects = -1;
-       else
-               opts->max_reconnects = DIV_ROUND_UP(ctrl_loss_tmo,
-                                               opts->reconnect_delay);
-       return count;
-}
-static DEVICE_ATTR(ctrl_loss_tmo, S_IRUGO | S_IWUSR,
-       nvme_ctrl_loss_tmo_show, nvme_ctrl_loss_tmo_store);
-
-static ssize_t nvme_ctrl_reconnect_delay_show(struct device *dev,
-               struct device_attribute *attr, char *buf)
-{
-       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
-
-       if (ctrl->opts->reconnect_delay == -1)
-               return sysfs_emit(buf, "off\n");
-       return sysfs_emit(buf, "%d\n", ctrl->opts->reconnect_delay);
-}
-
-static ssize_t nvme_ctrl_reconnect_delay_store(struct device *dev,
-               struct device_attribute *attr, const char *buf, size_t count)
-{
-       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
-       unsigned int v;
-       int err;
-
-       err = kstrtou32(buf, 10, &v);
-       if (err)
-               return err;
-
-       ctrl->opts->reconnect_delay = v;
-       return count;
-}
-static DEVICE_ATTR(reconnect_delay, S_IRUGO | S_IWUSR,
-       nvme_ctrl_reconnect_delay_show, nvme_ctrl_reconnect_delay_store);
-
-static ssize_t nvme_ctrl_fast_io_fail_tmo_show(struct device *dev,
-               struct device_attribute *attr, char *buf)
-{
-       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
-
-       if (ctrl->opts->fast_io_fail_tmo == -1)
-               return sysfs_emit(buf, "off\n");
-       return sysfs_emit(buf, "%d\n", ctrl->opts->fast_io_fail_tmo);
-}
-
-static ssize_t nvme_ctrl_fast_io_fail_tmo_store(struct device *dev,
-               struct device_attribute *attr, const char *buf, size_t count)
-{
-       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
-       struct nvmf_ctrl_options *opts = ctrl->opts;
-       int fast_io_fail_tmo, err;
-
-       err = kstrtoint(buf, 10, &fast_io_fail_tmo);
-       if (err)
-               return -EINVAL;
-
-       if (fast_io_fail_tmo < 0)
-               opts->fast_io_fail_tmo = -1;
-       else
-               opts->fast_io_fail_tmo = fast_io_fail_tmo;
-       return count;
-}
-static DEVICE_ATTR(fast_io_fail_tmo, S_IRUGO | S_IWUSR,
-       nvme_ctrl_fast_io_fail_tmo_show, nvme_ctrl_fast_io_fail_tmo_store);
-
-static ssize_t cntrltype_show(struct device *dev,
-                             struct device_attribute *attr, char *buf)
-{
-       static const char * const type[] = {
-               [NVME_CTRL_IO] = "io\n",
-               [NVME_CTRL_DISC] = "discovery\n",
-               [NVME_CTRL_ADMIN] = "admin\n",
-       };
-       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
-
-       if (ctrl->cntrltype > NVME_CTRL_ADMIN || !type[ctrl->cntrltype])
-               return sysfs_emit(buf, "reserved\n");
-
-       return sysfs_emit(buf, type[ctrl->cntrltype]);
-}
-static DEVICE_ATTR_RO(cntrltype);
-
-static ssize_t dctype_show(struct device *dev,
-                          struct device_attribute *attr, char *buf)
-{
-       static const char * const type[] = {
-               [NVME_DCTYPE_NOT_REPORTED] = "none\n",
-               [NVME_DCTYPE_DDC] = "ddc\n",
-               [NVME_DCTYPE_CDC] = "cdc\n",
-       };
-       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
-
-       if (ctrl->dctype > NVME_DCTYPE_CDC || !type[ctrl->dctype])
-               return sysfs_emit(buf, "reserved\n");
-
-       return sysfs_emit(buf, type[ctrl->dctype]);
-}
-static DEVICE_ATTR_RO(dctype);
-
-#ifdef CONFIG_NVME_AUTH
-static ssize_t nvme_ctrl_dhchap_secret_show(struct device *dev,
-               struct device_attribute *attr, char *buf)
-{
-       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
-       struct nvmf_ctrl_options *opts = ctrl->opts;
-
-       if (!opts->dhchap_secret)
-               return sysfs_emit(buf, "none\n");
-       return sysfs_emit(buf, "%s\n", opts->dhchap_secret);
-}
-
-static ssize_t nvme_ctrl_dhchap_secret_store(struct device *dev,
-               struct device_attribute *attr, const char *buf, size_t count)
-{
-       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
-       struct nvmf_ctrl_options *opts = ctrl->opts;
-       char *dhchap_secret;
-
-       if (!ctrl->opts->dhchap_secret)
-               return -EINVAL;
-       if (count < 7)
-               return -EINVAL;
-       if (memcmp(buf, "DHHC-1:", 7))
-               return -EINVAL;
-
-       dhchap_secret = kzalloc(count + 1, GFP_KERNEL);
-       if (!dhchap_secret)
-               return -ENOMEM;
-       memcpy(dhchap_secret, buf, count);
-       nvme_auth_stop(ctrl);
-       if (strcmp(dhchap_secret, opts->dhchap_secret)) {
-               struct nvme_dhchap_key *key, *host_key;
-               int ret;
-
-               ret = nvme_auth_generate_key(dhchap_secret, &key);
-               if (ret)
-                       return ret;
-               kfree(opts->dhchap_secret);
-               opts->dhchap_secret = dhchap_secret;
-               host_key = ctrl->host_key;
-               mutex_lock(&ctrl->dhchap_auth_mutex);
-               ctrl->host_key = key;
-               mutex_unlock(&ctrl->dhchap_auth_mutex);
-               nvme_auth_free_key(host_key);
-       }
-       /* Start re-authentication */
-       dev_info(ctrl->device, "re-authenticating controller\n");
-       queue_work(nvme_wq, &ctrl->dhchap_auth_work);
-
-       return count;
-}
-static DEVICE_ATTR(dhchap_secret, S_IRUGO | S_IWUSR,
-       nvme_ctrl_dhchap_secret_show, nvme_ctrl_dhchap_secret_store);
-
-static ssize_t nvme_ctrl_dhchap_ctrl_secret_show(struct device *dev,
-               struct device_attribute *attr, char *buf)
-{
-       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
-       struct nvmf_ctrl_options *opts = ctrl->opts;
-
-       if (!opts->dhchap_ctrl_secret)
-               return sysfs_emit(buf, "none\n");
-       return sysfs_emit(buf, "%s\n", opts->dhchap_ctrl_secret);
-}
-
-static ssize_t nvme_ctrl_dhchap_ctrl_secret_store(struct device *dev,
-               struct device_attribute *attr, const char *buf, size_t count)
-{
-       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
-       struct nvmf_ctrl_options *opts = ctrl->opts;
-       char *dhchap_secret;
-
-       if (!ctrl->opts->dhchap_ctrl_secret)
-               return -EINVAL;
-       if (count < 7)
-               return -EINVAL;
-       if (memcmp(buf, "DHHC-1:", 7))
-               return -EINVAL;
-
-       dhchap_secret = kzalloc(count + 1, GFP_KERNEL);
-       if (!dhchap_secret)
-               return -ENOMEM;
-       memcpy(dhchap_secret, buf, count);
-       nvme_auth_stop(ctrl);
-       if (strcmp(dhchap_secret, opts->dhchap_ctrl_secret)) {
-               struct nvme_dhchap_key *key, *ctrl_key;
-               int ret;
-
-               ret = nvme_auth_generate_key(dhchap_secret, &key);
-               if (ret)
-                       return ret;
-               kfree(opts->dhchap_ctrl_secret);
-               opts->dhchap_ctrl_secret = dhchap_secret;
-               ctrl_key = ctrl->ctrl_key;
-               mutex_lock(&ctrl->dhchap_auth_mutex);
-               ctrl->ctrl_key = key;
-               mutex_unlock(&ctrl->dhchap_auth_mutex);
-               nvme_auth_free_key(ctrl_key);
-       }
-       /* Start re-authentication */
-       dev_info(ctrl->device, "re-authenticating controller\n");
-       queue_work(nvme_wq, &ctrl->dhchap_auth_work);
-
-       return count;
-}
-static DEVICE_ATTR(dhchap_ctrl_secret, S_IRUGO | S_IWUSR,
-       nvme_ctrl_dhchap_ctrl_secret_show, nvme_ctrl_dhchap_ctrl_secret_store);
-#endif
-
-static struct attribute *nvme_dev_attrs[] = {
-       &dev_attr_reset_controller.attr,
-       &dev_attr_rescan_controller.attr,
-       &dev_attr_model.attr,
-       &dev_attr_serial.attr,
-       &dev_attr_firmware_rev.attr,
-       &dev_attr_cntlid.attr,
-       &dev_attr_delete_controller.attr,
-       &dev_attr_transport.attr,
-       &dev_attr_subsysnqn.attr,
-       &dev_attr_address.attr,
-       &dev_attr_state.attr,
-       &dev_attr_numa_node.attr,
-       &dev_attr_queue_count.attr,
-       &dev_attr_sqsize.attr,
-       &dev_attr_hostnqn.attr,
-       &dev_attr_hostid.attr,
-       &dev_attr_ctrl_loss_tmo.attr,
-       &dev_attr_reconnect_delay.attr,
-       &dev_attr_fast_io_fail_tmo.attr,
-       &dev_attr_kato.attr,
-       &dev_attr_cntrltype.attr,
-       &dev_attr_dctype.attr,
-#ifdef CONFIG_NVME_AUTH
-       &dev_attr_dhchap_secret.attr,
-       &dev_attr_dhchap_ctrl_secret.attr,
-#endif
-       NULL
-};
-
-static umode_t nvme_dev_attrs_are_visible(struct kobject *kobj,
-               struct attribute *a, int n)
-{
-       struct device *dev = container_of(kobj, struct device, kobj);
-       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
-
-       if (a == &dev_attr_delete_controller.attr && !ctrl->ops->delete_ctrl)
-               return 0;
-       if (a == &dev_attr_address.attr && !ctrl->ops->get_address)
-               return 0;
-       if (a == &dev_attr_hostnqn.attr && !ctrl->opts)
-               return 0;
-       if (a == &dev_attr_hostid.attr && !ctrl->opts)
-               return 0;
-       if (a == &dev_attr_ctrl_loss_tmo.attr && !ctrl->opts)
-               return 0;
-       if (a == &dev_attr_reconnect_delay.attr && !ctrl->opts)
-               return 0;
-       if (a == &dev_attr_fast_io_fail_tmo.attr && !ctrl->opts)
-               return 0;
-#ifdef CONFIG_NVME_AUTH
-       if (a == &dev_attr_dhchap_secret.attr && !ctrl->opts)
-               return 0;
-       if (a == &dev_attr_dhchap_ctrl_secret.attr && !ctrl->opts)
-               return 0;
-#endif
-
-       return a->mode;
-}
-
-const struct attribute_group nvme_dev_attrs_group = {
-       .attrs          = nvme_dev_attrs,
-       .is_visible     = nvme_dev_attrs_are_visible,
-};
-EXPORT_SYMBOL_GPL(nvme_dev_attrs_group);
-
-static const struct attribute_group *nvme_dev_attr_groups[] = {
-       &nvme_dev_attrs_group,
-       NULL,
-};
-
 static struct nvme_ns_head *nvme_find_ns_head(struct nvme_ctrl *ctrl,
                unsigned nsid)
 {
@@ -4256,7 +3610,7 @@ static int nvme_init_ns_head(struct nvme_ns *ns, struct nvme_ns_info *info)
                        goto out_put_ns_head;
                }
 
-               if (!multipath && !list_empty(&head->list)) {
+               if (!multipath) {
                        dev_warn(ctrl->device,
                                "Found shared namespace %d, but multipathing not supported.\n",
                                info->nsid);
@@ -4357,7 +3711,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
         * instance as shared namespaces will show up as multiple block
         * devices.
         */
-       if (ns->head->disk) {
+       if (nvme_ns_head_multipath(ns->head)) {
                sprintf(disk->disk_name, "nvme%dc%dn%d", ctrl->subsys->instance,
                        ctrl->instance, ns->head->instance);
                disk->flags |= GENHD_FL_HIDDEN;
@@ -5243,6 +4597,8 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 
        return 0;
 out_free_cdev:
+       nvme_fault_inject_fini(&ctrl->fault_inject);
+       dev_pm_qos_hide_latency_tolerance(ctrl->device);
        cdev_device_del(&ctrl->cdev, ctrl->device);
 out_free_name:
        nvme_put_ctrl(ctrl);
index 0069ebf..8175d49 100644 (file)
@@ -21,35 +21,60 @@ static DEFINE_MUTEX(nvmf_hosts_mutex);
 
 static struct nvmf_host *nvmf_default_host;
 
-static struct nvmf_host *__nvmf_host_find(const char *hostnqn)
+static struct nvmf_host *nvmf_host_alloc(const char *hostnqn, uuid_t *id)
 {
        struct nvmf_host *host;
 
-       list_for_each_entry(host, &nvmf_hosts, list) {
-               if (!strcmp(host->nqn, hostnqn))
-                       return host;
-       }
+       host = kmalloc(sizeof(*host), GFP_KERNEL);
+       if (!host)
+               return NULL;
 
-       return NULL;
+       kref_init(&host->ref);
+       uuid_copy(&host->id, id);
+       strscpy(host->nqn, hostnqn, NVMF_NQN_SIZE);
+
+       return host;
 }
 
-static struct nvmf_host *nvmf_host_add(const char *hostnqn)
+static struct nvmf_host *nvmf_host_add(const char *hostnqn, uuid_t *id)
 {
        struct nvmf_host *host;
 
        mutex_lock(&nvmf_hosts_mutex);
-       host = __nvmf_host_find(hostnqn);
-       if (host) {
-               kref_get(&host->ref);
-               goto out_unlock;
+
+       /*
+        * We have defined a host as how it is perceived by the target.
+        * Therefore, we don't allow different Host NQNs with the same Host ID.
+        * Similarly, we do not allow the usage of the same Host NQN with
+        * different Host IDs. This'll maintain unambiguous host identification.
+        */
+       list_for_each_entry(host, &nvmf_hosts, list) {
+               bool same_hostnqn = !strcmp(host->nqn, hostnqn);
+               bool same_hostid = uuid_equal(&host->id, id);
+
+               if (same_hostnqn && same_hostid) {
+                       kref_get(&host->ref);
+                       goto out_unlock;
+               }
+               if (same_hostnqn) {
+                       pr_err("found same hostnqn %s but different hostid %pUb\n",
+                              hostnqn, id);
+                       host = ERR_PTR(-EINVAL);
+                       goto out_unlock;
+               }
+               if (same_hostid) {
+                       pr_err("found same hostid %pUb but different hostnqn %s\n",
+                              id, hostnqn);
+                       host = ERR_PTR(-EINVAL);
+                       goto out_unlock;
+               }
        }
 
-       host = kmalloc(sizeof(*host), GFP_KERNEL);
-       if (!host)
+       host = nvmf_host_alloc(hostnqn, id);
+       if (!host) {
+               host = ERR_PTR(-ENOMEM);
                goto out_unlock;
-
-       kref_init(&host->ref);
-       strscpy(host->nqn, hostnqn, NVMF_NQN_SIZE);
+       }
 
        list_add_tail(&host->list, &nvmf_hosts);
 out_unlock:
@@ -60,16 +85,17 @@ out_unlock:
 static struct nvmf_host *nvmf_host_default(void)
 {
        struct nvmf_host *host;
+       char nqn[NVMF_NQN_SIZE];
+       uuid_t id;
 
-       host = kmalloc(sizeof(*host), GFP_KERNEL);
+       uuid_gen(&id);
+       snprintf(nqn, NVMF_NQN_SIZE,
+               "nqn.2014-08.org.nvmexpress:uuid:%pUb", &id);
+
+       host = nvmf_host_alloc(nqn, &id);
        if (!host)
                return NULL;
 
-       kref_init(&host->ref);
-       uuid_gen(&host->id);
-       snprintf(host->nqn, NVMF_NQN_SIZE,
-               "nqn.2014-08.org.nvmexpress:uuid:%pUb", &host->id);
-
        mutex_lock(&nvmf_hosts_mutex);
        list_add_tail(&host->list, &nvmf_hosts);
        mutex_unlock(&nvmf_hosts_mutex);
@@ -349,6 +375,45 @@ static void nvmf_log_connect_error(struct nvme_ctrl *ctrl,
        }
 }
 
+static struct nvmf_connect_data *nvmf_connect_data_prep(struct nvme_ctrl *ctrl,
+               u16 cntlid)
+{
+       struct nvmf_connect_data *data;
+
+       data = kzalloc(sizeof(*data), GFP_KERNEL);
+       if (!data)
+               return NULL;
+
+       uuid_copy(&data->hostid, &ctrl->opts->host->id);
+       data->cntlid = cpu_to_le16(cntlid);
+       strncpy(data->subsysnqn, ctrl->opts->subsysnqn, NVMF_NQN_SIZE);
+       strncpy(data->hostnqn, ctrl->opts->host->nqn, NVMF_NQN_SIZE);
+
+       return data;
+}
+
+static void nvmf_connect_cmd_prep(struct nvme_ctrl *ctrl, u16 qid,
+               struct nvme_command *cmd)
+{
+       cmd->connect.opcode = nvme_fabrics_command;
+       cmd->connect.fctype = nvme_fabrics_type_connect;
+       cmd->connect.qid = cpu_to_le16(qid);
+
+       if (qid) {
+               cmd->connect.sqsize = cpu_to_le16(ctrl->sqsize);
+       } else {
+               cmd->connect.sqsize = cpu_to_le16(NVME_AQ_DEPTH - 1);
+
+               /*
+                * set keep-alive timeout in seconds granularity (ms * 1000)
+                */
+               cmd->connect.kato = cpu_to_le32(ctrl->kato * 1000);
+       }
+
+       if (ctrl->opts->disable_sqflow)
+               cmd->connect.cattr |= NVME_CONNECT_DISABLE_SQFLOW;
+}
+
 /**
  * nvmf_connect_admin_queue() - NVMe Fabrics Admin Queue "Connect"
  *                             API function.
@@ -377,28 +442,12 @@ int nvmf_connect_admin_queue(struct nvme_ctrl *ctrl)
        int ret;
        u32 result;
 
-       cmd.connect.opcode = nvme_fabrics_command;
-       cmd.connect.fctype = nvme_fabrics_type_connect;
-       cmd.connect.qid = 0;
-       cmd.connect.sqsize = cpu_to_le16(NVME_AQ_DEPTH - 1);
-
-       /*
-        * Set keep-alive timeout in seconds granularity (ms * 1000)
-        */
-       cmd.connect.kato = cpu_to_le32(ctrl->kato * 1000);
-
-       if (ctrl->opts->disable_sqflow)
-               cmd.connect.cattr |= NVME_CONNECT_DISABLE_SQFLOW;
+       nvmf_connect_cmd_prep(ctrl, 0, &cmd);
 
-       data = kzalloc(sizeof(*data), GFP_KERNEL);
+       data = nvmf_connect_data_prep(ctrl, 0xffff);
        if (!data)
                return -ENOMEM;
 
-       uuid_copy(&data->hostid, &ctrl->opts->host->id);
-       data->cntlid = cpu_to_le16(0xffff);
-       strncpy(data->subsysnqn, ctrl->opts->subsysnqn, NVMF_NQN_SIZE);
-       strncpy(data->hostnqn, ctrl->opts->host->nqn, NVMF_NQN_SIZE);
-
        ret = __nvme_submit_sync_cmd(ctrl->fabrics_q, &cmd, &res,
                        data, sizeof(*data), NVME_QID_ANY, 1,
                        BLK_MQ_REQ_RESERVED | BLK_MQ_REQ_NOWAIT);
@@ -468,23 +517,12 @@ int nvmf_connect_io_queue(struct nvme_ctrl *ctrl, u16 qid)
        int ret;
        u32 result;
 
-       cmd.connect.opcode = nvme_fabrics_command;
-       cmd.connect.fctype = nvme_fabrics_type_connect;
-       cmd.connect.qid = cpu_to_le16(qid);
-       cmd.connect.sqsize = cpu_to_le16(ctrl->sqsize);
+       nvmf_connect_cmd_prep(ctrl, qid, &cmd);
 
-       if (ctrl->opts->disable_sqflow)
-               cmd.connect.cattr |= NVME_CONNECT_DISABLE_SQFLOW;
-
-       data = kzalloc(sizeof(*data), GFP_KERNEL);
+       data = nvmf_connect_data_prep(ctrl, ctrl->cntlid);
        if (!data)
                return -ENOMEM;
 
-       uuid_copy(&data->hostid, &ctrl->opts->host->id);
-       data->cntlid = cpu_to_le16(ctrl->cntlid);
-       strncpy(data->subsysnqn, ctrl->opts->subsysnqn, NVMF_NQN_SIZE);
-       strncpy(data->hostnqn, ctrl->opts->host->nqn, NVMF_NQN_SIZE);
-
        ret = __nvme_submit_sync_cmd(ctrl->connect_q, &cmd, &res,
                        data, sizeof(*data), qid, 1,
                        BLK_MQ_REQ_RESERVED | BLK_MQ_REQ_NOWAIT);
@@ -621,6 +659,7 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
        size_t nqnlen  = 0;
        int ctrl_loss_tmo = NVMF_DEF_CTRL_LOSS_TMO;
        uuid_t hostid;
+       char hostnqn[NVMF_NQN_SIZE];
 
        /* Set defaults */
        opts->queue_size = NVMF_DEF_QUEUE_SIZE;
@@ -637,7 +676,9 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
        if (!options)
                return -ENOMEM;
 
-       uuid_gen(&hostid);
+       /* use default host if not given by user space */
+       uuid_copy(&hostid, &nvmf_default_host->id);
+       strscpy(hostnqn, nvmf_default_host->nqn, NVMF_NQN_SIZE);
 
        while ((p = strsep(&o, ",\n")) != NULL) {
                if (!*p)
@@ -783,12 +824,8 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
                                ret = -EINVAL;
                                goto out;
                        }
-                       opts->host = nvmf_host_add(p);
+                       strscpy(hostnqn, p, NVMF_NQN_SIZE);
                        kfree(p);
-                       if (!opts->host) {
-                               ret = -ENOMEM;
-                               goto out;
-                       }
                        break;
                case NVMF_OPT_RECONNECT_DELAY:
                        if (match_int(args, &token)) {
@@ -945,18 +982,94 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
                                opts->fast_io_fail_tmo, ctrl_loss_tmo);
        }
 
-       if (!opts->host) {
-               kref_get(&nvmf_default_host->ref);
-               opts->host = nvmf_default_host;
+       opts->host = nvmf_host_add(hostnqn, &hostid);
+       if (IS_ERR(opts->host)) {
+               ret = PTR_ERR(opts->host);
+               opts->host = NULL;
+               goto out;
        }
 
-       uuid_copy(&opts->host->id, &hostid);
-
 out:
        kfree(options);
        return ret;
 }
 
+void nvmf_set_io_queues(struct nvmf_ctrl_options *opts, u32 nr_io_queues,
+                       u32 io_queues[HCTX_MAX_TYPES])
+{
+       if (opts->nr_write_queues && opts->nr_io_queues < nr_io_queues) {
+               /*
+                * separate read/write queues
+                * hand out dedicated default queues only after we have
+                * sufficient read queues.
+                */
+               io_queues[HCTX_TYPE_READ] = opts->nr_io_queues;
+               nr_io_queues -= io_queues[HCTX_TYPE_READ];
+               io_queues[HCTX_TYPE_DEFAULT] =
+                       min(opts->nr_write_queues, nr_io_queues);
+               nr_io_queues -= io_queues[HCTX_TYPE_DEFAULT];
+       } else {
+               /*
+                * shared read/write queues
+                * either no write queues were requested, or we don't have
+                * sufficient queue count to have dedicated default queues.
+                */
+               io_queues[HCTX_TYPE_DEFAULT] =
+                       min(opts->nr_io_queues, nr_io_queues);
+               nr_io_queues -= io_queues[HCTX_TYPE_DEFAULT];
+       }
+
+       if (opts->nr_poll_queues && nr_io_queues) {
+               /* map dedicated poll queues only if we have queues left */
+               io_queues[HCTX_TYPE_POLL] =
+                       min(opts->nr_poll_queues, nr_io_queues);
+       }
+}
+EXPORT_SYMBOL_GPL(nvmf_set_io_queues);
+
+void nvmf_map_queues(struct blk_mq_tag_set *set, struct nvme_ctrl *ctrl,
+                    u32 io_queues[HCTX_MAX_TYPES])
+{
+       struct nvmf_ctrl_options *opts = ctrl->opts;
+
+       if (opts->nr_write_queues && io_queues[HCTX_TYPE_READ]) {
+               /* separate read/write queues */
+               set->map[HCTX_TYPE_DEFAULT].nr_queues =
+                       io_queues[HCTX_TYPE_DEFAULT];
+               set->map[HCTX_TYPE_DEFAULT].queue_offset = 0;
+               set->map[HCTX_TYPE_READ].nr_queues =
+                       io_queues[HCTX_TYPE_READ];
+               set->map[HCTX_TYPE_READ].queue_offset =
+                       io_queues[HCTX_TYPE_DEFAULT];
+       } else {
+               /* shared read/write queues */
+               set->map[HCTX_TYPE_DEFAULT].nr_queues =
+                       io_queues[HCTX_TYPE_DEFAULT];
+               set->map[HCTX_TYPE_DEFAULT].queue_offset = 0;
+               set->map[HCTX_TYPE_READ].nr_queues =
+                       io_queues[HCTX_TYPE_DEFAULT];
+               set->map[HCTX_TYPE_READ].queue_offset = 0;
+       }
+
+       blk_mq_map_queues(&set->map[HCTX_TYPE_DEFAULT]);
+       blk_mq_map_queues(&set->map[HCTX_TYPE_READ]);
+       if (opts->nr_poll_queues && io_queues[HCTX_TYPE_POLL]) {
+               /* map dedicated poll queues only if we have queues left */
+               set->map[HCTX_TYPE_POLL].nr_queues = io_queues[HCTX_TYPE_POLL];
+               set->map[HCTX_TYPE_POLL].queue_offset =
+                       io_queues[HCTX_TYPE_DEFAULT] +
+                       io_queues[HCTX_TYPE_READ];
+               blk_mq_map_queues(&set->map[HCTX_TYPE_POLL]);
+       }
+
+       dev_info(ctrl->device,
+               "mapped %d/%d/%d default/read/poll queues.\n",
+               io_queues[HCTX_TYPE_DEFAULT],
+               io_queues[HCTX_TYPE_READ],
+               io_queues[HCTX_TYPE_POLL]);
+}
+EXPORT_SYMBOL_GPL(nvmf_map_queues);
+
 static int nvmf_check_required_opts(struct nvmf_ctrl_options *opts,
                unsigned int required_opts)
 {
index dcac3df..82e7a27 100644 (file)
@@ -77,6 +77,9 @@ enum {
  *                           with the parsing opts enum.
  * @mask:      Used by the fabrics library to parse through sysfs options
  *             on adding a NVMe controller.
+ * @max_reconnects: maximum number of allowed reconnect attempts before removing
+ *             the controller, (-1) means reconnect forever, zero means remove
+ *             immediately;
  * @transport: Holds the fabric transport "technology name" (for a lack of
  *             better description) that will be used by an NVMe controller
  *             being added.
@@ -96,9 +99,6 @@ enum {
  * @discovery_nqn: indicates if the subsysnqn is the well-known discovery NQN.
  * @kato:      Keep-alive timeout.
  * @host:      Virtual NVMe host, contains the NQN and Host ID.
- * @max_reconnects: maximum number of allowed reconnect attempts before removing
- *              the controller, (-1) means reconnect forever, zero means remove
- *              immediately;
  * @dhchap_secret: DH-HMAC-CHAP secret
  * @dhchap_ctrl_secret: DH-HMAC-CHAP controller secret for bi-directional
  *              authentication
@@ -112,6 +112,7 @@ enum {
  */
 struct nvmf_ctrl_options {
        unsigned                mask;
+       int                     max_reconnects;
        char                    *transport;
        char                    *subsysnqn;
        char                    *traddr;
@@ -125,7 +126,6 @@ struct nvmf_ctrl_options {
        bool                    duplicate_connect;
        unsigned int            kato;
        struct nvmf_host        *host;
-       int                     max_reconnects;
        char                    *dhchap_secret;
        char                    *dhchap_ctrl_secret;
        bool                    disable_sqflow;
@@ -181,7 +181,7 @@ nvmf_ctlr_matches_baseopts(struct nvme_ctrl *ctrl,
            ctrl->state == NVME_CTRL_DEAD ||
            strcmp(opts->subsysnqn, ctrl->opts->subsysnqn) ||
            strcmp(opts->host->nqn, ctrl->opts->host->nqn) ||
-           memcmp(&opts->host->id, &ctrl->opts->host->id, sizeof(uuid_t)))
+           !uuid_equal(&opts->host->id, &ctrl->opts->host->id))
                return false;
 
        return true;
@@ -203,6 +203,13 @@ static inline void nvmf_complete_timed_out_request(struct request *rq)
        }
 }
 
+static inline unsigned int nvmf_nr_io_queues(struct nvmf_ctrl_options *opts)
+{
+       return min(opts->nr_io_queues, num_online_cpus()) +
+               min(opts->nr_write_queues, num_online_cpus()) +
+               min(opts->nr_poll_queues, num_online_cpus());
+}
+
 int nvmf_reg_read32(struct nvme_ctrl *ctrl, u32 off, u32 *val);
 int nvmf_reg_read64(struct nvme_ctrl *ctrl, u32 off, u64 *val);
 int nvmf_reg_write32(struct nvme_ctrl *ctrl, u32 off, u32 val);
@@ -215,5 +222,9 @@ int nvmf_get_address(struct nvme_ctrl *ctrl, char *buf, int size);
 bool nvmf_should_reconnect(struct nvme_ctrl *ctrl);
 bool nvmf_ip_options_match(struct nvme_ctrl *ctrl,
                struct nvmf_ctrl_options *opts);
+void nvmf_set_io_queues(struct nvmf_ctrl_options *opts, u32 nr_io_queues,
+                       u32 io_queues[HCTX_MAX_TYPES]);
+void nvmf_map_queues(struct blk_mq_tag_set *set, struct nvme_ctrl *ctrl,
+                    u32 io_queues[HCTX_MAX_TYPES]);
 
 #endif /* _NVME_FABRICS_H */
index 2ed7592..691f2df 100644 (file)
@@ -2917,8 +2917,8 @@ nvme_fc_create_io_queues(struct nvme_fc_ctrl *ctrl)
 
        ret = nvme_alloc_io_tag_set(&ctrl->ctrl, &ctrl->tag_set,
                        &nvme_fc_mq_ops, 1,
-                       struct_size((struct nvme_fcp_op_w_sgl *)NULL, priv,
-                                   ctrl->lport->ops->fcprqst_priv_sz));
+                       struct_size_t(struct nvme_fcp_op_w_sgl, priv,
+                                     ctrl->lport->ops->fcprqst_priv_sz));
        if (ret)
                return ret;
 
@@ -3536,8 +3536,8 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
 
        ret = nvme_alloc_admin_tag_set(&ctrl->ctrl, &ctrl->admin_tag_set,
                        &nvme_fc_admin_mq_ops,
-                       struct_size((struct nvme_fcp_op_w_sgl *)NULL, priv,
-                                   ctrl->lport->ops->fcprqst_priv_sz));
+                       struct_size_t(struct nvme_fcp_op_w_sgl, priv,
+                                     ctrl->lport->ops->fcprqst_priv_sz));
        if (ret)
                goto fail_ctrl;
 
index f15e733..2130ad6 100644 (file)
@@ -14,7 +14,7 @@ enum {
 };
 
 static bool nvme_cmd_allowed(struct nvme_ns *ns, struct nvme_command *c,
-               unsigned int flags, fmode_t mode)
+               unsigned int flags, bool open_for_write)
 {
        u32 effects;
 
@@ -80,7 +80,7 @@ static bool nvme_cmd_allowed(struct nvme_ns *ns, struct nvme_command *c,
         * writing.
         */
        if (nvme_is_write(c) || (effects & NVME_CMD_EFFECTS_LBCC))
-               return mode & FMODE_WRITE;
+               return open_for_write;
        return true;
 }
 
@@ -337,7 +337,7 @@ static bool nvme_validate_passthru_nsid(struct nvme_ctrl *ctrl,
 
 static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
                struct nvme_passthru_cmd __user *ucmd, unsigned int flags,
-               fmode_t mode)
+               bool open_for_write)
 {
        struct nvme_passthru_cmd cmd;
        struct nvme_command c;
@@ -365,7 +365,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
        c.common.cdw14 = cpu_to_le32(cmd.cdw14);
        c.common.cdw15 = cpu_to_le32(cmd.cdw15);
 
-       if (!nvme_cmd_allowed(ns, &c, 0, mode))
+       if (!nvme_cmd_allowed(ns, &c, 0, open_for_write))
                return -EACCES;
 
        if (cmd.timeout_ms)
@@ -385,7 +385,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 
 static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
                struct nvme_passthru_cmd64 __user *ucmd, unsigned int flags,
-               fmode_t mode)
+               bool open_for_write)
 {
        struct nvme_passthru_cmd64 cmd;
        struct nvme_command c;
@@ -412,7 +412,7 @@ static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
        c.common.cdw14 = cpu_to_le32(cmd.cdw14);
        c.common.cdw15 = cpu_to_le32(cmd.cdw15);
 
-       if (!nvme_cmd_allowed(ns, &c, flags, mode))
+       if (!nvme_cmd_allowed(ns, &c, flags, open_for_write))
                return -EACCES;
 
        if (cmd.timeout_ms)
@@ -521,7 +521,7 @@ static enum rq_end_io_ret nvme_uring_cmd_end_io(struct request *req,
        if (cookie != NULL && blk_rq_is_poll(req))
                nvme_uring_task_cb(ioucmd, IO_URING_F_UNLOCKED);
        else
-               io_uring_cmd_complete_in_task(ioucmd, nvme_uring_task_cb);
+               io_uring_cmd_do_in_task_lazy(ioucmd, nvme_uring_task_cb);
 
        return RQ_END_IO_FREE;
 }
@@ -543,7 +543,7 @@ static enum rq_end_io_ret nvme_uring_cmd_end_io_meta(struct request *req,
        if (cookie != NULL && blk_rq_is_poll(req))
                nvme_uring_task_meta_cb(ioucmd, IO_URING_F_UNLOCKED);
        else
-               io_uring_cmd_complete_in_task(ioucmd, nvme_uring_task_meta_cb);
+               io_uring_cmd_do_in_task_lazy(ioucmd, nvme_uring_task_meta_cb);
 
        return RQ_END_IO_NONE;
 }
@@ -583,7 +583,7 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
        c.common.cdw14 = cpu_to_le32(READ_ONCE(cmd->cdw14));
        c.common.cdw15 = cpu_to_le32(READ_ONCE(cmd->cdw15));
 
-       if (!nvme_cmd_allowed(ns, &c, 0, ioucmd->file->f_mode))
+       if (!nvme_cmd_allowed(ns, &c, 0, ioucmd->file->f_mode & FMODE_WRITE))
                return -EACCES;
 
        d.metadata = READ_ONCE(cmd->metadata);
@@ -649,13 +649,13 @@ static bool is_ctrl_ioctl(unsigned int cmd)
 }
 
 static int nvme_ctrl_ioctl(struct nvme_ctrl *ctrl, unsigned int cmd,
-               void __user *argp, fmode_t mode)
+               void __user *argp, bool open_for_write)
 {
        switch (cmd) {
        case NVME_IOCTL_ADMIN_CMD:
-               return nvme_user_cmd(ctrl, NULL, argp, 0, mode);
+               return nvme_user_cmd(ctrl, NULL, argp, 0, open_for_write);
        case NVME_IOCTL_ADMIN64_CMD:
-               return nvme_user_cmd64(ctrl, NULL, argp, 0, mode);
+               return nvme_user_cmd64(ctrl, NULL, argp, 0, open_for_write);
        default:
                return sed_ioctl(ctrl->opal_dev, cmd, argp);
        }
@@ -680,14 +680,14 @@ struct nvme_user_io32 {
 #endif /* COMPAT_FOR_U64_ALIGNMENT */
 
 static int nvme_ns_ioctl(struct nvme_ns *ns, unsigned int cmd,
-               void __user *argp, unsigned int flags, fmode_t mode)
+               void __user *argp, unsigned int flags, bool open_for_write)
 {
        switch (cmd) {
        case NVME_IOCTL_ID:
                force_successful_syscall_return();
                return ns->head->ns_id;
        case NVME_IOCTL_IO_CMD:
-               return nvme_user_cmd(ns->ctrl, ns, argp, flags, mode);
+               return nvme_user_cmd(ns->ctrl, ns, argp, flags, open_for_write);
        /*
         * struct nvme_user_io can have different padding on some 32-bit ABIs.
         * Just accept the compat version as all fields that are used are the
@@ -702,16 +702,18 @@ static int nvme_ns_ioctl(struct nvme_ns *ns, unsigned int cmd,
                flags |= NVME_IOCTL_VEC;
                fallthrough;
        case NVME_IOCTL_IO64_CMD:
-               return nvme_user_cmd64(ns->ctrl, ns, argp, flags, mode);
+               return nvme_user_cmd64(ns->ctrl, ns, argp, flags,
+                                      open_for_write);
        default:
                return -ENOTTY;
        }
 }
 
-int nvme_ioctl(struct block_device *bdev, fmode_t mode,
+int nvme_ioctl(struct block_device *bdev, blk_mode_t mode,
                unsigned int cmd, unsigned long arg)
 {
        struct nvme_ns *ns = bdev->bd_disk->private_data;
+       bool open_for_write = mode & BLK_OPEN_WRITE;
        void __user *argp = (void __user *)arg;
        unsigned int flags = 0;
 
@@ -719,19 +721,20 @@ int nvme_ioctl(struct block_device *bdev, fmode_t mode,
                flags |= NVME_IOCTL_PARTITION;
 
        if (is_ctrl_ioctl(cmd))
-               return nvme_ctrl_ioctl(ns->ctrl, cmd, argp, mode);
-       return nvme_ns_ioctl(ns, cmd, argp, flags, mode);
+               return nvme_ctrl_ioctl(ns->ctrl, cmd, argp, open_for_write);
+       return nvme_ns_ioctl(ns, cmd, argp, flags, open_for_write);
 }
 
 long nvme_ns_chr_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 {
        struct nvme_ns *ns =
                container_of(file_inode(file)->i_cdev, struct nvme_ns, cdev);
+       bool open_for_write = file->f_mode & FMODE_WRITE;
        void __user *argp = (void __user *)arg;
 
        if (is_ctrl_ioctl(cmd))
-               return nvme_ctrl_ioctl(ns->ctrl, cmd, argp, file->f_mode);
-       return nvme_ns_ioctl(ns, cmd, argp, 0, file->f_mode);
+               return nvme_ctrl_ioctl(ns->ctrl, cmd, argp, open_for_write);
+       return nvme_ns_ioctl(ns, cmd, argp, 0, open_for_write);
 }
 
 static int nvme_uring_cmd_checks(unsigned int issue_flags)
@@ -800,7 +803,7 @@ int nvme_ns_chr_uring_cmd_iopoll(struct io_uring_cmd *ioucmd,
 #ifdef CONFIG_NVME_MULTIPATH
 static int nvme_ns_head_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd,
                void __user *argp, struct nvme_ns_head *head, int srcu_idx,
-               fmode_t mode)
+               bool open_for_write)
        __releases(&head->srcu)
 {
        struct nvme_ctrl *ctrl = ns->ctrl;
@@ -808,16 +811,17 @@ static int nvme_ns_head_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd,
 
        nvme_get_ctrl(ns->ctrl);
        srcu_read_unlock(&head->srcu, srcu_idx);
-       ret = nvme_ctrl_ioctl(ns->ctrl, cmd, argp, mode);
+       ret = nvme_ctrl_ioctl(ns->ctrl, cmd, argp, open_for_write);
 
        nvme_put_ctrl(ctrl);
        return ret;
 }
 
-int nvme_ns_head_ioctl(struct block_device *bdev, fmode_t mode,
+int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode,
                unsigned int cmd, unsigned long arg)
 {
        struct nvme_ns_head *head = bdev->bd_disk->private_data;
+       bool open_for_write = mode & BLK_OPEN_WRITE;
        void __user *argp = (void __user *)arg;
        struct nvme_ns *ns;
        int srcu_idx, ret = -EWOULDBLOCK;
@@ -838,9 +842,9 @@ int nvme_ns_head_ioctl(struct block_device *bdev, fmode_t mode,
         */
        if (is_ctrl_ioctl(cmd))
                return nvme_ns_head_ctrl_ioctl(ns, cmd, argp, head, srcu_idx,
-                                       mode);
+                                              open_for_write);
 
-       ret = nvme_ns_ioctl(ns, cmd, argp, flags, mode);
+       ret = nvme_ns_ioctl(ns, cmd, argp, flags, open_for_write);
 out_unlock:
        srcu_read_unlock(&head->srcu, srcu_idx);
        return ret;
@@ -849,6 +853,7 @@ out_unlock:
 long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
                unsigned long arg)
 {
+       bool open_for_write = file->f_mode & FMODE_WRITE;
        struct cdev *cdev = file_inode(file)->i_cdev;
        struct nvme_ns_head *head =
                container_of(cdev, struct nvme_ns_head, cdev);
@@ -863,9 +868,9 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
 
        if (is_ctrl_ioctl(cmd))
                return nvme_ns_head_ctrl_ioctl(ns, cmd, argp, head, srcu_idx,
-                               file->f_mode);
+                               open_for_write);
 
-       ret = nvme_ns_ioctl(ns, cmd, argp, 0, file->f_mode);
+       ret = nvme_ns_ioctl(ns, cmd, argp, 0, open_for_write);
 out_unlock:
        srcu_read_unlock(&head->srcu, srcu_idx);
        return ret;
@@ -940,7 +945,7 @@ int nvme_dev_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags)
 }
 
 static int nvme_dev_user_cmd(struct nvme_ctrl *ctrl, void __user *argp,
-               fmode_t mode)
+               bool open_for_write)
 {
        struct nvme_ns *ns;
        int ret;
@@ -964,7 +969,7 @@ static int nvme_dev_user_cmd(struct nvme_ctrl *ctrl, void __user *argp,
        kref_get(&ns->kref);
        up_read(&ctrl->namespaces_rwsem);
 
-       ret = nvme_user_cmd(ctrl, ns, argp, 0, mode);
+       ret = nvme_user_cmd(ctrl, ns, argp, 0, open_for_write);
        nvme_put_ns(ns);
        return ret;
 
@@ -976,16 +981,17 @@ out_unlock:
 long nvme_dev_ioctl(struct file *file, unsigned int cmd,
                unsigned long arg)
 {
+       bool open_for_write = file->f_mode & FMODE_WRITE;
        struct nvme_ctrl *ctrl = file->private_data;
        void __user *argp = (void __user *)arg;
 
        switch (cmd) {
        case NVME_IOCTL_ADMIN_CMD:
-               return nvme_user_cmd(ctrl, NULL, argp, 0, file->f_mode);
+               return nvme_user_cmd(ctrl, NULL, argp, 0, open_for_write);
        case NVME_IOCTL_ADMIN64_CMD:
-               return nvme_user_cmd64(ctrl, NULL, argp, 0, file->f_mode);
+               return nvme_user_cmd64(ctrl, NULL, argp, 0, open_for_write);
        case NVME_IOCTL_IO_CMD:
-               return nvme_dev_user_cmd(ctrl, argp, file->f_mode);
+               return nvme_dev_user_cmd(ctrl, argp, open_for_write);
        case NVME_IOCTL_RESET:
                if (!capable(CAP_SYS_ADMIN))
                        return -EACCES;
index 2bc159a..98001ee 100644 (file)
@@ -402,14 +402,14 @@ static void nvme_ns_head_submit_bio(struct bio *bio)
        srcu_read_unlock(&head->srcu, srcu_idx);
 }
 
-static int nvme_ns_head_open(struct block_device *bdev, fmode_t mode)
+static int nvme_ns_head_open(struct gendisk *disk, blk_mode_t mode)
 {
-       if (!nvme_tryget_ns_head(bdev->bd_disk->private_data))
+       if (!nvme_tryget_ns_head(disk->private_data))
                return -ENXIO;
        return 0;
 }
 
-static void nvme_ns_head_release(struct gendisk *disk, fmode_t mode)
+static void nvme_ns_head_release(struct gendisk *disk)
 {
        nvme_put_ns_head(disk->private_data);
 }
index 8657811..9a98c14 100644 (file)
@@ -247,12 +247,13 @@ enum nvme_ctrl_flags {
        NVME_CTRL_ADMIN_Q_STOPPED       = 1,
        NVME_CTRL_STARTED_ONCE          = 2,
        NVME_CTRL_STOPPED               = 3,
+       NVME_CTRL_SKIP_ID_CNS_CS        = 4,
 };
 
 struct nvme_ctrl {
        bool comp_seen;
-       enum nvme_ctrl_state state;
        bool identified;
+       enum nvme_ctrl_state state;
        spinlock_t lock;
        struct mutex scan_lock;
        const struct nvme_ctrl_ops *ops;
@@ -284,8 +285,8 @@ struct nvme_ctrl {
        char name[12];
        u16 cntlid;
 
-       u32 ctrl_config;
        u16 mtfa;
+       u32 ctrl_config;
        u32 queue_count;
 
        u64 cap;
@@ -359,10 +360,10 @@ struct nvme_ctrl {
        bool apst_enabled;
 
        /* PCIe only: */
+       u16 hmmaxd;
        u32 hmpre;
        u32 hmmin;
        u32 hmminds;
-       u16 hmmaxd;
 
        /* Fabrics only */
        u32 ioccsz;
@@ -842,10 +843,10 @@ void nvme_put_ns_head(struct nvme_ns_head *head);
 int nvme_cdev_add(struct cdev *cdev, struct device *cdev_device,
                const struct file_operations *fops, struct module *owner);
 void nvme_cdev_del(struct cdev *cdev, struct device *cdev_device);
-int nvme_ioctl(struct block_device *bdev, fmode_t mode,
+int nvme_ioctl(struct block_device *bdev, blk_mode_t mode,
                unsigned int cmd, unsigned long arg);
 long nvme_ns_chr_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
-int nvme_ns_head_ioctl(struct block_device *bdev, fmode_t mode,
+int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode,
                unsigned int cmd, unsigned long arg);
 long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
                unsigned long arg);
@@ -866,7 +867,11 @@ extern const struct attribute_group *nvme_ns_id_attr_groups[];
 extern const struct pr_ops nvme_pr_ops;
 extern const struct block_device_operations nvme_ns_head_ops;
 extern const struct attribute_group nvme_dev_attrs_group;
+extern const struct attribute_group *nvme_subsys_attrs_groups[];
+extern const struct attribute_group *nvme_dev_attr_groups[];
+extern const struct block_device_operations nvme_bdev_ops;
 
+void nvme_delete_ctrl_sync(struct nvme_ctrl *ctrl);
 struct nvme_ns *nvme_find_path(struct nvme_ns_head *head);
 #ifdef CONFIG_NVME_MULTIPATH
 static inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl)
index 492f319..48c60f7 100644 (file)
@@ -420,10 +420,9 @@ static int nvme_pci_init_request(struct blk_mq_tag_set *set,
                struct request *req, unsigned int hctx_idx,
                unsigned int numa_node)
 {
-       struct nvme_dev *dev = to_nvme_dev(set->driver_data);
        struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 
-       nvme_req(req)->ctrl = &dev->ctrl;
+       nvme_req(req)->ctrl = set->driver_data;
        nvme_req(req)->cmd = &iod->cmd;
        return 0;
 }
index 0eb7969..d433b2e 100644 (file)
@@ -501,7 +501,7 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
        }
        ibdev = queue->device->dev;
 
-       /* +1 for ib_stop_cq */
+       /* +1 for ib_drain_qp */
        queue->cq_size = cq_factor * queue->queue_size + 1;
 
        ret = nvme_rdma_create_cq(ibdev, queue);
@@ -713,18 +713,10 @@ out_stop_queues:
 static int nvme_rdma_alloc_io_queues(struct nvme_rdma_ctrl *ctrl)
 {
        struct nvmf_ctrl_options *opts = ctrl->ctrl.opts;
-       struct ib_device *ibdev = ctrl->device->dev;
-       unsigned int nr_io_queues, nr_default_queues;
-       unsigned int nr_read_queues, nr_poll_queues;
+       unsigned int nr_io_queues;
        int i, ret;
 
-       nr_read_queues = min_t(unsigned int, ibdev->num_comp_vectors,
-                               min(opts->nr_io_queues, num_online_cpus()));
-       nr_default_queues =  min_t(unsigned int, ibdev->num_comp_vectors,
-                               min(opts->nr_write_queues, num_online_cpus()));
-       nr_poll_queues = min(opts->nr_poll_queues, num_online_cpus());
-       nr_io_queues = nr_read_queues + nr_default_queues + nr_poll_queues;
-
+       nr_io_queues = nvmf_nr_io_queues(opts);
        ret = nvme_set_queue_count(&ctrl->ctrl, &nr_io_queues);
        if (ret)
                return ret;
@@ -739,34 +731,7 @@ static int nvme_rdma_alloc_io_queues(struct nvme_rdma_ctrl *ctrl)
        dev_info(ctrl->ctrl.device,
                "creating %d I/O queues.\n", nr_io_queues);
 
-       if (opts->nr_write_queues && nr_read_queues < nr_io_queues) {
-               /*
-                * separate read/write queues
-                * hand out dedicated default queues only after we have
-                * sufficient read queues.
-                */
-               ctrl->io_queues[HCTX_TYPE_READ] = nr_read_queues;
-               nr_io_queues -= ctrl->io_queues[HCTX_TYPE_READ];
-               ctrl->io_queues[HCTX_TYPE_DEFAULT] =
-                       min(nr_default_queues, nr_io_queues);
-               nr_io_queues -= ctrl->io_queues[HCTX_TYPE_DEFAULT];
-       } else {
-               /*
-                * shared read/write queues
-                * either no write queues were requested, or we don't have
-                * sufficient queue count to have dedicated default queues.
-                */
-               ctrl->io_queues[HCTX_TYPE_DEFAULT] =
-                       min(nr_read_queues, nr_io_queues);
-               nr_io_queues -= ctrl->io_queues[HCTX_TYPE_DEFAULT];
-       }
-
-       if (opts->nr_poll_queues && nr_io_queues) {
-               /* map dedicated poll queues only if we have queues left */
-               ctrl->io_queues[HCTX_TYPE_POLL] =
-                       min(nr_poll_queues, nr_io_queues);
-       }
-
+       nvmf_set_io_queues(opts, nr_io_queues, ctrl->io_queues);
        for (i = 1; i < ctrl->ctrl.queue_count; i++) {
                ret = nvme_rdma_alloc_queue(ctrl, i,
                                ctrl->ctrl.sqsize + 1);
@@ -2138,44 +2103,8 @@ static void nvme_rdma_complete_rq(struct request *rq)
 static void nvme_rdma_map_queues(struct blk_mq_tag_set *set)
 {
        struct nvme_rdma_ctrl *ctrl = to_rdma_ctrl(set->driver_data);
-       struct nvmf_ctrl_options *opts = ctrl->ctrl.opts;
 
-       if (opts->nr_write_queues && ctrl->io_queues[HCTX_TYPE_READ]) {
-               /* separate read/write queues */
-               set->map[HCTX_TYPE_DEFAULT].nr_queues =
-                       ctrl->io_queues[HCTX_TYPE_DEFAULT];
-               set->map[HCTX_TYPE_DEFAULT].queue_offset = 0;
-               set->map[HCTX_TYPE_READ].nr_queues =
-                       ctrl->io_queues[HCTX_TYPE_READ];
-               set->map[HCTX_TYPE_READ].queue_offset =
-                       ctrl->io_queues[HCTX_TYPE_DEFAULT];
-       } else {
-               /* shared read/write queues */
-               set->map[HCTX_TYPE_DEFAULT].nr_queues =
-                       ctrl->io_queues[HCTX_TYPE_DEFAULT];
-               set->map[HCTX_TYPE_DEFAULT].queue_offset = 0;
-               set->map[HCTX_TYPE_READ].nr_queues =
-                       ctrl->io_queues[HCTX_TYPE_DEFAULT];
-               set->map[HCTX_TYPE_READ].queue_offset = 0;
-       }
-       blk_mq_map_queues(&set->map[HCTX_TYPE_DEFAULT]);
-       blk_mq_map_queues(&set->map[HCTX_TYPE_READ]);
-
-       if (opts->nr_poll_queues && ctrl->io_queues[HCTX_TYPE_POLL]) {
-               /* map dedicated poll queues only if we have queues left */
-               set->map[HCTX_TYPE_POLL].nr_queues =
-                               ctrl->io_queues[HCTX_TYPE_POLL];
-               set->map[HCTX_TYPE_POLL].queue_offset =
-                       ctrl->io_queues[HCTX_TYPE_DEFAULT] +
-                       ctrl->io_queues[HCTX_TYPE_READ];
-               blk_mq_map_queues(&set->map[HCTX_TYPE_POLL]);
-       }
-
-       dev_info(ctrl->ctrl.device,
-               "mapped %d/%d/%d default/read/poll queues.\n",
-               ctrl->io_queues[HCTX_TYPE_DEFAULT],
-               ctrl->io_queues[HCTX_TYPE_READ],
-               ctrl->io_queues[HCTX_TYPE_POLL]);
+       nvmf_map_queues(set, &ctrl->ctrl, ctrl->io_queues);
 }
 
 static const struct blk_mq_ops nvme_rdma_mq_ops = {
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
new file mode 100644 (file)
index 0000000..45e9181
--- /dev/null
@@ -0,0 +1,668 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Sysfs interface for the NVMe core driver.
+ *
+ * Copyright (c) 2011-2014, Intel Corporation.
+ */
+
+#include <linux/nvme-auth.h>
+
+#include "nvme.h"
+#include "fabrics.h"
+
+static ssize_t nvme_sysfs_reset(struct device *dev,
+                               struct device_attribute *attr, const char *buf,
+                               size_t count)
+{
+       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+       int ret;
+
+       ret = nvme_reset_ctrl_sync(ctrl);
+       if (ret < 0)
+               return ret;
+       return count;
+}
+static DEVICE_ATTR(reset_controller, S_IWUSR, NULL, nvme_sysfs_reset);
+
+static ssize_t nvme_sysfs_rescan(struct device *dev,
+                               struct device_attribute *attr, const char *buf,
+                               size_t count)
+{
+       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+
+       nvme_queue_scan(ctrl);
+       return count;
+}
+static DEVICE_ATTR(rescan_controller, S_IWUSR, NULL, nvme_sysfs_rescan);
+
+static inline struct nvme_ns_head *dev_to_ns_head(struct device *dev)
+{
+       struct gendisk *disk = dev_to_disk(dev);
+
+       if (disk->fops == &nvme_bdev_ops)
+               return nvme_get_ns_from_dev(dev)->head;
+       else
+               return disk->private_data;
+}
+
+static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
+               char *buf)
+{
+       struct nvme_ns_head *head = dev_to_ns_head(dev);
+       struct nvme_ns_ids *ids = &head->ids;
+       struct nvme_subsystem *subsys = head->subsys;
+       int serial_len = sizeof(subsys->serial);
+       int model_len = sizeof(subsys->model);
+
+       if (!uuid_is_null(&ids->uuid))
+               return sysfs_emit(buf, "uuid.%pU\n", &ids->uuid);
+
+       if (memchr_inv(ids->nguid, 0, sizeof(ids->nguid)))
+               return sysfs_emit(buf, "eui.%16phN\n", ids->nguid);
+
+       if (memchr_inv(ids->eui64, 0, sizeof(ids->eui64)))
+               return sysfs_emit(buf, "eui.%8phN\n", ids->eui64);
+
+       while (serial_len > 0 && (subsys->serial[serial_len - 1] == ' ' ||
+                                 subsys->serial[serial_len - 1] == '\0'))
+               serial_len--;
+       while (model_len > 0 && (subsys->model[model_len - 1] == ' ' ||
+                                subsys->model[model_len - 1] == '\0'))
+               model_len--;
+
+       return sysfs_emit(buf, "nvme.%04x-%*phN-%*phN-%08x\n", subsys->vendor_id,
+               serial_len, subsys->serial, model_len, subsys->model,
+               head->ns_id);
+}
+static DEVICE_ATTR_RO(wwid);
+
+static ssize_t nguid_show(struct device *dev, struct device_attribute *attr,
+               char *buf)
+{
+       return sysfs_emit(buf, "%pU\n", dev_to_ns_head(dev)->ids.nguid);
+}
+static DEVICE_ATTR_RO(nguid);
+
+static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
+               char *buf)
+{
+       struct nvme_ns_ids *ids = &dev_to_ns_head(dev)->ids;
+
+       /* For backward compatibility expose the NGUID to userspace if
+        * we have no UUID set
+        */
+       if (uuid_is_null(&ids->uuid)) {
+               dev_warn_ratelimited(dev,
+                       "No UUID available providing old NGUID\n");
+               return sysfs_emit(buf, "%pU\n", ids->nguid);
+       }
+       return sysfs_emit(buf, "%pU\n", &ids->uuid);
+}
+static DEVICE_ATTR_RO(uuid);
+
+static ssize_t eui_show(struct device *dev, struct device_attribute *attr,
+               char *buf)
+{
+       return sysfs_emit(buf, "%8ph\n", dev_to_ns_head(dev)->ids.eui64);
+}
+static DEVICE_ATTR_RO(eui);
+
+static ssize_t nsid_show(struct device *dev, struct device_attribute *attr,
+               char *buf)
+{
+       return sysfs_emit(buf, "%d\n", dev_to_ns_head(dev)->ns_id);
+}
+static DEVICE_ATTR_RO(nsid);
+
+static struct attribute *nvme_ns_id_attrs[] = {
+       &dev_attr_wwid.attr,
+       &dev_attr_uuid.attr,
+       &dev_attr_nguid.attr,
+       &dev_attr_eui.attr,
+       &dev_attr_nsid.attr,
+#ifdef CONFIG_NVME_MULTIPATH
+       &dev_attr_ana_grpid.attr,
+       &dev_attr_ana_state.attr,
+#endif
+       NULL,
+};
+
+static umode_t nvme_ns_id_attrs_are_visible(struct kobject *kobj,
+               struct attribute *a, int n)
+{
+       struct device *dev = container_of(kobj, struct device, kobj);
+       struct nvme_ns_ids *ids = &dev_to_ns_head(dev)->ids;
+
+       if (a == &dev_attr_uuid.attr) {
+               if (uuid_is_null(&ids->uuid) &&
+                   !memchr_inv(ids->nguid, 0, sizeof(ids->nguid)))
+                       return 0;
+       }
+       if (a == &dev_attr_nguid.attr) {
+               if (!memchr_inv(ids->nguid, 0, sizeof(ids->nguid)))
+                       return 0;
+       }
+       if (a == &dev_attr_eui.attr) {
+               if (!memchr_inv(ids->eui64, 0, sizeof(ids->eui64)))
+                       return 0;
+       }
+#ifdef CONFIG_NVME_MULTIPATH
+       if (a == &dev_attr_ana_grpid.attr || a == &dev_attr_ana_state.attr) {
+               if (dev_to_disk(dev)->fops != &nvme_bdev_ops) /* per-path attr */
+                       return 0;
+               if (!nvme_ctrl_use_ana(nvme_get_ns_from_dev(dev)->ctrl))
+                       return 0;
+       }
+#endif
+       return a->mode;
+}
+
+static const struct attribute_group nvme_ns_id_attr_group = {
+       .attrs          = nvme_ns_id_attrs,
+       .is_visible     = nvme_ns_id_attrs_are_visible,
+};
+
+const struct attribute_group *nvme_ns_id_attr_groups[] = {
+       &nvme_ns_id_attr_group,
+       NULL,
+};
+
+#define nvme_show_str_function(field)                                          \
+static ssize_t  field##_show(struct device *dev,                               \
+                           struct device_attribute *attr, char *buf)           \
+{                                                                              \
+        struct nvme_ctrl *ctrl = dev_get_drvdata(dev);                         \
+        return sysfs_emit(buf, "%.*s\n",                                       \
+               (int)sizeof(ctrl->subsys->field), ctrl->subsys->field);         \
+}                                                                              \
+static DEVICE_ATTR(field, S_IRUGO, field##_show, NULL);
+
+nvme_show_str_function(model);
+nvme_show_str_function(serial);
+nvme_show_str_function(firmware_rev);
+
+#define nvme_show_int_function(field)                                          \
+static ssize_t  field##_show(struct device *dev,                               \
+                           struct device_attribute *attr, char *buf)           \
+{                                                                              \
+        struct nvme_ctrl *ctrl = dev_get_drvdata(dev);                         \
+        return sysfs_emit(buf, "%d\n", ctrl->field);                           \
+}                                                                              \
+static DEVICE_ATTR(field, S_IRUGO, field##_show, NULL);
+
+nvme_show_int_function(cntlid);
+nvme_show_int_function(numa_node);
+nvme_show_int_function(queue_count);
+nvme_show_int_function(sqsize);
+nvme_show_int_function(kato);
+
+static ssize_t nvme_sysfs_delete(struct device *dev,
+                               struct device_attribute *attr, const char *buf,
+                               size_t count)
+{
+       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+
+       if (!test_bit(NVME_CTRL_STARTED_ONCE, &ctrl->flags))
+               return -EBUSY;
+
+       if (device_remove_file_self(dev, attr))
+               nvme_delete_ctrl_sync(ctrl);
+       return count;
+}
+static DEVICE_ATTR(delete_controller, S_IWUSR, NULL, nvme_sysfs_delete);
+
+static ssize_t nvme_sysfs_show_transport(struct device *dev,
+                                        struct device_attribute *attr,
+                                        char *buf)
+{
+       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+
+       return sysfs_emit(buf, "%s\n", ctrl->ops->name);
+}
+static DEVICE_ATTR(transport, S_IRUGO, nvme_sysfs_show_transport, NULL);
+
+static ssize_t nvme_sysfs_show_state(struct device *dev,
+                                    struct device_attribute *attr,
+                                    char *buf)
+{
+       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+       static const char *const state_name[] = {
+               [NVME_CTRL_NEW]         = "new",
+               [NVME_CTRL_LIVE]        = "live",
+               [NVME_CTRL_RESETTING]   = "resetting",
+               [NVME_CTRL_CONNECTING]  = "connecting",
+               [NVME_CTRL_DELETING]    = "deleting",
+               [NVME_CTRL_DELETING_NOIO]= "deleting (no IO)",
+               [NVME_CTRL_DEAD]        = "dead",
+       };
+
+       if ((unsigned)ctrl->state < ARRAY_SIZE(state_name) &&
+           state_name[ctrl->state])
+               return sysfs_emit(buf, "%s\n", state_name[ctrl->state]);
+
+       return sysfs_emit(buf, "unknown state\n");
+}
+
+static DEVICE_ATTR(state, S_IRUGO, nvme_sysfs_show_state, NULL);
+
+static ssize_t nvme_sysfs_show_subsysnqn(struct device *dev,
+                                        struct device_attribute *attr,
+                                        char *buf)
+{
+       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+
+       return sysfs_emit(buf, "%s\n", ctrl->subsys->subnqn);
+}
+static DEVICE_ATTR(subsysnqn, S_IRUGO, nvme_sysfs_show_subsysnqn, NULL);
+
+static ssize_t nvme_sysfs_show_hostnqn(struct device *dev,
+                                       struct device_attribute *attr,
+                                       char *buf)
+{
+       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+
+       return sysfs_emit(buf, "%s\n", ctrl->opts->host->nqn);
+}
+static DEVICE_ATTR(hostnqn, S_IRUGO, nvme_sysfs_show_hostnqn, NULL);
+
+static ssize_t nvme_sysfs_show_hostid(struct device *dev,
+                                       struct device_attribute *attr,
+                                       char *buf)
+{
+       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+
+       return sysfs_emit(buf, "%pU\n", &ctrl->opts->host->id);
+}
+static DEVICE_ATTR(hostid, S_IRUGO, nvme_sysfs_show_hostid, NULL);
+
+static ssize_t nvme_sysfs_show_address(struct device *dev,
+                                        struct device_attribute *attr,
+                                        char *buf)
+{
+       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+
+       return ctrl->ops->get_address(ctrl, buf, PAGE_SIZE);
+}
+static DEVICE_ATTR(address, S_IRUGO, nvme_sysfs_show_address, NULL);
+
+static ssize_t nvme_ctrl_loss_tmo_show(struct device *dev,
+               struct device_attribute *attr, char *buf)
+{
+       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+       struct nvmf_ctrl_options *opts = ctrl->opts;
+
+       if (ctrl->opts->max_reconnects == -1)
+               return sysfs_emit(buf, "off\n");
+       return sysfs_emit(buf, "%d\n",
+                         opts->max_reconnects * opts->reconnect_delay);
+}
+
+static ssize_t nvme_ctrl_loss_tmo_store(struct device *dev,
+               struct device_attribute *attr, const char *buf, size_t count)
+{
+       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+       struct nvmf_ctrl_options *opts = ctrl->opts;
+       int ctrl_loss_tmo, err;
+
+       err = kstrtoint(buf, 10, &ctrl_loss_tmo);
+       if (err)
+               return -EINVAL;
+
+       if (ctrl_loss_tmo < 0)
+               opts->max_reconnects = -1;
+       else
+               opts->max_reconnects = DIV_ROUND_UP(ctrl_loss_tmo,
+                                               opts->reconnect_delay);
+       return count;
+}
+static DEVICE_ATTR(ctrl_loss_tmo, S_IRUGO | S_IWUSR,
+       nvme_ctrl_loss_tmo_show, nvme_ctrl_loss_tmo_store);
+
+static ssize_t nvme_ctrl_reconnect_delay_show(struct device *dev,
+               struct device_attribute *attr, char *buf)
+{
+       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+
+       if (ctrl->opts->reconnect_delay == -1)
+               return sysfs_emit(buf, "off\n");
+       return sysfs_emit(buf, "%d\n", ctrl->opts->reconnect_delay);
+}
+
+static ssize_t nvme_ctrl_reconnect_delay_store(struct device *dev,
+               struct device_attribute *attr, const char *buf, size_t count)
+{
+       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+       unsigned int v;
+       int err;
+
+       err = kstrtou32(buf, 10, &v);
+       if (err)
+               return err;
+
+       ctrl->opts->reconnect_delay = v;
+       return count;
+}
+static DEVICE_ATTR(reconnect_delay, S_IRUGO | S_IWUSR,
+       nvme_ctrl_reconnect_delay_show, nvme_ctrl_reconnect_delay_store);
+
+static ssize_t nvme_ctrl_fast_io_fail_tmo_show(struct device *dev,
+               struct device_attribute *attr, char *buf)
+{
+       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+
+       if (ctrl->opts->fast_io_fail_tmo == -1)
+               return sysfs_emit(buf, "off\n");
+       return sysfs_emit(buf, "%d\n", ctrl->opts->fast_io_fail_tmo);
+}
+
+static ssize_t nvme_ctrl_fast_io_fail_tmo_store(struct device *dev,
+               struct device_attribute *attr, const char *buf, size_t count)
+{
+       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+       struct nvmf_ctrl_options *opts = ctrl->opts;
+       int fast_io_fail_tmo, err;
+
+       err = kstrtoint(buf, 10, &fast_io_fail_tmo);
+       if (err)
+               return -EINVAL;
+
+       if (fast_io_fail_tmo < 0)
+               opts->fast_io_fail_tmo = -1;
+       else
+               opts->fast_io_fail_tmo = fast_io_fail_tmo;
+       return count;
+}
+static DEVICE_ATTR(fast_io_fail_tmo, S_IRUGO | S_IWUSR,
+       nvme_ctrl_fast_io_fail_tmo_show, nvme_ctrl_fast_io_fail_tmo_store);
+
+static ssize_t cntrltype_show(struct device *dev,
+                             struct device_attribute *attr, char *buf)
+{
+       static const char * const type[] = {
+               [NVME_CTRL_IO] = "io\n",
+               [NVME_CTRL_DISC] = "discovery\n",
+               [NVME_CTRL_ADMIN] = "admin\n",
+       };
+       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+
+       if (ctrl->cntrltype > NVME_CTRL_ADMIN || !type[ctrl->cntrltype])
+               return sysfs_emit(buf, "reserved\n");
+
+       return sysfs_emit(buf, type[ctrl->cntrltype]);
+}
+static DEVICE_ATTR_RO(cntrltype);
+
+static ssize_t dctype_show(struct device *dev,
+                          struct device_attribute *attr, char *buf)
+{
+       static const char * const type[] = {
+               [NVME_DCTYPE_NOT_REPORTED] = "none\n",
+               [NVME_DCTYPE_DDC] = "ddc\n",
+               [NVME_DCTYPE_CDC] = "cdc\n",
+       };
+       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+
+       if (ctrl->dctype > NVME_DCTYPE_CDC || !type[ctrl->dctype])
+               return sysfs_emit(buf, "reserved\n");
+
+       return sysfs_emit(buf, type[ctrl->dctype]);
+}
+static DEVICE_ATTR_RO(dctype);
+
+#ifdef CONFIG_NVME_AUTH
+static ssize_t nvme_ctrl_dhchap_secret_show(struct device *dev,
+               struct device_attribute *attr, char *buf)
+{
+       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+       struct nvmf_ctrl_options *opts = ctrl->opts;
+
+       if (!opts->dhchap_secret)
+               return sysfs_emit(buf, "none\n");
+       return sysfs_emit(buf, "%s\n", opts->dhchap_secret);
+}
+
+static ssize_t nvme_ctrl_dhchap_secret_store(struct device *dev,
+               struct device_attribute *attr, const char *buf, size_t count)
+{
+       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+       struct nvmf_ctrl_options *opts = ctrl->opts;
+       char *dhchap_secret;
+
+       if (!ctrl->opts->dhchap_secret)
+               return -EINVAL;
+       if (count < 7)
+               return -EINVAL;
+       if (memcmp(buf, "DHHC-1:", 7))
+               return -EINVAL;
+
+       dhchap_secret = kzalloc(count + 1, GFP_KERNEL);
+       if (!dhchap_secret)
+               return -ENOMEM;
+       memcpy(dhchap_secret, buf, count);
+       nvme_auth_stop(ctrl);
+       if (strcmp(dhchap_secret, opts->dhchap_secret)) {
+               struct nvme_dhchap_key *key, *host_key;
+               int ret;
+
+               ret = nvme_auth_generate_key(dhchap_secret, &key);
+               if (ret) {
+                       kfree(dhchap_secret);
+                       return ret;
+               }
+               kfree(opts->dhchap_secret);
+               opts->dhchap_secret = dhchap_secret;
+               host_key = ctrl->host_key;
+               mutex_lock(&ctrl->dhchap_auth_mutex);
+               ctrl->host_key = key;
+               mutex_unlock(&ctrl->dhchap_auth_mutex);
+               nvme_auth_free_key(host_key);
+       } else
+               kfree(dhchap_secret);
+       /* Start re-authentication */
+       dev_info(ctrl->device, "re-authenticating controller\n");
+       queue_work(nvme_wq, &ctrl->dhchap_auth_work);
+
+       return count;
+}
+
+static DEVICE_ATTR(dhchap_secret, S_IRUGO | S_IWUSR,
+       nvme_ctrl_dhchap_secret_show, nvme_ctrl_dhchap_secret_store);
+
+static ssize_t nvme_ctrl_dhchap_ctrl_secret_show(struct device *dev,
+               struct device_attribute *attr, char *buf)
+{
+       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+       struct nvmf_ctrl_options *opts = ctrl->opts;
+
+       if (!opts->dhchap_ctrl_secret)
+               return sysfs_emit(buf, "none\n");
+       return sysfs_emit(buf, "%s\n", opts->dhchap_ctrl_secret);
+}
+
+static ssize_t nvme_ctrl_dhchap_ctrl_secret_store(struct device *dev,
+               struct device_attribute *attr, const char *buf, size_t count)
+{
+       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+       struct nvmf_ctrl_options *opts = ctrl->opts;
+       char *dhchap_secret;
+
+       if (!ctrl->opts->dhchap_ctrl_secret)
+               return -EINVAL;
+       if (count < 7)
+               return -EINVAL;
+       if (memcmp(buf, "DHHC-1:", 7))
+               return -EINVAL;
+
+       dhchap_secret = kzalloc(count + 1, GFP_KERNEL);
+       if (!dhchap_secret)
+               return -ENOMEM;
+       memcpy(dhchap_secret, buf, count);
+       nvme_auth_stop(ctrl);
+       if (strcmp(dhchap_secret, opts->dhchap_ctrl_secret)) {
+               struct nvme_dhchap_key *key, *ctrl_key;
+               int ret;
+
+               ret = nvme_auth_generate_key(dhchap_secret, &key);
+               if (ret) {
+                       kfree(dhchap_secret);
+                       return ret;
+               }
+               kfree(opts->dhchap_ctrl_secret);
+               opts->dhchap_ctrl_secret = dhchap_secret;
+               ctrl_key = ctrl->ctrl_key;
+               mutex_lock(&ctrl->dhchap_auth_mutex);
+               ctrl->ctrl_key = key;
+               mutex_unlock(&ctrl->dhchap_auth_mutex);
+               nvme_auth_free_key(ctrl_key);
+       } else
+               kfree(dhchap_secret);
+       /* Start re-authentication */
+       dev_info(ctrl->device, "re-authenticating controller\n");
+       queue_work(nvme_wq, &ctrl->dhchap_auth_work);
+
+       return count;
+}
+
+static DEVICE_ATTR(dhchap_ctrl_secret, S_IRUGO | S_IWUSR,
+       nvme_ctrl_dhchap_ctrl_secret_show, nvme_ctrl_dhchap_ctrl_secret_store);
+#endif
+
+static struct attribute *nvme_dev_attrs[] = {
+       &dev_attr_reset_controller.attr,
+       &dev_attr_rescan_controller.attr,
+       &dev_attr_model.attr,
+       &dev_attr_serial.attr,
+       &dev_attr_firmware_rev.attr,
+       &dev_attr_cntlid.attr,
+       &dev_attr_delete_controller.attr,
+       &dev_attr_transport.attr,
+       &dev_attr_subsysnqn.attr,
+       &dev_attr_address.attr,
+       &dev_attr_state.attr,
+       &dev_attr_numa_node.attr,
+       &dev_attr_queue_count.attr,
+       &dev_attr_sqsize.attr,
+       &dev_attr_hostnqn.attr,
+       &dev_attr_hostid.attr,
+       &dev_attr_ctrl_loss_tmo.attr,
+       &dev_attr_reconnect_delay.attr,
+       &dev_attr_fast_io_fail_tmo.attr,
+       &dev_attr_kato.attr,
+       &dev_attr_cntrltype.attr,
+       &dev_attr_dctype.attr,
+#ifdef CONFIG_NVME_AUTH
+       &dev_attr_dhchap_secret.attr,
+       &dev_attr_dhchap_ctrl_secret.attr,
+#endif
+       NULL
+};
+
+static umode_t nvme_dev_attrs_are_visible(struct kobject *kobj,
+               struct attribute *a, int n)
+{
+       struct device *dev = container_of(kobj, struct device, kobj);
+       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+
+       if (a == &dev_attr_delete_controller.attr && !ctrl->ops->delete_ctrl)
+               return 0;
+       if (a == &dev_attr_address.attr && !ctrl->ops->get_address)
+               return 0;
+       if (a == &dev_attr_hostnqn.attr && !ctrl->opts)
+               return 0;
+       if (a == &dev_attr_hostid.attr && !ctrl->opts)
+               return 0;
+       if (a == &dev_attr_ctrl_loss_tmo.attr && !ctrl->opts)
+               return 0;
+       if (a == &dev_attr_reconnect_delay.attr && !ctrl->opts)
+               return 0;
+       if (a == &dev_attr_fast_io_fail_tmo.attr && !ctrl->opts)
+               return 0;
+#ifdef CONFIG_NVME_AUTH
+       if (a == &dev_attr_dhchap_secret.attr && !ctrl->opts)
+               return 0;
+       if (a == &dev_attr_dhchap_ctrl_secret.attr && !ctrl->opts)
+               return 0;
+#endif
+
+       return a->mode;
+}
+
+const struct attribute_group nvme_dev_attrs_group = {
+       .attrs          = nvme_dev_attrs,
+       .is_visible     = nvme_dev_attrs_are_visible,
+};
+EXPORT_SYMBOL_GPL(nvme_dev_attrs_group);
+
+const struct attribute_group *nvme_dev_attr_groups[] = {
+       &nvme_dev_attrs_group,
+       NULL,
+};
+
+#define SUBSYS_ATTR_RO(_name, _mode, _show)                    \
+       struct device_attribute subsys_attr_##_name = \
+               __ATTR(_name, _mode, _show, NULL)
+
+static ssize_t nvme_subsys_show_nqn(struct device *dev,
+                                   struct device_attribute *attr,
+                                   char *buf)
+{
+       struct nvme_subsystem *subsys =
+               container_of(dev, struct nvme_subsystem, dev);
+
+       return sysfs_emit(buf, "%s\n", subsys->subnqn);
+}
+static SUBSYS_ATTR_RO(subsysnqn, S_IRUGO, nvme_subsys_show_nqn);
+
+static ssize_t nvme_subsys_show_type(struct device *dev,
+                                   struct device_attribute *attr,
+                                   char *buf)
+{
+       struct nvme_subsystem *subsys =
+               container_of(dev, struct nvme_subsystem, dev);
+
+       switch (subsys->subtype) {
+       case NVME_NQN_DISC:
+               return sysfs_emit(buf, "discovery\n");
+       case NVME_NQN_NVME:
+               return sysfs_emit(buf, "nvm\n");
+       default:
+               return sysfs_emit(buf, "reserved\n");
+       }
+}
+static SUBSYS_ATTR_RO(subsystype, S_IRUGO, nvme_subsys_show_type);
+
+#define nvme_subsys_show_str_function(field)                           \
+static ssize_t subsys_##field##_show(struct device *dev,               \
+                           struct device_attribute *attr, char *buf)   \
+{                                                                      \
+       struct nvme_subsystem *subsys =                                 \
+               container_of(dev, struct nvme_subsystem, dev);          \
+       return sysfs_emit(buf, "%.*s\n",                                \
+                          (int)sizeof(subsys->field), subsys->field);  \
+}                                                                      \
+static SUBSYS_ATTR_RO(field, S_IRUGO, subsys_##field##_show);
+
+nvme_subsys_show_str_function(model);
+nvme_subsys_show_str_function(serial);
+nvme_subsys_show_str_function(firmware_rev);
+
+static struct attribute *nvme_subsys_attrs[] = {
+       &subsys_attr_model.attr,
+       &subsys_attr_serial.attr,
+       &subsys_attr_firmware_rev.attr,
+       &subsys_attr_subsysnqn.attr,
+       &subsys_attr_subsystype.attr,
+#ifdef CONFIG_NVME_MULTIPATH
+       &subsys_attr_iopolicy.attr,
+#endif
+       NULL,
+};
+
+static const struct attribute_group nvme_subsys_attrs_group = {
+       .attrs = nvme_subsys_attrs,
+};
+
+const struct attribute_group *nvme_subsys_attrs_groups[] = {
+       &nvme_subsys_attrs_group,
+       NULL,
+};
index 47ae17f..3e7dd6f 100644 (file)
@@ -1807,58 +1807,12 @@ out_free_queues:
        return ret;
 }
 
-static unsigned int nvme_tcp_nr_io_queues(struct nvme_ctrl *ctrl)
-{
-       unsigned int nr_io_queues;
-
-       nr_io_queues = min(ctrl->opts->nr_io_queues, num_online_cpus());
-       nr_io_queues += min(ctrl->opts->nr_write_queues, num_online_cpus());
-       nr_io_queues += min(ctrl->opts->nr_poll_queues, num_online_cpus());
-
-       return nr_io_queues;
-}
-
-static void nvme_tcp_set_io_queues(struct nvme_ctrl *nctrl,
-               unsigned int nr_io_queues)
-{
-       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
-       struct nvmf_ctrl_options *opts = nctrl->opts;
-
-       if (opts->nr_write_queues && opts->nr_io_queues < nr_io_queues) {
-               /*
-                * separate read/write queues
-                * hand out dedicated default queues only after we have
-                * sufficient read queues.
-                */
-               ctrl->io_queues[HCTX_TYPE_READ] = opts->nr_io_queues;
-               nr_io_queues -= ctrl->io_queues[HCTX_TYPE_READ];
-               ctrl->io_queues[HCTX_TYPE_DEFAULT] =
-                       min(opts->nr_write_queues, nr_io_queues);
-               nr_io_queues -= ctrl->io_queues[HCTX_TYPE_DEFAULT];
-       } else {
-               /*
-                * shared read/write queues
-                * either no write queues were requested, or we don't have
-                * sufficient queue count to have dedicated default queues.
-                */
-               ctrl->io_queues[HCTX_TYPE_DEFAULT] =
-                       min(opts->nr_io_queues, nr_io_queues);
-               nr_io_queues -= ctrl->io_queues[HCTX_TYPE_DEFAULT];
-       }
-
-       if (opts->nr_poll_queues && nr_io_queues) {
-               /* map dedicated poll queues only if we have queues left */
-               ctrl->io_queues[HCTX_TYPE_POLL] =
-                       min(opts->nr_poll_queues, nr_io_queues);
-       }
-}
-
 static int nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
 {
        unsigned int nr_io_queues;
        int ret;
 
-       nr_io_queues = nvme_tcp_nr_io_queues(ctrl);
+       nr_io_queues = nvmf_nr_io_queues(ctrl->opts);
        ret = nvme_set_queue_count(ctrl, &nr_io_queues);
        if (ret)
                return ret;
@@ -1873,8 +1827,8 @@ static int nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
        dev_info(ctrl->device,
                "creating %d I/O queues.\n", nr_io_queues);
 
-       nvme_tcp_set_io_queues(ctrl, nr_io_queues);
-
+       nvmf_set_io_queues(ctrl->opts, nr_io_queues,
+                          to_tcp_ctrl(ctrl)->io_queues);
        return __nvme_tcp_alloc_io_queues(ctrl);
 }
 
@@ -2454,44 +2408,8 @@ static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
 static void nvme_tcp_map_queues(struct blk_mq_tag_set *set)
 {
        struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(set->driver_data);
-       struct nvmf_ctrl_options *opts = ctrl->ctrl.opts;
-
-       if (opts->nr_write_queues && ctrl->io_queues[HCTX_TYPE_READ]) {
-               /* separate read/write queues */
-               set->map[HCTX_TYPE_DEFAULT].nr_queues =
-                       ctrl->io_queues[HCTX_TYPE_DEFAULT];
-               set->map[HCTX_TYPE_DEFAULT].queue_offset = 0;
-               set->map[HCTX_TYPE_READ].nr_queues =
-                       ctrl->io_queues[HCTX_TYPE_READ];
-               set->map[HCTX_TYPE_READ].queue_offset =
-                       ctrl->io_queues[HCTX_TYPE_DEFAULT];
-       } else {
-               /* shared read/write queues */
-               set->map[HCTX_TYPE_DEFAULT].nr_queues =
-                       ctrl->io_queues[HCTX_TYPE_DEFAULT];
-               set->map[HCTX_TYPE_DEFAULT].queue_offset = 0;
-               set->map[HCTX_TYPE_READ].nr_queues =
-                       ctrl->io_queues[HCTX_TYPE_DEFAULT];
-               set->map[HCTX_TYPE_READ].queue_offset = 0;
-       }
-       blk_mq_map_queues(&set->map[HCTX_TYPE_DEFAULT]);
-       blk_mq_map_queues(&set->map[HCTX_TYPE_READ]);
-
-       if (opts->nr_poll_queues && ctrl->io_queues[HCTX_TYPE_POLL]) {
-               /* map dedicated poll queues only if we have queues left */
-               set->map[HCTX_TYPE_POLL].nr_queues =
-                               ctrl->io_queues[HCTX_TYPE_POLL];
-               set->map[HCTX_TYPE_POLL].queue_offset =
-                       ctrl->io_queues[HCTX_TYPE_DEFAULT] +
-                       ctrl->io_queues[HCTX_TYPE_READ];
-               blk_mq_map_queues(&set->map[HCTX_TYPE_POLL]);
-       }
-
-       dev_info(ctrl->ctrl.device,
-               "mapped %d/%d/%d default/read/poll queues.\n",
-               ctrl->io_queues[HCTX_TYPE_DEFAULT],
-               ctrl->io_queues[HCTX_TYPE_READ],
-               ctrl->io_queues[HCTX_TYPE_POLL]);
+
+       nvmf_map_queues(set, &ctrl->ctrl, ctrl->io_queues);
 }
 
 static int nvme_tcp_poll(struct blk_mq_hw_ctx *hctx, struct io_comp_batch *iob)
index 7970a76..586458f 100644 (file)
@@ -295,13 +295,11 @@ void nvmet_execute_auth_send(struct nvmet_req *req)
                        status = 0;
                }
                goto done_kfree;
-               break;
        case NVME_AUTH_DHCHAP_MESSAGE_SUCCESS2:
                req->sq->authenticated = true;
                pr_debug("%s: ctrl %d qid %d ctrl authenticated\n",
                         __func__, ctrl->cntlid, req->sq->qid);
                goto done_kfree;
-               break;
        case NVME_AUTH_DHCHAP_MESSAGE_FAILURE2:
                status = nvmet_auth_failure2(d);
                if (status) {
@@ -312,7 +310,6 @@ void nvmet_execute_auth_send(struct nvmet_req *req)
                        status = 0;
                }
                goto done_kfree;
-               break;
        default:
                req->sq->dhchap_status =
                        NVME_AUTH_DHCHAP_FAILURE_INCORRECT_MESSAGE;
@@ -320,7 +317,6 @@ void nvmet_execute_auth_send(struct nvmet_req *req)
                        NVME_AUTH_DHCHAP_MESSAGE_FAILURE2;
                req->sq->authenticated = false;
                goto done_kfree;
-               break;
        }
 done_failure1:
        req->sq->dhchap_status = NVME_AUTH_DHCHAP_FAILURE_INCORRECT_MESSAGE;
@@ -483,15 +479,6 @@ void nvmet_execute_auth_receive(struct nvmet_req *req)
                        status = NVME_SC_INTERNAL;
                        break;
                }
-               if (status) {
-                       req->sq->dhchap_status = status;
-                       nvmet_auth_failure1(req, d, al);
-                       pr_warn("ctrl %d qid %d: challenge status (%x)\n",
-                               ctrl->cntlid, req->sq->qid,
-                               req->sq->dhchap_status);
-                       status = 0;
-                       break;
-               }
                req->sq->dhchap_step = NVME_AUTH_DHCHAP_MESSAGE_REPLY;
                break;
        case NVME_AUTH_DHCHAP_MESSAGE_SUCCESS1:
index e940a7d..c65a734 100644 (file)
@@ -645,8 +645,6 @@ fcloop_fcp_recv_work(struct work_struct *work)
        }
        if (ret)
                fcloop_call_host_done(fcpreq, tfcp_req, ret);
-
-       return;
 }
 
 static void
@@ -1168,7 +1166,8 @@ __wait_localport_unreg(struct fcloop_lport *lport)
 
        ret = nvme_fc_unregister_localport(lport->localport);
 
-       wait_for_completion(&lport->unreg_done);
+       if (!ret)
+               wait_for_completion(&lport->unreg_done);
 
        kfree(lport);
 
index c2d6cea..2733e01 100644 (file)
@@ -51,7 +51,7 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id)
 void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
 {
        if (ns->bdev) {
-               blkdev_put(ns->bdev, FMODE_WRITE | FMODE_READ);
+               blkdev_put(ns->bdev, NULL);
                ns->bdev = NULL;
        }
 }
@@ -85,7 +85,7 @@ int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
                return -ENOTBLK;
 
        ns->bdev = blkdev_get_by_path(ns->device_path,
-                       FMODE_READ | FMODE_WRITE, NULL);
+                       BLK_OPEN_READ | BLK_OPEN_WRITE, NULL, NULL);
        if (IS_ERR(ns->bdev)) {
                ret = PTR_ERR(ns->bdev);
                if (ret != -ENOTBLK) {
index dc60a22..6cf723b 100644 (file)
@@ -109,8 +109,8 @@ struct nvmet_sq {
        u32                     sqhd;
        bool                    sqhd_disabled;
 #ifdef CONFIG_NVME_TARGET_AUTH
-       struct delayed_work     auth_expired_work;
        bool                    authenticated;
+       struct delayed_work     auth_expired_work;
        u16                     dhchap_tid;
        u16                     dhchap_status;
        int                     dhchap_step;
index d740eba..4e5b972 100644 (file)
 #define PARPORT_MAX_TIMESLICE_VALUE ((unsigned long) HZ)
 #define PARPORT_MIN_SPINTIME_VALUE 1
 #define PARPORT_MAX_SPINTIME_VALUE 1000
+/*
+ * PARPORT_BASE_* is the size of the known parts of the sysctl path
+ * in dev/partport/%s/devices/%s. "dev/parport/"(12), "/devices/"(9
+ * and null char(1).
+ */
+#define PARPORT_BASE_PATH_SIZE 13
+#define PARPORT_BASE_DEVICES_PATH_SIZE 22
 
 static int do_active_device(struct ctl_table *table, int write,
                      void *result, size_t *lenp, loff_t *ppos)
@@ -236,13 +243,6 @@ do {                                                                       \
        return 0;
 }
 
-#define PARPORT_PORT_DIR(CHILD) { .procname = NULL, .mode = 0555, .child = CHILD }
-#define PARPORT_PARPORT_DIR(CHILD) { .procname = "parport", \
-                                     .mode = 0555, .child = CHILD }
-#define PARPORT_DEV_DIR(CHILD) { .procname = "dev", .mode = 0555, .child = CHILD }
-#define PARPORT_DEVICES_ROOT_DIR  {  .procname = "devices", \
-                                    .mode = 0555, .child = NULL }
-
 static const unsigned long parport_min_timeslice_value =
 PARPORT_MIN_TIMESLICE_VALUE;
 
@@ -257,17 +257,16 @@ PARPORT_MAX_SPINTIME_VALUE;
 
 
 struct parport_sysctl_table {
-       struct ctl_table_header *sysctl_header;
+       struct ctl_table_header *port_header;
+       struct ctl_table_header *devices_header;
        struct ctl_table vars[12];
        struct ctl_table device_dir[2];
-       struct ctl_table port_dir[2];
-       struct ctl_table parport_dir[2];
-       struct ctl_table dev_dir[2];
 };
 
 static const struct parport_sysctl_table parport_sysctl_template = {
-       .sysctl_header = NULL,
-        {
+       .port_header = NULL,
+       .devices_header = NULL,
+       {
                {
                        .procname       = "spintime",
                        .data           = NULL,
@@ -305,7 +304,6 @@ static const struct parport_sysctl_table parport_sysctl_template = {
                        .mode           = 0444,
                        .proc_handler   = do_hardware_modes
                },
-               PARPORT_DEVICES_ROOT_DIR,
 #ifdef CONFIG_PARPORT_1284
                {
                        .procname       = "autoprobe",
@@ -355,18 +353,6 @@ static const struct parport_sysctl_table parport_sysctl_template = {
                },
                {}
        },
-       {
-               PARPORT_PORT_DIR(NULL),
-               {}
-       },
-       {
-               PARPORT_PARPORT_DIR(NULL),
-               {}
-       },
-       {
-               PARPORT_DEV_DIR(NULL),
-               {}
-       }
 };
 
 struct parport_device_sysctl_table
@@ -393,6 +379,7 @@ parport_device_sysctl_template = {
                        .extra1         = (void*) &parport_min_timeslice_value,
                        .extra2         = (void*) &parport_max_timeslice_value
                },
+               {}
        },
        {
                {
@@ -400,25 +387,8 @@ parport_device_sysctl_template = {
                        .data           = NULL,
                        .maxlen         = 0,
                        .mode           = 0555,
-                       .child          = NULL
                },
                {}
-       },
-       {
-               PARPORT_DEVICES_ROOT_DIR,
-               {}
-       },
-       {
-               PARPORT_PORT_DIR(NULL),
-               {}
-       },
-       {
-               PARPORT_PARPORT_DIR(NULL),
-               {}
-       },
-       {
-               PARPORT_DEV_DIR(NULL),
-               {}
        }
 };
 
@@ -454,30 +424,15 @@ parport_default_sysctl_table = {
                        .extra2         = (void*) &parport_max_spintime_value
                },
                {}
-       },
-       {
-               {
-                       .procname       = "default",
-                       .mode           = 0555,
-                       .child          = parport_default_sysctl_table.vars
-               },
-               {}
-       },
-       {
-               PARPORT_PARPORT_DIR(parport_default_sysctl_table.default_dir),
-               {}
-       },
-       {
-               PARPORT_DEV_DIR(parport_default_sysctl_table.parport_dir),
-               {}
        }
 };
 
-
 int parport_proc_register(struct parport *port)
 {
        struct parport_sysctl_table *t;
-       int i;
+       char *tmp_dir_path;
+       size_t tmp_path_len, port_name_len;
+       int bytes_written, i, err = 0;
 
        t = kmemdup(&parport_sysctl_template, sizeof(*t), GFP_KERNEL);
        if (t == NULL)
@@ -485,28 +440,64 @@ int parport_proc_register(struct parport *port)
 
        t->device_dir[0].extra1 = port;
 
-       for (i = 0; i < 5; i++)
+       t->vars[0].data = &port->spintime;
+       for (i = 0; i < 5; i++) {
                t->vars[i].extra1 = port;
+               t->vars[5 + i].extra2 = &port->probe_info[i];
+       }
 
-       t->vars[0].data = &port->spintime;
-       t->vars[5].child = t->device_dir;
-       
-       for (i = 0; i < 5; i++)
-               t->vars[6 + i].extra2 = &port->probe_info[i];
+       port_name_len = strnlen(port->name, PARPORT_NAME_MAX_LEN);
+       /*
+        * Allocate a buffer for two paths: dev/parport/PORT and dev/parport/PORT/devices.
+        * We calculate for the second as that will give us enough for the first.
+        */
+       tmp_path_len = PARPORT_BASE_DEVICES_PATH_SIZE + port_name_len;
+       tmp_dir_path = kzalloc(tmp_path_len, GFP_KERNEL);
+       if (!tmp_dir_path) {
+               err = -ENOMEM;
+               goto exit_free_t;
+       }
 
-       t->port_dir[0].procname = port->name;
+       bytes_written = snprintf(tmp_dir_path, tmp_path_len,
+                                "dev/parport/%s/devices", port->name);
+       if (tmp_path_len <= bytes_written) {
+               err = -ENOENT;
+               goto exit_free_tmp_dir_path;
+       }
+       t->devices_header = register_sysctl(tmp_dir_path, t->device_dir);
+       if (t->devices_header == NULL) {
+               err = -ENOENT;
+               goto  exit_free_tmp_dir_path;
+       }
 
-       t->port_dir[0].child = t->vars;
-       t->parport_dir[0].child = t->port_dir;
-       t->dev_dir[0].child = t->parport_dir;
+       tmp_path_len = PARPORT_BASE_PATH_SIZE + port_name_len;
+       bytes_written = snprintf(tmp_dir_path, tmp_path_len,
+                                "dev/parport/%s", port->name);
+       if (tmp_path_len <= bytes_written) {
+               err = -ENOENT;
+               goto unregister_devices_h;
+       }
 
-       t->sysctl_header = register_sysctl_table(t->dev_dir);
-       if (t->sysctl_header == NULL) {
-               kfree(t);
-               t = NULL;
+       t->port_header = register_sysctl(tmp_dir_path, t->vars);
+       if (t->port_header == NULL) {
+               err = -ENOENT;
+               goto unregister_devices_h;
        }
+
        port->sysctl_table = t;
+
+       kfree(tmp_dir_path);
        return 0;
+
+unregister_devices_h:
+       unregister_sysctl_table(t->devices_header);
+
+exit_free_tmp_dir_path:
+       kfree(tmp_dir_path);
+
+exit_free_t:
+       kfree(t);
+       return err;
 }
 
 int parport_proc_unregister(struct parport *port)
@@ -514,7 +505,8 @@ int parport_proc_unregister(struct parport *port)
        if (port->sysctl_table) {
                struct parport_sysctl_table *t = port->sysctl_table;
                port->sysctl_table = NULL;
-               unregister_sysctl_table(t->sysctl_header);
+               unregister_sysctl_table(t->devices_header);
+               unregister_sysctl_table(t->port_header);
                kfree(t);
        }
        return 0;
@@ -522,30 +514,53 @@ int parport_proc_unregister(struct parport *port)
 
 int parport_device_proc_register(struct pardevice *device)
 {
+       int bytes_written, err = 0;
        struct parport_device_sysctl_table *t;
        struct parport * port = device->port;
+       size_t port_name_len, device_name_len, tmp_dir_path_len;
+       char *tmp_dir_path;
        
        t = kmemdup(&parport_device_sysctl_template, sizeof(*t), GFP_KERNEL);
        if (t == NULL)
                return -ENOMEM;
 
-       t->dev_dir[0].child = t->parport_dir;
-       t->parport_dir[0].child = t->port_dir;
-       t->port_dir[0].procname = port->name;
-       t->port_dir[0].child = t->devices_root_dir;
-       t->devices_root_dir[0].child = t->device_dir;
+       port_name_len = strnlen(port->name, PARPORT_NAME_MAX_LEN);
+       device_name_len = strnlen(device->name, PATH_MAX);
+
+       /* Allocate a buffer for two paths: dev/parport/PORT/devices/DEVICE. */
+       tmp_dir_path_len = PARPORT_BASE_DEVICES_PATH_SIZE + port_name_len + device_name_len;
+       tmp_dir_path = kzalloc(tmp_dir_path_len, GFP_KERNEL);
+       if (!tmp_dir_path) {
+               err = -ENOMEM;
+               goto exit_free_t;
+       }
+
+       bytes_written = snprintf(tmp_dir_path, tmp_dir_path_len, "dev/parport/%s/devices/%s",
+                                port->name, device->name);
+       if (tmp_dir_path_len <= bytes_written) {
+               err = -ENOENT;
+               goto exit_free_path;
+       }
 
-       t->device_dir[0].procname = device->name;
-       t->device_dir[0].child = t->vars;
        t->vars[0].data = &device->timeslice;
 
-       t->sysctl_header = register_sysctl_table(t->dev_dir);
+       t->sysctl_header = register_sysctl(tmp_dir_path, t->vars);
        if (t->sysctl_header == NULL) {
                kfree(t);
                t = NULL;
        }
        device->sysctl_table = t;
+
+       kfree(tmp_dir_path);
        return 0;
+
+exit_free_path:
+       kfree(tmp_dir_path);
+
+exit_free_t:
+       kfree(t);
+
+       return err;
 }
 
 int parport_device_proc_unregister(struct pardevice *device)
@@ -564,7 +579,7 @@ static int __init parport_default_proc_register(void)
        int ret;
 
        parport_default_sysctl_table.sysctl_header =
-               register_sysctl_table(parport_default_sysctl_table.dev_dir);
+               register_sysctl("dev/parport/default", parport_default_sysctl_table.vars);
        if (!parport_default_sysctl_table.sysctl_header)
                return -ENOMEM;
        ret = parport_bus_init();
index 62f8407..2d46b1d 100644 (file)
@@ -467,7 +467,7 @@ struct parport *parport_register_port(unsigned long base, int irq, int dma,
        atomic_set(&tmp->ref_count, 1);
        INIT_LIST_HEAD(&tmp->full_list);
 
-       name = kmalloc(15, GFP_KERNEL);
+       name = kmalloc(PARPORT_NAME_MAX_LEN, GFP_KERNEL);
        if (!name) {
                kfree(tmp);
                return NULL;
index 9309f24..3c07d8d 100644 (file)
@@ -168,6 +168,7 @@ config PCI_P2PDMA
        #
        depends on 64BIT
        select GENERIC_ALLOCATOR
+       select NEED_SG_DMA_FLAGS
        help
          Enableѕ drivers to do PCI peer-to-peer transactions to and from
          BARs that are exposed in other devices that are the part of
index 711f824..4c07d71 100644 (file)
@@ -127,6 +127,14 @@ config FSL_IMX8_DDR_PMU
          can give information about memory throughput and other related
          events.
 
+config FSL_IMX9_DDR_PMU
+       tristate "Freescale i.MX9 DDR perf monitor"
+       depends on ARCH_MXC
+        help
+        Provides support for the DDR performance monitor in i.MX9, which
+        can give information about memory throughput and other related
+        events.
+
 config QCOM_L2_PMU
        bool "Qualcomm Technologies L2-cache PMU"
        depends on ARCH_QCOM && ARM64 && ACPI
index dabc859..5cfe895 100644 (file)
@@ -8,6 +8,7 @@ obj-$(CONFIG_ARM_PMU_ACPI) += arm_pmu_acpi.o
 obj-$(CONFIG_ARM_PMUV3) += arm_pmuv3.o
 obj-$(CONFIG_ARM_SMMU_V3_PMU) += arm_smmuv3_pmu.o
 obj-$(CONFIG_FSL_IMX8_DDR_PMU) += fsl_imx8_ddr_perf.o
+obj-$(CONFIG_FSL_IMX9_DDR_PMU) += fsl_imx9_ddr_perf.o
 obj-$(CONFIG_HISI_PMU) += hisilicon/
 obj-$(CONFIG_QCOM_L2_PMU)      += qcom_l2_pmu.o
 obj-$(CONFIG_QCOM_L3_PMU) += qcom_l3_pmu.o
index 8574c6e..cd2de44 100644 (file)
@@ -493,6 +493,17 @@ static int m1_pmu_map_event(struct perf_event *event)
        return armpmu_map_event(event, &m1_pmu_perf_map, NULL, M1_PMU_CFG_EVENT);
 }
 
+static int m2_pmu_map_event(struct perf_event *event)
+{
+       /*
+        * Same deal as the above, except that M2 has 64bit counters.
+        * Which, as far as we're concerned, actually means 63 bits.
+        * Yes, this is getting awkward.
+        */
+       event->hw.flags |= ARMPMU_EVT_63BIT;
+       return armpmu_map_event(event, &m1_pmu_perf_map, NULL, M1_PMU_CFG_EVENT);
+}
+
 static void m1_pmu_reset(void *info)
 {
        int i;
@@ -525,7 +536,7 @@ static int m1_pmu_set_event_filter(struct hw_perf_event *event,
        return 0;
 }
 
-static int m1_pmu_init(struct arm_pmu *cpu_pmu)
+static int m1_pmu_init(struct arm_pmu *cpu_pmu, u32 flags)
 {
        cpu_pmu->handle_irq       = m1_pmu_handle_irq;
        cpu_pmu->enable           = m1_pmu_enable_event;
@@ -536,7 +547,14 @@ static int m1_pmu_init(struct arm_pmu *cpu_pmu)
        cpu_pmu->clear_event_idx  = m1_pmu_clear_event_idx;
        cpu_pmu->start            = m1_pmu_start;
        cpu_pmu->stop             = m1_pmu_stop;
-       cpu_pmu->map_event        = m1_pmu_map_event;
+
+       if (flags & ARMPMU_EVT_47BIT)
+               cpu_pmu->map_event = m1_pmu_map_event;
+       else if (flags & ARMPMU_EVT_63BIT)
+               cpu_pmu->map_event = m2_pmu_map_event;
+       else
+               return WARN_ON(-EINVAL);
+
        cpu_pmu->reset            = m1_pmu_reset;
        cpu_pmu->set_event_filter = m1_pmu_set_event_filter;
 
@@ -550,25 +568,25 @@ static int m1_pmu_init(struct arm_pmu *cpu_pmu)
 static int m1_pmu_ice_init(struct arm_pmu *cpu_pmu)
 {
        cpu_pmu->name = "apple_icestorm_pmu";
-       return m1_pmu_init(cpu_pmu);
+       return m1_pmu_init(cpu_pmu, ARMPMU_EVT_47BIT);
 }
 
 static int m1_pmu_fire_init(struct arm_pmu *cpu_pmu)
 {
        cpu_pmu->name = "apple_firestorm_pmu";
-       return m1_pmu_init(cpu_pmu);
+       return m1_pmu_init(cpu_pmu, ARMPMU_EVT_47BIT);
 }
 
 static int m2_pmu_avalanche_init(struct arm_pmu *cpu_pmu)
 {
        cpu_pmu->name = "apple_avalanche_pmu";
-       return m1_pmu_init(cpu_pmu);
+       return m1_pmu_init(cpu_pmu, ARMPMU_EVT_63BIT);
 }
 
 static int m2_pmu_blizzard_init(struct arm_pmu *cpu_pmu)
 {
        cpu_pmu->name = "apple_blizzard_pmu";
-       return m1_pmu_init(cpu_pmu);
+       return m1_pmu_init(cpu_pmu, ARMPMU_EVT_63BIT);
 }
 
 static const struct of_device_id m1_pmu_of_device_ids[] = {
index 03b1309..998259f 100644 (file)
@@ -645,7 +645,7 @@ static void cci_pmu_sync_counters(struct cci_pmu *cci_pmu)
        struct cci_pmu_hw_events *cci_hw = &cci_pmu->hw_events;
        DECLARE_BITMAP(mask, HW_CNTRS_MAX);
 
-       bitmap_zero(mask, cci_pmu->num_cntrs);
+       bitmap_zero(mask, HW_CNTRS_MAX);
        for_each_set_bit(i, cci_pmu->hw_events.used_mask, cci_pmu->num_cntrs) {
                struct perf_event *event = cci_hw->events[i];
 
@@ -656,7 +656,7 @@ static void cci_pmu_sync_counters(struct cci_pmu *cci_pmu)
                if (event->hw.state & PERF_HES_STOPPED)
                        continue;
                if (event->hw.state & PERF_HES_ARCH) {
-                       set_bit(i, mask);
+                       __set_bit(i, mask);
                        event->hw.state &= ~PERF_HES_ARCH;
                }
        }
index 47d359f..b8c1587 100644 (file)
 #define CMN_MAX_DTMS                   (CMN_MAX_XPS + (CMN_MAX_DIMENSION - 1) * 4)
 
 /* The CFG node has various info besides the discovery tree */
-#define CMN_CFGM_PERIPH_ID_2           0x0010
-#define CMN_CFGM_PID2_REVISION         GENMASK(7, 4)
+#define CMN_CFGM_PERIPH_ID_01          0x0008
+#define CMN_CFGM_PID0_PART_0           GENMASK_ULL(7, 0)
+#define CMN_CFGM_PID1_PART_1           GENMASK_ULL(35, 32)
+#define CMN_CFGM_PERIPH_ID_23          0x0010
+#define CMN_CFGM_PID2_REVISION         GENMASK_ULL(7, 4)
 
 #define CMN_CFGM_INFO_GLOBAL           0x900
 #define CMN_INFO_MULTIPLE_DTM_EN       BIT_ULL(63)
 #define CMN_WP_DOWN                    2
 
 
+/* Internal values for encoding event support */
 enum cmn_model {
        CMN600 = 1,
        CMN650 = 2,
@@ -197,26 +201,34 @@ enum cmn_model {
        CMN_650ON = CMN650 | CMN700,
 };
 
+/* Actual part numbers and revision IDs defined by the hardware */
+enum cmn_part {
+       PART_CMN600 = 0x434,
+       PART_CMN650 = 0x436,
+       PART_CMN700 = 0x43c,
+       PART_CI700 = 0x43a,
+};
+
 /* CMN-600 r0px shouldn't exist in silicon, thankfully */
 enum cmn_revision {
-       CMN600_R1P0,
-       CMN600_R1P1,
-       CMN600_R1P2,
-       CMN600_R1P3,
-       CMN600_R2P0,
-       CMN600_R3P0,
-       CMN600_R3P1,
-       CMN650_R0P0 = 0,
-       CMN650_R1P0,
-       CMN650_R1P1,
-       CMN650_R2P0,
-       CMN650_R1P2,
-       CMN700_R0P0 = 0,
-       CMN700_R1P0,
-       CMN700_R2P0,
-       CI700_R0P0 = 0,
-       CI700_R1P0,
-       CI700_R2P0,
+       REV_CMN600_R1P0,
+       REV_CMN600_R1P1,
+       REV_CMN600_R1P2,
+       REV_CMN600_R1P3,
+       REV_CMN600_R2P0,
+       REV_CMN600_R3P0,
+       REV_CMN600_R3P1,
+       REV_CMN650_R0P0 = 0,
+       REV_CMN650_R1P0,
+       REV_CMN650_R1P1,
+       REV_CMN650_R2P0,
+       REV_CMN650_R1P2,
+       REV_CMN700_R0P0 = 0,
+       REV_CMN700_R1P0,
+       REV_CMN700_R2P0,
+       REV_CI700_R0P0 = 0,
+       REV_CI700_R1P0,
+       REV_CI700_R2P0,
 };
 
 enum cmn_node_type {
@@ -306,7 +318,7 @@ struct arm_cmn {
        unsigned int state;
 
        enum cmn_revision rev;
-       enum cmn_model model;
+       enum cmn_part part;
        u8 mesh_x;
        u8 mesh_y;
        u16 num_xps;
@@ -394,19 +406,35 @@ static struct arm_cmn_node *arm_cmn_node(const struct arm_cmn *cmn,
        return NULL;
 }
 
+static enum cmn_model arm_cmn_model(const struct arm_cmn *cmn)
+{
+       switch (cmn->part) {
+       case PART_CMN600:
+               return CMN600;
+       case PART_CMN650:
+               return CMN650;
+       case PART_CMN700:
+               return CMN700;
+       case PART_CI700:
+               return CI700;
+       default:
+               return 0;
+       };
+}
+
 static u32 arm_cmn_device_connect_info(const struct arm_cmn *cmn,
                                       const struct arm_cmn_node *xp, int port)
 {
        int offset = CMN_MXP__CONNECT_INFO(port);
 
        if (port >= 2) {
-               if (cmn->model & (CMN600 | CMN650))
+               if (cmn->part == PART_CMN600 || cmn->part == PART_CMN650)
                        return 0;
                /*
                 * CI-700 may have extra ports, but still has the
                 * mesh_port_connect_info registers in the way.
                 */
-               if (cmn->model == CI700)
+               if (cmn->part == PART_CI700)
                        offset += CI700_CONNECT_INFO_P2_5_OFFSET;
        }
 
@@ -640,7 +668,7 @@ static umode_t arm_cmn_event_attr_is_visible(struct kobject *kobj,
 
        eattr = container_of(attr, typeof(*eattr), attr.attr);
 
-       if (!(eattr->model & cmn->model))
+       if (!(eattr->model & arm_cmn_model(cmn)))
                return 0;
 
        type = eattr->type;
@@ -658,7 +686,7 @@ static umode_t arm_cmn_event_attr_is_visible(struct kobject *kobj,
                if ((intf & 4) && !(cmn->ports_used & BIT(intf & 3)))
                        return 0;
 
-               if (chan == 4 && cmn->model == CMN600)
+               if (chan == 4 && cmn->part == PART_CMN600)
                        return 0;
 
                if ((chan == 5 && cmn->rsp_vc_num < 2) ||
@@ -669,19 +697,19 @@ static umode_t arm_cmn_event_attr_is_visible(struct kobject *kobj,
        }
 
        /* Revision-specific differences */
-       if (cmn->model == CMN600) {
-               if (cmn->rev < CMN600_R1P3) {
+       if (cmn->part == PART_CMN600) {
+               if (cmn->rev < REV_CMN600_R1P3) {
                        if (type == CMN_TYPE_CXRA && eventid > 0x10)
                                return 0;
                }
-               if (cmn->rev < CMN600_R1P2) {
+               if (cmn->rev < REV_CMN600_R1P2) {
                        if (type == CMN_TYPE_HNF && eventid == 0x1b)
                                return 0;
                        if (type == CMN_TYPE_CXRA || type == CMN_TYPE_CXHA)
                                return 0;
                }
-       } else if (cmn->model == CMN650) {
-               if (cmn->rev < CMN650_R2P0 || cmn->rev == CMN650_R1P2) {
+       } else if (cmn->part == PART_CMN650) {
+               if (cmn->rev < REV_CMN650_R2P0 || cmn->rev == REV_CMN650_R1P2) {
                        if (type == CMN_TYPE_HNF && eventid > 0x22)
                                return 0;
                        if (type == CMN_TYPE_SBSX && eventid == 0x17)
@@ -689,8 +717,8 @@ static umode_t arm_cmn_event_attr_is_visible(struct kobject *kobj,
                        if (type == CMN_TYPE_RNI && eventid > 0x10)
                                return 0;
                }
-       } else if (cmn->model == CMN700) {
-               if (cmn->rev < CMN700_R2P0) {
+       } else if (cmn->part == PART_CMN700) {
+               if (cmn->rev < REV_CMN700_R2P0) {
                        if (type == CMN_TYPE_HNF && eventid > 0x2c)
                                return 0;
                        if (type == CMN_TYPE_CCHA && eventid > 0x74)
@@ -698,7 +726,7 @@ static umode_t arm_cmn_event_attr_is_visible(struct kobject *kobj,
                        if (type == CMN_TYPE_CCLA && eventid > 0x27)
                                return 0;
                }
-               if (cmn->rev < CMN700_R1P0) {
+               if (cmn->rev < REV_CMN700_R1P0) {
                        if (type == CMN_TYPE_HNF && eventid > 0x2b)
                                return 0;
                }
@@ -1171,19 +1199,31 @@ static ssize_t arm_cmn_cpumask_show(struct device *dev,
 static struct device_attribute arm_cmn_cpumask_attr =
                __ATTR(cpumask, 0444, arm_cmn_cpumask_show, NULL);
 
-static struct attribute *arm_cmn_cpumask_attrs[] = {
+static ssize_t arm_cmn_identifier_show(struct device *dev,
+                                      struct device_attribute *attr, char *buf)
+{
+       struct arm_cmn *cmn = to_cmn(dev_get_drvdata(dev));
+
+       return sysfs_emit(buf, "%03x%02x\n", cmn->part, cmn->rev);
+}
+
+static struct device_attribute arm_cmn_identifier_attr =
+               __ATTR(identifier, 0444, arm_cmn_identifier_show, NULL);
+
+static struct attribute *arm_cmn_other_attrs[] = {
        &arm_cmn_cpumask_attr.attr,
+       &arm_cmn_identifier_attr.attr,
        NULL,
 };
 
-static const struct attribute_group arm_cmn_cpumask_attr_group = {
-       .attrs = arm_cmn_cpumask_attrs,
+static const struct attribute_group arm_cmn_other_attrs_group = {
+       .attrs = arm_cmn_other_attrs,
 };
 
 static const struct attribute_group *arm_cmn_attr_groups[] = {
        &arm_cmn_event_attrs_group,
        &arm_cmn_format_attrs_group,
-       &arm_cmn_cpumask_attr_group,
+       &arm_cmn_other_attrs_group,
        NULL
 };
 
@@ -1200,7 +1240,7 @@ static u32 arm_cmn_wp_config(struct perf_event *event)
        u32 grp = CMN_EVENT_WP_GRP(event);
        u32 exc = CMN_EVENT_WP_EXCLUSIVE(event);
        u32 combine = CMN_EVENT_WP_COMBINE(event);
-       bool is_cmn600 = to_cmn(event->pmu)->model == CMN600;
+       bool is_cmn600 = to_cmn(event->pmu)->part == PART_CMN600;
 
        config = FIELD_PREP(CMN_DTM_WPn_CONFIG_WP_DEV_SEL, dev) |
                 FIELD_PREP(CMN_DTM_WPn_CONFIG_WP_CHN_SEL, chn) |
@@ -1520,14 +1560,14 @@ done:
        return ret;
 }
 
-static enum cmn_filter_select arm_cmn_filter_sel(enum cmn_model model,
+static enum cmn_filter_select arm_cmn_filter_sel(const struct arm_cmn *cmn,
                                                 enum cmn_node_type type,
                                                 unsigned int eventid)
 {
        struct arm_cmn_event_attr *e;
-       int i;
+       enum cmn_model model = arm_cmn_model(cmn);
 
-       for (i = 0; i < ARRAY_SIZE(arm_cmn_event_attrs) - 1; i++) {
+       for (int i = 0; i < ARRAY_SIZE(arm_cmn_event_attrs) - 1; i++) {
                e = container_of(arm_cmn_event_attrs[i], typeof(*e), attr.attr);
                if (e->model & model && e->type == type && e->eventid == eventid)
                        return e->fsel;
@@ -1570,12 +1610,12 @@ static int arm_cmn_event_init(struct perf_event *event)
                /* ...but the DTM may depend on which port we're watching */
                if (cmn->multi_dtm)
                        hw->dtm_offset = CMN_EVENT_WP_DEV_SEL(event) / 2;
-       } else if (type == CMN_TYPE_XP && cmn->model == CMN700) {
+       } else if (type == CMN_TYPE_XP && cmn->part == PART_CMN700) {
                hw->wide_sel = true;
        }
 
        /* This is sufficiently annoying to recalculate, so cache it */
-       hw->filter_sel = arm_cmn_filter_sel(cmn->model, type, eventid);
+       hw->filter_sel = arm_cmn_filter_sel(cmn, type, eventid);
 
        bynodeid = CMN_EVENT_BYNODEID(event);
        nodeid = CMN_EVENT_NODEID(event);
@@ -1899,9 +1939,10 @@ static int arm_cmn_init_dtc(struct arm_cmn *cmn, struct arm_cmn_node *dn, int id
        if (dtc->irq < 0)
                return dtc->irq;
 
-       writel_relaxed(0, dtc->base + CMN_DT_PMCR);
+       writel_relaxed(CMN_DT_DTC_CTL_DT_EN, dtc->base + CMN_DT_DTC_CTL);
+       writel_relaxed(CMN_DT_PMCR_PMU_EN | CMN_DT_PMCR_OVFL_INTR_EN, dtc->base + CMN_DT_PMCR);
+       writeq_relaxed(0, dtc->base + CMN_DT_PMCCNTR);
        writel_relaxed(0x1ff, dtc->base + CMN_DT_PMOVSR_CLR);
-       writel_relaxed(CMN_DT_PMCR_OVFL_INTR_EN, dtc->base + CMN_DT_PMCR);
 
        return 0;
 }
@@ -1961,7 +2002,7 @@ static int arm_cmn_init_dtcs(struct arm_cmn *cmn)
                        dn->type = CMN_TYPE_CCLA;
        }
 
-       writel_relaxed(CMN_DT_DTC_CTL_DT_EN, cmn->dtc[0].base + CMN_DT_DTC_CTL);
+       arm_cmn_set_state(cmn, CMN_STATE_DISABLED);
 
        return 0;
 }
@@ -2006,6 +2047,7 @@ static int arm_cmn_discover(struct arm_cmn *cmn, unsigned int rgn_offset)
        void __iomem *cfg_region;
        struct arm_cmn_node cfg, *dn;
        struct arm_cmn_dtm *dtm;
+       enum cmn_part part;
        u16 child_count, child_poff;
        u32 xp_offset[CMN_MAX_XPS];
        u64 reg;
@@ -2017,7 +2059,19 @@ static int arm_cmn_discover(struct arm_cmn *cmn, unsigned int rgn_offset)
                return -ENODEV;
 
        cfg_region = cmn->base + rgn_offset;
-       reg = readl_relaxed(cfg_region + CMN_CFGM_PERIPH_ID_2);
+
+       reg = readq_relaxed(cfg_region + CMN_CFGM_PERIPH_ID_01);
+       part = FIELD_GET(CMN_CFGM_PID0_PART_0, reg);
+       part |= FIELD_GET(CMN_CFGM_PID1_PART_1, reg) << 8;
+       if (cmn->part && cmn->part != part)
+               dev_warn(cmn->dev,
+                        "Firmware binding mismatch: expected part number 0x%x, found 0x%x\n",
+                        cmn->part, part);
+       cmn->part = part;
+       if (!arm_cmn_model(cmn))
+               dev_warn(cmn->dev, "Unknown part number: 0x%x\n", part);
+
+       reg = readl_relaxed(cfg_region + CMN_CFGM_PERIPH_ID_23);
        cmn->rev = FIELD_GET(CMN_CFGM_PID2_REVISION, reg);
 
        reg = readq_relaxed(cfg_region + CMN_CFGM_INFO_GLOBAL);
@@ -2081,7 +2135,7 @@ static int arm_cmn_discover(struct arm_cmn *cmn, unsigned int rgn_offset)
                if (xp->id == (1 << 3))
                        cmn->mesh_x = xp->logid;
 
-               if (cmn->model == CMN600)
+               if (cmn->part == PART_CMN600)
                        xp->dtc = 0xf;
                else
                        xp->dtc = 1 << readl_relaxed(xp_region + CMN_DTM_UNIT_INFO);
@@ -2201,7 +2255,7 @@ static int arm_cmn_discover(struct arm_cmn *cmn, unsigned int rgn_offset)
        if (cmn->num_xps == 1)
                dev_warn(cmn->dev, "1x1 config not fully supported, translate XP events manually\n");
 
-       dev_dbg(cmn->dev, "model %d, periph_id_2 revision %d\n", cmn->model, cmn->rev);
+       dev_dbg(cmn->dev, "periph_id part 0x%03x revision %d\n", cmn->part, cmn->rev);
        reg = cmn->ports_used;
        dev_dbg(cmn->dev, "mesh %dx%d, ID width %d, ports %6pbl%s\n",
                cmn->mesh_x, cmn->mesh_y, arm_cmn_xyidbits(cmn), &reg,
@@ -2256,17 +2310,17 @@ static int arm_cmn_probe(struct platform_device *pdev)
                return -ENOMEM;
 
        cmn->dev = &pdev->dev;
-       cmn->model = (unsigned long)device_get_match_data(cmn->dev);
+       cmn->part = (unsigned long)device_get_match_data(cmn->dev);
        platform_set_drvdata(pdev, cmn);
 
-       if (cmn->model == CMN600 && has_acpi_companion(cmn->dev)) {
+       if (cmn->part == PART_CMN600 && has_acpi_companion(cmn->dev)) {
                rootnode = arm_cmn600_acpi_probe(pdev, cmn);
        } else {
                rootnode = 0;
                cmn->base = devm_platform_ioremap_resource(pdev, 0);
                if (IS_ERR(cmn->base))
                        return PTR_ERR(cmn->base);
-               if (cmn->model == CMN600)
+               if (cmn->part == PART_CMN600)
                        rootnode = arm_cmn600_of_probe(pdev->dev.of_node);
        }
        if (rootnode < 0)
@@ -2335,10 +2389,10 @@ static int arm_cmn_remove(struct platform_device *pdev)
 
 #ifdef CONFIG_OF
 static const struct of_device_id arm_cmn_of_match[] = {
-       { .compatible = "arm,cmn-600", .data = (void *)CMN600 },
-       { .compatible = "arm,cmn-650", .data = (void *)CMN650 },
-       { .compatible = "arm,cmn-700", .data = (void *)CMN700 },
-       { .compatible = "arm,ci-700", .data = (void *)CI700 },
+       { .compatible = "arm,cmn-600", .data = (void *)PART_CMN600 },
+       { .compatible = "arm,cmn-650" },
+       { .compatible = "arm,cmn-700" },
+       { .compatible = "arm,ci-700" },
        {}
 };
 MODULE_DEVICE_TABLE(of, arm_cmn_of_match);
@@ -2346,9 +2400,9 @@ MODULE_DEVICE_TABLE(of, arm_cmn_of_match);
 
 #ifdef CONFIG_ACPI
 static const struct acpi_device_id arm_cmn_acpi_match[] = {
-       { "ARMHC600", CMN600 },
-       { "ARMHC650", CMN650 },
-       { "ARMHC700", CMN700 },
+       { "ARMHC600", PART_CMN600 },
+       { "ARMHC650" },
+       { "ARMHC700" },
        {}
 };
 MODULE_DEVICE_TABLE(acpi, arm_cmn_acpi_match);
index 0b316fe..25d25de 100644 (file)
@@ -4,8 +4,7 @@
 
 config ARM_CORESIGHT_PMU_ARCH_SYSTEM_PMU
        tristate "ARM Coresight Architecture PMU"
-       depends on ARM64 && ACPI
-       depends on ACPI_APMT || COMPILE_TEST
+       depends on ARM64 || COMPILE_TEST
        help
          Provides support for performance monitoring unit (PMU) devices
          based on ARM CoreSight PMU architecture. Note that this PMU
index a3f1c41..e2b7827 100644 (file)
@@ -28,7 +28,6 @@
 #include <linux/module.h>
 #include <linux/perf_event.h>
 #include <linux/platform_device.h>
-#include <acpi/processor.h>
 
 #include "arm_cspmu.h"
 #include "nvidia_cspmu.h"
 #define ARM_CSPMU_ACTIVE_CPU_MASK              0x0
 #define ARM_CSPMU_ASSOCIATED_CPU_MASK          0x1
 
-/* Check if field f in flags is set with value v */
-#define CHECK_APMT_FLAG(flags, f, v) \
-       ((flags & (ACPI_APMT_FLAGS_ ## f)) == (ACPI_APMT_FLAGS_ ## f ## _ ## v))
-
 /* Check and use default if implementer doesn't provide attribute callback */
 #define CHECK_DEFAULT_IMPL_OPS(ops, callback)                  \
        do {                                                    \
 
 static unsigned long arm_cspmu_cpuhp_state;
 
+static struct acpi_apmt_node *arm_cspmu_apmt_node(struct device *dev)
+{
+       return *(struct acpi_apmt_node **)dev_get_platdata(dev);
+}
+
 /*
  * In CoreSight PMU architecture, all of the MMIO registers are 32-bit except
  * counter register. The counter register can be implemented as 32-bit or 64-bit
@@ -156,12 +156,6 @@ static u64 read_reg64_hilohi(const void __iomem *addr, u32 max_poll_count)
        return val;
 }
 
-/* Check if PMU supports 64-bit single copy atomic. */
-static inline bool supports_64bit_atomics(const struct arm_cspmu *cspmu)
-{
-       return CHECK_APMT_FLAG(cspmu->apmt_node->flags, ATOMIC, SUPP);
-}
-
 /* Check if cycle counter is supported. */
 static inline bool supports_cycle_counter(const struct arm_cspmu *cspmu)
 {
@@ -189,10 +183,10 @@ static inline bool use_64b_counter_reg(const struct arm_cspmu *cspmu)
 ssize_t arm_cspmu_sysfs_event_show(struct device *dev,
                                struct device_attribute *attr, char *buf)
 {
-       struct dev_ext_attribute *eattr =
-               container_of(attr, struct dev_ext_attribute, attr);
-       return sysfs_emit(buf, "event=0x%llx\n",
-                         (unsigned long long)eattr->var);
+       struct perf_pmu_events_attr *pmu_attr;
+
+       pmu_attr = container_of(attr, typeof(*pmu_attr), attr);
+       return sysfs_emit(buf, "event=0x%llx\n", pmu_attr->id);
 }
 EXPORT_SYMBOL_GPL(arm_cspmu_sysfs_event_show);
 
@@ -320,7 +314,7 @@ static const char *arm_cspmu_get_name(const struct arm_cspmu *cspmu)
        static atomic_t pmu_idx[ACPI_APMT_NODE_TYPE_COUNT] = { 0 };
 
        dev = cspmu->dev;
-       apmt_node = cspmu->apmt_node;
+       apmt_node = arm_cspmu_apmt_node(dev);
        pmu_type = apmt_node->type;
 
        if (pmu_type >= ACPI_APMT_NODE_TYPE_COUNT) {
@@ -397,8 +391,8 @@ static const struct impl_match impl_match[] = {
 static int arm_cspmu_init_impl_ops(struct arm_cspmu *cspmu)
 {
        int ret;
-       struct acpi_apmt_node *apmt_node = cspmu->apmt_node;
        struct arm_cspmu_impl_ops *impl_ops = &cspmu->impl.ops;
+       struct acpi_apmt_node *apmt_node = arm_cspmu_apmt_node(cspmu->dev);
        const struct impl_match *match = impl_match;
 
        /*
@@ -720,7 +714,7 @@ static u64 arm_cspmu_read_counter(struct perf_event *event)
                offset = counter_offset(sizeof(u64), event->hw.idx);
                counter_addr = cspmu->base1 + offset;
 
-               return supports_64bit_atomics(cspmu) ?
+               return cspmu->has_atomic_dword ?
                               readq(counter_addr) :
                               read_reg64_hilohi(counter_addr, HILOHI_MAX_POLL);
        }
@@ -911,24 +905,18 @@ static struct arm_cspmu *arm_cspmu_alloc(struct platform_device *pdev)
 {
        struct acpi_apmt_node *apmt_node;
        struct arm_cspmu *cspmu;
-       struct device *dev;
-
-       dev = &pdev->dev;
-       apmt_node = *(struct acpi_apmt_node **)dev_get_platdata(dev);
-       if (!apmt_node) {
-               dev_err(dev, "failed to get APMT node\n");
-               return NULL;
-       }
+       struct device *dev = &pdev->dev;
 
        cspmu = devm_kzalloc(dev, sizeof(*cspmu), GFP_KERNEL);
        if (!cspmu)
                return NULL;
 
        cspmu->dev = dev;
-       cspmu->apmt_node = apmt_node;
-
        platform_set_drvdata(pdev, cspmu);
 
+       apmt_node = arm_cspmu_apmt_node(dev);
+       cspmu->has_atomic_dword = apmt_node->flags & ACPI_APMT_FLAGS_ATOMIC;
+
        return cspmu;
 }
 
@@ -936,11 +924,9 @@ static int arm_cspmu_init_mmio(struct arm_cspmu *cspmu)
 {
        struct device *dev;
        struct platform_device *pdev;
-       struct acpi_apmt_node *apmt_node;
 
        dev = cspmu->dev;
        pdev = to_platform_device(dev);
-       apmt_node = cspmu->apmt_node;
 
        /* Base address for page 0. */
        cspmu->base0 = devm_platform_ioremap_resource(pdev, 0);
@@ -951,7 +937,7 @@ static int arm_cspmu_init_mmio(struct arm_cspmu *cspmu)
 
        /* Base address for page 1 if supported. Otherwise point to page 0. */
        cspmu->base1 = cspmu->base0;
-       if (CHECK_APMT_FLAG(apmt_node->flags, DUAL_PAGE, SUPP)) {
+       if (platform_get_resource(pdev, IORESOURCE_MEM, 1)) {
                cspmu->base1 = devm_platform_ioremap_resource(pdev, 1);
                if (IS_ERR(cspmu->base1)) {
                        dev_err(dev, "ioremap failed for page-1 resource\n");
@@ -1048,19 +1034,14 @@ static int arm_cspmu_request_irq(struct arm_cspmu *cspmu)
        int irq, ret;
        struct device *dev;
        struct platform_device *pdev;
-       struct acpi_apmt_node *apmt_node;
 
        dev = cspmu->dev;
        pdev = to_platform_device(dev);
-       apmt_node = cspmu->apmt_node;
 
        /* Skip IRQ request if the PMU does not support overflow interrupt. */
-       if (apmt_node->ovflw_irq == 0)
-               return 0;
-
-       irq = platform_get_irq(pdev, 0);
+       irq = platform_get_irq_optional(pdev, 0);
        if (irq < 0)
-               return irq;
+               return irq == -ENXIO ? 0 : irq;
 
        ret = devm_request_irq(dev, irq, arm_cspmu_handle_irq,
                               IRQF_NOBALANCING | IRQF_NO_THREAD, dev_name(dev),
@@ -1075,6 +1056,9 @@ static int arm_cspmu_request_irq(struct arm_cspmu *cspmu)
        return 0;
 }
 
+#if defined(CONFIG_ACPI) && defined(CONFIG_ARM64)
+#include <acpi/processor.h>
+
 static inline int arm_cspmu_find_cpu_container(int cpu, u32 container_uid)
 {
        u32 acpi_uid;
@@ -1099,15 +1083,13 @@ static inline int arm_cspmu_find_cpu_container(int cpu, u32 container_uid)
        return -ENODEV;
 }
 
-static int arm_cspmu_get_cpus(struct arm_cspmu *cspmu)
+static int arm_cspmu_acpi_get_cpus(struct arm_cspmu *cspmu)
 {
-       struct device *dev;
        struct acpi_apmt_node *apmt_node;
        int affinity_flag;
        int cpu;
 
-       dev = cspmu->pmu.dev;
-       apmt_node = cspmu->apmt_node;
+       apmt_node = arm_cspmu_apmt_node(cspmu->dev);
        affinity_flag = apmt_node->flags & ACPI_APMT_FLAGS_AFFINITY;
 
        if (affinity_flag == ACPI_APMT_FLAGS_AFFINITY_PROC) {
@@ -1129,12 +1111,23 @@ static int arm_cspmu_get_cpus(struct arm_cspmu *cspmu)
        }
 
        if (cpumask_empty(&cspmu->associated_cpus)) {
-               dev_dbg(dev, "No cpu associated with the PMU\n");
+               dev_dbg(cspmu->dev, "No cpu associated with the PMU\n");
                return -ENODEV;
        }
 
        return 0;
 }
+#else
+static int arm_cspmu_acpi_get_cpus(struct arm_cspmu *cspmu)
+{
+       return -ENODEV;
+}
+#endif
+
+static int arm_cspmu_get_cpus(struct arm_cspmu *cspmu)
+{
+       return arm_cspmu_acpi_get_cpus(cspmu);
+}
 
 static int arm_cspmu_register_pmu(struct arm_cspmu *cspmu)
 {
@@ -1220,6 +1213,12 @@ static int arm_cspmu_device_remove(struct platform_device *pdev)
        return 0;
 }
 
+static const struct platform_device_id arm_cspmu_id[] = {
+       {DRVNAME, 0},
+       { },
+};
+MODULE_DEVICE_TABLE(platform, arm_cspmu_id);
+
 static struct platform_driver arm_cspmu_driver = {
        .driver = {
                        .name = DRVNAME,
@@ -1227,12 +1226,14 @@ static struct platform_driver arm_cspmu_driver = {
                },
        .probe = arm_cspmu_device_probe,
        .remove = arm_cspmu_device_remove,
+       .id_table = arm_cspmu_id,
 };
 
 static void arm_cspmu_set_active_cpu(int cpu, struct arm_cspmu *cspmu)
 {
        cpumask_set_cpu(cpu, &cspmu->active_cpu);
-       WARN_ON(irq_set_affinity(cspmu->irq, &cspmu->active_cpu));
+       if (cspmu->irq)
+               WARN_ON(irq_set_affinity(cspmu->irq, &cspmu->active_cpu));
 }
 
 static int arm_cspmu_cpu_online(unsigned int cpu, struct hlist_node *node)
index 51323b1..83df53d 100644 (file)
@@ -8,7 +8,6 @@
 #ifndef __ARM_CSPMU_H__
 #define __ARM_CSPMU_H__
 
-#include <linux/acpi.h>
 #include <linux/bitfield.h>
 #include <linux/cpumask.h>
 #include <linux/device.h>
@@ -118,16 +117,16 @@ struct arm_cspmu_impl {
 struct arm_cspmu {
        struct pmu pmu;
        struct device *dev;
-       struct acpi_apmt_node *apmt_node;
        const char *name;
        const char *identifier;
        void __iomem *base0;
        void __iomem *base1;
-       int irq;
        cpumask_t associated_cpus;
        cpumask_t active_cpu;
        struct hlist_node cpuhp_node;
+       int irq;
 
+       bool has_atomic_dword;
        u32 pmcfgr;
        u32 num_logical_ctrs;
        u32 num_set_clr_reg;
index 5de06f9..9d0f01c 100644 (file)
@@ -227,9 +227,31 @@ static const struct attribute_group dmc620_pmu_format_attr_group = {
        .attrs  = dmc620_pmu_formats_attrs,
 };
 
+static ssize_t dmc620_pmu_cpumask_show(struct device *dev,
+                                      struct device_attribute *attr, char *buf)
+{
+       struct dmc620_pmu *dmc620_pmu = to_dmc620_pmu(dev_get_drvdata(dev));
+
+       return cpumap_print_to_pagebuf(true, buf,
+                                      cpumask_of(dmc620_pmu->irq->cpu));
+}
+
+static struct device_attribute dmc620_pmu_cpumask_attr =
+       __ATTR(cpumask, 0444, dmc620_pmu_cpumask_show, NULL);
+
+static struct attribute *dmc620_pmu_cpumask_attrs[] = {
+       &dmc620_pmu_cpumask_attr.attr,
+       NULL,
+};
+
+static const struct attribute_group dmc620_pmu_cpumask_attr_group = {
+       .attrs = dmc620_pmu_cpumask_attrs,
+};
+
 static const struct attribute_group *dmc620_pmu_attr_groups[] = {
        &dmc620_pmu_events_attr_group,
        &dmc620_pmu_format_attr_group,
+       &dmc620_pmu_cpumask_attr_group,
        NULL,
 };
 
index 15bd1e3..f6ccb2c 100644 (file)
@@ -109,6 +109,8 @@ static inline u64 arm_pmu_event_max_period(struct perf_event *event)
 {
        if (event->hw.flags & ARMPMU_EVT_64BIT)
                return GENMASK_ULL(63, 0);
+       else if (event->hw.flags & ARMPMU_EVT_63BIT)
+               return GENMASK_ULL(62, 0);
        else if (event->hw.flags & ARMPMU_EVT_47BIT)
                return GENMASK_ULL(46, 0);
        else
@@ -687,6 +689,11 @@ static int armpmu_get_cpu_irq(struct arm_pmu *pmu, int cpu)
        return per_cpu(hw_events->irq, cpu);
 }
 
+bool arm_pmu_irq_is_nmi(void)
+{
+       return has_nmi;
+}
+
 /*
  * PMU hardware loses all context when a CPU goes offline.
  * When a CPU is hotplugged back in, since some hardware registers are
index 93b7edb..08b3a1b 100644 (file)
@@ -22,6 +22,7 @@
 #include <linux/platform_device.h>
 #include <linux/sched_clock.h>
 #include <linux/smp.h>
+#include <linux/nmi.h>
 
 #include <asm/arm_pmuv3.h>
 
@@ -1363,10 +1364,17 @@ static struct platform_driver armv8_pmu_driver = {
 
 static int __init armv8_pmu_driver_init(void)
 {
+       int ret;
+
        if (acpi_disabled)
-               return platform_driver_register(&armv8_pmu_driver);
+               ret = platform_driver_register(&armv8_pmu_driver);
        else
-               return arm_pmu_acpi_probe(armv8_pmuv3_pmu_init);
+               ret = arm_pmu_acpi_probe(armv8_pmuv3_pmu_init);
+
+       if (!ret)
+               lockup_detector_retry_init();
+
+       return ret;
 }
 device_initcall(armv8_pmu_driver_init)
 
diff --git a/drivers/perf/fsl_imx9_ddr_perf.c b/drivers/perf/fsl_imx9_ddr_perf.c
new file mode 100644 (file)
index 0000000..71d5b07
--- /dev/null
@@ -0,0 +1,711 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright 2023 NXP
+
+#include <linux/bitfield.h>
+#include <linux/init.h>
+#include <linux/interrupt.h>
+#include <linux/io.h>
+#include <linux/module.h>
+#include <linux/of.h>
+#include <linux/of_address.h>
+#include <linux/of_device.h>
+#include <linux/of_irq.h>
+#include <linux/perf_event.h>
+
+/* Performance monitor configuration */
+#define PMCFG1                         0x00
+#define PMCFG1_RD_TRANS_FILT_EN        BIT(31)
+#define PMCFG1_WR_TRANS_FILT_EN        BIT(30)
+#define PMCFG1_RD_BT_FILT_EN           BIT(29)
+#define PMCFG1_ID_MASK                 GENMASK(17, 0)
+
+#define PMCFG2                         0x04
+#define PMCFG2_ID                      GENMASK(17, 0)
+
+/* Global control register affects all counters and takes priority over local control registers */
+#define PMGC0          0x40
+/* Global control register bits */
+#define PMGC0_FAC      BIT(31)
+#define PMGC0_PMIE     BIT(30)
+#define PMGC0_FCECE    BIT(29)
+
+/*
+ * 64bit counter0 exclusively dedicated to counting cycles
+ * 32bit counters monitor counter-specific events in addition to counting reference events
+ */
+#define PMLCA(n)       (0x40 + 0x10 + (0x10 * n))
+#define PMLCB(n)       (0x40 + 0x14 + (0x10 * n))
+#define PMC(n)         (0x40 + 0x18 + (0x10 * n))
+/* Local control register bits */
+#define PMLCA_FC       BIT(31)
+#define PMLCA_CE       BIT(26)
+#define PMLCA_EVENT    GENMASK(22, 16)
+
+#define NUM_COUNTERS           11
+#define CYCLES_COUNTER         0
+
+#define to_ddr_pmu(p)          container_of(p, struct ddr_pmu, pmu)
+
+#define DDR_PERF_DEV_NAME      "imx9_ddr"
+#define DDR_CPUHP_CB_NAME      DDR_PERF_DEV_NAME "_perf_pmu"
+
+static DEFINE_IDA(ddr_ida);
+
+struct imx_ddr_devtype_data {
+       const char *identifier;         /* system PMU identifier for userspace */
+};
+
+struct ddr_pmu {
+       struct pmu pmu;
+       void __iomem *base;
+       unsigned int cpu;
+       struct hlist_node node;
+       struct device *dev;
+       struct perf_event *events[NUM_COUNTERS];
+       int active_events;
+       enum cpuhp_state cpuhp_state;
+       const struct imx_ddr_devtype_data *devtype_data;
+       int irq;
+       int id;
+};
+
+static const struct imx_ddr_devtype_data imx93_devtype_data = {
+       .identifier = "imx93",
+};
+
+static const struct of_device_id imx_ddr_pmu_dt_ids[] = {
+       {.compatible = "fsl,imx93-ddr-pmu", .data = &imx93_devtype_data},
+       { /* sentinel */ }
+};
+MODULE_DEVICE_TABLE(of, imx_ddr_pmu_dt_ids);
+
+static ssize_t ddr_perf_identifier_show(struct device *dev,
+                                       struct device_attribute *attr,
+                                       char *page)
+{
+       struct ddr_pmu *pmu = dev_get_drvdata(dev);
+
+       return sysfs_emit(page, "%s\n", pmu->devtype_data->identifier);
+}
+
+static struct device_attribute ddr_perf_identifier_attr =
+       __ATTR(identifier, 0444, ddr_perf_identifier_show, NULL);
+
+static struct attribute *ddr_perf_identifier_attrs[] = {
+       &ddr_perf_identifier_attr.attr,
+       NULL,
+};
+
+static struct attribute_group ddr_perf_identifier_attr_group = {
+       .attrs = ddr_perf_identifier_attrs,
+};
+
+static ssize_t ddr_perf_cpumask_show(struct device *dev,
+                                    struct device_attribute *attr, char *buf)
+{
+       struct ddr_pmu *pmu = dev_get_drvdata(dev);
+
+       return cpumap_print_to_pagebuf(true, buf, cpumask_of(pmu->cpu));
+}
+
+static struct device_attribute ddr_perf_cpumask_attr =
+       __ATTR(cpumask, 0444, ddr_perf_cpumask_show, NULL);
+
+static struct attribute *ddr_perf_cpumask_attrs[] = {
+       &ddr_perf_cpumask_attr.attr,
+       NULL,
+};
+
+static const struct attribute_group ddr_perf_cpumask_attr_group = {
+       .attrs = ddr_perf_cpumask_attrs,
+};
+
+static ssize_t ddr_pmu_event_show(struct device *dev,
+                                 struct device_attribute *attr, char *page)
+{
+       struct perf_pmu_events_attr *pmu_attr;
+
+       pmu_attr = container_of(attr, struct perf_pmu_events_attr, attr);
+       return sysfs_emit(page, "event=0x%02llx\n", pmu_attr->id);
+}
+
+#define IMX9_DDR_PMU_EVENT_ATTR(_name, _id)                            \
+       (&((struct perf_pmu_events_attr[]) {                            \
+               { .attr = __ATTR(_name, 0444, ddr_pmu_event_show, NULL),\
+                 .id = _id, }                                          \
+       })[0].attr.attr)
+
+static struct attribute *ddr_perf_events_attrs[] = {
+       /* counter0 cycles event */
+       IMX9_DDR_PMU_EVENT_ATTR(cycles, 0),
+
+       /* reference events for all normal counters, need assert DEBUG19[21] bit */
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_ddrc1_rmw_for_ecc, 12),
+       IMX9_DDR_PMU_EVENT_ATTR(eddrtq_pmon_rreorder, 13),
+       IMX9_DDR_PMU_EVENT_ATTR(eddrtq_pmon_wreorder, 14),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_pm_0, 15),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_pm_1, 16),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_pm_2, 17),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_pm_3, 18),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_pm_4, 19),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_pm_5, 22),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_pm_6, 23),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_pm_7, 24),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_pm_8, 25),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_pm_9, 26),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_pm_10, 27),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_pm_11, 28),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_pm_12, 31),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_pm_13, 59),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_pm_15, 61),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_pm_29, 63),
+
+       /* counter1 specific events */
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_ld_riq_0, 64),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_ld_riq_1, 65),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_ld_riq_2, 66),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_ld_riq_3, 67),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_ld_riq_4, 68),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_ld_riq_5, 69),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_ld_riq_6, 70),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_ld_riq_7, 71),
+
+       /* counter2 specific events */
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_ld_wiq_0, 64),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_ld_wiq_1, 65),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_ld_wiq_2, 66),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_ld_wiq_3, 67),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_ld_wiq_4, 68),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_ld_wiq_5, 69),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_ld_wiq_6, 70),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_ld_wiq_7, 71),
+       IMX9_DDR_PMU_EVENT_ATTR(eddrtq_pmon_empty, 72),
+       IMX9_DDR_PMU_EVENT_ATTR(eddrtq_pm_rd_trans_filt, 73),
+
+       /* counter3 specific events */
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_qx_row_collision_0, 64),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_qx_row_collision_1, 65),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_qx_row_collision_2, 66),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_qx_row_collision_3, 67),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_qx_row_collision_4, 68),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_qx_row_collision_5, 69),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_qx_row_collision_6, 70),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_qx_row_collision_7, 71),
+       IMX9_DDR_PMU_EVENT_ATTR(eddrtq_pmon_full, 72),
+       IMX9_DDR_PMU_EVENT_ATTR(eddrtq_pm_wr_trans_filt, 73),
+
+       /* counter4 specific events */
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_qx_row_open_0, 64),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_qx_row_open_1, 65),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_qx_row_open_2, 66),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_qx_row_open_3, 67),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_qx_row_open_4, 68),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_qx_row_open_5, 69),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_qx_row_open_6, 70),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_qx_row_open_7, 71),
+       IMX9_DDR_PMU_EVENT_ATTR(eddrtq_pmon_ld_rdq2_rmw, 72),
+       IMX9_DDR_PMU_EVENT_ATTR(eddrtq_pm_rd_beat_filt, 73),
+
+       /* counter5 specific events */
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_qx_valid_start_0, 64),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_qx_valid_start_1, 65),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_qx_valid_start_2, 66),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_qx_valid_start_3, 67),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_qx_valid_start_4, 68),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_qx_valid_start_5, 69),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_qx_valid_start_6, 70),
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_qx_valid_start_7, 71),
+       IMX9_DDR_PMU_EVENT_ATTR(eddrtq_pmon_ld_rdq1, 72),
+
+       /* counter6 specific events */
+       IMX9_DDR_PMU_EVENT_ATTR(ddrc_qx_valid_end_0, 64),
+       IMX9_DDR_PMU_EVENT_ATTR(eddrtq_pmon_ld_rdq2, 72),
+
+       /* counter7 specific events */
+       IMX9_DDR_PMU_EVENT_ATTR(eddrtq_pmon_1_2_full, 64),
+       IMX9_DDR_PMU_EVENT_ATTR(eddrtq_pmon_ld_wrq0, 65),
+
+       /* counter8 specific events */
+       IMX9_DDR_PMU_EVENT_ATTR(eddrtq_pmon_bias_switched, 64),
+       IMX9_DDR_PMU_EVENT_ATTR(eddrtq_pmon_1_4_full, 65),
+
+       /* counter9 specific events */
+       IMX9_DDR_PMU_EVENT_ATTR(eddrtq_pmon_ld_wrq1, 65),
+       IMX9_DDR_PMU_EVENT_ATTR(eddrtq_pmon_3_4_full, 66),
+
+       /* counter10 specific events */
+       IMX9_DDR_PMU_EVENT_ATTR(eddrtq_pmon_misc_mrk, 65),
+       IMX9_DDR_PMU_EVENT_ATTR(eddrtq_pmon_ld_rdq0, 66),
+       NULL,
+};
+
+static const struct attribute_group ddr_perf_events_attr_group = {
+       .name = "events",
+       .attrs = ddr_perf_events_attrs,
+};
+
+PMU_FORMAT_ATTR(event, "config:0-7");
+PMU_FORMAT_ATTR(counter, "config:8-15");
+PMU_FORMAT_ATTR(axi_id, "config1:0-17");
+PMU_FORMAT_ATTR(axi_mask, "config2:0-17");
+
+static struct attribute *ddr_perf_format_attrs[] = {
+       &format_attr_event.attr,
+       &format_attr_counter.attr,
+       &format_attr_axi_id.attr,
+       &format_attr_axi_mask.attr,
+       NULL,
+};
+
+static const struct attribute_group ddr_perf_format_attr_group = {
+       .name = "format",
+       .attrs = ddr_perf_format_attrs,
+};
+
+static const struct attribute_group *attr_groups[] = {
+       &ddr_perf_identifier_attr_group,
+       &ddr_perf_cpumask_attr_group,
+       &ddr_perf_events_attr_group,
+       &ddr_perf_format_attr_group,
+       NULL,
+};
+
+static void ddr_perf_clear_counter(struct ddr_pmu *pmu, int counter)
+{
+       if (counter == CYCLES_COUNTER) {
+               writel(0, pmu->base + PMC(counter) + 0x4);
+               writel(0, pmu->base + PMC(counter));
+       } else {
+               writel(0, pmu->base + PMC(counter));
+       }
+}
+
+static u64 ddr_perf_read_counter(struct ddr_pmu *pmu, int counter)
+{
+       u32 val_lower, val_upper;
+       u64 val;
+
+       if (counter != CYCLES_COUNTER) {
+               val = readl_relaxed(pmu->base + PMC(counter));
+               goto out;
+       }
+
+       /* special handling for reading 64bit cycle counter */
+       do {
+               val_upper = readl_relaxed(pmu->base + PMC(counter) + 0x4);
+               val_lower = readl_relaxed(pmu->base + PMC(counter));
+       } while (val_upper != readl_relaxed(pmu->base + PMC(counter) + 0x4));
+
+       val = val_upper;
+       val = (val << 32);
+       val |= val_lower;
+out:
+       return val;
+}
+
+static void ddr_perf_counter_global_config(struct ddr_pmu *pmu, bool enable)
+{
+       u32 ctrl;
+
+       ctrl = readl_relaxed(pmu->base + PMGC0);
+
+       if (enable) {
+               /*
+                * The performance monitor must be reset before event counting
+                * sequences. The performance monitor can be reset by first freezing
+                * one or more counters and then clearing the freeze condition to
+                * allow the counters to count according to the settings in the
+                * performance monitor registers. Counters can be frozen individually
+                * by setting PMLCAn[FC] bits, or simultaneously by setting PMGC0[FAC].
+                * Simply clearing these freeze bits will then allow the performance
+                * monitor to begin counting based on the register settings.
+                */
+               ctrl |= PMGC0_FAC;
+               writel(ctrl, pmu->base + PMGC0);
+
+               /*
+                * Freeze all counters disabled, interrupt enabled, and freeze
+                * counters on condition enabled.
+                */
+               ctrl &= ~PMGC0_FAC;
+               ctrl |= PMGC0_PMIE | PMGC0_FCECE;
+               writel(ctrl, pmu->base + PMGC0);
+       } else {
+               ctrl |= PMGC0_FAC;
+               ctrl &= ~(PMGC0_PMIE | PMGC0_FCECE);
+               writel(ctrl, pmu->base + PMGC0);
+       }
+}
+
+static void ddr_perf_counter_local_config(struct ddr_pmu *pmu, int config,
+                                   int counter, bool enable)
+{
+       u32 ctrl_a;
+
+       ctrl_a = readl_relaxed(pmu->base + PMLCA(counter));
+
+       if (enable) {
+               ctrl_a |= PMLCA_FC;
+               writel(ctrl_a, pmu->base + PMLCA(counter));
+
+               ddr_perf_clear_counter(pmu, counter);
+
+               /* Freeze counter disabled, condition enabled, and program event.*/
+               ctrl_a &= ~PMLCA_FC;
+               ctrl_a |= PMLCA_CE;
+               ctrl_a &= ~FIELD_PREP(PMLCA_EVENT, 0x7F);
+               ctrl_a |= FIELD_PREP(PMLCA_EVENT, (config & 0x000000FF));
+               writel(ctrl_a, pmu->base + PMLCA(counter));
+       } else {
+               /* Freeze counter. */
+               ctrl_a |= PMLCA_FC;
+               writel(ctrl_a, pmu->base + PMLCA(counter));
+       }
+}
+
+static void ddr_perf_monitor_config(struct ddr_pmu *pmu, int cfg, int cfg1, int cfg2)
+{
+       u32 pmcfg1, pmcfg2;
+       int event, counter;
+
+       event = cfg & 0x000000FF;
+       counter = (cfg & 0x0000FF00) >> 8;
+
+       pmcfg1 = readl_relaxed(pmu->base + PMCFG1);
+
+       if (counter == 2 && event == 73)
+               pmcfg1 |= PMCFG1_RD_TRANS_FILT_EN;
+       else if (counter == 2 && event != 73)
+               pmcfg1 &= ~PMCFG1_RD_TRANS_FILT_EN;
+
+       if (counter == 3 && event == 73)
+               pmcfg1 |= PMCFG1_WR_TRANS_FILT_EN;
+       else if (counter == 3 && event != 73)
+               pmcfg1 &= ~PMCFG1_WR_TRANS_FILT_EN;
+
+       if (counter == 4 && event == 73)
+               pmcfg1 |= PMCFG1_RD_BT_FILT_EN;
+       else if (counter == 4 && event != 73)
+               pmcfg1 &= ~PMCFG1_RD_BT_FILT_EN;
+
+       pmcfg1 &= ~FIELD_PREP(PMCFG1_ID_MASK, 0x3FFFF);
+       pmcfg1 |= FIELD_PREP(PMCFG1_ID_MASK, cfg2);
+       writel(pmcfg1, pmu->base + PMCFG1);
+
+       pmcfg2 = readl_relaxed(pmu->base + PMCFG2);
+       pmcfg2 &= ~FIELD_PREP(PMCFG2_ID, 0x3FFFF);
+       pmcfg2 |= FIELD_PREP(PMCFG2_ID, cfg1);
+       writel(pmcfg2, pmu->base + PMCFG2);
+}
+
+static void ddr_perf_event_update(struct perf_event *event)
+{
+       struct ddr_pmu *pmu = to_ddr_pmu(event->pmu);
+       struct hw_perf_event *hwc = &event->hw;
+       int counter = hwc->idx;
+       u64 new_raw_count;
+
+       new_raw_count = ddr_perf_read_counter(pmu, counter);
+       local64_add(new_raw_count, &event->count);
+
+       /* clear counter's value every time */
+       ddr_perf_clear_counter(pmu, counter);
+}
+
+static int ddr_perf_event_init(struct perf_event *event)
+{
+       struct ddr_pmu *pmu = to_ddr_pmu(event->pmu);
+       struct hw_perf_event *hwc = &event->hw;
+       struct perf_event *sibling;
+
+       if (event->attr.type != event->pmu->type)
+               return -ENOENT;
+
+       if (is_sampling_event(event) || event->attach_state & PERF_ATTACH_TASK)
+               return -EOPNOTSUPP;
+
+       if (event->cpu < 0) {
+               dev_warn(pmu->dev, "Can't provide per-task data!\n");
+               return -EOPNOTSUPP;
+       }
+
+       /*
+        * We must NOT create groups containing mixed PMUs, although software
+        * events are acceptable (for example to create a CCN group
+        * periodically read when a hrtimer aka cpu-clock leader triggers).
+        */
+       if (event->group_leader->pmu != event->pmu &&
+                       !is_software_event(event->group_leader))
+               return -EINVAL;
+
+       for_each_sibling_event(sibling, event->group_leader) {
+               if (sibling->pmu != event->pmu &&
+                               !is_software_event(sibling))
+                       return -EINVAL;
+       }
+
+       event->cpu = pmu->cpu;
+       hwc->idx = -1;
+
+       return 0;
+}
+
+static void ddr_perf_event_start(struct perf_event *event, int flags)
+{
+       struct ddr_pmu *pmu = to_ddr_pmu(event->pmu);
+       struct hw_perf_event *hwc = &event->hw;
+       int counter = hwc->idx;
+
+       local64_set(&hwc->prev_count, 0);
+
+       ddr_perf_counter_local_config(pmu, event->attr.config, counter, true);
+       hwc->state = 0;
+}
+
+static int ddr_perf_event_add(struct perf_event *event, int flags)
+{
+       struct ddr_pmu *pmu = to_ddr_pmu(event->pmu);
+       struct hw_perf_event *hwc = &event->hw;
+       int cfg = event->attr.config;
+       int cfg1 = event->attr.config1;
+       int cfg2 = event->attr.config2;
+       int counter;
+
+       counter = (cfg & 0x0000FF00) >> 8;
+
+       pmu->events[counter] = event;
+       pmu->active_events++;
+       hwc->idx = counter;
+       hwc->state |= PERF_HES_STOPPED;
+
+       if (flags & PERF_EF_START)
+               ddr_perf_event_start(event, flags);
+
+       /* read trans, write trans, read beat */
+       ddr_perf_monitor_config(pmu, cfg, cfg1, cfg2);
+
+       return 0;
+}
+
+static void ddr_perf_event_stop(struct perf_event *event, int flags)
+{
+       struct ddr_pmu *pmu = to_ddr_pmu(event->pmu);
+       struct hw_perf_event *hwc = &event->hw;
+       int counter = hwc->idx;
+
+       ddr_perf_counter_local_config(pmu, event->attr.config, counter, false);
+       ddr_perf_event_update(event);
+
+       hwc->state |= PERF_HES_STOPPED;
+}
+
+static void ddr_perf_event_del(struct perf_event *event, int flags)
+{
+       struct ddr_pmu *pmu = to_ddr_pmu(event->pmu);
+       struct hw_perf_event *hwc = &event->hw;
+
+       ddr_perf_event_stop(event, PERF_EF_UPDATE);
+
+       pmu->active_events--;
+       hwc->idx = -1;
+}
+
+static void ddr_perf_pmu_enable(struct pmu *pmu)
+{
+       struct ddr_pmu *ddr_pmu = to_ddr_pmu(pmu);
+
+       ddr_perf_counter_global_config(ddr_pmu, true);
+}
+
+static void ddr_perf_pmu_disable(struct pmu *pmu)
+{
+       struct ddr_pmu *ddr_pmu = to_ddr_pmu(pmu);
+
+       ddr_perf_counter_global_config(ddr_pmu, false);
+}
+
+static void ddr_perf_init(struct ddr_pmu *pmu, void __iomem *base,
+                        struct device *dev)
+{
+       *pmu = (struct ddr_pmu) {
+               .pmu = (struct pmu) {
+                       .module       = THIS_MODULE,
+                       .capabilities = PERF_PMU_CAP_NO_EXCLUDE,
+                       .task_ctx_nr  = perf_invalid_context,
+                       .attr_groups  = attr_groups,
+                       .event_init   = ddr_perf_event_init,
+                       .add          = ddr_perf_event_add,
+                       .del          = ddr_perf_event_del,
+                       .start        = ddr_perf_event_start,
+                       .stop         = ddr_perf_event_stop,
+                       .read         = ddr_perf_event_update,
+                       .pmu_enable   = ddr_perf_pmu_enable,
+                       .pmu_disable  = ddr_perf_pmu_disable,
+               },
+               .base = base,
+               .dev = dev,
+       };
+}
+
+static irqreturn_t ddr_perf_irq_handler(int irq, void *p)
+{
+       struct ddr_pmu *pmu = (struct ddr_pmu *)p;
+       struct perf_event *event;
+       int i;
+
+       /*
+        * Counters can generate an interrupt on an overflow when msb of a
+        * counter changes from 0 to 1. For the interrupt to be signalled,
+        * below condition mush be satisfied:
+        * PMGC0[PMIE] = 1, PMGC0[FCECE] = 1, PMLCAn[CE] = 1
+        * When an interrupt is signalled, PMGC0[FAC] is set by hardware and
+        * all of the registers are frozen.
+        * Software can clear the interrupt condition by resetting the performance
+        * monitor and clearing the most significant bit of the counter that
+        * generate the overflow.
+        */
+       for (i = 0; i < NUM_COUNTERS; i++) {
+               if (!pmu->events[i])
+                       continue;
+
+               event = pmu->events[i];
+
+               ddr_perf_event_update(event);
+       }
+
+       ddr_perf_counter_global_config(pmu, true);
+
+       return IRQ_HANDLED;
+}
+
+static int ddr_perf_offline_cpu(unsigned int cpu, struct hlist_node *node)
+{
+       struct ddr_pmu *pmu = hlist_entry_safe(node, struct ddr_pmu, node);
+       int target;
+
+       if (cpu != pmu->cpu)
+               return 0;
+
+       target = cpumask_any_but(cpu_online_mask, cpu);
+       if (target >= nr_cpu_ids)
+               return 0;
+
+       perf_pmu_migrate_context(&pmu->pmu, cpu, target);
+       pmu->cpu = target;
+
+       WARN_ON(irq_set_affinity(pmu->irq, cpumask_of(pmu->cpu)));
+
+       return 0;
+}
+
+static int ddr_perf_probe(struct platform_device *pdev)
+{
+       struct ddr_pmu *pmu;
+       void __iomem *base;
+       int ret, irq;
+       char *name;
+
+       base = devm_platform_ioremap_resource(pdev, 0);
+       if (IS_ERR(base))
+               return PTR_ERR(base);
+
+       pmu = devm_kzalloc(&pdev->dev, sizeof(*pmu), GFP_KERNEL);
+       if (!pmu)
+               return -ENOMEM;
+
+       ddr_perf_init(pmu, base, &pdev->dev);
+
+       pmu->devtype_data = of_device_get_match_data(&pdev->dev);
+
+       platform_set_drvdata(pdev, pmu);
+
+       pmu->id = ida_simple_get(&ddr_ida, 0, 0, GFP_KERNEL);
+       name = devm_kasprintf(&pdev->dev, GFP_KERNEL, DDR_PERF_DEV_NAME "%d", pmu->id);
+       if (!name) {
+               ret = -ENOMEM;
+               goto format_string_err;
+       }
+
+       pmu->cpu = raw_smp_processor_id();
+       ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, DDR_CPUHP_CB_NAME,
+                                     NULL, ddr_perf_offline_cpu);
+       if (ret < 0) {
+               dev_err(&pdev->dev, "Failed to add callbacks for multi state\n");
+               goto cpuhp_state_err;
+       }
+       pmu->cpuhp_state = ret;
+
+       /* Register the pmu instance for cpu hotplug */
+       ret = cpuhp_state_add_instance_nocalls(pmu->cpuhp_state, &pmu->node);
+       if (ret) {
+               dev_err(&pdev->dev, "Error %d registering hotplug\n", ret);
+               goto cpuhp_instance_err;
+       }
+
+       /* Request irq */
+       irq = platform_get_irq(pdev, 0);
+       if (irq < 0) {
+               ret = irq;
+               goto ddr_perf_err;
+       }
+
+       ret = devm_request_irq(&pdev->dev, irq, ddr_perf_irq_handler,
+                              IRQF_NOBALANCING | IRQF_NO_THREAD,
+                              DDR_CPUHP_CB_NAME, pmu);
+       if (ret < 0) {
+               dev_err(&pdev->dev, "Request irq failed: %d", ret);
+               goto ddr_perf_err;
+       }
+
+       pmu->irq = irq;
+       ret = irq_set_affinity(pmu->irq, cpumask_of(pmu->cpu));
+       if (ret) {
+               dev_err(pmu->dev, "Failed to set interrupt affinity\n");
+               goto ddr_perf_err;
+       }
+
+       ret = perf_pmu_register(&pmu->pmu, name, -1);
+       if (ret)
+               goto ddr_perf_err;
+
+       return 0;
+
+ddr_perf_err:
+       cpuhp_state_remove_instance_nocalls(pmu->cpuhp_state, &pmu->node);
+cpuhp_instance_err:
+       cpuhp_remove_multi_state(pmu->cpuhp_state);
+cpuhp_state_err:
+format_string_err:
+       ida_simple_remove(&ddr_ida, pmu->id);
+       dev_warn(&pdev->dev, "i.MX9 DDR Perf PMU failed (%d), disabled\n", ret);
+       return ret;
+}
+
+static int ddr_perf_remove(struct platform_device *pdev)
+{
+       struct ddr_pmu *pmu = platform_get_drvdata(pdev);
+
+       cpuhp_state_remove_instance_nocalls(pmu->cpuhp_state, &pmu->node);
+       cpuhp_remove_multi_state(pmu->cpuhp_state);
+
+       perf_pmu_unregister(&pmu->pmu);
+
+       ida_simple_remove(&ddr_ida, pmu->id);
+
+       return 0;
+}
+
+static struct platform_driver imx_ddr_pmu_driver = {
+       .driver         = {
+               .name                = "imx9-ddr-pmu",
+               .of_match_table      = imx_ddr_pmu_dt_ids,
+               .suppress_bind_attrs = true,
+       },
+       .probe          = ddr_perf_probe,
+       .remove         = ddr_perf_remove,
+};
+module_platform_driver(imx_ddr_pmu_driver);
+
+MODULE_AUTHOR("Xu Yang <xu.yang_2@nxp.com>");
+MODULE_LICENSE("GPL v2");
+MODULE_DESCRIPTION("DDRC PerfMon for i.MX9 SoCs");
index 4d2c9ab..48dcc83 100644 (file)
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 obj-$(CONFIG_HISI_PMU) += hisi_uncore_pmu.o hisi_uncore_l3c_pmu.o \
                          hisi_uncore_hha_pmu.o hisi_uncore_ddrc_pmu.o hisi_uncore_sllc_pmu.o \
-                         hisi_uncore_pa_pmu.o hisi_uncore_cpa_pmu.o
+                         hisi_uncore_pa_pmu.o hisi_uncore_cpa_pmu.o hisi_uncore_uc_pmu.o
 
 obj-$(CONFIG_HISI_PCIE_PMU) += hisi_pcie_pmu.o
 obj-$(CONFIG_HNS3_PMU) += hns3_pmu.o
index 6fee0b6..e10fc7c 100644 (file)
@@ -683,7 +683,7 @@ static int hisi_pcie_pmu_offline_cpu(unsigned int cpu, struct hlist_node *node)
 
        pcie_pmu->on_cpu = -1;
        /* Choose a new CPU from all online cpus. */
-       target = cpumask_first(cpu_online_mask);
+       target = cpumask_any_but(cpu_online_mask, cpu);
        if (target >= nr_cpu_ids) {
                pci_err(pcie_pmu->pdev, "There is no CPU to set\n");
                return 0;
index 71b6687..d941e74 100644 (file)
 #define PA_TT_CTRL                     0x1c08
 #define PA_TGTID_CTRL                  0x1c14
 #define PA_SRCID_CTRL                  0x1c18
+
+/* H32 PA interrupt registers */
 #define PA_INT_MASK                    0x1c70
 #define PA_INT_STATUS                  0x1c78
 #define PA_INT_CLEAR                   0x1c7c
+
+#define H60PA_INT_STATUS               0x1c70
+#define H60PA_INT_MASK                 0x1c74
+
 #define PA_EVENT_TYPE0                 0x1c80
 #define PA_PMU_VERSION                 0x1cf0
 #define PA_EVENT_CNT0_L                        0x1d00
@@ -46,6 +52,12 @@ HISI_PMU_EVENT_ATTR_EXTRACTOR(srcid_cmd, config1, 32, 22);
 HISI_PMU_EVENT_ATTR_EXTRACTOR(srcid_msk, config1, 43, 33);
 HISI_PMU_EVENT_ATTR_EXTRACTOR(tracetag_en, config1, 44, 44);
 
+struct hisi_pa_pmu_int_regs {
+       u32 mask_offset;
+       u32 clear_offset;
+       u32 status_offset;
+};
+
 static void hisi_pa_pmu_enable_tracetag(struct perf_event *event)
 {
        struct hisi_pmu *pa_pmu = to_hisi_pmu(event->pmu);
@@ -219,40 +231,40 @@ static void hisi_pa_pmu_disable_counter(struct hisi_pmu *pa_pmu,
 static void hisi_pa_pmu_enable_counter_int(struct hisi_pmu *pa_pmu,
                                           struct hw_perf_event *hwc)
 {
+       struct hisi_pa_pmu_int_regs *regs = pa_pmu->dev_info->private;
        u32 val;
 
        /* Write 0 to enable interrupt */
-       val = readl(pa_pmu->base + PA_INT_MASK);
+       val = readl(pa_pmu->base + regs->mask_offset);
        val &= ~(1 << hwc->idx);
-       writel(val, pa_pmu->base + PA_INT_MASK);
+       writel(val, pa_pmu->base + regs->mask_offset);
 }
 
 static void hisi_pa_pmu_disable_counter_int(struct hisi_pmu *pa_pmu,
                                            struct hw_perf_event *hwc)
 {
+       struct hisi_pa_pmu_int_regs *regs = pa_pmu->dev_info->private;
        u32 val;
 
        /* Write 1 to mask interrupt */
-       val = readl(pa_pmu->base + PA_INT_MASK);
+       val = readl(pa_pmu->base + regs->mask_offset);
        val |= 1 << hwc->idx;
-       writel(val, pa_pmu->base + PA_INT_MASK);
+       writel(val, pa_pmu->base + regs->mask_offset);
 }
 
 static u32 hisi_pa_pmu_get_int_status(struct hisi_pmu *pa_pmu)
 {
-       return readl(pa_pmu->base + PA_INT_STATUS);
+       struct hisi_pa_pmu_int_regs *regs = pa_pmu->dev_info->private;
+
+       return readl(pa_pmu->base + regs->status_offset);
 }
 
 static void hisi_pa_pmu_clear_int_status(struct hisi_pmu *pa_pmu, int idx)
 {
-       writel(1 << idx, pa_pmu->base + PA_INT_CLEAR);
-}
+       struct hisi_pa_pmu_int_regs *regs = pa_pmu->dev_info->private;
 
-static const struct acpi_device_id hisi_pa_pmu_acpi_match[] = {
-       { "HISI0273", },
-       {}
-};
-MODULE_DEVICE_TABLE(acpi, hisi_pa_pmu_acpi_match);
+       writel(1 << idx, pa_pmu->base + regs->clear_offset);
+}
 
 static int hisi_pa_pmu_init_data(struct platform_device *pdev,
                                   struct hisi_pmu *pa_pmu)
@@ -276,6 +288,10 @@ static int hisi_pa_pmu_init_data(struct platform_device *pdev,
        pa_pmu->ccl_id = -1;
        pa_pmu->sccl_id = -1;
 
+       pa_pmu->dev_info = device_get_match_data(&pdev->dev);
+       if (!pa_pmu->dev_info)
+               return -ENODEV;
+
        pa_pmu->base = devm_platform_ioremap_resource(pdev, 0);
        if (IS_ERR(pa_pmu->base)) {
                dev_err(&pdev->dev, "ioremap failed for pa_pmu resource.\n");
@@ -314,6 +330,32 @@ static const struct attribute_group hisi_pa_pmu_v2_events_group = {
        .attrs = hisi_pa_pmu_v2_events_attr,
 };
 
+static struct attribute *hisi_pa_pmu_v3_events_attr[] = {
+       HISI_PMU_EVENT_ATTR(tx_req,     0x0),
+       HISI_PMU_EVENT_ATTR(tx_dat,     0x1),
+       HISI_PMU_EVENT_ATTR(tx_snp,     0x2),
+       HISI_PMU_EVENT_ATTR(rx_req,     0x7),
+       HISI_PMU_EVENT_ATTR(rx_dat,     0x8),
+       HISI_PMU_EVENT_ATTR(rx_snp,     0x9),
+       NULL
+};
+
+static const struct attribute_group hisi_pa_pmu_v3_events_group = {
+       .name = "events",
+       .attrs = hisi_pa_pmu_v3_events_attr,
+};
+
+static struct attribute *hisi_h60pa_pmu_events_attr[] = {
+       HISI_PMU_EVENT_ATTR(rx_flit,    0x50),
+       HISI_PMU_EVENT_ATTR(tx_flit,    0x65),
+       NULL
+};
+
+static const struct attribute_group hisi_h60pa_pmu_events_group = {
+       .name = "events",
+       .attrs = hisi_h60pa_pmu_events_attr,
+};
+
 static DEVICE_ATTR(cpumask, 0444, hisi_cpumask_sysfs_show, NULL);
 
 static struct attribute *hisi_pa_pmu_cpumask_attrs[] = {
@@ -337,6 +379,12 @@ static const struct attribute_group hisi_pa_pmu_identifier_group = {
        .attrs = hisi_pa_pmu_identifier_attrs,
 };
 
+static struct hisi_pa_pmu_int_regs hisi_pa_pmu_regs = {
+       .mask_offset = PA_INT_MASK,
+       .clear_offset = PA_INT_CLEAR,
+       .status_offset = PA_INT_STATUS,
+};
+
 static const struct attribute_group *hisi_pa_pmu_v2_attr_groups[] = {
        &hisi_pa_pmu_v2_format_group,
        &hisi_pa_pmu_v2_events_group,
@@ -345,6 +393,46 @@ static const struct attribute_group *hisi_pa_pmu_v2_attr_groups[] = {
        NULL
 };
 
+static const struct hisi_pmu_dev_info hisi_h32pa_v2 = {
+       .name = "pa",
+       .attr_groups = hisi_pa_pmu_v2_attr_groups,
+       .private = &hisi_pa_pmu_regs,
+};
+
+static const struct attribute_group *hisi_pa_pmu_v3_attr_groups[] = {
+       &hisi_pa_pmu_v2_format_group,
+       &hisi_pa_pmu_v3_events_group,
+       &hisi_pa_pmu_cpumask_attr_group,
+       &hisi_pa_pmu_identifier_group,
+       NULL
+};
+
+static const struct hisi_pmu_dev_info hisi_h32pa_v3 = {
+       .name = "pa",
+       .attr_groups = hisi_pa_pmu_v3_attr_groups,
+       .private = &hisi_pa_pmu_regs,
+};
+
+static struct hisi_pa_pmu_int_regs hisi_h60pa_pmu_regs = {
+       .mask_offset = H60PA_INT_MASK,
+       .clear_offset = H60PA_INT_STATUS, /* Clear on write */
+       .status_offset = H60PA_INT_STATUS,
+};
+
+static const struct attribute_group *hisi_h60pa_pmu_attr_groups[] = {
+       &hisi_pa_pmu_v2_format_group,
+       &hisi_h60pa_pmu_events_group,
+       &hisi_pa_pmu_cpumask_attr_group,
+       &hisi_pa_pmu_identifier_group,
+       NULL
+};
+
+static const struct hisi_pmu_dev_info hisi_h60pa = {
+       .name = "h60pa",
+       .attr_groups = hisi_h60pa_pmu_attr_groups,
+       .private = &hisi_h60pa_pmu_regs,
+};
+
 static const struct hisi_uncore_ops hisi_uncore_pa_ops = {
        .write_evtype           = hisi_pa_pmu_write_evtype,
        .get_event_idx          = hisi_uncore_pmu_get_event_idx,
@@ -375,7 +463,7 @@ static int hisi_pa_pmu_dev_probe(struct platform_device *pdev,
        if (ret)
                return ret;
 
-       pa_pmu->pmu_events.attr_groups = hisi_pa_pmu_v2_attr_groups;
+       pa_pmu->pmu_events.attr_groups = pa_pmu->dev_info->attr_groups;
        pa_pmu->num_counters = PA_NR_COUNTERS;
        pa_pmu->ops = &hisi_uncore_pa_ops;
        pa_pmu->check_event = 0xB0;
@@ -400,8 +488,9 @@ static int hisi_pa_pmu_probe(struct platform_device *pdev)
        if (ret)
                return ret;
 
-       name = devm_kasprintf(&pdev->dev, GFP_KERNEL, "hisi_sicl%u_pa%u",
-                             pa_pmu->sicl_id, pa_pmu->index_id);
+       name = devm_kasprintf(&pdev->dev, GFP_KERNEL, "hisi_sicl%d_%s%u",
+                             pa_pmu->sicl_id, pa_pmu->dev_info->name,
+                             pa_pmu->index_id);
        if (!name)
                return -ENOMEM;
 
@@ -435,6 +524,14 @@ static int hisi_pa_pmu_remove(struct platform_device *pdev)
        return 0;
 }
 
+static const struct acpi_device_id hisi_pa_pmu_acpi_match[] = {
+       { "HISI0273", (kernel_ulong_t)&hisi_h32pa_v2 },
+       { "HISI0275", (kernel_ulong_t)&hisi_h32pa_v3 },
+       { "HISI0274", (kernel_ulong_t)&hisi_h60pa },
+       {}
+};
+MODULE_DEVICE_TABLE(acpi, hisi_pa_pmu_acpi_match);
+
 static struct platform_driver hisi_pa_pmu_driver = {
        .driver = {
                .name = "hisi_pa_pmu",
index 2823f38..0403145 100644 (file)
@@ -20,7 +20,6 @@
 
 #include "hisi_uncore_pmu.h"
 
-#define HISI_GET_EVENTID(ev) (ev->hw.config_base & 0xff)
 #define HISI_MAX_PERIOD(nr) (GENMASK_ULL((nr) - 1, 0))
 
 /*
@@ -226,6 +225,9 @@ int hisi_uncore_pmu_event_init(struct perf_event *event)
        hwc->idx                = -1;
        hwc->config_base        = event->attr.config;
 
+       if (hisi_pmu->ops->check_filter && hisi_pmu->ops->check_filter(event))
+               return -EINVAL;
+
        /* Enforce to use the same CPU for all events in this PMU */
        event->cpu = hisi_pmu->on_cpu;
 
index 07890a8..92402aa 100644 (file)
                return FIELD_GET(GENMASK_ULL(hi, lo), event->attr.config);  \
        }
 
+#define HISI_GET_EVENTID(ev) (ev->hw.config_base & 0xff)
+
+#define HISI_PMU_EVTYPE_BITS           8
+#define HISI_PMU_EVTYPE_SHIFT(idx)     ((idx) % 4 * HISI_PMU_EVTYPE_BITS)
+
 struct hisi_pmu;
 
 struct hisi_uncore_ops {
+       int (*check_filter)(struct perf_event *event);
        void (*write_evtype)(struct hisi_pmu *, int, u32);
        int (*get_event_idx)(struct perf_event *);
        u64 (*read_counter)(struct hisi_pmu *, struct hw_perf_event *);
@@ -62,6 +68,13 @@ struct hisi_uncore_ops {
        void (*disable_filter)(struct perf_event *event);
 };
 
+/* Describes the HISI PMU chip features information */
+struct hisi_pmu_dev_info {
+       const char *name;
+       const struct attribute_group **attr_groups;
+       void *private;
+};
+
 struct hisi_pmu_hwevents {
        struct perf_event *hw_events[HISI_MAX_COUNTERS];
        DECLARE_BITMAP(used_mask, HISI_MAX_COUNTERS);
@@ -72,6 +85,7 @@ struct hisi_pmu_hwevents {
 struct hisi_pmu {
        struct pmu pmu;
        const struct hisi_uncore_ops *ops;
+       const struct hisi_pmu_dev_info *dev_info;
        struct hisi_pmu_hwevents pmu_events;
        /* associated_cpus: All CPUs associated with the PMU */
        cpumask_t associated_cpus;
diff --git a/drivers/perf/hisilicon/hisi_uncore_uc_pmu.c b/drivers/perf/hisilicon/hisi_uncore_uc_pmu.c
new file mode 100644 (file)
index 0000000..63da05e
--- /dev/null
@@ -0,0 +1,578 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * HiSilicon SoC UC (unified cache) uncore Hardware event counters support
+ *
+ * Copyright (C) 2023 HiSilicon Limited
+ *
+ * This code is based on the uncore PMUs like hisi_uncore_l3c_pmu.
+ */
+#include <linux/cpuhotplug.h>
+#include <linux/interrupt.h>
+#include <linux/irq.h>
+#include <linux/list.h>
+#include <linux/mod_devicetable.h>
+#include <linux/property.h>
+
+#include "hisi_uncore_pmu.h"
+
+/* Dynamic CPU hotplug state used by UC PMU */
+static enum cpuhp_state hisi_uc_pmu_online;
+
+/* UC register definition */
+#define HISI_UC_INT_MASK_REG           0x0800
+#define HISI_UC_INT_STS_REG            0x0808
+#define HISI_UC_INT_CLEAR_REG          0x080c
+#define HISI_UC_TRACETAG_CTRL_REG      0x1b2c
+#define HISI_UC_TRACETAG_REQ_MSK       GENMASK(9, 7)
+#define HISI_UC_TRACETAG_MARK_EN       BIT(0)
+#define HISI_UC_TRACETAG_REQ_EN                (HISI_UC_TRACETAG_MARK_EN | BIT(2))
+#define HISI_UC_TRACETAG_SRCID_EN      BIT(3)
+#define HISI_UC_SRCID_CTRL_REG         0x1b40
+#define HISI_UC_SRCID_MSK              GENMASK(14, 1)
+#define HISI_UC_EVENT_CTRL_REG         0x1c00
+#define HISI_UC_EVENT_TRACETAG_EN      BIT(29)
+#define HISI_UC_EVENT_URING_MSK                GENMASK(28, 27)
+#define HISI_UC_EVENT_GLB_EN           BIT(26)
+#define HISI_UC_VERSION_REG            0x1cf0
+#define HISI_UC_EVTYPE_REGn(n)         (0x1d00 + (n) * 4)
+#define HISI_UC_EVTYPE_MASK            GENMASK(7, 0)
+#define HISI_UC_CNTR_REGn(n)           (0x1e00 + (n) * 8)
+
+#define HISI_UC_NR_COUNTERS            0x8
+#define HISI_UC_V2_NR_EVENTS           0xFF
+#define HISI_UC_CNTR_REG_BITS          64
+
+#define HISI_UC_RD_REQ_TRACETAG                0x4
+#define HISI_UC_URING_EVENT_MIN                0x47
+#define HISI_UC_URING_EVENT_MAX                0x59
+
+HISI_PMU_EVENT_ATTR_EXTRACTOR(rd_req_en, config1, 0, 0);
+HISI_PMU_EVENT_ATTR_EXTRACTOR(uring_channel, config1, 5, 4);
+HISI_PMU_EVENT_ATTR_EXTRACTOR(srcid, config1, 19, 6);
+HISI_PMU_EVENT_ATTR_EXTRACTOR(srcid_en, config1, 20, 20);
+
+static int hisi_uc_pmu_check_filter(struct perf_event *event)
+{
+       struct hisi_pmu *uc_pmu = to_hisi_pmu(event->pmu);
+
+       if (hisi_get_srcid_en(event) && !hisi_get_rd_req_en(event)) {
+               dev_err(uc_pmu->dev,
+                       "rcid_en depends on rd_req_en being enabled!\n");
+               return -EINVAL;
+       }
+
+       if (!hisi_get_uring_channel(event))
+               return 0;
+
+       if ((HISI_GET_EVENTID(event) < HISI_UC_URING_EVENT_MIN) ||
+           (HISI_GET_EVENTID(event) > HISI_UC_URING_EVENT_MAX))
+               dev_warn(uc_pmu->dev,
+                        "Only events: [%#x ~ %#x] support channel filtering!",
+                        HISI_UC_URING_EVENT_MIN, HISI_UC_URING_EVENT_MAX);
+
+       return 0;
+}
+
+static void hisi_uc_pmu_config_req_tracetag(struct perf_event *event)
+{
+       struct hisi_pmu *uc_pmu = to_hisi_pmu(event->pmu);
+       u32 val;
+
+       if (!hisi_get_rd_req_en(event))
+               return;
+
+       val = readl(uc_pmu->base + HISI_UC_TRACETAG_CTRL_REG);
+
+       /* The request-type has been configured */
+       if (FIELD_GET(HISI_UC_TRACETAG_REQ_MSK, val) == HISI_UC_RD_REQ_TRACETAG)
+               return;
+
+       /* Set request-type for tracetag, only read request is supported! */
+       val &= ~HISI_UC_TRACETAG_REQ_MSK;
+       val |= FIELD_PREP(HISI_UC_TRACETAG_REQ_MSK, HISI_UC_RD_REQ_TRACETAG);
+       val |= HISI_UC_TRACETAG_REQ_EN;
+       writel(val, uc_pmu->base + HISI_UC_TRACETAG_CTRL_REG);
+}
+
+static void hisi_uc_pmu_clear_req_tracetag(struct perf_event *event)
+{
+       struct hisi_pmu *uc_pmu = to_hisi_pmu(event->pmu);
+       u32 val;
+
+       if (!hisi_get_rd_req_en(event))
+               return;
+
+       val = readl(uc_pmu->base + HISI_UC_TRACETAG_CTRL_REG);
+
+       /* Do nothing, the request-type tracetag has been cleaned up */
+       if (FIELD_GET(HISI_UC_TRACETAG_REQ_MSK, val) == 0)
+               return;
+
+       /* Clear request-type */
+       val &= ~HISI_UC_TRACETAG_REQ_MSK;
+       val &= ~HISI_UC_TRACETAG_REQ_EN;
+       writel(val, uc_pmu->base + HISI_UC_TRACETAG_CTRL_REG);
+}
+
+static void hisi_uc_pmu_config_srcid_tracetag(struct perf_event *event)
+{
+       struct hisi_pmu *uc_pmu = to_hisi_pmu(event->pmu);
+       u32 val;
+
+       if (!hisi_get_srcid_en(event))
+               return;
+
+       val = readl(uc_pmu->base + HISI_UC_TRACETAG_CTRL_REG);
+
+       /* Do nothing, the source id has been configured */
+       if (FIELD_GET(HISI_UC_TRACETAG_SRCID_EN, val))
+               return;
+
+       /* Enable source id tracetag */
+       val |= HISI_UC_TRACETAG_SRCID_EN;
+       writel(val, uc_pmu->base + HISI_UC_TRACETAG_CTRL_REG);
+
+       val = readl(uc_pmu->base + HISI_UC_SRCID_CTRL_REG);
+       val &= ~HISI_UC_SRCID_MSK;
+       val |= FIELD_PREP(HISI_UC_SRCID_MSK, hisi_get_srcid(event));
+       writel(val, uc_pmu->base + HISI_UC_SRCID_CTRL_REG);
+
+       /* Depend on request-type tracetag enabled */
+       hisi_uc_pmu_config_req_tracetag(event);
+}
+
+static void hisi_uc_pmu_clear_srcid_tracetag(struct perf_event *event)
+{
+       struct hisi_pmu *uc_pmu = to_hisi_pmu(event->pmu);
+       u32 val;
+
+       if (!hisi_get_srcid_en(event))
+               return;
+
+       val = readl(uc_pmu->base + HISI_UC_TRACETAG_CTRL_REG);
+
+       /* Do nothing, the source id has been cleaned up */
+       if (FIELD_GET(HISI_UC_TRACETAG_SRCID_EN, val) == 0)
+               return;
+
+       hisi_uc_pmu_clear_req_tracetag(event);
+
+       /* Disable source id tracetag */
+       val &= ~HISI_UC_TRACETAG_SRCID_EN;
+       writel(val, uc_pmu->base + HISI_UC_TRACETAG_CTRL_REG);
+
+       val = readl(uc_pmu->base + HISI_UC_SRCID_CTRL_REG);
+       val &= ~HISI_UC_SRCID_MSK;
+       writel(val, uc_pmu->base + HISI_UC_SRCID_CTRL_REG);
+}
+
+static void hisi_uc_pmu_config_uring_channel(struct perf_event *event)
+{
+       struct hisi_pmu *uc_pmu = to_hisi_pmu(event->pmu);
+       u32 uring_channel = hisi_get_uring_channel(event);
+       u32 val;
+
+       /* Do nothing if not being set or is set explicitly to zero (default) */
+       if (uring_channel == 0)
+               return;
+
+       val = readl(uc_pmu->base + HISI_UC_EVENT_CTRL_REG);
+
+       /* Do nothing, the uring_channel has been configured */
+       if (uring_channel == FIELD_GET(HISI_UC_EVENT_URING_MSK, val))
+               return;
+
+       val &= ~HISI_UC_EVENT_URING_MSK;
+       val |= FIELD_PREP(HISI_UC_EVENT_URING_MSK, uring_channel);
+       writel(val, uc_pmu->base + HISI_UC_EVENT_CTRL_REG);
+}
+
+static void hisi_uc_pmu_clear_uring_channel(struct perf_event *event)
+{
+       struct hisi_pmu *uc_pmu = to_hisi_pmu(event->pmu);
+       u32 val;
+
+       /* Do nothing if not being set or is set explicitly to zero (default) */
+       if (hisi_get_uring_channel(event) == 0)
+               return;
+
+       val = readl(uc_pmu->base + HISI_UC_EVENT_CTRL_REG);
+
+       /* Do nothing, the uring_channel has been cleaned up */
+       if (FIELD_GET(HISI_UC_EVENT_URING_MSK, val) == 0)
+               return;
+
+       val &= ~HISI_UC_EVENT_URING_MSK;
+       writel(val, uc_pmu->base + HISI_UC_EVENT_CTRL_REG);
+}
+
+static void hisi_uc_pmu_enable_filter(struct perf_event *event)
+{
+       if (event->attr.config1 == 0)
+               return;
+
+       hisi_uc_pmu_config_uring_channel(event);
+       hisi_uc_pmu_config_req_tracetag(event);
+       hisi_uc_pmu_config_srcid_tracetag(event);
+}
+
+static void hisi_uc_pmu_disable_filter(struct perf_event *event)
+{
+       if (event->attr.config1 == 0)
+               return;
+
+       hisi_uc_pmu_clear_srcid_tracetag(event);
+       hisi_uc_pmu_clear_req_tracetag(event);
+       hisi_uc_pmu_clear_uring_channel(event);
+}
+
+static void hisi_uc_pmu_write_evtype(struct hisi_pmu *uc_pmu, int idx, u32 type)
+{
+       u32 val;
+
+       /*
+        * Select the appropriate event select register.
+        * There are 2 32-bit event select registers for the
+        * 8 hardware counters, each event code is 8-bit wide.
+        */
+       val = readl(uc_pmu->base + HISI_UC_EVTYPE_REGn(idx / 4));
+       val &= ~(HISI_UC_EVTYPE_MASK << HISI_PMU_EVTYPE_SHIFT(idx));
+       val |= (type << HISI_PMU_EVTYPE_SHIFT(idx));
+       writel(val, uc_pmu->base + HISI_UC_EVTYPE_REGn(idx / 4));
+}
+
+static void hisi_uc_pmu_start_counters(struct hisi_pmu *uc_pmu)
+{
+       u32 val;
+
+       val = readl(uc_pmu->base + HISI_UC_EVENT_CTRL_REG);
+       val |= HISI_UC_EVENT_GLB_EN;
+       writel(val, uc_pmu->base + HISI_UC_EVENT_CTRL_REG);
+}
+
+static void hisi_uc_pmu_stop_counters(struct hisi_pmu *uc_pmu)
+{
+       u32 val;
+
+       val = readl(uc_pmu->base + HISI_UC_EVENT_CTRL_REG);
+       val &= ~HISI_UC_EVENT_GLB_EN;
+       writel(val, uc_pmu->base + HISI_UC_EVENT_CTRL_REG);
+}
+
+static void hisi_uc_pmu_enable_counter(struct hisi_pmu *uc_pmu,
+                                       struct hw_perf_event *hwc)
+{
+       u32 val;
+
+       /* Enable counter index */
+       val = readl(uc_pmu->base + HISI_UC_EVENT_CTRL_REG);
+       val |= (1 << hwc->idx);
+       writel(val, uc_pmu->base + HISI_UC_EVENT_CTRL_REG);
+}
+
+static void hisi_uc_pmu_disable_counter(struct hisi_pmu *uc_pmu,
+                                       struct hw_perf_event *hwc)
+{
+       u32 val;
+
+       /* Clear counter index */
+       val = readl(uc_pmu->base + HISI_UC_EVENT_CTRL_REG);
+       val &= ~(1 << hwc->idx);
+       writel(val, uc_pmu->base + HISI_UC_EVENT_CTRL_REG);
+}
+
+static u64 hisi_uc_pmu_read_counter(struct hisi_pmu *uc_pmu,
+                                   struct hw_perf_event *hwc)
+{
+       return readq(uc_pmu->base + HISI_UC_CNTR_REGn(hwc->idx));
+}
+
+static void hisi_uc_pmu_write_counter(struct hisi_pmu *uc_pmu,
+                                     struct hw_perf_event *hwc, u64 val)
+{
+       writeq(val, uc_pmu->base + HISI_UC_CNTR_REGn(hwc->idx));
+}
+
+static void hisi_uc_pmu_enable_counter_int(struct hisi_pmu *uc_pmu,
+                                          struct hw_perf_event *hwc)
+{
+       u32 val;
+
+       val = readl(uc_pmu->base + HISI_UC_INT_MASK_REG);
+       val &= ~(1 << hwc->idx);
+       writel(val, uc_pmu->base + HISI_UC_INT_MASK_REG);
+}
+
+static void hisi_uc_pmu_disable_counter_int(struct hisi_pmu *uc_pmu,
+                                           struct hw_perf_event *hwc)
+{
+       u32 val;
+
+       val = readl(uc_pmu->base + HISI_UC_INT_MASK_REG);
+       val |= (1 << hwc->idx);
+       writel(val, uc_pmu->base + HISI_UC_INT_MASK_REG);
+}
+
+static u32 hisi_uc_pmu_get_int_status(struct hisi_pmu *uc_pmu)
+{
+       return readl(uc_pmu->base + HISI_UC_INT_STS_REG);
+}
+
+static void hisi_uc_pmu_clear_int_status(struct hisi_pmu *uc_pmu, int idx)
+{
+       writel(1 << idx, uc_pmu->base + HISI_UC_INT_CLEAR_REG);
+}
+
+static int hisi_uc_pmu_init_data(struct platform_device *pdev,
+                                struct hisi_pmu *uc_pmu)
+{
+       /*
+        * Use SCCL (Super CPU Cluster) ID and CCL (CPU Cluster) ID to
+        * identify the topology information of UC PMU devices in the chip.
+        * They have some CCLs per SCCL and then 4 UC PMU per CCL.
+        */
+       if (device_property_read_u32(&pdev->dev, "hisilicon,scl-id",
+                                    &uc_pmu->sccl_id)) {
+               dev_err(&pdev->dev, "Can not read uc sccl-id!\n");
+               return -EINVAL;
+       }
+
+       if (device_property_read_u32(&pdev->dev, "hisilicon,ccl-id",
+                                    &uc_pmu->ccl_id)) {
+               dev_err(&pdev->dev, "Can not read uc ccl-id!\n");
+               return -EINVAL;
+       }
+
+       if (device_property_read_u32(&pdev->dev, "hisilicon,sub-id",
+                                    &uc_pmu->sub_id)) {
+               dev_err(&pdev->dev, "Can not read sub-id!\n");
+               return -EINVAL;
+       }
+
+       uc_pmu->base = devm_platform_ioremap_resource(pdev, 0);
+       if (IS_ERR(uc_pmu->base)) {
+               dev_err(&pdev->dev, "ioremap failed for uc_pmu resource\n");
+               return PTR_ERR(uc_pmu->base);
+       }
+
+       uc_pmu->identifier = readl(uc_pmu->base + HISI_UC_VERSION_REG);
+
+       return 0;
+}
+
+static struct attribute *hisi_uc_pmu_format_attr[] = {
+       HISI_PMU_FORMAT_ATTR(event, "config:0-7"),
+       HISI_PMU_FORMAT_ATTR(rd_req_en, "config1:0-0"),
+       HISI_PMU_FORMAT_ATTR(uring_channel, "config1:4-5"),
+       HISI_PMU_FORMAT_ATTR(srcid, "config1:6-19"),
+       HISI_PMU_FORMAT_ATTR(srcid_en, "config1:20-20"),
+       NULL
+};
+
+static const struct attribute_group hisi_uc_pmu_format_group = {
+       .name = "format",
+       .attrs = hisi_uc_pmu_format_attr,
+};
+
+static struct attribute *hisi_uc_pmu_events_attr[] = {
+       HISI_PMU_EVENT_ATTR(sq_time,            0x00),
+       HISI_PMU_EVENT_ATTR(pq_time,            0x01),
+       HISI_PMU_EVENT_ATTR(hbm_time,           0x02),
+       HISI_PMU_EVENT_ATTR(iq_comp_time_cring, 0x03),
+       HISI_PMU_EVENT_ATTR(iq_comp_time_uring, 0x05),
+       HISI_PMU_EVENT_ATTR(cpu_rd,             0x10),
+       HISI_PMU_EVENT_ATTR(cpu_rd64,           0x17),
+       HISI_PMU_EVENT_ATTR(cpu_rs64,           0x19),
+       HISI_PMU_EVENT_ATTR(cpu_mru,            0x1a),
+       HISI_PMU_EVENT_ATTR(cycles,             0x9c),
+       HISI_PMU_EVENT_ATTR(spipe_hit,          0xb3),
+       HISI_PMU_EVENT_ATTR(hpipe_hit,          0xdb),
+       HISI_PMU_EVENT_ATTR(cring_rxdat_cnt,    0xfa),
+       HISI_PMU_EVENT_ATTR(cring_txdat_cnt,    0xfb),
+       HISI_PMU_EVENT_ATTR(uring_rxdat_cnt,    0xfc),
+       HISI_PMU_EVENT_ATTR(uring_txdat_cnt,    0xfd),
+       NULL
+};
+
+static const struct attribute_group hisi_uc_pmu_events_group = {
+       .name = "events",
+       .attrs = hisi_uc_pmu_events_attr,
+};
+
+static DEVICE_ATTR(cpumask, 0444, hisi_cpumask_sysfs_show, NULL);
+
+static struct attribute *hisi_uc_pmu_cpumask_attrs[] = {
+       &dev_attr_cpumask.attr,
+       NULL,
+};
+
+static const struct attribute_group hisi_uc_pmu_cpumask_attr_group = {
+       .attrs = hisi_uc_pmu_cpumask_attrs,
+};
+
+static struct device_attribute hisi_uc_pmu_identifier_attr =
+       __ATTR(identifier, 0444, hisi_uncore_pmu_identifier_attr_show, NULL);
+
+static struct attribute *hisi_uc_pmu_identifier_attrs[] = {
+       &hisi_uc_pmu_identifier_attr.attr,
+       NULL
+};
+
+static const struct attribute_group hisi_uc_pmu_identifier_group = {
+       .attrs = hisi_uc_pmu_identifier_attrs,
+};
+
+static const struct attribute_group *hisi_uc_pmu_attr_groups[] = {
+       &hisi_uc_pmu_format_group,
+       &hisi_uc_pmu_events_group,
+       &hisi_uc_pmu_cpumask_attr_group,
+       &hisi_uc_pmu_identifier_group,
+       NULL
+};
+
+static const struct hisi_uncore_ops hisi_uncore_uc_pmu_ops = {
+       .check_filter           = hisi_uc_pmu_check_filter,
+       .write_evtype           = hisi_uc_pmu_write_evtype,
+       .get_event_idx          = hisi_uncore_pmu_get_event_idx,
+       .start_counters         = hisi_uc_pmu_start_counters,
+       .stop_counters          = hisi_uc_pmu_stop_counters,
+       .enable_counter         = hisi_uc_pmu_enable_counter,
+       .disable_counter        = hisi_uc_pmu_disable_counter,
+       .enable_counter_int     = hisi_uc_pmu_enable_counter_int,
+       .disable_counter_int    = hisi_uc_pmu_disable_counter_int,
+       .write_counter          = hisi_uc_pmu_write_counter,
+       .read_counter           = hisi_uc_pmu_read_counter,
+       .get_int_status         = hisi_uc_pmu_get_int_status,
+       .clear_int_status       = hisi_uc_pmu_clear_int_status,
+       .enable_filter          = hisi_uc_pmu_enable_filter,
+       .disable_filter         = hisi_uc_pmu_disable_filter,
+};
+
+static int hisi_uc_pmu_dev_probe(struct platform_device *pdev,
+                                struct hisi_pmu *uc_pmu)
+{
+       int ret;
+
+       ret = hisi_uc_pmu_init_data(pdev, uc_pmu);
+       if (ret)
+               return ret;
+
+       ret = hisi_uncore_pmu_init_irq(uc_pmu, pdev);
+       if (ret)
+               return ret;
+
+       uc_pmu->pmu_events.attr_groups = hisi_uc_pmu_attr_groups;
+       uc_pmu->check_event = HISI_UC_EVTYPE_MASK;
+       uc_pmu->ops = &hisi_uncore_uc_pmu_ops;
+       uc_pmu->counter_bits = HISI_UC_CNTR_REG_BITS;
+       uc_pmu->num_counters = HISI_UC_NR_COUNTERS;
+       uc_pmu->dev = &pdev->dev;
+       uc_pmu->on_cpu = -1;
+
+       return 0;
+}
+
+static void hisi_uc_pmu_remove_cpuhp_instance(void *hotplug_node)
+{
+       cpuhp_state_remove_instance_nocalls(hisi_uc_pmu_online, hotplug_node);
+}
+
+static void hisi_uc_pmu_unregister_pmu(void *pmu)
+{
+       perf_pmu_unregister(pmu);
+}
+
+static int hisi_uc_pmu_probe(struct platform_device *pdev)
+{
+       struct hisi_pmu *uc_pmu;
+       char *name;
+       int ret;
+
+       uc_pmu = devm_kzalloc(&pdev->dev, sizeof(*uc_pmu), GFP_KERNEL);
+       if (!uc_pmu)
+               return -ENOMEM;
+
+       platform_set_drvdata(pdev, uc_pmu);
+
+       ret = hisi_uc_pmu_dev_probe(pdev, uc_pmu);
+       if (ret)
+               return ret;
+
+       name = devm_kasprintf(&pdev->dev, GFP_KERNEL, "hisi_sccl%d_uc%d_%u",
+                             uc_pmu->sccl_id, uc_pmu->ccl_id, uc_pmu->sub_id);
+       if (!name)
+               return -ENOMEM;
+
+       ret = cpuhp_state_add_instance(hisi_uc_pmu_online, &uc_pmu->node);
+       if (ret)
+               return dev_err_probe(&pdev->dev, ret, "Error registering hotplug\n");
+
+       ret = devm_add_action_or_reset(&pdev->dev,
+                                      hisi_uc_pmu_remove_cpuhp_instance,
+                                      &uc_pmu->node);
+       if (ret)
+               return ret;
+
+       hisi_pmu_init(uc_pmu, THIS_MODULE);
+
+       ret = perf_pmu_register(&uc_pmu->pmu, name, -1);
+       if (ret)
+               return ret;
+
+       return devm_add_action_or_reset(&pdev->dev,
+                                       hisi_uc_pmu_unregister_pmu,
+                                       &uc_pmu->pmu);
+}
+
+static const struct acpi_device_id hisi_uc_pmu_acpi_match[] = {
+       { "HISI0291", },
+       {}
+};
+MODULE_DEVICE_TABLE(acpi, hisi_uc_pmu_acpi_match);
+
+static struct platform_driver hisi_uc_pmu_driver = {
+       .driver = {
+               .name = "hisi_uc_pmu",
+               .acpi_match_table = hisi_uc_pmu_acpi_match,
+               /*
+                * We have not worked out a safe bind/unbind process,
+                * Forcefully unbinding during sampling will lead to a
+                * kernel panic, so this is not supported yet.
+                */
+               .suppress_bind_attrs = true,
+       },
+       .probe = hisi_uc_pmu_probe,
+};
+
+static int __init hisi_uc_pmu_module_init(void)
+{
+       int ret;
+
+       ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN,
+                                     "perf/hisi/uc:online",
+                                     hisi_uncore_pmu_online_cpu,
+                                     hisi_uncore_pmu_offline_cpu);
+       if (ret < 0) {
+               pr_err("UC PMU: Error setup hotplug, ret = %d\n", ret);
+               return ret;
+       }
+       hisi_uc_pmu_online = ret;
+
+       ret = platform_driver_register(&hisi_uc_pmu_driver);
+       if (ret)
+               cpuhp_remove_multi_state(hisi_uc_pmu_online);
+
+       return ret;
+}
+module_init(hisi_uc_pmu_module_init);
+
+static void __exit hisi_uc_pmu_module_exit(void)
+{
+       platform_driver_unregister(&hisi_uc_pmu_driver);
+       cpuhp_remove_multi_state(hisi_uc_pmu_online);
+}
+module_exit(hisi_uc_pmu_module_exit);
+
+MODULE_DESCRIPTION("HiSilicon SoC UC uncore PMU driver");
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Junhao He <hejunhao3@huawei.com>");
index aaca6db..3f9a98c 100644 (file)
@@ -857,7 +857,6 @@ static int l2_cache_pmu_probe_cluster(struct device *dev, void *data)
                return -ENOMEM;
 
        INIT_LIST_HEAD(&cluster->next);
-       list_add(&cluster->next, &l2cache_pmu->clusters);
        cluster->cluster_id = fw_cluster_id;
 
        irq = platform_get_irq(sdev, 0);
@@ -883,6 +882,7 @@ static int l2_cache_pmu_probe_cluster(struct device *dev, void *data)
 
        spin_lock_init(&cluster->pmu_lock);
 
+       list_add(&cluster->next, &l2cache_pmu->clusters);
        l2cache_pmu->num_pmus++;
 
        return 0;
index 5787c57..77ff9a6 100644 (file)
@@ -407,7 +407,7 @@ config PINCTRL_PISTACHIO
 
 config PINCTRL_RK805
        tristate "Pinctrl and GPIO driver for RK805 PMIC"
-       depends on MFD_RK808
+       depends on MFD_RK8XX
        select GPIOLIB
        select PINMUX
        select GENERIC_PINCONF
index f279b36..43d3530 100644 (file)
@@ -30,6 +30,7 @@
 #include <linux/pinctrl/pinconf.h>
 #include <linux/pinctrl/pinconf-generic.h>
 #include <linux/pinctrl/pinmux.h>
+#include <linux/suspend.h>
 
 #include "core.h"
 #include "pinctrl-utils.h"
@@ -636,9 +637,8 @@ static bool do_amd_gpio_irq_handler(int irq, void *dev_id)
                        regval = readl(regs + i);
 
                        if (regval & PIN_IRQ_PENDING)
-                               dev_dbg(&gpio_dev->pdev->dev,
-                                       "GPIO %d is active: 0x%x",
-                                       irqnr + i, regval);
+                               pm_pr_dbg("GPIO %d is active: 0x%x",
+                                         irqnr + i, regval);
 
                        /* caused wake on resume context for shared IRQ */
                        if (irq < 0 && (regval & BIT(WAKE_STS_OFF)))
index 7c1f740..2639a9e 100644 (file)
@@ -1,10 +1,12 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 /*
- * Pinctrl driver for Rockchip RK805 PMIC
+ * Pinctrl driver for Rockchip RK805/RK806 PMIC
  *
  * Copyright (c) 2017, Fuzhou Rockchip Electronics Co., Ltd
+ * Copyright (c) 2021 Rockchip Electronics Co., Ltd.
  *
  * Author: Joseph Chen <chenjh@rock-chips.com>
+ * Author: Xu Shengfei <xsf@rock-chips.com>
  *
  * Based on the pinctrl-as3722 driver
  */
@@ -44,6 +46,7 @@ struct rk805_pin_group {
 
 /*
  * @reg: gpio setting register;
+ * @fun_reg: functions select register;
  * @fun_mask: functions select mask value, when set is gpio;
  * @dir_mask: input or output mask value, when set is output, otherwise input;
  * @val_mask: gpio set value, when set is level high, otherwise low;
@@ -56,6 +59,7 @@ struct rk805_pin_group {
  */
 struct rk805_pin_config {
        u8 reg;
+       u8 fun_reg;
        u8 fun_msk;
        u8 dir_msk;
        u8 val_msk;
@@ -80,22 +84,50 @@ enum rk805_pinmux_option {
        RK805_PINMUX_GPIO,
 };
 
+enum rk806_pinmux_option {
+       RK806_PINMUX_FUN0 = 0,
+       RK806_PINMUX_FUN1,
+       RK806_PINMUX_FUN2,
+       RK806_PINMUX_FUN3,
+       RK806_PINMUX_FUN4,
+       RK806_PINMUX_FUN5,
+};
+
 enum {
        RK805_GPIO0,
        RK805_GPIO1,
 };
 
+enum {
+       RK806_GPIO_DVS1,
+       RK806_GPIO_DVS2,
+       RK806_GPIO_DVS3
+};
+
 static const char *const rk805_gpio_groups[] = {
        "gpio0",
        "gpio1",
 };
 
+static const char *const rk806_gpio_groups[] = {
+       "gpio_pwrctrl1",
+       "gpio_pwrctrl2",
+       "gpio_pwrctrl3",
+};
+
 /* RK805: 2 output only GPIOs */
 static const struct pinctrl_pin_desc rk805_pins_desc[] = {
        PINCTRL_PIN(RK805_GPIO0, "gpio0"),
        PINCTRL_PIN(RK805_GPIO1, "gpio1"),
 };
 
+/* RK806 */
+static const struct pinctrl_pin_desc rk806_pins_desc[] = {
+       PINCTRL_PIN(RK806_GPIO_DVS1, "gpio_pwrctrl1"),
+       PINCTRL_PIN(RK806_GPIO_DVS2, "gpio_pwrctrl2"),
+       PINCTRL_PIN(RK806_GPIO_DVS3, "gpio_pwrctrl3"),
+};
+
 static const struct rk805_pin_function rk805_pin_functions[] = {
        {
                .name = "gpio",
@@ -105,6 +137,45 @@ static const struct rk805_pin_function rk805_pin_functions[] = {
        },
 };
 
+static const struct rk805_pin_function rk806_pin_functions[] = {
+       {
+               .name = "pin_fun0",
+               .groups = rk806_gpio_groups,
+               .ngroups = ARRAY_SIZE(rk806_gpio_groups),
+               .mux_option = RK806_PINMUX_FUN0,
+       },
+       {
+               .name = "pin_fun1",
+               .groups = rk806_gpio_groups,
+               .ngroups = ARRAY_SIZE(rk806_gpio_groups),
+               .mux_option = RK806_PINMUX_FUN1,
+       },
+       {
+               .name = "pin_fun2",
+               .groups = rk806_gpio_groups,
+               .ngroups = ARRAY_SIZE(rk806_gpio_groups),
+               .mux_option = RK806_PINMUX_FUN2,
+       },
+       {
+               .name = "pin_fun3",
+               .groups = rk806_gpio_groups,
+               .ngroups = ARRAY_SIZE(rk806_gpio_groups),
+               .mux_option = RK806_PINMUX_FUN3,
+       },
+       {
+               .name = "pin_fun4",
+               .groups = rk806_gpio_groups,
+               .ngroups = ARRAY_SIZE(rk806_gpio_groups),
+               .mux_option = RK806_PINMUX_FUN4,
+       },
+       {
+               .name = "pin_fun5",
+               .groups = rk806_gpio_groups,
+               .ngroups = ARRAY_SIZE(rk806_gpio_groups),
+               .mux_option = RK806_PINMUX_FUN5,
+       },
+};
+
 static const struct rk805_pin_group rk805_pin_groups[] = {
        {
                .name = "gpio0",
@@ -118,6 +189,24 @@ static const struct rk805_pin_group rk805_pin_groups[] = {
        },
 };
 
+static const struct rk805_pin_group rk806_pin_groups[] = {
+       {
+               .name = "gpio_pwrctrl1",
+               .pins = { RK806_GPIO_DVS1 },
+               .npins = 1,
+       },
+       {
+               .name = "gpio_pwrctrl2",
+               .pins = { RK806_GPIO_DVS2 },
+               .npins = 1,
+       },
+       {
+               .name = "gpio_pwrctrl3",
+               .pins = { RK806_GPIO_DVS3 },
+               .npins = 1,
+       }
+};
+
 #define RK805_GPIO0_VAL_MSK    BIT(0)
 #define RK805_GPIO1_VAL_MSK    BIT(1)
 
@@ -132,6 +221,40 @@ static const struct rk805_pin_config rk805_gpio_cfgs[] = {
        },
 };
 
+#define RK806_PWRCTRL1_DR      BIT(0)
+#define RK806_PWRCTRL2_DR      BIT(1)
+#define RK806_PWRCTRL3_DR      BIT(2)
+#define RK806_PWRCTRL1_DATA    BIT(4)
+#define RK806_PWRCTRL2_DATA    BIT(5)
+#define RK806_PWRCTRL3_DATA    BIT(6)
+#define RK806_PWRCTRL1_FUN     GENMASK(2, 0)
+#define RK806_PWRCTRL2_FUN     GENMASK(6, 4)
+#define RK806_PWRCTRL3_FUN     GENMASK(2, 0)
+
+static struct rk805_pin_config rk806_gpio_cfgs[] = {
+       {
+               .fun_reg = RK806_SLEEP_CONFIG0,
+               .fun_msk = RK806_PWRCTRL1_FUN,
+               .reg = RK806_SLEEP_GPIO,
+               .val_msk = RK806_PWRCTRL1_DATA,
+               .dir_msk = RK806_PWRCTRL1_DR,
+       },
+       {
+               .fun_reg = RK806_SLEEP_CONFIG0,
+               .fun_msk = RK806_PWRCTRL2_FUN,
+               .reg = RK806_SLEEP_GPIO,
+               .val_msk = RK806_PWRCTRL2_DATA,
+               .dir_msk = RK806_PWRCTRL2_DR,
+       },
+       {
+               .fun_reg = RK806_SLEEP_CONFIG1,
+               .fun_msk = RK806_PWRCTRL3_FUN,
+               .reg = RK806_SLEEP_GPIO,
+               .val_msk = RK806_PWRCTRL3_DATA,
+               .dir_msk = RK806_PWRCTRL3_DR,
+       }
+};
+
 /* generic gpio chip */
 static int rk805_gpio_get(struct gpio_chip *chip, unsigned int offset)
 {
@@ -289,19 +412,13 @@ static int _rk805_pinctrl_set_mux(struct pinctrl_dev *pctldev,
        if (!pci->pin_cfg[offset].fun_msk)
                return 0;
 
-       if (mux == RK805_PINMUX_GPIO) {
-               ret = regmap_update_bits(pci->rk808->regmap,
-                                        pci->pin_cfg[offset].reg,
-                                        pci->pin_cfg[offset].fun_msk,
-                                        pci->pin_cfg[offset].fun_msk);
-               if (ret) {
-                       dev_err(pci->dev, "set gpio%d GPIO failed\n", offset);
-                       return ret;
-               }
-       } else {
-               dev_err(pci->dev, "Couldn't find function mux %d\n", mux);
-               return -EINVAL;
-       }
+       mux <<= ffs(pci->pin_cfg[offset].fun_msk) - 1;
+       ret = regmap_update_bits(pci->rk808->regmap,
+                                pci->pin_cfg[offset].fun_reg,
+                                pci->pin_cfg[offset].fun_msk, mux);
+
+       if (ret)
+               dev_err(pci->dev, "set gpio%d func%d failed\n", offset, mux);
 
        return 0;
 }
@@ -317,6 +434,22 @@ static int rk805_pinctrl_set_mux(struct pinctrl_dev *pctldev,
        return _rk805_pinctrl_set_mux(pctldev, offset, mux);
 }
 
+static int rk805_pinctrl_gpio_request_enable(struct pinctrl_dev *pctldev,
+                                            struct pinctrl_gpio_range *range,
+                                            unsigned int offset)
+{
+       struct rk805_pctrl_info *pci = pinctrl_dev_get_drvdata(pctldev);
+
+       switch (pci->rk808->variant) {
+       case RK805_ID:
+               return _rk805_pinctrl_set_mux(pctldev, offset, RK805_PINMUX_GPIO);
+       case RK806_ID:
+               return _rk805_pinctrl_set_mux(pctldev, offset, RK806_PINMUX_FUN5);
+       }
+
+       return -ENOTSUPP;
+}
+
 static int rk805_pmx_gpio_set_direction(struct pinctrl_dev *pctldev,
                                        struct pinctrl_gpio_range *range,
                                        unsigned int offset, bool input)
@@ -324,13 +457,6 @@ static int rk805_pmx_gpio_set_direction(struct pinctrl_dev *pctldev,
        struct rk805_pctrl_info *pci = pinctrl_dev_get_drvdata(pctldev);
        int ret;
 
-       /* switch to gpio function */
-       ret = _rk805_pinctrl_set_mux(pctldev, offset, RK805_PINMUX_GPIO);
-       if (ret) {
-               dev_err(pci->dev, "set gpio%d mux failed\n", offset);
-               return ret;
-       }
-
        /* set direction */
        if (!pci->pin_cfg[offset].dir_msk)
                return 0;
@@ -352,6 +478,7 @@ static const struct pinmux_ops rk805_pinmux_ops = {
        .get_function_name      = rk805_pinctrl_get_func_name,
        .get_function_groups    = rk805_pinctrl_get_func_groups,
        .set_mux                = rk805_pinctrl_set_mux,
+       .gpio_request_enable    = rk805_pinctrl_gpio_request_enable,
        .gpio_set_direction     = rk805_pmx_gpio_set_direction,
 };
 
@@ -364,6 +491,7 @@ static int rk805_pinconf_get(struct pinctrl_dev *pctldev,
 
        switch (param) {
        case PIN_CONFIG_OUTPUT:
+       case PIN_CONFIG_INPUT_ENABLE:
                arg = rk805_gpio_get(&pci->gpio_chip, pin);
                break;
        default:
@@ -393,6 +521,12 @@ static int rk805_pinconf_set(struct pinctrl_dev *pctldev,
                        rk805_gpio_set(&pci->gpio_chip, pin, arg);
                        rk805_pmx_gpio_set_direction(pctldev, NULL, pin, false);
                        break;
+               case PIN_CONFIG_INPUT_ENABLE:
+                       if (pci->rk808->variant != RK805_ID && arg) {
+                               rk805_pmx_gpio_set_direction(pctldev, NULL, pin, true);
+                               break;
+                       }
+                       fallthrough;
                default:
                        dev_err(pci->dev, "Properties not supported\n");
                        return -ENOTSUPP;
@@ -448,6 +582,18 @@ static int rk805_pinctrl_probe(struct platform_device *pdev)
                pci->pin_cfg = rk805_gpio_cfgs;
                pci->gpio_chip.ngpio = ARRAY_SIZE(rk805_gpio_cfgs);
                break;
+       case RK806_ID:
+               pci->pins = rk806_pins_desc;
+               pci->num_pins = ARRAY_SIZE(rk806_pins_desc);
+               pci->functions = rk806_pin_functions;
+               pci->num_functions = ARRAY_SIZE(rk806_pin_functions);
+               pci->groups = rk806_pin_groups;
+               pci->num_pin_groups = ARRAY_SIZE(rk806_pin_groups);
+               pci->pinctrl_desc.pins = rk806_pins_desc;
+               pci->pinctrl_desc.npins = ARRAY_SIZE(rk806_pins_desc);
+               pci->pin_cfg = rk806_gpio_cfgs;
+               pci->gpio_chip.ngpio = ARRAY_SIZE(rk806_gpio_cfgs);
+               break;
        default:
                dev_err(&pdev->dev, "unsupported RK805 ID %lu\n",
                        pci->rk808->variant);
@@ -488,5 +634,6 @@ static struct platform_driver rk805_pinctrl_driver = {
 module_platform_driver(rk805_pinctrl_driver);
 
 MODULE_DESCRIPTION("RK805 pin control and GPIO driver");
+MODULE_AUTHOR("Xu Shengfei <xsf@rock-chips.com>");
 MODULE_AUTHOR("Joseph Chen <chenjh@rock-chips.com>");
 MODULE_LICENSE("GPL v2");
index dbe698f..e29c51c 100644 (file)
@@ -372,7 +372,7 @@ static struct i2c_driver cros_ec_driver = {
                .of_match_table = of_match_ptr(cros_ec_i2c_of_match),
                .pm     = &cros_ec_i2c_pm_ops,
        },
-       .probe_new      = cros_ec_i2c_probe,
+       .probe          = cros_ec_i2c_probe,
        .remove         = cros_ec_i2c_remove,
        .id_table       = cros_ec_i2c_id,
 };
index 68bba0f..500a61b 100644 (file)
@@ -16,6 +16,7 @@
 #include <linux/delay.h>
 #include <linux/io.h>
 #include <linux/interrupt.h>
+#include <linux/kobject.h>
 #include <linux/module.h>
 #include <linux/platform_data/cros_ec_commands.h>
 #include <linux/platform_data/cros_ec_proto.h>
@@ -315,6 +316,7 @@ static int cros_ec_lpc_readmem(struct cros_ec_device *ec, unsigned int offset,
 
 static void cros_ec_lpc_acpi_notify(acpi_handle device, u32 value, void *data)
 {
+       static const char *env[] = { "ERROR=PANIC", NULL };
        struct cros_ec_device *ec_dev = data;
        bool ec_has_more_events;
        int ret;
@@ -324,6 +326,7 @@ static void cros_ec_lpc_acpi_notify(acpi_handle device, u32 value, void *data)
        if (value == ACPI_NOTIFY_CROS_EC_PANIC) {
                dev_emerg(ec_dev->dev, "CrOS EC Panic Reported. Shutdown is imminent!");
                blocking_notifier_call_chain(&ec_dev->panic_notifier, 0, ec_dev);
+               kobject_uevent_env(&ec_dev->dev->kobj, KOBJ_CHANGE, (char **)env);
                /* Begin orderly shutdown. Force shutdown after 1 second. */
                hw_protection_shutdown("CrOS EC Panic", 1000);
                /* Do not query for other events after a panic is reported */
@@ -543,23 +546,25 @@ static const struct dmi_system_id cros_ec_lpc_dmi_table[] __initconst = {
 MODULE_DEVICE_TABLE(dmi, cros_ec_lpc_dmi_table);
 
 #ifdef CONFIG_PM_SLEEP
-static int cros_ec_lpc_suspend(struct device *dev)
+static int cros_ec_lpc_prepare(struct device *dev)
 {
        struct cros_ec_device *ec_dev = dev_get_drvdata(dev);
 
        return cros_ec_suspend(ec_dev);
 }
 
-static int cros_ec_lpc_resume(struct device *dev)
+static void cros_ec_lpc_complete(struct device *dev)
 {
        struct cros_ec_device *ec_dev = dev_get_drvdata(dev);
-
-       return cros_ec_resume(ec_dev);
+       cros_ec_resume(ec_dev);
 }
 #endif
 
 static const struct dev_pm_ops cros_ec_lpc_pm_ops = {
-       SET_LATE_SYSTEM_SLEEP_PM_OPS(cros_ec_lpc_suspend, cros_ec_lpc_resume)
+#ifdef CONFIG_PM_SLEEP
+       .prepare = cros_ec_lpc_prepare,
+       .complete = cros_ec_lpc_complete
+#endif
 };
 
 static struct platform_driver cros_ec_lpc_driver = {
index 21143db..3e88cc9 100644 (file)
@@ -104,13 +104,7 @@ static void debug_packet(struct device *dev, const char *name, u8 *ptr,
                         int len)
 {
 #ifdef DEBUG
-       int i;
-
-       dev_dbg(dev, "%s: ", name);
-       for (i = 0; i < len; i++)
-               pr_cont(" %02x", ptr[i]);
-
-       pr_cont("\n");
+       dev_dbg(dev, "%s: %*ph\n", name, len, ptr);
 #endif
 }
 
index 62ccb1a..b313130 100644 (file)
@@ -143,7 +143,7 @@ MODULE_DEVICE_TABLE(acpi, hps_acpi_id);
 #endif /* CONFIG_ACPI */
 
 static struct i2c_driver hps_i2c_driver = {
-       .probe_new = hps_i2c_probe,
+       .probe = hps_i2c_probe,
        .remove = hps_i2c_remove,
        .id_table = hps_i2c_id,
        .driver = {
index 7527204..0eefdcf 100644 (file)
@@ -51,13 +51,18 @@ static int cros_typec_cmd_mux_set(struct cros_typec_switch_data *sdata, int port
 static int cros_typec_get_mux_state(unsigned long mode, struct typec_altmode *alt)
 {
        int ret = -EOPNOTSUPP;
+       u8 pin_assign;
 
-       if (mode == TYPEC_STATE_SAFE)
+       if (mode == TYPEC_STATE_SAFE) {
                ret = USB_PD_MUX_SAFE_MODE;
-       else if (mode == TYPEC_STATE_USB)
+       } else if (mode == TYPEC_STATE_USB) {
                ret = USB_PD_MUX_USB_ENABLED;
-       else if (alt && alt->svid == USB_TYPEC_DP_SID)
+       } else if (alt && alt->svid == USB_TYPEC_DP_SID) {
                ret = USB_PD_MUX_DP_ENABLED;
+               pin_assign = mode - TYPEC_STATE_MODAL;
+               if (pin_assign & DP_PIN_ASSIGN_D)
+                       ret |= USB_PD_MUX_USB_ENABLED;
+       }
 
        return ret;
 }
index 4279057..1304cd6 100644 (file)
@@ -543,7 +543,7 @@ static int amd_pmc_idlemask_read(struct amd_pmc_dev *pdev, struct device *dev,
        }
 
        if (dev)
-               dev_dbg(pdev->dev, "SMU idlemask s0i3: 0x%x\n", val);
+               pm_pr_dbg("SMU idlemask s0i3: 0x%x\n", val);
 
        if (s)
                seq_printf(s, "SMU idlemask : 0x%x\n", val);
@@ -769,7 +769,7 @@ static int amd_pmc_verify_czn_rtc(struct amd_pmc_dev *pdev, u32 *arg)
 
        *arg |= (duration << 16);
        rc = rtc_alarm_irq_enable(rtc_device, 0);
-       dev_dbg(pdev->dev, "wakeup timer programmed for %lld seconds\n", duration);
+       pm_pr_dbg("wakeup timer programmed for %lld seconds\n", duration);
 
        return rc;
 }
index c78be9f..4a5e8e1 100644 (file)
@@ -706,7 +706,7 @@ config CHARGER_BQ256XX
 
 config CHARGER_RK817
        tristate "Rockchip RK817 PMIC Battery Charger"
-       depends on MFD_RK808
+       depends on MFD_RK8XX
        help
          Say Y to include support for Rockchip RK817 Battery Charger.
 
index 90d33cd..69ef8d0 100644 (file)
@@ -18,10 +18,12 @@ if POWERCAP
 # Client driver configurations go here.
 config INTEL_RAPL_CORE
        tristate
+       depends on PCI
+       select IOSF_MBI
 
 config INTEL_RAPL
        tristate "Intel RAPL Support via MSR Interface"
-       depends on X86 && IOSF_MBI
+       depends on X86 && PCI
        select INTEL_RAPL_CORE
        help
          This enables support for the Intel Running Average Power Limit (RAPL)
@@ -33,6 +35,20 @@ config INTEL_RAPL
          controller, CPU core (Power Plane 0), graphics uncore (Power Plane
          1), etc.
 
+config INTEL_RAPL_TPMI
+       tristate "Intel RAPL Support via TPMI Interface"
+       depends on X86
+       depends on INTEL_TPMI
+       select INTEL_RAPL_CORE
+       help
+         This enables support for the Intel Running Average Power Limit (RAPL)
+         technology via TPMI interface, which allows power limits to be enforced
+         and monitored.
+
+         In RAPL, the platform level settings are divided into domains for
+         fine grained control. These domains include processor package, DRAM
+         controller, platform, etc.
+
 config IDLE_INJECT
        bool "Idle injection framework"
        depends on CPU_IDLE
index 4474201..5ab0dce 100644 (file)
@@ -5,5 +5,6 @@ obj-$(CONFIG_DTPM_DEVFREQ) += dtpm_devfreq.o
 obj-$(CONFIG_POWERCAP) += powercap_sys.o
 obj-$(CONFIG_INTEL_RAPL_CORE) += intel_rapl_common.o
 obj-$(CONFIG_INTEL_RAPL) += intel_rapl_msr.o
+obj-$(CONFIG_INTEL_RAPL_TPMI) += intel_rapl_tpmi.o
 obj-$(CONFIG_IDLE_INJECT) += idle_inject.o
 obj-$(CONFIG_ARM_SCMI_POWERCAP) += arm_scmi_powercap.o
index 8970c7b..4e646e5 100644 (file)
 #define PSYS_TIME_WINDOW1_MASK       (0x7FULL<<19)
 #define PSYS_TIME_WINDOW2_MASK       (0x7FULL<<51)
 
+/* bitmasks for RAPL TPMI, used by primitive access functions */
+#define TPMI_POWER_LIMIT_MASK  0x3FFFF
+#define TPMI_POWER_LIMIT_ENABLE        BIT_ULL(62)
+#define TPMI_TIME_WINDOW_MASK  (0x7FULL<<18)
+#define TPMI_INFO_SPEC_MASK    0x3FFFF
+#define TPMI_INFO_MIN_MASK     (0x3FFFFULL << 18)
+#define TPMI_INFO_MAX_MASK     (0x3FFFFULL << 36)
+#define TPMI_INFO_MAX_TIME_WIN_MASK    (0x7FULL << 54)
+
 /* Non HW constants */
 #define RAPL_PRIMITIVE_DERIVED       BIT(1)    /* not from raw data */
 #define RAPL_PRIMITIVE_DUMMY         BIT(2)
@@ -94,26 +103,120 @@ enum unit_type {
 
 #define        DOMAIN_STATE_INACTIVE           BIT(0)
 #define        DOMAIN_STATE_POWER_LIMIT_SET    BIT(1)
-#define DOMAIN_STATE_BIOS_LOCKED        BIT(2)
 
-static const char pl1_name[] = "long_term";
-static const char pl2_name[] = "short_term";
-static const char pl4_name[] = "peak_power";
+static const char *pl_names[NR_POWER_LIMITS] = {
+       [POWER_LIMIT1] = "long_term",
+       [POWER_LIMIT2] = "short_term",
+       [POWER_LIMIT4] = "peak_power",
+};
+
+enum pl_prims {
+       PL_ENABLE,
+       PL_CLAMP,
+       PL_LIMIT,
+       PL_TIME_WINDOW,
+       PL_MAX_POWER,
+       PL_LOCK,
+};
+
+static bool is_pl_valid(struct rapl_domain *rd, int pl)
+{
+       if (pl < POWER_LIMIT1 || pl > POWER_LIMIT4)
+               return false;
+       return rd->rpl[pl].name ? true : false;
+}
+
+static int get_pl_lock_prim(struct rapl_domain *rd, int pl)
+{
+       if (rd->rp->priv->type == RAPL_IF_TPMI) {
+               if (pl == POWER_LIMIT1)
+                       return PL1_LOCK;
+               if (pl == POWER_LIMIT2)
+                       return PL2_LOCK;
+               if (pl == POWER_LIMIT4)
+                       return PL4_LOCK;
+       }
+
+       /* MSR/MMIO Interface doesn't have Lock bit for PL4 */
+       if (pl == POWER_LIMIT4)
+               return -EINVAL;
+
+       /*
+        * Power Limit register that supports two power limits has a different
+        * bit position for the Lock bit.
+        */
+       if (rd->rp->priv->limits[rd->id] & BIT(POWER_LIMIT2))
+               return FW_HIGH_LOCK;
+       return FW_LOCK;
+}
+
+static int get_pl_prim(struct rapl_domain *rd, int pl, enum pl_prims prim)
+{
+       switch (pl) {
+       case POWER_LIMIT1:
+               if (prim == PL_ENABLE)
+                       return PL1_ENABLE;
+               if (prim == PL_CLAMP && rd->rp->priv->type != RAPL_IF_TPMI)
+                       return PL1_CLAMP;
+               if (prim == PL_LIMIT)
+                       return POWER_LIMIT1;
+               if (prim == PL_TIME_WINDOW)
+                       return TIME_WINDOW1;
+               if (prim == PL_MAX_POWER)
+                       return THERMAL_SPEC_POWER;
+               if (prim == PL_LOCK)
+                       return get_pl_lock_prim(rd, pl);
+               return -EINVAL;
+       case POWER_LIMIT2:
+               if (prim == PL_ENABLE)
+                       return PL2_ENABLE;
+               if (prim == PL_CLAMP && rd->rp->priv->type != RAPL_IF_TPMI)
+                       return PL2_CLAMP;
+               if (prim == PL_LIMIT)
+                       return POWER_LIMIT2;
+               if (prim == PL_TIME_WINDOW)
+                       return TIME_WINDOW2;
+               if (prim == PL_MAX_POWER)
+                       return MAX_POWER;
+               if (prim == PL_LOCK)
+                       return get_pl_lock_prim(rd, pl);
+               return -EINVAL;
+       case POWER_LIMIT4:
+               if (prim == PL_LIMIT)
+                       return POWER_LIMIT4;
+               if (prim == PL_ENABLE)
+                       return PL4_ENABLE;
+               /* PL4 would be around two times PL2, use same prim as PL2. */
+               if (prim == PL_MAX_POWER)
+                       return MAX_POWER;
+               if (prim == PL_LOCK)
+                       return get_pl_lock_prim(rd, pl);
+               return -EINVAL;
+       default:
+               return -EINVAL;
+       }
+}
 
 #define power_zone_to_rapl_domain(_zone) \
        container_of(_zone, struct rapl_domain, power_zone)
 
 struct rapl_defaults {
        u8 floor_freq_reg_addr;
-       int (*check_unit)(struct rapl_package *rp, int cpu);
+       int (*check_unit)(struct rapl_domain *rd);
        void (*set_floor_freq)(struct rapl_domain *rd, bool mode);
-       u64 (*compute_time_window)(struct rapl_package *rp, u64 val,
+       u64 (*compute_time_window)(struct rapl_domain *rd, u64 val,
                                    bool to_raw);
        unsigned int dram_domain_energy_unit;
        unsigned int psys_domain_energy_unit;
        bool spr_psys_bits;
 };
-static struct rapl_defaults *rapl_defaults;
+static struct rapl_defaults *defaults_msr;
+static const struct rapl_defaults defaults_tpmi;
+
+static struct rapl_defaults *get_defaults(struct rapl_package *rp)
+{
+       return rp->priv->defaults;
+}
 
 /* Sideband MBI registers */
 #define IOSF_CPU_POWER_BUDGET_CTL_BYT (0x2)
@@ -150,6 +253,12 @@ static int rapl_read_data_raw(struct rapl_domain *rd,
 static int rapl_write_data_raw(struct rapl_domain *rd,
                               enum rapl_primitives prim,
                               unsigned long long value);
+static int rapl_read_pl_data(struct rapl_domain *rd, int pl,
+                             enum pl_prims pl_prim,
+                             bool xlate, u64 *data);
+static int rapl_write_pl_data(struct rapl_domain *rd, int pl,
+                              enum pl_prims pl_prim,
+                              unsigned long long value);
 static u64 rapl_unit_xlate(struct rapl_domain *rd,
                           enum unit_type type, u64 value, int to_raw);
 static void package_power_limit_irq_save(struct rapl_package *rp);
@@ -217,7 +326,7 @@ static int find_nr_power_limit(struct rapl_domain *rd)
        int i, nr_pl = 0;
 
        for (i = 0; i < NR_POWER_LIMITS; i++) {
-               if (rd->rpl[i].name)
+               if (is_pl_valid(rd, i))
                        nr_pl++;
        }
 
@@ -227,37 +336,35 @@ static int find_nr_power_limit(struct rapl_domain *rd)
 static int set_domain_enable(struct powercap_zone *power_zone, bool mode)
 {
        struct rapl_domain *rd = power_zone_to_rapl_domain(power_zone);
-
-       if (rd->state & DOMAIN_STATE_BIOS_LOCKED)
-               return -EACCES;
+       struct rapl_defaults *defaults = get_defaults(rd->rp);
+       int ret;
 
        cpus_read_lock();
-       rapl_write_data_raw(rd, PL1_ENABLE, mode);
-       if (rapl_defaults->set_floor_freq)
-               rapl_defaults->set_floor_freq(rd, mode);
+       ret = rapl_write_pl_data(rd, POWER_LIMIT1, PL_ENABLE, mode);
+       if (!ret && defaults->set_floor_freq)
+               defaults->set_floor_freq(rd, mode);
        cpus_read_unlock();
 
-       return 0;
+       return ret;
 }
 
 static int get_domain_enable(struct powercap_zone *power_zone, bool *mode)
 {
        struct rapl_domain *rd = power_zone_to_rapl_domain(power_zone);
        u64 val;
+       int ret;
 
-       if (rd->state & DOMAIN_STATE_BIOS_LOCKED) {
+       if (rd->rpl[POWER_LIMIT1].locked) {
                *mode = false;
                return 0;
        }
        cpus_read_lock();
-       if (rapl_read_data_raw(rd, PL1_ENABLE, true, &val)) {
-               cpus_read_unlock();
-               return -EIO;
-       }
-       *mode = val;
+       ret = rapl_read_pl_data(rd, POWER_LIMIT1, PL_ENABLE, true, &val);
+       if (!ret)
+               *mode = val;
        cpus_read_unlock();
 
-       return 0;
+       return ret;
 }
 
 /* per RAPL domain ops, in the order of rapl_domain_type */
@@ -313,8 +420,8 @@ static int contraint_to_pl(struct rapl_domain *rd, int cid)
 {
        int i, j;
 
-       for (i = 0, j = 0; i < NR_POWER_LIMITS; i++) {
-               if ((rd->rpl[i].name) && j++ == cid) {
+       for (i = POWER_LIMIT1, j = 0; i < NR_POWER_LIMITS; i++) {
+               if (is_pl_valid(rd, i) && j++ == cid) {
                        pr_debug("%s: index %d\n", __func__, i);
                        return i;
                }
@@ -335,36 +442,11 @@ static int set_power_limit(struct powercap_zone *power_zone, int cid,
        cpus_read_lock();
        rd = power_zone_to_rapl_domain(power_zone);
        id = contraint_to_pl(rd, cid);
-       if (id < 0) {
-               ret = id;
-               goto set_exit;
-       }
-
        rp = rd->rp;
 
-       if (rd->state & DOMAIN_STATE_BIOS_LOCKED) {
-               dev_warn(&power_zone->dev,
-                        "%s locked by BIOS, monitoring only\n", rd->name);
-               ret = -EACCES;
-               goto set_exit;
-       }
-
-       switch (rd->rpl[id].prim_id) {
-       case PL1_ENABLE:
-               rapl_write_data_raw(rd, POWER_LIMIT1, power_limit);
-               break;
-       case PL2_ENABLE:
-               rapl_write_data_raw(rd, POWER_LIMIT2, power_limit);
-               break;
-       case PL4_ENABLE:
-               rapl_write_data_raw(rd, POWER_LIMIT4, power_limit);
-               break;
-       default:
-               ret = -EINVAL;
-       }
+       ret = rapl_write_pl_data(rd, id, PL_LIMIT, power_limit);
        if (!ret)
                package_power_limit_irq_save(rp);
-set_exit:
        cpus_read_unlock();
        return ret;
 }
@@ -374,38 +456,17 @@ static int get_current_power_limit(struct powercap_zone *power_zone, int cid,
 {
        struct rapl_domain *rd;
        u64 val;
-       int prim;
        int ret = 0;
        int id;
 
        cpus_read_lock();
        rd = power_zone_to_rapl_domain(power_zone);
        id = contraint_to_pl(rd, cid);
-       if (id < 0) {
-               ret = id;
-               goto get_exit;
-       }
 
-       switch (rd->rpl[id].prim_id) {
-       case PL1_ENABLE:
-               prim = POWER_LIMIT1;
-               break;
-       case PL2_ENABLE:
-               prim = POWER_LIMIT2;
-               break;
-       case PL4_ENABLE:
-               prim = POWER_LIMIT4;
-               break;
-       default:
-               cpus_read_unlock();
-               return -EINVAL;
-       }
-       if (rapl_read_data_raw(rd, prim, true, &val))
-               ret = -EIO;
-       else
+       ret = rapl_read_pl_data(rd, id, PL_LIMIT, true, &val);
+       if (!ret)
                *data = val;
 
-get_exit:
        cpus_read_unlock();
 
        return ret;
@@ -421,23 +482,9 @@ static int set_time_window(struct powercap_zone *power_zone, int cid,
        cpus_read_lock();
        rd = power_zone_to_rapl_domain(power_zone);
        id = contraint_to_pl(rd, cid);
-       if (id < 0) {
-               ret = id;
-               goto set_time_exit;
-       }
 
-       switch (rd->rpl[id].prim_id) {
-       case PL1_ENABLE:
-               rapl_write_data_raw(rd, TIME_WINDOW1, window);
-               break;
-       case PL2_ENABLE:
-               rapl_write_data_raw(rd, TIME_WINDOW2, window);
-               break;
-       default:
-               ret = -EINVAL;
-       }
+       ret = rapl_write_pl_data(rd, id, PL_TIME_WINDOW, window);
 
-set_time_exit:
        cpus_read_unlock();
        return ret;
 }
@@ -453,33 +500,11 @@ static int get_time_window(struct powercap_zone *power_zone, int cid,
        cpus_read_lock();
        rd = power_zone_to_rapl_domain(power_zone);
        id = contraint_to_pl(rd, cid);
-       if (id < 0) {
-               ret = id;
-               goto get_time_exit;
-       }
 
-       switch (rd->rpl[id].prim_id) {
-       case PL1_ENABLE:
-               ret = rapl_read_data_raw(rd, TIME_WINDOW1, true, &val);
-               break;
-       case PL2_ENABLE:
-               ret = rapl_read_data_raw(rd, TIME_WINDOW2, true, &val);
-               break;
-       case PL4_ENABLE:
-               /*
-                * Time window parameter is not applicable for PL4 entry
-                * so assigining '0' as default value.
-                */
-               val = 0;
-               break;
-       default:
-               cpus_read_unlock();
-               return -EINVAL;
-       }
+       ret = rapl_read_pl_data(rd, id, PL_TIME_WINDOW, true, &val);
        if (!ret)
                *data = val;
 
-get_time_exit:
        cpus_read_unlock();
 
        return ret;
@@ -499,36 +524,23 @@ static const char *get_constraint_name(struct powercap_zone *power_zone,
        return NULL;
 }
 
-static int get_max_power(struct powercap_zone *power_zone, int id, u64 *data)
+static int get_max_power(struct powercap_zone *power_zone, int cid, u64 *data)
 {
        struct rapl_domain *rd;
        u64 val;
-       int prim;
        int ret = 0;
+       int id;
 
        cpus_read_lock();
        rd = power_zone_to_rapl_domain(power_zone);
-       switch (rd->rpl[id].prim_id) {
-       case PL1_ENABLE:
-               prim = THERMAL_SPEC_POWER;
-               break;
-       case PL2_ENABLE:
-               prim = MAX_POWER;
-               break;
-       case PL4_ENABLE:
-               prim = MAX_POWER;
-               break;
-       default:
-               cpus_read_unlock();
-               return -EINVAL;
-       }
-       if (rapl_read_data_raw(rd, prim, true, &val))
-               ret = -EIO;
-       else
+       id = contraint_to_pl(rd, cid);
+
+       ret = rapl_read_pl_data(rd, id, PL_MAX_POWER, true, &val);
+       if (!ret)
                *data = val;
 
        /* As a generalization rule, PL4 would be around two times PL2. */
-       if (rd->rpl[id].prim_id == PL4_ENABLE)
+       if (id == POWER_LIMIT4)
                *data = *data * 2;
 
        cpus_read_unlock();
@@ -545,6 +557,12 @@ static const struct powercap_zone_constraint_ops constraint_ops = {
        .get_name = get_constraint_name,
 };
 
+/* Return the id used for read_raw/write_raw callback */
+static int get_rid(struct rapl_package *rp)
+{
+       return rp->lead_cpu >= 0 ? rp->lead_cpu : rp->id;
+}
+
 /* called after domain detection and package level data are set */
 static void rapl_init_domains(struct rapl_package *rp)
 {
@@ -554,6 +572,7 @@ static void rapl_init_domains(struct rapl_package *rp)
 
        for (i = 0; i < RAPL_DOMAIN_MAX; i++) {
                unsigned int mask = rp->domain_map & (1 << i);
+               int t;
 
                if (!mask)
                        continue;
@@ -562,51 +581,26 @@ static void rapl_init_domains(struct rapl_package *rp)
 
                if (i == RAPL_DOMAIN_PLATFORM && rp->id > 0) {
                        snprintf(rd->name, RAPL_DOMAIN_NAME_LENGTH, "psys-%d",
-                               topology_physical_package_id(rp->lead_cpu));
-               } else
+                               rp->lead_cpu >= 0 ? topology_physical_package_id(rp->lead_cpu) :
+                               rp->id);
+               } else {
                        snprintf(rd->name, RAPL_DOMAIN_NAME_LENGTH, "%s",
                                rapl_domain_names[i]);
+               }
 
                rd->id = i;
-               rd->rpl[0].prim_id = PL1_ENABLE;
-               rd->rpl[0].name = pl1_name;
 
-               /*
-                * The PL2 power domain is applicable for limits two
-                * and limits three
-                */
-               if (rp->priv->limits[i] >= 2) {
-                       rd->rpl[1].prim_id = PL2_ENABLE;
-                       rd->rpl[1].name = pl2_name;
-               }
+               /* PL1 is supported by default */
+               rp->priv->limits[i] |= BIT(POWER_LIMIT1);
 
-               /* Enable PL4 domain if the total power limits are three */
-               if (rp->priv->limits[i] == 3) {
-                       rd->rpl[2].prim_id = PL4_ENABLE;
-                       rd->rpl[2].name = pl4_name;
+               for (t = POWER_LIMIT1; t < NR_POWER_LIMITS; t++) {
+                       if (rp->priv->limits[i] & BIT(t))
+                               rd->rpl[t].name = pl_names[t];
                }
 
                for (j = 0; j < RAPL_DOMAIN_REG_MAX; j++)
                        rd->regs[j] = rp->priv->regs[i][j];
 
-               switch (i) {
-               case RAPL_DOMAIN_DRAM:
-                       rd->domain_energy_unit =
-                           rapl_defaults->dram_domain_energy_unit;
-                       if (rd->domain_energy_unit)
-                               pr_info("DRAM domain energy unit %dpj\n",
-                                       rd->domain_energy_unit);
-                       break;
-               case RAPL_DOMAIN_PLATFORM:
-                       rd->domain_energy_unit =
-                           rapl_defaults->psys_domain_energy_unit;
-                       if (rd->domain_energy_unit)
-                               pr_info("Platform domain energy unit %dpj\n",
-                                       rd->domain_energy_unit);
-                       break;
-               default:
-                       break;
-               }
                rd++;
        }
 }
@@ -615,23 +609,19 @@ static u64 rapl_unit_xlate(struct rapl_domain *rd, enum unit_type type,
                           u64 value, int to_raw)
 {
        u64 units = 1;
-       struct rapl_package *rp = rd->rp;
+       struct rapl_defaults *defaults = get_defaults(rd->rp);
        u64 scale = 1;
 
        switch (type) {
        case POWER_UNIT:
-               units = rp->power_unit;
+               units = rd->power_unit;
                break;
        case ENERGY_UNIT:
                scale = ENERGY_UNIT_SCALE;
-               /* per domain unit takes precedence */
-               if (rd->domain_energy_unit)
-                       units = rd->domain_energy_unit;
-               else
-                       units = rp->energy_unit;
+               units = rd->energy_unit;
                break;
        case TIME_UNIT:
-               return rapl_defaults->compute_time_window(rp, value, to_raw);
+               return defaults->compute_time_window(rd, value, to_raw);
        case ARBITRARY_UNIT:
        default:
                return value;
@@ -645,67 +635,141 @@ static u64 rapl_unit_xlate(struct rapl_domain *rd, enum unit_type type,
        return div64_u64(value, scale);
 }
 
-/* in the order of enum rapl_primitives */
-static struct rapl_primitive_info rpi[] = {
+/* RAPL primitives for MSR and MMIO I/F */
+static struct rapl_primitive_info rpi_msr[NR_RAPL_PRIMITIVES] = {
        /* name, mask, shift, msr index, unit divisor */
-       PRIMITIVE_INFO_INIT(ENERGY_COUNTER, ENERGY_STATUS_MASK, 0,
-                           RAPL_DOMAIN_REG_STATUS, ENERGY_UNIT, 0),
-       PRIMITIVE_INFO_INIT(POWER_LIMIT1, POWER_LIMIT1_MASK, 0,
+       [POWER_LIMIT1] = PRIMITIVE_INFO_INIT(POWER_LIMIT1, POWER_LIMIT1_MASK, 0,
                            RAPL_DOMAIN_REG_LIMIT, POWER_UNIT, 0),
-       PRIMITIVE_INFO_INIT(POWER_LIMIT2, POWER_LIMIT2_MASK, 32,
+       [POWER_LIMIT2] = PRIMITIVE_INFO_INIT(POWER_LIMIT2, POWER_LIMIT2_MASK, 32,
                            RAPL_DOMAIN_REG_LIMIT, POWER_UNIT, 0),
-       PRIMITIVE_INFO_INIT(POWER_LIMIT4, POWER_LIMIT4_MASK, 0,
+       [POWER_LIMIT4] = PRIMITIVE_INFO_INIT(POWER_LIMIT4, POWER_LIMIT4_MASK, 0,
                                RAPL_DOMAIN_REG_PL4, POWER_UNIT, 0),
-       PRIMITIVE_INFO_INIT(FW_LOCK, POWER_LOW_LOCK, 31,
+       [ENERGY_COUNTER] = PRIMITIVE_INFO_INIT(ENERGY_COUNTER, ENERGY_STATUS_MASK, 0,
+                           RAPL_DOMAIN_REG_STATUS, ENERGY_UNIT, 0),
+       [FW_LOCK] = PRIMITIVE_INFO_INIT(FW_LOCK, POWER_LOW_LOCK, 31,
                            RAPL_DOMAIN_REG_LIMIT, ARBITRARY_UNIT, 0),
-       PRIMITIVE_INFO_INIT(PL1_ENABLE, POWER_LIMIT1_ENABLE, 15,
+       [FW_HIGH_LOCK] = PRIMITIVE_INFO_INIT(FW_LOCK, POWER_HIGH_LOCK, 63,
                            RAPL_DOMAIN_REG_LIMIT, ARBITRARY_UNIT, 0),
-       PRIMITIVE_INFO_INIT(PL1_CLAMP, POWER_LIMIT1_CLAMP, 16,
+       [PL1_ENABLE] = PRIMITIVE_INFO_INIT(PL1_ENABLE, POWER_LIMIT1_ENABLE, 15,
                            RAPL_DOMAIN_REG_LIMIT, ARBITRARY_UNIT, 0),
-       PRIMITIVE_INFO_INIT(PL2_ENABLE, POWER_LIMIT2_ENABLE, 47,
+       [PL1_CLAMP] = PRIMITIVE_INFO_INIT(PL1_CLAMP, POWER_LIMIT1_CLAMP, 16,
                            RAPL_DOMAIN_REG_LIMIT, ARBITRARY_UNIT, 0),
-       PRIMITIVE_INFO_INIT(PL2_CLAMP, POWER_LIMIT2_CLAMP, 48,
+       [PL2_ENABLE] = PRIMITIVE_INFO_INIT(PL2_ENABLE, POWER_LIMIT2_ENABLE, 47,
                            RAPL_DOMAIN_REG_LIMIT, ARBITRARY_UNIT, 0),
-       PRIMITIVE_INFO_INIT(PL4_ENABLE, POWER_LIMIT4_MASK, 0,
+       [PL2_CLAMP] = PRIMITIVE_INFO_INIT(PL2_CLAMP, POWER_LIMIT2_CLAMP, 48,
+                           RAPL_DOMAIN_REG_LIMIT, ARBITRARY_UNIT, 0),
+       [PL4_ENABLE] = PRIMITIVE_INFO_INIT(PL4_ENABLE, POWER_LIMIT4_MASK, 0,
                                RAPL_DOMAIN_REG_PL4, ARBITRARY_UNIT, 0),
-       PRIMITIVE_INFO_INIT(TIME_WINDOW1, TIME_WINDOW1_MASK, 17,
+       [TIME_WINDOW1] = PRIMITIVE_INFO_INIT(TIME_WINDOW1, TIME_WINDOW1_MASK, 17,
                            RAPL_DOMAIN_REG_LIMIT, TIME_UNIT, 0),
-       PRIMITIVE_INFO_INIT(TIME_WINDOW2, TIME_WINDOW2_MASK, 49,
+       [TIME_WINDOW2] = PRIMITIVE_INFO_INIT(TIME_WINDOW2, TIME_WINDOW2_MASK, 49,
                            RAPL_DOMAIN_REG_LIMIT, TIME_UNIT, 0),
-       PRIMITIVE_INFO_INIT(THERMAL_SPEC_POWER, POWER_INFO_THERMAL_SPEC_MASK,
+       [THERMAL_SPEC_POWER] = PRIMITIVE_INFO_INIT(THERMAL_SPEC_POWER, POWER_INFO_THERMAL_SPEC_MASK,
                            0, RAPL_DOMAIN_REG_INFO, POWER_UNIT, 0),
-       PRIMITIVE_INFO_INIT(MAX_POWER, POWER_INFO_MAX_MASK, 32,
+       [MAX_POWER] = PRIMITIVE_INFO_INIT(MAX_POWER, POWER_INFO_MAX_MASK, 32,
                            RAPL_DOMAIN_REG_INFO, POWER_UNIT, 0),
-       PRIMITIVE_INFO_INIT(MIN_POWER, POWER_INFO_MIN_MASK, 16,
+       [MIN_POWER] = PRIMITIVE_INFO_INIT(MIN_POWER, POWER_INFO_MIN_MASK, 16,
                            RAPL_DOMAIN_REG_INFO, POWER_UNIT, 0),
-       PRIMITIVE_INFO_INIT(MAX_TIME_WINDOW, POWER_INFO_MAX_TIME_WIN_MASK, 48,
+       [MAX_TIME_WINDOW] = PRIMITIVE_INFO_INIT(MAX_TIME_WINDOW, POWER_INFO_MAX_TIME_WIN_MASK, 48,
                            RAPL_DOMAIN_REG_INFO, TIME_UNIT, 0),
-       PRIMITIVE_INFO_INIT(THROTTLED_TIME, PERF_STATUS_THROTTLE_TIME_MASK, 0,
+       [THROTTLED_TIME] = PRIMITIVE_INFO_INIT(THROTTLED_TIME, PERF_STATUS_THROTTLE_TIME_MASK, 0,
                            RAPL_DOMAIN_REG_PERF, TIME_UNIT, 0),
-       PRIMITIVE_INFO_INIT(PRIORITY_LEVEL, PP_POLICY_MASK, 0,
+       [PRIORITY_LEVEL] = PRIMITIVE_INFO_INIT(PRIORITY_LEVEL, PP_POLICY_MASK, 0,
                            RAPL_DOMAIN_REG_POLICY, ARBITRARY_UNIT, 0),
-       PRIMITIVE_INFO_INIT(PSYS_POWER_LIMIT1, PSYS_POWER_LIMIT1_MASK, 0,
+       [PSYS_POWER_LIMIT1] = PRIMITIVE_INFO_INIT(PSYS_POWER_LIMIT1, PSYS_POWER_LIMIT1_MASK, 0,
                            RAPL_DOMAIN_REG_LIMIT, POWER_UNIT, 0),
-       PRIMITIVE_INFO_INIT(PSYS_POWER_LIMIT2, PSYS_POWER_LIMIT2_MASK, 32,
+       [PSYS_POWER_LIMIT2] = PRIMITIVE_INFO_INIT(PSYS_POWER_LIMIT2, PSYS_POWER_LIMIT2_MASK, 32,
                            RAPL_DOMAIN_REG_LIMIT, POWER_UNIT, 0),
-       PRIMITIVE_INFO_INIT(PSYS_PL1_ENABLE, PSYS_POWER_LIMIT1_ENABLE, 17,
+       [PSYS_PL1_ENABLE] = PRIMITIVE_INFO_INIT(PSYS_PL1_ENABLE, PSYS_POWER_LIMIT1_ENABLE, 17,
                            RAPL_DOMAIN_REG_LIMIT, ARBITRARY_UNIT, 0),
-       PRIMITIVE_INFO_INIT(PSYS_PL2_ENABLE, PSYS_POWER_LIMIT2_ENABLE, 49,
+       [PSYS_PL2_ENABLE] = PRIMITIVE_INFO_INIT(PSYS_PL2_ENABLE, PSYS_POWER_LIMIT2_ENABLE, 49,
                            RAPL_DOMAIN_REG_LIMIT, ARBITRARY_UNIT, 0),
-       PRIMITIVE_INFO_INIT(PSYS_TIME_WINDOW1, PSYS_TIME_WINDOW1_MASK, 19,
+       [PSYS_TIME_WINDOW1] = PRIMITIVE_INFO_INIT(PSYS_TIME_WINDOW1, PSYS_TIME_WINDOW1_MASK, 19,
                            RAPL_DOMAIN_REG_LIMIT, TIME_UNIT, 0),
-       PRIMITIVE_INFO_INIT(PSYS_TIME_WINDOW2, PSYS_TIME_WINDOW2_MASK, 51,
+       [PSYS_TIME_WINDOW2] = PRIMITIVE_INFO_INIT(PSYS_TIME_WINDOW2, PSYS_TIME_WINDOW2_MASK, 51,
                            RAPL_DOMAIN_REG_LIMIT, TIME_UNIT, 0),
        /* non-hardware */
-       PRIMITIVE_INFO_INIT(AVERAGE_POWER, 0, 0, 0, POWER_UNIT,
+       [AVERAGE_POWER] = PRIMITIVE_INFO_INIT(AVERAGE_POWER, 0, 0, 0, POWER_UNIT,
                            RAPL_PRIMITIVE_DERIVED),
-       {NULL, 0, 0, 0},
 };
 
+/* RAPL primitives for TPMI I/F */
+static struct rapl_primitive_info rpi_tpmi[NR_RAPL_PRIMITIVES] = {
+       /* name, mask, shift, msr index, unit divisor */
+       [POWER_LIMIT1] = PRIMITIVE_INFO_INIT(POWER_LIMIT1, TPMI_POWER_LIMIT_MASK, 0,
+               RAPL_DOMAIN_REG_LIMIT, POWER_UNIT, 0),
+       [POWER_LIMIT2] = PRIMITIVE_INFO_INIT(POWER_LIMIT2, TPMI_POWER_LIMIT_MASK, 0,
+               RAPL_DOMAIN_REG_PL2, POWER_UNIT, 0),
+       [POWER_LIMIT4] = PRIMITIVE_INFO_INIT(POWER_LIMIT4, TPMI_POWER_LIMIT_MASK, 0,
+               RAPL_DOMAIN_REG_PL4, POWER_UNIT, 0),
+       [ENERGY_COUNTER] = PRIMITIVE_INFO_INIT(ENERGY_COUNTER, ENERGY_STATUS_MASK, 0,
+               RAPL_DOMAIN_REG_STATUS, ENERGY_UNIT, 0),
+       [PL1_LOCK] = PRIMITIVE_INFO_INIT(PL1_LOCK, POWER_HIGH_LOCK, 63,
+               RAPL_DOMAIN_REG_LIMIT, ARBITRARY_UNIT, 0),
+       [PL2_LOCK] = PRIMITIVE_INFO_INIT(PL2_LOCK, POWER_HIGH_LOCK, 63,
+               RAPL_DOMAIN_REG_PL2, ARBITRARY_UNIT, 0),
+       [PL4_LOCK] = PRIMITIVE_INFO_INIT(PL4_LOCK, POWER_HIGH_LOCK, 63,
+               RAPL_DOMAIN_REG_PL4, ARBITRARY_UNIT, 0),
+       [PL1_ENABLE] = PRIMITIVE_INFO_INIT(PL1_ENABLE, TPMI_POWER_LIMIT_ENABLE, 62,
+               RAPL_DOMAIN_REG_LIMIT, ARBITRARY_UNIT, 0),
+       [PL2_ENABLE] = PRIMITIVE_INFO_INIT(PL2_ENABLE, TPMI_POWER_LIMIT_ENABLE, 62,
+               RAPL_DOMAIN_REG_PL2, ARBITRARY_UNIT, 0),
+       [PL4_ENABLE] = PRIMITIVE_INFO_INIT(PL4_ENABLE, TPMI_POWER_LIMIT_ENABLE, 62,
+               RAPL_DOMAIN_REG_PL4, ARBITRARY_UNIT, 0),
+       [TIME_WINDOW1] = PRIMITIVE_INFO_INIT(TIME_WINDOW1, TPMI_TIME_WINDOW_MASK, 18,
+               RAPL_DOMAIN_REG_LIMIT, TIME_UNIT, 0),
+       [TIME_WINDOW2] = PRIMITIVE_INFO_INIT(TIME_WINDOW2, TPMI_TIME_WINDOW_MASK, 18,
+               RAPL_DOMAIN_REG_PL2, TIME_UNIT, 0),
+       [THERMAL_SPEC_POWER] = PRIMITIVE_INFO_INIT(THERMAL_SPEC_POWER, TPMI_INFO_SPEC_MASK, 0,
+               RAPL_DOMAIN_REG_INFO, POWER_UNIT, 0),
+       [MAX_POWER] = PRIMITIVE_INFO_INIT(MAX_POWER, TPMI_INFO_MAX_MASK, 36,
+               RAPL_DOMAIN_REG_INFO, POWER_UNIT, 0),
+       [MIN_POWER] = PRIMITIVE_INFO_INIT(MIN_POWER, TPMI_INFO_MIN_MASK, 18,
+               RAPL_DOMAIN_REG_INFO, POWER_UNIT, 0),
+       [MAX_TIME_WINDOW] = PRIMITIVE_INFO_INIT(MAX_TIME_WINDOW, TPMI_INFO_MAX_TIME_WIN_MASK, 54,
+               RAPL_DOMAIN_REG_INFO, TIME_UNIT, 0),
+       [THROTTLED_TIME] = PRIMITIVE_INFO_INIT(THROTTLED_TIME, PERF_STATUS_THROTTLE_TIME_MASK, 0,
+               RAPL_DOMAIN_REG_PERF, TIME_UNIT, 0),
+       /* non-hardware */
+       [AVERAGE_POWER] = PRIMITIVE_INFO_INIT(AVERAGE_POWER, 0, 0, 0,
+               POWER_UNIT, RAPL_PRIMITIVE_DERIVED),
+};
+
+static struct rapl_primitive_info *get_rpi(struct rapl_package *rp, int prim)
+{
+       struct rapl_primitive_info *rpi = rp->priv->rpi;
+
+       if (prim < 0 || prim > NR_RAPL_PRIMITIVES || !rpi)
+               return NULL;
+
+       return &rpi[prim];
+}
+
+static int rapl_config(struct rapl_package *rp)
+{
+       switch (rp->priv->type) {
+       /* MMIO I/F shares the same register layout as MSR registers */
+       case RAPL_IF_MMIO:
+       case RAPL_IF_MSR:
+               rp->priv->defaults = (void *)defaults_msr;
+               rp->priv->rpi = (void *)rpi_msr;
+               break;
+       case RAPL_IF_TPMI:
+               rp->priv->defaults = (void *)&defaults_tpmi;
+               rp->priv->rpi = (void *)rpi_tpmi;
+               break;
+       default:
+               return -EINVAL;
+       }
+       return 0;
+}
+
 static enum rapl_primitives
 prim_fixups(struct rapl_domain *rd, enum rapl_primitives prim)
 {
-       if (!rapl_defaults->spr_psys_bits)
+       struct rapl_defaults *defaults = get_defaults(rd->rp);
+
+       if (!defaults->spr_psys_bits)
                return prim;
 
        if (rd->id != RAPL_DOMAIN_PLATFORM)
@@ -747,41 +811,33 @@ static int rapl_read_data_raw(struct rapl_domain *rd,
 {
        u64 value;
        enum rapl_primitives prim_fixed = prim_fixups(rd, prim);
-       struct rapl_primitive_info *rp = &rpi[prim_fixed];
+       struct rapl_primitive_info *rpi = get_rpi(rd->rp, prim_fixed);
        struct reg_action ra;
-       int cpu;
 
-       if (!rp->name || rp->flag & RAPL_PRIMITIVE_DUMMY)
+       if (!rpi || !rpi->name || rpi->flag & RAPL_PRIMITIVE_DUMMY)
                return -EINVAL;
 
-       ra.reg = rd->regs[rp->id];
+       ra.reg = rd->regs[rpi->id];
        if (!ra.reg)
                return -EINVAL;
 
-       cpu = rd->rp->lead_cpu;
-
-       /* domain with 2 limits has different bit */
-       if (prim == FW_LOCK && rd->rp->priv->limits[rd->id] == 2) {
-               rp->mask = POWER_HIGH_LOCK;
-               rp->shift = 63;
-       }
        /* non-hardware data are collected by the polling thread */
-       if (rp->flag & RAPL_PRIMITIVE_DERIVED) {
+       if (rpi->flag & RAPL_PRIMITIVE_DERIVED) {
                *data = rd->rdd.primitives[prim];
                return 0;
        }
 
-       ra.mask = rp->mask;
+       ra.mask = rpi->mask;
 
-       if (rd->rp->priv->read_raw(cpu, &ra)) {
-               pr_debug("failed to read reg 0x%llx on cpu %d\n", ra.reg, cpu);
+       if (rd->rp->priv->read_raw(get_rid(rd->rp), &ra)) {
+               pr_debug("failed to read reg 0x%llx for %s:%s\n", ra.reg, rd->rp->name, rd->name);
                return -EIO;
        }
 
-       value = ra.value >> rp->shift;
+       value = ra.value >> rpi->shift;
 
        if (xlate)
-               *data = rapl_unit_xlate(rd, rp->unit, value, 0);
+               *data = rapl_unit_xlate(rd, rpi->unit, value, 0);
        else
                *data = value;
 
@@ -794,28 +850,56 @@ static int rapl_write_data_raw(struct rapl_domain *rd,
                               unsigned long long value)
 {
        enum rapl_primitives prim_fixed = prim_fixups(rd, prim);
-       struct rapl_primitive_info *rp = &rpi[prim_fixed];
-       int cpu;
+       struct rapl_primitive_info *rpi = get_rpi(rd->rp, prim_fixed);
        u64 bits;
        struct reg_action ra;
        int ret;
 
-       cpu = rd->rp->lead_cpu;
-       bits = rapl_unit_xlate(rd, rp->unit, value, 1);
-       bits <<= rp->shift;
-       bits &= rp->mask;
+       if (!rpi || !rpi->name || rpi->flag & RAPL_PRIMITIVE_DUMMY)
+               return -EINVAL;
+
+       bits = rapl_unit_xlate(rd, rpi->unit, value, 1);
+       bits <<= rpi->shift;
+       bits &= rpi->mask;
 
        memset(&ra, 0, sizeof(ra));
 
-       ra.reg = rd->regs[rp->id];
-       ra.mask = rp->mask;
+       ra.reg = rd->regs[rpi->id];
+       ra.mask = rpi->mask;
        ra.value = bits;
 
-       ret = rd->rp->priv->write_raw(cpu, &ra);
+       ret = rd->rp->priv->write_raw(get_rid(rd->rp), &ra);
 
        return ret;
 }
 
+static int rapl_read_pl_data(struct rapl_domain *rd, int pl,
+                             enum pl_prims pl_prim, bool xlate, u64 *data)
+{
+       enum rapl_primitives prim = get_pl_prim(rd, pl, pl_prim);
+
+       if (!is_pl_valid(rd, pl))
+               return -EINVAL;
+
+       return rapl_read_data_raw(rd, prim, xlate, data);
+}
+
+static int rapl_write_pl_data(struct rapl_domain *rd, int pl,
+                              enum pl_prims pl_prim,
+                              unsigned long long value)
+{
+       enum rapl_primitives prim = get_pl_prim(rd, pl, pl_prim);
+
+       if (!is_pl_valid(rd, pl))
+               return -EINVAL;
+
+       if (rd->rpl[pl].locked) {
+               pr_warn("%s:%s:%s locked by BIOS\n", rd->rp->name, rd->name, pl_names[pl]);
+               return -EACCES;
+       }
+
+       return rapl_write_data_raw(rd, prim, value);
+}
 /*
  * Raw RAPL data stored in MSRs are in certain scales. We need to
  * convert them into standard units based on the units reported in
@@ -827,58 +911,58 @@ static int rapl_write_data_raw(struct rapl_domain *rd,
  * power unit : microWatts  : Represented in milliWatts by default
  * time unit  : microseconds: Represented in seconds by default
  */
-static int rapl_check_unit_core(struct rapl_package *rp, int cpu)
+static int rapl_check_unit_core(struct rapl_domain *rd)
 {
        struct reg_action ra;
        u32 value;
 
-       ra.reg = rp->priv->reg_unit;
+       ra.reg = rd->regs[RAPL_DOMAIN_REG_UNIT];
        ra.mask = ~0;
-       if (rp->priv->read_raw(cpu, &ra)) {
-               pr_err("Failed to read power unit REG 0x%llx on CPU %d, exit.\n",
-                      rp->priv->reg_unit, cpu);
+       if (rd->rp->priv->read_raw(get_rid(rd->rp), &ra)) {
+               pr_err("Failed to read power unit REG 0x%llx on %s:%s, exit.\n",
+                       ra.reg, rd->rp->name, rd->name);
                return -ENODEV;
        }
 
        value = (ra.value & ENERGY_UNIT_MASK) >> ENERGY_UNIT_OFFSET;
-       rp->energy_unit = ENERGY_UNIT_SCALE * 1000000 / (1 << value);
+       rd->energy_unit = ENERGY_UNIT_SCALE * 1000000 / (1 << value);
 
        value = (ra.value & POWER_UNIT_MASK) >> POWER_UNIT_OFFSET;
-       rp->power_unit = 1000000 / (1 << value);
+       rd->power_unit = 1000000 / (1 << value);
 
        value = (ra.value & TIME_UNIT_MASK) >> TIME_UNIT_OFFSET;
-       rp->time_unit = 1000000 / (1 << value);
+       rd->time_unit = 1000000 / (1 << value);
 
-       pr_debug("Core CPU %s energy=%dpJ, time=%dus, power=%duW\n",
-                rp->name, rp->energy_unit, rp->time_unit, rp->power_unit);
+       pr_debug("Core CPU %s:%s energy=%dpJ, time=%dus, power=%duW\n",
+                rd->rp->name, rd->name, rd->energy_unit, rd->time_unit, rd->power_unit);
 
        return 0;
 }
 
-static int rapl_check_unit_atom(struct rapl_package *rp, int cpu)
+static int rapl_check_unit_atom(struct rapl_domain *rd)
 {
        struct reg_action ra;
        u32 value;
 
-       ra.reg = rp->priv->reg_unit;
+       ra.reg = rd->regs[RAPL_DOMAIN_REG_UNIT];
        ra.mask = ~0;
-       if (rp->priv->read_raw(cpu, &ra)) {
-               pr_err("Failed to read power unit REG 0x%llx on CPU %d, exit.\n",
-                      rp->priv->reg_unit, cpu);
+       if (rd->rp->priv->read_raw(get_rid(rd->rp), &ra)) {
+               pr_err("Failed to read power unit REG 0x%llx on %s:%s, exit.\n",
+                       ra.reg, rd->rp->name, rd->name);
                return -ENODEV;
        }
 
        value = (ra.value & ENERGY_UNIT_MASK) >> ENERGY_UNIT_OFFSET;
-       rp->energy_unit = ENERGY_UNIT_SCALE * 1 << value;
+       rd->energy_unit = ENERGY_UNIT_SCALE * 1 << value;
 
        value = (ra.value & POWER_UNIT_MASK) >> POWER_UNIT_OFFSET;
-       rp->power_unit = (1 << value) * 1000;
+       rd->power_unit = (1 << value) * 1000;
 
        value = (ra.value & TIME_UNIT_MASK) >> TIME_UNIT_OFFSET;
-       rp->time_unit = 1000000 / (1 << value);
+       rd->time_unit = 1000000 / (1 << value);
 
-       pr_debug("Atom %s energy=%dpJ, time=%dus, power=%duW\n",
-                rp->name, rp->energy_unit, rp->time_unit, rp->power_unit);
+       pr_debug("Atom %s:%s energy=%dpJ, time=%dus, power=%duW\n",
+                rd->rp->name, rd->name, rd->energy_unit, rd->time_unit, rd->power_unit);
 
        return 0;
 }
@@ -910,6 +994,9 @@ static void power_limit_irq_save_cpu(void *info)
 
 static void package_power_limit_irq_save(struct rapl_package *rp)
 {
+       if (rp->lead_cpu < 0)
+               return;
+
        if (!boot_cpu_has(X86_FEATURE_PTS) || !boot_cpu_has(X86_FEATURE_PLN))
                return;
 
@@ -924,6 +1011,9 @@ static void package_power_limit_irq_restore(struct rapl_package *rp)
 {
        u32 l, h;
 
+       if (rp->lead_cpu < 0)
+               return;
+
        if (!boot_cpu_has(X86_FEATURE_PTS) || !boot_cpu_has(X86_FEATURE_PLN))
                return;
 
@@ -943,33 +1033,33 @@ static void package_power_limit_irq_restore(struct rapl_package *rp)
 
 static void set_floor_freq_default(struct rapl_domain *rd, bool mode)
 {
-       int nr_powerlimit = find_nr_power_limit(rd);
+       int i;
 
        /* always enable clamp such that p-state can go below OS requested
         * range. power capping priority over guranteed frequency.
         */
-       rapl_write_data_raw(rd, PL1_CLAMP, mode);
+       rapl_write_pl_data(rd, POWER_LIMIT1, PL_CLAMP, mode);
 
-       /* some domains have pl2 */
-       if (nr_powerlimit > 1) {
-               rapl_write_data_raw(rd, PL2_ENABLE, mode);
-               rapl_write_data_raw(rd, PL2_CLAMP, mode);
+       for (i = POWER_LIMIT2; i < NR_POWER_LIMITS; i++) {
+               rapl_write_pl_data(rd, i, PL_ENABLE, mode);
+               rapl_write_pl_data(rd, i, PL_CLAMP, mode);
        }
 }
 
 static void set_floor_freq_atom(struct rapl_domain *rd, bool enable)
 {
        static u32 power_ctrl_orig_val;
+       struct rapl_defaults *defaults = get_defaults(rd->rp);
        u32 mdata;
 
-       if (!rapl_defaults->floor_freq_reg_addr) {
+       if (!defaults->floor_freq_reg_addr) {
                pr_err("Invalid floor frequency config register\n");
                return;
        }
 
        if (!power_ctrl_orig_val)
                iosf_mbi_read(BT_MBI_UNIT_PMC, MBI_CR_READ,
-                             rapl_defaults->floor_freq_reg_addr,
+                             defaults->floor_freq_reg_addr,
                              &power_ctrl_orig_val);
        mdata = power_ctrl_orig_val;
        if (enable) {
@@ -977,10 +1067,10 @@ static void set_floor_freq_atom(struct rapl_domain *rd, bool enable)
                mdata |= 1 << 8;
        }
        iosf_mbi_write(BT_MBI_UNIT_PMC, MBI_CR_WRITE,
-                      rapl_defaults->floor_freq_reg_addr, mdata);
+                      defaults->floor_freq_reg_addr, mdata);
 }
 
-static u64 rapl_compute_time_window_core(struct rapl_package *rp, u64 value,
+static u64 rapl_compute_time_window_core(struct rapl_domain *rd, u64 value,
                                         bool to_raw)
 {
        u64 f, y;               /* fraction and exp. used for time unit */
@@ -992,12 +1082,12 @@ static u64 rapl_compute_time_window_core(struct rapl_package *rp, u64 value,
        if (!to_raw) {
                f = (value & 0x60) >> 5;
                y = value & 0x1f;
-               value = (1 << y) * (4 + f) * rp->time_unit / 4;
+               value = (1 << y) * (4 + f) * rd->time_unit / 4;
        } else {
-               if (value < rp->time_unit)
+               if (value < rd->time_unit)
                        return 0;
 
-               do_div(value, rp->time_unit);
+               do_div(value, rd->time_unit);
                y = ilog2(value);
 
                /*
@@ -1013,7 +1103,7 @@ static u64 rapl_compute_time_window_core(struct rapl_package *rp, u64 value,
        return value;
 }
 
-static u64 rapl_compute_time_window_atom(struct rapl_package *rp, u64 value,
+static u64 rapl_compute_time_window_atom(struct rapl_domain *rd, u64 value,
                                         bool to_raw)
 {
        /*
@@ -1021,13 +1111,56 @@ static u64 rapl_compute_time_window_atom(struct rapl_package *rp, u64 value,
         * where time_unit is default to 1 sec. Never 0.
         */
        if (!to_raw)
-               return (value) ? value * rp->time_unit : rp->time_unit;
+               return (value) ? value * rd->time_unit : rd->time_unit;
 
-       value = div64_u64(value, rp->time_unit);
+       value = div64_u64(value, rd->time_unit);
 
        return value;
 }
 
+/* TPMI Unit register has different layout */
+#define TPMI_POWER_UNIT_OFFSET POWER_UNIT_OFFSET
+#define TPMI_POWER_UNIT_MASK   POWER_UNIT_MASK
+#define TPMI_ENERGY_UNIT_OFFSET        0x06
+#define TPMI_ENERGY_UNIT_MASK  0x7C0
+#define TPMI_TIME_UNIT_OFFSET  0x0C
+#define TPMI_TIME_UNIT_MASK    0xF000
+
+static int rapl_check_unit_tpmi(struct rapl_domain *rd)
+{
+       struct reg_action ra;
+       u32 value;
+
+       ra.reg = rd->regs[RAPL_DOMAIN_REG_UNIT];
+       ra.mask = ~0;
+       if (rd->rp->priv->read_raw(get_rid(rd->rp), &ra)) {
+               pr_err("Failed to read power unit REG 0x%llx on %s:%s, exit.\n",
+                       ra.reg, rd->rp->name, rd->name);
+               return -ENODEV;
+       }
+
+       value = (ra.value & TPMI_ENERGY_UNIT_MASK) >> TPMI_ENERGY_UNIT_OFFSET;
+       rd->energy_unit = ENERGY_UNIT_SCALE * 1000000 / (1 << value);
+
+       value = (ra.value & TPMI_POWER_UNIT_MASK) >> TPMI_POWER_UNIT_OFFSET;
+       rd->power_unit = 1000000 / (1 << value);
+
+       value = (ra.value & TPMI_TIME_UNIT_MASK) >> TPMI_TIME_UNIT_OFFSET;
+       rd->time_unit = 1000000 / (1 << value);
+
+       pr_debug("Core CPU %s:%s energy=%dpJ, time=%dus, power=%duW\n",
+                rd->rp->name, rd->name, rd->energy_unit, rd->time_unit, rd->power_unit);
+
+       return 0;
+}
+
+static const struct rapl_defaults defaults_tpmi = {
+       .check_unit = rapl_check_unit_tpmi,
+       /* Reuse existing logic, ignore the PL_CLAMP failures and enable all Power Limits */
+       .set_floor_freq = set_floor_freq_default,
+       .compute_time_window = rapl_compute_time_window_core,
+};
+
 static const struct rapl_defaults rapl_defaults_core = {
        .floor_freq_reg_addr = 0,
        .check_unit = rapl_check_unit_core,
@@ -1159,8 +1292,10 @@ static void rapl_update_domain_data(struct rapl_package *rp)
                         rp->domains[dmn].name);
                /* exclude non-raw primitives */
                for (prim = 0; prim < NR_RAW_PRIMITIVES; prim++) {
+                       struct rapl_primitive_info *rpi = get_rpi(rp, prim);
+
                        if (!rapl_read_data_raw(&rp->domains[dmn], prim,
-                                               rpi[prim].unit, &val))
+                                               rpi->unit, &val))
                                rp->domains[dmn].rdd.primitives[prim] = val;
                }
        }
@@ -1239,7 +1374,7 @@ err_cleanup:
        return ret;
 }
 
-static int rapl_check_domain(int cpu, int domain, struct rapl_package *rp)
+static int rapl_check_domain(int domain, struct rapl_package *rp)
 {
        struct reg_action ra;
 
@@ -1260,9 +1395,43 @@ static int rapl_check_domain(int cpu, int domain, struct rapl_package *rp)
         */
 
        ra.mask = ENERGY_STATUS_MASK;
-       if (rp->priv->read_raw(cpu, &ra) || !ra.value)
+       if (rp->priv->read_raw(get_rid(rp), &ra) || !ra.value)
+               return -ENODEV;
+
+       return 0;
+}
+
+/*
+ * Get per domain energy/power/time unit.
+ * RAPL Interfaces without per domain unit register will use the package
+ * scope unit register to set per domain units.
+ */
+static int rapl_get_domain_unit(struct rapl_domain *rd)
+{
+       struct rapl_defaults *defaults = get_defaults(rd->rp);
+       int ret;
+
+       if (!rd->regs[RAPL_DOMAIN_REG_UNIT]) {
+               if (!rd->rp->priv->reg_unit) {
+                       pr_err("No valid Unit register found\n");
+                       return -ENODEV;
+               }
+               rd->regs[RAPL_DOMAIN_REG_UNIT] = rd->rp->priv->reg_unit;
+       }
+
+       if (!defaults->check_unit) {
+               pr_err("missing .check_unit() callback\n");
                return -ENODEV;
+       }
+
+       ret = defaults->check_unit(rd);
+       if (ret)
+               return ret;
 
+       if (rd->id == RAPL_DOMAIN_DRAM && defaults->dram_domain_energy_unit)
+               rd->energy_unit = defaults->dram_domain_energy_unit;
+       if (rd->id == RAPL_DOMAIN_PLATFORM && defaults->psys_domain_energy_unit)
+               rd->energy_unit = defaults->psys_domain_energy_unit;
        return 0;
 }
 
@@ -1280,19 +1449,16 @@ static void rapl_detect_powerlimit(struct rapl_domain *rd)
        u64 val64;
        int i;
 
-       /* check if the domain is locked by BIOS, ignore if MSR doesn't exist */
-       if (!rapl_read_data_raw(rd, FW_LOCK, false, &val64)) {
-               if (val64) {
-                       pr_info("RAPL %s domain %s locked by BIOS\n",
-                               rd->rp->name, rd->name);
-                       rd->state |= DOMAIN_STATE_BIOS_LOCKED;
+       for (i = POWER_LIMIT1; i < NR_POWER_LIMITS; i++) {
+               if (!rapl_read_pl_data(rd, i, PL_LOCK, false, &val64)) {
+                       if (val64) {
+                               rd->rpl[i].locked = true;
+                               pr_info("%s:%s:%s locked by BIOS\n",
+                                       rd->rp->name, rd->name, pl_names[i]);
+                       }
                }
-       }
-       /* check if power limit MSR exists, otherwise domain is monitoring only */
-       for (i = 0; i < NR_POWER_LIMITS; i++) {
-               int prim = rd->rpl[i].prim_id;
 
-               if (rapl_read_data_raw(rd, prim, false, &val64))
+               if (rapl_read_pl_data(rd, i, PL_ENABLE, false, &val64))
                        rd->rpl[i].name = NULL;
        }
 }
@@ -1300,14 +1466,14 @@ static void rapl_detect_powerlimit(struct rapl_domain *rd)
 /* Detect active and valid domains for the given CPU, caller must
  * ensure the CPU belongs to the targeted package and CPU hotlug is disabled.
  */
-static int rapl_detect_domains(struct rapl_package *rp, int cpu)
+static int rapl_detect_domains(struct rapl_package *rp)
 {
        struct rapl_domain *rd;
        int i;
 
        for (i = 0; i < RAPL_DOMAIN_MAX; i++) {
                /* use physical package id to read counters */
-               if (!rapl_check_domain(cpu, i, rp)) {
+               if (!rapl_check_domain(i, rp)) {
                        rp->domain_map |= 1 << i;
                        pr_info("Found RAPL domain %s\n", rapl_domain_names[i]);
                }
@@ -1326,8 +1492,10 @@ static int rapl_detect_domains(struct rapl_package *rp, int cpu)
 
        rapl_init_domains(rp);
 
-       for (rd = rp->domains; rd < rp->domains + rp->nr_domains; rd++)
+       for (rd = rp->domains; rd < rp->domains + rp->nr_domains; rd++) {
+               rapl_get_domain_unit(rd);
                rapl_detect_powerlimit(rd);
+       }
 
        return 0;
 }
@@ -1340,13 +1508,13 @@ void rapl_remove_package(struct rapl_package *rp)
        package_power_limit_irq_restore(rp);
 
        for (rd = rp->domains; rd < rp->domains + rp->nr_domains; rd++) {
-               rapl_write_data_raw(rd, PL1_ENABLE, 0);
-               rapl_write_data_raw(rd, PL1_CLAMP, 0);
-               if (find_nr_power_limit(rd) > 1) {
-                       rapl_write_data_raw(rd, PL2_ENABLE, 0);
-                       rapl_write_data_raw(rd, PL2_CLAMP, 0);
-                       rapl_write_data_raw(rd, PL4_ENABLE, 0);
+               int i;
+
+               for (i = POWER_LIMIT1; i < NR_POWER_LIMITS; i++) {
+                       rapl_write_pl_data(rd, i, PL_ENABLE, 0);
+                       rapl_write_pl_data(rd, i, PL_CLAMP, 0);
                }
+
                if (rd->id == RAPL_DOMAIN_PACKAGE) {
                        rd_package = rd;
                        continue;
@@ -1365,13 +1533,18 @@ void rapl_remove_package(struct rapl_package *rp)
 EXPORT_SYMBOL_GPL(rapl_remove_package);
 
 /* caller to ensure CPU hotplug lock is held */
-struct rapl_package *rapl_find_package_domain(int cpu, struct rapl_if_priv *priv)
+struct rapl_package *rapl_find_package_domain(int id, struct rapl_if_priv *priv, bool id_is_cpu)
 {
-       int id = topology_logical_die_id(cpu);
        struct rapl_package *rp;
+       int uid;
+
+       if (id_is_cpu)
+               uid = topology_logical_die_id(id);
+       else
+               uid = id;
 
        list_for_each_entry(rp, &rapl_packages, plist) {
-               if (rp->id == id
+               if (rp->id == uid
                    && rp->priv->control_type == priv->control_type)
                        return rp;
        }
@@ -1381,34 +1554,37 @@ struct rapl_package *rapl_find_package_domain(int cpu, struct rapl_if_priv *priv
 EXPORT_SYMBOL_GPL(rapl_find_package_domain);
 
 /* called from CPU hotplug notifier, hotplug lock held */
-struct rapl_package *rapl_add_package(int cpu, struct rapl_if_priv *priv)
+struct rapl_package *rapl_add_package(int id, struct rapl_if_priv *priv, bool id_is_cpu)
 {
-       int id = topology_logical_die_id(cpu);
        struct rapl_package *rp;
        int ret;
 
-       if (!rapl_defaults)
-               return ERR_PTR(-ENODEV);
-
        rp = kzalloc(sizeof(struct rapl_package), GFP_KERNEL);
        if (!rp)
                return ERR_PTR(-ENOMEM);
 
-       /* add the new package to the list */
-       rp->id = id;
-       rp->lead_cpu = cpu;
-       rp->priv = priv;
+       if (id_is_cpu) {
+               rp->id = topology_logical_die_id(id);
+               rp->lead_cpu = id;
+               if (topology_max_die_per_package() > 1)
+                       snprintf(rp->name, PACKAGE_DOMAIN_NAME_LENGTH, "package-%d-die-%d",
+                                topology_physical_package_id(id), topology_die_id(id));
+               else
+                       snprintf(rp->name, PACKAGE_DOMAIN_NAME_LENGTH, "package-%d",
+                                topology_physical_package_id(id));
+       } else {
+               rp->id = id;
+               rp->lead_cpu = -1;
+               snprintf(rp->name, PACKAGE_DOMAIN_NAME_LENGTH, "package-%d", id);
+       }
 
-       if (topology_max_die_per_package() > 1)
-               snprintf(rp->name, PACKAGE_DOMAIN_NAME_LENGTH,
-                        "package-%d-die-%d",
-                        topology_physical_package_id(cpu), topology_die_id(cpu));
-       else
-               snprintf(rp->name, PACKAGE_DOMAIN_NAME_LENGTH, "package-%d",
-                        topology_physical_package_id(cpu));
+       rp->priv = priv;
+       ret = rapl_config(rp);
+       if (ret)
+               goto err_free_package;
 
        /* check if the package contains valid domains */
-       if (rapl_detect_domains(rp, cpu) || rapl_defaults->check_unit(rp, cpu)) {
+       if (rapl_detect_domains(rp)) {
                ret = -ENODEV;
                goto err_free_package;
        }
@@ -1430,38 +1606,18 @@ static void power_limit_state_save(void)
 {
        struct rapl_package *rp;
        struct rapl_domain *rd;
-       int nr_pl, ret, i;
+       int ret, i;
 
        cpus_read_lock();
        list_for_each_entry(rp, &rapl_packages, plist) {
                if (!rp->power_zone)
                        continue;
                rd = power_zone_to_rapl_domain(rp->power_zone);
-               nr_pl = find_nr_power_limit(rd);
-               for (i = 0; i < nr_pl; i++) {
-                       switch (rd->rpl[i].prim_id) {
-                       case PL1_ENABLE:
-                               ret = rapl_read_data_raw(rd,
-                                                POWER_LIMIT1, true,
-                                                &rd->rpl[i].last_power_limit);
-                               if (ret)
-                                       rd->rpl[i].last_power_limit = 0;
-                               break;
-                       case PL2_ENABLE:
-                               ret = rapl_read_data_raw(rd,
-                                                POWER_LIMIT2, true,
+               for (i = POWER_LIMIT1; i < NR_POWER_LIMITS; i++) {
+                       ret = rapl_read_pl_data(rd, i, PL_LIMIT, true,
                                                 &rd->rpl[i].last_power_limit);
-                               if (ret)
-                                       rd->rpl[i].last_power_limit = 0;
-                               break;
-                       case PL4_ENABLE:
-                               ret = rapl_read_data_raw(rd,
-                                                POWER_LIMIT4, true,
-                                                &rd->rpl[i].last_power_limit);
-                               if (ret)
-                                       rd->rpl[i].last_power_limit = 0;
-                               break;
-                       }
+                       if (ret)
+                               rd->rpl[i].last_power_limit = 0;
                }
        }
        cpus_read_unlock();
@@ -1471,33 +1627,17 @@ static void power_limit_state_restore(void)
 {
        struct rapl_package *rp;
        struct rapl_domain *rd;
-       int nr_pl, i;
+       int i;
 
        cpus_read_lock();
        list_for_each_entry(rp, &rapl_packages, plist) {
                if (!rp->power_zone)
                        continue;
                rd = power_zone_to_rapl_domain(rp->power_zone);
-               nr_pl = find_nr_power_limit(rd);
-               for (i = 0; i < nr_pl; i++) {
-                       switch (rd->rpl[i].prim_id) {
-                       case PL1_ENABLE:
-                               if (rd->rpl[i].last_power_limit)
-                                       rapl_write_data_raw(rd, POWER_LIMIT1,
-                                           rd->rpl[i].last_power_limit);
-                               break;
-                       case PL2_ENABLE:
-                               if (rd->rpl[i].last_power_limit)
-                                       rapl_write_data_raw(rd, POWER_LIMIT2,
-                                           rd->rpl[i].last_power_limit);
-                               break;
-                       case PL4_ENABLE:
-                               if (rd->rpl[i].last_power_limit)
-                                       rapl_write_data_raw(rd, POWER_LIMIT4,
-                                           rd->rpl[i].last_power_limit);
-                               break;
-                       }
-               }
+               for (i = POWER_LIMIT1; i < NR_POWER_LIMITS; i++)
+                       if (rd->rpl[i].last_power_limit)
+                               rapl_write_pl_data(rd, i, PL_LIMIT,
+                                              rd->rpl[i].last_power_limit);
        }
        cpus_read_unlock();
 }
@@ -1528,32 +1668,25 @@ static int __init rapl_init(void)
        int ret;
 
        id = x86_match_cpu(rapl_ids);
-       if (!id) {
-               pr_err("driver does not support CPU family %d model %d\n",
-                      boot_cpu_data.x86, boot_cpu_data.x86_model);
-
-               return -ENODEV;
-       }
+       if (id) {
+               defaults_msr = (struct rapl_defaults *)id->driver_data;
 
-       rapl_defaults = (struct rapl_defaults *)id->driver_data;
-
-       ret = register_pm_notifier(&rapl_pm_notifier);
-       if (ret)
-               return ret;
+               rapl_msr_platdev = platform_device_alloc("intel_rapl_msr", 0);
+               if (!rapl_msr_platdev)
+                       return -ENOMEM;
 
-       rapl_msr_platdev = platform_device_alloc("intel_rapl_msr", 0);
-       if (!rapl_msr_platdev) {
-               ret = -ENOMEM;
-               goto end;
+               ret = platform_device_add(rapl_msr_platdev);
+               if (ret) {
+                       platform_device_put(rapl_msr_platdev);
+                       return ret;
+               }
        }
 
-       ret = platform_device_add(rapl_msr_platdev);
-       if (ret)
+       ret = register_pm_notifier(&rapl_pm_notifier);
+       if (ret && rapl_msr_platdev) {
+               platform_device_del(rapl_msr_platdev);
                platform_device_put(rapl_msr_platdev);
-
-end:
-       if (ret)
-               unregister_pm_notifier(&rapl_pm_notifier);
+       }
 
        return ret;
 }
index a276737..569e25e 100644 (file)
@@ -22,7 +22,6 @@
 #include <linux/processor.h>
 #include <linux/platform_device.h>
 
-#include <asm/iosf_mbi.h>
 #include <asm/cpu_device_id.h>
 #include <asm/intel-family.h>
 
@@ -34,6 +33,7 @@
 static struct rapl_if_priv *rapl_msr_priv;
 
 static struct rapl_if_priv rapl_msr_priv_intel = {
+       .type = RAPL_IF_MSR,
        .reg_unit = MSR_RAPL_POWER_UNIT,
        .regs[RAPL_DOMAIN_PACKAGE] = {
                MSR_PKG_POWER_LIMIT, MSR_PKG_ENERGY_STATUS, MSR_PKG_PERF_STATUS, 0, MSR_PKG_POWER_INFO },
@@ -45,11 +45,12 @@ static struct rapl_if_priv rapl_msr_priv_intel = {
                MSR_DRAM_POWER_LIMIT, MSR_DRAM_ENERGY_STATUS, MSR_DRAM_PERF_STATUS, 0, MSR_DRAM_POWER_INFO },
        .regs[RAPL_DOMAIN_PLATFORM] = {
                MSR_PLATFORM_POWER_LIMIT, MSR_PLATFORM_ENERGY_STATUS, 0, 0, 0},
-       .limits[RAPL_DOMAIN_PACKAGE] = 2,
-       .limits[RAPL_DOMAIN_PLATFORM] = 2,
+       .limits[RAPL_DOMAIN_PACKAGE] = BIT(POWER_LIMIT2),
+       .limits[RAPL_DOMAIN_PLATFORM] = BIT(POWER_LIMIT2),
 };
 
 static struct rapl_if_priv rapl_msr_priv_amd = {
+       .type = RAPL_IF_MSR,
        .reg_unit = MSR_AMD_RAPL_POWER_UNIT,
        .regs[RAPL_DOMAIN_PACKAGE] = {
                0, MSR_AMD_PKG_ENERGY_STATUS, 0, 0, 0 },
@@ -68,9 +69,9 @@ static int rapl_cpu_online(unsigned int cpu)
 {
        struct rapl_package *rp;
 
-       rp = rapl_find_package_domain(cpu, rapl_msr_priv);
+       rp = rapl_find_package_domain(cpu, rapl_msr_priv, true);
        if (!rp) {
-               rp = rapl_add_package(cpu, rapl_msr_priv);
+               rp = rapl_add_package(cpu, rapl_msr_priv, true);
                if (IS_ERR(rp))
                        return PTR_ERR(rp);
        }
@@ -83,7 +84,7 @@ static int rapl_cpu_down_prep(unsigned int cpu)
        struct rapl_package *rp;
        int lead_cpu;
 
-       rp = rapl_find_package_domain(cpu, rapl_msr_priv);
+       rp = rapl_find_package_domain(cpu, rapl_msr_priv, true);
        if (!rp)
                return 0;
 
@@ -137,14 +138,14 @@ static int rapl_msr_write_raw(int cpu, struct reg_action *ra)
 
 /* List of verified CPUs. */
 static const struct x86_cpu_id pl4_support_ids[] = {
-       { X86_VENDOR_INTEL, 6, INTEL_FAM6_TIGERLAKE_L, X86_FEATURE_ANY },
-       { X86_VENDOR_INTEL, 6, INTEL_FAM6_ALDERLAKE, X86_FEATURE_ANY },
-       { X86_VENDOR_INTEL, 6, INTEL_FAM6_ALDERLAKE_L, X86_FEATURE_ANY },
-       { X86_VENDOR_INTEL, 6, INTEL_FAM6_ALDERLAKE_N, X86_FEATURE_ANY },
-       { X86_VENDOR_INTEL, 6, INTEL_FAM6_RAPTORLAKE, X86_FEATURE_ANY },
-       { X86_VENDOR_INTEL, 6, INTEL_FAM6_RAPTORLAKE_P, X86_FEATURE_ANY },
-       { X86_VENDOR_INTEL, 6, INTEL_FAM6_METEORLAKE, X86_FEATURE_ANY },
-       { X86_VENDOR_INTEL, 6, INTEL_FAM6_METEORLAKE_L, X86_FEATURE_ANY },
+       X86_MATCH_INTEL_FAM6_MODEL(TIGERLAKE_L, NULL),
+       X86_MATCH_INTEL_FAM6_MODEL(ALDERLAKE, NULL),
+       X86_MATCH_INTEL_FAM6_MODEL(ALDERLAKE_L, NULL),
+       X86_MATCH_INTEL_FAM6_MODEL(ALDERLAKE_N, NULL),
+       X86_MATCH_INTEL_FAM6_MODEL(RAPTORLAKE, NULL),
+       X86_MATCH_INTEL_FAM6_MODEL(RAPTORLAKE_P, NULL),
+       X86_MATCH_INTEL_FAM6_MODEL(METEORLAKE, NULL),
+       X86_MATCH_INTEL_FAM6_MODEL(METEORLAKE_L, NULL),
        {}
 };
 
@@ -169,7 +170,7 @@ static int rapl_msr_probe(struct platform_device *pdev)
        rapl_msr_priv->write_raw = rapl_msr_write_raw;
 
        if (id) {
-               rapl_msr_priv->limits[RAPL_DOMAIN_PACKAGE] = 3;
+               rapl_msr_priv->limits[RAPL_DOMAIN_PACKAGE] |= BIT(POWER_LIMIT4);
                rapl_msr_priv->regs[RAPL_DOMAIN_PACKAGE][RAPL_DOMAIN_REG_PL4] =
                        MSR_VR_CURRENT_CONFIG;
                pr_info("PL4 support detected.\n");
diff --git a/drivers/powercap/intel_rapl_tpmi.c b/drivers/powercap/intel_rapl_tpmi.c
new file mode 100644 (file)
index 0000000..4f4f13d
--- /dev/null
@@ -0,0 +1,325 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * intel_rapl_tpmi: Intel RAPL driver via TPMI interface
+ *
+ * Copyright (c) 2023, Intel Corporation.
+ * All Rights Reserved.
+ *
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/auxiliary_bus.h>
+#include <linux/io.h>
+#include <linux/intel_tpmi.h>
+#include <linux/intel_rapl.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+
+#define TPMI_RAPL_VERSION 1
+
+/* 1 header + 10 registers + 5 reserved. 8 bytes for each. */
+#define TPMI_RAPL_DOMAIN_SIZE 128
+
+enum tpmi_rapl_domain_type {
+       TPMI_RAPL_DOMAIN_INVALID,
+       TPMI_RAPL_DOMAIN_SYSTEM,
+       TPMI_RAPL_DOMAIN_PACKAGE,
+       TPMI_RAPL_DOMAIN_RESERVED,
+       TPMI_RAPL_DOMAIN_MEMORY,
+       TPMI_RAPL_DOMAIN_MAX,
+};
+
+enum tpmi_rapl_register {
+       TPMI_RAPL_REG_HEADER,
+       TPMI_RAPL_REG_UNIT,
+       TPMI_RAPL_REG_PL1,
+       TPMI_RAPL_REG_PL2,
+       TPMI_RAPL_REG_PL3,
+       TPMI_RAPL_REG_PL4,
+       TPMI_RAPL_REG_RESERVED,
+       TPMI_RAPL_REG_ENERGY_STATUS,
+       TPMI_RAPL_REG_PERF_STATUS,
+       TPMI_RAPL_REG_POWER_INFO,
+       TPMI_RAPL_REG_INTERRUPT,
+       TPMI_RAPL_REG_MAX = 15,
+};
+
+struct tpmi_rapl_package {
+       struct rapl_if_priv priv;
+       struct intel_tpmi_plat_info *tpmi_info;
+       struct rapl_package *rp;
+       void __iomem *base;
+       struct list_head node;
+};
+
+static LIST_HEAD(tpmi_rapl_packages);
+static DEFINE_MUTEX(tpmi_rapl_lock);
+
+static struct powercap_control_type *tpmi_control_type;
+
+static int tpmi_rapl_read_raw(int id, struct reg_action *ra)
+{
+       if (!ra->reg)
+               return -EINVAL;
+
+       ra->value = readq((void __iomem *)ra->reg);
+
+       ra->value &= ra->mask;
+       return 0;
+}
+
+static int tpmi_rapl_write_raw(int id, struct reg_action *ra)
+{
+       u64 val;
+
+       if (!ra->reg)
+               return -EINVAL;
+
+       val = readq((void __iomem *)ra->reg);
+
+       val &= ~ra->mask;
+       val |= ra->value;
+
+       writeq(val, (void __iomem *)ra->reg);
+       return 0;
+}
+
+static struct tpmi_rapl_package *trp_alloc(int pkg_id)
+{
+       struct tpmi_rapl_package *trp;
+       int ret;
+
+       mutex_lock(&tpmi_rapl_lock);
+
+       if (list_empty(&tpmi_rapl_packages)) {
+               tpmi_control_type = powercap_register_control_type(NULL, "intel-rapl", NULL);
+               if (IS_ERR(tpmi_control_type)) {
+                       ret = PTR_ERR(tpmi_control_type);
+                       goto err_unlock;
+               }
+       }
+
+       trp = kzalloc(sizeof(*trp), GFP_KERNEL);
+       if (!trp) {
+               ret = -ENOMEM;
+               goto err_del_powercap;
+       }
+
+       list_add(&trp->node, &tpmi_rapl_packages);
+
+       mutex_unlock(&tpmi_rapl_lock);
+       return trp;
+
+err_del_powercap:
+       if (list_empty(&tpmi_rapl_packages))
+               powercap_unregister_control_type(tpmi_control_type);
+err_unlock:
+       mutex_unlock(&tpmi_rapl_lock);
+       return ERR_PTR(ret);
+}
+
+static void trp_release(struct tpmi_rapl_package *trp)
+{
+       mutex_lock(&tpmi_rapl_lock);
+       list_del(&trp->node);
+
+       if (list_empty(&tpmi_rapl_packages))
+               powercap_unregister_control_type(tpmi_control_type);
+
+       kfree(trp);
+       mutex_unlock(&tpmi_rapl_lock);
+}
+
+static int parse_one_domain(struct tpmi_rapl_package *trp, u32 offset)
+{
+       u8 tpmi_domain_version;
+       enum rapl_domain_type domain_type;
+       enum tpmi_rapl_domain_type tpmi_domain_type;
+       enum tpmi_rapl_register reg_index;
+       enum rapl_domain_reg_id reg_id;
+       int tpmi_domain_size, tpmi_domain_flags;
+       u64 *tpmi_rapl_regs = trp->base + offset;
+       u64 tpmi_domain_header = readq((void __iomem *)tpmi_rapl_regs);
+
+       /* Domain Parent bits are ignored for now */
+       tpmi_domain_version = tpmi_domain_header & 0xff;
+       tpmi_domain_type = tpmi_domain_header >> 8 & 0xff;
+       tpmi_domain_size = tpmi_domain_header >> 16 & 0xff;
+       tpmi_domain_flags = tpmi_domain_header >> 32 & 0xffff;
+
+       if (tpmi_domain_version != TPMI_RAPL_VERSION) {
+               pr_warn(FW_BUG "Unsupported version:%d\n", tpmi_domain_version);
+               return -ENODEV;
+       }
+
+       /* Domain size: in unit of 128 Bytes */
+       if (tpmi_domain_size != 1) {
+               pr_warn(FW_BUG "Invalid Domain size %d\n", tpmi_domain_size);
+               return -EINVAL;
+       }
+
+       /* Unit register and Energy Status register are mandatory for each domain */
+       if (!(tpmi_domain_flags & BIT(TPMI_RAPL_REG_UNIT)) ||
+           !(tpmi_domain_flags & BIT(TPMI_RAPL_REG_ENERGY_STATUS))) {
+               pr_warn(FW_BUG "Invalid Domain flag 0x%x\n", tpmi_domain_flags);
+               return -EINVAL;
+       }
+
+       switch (tpmi_domain_type) {
+       case TPMI_RAPL_DOMAIN_PACKAGE:
+               domain_type = RAPL_DOMAIN_PACKAGE;
+               break;
+       case TPMI_RAPL_DOMAIN_SYSTEM:
+               domain_type = RAPL_DOMAIN_PLATFORM;
+               break;
+       case TPMI_RAPL_DOMAIN_MEMORY:
+               domain_type = RAPL_DOMAIN_DRAM;
+               break;
+       default:
+               pr_warn(FW_BUG "Unsupported Domain type %d\n", tpmi_domain_type);
+               return -EINVAL;
+       }
+
+       if (trp->priv.regs[domain_type][RAPL_DOMAIN_REG_UNIT]) {
+               pr_warn(FW_BUG "Duplicate Domain type %d\n", tpmi_domain_type);
+               return -EINVAL;
+       }
+
+       reg_index = TPMI_RAPL_REG_HEADER;
+       while (++reg_index != TPMI_RAPL_REG_MAX) {
+               if (!(tpmi_domain_flags & BIT(reg_index)))
+                       continue;
+
+               switch (reg_index) {
+               case TPMI_RAPL_REG_UNIT:
+                       reg_id = RAPL_DOMAIN_REG_UNIT;
+                       break;
+               case TPMI_RAPL_REG_PL1:
+                       reg_id = RAPL_DOMAIN_REG_LIMIT;
+                       trp->priv.limits[domain_type] |= BIT(POWER_LIMIT1);
+                       break;
+               case TPMI_RAPL_REG_PL2:
+                       reg_id = RAPL_DOMAIN_REG_PL2;
+                       trp->priv.limits[domain_type] |= BIT(POWER_LIMIT2);
+                       break;
+               case TPMI_RAPL_REG_PL4:
+                       reg_id = RAPL_DOMAIN_REG_PL4;
+                       trp->priv.limits[domain_type] |= BIT(POWER_LIMIT4);
+                       break;
+               case TPMI_RAPL_REG_ENERGY_STATUS:
+                       reg_id = RAPL_DOMAIN_REG_STATUS;
+                       break;
+               case TPMI_RAPL_REG_PERF_STATUS:
+                       reg_id = RAPL_DOMAIN_REG_PERF;
+                       break;
+               case TPMI_RAPL_REG_POWER_INFO:
+                       reg_id = RAPL_DOMAIN_REG_INFO;
+                       break;
+               default:
+                       continue;
+               }
+               trp->priv.regs[domain_type][reg_id] = (u64)&tpmi_rapl_regs[reg_index];
+       }
+
+       return 0;
+}
+
+static int intel_rapl_tpmi_probe(struct auxiliary_device *auxdev,
+                                const struct auxiliary_device_id *id)
+{
+       struct tpmi_rapl_package *trp;
+       struct intel_tpmi_plat_info *info;
+       struct resource *res;
+       u32 offset;
+       int ret;
+
+       info = tpmi_get_platform_data(auxdev);
+       if (!info)
+               return -ENODEV;
+
+       trp = trp_alloc(info->package_id);
+       if (IS_ERR(trp))
+               return PTR_ERR(trp);
+
+       if (tpmi_get_resource_count(auxdev) > 1) {
+               dev_err(&auxdev->dev, "does not support multiple resources\n");
+               ret = -EINVAL;
+               goto err;
+       }
+
+       res = tpmi_get_resource_at_index(auxdev, 0);
+       if (!res) {
+               dev_err(&auxdev->dev, "can't fetch device resource info\n");
+               ret = -EIO;
+               goto err;
+       }
+
+       trp->base = devm_ioremap_resource(&auxdev->dev, res);
+       if (IS_ERR(trp->base)) {
+               ret = PTR_ERR(trp->base);
+               goto err;
+       }
+
+       for (offset = 0; offset < resource_size(res); offset += TPMI_RAPL_DOMAIN_SIZE) {
+               ret = parse_one_domain(trp, offset);
+               if (ret)
+                       goto err;
+       }
+
+       trp->tpmi_info = info;
+       trp->priv.type = RAPL_IF_TPMI;
+       trp->priv.read_raw = tpmi_rapl_read_raw;
+       trp->priv.write_raw = tpmi_rapl_write_raw;
+       trp->priv.control_type = tpmi_control_type;
+
+       /* RAPL TPMI I/F is per physical package */
+       trp->rp = rapl_find_package_domain(info->package_id, &trp->priv, false);
+       if (trp->rp) {
+               dev_err(&auxdev->dev, "Domain for Package%d already exists\n", info->package_id);
+               ret = -EEXIST;
+               goto err;
+       }
+
+       trp->rp = rapl_add_package(info->package_id, &trp->priv, false);
+       if (IS_ERR(trp->rp)) {
+               dev_err(&auxdev->dev, "Failed to add RAPL Domain for Package%d, %ld\n",
+                       info->package_id, PTR_ERR(trp->rp));
+               ret = PTR_ERR(trp->rp);
+               goto err;
+       }
+
+       auxiliary_set_drvdata(auxdev, trp);
+
+       return 0;
+err:
+       trp_release(trp);
+       return ret;
+}
+
+static void intel_rapl_tpmi_remove(struct auxiliary_device *auxdev)
+{
+       struct tpmi_rapl_package *trp = auxiliary_get_drvdata(auxdev);
+
+       rapl_remove_package(trp->rp);
+       trp_release(trp);
+}
+
+static const struct auxiliary_device_id intel_rapl_tpmi_ids[] = {
+       {.name = "intel_vsec.tpmi-rapl" },
+       { }
+};
+
+MODULE_DEVICE_TABLE(auxiliary, intel_rapl_tpmi_ids);
+
+static struct auxiliary_driver intel_rapl_tpmi_driver = {
+       .probe = intel_rapl_tpmi_probe,
+       .remove = intel_rapl_tpmi_remove,
+       .id_table = intel_rapl_tpmi_ids,
+};
+
+module_auxiliary_driver(intel_rapl_tpmi_driver)
+
+MODULE_IMPORT_NS(INTEL_TPMI);
+
+MODULE_DESCRIPTION("Intel RAPL TPMI Driver");
+MODULE_LICENSE("GPL");
index 0c567d9..5f7d286 100644 (file)
@@ -6,7 +6,7 @@
  *              Bo Shen <voice.shen@atmel.com>
  *
  * Links to reference manuals for the supported PWM chips can be found in
- * Documentation/arm/microchip.rst.
+ * Documentation/arch/arm/microchip.rst.
  *
  * Limitations:
  * - Periods start with the inactive level.
index 46ed668..762429d 100644 (file)
@@ -8,7 +8,7 @@
  *             eric miao <eric.miao@marvell.com>
  *
  * Links to reference manuals for some of the supported PWM chips can be found
- * in Documentation/arm/marvell.rst.
+ * in Documentation/arch/arm/marvell.rst.
  *
  * Limitations:
  * - When PWM is stopped, the current PWM period stops abruptly at the next
index f0a6391..ffb973c 100644 (file)
@@ -46,7 +46,7 @@ int __init ras_add_daemon_trace(void)
 
        fentry = debugfs_create_file("daemon_active", S_IRUSR, ras_debugfs_dir,
                                     NULL, &trace_fops);
-       if (!fentry)
+       if (IS_ERR(fentry))
                return -ENODEV;
 
        return 0;
index 74275b6..e6598e7 100644 (file)
@@ -104,7 +104,7 @@ static struct i2c_driver pg86x_regulator_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = of_match_ptr(pg86x_dt_ids),
        },
-       .probe_new = pg86x_i2c_probe,
+       .probe = pg86x_i2c_probe,
        .id_table = pg86x_i2c_id,
 };
 
index e5f3613..2c24050 100644 (file)
@@ -1033,6 +1033,13 @@ config REGULATOR_QCOM_USB_VBUS
          Say M here if you want to include support for enabling the VBUS output
          as a module. The module will be named "qcom_usb_vbus_regulator".
 
+config REGULATOR_RAA215300
+       tristate "Renesas RAA215300 driver"
+       select REGMAP_I2C
+       depends on I2C
+       help
+         Support for the Renesas RAA215300 PMIC.
+
 config REGULATOR_RASPBERRYPI_TOUCHSCREEN_ATTINY
        tristate "Raspberry Pi 7-inch touchscreen panel ATTINY regulator"
        depends on BACKLIGHT_CLASS_DEVICE
@@ -1056,7 +1063,7 @@ config REGULATOR_RC5T583
 
 config REGULATOR_RK808
        tristate "Rockchip RK805/RK808/RK809/RK817/RK818 Power regulators"
-       depends on MFD_RK808
+       depends on MFD_RK8XX
        help
          Select this option to enable the power regulator of ROCKCHIP
          PMIC RK805,RK809&RK817,RK808 and RK818.
@@ -1397,6 +1404,17 @@ config REGULATOR_TPS6286X
          high-frequency synchronous step-down converters with an I2C
          interface.
 
+config REGULATOR_TPS6287X
+       tristate "TI TPS6287x Power Regulator"
+       depends on I2C && OF
+       select REGMAP_I2C
+       help
+         This driver supports TPS6287x voltage regulator chips. These are
+         pin-to-pin high-frequency synchronous step-down dc-dc converters
+         with an I2C interface.
+
+         If built as a module it will be called tps6287x-regulator.
+
 config REGULATOR_TPS65023
        tristate "TI TPS65023 Power regulators"
        depends on I2C
@@ -1463,6 +1481,19 @@ config REGULATOR_TPS65219
          voltage regulators. It supports software based voltage control
          for different voltage domains.
 
+config REGULATOR_TPS6594
+       tristate "TI TPS6594 Power regulators"
+       depends on MFD_TPS6594 && OF
+       default MFD_TPS6594
+       help
+         This driver supports TPS6594 voltage regulator chips.
+         TPS6594 series of PMICs have 5 BUCKs and 4 LDOs
+         voltage regulators.
+         BUCKs 1,2,3,4 can be used in single phase or multiphase mode.
+         Part number defines which single or multiphase mode is i used.
+         It supports software based voltage control
+         for different voltage domains.
+
 config REGULATOR_TPS6524X
        tristate "TI TPS6524X Power regulators"
        depends on SPI
index 58dfe01..ebfa753 100644 (file)
@@ -124,6 +124,7 @@ obj-$(CONFIG_REGULATOR_TPS51632) += tps51632-regulator.o
 obj-$(CONFIG_REGULATOR_PBIAS) += pbias-regulator.o
 obj-$(CONFIG_REGULATOR_PCAP) += pcap-regulator.o
 obj-$(CONFIG_REGULATOR_PCF50633) += pcf50633-regulator.o
+obj-$(CONFIG_REGULATOR_RAA215300) += raa215300.o
 obj-$(CONFIG_REGULATOR_RASPBERRYPI_TOUCHSCREEN_ATTINY)  += rpi-panel-attiny-regulator.o
 obj-$(CONFIG_REGULATOR_RC5T583)  += rc5t583-regulator.o
 obj-$(CONFIG_REGULATOR_RK808)   += rk808-regulator.o
@@ -163,6 +164,7 @@ obj-$(CONFIG_REGULATOR_TI_ABB) += ti-abb-regulator.o
 obj-$(CONFIG_REGULATOR_TPS6105X) += tps6105x-regulator.o
 obj-$(CONFIG_REGULATOR_TPS62360) += tps62360-regulator.o
 obj-$(CONFIG_REGULATOR_TPS6286X) += tps6286x-regulator.o
+obj-$(CONFIG_REGULATOR_TPS6287X) += tps6287x-regulator.o
 obj-$(CONFIG_REGULATOR_TPS65023) += tps65023-regulator.o
 obj-$(CONFIG_REGULATOR_TPS6507X) += tps6507x-regulator.o
 obj-$(CONFIG_REGULATOR_TPS65086) += tps65086-regulator.o
@@ -174,6 +176,7 @@ obj-$(CONFIG_REGULATOR_TPS6524X) += tps6524x-regulator.o
 obj-$(CONFIG_REGULATOR_TPS6586X) += tps6586x-regulator.o
 obj-$(CONFIG_REGULATOR_TPS65910) += tps65910-regulator.o
 obj-$(CONFIG_REGULATOR_TPS65912) += tps65912-regulator.o
+obj-$(CONFIG_REGULATOR_TPS6594) += tps6594-regulator.o
 obj-$(CONFIG_REGULATOR_TPS65132) += tps65132-regulator.o
 obj-$(CONFIG_REGULATOR_TPS68470) += tps68470-regulator.o
 obj-$(CONFIG_REGULATOR_TWL4030) += twl-regulator.o twl6030-regulator.o
index 5c409ff..a504b01 100644 (file)
@@ -791,7 +791,7 @@ static struct i2c_driver act8865_pmic_driver = {
                .name   = "act8865",
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
        },
-       .probe_new      = act8865_pmic_probe,
+       .probe          = act8865_pmic_probe,
        .id_table       = act8865_ids,
 };
 
index c228cf6..40f7dba 100644 (file)
@@ -254,7 +254,7 @@ static int ad5398_probe(struct i2c_client *client)
 }
 
 static struct i2c_driver ad5398_driver = {
-       .probe_new = ad5398_probe,
+       .probe = ad5398_probe,
        .driver         = {
                .name   = "ad5398",
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
index 943172b..810f90f 100644 (file)
 #define AXP22X_PWR_OUT_DLDO4_MASK      BIT_MASK(6)
 #define AXP22X_PWR_OUT_ALDO3_MASK      BIT_MASK(7)
 
+#define AXP313A_DCDC1_NUM_VOLTAGES     107
+#define AXP313A_DCDC23_NUM_VOLTAGES    88
+#define AXP313A_DCDC_V_OUT_MASK                GENMASK(6, 0)
+#define AXP313A_LDO_V_OUT_MASK         GENMASK(4, 0)
+
 #define AXP803_PWR_OUT_DCDC1_MASK      BIT_MASK(0)
 #define AXP803_PWR_OUT_DCDC2_MASK      BIT_MASK(1)
 #define AXP803_PWR_OUT_DCDC3_MASK      BIT_MASK(2)
 
 #define AXP813_PWR_OUT_DCDC7_MASK      BIT_MASK(6)
 
+#define AXP15060_DCDC1_V_CTRL_MASK             GENMASK(4, 0)
+#define AXP15060_DCDC2_V_CTRL_MASK             GENMASK(6, 0)
+#define AXP15060_DCDC3_V_CTRL_MASK             GENMASK(6, 0)
+#define AXP15060_DCDC4_V_CTRL_MASK             GENMASK(6, 0)
+#define AXP15060_DCDC5_V_CTRL_MASK             GENMASK(6, 0)
+#define AXP15060_DCDC6_V_CTRL_MASK             GENMASK(4, 0)
+#define AXP15060_ALDO1_V_CTRL_MASK             GENMASK(4, 0)
+#define AXP15060_ALDO2_V_CTRL_MASK             GENMASK(4, 0)
+#define AXP15060_ALDO3_V_CTRL_MASK             GENMASK(4, 0)
+#define AXP15060_ALDO4_V_CTRL_MASK             GENMASK(4, 0)
+#define AXP15060_ALDO5_V_CTRL_MASK             GENMASK(4, 0)
+#define AXP15060_BLDO1_V_CTRL_MASK             GENMASK(4, 0)
+#define AXP15060_BLDO2_V_CTRL_MASK             GENMASK(4, 0)
+#define AXP15060_BLDO3_V_CTRL_MASK             GENMASK(4, 0)
+#define AXP15060_BLDO4_V_CTRL_MASK             GENMASK(4, 0)
+#define AXP15060_BLDO5_V_CTRL_MASK             GENMASK(4, 0)
+#define AXP15060_CLDO1_V_CTRL_MASK             GENMASK(4, 0)
+#define AXP15060_CLDO2_V_CTRL_MASK             GENMASK(4, 0)
+#define AXP15060_CLDO3_V_CTRL_MASK             GENMASK(4, 0)
+#define AXP15060_CLDO4_V_CTRL_MASK             GENMASK(5, 0)
+#define AXP15060_CPUSLDO_V_CTRL_MASK           GENMASK(3, 0)
+
+#define AXP15060_PWR_OUT_DCDC1_MASK    BIT_MASK(0)
+#define AXP15060_PWR_OUT_DCDC2_MASK    BIT_MASK(1)
+#define AXP15060_PWR_OUT_DCDC3_MASK    BIT_MASK(2)
+#define AXP15060_PWR_OUT_DCDC4_MASK    BIT_MASK(3)
+#define AXP15060_PWR_OUT_DCDC5_MASK    BIT_MASK(4)
+#define AXP15060_PWR_OUT_DCDC6_MASK    BIT_MASK(5)
+#define AXP15060_PWR_OUT_ALDO1_MASK    BIT_MASK(0)
+#define AXP15060_PWR_OUT_ALDO2_MASK    BIT_MASK(1)
+#define AXP15060_PWR_OUT_ALDO3_MASK    BIT_MASK(2)
+#define AXP15060_PWR_OUT_ALDO4_MASK    BIT_MASK(3)
+#define AXP15060_PWR_OUT_ALDO5_MASK    BIT_MASK(4)
+#define AXP15060_PWR_OUT_BLDO1_MASK    BIT_MASK(5)
+#define AXP15060_PWR_OUT_BLDO2_MASK    BIT_MASK(6)
+#define AXP15060_PWR_OUT_BLDO3_MASK    BIT_MASK(7)
+#define AXP15060_PWR_OUT_BLDO4_MASK    BIT_MASK(0)
+#define AXP15060_PWR_OUT_BLDO5_MASK    BIT_MASK(1)
+#define AXP15060_PWR_OUT_CLDO1_MASK    BIT_MASK(2)
+#define AXP15060_PWR_OUT_CLDO2_MASK    BIT_MASK(3)
+#define AXP15060_PWR_OUT_CLDO3_MASK    BIT_MASK(4)
+#define AXP15060_PWR_OUT_CLDO4_MASK    BIT_MASK(5)
+#define AXP15060_PWR_OUT_CPUSLDO_MASK  BIT_MASK(6)
+#define AXP15060_PWR_OUT_SW_MASK               BIT_MASK(7)
+
+#define AXP15060_DCDC23_POLYPHASE_DUAL_MASK            BIT_MASK(6)
+#define AXP15060_DCDC46_POLYPHASE_DUAL_MASK            BIT_MASK(7)
+
+#define AXP15060_DCDC234_500mV_START   0x00
+#define AXP15060_DCDC234_500mV_STEPS   70
+#define AXP15060_DCDC234_500mV_END             \
+       (AXP15060_DCDC234_500mV_START + AXP15060_DCDC234_500mV_STEPS)
+#define AXP15060_DCDC234_1220mV_START  0x47
+#define AXP15060_DCDC234_1220mV_STEPS  16
+#define AXP15060_DCDC234_1220mV_END            \
+       (AXP15060_DCDC234_1220mV_START + AXP15060_DCDC234_1220mV_STEPS)
+#define AXP15060_DCDC234_NUM_VOLTAGES  88
+
+#define AXP15060_DCDC5_800mV_START     0x00
+#define AXP15060_DCDC5_800mV_STEPS     32
+#define AXP15060_DCDC5_800mV_END               \
+       (AXP15060_DCDC5_800mV_START + AXP15060_DCDC5_800mV_STEPS)
+#define AXP15060_DCDC5_1140mV_START    0x21
+#define AXP15060_DCDC5_1140mV_STEPS    35
+#define AXP15060_DCDC5_1140mV_END              \
+       (AXP15060_DCDC5_1140mV_START + AXP15060_DCDC5_1140mV_STEPS)
+#define AXP15060_DCDC5_NUM_VOLTAGES    69
+
 #define AXP_DESC_IO(_family, _id, _match, _supply, _min, _max, _step, _vreg,   \
                    _vmask, _ereg, _emask, _enable_val, _disable_val)           \
        [_family##_##_id] = {                                                   \
@@ -638,6 +711,48 @@ static const struct regulator_desc axp22x_drivevbus_regulator = {
        .ops            = &axp20x_ops_sw,
 };
 
+static const struct linear_range axp313a_dcdc1_ranges[] = {
+       REGULATOR_LINEAR_RANGE(500000,   0,  70,  10000),
+       REGULATOR_LINEAR_RANGE(1220000, 71,  87,  20000),
+       REGULATOR_LINEAR_RANGE(1600000, 88, 106, 100000),
+};
+
+static const struct linear_range axp313a_dcdc2_ranges[] = {
+       REGULATOR_LINEAR_RANGE(500000,   0, 70, 10000),
+       REGULATOR_LINEAR_RANGE(1220000, 71, 87, 20000),
+};
+
+/*
+ * This is deviating from the datasheet. The values here are taken from the
+ * BSP driver and have been confirmed by measurements.
+ */
+static const struct linear_range axp313a_dcdc3_ranges[] = {
+       REGULATOR_LINEAR_RANGE(500000,   0,  70, 10000),
+       REGULATOR_LINEAR_RANGE(1220000, 71, 102, 20000),
+};
+
+static const struct regulator_desc axp313a_regulators[] = {
+       AXP_DESC_RANGES(AXP313A, DCDC1, "dcdc1", "vin1",
+                       axp313a_dcdc1_ranges, AXP313A_DCDC1_NUM_VOLTAGES,
+                       AXP313A_DCDC1_CONRTOL, AXP313A_DCDC_V_OUT_MASK,
+                       AXP313A_OUTPUT_CONTROL, BIT(0)),
+       AXP_DESC_RANGES(AXP313A, DCDC2, "dcdc2", "vin2",
+                       axp313a_dcdc2_ranges, AXP313A_DCDC23_NUM_VOLTAGES,
+                       AXP313A_DCDC2_CONRTOL, AXP313A_DCDC_V_OUT_MASK,
+                       AXP313A_OUTPUT_CONTROL, BIT(1)),
+       AXP_DESC_RANGES(AXP313A, DCDC3, "dcdc3", "vin3",
+                       axp313a_dcdc3_ranges, AXP313A_DCDC23_NUM_VOLTAGES,
+                       AXP313A_DCDC3_CONRTOL, AXP313A_DCDC_V_OUT_MASK,
+                       AXP313A_OUTPUT_CONTROL, BIT(2)),
+       AXP_DESC(AXP313A, ALDO1, "aldo1", "vin1", 500, 3500, 100,
+                AXP313A_ALDO1_CONRTOL, AXP313A_LDO_V_OUT_MASK,
+                AXP313A_OUTPUT_CONTROL, BIT(3)),
+       AXP_DESC(AXP313A, DLDO1, "dldo1", "vin1", 500, 3500, 100,
+                AXP313A_DLDO1_CONRTOL, AXP313A_LDO_V_OUT_MASK,
+                AXP313A_OUTPUT_CONTROL, BIT(4)),
+       AXP_DESC_FIXED(AXP313A, RTC_LDO, "rtc-ldo", "vin1", 1800),
+};
+
 /* DCDC ranges shared with AXP813 */
 static const struct linear_range axp803_dcdc234_ranges[] = {
        REGULATOR_LINEAR_RANGE(500000,
@@ -1001,6 +1116,104 @@ static const struct regulator_desc axp813_regulators[] = {
                    AXP22X_PWR_OUT_CTRL2, AXP22X_PWR_OUT_DC1SW_MASK),
 };
 
+static const struct linear_range axp15060_dcdc234_ranges[] = {
+       REGULATOR_LINEAR_RANGE(500000,
+                              AXP15060_DCDC234_500mV_START,
+                              AXP15060_DCDC234_500mV_END,
+                              10000),
+       REGULATOR_LINEAR_RANGE(1220000,
+                              AXP15060_DCDC234_1220mV_START,
+                              AXP15060_DCDC234_1220mV_END,
+                              20000),
+};
+
+static const struct linear_range axp15060_dcdc5_ranges[] = {
+       REGULATOR_LINEAR_RANGE(800000,
+                              AXP15060_DCDC5_800mV_START,
+                              AXP15060_DCDC5_800mV_END,
+                              10000),
+       REGULATOR_LINEAR_RANGE(1140000,
+                              AXP15060_DCDC5_1140mV_START,
+                              AXP15060_DCDC5_1140mV_END,
+                              20000),
+};
+
+static const struct regulator_desc axp15060_regulators[] = {
+       AXP_DESC(AXP15060, DCDC1, "dcdc1", "vin1", 1500, 3400, 100,
+                AXP15060_DCDC1_V_CTRL, AXP15060_DCDC1_V_CTRL_MASK,
+                AXP15060_PWR_OUT_CTRL1, AXP15060_PWR_OUT_DCDC1_MASK),
+       AXP_DESC_RANGES(AXP15060, DCDC2, "dcdc2", "vin2",
+                       axp15060_dcdc234_ranges, AXP15060_DCDC234_NUM_VOLTAGES,
+                       AXP15060_DCDC2_V_CTRL, AXP15060_DCDC2_V_CTRL_MASK,
+                       AXP15060_PWR_OUT_CTRL1, AXP15060_PWR_OUT_DCDC2_MASK),
+       AXP_DESC_RANGES(AXP15060, DCDC3, "dcdc3", "vin3",
+                       axp15060_dcdc234_ranges, AXP15060_DCDC234_NUM_VOLTAGES,
+                       AXP15060_DCDC3_V_CTRL, AXP15060_DCDC3_V_CTRL_MASK,
+                       AXP15060_PWR_OUT_CTRL1, AXP15060_PWR_OUT_DCDC3_MASK),
+       AXP_DESC_RANGES(AXP15060, DCDC4, "dcdc4", "vin4",
+                       axp15060_dcdc234_ranges, AXP15060_DCDC234_NUM_VOLTAGES,
+                       AXP15060_DCDC4_V_CTRL, AXP15060_DCDC4_V_CTRL_MASK,
+                       AXP15060_PWR_OUT_CTRL1, AXP15060_PWR_OUT_DCDC4_MASK),
+       AXP_DESC_RANGES(AXP15060, DCDC5, "dcdc5", "vin5",
+                       axp15060_dcdc5_ranges, AXP15060_DCDC5_NUM_VOLTAGES,
+                       AXP15060_DCDC5_V_CTRL, AXP15060_DCDC5_V_CTRL_MASK,
+                       AXP15060_PWR_OUT_CTRL1, AXP15060_PWR_OUT_DCDC5_MASK),
+       AXP_DESC(AXP15060, DCDC6, "dcdc6", "vin6", 500, 3400, 100,
+                AXP15060_DCDC6_V_CTRL, AXP15060_DCDC6_V_CTRL_MASK,
+                AXP15060_PWR_OUT_CTRL1, AXP15060_PWR_OUT_DCDC6_MASK),
+       AXP_DESC(AXP15060, ALDO1, "aldo1", "aldoin", 700, 3300, 100,
+                AXP15060_ALDO1_V_CTRL, AXP15060_ALDO1_V_CTRL_MASK,
+                AXP15060_PWR_OUT_CTRL2, AXP15060_PWR_OUT_ALDO1_MASK),
+       AXP_DESC(AXP15060, ALDO2, "aldo2", "aldoin", 700, 3300, 100,
+                AXP15060_ALDO2_V_CTRL, AXP15060_ALDO2_V_CTRL_MASK,
+                AXP15060_PWR_OUT_CTRL2, AXP15060_PWR_OUT_ALDO2_MASK),
+       AXP_DESC(AXP15060, ALDO3, "aldo3", "aldoin", 700, 3300, 100,
+                AXP15060_ALDO3_V_CTRL, AXP15060_ALDO3_V_CTRL_MASK,
+                AXP15060_PWR_OUT_CTRL2, AXP15060_PWR_OUT_ALDO3_MASK),
+       AXP_DESC(AXP15060, ALDO4, "aldo4", "aldoin", 700, 3300, 100,
+                AXP15060_ALDO4_V_CTRL, AXP15060_ALDO4_V_CTRL_MASK,
+                AXP15060_PWR_OUT_CTRL2, AXP15060_PWR_OUT_ALDO4_MASK),
+       AXP_DESC(AXP15060, ALDO5, "aldo5", "aldoin", 700, 3300, 100,
+                AXP15060_ALDO5_V_CTRL, AXP15060_ALDO5_V_CTRL_MASK,
+                AXP15060_PWR_OUT_CTRL2, AXP15060_PWR_OUT_ALDO5_MASK),
+       AXP_DESC(AXP15060, BLDO1, "bldo1", "bldoin", 700, 3300, 100,
+                AXP15060_BLDO1_V_CTRL, AXP15060_BLDO1_V_CTRL_MASK,
+                AXP15060_PWR_OUT_CTRL2, AXP15060_PWR_OUT_BLDO1_MASK),
+       AXP_DESC(AXP15060, BLDO2, "bldo2", "bldoin", 700, 3300, 100,
+                AXP15060_BLDO2_V_CTRL, AXP15060_BLDO2_V_CTRL_MASK,
+                AXP15060_PWR_OUT_CTRL2, AXP15060_PWR_OUT_BLDO2_MASK),
+       AXP_DESC(AXP15060, BLDO3, "bldo3", "bldoin", 700, 3300, 100,
+                AXP15060_BLDO3_V_CTRL, AXP15060_BLDO3_V_CTRL_MASK,
+                AXP15060_PWR_OUT_CTRL2, AXP15060_PWR_OUT_BLDO3_MASK),
+       AXP_DESC(AXP15060, BLDO4, "bldo4", "bldoin", 700, 3300, 100,
+                AXP15060_BLDO4_V_CTRL, AXP15060_BLDO4_V_CTRL_MASK,
+                AXP15060_PWR_OUT_CTRL3, AXP15060_PWR_OUT_BLDO4_MASK),
+       AXP_DESC(AXP15060, BLDO5, "bldo5", "bldoin", 700, 3300, 100,
+                AXP15060_BLDO5_V_CTRL, AXP15060_BLDO5_V_CTRL_MASK,
+                AXP15060_PWR_OUT_CTRL3, AXP15060_PWR_OUT_BLDO5_MASK),
+       AXP_DESC(AXP15060, CLDO1, "cldo1", "cldoin", 700, 3300, 100,
+                AXP15060_CLDO1_V_CTRL, AXP15060_CLDO1_V_CTRL_MASK,
+                AXP15060_PWR_OUT_CTRL3, AXP15060_PWR_OUT_CLDO1_MASK),
+       AXP_DESC(AXP15060, CLDO2, "cldo2", "cldoin", 700, 3300, 100,
+                AXP15060_CLDO2_V_CTRL, AXP15060_CLDO2_V_CTRL_MASK,
+                AXP15060_PWR_OUT_CTRL3, AXP15060_PWR_OUT_CLDO2_MASK),
+       AXP_DESC(AXP15060, CLDO3, "cldo3", "cldoin", 700, 3300, 100,
+                AXP15060_CLDO3_V_CTRL, AXP15060_CLDO3_V_CTRL_MASK,
+                AXP15060_PWR_OUT_CTRL3, AXP15060_PWR_OUT_CLDO3_MASK),
+       AXP_DESC(AXP15060, CLDO4, "cldo4", "cldoin", 700, 4200, 100,
+                AXP15060_CLDO4_V_CTRL, AXP15060_CLDO4_V_CTRL_MASK,
+                AXP15060_PWR_OUT_CTRL3, AXP15060_PWR_OUT_CLDO4_MASK),
+       /* Supply comes from DCDC5 */
+       AXP_DESC(AXP15060, CPUSLDO, "cpusldo", NULL, 700, 1400, 50,
+                AXP15060_CPUSLDO_V_CTRL, AXP15060_CPUSLDO_V_CTRL_MASK,
+                AXP15060_PWR_OUT_CTRL3, AXP15060_PWR_OUT_CPUSLDO_MASK),
+       /* Supply comes from DCDC1 */
+       AXP_DESC_SW(AXP15060, SW, "sw", NULL,
+                   AXP15060_PWR_OUT_CTRL3, AXP15060_PWR_OUT_SW_MASK),
+       /* Supply comes from ALDO1 */
+       AXP_DESC_FIXED(AXP15060, RTC_LDO, "rtc-ldo", NULL, 1800),
+};
+
 static int axp20x_set_dcdc_freq(struct platform_device *pdev, u32 dcdcfreq)
 {
        struct axp20x_dev *axp20x = dev_get_drvdata(pdev->dev.parent);
@@ -1040,6 +1253,16 @@ static int axp20x_set_dcdc_freq(struct platform_device *pdev, u32 dcdcfreq)
                def = 3000;
                step = 150;
                break;
+       case AXP313A_ID:
+       case AXP15060_ID:
+               /* The DCDC PWM frequency seems to be fixed to 3 MHz. */
+               if (dcdcfreq != 0) {
+                       dev_err(&pdev->dev,
+                               "DCDC frequency on this PMIC is fixed to 3 MHz.\n");
+                       return -EINVAL;
+               }
+
+               return 0;
        default:
                dev_err(&pdev->dev,
                        "Setting DCDC frequency for unsupported AXP variant\n");
@@ -1145,6 +1368,15 @@ static int axp20x_set_dcdc_workmode(struct regulator_dev *rdev, int id, u32 work
                workmode <<= id - AXP813_DCDC1;
                break;
 
+       case AXP15060_ID:
+               reg = AXP15060_DCDC_MODE_CTRL2;
+               if (id < AXP15060_DCDC1 || id > AXP15060_DCDC6)
+                       return -EINVAL;
+
+               mask = AXP22X_WORKMODE_DCDCX_MASK(id - AXP15060_DCDC1);
+               workmode <<= id - AXP15060_DCDC1;
+               break;
+
        default:
                /* should not happen */
                WARN_ON(1);
@@ -1164,7 +1396,7 @@ static bool axp20x_is_polyphase_slave(struct axp20x_dev *axp20x, int id)
 
        /*
         * Currently in our supported AXP variants, only AXP803, AXP806,
-        * and AXP813 have polyphase regulators.
+        * AXP813 and AXP15060 have polyphase regulators.
         */
        switch (axp20x->variant) {
        case AXP803_ID:
@@ -1196,6 +1428,17 @@ static bool axp20x_is_polyphase_slave(struct axp20x_dev *axp20x, int id)
                }
                break;
 
+       case AXP15060_ID:
+               regmap_read(axp20x->regmap, AXP15060_DCDC_MODE_CTRL1, &reg);
+
+               switch (id) {
+               case AXP15060_DCDC3:
+                       return !!(reg & AXP15060_DCDC23_POLYPHASE_DUAL_MASK);
+               case AXP15060_DCDC6:
+                       return !!(reg & AXP15060_DCDC46_POLYPHASE_DUAL_MASK);
+               }
+               break;
+
        default:
                return false;
        }
@@ -1217,6 +1460,7 @@ static int axp20x_regulator_probe(struct platform_device *pdev)
        u32 workmode;
        const char *dcdc1_name = axp22x_regulators[AXP22X_DCDC1].name;
        const char *dcdc5_name = axp22x_regulators[AXP22X_DCDC5].name;
+       const char *aldo1_name = axp15060_regulators[AXP15060_ALDO1].name;
        bool drivevbus = false;
 
        switch (axp20x->variant) {
@@ -1232,6 +1476,10 @@ static int axp20x_regulator_probe(struct platform_device *pdev)
                drivevbus = of_property_read_bool(pdev->dev.parent->of_node,
                                                  "x-powers,drive-vbus-en");
                break;
+       case AXP313A_ID:
+               regulators = axp313a_regulators;
+               nregulators = AXP313A_REG_ID_MAX;
+               break;
        case AXP803_ID:
                regulators = axp803_regulators;
                nregulators = AXP803_REG_ID_MAX;
@@ -1252,6 +1500,10 @@ static int axp20x_regulator_probe(struct platform_device *pdev)
                drivevbus = of_property_read_bool(pdev->dev.parent->of_node,
                                                  "x-powers,drive-vbus-en");
                break;
+       case AXP15060_ID:
+               regulators = axp15060_regulators;
+               nregulators = AXP15060_REG_ID_MAX;
+               break;
        default:
                dev_err(&pdev->dev, "Unsupported AXP variant: %ld\n",
                        axp20x->variant);
@@ -1278,8 +1530,9 @@ static int axp20x_regulator_probe(struct platform_device *pdev)
                        continue;
 
                /*
-                * Regulators DC1SW and DC5LDO are connected internally,
-                * so we have to handle their supply names separately.
+                * Regulators DC1SW, DC5LDO and RTCLDO on AXP15060 are
+                * connected internally, so we have to handle their supply
+                * names separately.
                 *
                 * We always register the regulators in proper sequence,
                 * so the supply names are correctly read. See the last
@@ -1288,7 +1541,8 @@ static int axp20x_regulator_probe(struct platform_device *pdev)
                 */
                if ((regulators == axp22x_regulators && i == AXP22X_DC1SW) ||
                    (regulators == axp803_regulators && i == AXP803_DC1SW) ||
-                   (regulators == axp809_regulators && i == AXP809_DC1SW)) {
+                   (regulators == axp809_regulators && i == AXP809_DC1SW) ||
+                   (regulators == axp15060_regulators && i == AXP15060_SW)) {
                        new_desc = devm_kzalloc(&pdev->dev, sizeof(*desc),
                                                GFP_KERNEL);
                        if (!new_desc)
@@ -1300,7 +1554,8 @@ static int axp20x_regulator_probe(struct platform_device *pdev)
                }
 
                if ((regulators == axp22x_regulators && i == AXP22X_DC5LDO) ||
-                   (regulators == axp809_regulators && i == AXP809_DC5LDO)) {
+                   (regulators == axp809_regulators && i == AXP809_DC5LDO) ||
+                   (regulators == axp15060_regulators && i == AXP15060_CPUSLDO)) {
                        new_desc = devm_kzalloc(&pdev->dev, sizeof(*desc),
                                                GFP_KERNEL);
                        if (!new_desc)
@@ -1311,6 +1566,18 @@ static int axp20x_regulator_probe(struct platform_device *pdev)
                        desc = new_desc;
                }
 
+
+               if (regulators == axp15060_regulators && i == AXP15060_RTC_LDO) {
+                       new_desc = devm_kzalloc(&pdev->dev, sizeof(*desc),
+                                               GFP_KERNEL);
+                       if (!new_desc)
+                               return -ENOMEM;
+
+                       *new_desc = regulators[i];
+                       new_desc->supply_name = aldo1_name;
+                       desc = new_desc;
+               }
+
                rdev = devm_regulator_register(&pdev->dev, desc, &config);
                if (IS_ERR(rdev)) {
                        dev_err(&pdev->dev, "Failed to register %s\n",
@@ -1329,19 +1596,26 @@ static int axp20x_regulator_probe(struct platform_device *pdev)
                }
 
                /*
-                * Save AXP22X DCDC1 / DCDC5 regulator names for later.
+                * Save AXP22X DCDC1 / DCDC5 / AXP15060 ALDO1 regulator names for later.
                 */
                if ((regulators == axp22x_regulators && i == AXP22X_DCDC1) ||
-                   (regulators == axp809_regulators && i == AXP809_DCDC1))
+                   (regulators == axp809_regulators && i == AXP809_DCDC1) ||
+                   (regulators == axp15060_regulators && i == AXP15060_DCDC1))
                        of_property_read_string(rdev->dev.of_node,
                                                "regulator-name",
                                                &dcdc1_name);
 
                if ((regulators == axp22x_regulators && i == AXP22X_DCDC5) ||
-                   (regulators == axp809_regulators && i == AXP809_DCDC5))
+                   (regulators == axp809_regulators && i == AXP809_DCDC5) ||
+                   (regulators == axp15060_regulators && i == AXP15060_DCDC5))
                        of_property_read_string(rdev->dev.of_node,
                                                "regulator-name",
                                                &dcdc5_name);
+
+               if (regulators == axp15060_regulators && i == AXP15060_ALDO1)
+                       of_property_read_string(rdev->dev.of_node,
+                                               "regulator-name",
+                                               &aldo1_name);
        }
 
        if (drivevbus) {
index 698ab7f..d8e1caa 100644 (file)
@@ -1911,19 +1911,17 @@ static struct regulator *create_regulator(struct regulator_dev *rdev,
 
        if (err != -EEXIST)
                regulator->debugfs = debugfs_create_dir(supply_name, rdev->debugfs);
-       if (!regulator->debugfs) {
+       if (IS_ERR(regulator->debugfs))
                rdev_dbg(rdev, "Failed to create debugfs directory\n");
-       } else {
-               debugfs_create_u32("uA_load", 0444, regulator->debugfs,
-                                  &regulator->uA_load);
-               debugfs_create_u32("min_uV", 0444, regulator->debugfs,
-                                  &regulator->voltage[PM_SUSPEND_ON].min_uV);
-               debugfs_create_u32("max_uV", 0444, regulator->debugfs,
-                                  &regulator->voltage[PM_SUSPEND_ON].max_uV);
-               debugfs_create_file("constraint_flags", 0444,
-                                   regulator->debugfs, regulator,
-                                   &constraint_flags_fops);
-       }
+
+       debugfs_create_u32("uA_load", 0444, regulator->debugfs,
+                          &regulator->uA_load);
+       debugfs_create_u32("min_uV", 0444, regulator->debugfs,
+                          &regulator->voltage[PM_SUSPEND_ON].min_uV);
+       debugfs_create_u32("max_uV", 0444, regulator->debugfs,
+                          &regulator->voltage[PM_SUSPEND_ON].max_uV);
+       debugfs_create_file("constraint_flags", 0444, regulator->debugfs,
+                           regulator, &constraint_flags_fops);
 
        /*
         * Check now if the regulator is an always on regulator - if
@@ -5256,10 +5254,8 @@ static void rdev_init_debugfs(struct regulator_dev *rdev)
        }
 
        rdev->debugfs = debugfs_create_dir(rname, debugfs_root);
-       if (IS_ERR(rdev->debugfs)) {
-               rdev_warn(rdev, "Failed to create debugfs directory\n");
-               return;
-       }
+       if (IS_ERR(rdev->debugfs))
+               rdev_dbg(rdev, "Failed to create debugfs directory\n");
 
        debugfs_create_u32("use_count", 0444, rdev->debugfs,
                           &rdev->use_count);
@@ -6179,7 +6175,7 @@ static int __init regulator_init(void)
 
        debugfs_root = debugfs_create_dir("regulator", NULL);
        if (IS_ERR(debugfs_root))
-               pr_warn("regulator: Failed to create debugfs directory\n");
+               pr_debug("regulator: Failed to create debugfs directory\n");
 
 #ifdef CONFIG_DEBUG_FS
        debugfs_create_file("supply_map", 0444, debugfs_root, NULL,
index 6ce0fdc..1221249 100644 (file)
@@ -1197,7 +1197,7 @@ static struct i2c_driver da9121_regulator_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = of_match_ptr(da9121_dt_ids),
        },
-       .probe_new = da9121_i2c_probe,
+       .probe = da9121_i2c_probe,
        .remove = da9121_i2c_remove,
        .id_table = da9121_i2c_id,
 };
index 4332a3b..252f74a 100644 (file)
@@ -224,7 +224,7 @@ static struct i2c_driver da9210_regulator_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = of_match_ptr(da9210_dt_ids),
        },
-       .probe_new = da9210_i2c_probe,
+       .probe = da9210_i2c_probe,
        .id_table = da9210_i2c_id,
 };
 
index a2b4f6f..af383ff 100644 (file)
@@ -555,7 +555,7 @@ static struct i2c_driver da9211_regulator_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = of_match_ptr(da9211_dt_ids),
        },
-       .probe_new = da9211_i2c_probe,
+       .probe = da9211_i2c_probe,
        .id_table = da9211_i2c_id,
 };
 
index 130f3db..289c06e 100644 (file)
@@ -775,7 +775,7 @@ static struct i2c_driver fan53555_regulator_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = of_match_ptr(fan53555_dt_ids),
        },
-       .probe_new = fan53555_regulator_probe,
+       .probe = fan53555_regulator_probe,
        .id_table = fan53555_id,
 };
 
index a3bebde..6cb5656 100644 (file)
@@ -175,7 +175,7 @@ static struct i2c_driver fan53880_regulator_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = fan53880_dt_ids,
        },
-       .probe_new = fan53880_i2c_probe,
+       .probe = fan53880_i2c_probe,
        .id_table = fan53880_i2c_id,
 };
 module_i2c_driver(fan53880_regulator_driver);
index ad2237a..e6c999b 100644 (file)
@@ -902,8 +902,21 @@ bool regulator_is_equal(struct regulator *reg1, struct regulator *reg2)
 }
 EXPORT_SYMBOL_GPL(regulator_is_equal);
 
-static int find_closest_bigger(unsigned int target, const unsigned int *table,
-                              unsigned int num_sel, unsigned int *sel)
+/**
+ * regulator_find_closest_bigger - helper to find offset in ramp delay table
+ *
+ * @target: targeted ramp_delay
+ * @table: table with supported ramp delays
+ * @num_sel: number of entries in the table
+ * @sel: Pointer to store table offset
+ *
+ * This is the internal helper used by regulator_set_ramp_delay_regmap to
+ * map ramp delay to register value. It should only be used directly if
+ * regulator_set_ramp_delay_regmap cannot handle a specific device setup
+ * (e.g. because the value is split over multiple registers).
+ */
+int regulator_find_closest_bigger(unsigned int target, const unsigned int *table,
+                                 unsigned int num_sel, unsigned int *sel)
 {
        unsigned int s, tmp, max, maxsel = 0;
        bool found = false;
@@ -933,11 +946,13 @@ static int find_closest_bigger(unsigned int target, const unsigned int *table,
 
        return 0;
 }
+EXPORT_SYMBOL_GPL(regulator_find_closest_bigger);
 
 /**
  * regulator_set_ramp_delay_regmap - set_ramp_delay() helper
  *
  * @rdev: regulator to operate on
+ * @ramp_delay: ramp-rate value given in units V/S (uV/uS)
  *
  * Regulators that use regmap for their register I/O can set the ramp_reg
  * and ramp_mask fields in their descriptor and then use this as their
@@ -951,8 +966,8 @@ int regulator_set_ramp_delay_regmap(struct regulator_dev *rdev, int ramp_delay)
        if (WARN_ON(!rdev->desc->n_ramp_values || !rdev->desc->ramp_delay_table))
                return -EINVAL;
 
-       ret = find_closest_bigger(ramp_delay, rdev->desc->ramp_delay_table,
-                                 rdev->desc->n_ramp_values, &sel);
+       ret = regulator_find_closest_bigger(ramp_delay, rdev->desc->ramp_delay_table,
+                                           rdev->desc->n_ramp_values, &sel);
 
        if (ret) {
                dev_warn(rdev_get_dev(rdev),
index 3c37c4d..69b4afe 100644 (file)
@@ -149,7 +149,7 @@ static struct i2c_driver isl6271a_i2c_driver = {
                .name = "isl6271a",
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
        },
-       .probe_new = isl6271a_probe,
+       .probe = isl6271a_probe,
        .id_table = isl6271a_id,
 };
 
index 90bc8d0..0f75600 100644 (file)
@@ -198,7 +198,7 @@ static struct i2c_driver isl9305_regulator_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = of_match_ptr(isl9305_dt_ids),
        },
-       .probe_new = isl9305_i2c_probe,
+       .probe = isl9305_i2c_probe,
        .id_table = isl9305_i2c_id,
 };
 
index e06f2a0..e1b5c45 100644 (file)
@@ -449,7 +449,7 @@ static struct i2c_driver lp3971_i2c_driver = {
                .name = "LP3971",
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
        },
-       .probe_new = lp3971_i2c_probe,
+       .probe = lp3971_i2c_probe,
        .id_table = lp3971_i2c_id,
 };
 
index edacca8..7bd6f05 100644 (file)
@@ -547,7 +547,7 @@ static struct i2c_driver lp3972_i2c_driver = {
                .name = "lp3972",
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
        },
-       .probe_new = lp3972_i2c_probe,
+       .probe = lp3972_i2c_probe,
        .id_table = lp3972_i2c_id,
 };
 
index a8b0969..63aa227 100644 (file)
@@ -947,7 +947,7 @@ static struct i2c_driver lp872x_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = of_match_ptr(lp872x_dt_ids),
        },
-       .probe_new = lp872x_probe,
+       .probe = lp872x_probe,
        .id_table = lp872x_ids,
 };
 
index 37b51b9..4bc310f 100644 (file)
@@ -442,7 +442,7 @@ static struct i2c_driver lp8755_i2c_driver = {
                   .name = LP8755_NAME,
                   .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                   },
-       .probe_new = lp8755_probe,
+       .probe = lp8755_probe,
        .remove = lp8755_remove,
        .id_table = lp8755_id,
 };
index 359b534..e9751c2 100644 (file)
@@ -348,7 +348,7 @@ static const struct regmap_config ltc3589_regmap_config = {
        .num_reg_defaults = ARRAY_SIZE(ltc3589_reg_defaults),
        .use_single_read = true,
        .use_single_write = true,
-       .cache_type = REGCACHE_RBTREE,
+       .cache_type = REGCACHE_MAPLE,
 };
 
 static irqreturn_t ltc3589_isr(int irq, void *dev_id)
@@ -477,7 +477,7 @@ static struct i2c_driver ltc3589_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = of_match_ptr(ltc3589_of_match),
        },
-       .probe_new = ltc3589_probe,
+       .probe = ltc3589_probe,
        .id_table = ltc3589_i2c_id,
 };
 module_i2c_driver(ltc3589_driver);
index a28e6c3..73d511e 100644 (file)
@@ -261,7 +261,7 @@ static const struct regmap_config ltc3676_regmap_config = {
        .max_register = LTC3676_CLIRQ,
        .use_single_read = true,
        .use_single_write = true,
-       .cache_type = REGCACHE_RBTREE,
+       .cache_type = REGCACHE_MAPLE,
 };
 
 static irqreturn_t ltc3676_isr(int irq, void *dev_id)
@@ -374,7 +374,7 @@ static struct i2c_driver ltc3676_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = of_match_ptr(ltc3676_of_match),
        },
-       .probe_new = ltc3676_regulator_probe,
+       .probe = ltc3676_regulator_probe,
        .id_table = ltc3676_i2c_id,
 };
 module_i2c_driver(ltc3676_driver);
index 5d8852b..90aa5b7 100644 (file)
@@ -289,7 +289,7 @@ static const struct i2c_device_id max1586_id[] = {
 MODULE_DEVICE_TABLE(i2c, max1586_id);
 
 static struct i2c_driver max1586_pmic_driver = {
-       .probe_new = max1586_pmic_probe,
+       .probe = max1586_pmic_probe,
        .driver         = {
                .name   = "max1586",
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
index ace1d58..fad31f5 100644 (file)
@@ -323,7 +323,7 @@ static struct i2c_driver max20086_regulator_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = of_match_ptr(max20086_dt_ids),
        },
-       .probe_new = max20086_i2c_probe,
+       .probe = max20086_i2c_probe,
        .id_table = max20086_i2c_id,
 };
 
index be8169b..8c09dc7 100644 (file)
@@ -156,7 +156,7 @@ static struct i2c_driver max20411_i2c_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = of_max20411_match_tbl,
        },
-       .probe_new = max20411_probe,
+       .probe = max20411_probe,
        .id_table = max20411_id,
 };
 module_i2c_driver(max20411_i2c_driver);
index ea5d4b1..3855f5e 100644 (file)
@@ -292,7 +292,7 @@ static struct i2c_driver max77826_regulator_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = of_match_ptr(max77826_of_match),
        },
-       .probe_new = max77826_i2c_probe,
+       .probe = max77826_i2c_probe,
        .id_table = max77826_id,
 };
 module_i2c_driver(max77826_regulator_driver);
index a517fb4..24e1dfb 100644 (file)
@@ -246,7 +246,7 @@ static const struct i2c_device_id max8649_id[] = {
 MODULE_DEVICE_TABLE(i2c, max8649_id);
 
 static struct i2c_driver max8649_driver = {
-       .probe_new      = max8649_regulator_probe,
+       .probe          = max8649_regulator_probe,
        .driver         = {
                .name   = "max8649",
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
index d6b89f0..ede1709 100644 (file)
@@ -503,7 +503,7 @@ static const struct i2c_device_id max8660_id[] = {
 MODULE_DEVICE_TABLE(i2c, max8660_id);
 
 static struct i2c_driver max8660_driver = {
-       .probe_new = max8660_probe,
+       .probe = max8660_probe,
        .driver         = {
                .name   = "max8660",
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
index 10ffd77..cb0e729 100644 (file)
@@ -168,7 +168,7 @@ static const struct i2c_device_id max8893_ids[] = {
 MODULE_DEVICE_TABLE(i2c, max8893_ids);
 
 static struct i2c_driver max8893_driver = {
-       .probe_new      = max8893_probe_new,
+       .probe          = max8893_probe_new,
        .driver         = {
                .name   = "max8893",
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
index 8ad8fe7..0b0b841 100644 (file)
@@ -313,7 +313,7 @@ static const struct i2c_device_id max8952_ids[] = {
 MODULE_DEVICE_TABLE(i2c, max8952_ids);
 
 static struct i2c_driver max8952_pmic_driver = {
-       .probe_new      = max8952_pmic_probe,
+       .probe          = max8952_pmic_probe,
        .driver         = {
                .name   = "max8952",
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
index a991a88..8d51932 100644 (file)
@@ -807,7 +807,7 @@ static struct i2c_driver max8973_i2c_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = of_max8973_match_tbl,
        },
-       .probe_new = max8973_probe,
+       .probe = max8973_probe,
        .id_table = max8973_id,
 };
 
index 3a6d795..6c6f5a2 100644 (file)
@@ -584,7 +584,7 @@ static const struct i2c_device_id mcp16502_i2c_id[] = {
 MODULE_DEVICE_TABLE(i2c, mcp16502_i2c_id);
 
 static struct i2c_driver mcp16502_drv = {
-       .probe_new      = mcp16502_probe,
+       .probe          = mcp16502_probe,
        .driver         = {
                .name   = "mcp16502-regulator",
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
index 91e9019..3886b25 100644 (file)
@@ -240,7 +240,7 @@ static struct i2c_driver mp5416_regulator_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = of_match_ptr(mp5416_of_match),
        },
-       .probe_new = mp5416_i2c_probe,
+       .probe = mp5416_i2c_probe,
        .id_table = mp5416_id,
 };
 module_i2c_driver(mp5416_regulator_driver);
index b968a68..b820bd6 100644 (file)
@@ -147,7 +147,7 @@ static struct i2c_driver mp8859_regulator_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = of_match_ptr(mp8859_dt_id),
        },
-       .probe_new = mp8859_i2c_probe,
+       .probe = mp8859_i2c_probe,
        .id_table = mp8859_i2c_id,
 };
 
index 250c27e..ede1b1e 100644 (file)
@@ -365,7 +365,7 @@ static struct i2c_driver mp886x_regulator_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = mp886x_dt_ids,
        },
-       .probe_new = mp886x_i2c_probe,
+       .probe = mp886x_i2c_probe,
        .id_table = mp886x_id,
 };
 module_i2c_driver(mp886x_regulator_driver);
index 544d41b..bf677c5 100644 (file)
@@ -321,7 +321,7 @@ static struct i2c_driver mpq7920_regulator_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = of_match_ptr(mpq7920_of_match),
        },
-       .probe_new = mpq7920_i2c_probe,
+       .probe = mpq7920_i2c_probe,
        .id_table = mpq7920_id,
 };
 module_i2c_driver(mpq7920_regulator_driver);
index a9f0c9f..b077177 100644 (file)
@@ -154,7 +154,7 @@ static struct i2c_driver mt6311_regulator_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = of_match_ptr(mt6311_dt_ids),
        },
-       .probe_new = mt6311_i2c_probe,
+       .probe = mt6311_i2c_probe,
        .id_table = mt6311_i2c_id,
 };
 
index c9e16bd..31a16fb 100644 (file)
@@ -34,8 +34,10 @@ struct mt6358_regulator_info {
        u32 modeset_mask;
 };
 
+#define to_regulator_info(x) container_of((x), struct mt6358_regulator_info, desc)
+
 #define MT6358_BUCK(match, vreg, min, max, step,               \
-       volt_ranges, vosel_mask, _da_vsel_reg, _da_vsel_mask,   \
+       vosel_mask, _da_vsel_reg, _da_vsel_mask,        \
        _modeset_reg, _modeset_shift)           \
 [MT6358_ID_##vreg] = { \
        .desc = {       \
@@ -46,8 +48,8 @@ struct mt6358_regulator_info {
                .id = MT6358_ID_##vreg,         \
                .owner = THIS_MODULE,           \
                .n_voltages = ((max) - (min)) / (step) + 1,     \
-               .linear_ranges = volt_ranges,           \
-               .n_linear_ranges = ARRAY_SIZE(volt_ranges),     \
+               .min_uV = (min),                \
+               .uV_step = (step),              \
                .vsel_reg = MT6358_BUCK_##vreg##_ELR0,  \
                .vsel_mask = vosel_mask,        \
                .enable_reg = MT6358_BUCK_##vreg##_CON0,        \
@@ -87,7 +89,7 @@ struct mt6358_regulator_info {
 }
 
 #define MT6358_LDO1(match, vreg, min, max, step,       \
-       volt_ranges, _da_vsel_reg, _da_vsel_mask,       \
+       _da_vsel_reg, _da_vsel_mask,    \
        vosel, vosel_mask)      \
 [MT6358_ID_##vreg] = { \
        .desc = {       \
@@ -98,8 +100,8 @@ struct mt6358_regulator_info {
                .id = MT6358_ID_##vreg, \
                .owner = THIS_MODULE,   \
                .n_voltages = ((max) - (min)) / (step) + 1,     \
-               .linear_ranges = volt_ranges,   \
-               .n_linear_ranges = ARRAY_SIZE(volt_ranges),     \
+               .min_uV = (min),                \
+               .uV_step = (step),              \
                .vsel_reg = vosel,      \
                .vsel_mask = vosel_mask,        \
                .enable_reg = MT6358_LDO_##vreg##_CON0, \
@@ -131,7 +133,7 @@ struct mt6358_regulator_info {
 }
 
 #define MT6366_BUCK(match, vreg, min, max, step,               \
-       volt_ranges, vosel_mask, _da_vsel_reg, _da_vsel_mask,   \
+       vosel_mask, _da_vsel_reg, _da_vsel_mask,        \
        _modeset_reg, _modeset_shift)           \
 [MT6366_ID_##vreg] = { \
        .desc = {       \
@@ -142,8 +144,8 @@ struct mt6358_regulator_info {
                .id = MT6366_ID_##vreg,         \
                .owner = THIS_MODULE,           \
                .n_voltages = ((max) - (min)) / (step) + 1,     \
-               .linear_ranges = volt_ranges,           \
-               .n_linear_ranges = ARRAY_SIZE(volt_ranges),     \
+               .min_uV = (min),                \
+               .uV_step = (step),              \
                .vsel_reg = MT6358_BUCK_##vreg##_ELR0,  \
                .vsel_mask = vosel_mask,        \
                .enable_reg = MT6358_BUCK_##vreg##_CON0,        \
@@ -183,7 +185,7 @@ struct mt6358_regulator_info {
 }
 
 #define MT6366_LDO1(match, vreg, min, max, step,       \
-       volt_ranges, _da_vsel_reg, _da_vsel_mask,       \
+       _da_vsel_reg, _da_vsel_mask,    \
        vosel, vosel_mask)      \
 [MT6366_ID_##vreg] = { \
        .desc = {       \
@@ -194,8 +196,8 @@ struct mt6358_regulator_info {
                .id = MT6366_ID_##vreg, \
                .owner = THIS_MODULE,   \
                .n_voltages = ((max) - (min)) / (step) + 1,     \
-               .linear_ranges = volt_ranges,   \
-               .n_linear_ranges = ARRAY_SIZE(volt_ranges),     \
+               .min_uV = (min),                \
+               .uV_step = (step),              \
                .vsel_reg = vosel,      \
                .vsel_mask = vosel_mask,        \
                .enable_reg = MT6358_LDO_##vreg##_CON0, \
@@ -226,21 +228,6 @@ struct mt6358_regulator_info {
        .qi = BIT(15),                                                  \
 }
 
-static const struct linear_range buck_volt_range1[] = {
-       REGULATOR_LINEAR_RANGE(500000, 0, 0x7f, 6250),
-};
-
-static const struct linear_range buck_volt_range2[] = {
-       REGULATOR_LINEAR_RANGE(500000, 0, 0x7f, 12500),
-};
-
-static const struct linear_range buck_volt_range3[] = {
-       REGULATOR_LINEAR_RANGE(500000, 0, 0x3f, 50000),
-};
-
-static const struct linear_range buck_volt_range4[] = {
-       REGULATOR_LINEAR_RANGE(1000000, 0, 0x7f, 12500),
-};
 
 static const unsigned int vdram2_voltages[] = {
        600000, 1800000,
@@ -277,7 +264,7 @@ static const unsigned int vcama_voltages[] = {
        2800000, 2900000, 3000000,
 };
 
-static const unsigned int vcn33_bt_wifi_voltages[] = {
+static const unsigned int vcn33_voltages[] = {
        3300000, 3400000, 3500000,
 };
 
@@ -321,7 +308,7 @@ static const u32 vcama_idx[] = {
        0, 7, 9, 10, 11, 12,
 };
 
-static const u32 vcn33_bt_wifi_idx[] = {
+static const u32 vcn33_idx[] = {
        1, 2, 3,
 };
 
@@ -342,9 +329,9 @@ static unsigned int mt6358_map_mode(unsigned int mode)
 static int mt6358_set_voltage_sel(struct regulator_dev *rdev,
                                  unsigned int selector)
 {
+       const struct mt6358_regulator_info *info = to_regulator_info(rdev->desc);
        int idx, ret;
        const u32 *pvol;
-       struct mt6358_regulator_info *info = rdev_get_drvdata(rdev);
 
        pvol = info->index_table;
 
@@ -358,9 +345,9 @@ static int mt6358_set_voltage_sel(struct regulator_dev *rdev,
 
 static int mt6358_get_voltage_sel(struct regulator_dev *rdev)
 {
+       const struct mt6358_regulator_info *info = to_regulator_info(rdev->desc);
        int idx, ret;
        u32 selector;
-       struct mt6358_regulator_info *info = rdev_get_drvdata(rdev);
        const u32 *pvol;
 
        ret = regmap_read(rdev->regmap, info->desc.vsel_reg, &selector);
@@ -384,8 +371,8 @@ static int mt6358_get_voltage_sel(struct regulator_dev *rdev)
 
 static int mt6358_get_buck_voltage_sel(struct regulator_dev *rdev)
 {
+       const struct mt6358_regulator_info *info = to_regulator_info(rdev->desc);
        int ret, regval;
-       struct mt6358_regulator_info *info = rdev_get_drvdata(rdev);
 
        ret = regmap_read(rdev->regmap, info->da_vsel_reg, &regval);
        if (ret != 0) {
@@ -402,9 +389,9 @@ static int mt6358_get_buck_voltage_sel(struct regulator_dev *rdev)
 
 static int mt6358_get_status(struct regulator_dev *rdev)
 {
+       const struct mt6358_regulator_info *info = to_regulator_info(rdev->desc);
        int ret;
        u32 regval;
-       struct mt6358_regulator_info *info = rdev_get_drvdata(rdev);
 
        ret = regmap_read(rdev->regmap, info->status_reg, &regval);
        if (ret != 0) {
@@ -418,7 +405,7 @@ static int mt6358_get_status(struct regulator_dev *rdev)
 static int mt6358_regulator_set_mode(struct regulator_dev *rdev,
                                     unsigned int mode)
 {
-       struct mt6358_regulator_info *info = rdev_get_drvdata(rdev);
+       const struct mt6358_regulator_info *info = to_regulator_info(rdev->desc);
        int val;
 
        switch (mode) {
@@ -443,7 +430,7 @@ static int mt6358_regulator_set_mode(struct regulator_dev *rdev,
 
 static unsigned int mt6358_regulator_get_mode(struct regulator_dev *rdev)
 {
-       struct mt6358_regulator_info *info = rdev_get_drvdata(rdev);
+       const struct mt6358_regulator_info *info = to_regulator_info(rdev->desc);
        int ret, regval;
 
        ret = regmap_read(rdev->regmap, info->modeset_reg, &regval);
@@ -464,8 +451,8 @@ static unsigned int mt6358_regulator_get_mode(struct regulator_dev *rdev)
 }
 
 static const struct regulator_ops mt6358_volt_range_ops = {
-       .list_voltage = regulator_list_voltage_linear_range,
-       .map_voltage = regulator_map_voltage_linear_range,
+       .list_voltage = regulator_list_voltage_linear,
+       .map_voltage = regulator_map_voltage_linear,
        .set_voltage_sel = regulator_set_voltage_sel_regmap,
        .get_voltage_sel = mt6358_get_buck_voltage_sel,
        .set_voltage_time_sel = regulator_set_voltage_time_sel,
@@ -498,37 +485,25 @@ static const struct regulator_ops mt6358_volt_fixed_ops = {
 };
 
 /* The array is indexed by id(MT6358_ID_XXX) */
-static struct mt6358_regulator_info mt6358_regulators[] = {
+static const struct mt6358_regulator_info mt6358_regulators[] = {
        MT6358_BUCK("buck_vdram1", VDRAM1, 500000, 2087500, 12500,
-                   buck_volt_range2, 0x7f, MT6358_BUCK_VDRAM1_DBG0, 0x7f,
-                   MT6358_VDRAM1_ANA_CON0, 8),
+                   0x7f, MT6358_BUCK_VDRAM1_DBG0, 0x7f, MT6358_VDRAM1_ANA_CON0, 8),
        MT6358_BUCK("buck_vcore", VCORE, 500000, 1293750, 6250,
-                   buck_volt_range1, 0x7f, MT6358_BUCK_VCORE_DBG0, 0x7f,
-                   MT6358_VCORE_VGPU_ANA_CON0, 1),
-       MT6358_BUCK("buck_vcore_sshub", VCORE_SSHUB, 500000, 1293750, 6250,
-                   buck_volt_range1, 0x7f, MT6358_BUCK_VCORE_SSHUB_ELR0, 0x7f,
-                   MT6358_VCORE_VGPU_ANA_CON0, 1),
+                   0x7f, MT6358_BUCK_VCORE_DBG0, 0x7f, MT6358_VCORE_VGPU_ANA_CON0, 1),
        MT6358_BUCK("buck_vpa", VPA, 500000, 3650000, 50000,
-                   buck_volt_range3, 0x3f, MT6358_BUCK_VPA_DBG0, 0x3f,
-                   MT6358_VPA_ANA_CON0, 3),
+                   0x3f, MT6358_BUCK_VPA_DBG0, 0x3f, MT6358_VPA_ANA_CON0, 3),
        MT6358_BUCK("buck_vproc11", VPROC11, 500000, 1293750, 6250,
-                   buck_volt_range1, 0x7f, MT6358_BUCK_VPROC11_DBG0, 0x7f,
-                   MT6358_VPROC_ANA_CON0, 1),
+                   0x7f, MT6358_BUCK_VPROC11_DBG0, 0x7f, MT6358_VPROC_ANA_CON0, 1),
        MT6358_BUCK("buck_vproc12", VPROC12, 500000, 1293750, 6250,
-                   buck_volt_range1, 0x7f, MT6358_BUCK_VPROC12_DBG0, 0x7f,
-                   MT6358_VPROC_ANA_CON0, 2),
+                   0x7f, MT6358_BUCK_VPROC12_DBG0, 0x7f, MT6358_VPROC_ANA_CON0, 2),
        MT6358_BUCK("buck_vgpu", VGPU, 500000, 1293750, 6250,
-                   buck_volt_range1, 0x7f, MT6358_BUCK_VGPU_ELR0, 0x7f,
-                   MT6358_VCORE_VGPU_ANA_CON0, 2),
+                   0x7f, MT6358_BUCK_VGPU_ELR0, 0x7f, MT6358_VCORE_VGPU_ANA_CON0, 2),
        MT6358_BUCK("buck_vs2", VS2, 500000, 2087500, 12500,
-                   buck_volt_range2, 0x7f, MT6358_BUCK_VS2_DBG0, 0x7f,
-                   MT6358_VS2_ANA_CON0, 8),
+                   0x7f, MT6358_BUCK_VS2_DBG0, 0x7f, MT6358_VS2_ANA_CON0, 8),
        MT6358_BUCK("buck_vmodem", VMODEM, 500000, 1293750, 6250,
-                   buck_volt_range1, 0x7f, MT6358_BUCK_VMODEM_DBG0, 0x7f,
-                   MT6358_VMODEM_ANA_CON0, 8),
+                   0x7f, MT6358_BUCK_VMODEM_DBG0, 0x7f, MT6358_VMODEM_ANA_CON0, 8),
        MT6358_BUCK("buck_vs1", VS1, 1000000, 2587500, 12500,
-                   buck_volt_range4, 0x7f, MT6358_BUCK_VS1_DBG0, 0x7f,
-                   MT6358_VS1_ANA_CON0, 8),
+                   0x7f, MT6358_BUCK_VS1_DBG0, 0x7f, MT6358_VS1_ANA_CON0, 8),
        MT6358_REG_FIXED("ldo_vrf12", VRF12,
                         MT6358_LDO_VRF12_CON0, 0, 1200000),
        MT6358_REG_FIXED("ldo_vio18", VIO18,
@@ -566,12 +541,8 @@ static struct mt6358_regulator_info mt6358_regulators[] = {
                   MT6358_LDO_VCAMA1_CON0, 0, MT6358_VCAMA1_ANA_CON0, 0xf00),
        MT6358_LDO("ldo_vemc", VEMC, vmch_vemc_voltages, vmch_vemc_idx,
                   MT6358_LDO_VEMC_CON0, 0, MT6358_VEMC_ANA_CON0, 0x700),
-       MT6358_LDO("ldo_vcn33_bt", VCN33_BT, vcn33_bt_wifi_voltages,
-                  vcn33_bt_wifi_idx, MT6358_LDO_VCN33_CON0_0,
-                  0, MT6358_VCN33_ANA_CON0, 0x300),
-       MT6358_LDO("ldo_vcn33_wifi", VCN33_WIFI, vcn33_bt_wifi_voltages,
-                  vcn33_bt_wifi_idx, MT6358_LDO_VCN33_CON0_1,
-                  0, MT6358_VCN33_ANA_CON0, 0x300),
+       MT6358_LDO("ldo_vcn33", VCN33, vcn33_voltages, vcn33_idx,
+                  MT6358_LDO_VCN33_CON0_0, 0, MT6358_VCN33_ANA_CON0, 0x300),
        MT6358_LDO("ldo_vcama2", VCAMA2, vcama_voltages, vcama_idx,
                   MT6358_LDO_VCAMA2_CON0, 0, MT6358_VCAMA2_ANA_CON0, 0xf00),
        MT6358_LDO("ldo_vmc", VMC, vmc_voltages, vmc_idx,
@@ -582,55 +553,35 @@ static struct mt6358_regulator_info mt6358_regulators[] = {
        MT6358_LDO("ldo_vsim2", VSIM2, vsim_voltages, vsim_idx,
                   MT6358_LDO_VSIM2_CON0, 0, MT6358_VSIM2_ANA_CON0, 0xf00),
        MT6358_LDO1("ldo_vsram_proc11", VSRAM_PROC11, 500000, 1293750, 6250,
-                   buck_volt_range1, MT6358_LDO_VSRAM_PROC11_DBG0, 0x7f00,
-                   MT6358_LDO_VSRAM_CON0, 0x7f),
+                   MT6358_LDO_VSRAM_PROC11_DBG0, 0x7f00, MT6358_LDO_VSRAM_CON0, 0x7f),
        MT6358_LDO1("ldo_vsram_others", VSRAM_OTHERS, 500000, 1293750, 6250,
-                   buck_volt_range1, MT6358_LDO_VSRAM_OTHERS_DBG0, 0x7f00,
-                   MT6358_LDO_VSRAM_CON2, 0x7f),
-       MT6358_LDO1("ldo_vsram_others_sshub", VSRAM_OTHERS_SSHUB, 500000,
-                   1293750, 6250, buck_volt_range1,
-                   MT6358_LDO_VSRAM_OTHERS_SSHUB_CON1, 0x7f,
-                   MT6358_LDO_VSRAM_OTHERS_SSHUB_CON1, 0x7f),
+                   MT6358_LDO_VSRAM_OTHERS_DBG0, 0x7f00, MT6358_LDO_VSRAM_CON2, 0x7f),
        MT6358_LDO1("ldo_vsram_gpu", VSRAM_GPU, 500000, 1293750, 6250,
-                   buck_volt_range1, MT6358_LDO_VSRAM_GPU_DBG0, 0x7f00,
-                   MT6358_LDO_VSRAM_CON3, 0x7f),
+                   MT6358_LDO_VSRAM_GPU_DBG0, 0x7f00, MT6358_LDO_VSRAM_CON3, 0x7f),
        MT6358_LDO1("ldo_vsram_proc12", VSRAM_PROC12, 500000, 1293750, 6250,
-                   buck_volt_range1, MT6358_LDO_VSRAM_PROC12_DBG0, 0x7f00,
-                   MT6358_LDO_VSRAM_CON1, 0x7f),
+                   MT6358_LDO_VSRAM_PROC12_DBG0, 0x7f00, MT6358_LDO_VSRAM_CON1, 0x7f),
 };
 
 /* The array is indexed by id(MT6366_ID_XXX) */
-static struct mt6358_regulator_info mt6366_regulators[] = {
+static const struct mt6358_regulator_info mt6366_regulators[] = {
        MT6366_BUCK("buck_vdram1", VDRAM1, 500000, 2087500, 12500,
-                   buck_volt_range2, 0x7f, MT6358_BUCK_VDRAM1_DBG0, 0x7f,
-                   MT6358_VDRAM1_ANA_CON0, 8),
+                   0x7f, MT6358_BUCK_VDRAM1_DBG0, 0x7f, MT6358_VDRAM1_ANA_CON0, 8),
        MT6366_BUCK("buck_vcore", VCORE, 500000, 1293750, 6250,
-                   buck_volt_range1, 0x7f, MT6358_BUCK_VCORE_DBG0, 0x7f,
-                   MT6358_VCORE_VGPU_ANA_CON0, 1),
-       MT6366_BUCK("buck_vcore_sshub", VCORE_SSHUB, 500000, 1293750, 6250,
-                   buck_volt_range1, 0x7f, MT6358_BUCK_VCORE_SSHUB_ELR0, 0x7f,
-                   MT6358_VCORE_VGPU_ANA_CON0, 1),
+                   0x7f, MT6358_BUCK_VCORE_DBG0, 0x7f, MT6358_VCORE_VGPU_ANA_CON0, 1),
        MT6366_BUCK("buck_vpa", VPA, 500000, 3650000, 50000,
-                   buck_volt_range3, 0x3f, MT6358_BUCK_VPA_DBG0, 0x3f,
-                   MT6358_VPA_ANA_CON0, 3),
+                   0x3f, MT6358_BUCK_VPA_DBG0, 0x3f, MT6358_VPA_ANA_CON0, 3),
        MT6366_BUCK("buck_vproc11", VPROC11, 500000, 1293750, 6250,
-                   buck_volt_range1, 0x7f, MT6358_BUCK_VPROC11_DBG0, 0x7f,
-                   MT6358_VPROC_ANA_CON0, 1),
+                   0x7f, MT6358_BUCK_VPROC11_DBG0, 0x7f, MT6358_VPROC_ANA_CON0, 1),
        MT6366_BUCK("buck_vproc12", VPROC12, 500000, 1293750, 6250,
-                   buck_volt_range1, 0x7f, MT6358_BUCK_VPROC12_DBG0, 0x7f,
-                   MT6358_VPROC_ANA_CON0, 2),
+                   0x7f, MT6358_BUCK_VPROC12_DBG0, 0x7f, MT6358_VPROC_ANA_CON0, 2),
        MT6366_BUCK("buck_vgpu", VGPU, 500000, 1293750, 6250,
-                   buck_volt_range1, 0x7f, MT6358_BUCK_VGPU_ELR0, 0x7f,
-                   MT6358_VCORE_VGPU_ANA_CON0, 2),
+                   0x7f, MT6358_BUCK_VGPU_ELR0, 0x7f, MT6358_VCORE_VGPU_ANA_CON0, 2),
        MT6366_BUCK("buck_vs2", VS2, 500000, 2087500, 12500,
-                   buck_volt_range2, 0x7f, MT6358_BUCK_VS2_DBG0, 0x7f,
-                   MT6358_VS2_ANA_CON0, 8),
+                   0x7f, MT6358_BUCK_VS2_DBG0, 0x7f, MT6358_VS2_ANA_CON0, 8),
        MT6366_BUCK("buck_vmodem", VMODEM, 500000, 1293750, 6250,
-                   buck_volt_range1, 0x7f, MT6358_BUCK_VMODEM_DBG0, 0x7f,
-                   MT6358_VMODEM_ANA_CON0, 8),
+                   0x7f, MT6358_BUCK_VMODEM_DBG0, 0x7f, MT6358_VMODEM_ANA_CON0, 8),
        MT6366_BUCK("buck_vs1", VS1, 1000000, 2587500, 12500,
-                   buck_volt_range4, 0x7f, MT6358_BUCK_VS1_DBG0, 0x7f,
-                   MT6358_VS1_ANA_CON0, 8),
+                   0x7f, MT6358_BUCK_VS1_DBG0, 0x7f, MT6358_VS1_ANA_CON0, 8),
        MT6366_REG_FIXED("ldo_vrf12", VRF12,
                         MT6358_LDO_VRF12_CON0, 0, 1200000),
        MT6366_REG_FIXED("ldo_vio18", VIO18,
@@ -662,41 +613,72 @@ static struct mt6358_regulator_info mt6366_regulators[] = {
                   MT6358_LDO_VMCH_CON0, 0, MT6358_VMCH_ANA_CON0, 0x700),
        MT6366_LDO("ldo_vemc", VEMC, vmch_vemc_voltages, vmch_vemc_idx,
                   MT6358_LDO_VEMC_CON0, 0, MT6358_VEMC_ANA_CON0, 0x700),
-       MT6366_LDO("ldo_vcn33_bt", VCN33_BT, vcn33_bt_wifi_voltages,
-                  vcn33_bt_wifi_idx, MT6358_LDO_VCN33_CON0_0,
-                  0, MT6358_VCN33_ANA_CON0, 0x300),
-       MT6366_LDO("ldo_vcn33_wifi", VCN33_WIFI, vcn33_bt_wifi_voltages,
-                  vcn33_bt_wifi_idx, MT6358_LDO_VCN33_CON0_1,
-                  0, MT6358_VCN33_ANA_CON0, 0x300),
+       MT6366_LDO("ldo_vcn33", VCN33, vcn33_voltages, vcn33_idx,
+                  MT6358_LDO_VCN33_CON0_0, 0, MT6358_VCN33_ANA_CON0, 0x300),
        MT6366_LDO("ldo_vmc", VMC, vmc_voltages, vmc_idx,
                   MT6358_LDO_VMC_CON0, 0, MT6358_VMC_ANA_CON0, 0xf00),
        MT6366_LDO("ldo_vsim2", VSIM2, vsim_voltages, vsim_idx,
                   MT6358_LDO_VSIM2_CON0, 0, MT6358_VSIM2_ANA_CON0, 0xf00),
        MT6366_LDO1("ldo_vsram_proc11", VSRAM_PROC11, 500000, 1293750, 6250,
-                   buck_volt_range1, MT6358_LDO_VSRAM_PROC11_DBG0, 0x7f00,
-                   MT6358_LDO_VSRAM_CON0, 0x7f),
+                   MT6358_LDO_VSRAM_PROC11_DBG0, 0x7f00, MT6358_LDO_VSRAM_CON0, 0x7f),
        MT6366_LDO1("ldo_vsram_others", VSRAM_OTHERS, 500000, 1293750, 6250,
-                   buck_volt_range1, MT6358_LDO_VSRAM_OTHERS_DBG0, 0x7f00,
-                   MT6358_LDO_VSRAM_CON2, 0x7f),
-       MT6366_LDO1("ldo_vsram_others_sshub", VSRAM_OTHERS_SSHUB, 500000,
-                   1293750, 6250, buck_volt_range1,
-                   MT6358_LDO_VSRAM_OTHERS_SSHUB_CON1, 0x7f,
-                   MT6358_LDO_VSRAM_OTHERS_SSHUB_CON1, 0x7f),
+                   MT6358_LDO_VSRAM_OTHERS_DBG0, 0x7f00, MT6358_LDO_VSRAM_CON2, 0x7f),
        MT6366_LDO1("ldo_vsram_gpu", VSRAM_GPU, 500000, 1293750, 6250,
-                   buck_volt_range1, MT6358_LDO_VSRAM_GPU_DBG0, 0x7f00,
-                   MT6358_LDO_VSRAM_CON3, 0x7f),
+                   MT6358_LDO_VSRAM_GPU_DBG0, 0x7f00, MT6358_LDO_VSRAM_CON3, 0x7f),
        MT6366_LDO1("ldo_vsram_proc12", VSRAM_PROC12, 500000, 1293750, 6250,
-                   buck_volt_range1, MT6358_LDO_VSRAM_PROC12_DBG0, 0x7f00,
-                   MT6358_LDO_VSRAM_CON1, 0x7f),
+                   MT6358_LDO_VSRAM_PROC12_DBG0, 0x7f00, MT6358_LDO_VSRAM_CON1, 0x7f),
 };
 
+static int mt6358_sync_vcn33_setting(struct device *dev)
+{
+       struct mt6397_chip *mt6397 = dev_get_drvdata(dev->parent);
+       unsigned int val;
+       int ret;
+
+       /*
+        * VCN33_WIFI and VCN33_BT are two separate enable bits for the same
+        * regulator. They share the same voltage setting and output pin.
+        * Instead of having two potentially conflicting regulators, just have
+        * one VCN33 regulator. Sync the two enable bits and only use one in
+        * the regulator device.
+        */
+       ret = regmap_read(mt6397->regmap, MT6358_LDO_VCN33_CON0_1, &val);
+       if (ret) {
+               dev_err(dev, "Failed to read VCN33_WIFI setting\n");
+               return ret;
+       }
+
+       if (!(val & BIT(0)))
+               return 0;
+
+       /* Sync VCN33_WIFI enable status to VCN33_BT */
+       ret = regmap_update_bits(mt6397->regmap, MT6358_LDO_VCN33_CON0_0, BIT(0), BIT(0));
+       if (ret) {
+               dev_err(dev, "Failed to sync VCN33_WIFI setting to VCN33_BT\n");
+               return ret;
+       }
+
+       /* Disable VCN33_WIFI */
+       ret = regmap_update_bits(mt6397->regmap, MT6358_LDO_VCN33_CON0_1, BIT(0), 0);
+       if (ret) {
+               dev_err(dev, "Failed to disable VCN33_BT\n");
+               return ret;
+       }
+
+       return 0;
+}
+
 static int mt6358_regulator_probe(struct platform_device *pdev)
 {
        struct mt6397_chip *mt6397 = dev_get_drvdata(pdev->dev.parent);
        struct regulator_config config = {};
        struct regulator_dev *rdev;
-       struct mt6358_regulator_info *mt6358_info;
-       int i, max_regulator;
+       const struct mt6358_regulator_info *mt6358_info;
+       int i, max_regulator, ret;
+
+       ret = mt6358_sync_vcn33_setting(&pdev->dev);
+       if (ret)
+               return ret;
 
        if (mt6397->chip_id == MT6366_CHIP_ID) {
                max_regulator = MT6366_MAX_REGULATOR;
@@ -708,7 +690,6 @@ static int mt6358_regulator_probe(struct platform_device *pdev)
 
        for (i = 0; i < max_regulator; i++) {
                config.dev = &pdev->dev;
-               config.driver_data = &mt6358_info[i];
                config.regmap = mt6397->regmap;
 
                rdev = devm_regulator_register(&pdev->dev,
index e75dd92..91bfb7e 100644 (file)
@@ -875,7 +875,7 @@ static struct i2c_driver pca9450_i2c_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = pca9450_of_match,
        },
-       .probe_new = pca9450_i2c_probe,
+       .probe = pca9450_i2c_probe,
 };
 
 module_i2c_driver(pca9450_i2c_driver);
index 99a15c3..b0781d9 100644 (file)
@@ -610,7 +610,7 @@ static struct i2c_driver pf8x00_regulator_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = pf8x00_dt_ids,
        },
-       .probe_new = pf8x00_i2c_probe,
+       .probe = pf8x00_i2c_probe,
 };
 module_i2c_driver(pf8x00_regulator_driver);
 
index a9fcf6a..8d7e6c3 100644 (file)
@@ -848,7 +848,7 @@ static struct i2c_driver pfuze_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = pfuze_dt_ids,
        },
-       .probe_new = pfuze100_regulator_probe,
+       .probe = pfuze100_regulator_probe,
 };
 module_i2c_driver(pfuze_driver);
 
index f170e0d..aa90360 100644 (file)
@@ -379,7 +379,7 @@ static struct i2c_driver pv88060_regulator_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = of_match_ptr(pv88060_dt_ids),
        },
-       .probe_new = pv88060_i2c_probe,
+       .probe = pv88060_i2c_probe,
        .id_table = pv88060_i2c_id,
 };
 
index 133b89d..7ab3e4a 100644 (file)
@@ -560,7 +560,7 @@ static struct i2c_driver pv88080_regulator_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = of_match_ptr(pv88080_dt_ids),
        },
-       .probe_new = pv88080_i2c_probe,
+       .probe = pv88080_i2c_probe,
        .id_table = pv88080_i2c_id,
 };
 
index 1bc33bc..f4acde4 100644 (file)
@@ -400,7 +400,7 @@ static struct i2c_driver pv88090_regulator_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = of_match_ptr(pv88090_dt_ids),
        },
-       .probe_new = pv88090_i2c_probe,
+       .probe = pv88090_i2c_probe,
        .id_table = pv88090_i2c_id,
 };
 
diff --git a/drivers/regulator/raa215300.c b/drivers/regulator/raa215300.c
new file mode 100644 (file)
index 0000000..24a1c89
--- /dev/null
@@ -0,0 +1,190 @@
+// SPDX-License-Identifier: GPL-2.0
+//
+// Renesas RAA215300 PMIC driver
+//
+// Copyright (C) 2023 Renesas Electronics Corporation
+//
+
+#include <linux/clk.h>
+#include <linux/clkdev.h>
+#include <linux/clk-provider.h>
+#include <linux/err.h>
+#include <linux/i2c.h>
+#include <linux/module.h>
+#include <linux/of.h>
+#include <linux/regmap.h>
+
+#define RAA215300_FAULT_LATCHED_STATUS_1       0x59
+#define RAA215300_FAULT_LATCHED_STATUS_2       0x5a
+#define RAA215300_FAULT_LATCHED_STATUS_3       0x5b
+#define RAA215300_FAULT_LATCHED_STATUS_4       0x5c
+#define RAA215300_FAULT_LATCHED_STATUS_6       0x5e
+
+#define RAA215300_INT_MASK_1   0x64
+#define RAA215300_INT_MASK_2   0x65
+#define RAA215300_INT_MASK_3   0x66
+#define RAA215300_INT_MASK_4   0x67
+#define RAA215300_INT_MASK_6   0x68
+
+#define RAA215300_REG_BLOCK_EN 0x6c
+#define RAA215300_HW_REV       0xf8
+
+#define RAA215300_INT_MASK_1_ALL       GENMASK(5, 0)
+#define RAA215300_INT_MASK_2_ALL       GENMASK(3, 0)
+#define RAA215300_INT_MASK_3_ALL       GENMASK(5, 0)
+#define RAA215300_INT_MASK_4_ALL       BIT(0)
+#define RAA215300_INT_MASK_6_ALL       GENMASK(7, 0)
+
+#define RAA215300_REG_BLOCK_EN_RTC_EN  BIT(6)
+#define RAA215300_RTC_DEFAULT_ADDR     0x6f
+
+const char *clkin_name = "clkin";
+const char *xin_name = "xin";
+static struct clk *clk;
+
+static const struct regmap_config raa215300_regmap_config = {
+       .reg_bits = 8,
+       .val_bits = 8,
+       .max_register = 0xff,
+};
+
+static void raa215300_rtc_unregister_device(void *data)
+{
+       i2c_unregister_device(data);
+       if (!clk) {
+               clk_unregister_fixed_rate(clk);
+               clk = NULL;
+       }
+}
+
+static int raa215300_clk_present(struct i2c_client *client, const char *name)
+{
+       struct clk *clk;
+
+       clk = devm_clk_get_optional(&client->dev, name);
+       if (IS_ERR(clk))
+               return PTR_ERR(clk);
+
+       return !!clk;
+}
+
+static int raa215300_i2c_probe(struct i2c_client *client)
+{
+       struct device *dev = &client->dev;
+       const char *clk_name = xin_name;
+       unsigned int pmic_version, val;
+       struct regmap *regmap;
+       int ret;
+
+       regmap = devm_regmap_init_i2c(client, &raa215300_regmap_config);
+       if (IS_ERR(regmap))
+               return dev_err_probe(dev, PTR_ERR(regmap),
+                                    "regmap i2c init failed\n");
+
+       ret = regmap_read(regmap, RAA215300_HW_REV, &pmic_version);
+       if (ret < 0)
+               return dev_err_probe(dev, ret, "HW rev read failed\n");
+
+       dev_dbg(dev, "RAA215300 PMIC version 0x%04x\n", pmic_version);
+
+       /* Clear all blocks except RTC, if enabled */
+       regmap_read(regmap, RAA215300_REG_BLOCK_EN, &val);
+       val &= RAA215300_REG_BLOCK_EN_RTC_EN;
+       regmap_write(regmap, RAA215300_REG_BLOCK_EN, val);
+
+       /*Clear the latched registers */
+       regmap_read(regmap, RAA215300_FAULT_LATCHED_STATUS_1, &val);
+       regmap_write(regmap, RAA215300_FAULT_LATCHED_STATUS_1, val);
+       regmap_read(regmap, RAA215300_FAULT_LATCHED_STATUS_2, &val);
+       regmap_write(regmap, RAA215300_FAULT_LATCHED_STATUS_2, val);
+       regmap_read(regmap, RAA215300_FAULT_LATCHED_STATUS_3, &val);
+       regmap_write(regmap, RAA215300_FAULT_LATCHED_STATUS_3, val);
+       regmap_read(regmap, RAA215300_FAULT_LATCHED_STATUS_4, &val);
+       regmap_write(regmap, RAA215300_FAULT_LATCHED_STATUS_4, val);
+       regmap_read(regmap, RAA215300_FAULT_LATCHED_STATUS_6, &val);
+       regmap_write(regmap, RAA215300_FAULT_LATCHED_STATUS_6, val);
+
+       /* Mask all the PMIC interrupts */
+       regmap_write(regmap, RAA215300_INT_MASK_1, RAA215300_INT_MASK_1_ALL);
+       regmap_write(regmap, RAA215300_INT_MASK_2, RAA215300_INT_MASK_2_ALL);
+       regmap_write(regmap, RAA215300_INT_MASK_3, RAA215300_INT_MASK_3_ALL);
+       regmap_write(regmap, RAA215300_INT_MASK_4, RAA215300_INT_MASK_4_ALL);
+       regmap_write(regmap, RAA215300_INT_MASK_6, RAA215300_INT_MASK_6_ALL);
+
+       ret = raa215300_clk_present(client, xin_name);
+       if (ret < 0) {
+               return ret;
+       } else if (!ret) {
+               ret = raa215300_clk_present(client, clkin_name);
+               if (ret < 0)
+                       return ret;
+
+               clk_name = clkin_name;
+       }
+
+       if (ret) {
+               char *name = pmic_version >= 0x12 ? "isl1208" : "raa215300_a0";
+               struct device_node *np = client->dev.of_node;
+               u32 addr = RAA215300_RTC_DEFAULT_ADDR;
+               struct i2c_board_info info = {};
+               struct i2c_client *rtc_client;
+               ssize_t size;
+
+               clk = clk_register_fixed_rate(NULL, clk_name, NULL, 0, 32000);
+               clk_register_clkdev(clk, clk_name, NULL);
+
+               if (np) {
+                       int i;
+
+                       i = of_property_match_string(np, "reg-names", "rtc");
+                       if (i >= 0)
+                               of_property_read_u32_index(np, "reg", i, &addr);
+               }
+
+               info.addr = addr;
+               if (client->irq > 0)
+                       info.irq = client->irq;
+
+               size = strscpy(info.type, name, sizeof(info.type));
+               if (size < 0)
+                       return dev_err_probe(dev, size,
+                                            "Invalid device name: %s\n", name);
+
+               /* Enable RTC block */
+               regmap_update_bits(regmap, RAA215300_REG_BLOCK_EN,
+                                  RAA215300_REG_BLOCK_EN_RTC_EN,
+                                  RAA215300_REG_BLOCK_EN_RTC_EN);
+
+               rtc_client = i2c_new_client_device(client->adapter, &info);
+               if (IS_ERR(rtc_client))
+                       return PTR_ERR(rtc_client);
+
+               ret = devm_add_action_or_reset(dev,
+                                              raa215300_rtc_unregister_device,
+                                              rtc_client);
+               if (ret < 0)
+                       return ret;
+       }
+
+       return 0;
+}
+
+static const struct of_device_id raa215300_dt_match[] = {
+       { .compatible = "renesas,raa215300" },
+       { /* sentinel */ }
+};
+MODULE_DEVICE_TABLE(of, raa215300_dt_match);
+
+static struct i2c_driver raa215300_i2c_driver = {
+       .driver = {
+               .name = "raa215300",
+               .of_match_table = raa215300_dt_match,
+       },
+       .probe_new = raa215300_i2c_probe,
+};
+module_i2c_driver(raa215300_i2c_driver);
+
+MODULE_DESCRIPTION("Renesas RAA215300 PMIC driver");
+MODULE_AUTHOR("Fabrizio Castro <fabrizio.castro.jz@renesas.com>");
+MODULE_AUTHOR("Biju Das <biju.das.jz@bp.renesas.com>");
+MODULE_LICENSE("GPL");
index 3637e81..460525e 100644 (file)
@@ -3,9 +3,11 @@
  * Regulator driver for Rockchip RK805/RK808/RK818
  *
  * Copyright (c) 2014, Fuzhou Rockchip Electronics Co., Ltd
+ * Copyright (c) 2021 Rockchip Electronics Co., Ltd.
  *
  * Author: Chris Zhong <zyw@rock-chips.com>
  * Author: Zhang Qing <zhangqing@rock-chips.com>
+ * Author: Xu Shengfei <xsf@rock-chips.com>
  *
  * Copyright (C) 2016 PHYTEC Messtechnik GmbH
  *
 #define RK818_LDO3_ON_VSEL_MASK                0xf
 #define RK818_BOOST_ON_VSEL_MASK       0xe0
 
+#define RK806_DCDC_SLP_REG_OFFSET      0x0A
+#define RK806_NLDO_SLP_REG_OFFSET      0x05
+#define RK806_PLDO_SLP_REG_OFFSET      0x06
+
+#define RK806_BUCK_SEL_CNT             0xff
+#define RK806_LDO_SEL_CNT              0xff
+
 /* Ramp rate definitions for buck1 / buck2 only */
 #define RK808_RAMP_RATE_OFFSET         3
 #define RK808_RAMP_RATE_MASK           (3 << RK808_RAMP_RATE_OFFSET)
        RK8XX_DESC_COM(_id, _match, _supply, _min, _max, _step, _vreg,  \
        _vmask, _ereg, _emask, 0, 0, _etime, &rk805_reg_ops)
 
+#define RK806_REGULATOR(_name, _supply_name, _id, _ops,\
+                       _n_voltages, _vr, _er, _lr, ctrl_bit,\
+                       _rr, _rm, _rt)\
+[_id] = {\
+               .name = _name,\
+               .supply_name = _supply_name,\
+               .of_match = of_match_ptr(_name),\
+               .regulators_node = of_match_ptr("regulators"),\
+               .id = _id,\
+               .ops = &_ops,\
+               .type = REGULATOR_VOLTAGE,\
+               .n_voltages = _n_voltages,\
+               .linear_ranges = _lr,\
+               .n_linear_ranges = ARRAY_SIZE(_lr),\
+               .vsel_reg = _vr,\
+               .vsel_mask = 0xff,\
+               .enable_reg = _er,\
+               .enable_mask = ENABLE_MASK(ctrl_bit),\
+               .enable_val = ENABLE_MASK(ctrl_bit),\
+               .disable_val = DISABLE_VAL(ctrl_bit),\
+               .of_map_mode = rk8xx_regulator_of_map_mode,\
+               .ramp_reg = _rr,\
+               .ramp_mask = _rm,\
+               .ramp_delay_table = _rt, \
+               .n_ramp_values = ARRAY_SIZE(_rt), \
+               .owner = THIS_MODULE,\
+       }
+
 #define RK8XX_DESC(_id, _match, _supply, _min, _max, _step, _vreg,     \
        _vmask, _ereg, _emask, _etime)                                  \
        RK8XX_DESC_COM(_id, _match, _supply, _min, _max, _step, _vreg,  \
        RKXX_DESC_SWITCH_COM(_id, _match, _supply, _ereg, _emask,       \
        0, 0, &rk808_switch_ops)
 
+struct rk8xx_register_bit {
+       u8 reg;
+       u8 bit;
+};
+
+#define RK8XX_REG_BIT(_reg, _bit)                                      \
+       {                                                               \
+               .reg = _reg,                                            \
+               .bit = BIT(_bit),                                               \
+       }
+
 struct rk808_regulator_data {
        struct gpio_desc *dvs_gpio[2];
 };
@@ -216,6 +264,133 @@ static const unsigned int rk817_buck1_4_ramp_table[] = {
        3000, 6300, 12500, 25000
 };
 
+static int rk806_set_mode_dcdc(struct regulator_dev *rdev, unsigned int mode)
+{
+       int rid = rdev_get_id(rdev);
+       int ctr_bit, reg;
+
+       reg = RK806_POWER_FPWM_EN0 + rid / 8;
+       ctr_bit = rid % 8;
+
+       switch (mode) {
+       case REGULATOR_MODE_FAST:
+               return regmap_update_bits(rdev->regmap, reg,
+                                         PWM_MODE_MSK << ctr_bit,
+                                         FPWM_MODE << ctr_bit);
+       case REGULATOR_MODE_NORMAL:
+               return regmap_update_bits(rdev->regmap, reg,
+                                         PWM_MODE_MSK << ctr_bit,
+                                         AUTO_PWM_MODE << ctr_bit);
+       default:
+               dev_err(rdev_get_dev(rdev), "mode unsupported: %u\n", mode);
+               return -EINVAL;
+       }
+
+       return 0;
+}
+
+static unsigned int rk806_get_mode_dcdc(struct regulator_dev *rdev)
+{
+       int rid = rdev_get_id(rdev);
+       int ctr_bit, reg;
+       unsigned int val;
+       int err;
+
+       reg = RK806_POWER_FPWM_EN0 + rid / 8;
+       ctr_bit = rid % 8;
+
+       err = regmap_read(rdev->regmap, reg, &val);
+       if (err)
+               return err;
+
+       if ((val >> ctr_bit) & FPWM_MODE)
+               return REGULATOR_MODE_FAST;
+       else
+               return REGULATOR_MODE_NORMAL;
+}
+
+static const struct rk8xx_register_bit rk806_dcdc_rate2[] = {
+       RK8XX_REG_BIT(0xEB, 0),
+       RK8XX_REG_BIT(0xEB, 1),
+       RK8XX_REG_BIT(0xEB, 2),
+       RK8XX_REG_BIT(0xEB, 3),
+       RK8XX_REG_BIT(0xEB, 4),
+       RK8XX_REG_BIT(0xEB, 5),
+       RK8XX_REG_BIT(0xEB, 6),
+       RK8XX_REG_BIT(0xEB, 7),
+       RK8XX_REG_BIT(0xEA, 0),
+       RK8XX_REG_BIT(0xEA, 1),
+};
+
+static const unsigned int rk806_ramp_delay_table_dcdc[] = {
+       50000, 25000, 12500, 6250, 3125, 1560, 961, 390
+};
+
+static int rk806_set_ramp_delay_dcdc(struct regulator_dev *rdev, int ramp_delay)
+{
+       int rid = rdev_get_id(rdev);
+       int regval, ramp_value, ret;
+
+       ret = regulator_find_closest_bigger(ramp_delay, rdev->desc->ramp_delay_table,
+                                           rdev->desc->n_ramp_values, &ramp_value);
+       if (ret) {
+               dev_warn(rdev_get_dev(rdev),
+                        "Can't set ramp-delay %u, setting %u\n", ramp_delay,
+                        rdev->desc->ramp_delay_table[ramp_value]);
+       }
+
+       regval = ramp_value << (ffs(rdev->desc->ramp_mask) - 1);
+
+       ret = regmap_update_bits(rdev->regmap, rdev->desc->ramp_reg,
+                                rdev->desc->ramp_mask, regval);
+       if (ret)
+               return ret;
+
+       /*
+        * The above is effectively a copy of regulator_set_ramp_delay_regmap(),
+        * but that only stores the lower 2 bits for rk806 DCDC ramp. The MSB must
+        * be stored in a separate register, so this open codes the implementation
+        * to have access to the ramp_value.
+        */
+
+       regval = (ramp_value >> 2) & 0x1 ? rk806_dcdc_rate2[rid].bit : 0;
+       return regmap_update_bits(rdev->regmap, rk806_dcdc_rate2[rid].reg,
+                                 rk806_dcdc_rate2[rid].bit,
+                                 regval);
+}
+
+static const unsigned int rk806_ramp_delay_table_ldo[] = {
+       100000, 50000, 25000, 12500, 6280, 3120, 1900, 780
+};
+
+static int rk806_set_suspend_voltage_range(struct regulator_dev *rdev, int reg_offset, int uv)
+{
+       int sel = regulator_map_voltage_linear_range(rdev, uv, uv);
+       unsigned int reg;
+
+       if (sel < 0)
+               return -EINVAL;
+
+       reg = rdev->desc->vsel_reg + reg_offset;
+
+       return regmap_update_bits(rdev->regmap, reg, rdev->desc->vsel_mask, sel);
+}
+
+static int rk806_set_suspend_voltage_range_dcdc(struct regulator_dev *rdev, int uv)
+{
+       return rk806_set_suspend_voltage_range(rdev, RK806_DCDC_SLP_REG_OFFSET, uv);
+}
+
+static int rk806_set_suspend_voltage_range_nldo(struct regulator_dev *rdev, int uv)
+{
+       return rk806_set_suspend_voltage_range(rdev, RK806_NLDO_SLP_REG_OFFSET, uv);
+}
+
+static int rk806_set_suspend_voltage_range_pldo(struct regulator_dev *rdev, int uv)
+{
+       return rk806_set_suspend_voltage_range(rdev, RK806_PLDO_SLP_REG_OFFSET, uv);
+}
+
 static int rk808_buck1_2_get_voltage_sel_regmap(struct regulator_dev *rdev)
 {
        struct rk808_regulator_data *pdata = rdev_get_drvdata(rdev);
@@ -393,6 +568,47 @@ static int rk805_set_suspend_disable(struct regulator_dev *rdev)
                                  0);
 }
 
+static const struct rk8xx_register_bit rk806_suspend_bits[] = {
+       RK8XX_REG_BIT(RK806_POWER_SLP_EN0, 0),
+       RK8XX_REG_BIT(RK806_POWER_SLP_EN0, 1),
+       RK8XX_REG_BIT(RK806_POWER_SLP_EN0, 2),
+       RK8XX_REG_BIT(RK806_POWER_SLP_EN0, 3),
+       RK8XX_REG_BIT(RK806_POWER_SLP_EN0, 4),
+       RK8XX_REG_BIT(RK806_POWER_SLP_EN0, 5),
+       RK8XX_REG_BIT(RK806_POWER_SLP_EN0, 6),
+       RK8XX_REG_BIT(RK806_POWER_SLP_EN0, 7),
+       RK8XX_REG_BIT(RK806_POWER_SLP_EN1, 6),
+       RK8XX_REG_BIT(RK806_POWER_SLP_EN1, 7),
+       RK8XX_REG_BIT(RK806_POWER_SLP_EN1, 0),
+       RK8XX_REG_BIT(RK806_POWER_SLP_EN1, 1),
+       RK8XX_REG_BIT(RK806_POWER_SLP_EN1, 2),
+       RK8XX_REG_BIT(RK806_POWER_SLP_EN1, 3),
+       RK8XX_REG_BIT(RK806_POWER_SLP_EN1, 4),
+       RK8XX_REG_BIT(RK806_POWER_SLP_EN2, 1),
+       RK8XX_REG_BIT(RK806_POWER_SLP_EN2, 2),
+       RK8XX_REG_BIT(RK806_POWER_SLP_EN2, 3),
+       RK8XX_REG_BIT(RK806_POWER_SLP_EN2, 4),
+       RK8XX_REG_BIT(RK806_POWER_SLP_EN2, 5),
+       RK8XX_REG_BIT(RK806_POWER_SLP_EN2, 0),
+};
+
+static int rk806_set_suspend_enable(struct regulator_dev *rdev)
+{
+       int rid = rdev_get_id(rdev);
+
+       return regmap_update_bits(rdev->regmap, rk806_suspend_bits[rid].reg,
+                                 rk806_suspend_bits[rid].bit,
+                                 rk806_suspend_bits[rid].bit);
+}
+
+static int rk806_set_suspend_disable(struct regulator_dev *rdev)
+{
+       int rid = rdev_get_id(rdev);
+
+       return regmap_update_bits(rdev->regmap, rk806_suspend_bits[rid].reg,
+                                 rk806_suspend_bits[rid].bit, 0);
+}
+
 static int rk808_set_suspend_enable(struct regulator_dev *rdev)
 {
        unsigned int reg;
@@ -561,6 +777,64 @@ static const struct regulator_ops rk805_switch_ops = {
        .set_suspend_disable    = rk805_set_suspend_disable,
 };
 
+static const struct regulator_ops rk806_ops_dcdc = {
+       .list_voltage           = regulator_list_voltage_linear_range,
+       .map_voltage            = regulator_map_voltage_linear_range,
+       .get_voltage_sel        = regulator_get_voltage_sel_regmap,
+       .set_voltage_sel        = regulator_set_voltage_sel_regmap,
+       .set_voltage_time_sel   = regulator_set_voltage_time_sel,
+       .set_mode               = rk806_set_mode_dcdc,
+       .get_mode               = rk806_get_mode_dcdc,
+
+       .enable                 = regulator_enable_regmap,
+       .disable                = regulator_disable_regmap,
+       .is_enabled             = rk8xx_is_enabled_wmsk_regmap,
+
+       .set_suspend_mode       = rk806_set_mode_dcdc,
+       .set_ramp_delay         = rk806_set_ramp_delay_dcdc,
+
+       .set_suspend_voltage    = rk806_set_suspend_voltage_range_dcdc,
+       .set_suspend_enable     = rk806_set_suspend_enable,
+       .set_suspend_disable    = rk806_set_suspend_disable,
+};
+
+static const struct regulator_ops rk806_ops_nldo = {
+       .list_voltage           = regulator_list_voltage_linear_range,
+       .map_voltage            = regulator_map_voltage_linear_range,
+       .get_voltage_sel        = regulator_get_voltage_sel_regmap,
+       .set_voltage_sel        = regulator_set_voltage_sel_regmap,
+       .set_voltage_time_sel   = regulator_set_voltage_time_sel,
+
+       .enable                 = regulator_enable_regmap,
+       .disable                = regulator_disable_regmap,
+       .is_enabled             = regulator_is_enabled_regmap,
+
+       .set_ramp_delay         = regulator_set_ramp_delay_regmap,
+
+       .set_suspend_voltage    = rk806_set_suspend_voltage_range_nldo,
+       .set_suspend_enable     = rk806_set_suspend_enable,
+       .set_suspend_disable    = rk806_set_suspend_disable,
+};
+
+static const struct regulator_ops rk806_ops_pldo = {
+       .list_voltage           = regulator_list_voltage_linear_range,
+       .map_voltage            = regulator_map_voltage_linear_range,
+
+       .get_voltage_sel        = regulator_get_voltage_sel_regmap,
+       .set_voltage_sel        = regulator_set_voltage_sel_regmap,
+       .set_voltage_time_sel   = regulator_set_voltage_time_sel,
+
+       .enable                 = regulator_enable_regmap,
+       .disable                = regulator_disable_regmap,
+       .is_enabled             = regulator_is_enabled_regmap,
+
+       .set_ramp_delay         = regulator_set_ramp_delay_regmap,
+
+       .set_suspend_voltage    = rk806_set_suspend_voltage_range_pldo,
+       .set_suspend_enable     = rk806_set_suspend_enable,
+       .set_suspend_disable    = rk806_set_suspend_disable,
+};
+
 static const struct regulator_ops rk808_buck1_2_ops = {
        .list_voltage           = regulator_list_voltage_linear,
        .map_voltage            = regulator_map_voltage_linear,
@@ -743,6 +1017,112 @@ static const struct regulator_desc rk805_reg[] = {
                BIT(2), 400),
 };
 
+static const struct linear_range rk806_buck_voltage_ranges[] = {
+       REGULATOR_LINEAR_RANGE(500000, 0, 160, 6250), /* 500mV ~ 1500mV */
+       REGULATOR_LINEAR_RANGE(1500000, 161, 237, 25000), /* 1500mV ~ 3400mV */
+       REGULATOR_LINEAR_RANGE(3400000, 238, 255, 0),
+};
+
+static const struct linear_range rk806_ldo_voltage_ranges[] = {
+       REGULATOR_LINEAR_RANGE(500000, 0, 232, 12500), /* 500mV ~ 3400mV */
+       REGULATOR_LINEAR_RANGE(3400000, 233, 255, 0), /* 500mV ~ 3400mV */
+};
+
+static const struct regulator_desc rk806_reg[] = {
+       RK806_REGULATOR("dcdc-reg1", "vcc1", RK806_ID_DCDC1, rk806_ops_dcdc,
+                       RK806_BUCK_SEL_CNT, RK806_BUCK1_ON_VSEL,
+                       RK806_POWER_EN0, rk806_buck_voltage_ranges, 0,
+                       RK806_BUCK1_CONFIG, 0xc0, rk806_ramp_delay_table_dcdc),
+       RK806_REGULATOR("dcdc-reg2", "vcc2", RK806_ID_DCDC2, rk806_ops_dcdc,
+                       RK806_BUCK_SEL_CNT, RK806_BUCK2_ON_VSEL,
+                       RK806_POWER_EN0, rk806_buck_voltage_ranges, 1,
+                       RK806_BUCK2_CONFIG, 0xc0, rk806_ramp_delay_table_dcdc),
+       RK806_REGULATOR("dcdc-reg3", "vcc3", RK806_ID_DCDC3, rk806_ops_dcdc,
+                       RK806_BUCK_SEL_CNT, RK806_BUCK3_ON_VSEL,
+                       RK806_POWER_EN0, rk806_buck_voltage_ranges, 2,
+                       RK806_BUCK3_CONFIG, 0xc0, rk806_ramp_delay_table_dcdc),
+       RK806_REGULATOR("dcdc-reg4", "vcc4", RK806_ID_DCDC4, rk806_ops_dcdc,
+                       RK806_BUCK_SEL_CNT, RK806_BUCK4_ON_VSEL,
+                       RK806_POWER_EN0, rk806_buck_voltage_ranges, 3,
+                       RK806_BUCK4_CONFIG, 0xc0, rk806_ramp_delay_table_dcdc),
+
+       RK806_REGULATOR("dcdc-reg5", "vcc5", RK806_ID_DCDC5, rk806_ops_dcdc,
+                       RK806_BUCK_SEL_CNT, RK806_BUCK5_ON_VSEL,
+                       RK806_POWER_EN1, rk806_buck_voltage_ranges, 0,
+                       RK806_BUCK5_CONFIG, 0xc0, rk806_ramp_delay_table_dcdc),
+       RK806_REGULATOR("dcdc-reg6", "vcc6", RK806_ID_DCDC6, rk806_ops_dcdc,
+                       RK806_BUCK_SEL_CNT, RK806_BUCK6_ON_VSEL,
+                       RK806_POWER_EN1, rk806_buck_voltage_ranges, 1,
+                       RK806_BUCK6_CONFIG, 0xc0, rk806_ramp_delay_table_dcdc),
+       RK806_REGULATOR("dcdc-reg7", "vcc7", RK806_ID_DCDC7, rk806_ops_dcdc,
+                       RK806_BUCK_SEL_CNT, RK806_BUCK7_ON_VSEL,
+                       RK806_POWER_EN1, rk806_buck_voltage_ranges, 2,
+                       RK806_BUCK7_CONFIG, 0xc0, rk806_ramp_delay_table_dcdc),
+       RK806_REGULATOR("dcdc-reg8", "vcc8", RK806_ID_DCDC8, rk806_ops_dcdc,
+                       RK806_BUCK_SEL_CNT, RK806_BUCK8_ON_VSEL,
+                       RK806_POWER_EN1, rk806_buck_voltage_ranges, 3,
+                       RK806_BUCK8_CONFIG, 0xc0, rk806_ramp_delay_table_dcdc),
+
+       RK806_REGULATOR("dcdc-reg9", "vcc9", RK806_ID_DCDC9, rk806_ops_dcdc,
+                       RK806_BUCK_SEL_CNT, RK806_BUCK9_ON_VSEL,
+                       RK806_POWER_EN2, rk806_buck_voltage_ranges, 0,
+                       RK806_BUCK9_CONFIG, 0xc0, rk806_ramp_delay_table_dcdc),
+       RK806_REGULATOR("dcdc-reg10", "vcc10", RK806_ID_DCDC10, rk806_ops_dcdc,
+                       RK806_BUCK_SEL_CNT, RK806_BUCK10_ON_VSEL,
+                       RK806_POWER_EN2, rk806_buck_voltage_ranges, 1,
+                       RK806_BUCK10_CONFIG, 0xc0, rk806_ramp_delay_table_dcdc),
+
+       RK806_REGULATOR("nldo-reg1", "vcc13", RK806_ID_NLDO1, rk806_ops_nldo,
+                       RK806_LDO_SEL_CNT, RK806_NLDO1_ON_VSEL,
+                       RK806_POWER_EN3, rk806_ldo_voltage_ranges, 0,
+                       0xEA, 0x38, rk806_ramp_delay_table_ldo),
+       RK806_REGULATOR("nldo-reg2", "vcc13", RK806_ID_NLDO2, rk806_ops_nldo,
+                       RK806_LDO_SEL_CNT, RK806_NLDO2_ON_VSEL,
+                       RK806_POWER_EN3, rk806_ldo_voltage_ranges, 1,
+                       0xEA, 0x38, rk806_ramp_delay_table_ldo),
+       RK806_REGULATOR("nldo-reg3", "vcc13", RK806_ID_NLDO3, rk806_ops_nldo,
+                       RK806_LDO_SEL_CNT, RK806_NLDO3_ON_VSEL,
+                       RK806_POWER_EN3, rk806_ldo_voltage_ranges, 2,
+                       0xEA, 0x38, rk806_ramp_delay_table_ldo),
+       RK806_REGULATOR("nldo-reg4", "vcc14", RK806_ID_NLDO4, rk806_ops_nldo,
+                       RK806_LDO_SEL_CNT, RK806_NLDO4_ON_VSEL,
+                       RK806_POWER_EN3, rk806_ldo_voltage_ranges, 3,
+                       0xEA, 0x38, rk806_ramp_delay_table_ldo),
+
+       RK806_REGULATOR("nldo-reg5", "vcc14", RK806_ID_NLDO5, rk806_ops_nldo,
+                       RK806_LDO_SEL_CNT, RK806_NLDO5_ON_VSEL,
+                       RK806_POWER_EN5, rk806_ldo_voltage_ranges, 2,
+                       0xEA, 0x38, rk806_ramp_delay_table_ldo),
+
+       RK806_REGULATOR("pldo-reg1", "vcc11", RK806_ID_PLDO1, rk806_ops_pldo,
+                       RK806_LDO_SEL_CNT, RK806_PLDO1_ON_VSEL,
+                       RK806_POWER_EN4, rk806_ldo_voltage_ranges, 1,
+                       0xEA, 0x38, rk806_ramp_delay_table_ldo),
+       RK806_REGULATOR("pldo-reg2", "vcc11", RK806_ID_PLDO2, rk806_ops_pldo,
+                       RK806_LDO_SEL_CNT, RK806_PLDO2_ON_VSEL,
+                       RK806_POWER_EN4, rk806_ldo_voltage_ranges, 2,
+                       0xEA, 0x38, rk806_ramp_delay_table_ldo),
+       RK806_REGULATOR("pldo-reg3", "vcc11", RK806_ID_PLDO3, rk806_ops_pldo,
+                       RK806_LDO_SEL_CNT, RK806_PLDO3_ON_VSEL,
+                       RK806_POWER_EN4, rk806_ldo_voltage_ranges, 3,
+                       0xEA, 0x38, rk806_ramp_delay_table_ldo),
+
+       RK806_REGULATOR("pldo-reg4", "vcc12", RK806_ID_PLDO4, rk806_ops_pldo,
+                       RK806_LDO_SEL_CNT, RK806_PLDO4_ON_VSEL,
+                       RK806_POWER_EN5, rk806_ldo_voltage_ranges, 0,
+                       0xEA, 0x38, rk806_ramp_delay_table_ldo),
+       RK806_REGULATOR("pldo-reg5", "vcc12", RK806_ID_PLDO5, rk806_ops_pldo,
+                       RK806_LDO_SEL_CNT, RK806_PLDO5_ON_VSEL,
+                       RK806_POWER_EN5, rk806_ldo_voltage_ranges, 1,
+                       0xEA, 0x38, rk806_ramp_delay_table_ldo),
+
+       RK806_REGULATOR("pldo-reg6", "vcca", RK806_ID_PLDO6, rk806_ops_pldo,
+                       RK806_LDO_SEL_CNT, RK806_PLDO6_ON_VSEL,
+                       RK806_POWER_EN4, rk806_ldo_voltage_ranges, 0,
+                       0xEA, 0x38, rk806_ramp_delay_table_ldo),
+};
+
+
 static const struct regulator_desc rk808_reg[] = {
        {
                .name = "DCDC_REG1",
@@ -1245,20 +1625,19 @@ static const struct regulator_desc rk818_reg[] = {
 };
 
 static int rk808_regulator_dt_parse_pdata(struct device *dev,
-                                  struct device *client_dev,
                                   struct regmap *map,
                                   struct rk808_regulator_data *pdata)
 {
        struct device_node *np;
        int tmp, ret = 0, i;
 
-       np = of_get_child_by_name(client_dev->of_node, "regulators");
+       np = of_get_child_by_name(dev->of_node, "regulators");
        if (!np)
                return -ENXIO;
 
        for (i = 0; i < ARRAY_SIZE(pdata->dvs_gpio); i++) {
                pdata->dvs_gpio[i] =
-                       devm_gpiod_get_index_optional(client_dev, "dvs", i,
+                       devm_gpiod_get_index_optional(dev, "dvs", i,
                                                      GPIOD_OUT_LOW);
                if (IS_ERR(pdata->dvs_gpio[i])) {
                        ret = PTR_ERR(pdata->dvs_gpio[i]);
@@ -1292,6 +1671,9 @@ static int rk808_regulator_probe(struct platform_device *pdev)
        struct regmap *regmap;
        int ret, i, nregulators;
 
+       pdev->dev.of_node = pdev->dev.parent->of_node;
+       pdev->dev.of_node_reused = true;
+
        regmap = dev_get_regmap(pdev->dev.parent, NULL);
        if (!regmap)
                return -ENODEV;
@@ -1300,8 +1682,7 @@ static int rk808_regulator_probe(struct platform_device *pdev)
        if (!pdata)
                return -ENOMEM;
 
-       ret = rk808_regulator_dt_parse_pdata(&pdev->dev, pdev->dev.parent,
-                                            regmap, pdata);
+       ret = rk808_regulator_dt_parse_pdata(&pdev->dev, regmap, pdata);
        if (ret < 0)
                return ret;
 
@@ -1312,6 +1693,10 @@ static int rk808_regulator_probe(struct platform_device *pdev)
                regulators = rk805_reg;
                nregulators = RK805_NUM_REGULATORS;
                break;
+       case RK806_ID:
+               regulators = rk806_reg;
+               nregulators = ARRAY_SIZE(rk806_reg);
+               break;
        case RK808_ID:
                regulators = rk808_reg;
                nregulators = RK808_NUM_REGULATORS;
@@ -1335,7 +1720,6 @@ static int rk808_regulator_probe(struct platform_device *pdev)
        }
 
        config.dev = &pdev->dev;
-       config.dev->of_node = pdev->dev.parent->of_node;
        config.driver_data = pdata;
        config.regmap = regmap;
 
@@ -1355,7 +1739,7 @@ static struct platform_driver rk808_regulator_driver = {
        .probe = rk808_regulator_probe,
        .driver = {
                .name = "rk808-regulator",
-               .probe_type = PROBE_PREFER_ASYNCHRONOUS,
+               .probe_type = PROBE_FORCE_SYNCHRONOUS,
        },
 };
 
@@ -1366,5 +1750,6 @@ MODULE_AUTHOR("Tony xie <tony.xie@rock-chips.com>");
 MODULE_AUTHOR("Chris Zhong <zyw@rock-chips.com>");
 MODULE_AUTHOR("Zhang Qing <zhangqing@rock-chips.com>");
 MODULE_AUTHOR("Wadim Egorov <w.egorov@phytec.de>");
+MODULE_AUTHOR("Xu Shengfei <xsf@rock-chips.com>");
 MODULE_LICENSE("GPL");
 MODULE_ALIAS("platform:rk808-regulator");
index 9afe961..e9719a3 100644 (file)
@@ -399,7 +399,7 @@ static struct i2c_driver attiny_regulator_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = of_match_ptr(attiny_dt_ids),
        },
-       .probe_new = attiny_i2c_probe,
+       .probe = attiny_i2c_probe,
        .remove = attiny_i2c_remove,
 };
 
index be3dc98..4955bfe 100644 (file)
@@ -242,7 +242,7 @@ static struct i2c_driver rt4801_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = of_match_ptr(rt4801_of_id),
        },
-       .probe_new = rt4801_probe,
+       .probe = rt4801_probe,
 };
 module_i2c_driver(rt4801_driver);
 
index f6c12f8..a53ed52 100644 (file)
@@ -508,7 +508,7 @@ static struct i2c_driver rt5190a_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = rt5190a_device_table,
        },
-       .probe_new = rt5190a_probe,
+       .probe = rt5190a_probe,
 };
 module_i2c_driver(rt5190a_driver);
 
index 74fc5bf..0ce6a16 100644 (file)
@@ -282,7 +282,7 @@ static struct i2c_driver rt5739_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = rt5739_device_table,
        },
-       .probe_new = rt5739_probe,
+       .probe = rt5739_probe,
 };
 module_i2c_driver(rt5739_driver);
 
index d5a42ad..90555a9 100644 (file)
@@ -362,7 +362,7 @@ static struct i2c_driver rt5759_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = of_match_ptr(rt5759_device_table),
        },
-       .probe_new = rt5759_probe,
+       .probe = rt5759_probe,
 };
 module_i2c_driver(rt5759_driver);
 
index 8990dac..e2a0eee 100644 (file)
@@ -311,7 +311,7 @@ static struct i2c_driver rt6160_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = rt6160_of_match_table,
        },
-       .probe_new = rt6160_probe,
+       .probe = rt6160_probe,
 };
 module_i2c_driver(rt6160_driver);
 
index ca91a1f..3883440 100644 (file)
@@ -487,7 +487,7 @@ static struct i2c_driver rt6190_driver = {
                .of_match_table = rt6190_of_dev_table,
                .pm = pm_ptr(&rt6190_dev_pm),
        },
-       .probe_new = rt6190_probe,
+       .probe = rt6190_probe,
 };
 module_i2c_driver(rt6190_driver);
 
index 8721d11..1843ece 100644 (file)
@@ -246,7 +246,7 @@ static struct i2c_driver rt6245_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = rt6245_of_match_table,
        },
-       .probe_new = rt6245_probe,
+       .probe = rt6245_probe,
 };
 module_i2c_driver(rt6245_driver);
 
index 7cbb812..dfd1522 100644 (file)
@@ -429,7 +429,7 @@ static struct i2c_driver rtmv20_driver = {
                .of_match_table = of_match_ptr(rtmv20_of_id),
                .pm = &rtmv20_pm,
        },
-       .probe_new = rtmv20_probe,
+       .probe = rtmv20_probe,
 };
 module_i2c_driver(rtmv20_driver);
 
index ee1577d..b7372cb 100644 (file)
@@ -366,7 +366,7 @@ static struct i2c_driver rtq2134_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = rtq2134_device_tables,
        },
-       .probe_new = rtq2134_probe,
+       .probe = rtq2134_probe,
 };
 module_i2c_driver(rtq2134_driver);
 
index 8559a26..8176e5a 100644 (file)
@@ -281,7 +281,7 @@ static struct i2c_driver rtq6752_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = rtq6752_device_table,
        },
-       .probe_new = rtq6752_probe,
+       .probe = rtq6752_probe,
 };
 module_i2c_driver(rtq6752_driver);
 
index 559ae03..59aa168 100644 (file)
@@ -507,7 +507,7 @@ static struct i2c_driver slg51000_regulator_driver = {
                .name = "slg51000-regulator",
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
        },
-       .probe_new = slg51000_i2c_probe,
+       .probe = slg51000_i2c_probe,
        .id_table = slg51000_i2c_id,
 };
 
index 0e101df..4c60edd 100644 (file)
@@ -93,7 +93,7 @@ static int stm32_pwr_reg_disable(struct regulator_dev *rdev)
        writel_relaxed(val, priv->base + REG_PWR_CR3);
 
        /* use an arbitrary timeout of 20ms */
-       ret = readx_poll_timeout(stm32_pwr_reg_is_ready, rdev, val, !val,
+       ret = readx_poll_timeout(stm32_pwr_reg_is_enabled, rdev, val, !val,
                                 100, 20 * 1000);
        if (ret)
                dev_err(&rdev->dev, "regulator disable timed out!\n");
index e3c7539..1bcfdd6 100644 (file)
@@ -141,7 +141,7 @@ static struct i2c_driver sy8106a_regulator_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = sy8106a_i2c_of_match,
        },
-       .probe_new = sy8106a_i2c_probe,
+       .probe = sy8106a_i2c_probe,
        .id_table = sy8106a_i2c_id,
 };
 
index c327ad6..d070310 100644 (file)
@@ -236,7 +236,7 @@ static struct i2c_driver sy8824_regulator_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = sy8824_dt_ids,
        },
-       .probe_new = sy8824_i2c_probe,
+       .probe = sy8824_i2c_probe,
        .id_table = sy8824_id,
 };
 module_i2c_driver(sy8824_regulator_driver);
index 99ca08c..433959b 100644 (file)
@@ -190,7 +190,7 @@ static struct i2c_driver sy8827n_regulator_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = sy8827n_dt_ids,
        },
-       .probe_new = sy8827n_i2c_probe,
+       .probe = sy8827n_i2c_probe,
        .id_table = sy8827n_id,
 };
 module_i2c_driver(sy8827n_regulator_driver);
index 9bd4e72..d8a856c 100644 (file)
@@ -354,7 +354,7 @@ static struct i2c_driver tps51632_i2c_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = of_match_ptr(tps51632_of_match),
        },
-       .probe_new = tps51632_probe,
+       .probe = tps51632_probe,
        .id_table = tps51632_id,
 };
 
index 65cc08d..32e1a05 100644 (file)
@@ -491,7 +491,7 @@ static struct i2c_driver tps62360_i2c_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = of_match_ptr(tps62360_of_match),
        },
-       .probe_new = tps62360_probe,
+       .probe = tps62360_probe,
        .shutdown = tps62360_shutdown,
        .id_table = tps62360_id,
 };
index f92e764..b1c4b51 100644 (file)
@@ -150,7 +150,7 @@ static struct i2c_driver tps6286x_regulator_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = of_match_ptr(tps6286x_dt_ids),
        },
-       .probe_new = tps6286x_i2c_probe,
+       .probe = tps6286x_i2c_probe,
        .id_table = tps6286x_i2c_id,
 };
 
diff --git a/drivers/regulator/tps6287x-regulator.c b/drivers/regulator/tps6287x-regulator.c
new file mode 100644 (file)
index 0000000..b1c0963
--- /dev/null
@@ -0,0 +1,189 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2023 Axis Communications AB
+ *
+ * Driver for Texas Instruments TPS6287x PMIC.
+ * Datasheet: https://www.ti.com/lit/ds/symlink/tps62873.pdf
+ */
+
+#include <linux/err.h>
+#include <linux/i2c.h>
+#include <linux/module.h>
+#include <linux/of_device.h>
+#include <linux/regmap.h>
+#include <linux/regulator/of_regulator.h>
+#include <linux/regulator/machine.h>
+#include <linux/regulator/driver.h>
+#include <linux/bitfield.h>
+#include <linux/linear_range.h>
+
+#define TPS6287X_VSET          0x00
+#define TPS6287X_CTRL1         0x01
+#define TPS6287X_CTRL1_VRAMP   GENMASK(1, 0)
+#define TPS6287X_CTRL1_FPWMEN  BIT(4)
+#define TPS6287X_CTRL1_SWEN    BIT(5)
+#define TPS6287X_CTRL2         0x02
+#define TPS6287X_CTRL2_VRANGE  GENMASK(3, 2)
+#define TPS6287X_CTRL3         0x03
+#define TPS6287X_STATUS                0x04
+
+static const struct regmap_config tps6287x_regmap_config = {
+       .reg_bits = 8,
+       .val_bits = 8,
+       .max_register = TPS6287X_STATUS,
+};
+
+static const struct linear_range tps6287x_voltage_ranges[] = {
+       LINEAR_RANGE(400000, 0, 0xFF, 1250),
+       LINEAR_RANGE(400000, 0, 0xFF, 2500),
+       LINEAR_RANGE(400000, 0, 0xFF, 5000),
+       LINEAR_RANGE(800000, 0, 0xFF, 10000),
+};
+
+static const unsigned int tps6287x_voltage_range_sel[] = {
+       0x0, 0x4, 0x8, 0xC
+};
+
+static const unsigned int tps6287x_ramp_table[] = {
+       10000, 5000, 1250, 500
+};
+
+static int tps6287x_set_mode(struct regulator_dev *rdev, unsigned int mode)
+{
+       unsigned int val;
+
+       switch (mode) {
+       case REGULATOR_MODE_NORMAL:
+               val = 0;
+               break;
+       case REGULATOR_MODE_FAST:
+               val = TPS6287X_CTRL1_FPWMEN;
+               break;
+       default:
+               return -EINVAL;
+       }
+
+       return regmap_update_bits(rdev->regmap, TPS6287X_CTRL1,
+                                 TPS6287X_CTRL1_FPWMEN, val);
+}
+
+static unsigned int tps6287x_get_mode(struct regulator_dev *rdev)
+{
+       unsigned int val;
+       int ret;
+
+       ret = regmap_read(rdev->regmap, TPS6287X_CTRL1, &val);
+       if (ret < 0)
+               return 0;
+
+       return (val & TPS6287X_CTRL1_FPWMEN) ? REGULATOR_MODE_FAST :
+           REGULATOR_MODE_NORMAL;
+}
+
+static unsigned int tps6287x_of_map_mode(unsigned int mode)
+{
+       switch (mode) {
+       case REGULATOR_MODE_NORMAL:
+       case REGULATOR_MODE_FAST:
+               return mode;
+       default:
+               return REGULATOR_MODE_INVALID;
+       }
+}
+
+static const struct regulator_ops tps6287x_regulator_ops = {
+       .enable = regulator_enable_regmap,
+       .disable = regulator_disable_regmap,
+       .set_mode = tps6287x_set_mode,
+       .get_mode = tps6287x_get_mode,
+       .is_enabled = regulator_is_enabled_regmap,
+       .get_voltage_sel = regulator_get_voltage_sel_pickable_regmap,
+       .set_voltage_sel = regulator_set_voltage_sel_pickable_regmap,
+       .list_voltage = regulator_list_voltage_pickable_linear_range,
+       .set_ramp_delay = regulator_set_ramp_delay_regmap,
+};
+
+static struct regulator_desc tps6287x_reg = {
+       .name = "tps6287x",
+       .owner = THIS_MODULE,
+       .ops = &tps6287x_regulator_ops,
+       .of_map_mode = tps6287x_of_map_mode,
+       .type = REGULATOR_VOLTAGE,
+       .enable_reg = TPS6287X_CTRL1,
+       .enable_mask = TPS6287X_CTRL1_SWEN,
+       .vsel_reg = TPS6287X_VSET,
+       .vsel_mask = 0xFF,
+       .vsel_range_reg = TPS6287X_CTRL2,
+       .vsel_range_mask = TPS6287X_CTRL2_VRANGE,
+       .ramp_reg = TPS6287X_CTRL1,
+       .ramp_mask = TPS6287X_CTRL1_VRAMP,
+       .ramp_delay_table = tps6287x_ramp_table,
+       .n_ramp_values = ARRAY_SIZE(tps6287x_ramp_table),
+       .n_voltages = 256,
+       .linear_ranges = tps6287x_voltage_ranges,
+       .n_linear_ranges = ARRAY_SIZE(tps6287x_voltage_ranges),
+       .linear_range_selectors = tps6287x_voltage_range_sel,
+};
+
+static int tps6287x_i2c_probe(struct i2c_client *i2c)
+{
+       struct device *dev = &i2c->dev;
+       struct regulator_config config = {};
+       struct regulator_dev *rdev;
+
+       config.regmap = devm_regmap_init_i2c(i2c, &tps6287x_regmap_config);
+       if (IS_ERR(config.regmap)) {
+               dev_err(dev, "Failed to init i2c\n");
+               return PTR_ERR(config.regmap);
+       }
+
+       config.dev = dev;
+       config.of_node = dev->of_node;
+       config.init_data = of_get_regulator_init_data(dev, dev->of_node,
+                                                     &tps6287x_reg);
+
+       rdev = devm_regulator_register(dev, &tps6287x_reg, &config);
+       if (IS_ERR(rdev)) {
+               dev_err(dev, "Failed to register regulator\n");
+               return PTR_ERR(rdev);
+       }
+
+       dev_dbg(dev, "Probed regulator\n");
+
+       return 0;
+}
+
+static const struct of_device_id tps6287x_dt_ids[] = {
+       { .compatible = "ti,tps62870", },
+       { .compatible = "ti,tps62871", },
+       { .compatible = "ti,tps62872", },
+       { .compatible = "ti,tps62873", },
+       { }
+};
+
+MODULE_DEVICE_TABLE(of, tps6287x_dt_ids);
+
+static const struct i2c_device_id tps6287x_i2c_id[] = {
+       { "tps62870", 0 },
+       { "tps62871", 0 },
+       { "tps62872", 0 },
+       { "tps62873", 0 },
+       {},
+};
+
+MODULE_DEVICE_TABLE(i2c, tps6287x_i2c_id);
+
+static struct i2c_driver tps6287x_regulator_driver = {
+       .driver = {
+               .name = "tps6287x",
+               .of_match_table = tps6287x_dt_ids,
+       },
+       .probe = tps6287x_i2c_probe,
+       .id_table = tps6287x_i2c_id,
+};
+
+module_i2c_driver(tps6287x_regulator_driver);
+
+MODULE_AUTHOR("Mårten Lindahl <marten.lindahl@axis.com>");
+MODULE_DESCRIPTION("Regulator driver for TI TPS6287X PMIC");
+MODULE_LICENSE("GPL");
index d87cac6..d5757fd 100644 (file)
@@ -337,7 +337,7 @@ static struct i2c_driver tps_65023_i2c_driver = {
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
                .of_match_table = of_match_ptr(tps65023_of_match),
        },
-       .probe_new = tps_65023_probe,
+       .probe = tps_65023_probe,
        .id_table = tps_65023_id,
 };
 
index d4b02ee..a06f5f2 100644 (file)
@@ -272,7 +272,7 @@ static struct i2c_driver tps65132_i2c_driver = {
                .name = "tps65132",
                .probe_type = PROBE_PREFER_ASYNCHRONOUS,
        },
-       .probe_new = tps65132_probe,
+       .probe = tps65132_probe,
        .id_table = tps65132_id,
 };
 
index b1719ee..8971b50 100644 (file)
@@ -289,13 +289,13 @@ static irqreturn_t tps65219_regulator_irq_handler(int irq, void *data)
 
 static int tps65219_get_rdev_by_name(const char *regulator_name,
                                     struct regulator_dev *rdevtbl[7],
-                                    struct regulator_dev *dev)
+                                    struct regulator_dev **dev)
 {
        int i;
 
        for (i = 0; i < ARRAY_SIZE(regulators); i++) {
                if (strcmp(regulator_name, regulators[i].name) == 0) {
-                       dev = rdevtbl[i];
+                       *dev = rdevtbl[i];
                        return 0;
                }
        }
@@ -348,7 +348,7 @@ static int tps65219_regulator_probe(struct platform_device *pdev)
                irq_data[i].dev = tps->dev;
                irq_data[i].type = irq_type;
 
-               tps65219_get_rdev_by_name(irq_type->regulator_name, rdevtbl, rdev);
+               tps65219_get_rdev_by_name(irq_type->regulator_name, rdevtbl, &rdev);
                if (IS_ERR(rdev)) {
                        dev_err(tps->dev, "Failed to get rdev for %s\n",
                                irq_type->regulator_name);
diff --git a/drivers/regulator/tps6594-regulator.c b/drivers/regulator/tps6594-regulator.c
new file mode 100644 (file)
index 0000000..d5a574e
--- /dev/null
@@ -0,0 +1,615 @@
+// SPDX-License-Identifier: GPL-2.0
+//
+// Regulator driver for tps6594 PMIC
+//
+// Copyright (C) 2023 BayLibre Incorporated - https://www.baylibre.com/
+
+#include <linux/device.h>
+#include <linux/err.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/of_device.h>
+#include <linux/platform_device.h>
+#include <linux/regmap.h>
+#include <linux/regulator/driver.h>
+#include <linux/regulator/machine.h>
+#include <linux/regulator/of_regulator.h>
+
+#include <linux/mfd/tps6594.h>
+
+#define BUCK_NB                5
+#define LDO_NB         4
+#define MULTI_PHASE_NB 4
+#define REGS_INT_NB    4
+
+enum tps6594_regulator_id {
+       /* DCDC's */
+       TPS6594_BUCK_1,
+       TPS6594_BUCK_2,
+       TPS6594_BUCK_3,
+       TPS6594_BUCK_4,
+       TPS6594_BUCK_5,
+
+       /* LDOs */
+       TPS6594_LDO_1,
+       TPS6594_LDO_2,
+       TPS6594_LDO_3,
+       TPS6594_LDO_4,
+};
+
+enum tps6594_multi_regulator_id {
+       /* Multi-phase DCDC's */
+       TPS6594_BUCK_12,
+       TPS6594_BUCK_34,
+       TPS6594_BUCK_123,
+       TPS6594_BUCK_1234,
+};
+
+struct tps6594_regulator_irq_type {
+       const char *irq_name;
+       const char *regulator_name;
+       const char *event_name;
+       unsigned long event;
+};
+
+static struct tps6594_regulator_irq_type tps6594_ext_regulator_irq_types[] = {
+       { TPS6594_IRQ_NAME_VCCA_OV, "VCCA", "overvoltage", REGULATOR_EVENT_OVER_VOLTAGE_WARN },
+       { TPS6594_IRQ_NAME_VCCA_UV, "VCCA", "undervoltage", REGULATOR_EVENT_UNDER_VOLTAGE },
+       { TPS6594_IRQ_NAME_VMON1_OV, "VMON1", "overvoltage", REGULATOR_EVENT_OVER_VOLTAGE_WARN },
+       { TPS6594_IRQ_NAME_VMON1_UV, "VMON1", "undervoltage", REGULATOR_EVENT_UNDER_VOLTAGE },
+       { TPS6594_IRQ_NAME_VMON1_RV, "VMON1", "residual voltage",
+         REGULATOR_EVENT_OVER_VOLTAGE_WARN },
+       { TPS6594_IRQ_NAME_VMON2_OV, "VMON2", "overvoltage", REGULATOR_EVENT_OVER_VOLTAGE_WARN },
+       { TPS6594_IRQ_NAME_VMON2_UV, "VMON2", "undervoltage", REGULATOR_EVENT_UNDER_VOLTAGE },
+       { TPS6594_IRQ_NAME_VMON2_RV, "VMON2", "residual voltage",
+         REGULATOR_EVENT_OVER_VOLTAGE_WARN },
+};
+
+struct tps6594_regulator_irq_data {
+       struct device *dev;
+       struct tps6594_regulator_irq_type *type;
+       struct regulator_dev *rdev;
+};
+
+struct tps6594_ext_regulator_irq_data {
+       struct device *dev;
+       struct tps6594_regulator_irq_type *type;
+};
+
+#define TPS6594_REGULATOR(_name, _of, _id, _type, _ops, _n, _vr, _vm, _er, \
+                          _em, _cr, _cm, _lr, _nlr, _delay, _fuv, \
+                          _ct, _ncl, _bpm) \
+       {                                                               \
+               .name                   = _name,                        \
+               .of_match               = _of,                          \
+               .regulators_node        = of_match_ptr("regulators"),   \
+               .supply_name            = _of,                          \
+               .id                     = _id,                          \
+               .ops                    = &(_ops),                      \
+               .n_voltages             = _n,                           \
+               .type                   = _type,                        \
+               .owner                  = THIS_MODULE,                  \
+               .vsel_reg               = _vr,                          \
+               .vsel_mask              = _vm,                          \
+               .csel_reg               = _cr,                          \
+               .csel_mask              = _cm,                          \
+               .curr_table             = _ct,                          \
+               .n_current_limits       = _ncl,                         \
+               .enable_reg             = _er,                          \
+               .enable_mask            = _em,                          \
+               .volt_table             = NULL,                         \
+               .linear_ranges          = _lr,                          \
+               .n_linear_ranges        = _nlr,                         \
+               .ramp_delay             = _delay,                       \
+               .fixed_uV               = _fuv,                         \
+               .bypass_reg             = _vr,                          \
+               .bypass_mask            = _bpm,                         \
+       }                                                               \
+
+static const struct linear_range bucks_ranges[] = {
+       REGULATOR_LINEAR_RANGE(300000, 0x0, 0xe, 20000),
+       REGULATOR_LINEAR_RANGE(600000, 0xf, 0x72, 5000),
+       REGULATOR_LINEAR_RANGE(1100000, 0x73, 0xaa, 10000),
+       REGULATOR_LINEAR_RANGE(1660000, 0xab, 0xff, 20000),
+};
+
+static const struct linear_range ldos_1_2_3_ranges[] = {
+       REGULATOR_LINEAR_RANGE(600000, 0x4, 0x3a, 50000),
+};
+
+static const struct linear_range ldos_4_ranges[] = {
+       REGULATOR_LINEAR_RANGE(1200000, 0x20, 0x74, 25000),
+};
+
+/* Operations permitted on BUCK1/2/3/4/5 */
+static const struct regulator_ops tps6594_bucks_ops = {
+       .is_enabled             = regulator_is_enabled_regmap,
+       .enable                 = regulator_enable_regmap,
+       .disable                = regulator_disable_regmap,
+       .get_voltage_sel        = regulator_get_voltage_sel_regmap,
+       .set_voltage_sel        = regulator_set_voltage_sel_regmap,
+       .list_voltage           = regulator_list_voltage_linear_range,
+       .map_voltage            = regulator_map_voltage_linear_range,
+       .set_voltage_time_sel   = regulator_set_voltage_time_sel,
+
+};
+
+/* Operations permitted on LDO1/2/3 */
+static const struct regulator_ops tps6594_ldos_1_2_3_ops = {
+       .is_enabled             = regulator_is_enabled_regmap,
+       .enable                 = regulator_enable_regmap,
+       .disable                = regulator_disable_regmap,
+       .get_voltage_sel        = regulator_get_voltage_sel_regmap,
+       .set_voltage_sel        = regulator_set_voltage_sel_regmap,
+       .list_voltage           = regulator_list_voltage_linear_range,
+       .map_voltage            = regulator_map_voltage_linear_range,
+       .set_bypass             = regulator_set_bypass_regmap,
+       .get_bypass             = regulator_get_bypass_regmap,
+};
+
+/* Operations permitted on LDO4 */
+static const struct regulator_ops tps6594_ldos_4_ops = {
+       .is_enabled             = regulator_is_enabled_regmap,
+       .enable                 = regulator_enable_regmap,
+       .disable                = regulator_disable_regmap,
+       .get_voltage_sel        = regulator_get_voltage_sel_regmap,
+       .set_voltage_sel        = regulator_set_voltage_sel_regmap,
+       .list_voltage           = regulator_list_voltage_linear_range,
+       .map_voltage            = regulator_map_voltage_linear_range,
+};
+
+static const struct regulator_desc buck_regs[] = {
+       TPS6594_REGULATOR("BUCK1", "buck1", TPS6594_BUCK_1,
+                         REGULATOR_VOLTAGE, tps6594_bucks_ops, TPS6594_MASK_BUCKS_VSET,
+                         TPS6594_REG_BUCKX_VOUT_1(0),
+                         TPS6594_MASK_BUCKS_VSET,
+                         TPS6594_REG_BUCKX_CTRL(0),
+                         TPS6594_BIT_BUCK_EN, 0, 0, bucks_ranges,
+                         4, 0, 0, NULL, 0, 0),
+       TPS6594_REGULATOR("BUCK2", "buck2", TPS6594_BUCK_2,
+                         REGULATOR_VOLTAGE, tps6594_bucks_ops, TPS6594_MASK_BUCKS_VSET,
+                         TPS6594_REG_BUCKX_VOUT_1(1),
+                         TPS6594_MASK_BUCKS_VSET,
+                         TPS6594_REG_BUCKX_CTRL(1),
+                         TPS6594_BIT_BUCK_EN, 0, 0, bucks_ranges,
+                         4, 0, 0, NULL, 0, 0),
+       TPS6594_REGULATOR("BUCK3", "buck3", TPS6594_BUCK_3,
+                         REGULATOR_VOLTAGE, tps6594_bucks_ops, TPS6594_MASK_BUCKS_VSET,
+                         TPS6594_REG_BUCKX_VOUT_1(2),
+                         TPS6594_MASK_BUCKS_VSET,
+                         TPS6594_REG_BUCKX_CTRL(2),
+                         TPS6594_BIT_BUCK_EN, 0, 0, bucks_ranges,
+                         4, 0, 0, NULL, 0, 0),
+       TPS6594_REGULATOR("BUCK4", "buck4", TPS6594_BUCK_4,
+                         REGULATOR_VOLTAGE, tps6594_bucks_ops, TPS6594_MASK_BUCKS_VSET,
+                         TPS6594_REG_BUCKX_VOUT_1(3),
+                         TPS6594_MASK_BUCKS_VSET,
+                         TPS6594_REG_BUCKX_CTRL(3),
+                         TPS6594_BIT_BUCK_EN, 0, 0, bucks_ranges,
+                         4, 0, 0, NULL, 0, 0),
+       TPS6594_REGULATOR("BUCK5", "buck5", TPS6594_BUCK_5,
+                         REGULATOR_VOLTAGE, tps6594_bucks_ops, TPS6594_MASK_BUCKS_VSET,
+                         TPS6594_REG_BUCKX_VOUT_1(4),
+                         TPS6594_MASK_BUCKS_VSET,
+                         TPS6594_REG_BUCKX_CTRL(4),
+                         TPS6594_BIT_BUCK_EN, 0, 0, bucks_ranges,
+                         4, 0, 0, NULL, 0, 0),
+};
+
+static struct tps6594_regulator_irq_type tps6594_buck1_irq_types[] = {
+       { TPS6594_IRQ_NAME_BUCK1_OV, "BUCK1", "overvoltage", REGULATOR_EVENT_OVER_VOLTAGE_WARN },
+       { TPS6594_IRQ_NAME_BUCK1_UV, "BUCK1", "undervoltage", REGULATOR_EVENT_UNDER_VOLTAGE },
+       { TPS6594_IRQ_NAME_BUCK1_SC, "BUCK1", "short circuit", REGULATOR_EVENT_REGULATION_OUT },
+       { TPS6594_IRQ_NAME_BUCK1_ILIM, "BUCK1", "reach ilim, overcurrent",
+         REGULATOR_EVENT_OVER_CURRENT },
+};
+
+static struct tps6594_regulator_irq_type tps6594_buck2_irq_types[] = {
+       { TPS6594_IRQ_NAME_BUCK2_OV, "BUCK2", "overvoltage", REGULATOR_EVENT_OVER_VOLTAGE_WARN },
+       { TPS6594_IRQ_NAME_BUCK2_UV, "BUCK2", "undervoltage", REGULATOR_EVENT_UNDER_VOLTAGE },
+       { TPS6594_IRQ_NAME_BUCK2_SC, "BUCK2", "short circuit", REGULATOR_EVENT_REGULATION_OUT },
+       { TPS6594_IRQ_NAME_BUCK2_ILIM, "BUCK2", "reach ilim, overcurrent",
+         REGULATOR_EVENT_OVER_CURRENT },
+};
+
+static struct tps6594_regulator_irq_type tps6594_buck3_irq_types[] = {
+       { TPS6594_IRQ_NAME_BUCK3_OV, "BUCK3", "overvoltage", REGULATOR_EVENT_OVER_VOLTAGE_WARN },
+       { TPS6594_IRQ_NAME_BUCK3_UV, "BUCK3", "undervoltage", REGULATOR_EVENT_UNDER_VOLTAGE },
+       { TPS6594_IRQ_NAME_BUCK3_SC, "BUCK3", "short circuit", REGULATOR_EVENT_REGULATION_OUT },
+       { TPS6594_IRQ_NAME_BUCK3_ILIM, "BUCK3", "reach ilim, overcurrent",
+         REGULATOR_EVENT_OVER_CURRENT },
+};
+
+static struct tps6594_regulator_irq_type tps6594_buck4_irq_types[] = {
+       { TPS6594_IRQ_NAME_BUCK4_OV, "BUCK4", "overvoltage", REGULATOR_EVENT_OVER_VOLTAGE_WARN },
+       { TPS6594_IRQ_NAME_BUCK4_UV, "BUCK4", "undervoltage", REGULATOR_EVENT_UNDER_VOLTAGE },
+       { TPS6594_IRQ_NAME_BUCK4_SC, "BUCK4", "short circuit", REGULATOR_EVENT_REGULATION_OUT },
+       { TPS6594_IRQ_NAME_BUCK4_ILIM, "BUCK4", "reach ilim, overcurrent",
+         REGULATOR_EVENT_OVER_CURRENT },
+};
+
+static struct tps6594_regulator_irq_type tps6594_buck5_irq_types[] = {
+       { TPS6594_IRQ_NAME_BUCK5_OV, "BUCK5", "overvoltage", REGULATOR_EVENT_OVER_VOLTAGE_WARN },
+       { TPS6594_IRQ_NAME_BUCK5_UV, "BUCK5", "undervoltage", REGULATOR_EVENT_UNDER_VOLTAGE },
+       { TPS6594_IRQ_NAME_BUCK5_SC, "BUCK5", "short circuit", REGULATOR_EVENT_REGULATION_OUT },
+       { TPS6594_IRQ_NAME_BUCK5_ILIM, "BUCK5", "reach ilim, overcurrent",
+         REGULATOR_EVENT_OVER_CURRENT },
+};
+
+static struct tps6594_regulator_irq_type tps6594_ldo1_irq_types[] = {
+       { TPS6594_IRQ_NAME_LDO1_OV, "LDO1", "overvoltage", REGULATOR_EVENT_OVER_VOLTAGE_WARN },
+       { TPS6594_IRQ_NAME_LDO1_UV, "LDO1", "undervoltage", REGULATOR_EVENT_UNDER_VOLTAGE },
+       { TPS6594_IRQ_NAME_LDO1_SC, "LDO1", "short circuit", REGULATOR_EVENT_REGULATION_OUT },
+       { TPS6594_IRQ_NAME_LDO1_ILIM, "LDO1", "reach ilim, overcurrent",
+         REGULATOR_EVENT_OVER_CURRENT },
+};
+
+static struct tps6594_regulator_irq_type tps6594_ldo2_irq_types[] = {
+       { TPS6594_IRQ_NAME_LDO2_OV, "LDO2", "overvoltage", REGULATOR_EVENT_OVER_VOLTAGE_WARN },
+       { TPS6594_IRQ_NAME_LDO2_UV, "LDO2", "undervoltage", REGULATOR_EVENT_UNDER_VOLTAGE },
+       { TPS6594_IRQ_NAME_LDO2_SC, "LDO2", "short circuit", REGULATOR_EVENT_REGULATION_OUT },
+       { TPS6594_IRQ_NAME_LDO2_ILIM, "LDO2", "reach ilim, overcurrent",
+         REGULATOR_EVENT_OVER_CURRENT },
+};
+
+static struct tps6594_regulator_irq_type tps6594_ldo3_irq_types[] = {
+       { TPS6594_IRQ_NAME_LDO3_OV, "LDO3", "overvoltage", REGULATOR_EVENT_OVER_VOLTAGE_WARN },
+       { TPS6594_IRQ_NAME_LDO3_UV, "LDO3", "undervoltage", REGULATOR_EVENT_UNDER_VOLTAGE },
+       { TPS6594_IRQ_NAME_LDO3_SC, "LDO3", "short circuit", REGULATOR_EVENT_REGULATION_OUT },
+       { TPS6594_IRQ_NAME_LDO3_ILIM, "LDO3", "reach ilim, overcurrent",
+         REGULATOR_EVENT_OVER_CURRENT },
+};
+
+static struct tps6594_regulator_irq_type tps6594_ldo4_irq_types[] = {
+       { TPS6594_IRQ_NAME_LDO4_OV, "LDO4", "overvoltage", REGULATOR_EVENT_OVER_VOLTAGE_WARN },
+       { TPS6594_IRQ_NAME_LDO4_UV, "LDO4", "undervoltage", REGULATOR_EVENT_UNDER_VOLTAGE },
+       { TPS6594_IRQ_NAME_LDO4_SC, "LDO4", "short circuit", REGULATOR_EVENT_REGULATION_OUT },
+       { TPS6594_IRQ_NAME_LDO4_ILIM, "LDO4", "reach ilim, overcurrent",
+         REGULATOR_EVENT_OVER_CURRENT },
+};
+
+static struct tps6594_regulator_irq_type *tps6594_bucks_irq_types[] = {
+       tps6594_buck1_irq_types,
+       tps6594_buck2_irq_types,
+       tps6594_buck3_irq_types,
+       tps6594_buck4_irq_types,
+       tps6594_buck5_irq_types,
+};
+
+static struct tps6594_regulator_irq_type *tps6594_ldos_irq_types[] = {
+       tps6594_ldo1_irq_types,
+       tps6594_ldo2_irq_types,
+       tps6594_ldo3_irq_types,
+       tps6594_ldo4_irq_types,
+};
+
+static const struct regulator_desc multi_regs[] = {
+       TPS6594_REGULATOR("BUCK12", "buck12", TPS6594_BUCK_1,
+                         REGULATOR_VOLTAGE, tps6594_bucks_ops, TPS6594_MASK_BUCKS_VSET,
+                         TPS6594_REG_BUCKX_VOUT_1(1),
+                         TPS6594_MASK_BUCKS_VSET,
+                         TPS6594_REG_BUCKX_CTRL(1),
+                         TPS6594_BIT_BUCK_EN, 0, 0, bucks_ranges,
+                         4, 4000, 0, NULL, 0, 0),
+       TPS6594_REGULATOR("BUCK34", "buck34", TPS6594_BUCK_3,
+                         REGULATOR_VOLTAGE, tps6594_bucks_ops, TPS6594_MASK_BUCKS_VSET,
+                         TPS6594_REG_BUCKX_VOUT_1(3),
+                         TPS6594_MASK_BUCKS_VSET,
+                         TPS6594_REG_BUCKX_CTRL(3),
+                         TPS6594_BIT_BUCK_EN, 0, 0, bucks_ranges,
+                         4, 0, 0, NULL, 0, 0),
+       TPS6594_REGULATOR("BUCK123", "buck123", TPS6594_BUCK_1,
+                         REGULATOR_VOLTAGE, tps6594_bucks_ops, TPS6594_MASK_BUCKS_VSET,
+                         TPS6594_REG_BUCKX_VOUT_1(1),
+                         TPS6594_MASK_BUCKS_VSET,
+                         TPS6594_REG_BUCKX_CTRL(1),
+                         TPS6594_BIT_BUCK_EN, 0, 0, bucks_ranges,
+                         4, 4000, 0, NULL, 0, 0),
+       TPS6594_REGULATOR("BUCK1234", "buck1234", TPS6594_BUCK_1,
+                         REGULATOR_VOLTAGE, tps6594_bucks_ops, TPS6594_MASK_BUCKS_VSET,
+                         TPS6594_REG_BUCKX_VOUT_1(1),
+                         TPS6594_MASK_BUCKS_VSET,
+                         TPS6594_REG_BUCKX_CTRL(1),
+                         TPS6594_BIT_BUCK_EN, 0, 0, bucks_ranges,
+                         4, 4000, 0, NULL, 0, 0),
+};
+
+static const struct regulator_desc ldo_regs[] = {
+       TPS6594_REGULATOR("LDO1", "ldo1", TPS6594_LDO_1,
+                         REGULATOR_VOLTAGE, tps6594_ldos_1_2_3_ops, TPS6594_MASK_LDO123_VSET,
+                         TPS6594_REG_LDOX_VOUT(0),
+                         TPS6594_MASK_LDO123_VSET,
+                         TPS6594_REG_LDOX_CTRL(0),
+                         TPS6594_BIT_LDO_EN, 0, 0, ldos_1_2_3_ranges,
+                         1, 0, 0, NULL, 0, TPS6594_BIT_LDO_BYPASS),
+       TPS6594_REGULATOR("LDO2", "ldo2", TPS6594_LDO_2,
+                         REGULATOR_VOLTAGE, tps6594_ldos_1_2_3_ops, TPS6594_MASK_LDO123_VSET,
+                         TPS6594_REG_LDOX_VOUT(1),
+                         TPS6594_MASK_LDO123_VSET,
+                         TPS6594_REG_LDOX_CTRL(1),
+                         TPS6594_BIT_LDO_EN, 0, 0, ldos_1_2_3_ranges,
+                         1, 0, 0, NULL, 0, TPS6594_BIT_LDO_BYPASS),
+       TPS6594_REGULATOR("LDO3", "ldo3", TPS6594_LDO_3,
+                         REGULATOR_VOLTAGE, tps6594_ldos_1_2_3_ops, TPS6594_MASK_LDO123_VSET,
+                         TPS6594_REG_LDOX_VOUT(2),
+                         TPS6594_MASK_LDO123_VSET,
+                         TPS6594_REG_LDOX_CTRL(2),
+                         TPS6594_BIT_LDO_EN, 0, 0, ldos_1_2_3_ranges,
+                         1, 0, 0, NULL, 0, TPS6594_BIT_LDO_BYPASS),
+       TPS6594_REGULATOR("LDO4", "ldo4", TPS6594_LDO_4,
+                         REGULATOR_VOLTAGE, tps6594_ldos_4_ops, TPS6594_MASK_LDO4_VSET >> 1,
+                         TPS6594_REG_LDOX_VOUT(3),
+                         TPS6594_MASK_LDO4_VSET,
+                         TPS6594_REG_LDOX_CTRL(3),
+                         TPS6594_BIT_LDO_EN, 0, 0, ldos_4_ranges,
+                         1, 0, 0, NULL, 0, 0),
+};
+
+static irqreturn_t tps6594_regulator_irq_handler(int irq, void *data)
+{
+       struct tps6594_regulator_irq_data *irq_data = data;
+
+       if (irq_data->type->event_name[0] == '\0') {
+               /* This is the timeout interrupt no specific regulator */
+               dev_err(irq_data->dev,
+                       "System was put in shutdown due to timeout during an active or standby transition.\n");
+               return IRQ_HANDLED;
+       }
+
+       dev_err(irq_data->dev, "Error IRQ trap %s for %s\n",
+               irq_data->type->event_name, irq_data->type->regulator_name);
+
+       regulator_notifier_call_chain(irq_data->rdev,
+                                     irq_data->type->event, NULL);
+
+       return IRQ_HANDLED;
+}
+
+static int tps6594_request_reg_irqs(struct platform_device *pdev,
+                                   struct regulator_dev *rdev,
+                                   struct tps6594_regulator_irq_data *irq_data,
+                                   struct tps6594_regulator_irq_type *tps6594_regs_irq_types,
+                                   int *irq_idx)
+{
+       struct tps6594_regulator_irq_type *irq_type;
+       struct tps6594 *tps = dev_get_drvdata(pdev->dev.parent);
+       int j;
+       int irq;
+       int error;
+
+       for (j = 0; j < REGS_INT_NB; j++) {
+               irq_type = &tps6594_regs_irq_types[j];
+               irq = platform_get_irq_byname(pdev, irq_type->irq_name);
+               if (irq < 0)
+                       return -EINVAL;
+
+               irq_data[*irq_idx + j].dev = tps->dev;
+               irq_data[*irq_idx + j].type = irq_type;
+               irq_data[*irq_idx + j].rdev = rdev;
+
+               error = devm_request_threaded_irq(tps->dev, irq, NULL,
+                                                 tps6594_regulator_irq_handler,
+                                                 IRQF_ONESHOT,
+                                                 irq_type->irq_name,
+                                                 &irq_data[*irq_idx]);
+               (*irq_idx)++;
+               if (error) {
+                       dev_err(tps->dev, "tps6594 failed to request %s IRQ %d: %d\n",
+                               irq_type->irq_name, irq, error);
+                       return error;
+               }
+       }
+       return 0;
+}
+
+static int tps6594_regulator_probe(struct platform_device *pdev)
+{
+       struct tps6594 *tps = dev_get_drvdata(pdev->dev.parent);
+       struct regulator_dev *rdev;
+       struct device_node *np = NULL;
+       struct device_node *np_pmic_parent = NULL;
+       struct regulator_config config = {};
+       struct tps6594_regulator_irq_data *irq_data;
+       struct tps6594_ext_regulator_irq_data *irq_ext_reg_data;
+       struct tps6594_regulator_irq_type *irq_type;
+       u8 buck_configured[BUCK_NB] = { 0 };
+       u8 buck_multi[MULTI_PHASE_NB] = { 0 };
+       static const char * const multiphases[] = {"buck12", "buck123", "buck1234", "buck34"};
+       static const char *npname;
+       int error, i, irq, multi, delta;
+       int irq_idx = 0;
+       int buck_idx = 0;
+       int ext_reg_irq_nb = 2;
+
+       enum {
+               MULTI_BUCK12,
+               MULTI_BUCK123,
+               MULTI_BUCK1234,
+               MULTI_BUCK12_34,
+               MULTI_FIRST = MULTI_BUCK12,
+               MULTI_LAST = MULTI_BUCK12_34,
+               MULTI_NUM = MULTI_LAST - MULTI_FIRST + 1
+       };
+
+       config.dev = tps->dev;
+       config.driver_data = tps;
+       config.regmap = tps->regmap;
+
+       /*
+        * Switch case defines different possible multi phase config
+        * This is based on dts buck node name.
+        * Buck node name must be chosen accordingly.
+        * Default case is no Multiphase buck.
+        * In case of Multiphase configuration, value should be defined for
+        * buck_configured to avoid creating bucks for every buck in multiphase
+        */
+       for (multi = MULTI_FIRST; multi < MULTI_NUM; multi++) {
+               np = of_find_node_by_name(tps->dev->of_node, multiphases[multi]);
+               npname = of_node_full_name(np);
+               np_pmic_parent = of_get_parent(of_get_parent(np));
+               if (of_node_cmp(of_node_full_name(np_pmic_parent), tps->dev->of_node->full_name))
+                       continue;
+               delta = strcmp(npname, multiphases[multi]);
+               if (!delta) {
+                       switch (multi) {
+                       case MULTI_BUCK12:
+                               buck_multi[0] = 1;
+                               buck_configured[0] = 1;
+                               buck_configured[1] = 1;
+                               break;
+                       /* multiphase buck34 is supported only with buck12 */
+                       case MULTI_BUCK12_34:
+                               buck_multi[0] = 1;
+                               buck_multi[1] = 1;
+                               buck_configured[0] = 1;
+                               buck_configured[1] = 1;
+                               buck_configured[2] = 1;
+                               buck_configured[3] = 1;
+                               break;
+                       case MULTI_BUCK123:
+                               buck_multi[2] = 1;
+                               buck_configured[0] = 1;
+                               buck_configured[1] = 1;
+                               buck_configured[2] = 1;
+                               break;
+                       case MULTI_BUCK1234:
+                               buck_multi[3] = 1;
+                               buck_configured[0] = 1;
+                               buck_configured[1] = 1;
+                               buck_configured[2] = 1;
+                               buck_configured[3] = 1;
+                               break;
+                       }
+               }
+       }
+
+       if (tps->chip_id == LP8764)
+               /* There is only 4 buck on LP8764 */
+               buck_configured[4] = 1;
+
+       irq_data = devm_kmalloc_array(tps->dev,
+                               REGS_INT_NB * sizeof(struct tps6594_regulator_irq_data),
+                               ARRAY_SIZE(tps6594_bucks_irq_types) +
+                               ARRAY_SIZE(tps6594_ldos_irq_types),
+                               GFP_KERNEL);
+       if (!irq_data)
+               return -ENOMEM;
+
+       for (i = 0; i < MULTI_PHASE_NB; i++) {
+               if (buck_multi[i] == 0)
+                       continue;
+
+               rdev = devm_regulator_register(&pdev->dev, &multi_regs[i], &config);
+               if (IS_ERR(rdev))
+                       return dev_err_probe(tps->dev, PTR_ERR(rdev),
+                                            "failed to register %s regulator\n",
+                                            pdev->name);
+
+               /* config multiphase buck12+buck34 */
+               if (i == 1)
+                       buck_idx = 2;
+               error = tps6594_request_reg_irqs(pdev, rdev, irq_data,
+                                                tps6594_bucks_irq_types[buck_idx], &irq_idx);
+               if (error)
+                       return error;
+               error = tps6594_request_reg_irqs(pdev, rdev, irq_data,
+                                                tps6594_bucks_irq_types[buck_idx + 1], &irq_idx);
+               if (error)
+                       return error;
+
+               if (i == 2 || i == 3) {
+                       error = tps6594_request_reg_irqs(pdev, rdev, irq_data,
+                                                        tps6594_bucks_irq_types[buck_idx + 2],
+                                                        &irq_idx);
+                       if (error)
+                               return error;
+               }
+               if (i == 3) {
+                       error = tps6594_request_reg_irqs(pdev, rdev, irq_data,
+                                                        tps6594_bucks_irq_types[buck_idx + 3],
+                                                        &irq_idx);
+                       if (error)
+                               return error;
+               }
+       }
+
+       for (i = 0; i < BUCK_NB; i++) {
+               if (buck_configured[i] == 1)
+                       continue;
+
+               rdev = devm_regulator_register(&pdev->dev, &buck_regs[i], &config);
+               if (IS_ERR(rdev))
+                       return dev_err_probe(tps->dev, PTR_ERR(rdev),
+                                            "failed to register %s regulator\n",
+                                            pdev->name);
+
+               error = tps6594_request_reg_irqs(pdev, rdev, irq_data,
+                                                tps6594_bucks_irq_types[i], &irq_idx);
+               if (error)
+                       return error;
+       }
+
+       /* LP8764 dosen't have LDO */
+       if (tps->chip_id != LP8764) {
+               for (i = 0; i < ARRAY_SIZE(ldo_regs); i++) {
+                       rdev = devm_regulator_register(&pdev->dev, &ldo_regs[i], &config);
+                       if (IS_ERR(rdev))
+                               return dev_err_probe(tps->dev, PTR_ERR(rdev),
+                                                    "failed to register %s regulator\n",
+                                                    pdev->name);
+
+                       error = tps6594_request_reg_irqs(pdev, rdev, irq_data,
+                                                        tps6594_ldos_irq_types[i],
+                                                        &irq_idx);
+                       if (error)
+                               return error;
+               }
+       }
+
+       if (tps->chip_id == LP8764)
+               ext_reg_irq_nb = ARRAY_SIZE(tps6594_ext_regulator_irq_types);
+
+       irq_ext_reg_data = devm_kmalloc_array(tps->dev,
+                                       ext_reg_irq_nb,
+                                       sizeof(struct tps6594_ext_regulator_irq_data),
+                                       GFP_KERNEL);
+       if (!irq_ext_reg_data)
+               return -ENOMEM;
+
+       for (i = 0; i < ext_reg_irq_nb; ++i) {
+               irq_type = &tps6594_ext_regulator_irq_types[i];
+
+               irq = platform_get_irq_byname(pdev, irq_type->irq_name);
+               if (irq < 0)
+                       return -EINVAL;
+
+               irq_ext_reg_data[i].dev = tps->dev;
+               irq_ext_reg_data[i].type = irq_type;
+
+               error = devm_request_threaded_irq(tps->dev, irq, NULL,
+                                                 tps6594_regulator_irq_handler,
+                                                 IRQF_ONESHOT,
+                                                 irq_type->irq_name,
+                                                 &irq_ext_reg_data[i]);
+               if (error)
+                       return dev_err_probe(tps->dev, error,
+                                            "failed to request %s IRQ %d\n",
+                                            irq_type->irq_name, irq);
+       }
+       return 0;
+}
+
+static struct platform_driver tps6594_regulator_driver = {
+       .driver = {
+               .name = "tps6594-regulator",
+       },
+       .probe = tps6594_regulator_probe,
+};
+
+module_platform_driver(tps6594_regulator_driver);
+
+MODULE_ALIAS("platform:tps6594-regulator");
+MODULE_AUTHOR("Jerome Neanne <jneanne@baylibre.com>");
+MODULE_DESCRIPTION("TPS6594 voltage regulator driver");
+MODULE_LICENSE("GPL");
index 7538724..ffca9a8 100644 (file)
@@ -395,7 +395,7 @@ config RTC_DRV_NCT3018Y
 
 config RTC_DRV_RK808
        tristate "Rockchip RK805/RK808/RK809/RK817/RK818 RTC"
-       depends on MFD_RK808
+       depends on MFD_RK8XX
        help
          If you say yes here you will get support for the
          RTC of RK805, RK809 and RK817, RK808 and RK818 PMIC.
index 9fbfce7..4578895 100644 (file)
@@ -3234,12 +3234,12 @@ struct blk_mq_ops dasd_mq_ops = {
        .exit_hctx = dasd_exit_hctx,
 };
 
-static int dasd_open(struct block_device *bdev, fmode_t mode)
+static int dasd_open(struct gendisk *disk, blk_mode_t mode)
 {
        struct dasd_device *base;
        int rc;
 
-       base = dasd_device_from_gendisk(bdev->bd_disk);
+       base = dasd_device_from_gendisk(disk);
        if (!base)
                return -ENODEV;
 
@@ -3268,14 +3268,12 @@ static int dasd_open(struct block_device *bdev, fmode_t mode)
                rc = -ENODEV;
                goto out;
        }
-
-       if ((mode & FMODE_WRITE) &&
+       if ((mode & BLK_OPEN_WRITE) &&
            (test_bit(DASD_FLAG_DEVICE_RO, &base->flags) ||
             (base->features & DASD_FEATURE_READONLY))) {
                rc = -EROFS;
                goto out;
        }
-
        dasd_put_device(base);
        return 0;
 
@@ -3287,7 +3285,7 @@ unlock:
        return rc;
 }
 
-static void dasd_release(struct gendisk *disk, fmode_t mode)
+static void dasd_release(struct gendisk *disk)
 {
        struct dasd_device *base = dasd_device_from_gendisk(disk);
        if (base) {
index 998a961..fe5108a 100644 (file)
@@ -130,7 +130,8 @@ int dasd_scan_partitions(struct dasd_block *block)
        struct block_device *bdev;
        int rc;
 
-       bdev = blkdev_get_by_dev(disk_devt(block->gdp), FMODE_READ, NULL);
+       bdev = blkdev_get_by_dev(disk_devt(block->gdp), BLK_OPEN_READ, NULL,
+                                NULL);
        if (IS_ERR(bdev)) {
                DBF_DEV_EVENT(DBF_ERR, block->base,
                              "scan partitions error, blkdev_get returned %ld",
@@ -179,7 +180,7 @@ void dasd_destroy_partitions(struct dasd_block *block)
        mutex_unlock(&bdev->bd_disk->open_mutex);
 
        /* Matching blkdev_put to the blkdev_get in dasd_scan_partitions. */
-       blkdev_put(bdev, FMODE_READ);
+       blkdev_put(bdev, NULL);
 }
 
 int dasd_gendisk_init(void)
index 33f812f..0aa5635 100644 (file)
@@ -965,7 +965,8 @@ int dasd_scan_partitions(struct dasd_block *);
 void dasd_destroy_partitions(struct dasd_block *);
 
 /* externals in dasd_ioctl.c */
-int dasd_ioctl(struct block_device *, fmode_t, unsigned int, unsigned long);
+int dasd_ioctl(struct block_device *bdev, blk_mode_t mode, unsigned int cmd,
+               unsigned long arg);
 int dasd_set_read_only(struct block_device *bdev, bool ro);
 
 /* externals in dasd_proc.c */
index 8fca725..513a7e6 100644 (file)
@@ -612,7 +612,7 @@ static int dasd_ioctl_readall_cmb(struct dasd_block *block, unsigned int cmd,
        return ret;
 }
 
-int dasd_ioctl(struct block_device *bdev, fmode_t mode,
+int dasd_ioctl(struct block_device *bdev, blk_mode_t mode,
               unsigned int cmd, unsigned long arg)
 {
        struct dasd_block *block;
index c09f2e0..200f88f 100644 (file)
@@ -28,8 +28,8 @@
 #define DCSSBLK_PARM_LEN 400
 #define DCSS_BUS_ID_SIZE 20
 
-static int dcssblk_open(struct block_device *bdev, fmode_t mode);
-static void dcssblk_release(struct gendisk *disk, fmode_t mode);
+static int dcssblk_open(struct gendisk *disk, blk_mode_t mode);
+static void dcssblk_release(struct gendisk *disk);
 static void dcssblk_submit_bio(struct bio *bio);
 static long dcssblk_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
                long nr_pages, enum dax_access_mode mode, void **kaddr,
@@ -809,12 +809,11 @@ out_buf:
 }
 
 static int
-dcssblk_open(struct block_device *bdev, fmode_t mode)
+dcssblk_open(struct gendisk *disk, blk_mode_t mode)
 {
-       struct dcssblk_dev_info *dev_info;
+       struct dcssblk_dev_info *dev_info = disk->private_data;
        int rc;
 
-       dev_info = bdev->bd_disk->private_data;
        if (NULL == dev_info) {
                rc = -ENODEV;
                goto out;
@@ -826,7 +825,7 @@ out:
 }
 
 static void
-dcssblk_release(struct gendisk *disk, fmode_t mode)
+dcssblk_release(struct gendisk *disk)
 {
        struct dcssblk_dev_info *dev_info = disk->private_data;
        struct segment_info *entry;
index 599f547..942c73a 100644 (file)
@@ -51,6 +51,7 @@ static struct dentry *zcore_dir;
 static struct dentry *zcore_reipl_file;
 static struct dentry *zcore_hsa_file;
 static struct ipl_parameter_block *zcore_ipl_block;
+static unsigned long os_info_flags;
 
 static DEFINE_MUTEX(hsa_buf_mutex);
 static char hsa_buf[PAGE_SIZE] __aligned(PAGE_SIZE);
@@ -139,7 +140,13 @@ static ssize_t zcore_reipl_write(struct file *filp, const char __user *buf,
 {
        if (zcore_ipl_block) {
                diag308(DIAG308_SET, zcore_ipl_block);
-               diag308(DIAG308_LOAD_CLEAR, NULL);
+               if (os_info_flags & OS_INFO_FLAG_REIPL_CLEAR)
+                       diag308(DIAG308_LOAD_CLEAR, NULL);
+               /* Use special diag308 subcode for CCW normal ipl */
+               if (zcore_ipl_block->pb0_hdr.pbt == IPL_PBT_CCW)
+                       diag308(DIAG308_LOAD_NORMAL_DUMP, NULL);
+               else
+                       diag308(DIAG308_LOAD_NORMAL, NULL);
        }
        return count;
 }
@@ -212,7 +219,10 @@ static int __init check_sdias(void)
  */
 static int __init zcore_reipl_init(void)
 {
+       struct os_info_entry *entry;
        struct ipib_info ipib_info;
+       unsigned long os_info_addr;
+       struct os_info *os_info;
        int rc;
 
        rc = memcpy_hsa_kernel(&ipib_info, __LC_DUMP_REIPL, sizeof(ipib_info));
@@ -234,6 +244,35 @@ static int __init zcore_reipl_init(void)
                free_page((unsigned long) zcore_ipl_block);
                zcore_ipl_block = NULL;
        }
+       /*
+        * Read the bit-flags field from os_info flags entry.
+        * Return zero even for os_info read or entry checksum errors in order
+        * to continue dump processing, considering that os_info could be
+        * corrupted on the panicked system.
+        */
+       os_info = (void *)__get_free_page(GFP_KERNEL);
+       if (!os_info)
+               return -ENOMEM;
+       rc = memcpy_hsa_kernel(&os_info_addr, __LC_OS_INFO, sizeof(os_info_addr));
+       if (rc)
+               goto out;
+       if (os_info_addr < sclp.hsa_size)
+               rc = memcpy_hsa_kernel(os_info, os_info_addr, PAGE_SIZE);
+       else
+               rc = memcpy_real(os_info, os_info_addr, PAGE_SIZE);
+       if (rc || os_info_csum(os_info) != os_info->csum)
+               goto out;
+       entry = &os_info->entry[OS_INFO_FLAGS_ENTRY];
+       if (entry->addr && entry->size) {
+               if (entry->addr < sclp.hsa_size)
+                       rc = memcpy_hsa_kernel(&os_info_flags, entry->addr, sizeof(os_info_flags));
+               else
+                       rc = memcpy_real(&os_info_flags, entry->addr, sizeof(os_info_flags));
+               if (rc || (__force u32)csum_partial(&os_info_flags, entry->size, 0) != entry->csum)
+                       os_info_flags = 0;
+       }
+out:
+       free_page((unsigned long)os_info);
        return 0;
 }
 
index ff538a0..4360181 100644 (file)
@@ -171,7 +171,7 @@ static int vfio_ccw_sch_probe(struct subchannel *sch)
                return -ENODEV;
        }
 
-       parent = kzalloc(sizeof(*parent), GFP_KERNEL);
+       parent = kzalloc(struct_size(parent, mdev_types, 1), GFP_KERNEL);
        if (!parent)
                return -ENOMEM;
 
index b441ae6..b62bbc5 100644 (file)
@@ -79,7 +79,7 @@ struct vfio_ccw_parent {
 
        struct mdev_parent      parent;
        struct mdev_type        mdev_type;
-       struct mdev_type        *mdev_types[1];
+       struct mdev_type        *mdev_types[];
 };
 
 /**
index a8def50..e58bfd2 100644 (file)
@@ -2,7 +2,8 @@
 /*
  *  pkey device driver
  *
- *  Copyright IBM Corp. 2017,2019
+ *  Copyright IBM Corp. 2017, 2023
+ *
  *  Author(s): Harald Freudenberger
  */
 
@@ -32,8 +33,10 @@ MODULE_AUTHOR("IBM Corporation");
 MODULE_DESCRIPTION("s390 protected key interface");
 
 #define KEYBLOBBUFSIZE 8192    /* key buffer size used for internal processing */
+#define MINKEYBLOBBUFSIZE (sizeof(struct keytoken_header))
 #define PROTKEYBLOBBUFSIZE 256 /* protected key buffer size used internal */
 #define MAXAPQNSINLIST 64      /* max 64 apqns within a apqn list */
+#define AES_WK_VP_SIZE 32      /* Size of WK VP block appended to a prot key */
 
 /*
  * debug feature data and functions
@@ -71,49 +74,106 @@ struct protaeskeytoken {
 } __packed;
 
 /* inside view of a clear key token (type 0x00 version 0x02) */
-struct clearaeskeytoken {
-       u8  type;        /* 0x00 for PAES specific key tokens */
+struct clearkeytoken {
+       u8  type;       /* 0x00 for PAES specific key tokens */
        u8  res0[3];
-       u8  version;     /* 0x02 for clear AES key token */
+       u8  version;    /* 0x02 for clear key token */
        u8  res1[3];
-       u32 keytype;     /* key type, one of the PKEY_KEYTYPE values */
-       u32 len;         /* bytes actually stored in clearkey[] */
+       u32 keytype;    /* key type, one of the PKEY_KEYTYPE_* values */
+       u32 len;        /* bytes actually stored in clearkey[] */
        u8  clearkey[]; /* clear key value */
 } __packed;
 
+/* helper function which translates the PKEY_KEYTYPE_AES_* to their keysize */
+static inline u32 pkey_keytype_aes_to_size(u32 keytype)
+{
+       switch (keytype) {
+       case PKEY_KEYTYPE_AES_128:
+               return 16;
+       case PKEY_KEYTYPE_AES_192:
+               return 24;
+       case PKEY_KEYTYPE_AES_256:
+               return 32;
+       default:
+               return 0;
+       }
+}
+
 /*
- * Create a protected key from a clear key value.
+ * Create a protected key from a clear key value via PCKMO instruction.
  */
-static int pkey_clr2protkey(u32 keytype,
-                           const struct pkey_clrkey *clrkey,
-                           struct pkey_protkey *protkey)
+static int pkey_clr2protkey(u32 keytype, const u8 *clrkey,
+                           u8 *protkey, u32 *protkeylen, u32 *protkeytype)
 {
        /* mask of available pckmo subfunctions */
        static cpacf_mask_t pckmo_functions;
 
-       long fc;
+       u8 paramblock[112];
+       u32 pkeytype;
        int keysize;
-       u8 paramblock[64];
+       long fc;
 
        switch (keytype) {
        case PKEY_KEYTYPE_AES_128:
+               /* 16 byte key, 32 byte aes wkvp, total 48 bytes */
                keysize = 16;
+               pkeytype = keytype;
                fc = CPACF_PCKMO_ENC_AES_128_KEY;
                break;
        case PKEY_KEYTYPE_AES_192:
+               /* 24 byte key, 32 byte aes wkvp, total 56 bytes */
                keysize = 24;
+               pkeytype = keytype;
                fc = CPACF_PCKMO_ENC_AES_192_KEY;
                break;
        case PKEY_KEYTYPE_AES_256:
+               /* 32 byte key, 32 byte aes wkvp, total 64 bytes */
                keysize = 32;
+               pkeytype = keytype;
                fc = CPACF_PCKMO_ENC_AES_256_KEY;
                break;
+       case PKEY_KEYTYPE_ECC_P256:
+               /* 32 byte key, 32 byte aes wkvp, total 64 bytes */
+               keysize = 32;
+               pkeytype = PKEY_KEYTYPE_ECC;
+               fc = CPACF_PCKMO_ENC_ECC_P256_KEY;
+               break;
+       case PKEY_KEYTYPE_ECC_P384:
+               /* 48 byte key, 32 byte aes wkvp, total 80 bytes */
+               keysize = 48;
+               pkeytype = PKEY_KEYTYPE_ECC;
+               fc = CPACF_PCKMO_ENC_ECC_P384_KEY;
+               break;
+       case PKEY_KEYTYPE_ECC_P521:
+               /* 80 byte key, 32 byte aes wkvp, total 112 bytes */
+               keysize = 80;
+               pkeytype = PKEY_KEYTYPE_ECC;
+               fc = CPACF_PCKMO_ENC_ECC_P521_KEY;
+               break;
+       case PKEY_KEYTYPE_ECC_ED25519:
+               /* 32 byte key, 32 byte aes wkvp, total 64 bytes */
+               keysize = 32;
+               pkeytype = PKEY_KEYTYPE_ECC;
+               fc = CPACF_PCKMO_ENC_ECC_ED25519_KEY;
+               break;
+       case PKEY_KEYTYPE_ECC_ED448:
+               /* 64 byte key, 32 byte aes wkvp, total 96 bytes */
+               keysize = 64;
+               pkeytype = PKEY_KEYTYPE_ECC;
+               fc = CPACF_PCKMO_ENC_ECC_ED448_KEY;
+               break;
        default:
-               DEBUG_ERR("%s unknown/unsupported keytype %d\n",
+               DEBUG_ERR("%s unknown/unsupported keytype %u\n",
                          __func__, keytype);
                return -EINVAL;
        }
 
+       if (*protkeylen < keysize + AES_WK_VP_SIZE) {
+               DEBUG_ERR("%s prot key buffer size too small: %u < %d\n",
+                         __func__, *protkeylen, keysize + AES_WK_VP_SIZE);
+               return -EINVAL;
+       }
+
        /* Did we already check for PCKMO ? */
        if (!pckmo_functions.bytes[0]) {
                /* no, so check now */
@@ -128,15 +188,15 @@ static int pkey_clr2protkey(u32 keytype,
 
        /* prepare param block */
        memset(paramblock, 0, sizeof(paramblock));
-       memcpy(paramblock, clrkey->clrkey, keysize);
+       memcpy(paramblock, clrkey, keysize);
 
        /* call the pckmo instruction */
        cpacf_pckmo(fc, paramblock);
 
-       /* copy created protected key */
-       protkey->type = keytype;
-       protkey->len = keysize + 32;
-       memcpy(protkey->protkey, paramblock, keysize + 32);
+       /* copy created protected key to key buffer including the wkvp block */
+       *protkeylen = keysize + AES_WK_VP_SIZE;
+       memcpy(protkey, paramblock, *protkeylen);
+       *protkeytype = pkeytype;
 
        return 0;
 }
@@ -144,11 +204,12 @@ static int pkey_clr2protkey(u32 keytype,
 /*
  * Find card and transform secure key into protected key.
  */
-static int pkey_skey2pkey(const u8 *key, struct pkey_protkey *pkey)
+static int pkey_skey2pkey(const u8 *key, u8 *protkey,
+                         u32 *protkeylen, u32 *protkeytype)
 {
-       int rc, verify;
-       u16 cardnr, domain;
        struct keytoken_header *hdr = (struct keytoken_header *)key;
+       u16 cardnr, domain;
+       int rc, verify;
 
        zcrypt_wait_api_operational();
 
@@ -167,14 +228,13 @@ static int pkey_skey2pkey(const u8 *key, struct pkey_protkey *pkey)
                        continue;
                switch (hdr->version) {
                case TOKVER_CCA_AES:
-                       rc = cca_sec2protkey(cardnr, domain,
-                                            key, pkey->protkey,
-                                            &pkey->len, &pkey->type);
+                       rc = cca_sec2protkey(cardnr, domain, key,
+                                            protkey, protkeylen, protkeytype);
                        break;
                case TOKVER_CCA_VLSC:
-                       rc = cca_cipher2protkey(cardnr, domain,
-                                               key, pkey->protkey,
-                                               &pkey->len, &pkey->type);
+                       rc = cca_cipher2protkey(cardnr, domain, key,
+                                               protkey, protkeylen,
+                                               protkeytype);
                        break;
                default:
                        return -EINVAL;
@@ -195,9 +255,9 @@ static int pkey_skey2pkey(const u8 *key, struct pkey_protkey *pkey)
 static int pkey_clr2ep11key(const u8 *clrkey, size_t clrkeylen,
                            u8 *keybuf, size_t *keybuflen)
 {
-       int i, rc;
-       u16 card, dom;
        u32 nr_apqns, *apqns = NULL;
+       u16 card, dom;
+       int i, rc;
 
        zcrypt_wait_api_operational();
 
@@ -227,12 +287,13 @@ out:
 /*
  * Find card and transform EP11 secure key into protected key.
  */
-static int pkey_ep11key2pkey(const u8 *key, struct pkey_protkey *pkey)
+static int pkey_ep11key2pkey(const u8 *key, u8 *protkey,
+                            u32 *protkeylen, u32 *protkeytype)
 {
-       int i, rc;
-       u16 card, dom;
-       u32 nr_apqns, *apqns = NULL;
        struct ep11keyblob *kb = (struct ep11keyblob *)key;
+       u32 nr_apqns, *apqns = NULL;
+       u16 card, dom;
+       int i, rc;
 
        zcrypt_wait_api_operational();
 
@@ -246,9 +307,8 @@ static int pkey_ep11key2pkey(const u8 *key, struct pkey_protkey *pkey)
        for (rc = -ENODEV, i = 0; i < nr_apqns; i++) {
                card = apqns[i] >> 16;
                dom = apqns[i] & 0xFFFF;
-               pkey->len = sizeof(pkey->protkey);
                rc = ep11_kblob2protkey(card, dom, key, kb->head.len,
-                                       pkey->protkey, &pkey->len, &pkey->type);
+                                       protkey, protkeylen, protkeytype);
                if (rc == 0)
                        break;
        }
@@ -306,38 +366,31 @@ out:
 /*
  * Generate a random protected key
  */
-static int pkey_genprotkey(u32 keytype, struct pkey_protkey *protkey)
+static int pkey_genprotkey(u32 keytype, u8 *protkey,
+                          u32 *protkeylen, u32 *protkeytype)
 {
-       struct pkey_clrkey clrkey;
+       u8 clrkey[32];
        int keysize;
        int rc;
 
-       switch (keytype) {
-       case PKEY_KEYTYPE_AES_128:
-               keysize = 16;
-               break;
-       case PKEY_KEYTYPE_AES_192:
-               keysize = 24;
-               break;
-       case PKEY_KEYTYPE_AES_256:
-               keysize = 32;
-               break;
-       default:
+       keysize = pkey_keytype_aes_to_size(keytype);
+       if (!keysize) {
                DEBUG_ERR("%s unknown/unsupported keytype %d\n", __func__,
                          keytype);
                return -EINVAL;
        }
 
        /* generate a dummy random clear key */
-       get_random_bytes(clrkey.clrkey, keysize);
+       get_random_bytes(clrkey, keysize);
 
        /* convert it to a dummy protected key */
-       rc = pkey_clr2protkey(keytype, &clrkey, protkey);
+       rc = pkey_clr2protkey(keytype, clrkey,
+                             protkey, protkeylen, protkeytype);
        if (rc)
                return rc;
 
        /* replace the key part of the protected key with random bytes */
-       get_random_bytes(protkey->protkey, keysize);
+       get_random_bytes(protkey, keysize);
 
        return 0;
 }
@@ -345,37 +398,46 @@ static int pkey_genprotkey(u32 keytype, struct pkey_protkey *protkey)
 /*
  * Verify if a protected key is still valid
  */
-static int pkey_verifyprotkey(const struct pkey_protkey *protkey)
+static int pkey_verifyprotkey(const u8 *protkey, u32 protkeylen,
+                             u32 protkeytype)
 {
-       unsigned long fc;
        struct {
                u8 iv[AES_BLOCK_SIZE];
                u8 key[MAXPROTKEYSIZE];
        } param;
        u8 null_msg[AES_BLOCK_SIZE];
        u8 dest_buf[AES_BLOCK_SIZE];
-       unsigned int k;
+       unsigned int k, pkeylen;
+       unsigned long fc;
 
-       switch (protkey->type) {
+       switch (protkeytype) {
        case PKEY_KEYTYPE_AES_128:
+               pkeylen = 16 + AES_WK_VP_SIZE;
                fc = CPACF_KMC_PAES_128;
                break;
        case PKEY_KEYTYPE_AES_192:
+               pkeylen = 24 + AES_WK_VP_SIZE;
                fc = CPACF_KMC_PAES_192;
                break;
        case PKEY_KEYTYPE_AES_256:
+               pkeylen = 32 + AES_WK_VP_SIZE;
                fc = CPACF_KMC_PAES_256;
                break;
        default:
-               DEBUG_ERR("%s unknown/unsupported keytype %d\n", __func__,
-                         protkey->type);
+               DEBUG_ERR("%s unknown/unsupported keytype %u\n", __func__,
+                         protkeytype);
+               return -EINVAL;
+       }
+       if (protkeylen != pkeylen) {
+               DEBUG_ERR("%s invalid protected key size %u for keytype %u\n",
+                         __func__, protkeylen, protkeytype);
                return -EINVAL;
        }
 
        memset(null_msg, 0, sizeof(null_msg));
 
        memset(param.iv, 0, sizeof(param.iv));
-       memcpy(param.key, protkey->protkey, sizeof(param.key));
+       memcpy(param.key, protkey, protkeylen);
 
        k = cpacf_kmc(fc | CPACF_ENCRYPT, &param, null_msg, dest_buf,
                      sizeof(null_msg));
@@ -387,15 +449,119 @@ static int pkey_verifyprotkey(const struct pkey_protkey *protkey)
        return 0;
 }
 
+/* Helper for pkey_nonccatok2pkey, handles aes clear key token */
+static int nonccatokaes2pkey(const struct clearkeytoken *t,
+                            u8 *protkey, u32 *protkeylen, u32 *protkeytype)
+{
+       size_t tmpbuflen = max_t(size_t, SECKEYBLOBSIZE, MAXEP11AESKEYBLOBSIZE);
+       u8 *tmpbuf = NULL;
+       u32 keysize;
+       int rc;
+
+       keysize = pkey_keytype_aes_to_size(t->keytype);
+       if (!keysize) {
+               DEBUG_ERR("%s unknown/unsupported keytype %u\n",
+                         __func__, t->keytype);
+               return -EINVAL;
+       }
+       if (t->len != keysize) {
+               DEBUG_ERR("%s non clear key aes token: invalid key len %u\n",
+                         __func__, t->len);
+               return -EINVAL;
+       }
+
+       /* try direct way with the PCKMO instruction */
+       rc = pkey_clr2protkey(t->keytype, t->clearkey,
+                             protkey, protkeylen, protkeytype);
+       if (!rc)
+               goto out;
+
+       /* PCKMO failed, so try the CCA secure key way */
+       tmpbuf = kmalloc(tmpbuflen, GFP_ATOMIC);
+       if (!tmpbuf)
+               return -ENOMEM;
+       zcrypt_wait_api_operational();
+       rc = cca_clr2seckey(0xFFFF, 0xFFFF, t->keytype, t->clearkey, tmpbuf);
+       if (rc)
+               goto try_via_ep11;
+       rc = pkey_skey2pkey(tmpbuf,
+                           protkey, protkeylen, protkeytype);
+       if (!rc)
+               goto out;
+
+try_via_ep11:
+       /* if the CCA way also failed, let's try via EP11 */
+       rc = pkey_clr2ep11key(t->clearkey, t->len,
+                             tmpbuf, &tmpbuflen);
+       if (rc)
+               goto failure;
+       rc = pkey_ep11key2pkey(tmpbuf,
+                              protkey, protkeylen, protkeytype);
+       if (!rc)
+               goto out;
+
+failure:
+       DEBUG_ERR("%s unable to build protected key from clear", __func__);
+
+out:
+       kfree(tmpbuf);
+       return rc;
+}
+
+/* Helper for pkey_nonccatok2pkey, handles ecc clear key token */
+static int nonccatokecc2pkey(const struct clearkeytoken *t,
+                            u8 *protkey, u32 *protkeylen, u32 *protkeytype)
+{
+       u32 keylen;
+       int rc;
+
+       switch (t->keytype) {
+       case PKEY_KEYTYPE_ECC_P256:
+               keylen = 32;
+               break;
+       case PKEY_KEYTYPE_ECC_P384:
+               keylen = 48;
+               break;
+       case PKEY_KEYTYPE_ECC_P521:
+               keylen = 80;
+               break;
+       case PKEY_KEYTYPE_ECC_ED25519:
+               keylen = 32;
+               break;
+       case PKEY_KEYTYPE_ECC_ED448:
+               keylen = 64;
+               break;
+       default:
+               DEBUG_ERR("%s unknown/unsupported keytype %u\n",
+                         __func__, t->keytype);
+               return -EINVAL;
+       }
+
+       if (t->len != keylen) {
+               DEBUG_ERR("%s non clear key ecc token: invalid key len %u\n",
+                         __func__, t->len);
+               return -EINVAL;
+       }
+
+       /* only one path possible: via PCKMO instruction */
+       rc = pkey_clr2protkey(t->keytype, t->clearkey,
+                             protkey, protkeylen, protkeytype);
+       if (rc) {
+               DEBUG_ERR("%s unable to build protected key from clear",
+                         __func__);
+       }
+
+       return rc;
+}
+
 /*
  * Transform a non-CCA key token into a protected key
  */
 static int pkey_nonccatok2pkey(const u8 *key, u32 keylen,
-                              struct pkey_protkey *protkey)
+                              u8 *protkey, u32 *protkeylen, u32 *protkeytype)
 {
-       int rc = -EINVAL;
-       u8 *tmpbuf = NULL;
        struct keytoken_header *hdr = (struct keytoken_header *)key;
+       int rc = -EINVAL;
 
        switch (hdr->version) {
        case TOKVER_PROTECTED_KEY: {
@@ -404,59 +570,40 @@ static int pkey_nonccatok2pkey(const u8 *key, u32 keylen,
                if (keylen != sizeof(struct protaeskeytoken))
                        goto out;
                t = (struct protaeskeytoken *)key;
-               protkey->len = t->len;
-               protkey->type = t->keytype;
-               memcpy(protkey->protkey, t->protkey,
-                      sizeof(protkey->protkey));
-               rc = pkey_verifyprotkey(protkey);
+               rc = pkey_verifyprotkey(t->protkey, t->len, t->keytype);
+               if (rc)
+                       goto out;
+               memcpy(protkey, t->protkey, t->len);
+               *protkeylen = t->len;
+               *protkeytype = t->keytype;
                break;
        }
        case TOKVER_CLEAR_KEY: {
-               struct clearaeskeytoken *t;
-               struct pkey_clrkey ckey;
-               union u_tmpbuf {
-                       u8 skey[SECKEYBLOBSIZE];
-                       u8 ep11key[MAXEP11AESKEYBLOBSIZE];
-               };
-               size_t tmpbuflen = sizeof(union u_tmpbuf);
-
-               if (keylen < sizeof(struct clearaeskeytoken))
-                       goto out;
-               t = (struct clearaeskeytoken *)key;
-               if (keylen != sizeof(*t) + t->len)
-                       goto out;
-               if ((t->keytype == PKEY_KEYTYPE_AES_128 && t->len == 16) ||
-                   (t->keytype == PKEY_KEYTYPE_AES_192 && t->len == 24) ||
-                   (t->keytype == PKEY_KEYTYPE_AES_256 && t->len == 32))
-                       memcpy(ckey.clrkey, t->clearkey, t->len);
-               else
-                       goto out;
-               /* alloc temp key buffer space */
-               tmpbuf = kmalloc(tmpbuflen, GFP_ATOMIC);
-               if (!tmpbuf) {
-                       rc = -ENOMEM;
+               struct clearkeytoken *t = (struct clearkeytoken *)key;
+
+               if (keylen < sizeof(struct clearkeytoken) ||
+                   keylen != sizeof(*t) + t->len)
                        goto out;
-               }
-               /* try direct way with the PCKMO instruction */
-               rc = pkey_clr2protkey(t->keytype, &ckey, protkey);
-               if (rc == 0)
+               switch (t->keytype) {
+               case PKEY_KEYTYPE_AES_128:
+               case PKEY_KEYTYPE_AES_192:
+               case PKEY_KEYTYPE_AES_256:
+                       rc = nonccatokaes2pkey(t, protkey,
+                                              protkeylen, protkeytype);
                        break;
-               /* PCKMO failed, so try the CCA secure key way */
-               zcrypt_wait_api_operational();
-               rc = cca_clr2seckey(0xFFFF, 0xFFFF, t->keytype,
-                                   ckey.clrkey, tmpbuf);
-               if (rc == 0)
-                       rc = pkey_skey2pkey(tmpbuf, protkey);
-               if (rc == 0)
+               case PKEY_KEYTYPE_ECC_P256:
+               case PKEY_KEYTYPE_ECC_P384:
+               case PKEY_KEYTYPE_ECC_P521:
+               case PKEY_KEYTYPE_ECC_ED25519:
+               case PKEY_KEYTYPE_ECC_ED448:
+                       rc = nonccatokecc2pkey(t, protkey,
+                                              protkeylen, protkeytype);
                        break;
-               /* if the CCA way also failed, let's try via EP11 */
-               rc = pkey_clr2ep11key(ckey.clrkey, t->len,
-                                     tmpbuf, &tmpbuflen);
-               if (rc == 0)
-                       rc = pkey_ep11key2pkey(tmpbuf, protkey);
-               /* now we should really have an protected key */
-               DEBUG_ERR("%s unable to build protected key from clear",
-                         __func__);
+               default:
+                       DEBUG_ERR("%s unknown/unsupported non cca clear key type %u\n",
+                                 __func__, t->keytype);
+                       return -EINVAL;
+               }
                break;
        }
        case TOKVER_EP11_AES: {
@@ -464,7 +611,8 @@ static int pkey_nonccatok2pkey(const u8 *key, u32 keylen,
                rc = ep11_check_aes_key(debug_info, 3, key, keylen, 1);
                if (rc)
                        goto out;
-               rc = pkey_ep11key2pkey(key, protkey);
+               rc = pkey_ep11key2pkey(key,
+                                      protkey, protkeylen, protkeytype);
                break;
        }
        case TOKVER_EP11_AES_WITH_HEADER:
@@ -473,16 +621,14 @@ static int pkey_nonccatok2pkey(const u8 *key, u32 keylen,
                if (rc)
                        goto out;
                rc = pkey_ep11key2pkey(key + sizeof(struct ep11kblob_header),
-                                      protkey);
+                                      protkey, protkeylen, protkeytype);
                break;
        default:
                DEBUG_ERR("%s unknown/unsupported non-CCA token version %d\n",
                          __func__, hdr->version);
-               rc = -EINVAL;
        }
 
 out:
-       kfree(tmpbuf);
        return rc;
 }
 
@@ -490,7 +636,7 @@ out:
  * Transform a CCA internal key token into a protected key
  */
 static int pkey_ccainttok2pkey(const u8 *key, u32 keylen,
-                              struct pkey_protkey *protkey)
+                              u8 *protkey, u32 *protkeylen, u32 *protkeytype)
 {
        struct keytoken_header *hdr = (struct keytoken_header *)key;
 
@@ -509,17 +655,17 @@ static int pkey_ccainttok2pkey(const u8 *key, u32 keylen,
                return -EINVAL;
        }
 
-       return pkey_skey2pkey(key, protkey);
+       return pkey_skey2pkey(key, protkey, protkeylen, protkeytype);
 }
 
 /*
  * Transform a key blob (of any type) into a protected key
  */
 int pkey_keyblob2pkey(const u8 *key, u32 keylen,
-                     struct pkey_protkey *protkey)
+                     u8 *protkey, u32 *protkeylen, u32 *protkeytype)
 {
-       int rc;
        struct keytoken_header *hdr = (struct keytoken_header *)key;
+       int rc;
 
        if (keylen < sizeof(struct keytoken_header)) {
                DEBUG_ERR("%s invalid keylen %d\n", __func__, keylen);
@@ -528,10 +674,12 @@ int pkey_keyblob2pkey(const u8 *key, u32 keylen,
 
        switch (hdr->type) {
        case TOKTYPE_NON_CCA:
-               rc = pkey_nonccatok2pkey(key, keylen, protkey);
+               rc = pkey_nonccatok2pkey(key, keylen,
+                                        protkey, protkeylen, protkeytype);
                break;
        case TOKTYPE_CCA_INTERNAL:
-               rc = pkey_ccainttok2pkey(key, keylen, protkey);
+               rc = pkey_ccainttok2pkey(key, keylen,
+                                        protkey, protkeylen, protkeytype);
                break;
        default:
                DEBUG_ERR("%s unknown/unsupported blob type %d\n",
@@ -663,9 +811,9 @@ static int pkey_verifykey2(const u8 *key, size_t keylen,
                           enum pkey_key_type *ktype,
                           enum pkey_key_size *ksize, u32 *flags)
 {
-       int rc;
-       u32 _nr_apqns, *_apqns = NULL;
        struct keytoken_header *hdr = (struct keytoken_header *)key;
+       u32 _nr_apqns, *_apqns = NULL;
+       int rc;
 
        if (keylen < sizeof(struct keytoken_header))
                return -EINVAL;
@@ -771,10 +919,10 @@ out:
 
 static int pkey_keyblob2pkey2(const struct pkey_apqn *apqns, size_t nr_apqns,
                              const u8 *key, size_t keylen,
-                             struct pkey_protkey *pkey)
+                             u8 *protkey, u32 *protkeylen, u32 *protkeytype)
 {
-       int i, card, dom, rc;
        struct keytoken_header *hdr = (struct keytoken_header *)key;
+       int i, card, dom, rc;
 
        /* check for at least one apqn given */
        if (!apqns || !nr_apqns)
@@ -806,7 +954,9 @@ static int pkey_keyblob2pkey2(const struct pkey_apqn *apqns, size_t nr_apqns,
                        if (ep11_check_aes_key(debug_info, 3, key, keylen, 1))
                                return -EINVAL;
                } else {
-                       return pkey_nonccatok2pkey(key, keylen, pkey);
+                       return pkey_nonccatok2pkey(key, keylen,
+                                                  protkey, protkeylen,
+                                                  protkeytype);
                }
        } else {
                DEBUG_ERR("%s unknown/unsupported blob type %d\n",
@@ -822,20 +972,20 @@ static int pkey_keyblob2pkey2(const struct pkey_apqn *apqns, size_t nr_apqns,
                dom = apqns[i].domain;
                if (hdr->type == TOKTYPE_CCA_INTERNAL &&
                    hdr->version == TOKVER_CCA_AES) {
-                       rc = cca_sec2protkey(card, dom, key, pkey->protkey,
-                                            &pkey->len, &pkey->type);
+                       rc = cca_sec2protkey(card, dom, key,
+                                            protkey, protkeylen, protkeytype);
                } else if (hdr->type == TOKTYPE_CCA_INTERNAL &&
                           hdr->version == TOKVER_CCA_VLSC) {
-                       rc = cca_cipher2protkey(card, dom, key, pkey->protkey,
-                                               &pkey->len, &pkey->type);
+                       rc = cca_cipher2protkey(card, dom, key,
+                                               protkey, protkeylen,
+                                               protkeytype);
                } else {
                        /* EP11 AES secure key blob */
                        struct ep11keyblob *kb = (struct ep11keyblob *)key;
 
-                       pkey->len = sizeof(pkey->protkey);
                        rc = ep11_kblob2protkey(card, dom, key, kb->head.len,
-                                               pkey->protkey, &pkey->len,
-                                               &pkey->type);
+                                               protkey, protkeylen,
+                                               protkeytype);
                }
                if (rc == 0)
                        break;
@@ -847,9 +997,9 @@ static int pkey_keyblob2pkey2(const struct pkey_apqn *apqns, size_t nr_apqns,
 static int pkey_apqns4key(const u8 *key, size_t keylen, u32 flags,
                          struct pkey_apqn *apqns, size_t *nr_apqns)
 {
-       int rc;
-       u32 _nr_apqns, *_apqns = NULL;
        struct keytoken_header *hdr = (struct keytoken_header *)key;
+       u32 _nr_apqns, *_apqns = NULL;
+       int rc;
 
        if (keylen < sizeof(struct keytoken_header) || flags == 0)
                return -EINVAL;
@@ -860,9 +1010,9 @@ static int pkey_apqns4key(const u8 *key, size_t keylen, u32 flags,
            (hdr->version == TOKVER_EP11_AES_WITH_HEADER ||
             hdr->version == TOKVER_EP11_ECC_WITH_HEADER) &&
            is_ep11_keyblob(key + sizeof(struct ep11kblob_header))) {
-               int minhwtype = 0, api = 0;
                struct ep11keyblob *kb = (struct ep11keyblob *)
                        (key + sizeof(struct ep11kblob_header));
+               int minhwtype = 0, api = 0;
 
                if (flags != PKEY_FLAGS_MATCH_CUR_MKVP)
                        return -EINVAL;
@@ -877,8 +1027,8 @@ static int pkey_apqns4key(const u8 *key, size_t keylen, u32 flags,
        } else if (hdr->type == TOKTYPE_NON_CCA &&
                   hdr->version == TOKVER_EP11_AES &&
                   is_ep11_keyblob(key)) {
-               int minhwtype = 0, api = 0;
                struct ep11keyblob *kb = (struct ep11keyblob *)key;
+               int minhwtype = 0, api = 0;
 
                if (flags != PKEY_FLAGS_MATCH_CUR_MKVP)
                        return -EINVAL;
@@ -891,8 +1041,8 @@ static int pkey_apqns4key(const u8 *key, size_t keylen, u32 flags,
                if (rc)
                        goto out;
        } else if (hdr->type == TOKTYPE_CCA_INTERNAL) {
-               int minhwtype = ZCRYPT_CEX3C;
                u64 cur_mkvp = 0, old_mkvp = 0;
+               int minhwtype = ZCRYPT_CEX3C;
 
                if (hdr->version == TOKVER_CCA_AES) {
                        struct secaeskeytoken *t = (struct secaeskeytoken *)key;
@@ -919,8 +1069,8 @@ static int pkey_apqns4key(const u8 *key, size_t keylen, u32 flags,
                if (rc)
                        goto out;
        } else if (hdr->type == TOKTYPE_CCA_INTERNAL_PKA) {
-               u64 cur_mkvp = 0, old_mkvp = 0;
                struct eccprivkeytoken *t = (struct eccprivkeytoken *)key;
+               u64 cur_mkvp = 0, old_mkvp = 0;
 
                if (t->secid == 0x20) {
                        if (flags & PKEY_FLAGS_MATCH_CUR_MKVP)
@@ -957,8 +1107,8 @@ static int pkey_apqns4keytype(enum pkey_key_type ktype,
                              u8 cur_mkvp[32], u8 alt_mkvp[32], u32 flags,
                              struct pkey_apqn *apqns, size_t *nr_apqns)
 {
-       int rc;
        u32 _nr_apqns, *_apqns = NULL;
+       int rc;
 
        zcrypt_wait_api_operational();
 
@@ -1020,11 +1170,11 @@ out:
 }
 
 static int pkey_keyblob2pkey3(const struct pkey_apqn *apqns, size_t nr_apqns,
-                             const u8 *key, size_t keylen, u32 *protkeytype,
-                             u8 *protkey, u32 *protkeylen)
+                             const u8 *key, size_t keylen,
+                             u8 *protkey, u32 *protkeylen, u32 *protkeytype)
 {
-       int i, card, dom, rc;
        struct keytoken_header *hdr = (struct keytoken_header *)key;
+       int i, card, dom, rc;
 
        /* check for at least one apqn given */
        if (!apqns || !nr_apqns)
@@ -1076,15 +1226,8 @@ static int pkey_keyblob2pkey3(const struct pkey_apqn *apqns, size_t nr_apqns,
                if (cca_check_sececckeytoken(debug_info, 3, key, keylen, 1))
                        return -EINVAL;
        } else if (hdr->type == TOKTYPE_NON_CCA) {
-               struct pkey_protkey pkey;
-
-               rc = pkey_nonccatok2pkey(key, keylen, &pkey);
-               if (rc)
-                       return rc;
-               memcpy(protkey, pkey.protkey, pkey.len);
-               *protkeylen = pkey.len;
-               *protkeytype = pkey.type;
-               return 0;
+               return pkey_nonccatok2pkey(key, keylen,
+                                          protkey, protkeylen, protkeytype);
        } else {
                DEBUG_ERR("%s unknown/unsupported blob type %d\n",
                          __func__, hdr->type);
@@ -1130,7 +1273,7 @@ static int pkey_keyblob2pkey3(const struct pkey_apqn *apqns, size_t nr_apqns,
 
 static void *_copy_key_from_user(void __user *ukey, size_t keylen)
 {
-       if (!ukey || keylen < MINKEYBLOBSIZE || keylen > KEYBLOBBUFSIZE)
+       if (!ukey || keylen < MINKEYBLOBBUFSIZE || keylen > KEYBLOBBUFSIZE)
                return ERR_PTR(-EINVAL);
 
        return memdup_user(ukey, keylen);
@@ -1187,6 +1330,7 @@ static long pkey_unlocked_ioctl(struct file *filp, unsigned int cmd,
 
                if (copy_from_user(&ksp, usp, sizeof(ksp)))
                        return -EFAULT;
+               ksp.protkey.len = sizeof(ksp.protkey.protkey);
                rc = cca_sec2protkey(ksp.cardnr, ksp.domain,
                                     ksp.seckey.seckey, ksp.protkey.protkey,
                                     &ksp.protkey.len, &ksp.protkey.type);
@@ -1203,8 +1347,10 @@ static long pkey_unlocked_ioctl(struct file *filp, unsigned int cmd,
 
                if (copy_from_user(&kcp, ucp, sizeof(kcp)))
                        return -EFAULT;
-               rc = pkey_clr2protkey(kcp.keytype,
-                                     &kcp.clrkey, &kcp.protkey);
+               kcp.protkey.len = sizeof(kcp.protkey.protkey);
+               rc = pkey_clr2protkey(kcp.keytype, kcp.clrkey.clrkey,
+                                     kcp.protkey.protkey,
+                                     &kcp.protkey.len, &kcp.protkey.type);
                DEBUG_DBG("%s pkey_clr2protkey()=%d\n", __func__, rc);
                if (rc)
                        break;
@@ -1234,7 +1380,9 @@ static long pkey_unlocked_ioctl(struct file *filp, unsigned int cmd,
 
                if (copy_from_user(&ksp, usp, sizeof(ksp)))
                        return -EFAULT;
-               rc = pkey_skey2pkey(ksp.seckey.seckey, &ksp.protkey);
+               ksp.protkey.len = sizeof(ksp.protkey.protkey);
+               rc = pkey_skey2pkey(ksp.seckey.seckey, ksp.protkey.protkey,
+                                   &ksp.protkey.len, &ksp.protkey.type);
                DEBUG_DBG("%s pkey_skey2pkey()=%d\n", __func__, rc);
                if (rc)
                        break;
@@ -1263,7 +1411,9 @@ static long pkey_unlocked_ioctl(struct file *filp, unsigned int cmd,
 
                if (copy_from_user(&kgp, ugp, sizeof(kgp)))
                        return -EFAULT;
-               rc = pkey_genprotkey(kgp.keytype, &kgp.protkey);
+               kgp.protkey.len = sizeof(kgp.protkey.protkey);
+               rc = pkey_genprotkey(kgp.keytype, kgp.protkey.protkey,
+                                    &kgp.protkey.len, &kgp.protkey.type);
                DEBUG_DBG("%s pkey_genprotkey()=%d\n", __func__, rc);
                if (rc)
                        break;
@@ -1277,7 +1427,8 @@ static long pkey_unlocked_ioctl(struct file *filp, unsigned int cmd,
 
                if (copy_from_user(&kvp, uvp, sizeof(kvp)))
                        return -EFAULT;
-               rc = pkey_verifyprotkey(&kvp.protkey);
+               rc = pkey_verifyprotkey(kvp.protkey.protkey,
+                                       kvp.protkey.len, kvp.protkey.type);
                DEBUG_DBG("%s pkey_verifyprotkey()=%d\n", __func__, rc);
                break;
        }
@@ -1291,7 +1442,9 @@ static long pkey_unlocked_ioctl(struct file *filp, unsigned int cmd,
                kkey = _copy_key_from_user(ktp.key, ktp.keylen);
                if (IS_ERR(kkey))
                        return PTR_ERR(kkey);
-               rc = pkey_keyblob2pkey(kkey, ktp.keylen, &ktp.protkey);
+               ktp.protkey.len = sizeof(ktp.protkey.protkey);
+               rc = pkey_keyblob2pkey(kkey, ktp.keylen, ktp.protkey.protkey,
+                                      &ktp.protkey.len, &ktp.protkey.type);
                DEBUG_DBG("%s pkey_keyblob2pkey()=%d\n", __func__, rc);
                memzero_explicit(kkey, ktp.keylen);
                kfree(kkey);
@@ -1303,9 +1456,9 @@ static long pkey_unlocked_ioctl(struct file *filp, unsigned int cmd,
        }
        case PKEY_GENSECK2: {
                struct pkey_genseck2 __user *ugs = (void __user *)arg;
+               size_t klen = KEYBLOBBUFSIZE;
                struct pkey_genseck2 kgs;
                struct pkey_apqn *apqns;
-               size_t klen = KEYBLOBBUFSIZE;
                u8 *kkey;
 
                if (copy_from_user(&kgs, ugs, sizeof(kgs)))
@@ -1345,9 +1498,9 @@ static long pkey_unlocked_ioctl(struct file *filp, unsigned int cmd,
        }
        case PKEY_CLR2SECK2: {
                struct pkey_clr2seck2 __user *ucs = (void __user *)arg;
+               size_t klen = KEYBLOBBUFSIZE;
                struct pkey_clr2seck2 kcs;
                struct pkey_apqn *apqns;
-               size_t klen = KEYBLOBBUFSIZE;
                u8 *kkey;
 
                if (copy_from_user(&kcs, ucs, sizeof(kcs)))
@@ -1409,8 +1562,8 @@ static long pkey_unlocked_ioctl(struct file *filp, unsigned int cmd,
        }
        case PKEY_KBLOB2PROTK2: {
                struct pkey_kblob2pkey2 __user *utp = (void __user *)arg;
-               struct pkey_kblob2pkey2 ktp;
                struct pkey_apqn *apqns = NULL;
+               struct pkey_kblob2pkey2 ktp;
                u8 *kkey;
 
                if (copy_from_user(&ktp, utp, sizeof(ktp)))
@@ -1423,8 +1576,11 @@ static long pkey_unlocked_ioctl(struct file *filp, unsigned int cmd,
                        kfree(apqns);
                        return PTR_ERR(kkey);
                }
+               ktp.protkey.len = sizeof(ktp.protkey.protkey);
                rc = pkey_keyblob2pkey2(apqns, ktp.apqn_entries,
-                                       kkey, ktp.keylen, &ktp.protkey);
+                                       kkey, ktp.keylen,
+                                       ktp.protkey.protkey, &ktp.protkey.len,
+                                       &ktp.protkey.type);
                DEBUG_DBG("%s pkey_keyblob2pkey2()=%d\n", __func__, rc);
                kfree(apqns);
                memzero_explicit(kkey, ktp.keylen);
@@ -1437,8 +1593,8 @@ static long pkey_unlocked_ioctl(struct file *filp, unsigned int cmd,
        }
        case PKEY_APQNS4K: {
                struct pkey_apqns4key __user *uak = (void __user *)arg;
-               struct pkey_apqns4key kak;
                struct pkey_apqn *apqns = NULL;
+               struct pkey_apqns4key kak;
                size_t nr_apqns, len;
                u8 *kkey;
 
@@ -1486,8 +1642,8 @@ static long pkey_unlocked_ioctl(struct file *filp, unsigned int cmd,
        }
        case PKEY_APQNS4KT: {
                struct pkey_apqns4keytype __user *uat = (void __user *)arg;
-               struct pkey_apqns4keytype kat;
                struct pkey_apqn *apqns = NULL;
+               struct pkey_apqns4keytype kat;
                size_t nr_apqns, len;
 
                if (copy_from_user(&kat, uat, sizeof(kat)))
@@ -1528,9 +1684,9 @@ static long pkey_unlocked_ioctl(struct file *filp, unsigned int cmd,
        }
        case PKEY_KBLOB2PROTK3: {
                struct pkey_kblob2pkey3 __user *utp = (void __user *)arg;
-               struct pkey_kblob2pkey3 ktp;
-               struct pkey_apqn *apqns = NULL;
                u32 protkeylen = PROTKEYBLOBBUFSIZE;
+               struct pkey_apqn *apqns = NULL;
+               struct pkey_kblob2pkey3 ktp;
                u8 *kkey, *protkey;
 
                if (copy_from_user(&ktp, utp, sizeof(ktp)))
@@ -1549,9 +1705,9 @@ static long pkey_unlocked_ioctl(struct file *filp, unsigned int cmd,
                        kfree(kkey);
                        return -ENOMEM;
                }
-               rc = pkey_keyblob2pkey3(apqns, ktp.apqn_entries, kkey,
-                                       ktp.keylen, &ktp.pkeytype,
-                                       protkey, &protkeylen);
+               rc = pkey_keyblob2pkey3(apqns, ktp.apqn_entries,
+                                       kkey, ktp.keylen,
+                                       protkey, &protkeylen, &ktp.pkeytype);
                DEBUG_DBG("%s pkey_keyblob2pkey3()=%d\n", __func__, rc);
                kfree(apqns);
                memzero_explicit(kkey, ktp.keylen);
@@ -1612,7 +1768,9 @@ static ssize_t pkey_protkey_aes_attr_read(u32 keytype, bool is_xts, char *buf,
        protkeytoken.version = TOKVER_PROTECTED_KEY;
        protkeytoken.keytype = keytype;
 
-       rc = pkey_genprotkey(protkeytoken.keytype, &protkey);
+       protkey.len = sizeof(protkey.protkey);
+       rc = pkey_genprotkey(protkeytoken.keytype,
+                            protkey.protkey, &protkey.len, &protkey.type);
        if (rc)
                return rc;
 
@@ -1622,7 +1780,10 @@ static ssize_t pkey_protkey_aes_attr_read(u32 keytype, bool is_xts, char *buf,
        memcpy(buf, &protkeytoken, sizeof(protkeytoken));
 
        if (is_xts) {
-               rc = pkey_genprotkey(protkeytoken.keytype, &protkey);
+               /* xts needs a second protected key, reuse protkey struct */
+               protkey.len = sizeof(protkey.protkey);
+               rc = pkey_genprotkey(protkeytoken.keytype,
+                                    protkey.protkey, &protkey.len, &protkey.type);
                if (rc)
                        return rc;
 
@@ -1717,8 +1878,8 @@ static struct attribute_group protkey_attr_group = {
 static ssize_t pkey_ccadata_aes_attr_read(u32 keytype, bool is_xts, char *buf,
                                          loff_t off, size_t count)
 {
-       int rc;
        struct pkey_seckey *seckey = (struct pkey_seckey *)buf;
+       int rc;
 
        if (off != 0 || count < sizeof(struct secaeskeytoken))
                return -EINVAL;
@@ -1824,9 +1985,9 @@ static ssize_t pkey_ccacipher_aes_attr_read(enum pkey_key_size keybits,
                                            bool is_xts, char *buf, loff_t off,
                                            size_t count)
 {
-       int i, rc, card, dom;
-       u32 nr_apqns, *apqns = NULL;
        size_t keysize = CCACIPHERTOKENSIZE;
+       u32 nr_apqns, *apqns = NULL;
+       int i, rc, card, dom;
 
        if (off != 0 || count < CCACIPHERTOKENSIZE)
                return -EINVAL;
@@ -1947,9 +2108,9 @@ static ssize_t pkey_ep11_aes_attr_read(enum pkey_key_size keybits,
                                       bool is_xts, char *buf, loff_t off,
                                       size_t count)
 {
-       int i, rc, card, dom;
-       u32 nr_apqns, *apqns = NULL;
        size_t keysize = MAXEP11AESKEYBLOBSIZE;
+       u32 nr_apqns, *apqns = NULL;
+       int i, rc, card, dom;
 
        if (off != 0 || count < MAXEP11AESKEYBLOBSIZE)
                return -EINVAL;
index cfbcb86..a8f58e1 100644 (file)
@@ -716,6 +716,7 @@ static int vfio_ap_mdev_probe(struct mdev_device *mdev)
        ret = vfio_register_emulated_iommu_dev(&matrix_mdev->vdev);
        if (ret)
                goto err_put_vdev;
+       matrix_mdev->req_trigger = NULL;
        dev_set_drvdata(&mdev->dev, matrix_mdev);
        mutex_lock(&matrix_dev->mdevs_lock);
        list_add(&matrix_mdev->node, &matrix_dev->mdev_list);
@@ -1735,6 +1736,26 @@ static void vfio_ap_mdev_close_device(struct vfio_device *vdev)
        vfio_ap_mdev_unset_kvm(matrix_mdev);
 }
 
+static void vfio_ap_mdev_request(struct vfio_device *vdev, unsigned int count)
+{
+       struct device *dev = vdev->dev;
+       struct ap_matrix_mdev *matrix_mdev;
+
+       matrix_mdev = container_of(vdev, struct ap_matrix_mdev, vdev);
+
+       if (matrix_mdev->req_trigger) {
+               if (!(count % 10))
+                       dev_notice_ratelimited(dev,
+                                              "Relaying device request to user (#%u)\n",
+                                              count);
+
+               eventfd_signal(matrix_mdev->req_trigger, 1);
+       } else if (count == 0) {
+               dev_notice(dev,
+                          "No device request registered, blocked until released by user\n");
+       }
+}
+
 static int vfio_ap_mdev_get_device_info(unsigned long arg)
 {
        unsigned long minsz;
@@ -1750,11 +1771,115 @@ static int vfio_ap_mdev_get_device_info(unsigned long arg)
 
        info.flags = VFIO_DEVICE_FLAGS_AP | VFIO_DEVICE_FLAGS_RESET;
        info.num_regions = 0;
-       info.num_irqs = 0;
+       info.num_irqs = VFIO_AP_NUM_IRQS;
 
        return copy_to_user((void __user *)arg, &info, minsz) ? -EFAULT : 0;
 }
 
+static ssize_t vfio_ap_get_irq_info(unsigned long arg)
+{
+       unsigned long minsz;
+       struct vfio_irq_info info;
+
+       minsz = offsetofend(struct vfio_irq_info, count);
+
+       if (copy_from_user(&info, (void __user *)arg, minsz))
+               return -EFAULT;
+
+       if (info.argsz < minsz || info.index >= VFIO_AP_NUM_IRQS)
+               return -EINVAL;
+
+       switch (info.index) {
+       case VFIO_AP_REQ_IRQ_INDEX:
+               info.count = 1;
+               info.flags = VFIO_IRQ_INFO_EVENTFD;
+               break;
+       default:
+               return -EINVAL;
+       }
+
+       return copy_to_user((void __user *)arg, &info, minsz) ? -EFAULT : 0;
+}
+
+static int vfio_ap_irq_set_init(struct vfio_irq_set *irq_set, unsigned long arg)
+{
+       int ret;
+       size_t data_size;
+       unsigned long minsz;
+
+       minsz = offsetofend(struct vfio_irq_set, count);
+
+       if (copy_from_user(irq_set, (void __user *)arg, minsz))
+               return -EFAULT;
+
+       ret = vfio_set_irqs_validate_and_prepare(irq_set, 1, VFIO_AP_NUM_IRQS,
+                                                &data_size);
+       if (ret)
+               return ret;
+
+       if (!(irq_set->flags & VFIO_IRQ_SET_ACTION_TRIGGER))
+               return -EINVAL;
+
+       return 0;
+}
+
+static int vfio_ap_set_request_irq(struct ap_matrix_mdev *matrix_mdev,
+                                  unsigned long arg)
+{
+       s32 fd;
+       void __user *data;
+       unsigned long minsz;
+       struct eventfd_ctx *req_trigger;
+
+       minsz = offsetofend(struct vfio_irq_set, count);
+       data = (void __user *)(arg + minsz);
+
+       if (get_user(fd, (s32 __user *)data))
+               return -EFAULT;
+
+       if (fd == -1) {
+               if (matrix_mdev->req_trigger)
+                       eventfd_ctx_put(matrix_mdev->req_trigger);
+               matrix_mdev->req_trigger = NULL;
+       } else if (fd >= 0) {
+               req_trigger = eventfd_ctx_fdget(fd);
+               if (IS_ERR(req_trigger))
+                       return PTR_ERR(req_trigger);
+
+               if (matrix_mdev->req_trigger)
+                       eventfd_ctx_put(matrix_mdev->req_trigger);
+
+               matrix_mdev->req_trigger = req_trigger;
+       } else {
+               return -EINVAL;
+       }
+
+       return 0;
+}
+
+static int vfio_ap_set_irqs(struct ap_matrix_mdev *matrix_mdev,
+                           unsigned long arg)
+{
+       int ret;
+       struct vfio_irq_set irq_set;
+
+       ret = vfio_ap_irq_set_init(&irq_set, arg);
+       if (ret)
+               return ret;
+
+       switch (irq_set.flags & VFIO_IRQ_SET_DATA_TYPE_MASK) {
+       case VFIO_IRQ_SET_DATA_EVENTFD:
+               switch (irq_set.index) {
+               case VFIO_AP_REQ_IRQ_INDEX:
+                       return vfio_ap_set_request_irq(matrix_mdev, arg);
+               default:
+                       return -EINVAL;
+               }
+       default:
+               return -EINVAL;
+       }
+}
+
 static ssize_t vfio_ap_mdev_ioctl(struct vfio_device *vdev,
                                    unsigned int cmd, unsigned long arg)
 {
@@ -1770,6 +1895,12 @@ static ssize_t vfio_ap_mdev_ioctl(struct vfio_device *vdev,
        case VFIO_DEVICE_RESET:
                ret = vfio_ap_mdev_reset_queues(&matrix_mdev->qtable);
                break;
+       case VFIO_DEVICE_GET_IRQ_INFO:
+                       ret = vfio_ap_get_irq_info(arg);
+                       break;
+       case VFIO_DEVICE_SET_IRQS:
+               ret = vfio_ap_set_irqs(matrix_mdev, arg);
+               break;
        default:
                ret = -EOPNOTSUPP;
                break;
@@ -1844,6 +1975,7 @@ static const struct vfio_device_ops vfio_ap_matrix_dev_ops = {
        .bind_iommufd = vfio_iommufd_emulated_bind,
        .unbind_iommufd = vfio_iommufd_emulated_unbind,
        .attach_ioas = vfio_iommufd_emulated_attach_ioas,
+       .request = vfio_ap_mdev_request
 };
 
 static struct mdev_driver vfio_ap_matrix_driver = {
index 976a65f..4642bbd 100644 (file)
@@ -15,6 +15,7 @@
 #include <linux/types.h>
 #include <linux/mdev.h>
 #include <linux/delay.h>
+#include <linux/eventfd.h>
 #include <linux/mutex.h>
 #include <linux/kvm_host.h>
 #include <linux/vfio.h>
@@ -103,6 +104,7 @@ struct ap_queue_table {
  *             PQAP(AQIC) instruction.
  * @mdev:      the mediated device
  * @qtable:    table of queues (struct vfio_ap_queue) assigned to the mdev
+ * @req_trigger eventfd ctx for signaling userspace to return a device
  * @apm_add:   bitmap of APIDs added to the host's AP configuration
  * @aqm_add:   bitmap of APQIs added to the host's AP configuration
  * @adm_add:   bitmap of control domain numbers added to the host's AP
@@ -117,6 +119,7 @@ struct ap_matrix_mdev {
        crypto_hook pqap_hook;
        struct mdev_device *mdev;
        struct ap_queue_table qtable;
+       struct eventfd_ctx *req_trigger;
        DECLARE_BITMAP(apm_add, AP_DEVICES);
        DECLARE_BITMAP(aqm_add, AP_DOMAINS);
        DECLARE_BITMAP(adm_add, AP_DOMAINS);
index 38d20a6..f925f86 100644 (file)
@@ -617,7 +617,7 @@ static int twa_check_srl(TW_Device_Extension *tw_dev, int *flashed)
        }
 
        /* Load rest of compatibility struct */
-       strlcpy(tw_dev->tw_compat_info.driver_version, TW_DRIVER_VERSION,
+       strscpy(tw_dev->tw_compat_info.driver_version, TW_DRIVER_VERSION,
                sizeof(tw_dev->tw_compat_info.driver_version));
        tw_dev->tw_compat_info.driver_srl_high = TW_CURRENT_DRIVER_SRL;
        tw_dev->tw_compat_info.driver_branch_high = TW_CURRENT_DRIVER_BRANCH;
index ca85bdd..cea3a79 100644 (file)
@@ -417,7 +417,7 @@ static int NCR5380_init(struct Scsi_Host *instance, int flags)
        INIT_WORK(&hostdata->main_task, NCR5380_main);
        hostdata->work_q = alloc_workqueue("ncr5380_%d",
                                WQ_UNBOUND | WQ_MEM_RECLAIM,
-                               1, instance->host_no);
+                               0, instance->host_no);
        if (!hostdata->work_q)
                return -ENOMEM;
 
index 24c049e..70e1cac 100644 (file)
@@ -3289,7 +3289,7 @@ static int query_disk(struct aac_dev *dev, void __user *arg)
        else
                qd.unmapped = 0;
 
-       strlcpy(qd.name, fsa_dev_ptr[qd.cnum].devname,
+       strscpy(qd.name, fsa_dev_ptr[qd.cnum].devname,
          min(sizeof(qd.name), sizeof(fsa_dev_ptr[qd.cnum].devname) + 1));
 
        if (copy_to_user(arg, &qd, sizeof (struct aac_query_disk)))
index 2b3f0c1..872ad37 100644 (file)
@@ -383,7 +383,7 @@ int bnx2i_get_stats(void *handle)
        if (!stats)
                return -ENOMEM;
 
-       strlcpy(stats->version, DRV_MODULE_VERSION, sizeof(stats->version));
+       strscpy(stats->version, DRV_MODULE_VERSION, sizeof(stats->version));
        memcpy(stats->mac_add1 + 2, hba->cnic->mac_addr, ETH_ALEN);
 
        stats->max_frame_size = hba->netdev->mtu;
index ac648bb..cb0a399 100644 (file)
@@ -877,7 +877,8 @@ static long ch_ioctl(struct file *file,
        }
 
        default:
-               return scsi_ioctl(ch->device, file->f_mode, cmd, argp);
+               return scsi_ioctl(ch->device, file->f_mode & FMODE_WRITE, cmd,
+                                 argp);
 
        }
 }
index 06ccb51..f5334cc 100644 (file)
@@ -1394,8 +1394,8 @@ static int hptiop_probe(struct pci_dev *pcidev, const struct pci_device_id *id)
        host->cmd_per_lun = le32_to_cpu(iop_config.max_requests);
        host->max_cmd_len = 16;
 
-       req_size = struct_size((struct hpt_iop_request_scsi_command *)0,
-                              sg_list, hba->max_sg_descriptors);
+       req_size = struct_size_t(struct hpt_iop_request_scsi_command,
+                                sg_list, hba->max_sg_descriptors);
        if ((req_size & 0x1f) != 0)
                req_size = (req_size + 0x1f) & ~0x1f;
 
index 63f32f8..5959929 100644 (file)
@@ -250,7 +250,7 @@ static void gather_partition_info(void)
 
        ppartition_name = of_get_property(of_root, "ibm,partition-name", NULL);
        if (ppartition_name)
-               strlcpy(partition_name, ppartition_name,
+               strscpy(partition_name, ppartition_name,
                                sizeof(partition_name));
        p_number_ptr = of_get_property(of_root, "ibm,partition-no", NULL);
        if (p_number_ptr)
@@ -1282,12 +1282,12 @@ static void send_mad_capabilities(struct ibmvscsi_host_data *hostdata)
        if (hostdata->client_migrated)
                hostdata->caps.flags |= cpu_to_be32(CLIENT_MIGRATED);
 
-       strlcpy(hostdata->caps.name, dev_name(&hostdata->host->shost_gendev),
+       strscpy(hostdata->caps.name, dev_name(&hostdata->host->shost_gendev),
                sizeof(hostdata->caps.name));
 
        location = of_get_property(of_node, "ibm,loc-code", NULL);
        location = location ? location : dev_name(hostdata->dev);
-       strlcpy(hostdata->caps.loc, location, sizeof(hostdata->caps.loc));
+       strscpy(hostdata->caps.loc, location, sizeof(hostdata->caps.loc));
 
        req->common.type = cpu_to_be32(VIOSRP_CAPABILITIES_TYPE);
        req->buffer = cpu_to_be64(hostdata->caps_addr);
index 317c944..050eed8 100644 (file)
@@ -5153,8 +5153,8 @@ static void megasas_update_ext_vd_details(struct megasas_instance *instance)
                fusion->max_map_sz = ventura_map_sz;
        } else {
                fusion->old_map_sz =
-                       struct_size((struct MR_FW_RAID_MAP *)0, ldSpanMap,
-                                   instance->fw_supported_vd_count);
+                       struct_size_t(struct MR_FW_RAID_MAP, ldSpanMap,
+                                     instance->fw_supported_vd_count);
                fusion->new_map_sz =  sizeof(struct MR_FW_RAID_MAP_EXT);
 
                fusion->max_map_sz =
@@ -5789,8 +5789,8 @@ megasas_setup_jbod_map(struct megasas_instance *instance)
        struct fusion_context *fusion = instance->ctrl_context;
        size_t pd_seq_map_sz;
 
-       pd_seq_map_sz = struct_size((struct MR_PD_CFG_SEQ_NUM_SYNC *)0, seq,
-                                   MAX_PHYSICAL_DEVICES);
+       pd_seq_map_sz = struct_size_t(struct MR_PD_CFG_SEQ_NUM_SYNC, seq,
+                                     MAX_PHYSICAL_DEVICES);
 
        instance->use_seqnum_jbod_fp =
                instance->support_seqnum_jbod_fp;
@@ -8033,8 +8033,8 @@ skip_firing_dcmds:
        if (instance->adapter_type != MFI_SERIES) {
                megasas_release_fusion(instance);
                pd_seq_map_sz =
-                       struct_size((struct MR_PD_CFG_SEQ_NUM_SYNC *)0,
-                                   seq, MAX_PHYSICAL_DEVICES);
+                       struct_size_t(struct MR_PD_CFG_SEQ_NUM_SYNC,
+                                     seq, MAX_PHYSICAL_DEVICES);
                for (i = 0; i < 2 ; i++) {
                        if (fusion->ld_map[i])
                                dma_free_coherent(&instance->pdev->dev,
index 4463a53..b8b388a 100644 (file)
@@ -326,9 +326,9 @@ u8 MR_ValidateMapInfo(struct megasas_instance *instance, u64 map_id)
        else if (instance->supportmax256vd)
                expected_size = sizeof(struct MR_FW_RAID_MAP_EXT);
        else
-               expected_size = struct_size((struct MR_FW_RAID_MAP *)0,
-                                           ldSpanMap,
-                                           le16_to_cpu(pDrvRaidMap->ldCount));
+               expected_size = struct_size_t(struct MR_FW_RAID_MAP,
+                                             ldSpanMap,
+                                             le16_to_cpu(pDrvRaidMap->ldCount));
 
        if (le32_to_cpu(pDrvRaidMap->totalSize) != expected_size) {
                dev_dbg(&instance->pdev->dev, "megasas: map info structure size 0x%x",
index 45d3595..450522b 100644 (file)
@@ -2593,7 +2593,7 @@ retry_probe:
        sp_params.drv_minor = QEDI_DRIVER_MINOR_VER;
        sp_params.drv_rev = QEDI_DRIVER_REV_VER;
        sp_params.drv_eng = QEDI_DRIVER_ENG_VER;
-       strlcpy(sp_params.name, "qedi iSCSI", QED_DRV_VER_STR_SIZE);
+       strscpy(sp_params.name, "qedi iSCSI", QED_DRV_VER_STR_SIZE);
        rc = qedi_ops->common->slowpath_start(qedi->cdev, &sp_params);
        if (rc) {
                QEDI_ERR(&qedi->dbg_ctx, "Cannot start slowpath\n");
index 96ee352..a9a9ec0 100644 (file)
@@ -10,7 +10,7 @@
 #define uptr64(val) ((void __user *)(uintptr_t)(val))
 
 static int scsi_bsg_sg_io_fn(struct request_queue *q, struct sg_io_v4 *hdr,
-               fmode_t mode, unsigned int timeout)
+               bool open_for_write, unsigned int timeout)
 {
        struct scsi_cmnd *scmd;
        struct request *rq;
@@ -42,7 +42,7 @@ static int scsi_bsg_sg_io_fn(struct request_queue *q, struct sg_io_v4 *hdr,
        if (copy_from_user(scmd->cmnd, uptr64(hdr->request), scmd->cmd_len))
                goto out_put_request;
        ret = -EPERM;
-       if (!scsi_cmd_allowed(scmd->cmnd, mode))
+       if (!scsi_cmd_allowed(scmd->cmnd, open_for_write))
                goto out_put_request;
 
        ret = 0;
index e3b31d3..6f6c597 100644 (file)
@@ -248,7 +248,7 @@ static int scsi_send_start_stop(struct scsi_device *sdev, int data)
  * Only a subset of commands are allowed for unprivileged users. Commands used
  * to format the media, update the firmware, etc. are not permitted.
  */
-bool scsi_cmd_allowed(unsigned char *cmd, fmode_t mode)
+bool scsi_cmd_allowed(unsigned char *cmd, bool open_for_write)
 {
        /* root can do any command. */
        if (capable(CAP_SYS_RAWIO))
@@ -338,7 +338,7 @@ bool scsi_cmd_allowed(unsigned char *cmd, fmode_t mode)
        case GPCMD_SET_READ_AHEAD:
        /* ZBC */
        case ZBC_OUT:
-               return (mode & FMODE_WRITE);
+               return open_for_write;
        default:
                return false;
        }
@@ -346,7 +346,7 @@ bool scsi_cmd_allowed(unsigned char *cmd, fmode_t mode)
 EXPORT_SYMBOL(scsi_cmd_allowed);
 
 static int scsi_fill_sghdr_rq(struct scsi_device *sdev, struct request *rq,
-               struct sg_io_hdr *hdr, fmode_t mode)
+               struct sg_io_hdr *hdr, bool open_for_write)
 {
        struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(rq);
 
@@ -354,7 +354,7 @@ static int scsi_fill_sghdr_rq(struct scsi_device *sdev, struct request *rq,
                return -EMSGSIZE;
        if (copy_from_user(scmd->cmnd, hdr->cmdp, hdr->cmd_len))
                return -EFAULT;
-       if (!scsi_cmd_allowed(scmd->cmnd, mode))
+       if (!scsi_cmd_allowed(scmd->cmnd, open_for_write))
                return -EPERM;
        scmd->cmd_len = hdr->cmd_len;
 
@@ -407,7 +407,8 @@ static int scsi_complete_sghdr_rq(struct request *rq, struct sg_io_hdr *hdr,
        return ret;
 }
 
-static int sg_io(struct scsi_device *sdev, struct sg_io_hdr *hdr, fmode_t mode)
+static int sg_io(struct scsi_device *sdev, struct sg_io_hdr *hdr,
+               bool open_for_write)
 {
        unsigned long start_time;
        ssize_t ret = 0;
@@ -448,7 +449,7 @@ static int sg_io(struct scsi_device *sdev, struct sg_io_hdr *hdr, fmode_t mode)
                goto out_put_request;
        }
 
-       ret = scsi_fill_sghdr_rq(sdev, rq, hdr, mode);
+       ret = scsi_fill_sghdr_rq(sdev, rq, hdr, open_for_write);
        if (ret < 0)
                goto out_put_request;
 
@@ -477,8 +478,7 @@ out_put_request:
 /**
  * sg_scsi_ioctl  --  handle deprecated SCSI_IOCTL_SEND_COMMAND ioctl
  * @q:         request queue to send scsi commands down
- * @mode:      mode used to open the file through which the ioctl has been
- *             submitted
+ * @open_for_write: is the file / block device opened for writing?
  * @sic:       userspace structure describing the command to perform
  *
  * Send down the scsi command described by @sic to the device below
@@ -501,7 +501,7 @@ out_put_request:
  *      Positive numbers returned are the compacted SCSI error codes (4
  *      bytes in one int) where the lowest byte is the SCSI status.
  */
-static int sg_scsi_ioctl(struct request_queue *q, fmode_t mode,
+static int sg_scsi_ioctl(struct request_queue *q, bool open_for_write,
                struct scsi_ioctl_command __user *sic)
 {
        struct request *rq;
@@ -554,7 +554,7 @@ static int sg_scsi_ioctl(struct request_queue *q, fmode_t mode,
                goto error;
 
        err = -EPERM;
-       if (!scsi_cmd_allowed(scmd->cmnd, mode))
+       if (!scsi_cmd_allowed(scmd->cmnd, open_for_write))
                goto error;
 
        /* default.  possible overridden later */
@@ -776,7 +776,7 @@ static int scsi_put_cdrom_generic_arg(const struct cdrom_generic_command *cgc,
        return 0;
 }
 
-static int scsi_cdrom_send_packet(struct scsi_device *sdev, fmode_t mode,
+static int scsi_cdrom_send_packet(struct scsi_device *sdev, bool open_for_write,
                void __user *arg)
 {
        struct cdrom_generic_command cgc;
@@ -817,7 +817,7 @@ static int scsi_cdrom_send_packet(struct scsi_device *sdev, fmode_t mode,
        hdr.cmdp = ((struct cdrom_generic_command __user *) arg)->cmd;
        hdr.cmd_len = sizeof(cgc.cmd);
 
-       err = sg_io(sdev, &hdr, mode);
+       err = sg_io(sdev, &hdr, open_for_write);
        if (err == -EFAULT)
                return -EFAULT;
 
@@ -832,7 +832,7 @@ static int scsi_cdrom_send_packet(struct scsi_device *sdev, fmode_t mode,
        return err;
 }
 
-static int scsi_ioctl_sg_io(struct scsi_device *sdev, fmode_t mode,
+static int scsi_ioctl_sg_io(struct scsi_device *sdev, bool open_for_write,
                void __user *argp)
 {
        struct sg_io_hdr hdr;
@@ -841,7 +841,7 @@ static int scsi_ioctl_sg_io(struct scsi_device *sdev, fmode_t mode,
        error = get_sg_io_hdr(&hdr, argp);
        if (error)
                return error;
-       error = sg_io(sdev, &hdr, mode);
+       error = sg_io(sdev, &hdr, open_for_write);
        if (error == -EFAULT)
                return error;
        if (put_sg_io_hdr(&hdr, argp))
@@ -852,7 +852,7 @@ static int scsi_ioctl_sg_io(struct scsi_device *sdev, fmode_t mode,
 /**
  * scsi_ioctl - Dispatch ioctl to scsi device
  * @sdev: scsi device receiving ioctl
- * @mode: mode the block/char device is opened with
+ * @open_for_write: is the file / block device opened for writing?
  * @cmd: which ioctl is it
  * @arg: data associated with ioctl
  *
@@ -860,7 +860,7 @@ static int scsi_ioctl_sg_io(struct scsi_device *sdev, fmode_t mode,
  * does not take a major/minor number as the dev field.  Rather, it takes
  * a pointer to a &struct scsi_device.
  */
-int scsi_ioctl(struct scsi_device *sdev, fmode_t mode, int cmd,
+int scsi_ioctl(struct scsi_device *sdev, bool open_for_write, int cmd,
                void __user *arg)
 {
        struct request_queue *q = sdev->request_queue;
@@ -896,11 +896,11 @@ int scsi_ioctl(struct scsi_device *sdev, fmode_t mode, int cmd,
        case SG_EMULATED_HOST:
                return sg_emulated_host(q, arg);
        case SG_IO:
-               return scsi_ioctl_sg_io(sdev, mode, arg);
+               return scsi_ioctl_sg_io(sdev, open_for_write, arg);
        case SCSI_IOCTL_SEND_COMMAND:
-               return sg_scsi_ioctl(q, mode, arg);
+               return sg_scsi_ioctl(q, open_for_write, arg);
        case CDROM_SEND_PACKET:
-               return scsi_cdrom_send_packet(sdev, mode, arg);
+               return scsi_cdrom_send_packet(sdev, open_for_write, arg);
        case CDROMCLOSETRAY:
                return scsi_send_start_stop(sdev, 3);
        case CDROMEJECT:
index 1624d52..ab21697 100644 (file)
@@ -1280,11 +1280,10 @@ static void sd_uninit_command(struct scsi_cmnd *SCpnt)
                mempool_free(rq->special_vec.bv_page, sd_page_pool);
 }
 
-static bool sd_need_revalidate(struct block_device *bdev,
-               struct scsi_disk *sdkp)
+static bool sd_need_revalidate(struct gendisk *disk, struct scsi_disk *sdkp)
 {
        if (sdkp->device->removable || sdkp->write_prot) {
-               if (bdev_check_media_change(bdev))
+               if (disk_check_media_change(disk))
                        return true;
        }
 
@@ -1293,13 +1292,13 @@ static bool sd_need_revalidate(struct block_device *bdev,
         * nothing to do with partitions, BLKRRPART is used to force a full
         * revalidate after things like a format for historical reasons.
         */
-       return test_bit(GD_NEED_PART_SCAN, &bdev->bd_disk->state);
+       return test_bit(GD_NEED_PART_SCAN, &disk->state);
 }
 
 /**
  *     sd_open - open a scsi disk device
- *     @bdev: Block device of the scsi disk to open
- *     @mode: FMODE_* mask
+ *     @disk: disk to open
+ *     @mode: open mode
  *
  *     Returns 0 if successful. Returns a negated errno value in case 
  *     of error.
@@ -1309,11 +1308,11 @@ static bool sd_need_revalidate(struct block_device *bdev,
  *     In the latter case @inode and @filp carry an abridged amount
  *     of information as noted above.
  *
- *     Locking: called with bdev->bd_disk->open_mutex held.
+ *     Locking: called with disk->open_mutex held.
  **/
-static int sd_open(struct block_device *bdev, fmode_t mode)
+static int sd_open(struct gendisk *disk, blk_mode_t mode)
 {
-       struct scsi_disk *sdkp = scsi_disk(bdev->bd_disk);
+       struct scsi_disk *sdkp = scsi_disk(disk);
        struct scsi_device *sdev = sdkp->device;
        int retval;
 
@@ -1330,14 +1329,15 @@ static int sd_open(struct block_device *bdev, fmode_t mode)
        if (!scsi_block_when_processing_errors(sdev))
                goto error_out;
 
-       if (sd_need_revalidate(bdev, sdkp))
-               sd_revalidate_disk(bdev->bd_disk);
+       if (sd_need_revalidate(disk, sdkp))
+               sd_revalidate_disk(disk);
 
        /*
         * If the drive is empty, just let the open fail.
         */
        retval = -ENOMEDIUM;
-       if (sdev->removable && !sdkp->media_present && !(mode & FMODE_NDELAY))
+       if (sdev->removable && !sdkp->media_present &&
+           !(mode & BLK_OPEN_NDELAY))
                goto error_out;
 
        /*
@@ -1345,7 +1345,7 @@ static int sd_open(struct block_device *bdev, fmode_t mode)
         * if the user expects to be able to write to the thing.
         */
        retval = -EROFS;
-       if (sdkp->write_prot && (mode & FMODE_WRITE))
+       if (sdkp->write_prot && (mode & BLK_OPEN_WRITE))
                goto error_out;
 
        /*
@@ -1374,16 +1374,15 @@ error_out:
  *     sd_release - invoked when the (last) close(2) is called on this
  *     scsi disk.
  *     @disk: disk to release
- *     @mode: FMODE_* mask
  *
  *     Returns 0. 
  *
  *     Note: may block (uninterruptible) if error recovery is underway
  *     on this disk.
  *
- *     Locking: called with bdev->bd_disk->open_mutex held.
+ *     Locking: called with disk->open_mutex held.
  **/
-static void sd_release(struct gendisk *disk, fmode_t mode)
+static void sd_release(struct gendisk *disk)
 {
        struct scsi_disk *sdkp = scsi_disk(disk);
        struct scsi_device *sdev = sdkp->device;
@@ -1426,7 +1425,7 @@ static int sd_getgeo(struct block_device *bdev, struct hd_geometry *geo)
 /**
  *     sd_ioctl - process an ioctl
  *     @bdev: target block device
- *     @mode: FMODE_* mask
+ *     @mode: open mode
  *     @cmd: ioctl command number
  *     @arg: this is third argument given to ioctl(2) system call.
  *     Often contains a pointer.
@@ -1437,7 +1436,7 @@ static int sd_getgeo(struct block_device *bdev, struct hd_geometry *geo)
  *     Note: most ioctls are forward onto the block subsystem or further
  *     down in the scsi subsystem.
  **/
-static int sd_ioctl(struct block_device *bdev, fmode_t mode,
+static int sd_ioctl(struct block_device *bdev, blk_mode_t mode,
                    unsigned int cmd, unsigned long arg)
 {
        struct gendisk *disk = bdev->bd_disk;
@@ -1459,13 +1458,13 @@ static int sd_ioctl(struct block_device *bdev, fmode_t mode,
         * access to the device is prohibited.
         */
        error = scsi_ioctl_block_when_processing_errors(sdp, cmd,
-                       (mode & FMODE_NDELAY) != 0);
+                       (mode & BLK_OPEN_NDELAY));
        if (error)
                return error;
 
        if (is_sed_ioctl(cmd))
                return sed_ioctl(sdkp->opal_dev, cmd, p);
-       return scsi_ioctl(sdp, mode, cmd, p);
+       return scsi_ioctl(sdp, mode & BLK_OPEN_WRITE, cmd, p);
 }
 
 static void set_media_not_present(struct scsi_disk *sdkp)
index 037f8c9..dcb7378 100644 (file)
@@ -237,7 +237,7 @@ static int sg_allow_access(struct file *filp, unsigned char *cmd)
 
        if (sfp->parentdp->device->type == TYPE_SCANNER)
                return 0;
-       if (!scsi_cmd_allowed(cmd, filp->f_mode))
+       if (!scsi_cmd_allowed(cmd, filp->f_mode & FMODE_WRITE))
                return -EPERM;
        return 0;
 }
@@ -1103,7 +1103,8 @@ sg_ioctl_common(struct file *filp, Sg_device *sdp, Sg_fd *sfp,
        case SCSI_IOCTL_SEND_COMMAND:
                if (atomic_read(&sdp->detaching))
                        return -ENODEV;
-               return scsi_ioctl(sdp->device, filp->f_mode, cmd_in, p);
+               return scsi_ioctl(sdp->device, filp->f_mode & FMODE_WRITE,
+                                 cmd_in, p);
        case SG_SET_DEBUG:
                result = get_user(val, ip);
                if (result)
@@ -1159,7 +1160,7 @@ sg_ioctl(struct file *filp, unsigned int cmd_in, unsigned long arg)
        ret = sg_ioctl_common(filp, sdp, sfp, cmd_in, p);
        if (ret != -ENOIOCTLCMD)
                return ret;
-       return scsi_ioctl(sdp->device, filp->f_mode, cmd_in, p);
+       return scsi_ioctl(sdp->device, filp->f_mode & FMODE_WRITE, cmd_in, p);
 }
 
 static __poll_t
@@ -1496,6 +1497,10 @@ sg_add_device(struct device *cl_dev)
        int error;
        unsigned long iflags;
 
+       error = blk_get_queue(scsidp->request_queue);
+       if (error)
+               return error;
+
        error = -ENOMEM;
        cdev = cdev_alloc();
        if (!cdev) {
@@ -1553,6 +1558,7 @@ cdev_add_err:
 out:
        if (cdev)
                cdev_del(cdev);
+       blk_put_queue(scsidp->request_queue);
        return error;
 }
 
@@ -1560,6 +1566,7 @@ static void
 sg_device_destroy(struct kref *kref)
 {
        struct sg_device *sdp = container_of(kref, struct sg_device, d_ref);
+       struct request_queue *q = sdp->device->request_queue;
        unsigned long flags;
 
        /* CAUTION!  Note that the device can still be found via idr_find()
@@ -1567,6 +1574,9 @@ sg_device_destroy(struct kref *kref)
         * any other cleanup.
         */
 
+       blk_trace_remove(q);
+       blk_put_queue(q);
+
        write_lock_irqsave(&sg_index_lock, flags);
        idr_remove(&sg_index_idr, sdp->index);
        write_unlock_irqrestore(&sg_index_lock, flags);
index 03de97c..f4e0aa2 100644 (file)
@@ -5015,7 +5015,7 @@ static int pqi_create_queues(struct pqi_ctrl_info *ctrl_info)
 }
 
 #define PQI_REPORT_EVENT_CONFIG_BUFFER_LENGTH  \
-       struct_size((struct pqi_event_config *)0, descriptors, PQI_MAX_EVENT_DESCRIPTORS)
+       struct_size_t(struct pqi_event_config,  descriptors, PQI_MAX_EVENT_DESCRIPTORS)
 
 static int pqi_configure_events(struct pqi_ctrl_info *ctrl_info,
        bool enable_events)
index 12869e6..ce886c8 100644 (file)
@@ -484,9 +484,9 @@ static void sr_revalidate_disk(struct scsi_cd *cd)
        get_sectorsize(cd);
 }
 
-static int sr_block_open(struct block_device *bdev, fmode_t mode)
+static int sr_block_open(struct gendisk *disk, blk_mode_t mode)
 {
-       struct scsi_cd *cd = scsi_cd(bdev->bd_disk);
+       struct scsi_cd *cd = scsi_cd(disk);
        struct scsi_device *sdev = cd->device;
        int ret;
 
@@ -494,11 +494,11 @@ static int sr_block_open(struct block_device *bdev, fmode_t mode)
                return -ENXIO;
 
        scsi_autopm_get_device(sdev);
-       if (bdev_check_media_change(bdev))
+       if (disk_check_media_change(disk))
                sr_revalidate_disk(cd);
 
        mutex_lock(&cd->lock);
-       ret = cdrom_open(&cd->cdi, bdev, mode);
+       ret = cdrom_open(&cd->cdi, mode);
        mutex_unlock(&cd->lock);
 
        scsi_autopm_put_device(sdev);
@@ -507,19 +507,19 @@ static int sr_block_open(struct block_device *bdev, fmode_t mode)
        return ret;
 }
 
-static void sr_block_release(struct gendisk *disk, fmode_t mode)
+static void sr_block_release(struct gendisk *disk)
 {
        struct scsi_cd *cd = scsi_cd(disk);
 
        mutex_lock(&cd->lock);
-       cdrom_release(&cd->cdi, mode);
+       cdrom_release(&cd->cdi);
        mutex_unlock(&cd->lock);
 
        scsi_device_put(cd->device);
 }
 
-static int sr_block_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,
-                         unsigned long arg)
+static int sr_block_ioctl(struct block_device *bdev, blk_mode_t mode,
+               unsigned cmd, unsigned long arg)
 {
        struct scsi_cd *cd = scsi_cd(bdev->bd_disk);
        struct scsi_device *sdev = cd->device;
@@ -532,18 +532,18 @@ static int sr_block_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,
        mutex_lock(&cd->lock);
 
        ret = scsi_ioctl_block_when_processing_errors(sdev, cmd,
-                       (mode & FMODE_NDELAY) != 0);
+                       (mode & BLK_OPEN_NDELAY));
        if (ret)
                goto out;
 
        scsi_autopm_get_device(sdev);
 
        if (cmd != CDROMCLOSETRAY && cmd != CDROMEJECT) {
-               ret = cdrom_ioctl(&cd->cdi, bdev, mode, cmd, arg);
+               ret = cdrom_ioctl(&cd->cdi, bdev, cmd, arg);
                if (ret != -ENOSYS)
                        goto put;
        }
-       ret = scsi_ioctl(sdev, mode, cmd, argp);
+       ret = scsi_ioctl(sdev, mode & BLK_OPEN_WRITE, cmd, argp);
 
 put:
        scsi_autopm_put_device(sdev);
index b90a440..14d7981 100644 (file)
@@ -3832,7 +3832,7 @@ static long st_ioctl(struct file *file, unsigned int cmd_in, unsigned long arg)
                break;
        }
 
-       retval = scsi_ioctl(STp->device, file->f_mode, cmd_in, p);
+       retval = scsi_ioctl(STp->device, file->f_mode & FMODE_WRITE, cmd_in, p);
        if (!retval && cmd_in == SCSI_IOCTL_STOP_UNIT) {
                /* unload */
                STp->rew_at_close = 0;
index 795a2e1..dd50a25 100644 (file)
@@ -682,6 +682,30 @@ EXPORT_SYMBOL(geni_se_clk_freq_match);
 #define GENI_SE_DMA_EOT_EN BIT(1)
 #define GENI_SE_DMA_AHB_ERR_EN BIT(2)
 #define GENI_SE_DMA_EOT_BUF BIT(0)
+
+/**
+ * geni_se_tx_init_dma() - Initiate TX DMA transfer on the serial engine
+ * @se:                        Pointer to the concerned serial engine.
+ * @iova:              Mapped DMA address.
+ * @len:               Length of the TX buffer.
+ *
+ * This function is used to initiate DMA TX transfer.
+ */
+void geni_se_tx_init_dma(struct geni_se *se, dma_addr_t iova, size_t len)
+{
+       u32 val;
+
+       val = GENI_SE_DMA_DONE_EN;
+       val |= GENI_SE_DMA_EOT_EN;
+       val |= GENI_SE_DMA_AHB_ERR_EN;
+       writel_relaxed(val, se->base + SE_DMA_TX_IRQ_EN_SET);
+       writel_relaxed(lower_32_bits(iova), se->base + SE_DMA_TX_PTR_L);
+       writel_relaxed(upper_32_bits(iova), se->base + SE_DMA_TX_PTR_H);
+       writel_relaxed(GENI_SE_DMA_EOT_BUF, se->base + SE_DMA_TX_ATTR);
+       writel(len, se->base + SE_DMA_TX_LEN);
+}
+EXPORT_SYMBOL(geni_se_tx_init_dma);
+
 /**
  * geni_se_tx_dma_prep() - Prepare the serial engine for TX DMA transfer
  * @se:                        Pointer to the concerned serial engine.
@@ -697,7 +721,6 @@ int geni_se_tx_dma_prep(struct geni_se *se, void *buf, size_t len,
                        dma_addr_t *iova)
 {
        struct geni_wrapper *wrapper = se->wrapper;
-       u32 val;
 
        if (!wrapper)
                return -EINVAL;
@@ -706,17 +729,34 @@ int geni_se_tx_dma_prep(struct geni_se *se, void *buf, size_t len,
        if (dma_mapping_error(wrapper->dev, *iova))
                return -EIO;
 
+       geni_se_tx_init_dma(se, *iova, len);
+       return 0;
+}
+EXPORT_SYMBOL(geni_se_tx_dma_prep);
+
+/**
+ * geni_se_rx_init_dma() - Initiate RX DMA transfer on the serial engine
+ * @se:                        Pointer to the concerned serial engine.
+ * @iova:              Mapped DMA address.
+ * @len:               Length of the RX buffer.
+ *
+ * This function is used to initiate DMA RX transfer.
+ */
+void geni_se_rx_init_dma(struct geni_se *se, dma_addr_t iova, size_t len)
+{
+       u32 val;
+
        val = GENI_SE_DMA_DONE_EN;
        val |= GENI_SE_DMA_EOT_EN;
        val |= GENI_SE_DMA_AHB_ERR_EN;
-       writel_relaxed(val, se->base + SE_DMA_TX_IRQ_EN_SET);
-       writel_relaxed(lower_32_bits(*iova), se->base + SE_DMA_TX_PTR_L);
-       writel_relaxed(upper_32_bits(*iova), se->base + SE_DMA_TX_PTR_H);
-       writel_relaxed(GENI_SE_DMA_EOT_BUF, se->base + SE_DMA_TX_ATTR);
-       writel(len, se->base + SE_DMA_TX_LEN);
-       return 0;
+       writel_relaxed(val, se->base + SE_DMA_RX_IRQ_EN_SET);
+       writel_relaxed(lower_32_bits(iova), se->base + SE_DMA_RX_PTR_L);
+       writel_relaxed(upper_32_bits(iova), se->base + SE_DMA_RX_PTR_H);
+       /* RX does not have EOT buffer type bit. So just reset RX_ATTR */
+       writel_relaxed(0, se->base + SE_DMA_RX_ATTR);
+       writel(len, se->base + SE_DMA_RX_LEN);
 }
-EXPORT_SYMBOL(geni_se_tx_dma_prep);
+EXPORT_SYMBOL(geni_se_rx_init_dma);
 
 /**
  * geni_se_rx_dma_prep() - Prepare the serial engine for RX DMA transfer
@@ -733,7 +773,6 @@ int geni_se_rx_dma_prep(struct geni_se *se, void *buf, size_t len,
                        dma_addr_t *iova)
 {
        struct geni_wrapper *wrapper = se->wrapper;
-       u32 val;
 
        if (!wrapper)
                return -EINVAL;
@@ -742,15 +781,7 @@ int geni_se_rx_dma_prep(struct geni_se *se, void *buf, size_t len,
        if (dma_mapping_error(wrapper->dev, *iova))
                return -EIO;
 
-       val = GENI_SE_DMA_DONE_EN;
-       val |= GENI_SE_DMA_EOT_EN;
-       val |= GENI_SE_DMA_AHB_ERR_EN;
-       writel_relaxed(val, se->base + SE_DMA_RX_IRQ_EN_SET);
-       writel_relaxed(lower_32_bits(*iova), se->base + SE_DMA_RX_PTR_L);
-       writel_relaxed(upper_32_bits(*iova), se->base + SE_DMA_RX_PTR_H);
-       /* RX does not have EOT buffer type bit. So just reset RX_ATTR */
-       writel_relaxed(0, se->base + SE_DMA_RX_ATTR);
-       writel(len, se->base + SE_DMA_RX_LEN);
+       geni_se_rx_init_dma(se, *iova, len);
        return 0;
 }
 EXPORT_SYMBOL(geni_se_rx_dma_prep);
index 3de2ebe..abbd1fb 100644 (file)
@@ -825,6 +825,12 @@ config SPI_RSPI
        help
          SPI driver for Renesas RSPI and QSPI blocks.
 
+config SPI_RZV2M_CSI
+       tristate "Renesas RZV2M CSI controller"
+       depends on ARCH_RENESAS || COMPILE_TEST
+       help
+         SPI driver for Renesas RZ/V2M Clocked Serial Interface (CSI)
+
 config SPI_QCOM_QSPI
        tristate "QTI QSPI controller"
        depends on ARCH_QCOM || COMPILE_TEST
@@ -936,6 +942,7 @@ config SPI_SPRD_ADI
 config SPI_STM32
        tristate "STMicroelectronics STM32 SPI controller"
        depends on ARCH_STM32 || COMPILE_TEST
+       select SPI_SLAVE
        help
          SPI driver for STMicroelectronics STM32 SoCs.
 
index 28c4817..080c2c1 100644 (file)
@@ -113,6 +113,7 @@ obj-$(CONFIG_SPI_RB4XX)                     += spi-rb4xx.o
 obj-$(CONFIG_MACH_REALTEK_RTL)         += spi-realtek-rtl.o
 obj-$(CONFIG_SPI_RPCIF)                        += spi-rpc-if.o
 obj-$(CONFIG_SPI_RSPI)                 += spi-rspi.o
+obj-$(CONFIG_SPI_RZV2M_CSI)            += spi-rzv2m-csi.o
 obj-$(CONFIG_SPI_S3C64XX)              += spi-s3c64xx.o
 obj-$(CONFIG_SPI_SC18IS602)            += spi-sc18is602.o
 obj-$(CONFIG_SPI_SH)                   += spi-sh.o
index 7f06305..152cd67 100644 (file)
  */
 #define DMA_MIN_BYTES  16
 
-#define SPI_DMA_TIMEOUT                (msecs_to_jiffies(1000))
+#define SPI_DMA_MIN_TIMEOUT    (msecs_to_jiffies(1000))
+#define SPI_DMA_TIMEOUT_PER_10K        (msecs_to_jiffies(4))
 
 #define AUTOSUSPEND_TIMEOUT    2000
 
@@ -1279,7 +1280,8 @@ static int atmel_spi_one_transfer(struct spi_controller *host,
        struct atmel_spi_device *asd;
        int                     timeout;
        int                     ret;
-       unsigned long           dma_timeout;
+       unsigned int            dma_timeout;
+       long                    ret_timeout;
 
        as = spi_controller_get_devdata(host);
 
@@ -1333,11 +1335,13 @@ static int atmel_spi_one_transfer(struct spi_controller *host,
                        atmel_spi_unlock(as);
                }
 
-               dma_timeout = wait_for_completion_timeout(&as->xfer_completion,
-                                                         SPI_DMA_TIMEOUT);
-               if (WARN_ON(dma_timeout == 0)) {
-                       dev_err(&spi->dev, "spi transfer timeout\n");
-                       as->done_status = -EIO;
+               dma_timeout = msecs_to_jiffies(spi_controller_xfer_timeout(host, xfer));
+               ret_timeout = wait_for_completion_interruptible_timeout(&as->xfer_completion,
+                                                                       dma_timeout);
+               if (ret_timeout <= 0) {
+                       dev_err(&spi->dev, "spi transfer %s\n",
+                               !ret_timeout ? "timeout" : "canceled");
+                       as->done_status = ret_timeout < 0 ? ret_timeout : -EIO;
                }
 
                if (as->done_status)
index 32449be..abf10f9 100644 (file)
@@ -40,6 +40,7 @@
 #define CQSPI_SUPPORT_EXTERNAL_DMA     BIT(2)
 #define CQSPI_NO_SUPPORT_WR_COMPLETION BIT(3)
 #define CQSPI_SLOW_SRAM                BIT(4)
+#define CQSPI_NEEDS_APB_AHB_HAZARD_WAR BIT(5)
 
 /* Capabilities */
 #define CQSPI_SUPPORTS_OCTAL           BIT(0)
@@ -90,6 +91,7 @@ struct cqspi_st {
        u32                     pd_dev_id;
        bool                    wr_completion;
        bool                    slow_sram;
+       bool                    apb_ahb_hazard;
 };
 
 struct cqspi_driver_platdata {
@@ -1027,6 +1029,13 @@ static int cqspi_indirect_write_execute(struct cqspi_flash_pdata *f_pdata,
        if (cqspi->wr_delay)
                ndelay(cqspi->wr_delay);
 
+       /*
+        * If a hazard exists between the APB and AHB interfaces, perform a
+        * dummy readback from the controller to ensure synchronization.
+        */
+       if (cqspi->apb_ahb_hazard)
+               readl(reg_base + CQSPI_REG_INDIRECTWR);
+
        while (remaining > 0) {
                size_t write_words, mod_bytes;
 
@@ -1754,6 +1763,8 @@ static int cqspi_probe(struct platform_device *pdev)
                        cqspi->wr_completion = false;
                if (ddata->quirks & CQSPI_SLOW_SRAM)
                        cqspi->slow_sram = true;
+               if (ddata->quirks & CQSPI_NEEDS_APB_AHB_HAZARD_WAR)
+                       cqspi->apb_ahb_hazard = true;
 
                if (of_device_is_compatible(pdev->dev.of_node,
                                            "xlnx,versal-ospi-1.0")) {
@@ -1888,6 +1899,10 @@ static const struct cqspi_driver_platdata jh7110_qspi = {
        .quirks = CQSPI_DISABLE_DAC_MODE,
 };
 
+static const struct cqspi_driver_platdata pensando_cdns_qspi = {
+       .quirks = CQSPI_NEEDS_APB_AHB_HAZARD_WAR | CQSPI_DISABLE_DAC_MODE,
+};
+
 static const struct of_device_id cqspi_dt_ids[] = {
        {
                .compatible = "cdns,qspi-nor",
@@ -1917,6 +1932,10 @@ static const struct of_device_id cqspi_dt_ids[] = {
                .compatible = "starfive,jh7110-qspi",
                .data = &jh7110_qspi,
        },
+       {
+               .compatible = "amd,pensando-elba-qspi",
+               .data = &pensando_cdns_qspi,
+       },
        { /* end of table */ }
 };
 
index 26e6633..de8fe3c 100644 (file)
  * @regs:              Virtual address of the SPI controller registers
  * @ref_clk:           Pointer to the peripheral clock
  * @pclk:              Pointer to the APB clock
+ * @clk_rate:          Reference clock frequency, taken from @ref_clk
  * @speed_hz:          Current SPI bus clock speed in Hz
  * @txbuf:             Pointer to the TX buffer
  * @rxbuf:             Pointer to the RX buffer
index ae3108c..a8ba41a 100644 (file)
@@ -57,21 +57,17 @@ static const struct debugfs_reg32 dw_spi_dbgfs_regs[] = {
        DW_SPI_DBGFS_REG("RX_SAMPLE_DLY", DW_SPI_RX_SAMPLE_DLY),
 };
 
-static int dw_spi_debugfs_init(struct dw_spi *dws)
+static void dw_spi_debugfs_init(struct dw_spi *dws)
 {
        char name[32];
 
        snprintf(name, 32, "dw_spi%d", dws->master->bus_num);
        dws->debugfs = debugfs_create_dir(name, NULL);
-       if (!dws->debugfs)
-               return -ENOMEM;
 
        dws->regset.regs = dw_spi_dbgfs_regs;
        dws->regset.nregs = ARRAY_SIZE(dw_spi_dbgfs_regs);
        dws->regset.base = dws->regs;
        debugfs_create_regset32("registers", 0400, dws->debugfs, &dws->regset);
-
-       return 0;
 }
 
 static void dw_spi_debugfs_remove(struct dw_spi *dws)
@@ -80,9 +76,8 @@ static void dw_spi_debugfs_remove(struct dw_spi *dws)
 }
 
 #else
-static inline int dw_spi_debugfs_init(struct dw_spi *dws)
+static inline void dw_spi_debugfs_init(struct dw_spi *dws)
 {
-       return 0;
 }
 
 static inline void dw_spi_debugfs_remove(struct dw_spi *dws)
@@ -426,7 +421,10 @@ static int dw_spi_transfer_one(struct spi_controller *master,
        int ret;
 
        dws->dma_mapped = 0;
-       dws->n_bytes = DIV_ROUND_UP(transfer->bits_per_word, BITS_PER_BYTE);
+       dws->n_bytes =
+               roundup_pow_of_two(DIV_ROUND_UP(transfer->bits_per_word,
+                                               BITS_PER_BYTE));
+
        dws->tx = (void *)transfer->tx_buf;
        dws->tx_len = transfer->len / dws->n_bytes;
        dws->rx = transfer->rx_buf;
index ababb91..df81965 100644 (file)
@@ -72,12 +72,22 @@ static void dw_spi_dma_maxburst_init(struct dw_spi *dws)
        dw_writel(dws, DW_SPI_DMATDLR, dws->txburst);
 }
 
-static void dw_spi_dma_sg_burst_init(struct dw_spi *dws)
+static int dw_spi_dma_caps_init(struct dw_spi *dws)
 {
-       struct dma_slave_caps tx = {0}, rx = {0};
+       struct dma_slave_caps tx, rx;
+       int ret;
+
+       ret = dma_get_slave_caps(dws->txchan, &tx);
+       if (ret)
+               return ret;
 
-       dma_get_slave_caps(dws->txchan, &tx);
-       dma_get_slave_caps(dws->rxchan, &rx);
+       ret = dma_get_slave_caps(dws->rxchan, &rx);
+       if (ret)
+               return ret;
+
+       if (!(tx.directions & BIT(DMA_MEM_TO_DEV) &&
+             rx.directions & BIT(DMA_DEV_TO_MEM)))
+               return -ENXIO;
 
        if (tx.max_sg_burst > 0 && rx.max_sg_burst > 0)
                dws->dma_sg_burst = min(tx.max_sg_burst, rx.max_sg_burst);
@@ -87,6 +97,15 @@ static void dw_spi_dma_sg_burst_init(struct dw_spi *dws)
                dws->dma_sg_burst = rx.max_sg_burst;
        else
                dws->dma_sg_burst = 0;
+
+       /*
+        * Assuming both channels belong to the same DMA controller hence the
+        * peripheral side address width capabilities most likely would be
+        * the same.
+        */
+       dws->dma_addr_widths = tx.dst_addr_widths & rx.src_addr_widths;
+
+       return 0;
 }
 
 static int dw_spi_dma_init_mfld(struct device *dev, struct dw_spi *dws)
@@ -95,6 +114,7 @@ static int dw_spi_dma_init_mfld(struct device *dev, struct dw_spi *dws)
        struct dw_dma_slave dma_rx = { .src_id = 0 }, *rx = &dma_rx;
        struct pci_dev *dma_dev;
        dma_cap_mask_t mask;
+       int ret = -EBUSY;
 
        /*
         * Get pci device for DMA controller, currently it could only
@@ -124,20 +144,25 @@ static int dw_spi_dma_init_mfld(struct device *dev, struct dw_spi *dws)
 
        init_completion(&dws->dma_completion);
 
-       dw_spi_dma_maxburst_init(dws);
+       ret = dw_spi_dma_caps_init(dws);
+       if (ret)
+               goto free_txchan;
 
-       dw_spi_dma_sg_burst_init(dws);
+       dw_spi_dma_maxburst_init(dws);
 
        pci_dev_put(dma_dev);
 
        return 0;
 
+free_txchan:
+       dma_release_channel(dws->txchan);
+       dws->txchan = NULL;
 free_rxchan:
        dma_release_channel(dws->rxchan);
        dws->rxchan = NULL;
 err_exit:
        pci_dev_put(dma_dev);
-       return -EBUSY;
+       return ret;
 }
 
 static int dw_spi_dma_init_generic(struct device *dev, struct dw_spi *dws)
@@ -163,12 +188,17 @@ static int dw_spi_dma_init_generic(struct device *dev, struct dw_spi *dws)
 
        init_completion(&dws->dma_completion);
 
-       dw_spi_dma_maxburst_init(dws);
+       ret = dw_spi_dma_caps_init(dws);
+       if (ret)
+               goto free_txchan;
 
-       dw_spi_dma_sg_burst_init(dws);
+       dw_spi_dma_maxburst_init(dws);
 
        return 0;
 
+free_txchan:
+       dma_release_channel(dws->txchan);
+       dws->txchan = NULL;
 free_rxchan:
        dma_release_channel(dws->rxchan);
        dws->rxchan = NULL;
@@ -198,22 +228,32 @@ static irqreturn_t dw_spi_dma_transfer_handler(struct dw_spi *dws)
        return IRQ_HANDLED;
 }
 
+static enum dma_slave_buswidth dw_spi_dma_convert_width(u8 n_bytes)
+{
+       switch (n_bytes) {
+       case 1:
+               return DMA_SLAVE_BUSWIDTH_1_BYTE;
+       case 2:
+               return DMA_SLAVE_BUSWIDTH_2_BYTES;
+       case 4:
+               return DMA_SLAVE_BUSWIDTH_4_BYTES;
+       default:
+               return DMA_SLAVE_BUSWIDTH_UNDEFINED;
+       }
+}
+
 static bool dw_spi_can_dma(struct spi_controller *master,
                           struct spi_device *spi, struct spi_transfer *xfer)
 {
        struct dw_spi *dws = spi_controller_get_devdata(master);
+       enum dma_slave_buswidth dma_bus_width;
 
-       return xfer->len > dws->fifo_len;
-}
+       if (xfer->len <= dws->fifo_len)
+               return false;
 
-static enum dma_slave_buswidth dw_spi_dma_convert_width(u8 n_bytes)
-{
-       if (n_bytes == 1)
-               return DMA_SLAVE_BUSWIDTH_1_BYTE;
-       else if (n_bytes == 2)
-               return DMA_SLAVE_BUSWIDTH_2_BYTES;
+       dma_bus_width = dw_spi_dma_convert_width(dws->n_bytes);
 
-       return DMA_SLAVE_BUSWIDTH_UNDEFINED;
+       return dws->dma_addr_widths & BIT(dma_bus_width);
 }
 
 static int dw_spi_dma_wait(struct dw_spi *dws, unsigned int len, u32 speed)
index 15f5e9c..a963bc9 100644 (file)
@@ -236,6 +236,24 @@ static int dw_spi_intel_init(struct platform_device *pdev,
        return 0;
 }
 
+/*
+ * DMA-based mem ops are not configured for this device and are not tested.
+ */
+static int dw_spi_mountevans_imc_init(struct platform_device *pdev,
+                                     struct dw_spi_mmio *dwsmmio)
+{
+       /*
+        * The Intel Mount Evans SoC's Integrated Management Complex DW
+        * apb_ssi_v4.02a controller has an errata where a full TX FIFO can
+        * result in data corruption. The suggested workaround is to never
+        * completely fill the FIFO. The TX FIFO has a size of 32 so the
+        * fifo_len is set to 31.
+        */
+       dwsmmio->dws.fifo_len = 31;
+
+       return 0;
+}
+
 static int dw_spi_canaan_k210_init(struct platform_device *pdev,
                                   struct dw_spi_mmio *dwsmmio)
 {
@@ -405,6 +423,10 @@ static const struct of_device_id dw_spi_mmio_of_match[] = {
        { .compatible = "snps,dwc-ssi-1.01a", .data = dw_spi_hssi_init},
        { .compatible = "intel,keembay-ssi", .data = dw_spi_intel_init},
        { .compatible = "intel,thunderbay-ssi", .data = dw_spi_intel_init},
+       {
+               .compatible = "intel,mountevans-imc-ssi",
+               .data = dw_spi_mountevans_imc_init,
+       },
        { .compatible = "microchip,sparx5-spi", dw_spi_mscc_sparx5_init},
        { .compatible = "canaan,k210-spi", dw_spi_canaan_k210_init},
        { .compatible = "amd,pensando-elba-spi", .data = dw_spi_elba_init},
index 9e8eb2b..3962e6d 100644 (file)
@@ -190,6 +190,7 @@ struct dw_spi {
        struct dma_chan         *rxchan;
        u32                     rxburst;
        u32                     dma_sg_burst;
+       u32                     dma_addr_widths;
        unsigned long           dma_chan_busy;
        dma_addr_t              dma_addr; /* phy address of the Data register */
        const struct dw_spi_dma_ops *dma_ops;
index 4b70038..fb68c72 100644 (file)
@@ -303,6 +303,12 @@ static int fsl_lpspi_set_bitrate(struct fsl_lpspi_data *fsl_lpspi)
 
        perclk_rate = clk_get_rate(fsl_lpspi->clk_per);
 
+       if (!config.speed_hz) {
+               dev_err(fsl_lpspi->dev,
+                       "error: the transmission speed provided is 0!\n");
+               return -EINVAL;
+       }
+
        if (config.speed_hz > perclk_rate / 2) {
                dev_err(fsl_lpspi->dev,
                      "per-clk should be at least two times of transfer speed");
@@ -911,7 +917,7 @@ static int fsl_lpspi_probe(struct platform_device *pdev)
        if (ret == -EPROBE_DEFER)
                goto out_pm_get;
        if (ret < 0)
-               dev_err(&pdev->dev, "dma setup error %d, use pio\n", ret);
+               dev_warn(&pdev->dev, "dma setup error %d, use pio\n", ret);
        else
                /*
                 * disable LPSPI module IRQ when enable DMA mode successfully,
index b293428..26ce959 100644 (file)
@@ -35,7 +35,7 @@
 #define CS_DEMUX_OUTPUT_SEL    GENMASK(3, 0)
 
 #define SE_SPI_TRANS_CFG       0x25c
-#define CS_TOGGLE              BIT(0)
+#define CS_TOGGLE              BIT(1)
 
 #define SE_SPI_WORD_LEN                0x268
 #define WORD_LEN_MSK           GENMASK(9, 0)
@@ -97,8 +97,6 @@ struct spi_geni_master {
        struct dma_chan *tx;
        struct dma_chan *rx;
        int cur_xfer_mode;
-       dma_addr_t tx_se_dma;
-       dma_addr_t rx_se_dma;
 };
 
 static int get_spi_clk_cfg(unsigned int speed_hz,
@@ -174,7 +172,7 @@ static void handle_se_timeout(struct spi_master *spi,
 unmap_if_dma:
        if (mas->cur_xfer_mode == GENI_SE_DMA) {
                if (xfer) {
-                       if (xfer->tx_buf && mas->tx_se_dma) {
+                       if (xfer->tx_buf) {
                                spin_lock_irq(&mas->lock);
                                reinit_completion(&mas->tx_reset_done);
                                writel(1, se->base + SE_DMA_TX_FSM_RST);
@@ -182,9 +180,8 @@ unmap_if_dma:
                                time_left = wait_for_completion_timeout(&mas->tx_reset_done, HZ);
                                if (!time_left)
                                        dev_err(mas->dev, "DMA TX RESET failed\n");
-                               geni_se_tx_dma_unprep(se, mas->tx_se_dma, xfer->len);
                        }
-                       if (xfer->rx_buf && mas->rx_se_dma) {
+                       if (xfer->rx_buf) {
                                spin_lock_irq(&mas->lock);
                                reinit_completion(&mas->rx_reset_done);
                                writel(1, se->base + SE_DMA_RX_FSM_RST);
@@ -192,7 +189,6 @@ unmap_if_dma:
                                time_left = wait_for_completion_timeout(&mas->rx_reset_done, HZ);
                                if (!time_left)
                                        dev_err(mas->dev, "DMA RX RESET failed\n");
-                               geni_se_rx_dma_unprep(se, mas->rx_se_dma, xfer->len);
                        }
                } else {
                        /*
@@ -523,17 +519,36 @@ static int setup_gsi_xfer(struct spi_transfer *xfer, struct spi_geni_master *mas
        return 1;
 }
 
+static u32 get_xfer_len_in_words(struct spi_transfer *xfer,
+                               struct spi_geni_master *mas)
+{
+       u32 len;
+
+       if (!(mas->cur_bits_per_word % MIN_WORD_LEN))
+               len = xfer->len * BITS_PER_BYTE / mas->cur_bits_per_word;
+       else
+               len = xfer->len / (mas->cur_bits_per_word / BITS_PER_BYTE + 1);
+       len &= TRANS_LEN_MSK;
+
+       return len;
+}
+
 static bool geni_can_dma(struct spi_controller *ctlr,
                         struct spi_device *slv, struct spi_transfer *xfer)
 {
        struct spi_geni_master *mas = spi_master_get_devdata(slv->master);
+       u32 len, fifo_size;
 
-       /*
-        * Return true if transfer needs to be mapped prior to
-        * calling transfer_one which is the case only for GPI_DMA.
-        * For SE_DMA mode, map/unmap is done in geni_se_*x_dma_prep.
-        */
-       return mas->cur_xfer_mode == GENI_GPI_DMA;
+       if (mas->cur_xfer_mode == GENI_GPI_DMA)
+               return true;
+
+       len = get_xfer_len_in_words(xfer, mas);
+       fifo_size = mas->tx_fifo_depth * mas->fifo_width_bits / mas->cur_bits_per_word;
+
+       if (len > fifo_size)
+               return true;
+       else
+               return false;
 }
 
 static int spi_geni_prepare_message(struct spi_master *spi,
@@ -774,7 +789,7 @@ static int setup_se_xfer(struct spi_transfer *xfer,
                                u16 mode, struct spi_master *spi)
 {
        u32 m_cmd = 0;
-       u32 len, fifo_size;
+       u32 len;
        struct geni_se *se = &mas->se;
        int ret;
 
@@ -806,11 +821,7 @@ static int setup_se_xfer(struct spi_transfer *xfer,
        mas->tx_rem_bytes = 0;
        mas->rx_rem_bytes = 0;
 
-       if (!(mas->cur_bits_per_word % MIN_WORD_LEN))
-               len = xfer->len * BITS_PER_BYTE / mas->cur_bits_per_word;
-       else
-               len = xfer->len / (mas->cur_bits_per_word / BITS_PER_BYTE + 1);
-       len &= TRANS_LEN_MSK;
+       len = get_xfer_len_in_words(xfer, mas);
 
        mas->cur_xfer = xfer;
        if (xfer->tx_buf) {
@@ -825,9 +836,20 @@ static int setup_se_xfer(struct spi_transfer *xfer,
                mas->rx_rem_bytes = xfer->len;
        }
 
-       /* Select transfer mode based on transfer length */
-       fifo_size = mas->tx_fifo_depth * mas->fifo_width_bits / mas->cur_bits_per_word;
-       mas->cur_xfer_mode = (len <= fifo_size) ? GENI_SE_FIFO : GENI_SE_DMA;
+       /*
+        * Select DMA mode if sgt are present; and with only 1 entry
+        * This is not a serious limitation because the xfer buffers are
+        * expected to fit into in 1 entry almost always, and if any
+        * doesn't for any reason we fall back to FIFO mode anyway
+        */
+       if (!xfer->tx_sg.nents && !xfer->rx_sg.nents)
+               mas->cur_xfer_mode = GENI_SE_FIFO;
+       else if (xfer->tx_sg.nents > 1 || xfer->rx_sg.nents > 1) {
+               dev_warn_once(mas->dev, "Doing FIFO, cannot handle tx_nents-%d, rx_nents-%d\n",
+                       xfer->tx_sg.nents, xfer->rx_sg.nents);
+               mas->cur_xfer_mode = GENI_SE_FIFO;
+       } else
+               mas->cur_xfer_mode = GENI_SE_DMA;
        geni_se_select_mode(se, mas->cur_xfer_mode);
 
        /*
@@ -838,35 +860,17 @@ static int setup_se_xfer(struct spi_transfer *xfer,
        geni_se_setup_m_cmd(se, m_cmd, FRAGMENTATION);
 
        if (mas->cur_xfer_mode == GENI_SE_DMA) {
-               if (m_cmd & SPI_RX_ONLY) {
-                       ret =  geni_se_rx_dma_prep(se, xfer->rx_buf,
-                               xfer->len, &mas->rx_se_dma);
-                       if (ret) {
-                               dev_err(mas->dev, "Failed to setup Rx dma %d\n", ret);
-                               mas->rx_se_dma = 0;
-                               goto unlock_and_return;
-                       }
-               }
-               if (m_cmd & SPI_TX_ONLY) {
-                       ret =  geni_se_tx_dma_prep(se, (void *)xfer->tx_buf,
-                               xfer->len, &mas->tx_se_dma);
-                       if (ret) {
-                               dev_err(mas->dev, "Failed to setup Tx dma %d\n", ret);
-                               mas->tx_se_dma = 0;
-                               if (m_cmd & SPI_RX_ONLY) {
-                                       /* Unmap rx buffer if duplex transfer */
-                                       geni_se_rx_dma_unprep(se, mas->rx_se_dma, xfer->len);
-                                       mas->rx_se_dma = 0;
-                               }
-                               goto unlock_and_return;
-                       }
-               }
+               if (m_cmd & SPI_RX_ONLY)
+                       geni_se_rx_init_dma(se, sg_dma_address(xfer->rx_sg.sgl),
+                               sg_dma_len(xfer->rx_sg.sgl));
+               if (m_cmd & SPI_TX_ONLY)
+                       geni_se_tx_init_dma(se, sg_dma_address(xfer->tx_sg.sgl),
+                               sg_dma_len(xfer->tx_sg.sgl));
        } else if (m_cmd & SPI_TX_ONLY) {
                if (geni_spi_handle_tx(mas))
                        writel(mas->tx_wm, se->base + SE_GENI_TX_WATERMARK_REG);
        }
 
-unlock_and_return:
        spin_unlock_irq(&mas->lock);
        return ret;
 }
@@ -967,14 +971,6 @@ static irqreturn_t geni_spi_isr(int irq, void *data)
                if (dma_rx_status & RX_RESET_DONE)
                        complete(&mas->rx_reset_done);
                if (!mas->tx_rem_bytes && !mas->rx_rem_bytes && xfer) {
-                       if (xfer->tx_buf && mas->tx_se_dma) {
-                               geni_se_tx_dma_unprep(se, mas->tx_se_dma, xfer->len);
-                               mas->tx_se_dma = 0;
-                       }
-                       if (xfer->rx_buf && mas->rx_se_dma) {
-                               geni_se_rx_dma_unprep(se, mas->rx_se_dma, xfer->len);
-                               mas->rx_se_dma = 0;
-                       }
                        spi_finalize_current_transfer(spi);
                        mas->cur_xfer = NULL;
                }
@@ -1059,6 +1055,7 @@ static int spi_geni_probe(struct platform_device *pdev)
        spi->bits_per_word_mask = SPI_BPW_RANGE_MASK(4, 32);
        spi->num_chipselect = 4;
        spi->max_speed_hz = 50000000;
+       spi->max_dma_len = 0xffff0; /* 24 bits for tx/rx dma length */
        spi->prepare_message = spi_geni_prepare_message;
        spi->transfer_one = spi_geni_transfer_one;
        spi->can_dma = geni_can_dma;
index 524eadb..2b4b3d2 100644 (file)
@@ -169,7 +169,7 @@ static int hisi_spi_debugfs_init(struct hisi_spi *hs)
        master = container_of(hs->dev, struct spi_controller, dev);
        snprintf(name, 32, "hisi_spi%d", master->bus_num);
        hs->debugfs = debugfs_create_dir(name, NULL);
-       if (!hs->debugfs)
+       if (IS_ERR(hs->debugfs))
                return -ENOMEM;
 
        hs->regset.regs = hisi_spi_regs;
index 34e5f81..528ae46 100644 (file)
@@ -281,6 +281,7 @@ static bool spi_imx_can_dma(struct spi_controller *controller, struct spi_device
 #define MX51_ECSPI_CONFIG_SCLKPOL(cs)  (1 << ((cs & 3) +  4))
 #define MX51_ECSPI_CONFIG_SBBCTRL(cs)  (1 << ((cs & 3) +  8))
 #define MX51_ECSPI_CONFIG_SSBPOL(cs)   (1 << ((cs & 3) + 12))
+#define MX51_ECSPI_CONFIG_DATACTL(cs)  (1 << ((cs & 3) + 16))
 #define MX51_ECSPI_CONFIG_SCLKCTL(cs)  (1 << ((cs & 3) + 20))
 
 #define MX51_ECSPI_INT         0x10
@@ -516,6 +517,13 @@ static void mx51_ecspi_disable(struct spi_imx_data *spi_imx)
        writel(ctrl, spi_imx->base + MX51_ECSPI_CTRL);
 }
 
+static int mx51_ecspi_channel(const struct spi_device *spi)
+{
+       if (!spi_get_csgpiod(spi, 0))
+               return spi_get_chipselect(spi, 0);
+       return spi->controller->unused_native_cs;
+}
+
 static int mx51_ecspi_prepare_message(struct spi_imx_data *spi_imx,
                                      struct spi_message *msg)
 {
@@ -526,6 +534,7 @@ static int mx51_ecspi_prepare_message(struct spi_imx_data *spi_imx,
        u32 testreg, delay;
        u32 cfg = readl(spi_imx->base + MX51_ECSPI_CONFIG);
        u32 current_cfg = cfg;
+       int channel = mx51_ecspi_channel(spi);
 
        /* set Master or Slave mode */
        if (spi_imx->slave_mode)
@@ -540,7 +549,7 @@ static int mx51_ecspi_prepare_message(struct spi_imx_data *spi_imx,
                ctrl |= MX51_ECSPI_CTRL_DRCTL(spi_imx->spi_drctl);
 
        /* set chip select to use */
-       ctrl |= MX51_ECSPI_CTRL_CS(spi_get_chipselect(spi, 0));
+       ctrl |= MX51_ECSPI_CTRL_CS(channel);
 
        /*
         * The ctrl register must be written first, with the EN bit set other
@@ -561,22 +570,27 @@ static int mx51_ecspi_prepare_message(struct spi_imx_data *spi_imx,
         * BURST_LENGTH + 1 bits are received
         */
        if (spi_imx->slave_mode && is_imx53_ecspi(spi_imx))
-               cfg &= ~MX51_ECSPI_CONFIG_SBBCTRL(spi_get_chipselect(spi, 0));
+               cfg &= ~MX51_ECSPI_CONFIG_SBBCTRL(channel);
        else
-               cfg |= MX51_ECSPI_CONFIG_SBBCTRL(spi_get_chipselect(spi, 0));
+               cfg |= MX51_ECSPI_CONFIG_SBBCTRL(channel);
 
        if (spi->mode & SPI_CPOL) {
-               cfg |= MX51_ECSPI_CONFIG_SCLKPOL(spi_get_chipselect(spi, 0));
-               cfg |= MX51_ECSPI_CONFIG_SCLKCTL(spi_get_chipselect(spi, 0));
+               cfg |= MX51_ECSPI_CONFIG_SCLKPOL(channel);
+               cfg |= MX51_ECSPI_CONFIG_SCLKCTL(channel);
        } else {
-               cfg &= ~MX51_ECSPI_CONFIG_SCLKPOL(spi_get_chipselect(spi, 0));
-               cfg &= ~MX51_ECSPI_CONFIG_SCLKCTL(spi_get_chipselect(spi, 0));
+               cfg &= ~MX51_ECSPI_CONFIG_SCLKPOL(channel);
+               cfg &= ~MX51_ECSPI_CONFIG_SCLKCTL(channel);
        }
 
+       if (spi->mode & SPI_MOSI_IDLE_LOW)
+               cfg |= MX51_ECSPI_CONFIG_DATACTL(channel);
+       else
+               cfg &= ~MX51_ECSPI_CONFIG_DATACTL(channel);
+
        if (spi->mode & SPI_CS_HIGH)
-               cfg |= MX51_ECSPI_CONFIG_SSBPOL(spi_get_chipselect(spi, 0));
+               cfg |= MX51_ECSPI_CONFIG_SSBPOL(channel);
        else
-               cfg &= ~MX51_ECSPI_CONFIG_SSBPOL(spi_get_chipselect(spi, 0));
+               cfg &= ~MX51_ECSPI_CONFIG_SSBPOL(channel);
 
        if (cfg == current_cfg)
                return 0;
@@ -621,14 +635,15 @@ static void mx51_configure_cpha(struct spi_imx_data *spi_imx,
        bool cpha = (spi->mode & SPI_CPHA);
        bool flip_cpha = (spi->mode & SPI_RX_CPHA_FLIP) && spi_imx->rx_only;
        u32 cfg = readl(spi_imx->base + MX51_ECSPI_CONFIG);
+       int channel = mx51_ecspi_channel(spi);
 
        /* Flip cpha logical value iff flip_cpha */
        cpha ^= flip_cpha;
 
        if (cpha)
-               cfg |= MX51_ECSPI_CONFIG_SCLKPHA(spi_get_chipselect(spi, 0));
+               cfg |= MX51_ECSPI_CONFIG_SCLKPHA(channel);
        else
-               cfg &= ~MX51_ECSPI_CONFIG_SCLKPHA(spi_get_chipselect(spi, 0));
+               cfg &= ~MX51_ECSPI_CONFIG_SCLKPHA(channel);
 
        writel(cfg, spi_imx->base + MX51_ECSPI_CONFIG);
 }
@@ -1737,20 +1752,21 @@ static int spi_imx_probe(struct platform_device *pdev)
        else
                controller->num_chipselect = 3;
 
-       spi_imx->controller->transfer_one = spi_imx_transfer_one;
-       spi_imx->controller->setup = spi_imx_setup;
-       spi_imx->controller->cleanup = spi_imx_cleanup;
-       spi_imx->controller->prepare_message = spi_imx_prepare_message;
-       spi_imx->controller->unprepare_message = spi_imx_unprepare_message;
-       spi_imx->controller->slave_abort = spi_imx_slave_abort;
-       spi_imx->controller->mode_bits = SPI_CPOL | SPI_CPHA | SPI_CS_HIGH | SPI_NO_CS;
+       controller->transfer_one = spi_imx_transfer_one;
+       controller->setup = spi_imx_setup;
+       controller->cleanup = spi_imx_cleanup;
+       controller->prepare_message = spi_imx_prepare_message;
+       controller->unprepare_message = spi_imx_unprepare_message;
+       controller->slave_abort = spi_imx_slave_abort;
+       controller->mode_bits = SPI_CPOL | SPI_CPHA | SPI_CS_HIGH | SPI_NO_CS |
+                               SPI_MOSI_IDLE_LOW;
 
        if (is_imx35_cspi(spi_imx) || is_imx51_ecspi(spi_imx) ||
            is_imx53_ecspi(spi_imx))
-               spi_imx->controller->mode_bits |= SPI_LOOP | SPI_READY;
+               controller->mode_bits |= SPI_LOOP | SPI_READY;
 
        if (is_imx51_ecspi(spi_imx) || is_imx53_ecspi(spi_imx))
-               spi_imx->controller->mode_bits |= SPI_RX_CPHA_FLIP;
+               controller->mode_bits |= SPI_RX_CPHA_FLIP;
 
        if (is_imx51_ecspi(spi_imx) &&
            device_property_read_u32(&pdev->dev, "cs-gpios", NULL))
@@ -1759,7 +1775,12 @@ static int spi_imx_probe(struct platform_device *pdev)
                 * setting the burst length to the word size. This is
                 * considerably faster than manually controlling the CS.
                 */
-               spi_imx->controller->mode_bits |= SPI_CS_WORD;
+               controller->mode_bits |= SPI_CS_WORD;
+
+       if (is_imx51_ecspi(spi_imx) || is_imx53_ecspi(spi_imx)) {
+               controller->max_native_cs = 4;
+               controller->flags |= SPI_MASTER_GPIO_SS;
+       }
 
        spi_imx->spi_drctl = spi_drctl;
 
index d7432e2..39272ad 100644 (file)
@@ -1144,7 +1144,8 @@ static int mtk_spi_probe(struct platform_device *pdev)
        if (mdata->dev_comp->must_tx)
                master->flags = SPI_MASTER_MUST_TX;
        if (mdata->dev_comp->ipm_design)
-               master->mode_bits |= SPI_LOOP;
+               master->mode_bits |= SPI_LOOP | SPI_RX_DUAL | SPI_TX_DUAL |
+                                    SPI_RX_QUAD | SPI_TX_QUAD;
 
        if (mdata->dev_comp->ipm_design) {
                mdata->dev = dev;
@@ -1269,7 +1270,7 @@ static int mtk_spi_probe(struct platform_device *pdev)
        return 0;
 }
 
-static int mtk_spi_remove(struct platform_device *pdev)
+static void mtk_spi_remove(struct platform_device *pdev)
 {
        struct spi_master *master = platform_get_drvdata(pdev);
        struct mtk_spi *mdata = spi_master_get_devdata(master);
@@ -1278,21 +1279,25 @@ static int mtk_spi_remove(struct platform_device *pdev)
        if (mdata->use_spimem && !completion_done(&mdata->spimem_done))
                complete(&mdata->spimem_done);
 
-       ret = pm_runtime_resume_and_get(&pdev->dev);
-       if (ret < 0)
-               return ret;
-
-       mtk_spi_reset(mdata);
+       ret = pm_runtime_get_sync(&pdev->dev);
+       if (ret < 0) {
+               dev_warn(&pdev->dev, "Failed to resume hardware (%pe)\n", ERR_PTR(ret));
+       } else {
+               /*
+                * If pm runtime resume failed, clks are disabled and
+                * unprepared. So don't access the hardware and skip clk
+                * unpreparing.
+                */
+               mtk_spi_reset(mdata);
 
-       if (mdata->dev_comp->no_need_unprepare) {
-               clk_unprepare(mdata->spi_clk);
-               clk_unprepare(mdata->spi_hclk);
+               if (mdata->dev_comp->no_need_unprepare) {
+                       clk_unprepare(mdata->spi_clk);
+                       clk_unprepare(mdata->spi_hclk);
+               }
        }
 
        pm_runtime_put_noidle(&pdev->dev);
        pm_runtime_disable(&pdev->dev);
-
-       return 0;
 }
 
 #ifdef CONFIG_PM_SLEEP
@@ -1311,7 +1316,7 @@ static int mtk_spi_suspend(struct device *dev)
                clk_disable_unprepare(mdata->spi_hclk);
        }
 
-       return ret;
+       return 0;
 }
 
 static int mtk_spi_resume(struct device *dev)
@@ -1412,7 +1417,7 @@ static struct platform_driver mtk_spi_driver = {
                .of_match_table = mtk_spi_of_match,
        },
        .probe = mtk_spi_probe,
-       .remove = mtk_spi_remove,
+       .remove_new = mtk_spi_remove,
 };
 
 module_platform_driver(mtk_spi_driver);
index 982407b..1af75ef 100644 (file)
@@ -2217,8 +2217,8 @@ static int pl022_probe(struct amba_device *adev, const struct amba_id *id)
        amba_set_drvdata(adev, pl022);
        status = devm_spi_register_master(&adev->dev, master);
        if (status != 0) {
-               dev_err(&adev->dev,
-                       "probe - problem registering spi master\n");
+               dev_err_probe(&adev->dev, status,
+                             "problem registering spi master\n");
                goto err_spi_register;
        }
        dev_dbg(dev, "probe succeeded\n");
index fab1553..a8a683d 100644 (file)
@@ -2,6 +2,8 @@
 // Copyright (c) 2017-2018, The Linux foundation. All rights reserved.
 
 #include <linux/clk.h>
+#include <linux/dmapool.h>
+#include <linux/dma-mapping.h>
 #include <linux/interconnect.h>
 #include <linux/interrupt.h>
 #include <linux/io.h>
@@ -62,6 +64,7 @@
 #define WR_FIFO_FULL           BIT(10)
 #define WR_FIFO_OVERRUN                BIT(11)
 #define TRANSACTION_DONE       BIT(16)
+#define DMA_CHAIN_DONE         BIT(31)
 #define QSPI_ERR_IRQS          (RESP_FIFO_UNDERRUN | HRESP_FROM_NOC_ERR | \
                                 WR_FIFO_OVERRUN)
 #define QSPI_ALL_IRQS          (QSPI_ERR_IRQS | RESP_FIFO_RDY | \
 #define RD_FIFO_RESET          0x0030
 #define RESET_FIFO             BIT(0)
 
+#define NEXT_DMA_DESC_ADDR     0x0040
+#define CURRENT_DMA_DESC_ADDR  0x0044
+#define CURRENT_MEM_ADDR       0x0048
+
 #define CUR_MEM_ADDR           0x0048
 #define HW_VERSION             0x004c
 #define RD_FIFO                        0x0050
 #define SAMPLING_CLK_CFG       0x0090
 #define SAMPLING_CLK_STATUS    0x0094
 
+#define QSPI_ALIGN_REQ 32
 
 enum qspi_dir {
        QSPI_READ,
        QSPI_WRITE,
 };
 
+struct qspi_cmd_desc {
+       u32 data_address;
+       u32 next_descriptor;
+       u32 direction:1;
+       u32 multi_io_mode:3;
+       u32 reserved1:4;
+       u32 fragment:1;
+       u32 reserved2:7;
+       u32 length:16;
+};
+
 struct qspi_xfer {
        union {
                const void *tx_buf;
@@ -137,11 +156,23 @@ enum qspi_clocks {
        QSPI_NUM_CLKS
 };
 
+/*
+ * Number of entries in sgt returned from spi framework that-
+ * will be supported. Can be modified as required.
+ * In practice, given max_dma_len is 64KB, the number of
+ * entries is not expected to exceed 1.
+ */
+#define QSPI_MAX_SG 5
+
 struct qcom_qspi {
        void __iomem *base;
        struct device *dev;
        struct clk_bulk_data *clks;
        struct qspi_xfer xfer;
+       struct dma_pool *dma_cmd_pool;
+       dma_addr_t dma_cmd_desc[QSPI_MAX_SG];
+       void *virt_cmd_desc[QSPI_MAX_SG];
+       unsigned int n_cmd_desc;
        struct icc_path *icc_path_cpu_to_qspi;
        unsigned long last_speed;
        /* Lock to protect data accessed by IRQs */
@@ -153,21 +184,22 @@ static u32 qspi_buswidth_to_iomode(struct qcom_qspi *ctrl,
 {
        switch (buswidth) {
        case 1:
-               return SDR_1BIT << MULTI_IO_MODE_SHFT;
+               return SDR_1BIT;
        case 2:
-               return SDR_2BIT << MULTI_IO_MODE_SHFT;
+               return SDR_2BIT;
        case 4:
-               return SDR_4BIT << MULTI_IO_MODE_SHFT;
+               return SDR_4BIT;
        default:
                dev_warn_once(ctrl->dev,
                                "Unexpected bus width: %u\n", buswidth);
-               return SDR_1BIT << MULTI_IO_MODE_SHFT;
+               return SDR_1BIT;
        }
 }
 
 static void qcom_qspi_pio_xfer_cfg(struct qcom_qspi *ctrl)
 {
        u32 pio_xfer_cfg;
+       u32 iomode;
        const struct qspi_xfer *xfer;
 
        xfer = &ctrl->xfer;
@@ -179,7 +211,8 @@ static void qcom_qspi_pio_xfer_cfg(struct qcom_qspi *ctrl)
        else
                pio_xfer_cfg |= TRANSFER_FRAGMENT;
        pio_xfer_cfg &= ~MULTI_IO_MODE_MSK;
-       pio_xfer_cfg |= qspi_buswidth_to_iomode(ctrl, xfer->buswidth);
+       iomode = qspi_buswidth_to_iomode(ctrl, xfer->buswidth);
+       pio_xfer_cfg |= iomode << MULTI_IO_MODE_SHFT;
 
        writel(pio_xfer_cfg, ctrl->base + PIO_XFER_CFG);
 }
@@ -217,12 +250,22 @@ static void qcom_qspi_pio_xfer(struct qcom_qspi *ctrl)
 static void qcom_qspi_handle_err(struct spi_master *master,
                                 struct spi_message *msg)
 {
+       u32 int_status;
        struct qcom_qspi *ctrl = spi_master_get_devdata(master);
        unsigned long flags;
+       int i;
 
        spin_lock_irqsave(&ctrl->lock, flags);
        writel(0, ctrl->base + MSTR_INT_EN);
+       int_status = readl(ctrl->base + MSTR_INT_STATUS);
+       writel(int_status, ctrl->base + MSTR_INT_STATUS);
        ctrl->xfer.rem_bytes = 0;
+
+       /* free cmd descriptors if they are around (DMA mode) */
+       for (i = 0; i < ctrl->n_cmd_desc; i++)
+               dma_pool_free(ctrl->dma_cmd_pool, ctrl->virt_cmd_desc[i],
+                                 ctrl->dma_cmd_desc[i]);
+       ctrl->n_cmd_desc = 0;
        spin_unlock_irqrestore(&ctrl->lock, flags);
 }
 
@@ -242,7 +285,7 @@ static int qcom_qspi_set_speed(struct qcom_qspi *ctrl, unsigned long speed_hz)
        }
 
        /*
-        * Set BW quota for CPU as driver supports FIFO mode only.
+        * Set BW quota for CPU.
         * We don't have explicit peak requirement so keep it equal to avg_bw.
         */
        avg_bw_cpu = Bps_to_icc(speed_hz);
@@ -258,6 +301,102 @@ static int qcom_qspi_set_speed(struct qcom_qspi *ctrl, unsigned long speed_hz)
        return 0;
 }
 
+static int qcom_qspi_alloc_desc(struct qcom_qspi *ctrl, dma_addr_t dma_ptr,
+                       uint32_t n_bytes)
+{
+       struct qspi_cmd_desc *virt_cmd_desc, *prev;
+       dma_addr_t dma_cmd_desc;
+
+       /* allocate for dma cmd descriptor */
+       virt_cmd_desc = dma_pool_alloc(ctrl->dma_cmd_pool, GFP_KERNEL | __GFP_ZERO, &dma_cmd_desc);
+       if (!virt_cmd_desc)
+               return -ENOMEM;
+
+       ctrl->virt_cmd_desc[ctrl->n_cmd_desc] = virt_cmd_desc;
+       ctrl->dma_cmd_desc[ctrl->n_cmd_desc] = dma_cmd_desc;
+       ctrl->n_cmd_desc++;
+
+       /* setup cmd descriptor */
+       virt_cmd_desc->data_address = dma_ptr;
+       virt_cmd_desc->direction = ctrl->xfer.dir;
+       virt_cmd_desc->multi_io_mode = qspi_buswidth_to_iomode(ctrl, ctrl->xfer.buswidth);
+       virt_cmd_desc->fragment = !ctrl->xfer.is_last;
+       virt_cmd_desc->length = n_bytes;
+
+       /* update previous descriptor */
+       if (ctrl->n_cmd_desc >= 2) {
+               prev = (ctrl->virt_cmd_desc)[ctrl->n_cmd_desc - 2];
+               prev->next_descriptor = dma_cmd_desc;
+               prev->fragment = 1;
+       }
+
+       return 0;
+}
+
+static int qcom_qspi_setup_dma_desc(struct qcom_qspi *ctrl,
+                               struct spi_transfer *xfer)
+{
+       int ret;
+       struct sg_table *sgt;
+       dma_addr_t dma_ptr_sg;
+       unsigned int dma_len_sg;
+       int i;
+
+       if (ctrl->n_cmd_desc) {
+               dev_err(ctrl->dev, "Remnant dma buffers n_cmd_desc-%d\n", ctrl->n_cmd_desc);
+               return -EIO;
+       }
+
+       sgt = (ctrl->xfer.dir == QSPI_READ) ? &xfer->rx_sg : &xfer->tx_sg;
+       if (!sgt->nents || sgt->nents > QSPI_MAX_SG) {
+               dev_warn_once(ctrl->dev, "Cannot handle %d entries in scatter list\n", sgt->nents);
+               return -EAGAIN;
+       }
+
+       for (i = 0; i < sgt->nents; i++) {
+               dma_ptr_sg = sg_dma_address(sgt->sgl + i);
+               if (!IS_ALIGNED(dma_ptr_sg, QSPI_ALIGN_REQ)) {
+                       dev_warn_once(ctrl->dev, "dma_address not aligned to %d\n", QSPI_ALIGN_REQ);
+                       return -EAGAIN;
+               }
+       }
+
+       for (i = 0; i < sgt->nents; i++) {
+               dma_ptr_sg = sg_dma_address(sgt->sgl + i);
+               dma_len_sg = sg_dma_len(sgt->sgl + i);
+
+               ret = qcom_qspi_alloc_desc(ctrl, dma_ptr_sg, dma_len_sg);
+               if (ret)
+                       goto cleanup;
+       }
+       return 0;
+
+cleanup:
+       for (i = 0; i < ctrl->n_cmd_desc; i++)
+               dma_pool_free(ctrl->dma_cmd_pool, ctrl->virt_cmd_desc[i],
+                                 ctrl->dma_cmd_desc[i]);
+       ctrl->n_cmd_desc = 0;
+       return ret;
+}
+
+static void qcom_qspi_dma_xfer(struct qcom_qspi *ctrl)
+{
+       /* Setup new interrupts */
+       writel(DMA_CHAIN_DONE, ctrl->base + MSTR_INT_EN);
+
+       /* kick off transfer */
+       writel((u32)((ctrl->dma_cmd_desc)[0]), ctrl->base + NEXT_DMA_DESC_ADDR);
+}
+
+/* Switch to DMA if transfer length exceeds this */
+#define QSPI_MAX_BYTES_FIFO 64
+
+static bool qcom_qspi_can_dma(struct spi_controller *ctlr,
+                        struct spi_device *slv, struct spi_transfer *xfer)
+{
+       return xfer->len > QSPI_MAX_BYTES_FIFO;
+}
+
 static int qcom_qspi_transfer_one(struct spi_master *master,
                                  struct spi_device *slv,
                                  struct spi_transfer *xfer)
@@ -266,6 +405,7 @@ static int qcom_qspi_transfer_one(struct spi_master *master,
        int ret;
        unsigned long speed_hz;
        unsigned long flags;
+       u32 mstr_cfg;
 
        speed_hz = slv->max_speed_hz;
        if (xfer->speed_hz)
@@ -276,6 +416,7 @@ static int qcom_qspi_transfer_one(struct spi_master *master,
                return ret;
 
        spin_lock_irqsave(&ctrl->lock, flags);
+       mstr_cfg = readl(ctrl->base + MSTR_CONFIG);
 
        /* We are half duplex, so either rx or tx will be set */
        if (xfer->rx_buf) {
@@ -290,10 +431,36 @@ static int qcom_qspi_transfer_one(struct spi_master *master,
        ctrl->xfer.is_last = list_is_last(&xfer->transfer_list,
                                          &master->cur_msg->transfers);
        ctrl->xfer.rem_bytes = xfer->len;
+
+       if (xfer->rx_sg.nents || xfer->tx_sg.nents) {
+               /* do DMA transfer */
+               if (!(mstr_cfg & DMA_ENABLE)) {
+                       mstr_cfg |= DMA_ENABLE;
+                       writel(mstr_cfg, ctrl->base + MSTR_CONFIG);
+               }
+
+               ret = qcom_qspi_setup_dma_desc(ctrl, xfer);
+               if (ret != -EAGAIN) {
+                       if (!ret)
+                               qcom_qspi_dma_xfer(ctrl);
+                       goto exit;
+               }
+               dev_warn_once(ctrl->dev, "DMA failure, falling back to PIO\n");
+               ret = 0; /* We'll retry w/ PIO */
+       }
+
+       if (mstr_cfg & DMA_ENABLE) {
+               mstr_cfg &= ~DMA_ENABLE;
+               writel(mstr_cfg, ctrl->base + MSTR_CONFIG);
+       }
        qcom_qspi_pio_xfer(ctrl);
 
+exit:
        spin_unlock_irqrestore(&ctrl->lock, flags);
 
+       if (ret)
+               return ret;
+
        /* We'll call spi_finalize_current_transfer() when done */
        return 1;
 }
@@ -328,6 +495,16 @@ static int qcom_qspi_prepare_message(struct spi_master *master,
        return 0;
 }
 
+static int qcom_qspi_alloc_dma(struct qcom_qspi *ctrl)
+{
+       ctrl->dma_cmd_pool = dmam_pool_create("qspi cmd desc pool",
+               ctrl->dev, sizeof(struct qspi_cmd_desc), 0, 0);
+       if (!ctrl->dma_cmd_pool)
+               return -ENOMEM;
+
+       return 0;
+}
+
 static irqreturn_t pio_read(struct qcom_qspi *ctrl)
 {
        u32 rd_fifo_status;
@@ -426,6 +603,7 @@ static irqreturn_t qcom_qspi_irq(int irq, void *dev_id)
        int_status = readl(ctrl->base + MSTR_INT_STATUS);
        writel(int_status, ctrl->base + MSTR_INT_STATUS);
 
+       /* PIO mode handling */
        if (ctrl->xfer.dir == QSPI_WRITE) {
                if (int_status & WR_FIFO_EMPTY)
                        ret = pio_write(ctrl);
@@ -449,6 +627,22 @@ static irqreturn_t qcom_qspi_irq(int irq, void *dev_id)
                spi_finalize_current_transfer(dev_get_drvdata(ctrl->dev));
        }
 
+       /* DMA mode handling */
+       if (int_status & DMA_CHAIN_DONE) {
+               int i;
+
+               writel(0, ctrl->base + MSTR_INT_EN);
+               ctrl->xfer.rem_bytes = 0;
+
+               for (i = 0; i < ctrl->n_cmd_desc; i++)
+                       dma_pool_free(ctrl->dma_cmd_pool, ctrl->virt_cmd_desc[i],
+                                         ctrl->dma_cmd_desc[i]);
+               ctrl->n_cmd_desc = 0;
+
+               ret = IRQ_HANDLED;
+               spi_finalize_current_transfer(dev_get_drvdata(ctrl->dev));
+       }
+
        spin_unlock(&ctrl->lock);
        return ret;
 }
@@ -517,7 +711,13 @@ static int qcom_qspi_probe(struct platform_device *pdev)
                return ret;
        }
 
+       ret = dma_set_mask_and_coherent(dev, DMA_BIT_MASK(32));
+       if (ret)
+               return dev_err_probe(dev, ret, "could not set DMA mask\n");
+
        master->max_speed_hz = 300000000;
+       master->max_dma_len = 65536; /* as per HPG */
+       master->dma_alignment = QSPI_ALIGN_REQ;
        master->num_chipselect = QSPI_NUM_CS;
        master->bus_num = -1;
        master->dev.of_node = pdev->dev.of_node;
@@ -528,6 +728,8 @@ static int qcom_qspi_probe(struct platform_device *pdev)
        master->prepare_message = qcom_qspi_prepare_message;
        master->transfer_one = qcom_qspi_transfer_one;
        master->handle_err = qcom_qspi_handle_err;
+       if (of_property_read_bool(pdev->dev.of_node, "iommus"))
+               master->can_dma = qcom_qspi_can_dma;
        master->auto_runtime_pm = true;
 
        ret = devm_pm_opp_set_clkname(&pdev->dev, "core");
@@ -540,6 +742,10 @@ static int qcom_qspi_probe(struct platform_device *pdev)
                return ret;
        }
 
+       ret = qcom_qspi_alloc_dma(ctrl);
+       if (ret)
+               return ret;
+
        pm_runtime_use_autosuspend(dev);
        pm_runtime_set_autosuspend_delay(dev, 250);
        pm_runtime_enable(dev);
diff --git a/drivers/spi/spi-rzv2m-csi.c b/drivers/spi/spi-rzv2m-csi.c
new file mode 100644 (file)
index 0000000..14ad65d
--- /dev/null
@@ -0,0 +1,667 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Renesas RZ/V2M Clocked Serial Interface (CSI) driver
+ *
+ * Copyright (C) 2023 Renesas Electronics Corporation
+ */
+
+#include <linux/clk.h>
+#include <linux/count_zeros.h>
+#include <linux/interrupt.h>
+#include <linux/iopoll.h>
+#include <linux/platform_device.h>
+#include <linux/reset.h>
+#include <linux/spi/spi.h>
+
+/* Registers */
+#define CSI_MODE               0x00    /* CSI mode control */
+#define CSI_CLKSEL             0x04    /* CSI clock select */
+#define CSI_CNT                        0x08    /* CSI control */
+#define CSI_INT                        0x0C    /* CSI interrupt status */
+#define CSI_IFIFOL             0x10    /* CSI receive FIFO level display */
+#define CSI_OFIFOL             0x14    /* CSI transmit FIFO level display */
+#define CSI_IFIFO              0x18    /* CSI receive window */
+#define CSI_OFIFO              0x1C    /* CSI transmit window */
+#define CSI_FIFOTRG            0x20    /* CSI FIFO trigger level */
+
+/* CSI_MODE */
+#define CSI_MODE_CSIE          BIT(7)
+#define CSI_MODE_TRMD          BIT(6)
+#define CSI_MODE_CCL           BIT(5)
+#define CSI_MODE_DIR           BIT(4)
+#define CSI_MODE_CSOT          BIT(0)
+
+#define CSI_MODE_SETUP         0x00000040
+
+/* CSI_CLKSEL */
+#define CSI_CLKSEL_CKP         BIT(17)
+#define CSI_CLKSEL_DAP         BIT(16)
+#define CSI_CLKSEL_SLAVE       BIT(15)
+#define CSI_CLKSEL_CKS         GENMASK(14, 1)
+
+/* CSI_CNT */
+#define CSI_CNT_CSIRST         BIT(28)
+#define CSI_CNT_R_TRGEN                BIT(19)
+#define CSI_CNT_UNDER_E                BIT(13)
+#define CSI_CNT_OVERF_E                BIT(12)
+#define CSI_CNT_TREND_E                BIT(9)
+#define CSI_CNT_CSIEND_E       BIT(8)
+#define CSI_CNT_T_TRGR_E       BIT(4)
+#define CSI_CNT_R_TRGR_E       BIT(0)
+
+/* CSI_INT */
+#define CSI_INT_UNDER          BIT(13)
+#define CSI_INT_OVERF          BIT(12)
+#define CSI_INT_TREND          BIT(9)
+#define CSI_INT_CSIEND         BIT(8)
+#define CSI_INT_T_TRGR         BIT(4)
+#define CSI_INT_R_TRGR         BIT(0)
+
+/* CSI_FIFOTRG */
+#define CSI_FIFOTRG_R_TRG       GENMASK(2, 0)
+
+#define CSI_FIFO_SIZE_BYTES    32
+#define CSI_FIFO_HALF_SIZE     16
+#define CSI_EN_DIS_TIMEOUT_US  100
+#define CSI_CKS_MAX            0x3FFF
+
+#define UNDERRUN_ERROR         BIT(0)
+#define OVERFLOW_ERROR         BIT(1)
+#define TX_TIMEOUT_ERROR       BIT(2)
+#define RX_TIMEOUT_ERROR       BIT(3)
+
+#define CSI_MAX_SPI_SCKO       8000000
+
+struct rzv2m_csi_priv {
+       void __iomem *base;
+       struct clk *csiclk;
+       struct clk *pclk;
+       struct device *dev;
+       struct spi_controller *controller;
+       const u8 *txbuf;
+       u8 *rxbuf;
+       int buffer_len;
+       int bytes_sent;
+       int bytes_received;
+       int bytes_to_transfer;
+       int words_to_transfer;
+       unsigned char bytes_per_word;
+       wait_queue_head_t wait;
+       u8 errors;
+       u32 status;
+};
+
+static const unsigned char x_trg[] = {
+       0, 1, 1, 2, 2, 2, 2, 3,
+       3, 3, 3, 3, 3, 3, 3, 4,
+       4, 4, 4, 4, 4, 4, 4, 4,
+       4, 4, 4, 4, 4, 4, 4, 5
+};
+
+static const unsigned char x_trg_words[] = {
+       1,  2,  2,  4,  4,  4,  4,  8,
+       8,  8,  8,  8,  8,  8,  8,  16,
+       16, 16, 16, 16, 16, 16, 16, 16,
+       16, 16, 16, 16, 16, 16, 16, 32
+};
+
+static void rzv2m_csi_reg_write_bit(const struct rzv2m_csi_priv *csi,
+                                   int reg_offs, int bit_mask, u32 value)
+{
+       int nr_zeros;
+       u32 tmp;
+
+       nr_zeros = count_trailing_zeros(bit_mask);
+       value <<= nr_zeros;
+
+       tmp = (readl(csi->base + reg_offs) & ~bit_mask) | value;
+       writel(tmp, csi->base + reg_offs);
+}
+
+static int rzv2m_csi_sw_reset(struct rzv2m_csi_priv *csi, int assert)
+{
+       u32 reg;
+
+       rzv2m_csi_reg_write_bit(csi, CSI_CNT, CSI_CNT_CSIRST, assert);
+
+       if (assert) {
+               return readl_poll_timeout(csi->base + CSI_MODE, reg,
+                                         !(reg & CSI_MODE_CSOT), 0,
+                                         CSI_EN_DIS_TIMEOUT_US);
+       }
+
+       return 0;
+}
+
+static int rzv2m_csi_start_stop_operation(const struct rzv2m_csi_priv *csi,
+                                         int enable, bool wait)
+{
+       u32 reg;
+
+       rzv2m_csi_reg_write_bit(csi, CSI_MODE, CSI_MODE_CSIE, enable);
+
+       if (!enable && wait)
+               return readl_poll_timeout(csi->base + CSI_MODE, reg,
+                                         !(reg & CSI_MODE_CSOT), 0,
+                                         CSI_EN_DIS_TIMEOUT_US);
+
+       return 0;
+}
+
+static int rzv2m_csi_fill_txfifo(struct rzv2m_csi_priv *csi)
+{
+       int i;
+
+       if (readl(csi->base + CSI_OFIFOL))
+               return -EIO;
+
+       if (csi->bytes_per_word == 2) {
+               u16 *buf = (u16 *)csi->txbuf;
+
+               for (i = 0; i < csi->words_to_transfer; i++)
+                       writel(buf[i], csi->base + CSI_OFIFO);
+       } else {
+               u8 *buf = (u8 *)csi->txbuf;
+
+               for (i = 0; i < csi->words_to_transfer; i++)
+                       writel(buf[i], csi->base + CSI_OFIFO);
+       }
+
+       csi->txbuf += csi->bytes_to_transfer;
+       csi->bytes_sent += csi->bytes_to_transfer;
+
+       return 0;
+}
+
+static int rzv2m_csi_read_rxfifo(struct rzv2m_csi_priv *csi)
+{
+       int i;
+
+       if (readl(csi->base + CSI_IFIFOL) != csi->bytes_to_transfer)
+               return -EIO;
+
+       if (csi->bytes_per_word == 2) {
+               u16 *buf = (u16 *)csi->rxbuf;
+
+               for (i = 0; i < csi->words_to_transfer; i++)
+                       buf[i] = (u16)readl(csi->base + CSI_IFIFO);
+       } else {
+               u8 *buf = (u8 *)csi->rxbuf;
+
+               for (i = 0; i < csi->words_to_transfer; i++)
+                       buf[i] = (u8)readl(csi->base + CSI_IFIFO);
+       }
+
+       csi->rxbuf += csi->bytes_to_transfer;
+       csi->bytes_received += csi->bytes_to_transfer;
+
+       return 0;
+}
+
+static inline void rzv2m_csi_calc_current_transfer(struct rzv2m_csi_priv *csi)
+{
+       int bytes_transferred = max_t(int, csi->bytes_received, csi->bytes_sent);
+       int bytes_remaining = csi->buffer_len - bytes_transferred;
+       int to_transfer;
+
+       if (csi->txbuf)
+               /*
+                * Leaving a little bit of headroom in the FIFOs makes it very
+                * hard to raise an overflow error (which is only possible
+                * when IP transmits and receives at the same time).
+                */
+               to_transfer = min_t(int, CSI_FIFO_HALF_SIZE, bytes_remaining);
+       else
+               to_transfer = min_t(int, CSI_FIFO_SIZE_BYTES, bytes_remaining);
+
+       if (csi->bytes_per_word == 2)
+               to_transfer >>= 1;
+
+       /*
+        * We can only choose a trigger level from a predefined set of values.
+        * This will pick a value that is the greatest possible integer that's
+        * less than or equal to the number of bytes we need to transfer.
+        * This may result in multiple smaller transfers.
+        */
+       csi->words_to_transfer = x_trg_words[to_transfer - 1];
+
+       if (csi->bytes_per_word == 2)
+               csi->bytes_to_transfer = csi->words_to_transfer << 1;
+       else
+               csi->bytes_to_transfer = csi->words_to_transfer;
+}
+
+static inline void rzv2m_csi_set_rx_fifo_trigger_level(struct rzv2m_csi_priv *csi)
+{
+       rzv2m_csi_reg_write_bit(csi, CSI_FIFOTRG, CSI_FIFOTRG_R_TRG,
+                               x_trg[csi->words_to_transfer - 1]);
+}
+
+static inline void rzv2m_csi_enable_rx_trigger(struct rzv2m_csi_priv *csi,
+                                              bool enable)
+{
+       rzv2m_csi_reg_write_bit(csi, CSI_CNT, CSI_CNT_R_TRGEN, enable);
+}
+
+static void rzv2m_csi_disable_irqs(const struct rzv2m_csi_priv *csi,
+                                  u32 enable_bits)
+{
+       u32 cnt = readl(csi->base + CSI_CNT);
+
+       writel(cnt & ~enable_bits, csi->base + CSI_CNT);
+}
+
+static void rzv2m_csi_disable_all_irqs(struct rzv2m_csi_priv *csi)
+{
+       rzv2m_csi_disable_irqs(csi, CSI_CNT_R_TRGR_E | CSI_CNT_T_TRGR_E |
+                              CSI_CNT_CSIEND_E | CSI_CNT_TREND_E |
+                              CSI_CNT_OVERF_E | CSI_CNT_UNDER_E);
+}
+
+static inline void rzv2m_csi_clear_irqs(struct rzv2m_csi_priv *csi, u32 irqs)
+{
+       writel(irqs, csi->base + CSI_INT);
+}
+
+static void rzv2m_csi_clear_all_irqs(struct rzv2m_csi_priv *csi)
+{
+       rzv2m_csi_clear_irqs(csi, CSI_INT_UNDER | CSI_INT_OVERF |
+                            CSI_INT_TREND | CSI_INT_CSIEND |  CSI_INT_T_TRGR |
+                            CSI_INT_R_TRGR);
+}
+
+static void rzv2m_csi_enable_irqs(struct rzv2m_csi_priv *csi, u32 enable_bits)
+{
+       u32 cnt = readl(csi->base + CSI_CNT);
+
+       writel(cnt | enable_bits, csi->base + CSI_CNT);
+}
+
+static int rzv2m_csi_wait_for_interrupt(struct rzv2m_csi_priv *csi,
+                                       u32 wait_mask, u32 enable_bits)
+{
+       int ret;
+
+       rzv2m_csi_enable_irqs(csi, enable_bits);
+
+       ret = wait_event_timeout(csi->wait,
+                                ((csi->status & wait_mask) == wait_mask) ||
+                                csi->errors, HZ);
+
+       rzv2m_csi_disable_irqs(csi, enable_bits);
+
+       if (csi->errors)
+               return -EIO;
+
+       if (!ret)
+               return -ETIMEDOUT;
+
+       return 0;
+}
+
+static int rzv2m_csi_wait_for_tx_empty(struct rzv2m_csi_priv *csi)
+{
+       int ret;
+
+       if (readl(csi->base + CSI_OFIFOL) == 0)
+               return 0;
+
+       ret = rzv2m_csi_wait_for_interrupt(csi, CSI_INT_TREND, CSI_CNT_TREND_E);
+
+       if (ret == -ETIMEDOUT)
+               csi->errors |= TX_TIMEOUT_ERROR;
+
+       return ret;
+}
+
+static inline int rzv2m_csi_wait_for_rx_ready(struct rzv2m_csi_priv *csi)
+{
+       int ret;
+
+       if (readl(csi->base + CSI_IFIFOL) == csi->bytes_to_transfer)
+               return 0;
+
+       ret = rzv2m_csi_wait_for_interrupt(csi, CSI_INT_R_TRGR,
+                                          CSI_CNT_R_TRGR_E);
+
+       if (ret == -ETIMEDOUT)
+               csi->errors |= RX_TIMEOUT_ERROR;
+
+       return ret;
+}
+
+static irqreturn_t rzv2m_csi_irq_handler(int irq, void *data)
+{
+       struct rzv2m_csi_priv *csi = (struct rzv2m_csi_priv *)data;
+
+       csi->status = readl(csi->base + CSI_INT);
+       rzv2m_csi_disable_irqs(csi, csi->status);
+
+       if (csi->status & CSI_INT_OVERF)
+               csi->errors |= OVERFLOW_ERROR;
+       if (csi->status & CSI_INT_UNDER)
+               csi->errors |= UNDERRUN_ERROR;
+
+       wake_up(&csi->wait);
+
+       return IRQ_HANDLED;
+}
+
+static void rzv2m_csi_setup_clock(struct rzv2m_csi_priv *csi, u32 spi_hz)
+{
+       unsigned long csiclk_rate = clk_get_rate(csi->csiclk);
+       unsigned long pclk_rate = clk_get_rate(csi->pclk);
+       unsigned long csiclk_rate_limit = pclk_rate >> 1;
+       u32 cks;
+
+       /*
+        * There is a restriction on the frequency of CSICLK, it has to be <=
+        * PCLK / 2.
+        */
+       if (csiclk_rate > csiclk_rate_limit) {
+               clk_set_rate(csi->csiclk, csiclk_rate >> 1);
+               csiclk_rate = clk_get_rate(csi->csiclk);
+       } else if ((csiclk_rate << 1) <= csiclk_rate_limit) {
+               clk_set_rate(csi->csiclk, csiclk_rate << 1);
+               csiclk_rate = clk_get_rate(csi->csiclk);
+       }
+
+       spi_hz = spi_hz > CSI_MAX_SPI_SCKO ? CSI_MAX_SPI_SCKO : spi_hz;
+
+       cks = DIV_ROUND_UP(csiclk_rate, spi_hz << 1);
+       if (cks > CSI_CKS_MAX)
+               cks = CSI_CKS_MAX;
+
+       dev_dbg(csi->dev, "SPI clk rate is %ldHz\n", csiclk_rate / (cks << 1));
+
+       rzv2m_csi_reg_write_bit(csi, CSI_CLKSEL, CSI_CLKSEL_CKS, cks);
+}
+
+static void rzv2m_csi_setup_operating_mode(struct rzv2m_csi_priv *csi,
+                                          struct spi_transfer *t)
+{
+       if (t->rx_buf && !t->tx_buf)
+               /* Reception-only mode */
+               rzv2m_csi_reg_write_bit(csi, CSI_MODE, CSI_MODE_TRMD, 0);
+       else
+               /* Send and receive mode */
+               rzv2m_csi_reg_write_bit(csi, CSI_MODE, CSI_MODE_TRMD, 1);
+
+       csi->bytes_per_word = t->bits_per_word / 8;
+       rzv2m_csi_reg_write_bit(csi, CSI_MODE, CSI_MODE_CCL,
+                               csi->bytes_per_word == 2);
+}
+
+static int rzv2m_csi_setup(struct spi_device *spi)
+{
+       struct rzv2m_csi_priv *csi = spi_controller_get_devdata(spi->controller);
+       int ret;
+
+       rzv2m_csi_sw_reset(csi, 0);
+
+       writel(CSI_MODE_SETUP, csi->base + CSI_MODE);
+
+       /* Setup clock polarity and phase timing */
+       rzv2m_csi_reg_write_bit(csi, CSI_CLKSEL, CSI_CLKSEL_CKP,
+                               !(spi->mode & SPI_CPOL));
+       rzv2m_csi_reg_write_bit(csi, CSI_CLKSEL, CSI_CLKSEL_DAP,
+                               !(spi->mode & SPI_CPHA));
+
+       /* Setup serial data order */
+       rzv2m_csi_reg_write_bit(csi, CSI_MODE, CSI_MODE_DIR,
+                               !!(spi->mode & SPI_LSB_FIRST));
+
+       /* Set the operation mode as master */
+       rzv2m_csi_reg_write_bit(csi, CSI_CLKSEL, CSI_CLKSEL_SLAVE, 0);
+
+       /* Give the IP a SW reset */
+       ret = rzv2m_csi_sw_reset(csi, 1);
+       if (ret)
+               return ret;
+       rzv2m_csi_sw_reset(csi, 0);
+
+       /*
+        * We need to enable the communication so that the clock will settle
+        * for the right polarity before enabling the CS.
+        */
+       rzv2m_csi_start_stop_operation(csi, 1, false);
+       udelay(10);
+       rzv2m_csi_start_stop_operation(csi, 0, false);
+
+       return 0;
+}
+
+static int rzv2m_csi_pio_transfer(struct rzv2m_csi_priv *csi)
+{
+       bool tx_completed = csi->txbuf ? false : true;
+       bool rx_completed = csi->rxbuf ? false : true;
+       int ret = 0;
+
+       /* Make sure the TX FIFO is empty */
+       writel(0, csi->base + CSI_OFIFOL);
+
+       csi->bytes_sent = 0;
+       csi->bytes_received = 0;
+       csi->errors = 0;
+
+       rzv2m_csi_disable_all_irqs(csi);
+       rzv2m_csi_clear_all_irqs(csi);
+       rzv2m_csi_enable_rx_trigger(csi, true);
+
+       while (!tx_completed || !rx_completed) {
+               /*
+                * Decide how many words we are going to transfer during
+                * this cycle (for both TX and RX), then set the RX FIFO trigger
+                * level accordingly. No need to set a trigger level for the
+                * TX FIFO, as this IP comes with an interrupt that fires when
+                * the TX FIFO is empty.
+                */
+               rzv2m_csi_calc_current_transfer(csi);
+               rzv2m_csi_set_rx_fifo_trigger_level(csi);
+
+               rzv2m_csi_enable_irqs(csi, CSI_INT_OVERF | CSI_INT_UNDER);
+
+               /* Make sure the RX FIFO is empty */
+               writel(0, csi->base + CSI_IFIFOL);
+
+               writel(readl(csi->base + CSI_INT), csi->base + CSI_INT);
+               csi->status = 0;
+
+               rzv2m_csi_start_stop_operation(csi, 1, false);
+
+               /* TX */
+               if (csi->txbuf) {
+                       ret = rzv2m_csi_fill_txfifo(csi);
+                       if (ret)
+                               break;
+
+                       ret = rzv2m_csi_wait_for_tx_empty(csi);
+                       if (ret)
+                               break;
+
+                       if (csi->bytes_sent == csi->buffer_len)
+                               tx_completed = true;
+               }
+
+               /*
+                * Make sure the RX FIFO contains the desired number of words.
+                * We then either flush its content, or we copy it onto
+                * csi->rxbuf.
+                */
+               ret = rzv2m_csi_wait_for_rx_ready(csi);
+               if (ret)
+                       break;
+
+               /* RX */
+               if (csi->rxbuf) {
+                       rzv2m_csi_start_stop_operation(csi, 0, false);
+
+                       ret = rzv2m_csi_read_rxfifo(csi);
+                       if (ret)
+                               break;
+
+                       if (csi->bytes_received == csi->buffer_len)
+                               rx_completed = true;
+               }
+
+               ret = rzv2m_csi_start_stop_operation(csi, 0, true);
+               if (ret)
+                       goto pio_quit;
+
+               if (csi->errors) {
+                       ret = -EIO;
+                       goto pio_quit;
+               }
+       }
+
+       rzv2m_csi_start_stop_operation(csi, 0, true);
+
+pio_quit:
+       rzv2m_csi_disable_all_irqs(csi);
+       rzv2m_csi_enable_rx_trigger(csi, false);
+       rzv2m_csi_clear_all_irqs(csi);
+
+       return ret;
+}
+
+static int rzv2m_csi_transfer_one(struct spi_controller *controller,
+                                 struct spi_device *spi,
+                                 struct spi_transfer *transfer)
+{
+       struct rzv2m_csi_priv *csi = spi_controller_get_devdata(controller);
+       struct device *dev = csi->dev;
+       int ret;
+
+       csi->txbuf = transfer->tx_buf;
+       csi->rxbuf = transfer->rx_buf;
+       csi->buffer_len = transfer->len;
+
+       rzv2m_csi_setup_operating_mode(csi, transfer);
+
+       rzv2m_csi_setup_clock(csi, transfer->speed_hz);
+
+       ret = rzv2m_csi_pio_transfer(csi);
+       if (ret) {
+               if (csi->errors & UNDERRUN_ERROR)
+                       dev_err(dev, "Underrun error\n");
+               if (csi->errors & OVERFLOW_ERROR)
+                       dev_err(dev, "Overflow error\n");
+               if (csi->errors & TX_TIMEOUT_ERROR)
+                       dev_err(dev, "TX timeout error\n");
+               if (csi->errors & RX_TIMEOUT_ERROR)
+                       dev_err(dev, "RX timeout error\n");
+       }
+
+       return ret;
+}
+
+static int rzv2m_csi_probe(struct platform_device *pdev)
+{
+       struct spi_controller *controller;
+       struct device *dev = &pdev->dev;
+       struct rzv2m_csi_priv *csi;
+       struct reset_control *rstc;
+       int irq;
+       int ret;
+
+       controller = devm_spi_alloc_master(dev, sizeof(*csi));
+       if (!controller)
+               return -ENOMEM;
+
+       csi = spi_controller_get_devdata(controller);
+       platform_set_drvdata(pdev, csi);
+
+       csi->dev = dev;
+       csi->controller = controller;
+
+       csi->base = devm_platform_ioremap_resource(pdev, 0);
+       if (IS_ERR(csi->base))
+               return PTR_ERR(csi->base);
+
+       irq = platform_get_irq(pdev, 0);
+       if (irq < 0)
+               return irq;
+
+       csi->csiclk = devm_clk_get(dev, "csiclk");
+       if (IS_ERR(csi->csiclk))
+               return dev_err_probe(dev, PTR_ERR(csi->csiclk),
+                                    "could not get csiclk\n");
+
+       csi->pclk = devm_clk_get(dev, "pclk");
+       if (IS_ERR(csi->pclk))
+               return dev_err_probe(dev, PTR_ERR(csi->pclk),
+                                    "could not get pclk\n");
+
+       rstc = devm_reset_control_get_shared(dev, NULL);
+       if (IS_ERR(rstc))
+               return dev_err_probe(dev, PTR_ERR(rstc), "Missing reset ctrl\n");
+
+       init_waitqueue_head(&csi->wait);
+
+       controller->mode_bits = SPI_CPOL | SPI_CPHA | SPI_LSB_FIRST;
+       controller->dev.of_node = pdev->dev.of_node;
+       controller->bits_per_word_mask = SPI_BPW_MASK(16) | SPI_BPW_MASK(8);
+       controller->setup = rzv2m_csi_setup;
+       controller->transfer_one = rzv2m_csi_transfer_one;
+       controller->use_gpio_descriptors = true;
+
+       ret = devm_request_irq(dev, irq, rzv2m_csi_irq_handler, 0,
+                              dev_name(dev), csi);
+       if (ret)
+               return dev_err_probe(dev, ret, "cannot request IRQ\n");
+
+       /*
+        * The reset also affects other HW that is not under the control
+        * of Linux. Therefore, all we can do is make sure the reset is
+        * deasserted.
+        */
+       reset_control_deassert(rstc);
+
+       /* Make sure the IP is in SW reset state */
+       ret = rzv2m_csi_sw_reset(csi, 1);
+       if (ret)
+               return ret;
+
+       ret = clk_prepare_enable(csi->csiclk);
+       if (ret)
+               return dev_err_probe(dev, ret, "could not enable csiclk\n");
+
+       ret = spi_register_controller(controller);
+       if (ret) {
+               clk_disable_unprepare(csi->csiclk);
+               return dev_err_probe(dev, ret, "register controller failed\n");
+       }
+
+       return 0;
+}
+
+static int rzv2m_csi_remove(struct platform_device *pdev)
+{
+       struct rzv2m_csi_priv *csi = platform_get_drvdata(pdev);
+
+       spi_unregister_controller(csi->controller);
+       rzv2m_csi_sw_reset(csi, 1);
+       clk_disable_unprepare(csi->csiclk);
+
+       return 0;
+}
+
+static const struct of_device_id rzv2m_csi_match[] = {
+       { .compatible = "renesas,rzv2m-csi" },
+       { /* sentinel */ }
+};
+MODULE_DEVICE_TABLE(of, rzv2m_csi_match);
+
+static struct platform_driver rzv2m_csi_drv = {
+       .probe = rzv2m_csi_probe,
+       .remove = rzv2m_csi_remove,
+       .driver = {
+               .name = "rzv2m_csi",
+               .of_match_table = rzv2m_csi_match,
+       },
+};
+module_platform_driver(rzv2m_csi_drv);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Fabrizio Castro <castro.fabrizio.jz@renesas.com>");
+MODULE_DESCRIPTION("Clocked Serial Interface Driver");
index 7ac17f0..fd55697 100644 (file)
@@ -19,7 +19,6 @@
 #include <linux/platform_data/spi-s3c64xx.h>
 
 #define MAX_SPI_PORTS          12
-#define S3C64XX_SPI_QUIRK_POLL         (1 << 0)
 #define S3C64XX_SPI_QUIRK_CS_AUTO      (1 << 1)
 #define AUTOSUSPEND_TIMEOUT    2000
 
@@ -59,6 +58,8 @@
 #define S3C64XX_SPI_MODE_BUS_TSZ_HALFWORD      (1<<17)
 #define S3C64XX_SPI_MODE_BUS_TSZ_WORD          (2<<17)
 #define S3C64XX_SPI_MODE_BUS_TSZ_MASK          (3<<17)
+#define S3C64XX_SPI_MODE_RX_RDY_LVL            GENMASK(16, 11)
+#define S3C64XX_SPI_MODE_RX_RDY_LVL_SHIFT      11
 #define S3C64XX_SPI_MODE_SELF_LOOPBACK         (1<<3)
 #define S3C64XX_SPI_MODE_RXDMA_ON              (1<<2)
 #define S3C64XX_SPI_MODE_TXDMA_ON              (1<<1)
 
 #define S3C64XX_SPI_TRAILCNT           S3C64XX_SPI_MAX_TRAILCNT
 
+#define S3C64XX_SPI_POLLING_SIZE       32
+
 #define msecs_to_loops(t) (loops_per_jiffy / 1000 * HZ * t)
-#define is_polling(x)  (x->port_conf->quirks & S3C64XX_SPI_QUIRK_POLL)
+#define is_polling(x)  (x->cntrlr_info->polling)
 
 #define RXBUSY    (1<<2)
 #define TXBUSY    (1<<3)
@@ -553,7 +556,7 @@ static int s3c64xx_wait_for_dma(struct s3c64xx_spi_driver_data *sdd,
 }
 
 static int s3c64xx_wait_for_pio(struct s3c64xx_spi_driver_data *sdd,
-                               struct spi_transfer *xfer)
+                               struct spi_transfer *xfer, bool use_irq)
 {
        void __iomem *regs = sdd->regs;
        unsigned long val;
@@ -562,11 +565,24 @@ static int s3c64xx_wait_for_pio(struct s3c64xx_spi_driver_data *sdd,
        u32 cpy_len;
        u8 *buf;
        int ms;
+       unsigned long time_us;
 
-       /* millisecs to xfer 'len' bytes @ 'cur_speed' */
-       ms = xfer->len * 8 * 1000 / sdd->cur_speed;
+       /* microsecs to xfer 'len' bytes @ 'cur_speed' */
+       time_us = (xfer->len * 8 * 1000 * 1000) / sdd->cur_speed;
+       ms = (time_us / 1000);
        ms += 10; /* some tolerance */
 
+       /* sleep during signal transfer time */
+       status = readl(regs + S3C64XX_SPI_STATUS);
+       if (RX_FIFO_LVL(status, sdd) < xfer->len)
+               usleep_range(time_us / 2, time_us);
+
+       if (use_irq) {
+               val = msecs_to_jiffies(ms);
+               if (!wait_for_completion_timeout(&sdd->xfer_completion, val))
+                       return -EIO;
+       }
+
        val = msecs_to_loops(ms);
        do {
                status = readl(regs + S3C64XX_SPI_STATUS);
@@ -729,10 +745,13 @@ static int s3c64xx_spi_transfer_one(struct spi_master *master,
        void *rx_buf = NULL;
        int target_len = 0, origin_len = 0;
        int use_dma = 0;
+       bool use_irq = false;
        int status;
        u32 speed;
        u8 bpw;
        unsigned long flags;
+       u32 rdy_lv;
+       u32 val;
 
        reinit_completion(&sdd->xfer_completion);
 
@@ -753,17 +772,46 @@ static int s3c64xx_spi_transfer_one(struct spi_master *master,
            sdd->rx_dma.ch && sdd->tx_dma.ch) {
                use_dma = 1;
 
-       } else if (xfer->len > fifo_len) {
+       } else if (xfer->len >= fifo_len) {
                tx_buf = xfer->tx_buf;
                rx_buf = xfer->rx_buf;
                origin_len = xfer->len;
-
                target_len = xfer->len;
-               if (xfer->len > fifo_len)
-                       xfer->len = fifo_len;
+               xfer->len = fifo_len - 1;
        }
 
        do {
+               /* transfer size is greater than 32, change to IRQ mode */
+               if (!use_dma && xfer->len > S3C64XX_SPI_POLLING_SIZE)
+                       use_irq = true;
+
+               if (use_irq) {
+                       reinit_completion(&sdd->xfer_completion);
+
+                       rdy_lv = xfer->len;
+                       /* Setup RDY_FIFO trigger Level
+                        * RDY_LVL =
+                        * fifo_lvl up to 64 byte -> N bytes
+                        *               128 byte -> RDY_LVL * 2 bytes
+                        *               256 byte -> RDY_LVL * 4 bytes
+                        */
+                       if (fifo_len == 128)
+                               rdy_lv /= 2;
+                       else if (fifo_len == 256)
+                               rdy_lv /= 4;
+
+                       val = readl(sdd->regs + S3C64XX_SPI_MODE_CFG);
+                       val &= ~S3C64XX_SPI_MODE_RX_RDY_LVL;
+                       val |= (rdy_lv << S3C64XX_SPI_MODE_RX_RDY_LVL_SHIFT);
+                       writel(val, sdd->regs + S3C64XX_SPI_MODE_CFG);
+
+                       /* Enable FIFO_RDY_EN IRQ */
+                       val = readl(sdd->regs + S3C64XX_SPI_INT_EN);
+                       writel((val | S3C64XX_SPI_INT_RX_FIFORDY_EN),
+                                       sdd->regs + S3C64XX_SPI_INT_EN);
+
+               }
+
                spin_lock_irqsave(&sdd->lock, flags);
 
                /* Pending only which is to be done */
@@ -785,7 +833,7 @@ static int s3c64xx_spi_transfer_one(struct spi_master *master,
                if (use_dma)
                        status = s3c64xx_wait_for_dma(sdd, xfer);
                else
-                       status = s3c64xx_wait_for_pio(sdd, xfer);
+                       status = s3c64xx_wait_for_pio(sdd, xfer, use_irq);
 
                if (status) {
                        dev_err(&spi->dev,
@@ -824,8 +872,8 @@ static int s3c64xx_spi_transfer_one(struct spi_master *master,
                        if (xfer->rx_buf)
                                xfer->rx_buf += xfer->len;
 
-                       if (target_len > fifo_len)
-                               xfer->len = fifo_len;
+                       if (target_len >= fifo_len)
+                               xfer->len = fifo_len - 1;
                        else
                                xfer->len = target_len;
                }
@@ -995,6 +1043,14 @@ static irqreturn_t s3c64xx_spi_irq(int irq, void *data)
                dev_err(&spi->dev, "TX underrun\n");
        }
 
+       if (val & S3C64XX_SPI_ST_RX_FIFORDY) {
+               complete(&sdd->xfer_completion);
+               /* No pending clear irq, turn-off INT_EN_RX_FIFO_RDY */
+               val = readl(sdd->regs + S3C64XX_SPI_INT_EN);
+               writel((val & ~S3C64XX_SPI_INT_RX_FIFORDY_EN),
+                               sdd->regs + S3C64XX_SPI_INT_EN);
+       }
+
        /* Clear the pending irq by setting and then clearing it */
        writel(clr, sdd->regs + S3C64XX_SPI_PENDING_CLR);
        writel(0, sdd->regs + S3C64XX_SPI_PENDING_CLR);
@@ -1068,6 +1124,7 @@ static struct s3c64xx_spi_info *s3c64xx_spi_parse_dt(struct device *dev)
        }
 
        sci->no_cs = of_property_read_bool(dev->of_node, "no-cs-readback");
+       sci->polling = !of_property_present(dev->of_node, "dmas");
 
        return sci;
 }
@@ -1103,29 +1160,23 @@ static int s3c64xx_spi_probe(struct platform_device *pdev)
                        return PTR_ERR(sci);
        }
 
-       if (!sci) {
-               dev_err(&pdev->dev, "platform_data missing!\n");
-               return -ENODEV;
-       }
+       if (!sci)
+               return dev_err_probe(&pdev->dev, -ENODEV,
+                                    "Platform_data missing!\n");
 
        mem_res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
-       if (mem_res == NULL) {
-               dev_err(&pdev->dev, "Unable to get SPI MEM resource\n");
-               return -ENXIO;
-       }
+       if (!mem_res)
+               return dev_err_probe(&pdev->dev, -ENXIO,
+                                    "Unable to get SPI MEM resource\n");
 
        irq = platform_get_irq(pdev, 0);
-       if (irq < 0) {
-               dev_warn(&pdev->dev, "Failed to get IRQ: %d\n", irq);
-               return irq;
-       }
+       if (irq < 0)
+               return dev_err_probe(&pdev->dev, irq, "Failed to get IRQ\n");
 
-       master = spi_alloc_master(&pdev->dev,
-                               sizeof(struct s3c64xx_spi_driver_data));
-       if (master == NULL) {
-               dev_err(&pdev->dev, "Unable to allocate SPI Master\n");
-               return -ENOMEM;
-       }
+       master = devm_spi_alloc_master(&pdev->dev, sizeof(*sdd));
+       if (!master)
+               return dev_err_probe(&pdev->dev, -ENOMEM,
+                                    "Unable to allocate SPI Master\n");
 
        platform_set_drvdata(pdev, master);
 
@@ -1137,11 +1188,9 @@ static int s3c64xx_spi_probe(struct platform_device *pdev)
        sdd->sfr_start = mem_res->start;
        if (pdev->dev.of_node) {
                ret = of_alias_get_id(pdev->dev.of_node, "spi");
-               if (ret < 0) {
-                       dev_err(&pdev->dev, "failed to get alias id, errno %d\n",
-                               ret);
-                       goto err_deref_master;
-               }
+               if (ret < 0)
+                       return dev_err_probe(&pdev->dev, ret,
+                                            "Failed to get alias id\n");
                sdd->port_id = ret;
        } else {
                sdd->port_id = pdev->id;
@@ -1175,59 +1224,31 @@ static int s3c64xx_spi_probe(struct platform_device *pdev)
                master->can_dma = s3c64xx_spi_can_dma;
 
        sdd->regs = devm_ioremap_resource(&pdev->dev, mem_res);
-       if (IS_ERR(sdd->regs)) {
-               ret = PTR_ERR(sdd->regs);
-               goto err_deref_master;
-       }
+       if (IS_ERR(sdd->regs))
+               return PTR_ERR(sdd->regs);
 
-       if (sci->cfg_gpio && sci->cfg_gpio()) {
-               dev_err(&pdev->dev, "Unable to config gpio\n");
-               ret = -EBUSY;
-               goto err_deref_master;
-       }
+       if (sci->cfg_gpio && sci->cfg_gpio())
+               return dev_err_probe(&pdev->dev, -EBUSY,
+                                    "Unable to config gpio\n");
 
        /* Setup clocks */
-       sdd->clk = devm_clk_get(&pdev->dev, "spi");
-       if (IS_ERR(sdd->clk)) {
-               dev_err(&pdev->dev, "Unable to acquire clock 'spi'\n");
-               ret = PTR_ERR(sdd->clk);
-               goto err_deref_master;
-       }
-
-       ret = clk_prepare_enable(sdd->clk);
-       if (ret) {
-               dev_err(&pdev->dev, "Couldn't enable clock 'spi'\n");
-               goto err_deref_master;
-       }
+       sdd->clk = devm_clk_get_enabled(&pdev->dev, "spi");
+       if (IS_ERR(sdd->clk))
+               return dev_err_probe(&pdev->dev, PTR_ERR(sdd->clk),
+                                    "Unable to acquire clock 'spi'\n");
 
        sprintf(clk_name, "spi_busclk%d", sci->src_clk_nr);
-       sdd->src_clk = devm_clk_get(&pdev->dev, clk_name);
-       if (IS_ERR(sdd->src_clk)) {
-               dev_err(&pdev->dev,
-                       "Unable to acquire clock '%s'\n", clk_name);
-               ret = PTR_ERR(sdd->src_clk);
-               goto err_disable_clk;
-       }
-
-       ret = clk_prepare_enable(sdd->src_clk);
-       if (ret) {
-               dev_err(&pdev->dev, "Couldn't enable clock '%s'\n", clk_name);
-               goto err_disable_clk;
-       }
+       sdd->src_clk = devm_clk_get_enabled(&pdev->dev, clk_name);
+       if (IS_ERR(sdd->src_clk))
+               return dev_err_probe(&pdev->dev, PTR_ERR(sdd->src_clk),
+                                    "Unable to acquire clock '%s'\n",
+                                    clk_name);
 
        if (sdd->port_conf->clk_ioclk) {
-               sdd->ioclk = devm_clk_get(&pdev->dev, "spi_ioclk");
-               if (IS_ERR(sdd->ioclk)) {
-                       dev_err(&pdev->dev, "Unable to acquire 'ioclk'\n");
-                       ret = PTR_ERR(sdd->ioclk);
-                       goto err_disable_src_clk;
-               }
-
-               ret = clk_prepare_enable(sdd->ioclk);
-               if (ret) {
-                       dev_err(&pdev->dev, "Couldn't enable clock 'ioclk'\n");
-                       goto err_disable_src_clk;
-               }
+               sdd->ioclk = devm_clk_get_enabled(&pdev->dev, "spi_ioclk");
+               if (IS_ERR(sdd->ioclk))
+                       return dev_err_probe(&pdev->dev, PTR_ERR(sdd->ioclk),
+                                            "Unable to acquire 'ioclk'\n");
        }
 
        pm_runtime_set_autosuspend_delay(&pdev->dev, AUTOSUSPEND_TIMEOUT);
@@ -1275,14 +1296,6 @@ err_pm_put:
        pm_runtime_disable(&pdev->dev);
        pm_runtime_set_suspended(&pdev->dev);
 
-       clk_disable_unprepare(sdd->ioclk);
-err_disable_src_clk:
-       clk_disable_unprepare(sdd->src_clk);
-err_disable_clk:
-       clk_disable_unprepare(sdd->clk);
-err_deref_master:
-       spi_master_put(master);
-
        return ret;
 }
 
@@ -1300,12 +1313,6 @@ static void s3c64xx_spi_remove(struct platform_device *pdev)
                dma_release_channel(sdd->tx_dma.ch);
        }
 
-       clk_disable_unprepare(sdd->ioclk);
-
-       clk_disable_unprepare(sdd->src_clk);
-
-       clk_disable_unprepare(sdd->clk);
-
        pm_runtime_put_noidle(&pdev->dev);
        pm_runtime_disable(&pdev->dev);
        pm_runtime_set_suspended(&pdev->dev);
index 7001233..d52ed67 100644 (file)
@@ -337,7 +337,7 @@ static struct i2c_driver sc18is602_driver = {
                .name = "sc18is602",
                .of_match_table = of_match_ptr(sc18is602_of_match),
        },
-       .probe_new = sc18is602_probe,
+       .probe = sc18is602_probe,
        .id_table = sc18is602_id,
 };
 
index a2bd9dc..d64d3f7 100644 (file)
@@ -526,7 +526,7 @@ static int f_ospi_exec_op(struct spi_mem *mem, const struct spi_mem_op *op)
 static bool f_ospi_supports_op_width(struct spi_mem *mem,
                                     const struct spi_mem_op *op)
 {
-       u8 width_available[] = { 0, 1, 2, 4, 8 };
+       static const u8 width_available[] = { 0, 1, 2, 4, 8 };
        u8 width_op[] = { op->cmd.buswidth, op->addr.buswidth,
                          op->dummy.buswidth, op->data.buswidth };
        bool is_match_found;
@@ -566,7 +566,7 @@ static bool f_ospi_supports_op(struct spi_mem *mem,
 
 static int f_ospi_adjust_op_size(struct spi_mem *mem, struct spi_mem_op *op)
 {
-       op->data.nbytes = min((int)op->data.nbytes, (int)(OSPI_DAT_SIZE_MAX));
+       op->data.nbytes = min_t(int, op->data.nbytes, OSPI_DAT_SIZE_MAX);
 
        return 0;
 }
@@ -634,18 +634,12 @@ static int f_ospi_probe(struct platform_device *pdev)
                goto err_put_ctlr;
        }
 
-       ospi->clk = devm_clk_get(dev, NULL);
+       ospi->clk = devm_clk_get_enabled(dev, NULL);
        if (IS_ERR(ospi->clk)) {
                ret = PTR_ERR(ospi->clk);
                goto err_put_ctlr;
        }
 
-       ret = clk_prepare_enable(ospi->clk);
-       if (ret) {
-               dev_err(dev, "Failed to enable the clock\n");
-               goto err_disable_clk;
-       }
-
        mutex_init(&ospi->mlock);
 
        ret = f_ospi_init(ospi);
@@ -661,9 +655,6 @@ static int f_ospi_probe(struct platform_device *pdev)
 err_destroy_mutex:
        mutex_destroy(&ospi->mlock);
 
-err_disable_clk:
-       clk_disable_unprepare(ospi->clk);
-
 err_put_ctlr:
        spi_controller_put(ctlr);
 
@@ -674,8 +665,6 @@ static void f_ospi_remove(struct platform_device *pdev)
 {
        struct f_ospi *ospi = platform_get_drvdata(pdev);
 
-       clk_disable_unprepare(ospi->clk);
-
        mutex_destroy(&ospi->mlock);
 }
 
index d6598e4..6d10fa4 100644 (file)
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 //
-// STMicroelectronics STM32 SPI Controller driver (master mode only)
+// STMicroelectronics STM32 SPI Controller driver
 //
 // Copyright (C) 2017, STMicroelectronics - All Rights Reserved
 // Author(s): Amelie Delaunay <amelie.delaunay@st.com> for STMicroelectronics.
 #define STM32H7_SPI_CFG2_CPHA          BIT(24)
 #define STM32H7_SPI_CFG2_CPOL          BIT(25)
 #define STM32H7_SPI_CFG2_SSM           BIT(26)
+#define STM32H7_SPI_CFG2_SSIOP         BIT(28)
 #define STM32H7_SPI_CFG2_AFCNTR                BIT(31)
 
 /* STM32H7_SPI_IER bit fields */
  */
 #define SPI_DMA_MIN_BYTES      16
 
+/* STM32 SPI driver helpers */
+#define STM32_SPI_MASTER_MODE(stm32_spi) (!(stm32_spi)->device_mode)
+#define STM32_SPI_DEVICE_MODE(stm32_spi) ((stm32_spi)->device_mode)
+
 /**
  * struct stm32_spi_reg - stm32 SPI register & bitfield desc
  * @reg:               register offset
@@ -190,6 +195,7 @@ struct stm32_spi_reg {
  * @cpol: clock polarity register and polarity bit
  * @cpha: clock phase register and phase bit
  * @lsb_first: LSB transmitted first register and bit
+ * @cs_high: chips select active value
  * @br: baud rate register and bitfields
  * @rx: SPI RX data register
  * @tx: SPI TX data register
@@ -201,6 +207,7 @@ struct stm32_spi_regspec {
        const struct stm32_spi_reg cpol;
        const struct stm32_spi_reg cpha;
        const struct stm32_spi_reg lsb_first;
+       const struct stm32_spi_reg cs_high;
        const struct stm32_spi_reg br;
        const struct stm32_spi_reg rx;
        const struct stm32_spi_reg tx;
@@ -258,7 +265,7 @@ struct stm32_spi_cfg {
 /**
  * struct stm32_spi - private data of the SPI controller
  * @dev: driver model representation of the controller
- * @master: controller master interface
+ * @ctrl: controller interface
  * @cfg: compatible configuration data
  * @base: virtual memory area
  * @clk: hw kernel clock feeding the SPI clock generator
@@ -280,10 +287,11 @@ struct stm32_spi_cfg {
  * @dma_tx: dma channel for TX transfer
  * @dma_rx: dma channel for RX transfer
  * @phys_addr: SPI registers physical base address
+ * @device_mode: the controller is configured as SPI device
  */
 struct stm32_spi {
        struct device *dev;
-       struct spi_master *master;
+       struct spi_controller *ctrl;
        const struct stm32_spi_cfg *cfg;
        void __iomem *base;
        struct clk *clk;
@@ -307,6 +315,8 @@ struct stm32_spi {
        struct dma_chan *dma_tx;
        struct dma_chan *dma_rx;
        dma_addr_t phys_addr;
+
+       bool device_mode;
 };
 
 static const struct stm32_spi_regspec stm32f4_spi_regspec = {
@@ -318,6 +328,7 @@ static const struct stm32_spi_regspec stm32f4_spi_regspec = {
        .cpol = { STM32F4_SPI_CR1, STM32F4_SPI_CR1_CPOL },
        .cpha = { STM32F4_SPI_CR1, STM32F4_SPI_CR1_CPHA },
        .lsb_first = { STM32F4_SPI_CR1, STM32F4_SPI_CR1_LSBFRST },
+       .cs_high = {},
        .br = { STM32F4_SPI_CR1, STM32F4_SPI_CR1_BR, STM32F4_SPI_CR1_BR_SHIFT },
 
        .rx = { STM32F4_SPI_DR },
@@ -336,6 +347,7 @@ static const struct stm32_spi_regspec stm32h7_spi_regspec = {
        .cpol = { STM32H7_SPI_CFG2, STM32H7_SPI_CFG2_CPOL },
        .cpha = { STM32H7_SPI_CFG2, STM32H7_SPI_CFG2_CPHA },
        .lsb_first = { STM32H7_SPI_CFG2, STM32H7_SPI_CFG2_LSBFRST },
+       .cs_high = { STM32H7_SPI_CFG2, STM32H7_SPI_CFG2_SSIOP },
        .br = { STM32H7_SPI_CFG1, STM32H7_SPI_CFG1_MBR,
                STM32H7_SPI_CFG1_MBR_SHIFT },
 
@@ -437,9 +449,9 @@ static int stm32_spi_prepare_mbr(struct stm32_spi *spi, u32 speed_hz,
        div = DIV_ROUND_CLOSEST(spi->clk_rate & ~0x1, speed_hz);
 
        /*
-        * SPI framework set xfer->speed_hz to master->max_speed_hz if
-        * xfer->speed_hz is greater than master->max_speed_hz, and it returns
-        * an error when xfer->speed_hz is lower than master->min_speed_hz, so
+        * SPI framework set xfer->speed_hz to ctrl->max_speed_hz if
+        * xfer->speed_hz is greater than ctrl->max_speed_hz, and it returns
+        * an error when xfer->speed_hz is lower than ctrl->min_speed_hz, so
         * no need to check it there.
         * However, we need to ensure the following calculations.
         */
@@ -657,9 +669,9 @@ static void stm32f4_spi_disable(struct stm32_spi *spi)
        }
 
        if (spi->cur_usedma && spi->dma_tx)
-               dmaengine_terminate_all(spi->dma_tx);
+               dmaengine_terminate_async(spi->dma_tx);
        if (spi->cur_usedma && spi->dma_rx)
-               dmaengine_terminate_all(spi->dma_rx);
+               dmaengine_terminate_async(spi->dma_rx);
 
        stm32_spi_clr_bits(spi, STM32F4_SPI_CR1, STM32F4_SPI_CR1_SPE);
 
@@ -696,9 +708,9 @@ static void stm32h7_spi_disable(struct stm32_spi *spi)
        }
 
        if (spi->cur_usedma && spi->dma_tx)
-               dmaengine_terminate_all(spi->dma_tx);
+               dmaengine_terminate_async(spi->dma_tx);
        if (spi->cur_usedma && spi->dma_rx)
-               dmaengine_terminate_all(spi->dma_rx);
+               dmaengine_terminate_async(spi->dma_rx);
 
        stm32_spi_clr_bits(spi, STM32H7_SPI_CR1, STM32H7_SPI_CR1_SPE);
 
@@ -714,19 +726,19 @@ static void stm32h7_spi_disable(struct stm32_spi *spi)
 
 /**
  * stm32_spi_can_dma - Determine if the transfer is eligible for DMA use
- * @master: controller master interface
+ * @ctrl: controller interface
  * @spi_dev: pointer to the spi device
  * @transfer: pointer to spi transfer
  *
  * If driver has fifo and the current transfer size is greater than fifo size,
  * use DMA. Otherwise use DMA for transfer longer than defined DMA min bytes.
  */
-static bool stm32_spi_can_dma(struct spi_master *master,
+static bool stm32_spi_can_dma(struct spi_controller *ctrl,
                              struct spi_device *spi_dev,
                              struct spi_transfer *transfer)
 {
        unsigned int dma_size;
-       struct stm32_spi *spi = spi_master_get_devdata(master);
+       struct stm32_spi *spi = spi_controller_get_devdata(ctrl);
 
        if (spi->cfg->has_fifo)
                dma_size = spi->fifo_size;
@@ -742,12 +754,12 @@ static bool stm32_spi_can_dma(struct spi_master *master,
 /**
  * stm32f4_spi_irq_event - Interrupt handler for SPI controller events
  * @irq: interrupt line
- * @dev_id: SPI controller master interface
+ * @dev_id: SPI controller ctrl interface
  */
 static irqreturn_t stm32f4_spi_irq_event(int irq, void *dev_id)
 {
-       struct spi_master *master = dev_id;
-       struct stm32_spi *spi = spi_master_get_devdata(master);
+       struct spi_controller *ctrl = dev_id;
+       struct stm32_spi *spi = spi_controller_get_devdata(ctrl);
        u32 sr, mask = 0;
        bool end = false;
 
@@ -830,14 +842,14 @@ end_irq:
 /**
  * stm32f4_spi_irq_thread - Thread of interrupt handler for SPI controller
  * @irq: interrupt line
- * @dev_id: SPI controller master interface
+ * @dev_id: SPI controller interface
  */
 static irqreturn_t stm32f4_spi_irq_thread(int irq, void *dev_id)
 {
-       struct spi_master *master = dev_id;
-       struct stm32_spi *spi = spi_master_get_devdata(master);
+       struct spi_controller *ctrl = dev_id;
+       struct stm32_spi *spi = spi_controller_get_devdata(ctrl);
 
-       spi_finalize_current_transfer(master);
+       spi_finalize_current_transfer(ctrl);
        stm32f4_spi_disable(spi);
 
        return IRQ_HANDLED;
@@ -846,12 +858,12 @@ static irqreturn_t stm32f4_spi_irq_thread(int irq, void *dev_id)
 /**
  * stm32h7_spi_irq_thread - Thread of interrupt handler for SPI controller
  * @irq: interrupt line
- * @dev_id: SPI controller master interface
+ * @dev_id: SPI controller interface
  */
 static irqreturn_t stm32h7_spi_irq_thread(int irq, void *dev_id)
 {
-       struct spi_master *master = dev_id;
-       struct stm32_spi *spi = spi_master_get_devdata(master);
+       struct spi_controller *ctrl = dev_id;
+       struct stm32_spi *spi = spi_controller_get_devdata(ctrl);
        u32 sr, ier, mask;
        unsigned long flags;
        bool end = false;
@@ -931,7 +943,7 @@ static irqreturn_t stm32h7_spi_irq_thread(int irq, void *dev_id)
 
        if (end) {
                stm32h7_spi_disable(spi);
-               spi_finalize_current_transfer(master);
+               spi_finalize_current_transfer(ctrl);
        }
 
        return IRQ_HANDLED;
@@ -939,13 +951,13 @@ static irqreturn_t stm32h7_spi_irq_thread(int irq, void *dev_id)
 
 /**
  * stm32_spi_prepare_msg - set up the controller to transfer a single message
- * @master: controller master interface
+ * @ctrl: controller interface
  * @msg: pointer to spi message
  */
-static int stm32_spi_prepare_msg(struct spi_master *master,
+static int stm32_spi_prepare_msg(struct spi_controller *ctrl,
                                 struct spi_message *msg)
 {
-       struct stm32_spi *spi = spi_master_get_devdata(master);
+       struct stm32_spi *spi = spi_controller_get_devdata(ctrl);
        struct spi_device *spi_dev = msg->spi;
        struct device_node *np = spi_dev->dev.of_node;
        unsigned long flags;
@@ -971,6 +983,11 @@ static int stm32_spi_prepare_msg(struct spi_master *master,
        else
                clrb |= spi->cfg->regs->lsb_first.mask;
 
+       if (STM32_SPI_DEVICE_MODE(spi) && spi_dev->mode & SPI_CS_HIGH)
+               setb |= spi->cfg->regs->cs_high.mask;
+       else
+               clrb |= spi->cfg->regs->cs_high.mask;
+
        dev_dbg(spi->dev, "cpol=%d cpha=%d lsb_first=%d cs_high=%d\n",
                !!(spi_dev->mode & SPI_CPOL),
                !!(spi_dev->mode & SPI_CPHA),
@@ -984,9 +1001,9 @@ static int stm32_spi_prepare_msg(struct spi_master *master,
        if (spi->cfg->set_number_of_data) {
                int ret;
 
-               ret = spi_split_transfers_maxwords(master, msg,
-                                                  STM32H7_SPI_TSIZE_MAX,
-                                                  GFP_KERNEL | GFP_DMA);
+               ret = spi_split_transfers_maxsize(ctrl, msg,
+                                                 STM32H7_SPI_TSIZE_MAX,
+                                                 GFP_KERNEL | GFP_DMA);
                if (ret)
                        return ret;
        }
@@ -1016,7 +1033,7 @@ static void stm32f4_spi_dma_tx_cb(void *data)
        struct stm32_spi *spi = data;
 
        if (spi->cur_comm == SPI_SIMPLEX_TX || spi->cur_comm == SPI_3WIRE_TX) {
-               spi_finalize_current_transfer(spi->master);
+               spi_finalize_current_transfer(spi->ctrl);
                stm32f4_spi_disable(spi);
        }
 }
@@ -1031,7 +1048,7 @@ static void stm32_spi_dma_rx_cb(void *data)
 {
        struct stm32_spi *spi = data;
 
-       spi_finalize_current_transfer(spi->master);
+       spi_finalize_current_transfer(spi->ctrl);
        spi->cfg->disable(spi);
 }
 
@@ -1161,7 +1178,8 @@ static int stm32h7_spi_transfer_one_irq(struct stm32_spi *spi)
        if (spi->tx_buf)
                stm32h7_spi_write_txfifo(spi);
 
-       stm32_spi_set_bits(spi, STM32H7_SPI_CR1, STM32H7_SPI_CR1_CSTART);
+       if (STM32_SPI_MASTER_MODE(spi))
+               stm32_spi_set_bits(spi, STM32H7_SPI_CR1, STM32H7_SPI_CR1_CSTART);
 
        writel_relaxed(ier, spi->base + STM32H7_SPI_IER);
 
@@ -1208,7 +1226,8 @@ static void stm32h7_spi_transfer_one_dma_start(struct stm32_spi *spi)
 
        stm32_spi_enable(spi);
 
-       stm32_spi_set_bits(spi, STM32H7_SPI_CR1, STM32H7_SPI_CR1_CSTART);
+       if (STM32_SPI_MASTER_MODE(spi))
+               stm32_spi_set_bits(spi, STM32H7_SPI_CR1, STM32H7_SPI_CR1_CSTART);
 }
 
 /**
@@ -1302,7 +1321,7 @@ static int stm32_spi_transfer_one_dma(struct stm32_spi *spi,
 
 dma_submit_error:
        if (spi->dma_rx)
-               dmaengine_terminate_all(spi->dma_rx);
+               dmaengine_terminate_sync(spi->dma_rx);
 
 dma_desc_error:
        stm32_spi_clr_bits(spi, spi->cfg->regs->dma_rx_en.reg,
@@ -1536,16 +1555,18 @@ static int stm32_spi_transfer_one_setup(struct stm32_spi *spi,
        spi->cfg->set_bpw(spi);
 
        /* Update spi->cur_speed with real clock speed */
-       mbr = stm32_spi_prepare_mbr(spi, transfer->speed_hz,
-                                   spi->cfg->baud_rate_div_min,
-                                   spi->cfg->baud_rate_div_max);
-       if (mbr < 0) {
-               ret = mbr;
-               goto out;
-       }
+       if (STM32_SPI_MASTER_MODE(spi)) {
+               mbr = stm32_spi_prepare_mbr(spi, transfer->speed_hz,
+                                           spi->cfg->baud_rate_div_min,
+                                           spi->cfg->baud_rate_div_max);
+               if (mbr < 0) {
+                       ret = mbr;
+                       goto out;
+               }
 
-       transfer->speed_hz = spi->cur_speed;
-       stm32_spi_set_mbr(spi, mbr);
+               transfer->speed_hz = spi->cur_speed;
+               stm32_spi_set_mbr(spi, mbr);
+       }
 
        comm_type = stm32_spi_communication_type(spi_dev, transfer);
        ret = spi->cfg->set_mode(spi, comm_type);
@@ -1554,7 +1575,7 @@ static int stm32_spi_transfer_one_setup(struct stm32_spi *spi,
 
        spi->cur_comm = comm_type;
 
-       if (spi->cfg->set_data_idleness)
+       if (STM32_SPI_MASTER_MODE(spi) && spi->cfg->set_data_idleness)
                spi->cfg->set_data_idleness(spi, transfer->len);
 
        if (spi->cur_bpw <= 8)
@@ -1575,7 +1596,8 @@ static int stm32_spi_transfer_one_setup(struct stm32_spi *spi,
        dev_dbg(spi->dev,
                "data frame of %d-bit, data packet of %d data frames\n",
                spi->cur_bpw, spi->cur_fthlv);
-       dev_dbg(spi->dev, "speed set to %dHz\n", spi->cur_speed);
+       if (STM32_SPI_MASTER_MODE(spi))
+               dev_dbg(spi->dev, "speed set to %dHz\n", spi->cur_speed);
        dev_dbg(spi->dev, "transfer of %d bytes (%d data frames)\n",
                spi->cur_xferlen, nb_words);
        dev_dbg(spi->dev, "dma %s\n",
@@ -1589,18 +1611,18 @@ out:
 
 /**
  * stm32_spi_transfer_one - transfer a single spi_transfer
- * @master: controller master interface
+ * @ctrl: controller interface
  * @spi_dev: pointer to the spi device
  * @transfer: pointer to spi transfer
  *
  * It must return 0 if the transfer is finished or 1 if the transfer is still
  * in progress.
  */
-static int stm32_spi_transfer_one(struct spi_master *master,
+static int stm32_spi_transfer_one(struct spi_controller *ctrl,
                                  struct spi_device *spi_dev,
                                  struct spi_transfer *transfer)
 {
-       struct stm32_spi *spi = spi_master_get_devdata(master);
+       struct stm32_spi *spi = spi_controller_get_devdata(ctrl);
        int ret;
 
        spi->tx_buf = transfer->tx_buf;
@@ -1608,8 +1630,8 @@ static int stm32_spi_transfer_one(struct spi_master *master,
        spi->tx_len = spi->tx_buf ? transfer->len : 0;
        spi->rx_len = spi->rx_buf ? transfer->len : 0;
 
-       spi->cur_usedma = (master->can_dma &&
-                          master->can_dma(master, spi_dev, transfer));
+       spi->cur_usedma = (ctrl->can_dma &&
+                          ctrl->can_dma(ctrl, spi_dev, transfer));
 
        ret = stm32_spi_transfer_one_setup(spi, spi_dev, transfer);
        if (ret) {
@@ -1625,13 +1647,13 @@ static int stm32_spi_transfer_one(struct spi_master *master,
 
 /**
  * stm32_spi_unprepare_msg - relax the hardware
- * @master: controller master interface
+ * @ctrl: controller interface
  * @msg: pointer to the spi message
  */
-static int stm32_spi_unprepare_msg(struct spi_master *master,
+static int stm32_spi_unprepare_msg(struct spi_controller *ctrl,
                                   struct spi_message *msg)
 {
-       struct stm32_spi *spi = spi_master_get_devdata(master);
+       struct stm32_spi *spi = spi_controller_get_devdata(ctrl);
 
        spi->cfg->disable(spi);
 
@@ -1670,12 +1692,13 @@ static int stm32f4_spi_config(struct stm32_spi *spi)
 }
 
 /**
- * stm32h7_spi_config - Configure SPI controller as SPI master
+ * stm32h7_spi_config - Configure SPI controller
  * @spi: pointer to the spi controller data structure
  */
 static int stm32h7_spi_config(struct stm32_spi *spi)
 {
        unsigned long flags;
+       u32 cr1 = 0, cfg2 = 0;
 
        spin_lock_irqsave(&spi->lock, flags);
 
@@ -1683,24 +1706,28 @@ static int stm32h7_spi_config(struct stm32_spi *spi)
        stm32_spi_clr_bits(spi, STM32H7_SPI_I2SCFGR,
                           STM32H7_SPI_I2SCFGR_I2SMOD);
 
-       /*
-        * - SS input value high
-        * - transmitter half duplex direction
-        * - automatic communication suspend when RX-Fifo is full
-        */
-       stm32_spi_set_bits(spi, STM32H7_SPI_CR1, STM32H7_SPI_CR1_SSI |
-                                                STM32H7_SPI_CR1_HDDIR |
-                                                STM32H7_SPI_CR1_MASRX);
+       if (STM32_SPI_DEVICE_MODE(spi)) {
+               /* Use native device select */
+               cfg2 &= ~STM32H7_SPI_CFG2_SSM;
+       } else {
+               /*
+                * - Transmitter half duplex direction
+                * - Automatic communication suspend when RX-Fifo is full
+                * - SS input value high
+                */
+               cr1 |= STM32H7_SPI_CR1_HDDIR | STM32H7_SPI_CR1_MASRX | STM32H7_SPI_CR1_SSI;
 
-       /*
-        * - Set the master mode (default Motorola mode)
-        * - Consider 1 master/n slaves configuration and
-        *   SS input value is determined by the SSI bit
-        * - keep control of all associated GPIOs
-        */
-       stm32_spi_set_bits(spi, STM32H7_SPI_CFG2, STM32H7_SPI_CFG2_MASTER |
-                                                 STM32H7_SPI_CFG2_SSM |
-                                                 STM32H7_SPI_CFG2_AFCNTR);
+               /*
+                * - Set the master mode (default Motorola mode)
+                * - Consider 1 master/n devices configuration and
+                *   SS input value is determined by the SSI bit
+                * - keep control of all associated GPIOs
+                */
+               cfg2 |= STM32H7_SPI_CFG2_MASTER | STM32H7_SPI_CFG2_SSM | STM32H7_SPI_CFG2_AFCNTR;
+       }
+
+       stm32_spi_set_bits(spi, STM32H7_SPI_CR1, cr1);
+       stm32_spi_set_bits(spi, STM32H7_SPI_CFG2, cfg2);
 
        spin_unlock_irqrestore(&spi->lock, flags);
 
@@ -1756,24 +1783,38 @@ static const struct of_device_id stm32_spi_of_match[] = {
 };
 MODULE_DEVICE_TABLE(of, stm32_spi_of_match);
 
+static int stm32h7_spi_device_abort(struct spi_controller *ctrl)
+{
+       spi_finalize_current_transfer(ctrl);
+       return 0;
+}
+
 static int stm32_spi_probe(struct platform_device *pdev)
 {
-       struct spi_master *master;
+       struct spi_controller *ctrl;
        struct stm32_spi *spi;
        struct resource *res;
        struct reset_control *rst;
+       struct device_node *np = pdev->dev.of_node;
+       bool device_mode;
        int ret;
 
-       master = devm_spi_alloc_master(&pdev->dev, sizeof(struct stm32_spi));
-       if (!master) {
-               dev_err(&pdev->dev, "spi master allocation failed\n");
+       device_mode = of_property_read_bool(np, "spi-slave");
+
+       if (device_mode)
+               ctrl = devm_spi_alloc_slave(&pdev->dev, sizeof(struct stm32_spi));
+       else
+               ctrl = devm_spi_alloc_master(&pdev->dev, sizeof(struct stm32_spi));
+       if (!ctrl) {
+               dev_err(&pdev->dev, "spi controller allocation failed\n");
                return -ENOMEM;
        }
-       platform_set_drvdata(pdev, master);
+       platform_set_drvdata(pdev, ctrl);
 
-       spi = spi_master_get_devdata(master);
+       spi = spi_controller_get_devdata(ctrl);
        spi->dev = &pdev->dev;
-       spi->master = master;
+       spi->ctrl = ctrl;
+       spi->device_mode = device_mode;
        spin_lock_init(&spi->lock);
 
        spi->cfg = (const struct stm32_spi_cfg *)
@@ -1794,7 +1835,7 @@ static int stm32_spi_probe(struct platform_device *pdev)
        ret = devm_request_threaded_irq(&pdev->dev, spi->irq,
                                        spi->cfg->irq_handler_event,
                                        spi->cfg->irq_handler_thread,
-                                       IRQF_ONESHOT, pdev->name, master);
+                                       IRQF_ONESHOT, pdev->name, ctrl);
        if (ret) {
                dev_err(&pdev->dev, "irq%d request failed: %d\n", spi->irq,
                        ret);
@@ -1843,19 +1884,21 @@ static int stm32_spi_probe(struct platform_device *pdev)
                goto err_clk_disable;
        }
 
-       master->dev.of_node = pdev->dev.of_node;
-       master->auto_runtime_pm = true;
-       master->bus_num = pdev->id;
-       master->mode_bits = SPI_CPHA | SPI_CPOL | SPI_CS_HIGH | SPI_LSB_FIRST |
-                           SPI_3WIRE;
-       master->bits_per_word_mask = spi->cfg->get_bpw_mask(spi);
-       master->max_speed_hz = spi->clk_rate / spi->cfg->baud_rate_div_min;
-       master->min_speed_hz = spi->clk_rate / spi->cfg->baud_rate_div_max;
-       master->use_gpio_descriptors = true;
-       master->prepare_message = stm32_spi_prepare_msg;
-       master->transfer_one = stm32_spi_transfer_one;
-       master->unprepare_message = stm32_spi_unprepare_msg;
-       master->flags = spi->cfg->flags;
+       ctrl->dev.of_node = pdev->dev.of_node;
+       ctrl->auto_runtime_pm = true;
+       ctrl->bus_num = pdev->id;
+       ctrl->mode_bits = SPI_CPHA | SPI_CPOL | SPI_CS_HIGH | SPI_LSB_FIRST |
+                         SPI_3WIRE;
+       ctrl->bits_per_word_mask = spi->cfg->get_bpw_mask(spi);
+       ctrl->max_speed_hz = spi->clk_rate / spi->cfg->baud_rate_div_min;
+       ctrl->min_speed_hz = spi->clk_rate / spi->cfg->baud_rate_div_max;
+       ctrl->use_gpio_descriptors = true;
+       ctrl->prepare_message = stm32_spi_prepare_msg;
+       ctrl->transfer_one = stm32_spi_transfer_one;
+       ctrl->unprepare_message = stm32_spi_unprepare_msg;
+       ctrl->flags = spi->cfg->flags;
+       if (STM32_SPI_DEVICE_MODE(spi))
+               ctrl->slave_abort = stm32h7_spi_device_abort;
 
        spi->dma_tx = dma_request_chan(spi->dev, "tx");
        if (IS_ERR(spi->dma_tx)) {
@@ -1866,7 +1909,7 @@ static int stm32_spi_probe(struct platform_device *pdev)
 
                dev_warn(&pdev->dev, "failed to request tx dma channel\n");
        } else {
-               master->dma_tx = spi->dma_tx;
+               ctrl->dma_tx = spi->dma_tx;
        }
 
        spi->dma_rx = dma_request_chan(spi->dev, "rx");
@@ -1878,11 +1921,11 @@ static int stm32_spi_probe(struct platform_device *pdev)
 
                dev_warn(&pdev->dev, "failed to request rx dma channel\n");
        } else {
-               master->dma_rx = spi->dma_rx;
+               ctrl->dma_rx = spi->dma_rx;
        }
 
        if (spi->dma_tx || spi->dma_rx)
-               master->can_dma = stm32_spi_can_dma;
+               ctrl->can_dma = stm32_spi_can_dma;
 
        pm_runtime_set_autosuspend_delay(&pdev->dev,
                                         STM32_SPI_AUTOSUSPEND_DELAY);
@@ -1891,9 +1934,9 @@ static int stm32_spi_probe(struct platform_device *pdev)
        pm_runtime_get_noresume(&pdev->dev);
        pm_runtime_enable(&pdev->dev);
 
-       ret = spi_register_master(master);
+       ret = spi_register_controller(ctrl);
        if (ret) {
-               dev_err(&pdev->dev, "spi master registration failed: %d\n",
+               dev_err(&pdev->dev, "spi controller registration failed: %d\n",
                        ret);
                goto err_pm_disable;
        }
@@ -1901,7 +1944,8 @@ static int stm32_spi_probe(struct platform_device *pdev)
        pm_runtime_mark_last_busy(&pdev->dev);
        pm_runtime_put_autosuspend(&pdev->dev);
 
-       dev_info(&pdev->dev, "driver initialized\n");
+       dev_info(&pdev->dev, "driver initialized (%s mode)\n",
+                STM32_SPI_MASTER_MODE(spi) ? "master" : "device");
 
        return 0;
 
@@ -1923,12 +1967,12 @@ err_clk_disable:
 
 static void stm32_spi_remove(struct platform_device *pdev)
 {
-       struct spi_master *master = platform_get_drvdata(pdev);
-       struct stm32_spi *spi = spi_master_get_devdata(master);
+       struct spi_controller *ctrl = platform_get_drvdata(pdev);
+       struct stm32_spi *spi = spi_controller_get_devdata(ctrl);
 
        pm_runtime_get_sync(&pdev->dev);
 
-       spi_unregister_master(master);
+       spi_unregister_controller(ctrl);
        spi->cfg->disable(spi);
 
        pm_runtime_disable(&pdev->dev);
@@ -1936,10 +1980,10 @@ static void stm32_spi_remove(struct platform_device *pdev)
        pm_runtime_set_suspended(&pdev->dev);
        pm_runtime_dont_use_autosuspend(&pdev->dev);
 
-       if (master->dma_tx)
-               dma_release_channel(master->dma_tx);
-       if (master->dma_rx)
-               dma_release_channel(master->dma_rx);
+       if (ctrl->dma_tx)
+               dma_release_channel(ctrl->dma_tx);
+       if (ctrl->dma_rx)
+               dma_release_channel(ctrl->dma_rx);
 
        clk_disable_unprepare(spi->clk);
 
@@ -1949,8 +1993,8 @@ static void stm32_spi_remove(struct platform_device *pdev)
 
 static int __maybe_unused stm32_spi_runtime_suspend(struct device *dev)
 {
-       struct spi_master *master = dev_get_drvdata(dev);
-       struct stm32_spi *spi = spi_master_get_devdata(master);
+       struct spi_controller *ctrl = dev_get_drvdata(dev);
+       struct stm32_spi *spi = spi_controller_get_devdata(ctrl);
 
        clk_disable_unprepare(spi->clk);
 
@@ -1959,8 +2003,8 @@ static int __maybe_unused stm32_spi_runtime_suspend(struct device *dev)
 
 static int __maybe_unused stm32_spi_runtime_resume(struct device *dev)
 {
-       struct spi_master *master = dev_get_drvdata(dev);
-       struct stm32_spi *spi = spi_master_get_devdata(master);
+       struct spi_controller *ctrl = dev_get_drvdata(dev);
+       struct stm32_spi *spi = spi_controller_get_devdata(ctrl);
        int ret;
 
        ret = pinctrl_pm_select_default_state(dev);
@@ -1972,10 +2016,10 @@ static int __maybe_unused stm32_spi_runtime_resume(struct device *dev)
 
 static int __maybe_unused stm32_spi_suspend(struct device *dev)
 {
-       struct spi_master *master = dev_get_drvdata(dev);
+       struct spi_controller *ctrl = dev_get_drvdata(dev);
        int ret;
 
-       ret = spi_master_suspend(master);
+       ret = spi_controller_suspend(ctrl);
        if (ret)
                return ret;
 
@@ -1984,15 +2028,15 @@ static int __maybe_unused stm32_spi_suspend(struct device *dev)
 
 static int __maybe_unused stm32_spi_resume(struct device *dev)
 {
-       struct spi_master *master = dev_get_drvdata(dev);
-       struct stm32_spi *spi = spi_master_get_devdata(master);
+       struct spi_controller *ctrl = dev_get_drvdata(dev);
+       struct stm32_spi *spi = spi_controller_get_devdata(ctrl);
        int ret;
 
        ret = pm_runtime_force_resume(dev);
        if (ret)
                return ret;
 
-       ret = spi_master_resume(master);
+       ret = spi_controller_resume(ctrl);
        if (ret) {
                clk_disable_unprepare(spi->clk);
                return ret;
index 7532c85..30d5416 100644 (file)
@@ -42,7 +42,9 @@
 #define SUN6I_TFR_CTL_CS_MANUAL                        BIT(6)
 #define SUN6I_TFR_CTL_CS_LEVEL                 BIT(7)
 #define SUN6I_TFR_CTL_DHB                      BIT(8)
+#define SUN6I_TFR_CTL_SDC                      BIT(11)
 #define SUN6I_TFR_CTL_FBS                      BIT(12)
+#define SUN6I_TFR_CTL_SDM                      BIT(13)
 #define SUN6I_TFR_CTL_XCH                      BIT(31)
 
 #define SUN6I_INT_CTL_REG              0x10
 #define SUN6I_TXDATA_REG               0x200
 #define SUN6I_RXDATA_REG               0x300
 
+struct sun6i_spi_cfg {
+       unsigned long           fifo_depth;
+       bool                    has_clk_ctl;
+};
+
 struct sun6i_spi {
        struct spi_master       *master;
        void __iomem            *base_addr;
@@ -99,7 +106,7 @@ struct sun6i_spi {
        const u8                *tx_buf;
        u8                      *rx_buf;
        int                     len;
-       unsigned long           fifo_depth;
+       const struct sun6i_spi_cfg *cfg;
 };
 
 static inline u32 sun6i_spi_read(struct sun6i_spi *sspi, u32 reg)
@@ -156,7 +163,7 @@ static inline void sun6i_spi_fill_fifo(struct sun6i_spi *sspi)
        u8 byte;
 
        /* See how much data we can fit */
-       cnt = sspi->fifo_depth - sun6i_spi_get_tx_fifo_count(sspi);
+       cnt = sspi->cfg->fifo_depth - sun6i_spi_get_tx_fifo_count(sspi);
 
        len = min((int)cnt, sspi->len);
 
@@ -256,7 +263,7 @@ static int sun6i_spi_transfer_one(struct spi_master *master,
                                  struct spi_transfer *tfr)
 {
        struct sun6i_spi *sspi = spi_master_get_devdata(master);
-       unsigned int mclk_rate, div, div_cdr1, div_cdr2, timeout;
+       unsigned int div, div_cdr1, div_cdr2, timeout;
        unsigned int start, end, tx_time;
        unsigned int trig_level;
        unsigned int tx_len = 0, rx_len = 0;
@@ -289,14 +296,14 @@ static int sun6i_spi_transfer_one(struct spi_master *master,
                 * the hardcoded value used in old generation of Allwinner
                 * SPI controller. (See spi-sun4i.c)
                 */
-               trig_level = sspi->fifo_depth / 4 * 3;
+               trig_level = sspi->cfg->fifo_depth / 4 * 3;
        } else {
                /*
                 * Setup FIFO DMA request trigger level
                 * We choose 1/2 of the full fifo depth, that value will
                 * be used as DMA burst length.
                 */
-               trig_level = sspi->fifo_depth / 2;
+               trig_level = sspi->cfg->fifo_depth / 2;
 
                if (tfr->tx_buf)
                        reg |= SUN6I_FIFO_CTL_TF_DRQ_EN;
@@ -346,39 +353,65 @@ static int sun6i_spi_transfer_one(struct spi_master *master,
 
        sun6i_spi_write(sspi, SUN6I_TFR_CTL_REG, reg);
 
-       /* Ensure that we have a parent clock fast enough */
-       mclk_rate = clk_get_rate(sspi->mclk);
-       if (mclk_rate < (2 * tfr->speed_hz)) {
-               clk_set_rate(sspi->mclk, 2 * tfr->speed_hz);
-               mclk_rate = clk_get_rate(sspi->mclk);
-       }
+       if (sspi->cfg->has_clk_ctl) {
+               unsigned int mclk_rate = clk_get_rate(sspi->mclk);
 
-       /*
-        * Setup clock divider.
-        *
-        * We have two choices there. Either we can use the clock
-        * divide rate 1, which is calculated thanks to this formula:
-        * SPI_CLK = MOD_CLK / (2 ^ cdr)
-        * Or we can use CDR2, which is calculated with the formula:
-        * SPI_CLK = MOD_CLK / (2 * (cdr + 1))
-        * Wether we use the former or the latter is set through the
-        * DRS bit.
-        *
-        * First try CDR2, and if we can't reach the expected
-        * frequency, fall back to CDR1.
-        */
-       div_cdr1 = DIV_ROUND_UP(mclk_rate, tfr->speed_hz);
-       div_cdr2 = DIV_ROUND_UP(div_cdr1, 2);
-       if (div_cdr2 <= (SUN6I_CLK_CTL_CDR2_MASK + 1)) {
-               reg = SUN6I_CLK_CTL_CDR2(div_cdr2 - 1) | SUN6I_CLK_CTL_DRS;
-               tfr->effective_speed_hz = mclk_rate / (2 * div_cdr2);
+               /* Ensure that we have a parent clock fast enough */
+               if (mclk_rate < (2 * tfr->speed_hz)) {
+                       clk_set_rate(sspi->mclk, 2 * tfr->speed_hz);
+                       mclk_rate = clk_get_rate(sspi->mclk);
+               }
+
+               /*
+                * Setup clock divider.
+                *
+                * We have two choices there. Either we can use the clock
+                * divide rate 1, which is calculated thanks to this formula:
+                * SPI_CLK = MOD_CLK / (2 ^ cdr)
+                * Or we can use CDR2, which is calculated with the formula:
+                * SPI_CLK = MOD_CLK / (2 * (cdr + 1))
+                * Wether we use the former or the latter is set through the
+                * DRS bit.
+                *
+                * First try CDR2, and if we can't reach the expected
+                * frequency, fall back to CDR1.
+                */
+               div_cdr1 = DIV_ROUND_UP(mclk_rate, tfr->speed_hz);
+               div_cdr2 = DIV_ROUND_UP(div_cdr1, 2);
+               if (div_cdr2 <= (SUN6I_CLK_CTL_CDR2_MASK + 1)) {
+                       reg = SUN6I_CLK_CTL_CDR2(div_cdr2 - 1) | SUN6I_CLK_CTL_DRS;
+                       tfr->effective_speed_hz = mclk_rate / (2 * div_cdr2);
+               } else {
+                       div = min(SUN6I_CLK_CTL_CDR1_MASK, order_base_2(div_cdr1));
+                       reg = SUN6I_CLK_CTL_CDR1(div);
+                       tfr->effective_speed_hz = mclk_rate / (1 << div);
+               }
+
+               sun6i_spi_write(sspi, SUN6I_CLK_CTL_REG, reg);
        } else {
-               div = min(SUN6I_CLK_CTL_CDR1_MASK, order_base_2(div_cdr1));
-               reg = SUN6I_CLK_CTL_CDR1(div);
-               tfr->effective_speed_hz = mclk_rate / (1 << div);
+               clk_set_rate(sspi->mclk, tfr->speed_hz);
+               tfr->effective_speed_hz = clk_get_rate(sspi->mclk);
+
+               /*
+                * Configure work mode.
+                *
+                * There are three work modes depending on the controller clock
+                * frequency:
+                * - normal sample mode           : CLK <= 24MHz SDM=1 SDC=0
+                * - delay half-cycle sample mode : CLK <= 40MHz SDM=0 SDC=0
+                * - delay one-cycle sample mode  : CLK >= 80MHz SDM=0 SDC=1
+                */
+               reg = sun6i_spi_read(sspi, SUN6I_TFR_CTL_REG);
+               reg &= ~(SUN6I_TFR_CTL_SDM | SUN6I_TFR_CTL_SDC);
+
+               if (tfr->effective_speed_hz <= 24000000)
+                       reg |= SUN6I_TFR_CTL_SDM;
+               else if (tfr->effective_speed_hz >= 80000000)
+                       reg |= SUN6I_TFR_CTL_SDC;
+
+               sun6i_spi_write(sspi, SUN6I_TFR_CTL_REG, reg);
        }
 
-       sun6i_spi_write(sspi, SUN6I_CLK_CTL_REG, reg);
        /* Finally enable the bus - doing so before might raise SCK to HIGH */
        reg = sun6i_spi_read(sspi, SUN6I_GBL_CTL_REG);
        reg |= SUN6I_GBL_CTL_BUS_ENABLE;
@@ -410,9 +443,9 @@ static int sun6i_spi_transfer_one(struct spi_master *master,
        reg = SUN6I_INT_CTL_TC;
 
        if (!use_dma) {
-               if (rx_len > sspi->fifo_depth)
+               if (rx_len > sspi->cfg->fifo_depth)
                        reg |= SUN6I_INT_CTL_RF_RDY;
-               if (tx_len > sspi->fifo_depth)
+               if (tx_len > sspi->cfg->fifo_depth)
                        reg |= SUN6I_INT_CTL_TF_ERQ;
        }
 
@@ -422,7 +455,7 @@ static int sun6i_spi_transfer_one(struct spi_master *master,
        reg = sun6i_spi_read(sspi, SUN6I_TFR_CTL_REG);
        sun6i_spi_write(sspi, SUN6I_TFR_CTL_REG, reg | SUN6I_TFR_CTL_XCH);
 
-       tx_time = max(tfr->len * 8 * 2 / (tfr->speed_hz / 1000), 100U);
+       tx_time = spi_controller_xfer_timeout(master, tfr);
        start = jiffies;
        timeout = wait_for_completion_timeout(&sspi->done,
                                              msecs_to_jiffies(tx_time));
@@ -543,7 +576,7 @@ static bool sun6i_spi_can_dma(struct spi_master *master,
         * the fifo length we can just fill the fifo and wait for a single
         * irq, so don't bother setting up dma
         */
-       return xfer->len > sspi->fifo_depth;
+       return xfer->len > sspi->cfg->fifo_depth;
 }
 
 static int sun6i_spi_probe(struct platform_device *pdev)
@@ -582,7 +615,7 @@ static int sun6i_spi_probe(struct platform_device *pdev)
        }
 
        sspi->master = master;
-       sspi->fifo_depth = (unsigned long)of_device_get_match_data(&pdev->dev);
+       sspi->cfg = of_device_get_match_data(&pdev->dev);
 
        master->max_speed_hz = 100 * 1000 * 1000;
        master->min_speed_hz = 3 * 1000;
@@ -695,9 +728,27 @@ static void sun6i_spi_remove(struct platform_device *pdev)
                dma_release_channel(master->dma_rx);
 }
 
+static const struct sun6i_spi_cfg sun6i_a31_spi_cfg = {
+       .fifo_depth     = SUN6I_FIFO_DEPTH,
+       .has_clk_ctl    = true,
+};
+
+static const struct sun6i_spi_cfg sun8i_h3_spi_cfg = {
+       .fifo_depth     = SUN8I_FIFO_DEPTH,
+       .has_clk_ctl    = true,
+};
+
+static const struct sun6i_spi_cfg sun50i_r329_spi_cfg = {
+       .fifo_depth     = SUN8I_FIFO_DEPTH,
+};
+
 static const struct of_device_id sun6i_spi_match[] = {
-       { .compatible = "allwinner,sun6i-a31-spi", .data = (void *)SUN6I_FIFO_DEPTH },
-       { .compatible = "allwinner,sun8i-h3-spi",  .data = (void *)SUN8I_FIFO_DEPTH },
+       { .compatible = "allwinner,sun6i-a31-spi", .data = &sun6i_a31_spi_cfg },
+       { .compatible = "allwinner,sun8i-h3-spi",  .data = &sun8i_h3_spi_cfg },
+       {
+               .compatible = "allwinner,sun50i-r329-spi",
+               .data = &sun50i_r329_spi_cfg
+       },
        {}
 };
 MODULE_DEVICE_TABLE(of, sun6i_spi_match);
index 5d23411..ae6218b 100644 (file)
@@ -241,7 +241,7 @@ static struct i2c_driver spi_xcomm_driver = {
                .name   = "spi-xcomm",
        },
        .id_table       = spi_xcomm_ids,
-       .probe_new      = spi_xcomm_probe,
+       .probe          = spi_xcomm_probe,
 };
 module_i2c_driver(spi_xcomm_driver);
 
index 39d94c8..d13dc15 100644 (file)
@@ -64,7 +64,8 @@ static_assert(N_SPI_MINORS > 0 && N_SPI_MINORS <= 256);
                                | SPI_NO_CS | SPI_READY | SPI_TX_DUAL \
                                | SPI_TX_QUAD | SPI_TX_OCTAL | SPI_RX_DUAL \
                                | SPI_RX_QUAD | SPI_RX_OCTAL \
-                               | SPI_RX_CPHA_FLIP)
+                               | SPI_RX_CPHA_FLIP | SPI_3WIRE_HIZ \
+                               | SPI_MOSI_IDLE_LOW)
 
 struct spidev_data {
        dev_t                   devt;
@@ -237,7 +238,7 @@ static int spidev_message(struct spidev_data *spidev,
                /* Ensure that also following allocations from rx_buf/tx_buf will meet
                 * DMA alignment requirements.
                 */
-               unsigned int len_aligned = ALIGN(u_tmp->len, ARCH_KMALLOC_MINALIGN);
+               unsigned int len_aligned = ALIGN(u_tmp->len, ARCH_DMA_MINALIGN);
 
                k_tmp->len = u_tmp->len;
 
index cc838ff..3c462d6 100644 (file)
@@ -90,7 +90,7 @@ static int iblock_configure_device(struct se_device *dev)
        struct request_queue *q;
        struct block_device *bd = NULL;
        struct blk_integrity *bi;
-       fmode_t mode;
+       blk_mode_t mode = BLK_OPEN_READ;
        unsigned int max_write_zeroes_sectors;
        int ret;
 
@@ -108,13 +108,12 @@ static int iblock_configure_device(struct se_device *dev)
        pr_debug( "IBLOCK: Claiming struct block_device: %s\n",
                        ib_dev->ibd_udev_path);
 
-       mode = FMODE_READ|FMODE_EXCL;
        if (!ib_dev->ibd_readonly)
-               mode |= FMODE_WRITE;
+               mode |= BLK_OPEN_WRITE;
        else
                dev->dev_flags |= DF_READ_ONLY;
 
-       bd = blkdev_get_by_path(ib_dev->ibd_udev_path, mode, ib_dev);
+       bd = blkdev_get_by_path(ib_dev->ibd_udev_path, mode, ib_dev, NULL);
        if (IS_ERR(bd)) {
                ret = PTR_ERR(bd);
                goto out_free_bioset;
@@ -175,7 +174,7 @@ static int iblock_configure_device(struct se_device *dev)
        return 0;
 
 out_blkdev_put:
-       blkdev_put(ib_dev->ibd_bd, FMODE_WRITE|FMODE_READ|FMODE_EXCL);
+       blkdev_put(ib_dev->ibd_bd, ib_dev);
 out_free_bioset:
        bioset_exit(&ib_dev->ibd_bio_set);
 out:
@@ -201,7 +200,7 @@ static void iblock_destroy_device(struct se_device *dev)
        struct iblock_dev *ib_dev = IBLOCK_DEV(dev);
 
        if (ib_dev->ibd_bd != NULL)
-               blkdev_put(ib_dev->ibd_bd, FMODE_WRITE|FMODE_READ|FMODE_EXCL);
+               blkdev_put(ib_dev->ibd_bd, ib_dev);
        bioset_exit(&ib_dev->ibd_bio_set);
 }
 
index e742554..0d4f096 100644 (file)
@@ -366,8 +366,8 @@ static int pscsi_create_type_disk(struct se_device *dev, struct scsi_device *sd)
         * Claim exclusive struct block_device access to struct scsi_device
         * for TYPE_DISK and TYPE_ZBC using supplied udev_path
         */
-       bd = blkdev_get_by_path(dev->udev_path,
-                               FMODE_WRITE|FMODE_READ|FMODE_EXCL, pdv);
+       bd = blkdev_get_by_path(dev->udev_path, BLK_OPEN_WRITE | BLK_OPEN_READ,
+                               pdv, NULL);
        if (IS_ERR(bd)) {
                pr_err("pSCSI: blkdev_get_by_path() failed\n");
                scsi_device_put(sd);
@@ -377,7 +377,7 @@ static int pscsi_create_type_disk(struct se_device *dev, struct scsi_device *sd)
 
        ret = pscsi_add_device_to_list(dev, sd);
        if (ret) {
-               blkdev_put(pdv->pdv_bd, FMODE_WRITE|FMODE_READ|FMODE_EXCL);
+               blkdev_put(pdv->pdv_bd, pdv);
                scsi_device_put(sd);
                return ret;
        }
@@ -565,8 +565,7 @@ static void pscsi_destroy_device(struct se_device *dev)
                 */
                if ((sd->type == TYPE_DISK || sd->type == TYPE_ZBC) &&
                    pdv->pdv_bd) {
-                       blkdev_put(pdv->pdv_bd,
-                                  FMODE_WRITE|FMODE_READ|FMODE_EXCL);
+                       blkdev_put(pdv->pdv_bd, pdv);
                        pdv->pdv_bd = NULL;
                }
                /*
index 4cd7ab7..19a4b33 100644 (file)
@@ -130,6 +130,14 @@ config THERMAL_DEFAULT_GOV_POWER_ALLOCATOR
          system and device power allocation. This governor can only
          operate on cooling devices that implement the power API.
 
+config THERMAL_DEFAULT_GOV_BANG_BANG
+       bool "bang_bang"
+       depends on THERMAL_GOV_BANG_BANG
+       help
+         Use the bang_bang governor as default. This throttles the
+         devices one step at the time, taking into account the trip
+         point hysteresis.
+
 endchoice
 
 config THERMAL_GOV_FAIR_SHARE
index 3abc2dc..756b218 100644 (file)
@@ -282,8 +282,7 @@ static int amlogic_thermal_probe(struct platform_device *pdev)
                return ret;
        }
 
-       if (devm_thermal_add_hwmon_sysfs(&pdev->dev, pdata->tzd))
-               dev_warn(&pdev->dev, "Failed to add hwmon sysfs attributes\n");
+       devm_thermal_add_hwmon_sysfs(&pdev->dev, pdata->tzd);
 
        ret = amlogic_thermal_initialize(pdata);
        if (ret)
index 0e8dfa6..9f6dc4f 100644 (file)
@@ -231,7 +231,7 @@ static void armada380_init(struct platform_device *pdev,
        regmap_write(priv->syscon, data->syscon_control0_off, reg);
 }
 
-static void armada_ap806_init(struct platform_device *pdev,
+static void armada_ap80x_init(struct platform_device *pdev,
                              struct armada_thermal_priv *priv)
 {
        struct armada_thermal_data *data = priv->data;
@@ -614,7 +614,7 @@ static const struct armada_thermal_data armada380_data = {
 };
 
 static const struct armada_thermal_data armada_ap806_data = {
-       .init = armada_ap806_init,
+       .init = armada_ap80x_init,
        .is_valid_bit = BIT(16),
        .temp_shift = 0,
        .temp_mask = 0x3ff,
@@ -637,6 +637,30 @@ static const struct armada_thermal_data armada_ap806_data = {
        .cpu_nr = 4,
 };
 
+static const struct armada_thermal_data armada_ap807_data = {
+       .init = armada_ap80x_init,
+       .is_valid_bit = BIT(16),
+       .temp_shift = 0,
+       .temp_mask = 0x3ff,
+       .thresh_shift = 3,
+       .hyst_shift = 19,
+       .hyst_mask = 0x3,
+       .coef_b = -128900LL,
+       .coef_m = 394ULL,
+       .coef_div = 1,
+       .inverted = true,
+       .signed_sample = true,
+       .syscon_control0_off = 0x84,
+       .syscon_control1_off = 0x88,
+       .syscon_status_off = 0x8C,
+       .dfx_irq_cause_off = 0x108,
+       .dfx_irq_mask_off = 0x10C,
+       .dfx_overheat_irq = BIT(22),
+       .dfx_server_irq_mask_off = 0x104,
+       .dfx_server_irq_en = BIT(1),
+       .cpu_nr = 4,
+};
+
 static const struct armada_thermal_data armada_cp110_data = {
        .init = armada_cp110_init,
        .is_valid_bit = BIT(10),
@@ -681,6 +705,10 @@ static const struct of_device_id armada_thermal_id_table[] = {
                .data       = &armada_ap806_data,
        },
        {
+               .compatible = "marvell,armada-ap807-thermal",
+               .data       = &armada_ap807_data,
+       },
+       {
                .compatible = "marvell,armada-cp110-thermal",
                .data       = &armada_cp110_data,
        },
index d8005e9..d4b4086 100644 (file)
@@ -343,8 +343,7 @@ static int imx8mm_tmu_probe(struct platform_device *pdev)
                }
                tmu->sensors[i].hw_id = i;
 
-               if (devm_thermal_add_hwmon_sysfs(&pdev->dev, tmu->sensors[i].tzd))
-                       dev_warn(&pdev->dev, "failed to add hwmon sysfs attributes\n");
+               devm_thermal_add_hwmon_sysfs(&pdev->dev, tmu->sensors[i].tzd);
        }
 
        platform_set_drvdata(pdev, tmu);
index 839bb99..8d6b4ef 100644 (file)
@@ -116,8 +116,7 @@ static int imx_sc_thermal_probe(struct platform_device *pdev)
                        return ret;
                }
 
-               if (devm_thermal_add_hwmon_sysfs(&pdev->dev, sensor->tzd))
-                       dev_warn(&pdev->dev, "failed to add hwmon sysfs attributes\n");
+               devm_thermal_add_hwmon_sysfs(&pdev->dev, sensor->tzd);
        }
 
        return 0;
index 01b8033..dc519a6 100644 (file)
@@ -203,6 +203,151 @@ end:
 }
 EXPORT_SYMBOL(acpi_parse_art);
 
+/*
+ * acpi_parse_psvt - Passive Table (PSVT) for passive cooling
+ *
+ * @handle: ACPI handle of the device which contains PSVT
+ * @psvt_count: the number of valid entries resulted from parsing PSVT
+ * @psvtp: pointer to array of psvt entries
+ *
+ */
+static int acpi_parse_psvt(acpi_handle handle, int *psvt_count, struct psvt **psvtp)
+{
+       struct acpi_buffer buffer = { ACPI_ALLOCATE_BUFFER, NULL };
+       int nr_bad_entries = 0, revision = 0;
+       union acpi_object *p;
+       acpi_status status;
+       int i, result = 0;
+       struct psvt *psvts;
+
+       if (!acpi_has_method(handle, "PSVT"))
+               return -ENODEV;
+
+       status = acpi_evaluate_object(handle, "PSVT", NULL, &buffer);
+       if (ACPI_FAILURE(status))
+               return -ENODEV;
+
+       p = buffer.pointer;
+       if (!p || (p->type != ACPI_TYPE_PACKAGE)) {
+               result = -EFAULT;
+               goto end;
+       }
+
+       /* first package is the revision number */
+       if (p->package.count > 0) {
+               union acpi_object *prev = &(p->package.elements[0]);
+
+               if (prev->type == ACPI_TYPE_INTEGER)
+                       revision = (int)prev->integer.value;
+       } else {
+               result = -EFAULT;
+               goto end;
+       }
+
+       /* Support only version 2 */
+       if (revision != 2) {
+               result = -EFAULT;
+               goto end;
+       }
+
+       *psvt_count = p->package.count - 1;
+       if (!*psvt_count) {
+               result = -EFAULT;
+               goto end;
+       }
+
+       psvts = kcalloc(*psvt_count, sizeof(*psvts), GFP_KERNEL);
+       if (!psvts) {
+               result = -ENOMEM;
+               goto end;
+       }
+
+       /* Start index is 1 because the first package is the revision number */
+       for (i = 1; i < p->package.count; i++) {
+               struct acpi_buffer psvt_int_format = { sizeof("RRNNNNNNNNNN"), "RRNNNNNNNNNN" };
+               struct acpi_buffer psvt_str_format = { sizeof("RRNNNNNSNNNN"), "RRNNNNNSNNNN" };
+               union acpi_object *package = &(p->package.elements[i]);
+               struct psvt *psvt = &psvts[i - 1 - nr_bad_entries];
+               struct acpi_buffer *psvt_format = &psvt_int_format;
+               struct acpi_buffer element = { 0, NULL };
+               union acpi_object *knob;
+               struct acpi_device *res;
+               struct psvt *psvt_ptr;
+
+               element.length = ACPI_ALLOCATE_BUFFER;
+               element.pointer = NULL;
+
+               if (package->package.count >= ACPI_NR_PSVT_ELEMENTS) {
+                       knob = &(package->package.elements[ACPI_PSVT_CONTROL_KNOB]);
+               } else {
+                       nr_bad_entries++;
+                       pr_info("PSVT package %d is invalid, ignored\n", i);
+                       continue;
+               }
+
+               if (knob->type == ACPI_TYPE_STRING) {
+                       psvt_format = &psvt_str_format;
+                       if (knob->string.length > ACPI_LIMIT_STR_MAX_LEN - 1) {
+                               pr_info("PSVT package %d limit string len exceeds max\n", i);
+                               knob->string.length = ACPI_LIMIT_STR_MAX_LEN - 1;
+                       }
+               }
+
+               status = acpi_extract_package(&(p->package.elements[i]), psvt_format, &element);
+               if (ACPI_FAILURE(status)) {
+                       nr_bad_entries++;
+                       pr_info("PSVT package %d is invalid, ignored\n", i);
+                       continue;
+               }
+
+               psvt_ptr = (struct psvt *)element.pointer;
+
+               memcpy(psvt, psvt_ptr, sizeof(*psvt));
+
+               /* The limit element can be string or U64 */
+               psvt->control_knob_type = (u64)knob->type;
+
+               if (knob->type == ACPI_TYPE_STRING) {
+                       memset(&psvt->limit, 0, sizeof(u64));
+                       strncpy(psvt->limit.string, psvt_ptr->limit.str_ptr, knob->string.length);
+               } else {
+                       psvt->limit.integer = psvt_ptr->limit.integer;
+               }
+
+               kfree(element.pointer);
+
+               res = acpi_fetch_acpi_dev(psvt->source);
+               if (!res) {
+                       nr_bad_entries++;
+                       pr_info("Failed to get source ACPI device\n");
+                       continue;
+               }
+
+               res = acpi_fetch_acpi_dev(psvt->target);
+               if (!res) {
+                       nr_bad_entries++;
+                       pr_info("Failed to get target ACPI device\n");
+                       continue;
+               }
+       }
+
+       /* don't count bad entries */
+       *psvt_count -= nr_bad_entries;
+
+       if (!*psvt_count) {
+               result = -EFAULT;
+               kfree(psvts);
+               goto end;
+       }
+
+       *psvtp = psvts;
+
+       return 0;
+
+end:
+       kfree(buffer.pointer);
+       return result;
+}
 
 /* get device name from acpi handle */
 static void get_single_name(acpi_handle handle, char *name)
@@ -289,6 +434,57 @@ free_trt:
        return ret;
 }
 
+static int fill_psvt(char __user *ubuf)
+{
+       int i, ret, count, psvt_len;
+       union psvt_object *psvt_user;
+       struct psvt *psvts;
+
+       ret = acpi_parse_psvt(acpi_thermal_rel_handle, &count, &psvts);
+       if (ret)
+               return ret;
+
+       psvt_len = count * sizeof(*psvt_user);
+
+       psvt_user = kzalloc(psvt_len, GFP_KERNEL);
+       if (!psvt_user) {
+               ret = -ENOMEM;
+               goto free_psvt;
+       }
+
+       /* now fill in user psvt data */
+       for (i = 0; i < count; i++) {
+               /* userspace psvt needs device name instead of acpi reference */
+               get_single_name(psvts[i].source, psvt_user[i].source_device);
+               get_single_name(psvts[i].target, psvt_user[i].target_device);
+
+               psvt_user[i].priority = psvts[i].priority;
+               psvt_user[i].sample_period = psvts[i].sample_period;
+               psvt_user[i].passive_temp = psvts[i].passive_temp;
+               psvt_user[i].source_domain = psvts[i].source_domain;
+               psvt_user[i].control_knob = psvts[i].control_knob;
+               psvt_user[i].step_size = psvts[i].step_size;
+               psvt_user[i].limit_coeff = psvts[i].limit_coeff;
+               psvt_user[i].unlimit_coeff = psvts[i].unlimit_coeff;
+               psvt_user[i].control_knob_type = psvts[i].control_knob_type;
+               if (psvt_user[i].control_knob_type == ACPI_TYPE_STRING)
+                       strncpy(psvt_user[i].limit.string, psvts[i].limit.string,
+                               ACPI_LIMIT_STR_MAX_LEN);
+               else
+                       psvt_user[i].limit.integer = psvts[i].limit.integer;
+
+       }
+
+       if (copy_to_user(ubuf, psvt_user, psvt_len))
+               ret = -EFAULT;
+
+       kfree(psvt_user);
+
+free_psvt:
+       kfree(psvts);
+       return ret;
+}
+
 static long acpi_thermal_rel_ioctl(struct file *f, unsigned int cmd,
                                   unsigned long __arg)
 {
@@ -298,6 +494,7 @@ static long acpi_thermal_rel_ioctl(struct file *f, unsigned int cmd,
        char __user *arg = (void __user *)__arg;
        struct trt *trts = NULL;
        struct art *arts = NULL;
+       struct psvt *psvts;
 
        switch (cmd) {
        case ACPI_THERMAL_GET_TRT_COUNT:
@@ -336,6 +533,27 @@ static long acpi_thermal_rel_ioctl(struct file *f, unsigned int cmd,
        case ACPI_THERMAL_GET_ART:
                return fill_art(arg);
 
+       case ACPI_THERMAL_GET_PSVT_COUNT:
+               ret = acpi_parse_psvt(acpi_thermal_rel_handle, &count, &psvts);
+               if (!ret) {
+                       kfree(psvts);
+                       return put_user(count, (unsigned long __user *)__arg);
+               }
+               return ret;
+
+       case ACPI_THERMAL_GET_PSVT_LEN:
+               /* total length of the data retrieved (count * PSVT entry size) */
+               ret = acpi_parse_psvt(acpi_thermal_rel_handle, &count, &psvts);
+               length = count * sizeof(union psvt_object);
+               if (!ret) {
+                       kfree(psvts);
+                       return put_user(length, (unsigned long __user *)__arg);
+               }
+               return ret;
+
+       case ACPI_THERMAL_GET_PSVT:
+               return fill_psvt(arg);
+
        default:
                return -ENOTTY;
        }
index 78d9424..ac376d8 100644 (file)
 #define ACPI_THERMAL_GET_TRT   _IOR(ACPI_THERMAL_MAGIC, 5, unsigned long)
 #define ACPI_THERMAL_GET_ART   _IOR(ACPI_THERMAL_MAGIC, 6, unsigned long)
 
+/*
+ * ACPI_THERMAL_GET_PSVT_COUNT = Number of PSVT entries
+ * ACPI_THERMAL_GET_PSVT_LEN = Total return data size (PSVT count x each
+ * PSVT entry size)
+ * ACPI_THERMAL_GET_PSVT = Get the data as an array of psvt_objects
+ */
+#define ACPI_THERMAL_GET_PSVT_LEN _IOR(ACPI_THERMAL_MAGIC, 7, unsigned long)
+#define ACPI_THERMAL_GET_PSVT_COUNT _IOR(ACPI_THERMAL_MAGIC, 8, unsigned long)
+#define ACPI_THERMAL_GET_PSVT  _IOR(ACPI_THERMAL_MAGIC, 9, unsigned long)
+
 struct art {
        acpi_handle source;
        acpi_handle target;
@@ -43,6 +53,32 @@ struct trt {
        u64 reserved4;
 } __packed;
 
+#define ACPI_NR_PSVT_ELEMENTS  12
+#define ACPI_PSVT_CONTROL_KNOB 7
+#define ACPI_LIMIT_STR_MAX_LEN 8
+
+struct psvt {
+       acpi_handle source;
+       acpi_handle target;
+       u64 priority;
+       u64 sample_period;
+       u64 passive_temp;
+       u64 source_domain;
+       u64 control_knob;
+       union {
+               /* For limit_type = ACPI_TYPE_INTEGER */
+               u64 integer;
+               /* For limit_type = ACPI_TYPE_STRING */
+               char string[ACPI_LIMIT_STR_MAX_LEN];
+               char *str_ptr;
+       } limit;
+       u64 step_size;
+       u64 limit_coeff;
+       u64 unlimit_coeff;
+       /* Spec calls this field reserved, so we borrow it for type info */
+       u64 control_knob_type; /* ACPI_TYPE_STRING or ACPI_TYPE_INTEGER */
+} __packed;
+
 #define ACPI_NR_ART_ELEMENTS 13
 /* for usrspace */
 union art_object {
@@ -77,6 +113,27 @@ union trt_object {
        u64 __data[8];
 };
 
+union psvt_object {
+       struct {
+               char source_device[8];
+               char target_device[8];
+               u64 priority;
+               u64 sample_period;
+               u64 passive_temp;
+               u64 source_domain;
+               u64 control_knob;
+               union {
+                       u64 integer;
+                       char string[ACPI_LIMIT_STR_MAX_LEN];
+               } limit;
+               u64 step_size;
+               u64 limit_coeff;
+               u64 unlimit_coeff;
+               u64 control_knob_type;
+       };
+       u64 __data[ACPI_NR_PSVT_ELEMENTS];
+};
+
 #ifdef __KERNEL__
 int acpi_thermal_rel_misc_device_add(acpi_handle handle);
 int acpi_thermal_rel_misc_device_remove(acpi_handle handle);
index a205221..013f163 100644 (file)
@@ -15,8 +15,8 @@ static const struct rapl_mmio_regs rapl_mmio_default = {
        .reg_unit = 0x5938,
        .regs[RAPL_DOMAIN_PACKAGE] = { 0x59a0, 0x593c, 0x58f0, 0, 0x5930},
        .regs[RAPL_DOMAIN_DRAM] = { 0x58e0, 0x58e8, 0x58ec, 0, 0},
-       .limits[RAPL_DOMAIN_PACKAGE] = 2,
-       .limits[RAPL_DOMAIN_DRAM] = 2,
+       .limits[RAPL_DOMAIN_PACKAGE] = BIT(POWER_LIMIT2),
+       .limits[RAPL_DOMAIN_DRAM] = BIT(POWER_LIMIT2),
 };
 
 static int rapl_mmio_cpu_online(unsigned int cpu)
@@ -27,9 +27,9 @@ static int rapl_mmio_cpu_online(unsigned int cpu)
        if (topology_physical_package_id(cpu))
                return 0;
 
-       rp = rapl_find_package_domain(cpu, &rapl_mmio_priv);
+       rp = rapl_find_package_domain(cpu, &rapl_mmio_priv, true);
        if (!rp) {
-               rp = rapl_add_package(cpu, &rapl_mmio_priv);
+               rp = rapl_add_package(cpu, &rapl_mmio_priv, true);
                if (IS_ERR(rp))
                        return PTR_ERR(rp);
        }
@@ -42,7 +42,7 @@ static int rapl_mmio_cpu_down_prep(unsigned int cpu)
        struct rapl_package *rp;
        int lead_cpu;
 
-       rp = rapl_find_package_domain(cpu, &rapl_mmio_priv);
+       rp = rapl_find_package_domain(cpu, &rapl_mmio_priv, true);
        if (!rp)
                return 0;
 
@@ -97,6 +97,7 @@ int proc_thermal_rapl_add(struct pci_dev *pdev, struct proc_thermal_device *proc
                                                rapl_regs->regs[domain][reg];
                rapl_mmio_priv.limits[domain] = rapl_regs->limits[domain];
        }
+       rapl_mmio_priv.type = RAPL_IF_MMIO;
        rapl_mmio_priv.reg_unit = (u64)proc_priv->mmio_base + rapl_regs->reg_unit;
 
        rapl_mmio_priv.read_raw = rapl_mmio_read_raw;
index 7912104..1c3e590 100644 (file)
@@ -222,8 +222,7 @@ static int k3_bandgap_probe(struct platform_device *pdev)
                        goto err_alloc;
                }
 
-               if (devm_thermal_add_hwmon_sysfs(dev, data[id].tzd))
-                       dev_warn(dev, "Failed to add hwmon sysfs attributes\n");
+               devm_thermal_add_hwmon_sysfs(dev, data[id].tzd);
        }
 
        platform_set_drvdata(pdev, bgp);
index 0b55288..f59d36d 100644 (file)
@@ -1222,12 +1222,7 @@ static int mtk_thermal_probe(struct platform_device *pdev)
                return -ENODEV;
        }
 
-       auxadc_base = devm_of_iomap(&pdev->dev, auxadc, 0, NULL);
-       if (IS_ERR(auxadc_base)) {
-               of_node_put(auxadc);
-               return PTR_ERR(auxadc_base);
-       }
-
+       auxadc_base = of_iomap(auxadc, 0);
        auxadc_phys_base = of_get_phys_base(auxadc);
 
        of_node_put(auxadc);
@@ -1243,12 +1238,7 @@ static int mtk_thermal_probe(struct platform_device *pdev)
                return -ENODEV;
        }
 
-       apmixed_base = devm_of_iomap(&pdev->dev, apmixedsys, 0, NULL);
-       if (IS_ERR(apmixed_base)) {
-               of_node_put(apmixedsys);
-               return PTR_ERR(apmixed_base);
-       }
-
+       apmixed_base = of_iomap(apmixedsys, 0);
        apmixed_phys_base = of_get_phys_base(apmixedsys);
 
        of_node_put(apmixedsys);
index d0a3f95..b693fac 100644 (file)
@@ -19,6 +19,8 @@
 #include <linux/thermal.h>
 #include <dt-bindings/thermal/mediatek,lvts-thermal.h>
 
+#include "../thermal_hwmon.h"
+
 #define LVTS_MONCTL0(__base)   (__base + 0x0000)
 #define LVTS_MONCTL1(__base)   (__base + 0x0004)
 #define LVTS_MONCTL2(__base)   (__base + 0x0008)
@@ -996,6 +998,8 @@ static int lvts_ctrl_start(struct device *dev, struct lvts_ctrl *lvts_ctrl)
                        return PTR_ERR(tz);
                }
 
+               devm_thermal_add_hwmon_sysfs(dev, tz);
+
                /*
                 * The thermal zone pointer will be needed in the
                 * interrupt handler, we store it in the sensor
index 5749149..5ddc39b 100644 (file)
@@ -689,9 +689,7 @@ static int adc_tm5_register_tzd(struct adc_tm5_chip *adc_tm)
                        return PTR_ERR(tzd);
                }
                adc_tm->channels[i].tzd = tzd;
-               if (devm_thermal_add_hwmon_sysfs(adc_tm->dev, tzd))
-                       dev_warn(adc_tm->dev,
-                                "Failed to add hwmon sysfs attributes\n");
+               devm_thermal_add_hwmon_sysfs(adc_tm->dev, tzd);
        }
 
        return 0;
index 0f88e98..0e8ebfc 100644 (file)
@@ -411,22 +411,19 @@ static int qpnp_tm_probe(struct platform_device *pdev)
        chip->base = res;
 
        ret = qpnp_tm_read(chip, QPNP_TM_REG_TYPE, &type);
-       if (ret < 0) {
-               dev_err(&pdev->dev, "could not read type\n");
-               return ret;
-       }
+       if (ret < 0)
+               return dev_err_probe(&pdev->dev, ret,
+                                    "could not read type\n");
 
        ret = qpnp_tm_read(chip, QPNP_TM_REG_SUBTYPE, &subtype);
-       if (ret < 0) {
-               dev_err(&pdev->dev, "could not read subtype\n");
-               return ret;
-       }
+       if (ret < 0)
+               return dev_err_probe(&pdev->dev, ret,
+                                    "could not read subtype\n");
 
        ret = qpnp_tm_read(chip, QPNP_TM_REG_DIG_MAJOR, &dig_major);
-       if (ret < 0) {
-               dev_err(&pdev->dev, "could not read dig_major\n");
-               return ret;
-       }
+       if (ret < 0)
+               return dev_err_probe(&pdev->dev, ret,
+                                    "could not read dig_major\n");
 
        if (type != QPNP_TM_TYPE || (subtype != QPNP_TM_SUBTYPE_GEN1
                                     && subtype != QPNP_TM_SUBTYPE_GEN2)) {
@@ -448,20 +445,15 @@ static int qpnp_tm_probe(struct platform_device *pdev)
         */
        chip->tz_dev = devm_thermal_of_zone_register(
                &pdev->dev, 0, chip, &qpnp_tm_sensor_ops);
-       if (IS_ERR(chip->tz_dev)) {
-               dev_err(&pdev->dev, "failed to register sensor\n");
-               return PTR_ERR(chip->tz_dev);
-       }
+       if (IS_ERR(chip->tz_dev))
+               return dev_err_probe(&pdev->dev, PTR_ERR(chip->tz_dev),
+                                    "failed to register sensor\n");
 
        ret = qpnp_tm_init(chip);
-       if (ret < 0) {
-               dev_err(&pdev->dev, "init failed\n");
-               return ret;
-       }
+       if (ret < 0)
+               return dev_err_probe(&pdev->dev, ret, "init failed\n");
 
-       if (devm_thermal_add_hwmon_sysfs(&pdev->dev, chip->tz_dev))
-               dev_warn(&pdev->dev,
-                        "Failed to add hwmon sysfs attributes\n");
+       devm_thermal_add_hwmon_sysfs(&pdev->dev, chip->tz_dev);
 
        ret = devm_request_threaded_irq(&pdev->dev, irq, NULL, qpnp_tm_isr,
                                        IRQF_ONESHOT, node->name, chip);
index e89c6f3..a941b42 100644 (file)
@@ -39,26 +39,6 @@ struct tsens_legacy_calibration_format tsens_8916_nvmem = {
        },
 };
 
-struct tsens_legacy_calibration_format tsens_8939_nvmem = {
-       .base_len = 8,
-       .base_shift = 2,
-       .sp_len = 6,
-       .mode = { 12, 0 },
-       .invalid = { 12, 2 },
-       .base = { { 0, 0 }, { 1, 24 } },
-       .sp = {
-               { { 12, 3 },  { 12, 9 } },
-               { { 12, 15 }, { 12, 21 } },
-               { { 12, 27 }, { 13, 1 } },
-               { { 13, 7 },  { 13, 13 } },
-               { { 13, 19 }, { 13, 25 } },
-               { { 0, 8 },   { 0, 14 } },
-               { { 0, 20 },  { 0, 26 } },
-               { { 1, 0 },   { 1, 6 } },
-               { { 1, 12 },  { 1, 18 } },
-       },
-};
-
 struct tsens_legacy_calibration_format tsens_8974_nvmem = {
        .base_len = 8,
        .base_shift = 2,
@@ -103,22 +83,6 @@ struct tsens_legacy_calibration_format tsens_8974_backup_nvmem = {
        },
 };
 
-struct tsens_legacy_calibration_format tsens_9607_nvmem = {
-       .base_len = 8,
-       .base_shift = 2,
-       .sp_len = 6,
-       .mode = { 2, 20 },
-       .invalid = { 2, 22 },
-       .base = { { 0, 0 }, { 2, 12 } },
-       .sp = {
-               { { 0, 8 },  { 0, 14 } },
-               { { 0, 20 }, { 0, 26 } },
-               { { 1, 0 },  { 1, 6 } },
-               { { 1, 12 }, { 1, 18 } },
-               { { 2, 0 },  { 2, 6 } },
-       },
-};
-
 static int calibrate_8916(struct tsens_priv *priv)
 {
        u32 p1[5], p2[5];
@@ -243,6 +207,39 @@ static int calibrate_8974(struct tsens_priv *priv)
        return 0;
 }
 
+static int __init init_8226(struct tsens_priv *priv)
+{
+       priv->sensor[0].slope = 2901;
+       priv->sensor[1].slope = 2846;
+       priv->sensor[2].slope = 3038;
+       priv->sensor[3].slope = 2955;
+       priv->sensor[4].slope = 2901;
+       priv->sensor[5].slope = 2846;
+
+       return init_common(priv);
+}
+
+static int __init init_8909(struct tsens_priv *priv)
+{
+       int i;
+
+       for (i = 0; i < priv->num_sensors; ++i)
+               priv->sensor[i].slope = 3000;
+
+       priv->sensor[0].p1_calib_offset = 0;
+       priv->sensor[0].p2_calib_offset = 0;
+       priv->sensor[1].p1_calib_offset = -10;
+       priv->sensor[1].p2_calib_offset = -6;
+       priv->sensor[2].p1_calib_offset = 0;
+       priv->sensor[2].p2_calib_offset = 0;
+       priv->sensor[3].p1_calib_offset = -9;
+       priv->sensor[3].p2_calib_offset = -9;
+       priv->sensor[4].p1_calib_offset = -8;
+       priv->sensor[4].p2_calib_offset = -10;
+
+       return init_common(priv);
+}
+
 static int __init init_8939(struct tsens_priv *priv) {
        priv->sensor[0].slope = 2911;
        priv->sensor[1].slope = 2789;
@@ -258,7 +255,28 @@ static int __init init_8939(struct tsens_priv *priv) {
        return init_common(priv);
 }
 
-/* v0.1: 8916, 8939, 8974, 9607 */
+static int __init init_9607(struct tsens_priv *priv)
+{
+       int i;
+
+       for (i = 0; i < priv->num_sensors; ++i)
+               priv->sensor[i].slope = 3000;
+
+       priv->sensor[0].p1_calib_offset = 1;
+       priv->sensor[0].p2_calib_offset = 1;
+       priv->sensor[1].p1_calib_offset = -4;
+       priv->sensor[1].p2_calib_offset = -2;
+       priv->sensor[2].p1_calib_offset = 4;
+       priv->sensor[2].p2_calib_offset = 8;
+       priv->sensor[3].p1_calib_offset = -3;
+       priv->sensor[3].p2_calib_offset = -5;
+       priv->sensor[4].p1_calib_offset = -4;
+       priv->sensor[4].p2_calib_offset = -4;
+
+       return init_common(priv);
+}
+
+/* v0.1: 8226, 8909, 8916, 8939, 8974, 9607 */
 
 static struct tsens_features tsens_v0_1_feat = {
        .ver_major      = VER_0_1,
@@ -313,6 +331,32 @@ static const struct tsens_ops ops_v0_1 = {
        .get_temp       = get_temp_common,
 };
 
+static const struct tsens_ops ops_8226 = {
+       .init           = init_8226,
+       .calibrate      = tsens_calibrate_common,
+       .get_temp       = get_temp_common,
+};
+
+struct tsens_plat_data data_8226 = {
+       .num_sensors    = 6,
+       .ops            = &ops_8226,
+       .feat           = &tsens_v0_1_feat,
+       .fields = tsens_v0_1_regfields,
+};
+
+static const struct tsens_ops ops_8909 = {
+       .init           = init_8909,
+       .calibrate      = tsens_calibrate_common,
+       .get_temp       = get_temp_common,
+};
+
+struct tsens_plat_data data_8909 = {
+       .num_sensors    = 5,
+       .ops            = &ops_8909,
+       .feat           = &tsens_v0_1_feat,
+       .fields = tsens_v0_1_regfields,
+};
+
 static const struct tsens_ops ops_8916 = {
        .init           = init_common,
        .calibrate      = calibrate_8916,
@@ -356,9 +400,15 @@ struct tsens_plat_data data_8974 = {
        .fields = tsens_v0_1_regfields,
 };
 
+static const struct tsens_ops ops_9607 = {
+       .init           = init_9607,
+       .calibrate      = tsens_calibrate_common,
+       .get_temp       = get_temp_common,
+};
+
 struct tsens_plat_data data_9607 = {
        .num_sensors    = 5,
-       .ops            = &ops_v0_1,
+       .ops            = &ops_9607,
        .feat           = &tsens_v0_1_feat,
        .fields = tsens_v0_1_regfields,
 };
index b822a42..5132243 100644 (file)
@@ -42,28 +42,6 @@ struct tsens_legacy_calibration_format tsens_qcs404_nvmem = {
        },
 };
 
-struct tsens_legacy_calibration_format tsens_8976_nvmem = {
-       .base_len = 8,
-       .base_shift = 2,
-       .sp_len = 6,
-       .mode = { 4, 0 },
-       .invalid = { 4, 2 },
-       .base = { { 0, 0 }, { 2, 8 } },
-       .sp = {
-               { { 0, 8 },  { 0, 14 } },
-               { { 0, 20 }, { 0, 26 } },
-               { { 1, 0 },  { 1, 6 } },
-               { { 1, 12 }, { 1, 18 } },
-               { { 2, 8 },  { 2, 14 } },
-               { { 2, 20 }, { 2, 26 } },
-               { { 3, 0 },  { 3, 6 } },
-               { { 3, 12 }, { 3, 18 } },
-               { { 4, 2 },  { 4, 9 } },
-               { { 4, 14 }, { 4, 21 } },
-               { { 4, 26 }, { 5, 1 } },
-       },
-};
-
 static int calibrate_v1(struct tsens_priv *priv)
 {
        u32 p1[10], p2[10];
index d321812..98c356a 100644 (file)
@@ -134,10 +134,12 @@ int tsens_read_calibration(struct tsens_priv *priv, int shift, u32 *p1, u32 *p2,
                        p1[i] = p1[i] + (base1 << shift);
                break;
        case TWO_PT_CALIB:
+       case TWO_PT_CALIB_NO_OFFSET:
                for (i = 0; i < priv->num_sensors; i++)
                        p2[i] = (p2[i] + base2) << shift;
                fallthrough;
        case ONE_PT_CALIB2:
+       case ONE_PT_CALIB2_NO_OFFSET:
                for (i = 0; i < priv->num_sensors; i++)
                        p1[i] = (p1[i] + base1) << shift;
                break;
@@ -149,6 +151,18 @@ int tsens_read_calibration(struct tsens_priv *priv, int shift, u32 *p1, u32 *p2,
                }
        }
 
+       /* Apply calibration offset workaround except for _NO_OFFSET modes */
+       switch (mode) {
+       case TWO_PT_CALIB:
+               for (i = 0; i < priv->num_sensors; i++)
+                       p2[i] += priv->sensor[i].p2_calib_offset;
+               fallthrough;
+       case ONE_PT_CALIB2:
+               for (i = 0; i < priv->num_sensors; i++)
+                       p1[i] += priv->sensor[i].p1_calib_offset;
+               break;
+       }
+
        return mode;
 }
 
@@ -254,7 +268,7 @@ void compute_intercept_slope(struct tsens_priv *priv, u32 *p1,
 
                if (!priv->sensor[i].slope)
                        priv->sensor[i].slope = SLOPE_DEFAULT;
-               if (mode == TWO_PT_CALIB) {
+               if (mode == TWO_PT_CALIB || mode == TWO_PT_CALIB_NO_OFFSET) {
                        /*
                         * slope (m) = adc_code2 - adc_code1 (y2 - y1)/
                         *      temp_120_degc - temp_30_degc (x2 - x1)
@@ -1096,6 +1110,12 @@ static const struct of_device_id tsens_table[] = {
                .compatible = "qcom,mdm9607-tsens",
                .data = &data_9607,
        }, {
+               .compatible = "qcom,msm8226-tsens",
+               .data = &data_8226,
+       }, {
+               .compatible = "qcom,msm8909-tsens",
+               .data = &data_8909,
+       }, {
                .compatible = "qcom,msm8916-tsens",
                .data = &data_8916,
        }, {
@@ -1189,9 +1209,7 @@ static int tsens_register(struct tsens_priv *priv)
                if (priv->ops->enable)
                        priv->ops->enable(priv, i);
 
-               if (devm_thermal_add_hwmon_sysfs(priv->dev, tzd))
-                       dev_warn(priv->dev,
-                                "Failed to add hwmon sysfs attributes\n");
+               devm_thermal_add_hwmon_sysfs(priv->dev, tzd);
        }
 
        /* VER_0 require to set MIN and MAX THRESH
index dba9cd3..2805de1 100644 (file)
@@ -10,6 +10,8 @@
 #define ONE_PT_CALIB           0x1
 #define ONE_PT_CALIB2          0x2
 #define TWO_PT_CALIB           0x3
+#define ONE_PT_CALIB2_NO_OFFSET        0x6
+#define TWO_PT_CALIB_NO_OFFSET 0x7
 #define CAL_DEGC_PT1           30
 #define CAL_DEGC_PT2           120
 #define SLOPE_FACTOR           1000
@@ -57,6 +59,8 @@ struct tsens_sensor {
        unsigned int                    hw_id;
        int                             slope;
        u32                             status;
+       int                             p1_calib_offset;
+       int                             p2_calib_offset;
 };
 
 /**
@@ -635,7 +639,7 @@ int get_temp_common(const struct tsens_sensor *s, int *temp);
 extern struct tsens_plat_data data_8960;
 
 /* TSENS v0.1 targets */
-extern struct tsens_plat_data data_8916, data_8939, data_8974, data_9607;
+extern struct tsens_plat_data data_8226, data_8909, data_8916, data_8939, data_8974, data_9607;
 
 /* TSENS v1 targets */
 extern struct tsens_plat_data data_tsens_v1, data_8976, data_8956;
index e587563..ccc2eea 100644 (file)
@@ -31,7 +31,6 @@
 #define TMR_DISABLE    0x0
 #define TMR_ME         0x80000000
 #define TMR_ALPF       0x0c000000
-#define TMR_MSITE_ALL  GENMASK(15, 0)
 
 #define REGS_TMTMIR    0x008   /* Temperature measurement interval Register */
 #define TMTMIR_DEFAULT 0x0000000f
@@ -51,6 +50,7 @@
                                            * Site Register
                                            */
 #define TRITSR_V       BIT(31)
+#define TRITSR_TP5     BIT(9)
 #define REGS_V2_TMSAR(n)       (0x304 + 16 * (n))      /* TMU monitoring
                                                * site adjustment register
                                                */
@@ -105,6 +105,11 @@ static int tmu_get_temp(struct thermal_zone_device *tz, int *temp)
         * within sensor range. TEMP is an 9 bit value representing
         * temperature in KelVin.
         */
+
+       regmap_read(qdata->regmap, REGS_TMR, &val);
+       if (!(val & TMR_ME))
+               return -EAGAIN;
+
        if (regmap_read_poll_timeout(qdata->regmap,
                                     REGS_TRITSR(qsensor->id),
                                     val,
@@ -113,10 +118,15 @@ static int tmu_get_temp(struct thermal_zone_device *tz, int *temp)
                                     10 * USEC_PER_MSEC))
                return -ENODATA;
 
-       if (qdata->ver == TMU_VER1)
+       if (qdata->ver == TMU_VER1) {
                *temp = (val & GENMASK(7, 0)) * MILLIDEGREE_PER_DEGREE;
-       else
-               *temp = kelvin_to_millicelsius(val & GENMASK(8, 0));
+       } else {
+               if (val & TRITSR_TP5)
+                       *temp = milli_kelvin_to_millicelsius((val & GENMASK(8, 0)) *
+                                                            MILLIDEGREE_PER_DEGREE + 500);
+               else
+                       *temp = kelvin_to_millicelsius(val & GENMASK(8, 0));
+       }
 
        return 0;
 }
@@ -128,15 +138,7 @@ static const struct thermal_zone_device_ops tmu_tz_ops = {
 static int qoriq_tmu_register_tmu_zone(struct device *dev,
                                       struct qoriq_tmu_data *qdata)
 {
-       int id;
-
-       if (qdata->ver == TMU_VER1) {
-               regmap_write(qdata->regmap, REGS_TMR,
-                            TMR_MSITE_ALL | TMR_ME | TMR_ALPF);
-       } else {
-               regmap_write(qdata->regmap, REGS_V2_TMSR, TMR_MSITE_ALL);
-               regmap_write(qdata->regmap, REGS_TMR, TMR_ME | TMR_ALPF_V2);
-       }
+       int id, sites = 0;
 
        for (id = 0; id < SITES_MAX; id++) {
                struct thermal_zone_device *tzd;
@@ -153,14 +155,24 @@ static int qoriq_tmu_register_tmu_zone(struct device *dev,
                        if (ret == -ENODEV)
                                continue;
 
-                       regmap_write(qdata->regmap, REGS_TMR, TMR_DISABLE);
                        return ret;
                }
 
-               if (devm_thermal_add_hwmon_sysfs(dev, tzd))
-                       dev_warn(dev,
-                                "Failed to add hwmon sysfs attributes\n");
+               if (qdata->ver == TMU_VER1)
+                       sites |= 0x1 << (15 - id);
+               else
+                       sites |= 0x1 << id;
+
+               devm_thermal_add_hwmon_sysfs(dev, tzd);
+       }
 
+       if (sites) {
+               if (qdata->ver == TMU_VER1) {
+                       regmap_write(qdata->regmap, REGS_TMR, TMR_ME | TMR_ALPF | sites);
+               } else {
+                       regmap_write(qdata->regmap, REGS_V2_TMSR, sites);
+                       regmap_write(qdata->regmap, REGS_TMR, TMR_ME | TMR_ALPF_V2);
+               }
        }
 
        return 0;
@@ -208,8 +220,6 @@ static int qoriq_tmu_calibration(struct device *dev,
 
 static void qoriq_tmu_init_device(struct qoriq_tmu_data *data)
 {
-       int i;
-
        /* Disable interrupt, using polling instead */
        regmap_write(data->regmap, REGS_TIER, TIER_DISABLE);
 
@@ -220,8 +230,6 @@ static void qoriq_tmu_init_device(struct qoriq_tmu_data *data)
        } else {
                regmap_write(data->regmap, REGS_V2_TMTMIR, TMTMIR_DEFAULT);
                regmap_write(data->regmap, REGS_V2_TEUMR(0), TEUMR0_V2);
-               for (i = 0; i < SITES_MAX; i++)
-                       regmap_write(data->regmap, REGS_V2_TMSAR(i), TMSARA_V2);
        }
 
        /* Disable monitoring */
@@ -230,7 +238,7 @@ static void qoriq_tmu_init_device(struct qoriq_tmu_data *data)
 
 static const struct regmap_range qoriq_yes_ranges[] = {
        regmap_reg_range(REGS_TMR, REGS_TSCFGR),
-       regmap_reg_range(REGS_TTRnCR(0), REGS_TTRnCR(3)),
+       regmap_reg_range(REGS_TTRnCR(0), REGS_TTRnCR(15)),
        regmap_reg_range(REGS_V2_TEUMR(0), REGS_V2_TEUMR(2)),
        regmap_reg_range(REGS_V2_TMSAR(0), REGS_V2_TMSAR(15)),
        regmap_reg_range(REGS_IPBRR(0), REGS_IPBRR(1)),
index 42a4724..9029d01 100644 (file)
 #define REG_GEN3_PTAT2         0x60
 #define REG_GEN3_PTAT3         0x64
 #define REG_GEN3_THSCP         0x68
+#define REG_GEN4_THSFMON00     0x180
+#define REG_GEN4_THSFMON01     0x184
+#define REG_GEN4_THSFMON02     0x188
+#define REG_GEN4_THSFMON15     0x1BC
+#define REG_GEN4_THSFMON16     0x1C0
+#define REG_GEN4_THSFMON17     0x1C4
 
 /* IRQ{STR,MSK,EN} bits */
 #define IRQ_TEMP1              BIT(0)
@@ -55,6 +61,7 @@
 
 #define MCELSIUS(temp) ((temp) * 1000)
 #define GEN3_FUSE_MASK 0xFFF
+#define GEN4_FUSE_MASK 0xFFF
 
 #define TSC_MAX_NUM    5
 
@@ -66,6 +73,13 @@ struct equation_coefs {
        int b2;
 };
 
+struct rcar_gen3_thermal_priv;
+
+struct rcar_thermal_info {
+       int ths_tj_1;
+       void (*read_fuses)(struct rcar_gen3_thermal_priv *priv);
+};
+
 struct rcar_gen3_thermal_tsc {
        void __iomem *base;
        struct thermal_zone_device *zone;
@@ -79,6 +93,7 @@ struct rcar_gen3_thermal_priv {
        struct thermal_zone_device_ops ops;
        unsigned int num_tscs;
        int ptat[3];
+       const struct rcar_thermal_info *info;
 };
 
 static inline u32 rcar_gen3_thermal_read(struct rcar_gen3_thermal_tsc *tsc,
@@ -236,6 +251,62 @@ static irqreturn_t rcar_gen3_thermal_irq(int irq, void *data)
        return IRQ_HANDLED;
 }
 
+static void rcar_gen3_thermal_read_fuses_gen3(struct rcar_gen3_thermal_priv *priv)
+{
+       unsigned int i;
+
+       /*
+        * Set the pseudo calibration points with fused values.
+        * PTAT is shared between all TSCs but only fused for the first
+        * TSC while THCODEs are fused for each TSC.
+        */
+       priv->ptat[0] = rcar_gen3_thermal_read(priv->tscs[0], REG_GEN3_PTAT1) &
+               GEN3_FUSE_MASK;
+       priv->ptat[1] = rcar_gen3_thermal_read(priv->tscs[0], REG_GEN3_PTAT2) &
+               GEN3_FUSE_MASK;
+       priv->ptat[2] = rcar_gen3_thermal_read(priv->tscs[0], REG_GEN3_PTAT3) &
+               GEN3_FUSE_MASK;
+
+       for (i = 0; i < priv->num_tscs; i++) {
+               struct rcar_gen3_thermal_tsc *tsc = priv->tscs[i];
+
+               tsc->thcode[0] = rcar_gen3_thermal_read(tsc, REG_GEN3_THCODE1) &
+                       GEN3_FUSE_MASK;
+               tsc->thcode[1] = rcar_gen3_thermal_read(tsc, REG_GEN3_THCODE2) &
+                       GEN3_FUSE_MASK;
+               tsc->thcode[2] = rcar_gen3_thermal_read(tsc, REG_GEN3_THCODE3) &
+                       GEN3_FUSE_MASK;
+       }
+}
+
+static void rcar_gen3_thermal_read_fuses_gen4(struct rcar_gen3_thermal_priv *priv)
+{
+       unsigned int i;
+
+       /*
+        * Set the pseudo calibration points with fused values.
+        * PTAT is shared between all TSCs but only fused for the first
+        * TSC while THCODEs are fused for each TSC.
+        */
+       priv->ptat[0] = rcar_gen3_thermal_read(priv->tscs[0], REG_GEN4_THSFMON16) &
+               GEN4_FUSE_MASK;
+       priv->ptat[1] = rcar_gen3_thermal_read(priv->tscs[0], REG_GEN4_THSFMON17) &
+               GEN4_FUSE_MASK;
+       priv->ptat[2] = rcar_gen3_thermal_read(priv->tscs[0], REG_GEN4_THSFMON15) &
+               GEN4_FUSE_MASK;
+
+       for (i = 0; i < priv->num_tscs; i++) {
+               struct rcar_gen3_thermal_tsc *tsc = priv->tscs[i];
+
+               tsc->thcode[0] = rcar_gen3_thermal_read(tsc, REG_GEN4_THSFMON01) &
+                       GEN4_FUSE_MASK;
+               tsc->thcode[1] = rcar_gen3_thermal_read(tsc, REG_GEN4_THSFMON02) &
+                       GEN4_FUSE_MASK;
+               tsc->thcode[2] = rcar_gen3_thermal_read(tsc, REG_GEN4_THSFMON00) &
+                       GEN4_FUSE_MASK;
+       }
+}
+
 static bool rcar_gen3_thermal_read_fuses(struct rcar_gen3_thermal_priv *priv)
 {
        unsigned int i;
@@ -243,7 +314,8 @@ static bool rcar_gen3_thermal_read_fuses(struct rcar_gen3_thermal_priv *priv)
 
        /* If fuses are not set, fallback to pseudo values. */
        thscp = rcar_gen3_thermal_read(priv->tscs[0], REG_GEN3_THSCP);
-       if ((thscp & THSCP_COR_PARA_VLD) != THSCP_COR_PARA_VLD) {
+       if (!priv->info->read_fuses ||
+           (thscp & THSCP_COR_PARA_VLD) != THSCP_COR_PARA_VLD) {
                /* Default THCODE values in case FUSEs are not set. */
                static const int thcodes[TSC_MAX_NUM][3] = {
                        { 3397, 2800, 2221 },
@@ -268,29 +340,7 @@ static bool rcar_gen3_thermal_read_fuses(struct rcar_gen3_thermal_priv *priv)
                return false;
        }
 
-       /*
-        * Set the pseudo calibration points with fused values.
-        * PTAT is shared between all TSCs but only fused for the first
-        * TSC while THCODEs are fused for each TSC.
-        */
-       priv->ptat[0] = rcar_gen3_thermal_read(priv->tscs[0], REG_GEN3_PTAT1) &
-               GEN3_FUSE_MASK;
-       priv->ptat[1] = rcar_gen3_thermal_read(priv->tscs[0], REG_GEN3_PTAT2) &
-               GEN3_FUSE_MASK;
-       priv->ptat[2] = rcar_gen3_thermal_read(priv->tscs[0], REG_GEN3_PTAT3) &
-               GEN3_FUSE_MASK;
-
-       for (i = 0; i < priv->num_tscs; i++) {
-               struct rcar_gen3_thermal_tsc *tsc = priv->tscs[i];
-
-               tsc->thcode[0] = rcar_gen3_thermal_read(tsc, REG_GEN3_THCODE1) &
-                       GEN3_FUSE_MASK;
-               tsc->thcode[1] = rcar_gen3_thermal_read(tsc, REG_GEN3_THCODE2) &
-                       GEN3_FUSE_MASK;
-               tsc->thcode[2] = rcar_gen3_thermal_read(tsc, REG_GEN3_THCODE3) &
-                       GEN3_FUSE_MASK;
-       }
-
+       priv->info->read_fuses(priv);
        return true;
 }
 
@@ -318,52 +368,65 @@ static void rcar_gen3_thermal_init(struct rcar_gen3_thermal_priv *priv,
        usleep_range(1000, 2000);
 }
 
-static const int rcar_gen3_ths_tj_1 = 126;
-static const int rcar_gen3_ths_tj_1_m3_w = 116;
+static const struct rcar_thermal_info rcar_m3w_thermal_info = {
+       .ths_tj_1 = 116,
+       .read_fuses = rcar_gen3_thermal_read_fuses_gen3,
+};
+
+static const struct rcar_thermal_info rcar_gen3_thermal_info = {
+       .ths_tj_1 = 126,
+       .read_fuses = rcar_gen3_thermal_read_fuses_gen3,
+};
+
+static const struct rcar_thermal_info rcar_gen4_thermal_info = {
+       .ths_tj_1 = 126,
+       .read_fuses = rcar_gen3_thermal_read_fuses_gen4,
+};
+
 static const struct of_device_id rcar_gen3_thermal_dt_ids[] = {
        {
                .compatible = "renesas,r8a774a1-thermal",
-               .data = &rcar_gen3_ths_tj_1_m3_w,
+               .data = &rcar_m3w_thermal_info,
        },
        {
                .compatible = "renesas,r8a774b1-thermal",
-               .data = &rcar_gen3_ths_tj_1,
+               .data = &rcar_gen3_thermal_info,
        },
        {
                .compatible = "renesas,r8a774e1-thermal",
-               .data = &rcar_gen3_ths_tj_1,
+               .data = &rcar_gen3_thermal_info,
        },
        {
                .compatible = "renesas,r8a7795-thermal",
-               .data = &rcar_gen3_ths_tj_1,
+               .data = &rcar_gen3_thermal_info,
        },
        {
                .compatible = "renesas,r8a7796-thermal",
-               .data = &rcar_gen3_ths_tj_1_m3_w,
+               .data = &rcar_m3w_thermal_info,
        },
        {
                .compatible = "renesas,r8a77961-thermal",
-               .data = &rcar_gen3_ths_tj_1_m3_w,
+               .data = &rcar_m3w_thermal_info,
        },
        {
                .compatible = "renesas,r8a77965-thermal",
-               .data = &rcar_gen3_ths_tj_1,
+               .data = &rcar_gen3_thermal_info,
        },
        {
                .compatible = "renesas,r8a77980-thermal",
-               .data = &rcar_gen3_ths_tj_1,
+               .data = &rcar_gen3_thermal_info,
        },
        {
                .compatible = "renesas,r8a779a0-thermal",
-               .data = &rcar_gen3_ths_tj_1,
+               .data = &rcar_gen3_thermal_info,
        },
        {
                .compatible = "renesas,r8a779f0-thermal",
-               .data = &rcar_gen3_ths_tj_1,
+               .data = &rcar_gen4_thermal_info,
        },
        {
                .compatible = "renesas,r8a779g0-thermal",
-               .data = &rcar_gen3_ths_tj_1,
+               .data = &rcar_gen4_thermal_info,
        },
        {},
 };
@@ -418,7 +481,6 @@ static int rcar_gen3_thermal_probe(struct platform_device *pdev)
 {
        struct rcar_gen3_thermal_priv *priv;
        struct device *dev = &pdev->dev;
-       const int *ths_tj_1 = of_device_get_match_data(dev);
        struct resource *res;
        struct thermal_zone_device *zone;
        unsigned int i;
@@ -430,6 +492,7 @@ static int rcar_gen3_thermal_probe(struct platform_device *pdev)
 
        priv->ops = rcar_gen3_tz_of_ops;
 
+       priv->info = of_device_get_match_data(dev);
        platform_set_drvdata(pdev, priv);
 
        if (rcar_gen3_thermal_request_irqs(priv, pdev))
@@ -469,7 +532,7 @@ static int rcar_gen3_thermal_probe(struct platform_device *pdev)
                struct rcar_gen3_thermal_tsc *tsc = priv->tscs[i];
 
                rcar_gen3_thermal_init(priv, tsc);
-               rcar_gen3_thermal_calc_coefs(priv, tsc, *ths_tj_1);
+               rcar_gen3_thermal_calc_coefs(priv, tsc, priv->info->ths_tj_1);
 
                zone = devm_thermal_of_zone_register(dev, i, tsc, &priv->ops);
                if (IS_ERR(zone)) {
index 2d30420..0d6249b 100644 (file)
@@ -227,14 +227,12 @@ sensor_off:
 }
 EXPORT_SYMBOL_GPL(st_thermal_register);
 
-int st_thermal_unregister(struct platform_device *pdev)
+void st_thermal_unregister(struct platform_device *pdev)
 {
        struct st_thermal_sensor *sensor = platform_get_drvdata(pdev);
 
        st_thermal_sensor_off(sensor);
        thermal_zone_device_unregister(sensor->thermal_dev);
-
-       return 0;
 }
 EXPORT_SYMBOL_GPL(st_thermal_unregister);
 
index d661b2f..75a84e6 100644 (file)
@@ -94,7 +94,7 @@ struct st_thermal_sensor {
 
 extern int st_thermal_register(struct platform_device *pdev,
                               const struct of_device_id *st_thermal_of_match);
-extern int st_thermal_unregister(struct platform_device *pdev);
+extern void st_thermal_unregister(struct platform_device *pdev);
 extern const struct dev_pm_ops st_thermal_pm_ops;
 
 #endif /* __STI_RESET_SYSCFG_H */
index d68596c..e8cfa83 100644 (file)
@@ -172,9 +172,9 @@ static int st_mmap_probe(struct platform_device *pdev)
        return st_thermal_register(pdev,  st_mmap_thermal_of_match);
 }
 
-static int st_mmap_remove(struct platform_device *pdev)
+static void st_mmap_remove(struct platform_device *pdev)
 {
-       return st_thermal_unregister(pdev);
+       st_thermal_unregister(pdev);
 }
 
 static struct platform_driver st_mmap_thermal_driver = {
@@ -184,7 +184,7 @@ static struct platform_driver st_mmap_thermal_driver = {
                .of_match_table = st_mmap_thermal_of_match,
        },
        .probe          = st_mmap_probe,
-       .remove         = st_mmap_remove,
+       .remove_new     = st_mmap_remove,
 };
 
 module_platform_driver(st_mmap_thermal_driver);
index 793ddce..195f3c5 100644 (file)
@@ -319,6 +319,11 @@ out:
        return ret;
 }
 
+static void sun8i_ths_reset_control_assert(void *data)
+{
+       reset_control_assert(data);
+}
+
 static int sun8i_ths_resource_init(struct ths_device *tmdev)
 {
        struct device *dev = tmdev->dev;
@@ -339,47 +344,35 @@ static int sun8i_ths_resource_init(struct ths_device *tmdev)
                if (IS_ERR(tmdev->reset))
                        return PTR_ERR(tmdev->reset);
 
-               tmdev->bus_clk = devm_clk_get(&pdev->dev, "bus");
+               ret = reset_control_deassert(tmdev->reset);
+               if (ret)
+                       return ret;
+
+               ret = devm_add_action_or_reset(dev, sun8i_ths_reset_control_assert,
+                                              tmdev->reset);
+               if (ret)
+                       return ret;
+
+               tmdev->bus_clk = devm_clk_get_enabled(&pdev->dev, "bus");
                if (IS_ERR(tmdev->bus_clk))
                        return PTR_ERR(tmdev->bus_clk);
        }
 
        if (tmdev->chip->has_mod_clk) {
-               tmdev->mod_clk = devm_clk_get(&pdev->dev, "mod");
+               tmdev->mod_clk = devm_clk_get_enabled(&pdev->dev, "mod");
                if (IS_ERR(tmdev->mod_clk))
                        return PTR_ERR(tmdev->mod_clk);
        }
 
-       ret = reset_control_deassert(tmdev->reset);
-       if (ret)
-               return ret;
-
-       ret = clk_prepare_enable(tmdev->bus_clk);
-       if (ret)
-               goto assert_reset;
-
        ret = clk_set_rate(tmdev->mod_clk, 24000000);
        if (ret)
-               goto bus_disable;
-
-       ret = clk_prepare_enable(tmdev->mod_clk);
-       if (ret)
-               goto bus_disable;
+               return ret;
 
        ret = sun8i_ths_calibrate(tmdev);
        if (ret)
-               goto mod_disable;
+               return ret;
 
        return 0;
-
-mod_disable:
-       clk_disable_unprepare(tmdev->mod_clk);
-bus_disable:
-       clk_disable_unprepare(tmdev->bus_clk);
-assert_reset:
-       reset_control_assert(tmdev->reset);
-
-       return ret;
 }
 
 static int sun8i_h3_thermal_init(struct ths_device *tmdev)
@@ -475,9 +468,7 @@ static int sun8i_ths_register(struct ths_device *tmdev)
                if (IS_ERR(tmdev->sensor[i].tzd))
                        return PTR_ERR(tmdev->sensor[i].tzd);
 
-               if (devm_thermal_add_hwmon_sysfs(tmdev->dev, tmdev->sensor[i].tzd))
-                       dev_warn(tmdev->dev,
-                                "Failed to add hwmon sysfs attributes\n");
+               devm_thermal_add_hwmon_sysfs(tmdev->dev, tmdev->sensor[i].tzd);
        }
 
        return 0;
@@ -530,17 +521,6 @@ static int sun8i_ths_probe(struct platform_device *pdev)
        return 0;
 }
 
-static int sun8i_ths_remove(struct platform_device *pdev)
-{
-       struct ths_device *tmdev = platform_get_drvdata(pdev);
-
-       clk_disable_unprepare(tmdev->mod_clk);
-       clk_disable_unprepare(tmdev->bus_clk);
-       reset_control_assert(tmdev->reset);
-
-       return 0;
-}
-
 static const struct ths_thermal_chip sun8i_a83t_ths = {
        .sensor_num = 3,
        .scale = 705,
@@ -642,7 +622,6 @@ MODULE_DEVICE_TABLE(of, of_ths_match);
 
 static struct platform_driver ths_driver = {
        .probe = sun8i_ths_probe,
-       .remove = sun8i_ths_remove,
        .driver = {
                .name = "sun8i-thermal",
                .of_match_table = of_ths_match,
index cb584a5..c243e9d 100644 (file)
@@ -523,8 +523,7 @@ static int tegra_tsensor_register_channel(struct tegra_tsensor *ts,
                return 0;
        }
 
-       if (devm_thermal_add_hwmon_sysfs(ts->dev, tsc->tzd))
-               dev_warn(ts->dev, "failed to add hwmon sysfs attributes\n");
+       devm_thermal_add_hwmon_sysfs(ts->dev, tsc->tzd);
 
        return 0;
 }
index 017b0ce..f4f1a04 100644 (file)
@@ -13,6 +13,8 @@
 #include <linux/slab.h>
 #include <linux/thermal.h>
 
+#include "thermal_hwmon.h"
+
 struct gadc_thermal_info {
        struct device *dev;
        struct thermal_zone_device *tz_dev;
@@ -153,6 +155,8 @@ static int gadc_thermal_probe(struct platform_device *pdev)
                return ret;
        }
 
+       devm_thermal_add_hwmon_sysfs(&pdev->dev, gti->tz_dev);
+
        return 0;
 }
 
index 3d4a787..17c1bbe 100644 (file)
@@ -23,6 +23,8 @@
 #define DEFAULT_THERMAL_GOVERNOR       "user_space"
 #elif defined(CONFIG_THERMAL_DEFAULT_GOV_POWER_ALLOCATOR)
 #define DEFAULT_THERMAL_GOVERNOR       "power_allocator"
+#elif defined(CONFIG_THERMAL_DEFAULT_GOV_BANG_BANG)
+#define DEFAULT_THERMAL_GOVERNOR       "bang_bang"
 #endif
 
 /* Initial state of a cooling device during binding */
index fbe5550..c3ae446 100644 (file)
@@ -271,11 +271,14 @@ int devm_thermal_add_hwmon_sysfs(struct device *dev, struct thermal_zone_device
 
        ptr = devres_alloc(devm_thermal_hwmon_release, sizeof(*ptr),
                           GFP_KERNEL);
-       if (!ptr)
+       if (!ptr) {
+               dev_warn(dev, "Failed to allocate device resource data\n");
                return -ENOMEM;
+       }
 
        ret = thermal_add_hwmon_sysfs(tz);
        if (ret) {
+               dev_warn(dev, "Failed to add hwmon sysfs attributes\n");
                devres_free(ptr);
                return ret;
        }
index 6a53359..d414a4b 100644 (file)
@@ -182,8 +182,7 @@ int ti_thermal_expose_sensor(struct ti_bandgap *bgp, int id,
        ti_bandgap_write_update_interval(bgp, data->sensor_id,
                                         TI_BANDGAP_UPDATE_INTERVAL_MS);
 
-       if (devm_thermal_add_hwmon_sysfs(bgp->dev, data->ti_thermal))
-               dev_warn(bgp->dev, "failed to add hwmon sysfs attributes\n");
+       devm_thermal_add_hwmon_sysfs(bgp->dev, data->ti_thermal);
 
        return 0;
 }
index 3e3fb37..71a7a3e 100644 (file)
@@ -450,8 +450,8 @@ config SERIAL_SA1100
        help
          If you have a machine based on a SA1100/SA1110 StrongARM(R) CPU you
          can enable its onboard serial port by enabling this option.
-         Please read <file:Documentation/arm/sa1100/serial_uart.rst> for further
-         info.
+         Please read <file:Documentation/arch/arm/sa1100/serial_uart.rst> for
+         further info.
 
 config SERIAL_SA1100_CONSOLE
        bool "Console on SA1100 serial port"
index c84be40..4737a8f 100644 (file)
@@ -466,7 +466,7 @@ static const struct file_operations tty_fops = {
        .llseek         = no_llseek,
        .read_iter      = tty_read,
        .write_iter     = tty_write,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = copy_splice_read,
        .splice_write   = iter_file_splice_write,
        .poll           = tty_poll,
        .unlocked_ioctl = tty_ioctl,
@@ -481,7 +481,7 @@ static const struct file_operations console_fops = {
        .llseek         = no_llseek,
        .read_iter      = tty_read,
        .write_iter     = redirected_tty_write,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = copy_splice_read,
        .splice_write   = iter_file_splice_write,
        .poll           = tty_poll,
        .unlocked_ioctl = tty_ioctl,
index 268ccbe..8723086 100644 (file)
@@ -34,13 +34,13 @@ void __init usb_init_pool_max(void)
 {
        /*
         * The pool_max values must never be smaller than
-        * ARCH_KMALLOC_MINALIGN.
+        * ARCH_DMA_MINALIGN.
         */
-       if (ARCH_KMALLOC_MINALIGN <= 32)
+       if (ARCH_DMA_MINALIGN <= 32)
                ;                       /* Original value is okay */
-       else if (ARCH_KMALLOC_MINALIGN <= 64)
+       else if (ARCH_DMA_MINALIGN <= 64)
                pool_max[0] = 64;
-       else if (ARCH_KMALLOC_MINALIGN <= 128)
+       else if (ARCH_DMA_MINALIGN <= 128)
                pool_max[0] = 0;        /* Don't use this pool */
        else
                BUILD_BUG();            /* We don't allow this */
index 5f5c216..4619b4a 100644 (file)
@@ -1052,7 +1052,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
                goto out;
 
        pinned = pin_user_pages(uaddr, npages, FOLL_LONGTERM | FOLL_WRITE,
-                               page_list, NULL);
+                               page_list);
        if (pinned != npages) {
                ret = pinned < 0 ? pinned : -ENOMEM;
                goto out;
index 0d2f805..ebe0ad3 100644 (file)
@@ -514,6 +514,7 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
                            bool write_fault)
 {
        pte_t *ptep;
+       pte_t pte;
        spinlock_t *ptl;
        int ret;
 
@@ -536,10 +537,12 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
                        return ret;
        }
 
-       if (write_fault && !pte_write(*ptep))
+       pte = ptep_get(ptep);
+
+       if (write_fault && !pte_write(pte))
                ret = -EFAULT;
        else
-               *pfn = pte_pfn(*ptep);
+               *pfn = pte_pfn(pte);
 
        pte_unmap_unlock(ptep, ptl);
        return ret;
@@ -562,7 +565,7 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
 
        mmap_read_lock(mm);
        ret = pin_user_pages_remote(mm, vaddr, npages, flags | FOLL_LONGTERM,
-                                   pages, NULL, NULL);
+                                   pages, NULL);
        if (ret > 0) {
                int i;
 
index bf77924..b43e868 100644 (file)
@@ -1009,7 +1009,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
        while (npages) {
                sz2pin = min_t(unsigned long, npages, list_size);
                pinned = pin_user_pages(cur_base, sz2pin,
-                                       gup_flags, page_list, NULL);
+                                       gup_flags, page_list);
                if (sz2pin != pinned) {
                        if (pinned < 0) {
                                ret = pinned;
index d75ab3f..cecdc1c 100644 (file)
@@ -576,8 +576,8 @@ static void ioreq_resume(void)
 int acrn_ioreq_intr_setup(void)
 {
        acrn_setup_intr_handler(ioreq_intr_handler);
-       ioreq_wq = alloc_workqueue("ioreq_wq",
-                                  WQ_HIGHPRI | WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
+       ioreq_wq = alloc_ordered_workqueue("ioreq_wq",
+                                          WQ_HIGHPRI | WQ_MEM_RECLAIM);
        if (!ioreq_wq) {
                dev_err(acrn_dev.this_device, "Failed to alloc workqueue!\n");
                acrn_remove_intr_handler();
index f9db079..da2d7ca 100644 (file)
@@ -2,6 +2,7 @@ config SEV_GUEST
        tristate "AMD SEV Guest driver"
        default m
        depends on AMD_MEM_ENCRYPT
+       select CRYPTO
        select CRYPTO_AEAD2
        select CRYPTO_GCM
        help
index e2f580e..f447cd3 100644 (file)
@@ -949,7 +949,7 @@ static int privcmd_mmap(struct file *file, struct vm_area_struct *vma)
  */
 static int is_mapped_fn(pte_t *pte, unsigned long addr, void *data)
 {
-       return pte_none(*pte) ? 0 : -EBUSY;
+       return pte_none(ptep_get(pte)) ? 0 : -EBUSY;
 }
 
 static int privcmd_vma_range_is_mapped(
index 7beaf2c..d525934 100644 (file)
@@ -363,7 +363,7 @@ static struct sock_mapping *pvcalls_new_active_socket(
        map->data.in = map->bytes;
        map->data.out = map->bytes + XEN_FLEX_RING_SIZE(map->ring_order);
 
-       map->ioworker.wq = alloc_workqueue("pvcalls_io", WQ_UNBOUND, 1);
+       map->ioworker.wq = alloc_ordered_workqueue("pvcalls_io", 0);
        if (!map->ioworker.wq)
                goto out;
        atomic_set(&map->io, 1);
@@ -636,7 +636,7 @@ static int pvcalls_back_bind(struct xenbus_device *dev,
 
        INIT_WORK(&map->register_work, __pvcalls_back_accept);
        spin_lock_init(&map->copy_lock);
-       map->wq = alloc_workqueue("pvcalls_wq", WQ_UNBOUND, 1);
+       map->wq = alloc_ordered_workqueue("pvcalls_wq", 0);
        if (!map->wq) {
                ret = -ENOMEM;
                goto out;
index 6c31b8c..2996fb0 100644 (file)
@@ -374,6 +374,28 @@ v9fs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
        return ret;
 }
 
+/*
+ * v9fs_file_splice_read - splice-read from a file
+ * @in: The 9p file to read from
+ * @ppos: Where to find/update the file position
+ * @pipe: The pipe to splice into
+ * @len: The maximum amount of data to splice
+ * @flags: SPLICE_F_* flags
+ */
+static ssize_t v9fs_file_splice_read(struct file *in, loff_t *ppos,
+                                    struct pipe_inode_info *pipe,
+                                    size_t len, unsigned int flags)
+{
+       struct p9_fid *fid = in->private_data;
+
+       p9_debug(P9_DEBUG_VFS, "fid %d count %zu offset %lld\n",
+                fid->fid, len, *ppos);
+
+       if (fid->mode & P9L_DIRECT)
+               return copy_splice_read(in, ppos, pipe, len, flags);
+       return filemap_splice_read(in, ppos, pipe, len, flags);
+}
+
 /**
  * v9fs_file_write_iter - write to a file
  * @iocb: The operation parameters
@@ -569,7 +591,7 @@ const struct file_operations v9fs_file_operations = {
        .release = v9fs_dir_release,
        .lock = v9fs_file_lock,
        .mmap = generic_file_readonly_mmap,
-       .splice_read = generic_file_splice_read,
+       .splice_read = v9fs_file_splice_read,
        .splice_write = iter_file_splice_write,
        .fsync = v9fs_file_fsync,
 };
@@ -583,7 +605,7 @@ const struct file_operations v9fs_file_operations_dotl = {
        .lock = v9fs_file_lock_dotl,
        .flock = v9fs_file_flock_dotl,
        .mmap = v9fs_file_mmap,
-       .splice_read = generic_file_splice_read,
+       .splice_read = v9fs_file_splice_read,
        .splice_write = iter_file_splice_write,
        .fsync = v9fs_file_fsync_dotl,
 };
index 5bfdbf0..e513aae 100644 (file)
@@ -17,14 +17,8 @@ obj-y :=     open.o read_write.o file_table.o super.o \
                fs_types.o fs_context.o fs_parser.o fsopen.o init.o \
                kernel_read_file.o mnt_idmapping.o remap_range.o
 
-ifeq ($(CONFIG_BLOCK),y)
-obj-y +=       buffer.o mpage.o
-else
-obj-y +=       no-block.o
-endif
-
-obj-$(CONFIG_PROC_FS) += proc_namespace.o
-
+obj-$(CONFIG_BLOCK)            += buffer.o mpage.o
+obj-$(CONFIG_PROC_FS)          += proc_namespace.o
 obj-$(CONFIG_LEGACY_DIRECT_IO) += direct-io.o
 obj-y                          += notify/
 obj-$(CONFIG_EPOLL)            += eventpoll.o
index 754afb1..ee80718 100644 (file)
@@ -28,7 +28,7 @@ const struct file_operations adfs_file_operations = {
        .mmap           = generic_file_mmap,
        .fsync          = generic_file_fsync,
        .write_iter     = generic_file_write_iter,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = filemap_splice_read,
 };
 
 const struct inode_operations adfs_file_inode_operations = {
index 8daeed3..e43f2f0 100644 (file)
@@ -1001,7 +1001,7 @@ const struct file_operations affs_file_operations = {
        .open           = affs_file_open,
        .release        = affs_file_release,
        .fsync          = affs_file_fsync,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = filemap_splice_read,
 };
 
 const struct inode_operations affs_file_inode_operations = {
index 719b313..d37dd20 100644 (file)
@@ -25,6 +25,9 @@ static void afs_invalidate_folio(struct folio *folio, size_t offset,
 static bool afs_release_folio(struct folio *folio, gfp_t gfp_flags);
 
 static ssize_t afs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter);
+static ssize_t afs_file_splice_read(struct file *in, loff_t *ppos,
+                                   struct pipe_inode_info *pipe,
+                                   size_t len, unsigned int flags);
 static void afs_vm_open(struct vm_area_struct *area);
 static void afs_vm_close(struct vm_area_struct *area);
 static vm_fault_t afs_vm_map_pages(struct vm_fault *vmf, pgoff_t start_pgoff, pgoff_t end_pgoff);
@@ -36,7 +39,7 @@ const struct file_operations afs_file_operations = {
        .read_iter      = afs_file_read_iter,
        .write_iter     = afs_file_write,
        .mmap           = afs_file_mmap,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = afs_file_splice_read,
        .splice_write   = iter_file_splice_write,
        .fsync          = afs_fsync,
        .lock           = afs_lock,
@@ -587,3 +590,18 @@ static ssize_t afs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 
        return generic_file_read_iter(iocb, iter);
 }
+
+static ssize_t afs_file_splice_read(struct file *in, loff_t *ppos,
+                                   struct pipe_inode_info *pipe,
+                                   size_t len, unsigned int flags)
+{
+       struct afs_vnode *vnode = AFS_FS_I(file_inode(in));
+       struct afs_file *af = in->private_data;
+       int ret;
+
+       ret = afs_validate(vnode, af->key);
+       if (ret < 0)
+               return ret;
+
+       return filemap_splice_read(in, ppos, pipe, len, flags);
+}
index 8750b99..9c7fd6f 100644 (file)
@@ -465,7 +465,7 @@ static void afs_extend_writeback(struct address_space *mapping,
                                 bool caching,
                                 unsigned int *_len)
 {
-       struct pagevec pvec;
+       struct folio_batch fbatch;
        struct folio *folio;
        unsigned long priv;
        unsigned int psize, filler = 0;
@@ -476,7 +476,7 @@ static void afs_extend_writeback(struct address_space *mapping,
        unsigned int i;
 
        XA_STATE(xas, &mapping->i_pages, index);
-       pagevec_init(&pvec);
+       folio_batch_init(&fbatch);
 
        do {
                /* Firstly, we gather up a batch of contiguous dirty pages
@@ -535,7 +535,7 @@ static void afs_extend_writeback(struct address_space *mapping,
                                stop = false;
 
                        index += folio_nr_pages(folio);
-                       if (!pagevec_add(&pvec, &folio->page))
+                       if (!folio_batch_add(&fbatch, folio))
                                break;
                        if (stop)
                                break;
@@ -545,14 +545,14 @@ static void afs_extend_writeback(struct address_space *mapping,
                        xas_pause(&xas);
                rcu_read_unlock();
 
-               /* Now, if we obtained any pages, we can shift them to being
+               /* Now, if we obtained any folios, we can shift them to being
                 * writable and mark them for caching.
                 */
-               if (!pagevec_count(&pvec))
+               if (!folio_batch_count(&fbatch))
                        break;
 
-               for (i = 0; i < pagevec_count(&pvec); i++) {
-                       folio = page_folio(pvec.pages[i]);
+               for (i = 0; i < folio_batch_count(&fbatch); i++) {
+                       folio = fbatch.folios[i];
                        trace_afs_folio_dirty(vnode, tracepoint_string("store+"), folio);
 
                        if (!folio_clear_dirty_for_io(folio))
@@ -565,7 +565,7 @@ static void afs_extend_writeback(struct address_space *mapping,
                        folio_unlock(folio);
                }
 
-               pagevec_release(&pvec);
+               folio_batch_release(&fbatch);
                cond_resched();
        } while (!stop);
 
index b0b17bd..77e3361 100644 (file)
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -530,7 +530,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)
        for (i = 0; i < nr_pages; i++) {
                struct page *page;
                page = find_or_create_page(file->f_mapping,
-                                          i, GFP_HIGHUSER | __GFP_ZERO);
+                                          i, GFP_USER | __GFP_ZERO);
                if (!page)
                        break;
                pr_debug("pid(%d) page[%d]->count=%d\n",
@@ -571,7 +571,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)
        ctx->user_id = ctx->mmap_base;
        ctx->nr_events = nr_events; /* trusted copy */
 
-       ring = kmap_atomic(ctx->ring_pages[0]);
+       ring = page_address(ctx->ring_pages[0]);
        ring->nr = nr_events;   /* user copy */
        ring->id = ~0U;
        ring->head = ring->tail = 0;
@@ -579,7 +579,6 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)
        ring->compat_features = AIO_RING_COMPAT_FEATURES;
        ring->incompat_features = AIO_RING_INCOMPAT_FEATURES;
        ring->header_length = sizeof(struct aio_ring);
-       kunmap_atomic(ring);
        flush_dcache_page(ctx->ring_pages[0]);
 
        return 0;
@@ -682,9 +681,8 @@ static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm)
                                         * we are protected from page migration
                                         * changes ring_pages by ->ring_lock.
                                         */
-                                       ring = kmap_atomic(ctx->ring_pages[0]);
+                                       ring = page_address(ctx->ring_pages[0]);
                                        ring->id = ctx->id;
-                                       kunmap_atomic(ring);
                                        return 0;
                                }
 
@@ -1025,9 +1023,8 @@ static void user_refill_reqs_available(struct kioctx *ctx)
                 * against ctx->completed_events below will make sure we do the
                 * safe/right thing.
                 */
-               ring = kmap_atomic(ctx->ring_pages[0]);
+               ring = page_address(ctx->ring_pages[0]);
                head = ring->head;
-               kunmap_atomic(ring);
 
                refill_reqs_available(ctx, head, ctx->tail);
        }
@@ -1133,12 +1130,11 @@ static void aio_complete(struct aio_kiocb *iocb)
        if (++tail >= ctx->nr_events)
                tail = 0;
 
-       ev_page = kmap_atomic(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]);
+       ev_page = page_address(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]);
        event = ev_page + pos % AIO_EVENTS_PER_PAGE;
 
        *event = iocb->ki_res;
 
-       kunmap_atomic(ev_page);
        flush_dcache_page(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]);
 
        pr_debug("%p[%u]: %p: %p %Lx %Lx %Lx\n", ctx, tail, iocb,
@@ -1152,10 +1148,9 @@ static void aio_complete(struct aio_kiocb *iocb)
 
        ctx->tail = tail;
 
-       ring = kmap_atomic(ctx->ring_pages[0]);
+       ring = page_address(ctx->ring_pages[0]);
        head = ring->head;
        ring->tail = tail;
-       kunmap_atomic(ring);
        flush_dcache_page(ctx->ring_pages[0]);
 
        ctx->completed_events++;
@@ -1215,10 +1210,9 @@ static long aio_read_events_ring(struct kioctx *ctx,
        mutex_lock(&ctx->ring_lock);
 
        /* Access to ->ring_pages here is protected by ctx->ring_lock. */
-       ring = kmap_atomic(ctx->ring_pages[0]);
+       ring = page_address(ctx->ring_pages[0]);
        head = ring->head;
        tail = ring->tail;
-       kunmap_atomic(ring);
 
        /*
         * Ensure that once we've read the current tail pointer, that
@@ -1250,10 +1244,9 @@ static long aio_read_events_ring(struct kioctx *ctx,
                avail = min(avail, nr - ret);
                avail = min_t(long, avail, AIO_EVENTS_PER_PAGE - pos);
 
-               ev = kmap(page);
+               ev = page_address(page);
                copy_ret = copy_to_user(event + ret, ev + pos,
                                        sizeof(*ev) * avail);
-               kunmap(page);
 
                if (unlikely(copy_ret)) {
                        ret = -EFAULT;
@@ -1265,9 +1258,8 @@ static long aio_read_events_ring(struct kioctx *ctx,
                head %= ctx->nr_events;
        }
 
-       ring = kmap_atomic(ctx->ring_pages[0]);
+       ring = page_address(ctx->ring_pages[0]);
        ring->head = head;
-       kunmap_atomic(ring);
        flush_dcache_page(ctx->ring_pages[0]);
 
        pr_debug("%li  h%u t%u\n", ret, head, tail);
index 6baf90b..93046c9 100644 (file)
@@ -600,7 +600,7 @@ static int autofs_dir_symlink(struct mnt_idmap *idmap,
        p_ino = autofs_dentry_ino(dentry->d_parent);
        p_ino->count++;
 
-       dir->i_mtime = current_time(dir);
+       dir->i_mtime = dir->i_ctime = current_time(dir);
 
        return 0;
 }
@@ -633,7 +633,7 @@ static int autofs_dir_unlink(struct inode *dir, struct dentry *dentry)
        d_inode(dentry)->i_size = 0;
        clear_nlink(d_inode(dentry));
 
-       dir->i_mtime = current_time(dir);
+       dir->i_mtime = dir->i_ctime = current_time(dir);
 
        spin_lock(&sbi->lookup_lock);
        __autofs_add_expiring(dentry);
@@ -749,7 +749,7 @@ static int autofs_dir_mkdir(struct mnt_idmap *idmap,
        p_ino = autofs_dentry_ino(dentry->d_parent);
        p_ino->count++;
        inc_nlink(dir);
-       dir->i_mtime = current_time(dir);
+       dir->i_mtime = dir->i_ctime = current_time(dir);
 
        return 0;
 }
index 1b7e0f7..53b36aa 100644 (file)
@@ -500,7 +500,7 @@ befs_btree_read(struct super_block *sb, const befs_data_stream *ds,
                goto error_alloc;
        }
 
-       strlcpy(keybuf, keystart, keylen + 1);
+       strscpy(keybuf, keystart, keylen + 1);
        *value = fs64_to_cpu(sb, valarray[cur_key]);
        *keysize = keylen;
 
index 32749fc..eee9237 100644 (file)
@@ -374,7 +374,7 @@ static struct inode *befs_iget(struct super_block *sb, unsigned long ino)
        if (S_ISLNK(inode->i_mode) && !(befs_ino->i_flags & BEFS_LONG_SYMLINK)){
                inode->i_size = 0;
                inode->i_blocks = befs_sb->block_size / VFS_BLOCK_SIZE;
-               strlcpy(befs_ino->i_data.symlink, raw_inode->data.symlink,
+               strscpy(befs_ino->i_data.symlink, raw_inode->data.symlink,
                        BEFS_SYMLINK_LEN);
        } else {
                int num_blks;
index 57ae5ee..adc2230 100644 (file)
@@ -27,7 +27,7 @@ const struct file_operations bfs_file_operations = {
        .read_iter      = generic_file_read_iter,
        .write_iter     = generic_file_write_iter,
        .mmap           = generic_file_mmap,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = filemap_splice_read,
 };
 
 static int bfs_move_block(unsigned long from, unsigned long to,
index 1033fbd..983ce34 100644 (file)
@@ -1517,7 +1517,7 @@ static void fill_elf_note_phdr(struct elf_phdr *phdr, int sz, loff_t offset)
        phdr->p_filesz = sz;
        phdr->p_memsz = 0;
        phdr->p_flags = 0;
-       phdr->p_align = 0;
+       phdr->p_align = 4;
 }
 
 static void fill_note(struct memelfnote *note, const char *name, int type,
@@ -1773,7 +1773,7 @@ static int fill_thread_core_info(struct elf_thread_core_info *t,
        /*
         * NT_PRSTATUS is the one special case, because the regset data
         * goes into the pr_reg field inside the note contents, rather
-        * than being the whole note contents.  We fill the reset in here.
+        * than being the whole note contents.  We fill the regset in here.
         * We assume that regset 0 is NT_PRSTATUS.
         */
        fill_prstatus(&t->prstatus.common, t->task, signr);
index 05a1471..1c6c583 100644 (file)
@@ -743,12 +743,12 @@ static int elf_fdpic_map_file(struct elf_fdpic_params *params,
        struct elf32_fdpic_loadmap *loadmap;
 #ifdef CONFIG_MMU
        struct elf32_fdpic_loadseg *mseg;
+       unsigned long load_addr;
 #endif
        struct elf32_fdpic_loadseg *seg;
        struct elf32_phdr *phdr;
-       unsigned long load_addr, stop;
        unsigned nloads, tmp;
-       size_t size;
+       unsigned long stop;
        int loop, ret;
 
        /* allocate a load map table */
@@ -760,8 +760,7 @@ static int elf_fdpic_map_file(struct elf_fdpic_params *params,
        if (nloads == 0)
                return -ELIBBAD;
 
-       size = sizeof(*loadmap) + nloads * sizeof(*seg);
-       loadmap = kzalloc(size, GFP_KERNEL);
+       loadmap = kzalloc(struct_size(loadmap, segs, nloads), GFP_KERNEL);
        if (!loadmap)
                return -ENOMEM;
 
@@ -770,9 +769,6 @@ static int elf_fdpic_map_file(struct elf_fdpic_params *params,
        loadmap->version = ELF32_FDPIC_LOADMAP_VERSION;
        loadmap->nsegs = nloads;
 
-       load_addr = params->load_addr;
-       seg = loadmap->segs;
-
        /* map the requested LOADs into the memory space */
        switch (params->flags & ELF_FDPIC_FLAG_ARRANGEMENT) {
        case ELF_FDPIC_FLAG_CONSTDISP:
@@ -1269,7 +1265,7 @@ static inline void fill_elf_note_phdr(struct elf_phdr *phdr, int sz, loff_t offs
        phdr->p_filesz = sz;
        phdr->p_memsz = 0;
        phdr->p_flags = 0;
-       phdr->p_align = 0;
+       phdr->p_align = 4;
        return;
 }
 
index aac2404..ce083e9 100644 (file)
@@ -71,6 +71,16 @@ bool btrfs_workqueue_normal_congested(const struct btrfs_workqueue *wq)
        return atomic_read(&wq->pending) > wq->thresh * 2;
 }
 
+static void btrfs_init_workqueue(struct btrfs_workqueue *wq,
+                                struct btrfs_fs_info *fs_info)
+{
+       wq->fs_info = fs_info;
+       atomic_set(&wq->pending, 0);
+       INIT_LIST_HEAD(&wq->ordered_list);
+       spin_lock_init(&wq->list_lock);
+       spin_lock_init(&wq->thres_lock);
+}
+
 struct btrfs_workqueue *btrfs_alloc_workqueue(struct btrfs_fs_info *fs_info,
                                              const char *name, unsigned int flags,
                                              int limit_active, int thresh)
@@ -80,9 +90,9 @@ struct btrfs_workqueue *btrfs_alloc_workqueue(struct btrfs_fs_info *fs_info,
        if (!ret)
                return NULL;
 
-       ret->fs_info = fs_info;
+       btrfs_init_workqueue(ret, fs_info);
+
        ret->limit_active = limit_active;
-       atomic_set(&ret->pending, 0);
        if (thresh == 0)
                thresh = DFT_THRESHOLD;
        /* For low threshold, disabling threshold is a better choice */
@@ -106,9 +116,33 @@ struct btrfs_workqueue *btrfs_alloc_workqueue(struct btrfs_fs_info *fs_info,
                return NULL;
        }
 
-       INIT_LIST_HEAD(&ret->ordered_list);
-       spin_lock_init(&ret->list_lock);
-       spin_lock_init(&ret->thres_lock);
+       trace_btrfs_workqueue_alloc(ret, name);
+       return ret;
+}
+
+struct btrfs_workqueue *btrfs_alloc_ordered_workqueue(
+                               struct btrfs_fs_info *fs_info, const char *name,
+                               unsigned int flags)
+{
+       struct btrfs_workqueue *ret;
+
+       ret = kzalloc(sizeof(*ret), GFP_KERNEL);
+       if (!ret)
+               return NULL;
+
+       btrfs_init_workqueue(ret, fs_info);
+
+       /* Ordered workqueues don't allow @max_active adjustments. */
+       ret->limit_active = 1;
+       ret->current_active = 1;
+       ret->thresh = NO_THRESHOLD;
+
+       ret->normal_wq = alloc_ordered_workqueue("btrfs-%s", flags, name);
+       if (!ret->normal_wq) {
+               kfree(ret);
+               return NULL;
+       }
+
        trace_btrfs_workqueue_alloc(ret, name);
        return ret;
 }
index 6e2596d..30f66c5 100644 (file)
@@ -31,6 +31,9 @@ struct btrfs_workqueue *btrfs_alloc_workqueue(struct btrfs_fs_info *fs_info,
                                              unsigned int flags,
                                              int limit_active,
                                              int thresh);
+struct btrfs_workqueue *btrfs_alloc_ordered_workqueue(
+                               struct btrfs_fs_info *fs_info, const char *name,
+                               unsigned int flags);
 void btrfs_init_work(struct btrfs_work *work, btrfs_func_t func,
                     btrfs_func_t ordered_func, btrfs_func_t ordered_free);
 void btrfs_queue_work(struct btrfs_workqueue *wq,
index b3ad0f5..12b1244 100644 (file)
@@ -27,6 +27,17 @@ struct btrfs_failed_bio {
        atomic_t repair_count;
 };
 
+/* Is this a data path I/O that needs storage layer checksum and repair? */
+static inline bool is_data_bbio(struct btrfs_bio *bbio)
+{
+       return bbio->inode && is_data_inode(&bbio->inode->vfs_inode);
+}
+
+static bool bbio_has_ordered_extent(struct btrfs_bio *bbio)
+{
+       return is_data_bbio(bbio) && btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE;
+}
+
 /*
  * Initialize a btrfs_bio structure.  This skips the embedded bio itself as it
  * is already initialized by the block layer.
@@ -61,20 +72,6 @@ struct btrfs_bio *btrfs_bio_alloc(unsigned int nr_vecs, blk_opf_t opf,
        return bbio;
 }
 
-static blk_status_t btrfs_bio_extract_ordered_extent(struct btrfs_bio *bbio)
-{
-       struct btrfs_ordered_extent *ordered;
-       int ret;
-
-       ordered = btrfs_lookup_ordered_extent(bbio->inode, bbio->file_offset);
-       if (WARN_ON_ONCE(!ordered))
-               return BLK_STS_IOERR;
-       ret = btrfs_extract_ordered_extent(bbio, ordered);
-       btrfs_put_ordered_extent(ordered);
-
-       return errno_to_blk_status(ret);
-}
-
 static struct btrfs_bio *btrfs_split_bio(struct btrfs_fs_info *fs_info,
                                         struct btrfs_bio *orig_bbio,
                                         u64 map_length, bool use_append)
@@ -95,13 +92,41 @@ static struct btrfs_bio *btrfs_split_bio(struct btrfs_fs_info *fs_info,
        btrfs_bio_init(bbio, fs_info, NULL, orig_bbio);
        bbio->inode = orig_bbio->inode;
        bbio->file_offset = orig_bbio->file_offset;
-       if (!(orig_bbio->bio.bi_opf & REQ_BTRFS_ONE_ORDERED))
-               orig_bbio->file_offset += map_length;
-
+       orig_bbio->file_offset += map_length;
+       if (bbio_has_ordered_extent(bbio)) {
+               refcount_inc(&orig_bbio->ordered->refs);
+               bbio->ordered = orig_bbio->ordered;
+       }
        atomic_inc(&orig_bbio->pending_ios);
        return bbio;
 }
 
+/* Free a bio that was never submitted to the underlying device. */
+static void btrfs_cleanup_bio(struct btrfs_bio *bbio)
+{
+       if (bbio_has_ordered_extent(bbio))
+               btrfs_put_ordered_extent(bbio->ordered);
+       bio_put(&bbio->bio);
+}
+
+static void __btrfs_bio_end_io(struct btrfs_bio *bbio)
+{
+       if (bbio_has_ordered_extent(bbio)) {
+               struct btrfs_ordered_extent *ordered = bbio->ordered;
+
+               bbio->end_io(bbio);
+               btrfs_put_ordered_extent(ordered);
+       } else {
+               bbio->end_io(bbio);
+       }
+}
+
+void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status)
+{
+       bbio->bio.bi_status = status;
+       __btrfs_bio_end_io(bbio);
+}
+
 static void btrfs_orig_write_end_io(struct bio *bio);
 
 static void btrfs_bbio_propagate_error(struct btrfs_bio *bbio,
@@ -130,12 +155,12 @@ static void btrfs_orig_bbio_end_io(struct btrfs_bio *bbio)
 
                if (bbio->bio.bi_status)
                        btrfs_bbio_propagate_error(bbio, orig_bbio);
-               bio_put(&bbio->bio);
+               btrfs_cleanup_bio(bbio);
                bbio = orig_bbio;
        }
 
        if (atomic_dec_and_test(&bbio->pending_ios))
-               bbio->end_io(bbio);
+               __btrfs_bio_end_io(bbio);
 }
 
 static int next_repair_mirror(struct btrfs_failed_bio *fbio, int cur_mirror)
@@ -327,7 +352,7 @@ static void btrfs_end_bio_work(struct work_struct *work)
        struct btrfs_bio *bbio = container_of(work, struct btrfs_bio, end_io_work);
 
        /* Metadata reads are checked and repaired by the submitter. */
-       if (bbio->inode && !(bbio->bio.bi_opf & REQ_META))
+       if (is_data_bbio(bbio))
                btrfs_check_read_bio(bbio, bbio->bio.bi_private);
        else
                btrfs_orig_bbio_end_io(bbio);
@@ -348,7 +373,7 @@ static void btrfs_simple_end_io(struct bio *bio)
                INIT_WORK(&bbio->end_io_work, btrfs_end_bio_work);
                queue_work(btrfs_end_io_wq(fs_info, bio), &bbio->end_io_work);
        } else {
-               if (bio_op(bio) == REQ_OP_ZONE_APPEND)
+               if (bio_op(bio) == REQ_OP_ZONE_APPEND && !bio->bi_status)
                        btrfs_record_physical_zoned(bbio);
                btrfs_orig_bbio_end_io(bbio);
        }
@@ -361,8 +386,7 @@ static void btrfs_raid56_end_io(struct bio *bio)
 
        btrfs_bio_counter_dec(bioc->fs_info);
        bbio->mirror_num = bioc->mirror_num;
-       if (bio_op(bio) == REQ_OP_READ && bbio->inode &&
-           !(bbio->bio.bi_opf & REQ_META))
+       if (bio_op(bio) == REQ_OP_READ && is_data_bbio(bbio))
                btrfs_check_read_bio(bbio, NULL);
        else
                btrfs_orig_bbio_end_io(bbio);
@@ -472,13 +496,12 @@ static void btrfs_submit_mirrored_bio(struct btrfs_io_context *bioc, int dev_nr)
 static void __btrfs_submit_bio(struct bio *bio, struct btrfs_io_context *bioc,
                               struct btrfs_io_stripe *smap, int mirror_num)
 {
-       /* Do not leak our private flag into the block layer. */
-       bio->bi_opf &= ~REQ_BTRFS_ONE_ORDERED;
-
        if (!bioc) {
                /* Single mirror read/write fast path. */
                btrfs_bio(bio)->mirror_num = mirror_num;
                bio->bi_iter.bi_sector = smap->physical >> SECTOR_SHIFT;
+               if (bio_op(bio) != REQ_OP_READ)
+                       btrfs_bio(bio)->orig_physical = smap->physical;
                bio->bi_private = smap->dev;
                bio->bi_end_io = btrfs_simple_end_io;
                btrfs_submit_dev_bio(smap->dev, bio);
@@ -574,27 +597,20 @@ static void run_one_async_free(struct btrfs_work *work)
 
 static bool should_async_write(struct btrfs_bio *bbio)
 {
-       /*
-        * If the I/O is not issued by fsync and friends, (->sync_writers != 0),
-        * then try to defer the submission to a workqueue to parallelize the
-        * checksum calculation.
-        */
-       if (atomic_read(&bbio->inode->sync_writers))
+       /* Submit synchronously if the checksum implementation is fast. */
+       if (test_bit(BTRFS_FS_CSUM_IMPL_FAST, &bbio->fs_info->flags))
                return false;
 
        /*
-        * Submit metadata writes synchronously if the checksum implementation
-        * is fast, or we are on a zoned device that wants I/O to be submitted
-        * in order.
+        * Try to defer the submission to a workqueue to parallelize the
+        * checksum calculation unless the I/O is issued synchronously.
         */
-       if (bbio->bio.bi_opf & REQ_META) {
-               struct btrfs_fs_info *fs_info = bbio->fs_info;
+       if (op_is_sync(bbio->bio.bi_opf))
+               return false;
 
-               if (btrfs_is_zoned(fs_info))
-                       return false;
-               if (test_bit(BTRFS_FS_CSUM_IMPL_FAST, &fs_info->flags))
-                       return false;
-       }
+       /* Zoned devices require I/O to be submitted in order. */
+       if ((bbio->bio.bi_opf & REQ_META) && btrfs_is_zoned(bbio->fs_info))
+               return false;
 
        return true;
 }
@@ -622,10 +638,7 @@ static bool btrfs_wq_submit_bio(struct btrfs_bio *bbio,
 
        btrfs_init_work(&async->work, run_one_async_start, run_one_async_done,
                        run_one_async_free);
-       if (op_is_sync(bbio->bio.bi_opf))
-               btrfs_queue_work(fs_info->hipri_workers, &async->work);
-       else
-               btrfs_queue_work(fs_info->workers, &async->work);
+       btrfs_queue_work(fs_info->workers, &async->work);
        return true;
 }
 
@@ -635,7 +648,7 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
        struct btrfs_fs_info *fs_info = bbio->fs_info;
        struct btrfs_bio *orig_bbio = bbio;
        struct bio *bio = &bbio->bio;
-       u64 logical = bio->bi_iter.bi_sector << 9;
+       u64 logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
        u64 length = bio->bi_iter.bi_size;
        u64 map_length = length;
        bool use_append = btrfs_use_zone_append(bbio);
@@ -645,8 +658,8 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
        int error;
 
        btrfs_bio_counter_inc_blocked(fs_info);
-       error = __btrfs_map_block(fs_info, btrfs_op(bio), logical, &map_length,
-                                 &bioc, &smap, &mirror_num, 1);
+       error = btrfs_map_block(fs_info, btrfs_op(bio), logical, &map_length,
+                               &bioc, &smap, &mirror_num, 1);
        if (error) {
                ret = errno_to_blk_status(error);
                goto fail;
@@ -665,7 +678,7 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
         * Save the iter for the end_io handler and preload the checksums for
         * data reads.
         */
-       if (bio_op(bio) == REQ_OP_READ && inode && !(bio->bi_opf & REQ_META)) {
+       if (bio_op(bio) == REQ_OP_READ && is_data_bbio(bbio)) {
                bbio->saved_iter = bio->bi_iter;
                ret = btrfs_lookup_bio_sums(bbio);
                if (ret)
@@ -676,9 +689,6 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
                if (use_append) {
                        bio->bi_opf &= ~REQ_OP_WRITE;
                        bio->bi_opf |= REQ_OP_ZONE_APPEND;
-                       ret = btrfs_bio_extract_ordered_extent(bbio);
-                       if (ret)
-                               goto fail_put_bio;
                }
 
                /*
@@ -695,6 +705,10 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
                        ret = btrfs_bio_csum(bbio);
                        if (ret)
                                goto fail_put_bio;
+               } else if (use_append) {
+                       ret = btrfs_alloc_dummy_sum(bbio);
+                       if (ret)
+                               goto fail_put_bio;
                }
        }
 
@@ -704,7 +718,7 @@ done:
 
 fail_put_bio:
        if (map_length < length)
-               bio_put(bio);
+               btrfs_cleanup_bio(bbio);
 fail:
        btrfs_bio_counter_dec(fs_info);
        btrfs_bio_end_io(orig_bbio, ret);
index a8eca3a..ca79dec 100644 (file)
@@ -39,8 +39,8 @@ struct btrfs_bio {
 
        union {
                /*
-                * Data checksumming and original I/O information for internal
-                * use in the btrfs_submit_bio machinery.
+                * For data reads: checksumming and original I/O information.
+                * (for internal use in the btrfs_submit_bio machinery only)
                 */
                struct {
                        u8 *csum;
@@ -48,7 +48,20 @@ struct btrfs_bio {
                        struct bvec_iter saved_iter;
                };
 
-               /* For metadata parentness verification. */
+               /*
+                * For data writes:
+                * - ordered extent covering the bio
+                * - pointer to the checksums for this bio
+                * - original physical address from the allocator
+                *   (for zone append only)
+                */
+               struct {
+                       struct btrfs_ordered_extent *ordered;
+                       struct btrfs_ordered_sum *sums;
+                       u64 orig_physical;
+               };
+
+               /* For metadata reads: parentness verification. */
                struct btrfs_tree_parent_check parent_check;
        };
 
@@ -84,15 +97,7 @@ void btrfs_bio_init(struct btrfs_bio *bbio, struct btrfs_fs_info *fs_info,
 struct btrfs_bio *btrfs_bio_alloc(unsigned int nr_vecs, blk_opf_t opf,
                                  struct btrfs_fs_info *fs_info,
                                  btrfs_bio_end_io_t end_io, void *private);
-
-static inline void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status)
-{
-       bbio->bio.bi_status = status;
-       bbio->end_io(bbio);
-}
-
-/* Bio only refers to one ordered extent. */
-#define REQ_BTRFS_ONE_ORDERED                  REQ_DRV
+void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status);
 
 /* Submit using blkcg_punt_bio_submit. */
 #define REQ_BTRFS_CGROUP_PUNT                  REQ_FS_PRIVATE
index 590b035..48ae509 100644 (file)
@@ -95,14 +95,21 @@ static u64 btrfs_reduce_alloc_profile(struct btrfs_fs_info *fs_info, u64 flags)
        }
        allowed &= flags;
 
-       if (allowed & BTRFS_BLOCK_GROUP_RAID6)
+       /* Select the highest-redundancy RAID level. */
+       if (allowed & BTRFS_BLOCK_GROUP_RAID1C4)
+               allowed = BTRFS_BLOCK_GROUP_RAID1C4;
+       else if (allowed & BTRFS_BLOCK_GROUP_RAID6)
                allowed = BTRFS_BLOCK_GROUP_RAID6;
+       else if (allowed & BTRFS_BLOCK_GROUP_RAID1C3)
+               allowed = BTRFS_BLOCK_GROUP_RAID1C3;
        else if (allowed & BTRFS_BLOCK_GROUP_RAID5)
                allowed = BTRFS_BLOCK_GROUP_RAID5;
        else if (allowed & BTRFS_BLOCK_GROUP_RAID10)
                allowed = BTRFS_BLOCK_GROUP_RAID10;
        else if (allowed & BTRFS_BLOCK_GROUP_RAID1)
                allowed = BTRFS_BLOCK_GROUP_RAID1;
+       else if (allowed & BTRFS_BLOCK_GROUP_DUP)
+               allowed = BTRFS_BLOCK_GROUP_DUP;
        else if (allowed & BTRFS_BLOCK_GROUP_RAID0)
                allowed = BTRFS_BLOCK_GROUP_RAID0;
 
@@ -1633,11 +1640,14 @@ void btrfs_mark_bg_unused(struct btrfs_block_group *bg)
 {
        struct btrfs_fs_info *fs_info = bg->fs_info;
 
+       trace_btrfs_add_unused_block_group(bg);
        spin_lock(&fs_info->unused_bgs_lock);
        if (list_empty(&bg->bg_list)) {
                btrfs_get_block_group(bg);
-               trace_btrfs_add_unused_block_group(bg);
                list_add_tail(&bg->bg_list, &fs_info->unused_bgs);
+       } else {
+               /* Pull out the block group from the reclaim_bgs list. */
+               list_move_tail(&bg->bg_list, &fs_info->unused_bgs);
        }
        spin_unlock(&fs_info->unused_bgs_lock);
 }
@@ -1791,8 +1801,15 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
                }
                spin_unlock(&bg->lock);
 
-               /* Get out fast, in case we're unmounting the filesystem */
-               if (btrfs_fs_closing(fs_info)) {
+               /*
+                * Get out fast, in case we're read-only or unmounting the
+                * filesystem. It is OK to drop block groups from the list even
+                * for the read-only case. As we did sb_start_write(),
+                * "mount -o remount,ro" won't happen and read-only filesystem
+                * means it is forced read-only due to a fatal error. So, it
+                * never gets back to read-write to let us reclaim again.
+                */
+               if (btrfs_need_cleaner_sleep(fs_info)) {
                        up_write(&space_info->groups_sem);
                        goto next;
                }
@@ -1823,11 +1840,27 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
                }
 
 next:
+               if (ret)
+                       btrfs_mark_bg_to_reclaim(bg);
                btrfs_put_block_group(bg);
+
+               mutex_unlock(&fs_info->reclaim_bgs_lock);
+               /*
+                * Reclaiming all the block groups in the list can take really
+                * long.  Prioritize cleaning up unused block groups.
+                */
+               btrfs_delete_unused_bgs(fs_info);
+               /*
+                * If we are interrupted by a balance, we can just bail out. The
+                * cleaner thread restart again if necessary.
+                */
+               if (!mutex_trylock(&fs_info->reclaim_bgs_lock))
+                       goto end;
                spin_lock(&fs_info->unused_bgs_lock);
        }
        spin_unlock(&fs_info->unused_bgs_lock);
        mutex_unlock(&fs_info->reclaim_bgs_lock);
+end:
        btrfs_exclop_finish(fs_info);
        sb_end_write(fs_info->sb);
 }
@@ -1973,7 +2006,7 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
 
        /* For RAID5/6 adjust to a full IO stripe length */
        if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)
-               io_stripe_size = nr_data_stripes(map) << BTRFS_STRIPE_LEN_SHIFT;
+               io_stripe_size = btrfs_stripe_nr_to_offset(nr_data_stripes(map));
 
        buf = kcalloc(map->num_stripes, sizeof(u64), GFP_NOFS);
        if (!buf) {
@@ -3521,9 +3554,9 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
                        spin_unlock(&cache->lock);
                        spin_unlock(&space_info->lock);
 
-                       set_extent_dirty(&trans->transaction->pinned_extents,
-                                        bytenr, bytenr + num_bytes - 1,
-                                        GFP_NOFS | __GFP_NOFAIL);
+                       set_extent_bit(&trans->transaction->pinned_extents,
+                                      bytenr, bytenr + num_bytes - 1,
+                                      EXTENT_DIRTY, NULL);
                }
 
                spin_lock(&trans->transaction->dirty_bgs_lock);
index cc0e4b3..f204add 100644 (file)
@@ -162,7 +162,14 @@ struct btrfs_block_group {
         */
        struct list_head cluster_list;
 
-       /* For delayed block group creation or deletion of empty block groups */
+       /*
+        * Used for several lists:
+        *
+        * 1) struct btrfs_fs_info::unused_bgs
+        * 2) struct btrfs_fs_info::reclaim_bgs
+        * 3) struct btrfs_transaction::deleted_bgs
+        * 4) struct btrfs_trans_handle::new_bgs
+        */
        struct list_head bg_list;
 
        /* For read-only block groups */
index ac18c43..6279d20 100644 (file)
@@ -541,3 +541,22 @@ try_reserve:
 
        return ERR_PTR(ret);
 }
+
+int btrfs_check_trunc_cache_free_space(struct btrfs_fs_info *fs_info,
+                                      struct btrfs_block_rsv *rsv)
+{
+       u64 needed_bytes;
+       int ret;
+
+       /* 1 for slack space, 1 for updating the inode */
+       needed_bytes = btrfs_calc_insert_metadata_size(fs_info, 1) +
+               btrfs_calc_metadata_size(fs_info, 1);
+
+       spin_lock(&rsv->lock);
+       if (rsv->reserved < needed_bytes)
+               ret = -ENOSPC;
+       else
+               ret = 0;
+       spin_unlock(&rsv->lock);
+       return ret;
+}
index 6dc7817..b0bd12b 100644 (file)
@@ -82,6 +82,8 @@ void btrfs_release_global_block_rsv(struct btrfs_fs_info *fs_info);
 struct btrfs_block_rsv *btrfs_use_block_rsv(struct btrfs_trans_handle *trans,
                                            struct btrfs_root *root,
                                            u32 blocksize);
+int btrfs_check_trunc_cache_free_space(struct btrfs_fs_info *fs_info,
+                                      struct btrfs_block_rsv *rsv);
 static inline void btrfs_unuse_block_rsv(struct btrfs_fs_info *fs_info,
                                         struct btrfs_block_rsv *block_rsv,
                                         u32 blocksize)
index ec2ae44..d47a927 100644 (file)
@@ -116,9 +116,6 @@ struct btrfs_inode {
 
        unsigned long runtime_flags;
 
-       /* Keep track of who's O_SYNC/fsyncing currently */
-       atomic_t sync_writers;
-
        /* full 64 bit generation number, struct vfs_inode doesn't have a big
         * enough field for this.
         */
@@ -335,7 +332,7 @@ static inline void btrfs_mod_outstanding_extents(struct btrfs_inode *inode,
        if (btrfs_is_free_space_inode(inode))
                return;
        trace_btrfs_inode_mod_outstanding_extents(inode->root, btrfs_ino(inode),
-                                                 mod);
+                                                 mod, inode->outstanding_extents);
 }
 
 /*
@@ -407,30 +404,12 @@ static inline bool btrfs_inode_can_compress(const struct btrfs_inode *inode)
        return true;
 }
 
-/*
- * btrfs_inode_item stores flags in a u64, btrfs_inode stores them in two
- * separate u32s. These two functions convert between the two representations.
- */
-static inline u64 btrfs_inode_combine_flags(u32 flags, u32 ro_flags)
-{
-       return (flags | ((u64)ro_flags << 32));
-}
-
-static inline void btrfs_inode_split_flags(u64 inode_item_flags,
-                                          u32 *flags, u32 *ro_flags)
-{
-       *flags = (u32)inode_item_flags;
-       *ro_flags = (u32)(inode_item_flags >> 32);
-}
-
 /* Array of bytes with variable length, hexadecimal format 0x1234 */
 #define CSUM_FMT                               "0x%*phN"
 #define CSUM_FMT_VALUE(size, bytes)            size, bytes
 
 int btrfs_check_sector_csum(struct btrfs_fs_info *fs_info, struct page *page,
                            u32 pgoff, u8 *csum, const u8 * const csum_expected);
-int btrfs_extract_ordered_extent(struct btrfs_bio *bbio,
-                                struct btrfs_ordered_extent *ordered);
 bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev,
                        u32 bio_offset, struct bio_vec *bv);
 noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
index 82e49d9..3caf339 100644 (file)
@@ -1459,13 +1459,13 @@ static int btrfsic_map_block(struct btrfsic_state *state, u64 bytenr, u32 len,
        struct btrfs_fs_info *fs_info = state->fs_info;
        int ret;
        u64 length;
-       struct btrfs_io_context *multi = NULL;
+       struct btrfs_io_context *bioc = NULL;
+       struct btrfs_io_stripe smap, *map;
        struct btrfs_device *device;
 
        length = len;
-       ret = btrfs_map_block(fs_info, BTRFS_MAP_READ,
-                             bytenr, &length, &multi, mirror_num);
-
+       ret = btrfs_map_block(fs_info, BTRFS_MAP_READ, bytenr, &length, &bioc,
+                             NULL, &mirror_num, 0);
        if (ret) {
                block_ctx_out->start = 0;
                block_ctx_out->dev_bytenr = 0;
@@ -1478,21 +1478,26 @@ static int btrfsic_map_block(struct btrfsic_state *state, u64 bytenr, u32 len,
                return ret;
        }
 
-       device = multi->stripes[0].dev;
+       if (bioc)
+               map = &bioc->stripes[0];
+       else
+               map = &smap;
+
+       device = map->dev;
        if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state) ||
            !device->bdev || !device->name)
                block_ctx_out->dev = NULL;
        else
                block_ctx_out->dev = btrfsic_dev_state_lookup(
                                                        device->bdev->bd_dev);
-       block_ctx_out->dev_bytenr = multi->stripes[0].physical;
+       block_ctx_out->dev_bytenr = map->physical;
        block_ctx_out->start = bytenr;
        block_ctx_out->len = len;
        block_ctx_out->datav = NULL;
        block_ctx_out->pagev = NULL;
        block_ctx_out->mem_to_free = NULL;
 
-       kfree(multi);
+       kfree(bioc);
        if (NULL == block_ctx_out->dev) {
                ret = -ENXIO;
                pr_info("btrfsic: error, cannot lookup dev (#1)!\n");
@@ -1565,7 +1570,7 @@ static int btrfsic_read_block(struct btrfsic_state *state,
 
                bio = bio_alloc(block_ctx->dev->bdev, num_pages - i,
                                REQ_OP_READ, GFP_NOFS);
-               bio->bi_iter.bi_sector = dev_bytenr >> 9;
+               bio->bi_iter.bi_sector = dev_bytenr >> SECTOR_SHIFT;
 
                for (j = i; j < num_pages; j++) {
                        ret = bio_add_page(bio, block_ctx->pagev[j],
index 2d0493f..8818ed5 100644 (file)
@@ -37,7 +37,7 @@
 #include "file-item.h"
 #include "super.h"
 
-struct bio_set btrfs_compressed_bioset;
+static struct bio_set btrfs_compressed_bioset;
 
 static const char* const btrfs_compress_types[] = { "", "zlib", "lzo", "zstd" };
 
@@ -211,8 +211,6 @@ static noinline void end_compressed_writeback(const struct compressed_bio *cb)
                for (i = 0; i < ret; i++) {
                        struct folio *folio = fbatch.folios[i];
 
-                       if (errno)
-                               folio_set_error(folio);
                        btrfs_page_clamp_clear_writeback(fs_info, &folio->page,
                                                         cb->start, cb->len);
                }
@@ -226,13 +224,8 @@ static void btrfs_finish_compressed_write_work(struct work_struct *work)
        struct compressed_bio *cb =
                container_of(work, struct compressed_bio, write_end_work);
 
-       /*
-        * Ok, we're the last bio for this extent, step one is to call back
-        * into the FS and do all the end_io operations.
-        */
-       btrfs_writepage_endio_finish_ordered(cb->bbio.inode, NULL,
-                       cb->start, cb->start + cb->len - 1,
-                       cb->bbio.bio.bi_status == BLK_STS_OK);
+       btrfs_finish_ordered_extent(cb->bbio.ordered, NULL, cb->start, cb->len,
+                                   cb->bbio.bio.bi_status == BLK_STS_OK);
 
        if (cb->writeback)
                end_compressed_writeback(cb);
@@ -281,32 +274,31 @@ static void btrfs_add_compressed_bio_pages(struct compressed_bio *cb)
  * This also checksums the file bytes and gets things ready for
  * the end io hooks.
  */
-void btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
-                                unsigned int len, u64 disk_start,
-                                unsigned int compressed_len,
-                                struct page **compressed_pages,
-                                unsigned int nr_pages,
-                                blk_opf_t write_flags,
-                                bool writeback)
+void btrfs_submit_compressed_write(struct btrfs_ordered_extent *ordered,
+                                  struct page **compressed_pages,
+                                  unsigned int nr_pages,
+                                  blk_opf_t write_flags,
+                                  bool writeback)
 {
+       struct btrfs_inode *inode = BTRFS_I(ordered->inode);
        struct btrfs_fs_info *fs_info = inode->root->fs_info;
        struct compressed_bio *cb;
 
-       ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
-              IS_ALIGNED(len, fs_info->sectorsize));
-
-       write_flags |= REQ_BTRFS_ONE_ORDERED;
+       ASSERT(IS_ALIGNED(ordered->file_offset, fs_info->sectorsize));
+       ASSERT(IS_ALIGNED(ordered->num_bytes, fs_info->sectorsize));
 
-       cb = alloc_compressed_bio(inode, start, REQ_OP_WRITE | write_flags,
+       cb = alloc_compressed_bio(inode, ordered->file_offset,
+                                 REQ_OP_WRITE | write_flags,
                                  end_compressed_bio_write);
-       cb->start = start;
-       cb->len = len;
+       cb->start = ordered->file_offset;
+       cb->len = ordered->num_bytes;
        cb->compressed_pages = compressed_pages;
-       cb->compressed_len = compressed_len;
+       cb->compressed_len = ordered->disk_num_bytes;
        cb->writeback = writeback;
        INIT_WORK(&cb->write_end_work, btrfs_finish_compressed_write_work);
        cb->nr_pages = nr_pages;
-       cb->bbio.bio.bi_iter.bi_sector = disk_start >> SECTOR_SHIFT;
+       cb->bbio.bio.bi_iter.bi_sector = ordered->disk_bytenr >> SECTOR_SHIFT;
+       cb->bbio.ordered = ordered;
        btrfs_add_compressed_bio_pages(cb);
 
        btrfs_submit_bio(&cb->bbio, 0);
@@ -421,7 +413,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
                 */
                if (!em || cur < em->start ||
                    (cur + fs_info->sectorsize > extent_map_end(em)) ||
-                   (em->block_start >> 9) != orig_bio->bi_iter.bi_sector) {
+                   (em->block_start >> SECTOR_SHIFT) != orig_bio->bi_iter.bi_sector) {
                        free_extent_map(em);
                        unlock_extent(tree, cur, page_end, NULL);
                        unlock_page(page);
@@ -472,7 +464,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
  * After the compressed pages are read, we copy the bytes into the
  * bio we were passed and then call the bio end_io calls
  */
-void btrfs_submit_compressed_read(struct btrfs_bio *bbio, int mirror_num)
+void btrfs_submit_compressed_read(struct btrfs_bio *bbio)
 {
        struct btrfs_inode *inode = bbio->inode;
        struct btrfs_fs_info *fs_info = inode->root->fs_info;
@@ -538,7 +530,7 @@ void btrfs_submit_compressed_read(struct btrfs_bio *bbio, int mirror_num)
        if (memstall)
                psi_memstall_leave(&pflags);
 
-       btrfs_submit_bio(&cb->bbio, mirror_num);
+       btrfs_submit_bio(&cb->bbio, 0);
        return;
 
 out_free_compressed_pages:
index 19ab2ab..03bb9d1 100644 (file)
@@ -10,6 +10,7 @@
 #include "bio.h"
 
 struct btrfs_inode;
+struct btrfs_ordered_extent;
 
 /*
  * We want to make sure that amount of RAM required to uncompress an extent is
@@ -86,14 +87,12 @@ int btrfs_decompress(int type, const u8 *data_in, struct page *dest_page,
 int btrfs_decompress_buf2page(const char *buf, u32 buf_len,
                              struct compressed_bio *cb, u32 decompressed);
 
-void btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
-                                 unsigned int len, u64 disk_start,
-                                 unsigned int compressed_len,
+void btrfs_submit_compressed_write(struct btrfs_ordered_extent *ordered,
                                  struct page **compressed_pages,
                                  unsigned int nr_pages,
                                  blk_opf_t write_flags,
                                  bool writeback);
-void btrfs_submit_compressed_read(struct btrfs_bio *bbio, int mirror_num);
+void btrfs_submit_compressed_read(struct btrfs_bio *bbio);
 
 unsigned int btrfs_compress_str2level(unsigned int type, const char *str);
 
index 2ff2961..a4cb4b6 100644 (file)
@@ -37,8 +37,6 @@ static int push_node_left(struct btrfs_trans_handle *trans,
 static int balance_node_right(struct btrfs_trans_handle *trans,
                              struct extent_buffer *dst_buf,
                              struct extent_buffer *src_buf);
-static void del_ptr(struct btrfs_root *root, struct btrfs_path *path,
-                   int level, int slot);
 
 static const struct btrfs_csums {
        u16             size;
@@ -150,13 +148,19 @@ static inline void copy_leaf_items(const struct extent_buffer *dst,
                              nr_items * sizeof(struct btrfs_item));
 }
 
+/* This exists for btrfs-progs usages. */
+u16 btrfs_csum_type_size(u16 type)
+{
+       return btrfs_csums[type].size;
+}
+
 int btrfs_super_csum_size(const struct btrfs_super_block *s)
 {
        u16 t = btrfs_super_csum_type(s);
        /*
         * csum type is validated at mount time
         */
-       return btrfs_csums[t].size;
+       return btrfs_csum_type_size(t);
 }
 
 const char *btrfs_super_csum_name(u16 csum_type)
@@ -417,9 +421,13 @@ static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans,
                                               &refs, &flags);
                if (ret)
                        return ret;
-               if (refs == 0) {
-                       ret = -EROFS;
-                       btrfs_handle_fs_error(fs_info, ret, NULL);
+               if (unlikely(refs == 0)) {
+                       btrfs_crit(fs_info,
+               "found 0 references for tree block at bytenr %llu level %d root %llu",
+                                  buf->start, btrfs_header_level(buf),
+                                  btrfs_root_id(root));
+                       ret = -EUCLEAN;
+                       btrfs_abort_transaction(trans, ret);
                        return ret;
                }
        } else {
@@ -464,10 +472,7 @@ static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans,
                                return ret;
                }
                if (new_flags != 0) {
-                       int level = btrfs_header_level(buf);
-
-                       ret = btrfs_set_disk_extent_flags(trans, buf,
-                                                         new_flags, level);
+                       ret = btrfs_set_disk_extent_flags(trans, buf, new_flags);
                        if (ret)
                                return ret;
                }
@@ -583,9 +588,14 @@ static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans,
                    btrfs_header_backref_rev(buf) < BTRFS_MIXED_BACKREF_REV)
                        parent_start = buf->start;
 
-               atomic_inc(&cow->refs);
                ret = btrfs_tree_mod_log_insert_root(root->node, cow, true);
-               BUG_ON(ret < 0);
+               if (ret < 0) {
+                       btrfs_tree_unlock(cow);
+                       free_extent_buffer(cow);
+                       btrfs_abort_transaction(trans, ret);
+                       return ret;
+               }
+               atomic_inc(&cow->refs);
                rcu_assign_pointer(root->node, cow);
 
                btrfs_free_tree_block(trans, btrfs_root_id(root), buf,
@@ -594,8 +604,14 @@ static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans,
                add_root_to_dirty_list(root);
        } else {
                WARN_ON(trans->transid != btrfs_header_generation(parent));
-               btrfs_tree_mod_log_insert_key(parent, parent_slot,
-                                             BTRFS_MOD_LOG_KEY_REPLACE);
+               ret = btrfs_tree_mod_log_insert_key(parent, parent_slot,
+                                                   BTRFS_MOD_LOG_KEY_REPLACE);
+               if (ret) {
+                       btrfs_tree_unlock(cow);
+                       free_extent_buffer(cow);
+                       btrfs_abort_transaction(trans, ret);
+                       return ret;
+               }
                btrfs_set_node_blockptr(parent, parent_slot,
                                        cow->start);
                btrfs_set_node_ptr_generation(parent, parent_slot,
@@ -1028,8 +1044,7 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
                child = btrfs_read_node_slot(mid, 0);
                if (IS_ERR(child)) {
                        ret = PTR_ERR(child);
-                       btrfs_handle_fs_error(fs_info, ret, NULL);
-                       goto enospc;
+                       goto out;
                }
 
                btrfs_tree_lock(child);
@@ -1038,11 +1053,16 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
                if (ret) {
                        btrfs_tree_unlock(child);
                        free_extent_buffer(child);
-                       goto enospc;
+                       goto out;
                }
 
                ret = btrfs_tree_mod_log_insert_root(root->node, child, true);
-               BUG_ON(ret < 0);
+               if (ret < 0) {
+                       btrfs_tree_unlock(child);
+                       free_extent_buffer(child);
+                       btrfs_abort_transaction(trans, ret);
+                       goto out;
+               }
                rcu_assign_pointer(root->node, child);
 
                add_root_to_dirty_list(root);
@@ -1070,7 +1090,7 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
                if (IS_ERR(left)) {
                        ret = PTR_ERR(left);
                        left = NULL;
-                       goto enospc;
+                       goto out;
                }
 
                __btrfs_tree_lock(left, BTRFS_NESTING_LEFT);
@@ -1079,7 +1099,7 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
                                       BTRFS_NESTING_LEFT_COW);
                if (wret) {
                        ret = wret;
-                       goto enospc;
+                       goto out;
                }
        }
 
@@ -1088,7 +1108,7 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
                if (IS_ERR(right)) {
                        ret = PTR_ERR(right);
                        right = NULL;
-                       goto enospc;
+                       goto out;
                }
 
                __btrfs_tree_lock(right, BTRFS_NESTING_RIGHT);
@@ -1097,7 +1117,7 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
                                       BTRFS_NESTING_RIGHT_COW);
                if (wret) {
                        ret = wret;
-                       goto enospc;
+                       goto out;
                }
        }
 
@@ -1119,7 +1139,12 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
                if (btrfs_header_nritems(right) == 0) {
                        btrfs_clear_buffer_dirty(trans, right);
                        btrfs_tree_unlock(right);
-                       del_ptr(root, path, level + 1, pslot + 1);
+                       ret = btrfs_del_ptr(trans, root, path, level + 1, pslot + 1);
+                       if (ret < 0) {
+                               free_extent_buffer_stale(right);
+                               right = NULL;
+                               goto out;
+                       }
                        root_sub_used(root, right->len);
                        btrfs_free_tree_block(trans, btrfs_root_id(root), right,
                                              0, 1);
@@ -1130,7 +1155,10 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
                        btrfs_node_key(right, &right_key, 0);
                        ret = btrfs_tree_mod_log_insert_key(parent, pslot + 1,
                                        BTRFS_MOD_LOG_KEY_REPLACE);
-                       BUG_ON(ret < 0);
+                       if (ret < 0) {
+                               btrfs_abort_transaction(trans, ret);
+                               goto out;
+                       }
                        btrfs_set_node_key(parent, &right_key, pslot + 1);
                        btrfs_mark_buffer_dirty(parent);
                }
@@ -1145,15 +1173,19 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
                 * otherwise we would have pulled some pointers from the
                 * right
                 */
-               if (!left) {
-                       ret = -EROFS;
-                       btrfs_handle_fs_error(fs_info, ret, NULL);
-                       goto enospc;
+               if (unlikely(!left)) {
+                       btrfs_crit(fs_info,
+"missing left child when middle child only has 1 item, parent bytenr %llu level %d mid bytenr %llu root %llu",
+                                  parent->start, btrfs_header_level(parent),
+                                  mid->start, btrfs_root_id(root));
+                       ret = -EUCLEAN;
+                       btrfs_abort_transaction(trans, ret);
+                       goto out;
                }
                wret = balance_node_right(trans, mid, left);
                if (wret < 0) {
                        ret = wret;
-                       goto enospc;
+                       goto out;
                }
                if (wret == 1) {
                        wret = push_node_left(trans, left, mid, 1);
@@ -1165,7 +1197,12 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
        if (btrfs_header_nritems(mid) == 0) {
                btrfs_clear_buffer_dirty(trans, mid);
                btrfs_tree_unlock(mid);
-               del_ptr(root, path, level + 1, pslot);
+               ret = btrfs_del_ptr(trans, root, path, level + 1, pslot);
+               if (ret < 0) {
+                       free_extent_buffer_stale(mid);
+                       mid = NULL;
+                       goto out;
+               }
                root_sub_used(root, mid->len);
                btrfs_free_tree_block(trans, btrfs_root_id(root), mid, 0, 1);
                free_extent_buffer_stale(mid);
@@ -1176,7 +1213,10 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
                btrfs_node_key(mid, &mid_key, 0);
                ret = btrfs_tree_mod_log_insert_key(parent, pslot,
                                                    BTRFS_MOD_LOG_KEY_REPLACE);
-               BUG_ON(ret < 0);
+               if (ret < 0) {
+                       btrfs_abort_transaction(trans, ret);
+                       goto out;
+               }
                btrfs_set_node_key(parent, &mid_key, pslot);
                btrfs_mark_buffer_dirty(parent);
        }
@@ -1202,7 +1242,7 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
        if (orig_ptr !=
            btrfs_node_blockptr(path->nodes[level], path->slots[level]))
                BUG();
-enospc:
+out:
        if (right) {
                btrfs_tree_unlock(right);
                free_extent_buffer(right);
@@ -1278,7 +1318,12 @@ static noinline int push_nodes_for_insert(struct btrfs_trans_handle *trans,
                        btrfs_node_key(mid, &disk_key, 0);
                        ret = btrfs_tree_mod_log_insert_key(parent, pslot,
                                        BTRFS_MOD_LOG_KEY_REPLACE);
-                       BUG_ON(ret < 0);
+                       if (ret < 0) {
+                               btrfs_tree_unlock(left);
+                               free_extent_buffer(left);
+                               btrfs_abort_transaction(trans, ret);
+                               return ret;
+                       }
                        btrfs_set_node_key(parent, &disk_key, pslot);
                        btrfs_mark_buffer_dirty(parent);
                        if (btrfs_header_nritems(left) > orig_slot) {
@@ -1333,7 +1378,12 @@ static noinline int push_nodes_for_insert(struct btrfs_trans_handle *trans,
                        btrfs_node_key(right, &disk_key, 0);
                        ret = btrfs_tree_mod_log_insert_key(parent, pslot + 1,
                                        BTRFS_MOD_LOG_KEY_REPLACE);
-                       BUG_ON(ret < 0);
+                       if (ret < 0) {
+                               btrfs_tree_unlock(right);
+                               free_extent_buffer(right);
+                               btrfs_abort_transaction(trans, ret);
+                               return ret;
+                       }
                        btrfs_set_node_key(parent, &disk_key, pslot + 1);
                        btrfs_mark_buffer_dirty(parent);
 
@@ -2379,6 +2429,87 @@ done:
 }
 
 /*
+ * Search the tree again to find a leaf with smaller keys.
+ * Returns 0 if it found something.
+ * Returns 1 if there are no smaller keys.
+ * Returns < 0 on error.
+ *
+ * This may release the path, and so you may lose any locks held at the
+ * time you call it.
+ */
+static int btrfs_prev_leaf(struct btrfs_root *root, struct btrfs_path *path)
+{
+       struct btrfs_key key;
+       struct btrfs_key orig_key;
+       struct btrfs_disk_key found_key;
+       int ret;
+
+       btrfs_item_key_to_cpu(path->nodes[0], &key, 0);
+       orig_key = key;
+
+       if (key.offset > 0) {
+               key.offset--;
+       } else if (key.type > 0) {
+               key.type--;
+               key.offset = (u64)-1;
+       } else if (key.objectid > 0) {
+               key.objectid--;
+               key.type = (u8)-1;
+               key.offset = (u64)-1;
+       } else {
+               return 1;
+       }
+
+       btrfs_release_path(path);
+       ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+       if (ret <= 0)
+               return ret;
+
+       /*
+        * Previous key not found. Even if we were at slot 0 of the leaf we had
+        * before releasing the path and calling btrfs_search_slot(), we now may
+        * be in a slot pointing to the same original key - this can happen if
+        * after we released the path, one of more items were moved from a
+        * sibling leaf into the front of the leaf we had due to an insertion
+        * (see push_leaf_right()).
+        * If we hit this case and our slot is > 0 and just decrement the slot
+        * so that the caller does not process the same key again, which may or
+        * may not break the caller, depending on its logic.
+        */
+       if (path->slots[0] < btrfs_header_nritems(path->nodes[0])) {
+               btrfs_item_key(path->nodes[0], &found_key, path->slots[0]);
+               ret = comp_keys(&found_key, &orig_key);
+               if (ret == 0) {
+                       if (path->slots[0] > 0) {
+                               path->slots[0]--;
+                               return 0;
+                       }
+                       /*
+                        * At slot 0, same key as before, it means orig_key is
+                        * the lowest, leftmost, key in the tree. We're done.
+                        */
+                       return 1;
+               }
+       }
+
+       btrfs_item_key(path->nodes[0], &found_key, 0);
+       ret = comp_keys(&found_key, &key);
+       /*
+        * We might have had an item with the previous key in the tree right
+        * before we released our path. And after we released our path, that
+        * item might have been pushed to the first slot (0) of the leaf we
+        * were holding due to a tree balance. Alternatively, an item with the
+        * previous key can exist as the only element of a leaf (big fat item).
+        * Therefore account for these 2 cases, so that our callers (like
+        * btrfs_previous_item) don't miss an existing item with a key matching
+        * the previous key we computed above.
+        */
+       if (ret <= 0)
+               return 0;
+       return 1;
+}
+
+/*
  * helper to use instead of search slot if no exact match is needed but
  * instead the next or previous item should be returned.
  * When find_higher is true, the next higher item is returned, the next lower
@@ -2552,6 +2683,7 @@ void btrfs_set_item_key_safe(struct btrfs_fs_info *fs_info,
        if (slot > 0) {
                btrfs_item_key(eb, &disk_key, slot - 1);
                if (unlikely(comp_keys(&disk_key, new_key) >= 0)) {
+                       btrfs_print_leaf(eb);
                        btrfs_crit(fs_info,
                "slot %u key (%llu %u %llu) new key (%llu %u %llu)",
                                   slot, btrfs_disk_key_objectid(&disk_key),
@@ -2559,13 +2691,13 @@ void btrfs_set_item_key_safe(struct btrfs_fs_info *fs_info,
                                   btrfs_disk_key_offset(&disk_key),
                                   new_key->objectid, new_key->type,
                                   new_key->offset);
-                       btrfs_print_leaf(eb);
                        BUG();
                }
        }
        if (slot < btrfs_header_nritems(eb) - 1) {
                btrfs_item_key(eb, &disk_key, slot + 1);
                if (unlikely(comp_keys(&disk_key, new_key) <= 0)) {
+                       btrfs_print_leaf(eb);
                        btrfs_crit(fs_info,
                "slot %u key (%llu %u %llu) new key (%llu %u %llu)",
                                   slot, btrfs_disk_key_objectid(&disk_key),
@@ -2573,7 +2705,6 @@ void btrfs_set_item_key_safe(struct btrfs_fs_info *fs_info,
                                   btrfs_disk_key_offset(&disk_key),
                                   new_key->objectid, new_key->type,
                                   new_key->offset);
-                       btrfs_print_leaf(eb);
                        BUG();
                }
        }
@@ -2626,7 +2757,7 @@ static bool check_sibling_keys(struct extent_buffer *left,
                btrfs_item_key_to_cpu(right, &right_first, 0);
        }
 
-       if (btrfs_comp_cpu_keys(&left_last, &right_first) >= 0) {
+       if (unlikely(btrfs_comp_cpu_keys(&left_last, &right_first) >= 0)) {
                btrfs_crit(left->fs_info, "left extent buffer:");
                btrfs_print_tree(left, false);
                btrfs_crit(left->fs_info, "right extent buffer:");
@@ -2703,8 +2834,8 @@ static int push_node_left(struct btrfs_trans_handle *trans,
 
        if (push_items < src_nritems) {
                /*
-                * Don't call btrfs_tree_mod_log_insert_move() here, key removal
-                * was already fully logged by btrfs_tree_mod_log_eb_copy() above.
+                * btrfs_tree_mod_log_eb_copy handles logging the move, so we
+                * don't need to do an explicit tree mod log operation for it.
                 */
                memmove_extent_buffer(src, btrfs_node_key_ptr_offset(src, 0),
                                      btrfs_node_key_ptr_offset(src, push_items),
@@ -2765,8 +2896,11 @@ static int balance_node_right(struct btrfs_trans_handle *trans,
                btrfs_abort_transaction(trans, ret);
                return ret;
        }
-       ret = btrfs_tree_mod_log_insert_move(dst, push_items, 0, dst_nritems);
-       BUG_ON(ret < 0);
+
+       /*
+        * btrfs_tree_mod_log_eb_copy handles logging the move, so we don't
+        * need to do an explicit tree mod log operation for it.
+        */
        memmove_extent_buffer(dst, btrfs_node_key_ptr_offset(dst, push_items),
                                      btrfs_node_key_ptr_offset(dst, 0),
                                      (dst_nritems) *
@@ -2840,7 +2974,12 @@ static noinline int insert_new_root(struct btrfs_trans_handle *trans,
 
        old = root->node;
        ret = btrfs_tree_mod_log_insert_root(root->node, c, false);
-       BUG_ON(ret < 0);
+       if (ret < 0) {
+               btrfs_free_tree_block(trans, btrfs_root_id(root), c, 0, 1);
+               btrfs_tree_unlock(c);
+               free_extent_buffer(c);
+               return ret;
+       }
        rcu_assign_pointer(root->node, c);
 
        /* the super has an extra ref to root->node */
@@ -2861,10 +3000,10 @@ static noinline int insert_new_root(struct btrfs_trans_handle *trans,
  * slot and level indicate where you want the key to go, and
  * blocknr is the block the key points to.
  */
-static void insert_ptr(struct btrfs_trans_handle *trans,
-                      struct btrfs_path *path,
-                      struct btrfs_disk_key *key, u64 bytenr,
-                      int slot, int level)
+static int insert_ptr(struct btrfs_trans_handle *trans,
+                     struct btrfs_path *path,
+                     struct btrfs_disk_key *key, u64 bytenr,
+                     int slot, int level)
 {
        struct extent_buffer *lower;
        int nritems;
@@ -2880,7 +3019,10 @@ static void insert_ptr(struct btrfs_trans_handle *trans,
                if (level) {
                        ret = btrfs_tree_mod_log_insert_move(lower, slot + 1,
                                        slot, nritems - slot);
-                       BUG_ON(ret < 0);
+                       if (ret < 0) {
+                               btrfs_abort_transaction(trans, ret);
+                               return ret;
+                       }
                }
                memmove_extent_buffer(lower,
                              btrfs_node_key_ptr_offset(lower, slot + 1),
@@ -2890,7 +3032,10 @@ static void insert_ptr(struct btrfs_trans_handle *trans,
        if (level) {
                ret = btrfs_tree_mod_log_insert_key(lower, slot,
                                                    BTRFS_MOD_LOG_KEY_ADD);
-               BUG_ON(ret < 0);
+               if (ret < 0) {
+                       btrfs_abort_transaction(trans, ret);
+                       return ret;
+               }
        }
        btrfs_set_node_key(lower, key, slot);
        btrfs_set_node_blockptr(lower, slot, bytenr);
@@ -2898,6 +3043,8 @@ static void insert_ptr(struct btrfs_trans_handle *trans,
        btrfs_set_node_ptr_generation(lower, slot, trans->transid);
        btrfs_set_header_nritems(lower, nritems + 1);
        btrfs_mark_buffer_dirty(lower);
+
+       return 0;
 }
 
 /*
@@ -2962,6 +3109,8 @@ static noinline int split_node(struct btrfs_trans_handle *trans,
 
        ret = btrfs_tree_mod_log_eb_copy(split, c, 0, mid, c_nritems - mid);
        if (ret) {
+               btrfs_tree_unlock(split);
+               free_extent_buffer(split);
                btrfs_abort_transaction(trans, ret);
                return ret;
        }
@@ -2975,8 +3124,13 @@ static noinline int split_node(struct btrfs_trans_handle *trans,
        btrfs_mark_buffer_dirty(c);
        btrfs_mark_buffer_dirty(split);
 
-       insert_ptr(trans, path, &disk_key, split->start,
-                  path->slots[level + 1] + 1, level + 1);
+       ret = insert_ptr(trans, path, &disk_key, split->start,
+                        path->slots[level + 1] + 1, level + 1);
+       if (ret < 0) {
+               btrfs_tree_unlock(split);
+               free_extent_buffer(split);
+               return ret;
+       }
 
        if (path->slots[level] >= mid) {
                path->slots[level] -= mid;
@@ -2996,7 +3150,7 @@ static noinline int split_node(struct btrfs_trans_handle *trans,
  * and nr indicate which items in the leaf to check.  This totals up the
  * space used both by the item structs and the item data
  */
-static int leaf_space_used(struct extent_buffer *l, int start, int nr)
+static int leaf_space_used(const struct extent_buffer *l, int start, int nr)
 {
        int data_len;
        int nritems = btrfs_header_nritems(l);
@@ -3016,7 +3170,7 @@ static int leaf_space_used(struct extent_buffer *l, int start, int nr)
  * the start of the leaf data.  IOW, how much room
  * the leaf has left for both items and data
  */
-noinline int btrfs_leaf_free_space(struct extent_buffer *leaf)
+int btrfs_leaf_free_space(const struct extent_buffer *leaf)
 {
        struct btrfs_fs_info *fs_info = leaf->fs_info;
        int nritems = btrfs_header_nritems(leaf);
@@ -3453,16 +3607,17 @@ out:
  * split the path's leaf in two, making sure there is at least data_size
  * available for the resulting leaf level of the path.
  */
-static noinline void copy_for_split(struct btrfs_trans_handle *trans,
-                                   struct btrfs_path *path,
-                                   struct extent_buffer *l,
-                                   struct extent_buffer *right,
-                                   int slot, int mid, int nritems)
+static noinline int copy_for_split(struct btrfs_trans_handle *trans,
+                                  struct btrfs_path *path,
+                                  struct extent_buffer *l,
+                                  struct extent_buffer *right,
+                                  int slot, int mid, int nritems)
 {
        struct btrfs_fs_info *fs_info = trans->fs_info;
        int data_copy_size;
        int rt_data_off;
        int i;
+       int ret;
        struct btrfs_disk_key disk_key;
        struct btrfs_map_token token;
 
@@ -3487,7 +3642,9 @@ static noinline void copy_for_split(struct btrfs_trans_handle *trans,
 
        btrfs_set_header_nritems(l, mid);
        btrfs_item_key(right, &disk_key, 0);
-       insert_ptr(trans, path, &disk_key, right->start, path->slots[1] + 1, 1);
+       ret = insert_ptr(trans, path, &disk_key, right->start, path->slots[1] + 1, 1);
+       if (ret < 0)
+               return ret;
 
        btrfs_mark_buffer_dirty(right);
        btrfs_mark_buffer_dirty(l);
@@ -3505,6 +3662,8 @@ static noinline void copy_for_split(struct btrfs_trans_handle *trans,
        }
 
        BUG_ON(path->slots[0] < 0);
+
+       return 0;
 }
 
 /*
@@ -3703,8 +3862,13 @@ again:
        if (split == 0) {
                if (mid <= slot) {
                        btrfs_set_header_nritems(right, 0);
-                       insert_ptr(trans, path, &disk_key,
-                                  right->start, path->slots[1] + 1, 1);
+                       ret = insert_ptr(trans, path, &disk_key,
+                                        right->start, path->slots[1] + 1, 1);
+                       if (ret < 0) {
+                               btrfs_tree_unlock(right);
+                               free_extent_buffer(right);
+                               return ret;
+                       }
                        btrfs_tree_unlock(path->nodes[0]);
                        free_extent_buffer(path->nodes[0]);
                        path->nodes[0] = right;
@@ -3712,8 +3876,13 @@ again:
                        path->slots[1] += 1;
                } else {
                        btrfs_set_header_nritems(right, 0);
-                       insert_ptr(trans, path, &disk_key,
-                                  right->start, path->slots[1], 1);
+                       ret = insert_ptr(trans, path, &disk_key,
+                                        right->start, path->slots[1], 1);
+                       if (ret < 0) {
+                               btrfs_tree_unlock(right);
+                               free_extent_buffer(right);
+                               return ret;
+                       }
                        btrfs_tree_unlock(path->nodes[0]);
                        free_extent_buffer(path->nodes[0]);
                        path->nodes[0] = right;
@@ -3729,7 +3898,12 @@ again:
                return ret;
        }
 
-       copy_for_split(trans, path, l, right, slot, mid, nritems);
+       ret = copy_for_split(trans, path, l, right, slot, mid, nritems);
+       if (ret < 0) {
+               btrfs_tree_unlock(right);
+               free_extent_buffer(right);
+               return ret;
+       }
 
        if (split == 2) {
                BUG_ON(num_doubles != 0);
@@ -3826,7 +4000,12 @@ static noinline int split_item(struct btrfs_path *path,
        struct btrfs_disk_key disk_key;
 
        leaf = path->nodes[0];
-       BUG_ON(btrfs_leaf_free_space(leaf) < sizeof(struct btrfs_item));
+       /*
+        * Shouldn't happen because the caller must have previously called
+        * setup_leaf_for_split() to make room for the new item in the leaf.
+        */
+       if (WARN_ON(btrfs_leaf_free_space(leaf) < sizeof(struct btrfs_item)))
+               return -ENOSPC;
 
        orig_slot = path->slots[0];
        orig_offset = btrfs_item_offset(leaf, path->slots[0]);
@@ -4273,9 +4452,11 @@ int btrfs_duplicate_item(struct btrfs_trans_handle *trans,
  *
  * the tree should have been previously balanced so the deletion does not
  * empty a node.
+ *
+ * This is exported for use inside btrfs-progs, don't un-export it.
  */
-static void del_ptr(struct btrfs_root *root, struct btrfs_path *path,
-                   int level, int slot)
+int btrfs_del_ptr(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+                 struct btrfs_path *path, int level, int slot)
 {
        struct extent_buffer *parent = path->nodes[level];
        u32 nritems;
@@ -4286,7 +4467,10 @@ static void del_ptr(struct btrfs_root *root, struct btrfs_path *path,
                if (level) {
                        ret = btrfs_tree_mod_log_insert_move(parent, slot,
                                        slot + 1, nritems - slot - 1);
-                       BUG_ON(ret < 0);
+                       if (ret < 0) {
+                               btrfs_abort_transaction(trans, ret);
+                               return ret;
+                       }
                }
                memmove_extent_buffer(parent,
                              btrfs_node_key_ptr_offset(parent, slot),
@@ -4296,7 +4480,10 @@ static void del_ptr(struct btrfs_root *root, struct btrfs_path *path,
        } else if (level) {
                ret = btrfs_tree_mod_log_insert_key(parent, slot,
                                                    BTRFS_MOD_LOG_KEY_REMOVE);
-               BUG_ON(ret < 0);
+               if (ret < 0) {
+                       btrfs_abort_transaction(trans, ret);
+                       return ret;
+               }
        }
 
        nritems--;
@@ -4312,6 +4499,7 @@ static void del_ptr(struct btrfs_root *root, struct btrfs_path *path,
                fixup_low_keys(path, &disk_key, level + 1);
        }
        btrfs_mark_buffer_dirty(parent);
+       return 0;
 }
 
 /*
@@ -4324,13 +4512,17 @@ static void del_ptr(struct btrfs_root *root, struct btrfs_path *path,
  * The path must have already been setup for deleting the leaf, including
  * all the proper balancing.  path->nodes[1] must be locked.
  */
-static noinline void btrfs_del_leaf(struct btrfs_trans_handle *trans,
-                                   struct btrfs_root *root,
-                                   struct btrfs_path *path,
-                                   struct extent_buffer *leaf)
+static noinline int btrfs_del_leaf(struct btrfs_trans_handle *trans,
+                                  struct btrfs_root *root,
+                                  struct btrfs_path *path,
+                                  struct extent_buffer *leaf)
 {
+       int ret;
+
        WARN_ON(btrfs_header_generation(leaf) != trans->transid);
-       del_ptr(root, path, 1, path->slots[1]);
+       ret = btrfs_del_ptr(trans, root, path, 1, path->slots[1]);
+       if (ret < 0)
+               return ret;
 
        /*
         * btrfs_free_extent is expensive, we want to make sure we
@@ -4343,6 +4535,7 @@ static noinline void btrfs_del_leaf(struct btrfs_trans_handle *trans,
        atomic_inc(&leaf->refs);
        btrfs_free_tree_block(trans, btrfs_root_id(root), leaf, 0, 1);
        free_extent_buffer_stale(leaf);
+       return 0;
 }
 /*
  * delete the item at the leaf level in path.  If that empties
@@ -4392,7 +4585,9 @@ int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root,
                        btrfs_set_header_level(leaf, 0);
                } else {
                        btrfs_clear_buffer_dirty(trans, leaf);
-                       btrfs_del_leaf(trans, root, path, leaf);
+                       ret = btrfs_del_leaf(trans, root, path, leaf);
+                       if (ret < 0)
+                               return ret;
                }
        } else {
                int used = leaf_space_used(leaf, 0, nritems);
@@ -4416,7 +4611,7 @@ int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 
                        /* push_leaf_left fixes the path.
                         * make sure the path still points to our leaf
-                        * for possible call to del_ptr below
+                        * for possible call to btrfs_del_ptr below
                         */
                        slot = path->slots[1];
                        atomic_inc(&leaf->refs);
@@ -4453,7 +4648,9 @@ int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 
                        if (btrfs_header_nritems(leaf) == 0) {
                                path->slots[1] = slot;
-                               btrfs_del_leaf(trans, root, path, leaf);
+                               ret = btrfs_del_leaf(trans, root, path, leaf);
+                               if (ret < 0)
+                                       return ret;
                                free_extent_buffer(leaf);
                                ret = 0;
                        } else {
@@ -4474,86 +4671,6 @@ int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 }
 
 /*
- * search the tree again to find a leaf with lesser keys
- * returns 0 if it found something or 1 if there are no lesser leaves.
- * returns < 0 on io errors.
- *
- * This may release the path, and so you may lose any locks held at the
- * time you call it.
- */
-int btrfs_prev_leaf(struct btrfs_root *root, struct btrfs_path *path)
-{
-       struct btrfs_key key;
-       struct btrfs_key orig_key;
-       struct btrfs_disk_key found_key;
-       int ret;
-
-       btrfs_item_key_to_cpu(path->nodes[0], &key, 0);
-       orig_key = key;
-
-       if (key.offset > 0) {
-               key.offset--;
-       } else if (key.type > 0) {
-               key.type--;
-               key.offset = (u64)-1;
-       } else if (key.objectid > 0) {
-               key.objectid--;
-               key.type = (u8)-1;
-               key.offset = (u64)-1;
-       } else {
-               return 1;
-       }
-
-       btrfs_release_path(path);
-       ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
-       if (ret <= 0)
-               return ret;
-
-       /*
-        * Previous key not found. Even if we were at slot 0 of the leaf we had
-        * before releasing the path and calling btrfs_search_slot(), we now may
-        * be in a slot pointing to the same original key - this can happen if
-        * after we released the path, one of more items were moved from a
-        * sibling leaf into the front of the leaf we had due to an insertion
-        * (see push_leaf_right()).
-        * If we hit this case and our slot is > 0 and just decrement the slot
-        * so that the caller does not process the same key again, which may or
-        * may not break the caller, depending on its logic.
-        */
-       if (path->slots[0] < btrfs_header_nritems(path->nodes[0])) {
-               btrfs_item_key(path->nodes[0], &found_key, path->slots[0]);
-               ret = comp_keys(&found_key, &orig_key);
-               if (ret == 0) {
-                       if (path->slots[0] > 0) {
-                               path->slots[0]--;
-                               return 0;
-                       }
-                       /*
-                        * At slot 0, same key as before, it means orig_key is
-                        * the lowest, leftmost, key in the tree. We're done.
-                        */
-                       return 1;
-               }
-       }
-
-       btrfs_item_key(path->nodes[0], &found_key, 0);
-       ret = comp_keys(&found_key, &key);
-       /*
-        * We might have had an item with the previous key in the tree right
-        * before we released our path. And after we released our path, that
-        * item might have been pushed to the first slot (0) of the leaf we
-        * were holding due to a tree balance. Alternatively, an item with the
-        * previous key can exist as the only element of a leaf (big fat item).
-        * Therefore account for these 2 cases, so that our callers (like
-        * btrfs_previous_item) don't miss an existing item with a key matching
-        * the previous key we computed above.
-        */
-       if (ret <= 0)
-               return 0;
-       return 1;
-}
-
-/*
  * A helper function to walk down the tree starting at min_key, and looking
  * for nodes or leaves that are have a minimum transaction id.
  * This is used by the btree defrag code, and tree logging
index 4c1986c..f2d2b31 100644 (file)
@@ -541,6 +541,8 @@ int btrfs_copy_root(struct btrfs_trans_handle *trans,
                      struct extent_buffer **cow_ret, u64 new_root_objectid);
 int btrfs_block_can_be_shared(struct btrfs_root *root,
                              struct extent_buffer *buf);
+int btrfs_del_ptr(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+                 struct btrfs_path *path, int level, int slot);
 void btrfs_extend_item(struct btrfs_path *path, u32 data_size);
 void btrfs_truncate_item(struct btrfs_path *path, u32 new_size, int from_end);
 int btrfs_split_item(struct btrfs_trans_handle *trans,
@@ -633,7 +635,6 @@ static inline int btrfs_insert_empty_item(struct btrfs_trans_handle *trans,
        return btrfs_insert_empty_items(trans, root, path, &batch);
 }
 
-int btrfs_prev_leaf(struct btrfs_root *root, struct btrfs_path *path);
 int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
                        u64 time_seq);
 
@@ -686,7 +687,7 @@ static inline int btrfs_next_item(struct btrfs_root *root, struct btrfs_path *p)
 {
        return btrfs_next_old_item(root, p, 0);
 }
-int btrfs_leaf_free_space(struct extent_buffer *leaf);
+int btrfs_leaf_free_space(const struct extent_buffer *leaf);
 
 static inline int is_fstree(u64 rootid)
 {
@@ -702,6 +703,7 @@ static inline bool btrfs_is_data_reloc_root(const struct btrfs_root *root)
        return root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID;
 }
 
+u16 btrfs_csum_type_size(u16 type);
 int btrfs_super_csum_size(const struct btrfs_super_block *s);
 const char *btrfs_super_csum_name(u16 csum_type);
 const char *btrfs_super_csum_driver(u16 csum_type);
index 8065341..f2ff4cb 100644 (file)
@@ -1040,7 +1040,8 @@ static int defrag_one_locked_target(struct btrfs_inode *inode,
        clear_extent_bit(&inode->io_tree, start, start + len - 1,
                         EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
                         EXTENT_DEFRAG, cached_state);
-       set_extent_defrag(&inode->io_tree, start, start + len - 1, cached_state);
+       set_extent_bit(&inode->io_tree, start, start + len - 1,
+                      EXTENT_DELALLOC | EXTENT_DEFRAG, cached_state);
 
        /* Update the page status */
        for (i = start_index - first_index; i <= last_index - first_index; i++) {
index 0b32432..6a13cf0 100644 (file)
@@ -407,7 +407,6 @@ static inline void drop_delayed_ref(struct btrfs_delayed_ref_root *delayed_refs,
        RB_CLEAR_NODE(&ref->ref_node);
        if (!list_empty(&ref->add_list))
                list_del(&ref->add_list);
-       ref->in_tree = 0;
        btrfs_put_delayed_ref(ref);
        atomic_dec(&delayed_refs->num_entries);
 }
@@ -507,6 +506,7 @@ struct btrfs_delayed_ref_head *btrfs_select_ref_head(
 {
        struct btrfs_delayed_ref_head *head;
 
+       lockdep_assert_held(&delayed_refs->lock);
 again:
        head = find_ref_head(delayed_refs, delayed_refs->run_delayed_start,
                             true);
@@ -531,7 +531,7 @@ again:
                                href_node);
        }
 
-       head->processing = 1;
+       head->processing = true;
        WARN_ON(delayed_refs->num_heads_ready == 0);
        delayed_refs->num_heads_ready--;
        delayed_refs->run_delayed_start = head->bytenr +
@@ -549,31 +549,35 @@ void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
        RB_CLEAR_NODE(&head->href_node);
        atomic_dec(&delayed_refs->num_entries);
        delayed_refs->num_heads--;
-       if (head->processing == 0)
+       if (!head->processing)
                delayed_refs->num_heads_ready--;
 }
 
 /*
  * Helper to insert the ref_node to the tail or merge with tail.
  *
- * Return 0 for insert.
- * Return >0 for merge.
+ * Return false if the ref was inserted.
+ * Return true if the ref was merged into an existing one (and therefore can be
+ * freed by the caller).
  */
-static int insert_delayed_ref(struct btrfs_delayed_ref_root *root,
-                             struct btrfs_delayed_ref_head *href,
-                             struct btrfs_delayed_ref_node *ref)
+static bool insert_delayed_ref(struct btrfs_delayed_ref_root *root,
+                              struct btrfs_delayed_ref_head *href,
+                              struct btrfs_delayed_ref_node *ref)
 {
        struct btrfs_delayed_ref_node *exist;
        int mod;
-       int ret = 0;
 
        spin_lock(&href->lock);
        exist = tree_insert(&href->ref_tree, ref);
-       if (!exist)
-               goto inserted;
+       if (!exist) {
+               if (ref->action == BTRFS_ADD_DELAYED_REF)
+                       list_add_tail(&ref->add_list, &href->ref_add_list);
+               atomic_inc(&root->num_entries);
+               spin_unlock(&href->lock);
+               return false;
+       }
 
        /* Now we are sure we can merge */
-       ret = 1;
        if (exist->action == ref->action) {
                mod = ref->ref_mod;
        } else {
@@ -600,13 +604,7 @@ static int insert_delayed_ref(struct btrfs_delayed_ref_root *root,
        if (exist->ref_mod == 0)
                drop_delayed_ref(root, href, exist);
        spin_unlock(&href->lock);
-       return ret;
-inserted:
-       if (ref->action == BTRFS_ADD_DELAYED_REF)
-               list_add_tail(&ref->add_list, &href->ref_add_list);
-       atomic_inc(&root->num_entries);
-       spin_unlock(&href->lock);
-       return ret;
+       return true;
 }
 
 /*
@@ -699,34 +697,38 @@ static void init_delayed_ref_head(struct btrfs_delayed_ref_head *head_ref,
                                  bool is_system)
 {
        int count_mod = 1;
-       int must_insert_reserved = 0;
+       bool must_insert_reserved = false;
 
        /* If reserved is provided, it must be a data extent. */
        BUG_ON(!is_data && reserved);
 
-       /*
-        * The head node stores the sum of all the mods, so dropping a ref
-        * should drop the sum in the head node by one.
-        */
-       if (action == BTRFS_UPDATE_DELAYED_HEAD)
+       switch (action) {
+       case BTRFS_UPDATE_DELAYED_HEAD:
                count_mod = 0;
-       else if (action == BTRFS_DROP_DELAYED_REF)
+               break;
+       case BTRFS_DROP_DELAYED_REF:
+               /*
+                * The head node stores the sum of all the mods, so dropping a ref
+                * should drop the sum in the head node by one.
+                */
                count_mod = -1;
-
-       /*
-        * BTRFS_ADD_DELAYED_EXTENT means that we need to update the reserved
-        * accounting when the extent is finally added, or if a later
-        * modification deletes the delayed ref without ever inserting the
-        * extent into the extent allocation tree.  ref->must_insert_reserved
-        * is the flag used to record that accounting mods are required.
-        *
-        * Once we record must_insert_reserved, switch the action to
-        * BTRFS_ADD_DELAYED_REF because other special casing is not required.
-        */
-       if (action == BTRFS_ADD_DELAYED_EXTENT)
-               must_insert_reserved = 1;
-       else
-               must_insert_reserved = 0;
+               break;
+       case BTRFS_ADD_DELAYED_EXTENT:
+               /*
+                * BTRFS_ADD_DELAYED_EXTENT means that we need to update the
+                * reserved accounting when the extent is finally added, or if a
+                * later modification deletes the delayed ref without ever
+                * inserting the extent into the extent allocation tree.
+                * ref->must_insert_reserved is the flag used to record that
+                * accounting mods are required.
+                *
+                * Once we record must_insert_reserved, switch the action to
+                * BTRFS_ADD_DELAYED_REF because other special casing is not
+                * required.
+                */
+               must_insert_reserved = true;
+               break;
+       }
 
        refcount_set(&head_ref->refs, 1);
        head_ref->bytenr = bytenr;
@@ -738,7 +740,7 @@ static void init_delayed_ref_head(struct btrfs_delayed_ref_head *head_ref,
        head_ref->ref_tree = RB_ROOT_CACHED;
        INIT_LIST_HEAD(&head_ref->ref_add_list);
        RB_CLEAR_NODE(&head_ref->href_node);
-       head_ref->processing = 0;
+       head_ref->processing = false;
        head_ref->total_ref_mod = count_mod;
        spin_lock_init(&head_ref->lock);
        mutex_init(&head_ref->mutex);
@@ -763,11 +765,11 @@ static noinline struct btrfs_delayed_ref_head *
 add_delayed_ref_head(struct btrfs_trans_handle *trans,
                     struct btrfs_delayed_ref_head *head_ref,
                     struct btrfs_qgroup_extent_record *qrecord,
-                    int action, int *qrecord_inserted_ret)
+                    int action, bool *qrecord_inserted_ret)
 {
        struct btrfs_delayed_ref_head *existing;
        struct btrfs_delayed_ref_root *delayed_refs;
-       int qrecord_inserted = 0;
+       bool qrecord_inserted = false;
 
        delayed_refs = &trans->transaction->delayed_refs;
 
@@ -777,7 +779,7 @@ add_delayed_ref_head(struct btrfs_trans_handle *trans,
                                        delayed_refs, qrecord))
                        kfree(qrecord);
                else
-                       qrecord_inserted = 1;
+                       qrecord_inserted = true;
        }
 
        trace_add_delayed_ref_head(trans->fs_info, head_ref, action);
@@ -853,8 +855,6 @@ static void init_delayed_ref_common(struct btrfs_fs_info *fs_info,
        ref->num_bytes = num_bytes;
        ref->ref_mod = 1;
        ref->action = action;
-       ref->is_head = 0;
-       ref->in_tree = 1;
        ref->seq = seq;
        ref->type = ref_type;
        RB_CLEAR_NODE(&ref->ref_node);
@@ -875,11 +875,11 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
        struct btrfs_delayed_ref_head *head_ref;
        struct btrfs_delayed_ref_root *delayed_refs;
        struct btrfs_qgroup_extent_record *record = NULL;
-       int qrecord_inserted;
+       bool qrecord_inserted;
        bool is_system;
+       bool merged;
        int action = generic_ref->action;
        int level = generic_ref->tree_ref.level;
-       int ret;
        u64 bytenr = generic_ref->bytenr;
        u64 num_bytes = generic_ref->len;
        u64 parent = generic_ref->parent;
@@ -935,7 +935,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
        head_ref = add_delayed_ref_head(trans, head_ref, record,
                                        action, &qrecord_inserted);
 
-       ret = insert_delayed_ref(delayed_refs, head_ref, &ref->node);
+       merged = insert_delayed_ref(delayed_refs, head_ref, &ref->node);
        spin_unlock(&delayed_refs->lock);
 
        /*
@@ -947,7 +947,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
        trace_add_delayed_tree_ref(fs_info, &ref->node, ref,
                                   action == BTRFS_ADD_DELAYED_EXTENT ?
                                   BTRFS_ADD_DELAYED_REF : action);
-       if (ret > 0)
+       if (merged)
                kmem_cache_free(btrfs_delayed_tree_ref_cachep, ref);
 
        if (qrecord_inserted)
@@ -968,9 +968,9 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
        struct btrfs_delayed_ref_head *head_ref;
        struct btrfs_delayed_ref_root *delayed_refs;
        struct btrfs_qgroup_extent_record *record = NULL;
-       int qrecord_inserted;
+       bool qrecord_inserted;
        int action = generic_ref->action;
-       int ret;
+       bool merged;
        u64 bytenr = generic_ref->bytenr;
        u64 num_bytes = generic_ref->len;
        u64 parent = generic_ref->parent;
@@ -1027,7 +1027,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
        head_ref = add_delayed_ref_head(trans, head_ref, record,
                                        action, &qrecord_inserted);
 
-       ret = insert_delayed_ref(delayed_refs, head_ref, &ref->node);
+       merged = insert_delayed_ref(delayed_refs, head_ref, &ref->node);
        spin_unlock(&delayed_refs->lock);
 
        /*
@@ -1039,7 +1039,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
        trace_add_delayed_data_ref(trans->fs_info, &ref->node, ref,
                                   action == BTRFS_ADD_DELAYED_EXTENT ?
                                   BTRFS_ADD_DELAYED_REF : action);
-       if (ret > 0)
+       if (merged)
                kmem_cache_free(btrfs_delayed_data_ref_cachep, ref);
 
 
index b54261f..b8e14b0 100644 (file)
@@ -48,9 +48,6 @@ struct btrfs_delayed_ref_node {
 
        unsigned int action:8;
        unsigned int type:8;
-       /* is this node still in the rbtree? */
-       unsigned int is_head:1;
-       unsigned int in_tree:1;
 };
 
 struct btrfs_delayed_extent_op {
@@ -70,20 +67,26 @@ struct btrfs_delayed_extent_op {
 struct btrfs_delayed_ref_head {
        u64 bytenr;
        u64 num_bytes;
-       refcount_t refs;
+       /*
+        * For insertion into struct btrfs_delayed_ref_root::href_root.
+        * Keep it in the same cache line as 'bytenr' for more efficient
+        * searches in the rbtree.
+        */
+       struct rb_node href_node;
        /*
         * the mutex is held while running the refs, and it is also
         * held when checking the sum of reference modifications.
         */
        struct mutex mutex;
 
+       refcount_t refs;
+
+       /* Protects 'ref_tree' and 'ref_add_list'. */
        spinlock_t lock;
        struct rb_root_cached ref_tree;
        /* accumulate add BTRFS_ADD_DELAYED_REF nodes to this ref_add_list. */
        struct list_head ref_add_list;
 
-       struct rb_node href_node;
-
        struct btrfs_delayed_extent_op *extent_op;
 
        /*
@@ -113,10 +116,10 @@ struct btrfs_delayed_ref_head {
         * we need to update the in ram accounting to properly reflect
         * the free has happened.
         */
-       unsigned int must_insert_reserved:1;
-       unsigned int is_data:1;
-       unsigned int is_system:1;
-       unsigned int processing:1;
+       bool must_insert_reserved;
+       bool is_data;
+       bool is_system;
+       bool processing;
 };
 
 struct btrfs_delayed_tree_ref {
@@ -337,7 +340,7 @@ static inline void btrfs_put_delayed_ref(struct btrfs_delayed_ref_node *ref)
 {
        WARN_ON(refcount_read(&ref->refs) == 0);
        if (refcount_dec_and_test(&ref->refs)) {
-               WARN_ON(ref->in_tree);
+               WARN_ON(!RB_EMPTY_NODE(&ref->ref_node));
                switch (ref->type) {
                case BTRFS_TREE_BLOCK_REF_KEY:
                case BTRFS_SHARED_BLOCK_REF_KEY:
index 78696d3..5f10965 100644 (file)
@@ -41,7 +41,7 @@
  *   All new writes will be written to both target and source devices, so even
  *   if replace gets canceled, sources device still contains up-to-date data.
  *
- *   Location:         handle_ops_on_dev_replace() from __btrfs_map_block()
+ *   Location:         handle_ops_on_dev_replace() from btrfs_map_block()
  *   Start:            btrfs_dev_replace_start()
  *   End:              btrfs_dev_replace_finishing()
  *   Content:          Latest data/metadata
@@ -257,8 +257,8 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
                return -EINVAL;
        }
 
-       bdev = blkdev_get_by_path(device_path, FMODE_WRITE | FMODE_EXCL,
-                                 fs_info->bdev_holder);
+       bdev = blkdev_get_by_path(device_path, BLK_OPEN_WRITE,
+                                 fs_info->bdev_holder, NULL);
        if (IS_ERR(bdev)) {
                btrfs_err(fs_info, "target device %s is invalid!", device_path);
                return PTR_ERR(bdev);
@@ -315,7 +315,7 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
        device->bdev = bdev;
        set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
        set_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state);
-       device->mode = FMODE_EXCL;
+       device->holder = fs_info->bdev_holder;
        device->dev_stats_valid = 1;
        set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
        device->fs_devices = fs_devices;
@@ -334,7 +334,7 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
        return 0;
 
 error:
-       blkdev_put(bdev, FMODE_EXCL);
+       blkdev_put(bdev, fs_info->bdev_holder);
        return ret;
 }
 
@@ -795,8 +795,8 @@ static int btrfs_set_target_alloc_state(struct btrfs_device *srcdev,
        while (!find_first_extent_bit(&srcdev->alloc_state, start,
                                      &found_start, &found_end,
                                      CHUNK_ALLOCATED, &cached_state)) {
-               ret = set_extent_bits(&tgtdev->alloc_state, found_start,
-                                     found_end, CHUNK_ALLOCATED);
+               ret = set_extent_bit(&tgtdev->alloc_state, found_start,
+                                    found_end, CHUNK_ALLOCATED, NULL);
                if (ret)
                        break;
                start = found_end + 1;
index a6d77fe..944a734 100644 (file)
@@ -73,6 +73,23 @@ static struct list_head *get_discard_list(struct btrfs_discard_ctl *discard_ctl,
        return &discard_ctl->discard_list[block_group->discard_index];
 }
 
+/*
+ * Determine if async discard should be running.
+ *
+ * @discard_ctl: discard control
+ *
+ * Check if the file system is writeable and BTRFS_FS_DISCARD_RUNNING is set.
+ */
+static bool btrfs_run_discard_work(struct btrfs_discard_ctl *discard_ctl)
+{
+       struct btrfs_fs_info *fs_info = container_of(discard_ctl,
+                                                    struct btrfs_fs_info,
+                                                    discard_ctl);
+
+       return (!(fs_info->sb->s_flags & SB_RDONLY) &&
+               test_bit(BTRFS_FS_DISCARD_RUNNING, &fs_info->flags));
+}
+
 static void __add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
                                  struct btrfs_block_group *block_group)
 {
@@ -545,23 +562,6 @@ static void btrfs_discard_workfn(struct work_struct *work)
 }
 
 /*
- * Determine if async discard should be running.
- *
- * @discard_ctl: discard control
- *
- * Check if the file system is writeable and BTRFS_FS_DISCARD_RUNNING is set.
- */
-bool btrfs_run_discard_work(struct btrfs_discard_ctl *discard_ctl)
-{
-       struct btrfs_fs_info *fs_info = container_of(discard_ctl,
-                                                    struct btrfs_fs_info,
-                                                    discard_ctl);
-
-       return (!(fs_info->sb->s_flags & SB_RDONLY) &&
-               test_bit(BTRFS_FS_DISCARD_RUNNING, &fs_info->flags));
-}
-
-/*
  * Recalculate the base delay.
  *
  * @discard_ctl: discard control
index 57b9202..dddb0f9 100644 (file)
@@ -24,7 +24,6 @@ void btrfs_discard_queue_work(struct btrfs_discard_ctl *discard_ctl,
                              struct btrfs_block_group *block_group);
 void btrfs_discard_schedule_work(struct btrfs_discard_ctl *discard_ctl,
                                 bool override);
-bool btrfs_run_discard_work(struct btrfs_discard_ctl *discard_ctl);
 
 /* Update operations */
 void btrfs_discard_calc_delay(struct btrfs_discard_ctl *discard_ctl);
index dabc79c..7513388 100644 (file)
                                 BTRFS_SUPER_FLAG_METADUMP |\
                                 BTRFS_SUPER_FLAG_METADUMP_V2)
 
-static void btrfs_destroy_ordered_extents(struct btrfs_root *root);
-static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
-                                     struct btrfs_fs_info *fs_info);
-static void btrfs_destroy_delalloc_inodes(struct btrfs_root *root);
-static int btrfs_destroy_marked_extents(struct btrfs_fs_info *fs_info,
-                                       struct extent_io_tree *dirty_pages,
-                                       int mark);
-static int btrfs_destroy_pinned_extent(struct btrfs_fs_info *fs_info,
-                                      struct extent_io_tree *pinned_extents);
 static int btrfs_cleanup_transaction(struct btrfs_fs_info *fs_info);
 static void btrfs_error_commit_super(struct btrfs_fs_info *fs_info);
 
@@ -110,35 +101,27 @@ static void csum_tree_block(struct extent_buffer *buf, u8 *result)
  * detect blocks that either didn't get written at all or got written
  * in the wrong place.
  */
-static int verify_parent_transid(struct extent_io_tree *io_tree,
-                                struct extent_buffer *eb, u64 parent_transid,
-                                int atomic)
+int btrfs_buffer_uptodate(struct extent_buffer *eb, u64 parent_transid, int atomic)
 {
-       struct extent_state *cached_state = NULL;
-       int ret;
+       if (!extent_buffer_uptodate(eb))
+               return 0;
 
        if (!parent_transid || btrfs_header_generation(eb) == parent_transid)
-               return 0;
+               return 1;
 
        if (atomic)
                return -EAGAIN;
 
-       lock_extent(io_tree, eb->start, eb->start + eb->len - 1, &cached_state);
-       if (extent_buffer_uptodate(eb) &&
-           btrfs_header_generation(eb) == parent_transid) {
-               ret = 0;
-               goto out;
-       }
-       btrfs_err_rl(eb->fs_info,
+       if (!extent_buffer_uptodate(eb) ||
+           btrfs_header_generation(eb) != parent_transid) {
+               btrfs_err_rl(eb->fs_info,
 "parent transid verify failed on logical %llu mirror %u wanted %llu found %llu",
                        eb->start, eb->read_mirror,
                        parent_transid, btrfs_header_generation(eb));
-       ret = 1;
-       clear_extent_buffer_uptodate(eb);
-out:
-       unlock_extent(io_tree, eb->start, eb->start + eb->len - 1,
-                     &cached_state);
-       return ret;
+               clear_extent_buffer_uptodate(eb);
+               return 0;
+       }
+       return 1;
 }
 
 static bool btrfs_supported_super_csum(u16 csum_type)
@@ -180,64 +163,6 @@ int btrfs_check_super_csum(struct btrfs_fs_info *fs_info,
        return 0;
 }
 
-int btrfs_verify_level_key(struct extent_buffer *eb, int level,
-                          struct btrfs_key *first_key, u64 parent_transid)
-{
-       struct btrfs_fs_info *fs_info = eb->fs_info;
-       int found_level;
-       struct btrfs_key found_key;
-       int ret;
-
-       found_level = btrfs_header_level(eb);
-       if (found_level != level) {
-               WARN(IS_ENABLED(CONFIG_BTRFS_DEBUG),
-                    KERN_ERR "BTRFS: tree level check failed\n");
-               btrfs_err(fs_info,
-"tree level mismatch detected, bytenr=%llu level expected=%u has=%u",
-                         eb->start, level, found_level);
-               return -EIO;
-       }
-
-       if (!first_key)
-               return 0;
-
-       /*
-        * For live tree block (new tree blocks in current transaction),
-        * we need proper lock context to avoid race, which is impossible here.
-        * So we only checks tree blocks which is read from disk, whose
-        * generation <= fs_info->last_trans_committed.
-        */
-       if (btrfs_header_generation(eb) > fs_info->last_trans_committed)
-               return 0;
-
-       /* We have @first_key, so this @eb must have at least one item */
-       if (btrfs_header_nritems(eb) == 0) {
-               btrfs_err(fs_info,
-               "invalid tree nritems, bytenr=%llu nritems=0 expect >0",
-                         eb->start);
-               WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
-               return -EUCLEAN;
-       }
-
-       if (found_level)
-               btrfs_node_key_to_cpu(eb, &found_key, 0);
-       else
-               btrfs_item_key_to_cpu(eb, &found_key, 0);
-       ret = btrfs_comp_cpu_keys(first_key, &found_key);
-
-       if (ret) {
-               WARN(IS_ENABLED(CONFIG_BTRFS_DEBUG),
-                    KERN_ERR "BTRFS: tree first key check failed\n");
-               btrfs_err(fs_info,
-"tree first key mismatch detected, bytenr=%llu parent_transid=%llu key expected=(%llu,%u,%llu) has=(%llu,%u,%llu)",
-                         eb->start, parent_transid, first_key->objectid,
-                         first_key->type, first_key->offset,
-                         found_key.objectid, found_key.type,
-                         found_key.offset);
-       }
-       return ret;
-}
-
 static int btrfs_repair_eb_io_failure(const struct extent_buffer *eb,
                                      int mirror_num)
 {
@@ -312,12 +237,34 @@ int btrfs_read_extent_buffer(struct extent_buffer *eb,
        return ret;
 }
 
-static int csum_one_extent_buffer(struct extent_buffer *eb)
+/*
+ * Checksum a dirty tree block before IO.
+ */
+blk_status_t btree_csum_one_bio(struct btrfs_bio *bbio)
 {
+       struct extent_buffer *eb = bbio->private;
        struct btrfs_fs_info *fs_info = eb->fs_info;
+       u64 found_start = btrfs_header_bytenr(eb);
        u8 result[BTRFS_CSUM_SIZE];
        int ret;
 
+       /* Btree blocks are always contiguous on disk. */
+       if (WARN_ON_ONCE(bbio->file_offset != eb->start))
+               return BLK_STS_IOERR;
+       if (WARN_ON_ONCE(bbio->bio.bi_iter.bi_size != eb->len))
+               return BLK_STS_IOERR;
+
+       if (test_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags)) {
+               WARN_ON_ONCE(found_start != 0);
+               return BLK_STS_OK;
+       }
+
+       if (WARN_ON_ONCE(found_start != eb->start))
+               return BLK_STS_IOERR;
+       if (WARN_ON(!btrfs_page_test_uptodate(fs_info, eb->pages[0], eb->start,
+                                             eb->len)))
+               return BLK_STS_IOERR;
+
        ASSERT(memcmp_extent_buffer(eb, fs_info->fs_devices->metadata_uuid,
                                    offsetof(struct btrfs_header, fsid),
                                    BTRFS_FSID_SIZE) == 0);
@@ -326,7 +273,7 @@ static int csum_one_extent_buffer(struct extent_buffer *eb)
        if (btrfs_header_level(eb))
                ret = btrfs_check_node(eb);
        else
-               ret = btrfs_check_leaf_full(eb);
+               ret = btrfs_check_leaf(eb);
 
        if (ret < 0)
                goto error;
@@ -344,8 +291,7 @@ static int csum_one_extent_buffer(struct extent_buffer *eb)
                goto error;
        }
        write_extent_buffer(eb, result, 0, fs_info->csum_size);
-
-       return 0;
+       return BLK_STS_OK;
 
 error:
        btrfs_print_tree(eb, 0);
@@ -359,103 +305,10 @@ error:
         */
        WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG) ||
                btrfs_header_owner(eb) == BTRFS_TREE_LOG_OBJECTID);
-       return ret;
-}
-
-/* Checksum all dirty extent buffers in one bio_vec */
-static int csum_dirty_subpage_buffers(struct btrfs_fs_info *fs_info,
-                                     struct bio_vec *bvec)
-{
-       struct page *page = bvec->bv_page;
-       u64 bvec_start = page_offset(page) + bvec->bv_offset;
-       u64 cur;
-       int ret = 0;
-
-       for (cur = bvec_start; cur < bvec_start + bvec->bv_len;
-            cur += fs_info->nodesize) {
-               struct extent_buffer *eb;
-               bool uptodate;
-
-               eb = find_extent_buffer(fs_info, cur);
-               uptodate = btrfs_subpage_test_uptodate(fs_info, page, cur,
-                                                      fs_info->nodesize);
-
-               /* A dirty eb shouldn't disappear from buffer_radix */
-               if (WARN_ON(!eb))
-                       return -EUCLEAN;
-
-               if (WARN_ON(cur != btrfs_header_bytenr(eb))) {
-                       free_extent_buffer(eb);
-                       return -EUCLEAN;
-               }
-               if (WARN_ON(!uptodate)) {
-                       free_extent_buffer(eb);
-                       return -EUCLEAN;
-               }
-
-               ret = csum_one_extent_buffer(eb);
-               free_extent_buffer(eb);
-               if (ret < 0)
-                       return ret;
-       }
-       return ret;
-}
-
-/*
- * Checksum a dirty tree block before IO.  This has extra checks to make sure
- * we only fill in the checksum field in the first page of a multi-page block.
- * For subpage extent buffers we need bvec to also read the offset in the page.
- */
-static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct bio_vec *bvec)
-{
-       struct page *page = bvec->bv_page;
-       u64 start = page_offset(page);
-       u64 found_start;
-       struct extent_buffer *eb;
-
-       if (fs_info->nodesize < PAGE_SIZE)
-               return csum_dirty_subpage_buffers(fs_info, bvec);
-
-       eb = (struct extent_buffer *)page->private;
-       if (page != eb->pages[0])
-               return 0;
-
-       found_start = btrfs_header_bytenr(eb);
-
-       if (test_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags)) {
-               WARN_ON(found_start != 0);
-               return 0;
-       }
-
-       /*
-        * Please do not consolidate these warnings into a single if.
-        * It is useful to know what went wrong.
-        */
-       if (WARN_ON(found_start != start))
-               return -EUCLEAN;
-       if (WARN_ON(!PageUptodate(page)))
-               return -EUCLEAN;
-
-       return csum_one_extent_buffer(eb);
-}
-
-blk_status_t btree_csum_one_bio(struct btrfs_bio *bbio)
-{
-       struct btrfs_fs_info *fs_info = bbio->inode->root->fs_info;
-       struct bvec_iter iter;
-       struct bio_vec bv;
-       int ret = 0;
-
-       bio_for_each_segment(bv, &bbio->bio, iter) {
-               ret = csum_dirty_buffer(fs_info, &bv);
-               if (ret)
-                       break;
-       }
-
        return errno_to_blk_status(ret);
 }
 
-static int check_tree_block_fsid(struct extent_buffer *eb)
+static bool check_tree_block_fsid(struct extent_buffer *eb)
 {
        struct btrfs_fs_info *fs_info = eb->fs_info;
        struct btrfs_fs_devices *fs_devices = fs_info->fs_devices, *seed_devs;
@@ -475,18 +328,18 @@ static int check_tree_block_fsid(struct extent_buffer *eb)
                metadata_uuid = fs_devices->fsid;
 
        if (!memcmp(fsid, metadata_uuid, BTRFS_FSID_SIZE))
-               return 0;
+               return false;
 
        list_for_each_entry(seed_devs, &fs_devices->seed_list, seed_list)
                if (!memcmp(fsid, seed_devs->fsid, BTRFS_FSID_SIZE))
-                       return 0;
+                       return false;
 
-       return 1;
+       return true;
 }
 
 /* Do basic extent buffer checks at read time */
-static int validate_extent_buffer(struct extent_buffer *eb,
-                                 struct btrfs_tree_parent_check *check)
+int btrfs_validate_extent_buffer(struct extent_buffer *eb,
+                                struct btrfs_tree_parent_check *check)
 {
        struct btrfs_fs_info *fs_info = eb->fs_info;
        u64 found_start;
@@ -583,7 +436,7 @@ static int validate_extent_buffer(struct extent_buffer *eb,
         * that we don't try and read the other copies of this block, just
         * return -EIO.
         */
-       if (found_level == 0 && btrfs_check_leaf_full(eb)) {
+       if (found_level == 0 && btrfs_check_leaf(eb)) {
                set_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
                ret = -EIO;
        }
@@ -591,9 +444,7 @@ static int validate_extent_buffer(struct extent_buffer *eb,
        if (found_level > 0 && btrfs_check_node(eb))
                ret = -EIO;
 
-       if (!ret)
-               set_extent_buffer_uptodate(eb);
-       else
+       if (ret)
                btrfs_err(fs_info,
                "read time tree block corruption detected on logical %llu mirror %u",
                          eb->start, eb->read_mirror);
@@ -601,105 +452,6 @@ out:
        return ret;
 }
 
-static int validate_subpage_buffer(struct page *page, u64 start, u64 end,
-                                  int mirror, struct btrfs_tree_parent_check *check)
-{
-       struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
-       struct extent_buffer *eb;
-       bool reads_done;
-       int ret = 0;
-
-       ASSERT(check);
-
-       /*
-        * We don't allow bio merge for subpage metadata read, so we should
-        * only get one eb for each endio hook.
-        */
-       ASSERT(end == start + fs_info->nodesize - 1);
-       ASSERT(PagePrivate(page));
-
-       eb = find_extent_buffer(fs_info, start);
-       /*
-        * When we are reading one tree block, eb must have been inserted into
-        * the radix tree. If not, something is wrong.
-        */
-       ASSERT(eb);
-
-       reads_done = atomic_dec_and_test(&eb->io_pages);
-       /* Subpage read must finish in page read */
-       ASSERT(reads_done);
-
-       eb->read_mirror = mirror;
-       if (test_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags)) {
-               ret = -EIO;
-               goto err;
-       }
-       ret = validate_extent_buffer(eb, check);
-       if (ret < 0)
-               goto err;
-
-       set_extent_buffer_uptodate(eb);
-
-       free_extent_buffer(eb);
-       return ret;
-err:
-       /*
-        * end_bio_extent_readpage decrements io_pages in case of error,
-        * make sure it has something to decrement.
-        */
-       atomic_inc(&eb->io_pages);
-       clear_extent_buffer_uptodate(eb);
-       free_extent_buffer(eb);
-       return ret;
-}
-
-int btrfs_validate_metadata_buffer(struct btrfs_bio *bbio,
-                                  struct page *page, u64 start, u64 end,
-                                  int mirror)
-{
-       struct extent_buffer *eb;
-       int ret = 0;
-       int reads_done;
-
-       ASSERT(page->private);
-
-       if (btrfs_sb(page->mapping->host->i_sb)->nodesize < PAGE_SIZE)
-               return validate_subpage_buffer(page, start, end, mirror,
-                                              &bbio->parent_check);
-
-       eb = (struct extent_buffer *)page->private;
-
-       /*
-        * The pending IO might have been the only thing that kept this buffer
-        * in memory.  Make sure we have a ref for all this other checks
-        */
-       atomic_inc(&eb->refs);
-
-       reads_done = atomic_dec_and_test(&eb->io_pages);
-       if (!reads_done)
-               goto err;
-
-       eb->read_mirror = mirror;
-       if (test_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags)) {
-               ret = -EIO;
-               goto err;
-       }
-       ret = validate_extent_buffer(eb, &bbio->parent_check);
-err:
-       if (ret) {
-               /*
-                * our io error hook is going to dec the io pages
-                * again, we have to make sure it has something
-                * to decrement
-                */
-               atomic_inc(&eb->io_pages);
-               clear_extent_buffer_uptodate(eb);
-       }
-       free_extent_buffer(eb);
-
-       return ret;
-}
-
 #ifdef CONFIG_MIGRATION
 static int btree_migrate_folio(struct address_space *mapping,
                struct folio *dst, struct folio *src, enum migrate_mode mode)
@@ -1396,8 +1148,7 @@ static struct btrfs_root *btrfs_lookup_fs_root(struct btrfs_fs_info *fs_info,
        spin_lock(&fs_info->fs_roots_radix_lock);
        root = radix_tree_lookup(&fs_info->fs_roots_radix,
                                 (unsigned long)root_id);
-       if (root)
-               root = btrfs_grab_root(root);
+       root = btrfs_grab_root(root);
        spin_unlock(&fs_info->fs_roots_radix_lock);
        return root;
 }
@@ -1411,31 +1162,28 @@ static struct btrfs_root *btrfs_get_global_root(struct btrfs_fs_info *fs_info,
                .offset = 0,
        };
 
-       if (objectid == BTRFS_ROOT_TREE_OBJECTID)
+       switch (objectid) {
+       case BTRFS_ROOT_TREE_OBJECTID:
                return btrfs_grab_root(fs_info->tree_root);
-       if (objectid == BTRFS_EXTENT_TREE_OBJECTID)
+       case BTRFS_EXTENT_TREE_OBJECTID:
                return btrfs_grab_root(btrfs_global_root(fs_info, &key));
-       if (objectid == BTRFS_CHUNK_TREE_OBJECTID)
+       case BTRFS_CHUNK_TREE_OBJECTID:
                return btrfs_grab_root(fs_info->chunk_root);
-       if (objectid == BTRFS_DEV_TREE_OBJECTID)
+       case BTRFS_DEV_TREE_OBJECTID:
                return btrfs_grab_root(fs_info->dev_root);
-       if (objectid == BTRFS_CSUM_TREE_OBJECTID)
+       case BTRFS_CSUM_TREE_OBJECTID:
+               return btrfs_grab_root(btrfs_global_root(fs_info, &key));
+       case BTRFS_QUOTA_TREE_OBJECTID:
+               return btrfs_grab_root(fs_info->quota_root);
+       case BTRFS_UUID_TREE_OBJECTID:
+               return btrfs_grab_root(fs_info->uuid_root);
+       case BTRFS_BLOCK_GROUP_TREE_OBJECTID:
+               return btrfs_grab_root(fs_info->block_group_root);
+       case BTRFS_FREE_SPACE_TREE_OBJECTID:
                return btrfs_grab_root(btrfs_global_root(fs_info, &key));
-       if (objectid == BTRFS_QUOTA_TREE_OBJECTID)
-               return btrfs_grab_root(fs_info->quota_root) ?
-                       fs_info->quota_root : ERR_PTR(-ENOENT);
-       if (objectid == BTRFS_UUID_TREE_OBJECTID)
-               return btrfs_grab_root(fs_info->uuid_root) ?
-                       fs_info->uuid_root : ERR_PTR(-ENOENT);
-       if (objectid == BTRFS_BLOCK_GROUP_TREE_OBJECTID)
-               return btrfs_grab_root(fs_info->block_group_root) ?
-                       fs_info->block_group_root : ERR_PTR(-ENOENT);
-       if (objectid == BTRFS_FREE_SPACE_TREE_OBJECTID) {
-               struct btrfs_root *root = btrfs_global_root(fs_info, &key);
-
-               return btrfs_grab_root(root) ? root : ERR_PTR(-ENOENT);
-       }
-       return NULL;
+       default:
+               return NULL;
+       }
 }
 
 int btrfs_insert_fs_root(struct btrfs_fs_info *fs_info,
@@ -1991,7 +1739,6 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
 {
        btrfs_destroy_workqueue(fs_info->fixup_workers);
        btrfs_destroy_workqueue(fs_info->delalloc_workers);
-       btrfs_destroy_workqueue(fs_info->hipri_workers);
        btrfs_destroy_workqueue(fs_info->workers);
        if (fs_info->endio_workers)
                destroy_workqueue(fs_info->endio_workers);
@@ -2183,12 +1930,10 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
 {
        u32 max_active = fs_info->thread_pool_size;
        unsigned int flags = WQ_MEM_RECLAIM | WQ_FREEZABLE | WQ_UNBOUND;
+       unsigned int ordered_flags = WQ_MEM_RECLAIM | WQ_FREEZABLE;
 
        fs_info->workers =
                btrfs_alloc_workqueue(fs_info, "worker", flags, max_active, 16);
-       fs_info->hipri_workers =
-               btrfs_alloc_workqueue(fs_info, "worker-high",
-                                     flags | WQ_HIGHPRI, max_active, 16);
 
        fs_info->delalloc_workers =
                btrfs_alloc_workqueue(fs_info, "delalloc",
@@ -2202,7 +1947,7 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
                btrfs_alloc_workqueue(fs_info, "cache", flags, max_active, 0);
 
        fs_info->fixup_workers =
-               btrfs_alloc_workqueue(fs_info, "fixup", flags, 1, 0);
+               btrfs_alloc_ordered_workqueue(fs_info, "fixup", ordered_flags);
 
        fs_info->endio_workers =
                alloc_workqueue("btrfs-endio", flags, max_active);
@@ -2221,11 +1966,12 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
                btrfs_alloc_workqueue(fs_info, "delayed-meta", flags,
                                      max_active, 0);
        fs_info->qgroup_rescan_workers =
-               btrfs_alloc_workqueue(fs_info, "qgroup-rescan", flags, 1, 0);
+               btrfs_alloc_ordered_workqueue(fs_info, "qgroup-rescan",
+                                             ordered_flags);
        fs_info->discard_ctl.discard_workers =
-               alloc_workqueue("btrfs_discard", WQ_UNBOUND | WQ_FREEZABLE, 1);
+               alloc_ordered_workqueue("btrfs_discard", WQ_FREEZABLE);
 
-       if (!(fs_info->workers && fs_info->hipri_workers &&
+       if (!(fs_info->workers &&
              fs_info->delalloc_workers && fs_info->flush_workers &&
              fs_info->endio_workers && fs_info->endio_meta_workers &&
              fs_info->compressed_write_workers &&
@@ -2265,6 +2011,9 @@ static int btrfs_init_csum_hash(struct btrfs_fs_info *fs_info, u16 csum_type)
                if (!strstr(crypto_shash_driver_name(csum_shash), "generic"))
                        set_bit(BTRFS_FS_CSUM_IMPL_FAST, &fs_info->flags);
                break;
+       case BTRFS_CSUM_TYPE_XXHASH:
+               set_bit(BTRFS_FS_CSUM_IMPL_FAST, &fs_info->flags);
+               break;
        default:
                break;
        }
@@ -2642,6 +2391,14 @@ int btrfs_validate_super(struct btrfs_fs_info *fs_info,
                ret = -EINVAL;
        }
 
+       if (memcmp(fs_info->fs_devices->metadata_uuid, sb->dev_item.fsid,
+                  BTRFS_FSID_SIZE) != 0) {
+               btrfs_err(fs_info,
+                       "dev_item UUID does not match metadata fsid: %pU != %pU",
+                       fs_info->fs_devices->metadata_uuid, sb->dev_item.fsid);
+               ret = -EINVAL;
+       }
+
        /*
         * Artificial requirement for block-group-tree to force newer features
         * (free-space-tree, no-holes) so the test matrix is smaller.
@@ -2654,14 +2411,6 @@ int btrfs_validate_super(struct btrfs_fs_info *fs_info,
                ret = -EINVAL;
        }
 
-       if (memcmp(fs_info->fs_devices->metadata_uuid, sb->dev_item.fsid,
-                  BTRFS_FSID_SIZE) != 0) {
-               btrfs_err(fs_info,
-                       "dev_item UUID does not match metadata fsid: %pU != %pU",
-                       fs_info->fs_devices->metadata_uuid, sb->dev_item.fsid);
-               ret = -EINVAL;
-       }
-
        /*
         * Hint to catch really bogus numbers, bitflips or so, more exact checks are
         * done later
@@ -4662,28 +4411,10 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
        btrfs_close_devices(fs_info->fs_devices);
 }
 
-int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
-                         int atomic)
-{
-       int ret;
-       struct inode *btree_inode = buf->pages[0]->mapping->host;
-
-       ret = extent_buffer_uptodate(buf);
-       if (!ret)
-               return ret;
-
-       ret = verify_parent_transid(&BTRFS_I(btree_inode)->io_tree, buf,
-                                   parent_transid, atomic);
-       if (ret == -EAGAIN)
-               return ret;
-       return !ret;
-}
-
 void btrfs_mark_buffer_dirty(struct extent_buffer *buf)
 {
        struct btrfs_fs_info *fs_info = buf->fs_info;
        u64 transid = btrfs_header_generation(buf);
-       int was_dirty;
 
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
        /*
@@ -4698,19 +4429,13 @@ void btrfs_mark_buffer_dirty(struct extent_buffer *buf)
        if (transid != fs_info->generation)
                WARN(1, KERN_CRIT "btrfs transid mismatch buffer %llu, found %llu running %llu\n",
                        buf->start, transid, fs_info->generation);
-       was_dirty = set_extent_buffer_dirty(buf);
-       if (!was_dirty)
-               percpu_counter_add_batch(&fs_info->dirty_metadata_bytes,
-                                        buf->len,
-                                        fs_info->dirty_metadata_batch);
+       set_extent_buffer_dirty(buf);
 #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
        /*
-        * Since btrfs_mark_buffer_dirty() can be called with item pointer set
-        * but item data not updated.
-        * So here we should only check item pointers, not item data.
+        * btrfs_check_leaf() won't check item data if we don't have WRITTEN
+        * set, so this will only validate the basic structure of the items.
         */
-       if (btrfs_header_level(buf) == 0 &&
-           btrfs_check_leaf_relaxed(buf)) {
+       if (btrfs_header_level(buf) == 0 && btrfs_check_leaf(buf)) {
                btrfs_print_leaf(buf);
                ASSERT(0);
        }
@@ -4840,13 +4565,12 @@ static void btrfs_destroy_all_ordered_extents(struct btrfs_fs_info *fs_info)
        btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1);
 }
 
-static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
-                                     struct btrfs_fs_info *fs_info)
+static void btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
+                                      struct btrfs_fs_info *fs_info)
 {
        struct rb_node *node;
        struct btrfs_delayed_ref_root *delayed_refs;
        struct btrfs_delayed_ref_node *ref;
-       int ret = 0;
 
        delayed_refs = &trans->delayed_refs;
 
@@ -4854,7 +4578,7 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
        if (atomic_read(&delayed_refs->num_entries) == 0) {
                spin_unlock(&delayed_refs->lock);
                btrfs_debug(fs_info, "delayed_refs has NO entry");
-               return ret;
+               return;
        }
 
        while ((node = rb_first_cached(&delayed_refs->href_root)) != NULL) {
@@ -4871,7 +4595,6 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
                while ((n = rb_first_cached(&head->ref_tree)) != NULL) {
                        ref = rb_entry(n, struct btrfs_delayed_ref_node,
                                       ref_node);
-                       ref->in_tree = 0;
                        rb_erase_cached(&ref->ref_node, &head->ref_tree);
                        RB_CLEAR_NODE(&ref->ref_node);
                        if (!list_empty(&ref->add_list))
@@ -4916,8 +4639,6 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
        btrfs_qgroup_destroy_extent_records(trans);
 
        spin_unlock(&delayed_refs->lock);
-
-       return ret;
 }
 
 static void btrfs_destroy_delalloc_inodes(struct btrfs_root *root)
@@ -5142,8 +4863,6 @@ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans,
                                     EXTENT_DIRTY);
        btrfs_destroy_pinned_extent(fs_info, &cur_trans->pinned_extents);
 
-       btrfs_free_redirty_list(cur_trans);
-
        cur_trans->state =TRANS_STATE_COMPLETED;
        wake_up(&cur_trans->commit_wait);
 }
index 4d57723..b03767f 100644 (file)
@@ -31,8 +31,6 @@ struct btrfs_tree_parent_check;
 
 void btrfs_check_leaked_roots(struct btrfs_fs_info *fs_info);
 void btrfs_init_fs_info(struct btrfs_fs_info *fs_info);
-int btrfs_verify_level_key(struct extent_buffer *eb, int level,
-                          struct btrfs_key *first_key, u64 parent_transid);
 struct extent_buffer *read_tree_block(struct btrfs_fs_info *fs_info, u64 bytenr,
                                      struct btrfs_tree_parent_check *check);
 struct extent_buffer *btrfs_find_create_tree_block(
@@ -84,9 +82,8 @@ void btrfs_btree_balance_dirty(struct btrfs_fs_info *fs_info);
 void btrfs_btree_balance_dirty_nodelay(struct btrfs_fs_info *fs_info);
 void btrfs_drop_and_free_fs_root(struct btrfs_fs_info *fs_info,
                                 struct btrfs_root *root);
-int btrfs_validate_metadata_buffer(struct btrfs_bio *bbio,
-                                  struct page *page, u64 start, u64 end,
-                                  int mirror);
+int btrfs_validate_extent_buffer(struct extent_buffer *eb,
+                                struct btrfs_tree_parent_check *check);
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 struct btrfs_root *btrfs_alloc_dummy_root(struct btrfs_fs_info *fs_info);
 #endif
index 29a2258..a2315a4 100644 (file)
@@ -533,6 +533,16 @@ static struct extent_state *clear_state_bit(struct extent_io_tree *tree,
 }
 
 /*
+ * Detect if extent bits request NOWAIT semantics and set the gfp mask accordingly,
+ * unset the EXTENT_NOWAIT bit.
+ */
+static void set_gfp_mask_from_bits(u32 *bits, gfp_t *mask)
+{
+       *mask = (*bits & EXTENT_NOWAIT ? GFP_NOWAIT : GFP_NOFS);
+       *bits &= EXTENT_NOWAIT - 1;
+}
+
+/*
  * Clear some bits on a range in the tree.  This may require splitting or
  * inserting elements in the tree, so the gfp mask is used to indicate which
  * allocations or sleeping are allowed.
@@ -546,7 +556,7 @@ static struct extent_state *clear_state_bit(struct extent_io_tree *tree,
  */
 int __clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
                       u32 bits, struct extent_state **cached_state,
-                      gfp_t mask, struct extent_changeset *changeset)
+                      struct extent_changeset *changeset)
 {
        struct extent_state *state;
        struct extent_state *cached;
@@ -556,7 +566,9 @@ int __clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
        int clear = 0;
        int wake;
        int delete = (bits & EXTENT_CLEAR_ALL_BITS);
+       gfp_t mask;
 
+       set_gfp_mask_from_bits(&bits, &mask);
        btrfs_debug_check_extent_io_range(tree, start, end);
        trace_btrfs_clear_extent_bit(tree, start, end - start + 1, bits);
 
@@ -953,7 +965,8 @@ out:
 
 /*
  * Set some bits on a range in the tree.  This may require allocations or
- * sleeping, so the gfp mask is used to indicate what is allowed.
+ * sleeping. By default all allocations use GFP_NOFS, use EXTENT_NOWAIT for
+ * GFP_NOWAIT.
  *
  * If any of the exclusive bits are set, this will fail with -EEXIST if some
  * part of the range already has the desired bits set.  The extent_state of the
@@ -968,7 +981,7 @@ static int __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
                            u32 bits, u64 *failed_start,
                            struct extent_state **failed_state,
                            struct extent_state **cached_state,
-                           struct extent_changeset *changeset, gfp_t mask)
+                           struct extent_changeset *changeset)
 {
        struct extent_state *state;
        struct extent_state *prealloc = NULL;
@@ -978,7 +991,9 @@ static int __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
        u64 last_start;
        u64 last_end;
        u32 exclusive_bits = (bits & EXTENT_LOCKED);
+       gfp_t mask;
 
+       set_gfp_mask_from_bits(&bits, &mask);
        btrfs_debug_check_extent_io_range(tree, start, end);
        trace_btrfs_set_extent_bit(tree, start, end - start + 1, bits);
 
@@ -1188,10 +1203,10 @@ out:
 }
 
 int set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
-                  u32 bits, struct extent_state **cached_state, gfp_t mask)
+                  u32 bits, struct extent_state **cached_state)
 {
        return __set_extent_bit(tree, start, end, bits, NULL, NULL,
-                               cached_state, NULL, mask);
+                               cached_state, NULL);
 }
 
 /*
@@ -1687,8 +1702,7 @@ int set_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
         */
        ASSERT(!(bits & EXTENT_LOCKED));
 
-       return __set_extent_bit(tree, start, end, bits, NULL, NULL, NULL,
-                               changeset, GFP_NOFS);
+       return __set_extent_bit(tree, start, end, bits, NULL, NULL, NULL, changeset);
 }
 
 int clear_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
@@ -1700,8 +1714,7 @@ int clear_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
         */
        ASSERT(!(bits & EXTENT_LOCKED));
 
-       return __clear_extent_bit(tree, start, end, bits, NULL, GFP_NOFS,
-                                 changeset);
+       return __clear_extent_bit(tree, start, end, bits, NULL, changeset);
 }
 
 int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end,
@@ -1711,7 +1724,7 @@ int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end,
        u64 failed_start;
 
        err = __set_extent_bit(tree, start, end, EXTENT_LOCKED, &failed_start,
-                              NULL, cached, NULL, GFP_NOFS);
+                              NULL, cached, NULL);
        if (err == -EEXIST) {
                if (failed_start > start)
                        clear_extent_bit(tree, start, failed_start - 1,
@@ -1733,7 +1746,7 @@ int lock_extent(struct extent_io_tree *tree, u64 start, u64 end,
        u64 failed_start;
 
        err = __set_extent_bit(tree, start, end, EXTENT_LOCKED, &failed_start,
-                              &failed_state, cached_state, NULL, GFP_NOFS);
+                              &failed_state, cached_state, NULL);
        while (err == -EEXIST) {
                if (failed_start != start)
                        clear_extent_bit(tree, start, failed_start - 1,
@@ -1743,7 +1756,7 @@ int lock_extent(struct extent_io_tree *tree, u64 start, u64 end,
                                &failed_state);
                err = __set_extent_bit(tree, start, end, EXTENT_LOCKED,
                                       &failed_start, &failed_state,
-                                      cached_state, NULL, GFP_NOFS);
+                                      cached_state, NULL);
        }
        return err;
 }
index 21766e4..fbd3b27 100644 (file)
@@ -43,6 +43,15 @@ enum {
         * want the extent states to go away.
         */
        ENUM_BIT(EXTENT_CLEAR_ALL_BITS),
+
+       /*
+        * This must be last.
+        *
+        * Bit not representing a state but a request for NOWAIT semantics,
+        * e.g. when allocating memory, and must be masked out from the other
+        * bits.
+        */
+       ENUM_BIT(EXTENT_NOWAIT)
 };
 
 #define EXTENT_DO_ACCOUNTING    (EXTENT_CLEAR_META_RESV | \
@@ -127,22 +136,20 @@ int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end,
 int clear_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
                             u32 bits, struct extent_changeset *changeset);
 int __clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
-                      u32 bits, struct extent_state **cached, gfp_t mask,
+                      u32 bits, struct extent_state **cached,
                       struct extent_changeset *changeset);
 
 static inline int clear_extent_bit(struct extent_io_tree *tree, u64 start,
                                   u64 end, u32 bits,
                                   struct extent_state **cached)
 {
-       return __clear_extent_bit(tree, start, end, bits, cached,
-                                 GFP_NOFS, NULL);
+       return __clear_extent_bit(tree, start, end, bits, cached, NULL);
 }
 
 static inline int unlock_extent(struct extent_io_tree *tree, u64 start, u64 end,
                                struct extent_state **cached)
 {
-       return __clear_extent_bit(tree, start, end, EXTENT_LOCKED, cached,
-                                 GFP_NOFS, NULL);
+       return __clear_extent_bit(tree, start, end, EXTENT_LOCKED, cached, NULL);
 }
 
 static inline int clear_extent_bits(struct extent_io_tree *tree, u64 start,
@@ -154,31 +161,13 @@ static inline int clear_extent_bits(struct extent_io_tree *tree, u64 start,
 int set_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
                           u32 bits, struct extent_changeset *changeset);
 int set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
-                  u32 bits, struct extent_state **cached_state, gfp_t mask);
-
-static inline int set_extent_bits_nowait(struct extent_io_tree *tree, u64 start,
-                                        u64 end, u32 bits)
-{
-       return set_extent_bit(tree, start, end, bits, NULL, GFP_NOWAIT);
-}
-
-static inline int set_extent_bits(struct extent_io_tree *tree, u64 start,
-               u64 end, u32 bits)
-{
-       return set_extent_bit(tree, start, end, bits, NULL, GFP_NOFS);
-}
+                  u32 bits, struct extent_state **cached_state);
 
 static inline int clear_extent_uptodate(struct extent_io_tree *tree, u64 start,
                u64 end, struct extent_state **cached_state)
 {
        return __clear_extent_bit(tree, start, end, EXTENT_UPTODATE,
-                                 cached_state, GFP_NOFS, NULL);
-}
-
-static inline int set_extent_dirty(struct extent_io_tree *tree, u64 start,
-               u64 end, gfp_t mask)
-{
-       return set_extent_bit(tree, start, end, EXTENT_DIRTY, NULL, mask);
+                                 cached_state, NULL);
 }
 
 static inline int clear_extent_dirty(struct extent_io_tree *tree, u64 start,
@@ -193,29 +182,6 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
                       u32 bits, u32 clear_bits,
                       struct extent_state **cached_state);
 
-static inline int set_extent_delalloc(struct extent_io_tree *tree, u64 start,
-                                     u64 end, u32 extra_bits,
-                                     struct extent_state **cached_state)
-{
-       return set_extent_bit(tree, start, end,
-                             EXTENT_DELALLOC | extra_bits,
-                             cached_state, GFP_NOFS);
-}
-
-static inline int set_extent_defrag(struct extent_io_tree *tree, u64 start,
-               u64 end, struct extent_state **cached_state)
-{
-       return set_extent_bit(tree, start, end,
-                             EXTENT_DELALLOC | EXTENT_DEFRAG,
-                             cached_state, GFP_NOFS);
-}
-
-static inline int set_extent_new(struct extent_io_tree *tree, u64 start,
-               u64 end)
-{
-       return set_extent_bit(tree, start, end, EXTENT_NEW, NULL, GFP_NOFS);
-}
-
 int find_first_extent_bit(struct extent_io_tree *tree, u64 start,
                          u64 *start_ret, u64 *end_ret, u32 bits,
                          struct extent_state **cached_state);
index 5cd289d..911908e 100644 (file)
@@ -73,8 +73,8 @@ int btrfs_add_excluded_extent(struct btrfs_fs_info *fs_info,
                              u64 start, u64 num_bytes)
 {
        u64 end = start + num_bytes - 1;
-       set_extent_bits(&fs_info->excluded_extents, start, end,
-                       EXTENT_UPTODATE);
+       set_extent_bit(&fs_info->excluded_extents, start, end,
+                      EXTENT_UPTODATE, NULL);
        return 0;
 }
 
@@ -402,7 +402,7 @@ int btrfs_get_extent_inline_ref_type(const struct extent_buffer *eb,
                }
        }
 
-       btrfs_print_leaf((struct extent_buffer *)eb);
+       btrfs_print_leaf(eb);
        btrfs_err(eb->fs_info,
                  "eb %llu iref 0x%lx invalid extent inline ref type %d",
                  eb->start, (unsigned long)iref, type);
@@ -1164,15 +1164,10 @@ int insert_inline_extent_backref(struct btrfs_trans_handle *trans,
                 * should not happen at all.
                 */
                if (owner < BTRFS_FIRST_FREE_OBJECTID) {
+                       btrfs_print_leaf(path->nodes[0]);
                        btrfs_crit(trans->fs_info,
-"adding refs to an existing tree ref, bytenr %llu num_bytes %llu root_objectid %llu",
-                                  bytenr, num_bytes, root_objectid);
-                       if (IS_ENABLED(CONFIG_BTRFS_DEBUG)) {
-                               WARN_ON(1);
-                               btrfs_crit(trans->fs_info,
-                       "path->slots[0]=%d path->nodes[0]:", path->slots[0]);
-                               btrfs_print_leaf(path->nodes[0]);
-                       }
+"adding refs to an existing tree ref, bytenr %llu num_bytes %llu root_objectid %llu slot %u",
+                                  bytenr, num_bytes, root_objectid, path->slots[0]);
                        return -EUCLEAN;
                }
                update_inline_extent_backref(path, iref, refs_to_add, extent_op);
@@ -1208,11 +1203,11 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
 {
        int j, ret = 0;
        u64 bytes_left, end;
-       u64 aligned_start = ALIGN(start, 1 << 9);
+       u64 aligned_start = ALIGN(start, 1 << SECTOR_SHIFT);
 
        if (WARN_ON(start != aligned_start)) {
                len -= aligned_start - start;
-               len = round_down(len, 1 << 9);
+               len = round_down(len, 1 << SECTOR_SHIFT);
                start = aligned_start;
        }
 
@@ -1250,7 +1245,8 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
                }
 
                if (size) {
-                       ret = blkdev_issue_discard(bdev, start >> 9, size >> 9,
+                       ret = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
+                                                  size >> SECTOR_SHIFT,
                                                   GFP_NOFS);
                        if (!ret)
                                *discarded_bytes += size;
@@ -1267,7 +1263,8 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
        }
 
        if (bytes_left) {
-               ret = blkdev_issue_discard(bdev, start >> 9, bytes_left >> 9,
+               ret = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
+                                          bytes_left >> SECTOR_SHIFT,
                                           GFP_NOFS);
                if (!ret)
                        *discarded_bytes += bytes_left;
@@ -1500,7 +1497,7 @@ out:
 static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
                                struct btrfs_delayed_ref_node *node,
                                struct btrfs_delayed_extent_op *extent_op,
-                               int insert_reserved)
+                               bool insert_reserved)
 {
        int ret = 0;
        struct btrfs_delayed_data_ref *ref;
@@ -1650,7 +1647,7 @@ out:
 static int run_delayed_tree_ref(struct btrfs_trans_handle *trans,
                                struct btrfs_delayed_ref_node *node,
                                struct btrfs_delayed_extent_op *extent_op,
-                               int insert_reserved)
+                               bool insert_reserved)
 {
        int ret = 0;
        struct btrfs_delayed_tree_ref *ref;
@@ -1690,7 +1687,7 @@ static int run_delayed_tree_ref(struct btrfs_trans_handle *trans,
 static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
                               struct btrfs_delayed_ref_node *node,
                               struct btrfs_delayed_extent_op *extent_op,
-                              int insert_reserved)
+                              bool insert_reserved)
 {
        int ret = 0;
 
@@ -1748,7 +1745,7 @@ static void unselect_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_ref
                                      struct btrfs_delayed_ref_head *head)
 {
        spin_lock(&delayed_refs->lock);
-       head->processing = 0;
+       head->processing = false;
        delayed_refs->num_heads_ready++;
        spin_unlock(&delayed_refs->lock);
        btrfs_delayed_ref_unlock(head);
@@ -1900,7 +1897,7 @@ static int btrfs_run_delayed_refs_for_head(struct btrfs_trans_handle *trans,
        struct btrfs_delayed_ref_root *delayed_refs;
        struct btrfs_delayed_extent_op *extent_op;
        struct btrfs_delayed_ref_node *ref;
-       int must_insert_reserved = 0;
+       bool must_insert_reserved;
        int ret;
 
        delayed_refs = &trans->transaction->delayed_refs;
@@ -1916,7 +1913,6 @@ static int btrfs_run_delayed_refs_for_head(struct btrfs_trans_handle *trans,
                        return -EAGAIN;
                }
 
-               ref->in_tree = 0;
                rb_erase_cached(&ref->ref_node, &locked_ref->ref_tree);
                RB_CLEAR_NODE(&ref->ref_node);
                if (!list_empty(&ref->add_list))
@@ -1943,7 +1939,7 @@ static int btrfs_run_delayed_refs_for_head(struct btrfs_trans_handle *trans,
                 * spin lock.
                 */
                must_insert_reserved = locked_ref->must_insert_reserved;
-               locked_ref->must_insert_reserved = 0;
+               locked_ref->must_insert_reserved = false;
 
                extent_op = locked_ref->extent_op;
                locked_ref->extent_op = NULL;
@@ -2155,10 +2151,10 @@ out:
 }
 
 int btrfs_set_disk_extent_flags(struct btrfs_trans_handle *trans,
-                               struct extent_buffer *eb, u64 flags,
-                               int level)
+                               struct extent_buffer *eb, u64 flags)
 {
        struct btrfs_delayed_extent_op *extent_op;
+       int level = btrfs_header_level(eb);
        int ret;
 
        extent_op = btrfs_alloc_delayed_extent_op();
@@ -2510,8 +2506,8 @@ static int pin_down_extent(struct btrfs_trans_handle *trans,
        spin_unlock(&cache->lock);
        spin_unlock(&cache->space_info->lock);
 
-       set_extent_dirty(&trans->transaction->pinned_extents, bytenr,
-                        bytenr + num_bytes - 1, GFP_NOFS | __GFP_NOFAIL);
+       set_extent_bit(&trans->transaction->pinned_extents, bytenr,
+                      bytenr + num_bytes - 1, EXTENT_DIRTY, NULL);
        return 0;
 }
 
@@ -2838,6 +2834,13 @@ static int do_free_extent_accounting(struct btrfs_trans_handle *trans,
        return ret;
 }
 
+#define abort_and_dump(trans, path, fmt, args...)      \
+({                                                     \
+       btrfs_abort_transaction(trans, -EUCLEAN);       \
+       btrfs_print_leaf(path->nodes[0]);               \
+       btrfs_crit(trans->fs_info, fmt, ##args);        \
+})
+
 /*
  * Drop one or more refs of @node.
  *
@@ -2978,10 +2981,11 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 
                if (!found_extent) {
                        if (iref) {
-                               btrfs_crit(info,
-"invalid iref, no EXTENT/METADATA_ITEM found but has inline extent ref");
-                               btrfs_abort_transaction(trans, -EUCLEAN);
-                               goto err_dump;
+                               abort_and_dump(trans, path,
+"invalid iref slot %u, no EXTENT/METADATA_ITEM found but has inline extent ref",
+                                          path->slots[0]);
+                               ret = -EUCLEAN;
+                               goto out;
                        }
                        /* Must be SHARED_* item, remove the backref first */
                        ret = remove_extent_backref(trans, extent_root, path,
@@ -3029,11 +3033,11 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
                        }
 
                        if (ret) {
-                               btrfs_err(info,
-                                         "umm, got %d back from search, was looking for %llu",
-                                         ret, bytenr);
                                if (ret > 0)
                                        btrfs_print_leaf(path->nodes[0]);
+                               btrfs_err(info,
+                       "umm, got %d back from search, was looking for %llu, slot %d",
+                                         ret, bytenr, path->slots[0]);
                        }
                        if (ret < 0) {
                                btrfs_abort_transaction(trans, ret);
@@ -3042,12 +3046,10 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
                        extent_slot = path->slots[0];
                }
        } else if (WARN_ON(ret == -ENOENT)) {
-               btrfs_print_leaf(path->nodes[0]);
-               btrfs_err(info,
-                       "unable to find ref byte nr %llu parent %llu root %llu  owner %llu offset %llu",
-                       bytenr, parent, root_objectid, owner_objectid,
-                       owner_offset);
-               btrfs_abort_transaction(trans, ret);
+               abort_and_dump(trans, path,
+"unable to find ref byte nr %llu parent %llu root %llu owner %llu offset %llu slot %d",
+                              bytenr, parent, root_objectid, owner_objectid,
+                              owner_offset, path->slots[0]);
                goto out;
        } else {
                btrfs_abort_transaction(trans, ret);
@@ -3067,14 +3069,15 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
        if (owner_objectid < BTRFS_FIRST_FREE_OBJECTID &&
            key.type == BTRFS_EXTENT_ITEM_KEY) {
                struct btrfs_tree_block_info *bi;
+
                if (item_size < sizeof(*ei) + sizeof(*bi)) {
-                       btrfs_crit(info,
-"invalid extent item size for key (%llu, %u, %llu) owner %llu, has %u expect >= %zu",
-                                  key.objectid, key.type, key.offset,
-                                  owner_objectid, item_size,
-                                  sizeof(*ei) + sizeof(*bi));
-                       btrfs_abort_transaction(trans, -EUCLEAN);
-                       goto err_dump;
+                       abort_and_dump(trans, path,
+"invalid extent item size for key (%llu, %u, %llu) slot %u owner %llu, has %u expect >= %zu",
+                                      key.objectid, key.type, key.offset,
+                                      path->slots[0], owner_objectid, item_size,
+                                      sizeof(*ei) + sizeof(*bi));
+                       ret = -EUCLEAN;
+                       goto out;
                }
                bi = (struct btrfs_tree_block_info *)(ei + 1);
                WARN_ON(owner_objectid != btrfs_tree_block_level(leaf, bi));
@@ -3082,11 +3085,11 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 
        refs = btrfs_extent_refs(leaf, ei);
        if (refs < refs_to_drop) {
-               btrfs_crit(info,
-               "trying to drop %d refs but we only have %llu for bytenr %llu",
-                         refs_to_drop, refs, bytenr);
-               btrfs_abort_transaction(trans, -EUCLEAN);
-               goto err_dump;
+               abort_and_dump(trans, path,
+               "trying to drop %d refs but we only have %llu for bytenr %llu slot %u",
+                              refs_to_drop, refs, bytenr, path->slots[0]);
+               ret = -EUCLEAN;
+               goto out;
        }
        refs -= refs_to_drop;
 
@@ -3099,10 +3102,11 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
                 */
                if (iref) {
                        if (!found_extent) {
-                               btrfs_crit(info,
-"invalid iref, got inlined extent ref but no EXTENT/METADATA_ITEM found");
-                               btrfs_abort_transaction(trans, -EUCLEAN);
-                               goto err_dump;
+                               abort_and_dump(trans, path,
+"invalid iref, got inlined extent ref but no EXTENT/METADATA_ITEM found, slot %u",
+                                              path->slots[0]);
+                               ret = -EUCLEAN;
+                               goto out;
                        }
                } else {
                        btrfs_set_extent_refs(leaf, ei, refs);
@@ -3121,21 +3125,21 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
                if (found_extent) {
                        if (is_data && refs_to_drop !=
                            extent_data_ref_count(path, iref)) {
-                               btrfs_crit(info,
-               "invalid refs_to_drop, current refs %u refs_to_drop %u",
-                                          extent_data_ref_count(path, iref),
-                                          refs_to_drop);
-                               btrfs_abort_transaction(trans, -EUCLEAN);
-                               goto err_dump;
+                               abort_and_dump(trans, path,
+               "invalid refs_to_drop, current refs %u refs_to_drop %u slot %u",
+                                              extent_data_ref_count(path, iref),
+                                              refs_to_drop, path->slots[0]);
+                               ret = -EUCLEAN;
+                               goto out;
                        }
                        if (iref) {
                                if (path->slots[0] != extent_slot) {
-                                       btrfs_crit(info,
-"invalid iref, extent item key (%llu %u %llu) doesn't have wanted iref",
-                                                  key.objectid, key.type,
-                                                  key.offset);
-                                       btrfs_abort_transaction(trans, -EUCLEAN);
-                                       goto err_dump;
+                                       abort_and_dump(trans, path,
+"invalid iref, extent item key (%llu %u %llu) slot %u doesn't have wanted iref",
+                                                      key.objectid, key.type,
+                                                      key.offset, path->slots[0]);
+                                       ret = -EUCLEAN;
+                                       goto out;
                                }
                        } else {
                                /*
@@ -3145,10 +3149,11 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
                                 * [ EXTENT/METADATA_ITEM ][ SHARED_* ITEM ]
                                 */
                                if (path->slots[0] != extent_slot + 1) {
-                                       btrfs_crit(info,
-       "invalid SHARED_* item, previous item is not EXTENT/METADATA_ITEM");
-                                       btrfs_abort_transaction(trans, -EUCLEAN);
-                                       goto err_dump;
+                                       abort_and_dump(trans, path,
+       "invalid SHARED_* item slot %u, previous item is not EXTENT/METADATA_ITEM",
+                                                      path->slots[0]);
+                                       ret = -EUCLEAN;
+                                       goto out;
                                }
                                path->slots[0] = extent_slot;
                                num_to_del = 2;
@@ -3170,19 +3175,6 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 out:
        btrfs_free_path(path);
        return ret;
-err_dump:
-       /*
-        * Leaf dump can take up a lot of log buffer, so we only do full leaf
-        * dump for debug build.
-        */
-       if (IS_ENABLED(CONFIG_BTRFS_DEBUG)) {
-               btrfs_crit(info, "path->slots[0]=%d extent_slot=%d",
-                          path->slots[0], extent_slot);
-               btrfs_print_leaf(path->nodes[0]);
-       }
-
-       btrfs_free_path(path);
-       return -EUCLEAN;
 }
 
 /*
@@ -3219,7 +3211,7 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans,
                goto out;
 
        btrfs_delete_ref_head(delayed_refs, head);
-       head->processing = 0;
+       head->processing = false;
 
        spin_unlock(&head->lock);
        spin_unlock(&delayed_refs->lock);
@@ -4804,7 +4796,7 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
            !test_bit(BTRFS_ROOT_RESET_LOCKDEP_CLASS, &root->state))
                lockdep_owner = BTRFS_FS_TREE_OBJECTID;
 
-       /* btrfs_clean_tree_block() accesses generation field. */
+       /* btrfs_clear_buffer_dirty() accesses generation field. */
        btrfs_set_header_generation(buf, trans->transid);
 
        /*
@@ -4836,15 +4828,17 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
                 * EXTENT bit to differentiate dirty pages.
                 */
                if (buf->log_index == 0)
-                       set_extent_dirty(&root->dirty_log_pages, buf->start,
-                                       buf->start + buf->len - 1, GFP_NOFS);
+                       set_extent_bit(&root->dirty_log_pages, buf->start,
+                                      buf->start + buf->len - 1,
+                                      EXTENT_DIRTY, NULL);
                else
-                       set_extent_new(&root->dirty_log_pages, buf->start,
-                                       buf->start + buf->len - 1);
+                       set_extent_bit(&root->dirty_log_pages, buf->start,
+                                      buf->start + buf->len - 1,
+                                      EXTENT_NEW, NULL);
        } else {
                buf->log_index = -1;
-               set_extent_dirty(&trans->transaction->dirty_pages, buf->start,
-                        buf->start + buf->len - 1, GFP_NOFS);
+               set_extent_bit(&trans->transaction->dirty_pages, buf->start,
+                              buf->start + buf->len - 1, EXTENT_DIRTY, NULL);
        }
        /* this returns a buffer locked for blocking */
        return buf;
@@ -5102,8 +5096,7 @@ static noinline int walk_down_proc(struct btrfs_trans_handle *trans,
                BUG_ON(ret); /* -ENOMEM */
                ret = btrfs_dec_ref(trans, root, eb, 0);
                BUG_ON(ret); /* -ENOMEM */
-               ret = btrfs_set_disk_extent_flags(trans, eb, flag,
-                                                 btrfs_header_level(eb));
+               ret = btrfs_set_disk_extent_flags(trans, eb, flag);
                BUG_ON(ret); /* -ENOMEM */
                wc->flags[level] |= flag;
        }
@@ -5985,9 +5978,8 @@ static int btrfs_trim_free_extents(struct btrfs_device *device, u64 *trimmed)
                ret = btrfs_issue_discard(device->bdev, start, len,
                                          &bytes);
                if (!ret)
-                       set_extent_bits(&device->alloc_state, start,
-                                       start + bytes - 1,
-                                       CHUNK_TRIMMED);
+                       set_extent_bit(&device->alloc_state, start,
+                                      start + bytes - 1, CHUNK_TRIMMED, NULL);
                mutex_unlock(&fs_info->chunk_mutex);
 
                if (ret)
index 0c958fc..429d5c5 100644 (file)
@@ -141,7 +141,7 @@ int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
                  struct extent_buffer *buf, int full_backref);
 int btrfs_set_disk_extent_flags(struct btrfs_trans_handle *trans,
-                               struct extent_buffer *eb, u64 flags, int level);
+                               struct extent_buffer *eb, u64 flags);
 int btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_ref *ref);
 
 int btrfs_free_reserved_extent(struct btrfs_fs_info *fs_info,
index a1adadd..a91d5ad 100644 (file)
@@ -98,33 +98,16 @@ void btrfs_extent_buffer_leak_debug_check(struct btrfs_fs_info *fs_info)
  */
 struct btrfs_bio_ctrl {
        struct btrfs_bio *bbio;
-       int mirror_num;
        enum btrfs_compression_type compress_type;
        u32 len_to_oe_boundary;
        blk_opf_t opf;
        btrfs_bio_end_io_t end_io_func;
        struct writeback_control *wbc;
-
-       /*
-        * This is for metadata read, to provide the extra needed verification
-        * info.  This has to be provided for submit_one_bio(), as
-        * submit_one_bio() can submit a bio if it ends at stripe boundary.  If
-        * no such parent_check is provided, the metadata can hit false alert at
-        * endio time.
-        */
-       struct btrfs_tree_parent_check *parent_check;
-
-       /*
-        * Tell writepage not to lock the state bits for this range, it still
-        * does the unlocking.
-        */
-       bool extent_locked;
 };
 
 static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
 {
        struct btrfs_bio *bbio = bio_ctrl->bbio;
-       int mirror_num = bio_ctrl->mirror_num;
 
        if (!bbio)
                return;
@@ -132,25 +115,11 @@ static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
        /* Caller should ensure the bio has at least some range added */
        ASSERT(bbio->bio.bi_iter.bi_size);
 
-       if (!is_data_inode(&bbio->inode->vfs_inode)) {
-               if (btrfs_op(&bbio->bio) != BTRFS_MAP_WRITE) {
-                       /*
-                        * For metadata read, we should have the parent_check,
-                        * and copy it to bbio for metadata verification.
-                        */
-                       ASSERT(bio_ctrl->parent_check);
-                       memcpy(&bbio->parent_check,
-                              bio_ctrl->parent_check,
-                              sizeof(struct btrfs_tree_parent_check));
-               }
-               bbio->bio.bi_opf |= REQ_META;
-       }
-
        if (btrfs_op(&bbio->bio) == BTRFS_MAP_READ &&
            bio_ctrl->compress_type != BTRFS_COMPRESS_NONE)
-               btrfs_submit_compressed_read(bbio, mirror_num);
+               btrfs_submit_compressed_read(bbio);
        else
-               btrfs_submit_bio(bbio, mirror_num);
+               btrfs_submit_bio(bbio, 0);
 
        /* The bbio is owned by the end_io handler now */
        bio_ctrl->bbio = NULL;
@@ -248,8 +217,6 @@ static int process_one_page(struct btrfs_fs_info *fs_info,
 
        if (page_ops & PAGE_SET_ORDERED)
                btrfs_page_clamp_set_ordered(fs_info, page, start, len);
-       if (page_ops & PAGE_SET_ERROR)
-               btrfs_page_clamp_set_error(fs_info, page, start, len);
        if (page_ops & PAGE_START_WRITEBACK) {
                btrfs_page_clamp_clear_dirty(fs_info, page, start, len);
                btrfs_page_clamp_set_writeback(fs_info, page, start, len);
@@ -295,9 +262,6 @@ static int __process_pages_contig(struct address_space *mapping,
                ASSERT(processed_end && *processed_end == start);
        }
 
-       if ((page_ops & PAGE_SET_ERROR) && start_index <= end_index)
-               mapping_set_error(mapping, -EIO);
-
        folio_batch_init(&fbatch);
        while (index <= end_index) {
                int found_folios;
@@ -506,6 +470,15 @@ void extent_clear_unlock_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
                               start, end, page_ops, NULL);
 }
 
+static bool btrfs_verify_page(struct page *page, u64 start)
+{
+       if (!fsverity_active(page->mapping->host) ||
+           PageUptodate(page) ||
+           start >= i_size_read(page->mapping->host))
+               return true;
+       return fsverity_verify_page(page);
+}
+
 static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
 {
        struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
@@ -513,20 +486,10 @@ static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
        ASSERT(page_offset(page) <= start &&
               start + len <= page_offset(page) + PAGE_SIZE);
 
-       if (uptodate) {
-               if (fsverity_active(page->mapping->host) &&
-                   !PageError(page) &&
-                   !PageUptodate(page) &&
-                   start < i_size_read(page->mapping->host) &&
-                   !fsverity_verify_page(page)) {
-                       btrfs_page_set_error(fs_info, page, start, len);
-               } else {
-                       btrfs_page_set_uptodate(fs_info, page, start, len);
-               }
-       } else {
+       if (uptodate && btrfs_verify_page(page, start))
+               btrfs_page_set_uptodate(fs_info, page, start, len);
+       else
                btrfs_page_clear_uptodate(fs_info, page, start, len);
-               btrfs_page_set_error(fs_info, page, start, len);
-       }
 
        if (!btrfs_is_subpage(fs_info, page))
                unlock_page(page);
@@ -554,7 +517,6 @@ void end_extent_writepage(struct page *page, int err, u64 start, u64 end)
                len = end + 1 - start;
 
                btrfs_page_clear_uptodate(fs_info, page, start, len);
-               btrfs_page_set_error(fs_info, page, start, len);
                ret = err < 0 ? err : -EIO;
                mapping_set_error(page->mapping, ret);
        }
@@ -574,8 +536,6 @@ static void end_bio_extent_writepage(struct btrfs_bio *bbio)
        struct bio *bio = &bbio->bio;
        int error = blk_status_to_errno(bio->bi_status);
        struct bio_vec *bvec;
-       u64 start;
-       u64 end;
        struct bvec_iter_all iter_all;
 
        ASSERT(!bio_flagged(bio, BIO_CLONED));
@@ -584,6 +544,8 @@ static void end_bio_extent_writepage(struct btrfs_bio *bbio)
                struct inode *inode = page->mapping->host;
                struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
                const u32 sectorsize = fs_info->sectorsize;
+               u64 start = page_offset(page) + bvec->bv_offset;
+               u32 len = bvec->bv_len;
 
                /* Our read/write should always be sector aligned. */
                if (!IS_ALIGNED(bvec->bv_offset, sectorsize))
@@ -595,12 +557,12 @@ static void end_bio_extent_writepage(struct btrfs_bio *bbio)
                "incomplete page write with offset %u and length %u",
                                   bvec->bv_offset, bvec->bv_len);
 
-               start = page_offset(page) + bvec->bv_offset;
-               end = start + bvec->bv_len - 1;
-
-               end_extent_writepage(page, error, start, end);
-
-               btrfs_page_clear_writeback(fs_info, page, start, bvec->bv_len);
+               btrfs_finish_ordered_extent(bbio->ordered, page, start, len, !error);
+               if (error) {
+                       btrfs_page_clear_uptodate(fs_info, page, start, len);
+                       mapping_set_error(page->mapping, error);
+               }
+               btrfs_page_clear_writeback(fs_info, page, start, len);
        }
 
        bio_put(bio);
@@ -686,35 +648,6 @@ static void begin_page_read(struct btrfs_fs_info *fs_info, struct page *page)
 }
 
 /*
- * Find extent buffer for a givne bytenr.
- *
- * This is for end_bio_extent_readpage(), thus we can't do any unsafe locking
- * in endio context.
- */
-static struct extent_buffer *find_extent_buffer_readpage(
-               struct btrfs_fs_info *fs_info, struct page *page, u64 bytenr)
-{
-       struct extent_buffer *eb;
-
-       /*
-        * For regular sectorsize, we can use page->private to grab extent
-        * buffer
-        */
-       if (fs_info->nodesize >= PAGE_SIZE) {
-               ASSERT(PagePrivate(page) && page->private);
-               return (struct extent_buffer *)page->private;
-       }
-
-       /* For subpage case, we need to lookup buffer radix tree */
-       rcu_read_lock();
-       eb = radix_tree_lookup(&fs_info->buffer_radix,
-                              bytenr >> fs_info->sectorsize_bits);
-       rcu_read_unlock();
-       ASSERT(eb);
-       return eb;
-}
-
-/*
  * after a readpage IO is done, we need to:
  * clear the uptodate bits on error
  * set the uptodate bits if things worked
@@ -735,7 +668,6 @@ static void end_bio_extent_readpage(struct btrfs_bio *bbio)
         * larger than UINT_MAX, u32 here is enough.
         */
        u32 bio_offset = 0;
-       int mirror;
        struct bvec_iter_all iter_all;
 
        ASSERT(!bio_flagged(bio, BIO_CLONED));
@@ -775,11 +707,6 @@ static void end_bio_extent_readpage(struct btrfs_bio *bbio)
                end = start + bvec->bv_len - 1;
                len = bvec->bv_len;
 
-               mirror = bbio->mirror_num;
-               if (uptodate && !is_data_inode(inode) &&
-                   btrfs_validate_metadata_buffer(bbio, page, start, end, mirror))
-                       uptodate = false;
-
                if (likely(uptodate)) {
                        loff_t i_size = i_size_read(inode);
                        pgoff_t end_index = i_size >> PAGE_SHIFT;
@@ -800,19 +727,12 @@ static void end_bio_extent_readpage(struct btrfs_bio *bbio)
                                zero_user_segment(page, zero_start,
                                                  offset_in_page(end) + 1);
                        }
-               } else if (!is_data_inode(inode)) {
-                       struct extent_buffer *eb;
-
-                       eb = find_extent_buffer_readpage(fs_info, page, start);
-                       set_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
-                       eb->read_mirror = mirror;
-                       atomic_dec(&eb->io_pages);
                }
 
                /* Update page status and unlock. */
                end_page_read(page, uptodate, start, len);
                endio_readpage_release_extent(&processed, BTRFS_I(inode),
-                                             start, end, PageUptodate(page));
+                                             start, end, uptodate);
 
                ASSERT(bio_offset + len > bio_offset);
                bio_offset += len;
@@ -906,13 +826,8 @@ static void alloc_new_bio(struct btrfs_inode *inode,
        bio_ctrl->bbio = bbio;
        bio_ctrl->len_to_oe_boundary = U32_MAX;
 
-       /*
-        * Limit the extent to the ordered boundary for Zone Append.
-        * Compressed bios aren't submitted directly, so it doesn't apply to
-        * them.
-        */
-       if (bio_ctrl->compress_type == BTRFS_COMPRESS_NONE &&
-           btrfs_use_zone_append(bbio)) {
+       /* Limit data write bios to the ordered boundary. */
+       if (bio_ctrl->wbc) {
                struct btrfs_ordered_extent *ordered;
 
                ordered = btrfs_lookup_ordered_extent(inode, file_offset);
@@ -920,11 +835,9 @@ static void alloc_new_bio(struct btrfs_inode *inode,
                        bio_ctrl->len_to_oe_boundary = min_t(u32, U32_MAX,
                                        ordered->file_offset +
                                        ordered->disk_num_bytes - file_offset);
-                       btrfs_put_ordered_extent(ordered);
+                       bbio->ordered = ordered;
                }
-       }
 
-       if (bio_ctrl->wbc) {
                /*
                 * Pick the last added device to support cgroup writeback.  For
                 * multi-device file systems this means blk-cgroup policies have
@@ -1125,7 +1038,6 @@ static int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
        ret = set_page_extent_mapped(page);
        if (ret < 0) {
                unlock_extent(tree, start, end, NULL);
-               btrfs_page_set_error(fs_info, page, start, PAGE_SIZE);
                unlock_page(page);
                return ret;
        }
@@ -1329,11 +1241,9 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
                }
                ret = btrfs_run_delalloc_range(inode, page, delalloc_start,
                                delalloc_end, &page_started, &nr_written, wbc);
-               if (ret) {
-                       btrfs_page_set_error(inode->root->fs_info, page,
-                                            page_offset(page), PAGE_SIZE);
+               if (ret)
                        return ret;
-               }
+
                /*
                 * delalloc_end is already one less than the total length, so
                 * we don't subtract one from PAGE_SIZE
@@ -1438,7 +1348,6 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
        struct extent_map *em;
        int ret = 0;
        int nr = 0;
-       bool compressed;
 
        ret = btrfs_writepage_cow_fixup(page);
        if (ret) {
@@ -1448,12 +1357,6 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
                return 1;
        }
 
-       /*
-        * we don't want to touch the inode after unlocking the page,
-        * so we update the mapping writeback index now
-        */
-       bio_ctrl->wbc->nr_to_write--;
-
        bio_ctrl->end_io_func = end_bio_extent_writepage;
        while (cur <= end) {
                u64 disk_bytenr;
@@ -1486,7 +1389,6 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 
                em = btrfs_get_extent(inode, NULL, 0, cur, end - cur + 1);
                if (IS_ERR(em)) {
-                       btrfs_page_set_error(fs_info, page, cur, end - cur + 1);
                        ret = PTR_ERR_OR_ZERO(em);
                        goto out_error;
                }
@@ -1497,10 +1399,14 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
                ASSERT(cur < end);
                ASSERT(IS_ALIGNED(em->start, fs_info->sectorsize));
                ASSERT(IS_ALIGNED(em->len, fs_info->sectorsize));
+
                block_start = em->block_start;
-               compressed = test_bit(EXTENT_FLAG_COMPRESSED, &em->flags);
                disk_bytenr = em->block_start + extent_offset;
 
+               ASSERT(!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags));
+               ASSERT(block_start != EXTENT_MAP_HOLE);
+               ASSERT(block_start != EXTENT_MAP_INLINE);
+
                /*
                 * Note that em_end from extent_map_end() and dirty_range_end from
                 * find_next_dirty_byte() are all exclusive
@@ -1509,22 +1415,6 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
                free_extent_map(em);
                em = NULL;
 
-               /*
-                * compressed and inline extents are written through other
-                * paths in the FS
-                */
-               if (compressed || block_start == EXTENT_MAP_HOLE ||
-                   block_start == EXTENT_MAP_INLINE) {
-                       if (compressed)
-                               nr++;
-                       else
-                               btrfs_writepage_endio_finish_ordered(inode,
-                                               page, cur, cur + iosize - 1, true);
-                       btrfs_page_clear_dirty(fs_info, page, cur, iosize);
-                       cur += iosize;
-                       continue;
-               }
-
                btrfs_set_range_writeback(inode, cur, cur + iosize - 1);
                if (!PageWriteback(page)) {
                        btrfs_err(inode->root->fs_info,
@@ -1572,7 +1462,6 @@ static int __extent_writepage(struct page *page, struct btrfs_bio_ctrl *bio_ctrl
 {
        struct folio *folio = page_folio(page);
        struct inode *inode = page->mapping->host;
-       struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
        const u64 page_start = page_offset(page);
        const u64 page_end = page_start + PAGE_SIZE - 1;
        int ret;
@@ -1585,9 +1474,6 @@ static int __extent_writepage(struct page *page, struct btrfs_bio_ctrl *bio_ctrl
 
        WARN_ON(!PageLocked(page));
 
-       btrfs_page_clear_error(btrfs_sb(inode->i_sb), page,
-                              page_offset(page), PAGE_SIZE);
-
        pg_offset = offset_in_page(i_size);
        if (page->index > end_index ||
           (page->index == end_index && !pg_offset)) {
@@ -1600,77 +1486,30 @@ static int __extent_writepage(struct page *page, struct btrfs_bio_ctrl *bio_ctrl
                memzero_page(page, pg_offset, PAGE_SIZE - pg_offset);
 
        ret = set_page_extent_mapped(page);
-       if (ret < 0) {
-               SetPageError(page);
+       if (ret < 0)
                goto done;
-       }
 
-       if (!bio_ctrl->extent_locked) {
-               ret = writepage_delalloc(BTRFS_I(inode), page, bio_ctrl->wbc);
-               if (ret == 1)
-                       return 0;
-               if (ret)
-                       goto done;
-       }
+       ret = writepage_delalloc(BTRFS_I(inode), page, bio_ctrl->wbc);
+       if (ret == 1)
+               return 0;
+       if (ret)
+               goto done;
 
        ret = __extent_writepage_io(BTRFS_I(inode), page, bio_ctrl, i_size, &nr);
        if (ret == 1)
                return 0;
 
+       bio_ctrl->wbc->nr_to_write--;
+
 done:
        if (nr == 0) {
                /* make sure the mapping tag for page dirty gets cleared */
                set_page_writeback(page);
                end_page_writeback(page);
        }
-       /*
-        * Here we used to have a check for PageError() and then set @ret and
-        * call end_extent_writepage().
-        *
-        * But in fact setting @ret here will cause different error paths
-        * between subpage and regular sectorsize.
-        *
-        * For regular page size, we never submit current page, but only add
-        * current page to current bio.
-        * The bio submission can only happen in next page.
-        * Thus if we hit the PageError() branch, @ret is already set to
-        * non-zero value and will not get updated for regular sectorsize.
-        *
-        * But for subpage case, it's possible we submit part of current page,
-        * thus can get PageError() set by submitted bio of the same page,
-        * while our @ret is still 0.
-        *
-        * So here we unify the behavior and don't set @ret.
-        * Error can still be properly passed to higher layer as page will
-        * be set error, here we just don't handle the IO failure.
-        *
-        * NOTE: This is just a hotfix for subpage.
-        * The root fix will be properly ending ordered extent when we hit
-        * an error during writeback.
-        *
-        * But that needs a bigger refactoring, as we not only need to grab the
-        * submitted OE, but also need to know exactly at which bytenr we hit
-        * the error.
-        * Currently the full page based __extent_writepage_io() is not
-        * capable of that.
-        */
-       if (PageError(page))
+       if (ret)
                end_extent_writepage(page, ret, page_start, page_end);
-       if (bio_ctrl->extent_locked) {
-               struct writeback_control *wbc = bio_ctrl->wbc;
-
-               /*
-                * If bio_ctrl->extent_locked, it's from extent_write_locked_range(),
-                * the page can either be locked by lock_page() or
-                * process_one_page().
-                * Let btrfs_page_unlock_writer() handle both cases.
-                */
-               ASSERT(wbc);
-               btrfs_page_unlock_writer(fs_info, page, wbc->range_start,
-                                        wbc->range_end + 1 - wbc->range_start);
-       } else {
-               unlock_page(page);
-       }
+       unlock_page(page);
        ASSERT(ret <= 0);
        return ret;
 }
@@ -1681,52 +1520,26 @@ void wait_on_extent_buffer_writeback(struct extent_buffer *eb)
                       TASK_UNINTERRUPTIBLE);
 }
 
-static void end_extent_buffer_writeback(struct extent_buffer *eb)
-{
-       clear_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
-       smp_mb__after_atomic();
-       wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
-}
-
 /*
  * Lock extent buffer status and pages for writeback.
  *
- * May try to flush write bio if we can't get the lock.
- *
- * Return  0 if the extent buffer doesn't need to be submitted.
- *           (E.g. the extent buffer is not dirty)
- * Return >0 is the extent buffer is submitted to bio.
- * Return <0 if something went wrong, no page is locked.
+ * Return %false if the extent buffer doesn't need to be submitted (e.g. the
+ * extent buffer is not dirty)
+ * Return %true is the extent buffer is submitted to bio.
  */
-static noinline_for_stack int lock_extent_buffer_for_io(struct extent_buffer *eb,
-                         struct btrfs_bio_ctrl *bio_ctrl)
+static noinline_for_stack bool lock_extent_buffer_for_io(struct extent_buffer *eb,
+                         struct writeback_control *wbc)
 {
        struct btrfs_fs_info *fs_info = eb->fs_info;
-       int i, num_pages;
-       int flush = 0;
-       int ret = 0;
+       bool ret = false;
 
-       if (!btrfs_try_tree_write_lock(eb)) {
-               submit_write_bio(bio_ctrl, 0);
-               flush = 1;
-               btrfs_tree_lock(eb);
-       }
-
-       if (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags)) {
+       btrfs_tree_lock(eb);
+       while (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags)) {
                btrfs_tree_unlock(eb);
-               if (bio_ctrl->wbc->sync_mode != WB_SYNC_ALL)
-                       return 0;
-               if (!flush) {
-                       submit_write_bio(bio_ctrl, 0);
-                       flush = 1;
-               }
-               while (1) {
-                       wait_on_extent_buffer_writeback(eb);
-                       btrfs_tree_lock(eb);
-                       if (!test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags))
-                               break;
-                       btrfs_tree_unlock(eb);
-               }
+               if (wbc->sync_mode != WB_SYNC_ALL)
+                       return false;
+               wait_on_extent_buffer_writeback(eb);
+               btrfs_tree_lock(eb);
        }
 
        /*
@@ -1742,45 +1555,19 @@ static noinline_for_stack int lock_extent_buffer_for_io(struct extent_buffer *eb
                percpu_counter_add_batch(&fs_info->dirty_metadata_bytes,
                                         -eb->len,
                                         fs_info->dirty_metadata_batch);
-               ret = 1;
+               ret = true;
        } else {
                spin_unlock(&eb->refs_lock);
        }
-
        btrfs_tree_unlock(eb);
-
-       /*
-        * Either we don't need to submit any tree block, or we're submitting
-        * subpage eb.
-        * Subpage metadata doesn't use page locking at all, so we can skip
-        * the page locking.
-        */
-       if (!ret || fs_info->nodesize < PAGE_SIZE)
-               return ret;
-
-       num_pages = num_extent_pages(eb);
-       for (i = 0; i < num_pages; i++) {
-               struct page *p = eb->pages[i];
-
-               if (!trylock_page(p)) {
-                       if (!flush) {
-                               submit_write_bio(bio_ctrl, 0);
-                               flush = 1;
-                       }
-                       lock_page(p);
-               }
-       }
-
        return ret;
 }
 
-static void set_btree_ioerr(struct page *page, struct extent_buffer *eb)
+static void set_btree_ioerr(struct extent_buffer *eb)
 {
        struct btrfs_fs_info *fs_info = eb->fs_info;
 
-       btrfs_page_set_error(fs_info, page, eb->start, eb->len);
-       if (test_and_set_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))
-               return;
+       set_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
 
        /*
         * A read may stumble upon this buffer later, make sure that it gets an
@@ -1794,7 +1581,7 @@ static void set_btree_ioerr(struct page *page, struct extent_buffer *eb)
         * return a 0 because we are readonly if we don't modify the err seq for
         * the superblock.
         */
-       mapping_set_error(page->mapping, -EIO);
+       mapping_set_error(eb->fs_info->btree_inode->i_mapping, -EIO);
 
        /*
         * If writeback for a btree extent that doesn't belong to a log tree
@@ -1869,101 +1656,34 @@ static struct extent_buffer *find_extent_buffer_nolock(
        return NULL;
 }
 
-/*
- * The endio function for subpage extent buffer write.
- *
- * Unlike end_bio_extent_buffer_writepage(), we only call end_page_writeback()
- * after all extent buffers in the page has finished their writeback.
- */
-static void end_bio_subpage_eb_writepage(struct btrfs_bio *bbio)
+static void extent_buffer_write_end_io(struct btrfs_bio *bbio)
 {
-       struct bio *bio = &bbio->bio;
-       struct btrfs_fs_info *fs_info;
-       struct bio_vec *bvec;
+       struct extent_buffer *eb = bbio->private;
+       struct btrfs_fs_info *fs_info = eb->fs_info;
+       bool uptodate = !bbio->bio.bi_status;
        struct bvec_iter_all iter_all;
+       struct bio_vec *bvec;
+       u32 bio_offset = 0;
 
-       fs_info = btrfs_sb(bio_first_page_all(bio)->mapping->host->i_sb);
-       ASSERT(fs_info->nodesize < PAGE_SIZE);
+       if (!uptodate)
+               set_btree_ioerr(eb);
 
-       ASSERT(!bio_flagged(bio, BIO_CLONED));
-       bio_for_each_segment_all(bvec, bio, iter_all) {
+       bio_for_each_segment_all(bvec, &bbio->bio, iter_all) {
+               u64 start = eb->start + bio_offset;
                struct page *page = bvec->bv_page;
-               u64 bvec_start = page_offset(page) + bvec->bv_offset;
-               u64 bvec_end = bvec_start + bvec->bv_len - 1;
-               u64 cur_bytenr = bvec_start;
-
-               ASSERT(IS_ALIGNED(bvec->bv_len, fs_info->nodesize));
-
-               /* Iterate through all extent buffers in the range */
-               while (cur_bytenr <= bvec_end) {
-                       struct extent_buffer *eb;
-                       int done;
-
-                       /*
-                        * Here we can't use find_extent_buffer(), as it may
-                        * try to lock eb->refs_lock, which is not safe in endio
-                        * context.
-                        */
-                       eb = find_extent_buffer_nolock(fs_info, cur_bytenr);
-                       ASSERT(eb);
-
-                       cur_bytenr = eb->start + eb->len;
-
-                       ASSERT(test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags));
-                       done = atomic_dec_and_test(&eb->io_pages);
-                       ASSERT(done);
-
-                       if (bio->bi_status ||
-                           test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
-                               ClearPageUptodate(page);
-                               set_btree_ioerr(page, eb);
-                       }
+               u32 len = bvec->bv_len;
 
-                       btrfs_subpage_clear_writeback(fs_info, page, eb->start,
-                                                     eb->len);
-                       end_extent_buffer_writeback(eb);
-                       /*
-                        * free_extent_buffer() will grab spinlock which is not
-                        * safe in endio context. Thus here we manually dec
-                        * the ref.
-                        */
-                       atomic_dec(&eb->refs);
-               }
+               if (!uptodate)
+                       btrfs_page_clear_uptodate(fs_info, page, start, len);
+               btrfs_page_clear_writeback(fs_info, page, start, len);
+               bio_offset += len;
        }
-       bio_put(bio);
-}
 
-static void end_bio_extent_buffer_writepage(struct btrfs_bio *bbio)
-{
-       struct bio *bio = &bbio->bio;
-       struct bio_vec *bvec;
-       struct extent_buffer *eb;
-       int done;
-       struct bvec_iter_all iter_all;
-
-       ASSERT(!bio_flagged(bio, BIO_CLONED));
-       bio_for_each_segment_all(bvec, bio, iter_all) {
-               struct page *page = bvec->bv_page;
-
-               eb = (struct extent_buffer *)page->private;
-               BUG_ON(!eb);
-               done = atomic_dec_and_test(&eb->io_pages);
-
-               if (bio->bi_status ||
-                   test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
-                       ClearPageUptodate(page);
-                       set_btree_ioerr(page, eb);
-               }
-
-               end_page_writeback(page);
-
-               if (!done)
-                       continue;
-
-               end_extent_buffer_writeback(eb);
-       }
+       clear_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
+       smp_mb__after_atomic();
+       wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
 
-       bio_put(bio);
+       bio_put(&bbio->bio);
 }
 
 static void prepare_eb_write(struct extent_buffer *eb)
@@ -1973,7 +1693,6 @@ static void prepare_eb_write(struct extent_buffer *eb)
        unsigned long end;
 
        clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
-       atomic_set(&eb->io_pages, num_extent_pages(eb));
 
        /* Set btree blocks beyond nritems with 0 to avoid stale content */
        nritems = btrfs_header_nritems(eb);
@@ -1995,63 +1714,49 @@ static void prepare_eb_write(struct extent_buffer *eb)
        }
 }
 
-/*
- * Unlike the work in write_one_eb(), we rely completely on extent locking.
- * Page locking is only utilized at minimum to keep the VMM code happy.
- */
-static void write_one_subpage_eb(struct extent_buffer *eb,
-                                struct btrfs_bio_ctrl *bio_ctrl)
-{
-       struct btrfs_fs_info *fs_info = eb->fs_info;
-       struct page *page = eb->pages[0];
-       bool no_dirty_ebs = false;
-
-       prepare_eb_write(eb);
-
-       /* clear_page_dirty_for_io() in subpage helper needs page locked */
-       lock_page(page);
-       btrfs_subpage_set_writeback(fs_info, page, eb->start, eb->len);
-
-       /* Check if this is the last dirty bit to update nr_written */
-       no_dirty_ebs = btrfs_subpage_clear_and_test_dirty(fs_info, page,
-                                                         eb->start, eb->len);
-       if (no_dirty_ebs)
-               clear_page_dirty_for_io(page);
-
-       bio_ctrl->end_io_func = end_bio_subpage_eb_writepage;
-
-       submit_extent_page(bio_ctrl, eb->start, page, eb->len,
-                          eb->start - page_offset(page));
-       unlock_page(page);
-       /*
-        * Submission finished without problem, if no range of the page is
-        * dirty anymore, we have submitted a page.  Update nr_written in wbc.
-        */
-       if (no_dirty_ebs)
-               bio_ctrl->wbc->nr_to_write--;
-}
-
 static noinline_for_stack void write_one_eb(struct extent_buffer *eb,
-                       struct btrfs_bio_ctrl *bio_ctrl)
+                                           struct writeback_control *wbc)
 {
-       u64 disk_bytenr = eb->start;
-       int i, num_pages;
+       struct btrfs_fs_info *fs_info = eb->fs_info;
+       struct btrfs_bio *bbio;
 
        prepare_eb_write(eb);
 
-       bio_ctrl->end_io_func = end_bio_extent_buffer_writepage;
-
-       num_pages = num_extent_pages(eb);
-       for (i = 0; i < num_pages; i++) {
-               struct page *p = eb->pages[i];
-
-               clear_page_dirty_for_io(p);
-               set_page_writeback(p);
-               submit_extent_page(bio_ctrl, disk_bytenr, p, PAGE_SIZE, 0);
-               disk_bytenr += PAGE_SIZE;
-               bio_ctrl->wbc->nr_to_write--;
+       bbio = btrfs_bio_alloc(INLINE_EXTENT_BUFFER_PAGES,
+                              REQ_OP_WRITE | REQ_META | wbc_to_write_flags(wbc),
+                              eb->fs_info, extent_buffer_write_end_io, eb);
+       bbio->bio.bi_iter.bi_sector = eb->start >> SECTOR_SHIFT;
+       bio_set_dev(&bbio->bio, fs_info->fs_devices->latest_dev->bdev);
+       wbc_init_bio(wbc, &bbio->bio);
+       bbio->inode = BTRFS_I(eb->fs_info->btree_inode);
+       bbio->file_offset = eb->start;
+       if (fs_info->nodesize < PAGE_SIZE) {
+               struct page *p = eb->pages[0];
+
+               lock_page(p);
+               btrfs_subpage_set_writeback(fs_info, p, eb->start, eb->len);
+               if (btrfs_subpage_clear_and_test_dirty(fs_info, p, eb->start,
+                                                      eb->len)) {
+                       clear_page_dirty_for_io(p);
+                       wbc->nr_to_write--;
+               }
+               __bio_add_page(&bbio->bio, p, eb->len, eb->start - page_offset(p));
+               wbc_account_cgroup_owner(wbc, p, eb->len);
                unlock_page(p);
+       } else {
+               for (int i = 0; i < num_extent_pages(eb); i++) {
+                       struct page *p = eb->pages[i];
+
+                       lock_page(p);
+                       clear_page_dirty_for_io(p);
+                       set_page_writeback(p);
+                       __bio_add_page(&bbio->bio, p, PAGE_SIZE, 0);
+                       wbc_account_cgroup_owner(wbc, p, PAGE_SIZE);
+                       wbc->nr_to_write--;
+                       unlock_page(p);
+               }
        }
+       btrfs_submit_bio(bbio, 0);
 }
 
 /*
@@ -2068,14 +1773,13 @@ static noinline_for_stack void write_one_eb(struct extent_buffer *eb,
  * Return >=0 for the number of submitted extent buffers.
  * Return <0 for fatal error.
  */
-static int submit_eb_subpage(struct page *page, struct btrfs_bio_ctrl *bio_ctrl)
+static int submit_eb_subpage(struct page *page, struct writeback_control *wbc)
 {
        struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
        int submitted = 0;
        u64 page_start = page_offset(page);
        int bit_start = 0;
        int sectors_per_node = fs_info->nodesize >> fs_info->sectorsize_bits;
-       int ret;
 
        /* Lock and write each dirty extent buffers in the range */
        while (bit_start < fs_info->subpage_info->bitmap_nr_bits) {
@@ -2121,25 +1825,13 @@ static int submit_eb_subpage(struct page *page, struct btrfs_bio_ctrl *bio_ctrl)
                if (!eb)
                        continue;
 
-               ret = lock_extent_buffer_for_io(eb, bio_ctrl);
-               if (ret == 0) {
-                       free_extent_buffer(eb);
-                       continue;
+               if (lock_extent_buffer_for_io(eb, wbc)) {
+                       write_one_eb(eb, wbc);
+                       submitted++;
                }
-               if (ret < 0) {
-                       free_extent_buffer(eb);
-                       goto cleanup;
-               }
-               write_one_subpage_eb(eb, bio_ctrl);
                free_extent_buffer(eb);
-               submitted++;
        }
        return submitted;
-
-cleanup:
-       /* We hit error, end bio for the submitted extent buffers */
-       submit_write_bio(bio_ctrl, ret);
-       return ret;
 }
 
 /*
@@ -2162,7 +1854,7 @@ cleanup:
  * previous call.
  * Return <0 for fatal error.
  */
-static int submit_eb_page(struct page *page, struct btrfs_bio_ctrl *bio_ctrl,
+static int submit_eb_page(struct page *page, struct writeback_control *wbc,
                          struct extent_buffer **eb_context)
 {
        struct address_space *mapping = page->mapping;
@@ -2174,7 +1866,7 @@ static int submit_eb_page(struct page *page, struct btrfs_bio_ctrl *bio_ctrl,
                return 0;
 
        if (btrfs_sb(page->mapping->host->i_sb)->nodesize < PAGE_SIZE)
-               return submit_eb_subpage(page, bio_ctrl);
+               return submit_eb_subpage(page, wbc);
 
        spin_lock(&mapping->private_lock);
        if (!PagePrivate(page)) {
@@ -2207,8 +1899,7 @@ static int submit_eb_page(struct page *page, struct btrfs_bio_ctrl *bio_ctrl,
                 * If for_sync, this hole will be filled with
                 * trasnsaction commit.
                 */
-               if (bio_ctrl->wbc->sync_mode == WB_SYNC_ALL &&
-                   !bio_ctrl->wbc->for_sync)
+               if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync)
                        ret = -EAGAIN;
                else
                        ret = 0;
@@ -2218,13 +1909,12 @@ static int submit_eb_page(struct page *page, struct btrfs_bio_ctrl *bio_ctrl,
 
        *eb_context = eb;
 
-       ret = lock_extent_buffer_for_io(eb, bio_ctrl);
-       if (ret <= 0) {
+       if (!lock_extent_buffer_for_io(eb, wbc)) {
                btrfs_revert_meta_write_pointer(cache, eb);
                if (cache)
                        btrfs_put_block_group(cache);
                free_extent_buffer(eb);
-               return ret;
+               return 0;
        }
        if (cache) {
                /*
@@ -2233,7 +1923,7 @@ static int submit_eb_page(struct page *page, struct btrfs_bio_ctrl *bio_ctrl,
                btrfs_schedule_zone_finish_bg(cache, eb);
                btrfs_put_block_group(cache);
        }
-       write_one_eb(eb, bio_ctrl);
+       write_one_eb(eb, wbc);
        free_extent_buffer(eb);
        return 1;
 }
@@ -2242,11 +1932,6 @@ int btree_write_cache_pages(struct address_space *mapping,
                                   struct writeback_control *wbc)
 {
        struct extent_buffer *eb_context = NULL;
-       struct btrfs_bio_ctrl bio_ctrl = {
-               .wbc = wbc,
-               .opf = REQ_OP_WRITE | wbc_to_write_flags(wbc),
-               .extent_locked = 0,
-       };
        struct btrfs_fs_info *fs_info = BTRFS_I(mapping->host)->root->fs_info;
        int ret = 0;
        int done = 0;
@@ -2288,7 +1973,7 @@ retry:
                for (i = 0; i < nr_folios; i++) {
                        struct folio *folio = fbatch.folios[i];
 
-                       ret = submit_eb_page(&folio->page, &bio_ctrl, &eb_context);
+                       ret = submit_eb_page(&folio->page, wbc, &eb_context);
                        if (ret == 0)
                                continue;
                        if (ret < 0) {
@@ -2349,8 +2034,6 @@ retry:
                ret = 0;
        if (!ret && BTRFS_FS_ERROR(fs_info))
                ret = -EROFS;
-       submit_write_bio(&bio_ctrl, ret);
-
        btrfs_zoned_meta_io_unlock(fs_info);
        return ret;
 }
@@ -2520,38 +2203,31 @@ retry:
  * already been ran (aka, ordered extent inserted) and all pages are still
  * locked.
  */
-int extent_write_locked_range(struct inode *inode, u64 start, u64 end)
+int extent_write_locked_range(struct inode *inode, u64 start, u64 end,
+                             struct writeback_control *wbc)
 {
        bool found_error = false;
        int first_error = 0;
        int ret = 0;
        struct address_space *mapping = inode->i_mapping;
-       struct page *page;
+       struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+       const u32 sectorsize = fs_info->sectorsize;
+       loff_t i_size = i_size_read(inode);
        u64 cur = start;
-       unsigned long nr_pages;
-       const u32 sectorsize = btrfs_sb(inode->i_sb)->sectorsize;
-       struct writeback_control wbc_writepages = {
-               .sync_mode      = WB_SYNC_ALL,
-               .range_start    = start,
-               .range_end      = end + 1,
-               .no_cgroup_owner = 1,
-       };
        struct btrfs_bio_ctrl bio_ctrl = {
-               .wbc = &wbc_writepages,
-               /* We're called from an async helper function */
-               .opf = REQ_OP_WRITE | REQ_BTRFS_CGROUP_PUNT |
-                       wbc_to_write_flags(&wbc_writepages),
-               .extent_locked = 1,
+               .wbc = wbc,
+               .opf = REQ_OP_WRITE | wbc_to_write_flags(wbc),
        };
 
+       if (wbc->no_cgroup_owner)
+               bio_ctrl.opf |= REQ_BTRFS_CGROUP_PUNT;
+
        ASSERT(IS_ALIGNED(start, sectorsize) && IS_ALIGNED(end + 1, sectorsize));
-       nr_pages = (round_up(end, PAGE_SIZE) - round_down(start, PAGE_SIZE)) >>
-                  PAGE_SHIFT;
-       wbc_writepages.nr_to_write = nr_pages * 2;
 
-       wbc_attach_fdatawrite_inode(&wbc_writepages, inode);
        while (cur <= end) {
                u64 cur_end = min(round_down(cur, PAGE_SIZE) + PAGE_SIZE - 1, end);
+               struct page *page;
+               int nr = 0;
 
                page = find_get_page(mapping, cur >> PAGE_SHIFT);
                /*
@@ -2562,19 +2238,31 @@ int extent_write_locked_range(struct inode *inode, u64 start, u64 end)
                ASSERT(PageLocked(page));
                ASSERT(PageDirty(page));
                clear_page_dirty_for_io(page);
-               ret = __extent_writepage(page, &bio_ctrl);
-               ASSERT(ret <= 0);
+
+               ret = __extent_writepage_io(BTRFS_I(inode), page, &bio_ctrl,
+                                           i_size, &nr);
+               if (ret == 1)
+                       goto next_page;
+
+               /* Make sure the mapping tag for page dirty gets cleared. */
+               if (nr == 0) {
+                       set_page_writeback(page);
+                       end_page_writeback(page);
+               }
+               if (ret)
+                       end_extent_writepage(page, ret, cur, cur_end);
+               btrfs_page_unlock_writer(fs_info, page, cur, cur_end + 1 - cur);
                if (ret < 0) {
                        found_error = true;
                        first_error = ret;
                }
+next_page:
                put_page(page);
                cur = cur_end + 1;
        }
 
        submit_write_bio(&bio_ctrl, found_error ? ret : 0);
 
-       wbc_detach_inode(&wbc_writepages);
        if (found_error)
                return first_error;
        return ret;
@@ -2588,7 +2276,6 @@ int extent_writepages(struct address_space *mapping,
        struct btrfs_bio_ctrl bio_ctrl = {
                .wbc = wbc,
                .opf = REQ_OP_WRITE | wbc_to_write_flags(wbc),
-               .extent_locked = 0,
        };
 
        /*
@@ -2679,8 +2366,7 @@ static int try_release_extent_state(struct extent_io_tree *tree,
                 * The delalloc new bit will be cleared by ordered extent
                 * completion.
                 */
-               ret = __clear_extent_bit(tree, start, end, clear_bits, NULL,
-                                        mask, NULL);
+               ret = __clear_extent_bit(tree, start, end, clear_bits, NULL, NULL);
 
                /* if clear_extent_bit failed for enomem reasons,
                 * we can't allow the release to continue.
@@ -3421,10 +3107,9 @@ static void __free_extent_buffer(struct extent_buffer *eb)
        kmem_cache_free(extent_buffer_cache, eb);
 }
 
-int extent_buffer_under_io(const struct extent_buffer *eb)
+static int extent_buffer_under_io(const struct extent_buffer *eb)
 {
-       return (atomic_read(&eb->io_pages) ||
-               test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags) ||
+       return (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags) ||
                test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
 }
 
@@ -3557,11 +3242,9 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
        init_rwsem(&eb->lock);
 
        btrfs_leak_debug_add_eb(eb);
-       INIT_LIST_HEAD(&eb->release_list);
 
        spin_lock_init(&eb->refs_lock);
        atomic_set(&eb->refs, 1);
-       atomic_set(&eb->io_pages, 0);
 
        ASSERT(len <= BTRFS_MAX_METADATA_BLOCKSIZE);
 
@@ -3678,9 +3361,9 @@ static void check_buffer_tree_ref(struct extent_buffer *eb)
         * adequately protected by the refcount, but the TREE_REF bit and
         * its corresponding reference are not. To protect against this
         * class of races, we call check_buffer_tree_ref from the codepaths
-        * which trigger io after they set eb->io_pages. Note that once io is
-        * initiated, TREE_REF can no longer be cleared, so that is the
-        * moment at which any such race is best fixed.
+        * which trigger io. Note that once io is initiated, TREE_REF can no
+        * longer be cleared, so that is the moment at which any such race is
+        * best fixed.
         */
        refs = atomic_read(&eb->refs);
        if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
@@ -3939,7 +3622,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 
                WARN_ON(btrfs_page_test_dirty(fs_info, p, eb->start, eb->len));
                eb->pages[i] = p;
-               if (!PageUptodate(p))
+               if (!btrfs_page_test_uptodate(fs_info, p, eb->start, eb->len))
                        uptodate = 0;
 
                /*
@@ -4142,13 +3825,12 @@ void btrfs_clear_buffer_dirty(struct btrfs_trans_handle *trans,
                        continue;
                lock_page(page);
                btree_clear_page_dirty(page);
-               ClearPageError(page);
                unlock_page(page);
        }
        WARN_ON(atomic_read(&eb->refs) == 0);
 }
 
-bool set_extent_buffer_dirty(struct extent_buffer *eb)
+void set_extent_buffer_dirty(struct extent_buffer *eb)
 {
        int i;
        int num_pages;
@@ -4183,13 +3865,14 @@ bool set_extent_buffer_dirty(struct extent_buffer *eb)
                                             eb->start, eb->len);
                if (subpage)
                        unlock_page(eb->pages[0]);
+               percpu_counter_add_batch(&eb->fs_info->dirty_metadata_bytes,
+                                        eb->len,
+                                        eb->fs_info->dirty_metadata_batch);
        }
 #ifdef CONFIG_BTRFS_DEBUG
        for (i = 0; i < num_pages; i++)
                ASSERT(PageDirty(eb->pages[i]));
 #endif
-
-       return was_dirty;
 }
 
 void clear_extent_buffer_uptodate(struct extent_buffer *eb)
@@ -4242,84 +3925,54 @@ void set_extent_buffer_uptodate(struct extent_buffer *eb)
        }
 }
 
-static int read_extent_buffer_subpage(struct extent_buffer *eb, int wait,
-                                     int mirror_num,
-                                     struct btrfs_tree_parent_check *check)
+static void extent_buffer_read_end_io(struct btrfs_bio *bbio)
 {
+       struct extent_buffer *eb = bbio->private;
        struct btrfs_fs_info *fs_info = eb->fs_info;
-       struct extent_io_tree *io_tree;
-       struct page *page = eb->pages[0];
-       struct extent_state *cached_state = NULL;
-       struct btrfs_bio_ctrl bio_ctrl = {
-               .opf = REQ_OP_READ,
-               .mirror_num = mirror_num,
-               .parent_check = check,
-       };
-       int ret;
+       bool uptodate = !bbio->bio.bi_status;
+       struct bvec_iter_all iter_all;
+       struct bio_vec *bvec;
+       u32 bio_offset = 0;
 
-       ASSERT(!test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags));
-       ASSERT(PagePrivate(page));
-       ASSERT(check);
-       io_tree = &BTRFS_I(fs_info->btree_inode)->io_tree;
+       eb->read_mirror = bbio->mirror_num;
 
-       if (wait == WAIT_NONE) {
-               if (!try_lock_extent(io_tree, eb->start, eb->start + eb->len - 1,
-                                    &cached_state))
-                       return -EAGAIN;
-       } else {
-               ret = lock_extent(io_tree, eb->start, eb->start + eb->len - 1,
-                                 &cached_state);
-               if (ret < 0)
-                       return ret;
-       }
+       if (uptodate &&
+           btrfs_validate_extent_buffer(eb, &bbio->parent_check) < 0)
+               uptodate = false;
 
-       if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags) ||
-           PageUptodate(page) ||
-           btrfs_subpage_test_uptodate(fs_info, page, eb->start, eb->len)) {
-               set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
-               unlock_extent(io_tree, eb->start, eb->start + eb->len - 1,
-                             &cached_state);
-               return 0;
+       if (uptodate) {
+               set_extent_buffer_uptodate(eb);
+       } else {
+               clear_extent_buffer_uptodate(eb);
+               set_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
        }
 
-       clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
-       eb->read_mirror = 0;
-       atomic_set(&eb->io_pages, 1);
-       check_buffer_tree_ref(eb);
-       bio_ctrl.end_io_func = end_bio_extent_readpage;
+       bio_for_each_segment_all(bvec, &bbio->bio, iter_all) {
+               u64 start = eb->start + bio_offset;
+               struct page *page = bvec->bv_page;
+               u32 len = bvec->bv_len;
 
-       btrfs_subpage_clear_error(fs_info, page, eb->start, eb->len);
+               if (uptodate)
+                       btrfs_page_set_uptodate(fs_info, page, start, len);
+               else
+                       btrfs_page_clear_uptodate(fs_info, page, start, len);
 
-       btrfs_subpage_start_reader(fs_info, page, eb->start, eb->len);
-       submit_extent_page(&bio_ctrl, eb->start, page, eb->len,
-                          eb->start - page_offset(page));
-       submit_one_bio(&bio_ctrl);
-       if (wait != WAIT_COMPLETE) {
-               free_extent_state(cached_state);
-               return 0;
+               bio_offset += len;
        }
 
-       wait_extent_bit(io_tree, eb->start, eb->start + eb->len - 1,
-                       EXTENT_LOCKED, &cached_state);
-       if (!test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
-               return -EIO;
-       return 0;
+       clear_bit(EXTENT_BUFFER_READING, &eb->bflags);
+       smp_mb__after_atomic();
+       wake_up_bit(&eb->bflags, EXTENT_BUFFER_READING);
+       free_extent_buffer(eb);
+
+       bio_put(&bbio->bio);
 }
 
 int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num,
                             struct btrfs_tree_parent_check *check)
 {
-       int i;
-       struct page *page;
-       int locked_pages = 0;
-       int all_uptodate = 1;
-       int num_pages;
-       unsigned long num_reads = 0;
-       struct btrfs_bio_ctrl bio_ctrl = {
-               .opf = REQ_OP_READ,
-               .mirror_num = mirror_num,
-               .parent_check = check,
-       };
+       int num_pages = num_extent_pages(eb), i;
+       struct btrfs_bio *bbio;
 
        if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
                return 0;
@@ -4332,87 +3985,39 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num,
        if (unlikely(test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)))
                return -EIO;
 
-       if (eb->fs_info->nodesize < PAGE_SIZE)
-               return read_extent_buffer_subpage(eb, wait, mirror_num, check);
-
-       num_pages = num_extent_pages(eb);
-       for (i = 0; i < num_pages; i++) {
-               page = eb->pages[i];
-               if (wait == WAIT_NONE) {
-                       /*
-                        * WAIT_NONE is only utilized by readahead. If we can't
-                        * acquire the lock atomically it means either the eb
-                        * is being read out or under modification.
-                        * Either way the eb will be or has been cached,
-                        * readahead can exit safely.
-                        */
-                       if (!trylock_page(page))
-                               goto unlock_exit;
-               } else {
-                       lock_page(page);
-               }
-               locked_pages++;
-       }
-       /*
-        * We need to firstly lock all pages to make sure that
-        * the uptodate bit of our pages won't be affected by
-        * clear_extent_buffer_uptodate().
-        */
-       for (i = 0; i < num_pages; i++) {
-               page = eb->pages[i];
-               if (!PageUptodate(page)) {
-                       num_reads++;
-                       all_uptodate = 0;
-               }
-       }
-
-       if (all_uptodate) {
-               set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
-               goto unlock_exit;
-       }
+       /* Someone else is already reading the buffer, just wait for it. */
+       if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags))
+               goto done;
 
        clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
        eb->read_mirror = 0;
-       atomic_set(&eb->io_pages, num_reads);
-       /*
-        * It is possible for release_folio to clear the TREE_REF bit before we
-        * set io_pages. See check_buffer_tree_ref for a more detailed comment.
-        */
        check_buffer_tree_ref(eb);
-       bio_ctrl.end_io_func = end_bio_extent_readpage;
-       for (i = 0; i < num_pages; i++) {
-               page = eb->pages[i];
-
-               if (!PageUptodate(page)) {
-                       ClearPageError(page);
-                       submit_extent_page(&bio_ctrl, page_offset(page), page,
-                                          PAGE_SIZE, 0);
-               } else {
-                       unlock_page(page);
-               }
+       atomic_inc(&eb->refs);
+
+       bbio = btrfs_bio_alloc(INLINE_EXTENT_BUFFER_PAGES,
+                              REQ_OP_READ | REQ_META, eb->fs_info,
+                              extent_buffer_read_end_io, eb);
+       bbio->bio.bi_iter.bi_sector = eb->start >> SECTOR_SHIFT;
+       bbio->inode = BTRFS_I(eb->fs_info->btree_inode);
+       bbio->file_offset = eb->start;
+       memcpy(&bbio->parent_check, check, sizeof(*check));
+       if (eb->fs_info->nodesize < PAGE_SIZE) {
+               __bio_add_page(&bbio->bio, eb->pages[0], eb->len,
+                              eb->start - page_offset(eb->pages[0]));
+       } else {
+               for (i = 0; i < num_pages; i++)
+                       __bio_add_page(&bbio->bio, eb->pages[i], PAGE_SIZE, 0);
        }
+       btrfs_submit_bio(bbio, mirror_num);
 
-       submit_one_bio(&bio_ctrl);
-
-       if (wait != WAIT_COMPLETE)
-               return 0;
-
-       for (i = 0; i < num_pages; i++) {
-               page = eb->pages[i];
-               wait_on_page_locked(page);
-               if (!PageUptodate(page))
+done:
+       if (wait == WAIT_COMPLETE) {
+               wait_on_bit_io(&eb->bflags, EXTENT_BUFFER_READING, TASK_UNINTERRUPTIBLE);
+               if (!test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
                        return -EIO;
        }
 
        return 0;
-
-unlock_exit:
-       while (locked_pages > 0) {
-               locked_pages--;
-               page = eb->pages[locked_pages];
-               unlock_page(page);
-       }
-       return 0;
 }
 
 static bool report_eb_range(const struct extent_buffer *eb, unsigned long start,
@@ -4561,18 +4166,17 @@ static void assert_eb_page_uptodate(const struct extent_buffer *eb,
         * looked up.  We don't want to complain in this case, as the page was
         * valid before, we just didn't write it out.  Instead we want to catch
         * the case where we didn't actually read the block properly, which
-        * would have !PageUptodate && !PageError, as we clear PageError before
-        * reading.
+        * would have !PageUptodate and !EXTENT_BUFFER_WRITE_ERR.
         */
-       if (fs_info->nodesize < PAGE_SIZE) {
-               bool uptodate, error;
+       if (test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))
+               return;
 
-               uptodate = btrfs_subpage_test_uptodate(fs_info, page,
-                                                      eb->start, eb->len);
-               error = btrfs_subpage_test_error(fs_info, page, eb->start, eb->len);
-               WARN_ON(!uptodate && !error);
+       if (fs_info->nodesize < PAGE_SIZE) {
+               if (WARN_ON(!btrfs_subpage_test_uptodate(fs_info, page,
+                                                        eb->start, eb->len)))
+                       btrfs_subpage_dump_bitmap(fs_info, page, eb->start, eb->len);
        } else {
-               WARN_ON(!PageUptodate(page) && !PageError(page));
+               WARN_ON(!PageUptodate(page));
        }
 }
 
index 4341ad9..c5fae3a 100644 (file)
@@ -29,6 +29,8 @@ enum {
        /* write IO error */
        EXTENT_BUFFER_WRITE_ERR,
        EXTENT_BUFFER_NO_CHECK,
+       /* Indicate that extent buffer pages a being read */
+       EXTENT_BUFFER_READING,
 };
 
 /* these are flags for __process_pages_contig */
@@ -38,7 +40,6 @@ enum {
        ENUM_BIT(PAGE_START_WRITEBACK),
        ENUM_BIT(PAGE_END_WRITEBACK),
        ENUM_BIT(PAGE_SET_ORDERED),
-       ENUM_BIT(PAGE_SET_ERROR),
        ENUM_BIT(PAGE_LOCK),
 };
 
@@ -79,7 +80,6 @@ struct extent_buffer {
        struct btrfs_fs_info *fs_info;
        spinlock_t refs_lock;
        atomic_t refs;
-       atomic_t io_pages;
        int read_mirror;
        struct rcu_head rcu_head;
        pid_t lock_owner;
@@ -89,7 +89,6 @@ struct extent_buffer {
        struct rw_semaphore lock;
 
        struct page *pages[INLINE_EXTENT_BUFFER_PAGES];
-       struct list_head release_list;
 #ifdef CONFIG_BTRFS_DEBUG
        struct list_head leak_list;
 #endif
@@ -179,7 +178,8 @@ int try_release_extent_mapping(struct page *page, gfp_t mask);
 int try_release_extent_buffer(struct page *page);
 
 int btrfs_read_folio(struct file *file, struct folio *folio);
-int extent_write_locked_range(struct inode *inode, u64 start, u64 end);
+int extent_write_locked_range(struct inode *inode, u64 start, u64 end,
+                             struct writeback_control *wbc);
 int extent_writepages(struct address_space *mapping,
                      struct writeback_control *wbc);
 int btree_write_cache_pages(struct address_space *mapping,
@@ -262,10 +262,9 @@ void extent_buffer_bitmap_set(const struct extent_buffer *eb, unsigned long star
 void extent_buffer_bitmap_clear(const struct extent_buffer *eb,
                                unsigned long start, unsigned long pos,
                                unsigned long len);
-bool set_extent_buffer_dirty(struct extent_buffer *eb);
+void set_extent_buffer_dirty(struct extent_buffer *eb);
 void set_extent_buffer_uptodate(struct extent_buffer *eb);
 void clear_extent_buffer_uptodate(struct extent_buffer *eb);
-int extent_buffer_under_io(const struct extent_buffer *eb);
 void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end);
 void extent_range_redirty_for_io(struct inode *inode, u64 start, u64 end);
 void extent_clear_unlock_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
index 138afa9..0cdb3e8 100644 (file)
@@ -364,8 +364,9 @@ static void extent_map_device_set_bits(struct extent_map *em, unsigned bits)
                struct btrfs_io_stripe *stripe = &map->stripes[i];
                struct btrfs_device *device = stripe->dev;
 
-               set_extent_bits_nowait(&device->alloc_state, stripe->physical,
-                                stripe->physical + stripe_size - 1, bits);
+               set_extent_bit(&device->alloc_state, stripe->physical,
+                              stripe->physical + stripe_size - 1,
+                              bits | EXTENT_NOWAIT, NULL);
        }
 }
 
@@ -380,8 +381,9 @@ static void extent_map_device_clear_bits(struct extent_map *em, unsigned bits)
                struct btrfs_device *device = stripe->dev;
 
                __clear_extent_bit(&device->alloc_state, stripe->physical,
-                                  stripe->physical + stripe_size - 1, bits,
-                                  NULL, GFP_NOWAIT, NULL);
+                                  stripe->physical + stripe_size - 1,
+                                  bits | EXTENT_NOWAIT,
+                                  NULL, NULL);
        }
 }
 
@@ -502,10 +504,10 @@ void remove_extent_mapping(struct extent_map_tree *tree, struct extent_map *em)
        RB_CLEAR_NODE(&em->rb_node);
 }
 
-void replace_extent_mapping(struct extent_map_tree *tree,
-                           struct extent_map *cur,
-                           struct extent_map *new,
-                           int modified)
+static void replace_extent_mapping(struct extent_map_tree *tree,
+                                  struct extent_map *cur,
+                                  struct extent_map *new,
+                                  int modified)
 {
        lockdep_assert_held_write(&tree->lock);
 
@@ -959,3 +961,95 @@ int btrfs_replace_extent_map_range(struct btrfs_inode *inode,
 
        return ret;
 }
+
+/*
+ * Split off the first pre bytes from the extent_map at [start, start + len],
+ * and set the block_start for it to new_logical.
+ *
+ * This function is used when an ordered_extent needs to be split.
+ */
+int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
+                    u64 new_logical)
+{
+       struct extent_map_tree *em_tree = &inode->extent_tree;
+       struct extent_map *em;
+       struct extent_map *split_pre = NULL;
+       struct extent_map *split_mid = NULL;
+       int ret = 0;
+       unsigned long flags;
+
+       ASSERT(pre != 0);
+       ASSERT(pre < len);
+
+       split_pre = alloc_extent_map();
+       if (!split_pre)
+               return -ENOMEM;
+       split_mid = alloc_extent_map();
+       if (!split_mid) {
+               ret = -ENOMEM;
+               goto out_free_pre;
+       }
+
+       lock_extent(&inode->io_tree, start, start + len - 1, NULL);
+       write_lock(&em_tree->lock);
+       em = lookup_extent_mapping(em_tree, start, len);
+       if (!em) {
+               ret = -EIO;
+               goto out_unlock;
+       }
+
+       ASSERT(em->len == len);
+       ASSERT(!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags));
+       ASSERT(em->block_start < EXTENT_MAP_LAST_BYTE);
+       ASSERT(test_bit(EXTENT_FLAG_PINNED, &em->flags));
+       ASSERT(!test_bit(EXTENT_FLAG_LOGGING, &em->flags));
+       ASSERT(!list_empty(&em->list));
+
+       flags = em->flags;
+       clear_bit(EXTENT_FLAG_PINNED, &em->flags);
+
+       /* First, replace the em with a new extent_map starting from * em->start */
+       split_pre->start = em->start;
+       split_pre->len = pre;
+       split_pre->orig_start = split_pre->start;
+       split_pre->block_start = new_logical;
+       split_pre->block_len = split_pre->len;
+       split_pre->orig_block_len = split_pre->block_len;
+       split_pre->ram_bytes = split_pre->len;
+       split_pre->flags = flags;
+       split_pre->compress_type = em->compress_type;
+       split_pre->generation = em->generation;
+
+       replace_extent_mapping(em_tree, em, split_pre, 1);
+
+       /*
+        * Now we only have an extent_map at:
+        *     [em->start, em->start + pre]
+        */
+
+       /* Insert the middle extent_map. */
+       split_mid->start = em->start + pre;
+       split_mid->len = em->len - pre;
+       split_mid->orig_start = split_mid->start;
+       split_mid->block_start = em->block_start + pre;
+       split_mid->block_len = split_mid->len;
+       split_mid->orig_block_len = split_mid->block_len;
+       split_mid->ram_bytes = split_mid->len;
+       split_mid->flags = flags;
+       split_mid->compress_type = em->compress_type;
+       split_mid->generation = em->generation;
+       add_extent_mapping(em_tree, split_mid, 1);
+
+       /* Once for us */
+       free_extent_map(em);
+       /* Once for the tree */
+       free_extent_map(em);
+
+out_unlock:
+       write_unlock(&em_tree->lock);
+       unlock_extent(&inode->io_tree, start, start + len - 1, NULL);
+       free_extent_map(split_mid);
+out_free_pre:
+       free_extent_map(split_pre);
+       return ret;
+}
index ad31186..35d27c7 100644 (file)
@@ -90,10 +90,8 @@ struct extent_map *lookup_extent_mapping(struct extent_map_tree *tree,
 int add_extent_mapping(struct extent_map_tree *tree,
                       struct extent_map *em, int modified);
 void remove_extent_mapping(struct extent_map_tree *tree, struct extent_map *em);
-void replace_extent_mapping(struct extent_map_tree *tree,
-                           struct extent_map *cur,
-                           struct extent_map *new,
-                           int modified);
+int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
+                    u64 new_logical);
 
 struct extent_map *alloc_extent_map(void);
 void free_extent_map(struct extent_map *em);
index d1cd0a6..696bf69 100644 (file)
@@ -94,8 +94,8 @@ int btrfs_inode_set_file_extent_range(struct btrfs_inode *inode, u64 start,
 
        if (btrfs_fs_incompat(inode->root->fs_info, NO_HOLES))
                return 0;
-       return set_extent_bits(&inode->file_extent_tree, start, start + len - 1,
-                              EXTENT_DIRTY);
+       return set_extent_bit(&inode->file_extent_tree, start, start + len - 1,
+                             EXTENT_DIRTY, NULL);
 }
 
 /*
@@ -438,9 +438,9 @@ blk_status_t btrfs_lookup_bio_sums(struct btrfs_bio *bbio)
                            BTRFS_DATA_RELOC_TREE_OBJECTID) {
                                u64 file_offset = bbio->file_offset + bio_offset;
 
-                               set_extent_bits(&inode->io_tree, file_offset,
-                                               file_offset + sectorsize - 1,
-                                               EXTENT_NODATASUM);
+                               set_extent_bit(&inode->io_tree, file_offset,
+                                              file_offset + sectorsize - 1,
+                                              EXTENT_NODATASUM, NULL);
                        } else {
                                btrfs_warn_rl(fs_info,
                        "csum hole found for disk bytenr range [%llu, %llu)",
@@ -560,8 +560,8 @@ int btrfs_lookup_csums_list(struct btrfs_root *root, u64 start, u64 end,
                                goto fail;
                        }
 
-                       sums->bytenr = start;
-                       sums->len = (int)size;
+                       sums->logical = start;
+                       sums->len = size;
 
                        offset = bytes_to_csum_size(fs_info, start - key.offset);
 
@@ -721,20 +721,17 @@ fail:
  */
 blk_status_t btrfs_csum_one_bio(struct btrfs_bio *bbio)
 {
+       struct btrfs_ordered_extent *ordered = bbio->ordered;
        struct btrfs_inode *inode = bbio->inode;
        struct btrfs_fs_info *fs_info = inode->root->fs_info;
        SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
        struct bio *bio = &bbio->bio;
-       u64 offset = bbio->file_offset;
        struct btrfs_ordered_sum *sums;
-       struct btrfs_ordered_extent *ordered = NULL;
        char *data;
        struct bvec_iter iter;
        struct bio_vec bvec;
        int index;
        unsigned int blockcount;
-       unsigned long total_bytes = 0;
-       unsigned long this_sum_bytes = 0;
        int i;
        unsigned nofs_flag;
 
@@ -749,61 +746,17 @@ blk_status_t btrfs_csum_one_bio(struct btrfs_bio *bbio)
        sums->len = bio->bi_iter.bi_size;
        INIT_LIST_HEAD(&sums->list);
 
-       sums->bytenr = bio->bi_iter.bi_sector << 9;
+       sums->logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
        index = 0;
 
        shash->tfm = fs_info->csum_shash;
 
        bio_for_each_segment(bvec, bio, iter) {
-               if (!ordered) {
-                       ordered = btrfs_lookup_ordered_extent(inode, offset);
-                       /*
-                        * The bio range is not covered by any ordered extent,
-                        * must be a code logic error.
-                        */
-                       if (unlikely(!ordered)) {
-                               WARN(1, KERN_WARNING
-                       "no ordered extent for root %llu ino %llu offset %llu\n",
-                                    inode->root->root_key.objectid,
-                                    btrfs_ino(inode), offset);
-                               kvfree(sums);
-                               return BLK_STS_IOERR;
-                       }
-               }
-
                blockcount = BTRFS_BYTES_TO_BLKS(fs_info,
                                                 bvec.bv_len + fs_info->sectorsize
                                                 - 1);
 
                for (i = 0; i < blockcount; i++) {
-                       if (!(bio->bi_opf & REQ_BTRFS_ONE_ORDERED) &&
-                           !in_range(offset, ordered->file_offset,
-                                     ordered->num_bytes)) {
-                               unsigned long bytes_left;
-
-                               sums->len = this_sum_bytes;
-                               this_sum_bytes = 0;
-                               btrfs_add_ordered_sum(ordered, sums);
-                               btrfs_put_ordered_extent(ordered);
-
-                               bytes_left = bio->bi_iter.bi_size - total_bytes;
-
-                               nofs_flag = memalloc_nofs_save();
-                               sums = kvzalloc(btrfs_ordered_sum_size(fs_info,
-                                                     bytes_left), GFP_KERNEL);
-                               memalloc_nofs_restore(nofs_flag);
-                               if (!sums)
-                                       return BLK_STS_RESOURCE;
-
-                               sums->len = bytes_left;
-                               ordered = btrfs_lookup_ordered_extent(inode,
-                                                               offset);
-                               ASSERT(ordered); /* Logic error */
-                               sums->bytenr = (bio->bi_iter.bi_sector << 9)
-                                       + total_bytes;
-                               index = 0;
-                       }
-
                        data = bvec_kmap_local(&bvec);
                        crypto_shash_digest(shash,
                                            data + (i * fs_info->sectorsize),
@@ -811,15 +764,28 @@ blk_status_t btrfs_csum_one_bio(struct btrfs_bio *bbio)
                                            sums->sums + index);
                        kunmap_local(data);
                        index += fs_info->csum_size;
-                       offset += fs_info->sectorsize;
-                       this_sum_bytes += fs_info->sectorsize;
-                       total_bytes += fs_info->sectorsize;
                }
 
        }
-       this_sum_bytes = 0;
+
+       bbio->sums = sums;
        btrfs_add_ordered_sum(ordered, sums);
-       btrfs_put_ordered_extent(ordered);
+       return 0;
+}
+
+/*
+ * Nodatasum I/O on zoned file systems still requires an btrfs_ordered_sum to
+ * record the updated logical address on Zone Append completion.
+ * Allocate just the structure with an empty sums array here for that case.
+ */
+blk_status_t btrfs_alloc_dummy_sum(struct btrfs_bio *bbio)
+{
+       bbio->sums = kmalloc(sizeof(*bbio->sums), GFP_NOFS);
+       if (!bbio->sums)
+               return BLK_STS_RESOURCE;
+       bbio->sums->len = bbio->bio.bi_iter.bi_size;
+       bbio->sums->logical = bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT;
+       btrfs_add_ordered_sum(bbio->ordered, bbio->sums);
        return 0;
 }
 
@@ -1086,7 +1052,7 @@ int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
 again:
        next_offset = (u64)-1;
        found_next = 0;
-       bytenr = sums->bytenr + total_bytes;
+       bytenr = sums->logical + total_bytes;
        file_key.objectid = BTRFS_EXTENT_CSUM_OBJECTID;
        file_key.offset = bytenr;
        file_key.type = BTRFS_EXTENT_CSUM_KEY;
index 6be8725..4ec669b 100644 (file)
@@ -50,6 +50,7 @@ int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
                           struct btrfs_root *root,
                           struct btrfs_ordered_sum *sums);
 blk_status_t btrfs_csum_one_bio(struct btrfs_bio *bbio);
+blk_status_t btrfs_alloc_dummy_sum(struct btrfs_bio *bbio);
 int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
                             struct list_head *list, int search_commit,
                             bool nowait);
index f649647..fd03e68 100644 (file)
@@ -1145,7 +1145,6 @@ static int btrfs_write_check(struct kiocb *iocb, struct iov_iter *from,
            !(BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW | BTRFS_INODE_PREALLOC)))
                return -EAGAIN;
 
-       current->backing_dev_info = inode_to_bdi(inode);
        ret = file_remove_privs(file);
        if (ret)
                return ret;
@@ -1165,10 +1164,8 @@ static int btrfs_write_check(struct kiocb *iocb, struct iov_iter *from,
                loff_t end_pos = round_up(pos + count, fs_info->sectorsize);
 
                ret = btrfs_cont_expand(BTRFS_I(inode), oldsize, end_pos);
-               if (ret) {
-                       current->backing_dev_info = NULL;
+               if (ret)
                        return ret;
-               }
        }
 
        return 0;
@@ -1651,7 +1648,6 @@ ssize_t btrfs_do_write_iter(struct kiocb *iocb, struct iov_iter *from,
        struct file *file = iocb->ki_filp;
        struct btrfs_inode *inode = BTRFS_I(file_inode(file));
        ssize_t num_written, num_sync;
-       const bool sync = iocb_is_dsync(iocb);
 
        /*
         * If the fs flips readonly due to some impossible error, although we
@@ -1664,9 +1660,6 @@ ssize_t btrfs_do_write_iter(struct kiocb *iocb, struct iov_iter *from,
        if (encoded && (iocb->ki_flags & IOCB_NOWAIT))
                return -EOPNOTSUPP;
 
-       if (sync)
-               atomic_inc(&inode->sync_writers);
-
        if (encoded) {
                num_written = btrfs_encoded_write(iocb, from, encoded);
                num_sync = encoded->len;
@@ -1686,10 +1679,6 @@ ssize_t btrfs_do_write_iter(struct kiocb *iocb, struct iov_iter *from,
                        num_written = num_sync;
        }
 
-       if (sync)
-               atomic_dec(&inode->sync_writers);
-
-       current->backing_dev_info = NULL;
        return num_written;
 }
 
@@ -1733,9 +1722,7 @@ static int start_ordered_ops(struct inode *inode, loff_t start, loff_t end)
         * several segments of stripe length (currently 64K).
         */
        blk_start_plug(&plug);
-       atomic_inc(&BTRFS_I(inode)->sync_writers);
        ret = btrfs_fdatawrite_range(inode, start, end);
-       atomic_dec(&BTRFS_I(inode)->sync_writers);
        blk_finish_plug(&plug);
 
        return ret;
@@ -3709,7 +3696,8 @@ static int btrfs_file_open(struct inode *inode, struct file *filp)
 {
        int ret;
 
-       filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_BUF_WASYNC;
+       filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_BUF_WASYNC |
+                       FMODE_CAN_ODIRECT;
 
        ret = fsverity_file_open(inode, filp);
        if (ret)
@@ -3825,7 +3813,7 @@ static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 const struct file_operations btrfs_file_operations = {
        .llseek         = btrfs_file_llseek,
        .read_iter      = btrfs_file_read_iter,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = filemap_splice_read,
        .write_iter     = btrfs_file_write_iter,
        .splice_write   = iter_file_splice_write,
        .mmap           = btrfs_file_mmap,
index cf98a3c..8808004 100644 (file)
@@ -292,25 +292,6 @@ out:
        return ret;
 }
 
-int btrfs_check_trunc_cache_free_space(struct btrfs_fs_info *fs_info,
-                                      struct btrfs_block_rsv *rsv)
-{
-       u64 needed_bytes;
-       int ret;
-
-       /* 1 for slack space, 1 for updating the inode */
-       needed_bytes = btrfs_calc_insert_metadata_size(fs_info, 1) +
-               btrfs_calc_metadata_size(fs_info, 1);
-
-       spin_lock(&rsv->lock);
-       if (rsv->reserved < needed_bytes)
-               ret = -ENOSPC;
-       else
-               ret = 0;
-       spin_unlock(&rsv->lock);
-       return ret;
-}
-
 int btrfs_truncate_free_space_cache(struct btrfs_trans_handle *trans,
                                    struct btrfs_block_group *block_group,
                                    struct inode *vfs_inode)
@@ -923,27 +904,31 @@ static int copy_free_space_cache(struct btrfs_block_group *block_group,
        while (!ret && (n = rb_first(&ctl->free_space_offset)) != NULL) {
                info = rb_entry(n, struct btrfs_free_space, offset_index);
                if (!info->bitmap) {
+                       const u64 offset = info->offset;
+                       const u64 bytes = info->bytes;
+
                        unlink_free_space(ctl, info, true);
-                       ret = btrfs_add_free_space(block_group, info->offset,
-                                                  info->bytes);
+                       spin_unlock(&ctl->tree_lock);
                        kmem_cache_free(btrfs_free_space_cachep, info);
+                       ret = btrfs_add_free_space(block_group, offset, bytes);
+                       spin_lock(&ctl->tree_lock);
                } else {
                        u64 offset = info->offset;
                        u64 bytes = ctl->unit;
 
-                       while (search_bitmap(ctl, info, &offset, &bytes,
-                                            false) == 0) {
+                       ret = search_bitmap(ctl, info, &offset, &bytes, false);
+                       if (ret == 0) {
+                               bitmap_clear_bits(ctl, info, offset, bytes, true);
+                               spin_unlock(&ctl->tree_lock);
                                ret = btrfs_add_free_space(block_group, offset,
                                                           bytes);
-                               if (ret)
-                                       break;
-                               bitmap_clear_bits(ctl, info, offset, bytes, true);
-                               offset = info->offset;
-                               bytes = ctl->unit;
+                               spin_lock(&ctl->tree_lock);
+                       } else {
+                               free_bitmap(ctl, info);
+                               ret = 0;
                        }
-                       free_bitmap(ctl, info);
                }
-               cond_resched();
+               cond_resched_lock(&ctl->tree_lock);
        }
        return ret;
 }
@@ -1037,7 +1022,9 @@ int load_free_space_cache(struct btrfs_block_group *block_group)
                                          block_group->bytes_super));
 
        if (matched) {
+               spin_lock(&tmp_ctl.tree_lock);
                ret = copy_free_space_cache(block_group, &tmp_ctl);
+               spin_unlock(&tmp_ctl.tree_lock);
                /*
                 * ret == 1 means we successfully loaded the free space cache,
                 * so we need to re-set it here.
@@ -1596,20 +1583,34 @@ static inline u64 offset_to_bitmap(struct btrfs_free_space_ctl *ctl,
        return bitmap_start;
 }
 
-static int tree_insert_offset(struct rb_root *root, u64 offset,
-                             struct rb_node *node, int bitmap)
+static int tree_insert_offset(struct btrfs_free_space_ctl *ctl,
+                             struct btrfs_free_cluster *cluster,
+                             struct btrfs_free_space *new_entry)
 {
-       struct rb_node **p = &root->rb_node;
+       struct rb_root *root;
+       struct rb_node **p;
        struct rb_node *parent = NULL;
-       struct btrfs_free_space *info;
+
+       lockdep_assert_held(&ctl->tree_lock);
+
+       if (cluster) {
+               lockdep_assert_held(&cluster->lock);
+               root = &cluster->root;
+       } else {
+               root = &ctl->free_space_offset;
+       }
+
+       p = &root->rb_node;
 
        while (*p) {
+               struct btrfs_free_space *info;
+
                parent = *p;
                info = rb_entry(parent, struct btrfs_free_space, offset_index);
 
-               if (offset < info->offset) {
+               if (new_entry->offset < info->offset) {
                        p = &(*p)->rb_left;
-               } else if (offset > info->offset) {
+               } else if (new_entry->offset > info->offset) {
                        p = &(*p)->rb_right;
                } else {
                        /*
@@ -1625,7 +1626,7 @@ static int tree_insert_offset(struct rb_root *root, u64 offset,
                         * found a bitmap, we want to go left, or before
                         * logically.
                         */
-                       if (bitmap) {
+                       if (new_entry->bitmap) {
                                if (info->bitmap) {
                                        WARN_ON_ONCE(1);
                                        return -EEXIST;
@@ -1641,8 +1642,8 @@ static int tree_insert_offset(struct rb_root *root, u64 offset,
                }
        }
 
-       rb_link_node(node, parent, p);
-       rb_insert_color(node, root);
+       rb_link_node(&new_entry->offset_index, parent, p);
+       rb_insert_color(&new_entry->offset_index, root);
 
        return 0;
 }
@@ -1705,6 +1706,8 @@ tree_search_offset(struct btrfs_free_space_ctl *ctl,
        struct rb_node *n = ctl->free_space_offset.rb_node;
        struct btrfs_free_space *entry = NULL, *prev = NULL;
 
+       lockdep_assert_held(&ctl->tree_lock);
+
        /* find entry that is closest to the 'offset' */
        while (n) {
                entry = rb_entry(n, struct btrfs_free_space, offset_index);
@@ -1814,6 +1817,8 @@ static inline void unlink_free_space(struct btrfs_free_space_ctl *ctl,
                                     struct btrfs_free_space *info,
                                     bool update_stat)
 {
+       lockdep_assert_held(&ctl->tree_lock);
+
        rb_erase(&info->offset_index, &ctl->free_space_offset);
        rb_erase_cached(&info->bytes_index, &ctl->free_space_bytes);
        ctl->free_extents--;
@@ -1832,9 +1837,10 @@ static int link_free_space(struct btrfs_free_space_ctl *ctl,
 {
        int ret = 0;
 
+       lockdep_assert_held(&ctl->tree_lock);
+
        ASSERT(info->bytes || info->bitmap);
-       ret = tree_insert_offset(&ctl->free_space_offset, info->offset,
-                                &info->offset_index, (info->bitmap != NULL));
+       ret = tree_insert_offset(ctl, NULL, info);
        if (ret)
                return ret;
 
@@ -1862,6 +1868,8 @@ static void relink_bitmap_entry(struct btrfs_free_space_ctl *ctl,
        if (RB_EMPTY_NODE(&info->bytes_index))
                return;
 
+       lockdep_assert_held(&ctl->tree_lock);
+
        rb_erase_cached(&info->bytes_index, &ctl->free_space_bytes);
        rb_add_cached(&info->bytes_index, &ctl->free_space_bytes, entry_less);
 }
@@ -2447,6 +2455,7 @@ static bool try_merge_free_space(struct btrfs_free_space_ctl *ctl,
        u64 offset = info->offset;
        u64 bytes = info->bytes;
        const bool is_trimmed = btrfs_free_space_trimmed(info);
+       struct rb_node *right_prev = NULL;
 
        /*
         * first we want to see if there is free space adjacent to the range we
@@ -2454,9 +2463,11 @@ static bool try_merge_free_space(struct btrfs_free_space_ctl *ctl,
         * cover the entire range
         */
        right_info = tree_search_offset(ctl, offset + bytes, 0, 0);
-       if (right_info && rb_prev(&right_info->offset_index))
-               left_info = rb_entry(rb_prev(&right_info->offset_index),
-                                    struct btrfs_free_space, offset_index);
+       if (right_info)
+               right_prev = rb_prev(&right_info->offset_index);
+
+       if (right_prev)
+               left_info = rb_entry(right_prev, struct btrfs_free_space, offset_index);
        else if (!right_info)
                left_info = tree_search_offset(ctl, offset - 1, 0, 0);
 
@@ -2969,9 +2980,10 @@ static void __btrfs_return_cluster_to_free_space(
                             struct btrfs_free_cluster *cluster)
 {
        struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
-       struct btrfs_free_space *entry;
        struct rb_node *node;
 
+       lockdep_assert_held(&ctl->tree_lock);
+
        spin_lock(&cluster->lock);
        if (cluster->block_group != block_group) {
                spin_unlock(&cluster->lock);
@@ -2984,15 +2996,14 @@ static void __btrfs_return_cluster_to_free_space(
 
        node = rb_first(&cluster->root);
        while (node) {
-               bool bitmap;
+               struct btrfs_free_space *entry;
 
                entry = rb_entry(node, struct btrfs_free_space, offset_index);
                node = rb_next(&entry->offset_index);
                rb_erase(&entry->offset_index, &cluster->root);
                RB_CLEAR_NODE(&entry->offset_index);
 
-               bitmap = (entry->bitmap != NULL);
-               if (!bitmap) {
+               if (!entry->bitmap) {
                        /* Merging treats extents as if they were new */
                        if (!btrfs_free_space_trimmed(entry)) {
                                ctl->discardable_extents[BTRFS_STAT_CURR]--;
@@ -3010,8 +3021,7 @@ static void __btrfs_return_cluster_to_free_space(
                                        entry->bytes;
                        }
                }
-               tree_insert_offset(&ctl->free_space_offset,
-                                  entry->offset, &entry->offset_index, bitmap);
+               tree_insert_offset(ctl, NULL, entry);
                rb_add_cached(&entry->bytes_index, &ctl->free_space_bytes,
                              entry_less);
        }
@@ -3324,6 +3334,8 @@ static int btrfs_bitmap_cluster(struct btrfs_block_group *block_group,
        unsigned long total_found = 0;
        int ret;
 
+       lockdep_assert_held(&ctl->tree_lock);
+
        i = offset_to_bit(entry->offset, ctl->unit,
                          max_t(u64, offset, entry->offset));
        want_bits = bytes_to_bits(bytes, ctl->unit);
@@ -3385,8 +3397,7 @@ again:
         */
        RB_CLEAR_NODE(&entry->bytes_index);
 
-       ret = tree_insert_offset(&cluster->root, entry->offset,
-                                &entry->offset_index, 1);
+       ret = tree_insert_offset(ctl, cluster, entry);
        ASSERT(!ret); /* -EEXIST; Logic error */
 
        trace_btrfs_setup_cluster(block_group, cluster,
@@ -3414,6 +3425,8 @@ setup_cluster_no_bitmap(struct btrfs_block_group *block_group,
        u64 max_extent;
        u64 total_size = 0;
 
+       lockdep_assert_held(&ctl->tree_lock);
+
        entry = tree_search_offset(ctl, offset, 0, 1);
        if (!entry)
                return -ENOSPC;
@@ -3476,8 +3489,7 @@ setup_cluster_no_bitmap(struct btrfs_block_group *block_group,
 
                rb_erase(&entry->offset_index, &ctl->free_space_offset);
                rb_erase_cached(&entry->bytes_index, &ctl->free_space_bytes);
-               ret = tree_insert_offset(&cluster->root, entry->offset,
-                                        &entry->offset_index, 0);
+               ret = tree_insert_offset(ctl, cluster, entry);
                total_size += entry->bytes;
                ASSERT(!ret); /* -EEXIST; Logic error */
        } while (node && entry != last);
@@ -3671,7 +3683,7 @@ static int do_trimming(struct btrfs_block_group *block_group,
                __btrfs_add_free_space(block_group, reserved_start,
                                       start - reserved_start,
                                       reserved_trim_state);
-       if (start + bytes < reserved_start + reserved_bytes)
+       if (end < reserved_end)
                __btrfs_add_free_space(block_group, end, reserved_end - end,
                                       reserved_trim_state);
        __btrfs_add_free_space(block_group, start, bytes, trim_state);
index a855e04..33b4da3 100644 (file)
@@ -101,8 +101,6 @@ int btrfs_remove_free_space_inode(struct btrfs_trans_handle *trans,
                                  struct inode *inode,
                                  struct btrfs_block_group *block_group);
 
-int btrfs_check_trunc_cache_free_space(struct btrfs_fs_info *fs_info,
-                                      struct btrfs_block_rsv *rsv);
 int btrfs_truncate_free_space_cache(struct btrfs_trans_handle *trans,
                                    struct btrfs_block_group *block_group,
                                    struct inode *inode);
index b21da14..045ddce 100644 (file)
@@ -1280,7 +1280,10 @@ int btrfs_delete_free_space_tree(struct btrfs_fs_info *fs_info)
                goto abort;
 
        btrfs_global_root_delete(free_space_root);
+
+       spin_lock(&fs_info->trans_lock);
        list_del(&free_space_root->dirty_list);
+       spin_unlock(&fs_info->trans_lock);
 
        btrfs_tree_lock(free_space_root->node);
        btrfs_clear_buffer_dirty(trans, free_space_root->node);
index 0d98fc5..203d2a2 100644 (file)
@@ -543,7 +543,6 @@ struct btrfs_fs_info {
         * A third pool does submit_bio to avoid deadlocking with the other two.
         */
        struct btrfs_workqueue *workers;
-       struct btrfs_workqueue *hipri_workers;
        struct btrfs_workqueue *delalloc_workers;
        struct btrfs_workqueue *flush_workers;
        struct workqueue_struct *endio_workers;
@@ -577,6 +576,7 @@ struct btrfs_fs_info {
        s32 dirty_metadata_batch;
        s32 delalloc_batch;
 
+       /* Protected by 'trans_lock'. */
        struct list_head dirty_cowonly_roots;
 
        struct btrfs_fs_devices *fs_devices;
@@ -643,7 +643,6 @@ struct btrfs_fs_info {
         */
        refcount_t scrub_workers_refcnt;
        struct workqueue_struct *scrub_workers;
-       struct workqueue_struct *scrub_wr_completion_workers;
        struct btrfs_subpage_info *subpage_info;
 
        struct btrfs_discard_ctl discard_ctl;
@@ -854,7 +853,7 @@ static inline u64 btrfs_calc_metadata_size(const struct btrfs_fs_info *fs_info,
 
 static inline bool btrfs_is_zoned(const struct btrfs_fs_info *fs_info)
 {
-       return fs_info->zone_size > 0;
+       return IS_ENABLED(CONFIG_BLK_DEV_ZONED) && fs_info->zone_size > 0;
 }
 
 /*
index b80aeb7..ede43b6 100644 (file)
@@ -60,6 +60,22 @@ struct btrfs_truncate_control {
        bool clear_extent_range;
 };
 
+/*
+ * btrfs_inode_item stores flags in a u64, btrfs_inode stores them in two
+ * separate u32s. These two functions convert between the two representations.
+ */
+static inline u64 btrfs_inode_combine_flags(u32 flags, u32 ro_flags)
+{
+       return (flags | ((u64)ro_flags << 32));
+}
+
+static inline void btrfs_inode_split_flags(u64 inode_item_flags,
+                                          u32 *flags, u32 *ro_flags)
+{
+       *flags = (u32)inode_item_flags;
+       *ro_flags = (u32)(inode_item_flags >> 32);
+}
+
 int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
                               struct btrfs_root *root,
                               struct btrfs_truncate_control *control);
index 7fcafcc..dbbb672 100644 (file)
@@ -70,6 +70,7 @@
 #include "verity.h"
 #include "super.h"
 #include "orphan.h"
+#include "backref.h"
 
 struct btrfs_iget_args {
        u64 ino;
@@ -100,6 +101,18 @@ struct btrfs_rename_ctx {
        u64 index;
 };
 
+/*
+ * Used by data_reloc_print_warning_inode() to pass needed info for filename
+ * resolution and output of error message.
+ */
+struct data_reloc_warn {
+       struct btrfs_path path;
+       struct btrfs_fs_info *fs_info;
+       u64 extent_item_size;
+       u64 logical;
+       int mirror_num;
+};
+
 static const struct inode_operations btrfs_dir_inode_operations;
 static const struct inode_operations btrfs_symlink_inode_operations;
 static const struct inode_operations btrfs_special_inode_operations;
@@ -122,12 +135,198 @@ static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
                                       u64 ram_bytes, int compress_type,
                                       int type);
 
+static int data_reloc_print_warning_inode(u64 inum, u64 offset, u64 num_bytes,
+                                         u64 root, void *warn_ctx)
+{
+       struct data_reloc_warn *warn = warn_ctx;
+       struct btrfs_fs_info *fs_info = warn->fs_info;
+       struct extent_buffer *eb;
+       struct btrfs_inode_item *inode_item;
+       struct inode_fs_paths *ipath = NULL;
+       struct btrfs_root *local_root;
+       struct btrfs_key key;
+       unsigned int nofs_flag;
+       u32 nlink;
+       int ret;
+
+       local_root = btrfs_get_fs_root(fs_info, root, true);
+       if (IS_ERR(local_root)) {
+               ret = PTR_ERR(local_root);
+               goto err;
+       }
+
+       /* This makes the path point to (inum INODE_ITEM ioff). */
+       key.objectid = inum;
+       key.type = BTRFS_INODE_ITEM_KEY;
+       key.offset = 0;
+
+       ret = btrfs_search_slot(NULL, local_root, &key, &warn->path, 0, 0);
+       if (ret) {
+               btrfs_put_root(local_root);
+               btrfs_release_path(&warn->path);
+               goto err;
+       }
+
+       eb = warn->path.nodes[0];
+       inode_item = btrfs_item_ptr(eb, warn->path.slots[0], struct btrfs_inode_item);
+       nlink = btrfs_inode_nlink(eb, inode_item);
+       btrfs_release_path(&warn->path);
+
+       nofs_flag = memalloc_nofs_save();
+       ipath = init_ipath(4096, local_root, &warn->path);
+       memalloc_nofs_restore(nofs_flag);
+       if (IS_ERR(ipath)) {
+               btrfs_put_root(local_root);
+               ret = PTR_ERR(ipath);
+               ipath = NULL;
+               /*
+                * -ENOMEM, not a critical error, just output an generic error
+                * without filename.
+                */
+               btrfs_warn(fs_info,
+"checksum error at logical %llu mirror %u root %llu, inode %llu offset %llu",
+                          warn->logical, warn->mirror_num, root, inum, offset);
+               return ret;
+       }
+       ret = paths_from_inode(inum, ipath);
+       if (ret < 0)
+               goto err;
+
+       /*
+        * We deliberately ignore the bit ipath might have been too small to
+        * hold all of the paths here
+        */
+       for (int i = 0; i < ipath->fspath->elem_cnt; i++) {
+               btrfs_warn(fs_info,
+"checksum error at logical %llu mirror %u root %llu inode %llu offset %llu length %u links %u (path: %s)",
+                          warn->logical, warn->mirror_num, root, inum, offset,
+                          fs_info->sectorsize, nlink,
+                          (char *)(unsigned long)ipath->fspath->val[i]);
+       }
+
+       btrfs_put_root(local_root);
+       free_ipath(ipath);
+       return 0;
+
+err:
+       btrfs_warn(fs_info,
+"checksum error at logical %llu mirror %u root %llu inode %llu offset %llu, path resolving failed with ret=%d",
+                  warn->logical, warn->mirror_num, root, inum, offset, ret);
+
+       free_ipath(ipath);
+       return ret;
+}
+
+/*
+ * Do extra user-friendly error output (e.g. lookup all the affected files).
+ *
+ * Return true if we succeeded doing the backref lookup.
+ * Return false if such lookup failed, and has to fallback to the old error message.
+ */
+static void print_data_reloc_error(const struct btrfs_inode *inode, u64 file_off,
+                                  const u8 *csum, const u8 *csum_expected,
+                                  int mirror_num)
+{
+       struct btrfs_fs_info *fs_info = inode->root->fs_info;
+       struct btrfs_path path = { 0 };
+       struct btrfs_key found_key = { 0 };
+       struct extent_buffer *eb;
+       struct btrfs_extent_item *ei;
+       const u32 csum_size = fs_info->csum_size;
+       u64 logical;
+       u64 flags;
+       u32 item_size;
+       int ret;
+
+       mutex_lock(&fs_info->reloc_mutex);
+       logical = btrfs_get_reloc_bg_bytenr(fs_info);
+       mutex_unlock(&fs_info->reloc_mutex);
+
+       if (logical == U64_MAX) {
+               btrfs_warn_rl(fs_info, "has data reloc tree but no running relocation");
+               btrfs_warn_rl(fs_info,
+"csum failed root %lld ino %llu off %llu csum " CSUM_FMT " expected csum " CSUM_FMT " mirror %d",
+                       inode->root->root_key.objectid, btrfs_ino(inode), file_off,
+                       CSUM_FMT_VALUE(csum_size, csum),
+                       CSUM_FMT_VALUE(csum_size, csum_expected),
+                       mirror_num);
+               return;
+       }
+
+       logical += file_off;
+       btrfs_warn_rl(fs_info,
+"csum failed root %lld ino %llu off %llu logical %llu csum " CSUM_FMT " expected csum " CSUM_FMT " mirror %d",
+                       inode->root->root_key.objectid,
+                       btrfs_ino(inode), file_off, logical,
+                       CSUM_FMT_VALUE(csum_size, csum),
+                       CSUM_FMT_VALUE(csum_size, csum_expected),
+                       mirror_num);
+
+       ret = extent_from_logical(fs_info, logical, &path, &found_key, &flags);
+       if (ret < 0) {
+               btrfs_err_rl(fs_info, "failed to lookup extent item for logical %llu: %d",
+                            logical, ret);
+               return;
+       }
+       eb = path.nodes[0];
+       ei = btrfs_item_ptr(eb, path.slots[0], struct btrfs_extent_item);
+       item_size = btrfs_item_size(eb, path.slots[0]);
+       if (flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) {
+               unsigned long ptr = 0;
+               u64 ref_root;
+               u8 ref_level;
+
+               while (true) {
+                       ret = tree_backref_for_extent(&ptr, eb, &found_key, ei,
+                                                     item_size, &ref_root,
+                                                     &ref_level);
+                       if (ret < 0) {
+                               btrfs_warn_rl(fs_info,
+                               "failed to resolve tree backref for logical %llu: %d",
+                                             logical, ret);
+                               break;
+                       }
+                       if (ret > 0)
+                               break;
+
+                       btrfs_warn_rl(fs_info,
+"csum error at logical %llu mirror %u: metadata %s (level %d) in tree %llu",
+                               logical, mirror_num,
+                               (ref_level ? "node" : "leaf"),
+                               ref_level, ref_root);
+               }
+               btrfs_release_path(&path);
+       } else {
+               struct btrfs_backref_walk_ctx ctx = { 0 };
+               struct data_reloc_warn reloc_warn = { 0 };
+
+               btrfs_release_path(&path);
+
+               ctx.bytenr = found_key.objectid;
+               ctx.extent_item_pos = logical - found_key.objectid;
+               ctx.fs_info = fs_info;
+
+               reloc_warn.logical = logical;
+               reloc_warn.extent_item_size = found_key.offset;
+               reloc_warn.mirror_num = mirror_num;
+               reloc_warn.fs_info = fs_info;
+
+               iterate_extent_inodes(&ctx, true,
+                                     data_reloc_print_warning_inode, &reloc_warn);
+       }
+}
+
 static void __cold btrfs_print_data_csum_error(struct btrfs_inode *inode,
                u64 logical_start, u8 *csum, u8 *csum_expected, int mirror_num)
 {
        struct btrfs_root *root = inode->root;
        const u32 csum_size = root->fs_info->csum_size;
 
+       /* For data reloc tree, it's better to do a backref lookup instead. */
+       if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID)
+               return print_data_reloc_error(inode, logical_start, csum,
+                                             csum_expected, mirror_num);
+
        /* Output without objectid, which is more meaningful */
        if (root->root_key.objectid >= BTRFS_LAST_FREE_OBJECTID) {
                btrfs_warn_rl(root->fs_info,
@@ -636,6 +835,7 @@ static noinline int compress_file_range(struct async_chunk *async_chunk)
 {
        struct btrfs_inode *inode = async_chunk->inode;
        struct btrfs_fs_info *fs_info = inode->root->fs_info;
+       struct address_space *mapping = inode->vfs_inode.i_mapping;
        u64 blocksize = fs_info->sectorsize;
        u64 start = async_chunk->start;
        u64 end = async_chunk->end;
@@ -750,7 +950,7 @@ again:
                /* Compression level is applied here and only here */
                ret = btrfs_compress_pages(
                        compress_type | (fs_info->compress_level << 4),
-                                          inode->vfs_inode.i_mapping, start,
+                                          mapping, start,
                                           pages,
                                           &nr_pages,
                                           &total_in,
@@ -793,9 +993,9 @@ cont:
                        unsigned long clear_flags = EXTENT_DELALLOC |
                                EXTENT_DELALLOC_NEW | EXTENT_DEFRAG |
                                EXTENT_DO_ACCOUNTING;
-                       unsigned long page_error_op;
 
-                       page_error_op = ret < 0 ? PAGE_SET_ERROR : 0;
+                       if (ret < 0)
+                               mapping_set_error(mapping, -EIO);
 
                        /*
                         * inline extent creation worked or returned error,
@@ -812,7 +1012,6 @@ cont:
                                                     clear_flags,
                                                     PAGE_UNLOCK |
                                                     PAGE_START_WRITEBACK |
-                                                    page_error_op |
                                                     PAGE_END_WRITEBACK);
 
                        /*
@@ -934,6 +1133,12 @@ static int submit_uncompressed_range(struct btrfs_inode *inode,
        unsigned long nr_written = 0;
        int page_started = 0;
        int ret;
+       struct writeback_control wbc = {
+               .sync_mode              = WB_SYNC_ALL,
+               .range_start            = start,
+               .range_end              = end,
+               .no_cgroup_owner        = 1,
+       };
 
        /*
         * Call cow_file_range() to run the delalloc range directly, since we
@@ -954,8 +1159,6 @@ static int submit_uncompressed_range(struct btrfs_inode *inode,
                        const u64 page_start = page_offset(locked_page);
                        const u64 page_end = page_start + PAGE_SIZE - 1;
 
-                       btrfs_page_set_error(inode->root->fs_info, locked_page,
-                                            page_start, PAGE_SIZE);
                        set_page_writeback(locked_page);
                        end_page_writeback(locked_page);
                        end_extent_writepage(locked_page, ret, page_start, page_end);
@@ -965,7 +1168,10 @@ static int submit_uncompressed_range(struct btrfs_inode *inode,
        }
 
        /* All pages will be unlocked, including @locked_page */
-       return extent_write_locked_range(&inode->vfs_inode, start, end);
+       wbc_attach_fdatawrite_inode(&wbc, &inode->vfs_inode);
+       ret = extent_write_locked_range(&inode->vfs_inode, start, end, &wbc);
+       wbc_detach_inode(&wbc);
+       return ret;
 }
 
 static int submit_one_async_extent(struct btrfs_inode *inode,
@@ -976,6 +1182,7 @@ static int submit_one_async_extent(struct btrfs_inode *inode,
        struct extent_io_tree *io_tree = &inode->io_tree;
        struct btrfs_root *root = inode->root;
        struct btrfs_fs_info *fs_info = root->fs_info;
+       struct btrfs_ordered_extent *ordered;
        struct btrfs_key ins;
        struct page *locked_page = NULL;
        struct extent_map *em;
@@ -1037,7 +1244,7 @@ static int submit_one_async_extent(struct btrfs_inode *inode,
        }
        free_extent_map(em);
 
-       ret = btrfs_add_ordered_extent(inode, start,            /* file_offset */
+       ordered = btrfs_alloc_ordered_extent(inode, start,      /* file_offset */
                                       async_extent->ram_size,  /* num_bytes */
                                       async_extent->ram_size,  /* ram_bytes */
                                       ins.objectid,            /* disk_bytenr */
@@ -1045,8 +1252,9 @@ static int submit_one_async_extent(struct btrfs_inode *inode,
                                       0,                       /* offset */
                                       1 << BTRFS_ORDERED_COMPRESSED,
                                       async_extent->compress_type);
-       if (ret) {
+       if (IS_ERR(ordered)) {
                btrfs_drop_extent_map_range(inode, start, end, false);
+               ret = PTR_ERR(ordered);
                goto out_free_reserve;
        }
        btrfs_dec_block_group_reservations(fs_info, ins.objectid);
@@ -1055,11 +1263,7 @@ static int submit_one_async_extent(struct btrfs_inode *inode,
        extent_clear_unlock_delalloc(inode, start, end,
                        NULL, EXTENT_LOCKED | EXTENT_DELALLOC,
                        PAGE_UNLOCK | PAGE_START_WRITEBACK);
-
-       btrfs_submit_compressed_write(inode, start,     /* file_offset */
-                           async_extent->ram_size,     /* num_bytes */
-                           ins.objectid,               /* disk_bytenr */
-                           ins.offset,                 /* compressed_len */
+       btrfs_submit_compressed_write(ordered,
                            async_extent->pages,        /* compressed_pages */
                            async_extent->nr_pages,
                            async_chunk->write_flags, true);
@@ -1074,12 +1278,13 @@ out_free_reserve:
        btrfs_dec_block_group_reservations(fs_info, ins.objectid);
        btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
 out_free:
+       mapping_set_error(inode->vfs_inode.i_mapping, -EIO);
        extent_clear_unlock_delalloc(inode, start, end,
                                     NULL, EXTENT_LOCKED | EXTENT_DELALLOC |
                                     EXTENT_DELALLOC_NEW |
                                     EXTENT_DEFRAG | EXTENT_DO_ACCOUNTING,
                                     PAGE_UNLOCK | PAGE_START_WRITEBACK |
-                                    PAGE_END_WRITEBACK | PAGE_SET_ERROR);
+                                    PAGE_END_WRITEBACK);
        free_async_extent_pages(async_extent);
        goto done;
 }
@@ -1287,6 +1492,8 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
                min_alloc_size = fs_info->sectorsize;
 
        while (num_bytes > 0) {
+               struct btrfs_ordered_extent *ordered;
+
                cur_alloc_size = num_bytes;
                ret = btrfs_reserve_extent(root, cur_alloc_size, cur_alloc_size,
                                           min_alloc_size, 0, alloc_hint,
@@ -1311,16 +1518,18 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
                }
                free_extent_map(em);
 
-               ret = btrfs_add_ordered_extent(inode, start, ram_size, ram_size,
-                                              ins.objectid, cur_alloc_size, 0,
-                                              1 << BTRFS_ORDERED_REGULAR,
-                                              BTRFS_COMPRESS_NONE);
-               if (ret)
+               ordered = btrfs_alloc_ordered_extent(inode, start, ram_size,
+                                       ram_size, ins.objectid, cur_alloc_size,
+                                       0, 1 << BTRFS_ORDERED_REGULAR,
+                                       BTRFS_COMPRESS_NONE);
+               if (IS_ERR(ordered)) {
+                       ret = PTR_ERR(ordered);
                        goto out_drop_extent_cache;
+               }
 
                if (btrfs_is_data_reloc_root(root)) {
-                       ret = btrfs_reloc_clone_csums(inode, start,
-                                                     cur_alloc_size);
+                       ret = btrfs_reloc_clone_csums(ordered);
+
                        /*
                         * Only drop cache here, and process as normal.
                         *
@@ -1337,6 +1546,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
                                                            start + ram_size - 1,
                                                            false);
                }
+               btrfs_put_ordered_extent(ordered);
 
                btrfs_dec_block_group_reservations(fs_info, ins.objectid);
 
@@ -1494,7 +1704,7 @@ static noinline void async_cow_submit(struct btrfs_work *work)
         * ->inode could be NULL if async_chunk_start has failed to compress,
         * in which case we don't have anything to submit, yet we need to
         * always adjust ->async_delalloc_pages as its paired with the init
-        * happening in cow_file_range_async
+        * happening in run_delalloc_compressed
         */
        if (async_chunk->inode)
                submit_compressed_extents(async_chunk);
@@ -1521,58 +1731,36 @@ static noinline void async_cow_free(struct btrfs_work *work)
                kvfree(async_cow);
 }
 
-static int cow_file_range_async(struct btrfs_inode *inode,
-                               struct writeback_control *wbc,
-                               struct page *locked_page,
-                               u64 start, u64 end, int *page_started,
-                               unsigned long *nr_written)
+static bool run_delalloc_compressed(struct btrfs_inode *inode,
+                                   struct writeback_control *wbc,
+                                   struct page *locked_page,
+                                   u64 start, u64 end, int *page_started,
+                                   unsigned long *nr_written)
 {
        struct btrfs_fs_info *fs_info = inode->root->fs_info;
        struct cgroup_subsys_state *blkcg_css = wbc_blkcg_css(wbc);
        struct async_cow *ctx;
        struct async_chunk *async_chunk;
        unsigned long nr_pages;
-       u64 cur_end;
        u64 num_chunks = DIV_ROUND_UP(end - start, SZ_512K);
        int i;
-       bool should_compress;
        unsigned nofs_flag;
        const blk_opf_t write_flags = wbc_to_write_flags(wbc);
 
-       unlock_extent(&inode->io_tree, start, end, NULL);
-
-       if (inode->flags & BTRFS_INODE_NOCOMPRESS &&
-           !btrfs_test_opt(fs_info, FORCE_COMPRESS)) {
-               num_chunks = 1;
-               should_compress = false;
-       } else {
-               should_compress = true;
-       }
-
        nofs_flag = memalloc_nofs_save();
        ctx = kvmalloc(struct_size(ctx, chunks, num_chunks), GFP_KERNEL);
        memalloc_nofs_restore(nofs_flag);
+       if (!ctx)
+               return false;
 
-       if (!ctx) {
-               unsigned clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC |
-                       EXTENT_DELALLOC_NEW | EXTENT_DEFRAG |
-                       EXTENT_DO_ACCOUNTING;
-               unsigned long page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK |
-                                        PAGE_END_WRITEBACK | PAGE_SET_ERROR;
-
-               extent_clear_unlock_delalloc(inode, start, end, locked_page,
-                                            clear_bits, page_ops);
-               return -ENOMEM;
-       }
+       unlock_extent(&inode->io_tree, start, end, NULL);
+       set_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, &inode->runtime_flags);
 
        async_chunk = ctx->chunks;
        atomic_set(&ctx->num_chunks, num_chunks);
 
        for (i = 0; i < num_chunks; i++) {
-               if (should_compress)
-                       cur_end = min(end, start + SZ_512K - 1);
-               else
-                       cur_end = end;
+               u64 cur_end = min(end, start + SZ_512K - 1);
 
                /*
                 * igrab is called higher up in the call chain, take only the
@@ -1633,13 +1821,14 @@ static int cow_file_range_async(struct btrfs_inode *inode,
                start = cur_end + 1;
        }
        *page_started = 1;
-       return 0;
+       return true;
 }
 
 static noinline int run_delalloc_zoned(struct btrfs_inode *inode,
                                       struct page *locked_page, u64 start,
                                       u64 end, int *page_started,
-                                      unsigned long *nr_written)
+                                      unsigned long *nr_written,
+                                      struct writeback_control *wbc)
 {
        u64 done_offset = end;
        int ret;
@@ -1671,8 +1860,8 @@ static noinline int run_delalloc_zoned(struct btrfs_inode *inode,
                        account_page_redirty(locked_page);
                }
                locked_page_done = true;
-               extent_write_locked_range(&inode->vfs_inode, start, done_offset);
-
+               extent_write_locked_range(&inode->vfs_inode, start, done_offset,
+                                         wbc);
                start = done_offset + 1;
        }
 
@@ -1947,6 +2136,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
        nocow_args.writeback_path = true;
 
        while (1) {
+               struct btrfs_ordered_extent *ordered;
                struct btrfs_key found_key;
                struct btrfs_file_extent_item *fi;
                struct extent_buffer *leaf;
@@ -1954,6 +2144,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
                u64 ram_bytes;
                u64 nocow_end;
                int extent_type;
+               bool is_prealloc;
 
                nocow = false;
 
@@ -2092,8 +2283,8 @@ out_check:
                }
 
                nocow_end = cur_offset + nocow_args.num_bytes - 1;
-
-               if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
+               is_prealloc = extent_type == BTRFS_FILE_EXTENT_PREALLOC;
+               if (is_prealloc) {
                        u64 orig_start = found_key.offset - nocow_args.extent_offset;
                        struct extent_map *em;
 
@@ -2109,29 +2300,22 @@ out_check:
                                goto error;
                        }
                        free_extent_map(em);
-                       ret = btrfs_add_ordered_extent(inode,
-                                       cur_offset, nocow_args.num_bytes,
-                                       nocow_args.num_bytes,
-                                       nocow_args.disk_bytenr,
-                                       nocow_args.num_bytes, 0,
-                                       1 << BTRFS_ORDERED_PREALLOC,
-                                       BTRFS_COMPRESS_NONE);
-                       if (ret) {
+               }
+
+               ordered = btrfs_alloc_ordered_extent(inode, cur_offset,
+                               nocow_args.num_bytes, nocow_args.num_bytes,
+                               nocow_args.disk_bytenr, nocow_args.num_bytes, 0,
+                               is_prealloc
+                               ? (1 << BTRFS_ORDERED_PREALLOC)
+                               : (1 << BTRFS_ORDERED_NOCOW),
+                               BTRFS_COMPRESS_NONE);
+               if (IS_ERR(ordered)) {
+                       if (is_prealloc) {
                                btrfs_drop_extent_map_range(inode, cur_offset,
                                                            nocow_end, false);
-                               goto error;
                        }
-               } else {
-                       ret = btrfs_add_ordered_extent(inode, cur_offset,
-                                                      nocow_args.num_bytes,
-                                                      nocow_args.num_bytes,
-                                                      nocow_args.disk_bytenr,
-                                                      nocow_args.num_bytes,
-                                                      0,
-                                                      1 << BTRFS_ORDERED_NOCOW,
-                                                      BTRFS_COMPRESS_NONE);
-                       if (ret)
-                               goto error;
+                       ret = PTR_ERR(ordered);
+                       goto error;
                }
 
                if (nocow) {
@@ -2145,8 +2329,8 @@ out_check:
                         * extent_clear_unlock_delalloc() in error handler
                         * from freeing metadata of created ordered extent.
                         */
-                       ret = btrfs_reloc_clone_csums(inode, cur_offset,
-                                                     nocow_args.num_bytes);
+                       ret = btrfs_reloc_clone_csums(ordered);
+               btrfs_put_ordered_extent(ordered);
 
                extent_clear_unlock_delalloc(inode, cur_offset, nocow_end,
                                             locked_page, EXTENT_LOCKED |
@@ -2214,7 +2398,7 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct page *locked_page
                u64 start, u64 end, int *page_started, unsigned long *nr_written,
                struct writeback_control *wbc)
 {
-       int ret;
+       int ret = 0;
        const bool zoned = btrfs_is_zoned(inode->root->fs_info);
 
        /*
@@ -2235,19 +2419,23 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct page *locked_page
                ASSERT(!zoned || btrfs_is_data_reloc_root(inode->root));
                ret = run_delalloc_nocow(inode, locked_page, start, end,
                                         page_started, nr_written);
-       } else if (!btrfs_inode_can_compress(inode) ||
-                  !inode_need_compress(inode, start, end)) {
-               if (zoned)
-                       ret = run_delalloc_zoned(inode, locked_page, start, end,
-                                                page_started, nr_written);
-               else
-                       ret = cow_file_range(inode, locked_page, start, end,
-                                            page_started, nr_written, 1, NULL);
-       } else {
-               set_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, &inode->runtime_flags);
-               ret = cow_file_range_async(inode, wbc, locked_page, start, end,
-                                          page_started, nr_written);
+               goto out;
        }
+
+       if (btrfs_inode_can_compress(inode) &&
+           inode_need_compress(inode, start, end) &&
+           run_delalloc_compressed(inode, wbc, locked_page, start,
+                                   end, page_started, nr_written))
+               goto out;
+
+       if (zoned)
+               ret = run_delalloc_zoned(inode, locked_page, start, end,
+                                        page_started, nr_written, wbc);
+       else
+               ret = cow_file_range(inode, locked_page, start, end,
+                                    page_started, nr_written, 1, NULL);
+
+out:
        ASSERT(ret <= 0);
        if (ret)
                btrfs_cleanup_ordered_extents(inode, locked_page, start,
@@ -2515,125 +2703,42 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode,
        }
 }
 
-/*
- * Split off the first pre bytes from the extent_map at [start, start + len]
- *
- * This function is intended to be used only for extract_ordered_extent().
- */
-static int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre)
-{
-       struct extent_map_tree *em_tree = &inode->extent_tree;
-       struct extent_map *em;
-       struct extent_map *split_pre = NULL;
-       struct extent_map *split_mid = NULL;
-       int ret = 0;
-       unsigned long flags;
-
-       ASSERT(pre != 0);
-       ASSERT(pre < len);
-
-       split_pre = alloc_extent_map();
-       if (!split_pre)
-               return -ENOMEM;
-       split_mid = alloc_extent_map();
-       if (!split_mid) {
-               ret = -ENOMEM;
-               goto out_free_pre;
-       }
-
-       lock_extent(&inode->io_tree, start, start + len - 1, NULL);
-       write_lock(&em_tree->lock);
-       em = lookup_extent_mapping(em_tree, start, len);
-       if (!em) {
-               ret = -EIO;
-               goto out_unlock;
-       }
-
-       ASSERT(em->len == len);
-       ASSERT(!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags));
-       ASSERT(em->block_start < EXTENT_MAP_LAST_BYTE);
-       ASSERT(test_bit(EXTENT_FLAG_PINNED, &em->flags));
-       ASSERT(!test_bit(EXTENT_FLAG_LOGGING, &em->flags));
-       ASSERT(!list_empty(&em->list));
-
-       flags = em->flags;
-       clear_bit(EXTENT_FLAG_PINNED, &em->flags);
-
-       /* First, replace the em with a new extent_map starting from * em->start */
-       split_pre->start = em->start;
-       split_pre->len = pre;
-       split_pre->orig_start = split_pre->start;
-       split_pre->block_start = em->block_start;
-       split_pre->block_len = split_pre->len;
-       split_pre->orig_block_len = split_pre->block_len;
-       split_pre->ram_bytes = split_pre->len;
-       split_pre->flags = flags;
-       split_pre->compress_type = em->compress_type;
-       split_pre->generation = em->generation;
-
-       replace_extent_mapping(em_tree, em, split_pre, 1);
-
-       /*
-        * Now we only have an extent_map at:
-        *     [em->start, em->start + pre]
-        */
-
-       /* Insert the middle extent_map. */
-       split_mid->start = em->start + pre;
-       split_mid->len = em->len - pre;
-       split_mid->orig_start = split_mid->start;
-       split_mid->block_start = em->block_start + pre;
-       split_mid->block_len = split_mid->len;
-       split_mid->orig_block_len = split_mid->block_len;
-       split_mid->ram_bytes = split_mid->len;
-       split_mid->flags = flags;
-       split_mid->compress_type = em->compress_type;
-       split_mid->generation = em->generation;
-       add_extent_mapping(em_tree, split_mid, 1);
-
-       /* Once for us */
-       free_extent_map(em);
-       /* Once for the tree */
-       free_extent_map(em);
-
-out_unlock:
-       write_unlock(&em_tree->lock);
-       unlock_extent(&inode->io_tree, start, start + len - 1, NULL);
-       free_extent_map(split_mid);
-out_free_pre:
-       free_extent_map(split_pre);
-       return ret;
-}
-
-int btrfs_extract_ordered_extent(struct btrfs_bio *bbio,
-                                struct btrfs_ordered_extent *ordered)
+static int btrfs_extract_ordered_extent(struct btrfs_bio *bbio,
+                                       struct btrfs_ordered_extent *ordered)
 {
        u64 start = (u64)bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT;
        u64 len = bbio->bio.bi_iter.bi_size;
-       struct btrfs_inode *inode = bbio->inode;
-       u64 ordered_len = ordered->num_bytes;
-       int ret = 0;
+       struct btrfs_ordered_extent *new;
+       int ret;
 
        /* Must always be called for the beginning of an ordered extent. */
        if (WARN_ON_ONCE(start != ordered->disk_bytenr))
                return -EINVAL;
 
        /* No need to split if the ordered extent covers the entire bio. */
-       if (ordered->disk_num_bytes == len)
+       if (ordered->disk_num_bytes == len) {
+               refcount_inc(&ordered->refs);
+               bbio->ordered = ordered;
                return 0;
-
-       ret = btrfs_split_ordered_extent(ordered, len);
-       if (ret)
-               return ret;
+       }
 
        /*
         * Don't split the extent_map for NOCOW extents, as we're writing into
         * a pre-existing one.
         */
-       if (test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags))
-               return 0;
+       if (!test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags)) {
+               ret = split_extent_map(bbio->inode, bbio->file_offset,
+                                      ordered->num_bytes, len,
+                                      ordered->disk_bytenr);
+               if (ret)
+                       return ret;
+       }
 
-       return split_extent_map(inode, bbio->file_offset, ordered_len, len);
+       new = btrfs_split_ordered_extent(ordered, len);
+       if (IS_ERR(new))
+               return PTR_ERR(new);
+       bbio->ordered = new;
+       return 0;
 }
 
 /*
@@ -2651,7 +2756,7 @@ static int add_pending_csums(struct btrfs_trans_handle *trans,
                trans->adding_csums = true;
                if (!csum_root)
                        csum_root = btrfs_csum_root(trans->fs_info,
-                                                   sum->bytenr);
+                                                   sum->logical);
                ret = btrfs_csum_file_blocks(trans, csum_root, sum);
                trans->adding_csums = false;
                if (ret)
@@ -2689,8 +2794,7 @@ static int btrfs_find_new_delalloc_bytes(struct btrfs_inode *inode,
 
                ret = set_extent_bit(&inode->io_tree, search_start,
                                     search_start + em_len - 1,
-                                    EXTENT_DELALLOC_NEW, cached_state,
-                                    GFP_NOFS);
+                                    EXTENT_DELALLOC_NEW, cached_state);
 next:
                search_start = extent_map_end(em);
                free_extent_map(em);
@@ -2723,8 +2827,8 @@ int btrfs_set_extent_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
                        return ret;
        }
 
-       return set_extent_delalloc(&inode->io_tree, start, end, extra_bits,
-                                  cached_state);
+       return set_extent_bit(&inode->io_tree, start, end,
+                             EXTENT_DELALLOC | extra_bits, cached_state);
 }
 
 /* see btrfs_writepage_start_hook for details on why this is required */
@@ -2847,7 +2951,6 @@ out_page:
                mapping_set_error(page->mapping, ret);
                end_extent_writepage(page, ret, page_start, page_end);
                clear_page_dirty_for_io(page);
-               SetPageError(page);
        }
        btrfs_page_clear_checked(inode->root->fs_info, page, page_start, PAGE_SIZE);
        unlock_page(page);
@@ -3068,7 +3171,7 @@ static int insert_ordered_extent_file_extent(struct btrfs_trans_handle *trans,
  * an ordered extent if the range of bytes in the file it covers are
  * fully written.
  */
-int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
+int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
 {
        struct btrfs_inode *inode = BTRFS_I(ordered_extent->inode);
        struct btrfs_root *root = inode->root;
@@ -3103,15 +3206,9 @@ int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
                goto out;
        }
 
-       /* A valid ->physical implies a write on a sequential zone. */
-       if (ordered_extent->physical != (u64)-1) {
-               btrfs_rewrite_logical_zoned(ordered_extent);
+       if (btrfs_is_zoned(fs_info))
                btrfs_zone_finish_endio(fs_info, ordered_extent->disk_bytenr,
                                        ordered_extent->disk_num_bytes);
-       } else if (btrfs_is_data_reloc_root(inode->root)) {
-               btrfs_zone_finish_endio(fs_info, ordered_extent->disk_bytenr,
-                                       ordered_extent->disk_num_bytes);
-       }
 
        if (test_bit(BTRFS_ORDERED_TRUNCATED, &ordered_extent->flags)) {
                truncated = true;
@@ -3279,6 +3376,14 @@ out:
        return ret;
 }
 
+int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered)
+{
+       if (btrfs_is_zoned(btrfs_sb(ordered->inode->i_sb)) &&
+           !test_bit(BTRFS_ORDERED_IOERR, &ordered->flags))
+               btrfs_finish_ordered_zoned(ordered);
+       return btrfs_finish_one_ordered(ordered);
+}
+
 void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
                                          struct page *page, u64 start,
                                          u64 end, bool uptodate)
@@ -4226,7 +4331,7 @@ static int btrfs_unlink(struct inode *dir, struct dentry *dentry)
        }
 
        btrfs_record_unlink_dir(trans, BTRFS_I(dir), BTRFS_I(d_inode(dentry)),
-                       0);
+                               false);
 
        ret = btrfs_unlink_inode(trans, BTRFS_I(dir), BTRFS_I(d_inode(dentry)),
                                 &fname.disk_name);
@@ -4801,7 +4906,7 @@ again:
 
        if (only_release_metadata)
                set_extent_bit(&inode->io_tree, block_start, block_end,
-                              EXTENT_NORESERVE, NULL, GFP_NOFS);
+                              EXTENT_NORESERVE, NULL);
 
 out_unlock:
        if (ret) {
@@ -7670,8 +7775,8 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
                pos += submitted;
                length -= submitted;
                if (write)
-                       btrfs_mark_ordered_io_finished(BTRFS_I(inode), NULL,
-                                                      pos, length, false);
+                       btrfs_finish_ordered_extent(dio_data->ordered, NULL,
+                                                   pos, length, false);
                else
                        unlock_extent(&BTRFS_I(inode)->io_tree, pos,
                                      pos + length - 1, NULL);
@@ -7701,12 +7806,14 @@ static void btrfs_dio_end_io(struct btrfs_bio *bbio)
                           dip->file_offset, dip->bytes, bio->bi_status);
        }
 
-       if (btrfs_op(bio) == BTRFS_MAP_WRITE)
-               btrfs_mark_ordered_io_finished(inode, NULL, dip->file_offset,
-                                              dip->bytes, !bio->bi_status);
-       else
+       if (btrfs_op(bio) == BTRFS_MAP_WRITE) {
+               btrfs_finish_ordered_extent(bbio->ordered, NULL,
+                                           dip->file_offset, dip->bytes,
+                                           !bio->bi_status);
+       } else {
                unlock_extent(&inode->io_tree, dip->file_offset,
                              dip->file_offset + dip->bytes - 1, NULL);
+       }
 
        bbio->bio.bi_private = bbio->private;
        iomap_dio_bio_end_io(bio);
@@ -7742,7 +7849,8 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio,
 
                ret = btrfs_extract_ordered_extent(bbio, dio_data->ordered);
                if (ret) {
-                       btrfs_bio_end_io(bbio, errno_to_blk_status(ret));
+                       bbio->bio.bi_status = errno_to_blk_status(ret);
+                       btrfs_dio_end_io(bbio);
                        return;
                }
        }
@@ -8236,7 +8344,7 @@ static int btrfs_truncate(struct btrfs_inode *inode, bool skip_writeback)
        int ret;
        struct btrfs_trans_handle *trans;
        u64 mask = fs_info->sectorsize - 1;
-       u64 min_size = btrfs_calc_metadata_size(fs_info, 1);
+       const u64 min_size = btrfs_calc_metadata_size(fs_info, 1);
 
        if (!skip_writeback) {
                ret = btrfs_wait_ordered_range(&inode->vfs_inode,
@@ -8293,7 +8401,15 @@ static int btrfs_truncate(struct btrfs_inode *inode, bool skip_writeback)
        /* Migrate the slack space for the truncate to our reserve */
        ret = btrfs_block_rsv_migrate(&fs_info->trans_block_rsv, rsv,
                                      min_size, false);
-       BUG_ON(ret);
+       /*
+        * We have reserved 2 metadata units when we started the transaction and
+        * min_size matches 1 unit, so this should never fail, but if it does,
+        * it's not critical we just fail truncation.
+        */
+       if (WARN_ON(ret)) {
+               btrfs_end_transaction(trans);
+               goto out;
+       }
 
        trans->block_rsv = rsv;
 
@@ -8341,7 +8457,14 @@ static int btrfs_truncate(struct btrfs_inode *inode, bool skip_writeback)
                btrfs_block_rsv_release(fs_info, rsv, -1, NULL);
                ret = btrfs_block_rsv_migrate(&fs_info->trans_block_rsv,
                                              rsv, min_size, false);
-               BUG_ON(ret);    /* shouldn't happen */
+               /*
+                * We have reserved 2 metadata units when we started the
+                * transaction and min_size matches 1 unit, so this should never
+                * fail, but if it does, it's not critical we just fail truncation.
+                */
+               if (WARN_ON(ret))
+                       break;
+
                trans->block_rsv = rsv;
        }
 
@@ -8468,7 +8591,6 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
        ei->io_tree.inode = ei;
        extent_io_tree_init(fs_info, &ei->file_extent_tree,
                            IO_TREE_INODE_FILE_EXTENT);
-       atomic_set(&ei->sync_writers, 0);
        mutex_init(&ei->log_mutex);
        btrfs_ordered_inode_tree_init(&ei->ordered_tree);
        INIT_LIST_HEAD(&ei->delalloc_inodes);
@@ -8639,7 +8761,7 @@ static int btrfs_getattr(struct mnt_idmap *idmap,
        inode_bytes = inode_get_bytes(inode);
        spin_unlock(&BTRFS_I(inode)->lock);
        stat->blocks = (ALIGN(inode_bytes, blocksize) +
-                       ALIGN(delalloc_bytes, blocksize)) >> 9;
+                       ALIGN(delalloc_bytes, blocksize)) >> SECTOR_SHIFT;
        return 0;
 }
 
@@ -8795,9 +8917,9 @@ static int btrfs_rename_exchange(struct inode *old_dir,
 
        if (old_dentry->d_parent != new_dentry->d_parent) {
                btrfs_record_unlink_dir(trans, BTRFS_I(old_dir),
-                               BTRFS_I(old_inode), 1);
+                                       BTRFS_I(old_inode), true);
                btrfs_record_unlink_dir(trans, BTRFS_I(new_dir),
-                               BTRFS_I(new_inode), 1);
+                                       BTRFS_I(new_inode), true);
        }
 
        /* src is a subvolume */
@@ -9063,7 +9185,7 @@ static int btrfs_rename(struct mnt_idmap *idmap,
 
        if (old_dentry->d_parent != new_dentry->d_parent)
                btrfs_record_unlink_dir(trans, BTRFS_I(old_dir),
-                               BTRFS_I(old_inode), 1);
+                                       BTRFS_I(old_inode), true);
 
        if (unlikely(old_ino == BTRFS_FIRST_FREE_OBJECTID)) {
                ret = btrfs_unlink_subvol(trans, BTRFS_I(old_dir), old_dentry);
@@ -10170,6 +10292,7 @@ ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
        struct extent_io_tree *io_tree = &inode->io_tree;
        struct extent_changeset *data_reserved = NULL;
        struct extent_state *cached_state = NULL;
+       struct btrfs_ordered_extent *ordered;
        int compression;
        size_t orig_count;
        u64 start, end;
@@ -10346,14 +10469,15 @@ ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
        }
        free_extent_map(em);
 
-       ret = btrfs_add_ordered_extent(inode, start, num_bytes, ram_bytes,
+       ordered = btrfs_alloc_ordered_extent(inode, start, num_bytes, ram_bytes,
                                       ins.objectid, ins.offset,
                                       encoded->unencoded_offset,
                                       (1 << BTRFS_ORDERED_ENCODED) |
                                       (1 << BTRFS_ORDERED_COMPRESSED),
                                       compression);
-       if (ret) {
+       if (IS_ERR(ordered)) {
                btrfs_drop_extent_map_range(inode, start, end, false);
+               ret = PTR_ERR(ordered);
                goto out_free_reserved;
        }
        btrfs_dec_block_group_reservations(fs_info, ins.objectid);
@@ -10365,8 +10489,7 @@ ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
 
        btrfs_delalloc_release_extents(inode, num_bytes);
 
-       btrfs_submit_compressed_write(inode, start, num_bytes, ins.objectid,
-                                         ins.offset, pages, nr_pages, 0, false);
+       btrfs_submit_compressed_write(ordered, pages, nr_pages, 0, false);
        ret = orig_count;
        goto out;
 
@@ -10903,7 +11026,6 @@ static const struct address_space_operations btrfs_aops = {
        .read_folio     = btrfs_read_folio,
        .writepages     = btrfs_writepages,
        .readahead      = btrfs_readahead,
-       .direct_IO      = noop_direct_IO,
        .invalidate_folio = btrfs_invalidate_folio,
        .release_folio  = btrfs_release_folio,
        .migrate_folio  = btrfs_migrate_folio,
index 2fa36f6..a895d10 100644 (file)
@@ -649,6 +649,8 @@ static noinline int create_subvol(struct mnt_idmap *idmap,
        }
        trans->block_rsv = &block_rsv;
        trans->bytes_reserved = block_rsv.size;
+       /* Tree log can't currently deal with an inode which is a new root. */
+       btrfs_set_log_full_commit(trans);
 
        ret = btrfs_qgroup_inherit(trans, 0, objectid, inherit);
        if (ret)
@@ -757,10 +759,7 @@ out:
        trans->bytes_reserved = 0;
        btrfs_subvolume_release_metadata(root, &block_rsv);
 
-       if (ret)
-               btrfs_end_transaction(trans);
-       else
-               ret = btrfs_commit_transaction(trans);
+       btrfs_end_transaction(trans);
 out_new_inode_args:
        btrfs_new_inode_args_destroy(&new_inode_args);
 out_inode:
@@ -2672,7 +2671,7 @@ static long btrfs_ioctl_rm_dev_v2(struct file *file, void __user *arg)
        struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
        struct btrfs_ioctl_vol_args_v2 *vol_args;
        struct block_device *bdev = NULL;
-       fmode_t mode;
+       void *holder;
        int ret;
        bool cancel = false;
 
@@ -2709,7 +2708,7 @@ static long btrfs_ioctl_rm_dev_v2(struct file *file, void __user *arg)
                goto err_drop;
 
        /* Exclusive operation is now claimed */
-       ret = btrfs_rm_device(fs_info, &args, &bdev, &mode);
+       ret = btrfs_rm_device(fs_info, &args, &bdev, &holder);
 
        btrfs_exclop_finish(fs_info);
 
@@ -2724,7 +2723,7 @@ static long btrfs_ioctl_rm_dev_v2(struct file *file, void __user *arg)
 err_drop:
        mnt_drop_write_file(file);
        if (bdev)
-               blkdev_put(bdev, mode);
+               blkdev_put(bdev, holder);
 out:
        btrfs_put_dev_args_from_path(&args);
        kfree(vol_args);
@@ -2738,7 +2737,7 @@ static long btrfs_ioctl_rm_dev(struct file *file, void __user *arg)
        struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
        struct btrfs_ioctl_vol_args *vol_args;
        struct block_device *bdev = NULL;
-       fmode_t mode;
+       void *holder;
        int ret;
        bool cancel = false;
 
@@ -2765,7 +2764,7 @@ static long btrfs_ioctl_rm_dev(struct file *file, void __user *arg)
        ret = exclop_start_or_cancel_reloc(fs_info, BTRFS_EXCLOP_DEV_REMOVE,
                                           cancel);
        if (ret == 0) {
-               ret = btrfs_rm_device(fs_info, &args, &bdev, &mode);
+               ret = btrfs_rm_device(fs_info, &args, &bdev, &holder);
                if (!ret)
                        btrfs_info(fs_info, "disk deleted %s", vol_args->name);
                btrfs_exclop_finish(fs_info);
@@ -2773,7 +2772,7 @@ static long btrfs_ioctl_rm_dev(struct file *file, void __user *arg)
 
        mnt_drop_write_file(file);
        if (bdev)
-               blkdev_put(bdev, mode);
+               blkdev_put(bdev, holder);
 out:
        btrfs_put_dev_args_from_path(&args);
        kfree(vol_args);
@@ -3113,6 +3112,13 @@ static noinline long btrfs_ioctl_start_sync(struct btrfs_root *root,
        struct btrfs_trans_handle *trans;
        u64 transid;
 
+       /*
+        * Start orphan cleanup here for the given root in case it hasn't been
+        * started already by other means. Errors are handled in the other
+        * functions during transaction commit.
+        */
+       btrfs_orphan_cleanup(root);
+
        trans = btrfs_attach_transaction_barrier(root);
        if (IS_ERR(trans)) {
                if (PTR_ERR(trans) != -ENOENT)
@@ -3134,14 +3140,13 @@ out:
 static noinline long btrfs_ioctl_wait_sync(struct btrfs_fs_info *fs_info,
                                           void __user *argp)
 {
-       u64 transid;
+       /* By default wait for the current transaction. */
+       u64 transid = 0;
 
-       if (argp) {
+       if (argp)
                if (copy_from_user(&transid, argp, sizeof(transid)))
                        return -EFAULT;
-       } else {
-               transid = 0;  /* current trans */
-       }
+
        return btrfs_wait_for_commit(fs_info, transid);
 }
 
index 3a496b0..7979449 100644 (file)
@@ -57,8 +57,8 @@
 
 static struct btrfs_lockdep_keyset {
        u64                     id;             /* root objectid */
-       /* Longest entry: btrfs-free-space-00 */
-       char                    names[BTRFS_MAX_LEVEL][20];
+       /* Longest entry: btrfs-block-group-00 */
+       char                    names[BTRFS_MAX_LEVEL][24];
        struct lock_class_key   keys[BTRFS_MAX_LEVEL];
 } btrfs_lockdep_keysets[] = {
        { .id = BTRFS_ROOT_TREE_OBJECTID,       DEFINE_NAME("root")     },
@@ -72,6 +72,7 @@ static struct btrfs_lockdep_keyset {
        { .id = BTRFS_DATA_RELOC_TREE_OBJECTID, DEFINE_NAME("dreloc")   },
        { .id = BTRFS_UUID_TREE_OBJECTID,       DEFINE_NAME("uuid")     },
        { .id = BTRFS_FREE_SPACE_TREE_OBJECTID, DEFINE_NAME("free-space") },
+       { .id = BTRFS_BLOCK_GROUP_TREE_OBJECTID, DEFINE_NAME("block-group") },
        { .id = 0,                              DEFINE_NAME("tree")     },
 };
 
index 3a095b9..d3fcfc6 100644 (file)
@@ -88,9 +88,9 @@ struct list_head *lzo_alloc_workspace(unsigned int level)
        if (!workspace)
                return ERR_PTR(-ENOMEM);
 
-       workspace->mem = kvmalloc(LZO1X_MEM_COMPRESS, GFP_KERNEL);
-       workspace->buf = kvmalloc(WORKSPACE_BUF_LENGTH, GFP_KERNEL);
-       workspace->cbuf = kvmalloc(WORKSPACE_CBUF_LENGTH, GFP_KERNEL);
+       workspace->mem = kvmalloc(LZO1X_MEM_COMPRESS, GFP_KERNEL | __GFP_NOWARN);
+       workspace->buf = kvmalloc(WORKSPACE_BUF_LENGTH, GFP_KERNEL | __GFP_NOWARN);
+       workspace->cbuf = kvmalloc(WORKSPACE_CBUF_LENGTH, GFP_KERNEL | __GFP_NOWARN);
        if (!workspace->mem || !workspace->buf || !workspace->cbuf)
                goto fail;
 
index 310a05c..23fc11a 100644 (file)
@@ -252,14 +252,6 @@ void __cold _btrfs_printk(const struct btrfs_fs_info *fs_info, const char *fmt,
 }
 #endif
 
-#ifdef CONFIG_BTRFS_ASSERT
-void __cold __noreturn btrfs_assertfail(const char *expr, const char *file, int line)
-{
-       pr_err("assertion failed: %s, in %s:%d\n", expr, file, line);
-       BUG();
-}
-#endif
-
 void __cold btrfs_print_v0_err(struct btrfs_fs_info *fs_info)
 {
        btrfs_err(fs_info,
index ac2d198..deedc1a 100644 (file)
@@ -4,14 +4,23 @@
 #define BTRFS_MESSAGES_H
 
 #include <linux/types.h>
+#include <linux/printk.h>
+#include <linux/bug.h>
 
 struct btrfs_fs_info;
 
+/*
+ * We want to be able to override this in btrfs-progs.
+ */
+#ifdef __KERNEL__
+
 static inline __printf(2, 3) __cold
 void btrfs_no_printk(const struct btrfs_fs_info *fs_info, const char *fmt, ...)
 {
 }
 
+#endif
+
 #ifdef CONFIG_PRINTK
 
 #define btrfs_printk(fs_info, fmt, args...)                            \
@@ -160,7 +169,11 @@ do {                                                               \
 } while (0)
 
 #ifdef CONFIG_BTRFS_ASSERT
-void __cold __noreturn btrfs_assertfail(const char *expr, const char *file, int line);
+
+#define btrfs_assertfail(expr, file, line)     ({                              \
+       pr_err("assertion failed: %s, in %s:%d\n", (expr), (file), (line));     \
+       BUG();                                                          \
+})
 
 #define ASSERT(expr)                                           \
        (likely(expr) ? (void)0 : btrfs_assertfail(#expr, __FILE__, __LINE__))
index 768583a..005751a 100644 (file)
@@ -143,4 +143,24 @@ static inline struct rb_node *rb_simple_insert(struct rb_root *root, u64 bytenr,
        return NULL;
 }
 
+static inline bool bitmap_test_range_all_set(const unsigned long *addr,
+                                            unsigned long start,
+                                            unsigned long nbits)
+{
+       unsigned long found_zero;
+
+       found_zero = find_next_zero_bit(addr, start + nbits, start);
+       return (found_zero == start + nbits);
+}
+
+static inline bool bitmap_test_range_all_zero(const unsigned long *addr,
+                                             unsigned long start,
+                                             unsigned long nbits)
+{
+       unsigned long found_set;
+
+       found_set = find_next_bit(addr, start + nbits, start);
+       return (found_set == start + nbits);
+}
+
 #endif
index a9778a9..a629532 100644 (file)
@@ -146,35 +146,11 @@ static inline struct rb_node *tree_search(struct btrfs_ordered_inode_tree *tree,
        return ret;
 }
 
-/*
- * Add an ordered extent to the per-inode tree.
- *
- * @inode:           Inode that this extent is for.
- * @file_offset:     Logical offset in file where the extent starts.
- * @num_bytes:       Logical length of extent in file.
- * @ram_bytes:       Full length of unencoded data.
- * @disk_bytenr:     Offset of extent on disk.
- * @disk_num_bytes:  Size of extent on disk.
- * @offset:          Offset into unencoded data where file data starts.
- * @flags:           Flags specifying type of extent (1 << BTRFS_ORDERED_*).
- * @compress_type:   Compression algorithm used for data.
- *
- * Most of these parameters correspond to &struct btrfs_file_extent_item. The
- * tree is given a single reference on the ordered extent that was inserted, and
- * the returned pointer is given a second reference.
- *
- * Return: the new ordered extent or error pointer.
- */
-struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
-                       struct btrfs_inode *inode, u64 file_offset,
-                       u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
-                       u64 disk_num_bytes, u64 offset, unsigned long flags,
-                       int compress_type)
+static struct btrfs_ordered_extent *alloc_ordered_extent(
+                       struct btrfs_inode *inode, u64 file_offset, u64 num_bytes,
+                       u64 ram_bytes, u64 disk_bytenr, u64 disk_num_bytes,
+                       u64 offset, unsigned long flags, int compress_type)
 {
-       struct btrfs_root *root = inode->root;
-       struct btrfs_fs_info *fs_info = root->fs_info;
-       struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
-       struct rb_node *node;
        struct btrfs_ordered_extent *entry;
        int ret;
 
@@ -184,7 +160,6 @@ struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
                ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes);
                if (ret < 0)
                        return ERR_PTR(ret);
-               ret = 0;
        } else {
                /*
                 * The ordered extent has reserved qgroup space, release now
@@ -209,15 +184,7 @@ struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
        entry->compress_type = compress_type;
        entry->truncated_len = (u64)-1;
        entry->qgroup_rsv = ret;
-       entry->physical = (u64)-1;
-
-       ASSERT((flags & ~BTRFS_ORDERED_TYPE_FLAGS) == 0);
        entry->flags = flags;
-
-       percpu_counter_add_batch(&fs_info->ordered_bytes, num_bytes,
-                                fs_info->delalloc_batch);
-
-       /* one ref for the tree */
        refcount_set(&entry->refs, 1);
        init_waitqueue_head(&entry->wait);
        INIT_LIST_HEAD(&entry->list);
@@ -226,15 +193,40 @@ struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
        INIT_LIST_HEAD(&entry->work_list);
        init_completion(&entry->completion);
 
+       /*
+        * We don't need the count_max_extents here, we can assume that all of
+        * that work has been done at higher layers, so this is truly the
+        * smallest the extent is going to get.
+        */
+       spin_lock(&inode->lock);
+       btrfs_mod_outstanding_extents(inode, 1);
+       spin_unlock(&inode->lock);
+
+       return entry;
+}
+
+static void insert_ordered_extent(struct btrfs_ordered_extent *entry)
+{
+       struct btrfs_inode *inode = BTRFS_I(entry->inode);
+       struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
+       struct btrfs_root *root = inode->root;
+       struct btrfs_fs_info *fs_info = root->fs_info;
+       struct rb_node *node;
+
        trace_btrfs_ordered_extent_add(inode, entry);
 
+       percpu_counter_add_batch(&fs_info->ordered_bytes, entry->num_bytes,
+                                fs_info->delalloc_batch);
+
+       /* One ref for the tree. */
+       refcount_inc(&entry->refs);
+
        spin_lock_irq(&tree->lock);
-       node = tree_insert(&tree->tree, file_offset,
-                          &entry->rb_node);
+       node = tree_insert(&tree->tree, entry->file_offset, &entry->rb_node);
        if (node)
                btrfs_panic(fs_info, -EEXIST,
                                "inconsistency in ordered tree at offset %llu",
-                               file_offset);
+                               entry->file_offset);
        spin_unlock_irq(&tree->lock);
 
        spin_lock(&root->ordered_extent_lock);
@@ -248,43 +240,43 @@ struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
                spin_unlock(&fs_info->ordered_root_lock);
        }
        spin_unlock(&root->ordered_extent_lock);
-
-       /*
-        * We don't need the count_max_extents here, we can assume that all of
-        * that work has been done at higher layers, so this is truly the
-        * smallest the extent is going to get.
-        */
-       spin_lock(&inode->lock);
-       btrfs_mod_outstanding_extents(inode, 1);
-       spin_unlock(&inode->lock);
-
-       /* One ref for the returned entry to match semantics of lookup. */
-       refcount_inc(&entry->refs);
-
-       return entry;
 }
 
 /*
- * Add a new btrfs_ordered_extent for the range, but drop the reference instead
- * of returning it to the caller.
+ * Add an ordered extent to the per-inode tree.
+ *
+ * @inode:           Inode that this extent is for.
+ * @file_offset:     Logical offset in file where the extent starts.
+ * @num_bytes:       Logical length of extent in file.
+ * @ram_bytes:       Full length of unencoded data.
+ * @disk_bytenr:     Offset of extent on disk.
+ * @disk_num_bytes:  Size of extent on disk.
+ * @offset:          Offset into unencoded data where file data starts.
+ * @flags:           Flags specifying type of extent (1 << BTRFS_ORDERED_*).
+ * @compress_type:   Compression algorithm used for data.
+ *
+ * Most of these parameters correspond to &struct btrfs_file_extent_item. The
+ * tree is given a single reference on the ordered extent that was inserted, and
+ * the returned pointer is given a second reference.
+ *
+ * Return: the new ordered extent or error pointer.
  */
-int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
-                            u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
-                            u64 disk_num_bytes, u64 offset, unsigned long flags,
-                            int compress_type)
+struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
+                       struct btrfs_inode *inode, u64 file_offset,
+                       u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
+                       u64 disk_num_bytes, u64 offset, unsigned long flags,
+                       int compress_type)
 {
-       struct btrfs_ordered_extent *ordered;
-
-       ordered = btrfs_alloc_ordered_extent(inode, file_offset, num_bytes,
-                                            ram_bytes, disk_bytenr,
-                                            disk_num_bytes, offset, flags,
-                                            compress_type);
+       struct btrfs_ordered_extent *entry;
 
-       if (IS_ERR(ordered))
-               return PTR_ERR(ordered);
-       btrfs_put_ordered_extent(ordered);
+       ASSERT((flags & ~BTRFS_ORDERED_TYPE_FLAGS) == 0);
 
-       return 0;
+       entry = alloc_ordered_extent(inode, file_offset, num_bytes, ram_bytes,
+                                    disk_bytenr, disk_num_bytes, offset, flags,
+                                    compress_type);
+       if (!IS_ERR(entry))
+               insert_ordered_extent(entry);
+       return entry;
 }
 
 /*
@@ -311,6 +303,90 @@ static void finish_ordered_fn(struct btrfs_work *work)
        btrfs_finish_ordered_io(ordered_extent);
 }
 
+static bool can_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
+                                     struct page *page, u64 file_offset,
+                                     u64 len, bool uptodate)
+{
+       struct btrfs_inode *inode = BTRFS_I(ordered->inode);
+       struct btrfs_fs_info *fs_info = inode->root->fs_info;
+
+       lockdep_assert_held(&inode->ordered_tree.lock);
+
+       if (page) {
+               ASSERT(page->mapping);
+               ASSERT(page_offset(page) <= file_offset);
+               ASSERT(file_offset + len <= page_offset(page) + PAGE_SIZE);
+
+               /*
+                * Ordered (Private2) bit indicates whether we still have
+                * pending io unfinished for the ordered extent.
+                *
+                * If there's no such bit, we need to skip to next range.
+                */
+               if (!btrfs_page_test_ordered(fs_info, page, file_offset, len))
+                       return false;
+               btrfs_page_clear_ordered(fs_info, page, file_offset, len);
+       }
+
+       /* Now we're fine to update the accounting. */
+       if (WARN_ON_ONCE(len > ordered->bytes_left)) {
+               btrfs_crit(fs_info,
+"bad ordered extent accounting, root=%llu ino=%llu OE offset=%llu OE len=%llu to_dec=%llu left=%llu",
+                          inode->root->root_key.objectid, btrfs_ino(inode),
+                          ordered->file_offset, ordered->num_bytes,
+                          len, ordered->bytes_left);
+               ordered->bytes_left = 0;
+       } else {
+               ordered->bytes_left -= len;
+       }
+
+       if (!uptodate)
+               set_bit(BTRFS_ORDERED_IOERR, &ordered->flags);
+
+       if (ordered->bytes_left)
+               return false;
+
+       /*
+        * All the IO of the ordered extent is finished, we need to queue
+        * the finish_func to be executed.
+        */
+       set_bit(BTRFS_ORDERED_IO_DONE, &ordered->flags);
+       cond_wake_up(&ordered->wait);
+       refcount_inc(&ordered->refs);
+       trace_btrfs_ordered_extent_mark_finished(inode, ordered);
+       return true;
+}
+
+static void btrfs_queue_ordered_fn(struct btrfs_ordered_extent *ordered)
+{
+       struct btrfs_inode *inode = BTRFS_I(ordered->inode);
+       struct btrfs_fs_info *fs_info = inode->root->fs_info;
+       struct btrfs_workqueue *wq = btrfs_is_free_space_inode(inode) ?
+               fs_info->endio_freespace_worker : fs_info->endio_write_workers;
+
+       btrfs_init_work(&ordered->work, finish_ordered_fn, NULL, NULL);
+       btrfs_queue_work(wq, &ordered->work);
+}
+
+bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
+                                struct page *page, u64 file_offset, u64 len,
+                                bool uptodate)
+{
+       struct btrfs_inode *inode = BTRFS_I(ordered->inode);
+       unsigned long flags;
+       bool ret;
+
+       trace_btrfs_finish_ordered_extent(inode, file_offset, len, uptodate);
+
+       spin_lock_irqsave(&inode->ordered_tree.lock, flags);
+       ret = can_finish_ordered_extent(ordered, page, file_offset, len, uptodate);
+       spin_unlock_irqrestore(&inode->ordered_tree.lock, flags);
+
+       if (ret)
+               btrfs_queue_ordered_fn(ordered);
+       return ret;
+}
+
 /*
  * Mark all ordered extents io inside the specified range finished.
  *
@@ -329,22 +405,11 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
                                    u64 num_bytes, bool uptodate)
 {
        struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
-       struct btrfs_fs_info *fs_info = inode->root->fs_info;
-       struct btrfs_workqueue *wq;
        struct rb_node *node;
        struct btrfs_ordered_extent *entry = NULL;
        unsigned long flags;
        u64 cur = file_offset;
 
-       if (btrfs_is_free_space_inode(inode))
-               wq = fs_info->endio_freespace_worker;
-       else
-               wq = fs_info->endio_write_workers;
-
-       if (page)
-               ASSERT(page->mapping && page_offset(page) <= file_offset &&
-                      file_offset + num_bytes <= page_offset(page) + PAGE_SIZE);
-
        spin_lock_irqsave(&tree->lock, flags);
        while (cur < file_offset + num_bytes) {
                u64 entry_end;
@@ -397,50 +462,9 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
                ASSERT(end + 1 - cur < U32_MAX);
                len = end + 1 - cur;
 
-               if (page) {
-                       /*
-                        * Ordered (Private2) bit indicates whether we still
-                        * have pending io unfinished for the ordered extent.
-                        *
-                        * If there's no such bit, we need to skip to next range.
-                        */
-                       if (!btrfs_page_test_ordered(fs_info, page, cur, len)) {
-                               cur += len;
-                               continue;
-                       }
-                       btrfs_page_clear_ordered(fs_info, page, cur, len);
-               }
-
-               /* Now we're fine to update the accounting */
-               if (unlikely(len > entry->bytes_left)) {
-                       WARN_ON(1);
-                       btrfs_crit(fs_info,
-"bad ordered extent accounting, root=%llu ino=%llu OE offset=%llu OE len=%llu to_dec=%u left=%llu",
-                                  inode->root->root_key.objectid,
-                                  btrfs_ino(inode),
-                                  entry->file_offset,
-                                  entry->num_bytes,
-                                  len, entry->bytes_left);
-                       entry->bytes_left = 0;
-               } else {
-                       entry->bytes_left -= len;
-               }
-
-               if (!uptodate)
-                       set_bit(BTRFS_ORDERED_IOERR, &entry->flags);
-
-               /*
-                * All the IO of the ordered extent is finished, we need to queue
-                * the finish_func to be executed.
-                */
-               if (entry->bytes_left == 0) {
-                       set_bit(BTRFS_ORDERED_IO_DONE, &entry->flags);
-                       cond_wake_up(&entry->wait);
-                       refcount_inc(&entry->refs);
-                       trace_btrfs_ordered_extent_mark_finished(inode, entry);
+               if (can_finish_ordered_extent(entry, page, cur, len, uptodate)) {
                        spin_unlock_irqrestore(&tree->lock, flags);
-                       btrfs_init_work(&entry->work, finish_ordered_fn, NULL, NULL);
-                       btrfs_queue_work(wq, &entry->work);
+                       btrfs_queue_ordered_fn(entry);
                        spin_lock_irqsave(&tree->lock, flags);
                }
                cur += len;
@@ -564,7 +588,7 @@ void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode,
        freespace_inode = btrfs_is_free_space_inode(btrfs_inode);
 
        btrfs_lockdep_acquire(fs_info, btrfs_trans_pending_ordered);
-       /* This is paired with btrfs_add_ordered_extent. */
+       /* This is paired with btrfs_alloc_ordered_extent. */
        spin_lock(&btrfs_inode->lock);
        btrfs_mod_outstanding_extents(btrfs_inode, -1);
        spin_unlock(&btrfs_inode->lock);
@@ -1117,17 +1141,22 @@ bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end,
 }
 
 /* Split out a new ordered extent for this first @len bytes of @ordered. */
-int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 len)
+struct btrfs_ordered_extent *btrfs_split_ordered_extent(
+                       struct btrfs_ordered_extent *ordered, u64 len)
 {
-       struct inode *inode = ordered->inode;
-       struct btrfs_ordered_inode_tree *tree = &BTRFS_I(inode)->ordered_tree;
-       struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+       struct btrfs_inode *inode = BTRFS_I(ordered->inode);
+       struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
+       struct btrfs_root *root = inode->root;
+       struct btrfs_fs_info *fs_info = root->fs_info;
        u64 file_offset = ordered->file_offset;
        u64 disk_bytenr = ordered->disk_bytenr;
-       unsigned long flags = ordered->flags & BTRFS_ORDERED_TYPE_FLAGS;
+       unsigned long flags = ordered->flags;
+       struct btrfs_ordered_sum *sum, *tmpsum;
+       struct btrfs_ordered_extent *new;
        struct rb_node *node;
+       u64 offset = 0;
 
-       trace_btrfs_ordered_extent_split(BTRFS_I(inode), ordered);
+       trace_btrfs_ordered_extent_split(inode, ordered);
 
        ASSERT(!(flags & (1U << BTRFS_ORDERED_COMPRESSED)));
 
@@ -1136,18 +1165,27 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 len)
         * reduce the original extent to a zero length either.
         */
        if (WARN_ON_ONCE(len >= ordered->num_bytes))
-               return -EINVAL;
-       /* We cannot split once ordered extent is past end_bio. */
-       if (WARN_ON_ONCE(ordered->bytes_left != ordered->disk_num_bytes))
-               return -EINVAL;
+               return ERR_PTR(-EINVAL);
+       /* We cannot split partially completed ordered extents. */
+       if (ordered->bytes_left) {
+               ASSERT(!(flags & ~BTRFS_ORDERED_TYPE_FLAGS));
+               if (WARN_ON_ONCE(ordered->bytes_left != ordered->disk_num_bytes))
+                       return ERR_PTR(-EINVAL);
+       }
        /* We cannot split a compressed ordered extent. */
        if (WARN_ON_ONCE(ordered->disk_num_bytes != ordered->num_bytes))
-               return -EINVAL;
-       /* Checksum list should be empty. */
-       if (WARN_ON_ONCE(!list_empty(&ordered->list)))
-               return -EINVAL;
+               return ERR_PTR(-EINVAL);
 
-       spin_lock_irq(&tree->lock);
+       new = alloc_ordered_extent(inode, file_offset, len, len, disk_bytenr,
+                                  len, 0, flags, ordered->compress_type);
+       if (IS_ERR(new))
+               return new;
+
+       /* One ref for the tree. */
+       refcount_inc(&new->refs);
+
+       spin_lock_irq(&root->ordered_extent_lock);
+       spin_lock(&tree->lock);
        /* Remove from tree once */
        node = &ordered->rb_node;
        rb_erase(node, &tree->tree);
@@ -1159,26 +1197,48 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 len)
        ordered->disk_bytenr += len;
        ordered->num_bytes -= len;
        ordered->disk_num_bytes -= len;
-       ordered->bytes_left -= len;
+
+       if (test_bit(BTRFS_ORDERED_IO_DONE, &ordered->flags)) {
+               ASSERT(ordered->bytes_left == 0);
+               new->bytes_left = 0;
+       } else {
+               ordered->bytes_left -= len;
+       }
+
+       if (test_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags)) {
+               if (ordered->truncated_len > len) {
+                       ordered->truncated_len -= len;
+               } else {
+                       new->truncated_len = ordered->truncated_len;
+                       ordered->truncated_len = 0;
+               }
+       }
+
+       list_for_each_entry_safe(sum, tmpsum, &ordered->list, list) {
+               if (offset == len)
+                       break;
+               list_move_tail(&sum->list, &new->list);
+               offset += sum->len;
+       }
 
        /* Re-insert the node */
        node = tree_insert(&tree->tree, ordered->file_offset, &ordered->rb_node);
        if (node)
                btrfs_panic(fs_info, -EEXIST,
                        "zoned: inconsistency in ordered tree at offset %llu",
-                           ordered->file_offset);
+                       ordered->file_offset);
 
-       spin_unlock_irq(&tree->lock);
-
-       /*
-        * The splitting extent is already counted and will be added again in
-        * btrfs_add_ordered_extent(). Subtract len to avoid double counting.
-        */
-       percpu_counter_add_batch(&fs_info->ordered_bytes, -len, fs_info->delalloc_batch);
+       node = tree_insert(&tree->tree, new->file_offset, &new->rb_node);
+       if (node)
+               btrfs_panic(fs_info, -EEXIST,
+                       "zoned: inconsistency in ordered tree at offset %llu",
+                       new->file_offset);
+       spin_unlock(&tree->lock);
 
-       return btrfs_add_ordered_extent(BTRFS_I(inode), file_offset, len, len,
-                                       disk_bytenr, len, 0, flags,
-                                       ordered->compress_type);
+       list_add_tail(&new->root_extent_list, &root->ordered_extents);
+       root->nr_ordered_extents++;
+       spin_unlock_irq(&root->ordered_extent_lock);
+       return new;
 }
 
 int __init ordered_data_init(void)
index f0f1138..173bd5c 100644 (file)
@@ -14,13 +14,13 @@ struct btrfs_ordered_inode_tree {
 };
 
 struct btrfs_ordered_sum {
-       /* bytenr is the start of this extent on disk */
-       u64 bytenr;
-
        /*
-        * this is the length in bytes covered by the sums array below.
+        * Logical start address and length for of the blocks covered by
+        * the sums array.
         */
-       int len;
+       u64 logical;
+       u32 len;
+
        struct list_head list;
        /* last field is a variable length array of csums */
        u8 sums[];
@@ -151,12 +151,6 @@ struct btrfs_ordered_extent {
        struct completion completion;
        struct btrfs_work flush_work;
        struct list_head work_list;
-
-       /*
-        * Used to reverse-map physical address returned from ZONE_APPEND write
-        * command in a workqueue context
-        */
-       u64 physical;
 };
 
 static inline void
@@ -167,11 +161,15 @@ btrfs_ordered_inode_tree_init(struct btrfs_ordered_inode_tree *t)
        t->last = NULL;
 }
 
+int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent);
 int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent);
 
 void btrfs_put_ordered_extent(struct btrfs_ordered_extent *entry);
 void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode,
                                struct btrfs_ordered_extent *entry);
+bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
+                                struct page *page, u64 file_offset, u64 len,
+                                bool uptodate);
 void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
                                struct page *page, u64 file_offset,
                                u64 num_bytes, bool uptodate);
@@ -183,10 +181,6 @@ struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
                        u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
                        u64 disk_num_bytes, u64 offset, unsigned long flags,
                        int compress_type);
-int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset,
-                            u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
-                            u64 disk_num_bytes, u64 offset, unsigned long flags,
-                            int compress_type);
 void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
                           struct btrfs_ordered_sum *sum);
 struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct btrfs_inode *inode,
@@ -212,7 +206,8 @@ void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
                                        struct extent_state **cached_state);
 bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end,
                                  struct extent_state **cached_state);
-int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 len);
+struct btrfs_ordered_extent *btrfs_split_ordered_extent(
+                       struct btrfs_ordered_extent *ordered, u64 len);
 int __init ordered_data_init(void);
 void __cold ordered_data_exit(void);
 
index 497b9db..aa06d9c 100644 (file)
@@ -49,7 +49,7 @@ const char *btrfs_root_name(const struct btrfs_key *key, char *buf)
        return buf;
 }
 
-static void print_chunk(struct extent_buffer *eb, struct btrfs_chunk *chunk)
+static void print_chunk(const struct extent_buffer *eb, struct btrfs_chunk *chunk)
 {
        int num_stripes = btrfs_chunk_num_stripes(eb, chunk);
        int i;
@@ -62,7 +62,7 @@ static void print_chunk(struct extent_buffer *eb, struct btrfs_chunk *chunk)
                      btrfs_stripe_offset_nr(eb, chunk, i));
        }
 }
-static void print_dev_item(struct extent_buffer *eb,
+static void print_dev_item(const struct extent_buffer *eb,
                           struct btrfs_dev_item *dev_item)
 {
        pr_info("\t\tdev item devid %llu total_bytes %llu bytes used %llu\n",
@@ -70,7 +70,7 @@ static void print_dev_item(struct extent_buffer *eb,
               btrfs_device_total_bytes(eb, dev_item),
               btrfs_device_bytes_used(eb, dev_item));
 }
-static void print_extent_data_ref(struct extent_buffer *eb,
+static void print_extent_data_ref(const struct extent_buffer *eb,
                                  struct btrfs_extent_data_ref *ref)
 {
        pr_cont("extent data backref root %llu objectid %llu offset %llu count %u\n",
@@ -80,7 +80,7 @@ static void print_extent_data_ref(struct extent_buffer *eb,
               btrfs_extent_data_ref_count(eb, ref));
 }
 
-static void print_extent_item(struct extent_buffer *eb, int slot, int type)
+static void print_extent_item(const struct extent_buffer *eb, int slot, int type)
 {
        struct btrfs_extent_item *ei;
        struct btrfs_extent_inline_ref *iref;
@@ -169,7 +169,7 @@ static void print_extent_item(struct extent_buffer *eb, int slot, int type)
        WARN_ON(ptr > end);
 }
 
-static void print_uuid_item(struct extent_buffer *l, unsigned long offset,
+static void print_uuid_item(const struct extent_buffer *l, unsigned long offset,
                            u32 item_size)
 {
        if (!IS_ALIGNED(item_size, sizeof(u64))) {
@@ -191,7 +191,7 @@ static void print_uuid_item(struct extent_buffer *l, unsigned long offset,
  * Helper to output refs and locking status of extent buffer.  Useful to debug
  * race condition related problems.
  */
-static void print_eb_refs_lock(struct extent_buffer *eb)
+static void print_eb_refs_lock(const struct extent_buffer *eb)
 {
 #ifdef CONFIG_BTRFS_DEBUG
        btrfs_info(eb->fs_info, "refs %u lock_owner %u current %u",
@@ -199,7 +199,7 @@ static void print_eb_refs_lock(struct extent_buffer *eb)
 #endif
 }
 
-void btrfs_print_leaf(struct extent_buffer *l)
+void btrfs_print_leaf(const struct extent_buffer *l)
 {
        struct btrfs_fs_info *fs_info;
        int i;
@@ -355,7 +355,7 @@ void btrfs_print_leaf(struct extent_buffer *l)
        }
 }
 
-void btrfs_print_tree(struct extent_buffer *c, bool follow)
+void btrfs_print_tree(const struct extent_buffer *c, bool follow)
 {
        struct btrfs_fs_info *fs_info;
        int i; u32 nr;
index 8c3e931..c42bc66 100644 (file)
@@ -9,8 +9,8 @@
 /* Buffer size to contain tree name and possibly additional data (offset) */
 #define BTRFS_ROOT_NAME_BUF_LEN                                48
 
-void btrfs_print_leaf(struct extent_buffer *l);
-void btrfs_print_tree(struct extent_buffer *c, bool follow);
+void btrfs_print_leaf(const struct extent_buffer *l);
+void btrfs_print_tree(const struct extent_buffer *c, bool follow);
 const char *btrfs_root_name(const struct btrfs_key *key, char *buf);
 
 #endif
index f41da7a..da1f84a 100644 (file)
@@ -1232,12 +1232,23 @@ int btrfs_quota_disable(struct btrfs_fs_info *fs_info)
        int ret = 0;
 
        /*
-        * We need to have subvol_sem write locked, to prevent races between
-        * concurrent tasks trying to disable quotas, because we will unlock
-        * and relock qgroup_ioctl_lock across BTRFS_FS_QUOTA_ENABLED changes.
+        * We need to have subvol_sem write locked to prevent races with
+        * snapshot creation.
         */
        lockdep_assert_held_write(&fs_info->subvol_sem);
 
+       /*
+        * Lock the cleaner mutex to prevent races with concurrent relocation,
+        * because relocation may be building backrefs for blocks of the quota
+        * root while we are deleting the root. This is like dropping fs roots
+        * of deleted snapshots/subvolumes, we need the same protection.
+        *
+        * This also prevents races between concurrent tasks trying to disable
+        * quotas, because we will unlock and relock qgroup_ioctl_lock across
+        * BTRFS_FS_QUOTA_ENABLED changes.
+        */
+       mutex_lock(&fs_info->cleaner_mutex);
+
        mutex_lock(&fs_info->qgroup_ioctl_lock);
        if (!fs_info->quota_root)
                goto out;
@@ -1301,7 +1312,9 @@ int btrfs_quota_disable(struct btrfs_fs_info *fs_info)
                goto out;
        }
 
+       spin_lock(&fs_info->trans_lock);
        list_del(&quota_root->dirty_list);
+       spin_unlock(&fs_info->trans_lock);
 
        btrfs_tree_lock(quota_root->node);
        btrfs_clear_buffer_dirty(trans, quota_root->node);
@@ -1317,6 +1330,7 @@ out:
                btrfs_end_transaction(trans);
        else if (trans)
                ret = btrfs_end_transaction(trans);
+       mutex_unlock(&fs_info->cleaner_mutex);
 
        return ret;
 }
index 2fab37f..f37b925 100644 (file)
@@ -1079,7 +1079,7 @@ static int rbio_add_io_sector(struct btrfs_raid_bio *rbio,
 
        /* see if we can add this page onto our existing bio */
        if (last) {
-               u64 last_end = last->bi_iter.bi_sector << 9;
+               u64 last_end = last->bi_iter.bi_sector << SECTOR_SHIFT;
                last_end += last->bi_iter.bi_size;
 
                /*
@@ -1099,7 +1099,7 @@ static int rbio_add_io_sector(struct btrfs_raid_bio *rbio,
        bio = bio_alloc(stripe->dev->bdev,
                        max(BTRFS_STRIPE_LEN >> PAGE_SHIFT, 1),
                        op, GFP_NOFS);
-       bio->bi_iter.bi_sector = disk_start >> 9;
+       bio->bi_iter.bi_sector = disk_start >> SECTOR_SHIFT;
        bio->bi_private = rbio;
 
        __bio_add_page(bio, sector->page, sectorsize, sector->pgoff);
@@ -2747,3 +2747,48 @@ void raid56_parity_submit_scrub_rbio(struct btrfs_raid_bio *rbio)
        if (!lock_stripe_add(rbio))
                start_async_work(rbio, scrub_rbio_work_locked);
 }
+
+/*
+ * This is for scrub call sites where we already have correct data contents.
+ * This allows us to avoid reading data stripes again.
+ *
+ * Unfortunately here we have to do page copy, other than reusing the pages.
+ * This is due to the fact rbio has its own page management for its cache.
+ */
+void raid56_parity_cache_data_pages(struct btrfs_raid_bio *rbio,
+                                   struct page **data_pages, u64 data_logical)
+{
+       const u64 offset_in_full_stripe = data_logical -
+                                         rbio->bioc->full_stripe_logical;
+       const int page_index = offset_in_full_stripe >> PAGE_SHIFT;
+       const u32 sectorsize = rbio->bioc->fs_info->sectorsize;
+       const u32 sectors_per_page = PAGE_SIZE / sectorsize;
+       int ret;
+
+       /*
+        * If we hit ENOMEM temporarily, but later at
+        * raid56_parity_submit_scrub_rbio() time it succeeded, we just do
+        * the extra read, not a big deal.
+        *
+        * If we hit ENOMEM later at raid56_parity_submit_scrub_rbio() time,
+        * the bio would got proper error number set.
+        */
+       ret = alloc_rbio_data_pages(rbio);
+       if (ret < 0)
+               return;
+
+       /* data_logical must be at stripe boundary and inside the full stripe. */
+       ASSERT(IS_ALIGNED(offset_in_full_stripe, BTRFS_STRIPE_LEN));
+       ASSERT(offset_in_full_stripe < (rbio->nr_data << BTRFS_STRIPE_LEN_SHIFT));
+
+       for (int page_nr = 0; page_nr < (BTRFS_STRIPE_LEN >> PAGE_SHIFT); page_nr++) {
+               struct page *dst = rbio->stripe_pages[page_nr + page_index];
+               struct page *src = data_pages[page_nr];
+
+               memcpy_page(dst, 0, src, 0, PAGE_SIZE);
+               for (int sector_nr = sectors_per_page * page_index;
+                    sector_nr < sectors_per_page * (page_index + 1);
+                    sector_nr++)
+                       rbio->stripe_sectors[sector_nr].uptodate = true;
+       }
+}
index 0f7f31c..0e84c9c 100644 (file)
@@ -193,6 +193,9 @@ struct btrfs_raid_bio *raid56_parity_alloc_scrub_rbio(struct bio *bio,
                                unsigned long *dbitmap, int stripe_nsectors);
 void raid56_parity_submit_scrub_rbio(struct btrfs_raid_bio *rbio);
 
+void raid56_parity_cache_data_pages(struct btrfs_raid_bio *rbio,
+                                   struct page **data_pages, u64 data_logical);
+
 int btrfs_alloc_stripe_hash_table(struct btrfs_fs_info *info);
 void btrfs_free_stripe_hash_table(struct btrfs_fs_info *info);
 
index 59a0649..25a3361 100644 (file)
@@ -174,8 +174,8 @@ static void mark_block_processed(struct reloc_control *rc,
            in_range(node->bytenr, rc->block_group->start,
                     rc->block_group->length)) {
                blocksize = rc->extent_root->fs_info->nodesize;
-               set_extent_bits(&rc->processed_blocks, node->bytenr,
-                               node->bytenr + blocksize - 1, EXTENT_DIRTY);
+               set_extent_bit(&rc->processed_blocks, node->bytenr,
+                              node->bytenr + blocksize - 1, EXTENT_DIRTY, NULL);
        }
        node->processed = 1;
 }
@@ -3051,9 +3051,9 @@ static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
                        u64 boundary_end = boundary_start +
                                           fs_info->sectorsize - 1;
 
-                       set_extent_bits(&BTRFS_I(inode)->io_tree,
-                                       boundary_start, boundary_end,
-                                       EXTENT_BOUNDARY);
+                       set_extent_bit(&BTRFS_I(inode)->io_tree,
+                                      boundary_start, boundary_end,
+                                      EXTENT_BOUNDARY, NULL);
                }
                unlock_extent(&BTRFS_I(inode)->io_tree, clamped_start, clamped_end,
                              &cached_state);
@@ -4342,29 +4342,25 @@ out:
  * cloning checksum properly handles the nodatasum extents.
  * it also saves CPU time to re-calculate the checksum.
  */
-int btrfs_reloc_clone_csums(struct btrfs_inode *inode, u64 file_pos, u64 len)
+int btrfs_reloc_clone_csums(struct btrfs_ordered_extent *ordered)
 {
+       struct btrfs_inode *inode = BTRFS_I(ordered->inode);
        struct btrfs_fs_info *fs_info = inode->root->fs_info;
-       struct btrfs_root *csum_root;
-       struct btrfs_ordered_sum *sums;
-       struct btrfs_ordered_extent *ordered;
-       int ret;
-       u64 disk_bytenr;
-       u64 new_bytenr;
+       u64 disk_bytenr = ordered->file_offset + inode->index_cnt;
+       struct btrfs_root *csum_root = btrfs_csum_root(fs_info, disk_bytenr);
        LIST_HEAD(list);
+       int ret;
 
-       ordered = btrfs_lookup_ordered_extent(inode, file_pos);
-       BUG_ON(ordered->file_offset != file_pos || ordered->num_bytes != len);
-
-       disk_bytenr = file_pos + inode->index_cnt;
-       csum_root = btrfs_csum_root(fs_info, disk_bytenr);
        ret = btrfs_lookup_csums_list(csum_root, disk_bytenr,
-                                     disk_bytenr + len - 1, &list, 0, false);
+                                     disk_bytenr + ordered->num_bytes - 1,
+                                     &list, 0, false);
        if (ret)
-               goto out;
+               return ret;
 
        while (!list_empty(&list)) {
-               sums = list_entry(list.next, struct btrfs_ordered_sum, list);
+               struct btrfs_ordered_sum *sums =
+                       list_entry(list.next, struct btrfs_ordered_sum, list);
+
                list_del_init(&sums->list);
 
                /*
@@ -4379,14 +4375,11 @@ int btrfs_reloc_clone_csums(struct btrfs_inode *inode, u64 file_pos, u64 len)
                 * disk_len vs real len like with real inodes since it's all
                 * disk length.
                 */
-               new_bytenr = ordered->disk_bytenr + sums->bytenr - disk_bytenr;
-               sums->bytenr = new_bytenr;
-
+               sums->logical = ordered->disk_bytenr + sums->logical - disk_bytenr;
                btrfs_add_ordered_sum(ordered, sums);
        }
-out:
-       btrfs_put_ordered_extent(ordered);
-       return ret;
+
+       return 0;
 }
 
 int btrfs_reloc_cow_block(struct btrfs_trans_handle *trans,
@@ -4523,3 +4516,19 @@ int btrfs_reloc_post_snapshot(struct btrfs_trans_handle *trans,
                ret = clone_backref_node(trans, rc, root, reloc_root);
        return ret;
 }
+
+/*
+ * Get the current bytenr for the block group which is being relocated.
+ *
+ * Return U64_MAX if no running relocation.
+ */
+u64 btrfs_get_reloc_bg_bytenr(struct btrfs_fs_info *fs_info)
+{
+       u64 logical = U64_MAX;
+
+       lockdep_assert_held(&fs_info->reloc_mutex);
+
+       if (fs_info->reloc_ctl && fs_info->reloc_ctl->block_group)
+               logical = fs_info->reloc_ctl->block_group->start;
+       return logical;
+}
index 2041a86..77d69f6 100644 (file)
@@ -8,7 +8,7 @@ int btrfs_init_reloc_root(struct btrfs_trans_handle *trans, struct btrfs_root *r
 int btrfs_update_reloc_root(struct btrfs_trans_handle *trans,
                            struct btrfs_root *root);
 int btrfs_recover_relocation(struct btrfs_fs_info *fs_info);
-int btrfs_reloc_clone_csums(struct btrfs_inode *inode, u64 file_pos, u64 len);
+int btrfs_reloc_clone_csums(struct btrfs_ordered_extent *ordered);
 int btrfs_reloc_cow_block(struct btrfs_trans_handle *trans,
                          struct btrfs_root *root, struct extent_buffer *buf,
                          struct extent_buffer *cow);
@@ -19,5 +19,6 @@ int btrfs_reloc_post_snapshot(struct btrfs_trans_handle *trans,
 int btrfs_should_cancel_balance(struct btrfs_fs_info *fs_info);
 struct btrfs_root *find_reloc_root(struct btrfs_fs_info *fs_info, u64 bytenr);
 int btrfs_should_ignore_reloc_root(struct btrfs_root *root);
+u64 btrfs_get_reloc_bg_bytenr(struct btrfs_fs_info *fs_info);
 
 #endif
index bceaa8c..4cae41b 100644 (file)
@@ -177,7 +177,6 @@ struct scrub_ctx {
        struct btrfs_fs_info    *fs_info;
        int                     first_free;
        int                     cur_stripe;
-       struct list_head        csum_list;
        atomic_t                cancel_req;
        int                     readonly;
        int                     sectors_per_bio;
@@ -309,17 +308,6 @@ static void scrub_blocked_if_needed(struct btrfs_fs_info *fs_info)
        scrub_pause_off(fs_info);
 }
 
-static void scrub_free_csums(struct scrub_ctx *sctx)
-{
-       while (!list_empty(&sctx->csum_list)) {
-               struct btrfs_ordered_sum *sum;
-               sum = list_first_entry(&sctx->csum_list,
-                                      struct btrfs_ordered_sum, list);
-               list_del(&sum->list);
-               kfree(sum);
-       }
-}
-
 static noinline_for_stack void scrub_free_ctx(struct scrub_ctx *sctx)
 {
        int i;
@@ -330,7 +318,6 @@ static noinline_for_stack void scrub_free_ctx(struct scrub_ctx *sctx)
        for (i = 0; i < SCRUB_STRIPES_PER_SCTX; i++)
                release_scrub_stripe(&sctx->stripes[i]);
 
-       scrub_free_csums(sctx);
        kfree(sctx);
 }
 
@@ -352,7 +339,6 @@ static noinline_for_stack struct scrub_ctx *scrub_setup_ctx(
        refcount_set(&sctx->refs, 1);
        sctx->is_dev_replace = is_dev_replace;
        sctx->fs_info = fs_info;
-       INIT_LIST_HEAD(&sctx->csum_list);
        for (i = 0; i < SCRUB_STRIPES_PER_SCTX; i++) {
                int ret;
 
@@ -479,11 +465,8 @@ static void scrub_print_common_warning(const char *errstr, struct btrfs_device *
        struct extent_buffer *eb;
        struct btrfs_extent_item *ei;
        struct scrub_warning swarn;
-       unsigned long ptr = 0;
        u64 flags = 0;
-       u64 ref_root;
        u32 item_size;
-       u8 ref_level = 0;
        int ret;
 
        /* Super block error, no need to search extent tree. */
@@ -513,19 +496,28 @@ static void scrub_print_common_warning(const char *errstr, struct btrfs_device *
        item_size = btrfs_item_size(eb, path->slots[0]);
 
        if (flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) {
-               do {
+               unsigned long ptr = 0;
+               u8 ref_level;
+               u64 ref_root;
+
+               while (true) {
                        ret = tree_backref_for_extent(&ptr, eb, &found_key, ei,
                                                      item_size, &ref_root,
                                                      &ref_level);
+                       if (ret < 0) {
+                               btrfs_warn(fs_info,
+                               "failed to resolve tree backref for logical %llu: %d",
+                                                 swarn.logical, ret);
+                               break;
+                       }
+                       if (ret > 0)
+                               break;
                        btrfs_warn_in_rcu(fs_info,
 "%s at logical %llu on dev %s, physical %llu: metadata %s (level %d) in tree %llu",
-                               errstr, swarn.logical,
-                               btrfs_dev_name(dev),
-                               swarn.physical,
-                               ref_level ? "node" : "leaf",
-                               ret < 0 ? -1 : ref_level,
-                               ret < 0 ? -1 : ref_root);
-               } while (ret != 1);
+                               errstr, swarn.logical, btrfs_dev_name(dev),
+                               swarn.physical, (ref_level ? "node" : "leaf"),
+                               ref_level, ref_root);
+               }
                btrfs_release_path(path);
        } else {
                struct btrfs_backref_walk_ctx ctx = { 0 };
@@ -546,48 +538,6 @@ out:
        btrfs_free_path(path);
 }
 
-static inline int scrub_nr_raid_mirrors(struct btrfs_io_context *bioc)
-{
-       if (bioc->map_type & BTRFS_BLOCK_GROUP_RAID5)
-               return 2;
-       else if (bioc->map_type & BTRFS_BLOCK_GROUP_RAID6)
-               return 3;
-       else
-               return (int)bioc->num_stripes;
-}
-
-static inline void scrub_stripe_index_and_offset(u64 logical, u64 map_type,
-                                                u64 full_stripe_logical,
-                                                int nstripes, int mirror,
-                                                int *stripe_index,
-                                                u64 *stripe_offset)
-{
-       int i;
-
-       if (map_type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
-               const int nr_data_stripes = (map_type & BTRFS_BLOCK_GROUP_RAID5) ?
-                                           nstripes - 1 : nstripes - 2;
-
-               /* RAID5/6 */
-               for (i = 0; i < nr_data_stripes; i++) {
-                       const u64 data_stripe_start = full_stripe_logical +
-                                               (i * BTRFS_STRIPE_LEN);
-
-                       if (logical >= data_stripe_start &&
-                           logical < data_stripe_start + BTRFS_STRIPE_LEN)
-                               break;
-               }
-
-               *stripe_index = i;
-               *stripe_offset = (logical - full_stripe_logical) &
-                                BTRFS_STRIPE_LEN_MASK;
-       } else {
-               /* The other RAID type */
-               *stripe_index = mirror;
-               *stripe_offset = 0;
-       }
-}
-
 static int fill_writer_pointer_gap(struct scrub_ctx *sctx, u64 physical)
 {
        int ret = 0;
@@ -924,8 +874,9 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
 
                /* For scrub, our mirror_num should always start at 1. */
                ASSERT(stripe->mirror_num >= 1);
-               ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS,
-                                      stripe->logical, &mapped_len, &bioc);
+               ret = btrfs_map_block(fs_info, BTRFS_MAP_GET_READ_MIRRORS,
+                                     stripe->logical, &mapped_len, &bioc,
+                                     NULL, NULL, 1);
                /*
                 * If we failed, dev will be NULL, and later detailed reports
                 * will just be skipped.
@@ -1304,7 +1255,7 @@ static int get_raid56_logic_offset(u64 physical, int num,
                u32 stripe_index;
                u32 rot;
 
-               *offset = last_offset + (i << BTRFS_STRIPE_LEN_SHIFT);
+               *offset = last_offset + btrfs_stripe_nr_to_offset(i);
 
                stripe_nr = (u32)(*offset >> BTRFS_STRIPE_LEN_SHIFT) / data_stripes;
 
@@ -1319,7 +1270,7 @@ static int get_raid56_logic_offset(u64 physical, int num,
                if (stripe_index < num)
                        j++;
        }
-       *offset = last_offset + (j << BTRFS_STRIPE_LEN_SHIFT);
+       *offset = last_offset + btrfs_stripe_nr_to_offset(j);
        return 1;
 }
 
@@ -1715,7 +1666,7 @@ static int flush_scrub_stripes(struct scrub_ctx *sctx)
        ASSERT(test_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &sctx->stripes[0].state));
 
        scrub_throttle_dev_io(sctx, sctx->stripes[0].dev,
-                             nr_stripes << BTRFS_STRIPE_LEN_SHIFT);
+                             btrfs_stripe_nr_to_offset(nr_stripes));
        for (int i = 0; i < nr_stripes; i++) {
                stripe = &sctx->stripes[i];
                scrub_submit_initial_read(sctx, stripe);
@@ -1838,7 +1789,7 @@ static int scrub_raid56_parity_stripe(struct scrub_ctx *sctx,
        bool all_empty = true;
        const int data_stripes = nr_data_stripes(map);
        unsigned long extent_bitmap = 0;
-       u64 length = data_stripes << BTRFS_STRIPE_LEN_SHIFT;
+       u64 length = btrfs_stripe_nr_to_offset(data_stripes);
        int ret;
 
        ASSERT(sctx->raid56_data_stripes);
@@ -1853,13 +1804,13 @@ static int scrub_raid56_parity_stripe(struct scrub_ctx *sctx,
                              data_stripes) >> BTRFS_STRIPE_LEN_SHIFT;
                stripe_index = (i + rot) % map->num_stripes;
                physical = map->stripes[stripe_index].physical +
-                          (rot << BTRFS_STRIPE_LEN_SHIFT);
+                          btrfs_stripe_nr_to_offset(rot);
 
                scrub_reset_stripe(stripe);
                set_bit(SCRUB_STRIPE_FLAG_NO_REPORT, &stripe->state);
                ret = scrub_find_fill_first_stripe(bg,
                                map->stripes[stripe_index].dev, physical, 1,
-                               full_stripe_start + (i << BTRFS_STRIPE_LEN_SHIFT),
+                               full_stripe_start + btrfs_stripe_nr_to_offset(i),
                                BTRFS_STRIPE_LEN, stripe);
                if (ret < 0)
                        goto out;
@@ -1869,7 +1820,7 @@ static int scrub_raid56_parity_stripe(struct scrub_ctx *sctx,
                 */
                if (ret > 0) {
                        stripe->logical = full_stripe_start +
-                                         (i << BTRFS_STRIPE_LEN_SHIFT);
+                                         btrfs_stripe_nr_to_offset(i);
                        stripe->dev = map->stripes[stripe_index].dev;
                        stripe->mirror_num = 1;
                        set_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &stripe->state);
@@ -1957,8 +1908,8 @@ static int scrub_raid56_parity_stripe(struct scrub_ctx *sctx,
        bio->bi_end_io = raid56_scrub_wait_endio;
 
        btrfs_bio_counter_inc_blocked(fs_info);
-       ret = btrfs_map_sblock(fs_info, BTRFS_MAP_WRITE, full_stripe_start,
-                              &length, &bioc);
+       ret = btrfs_map_block(fs_info, BTRFS_MAP_WRITE, full_stripe_start,
+                             &length, &bioc, NULL, NULL, 1);
        if (ret < 0) {
                btrfs_put_bioc(bioc);
                btrfs_bio_counter_dec(fs_info);
@@ -1972,6 +1923,13 @@ static int scrub_raid56_parity_stripe(struct scrub_ctx *sctx,
                btrfs_bio_counter_dec(fs_info);
                goto out;
        }
+       /* Use the recovered stripes as cache to avoid read them from disk again. */
+       for (int i = 0; i < data_stripes; i++) {
+               stripe = &sctx->raid56_data_stripes[i];
+
+               raid56_parity_cache_data_pages(rbio, stripe->pages,
+                               full_stripe_start + (i << BTRFS_STRIPE_LEN_SHIFT));
+       }
        raid56_parity_submit_scrub_rbio(rbio);
        wait_for_completion_io(&io_done);
        ret = blk_status_to_errno(bio->bi_status);
@@ -2062,7 +2020,7 @@ static u64 simple_stripe_full_stripe_len(const struct map_lookup *map)
        ASSERT(map->type & (BTRFS_BLOCK_GROUP_RAID0 |
                            BTRFS_BLOCK_GROUP_RAID10));
 
-       return (map->num_stripes / map->sub_stripes) << BTRFS_STRIPE_LEN_SHIFT;
+       return btrfs_stripe_nr_to_offset(map->num_stripes / map->sub_stripes);
 }
 
 /* Get the logical bytenr for the stripe */
@@ -2078,7 +2036,7 @@ static u64 simple_stripe_get_logical(struct map_lookup *map,
         * (stripe_index / sub_stripes) gives how many data stripes we need to
         * skip.
         */
-       return ((stripe_index / map->sub_stripes) << BTRFS_STRIPE_LEN_SHIFT) +
+       return btrfs_stripe_nr_to_offset(stripe_index / map->sub_stripes) +
               bg->start;
 }
 
@@ -2204,7 +2162,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
        }
        if (profile & (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID10)) {
                ret = scrub_simple_stripe(sctx, bg, map, scrub_dev, stripe_index);
-               offset = (stripe_index / map->sub_stripes) << BTRFS_STRIPE_LEN_SHIFT;
+               offset = btrfs_stripe_nr_to_offset(stripe_index / map->sub_stripes);
                goto out;
        }
 
@@ -2219,7 +2177,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 
        /* Initialize @offset in case we need to go to out: label */
        get_raid56_logic_offset(physical, stripe_index, map, &offset, NULL);
-       increment = nr_data_stripes(map) << BTRFS_STRIPE_LEN_SHIFT;
+       increment = btrfs_stripe_nr_to_offset(nr_data_stripes(map));
 
        /*
         * Due to the rotation, for RAID56 it's better to iterate each stripe
@@ -2740,17 +2698,12 @@ static void scrub_workers_put(struct btrfs_fs_info *fs_info)
        if (refcount_dec_and_mutex_lock(&fs_info->scrub_workers_refcnt,
                                        &fs_info->scrub_lock)) {
                struct workqueue_struct *scrub_workers = fs_info->scrub_workers;
-               struct workqueue_struct *scrub_wr_comp =
-                                               fs_info->scrub_wr_completion_workers;
 
                fs_info->scrub_workers = NULL;
-               fs_info->scrub_wr_completion_workers = NULL;
                mutex_unlock(&fs_info->scrub_lock);
 
                if (scrub_workers)
                        destroy_workqueue(scrub_workers);
-               if (scrub_wr_comp)
-                       destroy_workqueue(scrub_wr_comp);
        }
 }
 
@@ -2761,7 +2714,6 @@ static noinline_for_stack int scrub_workers_get(struct btrfs_fs_info *fs_info,
                                                int is_dev_replace)
 {
        struct workqueue_struct *scrub_workers = NULL;
-       struct workqueue_struct *scrub_wr_comp = NULL;
        unsigned int flags = WQ_FREEZABLE | WQ_UNBOUND;
        int max_active = fs_info->thread_pool_size;
        int ret = -ENOMEM;
@@ -2769,21 +2721,17 @@ static noinline_for_stack int scrub_workers_get(struct btrfs_fs_info *fs_info,
        if (refcount_inc_not_zero(&fs_info->scrub_workers_refcnt))
                return 0;
 
-       scrub_workers = alloc_workqueue("btrfs-scrub", flags,
-                                       is_dev_replace ? 1 : max_active);
+       if (is_dev_replace)
+               scrub_workers = alloc_ordered_workqueue("btrfs-scrub", flags);
+       else
+               scrub_workers = alloc_workqueue("btrfs-scrub", flags, max_active);
        if (!scrub_workers)
-               goto fail_scrub_workers;
-
-       scrub_wr_comp = alloc_workqueue("btrfs-scrubwrc", flags, max_active);
-       if (!scrub_wr_comp)
-               goto fail_scrub_wr_completion_workers;
+               return -ENOMEM;
 
        mutex_lock(&fs_info->scrub_lock);
        if (refcount_read(&fs_info->scrub_workers_refcnt) == 0) {
-               ASSERT(fs_info->scrub_workers == NULL &&
-                      fs_info->scrub_wr_completion_workers == NULL);
+               ASSERT(fs_info->scrub_workers == NULL);
                fs_info->scrub_workers = scrub_workers;
-               fs_info->scrub_wr_completion_workers = scrub_wr_comp;
                refcount_set(&fs_info->scrub_workers_refcnt, 1);
                mutex_unlock(&fs_info->scrub_lock);
                return 0;
@@ -2794,10 +2742,7 @@ static noinline_for_stack int scrub_workers_get(struct btrfs_fs_info *fs_info,
 
        ret = 0;
 
-       destroy_workqueue(scrub_wr_comp);
-fail_scrub_wr_completion_workers:
        destroy_workqueue(scrub_workers);
-fail_scrub_workers:
        return ret;
 }
 
index af2e153..8bfd447 100644 (file)
@@ -1774,9 +1774,21 @@ static int read_symlink(struct btrfs_root *root,
        ei = btrfs_item_ptr(path->nodes[0], path->slots[0],
                        struct btrfs_file_extent_item);
        type = btrfs_file_extent_type(path->nodes[0], ei);
+       if (unlikely(type != BTRFS_FILE_EXTENT_INLINE)) {
+               ret = -EUCLEAN;
+               btrfs_crit(root->fs_info,
+"send: found symlink extent that is not inline, ino %llu root %llu extent type %d",
+                          ino, btrfs_root_id(root), type);
+               goto out;
+       }
        compression = btrfs_file_extent_compression(path->nodes[0], ei);
-       BUG_ON(type != BTRFS_FILE_EXTENT_INLINE);
-       BUG_ON(compression);
+       if (unlikely(compression != BTRFS_COMPRESS_NONE)) {
+               ret = -EUCLEAN;
+               btrfs_crit(root->fs_info,
+"send: found symlink extent with compression, ino %llu root %llu compression type %d",
+                          ino, btrfs_root_id(root), compression);
+               goto out;
+       }
 
        off = btrfs_file_extent_inline_start(ei);
        len = btrfs_file_extent_ram_bytes(path->nodes[0], ei);
index dd46b97..1b999c6 100644 (file)
@@ -100,9 +100,6 @@ void btrfs_init_subpage_info(struct btrfs_subpage_info *subpage_info, u32 sector
        subpage_info->uptodate_offset = cur;
        cur += nr_bits;
 
-       subpage_info->error_offset = cur;
-       cur += nr_bits;
-
        subpage_info->dirty_offset = cur;
        cur += nr_bits;
 
@@ -367,28 +364,6 @@ void btrfs_page_end_writer_lock(const struct btrfs_fs_info *fs_info,
                unlock_page(page);
 }
 
-static bool bitmap_test_range_all_set(unsigned long *addr, unsigned int start,
-                                     unsigned int nbits)
-{
-       unsigned int found_zero;
-
-       found_zero = find_next_zero_bit(addr, start + nbits, start);
-       if (found_zero == start + nbits)
-               return true;
-       return false;
-}
-
-static bool bitmap_test_range_all_zero(unsigned long *addr, unsigned int start,
-                                      unsigned int nbits)
-{
-       unsigned int found_set;
-
-       found_set = find_next_bit(addr, start + nbits, start);
-       if (found_set == start + nbits)
-               return true;
-       return false;
-}
-
 #define subpage_calc_start_bit(fs_info, page, name, start, len)                \
 ({                                                                     \
        unsigned int start_bit;                                         \
@@ -438,35 +413,6 @@ void btrfs_subpage_clear_uptodate(const struct btrfs_fs_info *fs_info,
        spin_unlock_irqrestore(&subpage->lock, flags);
 }
 
-void btrfs_subpage_set_error(const struct btrfs_fs_info *fs_info,
-               struct page *page, u64 start, u32 len)
-{
-       struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
-       unsigned int start_bit = subpage_calc_start_bit(fs_info, page,
-                                                       error, start, len);
-       unsigned long flags;
-
-       spin_lock_irqsave(&subpage->lock, flags);
-       bitmap_set(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits);
-       SetPageError(page);
-       spin_unlock_irqrestore(&subpage->lock, flags);
-}
-
-void btrfs_subpage_clear_error(const struct btrfs_fs_info *fs_info,
-               struct page *page, u64 start, u32 len)
-{
-       struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
-       unsigned int start_bit = subpage_calc_start_bit(fs_info, page,
-                                                       error, start, len);
-       unsigned long flags;
-
-       spin_lock_irqsave(&subpage->lock, flags);
-       bitmap_clear(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits);
-       if (subpage_test_bitmap_all_zero(fs_info, subpage, error))
-               ClearPageError(page);
-       spin_unlock_irqrestore(&subpage->lock, flags);
-}
-
 void btrfs_subpage_set_dirty(const struct btrfs_fs_info *fs_info,
                struct page *page, u64 start, u32 len)
 {
@@ -628,7 +574,6 @@ bool btrfs_subpage_test_##name(const struct btrfs_fs_info *fs_info, \
        return ret;                                                     \
 }
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(uptodate);
-IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(error);
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(dirty);
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(writeback);
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(ordered);
@@ -696,7 +641,6 @@ bool btrfs_page_clamp_test_##name(const struct btrfs_fs_info *fs_info,      \
 }
 IMPLEMENT_BTRFS_PAGE_OPS(uptodate, SetPageUptodate, ClearPageUptodate,
                         PageUptodate);
-IMPLEMENT_BTRFS_PAGE_OPS(error, SetPageError, ClearPageError, PageError);
 IMPLEMENT_BTRFS_PAGE_OPS(dirty, set_page_dirty, clear_page_dirty_for_io,
                         PageDirty);
 IMPLEMENT_BTRFS_PAGE_OPS(writeback, set_page_writeback, end_page_writeback,
@@ -767,3 +711,44 @@ void btrfs_page_unlock_writer(struct btrfs_fs_info *fs_info, struct page *page,
        /* Have writers, use proper subpage helper to end it */
        btrfs_page_end_writer_lock(fs_info, page, start, len);
 }
+
+#define GET_SUBPAGE_BITMAP(subpage, subpage_info, name, dst)           \
+       bitmap_cut(dst, subpage->bitmaps, 0,                            \
+                  subpage_info->name##_offset, subpage_info->bitmap_nr_bits)
+
+void __cold btrfs_subpage_dump_bitmap(const struct btrfs_fs_info *fs_info,
+                                     struct page *page, u64 start, u32 len)
+{
+       struct btrfs_subpage_info *subpage_info = fs_info->subpage_info;
+       struct btrfs_subpage *subpage;
+       unsigned long uptodate_bitmap;
+       unsigned long error_bitmap;
+       unsigned long dirty_bitmap;
+       unsigned long writeback_bitmap;
+       unsigned long ordered_bitmap;
+       unsigned long checked_bitmap;
+       unsigned long flags;
+
+       ASSERT(PagePrivate(page) && page->private);
+       ASSERT(subpage_info);
+       subpage = (struct btrfs_subpage *)page->private;
+
+       spin_lock_irqsave(&subpage->lock, flags);
+       GET_SUBPAGE_BITMAP(subpage, subpage_info, uptodate, &uptodate_bitmap);
+       GET_SUBPAGE_BITMAP(subpage, subpage_info, dirty, &dirty_bitmap);
+       GET_SUBPAGE_BITMAP(subpage, subpage_info, writeback, &writeback_bitmap);
+       GET_SUBPAGE_BITMAP(subpage, subpage_info, ordered, &ordered_bitmap);
+       GET_SUBPAGE_BITMAP(subpage, subpage_info, checked, &checked_bitmap);
+       spin_unlock_irqrestore(&subpage->lock, flags);
+
+       dump_page(page, "btrfs subpage dump");
+       btrfs_warn(fs_info,
+"start=%llu len=%u page=%llu, bitmaps uptodate=%*pbl error=%*pbl dirty=%*pbl writeback=%*pbl ordered=%*pbl checked=%*pbl",
+                   start, len, page_offset(page),
+                   subpage_info->bitmap_nr_bits, &uptodate_bitmap,
+                   subpage_info->bitmap_nr_bits, &error_bitmap,
+                   subpage_info->bitmap_nr_bits, &dirty_bitmap,
+                   subpage_info->bitmap_nr_bits, &writeback_bitmap,
+                   subpage_info->bitmap_nr_bits, &ordered_bitmap,
+                   subpage_info->bitmap_nr_bits, &checked_bitmap);
+}
index 0e80ad3..5cbf67c 100644 (file)
@@ -8,17 +8,17 @@
 /*
  * Extra info for subpapge bitmap.
  *
- * For subpage we pack all uptodate/error/dirty/writeback/ordered bitmaps into
+ * For subpage we pack all uptodate/dirty/writeback/ordered bitmaps into
  * one larger bitmap.
  *
  * This structure records how they are organized in the bitmap:
  *
- * /- uptodate_offset  /- error_offset /- dirty_offset
+ * /- uptodate_offset  /- dirty_offset /- ordered_offset
  * |                   |               |
  * v                   v               v
- * |u|u|u|u|........|u|u|e|e|.......|e|e| ...  |o|o|
+ * |u|u|u|u|........|u|u|d|d|.......|d|d|o|o|.......|o|o|
  * |<- bitmap_nr_bits ->|
- * |<--------------- total_nr_bits ---------------->|
+ * |<----------------- total_nr_bits ------------------>|
  */
 struct btrfs_subpage_info {
        /* Number of bits for each bitmap */
@@ -32,7 +32,6 @@ struct btrfs_subpage_info {
         * @bitmap_size, which is calculated from PAGE_SIZE / sectorsize.
         */
        unsigned int uptodate_offset;
-       unsigned int error_offset;
        unsigned int dirty_offset;
        unsigned int writeback_offset;
        unsigned int ordered_offset;
@@ -141,7 +140,6 @@ bool btrfs_page_clamp_test_##name(const struct btrfs_fs_info *fs_info,      \
                struct page *page, u64 start, u32 len);
 
 DECLARE_BTRFS_SUBPAGE_OPS(uptodate);
-DECLARE_BTRFS_SUBPAGE_OPS(error);
 DECLARE_BTRFS_SUBPAGE_OPS(dirty);
 DECLARE_BTRFS_SUBPAGE_OPS(writeback);
 DECLARE_BTRFS_SUBPAGE_OPS(ordered);
@@ -154,5 +152,7 @@ void btrfs_page_assert_not_dirty(const struct btrfs_fs_info *fs_info,
                                 struct page *page);
 void btrfs_page_unlock_writer(struct btrfs_fs_info *fs_info, struct page *page,
                              u64 start, u32 len);
+void __cold btrfs_subpage_dump_bitmap(const struct btrfs_fs_info *fs_info,
+                                     struct page *page, u64 start, u32 len);
 
 #endif
index efeb1a9..f1dd172 100644 (file)
@@ -849,8 +849,7 @@ out:
  * All other options will be parsed on much later in the mount process and
  * only when we need to allocate a new super block.
  */
-static int btrfs_parse_device_options(const char *options, fmode_t flags,
-                                     void *holder)
+static int btrfs_parse_device_options(const char *options, blk_mode_t flags)
 {
        substring_t args[MAX_OPT_ARGS];
        char *device_name, *opts, *orig, *p;
@@ -884,8 +883,7 @@ static int btrfs_parse_device_options(const char *options, fmode_t flags,
                                error = -ENOMEM;
                                goto out;
                        }
-                       device = btrfs_scan_one_device(device_name, flags,
-                                       holder);
+                       device = btrfs_scan_one_device(device_name, flags);
                        kfree(device_name);
                        if (IS_ERR(device)) {
                                error = PTR_ERR(device);
@@ -1442,12 +1440,9 @@ static struct dentry *btrfs_mount_root(struct file_system_type *fs_type,
        struct btrfs_fs_devices *fs_devices = NULL;
        struct btrfs_fs_info *fs_info = NULL;
        void *new_sec_opts = NULL;
-       fmode_t mode = FMODE_READ;
+       blk_mode_t mode = sb_open_mode(flags);
        int error = 0;
 
-       if (!(flags & SB_RDONLY))
-               mode |= FMODE_WRITE;
-
        if (data) {
                error = security_sb_eat_lsm_opts(data, &new_sec_opts);
                if (error)
@@ -1477,13 +1472,13 @@ static struct dentry *btrfs_mount_root(struct file_system_type *fs_type,
        }
 
        mutex_lock(&uuid_mutex);
-       error = btrfs_parse_device_options(data, mode, fs_type);
+       error = btrfs_parse_device_options(data, mode);
        if (error) {
                mutex_unlock(&uuid_mutex);
                goto error_fs_info;
        }
 
-       device = btrfs_scan_one_device(device_name, mode, fs_type);
+       device = btrfs_scan_one_device(device_name, mode);
        if (IS_ERR(device)) {
                mutex_unlock(&uuid_mutex);
                error = PTR_ERR(device);
@@ -1631,7 +1626,6 @@ static void btrfs_resize_thread_pool(struct btrfs_fs_info *fs_info,
               old_pool_size, new_pool_size);
 
        btrfs_workqueue_set_max(fs_info->workers, new_pool_size);
-       btrfs_workqueue_set_max(fs_info->hipri_workers, new_pool_size);
        btrfs_workqueue_set_max(fs_info->delalloc_workers, new_pool_size);
        btrfs_workqueue_set_max(fs_info->caching_workers, new_pool_size);
        workqueue_set_max_active(fs_info->endio_workers, new_pool_size);
@@ -2196,8 +2190,7 @@ static long btrfs_control_ioctl(struct file *file, unsigned int cmd,
        switch (cmd) {
        case BTRFS_IOC_SCAN_DEV:
                mutex_lock(&uuid_mutex);
-               device = btrfs_scan_one_device(vol->name, FMODE_READ,
-                                              &btrfs_root_fs_type);
+               device = btrfs_scan_one_device(vol->name, BLK_OPEN_READ);
                ret = PTR_ERR_OR_ZERO(device);
                mutex_unlock(&uuid_mutex);
                break;
@@ -2211,8 +2204,7 @@ static long btrfs_control_ioctl(struct file *file, unsigned int cmd,
                break;
        case BTRFS_IOC_DEVICES_READY:
                mutex_lock(&uuid_mutex);
-               device = btrfs_scan_one_device(vol->name, FMODE_READ,
-                                              &btrfs_root_fs_type);
+               device = btrfs_scan_one_device(vol->name, BLK_OPEN_READ);
                if (IS_ERR(device)) {
                        mutex_unlock(&uuid_mutex);
                        ret = PTR_ERR(device);
index dfc5c7f..f6bc6d7 100644 (file)
@@ -159,7 +159,7 @@ static int test_find_delalloc(u32 sectorsize)
         * |--- delalloc ---|
         * |---  search  ---|
         */
-       set_extent_delalloc(tmp, 0, sectorsize - 1, 0, NULL);
+       set_extent_bit(tmp, 0, sectorsize - 1, EXTENT_DELALLOC, NULL);
        start = 0;
        end = start + PAGE_SIZE - 1;
        found = find_lock_delalloc_range(inode, locked_page, &start,
@@ -190,7 +190,7 @@ static int test_find_delalloc(u32 sectorsize)
                test_err("couldn't find the locked page");
                goto out_bits;
        }
-       set_extent_delalloc(tmp, sectorsize, max_bytes - 1, 0, NULL);
+       set_extent_bit(tmp, sectorsize, max_bytes - 1, EXTENT_DELALLOC, NULL);
        start = test_start;
        end = start + PAGE_SIZE - 1;
        found = find_lock_delalloc_range(inode, locked_page, &start,
@@ -245,7 +245,7 @@ static int test_find_delalloc(u32 sectorsize)
         *
         * We are re-using our test_start from above since it works out well.
         */
-       set_extent_delalloc(tmp, max_bytes, total_dirty - 1, 0, NULL);
+       set_extent_bit(tmp, max_bytes, total_dirty - 1, EXTENT_DELALLOC, NULL);
        start = test_start;
        end = start + PAGE_SIZE - 1;
        found = find_lock_delalloc_range(inode, locked_page, &start,
@@ -503,8 +503,8 @@ static int test_find_first_clear_extent_bit(void)
         * Set 1M-4M alloc/discard and 32M-64M thus leaving a hole between
         * 4M-32M
         */
-       set_extent_bits(&tree, SZ_1M, SZ_4M - 1,
-                       CHUNK_TRIMMED | CHUNK_ALLOCATED);
+       set_extent_bit(&tree, SZ_1M, SZ_4M - 1,
+                      CHUNK_TRIMMED | CHUNK_ALLOCATED, NULL);
 
        find_first_clear_extent_bit(&tree, SZ_512K, &start, &end,
                                    CHUNK_TRIMMED | CHUNK_ALLOCATED);
@@ -516,8 +516,8 @@ static int test_find_first_clear_extent_bit(void)
        }
 
        /* Now add 32M-64M so that we have a hole between 4M-32M */
-       set_extent_bits(&tree, SZ_32M, SZ_64M - 1,
-                       CHUNK_TRIMMED | CHUNK_ALLOCATED);
+       set_extent_bit(&tree, SZ_32M, SZ_64M - 1,
+                      CHUNK_TRIMMED | CHUNK_ALLOCATED, NULL);
 
        /*
         * Request first hole starting at 12M, we should get 4M-32M
@@ -548,7 +548,7 @@ static int test_find_first_clear_extent_bit(void)
         * Set 64M-72M with CHUNK_ALLOC flag, then search for CHUNK_TRIMMED flag
         * being unset in this range, we should get the entry in range 64M-72M
         */
-       set_extent_bits(&tree, SZ_64M, SZ_64M + SZ_8M - 1, CHUNK_ALLOCATED);
+       set_extent_bit(&tree, SZ_64M, SZ_64M + SZ_8M - 1, CHUNK_ALLOCATED, NULL);
        find_first_clear_extent_bit(&tree, SZ_64M + SZ_1M, &start, &end,
                                    CHUNK_TRIMMED);
 
index 8b6a99b..cf30635 100644 (file)
@@ -374,8 +374,6 @@ loop:
        spin_lock_init(&cur_trans->dirty_bgs_lock);
        INIT_LIST_HEAD(&cur_trans->deleted_bgs);
        spin_lock_init(&cur_trans->dropped_roots_lock);
-       INIT_LIST_HEAD(&cur_trans->releasing_ebs);
-       spin_lock_init(&cur_trans->releasing_ebs_lock);
        list_add_tail(&cur_trans->list, &fs_info->trans_list);
        extent_io_tree_init(fs_info, &cur_trans->dirty_pages,
                        IO_TREE_TRANS_DIRTY_PAGES);
@@ -1056,7 +1054,6 @@ int btrfs_write_marked_extents(struct btrfs_fs_info *fs_info,
        u64 start = 0;
        u64 end;
 
-       atomic_inc(&BTRFS_I(fs_info->btree_inode)->sync_writers);
        while (!find_first_extent_bit(dirty_pages, start, &start, &end,
                                      mark, &cached_state)) {
                bool wait_writeback = false;
@@ -1092,7 +1089,6 @@ int btrfs_write_marked_extents(struct btrfs_fs_info *fs_info,
                cond_resched();
                start = end + 1;
        }
-       atomic_dec(&BTRFS_I(fs_info->btree_inode)->sync_writers);
        return werr;
 }
 
@@ -1688,7 +1684,10 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
         * insert the directory item
         */
        ret = btrfs_set_inode_index(BTRFS_I(parent_inode), &index);
-       BUG_ON(ret); /* -ENOMEM */
+       if (ret) {
+               btrfs_abort_transaction(trans, ret);
+               goto fail;
+       }
 
        /* check if there is a file/dir which has the same name. */
        dir_item = btrfs_lookup_dir_item(NULL, parent_root, path,
@@ -2484,13 +2483,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
                goto scrub_continue;
        }
 
-       /*
-        * At this point, we should have written all the tree blocks allocated
-        * in this transaction. So it's now safe to free the redirtyied extent
-        * buffers.
-        */
-       btrfs_free_redirty_list(cur_trans);
-
        ret = write_all_supers(fs_info, 0);
        /*
         * the super is written, we can safely allow the tree-loggers
index fa728ab..8e9fa23 100644 (file)
@@ -94,9 +94,6 @@ struct btrfs_transaction {
         */
        atomic_t pending_ordered;
        wait_queue_head_t pending_wait;
-
-       spinlock_t releasing_ebs_lock;
-       struct list_head releasing_ebs;
 };
 
 enum {
index e2b5479..038dfa8 100644 (file)
 #include "compression.h"
 #include "volumes.h"
 #include "misc.h"
-#include "btrfs_inode.h"
 #include "fs.h"
 #include "accessors.h"
 #include "file-item.h"
+#include "inode-item.h"
 
 /*
  * Error message should follow the following format:
@@ -857,10 +857,10 @@ int btrfs_check_chunk_valid(struct extent_buffer *leaf,
         *
         * Thus it should be a good way to catch obvious bitflips.
         */
-       if (unlikely(length >= ((u64)U32_MAX << BTRFS_STRIPE_LEN_SHIFT))) {
+       if (unlikely(length >= btrfs_stripe_nr_to_offset(U32_MAX))) {
                chunk_err(leaf, chunk, logical,
                          "chunk length too large: have %llu limit %llu",
-                         length, (u64)U32_MAX << BTRFS_STRIPE_LEN_SHIFT);
+                         length, btrfs_stripe_nr_to_offset(U32_MAX));
                return -EUCLEAN;
        }
        if (unlikely(type & ~(BTRFS_BLOCK_GROUP_TYPE_MASK |
@@ -1620,9 +1620,10 @@ static int check_inode_ref(struct extent_buffer *leaf,
 /*
  * Common point to switch the item-specific validation.
  */
-static int check_leaf_item(struct extent_buffer *leaf,
-                          struct btrfs_key *key, int slot,
-                          struct btrfs_key *prev_key)
+static enum btrfs_tree_block_status check_leaf_item(struct extent_buffer *leaf,
+                                                   struct btrfs_key *key,
+                                                   int slot,
+                                                   struct btrfs_key *prev_key)
 {
        int ret = 0;
        struct btrfs_chunk *chunk;
@@ -1671,10 +1672,13 @@ static int check_leaf_item(struct extent_buffer *leaf,
                ret = check_extent_data_ref(leaf, key, slot);
                break;
        }
-       return ret;
+
+       if (ret)
+               return BTRFS_TREE_BLOCK_INVALID_ITEM;
+       return BTRFS_TREE_BLOCK_CLEAN;
 }
 
-static int check_leaf(struct extent_buffer *leaf, bool check_item_data)
+enum btrfs_tree_block_status __btrfs_check_leaf(struct extent_buffer *leaf)
 {
        struct btrfs_fs_info *fs_info = leaf->fs_info;
        /* No valid key type is 0, so all key should be larger than this key */
@@ -1687,7 +1691,7 @@ static int check_leaf(struct extent_buffer *leaf, bool check_item_data)
                generic_err(leaf, 0,
                        "invalid level for leaf, have %d expect 0",
                        btrfs_header_level(leaf));
-               return -EUCLEAN;
+               return BTRFS_TREE_BLOCK_INVALID_LEVEL;
        }
 
        /*
@@ -1710,32 +1714,32 @@ static int check_leaf(struct extent_buffer *leaf, bool check_item_data)
                        generic_err(leaf, 0,
                        "invalid root, root %llu must never be empty",
                                    owner);
-                       return -EUCLEAN;
+                       return BTRFS_TREE_BLOCK_INVALID_NRITEMS;
                }
 
                /* Unknown tree */
                if (unlikely(owner == 0)) {
                        generic_err(leaf, 0,
                                "invalid owner, root 0 is not defined");
-                       return -EUCLEAN;
+                       return BTRFS_TREE_BLOCK_INVALID_OWNER;
                }
 
                /* EXTENT_TREE_V2 can have empty extent trees. */
                if (btrfs_fs_incompat(fs_info, EXTENT_TREE_V2))
-                       return 0;
+                       return BTRFS_TREE_BLOCK_CLEAN;
 
                if (unlikely(owner == BTRFS_EXTENT_TREE_OBJECTID)) {
                        generic_err(leaf, 0,
                        "invalid root, root %llu must never be empty",
                                    owner);
-                       return -EUCLEAN;
+                       return BTRFS_TREE_BLOCK_INVALID_NRITEMS;
                }
 
-               return 0;
+               return BTRFS_TREE_BLOCK_CLEAN;
        }
 
        if (unlikely(nritems == 0))
-               return 0;
+               return BTRFS_TREE_BLOCK_CLEAN;
 
        /*
         * Check the following things to make sure this is a good leaf, and
@@ -1751,7 +1755,6 @@ static int check_leaf(struct extent_buffer *leaf, bool check_item_data)
        for (slot = 0; slot < nritems; slot++) {
                u32 item_end_expected;
                u64 item_data_end;
-               int ret;
 
                btrfs_item_key_to_cpu(leaf, &key, slot);
 
@@ -1762,7 +1765,7 @@ static int check_leaf(struct extent_buffer *leaf, bool check_item_data)
                                prev_key.objectid, prev_key.type,
                                prev_key.offset, key.objectid, key.type,
                                key.offset);
-                       return -EUCLEAN;
+                       return BTRFS_TREE_BLOCK_BAD_KEY_ORDER;
                }
 
                item_data_end = (u64)btrfs_item_offset(leaf, slot) +
@@ -1781,7 +1784,7 @@ static int check_leaf(struct extent_buffer *leaf, bool check_item_data)
                        generic_err(leaf, slot,
                                "unexpected item end, have %llu expect %u",
                                item_data_end, item_end_expected);
-                       return -EUCLEAN;
+                       return BTRFS_TREE_BLOCK_INVALID_OFFSETS;
                }
 
                /*
@@ -1793,7 +1796,7 @@ static int check_leaf(struct extent_buffer *leaf, bool check_item_data)
                        generic_err(leaf, slot,
                        "slot end outside of leaf, have %llu expect range [0, %u]",
                                item_data_end, BTRFS_LEAF_DATA_SIZE(fs_info));
-                       return -EUCLEAN;
+                       return BTRFS_TREE_BLOCK_INVALID_OFFSETS;
                }
 
                /* Also check if the item pointer overlaps with btrfs item. */
@@ -1804,16 +1807,22 @@ static int check_leaf(struct extent_buffer *leaf, bool check_item_data)
                                btrfs_item_nr_offset(leaf, slot) +
                                sizeof(struct btrfs_item),
                                btrfs_item_ptr_offset(leaf, slot));
-                       return -EUCLEAN;
+                       return BTRFS_TREE_BLOCK_INVALID_OFFSETS;
                }
 
-               if (check_item_data) {
+               /*
+                * We only want to do this if WRITTEN is set, otherwise the leaf
+                * may be in some intermediate state and won't appear valid.
+                */
+               if (btrfs_header_flag(leaf, BTRFS_HEADER_FLAG_WRITTEN)) {
+                       enum btrfs_tree_block_status ret;
+
                        /*
                         * Check if the item size and content meet other
                         * criteria
                         */
                        ret = check_leaf_item(leaf, &key, slot, &prev_key);
-                       if (unlikely(ret < 0))
+                       if (unlikely(ret != BTRFS_TREE_BLOCK_CLEAN))
                                return ret;
                }
 
@@ -1822,21 +1831,21 @@ static int check_leaf(struct extent_buffer *leaf, bool check_item_data)
                prev_key.offset = key.offset;
        }
 
-       return 0;
+       return BTRFS_TREE_BLOCK_CLEAN;
 }
 
-int btrfs_check_leaf_full(struct extent_buffer *leaf)
+int btrfs_check_leaf(struct extent_buffer *leaf)
 {
-       return check_leaf(leaf, true);
-}
-ALLOW_ERROR_INJECTION(btrfs_check_leaf_full, ERRNO);
+       enum btrfs_tree_block_status ret;
 
-int btrfs_check_leaf_relaxed(struct extent_buffer *leaf)
-{
-       return check_leaf(leaf, false);
+       ret = __btrfs_check_leaf(leaf);
+       if (unlikely(ret != BTRFS_TREE_BLOCK_CLEAN))
+               return -EUCLEAN;
+       return 0;
 }
+ALLOW_ERROR_INJECTION(btrfs_check_leaf, ERRNO);
 
-int btrfs_check_node(struct extent_buffer *node)
+enum btrfs_tree_block_status __btrfs_check_node(struct extent_buffer *node)
 {
        struct btrfs_fs_info *fs_info = node->fs_info;
        unsigned long nr = btrfs_header_nritems(node);
@@ -1844,13 +1853,12 @@ int btrfs_check_node(struct extent_buffer *node)
        int slot;
        int level = btrfs_header_level(node);
        u64 bytenr;
-       int ret = 0;
 
        if (unlikely(level <= 0 || level >= BTRFS_MAX_LEVEL)) {
                generic_err(node, 0,
                        "invalid level for node, have %d expect [1, %d]",
                        level, BTRFS_MAX_LEVEL - 1);
-               return -EUCLEAN;
+               return BTRFS_TREE_BLOCK_INVALID_LEVEL;
        }
        if (unlikely(nr == 0 || nr > BTRFS_NODEPTRS_PER_BLOCK(fs_info))) {
                btrfs_crit(fs_info,
@@ -1858,7 +1866,7 @@ int btrfs_check_node(struct extent_buffer *node)
                           btrfs_header_owner(node), node->start,
                           nr == 0 ? "small" : "large", nr,
                           BTRFS_NODEPTRS_PER_BLOCK(fs_info));
-               return -EUCLEAN;
+               return BTRFS_TREE_BLOCK_INVALID_NRITEMS;
        }
 
        for (slot = 0; slot < nr - 1; slot++) {
@@ -1869,15 +1877,13 @@ int btrfs_check_node(struct extent_buffer *node)
                if (unlikely(!bytenr)) {
                        generic_err(node, slot,
                                "invalid NULL node pointer");
-                       ret = -EUCLEAN;
-                       goto out;
+                       return BTRFS_TREE_BLOCK_INVALID_BLOCKPTR;
                }
                if (unlikely(!IS_ALIGNED(bytenr, fs_info->sectorsize))) {
                        generic_err(node, slot,
                        "unaligned pointer, have %llu should be aligned to %u",
                                bytenr, fs_info->sectorsize);
-                       ret = -EUCLEAN;
-                       goto out;
+                       return BTRFS_TREE_BLOCK_INVALID_BLOCKPTR;
                }
 
                if (unlikely(btrfs_comp_cpu_keys(&key, &next_key) >= 0)) {
@@ -1886,12 +1892,20 @@ int btrfs_check_node(struct extent_buffer *node)
                                key.objectid, key.type, key.offset,
                                next_key.objectid, next_key.type,
                                next_key.offset);
-                       ret = -EUCLEAN;
-                       goto out;
+                       return BTRFS_TREE_BLOCK_BAD_KEY_ORDER;
                }
        }
-out:
-       return ret;
+       return BTRFS_TREE_BLOCK_CLEAN;
+}
+
+int btrfs_check_node(struct extent_buffer *node)
+{
+       enum btrfs_tree_block_status ret;
+
+       ret = __btrfs_check_node(node);
+       if (unlikely(ret != BTRFS_TREE_BLOCK_CLEAN))
+               return -EUCLEAN;
+       return 0;
 }
 ALLOW_ERROR_INJECTION(btrfs_check_node, ERRNO);
 
@@ -1949,3 +1963,61 @@ int btrfs_check_eb_owner(const struct extent_buffer *eb, u64 root_owner)
        }
        return 0;
 }
+
+int btrfs_verify_level_key(struct extent_buffer *eb, int level,
+                          struct btrfs_key *first_key, u64 parent_transid)
+{
+       struct btrfs_fs_info *fs_info = eb->fs_info;
+       int found_level;
+       struct btrfs_key found_key;
+       int ret;
+
+       found_level = btrfs_header_level(eb);
+       if (found_level != level) {
+               WARN(IS_ENABLED(CONFIG_BTRFS_DEBUG),
+                    KERN_ERR "BTRFS: tree level check failed\n");
+               btrfs_err(fs_info,
+"tree level mismatch detected, bytenr=%llu level expected=%u has=%u",
+                         eb->start, level, found_level);
+               return -EIO;
+       }
+
+       if (!first_key)
+               return 0;
+
+       /*
+        * For live tree block (new tree blocks in current transaction),
+        * we need proper lock context to avoid race, which is impossible here.
+        * So we only checks tree blocks which is read from disk, whose
+        * generation <= fs_info->last_trans_committed.
+        */
+       if (btrfs_header_generation(eb) > fs_info->last_trans_committed)
+               return 0;
+
+       /* We have @first_key, so this @eb must have at least one item */
+       if (btrfs_header_nritems(eb) == 0) {
+               btrfs_err(fs_info,
+               "invalid tree nritems, bytenr=%llu nritems=0 expect >0",
+                         eb->start);
+               WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
+               return -EUCLEAN;
+       }
+
+       if (found_level)
+               btrfs_node_key_to_cpu(eb, &found_key, 0);
+       else
+               btrfs_item_key_to_cpu(eb, &found_key, 0);
+       ret = btrfs_comp_cpu_keys(first_key, &found_key);
+
+       if (ret) {
+               WARN(IS_ENABLED(CONFIG_BTRFS_DEBUG),
+                    KERN_ERR "BTRFS: tree first key check failed\n");
+               btrfs_err(fs_info,
+"tree first key mismatch detected, bytenr=%llu parent_transid=%llu key expected=(%llu,%u,%llu) has=(%llu,%u,%llu)",
+                         eb->start, parent_transid, first_key->objectid,
+                         first_key->type, first_key->offset,
+                         found_key.objectid, found_key.type,
+                         found_key.offset);
+       }
+       return ret;
+}
index bfb5efa..3c2a02a 100644 (file)
@@ -40,22 +40,33 @@ struct btrfs_tree_parent_check {
        u8 level;
 };
 
-/*
- * Comprehensive leaf checker.
- * Will check not only the item pointers, but also every possible member
- * in item data.
- */
-int btrfs_check_leaf_full(struct extent_buffer *leaf);
+enum btrfs_tree_block_status {
+       BTRFS_TREE_BLOCK_CLEAN,
+       BTRFS_TREE_BLOCK_INVALID_NRITEMS,
+       BTRFS_TREE_BLOCK_INVALID_PARENT_KEY,
+       BTRFS_TREE_BLOCK_BAD_KEY_ORDER,
+       BTRFS_TREE_BLOCK_INVALID_LEVEL,
+       BTRFS_TREE_BLOCK_INVALID_FREE_SPACE,
+       BTRFS_TREE_BLOCK_INVALID_OFFSETS,
+       BTRFS_TREE_BLOCK_INVALID_BLOCKPTR,
+       BTRFS_TREE_BLOCK_INVALID_ITEM,
+       BTRFS_TREE_BLOCK_INVALID_OWNER,
+};
 
 /*
- * Less strict leaf checker.
- * Will only check item pointers, not reading item data.
+ * Exported simply for btrfs-progs which wants to have the
+ * btrfs_tree_block_status return codes.
  */
-int btrfs_check_leaf_relaxed(struct extent_buffer *leaf);
+enum btrfs_tree_block_status __btrfs_check_leaf(struct extent_buffer *leaf);
+enum btrfs_tree_block_status __btrfs_check_node(struct extent_buffer *node);
+
+int btrfs_check_leaf(struct extent_buffer *leaf);
 int btrfs_check_node(struct extent_buffer *node);
 
 int btrfs_check_chunk_valid(struct extent_buffer *leaf,
                            struct btrfs_chunk *chunk, u64 logical);
 int btrfs_check_eb_owner(const struct extent_buffer *eb, u64 root_owner);
+int btrfs_verify_level_key(struct extent_buffer *eb, int level,
+                          struct btrfs_key *first_key, u64 parent_transid);
 
 #endif
index d2755d5..365a1cc 100644 (file)
@@ -859,10 +859,10 @@ static noinline int replay_one_extent(struct btrfs_trans_handle *trans,
                                                struct btrfs_ordered_sum,
                                                list);
                                csum_root = btrfs_csum_root(fs_info,
-                                                           sums->bytenr);
+                                                           sums->logical);
                                if (!ret)
                                        ret = btrfs_del_csums(trans, csum_root,
-                                                             sums->bytenr,
+                                                             sums->logical,
                                                              sums->len);
                                if (!ret)
                                        ret = btrfs_csum_file_blocks(trans,
@@ -3252,7 +3252,7 @@ int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans,
  * Returns 1 if the inode was logged before in the transaction, 0 if it was not,
  * and < 0 on error.
  */
-static int inode_logged(struct btrfs_trans_handle *trans,
+static int inode_logged(const struct btrfs_trans_handle *trans,
                        struct btrfs_inode *inode,
                        struct btrfs_path *path_in)
 {
@@ -4056,14 +4056,14 @@ static int drop_inode_items(struct btrfs_trans_handle *trans,
 
        while (1) {
                ret = btrfs_search_slot(trans, log, &key, path, -1, 1);
-               BUG_ON(ret == 0); /* Logic error */
-               if (ret < 0)
-                       break;
-
-               if (path->slots[0] == 0)
+               if (ret < 0) {
                        break;
+               } else if (ret > 0) {
+                       if (path->slots[0] == 0)
+                               break;
+                       path->slots[0]--;
+               }
 
-               path->slots[0]--;
                btrfs_item_key_to_cpu(path->nodes[0], &found_key,
                                      path->slots[0]);
 
@@ -4221,7 +4221,7 @@ static int log_csums(struct btrfs_trans_handle *trans,
                     struct btrfs_root *log_root,
                     struct btrfs_ordered_sum *sums)
 {
-       const u64 lock_end = sums->bytenr + sums->len - 1;
+       const u64 lock_end = sums->logical + sums->len - 1;
        struct extent_state *cached_state = NULL;
        int ret;
 
@@ -4239,7 +4239,7 @@ static int log_csums(struct btrfs_trans_handle *trans,
         * file which happens to refer to the same extent as well. Such races
         * can leave checksum items in the log with overlapping ranges.
         */
-       ret = lock_extent(&log_root->log_csum_range, sums->bytenr, lock_end,
+       ret = lock_extent(&log_root->log_csum_range, sums->logical, lock_end,
                          &cached_state);
        if (ret)
                return ret;
@@ -4252,11 +4252,11 @@ static int log_csums(struct btrfs_trans_handle *trans,
         * some checksums missing in the fs/subvolume tree. So just delete (or
         * trim and adjust) any existing csum items in the log for this range.
         */
-       ret = btrfs_del_csums(trans, log_root, sums->bytenr, sums->len);
+       ret = btrfs_del_csums(trans, log_root, sums->logical, sums->len);
        if (!ret)
                ret = btrfs_csum_file_blocks(trans, log_root, sums);
 
-       unlock_extent(&log_root->log_csum_range, sums->bytenr, lock_end,
+       unlock_extent(&log_root->log_csum_range, sums->logical, lock_end,
                      &cached_state);
 
        return ret;
@@ -5303,7 +5303,7 @@ out:
  * multiple times when multiple tasks have joined the same log transaction.
  */
 static bool need_log_inode(const struct btrfs_trans_handle *trans,
-                          const struct btrfs_inode *inode)
+                          struct btrfs_inode *inode)
 {
        /*
         * If a directory was not modified, no dentries added or removed, we can
@@ -5321,7 +5321,7 @@ static bool need_log_inode(const struct btrfs_trans_handle *trans,
         * logged_trans will be 0, in which case we have to fully log it since
         * logged_trans is a transient field, not persisted.
         */
-       if (inode->logged_trans == trans->transid &&
+       if (inode_logged(trans, inode, NULL) == 1 &&
            !test_bit(BTRFS_INODE_COPY_EVERYTHING, &inode->runtime_flags))
                return false;
 
@@ -7309,7 +7309,7 @@ error:
  */
 void btrfs_record_unlink_dir(struct btrfs_trans_handle *trans,
                             struct btrfs_inode *dir, struct btrfs_inode *inode,
-                            int for_rename)
+                            bool for_rename)
 {
        /*
         * when we're logging a file, if it hasn't been renamed
@@ -7325,18 +7325,25 @@ void btrfs_record_unlink_dir(struct btrfs_trans_handle *trans,
        inode->last_unlink_trans = trans->transid;
        mutex_unlock(&inode->log_mutex);
 
+       if (!for_rename)
+               return;
+
        /*
-        * if this directory was already logged any new
-        * names for this file/dir will get recorded
+        * If this directory was already logged, any new names will be logged
+        * with btrfs_log_new_name() and old names will be deleted from the log
+        * tree with btrfs_del_dir_entries_in_log() or with
+        * btrfs_del_inode_ref_in_log().
         */
-       if (dir->logged_trans == trans->transid)
+       if (inode_logged(trans, dir, NULL) == 1)
                return;
 
        /*
-        * if the inode we're about to unlink was logged,
-        * the log will be properly updated for any new names
+        * If the inode we're about to unlink was logged before, the log will be
+        * properly updated with the new name with btrfs_log_new_name() and the
+        * old name removed with btrfs_del_dir_entries_in_log() or with
+        * btrfs_del_inode_ref_in_log().
         */
-       if (inode->logged_trans == trans->transid)
+       if (inode_logged(trans, inode, NULL) == 1)
                return;
 
        /*
@@ -7346,13 +7353,6 @@ void btrfs_record_unlink_dir(struct btrfs_trans_handle *trans,
         * properly.  So, we have to be conservative and force commits
         * so the new name gets discovered.
         */
-       if (for_rename)
-               goto record;
-
-       /* we can safely do the unlink without any special recording */
-       return;
-
-record:
        mutex_lock(&dir->log_mutex);
        dir->last_unlink_trans = trans->transid;
        mutex_unlock(&dir->log_mutex);
index bdeb521..a550a8a 100644 (file)
@@ -100,7 +100,7 @@ void btrfs_end_log_trans(struct btrfs_root *root);
 void btrfs_pin_log_trans(struct btrfs_root *root);
 void btrfs_record_unlink_dir(struct btrfs_trans_handle *trans,
                             struct btrfs_inode *dir, struct btrfs_inode *inode,
-                            int for_rename);
+                            bool for_rename);
 void btrfs_record_snapshot_destroy(struct btrfs_trans_handle *trans,
                                   struct btrfs_inode *dir);
 void btrfs_log_new_name(struct btrfs_trans_handle *trans,
index a555baa..3df6153 100644 (file)
@@ -226,21 +226,32 @@ int btrfs_tree_mod_log_insert_key(struct extent_buffer *eb, int slot,
                                  enum btrfs_mod_log_op op)
 {
        struct tree_mod_elem *tm;
-       int ret;
+       int ret = 0;
 
        if (!tree_mod_need_log(eb->fs_info, eb))
                return 0;
 
        tm = alloc_tree_mod_elem(eb, slot, op);
        if (!tm)
-               return -ENOMEM;
+               ret = -ENOMEM;
 
        if (tree_mod_dont_log(eb->fs_info, eb)) {
                kfree(tm);
+               /*
+                * Don't error if we failed to allocate memory because we don't
+                * need to log.
+                */
                return 0;
+       } else if (ret != 0) {
+               /*
+                * We previously failed to allocate memory and we need to log,
+                * so we have to fail.
+                */
+               goto out_unlock;
        }
 
        ret = tree_mod_log_insert(eb->fs_info, tm);
+out_unlock:
        write_unlock(&eb->fs_info->tree_mod_log_lock);
        if (ret)
                kfree(tm);
@@ -248,6 +259,26 @@ int btrfs_tree_mod_log_insert_key(struct extent_buffer *eb, int slot,
        return ret;
 }
 
+static struct tree_mod_elem *tree_mod_log_alloc_move(struct extent_buffer *eb,
+                                                    int dst_slot, int src_slot,
+                                                    int nr_items)
+{
+       struct tree_mod_elem *tm;
+
+       tm = kzalloc(sizeof(*tm), GFP_NOFS);
+       if (!tm)
+               return ERR_PTR(-ENOMEM);
+
+       tm->logical = eb->start;
+       tm->slot = src_slot;
+       tm->move.dst_slot = dst_slot;
+       tm->move.nr_items = nr_items;
+       tm->op = BTRFS_MOD_LOG_MOVE_KEYS;
+       RB_CLEAR_NODE(&tm->node);
+
+       return tm;
+}
+
 int btrfs_tree_mod_log_insert_move(struct extent_buffer *eb,
                                   int dst_slot, int src_slot,
                                   int nr_items)
@@ -262,35 +293,46 @@ int btrfs_tree_mod_log_insert_move(struct extent_buffer *eb,
                return 0;
 
        tm_list = kcalloc(nr_items, sizeof(struct tree_mod_elem *), GFP_NOFS);
-       if (!tm_list)
-               return -ENOMEM;
-
-       tm = kzalloc(sizeof(*tm), GFP_NOFS);
-       if (!tm) {
+       if (!tm_list) {
                ret = -ENOMEM;
-               goto free_tms;
+               goto lock;
        }
 
-       tm->logical = eb->start;
-       tm->slot = src_slot;
-       tm->move.dst_slot = dst_slot;
-       tm->move.nr_items = nr_items;
-       tm->op = BTRFS_MOD_LOG_MOVE_KEYS;
+       tm = tree_mod_log_alloc_move(eb, dst_slot, src_slot, nr_items);
+       if (IS_ERR(tm)) {
+               ret = PTR_ERR(tm);
+               tm = NULL;
+               goto lock;
+       }
 
        for (i = 0; i + dst_slot < src_slot && i < nr_items; i++) {
                tm_list[i] = alloc_tree_mod_elem(eb, i + dst_slot,
                                BTRFS_MOD_LOG_KEY_REMOVE_WHILE_MOVING);
                if (!tm_list[i]) {
                        ret = -ENOMEM;
-                       goto free_tms;
+                       goto lock;
                }
        }
 
-       if (tree_mod_dont_log(eb->fs_info, eb))
+lock:
+       if (tree_mod_dont_log(eb->fs_info, eb)) {
+               /*
+                * Don't error if we failed to allocate memory because we don't
+                * need to log.
+                */
+               ret = 0;
                goto free_tms;
+       }
        locked = true;
 
        /*
+        * We previously failed to allocate memory and we need to log, so we
+        * have to fail.
+        */
+       if (ret != 0)
+               goto free_tms;
+
+       /*
         * When we override something during the move, we log these removals.
         * This can only happen when we move towards the beginning of the
         * buffer, i.e. dst_slot < src_slot.
@@ -310,10 +352,12 @@ int btrfs_tree_mod_log_insert_move(struct extent_buffer *eb,
        return 0;
 
 free_tms:
-       for (i = 0; i < nr_items; i++) {
-               if (tm_list[i] && !RB_EMPTY_NODE(&tm_list[i]->node))
-                       rb_erase(&tm_list[i]->node, &eb->fs_info->tree_mod_log);
-               kfree(tm_list[i]);
+       if (tm_list) {
+               for (i = 0; i < nr_items; i++) {
+                       if (tm_list[i] && !RB_EMPTY_NODE(&tm_list[i]->node))
+                               rb_erase(&tm_list[i]->node, &eb->fs_info->tree_mod_log);
+                       kfree(tm_list[i]);
+               }
        }
        if (locked)
                write_unlock(&eb->fs_info->tree_mod_log_lock);
@@ -363,14 +407,14 @@ int btrfs_tree_mod_log_insert_root(struct extent_buffer *old_root,
                                  GFP_NOFS);
                if (!tm_list) {
                        ret = -ENOMEM;
-                       goto free_tms;
+                       goto lock;
                }
                for (i = 0; i < nritems; i++) {
                        tm_list[i] = alloc_tree_mod_elem(old_root, i,
                            BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING);
                        if (!tm_list[i]) {
                                ret = -ENOMEM;
-                               goto free_tms;
+                               goto lock;
                        }
                }
        }
@@ -378,7 +422,7 @@ int btrfs_tree_mod_log_insert_root(struct extent_buffer *old_root,
        tm = kzalloc(sizeof(*tm), GFP_NOFS);
        if (!tm) {
                ret = -ENOMEM;
-               goto free_tms;
+               goto lock;
        }
 
        tm->logical = new_root->start;
@@ -387,14 +431,28 @@ int btrfs_tree_mod_log_insert_root(struct extent_buffer *old_root,
        tm->generation = btrfs_header_generation(old_root);
        tm->op = BTRFS_MOD_LOG_ROOT_REPLACE;
 
-       if (tree_mod_dont_log(fs_info, NULL))
+lock:
+       if (tree_mod_dont_log(fs_info, NULL)) {
+               /*
+                * Don't error if we failed to allocate memory because we don't
+                * need to log.
+                */
+               ret = 0;
                goto free_tms;
+       } else if (ret != 0) {
+               /*
+                * We previously failed to allocate memory and we need to log,
+                * so we have to fail.
+                */
+               goto out_unlock;
+       }
 
        if (tm_list)
                ret = tree_mod_log_free_eb(fs_info, tm_list, nritems);
        if (!ret)
                ret = tree_mod_log_insert(fs_info, tm);
 
+out_unlock:
        write_unlock(&fs_info->tree_mod_log_lock);
        if (ret)
                goto free_tms;
@@ -486,9 +544,14 @@ int btrfs_tree_mod_log_eb_copy(struct extent_buffer *dst,
        struct btrfs_fs_info *fs_info = dst->fs_info;
        int ret = 0;
        struct tree_mod_elem **tm_list = NULL;
-       struct tree_mod_elem **tm_list_add, **tm_list_rem;
+       struct tree_mod_elem **tm_list_add = NULL;
+       struct tree_mod_elem **tm_list_rem = NULL;
        int i;
        bool locked = false;
+       struct tree_mod_elem *dst_move_tm = NULL;
+       struct tree_mod_elem *src_move_tm = NULL;
+       u32 dst_move_nr_items = btrfs_header_nritems(dst) - dst_offset;
+       u32 src_move_nr_items = btrfs_header_nritems(src) - (src_offset + nr_items);
 
        if (!tree_mod_need_log(fs_info, NULL))
                return 0;
@@ -498,8 +561,30 @@ int btrfs_tree_mod_log_eb_copy(struct extent_buffer *dst,
 
        tm_list = kcalloc(nr_items * 2, sizeof(struct tree_mod_elem *),
                          GFP_NOFS);
-       if (!tm_list)
-               return -ENOMEM;
+       if (!tm_list) {
+               ret = -ENOMEM;
+               goto lock;
+       }
+
+       if (dst_move_nr_items) {
+               dst_move_tm = tree_mod_log_alloc_move(dst, dst_offset + nr_items,
+                                                     dst_offset, dst_move_nr_items);
+               if (IS_ERR(dst_move_tm)) {
+                       ret = PTR_ERR(dst_move_tm);
+                       dst_move_tm = NULL;
+                       goto lock;
+               }
+       }
+       if (src_move_nr_items) {
+               src_move_tm = tree_mod_log_alloc_move(src, src_offset,
+                                                     src_offset + nr_items,
+                                                     src_move_nr_items);
+               if (IS_ERR(src_move_tm)) {
+                       ret = PTR_ERR(src_move_tm);
+                       src_move_tm = NULL;
+                       goto lock;
+               }
+       }
 
        tm_list_add = tm_list;
        tm_list_rem = tm_list + nr_items;
@@ -508,21 +593,40 @@ int btrfs_tree_mod_log_eb_copy(struct extent_buffer *dst,
                                                     BTRFS_MOD_LOG_KEY_REMOVE);
                if (!tm_list_rem[i]) {
                        ret = -ENOMEM;
-                       goto free_tms;
+                       goto lock;
                }
 
                tm_list_add[i] = alloc_tree_mod_elem(dst, i + dst_offset,
                                                     BTRFS_MOD_LOG_KEY_ADD);
                if (!tm_list_add[i]) {
                        ret = -ENOMEM;
-                       goto free_tms;
+                       goto lock;
                }
        }
 
-       if (tree_mod_dont_log(fs_info, NULL))
+lock:
+       if (tree_mod_dont_log(fs_info, NULL)) {
+               /*
+                * Don't error if we failed to allocate memory because we don't
+                * need to log.
+                */
+               ret = 0;
                goto free_tms;
+       }
        locked = true;
 
+       /*
+        * We previously failed to allocate memory and we need to log, so we
+        * have to fail.
+        */
+       if (ret != 0)
+               goto free_tms;
+
+       if (dst_move_tm) {
+               ret = tree_mod_log_insert(fs_info, dst_move_tm);
+               if (ret)
+                       goto free_tms;
+       }
        for (i = 0; i < nr_items; i++) {
                ret = tree_mod_log_insert(fs_info, tm_list_rem[i]);
                if (ret)
@@ -531,6 +635,11 @@ int btrfs_tree_mod_log_eb_copy(struct extent_buffer *dst,
                if (ret)
                        goto free_tms;
        }
+       if (src_move_tm) {
+               ret = tree_mod_log_insert(fs_info, src_move_tm);
+               if (ret)
+                       goto free_tms;
+       }
 
        write_unlock(&fs_info->tree_mod_log_lock);
        kfree(tm_list);
@@ -538,10 +647,18 @@ int btrfs_tree_mod_log_eb_copy(struct extent_buffer *dst,
        return 0;
 
 free_tms:
-       for (i = 0; i < nr_items * 2; i++) {
-               if (tm_list[i] && !RB_EMPTY_NODE(&tm_list[i]->node))
-                       rb_erase(&tm_list[i]->node, &fs_info->tree_mod_log);
-               kfree(tm_list[i]);
+       if (dst_move_tm && !RB_EMPTY_NODE(&dst_move_tm->node))
+               rb_erase(&dst_move_tm->node, &fs_info->tree_mod_log);
+       kfree(dst_move_tm);
+       if (src_move_tm && !RB_EMPTY_NODE(&src_move_tm->node))
+               rb_erase(&src_move_tm->node, &fs_info->tree_mod_log);
+       kfree(src_move_tm);
+       if (tm_list) {
+               for (i = 0; i < nr_items * 2; i++) {
+                       if (tm_list[i] && !RB_EMPTY_NODE(&tm_list[i]->node))
+                               rb_erase(&tm_list[i]->node, &fs_info->tree_mod_log);
+                       kfree(tm_list[i]);
+               }
        }
        if (locked)
                write_unlock(&fs_info->tree_mod_log_lock);
@@ -562,22 +679,38 @@ int btrfs_tree_mod_log_free_eb(struct extent_buffer *eb)
 
        nritems = btrfs_header_nritems(eb);
        tm_list = kcalloc(nritems, sizeof(struct tree_mod_elem *), GFP_NOFS);
-       if (!tm_list)
-               return -ENOMEM;
+       if (!tm_list) {
+               ret = -ENOMEM;
+               goto lock;
+       }
 
        for (i = 0; i < nritems; i++) {
                tm_list[i] = alloc_tree_mod_elem(eb, i,
                                    BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING);
                if (!tm_list[i]) {
                        ret = -ENOMEM;
-                       goto free_tms;
+                       goto lock;
                }
        }
 
-       if (tree_mod_dont_log(eb->fs_info, eb))
+lock:
+       if (tree_mod_dont_log(eb->fs_info, eb)) {
+               /*
+                * Don't error if we failed to allocate memory because we don't
+                * need to log.
+                */
+               ret = 0;
                goto free_tms;
+       } else if (ret != 0) {
+               /*
+                * We previously failed to allocate memory and we need to log,
+                * so we have to fail.
+                */
+               goto out_unlock;
+       }
 
        ret = tree_mod_log_free_eb(eb->fs_info, tm_list, nritems);
+out_unlock:
        write_unlock(&eb->fs_info->tree_mod_log_lock);
        if (ret)
                goto free_tms;
@@ -586,9 +719,11 @@ int btrfs_tree_mod_log_free_eb(struct extent_buffer *eb)
        return 0;
 
 free_tms:
-       for (i = 0; i < nritems; i++)
-               kfree(tm_list[i]);
-       kfree(tm_list);
+       if (tm_list) {
+               for (i = 0; i < nritems; i++)
+                       kfree(tm_list[i]);
+               kfree(tm_list);
+       }
 
        return ret;
 }
@@ -664,10 +799,27 @@ static void tree_mod_log_rewind(struct btrfs_fs_info *fs_info,
        unsigned long o_dst;
        unsigned long o_src;
        unsigned long p_size = sizeof(struct btrfs_key_ptr);
+       /*
+        * max_slot tracks the maximum valid slot of the rewind eb at every
+        * step of the rewind. This is in contrast with 'n' which eventually
+        * matches the number of items, but can be wrong during moves or if
+        * removes overlap on already valid slots (which is probably separately
+        * a bug). We do this to validate the offsets of memmoves for rewinding
+        * moves and detect invalid memmoves.
+        *
+        * Since a rewind eb can start empty, max_slot is a signed integer with
+        * a special meaning for -1, which is that no slot is valid to move out
+        * of. Any other negative value is invalid.
+        */
+       int max_slot;
+       int move_src_end_slot;
+       int move_dst_end_slot;
 
        n = btrfs_header_nritems(eb);
+       max_slot = n - 1;
        read_lock(&fs_info->tree_mod_log_lock);
        while (tm && tm->seq >= time_seq) {
+               ASSERT(max_slot >= -1);
                /*
                 * All the operations are recorded with the operator used for
                 * the modification. As we're going backwards, we do the
@@ -684,6 +836,8 @@ static void tree_mod_log_rewind(struct btrfs_fs_info *fs_info,
                        btrfs_set_node_ptr_generation(eb, tm->slot,
                                                      tm->generation);
                        n++;
+                       if (tm->slot > max_slot)
+                               max_slot = tm->slot;
                        break;
                case BTRFS_MOD_LOG_KEY_REPLACE:
                        BUG_ON(tm->slot >= n);
@@ -693,14 +847,37 @@ static void tree_mod_log_rewind(struct btrfs_fs_info *fs_info,
                                                      tm->generation);
                        break;
                case BTRFS_MOD_LOG_KEY_ADD:
+                       /*
+                        * It is possible we could have already removed keys
+                        * behind the known max slot, so this will be an
+                        * overestimate. In practice, the copy operation
+                        * inserts them in increasing order, and overestimating
+                        * just means we miss some warnings, so it's OK. It
+                        * isn't worth carefully tracking the full array of
+                        * valid slots to check against when moving.
+                        */
+                       if (tm->slot == max_slot)
+                               max_slot--;
                        /* if a move operation is needed it's in the log */
                        n--;
                        break;
                case BTRFS_MOD_LOG_MOVE_KEYS:
+                       ASSERT(tm->move.nr_items > 0);
+                       move_src_end_slot = tm->move.dst_slot + tm->move.nr_items - 1;
+                       move_dst_end_slot = tm->slot + tm->move.nr_items - 1;
                        o_dst = btrfs_node_key_ptr_offset(eb, tm->slot);
                        o_src = btrfs_node_key_ptr_offset(eb, tm->move.dst_slot);
+                       if (WARN_ON(move_src_end_slot > max_slot ||
+                                   tm->move.nr_items <= 0)) {
+                               btrfs_warn(fs_info,
+"move from invalid tree mod log slot eb %llu slot %d dst_slot %d nr_items %d seq %llu n %u max_slot %d",
+                                          eb->start, tm->slot,
+                                          tm->move.dst_slot, tm->move.nr_items,
+                                          tm->seq, n, max_slot);
+                       }
                        memmove_extent_buffer(eb, o_dst, o_src,
                                              tm->move.nr_items * p_size);
+                       max_slot = move_dst_end_slot;
                        break;
                case BTRFS_MOD_LOG_ROOT_REPLACE:
                        /*
index e60beb1..73f9ea7 100644 (file)
@@ -370,6 +370,8 @@ static struct btrfs_fs_devices *alloc_fs_devices(const u8 *fsid,
 {
        struct btrfs_fs_devices *fs_devs;
 
+       ASSERT(fsid || !metadata_fsid);
+
        fs_devs = kzalloc(sizeof(*fs_devs), GFP_KERNEL);
        if (!fs_devs)
                return ERR_PTR(-ENOMEM);
@@ -380,18 +382,17 @@ static struct btrfs_fs_devices *alloc_fs_devices(const u8 *fsid,
        INIT_LIST_HEAD(&fs_devs->alloc_list);
        INIT_LIST_HEAD(&fs_devs->fs_list);
        INIT_LIST_HEAD(&fs_devs->seed_list);
-       if (fsid)
-               memcpy(fs_devs->fsid, fsid, BTRFS_FSID_SIZE);
 
-       if (metadata_fsid)
-               memcpy(fs_devs->metadata_uuid, metadata_fsid, BTRFS_FSID_SIZE);
-       else if (fsid)
-               memcpy(fs_devs->metadata_uuid, fsid, BTRFS_FSID_SIZE);
+       if (fsid) {
+               memcpy(fs_devs->fsid, fsid, BTRFS_FSID_SIZE);
+               memcpy(fs_devs->metadata_uuid,
+                      metadata_fsid ?: fsid, BTRFS_FSID_SIZE);
+       }
 
        return fs_devs;
 }
 
-void btrfs_free_device(struct btrfs_device *device)
+static void btrfs_free_device(struct btrfs_device *device)
 {
        WARN_ON(!list_empty(&device->post_commit_list));
        rcu_string_free(device->name);
@@ -426,6 +427,21 @@ void __exit btrfs_cleanup_fs_uuids(void)
        }
 }
 
+static bool match_fsid_fs_devices(const struct btrfs_fs_devices *fs_devices,
+                                 const u8 *fsid, const u8 *metadata_fsid)
+{
+       if (memcmp(fsid, fs_devices->fsid, BTRFS_FSID_SIZE) != 0)
+               return false;
+
+       if (!metadata_fsid)
+               return true;
+
+       if (memcmp(metadata_fsid, fs_devices->metadata_uuid, BTRFS_FSID_SIZE) != 0)
+               return false;
+
+       return true;
+}
+
 static noinline struct btrfs_fs_devices *find_fsid(
                const u8 *fsid, const u8 *metadata_fsid)
 {
@@ -435,19 +451,25 @@ static noinline struct btrfs_fs_devices *find_fsid(
 
        /* Handle non-split brain cases */
        list_for_each_entry(fs_devices, &fs_uuids, fs_list) {
-               if (metadata_fsid) {
-                       if (memcmp(fsid, fs_devices->fsid, BTRFS_FSID_SIZE) == 0
-                           && memcmp(metadata_fsid, fs_devices->metadata_uuid,
-                                     BTRFS_FSID_SIZE) == 0)
-                               return fs_devices;
-               } else {
-                       if (memcmp(fsid, fs_devices->fsid, BTRFS_FSID_SIZE) == 0)
-                               return fs_devices;
-               }
+               if (match_fsid_fs_devices(fs_devices, fsid, metadata_fsid))
+                       return fs_devices;
        }
        return NULL;
 }
 
+/*
+ * First check if the metadata_uuid is different from the fsid in the given
+ * fs_devices. Then check if the given fsid is the same as the metadata_uuid
+ * in the fs_devices. If it is, return true; otherwise, return false.
+ */
+static inline bool check_fsid_changed(const struct btrfs_fs_devices *fs_devices,
+                                     const u8 *fsid)
+{
+       return memcmp(fs_devices->fsid, fs_devices->metadata_uuid,
+                     BTRFS_FSID_SIZE) != 0 &&
+              memcmp(fs_devices->metadata_uuid, fsid, BTRFS_FSID_SIZE) == 0;
+}
+
 static struct btrfs_fs_devices *find_fsid_with_metadata_uuid(
                                struct btrfs_super_block *disk_super)
 {
@@ -461,14 +483,14 @@ static struct btrfs_fs_devices *find_fsid_with_metadata_uuid(
         * at all and the CHANGING_FSID_V2 flag set.
         */
        list_for_each_entry(fs_devices, &fs_uuids, fs_list) {
-               if (fs_devices->fsid_change &&
-                   memcmp(disk_super->metadata_uuid, fs_devices->fsid,
-                          BTRFS_FSID_SIZE) == 0 &&
-                   memcmp(fs_devices->fsid, fs_devices->metadata_uuid,
-                          BTRFS_FSID_SIZE) == 0) {
+               if (!fs_devices->fsid_change)
+                       continue;
+
+               if (match_fsid_fs_devices(fs_devices, disk_super->metadata_uuid,
+                                         fs_devices->fsid))
                        return fs_devices;
-               }
        }
+
        /*
         * Handle scanned device having completed its fsid change but
         * belonging to a fs_devices that was created by a device that
@@ -476,13 +498,11 @@ static struct btrfs_fs_devices *find_fsid_with_metadata_uuid(
         * CHANGING_FSID_V2 flag set.
         */
        list_for_each_entry(fs_devices, &fs_uuids, fs_list) {
-               if (fs_devices->fsid_change &&
-                   memcmp(fs_devices->metadata_uuid,
-                          fs_devices->fsid, BTRFS_FSID_SIZE) != 0 &&
-                   memcmp(disk_super->metadata_uuid, fs_devices->metadata_uuid,
-                          BTRFS_FSID_SIZE) == 0) {
+               if (!fs_devices->fsid_change)
+                       continue;
+
+               if (check_fsid_changed(fs_devices, disk_super->metadata_uuid))
                        return fs_devices;
-               }
        }
 
        return find_fsid(disk_super->fsid, disk_super->metadata_uuid);
@@ -490,13 +510,13 @@ static struct btrfs_fs_devices *find_fsid_with_metadata_uuid(
 
 
 static int
-btrfs_get_bdev_and_sb(const char *device_path, fmode_t flags, void *holder,
+btrfs_get_bdev_and_sb(const char *device_path, blk_mode_t flags, void *holder,
                      int flush, struct block_device **bdev,
                      struct btrfs_super_block **disk_super)
 {
        int ret;
 
-       *bdev = blkdev_get_by_path(device_path, flags, holder);
+       *bdev = blkdev_get_by_path(device_path, flags, holder, NULL);
 
        if (IS_ERR(*bdev)) {
                ret = PTR_ERR(*bdev);
@@ -507,14 +527,14 @@ btrfs_get_bdev_and_sb(const char *device_path, fmode_t flags, void *holder,
                sync_blockdev(*bdev);
        ret = set_blocksize(*bdev, BTRFS_BDEV_BLOCKSIZE);
        if (ret) {
-               blkdev_put(*bdev, flags);
+               blkdev_put(*bdev, holder);
                goto error;
        }
        invalidate_bdev(*bdev);
        *disk_super = btrfs_read_dev_super(*bdev);
        if (IS_ERR(*disk_super)) {
                ret = PTR_ERR(*disk_super);
-               blkdev_put(*bdev, flags);
+               blkdev_put(*bdev, holder);
                goto error;
        }
 
@@ -590,7 +610,7 @@ static int btrfs_free_stale_devices(dev_t devt, struct btrfs_device *skip_device
  * fs_devices->device_list_mutex here.
  */
 static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
-                       struct btrfs_device *device, fmode_t flags,
+                       struct btrfs_device *device, blk_mode_t flags,
                        void *holder)
 {
        struct block_device *bdev;
@@ -642,7 +662,7 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
 
        device->bdev = bdev;
        clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
-       device->mode = flags;
+       device->holder = holder;
 
        fs_devices->open_devices++;
        if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
@@ -656,7 +676,7 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
 
 error_free_page:
        btrfs_release_disk_super(disk_super);
-       blkdev_put(bdev, flags);
+       blkdev_put(bdev, holder);
 
        return -EINVAL;
 }
@@ -673,18 +693,16 @@ static struct btrfs_fs_devices *find_fsid_inprogress(
        struct btrfs_fs_devices *fs_devices;
 
        list_for_each_entry(fs_devices, &fs_uuids, fs_list) {
-               if (memcmp(fs_devices->metadata_uuid, fs_devices->fsid,
-                          BTRFS_FSID_SIZE) != 0 &&
-                   memcmp(fs_devices->metadata_uuid, disk_super->fsid,
-                          BTRFS_FSID_SIZE) == 0 && !fs_devices->fsid_change) {
+               if (fs_devices->fsid_change)
+                       continue;
+
+               if (check_fsid_changed(fs_devices,  disk_super->fsid))
                        return fs_devices;
-               }
        }
 
        return find_fsid(disk_super->fsid, NULL);
 }
 
-
 static struct btrfs_fs_devices *find_fsid_changed(
                                        struct btrfs_super_block *disk_super)
 {
@@ -701,10 +719,7 @@ static struct btrfs_fs_devices *find_fsid_changed(
         */
        list_for_each_entry(fs_devices, &fs_uuids, fs_list) {
                /* Changed UUIDs */
-               if (memcmp(fs_devices->metadata_uuid, fs_devices->fsid,
-                          BTRFS_FSID_SIZE) != 0 &&
-                   memcmp(fs_devices->metadata_uuid, disk_super->metadata_uuid,
-                          BTRFS_FSID_SIZE) == 0 &&
+               if (check_fsid_changed(fs_devices, disk_super->metadata_uuid) &&
                    memcmp(fs_devices->fsid, disk_super->fsid,
                           BTRFS_FSID_SIZE) != 0)
                        return fs_devices;
@@ -735,11 +750,10 @@ static struct btrfs_fs_devices *find_fsid_reverted_metadata(
         * fs_devices equal to the FSID of the disk.
         */
        list_for_each_entry(fs_devices, &fs_uuids, fs_list) {
-               if (memcmp(fs_devices->fsid, fs_devices->metadata_uuid,
-                          BTRFS_FSID_SIZE) != 0 &&
-                   memcmp(fs_devices->metadata_uuid, disk_super->fsid,
-                          BTRFS_FSID_SIZE) == 0 &&
-                   fs_devices->fsid_change)
+               if (!fs_devices->fsid_change)
+                       continue;
+
+               if (check_fsid_changed(fs_devices, disk_super->fsid))
                        return fs_devices;
        }
 
@@ -790,12 +804,8 @@ static noinline struct btrfs_device *device_list_add(const char *path,
 
 
        if (!fs_devices) {
-               if (has_metadata_uuid)
-                       fs_devices = alloc_fs_devices(disk_super->fsid,
-                                                     disk_super->metadata_uuid);
-               else
-                       fs_devices = alloc_fs_devices(disk_super->fsid, NULL);
-
+               fs_devices = alloc_fs_devices(disk_super->fsid,
+                               has_metadata_uuid ? disk_super->metadata_uuid : NULL);
                if (IS_ERR(fs_devices))
                        return ERR_CAST(fs_devices);
 
@@ -1057,7 +1067,7 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
                        continue;
 
                if (device->bdev) {
-                       blkdev_put(device->bdev, device->mode);
+                       blkdev_put(device->bdev, device->holder);
                        device->bdev = NULL;
                        fs_devices->open_devices--;
                }
@@ -1103,7 +1113,7 @@ static void btrfs_close_bdev(struct btrfs_device *device)
                invalidate_bdev(device->bdev);
        }
 
-       blkdev_put(device->bdev, device->mode);
+       blkdev_put(device->bdev, device->holder);
 }
 
 static void btrfs_close_one_device(struct btrfs_device *device)
@@ -1207,14 +1217,12 @@ void btrfs_close_devices(struct btrfs_fs_devices *fs_devices)
 }
 
 static int open_fs_devices(struct btrfs_fs_devices *fs_devices,
-                               fmode_t flags, void *holder)
+                               blk_mode_t flags, void *holder)
 {
        struct btrfs_device *device;
        struct btrfs_device *latest_dev = NULL;
        struct btrfs_device *tmp_device;
 
-       flags |= FMODE_EXCL;
-
        list_for_each_entry_safe(device, tmp_device, &fs_devices->devices,
                                 dev_list) {
                int ret;
@@ -1257,7 +1265,7 @@ static int devid_cmp(void *priv, const struct list_head *a,
 }
 
 int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
-                      fmode_t flags, void *holder)
+                      blk_mode_t flags, void *holder)
 {
        int ret;
 
@@ -1348,8 +1356,7 @@ int btrfs_forget_devices(dev_t devt)
  * and we are not allowed to call set_blocksize during the scan. The superblock
  * is read via pagecache
  */
-struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags,
-                                          void *holder)
+struct btrfs_device *btrfs_scan_one_device(const char *path, blk_mode_t flags)
 {
        struct btrfs_super_block *disk_super;
        bool new_device_added = false;
@@ -1368,16 +1375,16 @@ struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags,
         */
 
        /*
-        * Avoid using flag |= FMODE_EXCL here, as the systemd-udev may
-        * initiate the device scan which may race with the user's mount
-        * or mkfs command, resulting in failure.
-        * Since the device scan is solely for reading purposes, there is
-        * no need for FMODE_EXCL. Additionally, the devices are read again
+        * Avoid an exclusive open here, as the systemd-udev may initiate the
+        * device scan which may race with the user's mount or mkfs command,
+        * resulting in failure.
+        * Since the device scan is solely for reading purposes, there is no
+        * need for an exclusive open. Additionally, the devices are read again
         * during the mount process. It is ok to get some inconsistent
         * values temporarily, as the device paths of the fsid are the only
         * required information for assembling the volume.
         */
-       bdev = blkdev_get_by_path(path, flags, holder);
+       bdev = blkdev_get_by_path(path, flags, NULL, NULL);
        if (IS_ERR(bdev))
                return ERR_CAST(bdev);
 
@@ -1401,7 +1408,7 @@ struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags,
        btrfs_release_disk_super(disk_super);
 
 error_bdev_put:
-       blkdev_put(bdev, flags);
+       blkdev_put(bdev, NULL);
 
        return device;
 }
@@ -1918,7 +1925,7 @@ static void update_dev_time(const char *device_path)
                return;
 
        now = current_time(d_inode(path.dentry));
-       inode_update_time(d_inode(path.dentry), &now, S_MTIME | S_CTIME);
+       inode_update_time(d_inode(path.dentry), &now, S_MTIME | S_CTIME | S_VERSION);
        path_put(&path);
 }
 
@@ -2088,7 +2095,7 @@ void btrfs_scratch_superblocks(struct btrfs_fs_info *fs_info,
 
 int btrfs_rm_device(struct btrfs_fs_info *fs_info,
                    struct btrfs_dev_lookup_args *args,
-                   struct block_device **bdev, fmode_t *mode)
+                   struct block_device **bdev, void **holder)
 {
        struct btrfs_trans_handle *trans;
        struct btrfs_device *device;
@@ -2227,7 +2234,7 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
        }
 
        *bdev = device->bdev;
-       *mode = device->mode;
+       *holder = device->holder;
        synchronize_rcu();
        btrfs_free_device(device);
 
@@ -2381,7 +2388,7 @@ int btrfs_get_dev_args_from_path(struct btrfs_fs_info *fs_info,
                return -ENOMEM;
        }
 
-       ret = btrfs_get_bdev_and_sb(path, FMODE_READ, fs_info->bdev_holder, 0,
+       ret = btrfs_get_bdev_and_sb(path, BLK_OPEN_READ, NULL, 0,
                                    &bdev, &disk_super);
        if (ret) {
                btrfs_put_dev_args_from_path(args);
@@ -2395,7 +2402,7 @@ int btrfs_get_dev_args_from_path(struct btrfs_fs_info *fs_info,
        else
                memcpy(args->fsid, disk_super->fsid, BTRFS_FSID_SIZE);
        btrfs_release_disk_super(disk_super);
-       blkdev_put(bdev, FMODE_READ);
+       blkdev_put(bdev, NULL);
        return 0;
 }
 
@@ -2628,8 +2635,8 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
        if (sb_rdonly(sb) && !fs_devices->seeding)
                return -EROFS;
 
-       bdev = blkdev_get_by_path(device_path, FMODE_WRITE | FMODE_EXCL,
-                                 fs_info->bdev_holder);
+       bdev = blkdev_get_by_path(device_path, BLK_OPEN_WRITE,
+                                 fs_info->bdev_holder, NULL);
        if (IS_ERR(bdev))
                return PTR_ERR(bdev);
 
@@ -2691,7 +2698,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
        device->commit_total_bytes = device->total_bytes;
        set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
        clear_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state);
-       device->mode = FMODE_EXCL;
+       device->holder = fs_info->bdev_holder;
        device->dev_stats_valid = 1;
        set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
 
@@ -2849,7 +2856,7 @@ error_free_zone:
 error_free_device:
        btrfs_free_device(device);
 error:
-       blkdev_put(bdev, FMODE_EXCL);
+       blkdev_put(bdev, fs_info->bdev_holder);
        if (locked) {
                mutex_unlock(&uuid_mutex);
                up_write(&sb->s_umount);
@@ -5125,7 +5132,7 @@ static void init_alloc_chunk_ctl_policy_regular(
        /* We don't want a chunk larger than 10% of writable space */
        ctl->max_chunk_size = min(mult_perc(fs_devices->total_rw_bytes, 10),
                                  ctl->max_chunk_size);
-       ctl->dev_extent_min = ctl->dev_stripes << BTRFS_STRIPE_LEN_SHIFT;
+       ctl->dev_extent_min = btrfs_stripe_nr_to_offset(ctl->dev_stripes);
 }
 
 static void init_alloc_chunk_ctl_policy_zoned(
@@ -5801,7 +5808,7 @@ unsigned long btrfs_full_stripe_len(struct btrfs_fs_info *fs_info,
        if (!WARN_ON(IS_ERR(em))) {
                map = em->map_lookup;
                if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)
-                       len = nr_data_stripes(map) << BTRFS_STRIPE_LEN_SHIFT;
+                       len = btrfs_stripe_nr_to_offset(nr_data_stripes(map));
                free_extent_map(em);
        }
        return len;
@@ -5975,12 +5982,12 @@ struct btrfs_discard_stripe *btrfs_map_discard(struct btrfs_fs_info *fs_info,
        stripe_nr = offset >> BTRFS_STRIPE_LEN_SHIFT;
 
        /* stripe_offset is the offset of this block in its stripe */
-       stripe_offset = offset - ((u64)stripe_nr << BTRFS_STRIPE_LEN_SHIFT);
+       stripe_offset = offset - btrfs_stripe_nr_to_offset(stripe_nr);
 
        stripe_nr_end = round_up(offset + length, BTRFS_STRIPE_LEN) >>
                        BTRFS_STRIPE_LEN_SHIFT;
        stripe_cnt = stripe_nr_end - stripe_nr;
-       stripe_end_offset = ((u64)stripe_nr_end << BTRFS_STRIPE_LEN_SHIFT) -
+       stripe_end_offset = btrfs_stripe_nr_to_offset(stripe_nr_end) -
                            (offset + length);
        /*
         * after this, stripe_nr is the number of stripes on this
@@ -6023,12 +6030,12 @@ struct btrfs_discard_stripe *btrfs_map_discard(struct btrfs_fs_info *fs_info,
        for (i = 0; i < *num_stripes; i++) {
                stripes[i].physical =
                        map->stripes[stripe_index].physical +
-                       stripe_offset + ((u64)stripe_nr << BTRFS_STRIPE_LEN_SHIFT);
+                       stripe_offset + btrfs_stripe_nr_to_offset(stripe_nr);
                stripes[i].dev = map->stripes[stripe_index].dev;
 
                if (map->type & (BTRFS_BLOCK_GROUP_RAID0 |
                                 BTRFS_BLOCK_GROUP_RAID10)) {
-                       stripes[i].length = stripes_per_dev << BTRFS_STRIPE_LEN_SHIFT;
+                       stripes[i].length = btrfs_stripe_nr_to_offset(stripes_per_dev);
 
                        if (i / sub_stripes < remaining_stripes)
                                stripes[i].length += BTRFS_STRIPE_LEN;
@@ -6163,17 +6170,10 @@ static void handle_ops_on_dev_replace(enum btrfs_map_op op,
        bioc->replace_nr_stripes = nr_extra_stripes;
 }
 
-static bool need_full_stripe(enum btrfs_map_op op)
-{
-       return (op == BTRFS_MAP_WRITE || op == BTRFS_MAP_GET_READ_MIRRORS);
-}
-
 static u64 btrfs_max_io_len(struct map_lookup *map, enum btrfs_map_op op,
                            u64 offset, u32 *stripe_nr, u64 *stripe_offset,
                            u64 *full_stripe_start)
 {
-       ASSERT(op != BTRFS_MAP_DISCARD);
-
        /*
         * Stripe_nr is the stripe where this block falls.  stripe_offset is
         * the offset of this block in its stripe.
@@ -6183,8 +6183,8 @@ static u64 btrfs_max_io_len(struct map_lookup *map, enum btrfs_map_op op,
        ASSERT(*stripe_offset < U32_MAX);
 
        if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
-               unsigned long full_stripe_len = nr_data_stripes(map) <<
-                                               BTRFS_STRIPE_LEN_SHIFT;
+               unsigned long full_stripe_len =
+                       btrfs_stripe_nr_to_offset(nr_data_stripes(map));
 
                /*
                 * For full stripe start, we use previously calculated
@@ -6196,8 +6196,8 @@ static u64 btrfs_max_io_len(struct map_lookup *map, enum btrfs_map_op op,
                 * not ensured to be power of 2.
                 */
                *full_stripe_start =
-                       (u64)rounddown(*stripe_nr, nr_data_stripes(map)) <<
-                       BTRFS_STRIPE_LEN_SHIFT;
+                       btrfs_stripe_nr_to_offset(
+                               rounddown(*stripe_nr, nr_data_stripes(map)));
 
                ASSERT(*full_stripe_start + full_stripe_len > offset);
                ASSERT(*full_stripe_start <= offset);
@@ -6223,14 +6223,14 @@ static void set_io_stripe(struct btrfs_io_stripe *dst, const struct map_lookup *
 {
        dst->dev = map->stripes[stripe_index].dev;
        dst->physical = map->stripes[stripe_index].physical +
-                       stripe_offset + ((u64)stripe_nr << BTRFS_STRIPE_LEN_SHIFT);
+                       stripe_offset + btrfs_stripe_nr_to_offset(stripe_nr);
 }
 
-int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
-                     u64 logical, u64 *length,
-                     struct btrfs_io_context **bioc_ret,
-                     struct btrfs_io_stripe *smap, int *mirror_num_ret,
-                     int need_raid_map)
+int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
+                   u64 logical, u64 *length,
+                   struct btrfs_io_context **bioc_ret,
+                   struct btrfs_io_stripe *smap, int *mirror_num_ret,
+                   int need_raid_map)
 {
        struct extent_map *em;
        struct map_lookup *map;
@@ -6253,7 +6253,6 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
        u64 max_len;
 
        ASSERT(bioc_ret);
-       ASSERT(op != BTRFS_MAP_DISCARD);
 
        num_copies = btrfs_num_copies(fs_info, logical, fs_info->sectorsize);
        if (mirror_num > num_copies)
@@ -6285,21 +6284,21 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
        if (map->type & BTRFS_BLOCK_GROUP_RAID0) {
                stripe_index = stripe_nr % map->num_stripes;
                stripe_nr /= map->num_stripes;
-               if (!need_full_stripe(op))
+               if (op == BTRFS_MAP_READ)
                        mirror_num = 1;
        } else if (map->type & BTRFS_BLOCK_GROUP_RAID1_MASK) {
-               if (need_full_stripe(op))
+               if (op != BTRFS_MAP_READ) {
                        num_stripes = map->num_stripes;
-               else if (mirror_num)
+               } else if (mirror_num) {
                        stripe_index = mirror_num - 1;
-               else {
+               else {
                        stripe_index = find_live_mirror(fs_info, map, 0,
                                            dev_replace_is_ongoing);
                        mirror_num = stripe_index + 1;
                }
 
        } else if (map->type & BTRFS_BLOCK_GROUP_DUP) {
-               if (need_full_stripe(op)) {
+               if (op != BTRFS_MAP_READ) {
                        num_stripes = map->num_stripes;
                } else if (mirror_num) {
                        stripe_index = mirror_num - 1;
@@ -6313,7 +6312,7 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
                stripe_index = (stripe_nr % factor) * map->sub_stripes;
                stripe_nr /= factor;
 
-               if (need_full_stripe(op))
+               if (op != BTRFS_MAP_READ)
                        num_stripes = map->sub_stripes;
                else if (mirror_num)
                        stripe_index += mirror_num - 1;
@@ -6326,7 +6325,7 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
                }
 
        } else if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
-               if (need_raid_map && (need_full_stripe(op) || mirror_num > 1)) {
+               if (need_raid_map && (op != BTRFS_MAP_READ || mirror_num > 1)) {
                        /*
                         * Push stripe_nr back to the start of the full stripe
                         * For those cases needing a full stripe, @stripe_nr
@@ -6345,7 +6344,8 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
                        /* Return the length to the full stripe end */
                        *length = min(logical + *length,
                                      raid56_full_stripe_start + em->start +
-                                     (data_stripes << BTRFS_STRIPE_LEN_SHIFT)) - logical;
+                                     btrfs_stripe_nr_to_offset(data_stripes)) -
+                                 logical;
                        stripe_index = 0;
                        stripe_offset = 0;
                } else {
@@ -6361,7 +6361,7 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 
                        /* We distribute the parity blocks across stripes */
                        stripe_index = (stripe_nr + stripe_index) % map->num_stripes;
-                       if (!need_full_stripe(op) && mirror_num <= 1)
+                       if (op == BTRFS_MAP_READ && mirror_num <= 1)
                                mirror_num = 1;
                }
        } else {
@@ -6401,7 +6401,7 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
         */
        if (smap && num_alloc_stripes == 1 &&
            !((map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) && mirror_num > 1) &&
-           (!need_full_stripe(op) || !dev_replace_is_ongoing ||
+           (op == BTRFS_MAP_READ || !dev_replace_is_ongoing ||
             !dev_replace->tgtdev)) {
                set_io_stripe(smap, map, stripe_index, stripe_offset, stripe_nr);
                *mirror_num_ret = mirror_num;
@@ -6425,7 +6425,7 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
         * It's still mostly the same as other profiles, just with extra rotation.
         */
        if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK && need_raid_map &&
-           (need_full_stripe(op) || mirror_num > 1)) {
+           (op != BTRFS_MAP_READ || mirror_num > 1)) {
                /*
                 * For RAID56 @stripe_nr is already the number of full stripes
                 * before us, which is also the rotation value (needs to modulo
@@ -6435,7 +6435,7 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
                 * modulo, to reduce one modulo call.
                 */
                bioc->full_stripe_logical = em->start +
-                       ((stripe_nr * data_stripes) << BTRFS_STRIPE_LEN_SHIFT);
+                       btrfs_stripe_nr_to_offset(stripe_nr * data_stripes);
                for (i = 0; i < num_stripes; i++)
                        set_io_stripe(&bioc->stripes[i], map,
                                      (i + stripe_nr) % num_stripes,
@@ -6452,11 +6452,11 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
                }
        }
 
-       if (need_full_stripe(op))
+       if (op != BTRFS_MAP_READ)
                max_errors = btrfs_chunk_max_errors(map);
 
        if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL &&
-           need_full_stripe(op)) {
+           op != BTRFS_MAP_READ) {
                handle_ops_on_dev_replace(op, bioc, dev_replace, logical,
                                          &num_stripes, &max_errors);
        }
@@ -6476,23 +6476,6 @@ out:
        return ret;
 }
 
-int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
-                     u64 logical, u64 *length,
-                     struct btrfs_io_context **bioc_ret, int mirror_num)
-{
-       return __btrfs_map_block(fs_info, op, logical, length, bioc_ret,
-                                NULL, &mirror_num, 0);
-}
-
-/* For Scrub/replace */
-int btrfs_map_sblock(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
-                    u64 logical, u64 *length,
-                    struct btrfs_io_context **bioc_ret)
-{
-       return __btrfs_map_block(fs_info, op, logical, length, bioc_ret,
-                                NULL, NULL, 1);
-}
-
 static bool dev_args_match_fs_devices(const struct btrfs_dev_lookup_args *args,
                                      const struct btrfs_fs_devices *fs_devices)
 {
@@ -6912,7 +6895,7 @@ static struct btrfs_fs_devices *open_seed_devices(struct btrfs_fs_info *fs_info,
        if (IS_ERR(fs_devices))
                return fs_devices;
 
-       ret = open_fs_devices(fs_devices, FMODE_READ, fs_info->bdev_holder);
+       ret = open_fs_devices(fs_devices, BLK_OPEN_READ, fs_info->bdev_holder);
        if (ret) {
                free_fs_devices(fs_devices);
                return ERR_PTR(ret);
@@ -8032,7 +8015,7 @@ static void map_raid56_repair_block(struct btrfs_io_context *bioc,
 
        for (i = 0; i < data_stripes; i++) {
                u64 stripe_start = bioc->full_stripe_logical +
-                                  (i << BTRFS_STRIPE_LEN_SHIFT);
+                                  btrfs_stripe_nr_to_offset(i);
 
                if (logical >= stripe_start &&
                    logical < stripe_start + BTRFS_STRIPE_LEN)
@@ -8069,8 +8052,8 @@ int btrfs_map_repair_block(struct btrfs_fs_info *fs_info,
 
        ASSERT(mirror_num > 0);
 
-       ret = __btrfs_map_block(fs_info, BTRFS_MAP_WRITE, logical, &map_length,
-                               &bioc, smap, &mirror_ret, true);
+       ret = btrfs_map_block(fs_info, BTRFS_MAP_WRITE, logical, &map_length,
+                             &bioc, smap, &mirror_ret, true);
        if (ret < 0)
                return ret;
 
index bf47a1a..b8c51f1 100644 (file)
@@ -94,8 +94,8 @@ struct btrfs_device {
 
        struct btrfs_zoned_device_info *zone_info;
 
-       /* the mode sent to blkdev_get */
-       fmode_t mode;
+       /* block device holder for blkdev_get/put */
+       void *holder;
 
        /*
         * Device's major-minor number. Must be set even if the device is not
@@ -280,8 +280,19 @@ enum btrfs_read_policy {
 
 struct btrfs_fs_devices {
        u8 fsid[BTRFS_FSID_SIZE]; /* FS specific uuid */
+
+       /*
+        * UUID written into the btree blocks:
+        *
+        * - If metadata_uuid != fsid then super block must have
+        *   BTRFS_FEATURE_INCOMPAT_METADATA_UUID flag set.
+        *
+        * - Following shall be true at all times:
+        *   - metadata_uuid == btrfs_header::fsid
+        *   - metadata_uuid == btrfs_dev_item::fsid
+        */
        u8 metadata_uuid[BTRFS_FSID_SIZE];
-       bool fsid_change;
+
        struct list_head fs_list;
 
        /*
@@ -319,34 +330,32 @@ struct btrfs_fs_devices {
         */
        struct btrfs_device *latest_dev;
 
-       /* all of the devices in the FS, protected by a mutex
-        * so we can safely walk it to write out the supers without
-        * worrying about add/remove by the multi-device code.
-        * Scrubbing super can kick off supers writing by holding
-        * this mutex lock.
+       /*
+        * All of the devices in the filesystem, protected by a mutex so we can
+        * safely walk it to write out the super blocks without worrying about
+        * adding/removing by the multi-device code. Scrubbing super block can
+        * kick off supers writing by holding this mutex lock.
         */
        struct mutex device_list_mutex;
 
        /* List of all devices, protected by device_list_mutex */
        struct list_head devices;
 
-       /*
-        * Devices which can satisfy space allocation. Protected by
-        * chunk_mutex
-        */
+       /* Devices which can satisfy space allocation. Protected by * chunk_mutex. */
        struct list_head alloc_list;
 
        struct list_head seed_list;
-       bool seeding;
 
+       /* Count fs-devices opened. */
        int opened;
 
-       /* set when we find or add a device that doesn't have the
-        * nonrot flag set
-        */
+       /* Set when we find or add a device that doesn't have the nonrot flag set. */
        bool rotating;
-       /* Devices support TRIM/discard commands */
+       /* Devices support TRIM/discard commands. */
        bool discardable;
+       bool fsid_change;
+       /* The filesystem is a seed filesystem. */
+       bool seeding;
 
        struct btrfs_fs_info *fs_info;
        /* sysfs kobjects */
@@ -357,7 +366,7 @@ struct btrfs_fs_devices {
 
        enum btrfs_chunk_allocation_policy chunk_alloc_policy;
 
-       /* Policy used to read the mirrored stripes */
+       /* Policy used to read the mirrored stripes. */
        enum btrfs_read_policy read_policy;
 };
 
@@ -547,15 +556,12 @@ struct btrfs_dev_lookup_args {
 enum btrfs_map_op {
        BTRFS_MAP_READ,
        BTRFS_MAP_WRITE,
-       BTRFS_MAP_DISCARD,
        BTRFS_MAP_GET_READ_MIRRORS,
 };
 
 static inline enum btrfs_map_op btrfs_op(struct bio *bio)
 {
        switch (bio_op(bio)) {
-       case REQ_OP_DISCARD:
-               return BTRFS_MAP_DISCARD;
        case REQ_OP_WRITE:
        case REQ_OP_ZONE_APPEND:
                return BTRFS_MAP_WRITE;
@@ -574,19 +580,24 @@ static inline unsigned long btrfs_chunk_item_size(int num_stripes)
                sizeof(struct btrfs_stripe) * (num_stripes - 1);
 }
 
+/*
+ * Do the type safe converstion from stripe_nr to offset inside the chunk.
+ *
+ * @stripe_nr is u32, with left shift it can overflow u32 for chunks larger
+ * than 4G.  This does the proper type cast to avoid overflow.
+ */
+static inline u64 btrfs_stripe_nr_to_offset(u32 stripe_nr)
+{
+       return (u64)stripe_nr << BTRFS_STRIPE_LEN_SHIFT;
+}
+
 void btrfs_get_bioc(struct btrfs_io_context *bioc);
 void btrfs_put_bioc(struct btrfs_io_context *bioc);
 int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
                    u64 logical, u64 *length,
-                   struct btrfs_io_context **bioc_ret, int mirror_num);
-int btrfs_map_sblock(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
-                    u64 logical, u64 *length,
-                    struct btrfs_io_context **bioc_ret);
-int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
-                     u64 logical, u64 *length,
-                     struct btrfs_io_context **bioc_ret,
-                     struct btrfs_io_stripe *smap, int *mirror_num_ret,
-                     int need_raid_map);
+                   struct btrfs_io_context **bioc_ret,
+                   struct btrfs_io_stripe *smap, int *mirror_num_ret,
+                   int need_raid_map);
 int btrfs_map_repair_block(struct btrfs_fs_info *fs_info,
                           struct btrfs_io_stripe *smap, u64 logical,
                           u32 length, int mirror_num);
@@ -599,9 +610,8 @@ struct btrfs_block_group *btrfs_create_chunk(struct btrfs_trans_handle *trans,
                                            u64 type);
 void btrfs_mapping_tree_free(struct extent_map_tree *tree);
 int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
-                      fmode_t flags, void *holder);
-struct btrfs_device *btrfs_scan_one_device(const char *path,
-                                          fmode_t flags, void *holder);
+                      blk_mode_t flags, void *holder);
+struct btrfs_device *btrfs_scan_one_device(const char *path, blk_mode_t flags);
 int btrfs_forget_devices(dev_t devt);
 void btrfs_close_devices(struct btrfs_fs_devices *fs_devices);
 void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices);
@@ -617,10 +627,9 @@ struct btrfs_device *btrfs_alloc_device(struct btrfs_fs_info *fs_info,
                                        const u64 *devid, const u8 *uuid,
                                        const char *path);
 void btrfs_put_dev_args_from_path(struct btrfs_dev_lookup_args *args);
-void btrfs_free_device(struct btrfs_device *device);
 int btrfs_rm_device(struct btrfs_fs_info *fs_info,
                    struct btrfs_dev_lookup_args *args,
-                   struct block_device **bdev, fmode_t *mode);
+                   struct block_device **bdev, void **holder);
 void __exit btrfs_cleanup_fs_uuids(void);
 int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 logical, u64 len);
 int btrfs_grow_device(struct btrfs_trans_handle *trans,
index 8acb05e..6c231a1 100644 (file)
@@ -63,7 +63,7 @@ struct list_head *zlib_alloc_workspace(unsigned int level)
 
        workspacesize = max(zlib_deflate_workspacesize(MAX_WBITS, MAX_MEM_LEVEL),
                        zlib_inflate_workspacesize());
-       workspace->strm.workspace = kvzalloc(workspacesize, GFP_KERNEL);
+       workspace->strm.workspace = kvzalloc(workspacesize, GFP_KERNEL | __GFP_NOWARN);
        workspace->level = level;
        workspace->buf = NULL;
        /*
index 39828af..85b8b33 100644 (file)
@@ -15,6 +15,7 @@
 #include "transaction.h"
 #include "dev-replace.h"
 #include "space-info.h"
+#include "super.h"
 #include "fs.h"
 #include "accessors.h"
 #include "bio.h"
@@ -1057,7 +1058,7 @@ u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start,
 
                /* Check if zones in the region are all empty */
                if (btrfs_dev_is_sequential(device, pos) &&
-                   find_next_zero_bit(zinfo->empty_zones, end, begin) != end) {
+                   !bitmap_test_range_all_set(zinfo->empty_zones, begin, nzones)) {
                        pos += zinfo->zone_size;
                        continue;
                }
@@ -1156,23 +1157,23 @@ int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size)
        struct btrfs_zoned_device_info *zinfo = device->zone_info;
        const u8 shift = zinfo->zone_size_shift;
        unsigned long begin = start >> shift;
-       unsigned long end = (start + size) >> shift;
+       unsigned long nbits = size >> shift;
        u64 pos;
        int ret;
 
        ASSERT(IS_ALIGNED(start, zinfo->zone_size));
        ASSERT(IS_ALIGNED(size, zinfo->zone_size));
 
-       if (end > zinfo->nr_zones)
+       if (begin + nbits > zinfo->nr_zones)
                return -ERANGE;
 
        /* All the zones are conventional */
-       if (find_next_bit(zinfo->seq_zones, end, begin) == end)
+       if (bitmap_test_range_all_zero(zinfo->seq_zones, begin, nbits))
                return 0;
 
        /* All the zones are sequential and empty */
-       if (find_next_zero_bit(zinfo->seq_zones, end, begin) == end &&
-           find_next_zero_bit(zinfo->empty_zones, end, begin) == end)
+       if (bitmap_test_range_all_set(zinfo->seq_zones, begin, nbits) &&
+           bitmap_test_range_all_set(zinfo->empty_zones, begin, nbits))
                return 0;
 
        for (pos = start; pos < start + size; pos += zinfo->zone_size) {
@@ -1602,37 +1603,17 @@ void btrfs_calc_zone_unusable(struct btrfs_block_group *cache)
 void btrfs_redirty_list_add(struct btrfs_transaction *trans,
                            struct extent_buffer *eb)
 {
-       struct btrfs_fs_info *fs_info = eb->fs_info;
-
-       if (!btrfs_is_zoned(fs_info) ||
-           btrfs_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN) ||
-           !list_empty(&eb->release_list))
+       if (!btrfs_is_zoned(eb->fs_info) ||
+           btrfs_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN))
                return;
 
+       ASSERT(!test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
+
        memzero_extent_buffer(eb, 0, eb->len);
        set_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags);
        set_extent_buffer_dirty(eb);
-       set_extent_bits_nowait(&trans->dirty_pages, eb->start,
-                              eb->start + eb->len - 1, EXTENT_DIRTY);
-
-       spin_lock(&trans->releasing_ebs_lock);
-       list_add_tail(&eb->release_list, &trans->releasing_ebs);
-       spin_unlock(&trans->releasing_ebs_lock);
-       atomic_inc(&eb->refs);
-}
-
-void btrfs_free_redirty_list(struct btrfs_transaction *trans)
-{
-       spin_lock(&trans->releasing_ebs_lock);
-       while (!list_empty(&trans->releasing_ebs)) {
-               struct extent_buffer *eb;
-
-               eb = list_first_entry(&trans->releasing_ebs,
-                                     struct extent_buffer, release_list);
-               list_del_init(&eb->release_list);
-               free_extent_buffer(eb);
-       }
-       spin_unlock(&trans->releasing_ebs_lock);
+       set_extent_bit(&trans->dirty_pages, eb->start, eb->start + eb->len - 1,
+                       EXTENT_DIRTY | EXTENT_NOWAIT, NULL);
 }
 
 bool btrfs_use_zone_append(struct btrfs_bio *bbio)
@@ -1677,63 +1658,89 @@ bool btrfs_use_zone_append(struct btrfs_bio *bbio)
 void btrfs_record_physical_zoned(struct btrfs_bio *bbio)
 {
        const u64 physical = bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT;
-       struct btrfs_ordered_extent *ordered;
+       struct btrfs_ordered_sum *sum = bbio->sums;
 
-       ordered = btrfs_lookup_ordered_extent(bbio->inode, bbio->file_offset);
-       if (WARN_ON(!ordered))
-               return;
-
-       ordered->physical = physical;
-       btrfs_put_ordered_extent(ordered);
+       if (physical < bbio->orig_physical)
+               sum->logical -= bbio->orig_physical - physical;
+       else
+               sum->logical += physical - bbio->orig_physical;
 }
 
-void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered)
+static void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered,
+                                       u64 logical)
 {
-       struct btrfs_inode *inode = BTRFS_I(ordered->inode);
-       struct btrfs_fs_info *fs_info = inode->root->fs_info;
-       struct extent_map_tree *em_tree;
+       struct extent_map_tree *em_tree = &BTRFS_I(ordered->inode)->extent_tree;
        struct extent_map *em;
-       struct btrfs_ordered_sum *sum;
-       u64 orig_logical = ordered->disk_bytenr;
-       struct map_lookup *map;
-       u64 physical = ordered->physical;
-       u64 chunk_start_phys;
-       u64 logical;
-
-       em = btrfs_get_chunk_map(fs_info, orig_logical, 1);
-       if (IS_ERR(em))
-               return;
-       map = em->map_lookup;
-       chunk_start_phys = map->stripes[0].physical;
-
-       if (WARN_ON_ONCE(map->num_stripes > 1) ||
-           WARN_ON_ONCE((map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) != 0) ||
-           WARN_ON_ONCE(physical < chunk_start_phys) ||
-           WARN_ON_ONCE(physical > chunk_start_phys + em->orig_block_len)) {
-               free_extent_map(em);
-               return;
-       }
-       logical = em->start + (physical - map->stripes[0].physical);
-       free_extent_map(em);
-
-       if (orig_logical == logical)
-               return;
 
        ordered->disk_bytenr = logical;
 
-       em_tree = &inode->extent_tree;
        write_lock(&em_tree->lock);
        em = search_extent_mapping(em_tree, ordered->file_offset,
                                   ordered->num_bytes);
        em->block_start = logical;
        free_extent_map(em);
        write_unlock(&em_tree->lock);
+}
 
-       list_for_each_entry(sum, &ordered->list, list) {
-               if (logical < orig_logical)
-                       sum->bytenr -= orig_logical - logical;
-               else
-                       sum->bytenr += logical - orig_logical;
+static bool btrfs_zoned_split_ordered(struct btrfs_ordered_extent *ordered,
+                                     u64 logical, u64 len)
+{
+       struct btrfs_ordered_extent *new;
+
+       if (!test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags) &&
+           split_extent_map(BTRFS_I(ordered->inode), ordered->file_offset,
+                            ordered->num_bytes, len, logical))
+               return false;
+
+       new = btrfs_split_ordered_extent(ordered, len);
+       if (IS_ERR(new))
+               return false;
+       new->disk_bytenr = logical;
+       btrfs_finish_one_ordered(new);
+       return true;
+}
+
+void btrfs_finish_ordered_zoned(struct btrfs_ordered_extent *ordered)
+{
+       struct btrfs_inode *inode = BTRFS_I(ordered->inode);
+       struct btrfs_fs_info *fs_info = inode->root->fs_info;
+       struct btrfs_ordered_sum *sum =
+               list_first_entry(&ordered->list, typeof(*sum), list);
+       u64 logical = sum->logical;
+       u64 len = sum->len;
+
+       while (len < ordered->disk_num_bytes) {
+               sum = list_next_entry(sum, list);
+               if (sum->logical == logical + len) {
+                       len += sum->len;
+                       continue;
+               }
+               if (!btrfs_zoned_split_ordered(ordered, logical, len)) {
+                       set_bit(BTRFS_ORDERED_IOERR, &ordered->flags);
+                       btrfs_err(fs_info, "failed to split ordered extent");
+                       goto out;
+               }
+               logical = sum->logical;
+               len = sum->len;
+       }
+
+       if (ordered->disk_bytenr != logical)
+               btrfs_rewrite_logical_zoned(ordered, logical);
+
+out:
+       /*
+        * If we end up here for nodatasum I/O, the btrfs_ordered_sum structures
+        * were allocated by btrfs_alloc_dummy_sum only to record the logical
+        * addresses and don't contain actual checksums.  We thus must free them
+        * here so that we don't attempt to log the csums later.
+        */
+       if ((inode->flags & BTRFS_INODE_NODATASUM) ||
+           test_bit(BTRFS_FS_STATE_NO_CSUMS, &fs_info->fs_state)) {
+               while ((sum = list_first_entry_or_null(&ordered->list,
+                                                      typeof(*sum), list))) {
+                       list_del(&sum->list);
+                       kfree(sum);
+               }
        }
 }
 
@@ -1792,8 +1799,8 @@ static int read_zone_info(struct btrfs_fs_info *fs_info, u64 logical,
        int nmirrors;
        int i, ret;
 
-       ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, logical,
-                              &mapped_length, &bioc);
+       ret = btrfs_map_block(fs_info, BTRFS_MAP_GET_READ_MIRRORS, logical,
+                             &mapped_length, &bioc, NULL, NULL, 1);
        if (ret || !bioc || mapped_length < PAGE_SIZE) {
                ret = -EIO;
                goto out_put_bioc;
index c0570d3..27322b9 100644 (file)
@@ -30,6 +30,8 @@ struct btrfs_zoned_device_info {
        struct blk_zone sb_zones[2 * BTRFS_SUPER_MIRROR_MAX];
 };
 
+void btrfs_finish_ordered_zoned(struct btrfs_ordered_extent *ordered);
+
 #ifdef CONFIG_BLK_DEV_ZONED
 int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
                       struct blk_zone *zone);
@@ -54,10 +56,8 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new);
 void btrfs_calc_zone_unusable(struct btrfs_block_group *cache);
 void btrfs_redirty_list_add(struct btrfs_transaction *trans,
                            struct extent_buffer *eb);
-void btrfs_free_redirty_list(struct btrfs_transaction *trans);
 bool btrfs_use_zone_append(struct btrfs_bio *bbio);
 void btrfs_record_physical_zoned(struct btrfs_bio *bbio);
-void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered);
 bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
                                    struct extent_buffer *eb,
                                    struct btrfs_block_group **cache_ret);
@@ -179,7 +179,6 @@ static inline void btrfs_calc_zone_unusable(struct btrfs_block_group *cache) { }
 
 static inline void btrfs_redirty_list_add(struct btrfs_transaction *trans,
                                          struct extent_buffer *eb) { }
-static inline void btrfs_free_redirty_list(struct btrfs_transaction *trans) { }
 
 static inline bool btrfs_use_zone_append(struct btrfs_bio *bbio)
 {
@@ -190,9 +189,6 @@ static inline void btrfs_record_physical_zoned(struct btrfs_bio *bbio)
 {
 }
 
-static inline void btrfs_rewrite_logical_zoned(
-                               struct btrfs_ordered_extent *ordered) { }
-
 static inline bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
                               struct extent_buffer *eb,
                               struct btrfs_block_group **cache_ret)
index f798da2..e7ac4ec 100644 (file)
@@ -356,7 +356,7 @@ struct list_head *zstd_alloc_workspace(unsigned int level)
        workspace->level = level;
        workspace->req_level = level;
        workspace->last_used = jiffies;
-       workspace->mem = kvmalloc(workspace->size, GFP_KERNEL);
+       workspace->mem = kvmalloc(workspace->size, GFP_KERNEL | __GFP_NOWARN);
        workspace->buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
        if (!workspace->mem || !workspace->buf)
                goto fail;
index a7fc561..cdd1002 100644 (file)
@@ -111,7 +111,6 @@ void buffer_check_dirty_writeback(struct folio *folio,
                bh = bh->b_this_page;
        } while (bh != head);
 }
-EXPORT_SYMBOL(buffer_check_dirty_writeback);
 
 /*
  * Block until a buffer comes unlocked.  This doesn't stop it
@@ -195,19 +194,19 @@ __find_get_block_slow(struct block_device *bdev, sector_t block)
        pgoff_t index;
        struct buffer_head *bh;
        struct buffer_head *head;
-       struct page *page;
+       struct folio *folio;
        int all_mapped = 1;
        static DEFINE_RATELIMIT_STATE(last_warned, HZ, 1);
 
        index = block >> (PAGE_SHIFT - bd_inode->i_blkbits);
-       page = find_get_page_flags(bd_mapping, index, FGP_ACCESSED);
-       if (!page)
+       folio = __filemap_get_folio(bd_mapping, index, FGP_ACCESSED, 0);
+       if (IS_ERR(folio))
                goto out;
 
        spin_lock(&bd_mapping->private_lock);
-       if (!page_has_buffers(page))
+       head = folio_buffers(folio);
+       if (!head)
                goto out_unlock;
-       head = page_buffers(page);
        bh = head;
        do {
                if (!buffer_mapped(bh))
@@ -237,7 +236,7 @@ __find_get_block_slow(struct block_device *bdev, sector_t block)
        }
 out_unlock:
        spin_unlock(&bd_mapping->private_lock);
-       put_page(page);
+       folio_put(folio);
 out:
        return ret;
 }
@@ -907,8 +906,8 @@ struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
 }
 EXPORT_SYMBOL_GPL(alloc_page_buffers);
 
-static inline void
-link_dev_buffers(struct page *page, struct buffer_head *head)
+static inline void link_dev_buffers(struct folio *folio,
+               struct buffer_head *head)
 {
        struct buffer_head *bh, *tail;
 
@@ -918,7 +917,7 @@ link_dev_buffers(struct page *page, struct buffer_head *head)
                bh = bh->b_this_page;
        } while (bh);
        tail->b_this_page = head;
-       attach_page_private(page, head);
+       folio_attach_private(folio, head);
 }
 
 static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size)
@@ -934,15 +933,14 @@ static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size)
 }
 
 /*
- * Initialise the state of a blockdev page's buffers.
+ * Initialise the state of a blockdev folio's buffers.
  */ 
-static sector_t
-init_page_buffers(struct page *page, struct block_device *bdev,
-                       sector_t block, int size)
+static sector_t folio_init_buffers(struct folio *folio,
+               struct block_device *bdev, sector_t block, int size)
 {
-       struct buffer_head *head = page_buffers(page);
+       struct buffer_head *head = folio_buffers(folio);
        struct buffer_head *bh = head;
-       int uptodate = PageUptodate(page);
+       bool uptodate = folio_test_uptodate(folio);
        sector_t end_block = blkdev_max_block(bdev, size);
 
        do {
@@ -976,7 +974,7 @@ grow_dev_page(struct block_device *bdev, sector_t block,
              pgoff_t index, int size, int sizebits, gfp_t gfp)
 {
        struct inode *inode = bdev->bd_inode;
-       struct page *page;
+       struct folio *folio;
        struct buffer_head *bh;
        sector_t end_block;
        int ret = 0;
@@ -992,42 +990,37 @@ grow_dev_page(struct block_device *bdev, sector_t block,
         */
        gfp_mask |= __GFP_NOFAIL;
 
-       page = find_or_create_page(inode->i_mapping, index, gfp_mask);
-
-       BUG_ON(!PageLocked(page));
+       folio = __filemap_get_folio(inode->i_mapping, index,
+                       FGP_LOCK | FGP_ACCESSED | FGP_CREAT, gfp_mask);
 
-       if (page_has_buffers(page)) {
-               bh = page_buffers(page);
+       bh = folio_buffers(folio);
+       if (bh) {
                if (bh->b_size == size) {
-                       end_block = init_page_buffers(page, bdev,
-                                               (sector_t)index << sizebits,
-                                               size);
+                       end_block = folio_init_buffers(folio, bdev,
+                                       (sector_t)index << sizebits, size);
                        goto done;
                }
-               if (!try_to_free_buffers(page_folio(page)))
+               if (!try_to_free_buffers(folio))
                        goto failed;
        }
 
-       /*
-        * Allocate some buffers for this page
-        */
-       bh = alloc_page_buffers(page, size, true);
+       bh = folio_alloc_buffers(folio, size, true);
 
        /*
-        * Link the page to the buffers and initialise them.  Take the
+        * Link the folio to the buffers and initialise them.  Take the
         * lock to be atomic wrt __find_get_block(), which does not
-        * run under the page lock.
+        * run under the folio lock.
         */
        spin_lock(&inode->i_mapping->private_lock);
-       link_dev_buffers(page, bh);
-       end_block = init_page_buffers(page, bdev, (sector_t)index << sizebits,
-                       size);
+       link_dev_buffers(folio, bh);
+       end_block = folio_init_buffers(folio, bdev,
+                       (sector_t)index << sizebits, size);
        spin_unlock(&inode->i_mapping->private_lock);
 done:
        ret = (block < end_block) ? 1 : -ENXIO;
 failed:
-       unlock_page(page);
-       put_page(page);
+       folio_unlock(folio);
+       folio_put(folio);
        return ret;
 }
 
@@ -1764,7 +1757,7 @@ static struct buffer_head *folio_create_buffers(struct folio *folio,
  * WB_SYNC_ALL, the writes are posted using REQ_SYNC; this
  * causes the writes to be flagged as synchronous writes.
  */
-int __block_write_full_page(struct inode *inode, struct page *page,
+int __block_write_full_folio(struct inode *inode, struct folio *folio,
                        get_block_t *get_block, struct writeback_control *wbc,
                        bh_end_io_t *handler)
 {
@@ -1776,14 +1769,14 @@ int __block_write_full_page(struct inode *inode, struct page *page,
        int nr_underway = 0;
        blk_opf_t write_flags = wbc_to_write_flags(wbc);
 
-       head = folio_create_buffers(page_folio(page), inode,
+       head = folio_create_buffers(folio, inode,
                                    (1 << BH_Dirty) | (1 << BH_Uptodate));
 
        /*
         * Be very careful.  We have no exclusion from block_dirty_folio
         * here, and the (potentially unmapped) buffers may become dirty at
         * any time.  If a buffer becomes dirty here after we've inspected it
-        * then we just miss that fact, and the page stays dirty.
+        * then we just miss that fact, and the folio stays dirty.
         *
         * Buffers outside i_size may be dirtied by block_dirty_folio;
         * handle that here by just cleaning them.
@@ -1793,7 +1786,7 @@ int __block_write_full_page(struct inode *inode, struct page *page,
        blocksize = bh->b_size;
        bbits = block_size_bits(blocksize);
 
-       block = (sector_t)page->index << (PAGE_SHIFT - bbits);
+       block = (sector_t)folio->index << (PAGE_SHIFT - bbits);
        last_block = (i_size_read(inode) - 1) >> bbits;
 
        /*
@@ -1804,7 +1797,7 @@ int __block_write_full_page(struct inode *inode, struct page *page,
                if (block > last_block) {
                        /*
                         * mapped buffers outside i_size will occur, because
-                        * this page can be outside i_size when there is a
+                        * this folio can be outside i_size when there is a
                         * truncate in progress.
                         */
                        /*
@@ -1834,7 +1827,7 @@ int __block_write_full_page(struct inode *inode, struct page *page,
                        continue;
                /*
                 * If it's a fully non-blocking write attempt and we cannot
-                * lock the buffer then redirty the page.  Note that this can
+                * lock the buffer then redirty the folio.  Note that this can
                 * potentially cause a busy-wait loop from writeback threads
                 * and kswapd activity, but those code paths have their own
                 * higher-level throttling.
@@ -1842,7 +1835,7 @@ int __block_write_full_page(struct inode *inode, struct page *page,
                if (wbc->sync_mode != WB_SYNC_NONE) {
                        lock_buffer(bh);
                } else if (!trylock_buffer(bh)) {
-                       redirty_page_for_writepage(wbc, page);
+                       folio_redirty_for_writepage(wbc, folio);
                        continue;
                }
                if (test_clear_buffer_dirty(bh)) {
@@ -1853,11 +1846,11 @@ int __block_write_full_page(struct inode *inode, struct page *page,
        } while ((bh = bh->b_this_page) != head);
 
        /*
-        * The page and its buffers are protected by PageWriteback(), so we can
-        * drop the bh refcounts early.
+        * The folio and its buffers are protected by the writeback flag,
+        * so we can drop the bh refcounts early.
         */
-       BUG_ON(PageWriteback(page));
-       set_page_writeback(page);
+       BUG_ON(folio_test_writeback(folio));
+       folio_start_writeback(folio);
 
        do {
                struct buffer_head *next = bh->b_this_page;
@@ -1867,20 +1860,20 @@ int __block_write_full_page(struct inode *inode, struct page *page,
                }
                bh = next;
        } while (bh != head);
-       unlock_page(page);
+       folio_unlock(folio);
 
        err = 0;
 done:
        if (nr_underway == 0) {
                /*
-                * The page was marked dirty, but the buffers were
+                * The folio was marked dirty, but the buffers were
                 * clean.  Someone wrote them back by hand with
                 * write_dirty_buffer/submit_bh.  A rare case.
                 */
-               end_page_writeback(page);
+               folio_end_writeback(folio);
 
                /*
-                * The page and buffer_heads can be released at any time from
+                * The folio and buffer_heads can be released at any time from
                 * here on.
                 */
        }
@@ -1891,7 +1884,7 @@ recover:
         * ENOSPC, or some other error.  We may already have added some
         * blocks to the file, so we need to write these out to avoid
         * exposing stale data.
-        * The page is currently locked and not marked for writeback
+        * The folio is currently locked and not marked for writeback
         */
        bh = head;
        /* Recovery: lock and submit the mapped buffers */
@@ -1903,15 +1896,15 @@ recover:
                } else {
                        /*
                         * The buffer may have been set dirty during
-                        * attachment to a dirty page.
+                        * attachment to a dirty folio.
                         */
                        clear_buffer_dirty(bh);
                }
        } while ((bh = bh->b_this_page) != head);
-       SetPageError(page);
-       BUG_ON(PageWriteback(page));
-       mapping_set_error(page->mapping, err);
-       set_page_writeback(page);
+       folio_set_error(folio);
+       BUG_ON(folio_test_writeback(folio));
+       mapping_set_error(folio->mapping, err);
+       folio_start_writeback(folio);
        do {
                struct buffer_head *next = bh->b_this_page;
                if (buffer_async_write(bh)) {
@@ -1921,39 +1914,40 @@ recover:
                }
                bh = next;
        } while (bh != head);
-       unlock_page(page);
+       folio_unlock(folio);
        goto done;
 }
-EXPORT_SYMBOL(__block_write_full_page);
+EXPORT_SYMBOL(__block_write_full_folio);
 
 /*
- * If a page has any new buffers, zero them out here, and mark them uptodate
+ * If a folio has any new buffers, zero them out here, and mark them uptodate
  * and dirty so they'll be written out (in order to prevent uninitialised
  * block data from leaking). And clear the new bit.
  */
-void page_zero_new_buffers(struct page *page, unsigned from, unsigned to)
+void folio_zero_new_buffers(struct folio *folio, size_t from, size_t to)
 {
-       unsigned int block_start, block_end;
+       size_t block_start, block_end;
        struct buffer_head *head, *bh;
 
-       BUG_ON(!PageLocked(page));
-       if (!page_has_buffers(page))
+       BUG_ON(!folio_test_locked(folio));
+       head = folio_buffers(folio);
+       if (!head)
                return;
 
-       bh = head = page_buffers(page);
+       bh = head;
        block_start = 0;
        do {
                block_end = block_start + bh->b_size;
 
                if (buffer_new(bh)) {
                        if (block_end > from && block_start < to) {
-                               if (!PageUptodate(page)) {
-                                       unsigned start, size;
+                               if (!folio_test_uptodate(folio)) {
+                                       size_t start, xend;
 
                                        start = max(from, block_start);
-                                       size = min(to, block_end) - start;
+                                       xend = min(to, block_end);
 
-                                       zero_user(page, start, size);
+                                       folio_zero_segment(folio, start, xend);
                                        set_buffer_uptodate(bh);
                                }
 
@@ -1966,7 +1960,7 @@ void page_zero_new_buffers(struct page *page, unsigned from, unsigned to)
                bh = bh->b_this_page;
        } while (bh != head);
 }
-EXPORT_SYMBOL(page_zero_new_buffers);
+EXPORT_SYMBOL(folio_zero_new_buffers);
 
 static void
 iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
@@ -2104,7 +2098,7 @@ int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
                        err = -EIO;
        }
        if (unlikely(err))
-               page_zero_new_buffers(&folio->page, from, to);
+               folio_zero_new_buffers(folio, from, to);
        return err;
 }
 
@@ -2116,15 +2110,15 @@ int __block_write_begin(struct page *page, loff_t pos, unsigned len,
 }
 EXPORT_SYMBOL(__block_write_begin);
 
-static int __block_commit_write(struct inode *inode, struct page *page,
-               unsigned from, unsigned to)
+static int __block_commit_write(struct inode *inode, struct folio *folio,
+               size_t from, size_t to)
 {
-       unsigned block_start, block_end;
-       int partial = 0;
+       size_t block_start, block_end;
+       bool partial = false;
        unsigned blocksize;
        struct buffer_head *bh, *head;
 
-       bh = head = page_buffers(page);
+       bh = head = folio_buffers(folio);
        blocksize = bh->b_size;
 
        block_start = 0;
@@ -2132,7 +2126,7 @@ static int __block_commit_write(struct inode *inode, struct page *page,
                block_end = block_start + blocksize;
                if (block_end <= from || block_start >= to) {
                        if (!buffer_uptodate(bh))
-                               partial = 1;
+                               partial = true;
                } else {
                        set_buffer_uptodate(bh);
                        mark_buffer_dirty(bh);
@@ -2147,11 +2141,11 @@ static int __block_commit_write(struct inode *inode, struct page *page,
        /*
         * If this is a partial write which happened to make all buffers
         * uptodate then we can optimize away a bogus read_folio() for
-        * the next read(). Here we 'discover' whether the page went
+        * the next read(). Here we 'discover' whether the folio went
         * uptodate as a result of this (potentially partial) write.
         */
        if (!partial)
-               SetPageUptodate(page);
+               folio_mark_uptodate(folio);
        return 0;
 }
 
@@ -2188,10 +2182,9 @@ int block_write_end(struct file *file, struct address_space *mapping,
                        loff_t pos, unsigned len, unsigned copied,
                        struct page *page, void *fsdata)
 {
+       struct folio *folio = page_folio(page);
        struct inode *inode = mapping->host;
-       unsigned start;
-
-       start = pos & (PAGE_SIZE - 1);
+       size_t start = pos - folio_pos(folio);
 
        if (unlikely(copied < len)) {
                /*
@@ -2203,18 +2196,18 @@ int block_write_end(struct file *file, struct address_space *mapping,
                 * read_folio might come in and destroy our partial write.
                 *
                 * Do the simplest thing, and just treat any short write to a
-                * non uptodate page as a zero-length write, and force the
+                * non uptodate folio as a zero-length write, and force the
                 * caller to redo the whole thing.
                 */
-               if (!PageUptodate(page))
+               if (!folio_test_uptodate(folio))
                        copied = 0;
 
-               page_zero_new_buffers(page, start+copied, start+len);
+               folio_zero_new_buffers(folio, start+copied, start+len);
        }
-       flush_dcache_page(page);
+       flush_dcache_folio(folio);
 
        /* This could be a short (even 0-length) commit */
-       __block_commit_write(inode, page, start, start+copied);
+       __block_commit_write(inode, folio, start, start + copied);
 
        return copied;
 }
@@ -2537,8 +2530,9 @@ EXPORT_SYMBOL(cont_write_begin);
 
 int block_commit_write(struct page *page, unsigned from, unsigned to)
 {
-       struct inode *inode = page->mapping->host;
-       __block_commit_write(inode,page,from,to);
+       struct folio *folio = page_folio(page);
+       struct inode *inode = folio->mapping->host;
+       __block_commit_write(inode, folio, from, to);
        return 0;
 }
 EXPORT_SYMBOL(block_commit_write);
@@ -2564,38 +2558,37 @@ EXPORT_SYMBOL(block_commit_write);
 int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
                         get_block_t get_block)
 {
-       struct page *page = vmf->page;
+       struct folio *folio = page_folio(vmf->page);
        struct inode *inode = file_inode(vma->vm_file);
        unsigned long end;
        loff_t size;
        int ret;
 
-       lock_page(page);
+       folio_lock(folio);
        size = i_size_read(inode);
-       if ((page->mapping != inode->i_mapping) ||
-           (page_offset(page) > size)) {
+       if ((folio->mapping != inode->i_mapping) ||
+           (folio_pos(folio) >= size)) {
                /* We overload EFAULT to mean page got truncated */
                ret = -EFAULT;
                goto out_unlock;
        }
 
-       /* page is wholly or partially inside EOF */
-       if (((page->index + 1) << PAGE_SHIFT) > size)
-               end = size & ~PAGE_MASK;
-       else
-               end = PAGE_SIZE;
+       end = folio_size(folio);
+       /* folio is wholly or partially inside EOF */
+       if (folio_pos(folio) + end > size)
+               end = size - folio_pos(folio);
 
-       ret = __block_write_begin(page, 0, end, get_block);
+       ret = __block_write_begin_int(folio, 0, end, get_block, NULL);
        if (!ret)
-               ret = block_commit_write(page, 0, end);
+               ret = __block_commit_write(inode, folio, 0, end);
 
        if (unlikely(ret < 0))
                goto out_unlock;
-       set_page_dirty(page);
-       wait_for_stable_page(page);
+       folio_mark_dirty(folio);
+       folio_wait_stable(folio);
        return 0;
 out_unlock:
-       unlock_page(page);
+       folio_unlock(folio);
        return ret;
 }
 EXPORT_SYMBOL(block_page_mkwrite);
@@ -2604,17 +2597,16 @@ int block_truncate_page(struct address_space *mapping,
                        loff_t from, get_block_t *get_block)
 {
        pgoff_t index = from >> PAGE_SHIFT;
-       unsigned offset = from & (PAGE_SIZE-1);
        unsigned blocksize;
        sector_t iblock;
-       unsigned length, pos;
+       size_t offset, length, pos;
        struct inode *inode = mapping->host;
-       struct page *page;
+       struct folio *folio;
        struct buffer_head *bh;
        int err = 0;
 
        blocksize = i_blocksize(inode);
-       length = offset & (blocksize - 1);
+       length = from & (blocksize - 1);
 
        /* Block boundary? Nothing to do */
        if (!length)
@@ -2623,15 +2615,18 @@ int block_truncate_page(struct address_space *mapping,
        length = blocksize - length;
        iblock = (sector_t)index << (PAGE_SHIFT - inode->i_blkbits);
        
-       page = grab_cache_page(mapping, index);
-       if (!page)
-               return -ENOMEM;
+       folio = filemap_grab_folio(mapping, index);
+       if (IS_ERR(folio))
+               return PTR_ERR(folio);
 
-       if (!page_has_buffers(page))
-               create_empty_buffers(page, blocksize, 0);
+       bh = folio_buffers(folio);
+       if (!bh) {
+               folio_create_empty_buffers(folio, blocksize, 0);
+               bh = folio_buffers(folio);
+       }
 
        /* Find the buffer that contains "offset" */
-       bh = page_buffers(page);
+       offset = offset_in_folio(folio, from);
        pos = blocksize;
        while (offset >= pos) {
                bh = bh->b_this_page;
@@ -2650,7 +2645,7 @@ int block_truncate_page(struct address_space *mapping,
        }
 
        /* Ok, it's mapped. Make sure it's up-to-date */
-       if (PageUptodate(page))
+       if (folio_test_uptodate(folio))
                set_buffer_uptodate(bh);
 
        if (!buffer_uptodate(bh) && !buffer_delay(bh) && !buffer_unwritten(bh)) {
@@ -2660,12 +2655,12 @@ int block_truncate_page(struct address_space *mapping,
                        goto unlock;
        }
 
-       zero_user(page, offset, length);
+       folio_zero_range(folio, offset, length);
        mark_buffer_dirty(bh);
 
 unlock:
-       unlock_page(page);
-       put_page(page);
+       folio_unlock(folio);
+       folio_put(folio);
 
        return err;
 }
@@ -2677,33 +2672,32 @@ EXPORT_SYMBOL(block_truncate_page);
 int block_write_full_page(struct page *page, get_block_t *get_block,
                        struct writeback_control *wbc)
 {
-       struct inode * const inode = page->mapping->host;
+       struct folio *folio = page_folio(page);
+       struct inode * const inode = folio->mapping->host;
        loff_t i_size = i_size_read(inode);
-       const pgoff_t end_index = i_size >> PAGE_SHIFT;
-       unsigned offset;
 
-       /* Is the page fully inside i_size? */
-       if (page->index < end_index)
-               return __block_write_full_page(inode, page, get_block, wbc,
+       /* Is the folio fully inside i_size? */
+       if (folio_pos(folio) + folio_size(folio) <= i_size)
+               return __block_write_full_folio(inode, folio, get_block, wbc,
                                               end_buffer_async_write);
 
-       /* Is the page fully outside i_size? (truncate in progress) */
-       offset = i_size & (PAGE_SIZE-1);
-       if (page->index >= end_index+1 || !offset) {
-               unlock_page(page);
+       /* Is the folio fully outside i_size? (truncate in progress) */
+       if (folio_pos(folio) >= i_size) {
+               folio_unlock(folio);
                return 0; /* don't care */
        }
 
        /*
-        * The page straddles i_size.  It must be zeroed out on each and every
+        * The folio straddles i_size.  It must be zeroed out on each and every
         * writepage invocation because it may be mmapped.  "A file is mapped
         * in multiples of the page size.  For a file that is not a multiple of
-        * the  page size, the remaining memory is zeroed when mapped, and
+        * the page size, the remaining memory is zeroed when mapped, and
         * writes to that region are not written out to the file."
         */
-       zero_user_segment(page, offset, PAGE_SIZE);
-       return __block_write_full_page(inode, page, get_block, wbc,
-                                                       end_buffer_async_write);
+       folio_zero_segment(folio, offset_in_folio(folio, i_size),
+                       folio_size(folio));
+       return __block_write_full_folio(inode, folio, get_block, wbc,
+                       end_buffer_async_write);
 }
 EXPORT_SYMBOL(block_write_full_page);
 
@@ -2760,8 +2754,7 @@ static void submit_bh_wbc(blk_opf_t opf, struct buffer_head *bh,
 
        bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
 
-       bio_add_page(bio, bh->b_page, bh->b_size, bh_offset(bh));
-       BUG_ON(bio->bi_iter.bi_size != bh->b_size);
+       __bio_add_page(bio, bh->b_page, bh->b_size, bh_offset(bh));
 
        bio->bi_end_io = end_bio_bh_io_sync;
        bio->bi_private = bh;
index 82219a8..d9d22d0 100644 (file)
@@ -451,9 +451,10 @@ struct file *cachefiles_create_tmpfile(struct cachefiles_object *object)
 
        ret = cachefiles_inject_write_error();
        if (ret == 0) {
-               file = vfs_tmpfile_open(&nop_mnt_idmap, &parentpath, S_IFREG,
-                                       O_RDWR | O_LARGEFILE | O_DIRECT,
-                                       cache->cache_cred);
+               file = kernel_tmpfile_open(&nop_mnt_idmap, &parentpath,
+                                          S_IFREG | 0600,
+                                          O_RDWR | O_LARGEFILE | O_DIRECT,
+                                          cache->cache_cred);
                ret = PTR_ERR_OR_ZERO(file);
        }
        if (ret) {
@@ -560,8 +561,8 @@ static bool cachefiles_open_file(struct cachefiles_object *object,
         */
        path.mnt = cache->mnt;
        path.dentry = dentry;
-       file = open_with_fake_path(&path, O_RDWR | O_LARGEFILE | O_DIRECT,
-                                  d_backing_inode(dentry), cache->cache_cred);
+       file = kernel_file_open(&path, O_RDWR | O_LARGEFILE | O_DIRECT,
+                               d_backing_inode(dentry), cache->cache_cred);
        if (IS_ERR(file)) {
                trace_cachefiles_vfs_error(object, d_backing_inode(dentry),
                                           PTR_ERR(file),
index f4d8bf7..b192523 100644 (file)
@@ -1746,6 +1746,69 @@ again:
 }
 
 /*
+ * Wrap filemap_splice_read with checks for cap bits on the inode.
+ * Atomically grab references, so that those bits are not released
+ * back to the MDS mid-read.
+ */
+static ssize_t ceph_splice_read(struct file *in, loff_t *ppos,
+                               struct pipe_inode_info *pipe,
+                               size_t len, unsigned int flags)
+{
+       struct ceph_file_info *fi = in->private_data;
+       struct inode *inode = file_inode(in);
+       struct ceph_inode_info *ci = ceph_inode(inode);
+       ssize_t ret;
+       int want = 0, got = 0;
+       CEPH_DEFINE_RW_CONTEXT(rw_ctx, 0);
+
+       dout("splice_read %p %llx.%llx %llu~%zu trying to get caps on %p\n",
+            inode, ceph_vinop(inode), *ppos, len, inode);
+
+       if (ceph_inode_is_shutdown(inode))
+               return -ESTALE;
+
+       if (ceph_has_inline_data(ci) ||
+           (fi->flags & CEPH_F_SYNC))
+               return copy_splice_read(in, ppos, pipe, len, flags);
+
+       ceph_start_io_read(inode);
+
+       want = CEPH_CAP_FILE_CACHE;
+       if (fi->fmode & CEPH_FILE_MODE_LAZY)
+               want |= CEPH_CAP_FILE_LAZYIO;
+
+       ret = ceph_get_caps(in, CEPH_CAP_FILE_RD, want, -1, &got);
+       if (ret < 0)
+               goto out_end;
+
+       if ((got & (CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO)) == 0) {
+               dout("splice_read/sync %p %llx.%llx %llu~%zu got cap refs on %s\n",
+                    inode, ceph_vinop(inode), *ppos, len,
+                    ceph_cap_string(got));
+
+               ceph_put_cap_refs(ci, got);
+               ceph_end_io_read(inode);
+               return copy_splice_read(in, ppos, pipe, len, flags);
+       }
+
+       dout("splice_read %p %llx.%llx %llu~%zu got cap refs on %s\n",
+            inode, ceph_vinop(inode), *ppos, len, ceph_cap_string(got));
+
+       rw_ctx.caps = got;
+       ceph_add_rw_context(fi, &rw_ctx);
+       ret = filemap_splice_read(in, ppos, pipe, len, flags);
+       ceph_del_rw_context(fi, &rw_ctx);
+
+       dout("splice_read %p %llx.%llx dropping cap refs on %s = %zd\n",
+            inode, ceph_vinop(inode), ceph_cap_string(got), ret);
+
+       ceph_put_cap_refs(ci, got);
+out_end:
+       ceph_end_io_read(inode);
+       return ret;
+}
+
+/*
  * Take cap references to avoid releasing caps to MDS mid-write.
  *
  * If we are synchronous, and write with an old snap context, the OSD
@@ -1791,9 +1854,6 @@ retry_snap:
        else
                ceph_start_io_write(inode);
 
-       /* We can write back this queue in page reclaim */
-       current->backing_dev_info = inode_to_bdi(inode);
-
        if (iocb->ki_flags & IOCB_APPEND) {
                err = ceph_do_getattr(inode, CEPH_STAT_CAP_SIZE, false);
                if (err < 0)
@@ -1894,8 +1954,6 @@ retry_snap:
                 * can not run at the same time
                 */
                written = generic_perform_write(iocb, from);
-               if (likely(written >= 0))
-                       iocb->ki_pos = pos + written;
                ceph_end_io_write(inode);
        }
 
@@ -1940,7 +1998,6 @@ out:
                ceph_end_io_write(inode);
 out_unlocked:
        ceph_free_cap_flush(prealloc_cf);
-       current->backing_dev_info = NULL;
        return written ? written : err;
 }
 
@@ -2593,7 +2650,7 @@ const struct file_operations ceph_file_fops = {
        .lock = ceph_lock,
        .setlease = simple_nosetlease,
        .flock = ceph_flock,
-       .splice_read = generic_file_splice_read,
+       .splice_read = ceph_splice_read,
        .splice_write = iter_file_splice_write,
        .unlocked_ioctl = ceph_ioctl,
        .compat_ioctl = compat_ptr_ioctl,
index 13deb45..950b691 100644 (file)
@@ -150,7 +150,7 @@ __register_chrdev_region(unsigned int major, unsigned int baseminor,
        cd->major = major;
        cd->baseminor = baseminor;
        cd->minorct = minorct;
-       strlcpy(cd->name, name, sizeof(cd->name));
+       strscpy(cd->name, name, sizeof(cd->name));
 
        if (!prev) {
                cd->next = curr;
index 3f3c81e..12b26bd 100644 (file)
@@ -23,6 +23,7 @@
 #include <linux/slab.h>
 #include <linux/uaccess.h>
 #include <linux/uio.h>
+#include <linux/splice.h>
 
 #include <linux/coda.h>
 #include "coda_psdev.h"
@@ -94,6 +95,32 @@ finish_write:
        return ret;
 }
 
+static ssize_t
+coda_file_splice_read(struct file *coda_file, loff_t *ppos,
+                     struct pipe_inode_info *pipe,
+                     size_t len, unsigned int flags)
+{
+       struct inode *coda_inode = file_inode(coda_file);
+       struct coda_file_info *cfi = coda_ftoc(coda_file);
+       struct file *in = cfi->cfi_container;
+       loff_t ki_pos = *ppos;
+       ssize_t ret;
+
+       ret = venus_access_intent(coda_inode->i_sb, coda_i2f(coda_inode),
+                                 &cfi->cfi_access_intent,
+                                 len, ki_pos, CODA_ACCESS_TYPE_READ);
+       if (ret)
+               goto finish_read;
+
+       ret = vfs_splice_read(in, ppos, pipe, len, flags);
+
+finish_read:
+       venus_access_intent(coda_inode->i_sb, coda_i2f(coda_inode),
+                           &cfi->cfi_access_intent,
+                           len, ki_pos, CODA_ACCESS_TYPE_READ_FINISH);
+       return ret;
+}
+
 static void
 coda_vm_open(struct vm_area_struct *vma)
 {
@@ -302,5 +329,5 @@ const struct file_operations coda_file_operations = {
        .open           = coda_open,
        .release        = coda_release,
        .fsync          = coda_fsync,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = coda_file_splice_read,
 };
index 88740c5..9d235fa 100644 (file)
@@ -648,7 +648,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
        } else {
                struct mnt_idmap *idmap;
                struct inode *inode;
-               int open_flags = O_CREAT | O_RDWR | O_NOFOLLOW |
+               int open_flags = O_CREAT | O_WRONLY | O_NOFOLLOW |
                                 O_LARGEFILE | O_EXCL;
 
                if (cprm.limit < binfmt->min_coredump)
index 006ef68..27c6597 100644 (file)
@@ -473,7 +473,7 @@ static unsigned int cramfs_physmem_mmap_capabilities(struct file *file)
 static const struct file_operations cramfs_physmem_fops = {
        .llseek                 = generic_file_llseek,
        .read_iter              = generic_file_read_iter,
-       .splice_read            = generic_file_splice_read,
+       .splice_read            = filemap_splice_read,
        .mmap                   = cramfs_physmem_mmap,
 #ifndef CONFIG_MMU
        .get_unmapped_area      = cramfs_physmem_get_unmapped_area,
index 7ab5a7b..2d63da4 100644 (file)
@@ -171,7 +171,7 @@ fscrypt_policy_flags(const union fscrypt_policy *policy)
  */
 struct fscrypt_symlink_data {
        __le16 len;
-       char encrypted_path[1];
+       char encrypted_path[];
 } __packed;
 
 /**
index 9e786ae..6238dbc 100644 (file)
@@ -255,10 +255,10 @@ int fscrypt_prepare_symlink(struct inode *dir, const char *target,
         * for now since filesystems will assume it is there and subtract it.
         */
        if (!__fscrypt_fname_encrypted_size(policy, len,
-                                           max_len - sizeof(struct fscrypt_symlink_data),
+                                           max_len - sizeof(struct fscrypt_symlink_data) - 1,
                                            &disk_link->len))
                return -ENAMETOOLONG;
-       disk_link->len += sizeof(struct fscrypt_symlink_data);
+       disk_link->len += sizeof(struct fscrypt_symlink_data) + 1;
 
        disk_link->name = NULL;
        return 0;
@@ -289,7 +289,7 @@ int __fscrypt_encrypt_symlink(struct inode *inode, const char *target,
                if (!sd)
                        return -ENOMEM;
        }
-       ciphertext_len = disk_link->len - sizeof(*sd);
+       ciphertext_len = disk_link->len - sizeof(*sd) - 1;
        sd->len = cpu_to_le16(ciphertext_len);
 
        err = fscrypt_fname_encrypt(inode, &iname, sd->encrypted_path,
@@ -367,7 +367,7 @@ const char *fscrypt_get_symlink(struct inode *inode, const void *caddr,
         * the ciphertext length, even though this is redundant with i_size.
         */
 
-       if (max_size < sizeof(*sd))
+       if (max_size < sizeof(*sd) + 1)
                return ERR_PTR(-EUCLEAN);
        sd = caddr;
        cstr.name = (unsigned char *)sd->encrypted_path;
@@ -376,7 +376,7 @@ const char *fscrypt_get_symlink(struct inode *inode, const void *caddr,
        if (cstr.len == 0)
                return ERR_PTR(-EUCLEAN);
 
-       if (cstr.len + sizeof(*sd) - 1 > max_size)
+       if (cstr.len + sizeof(*sd) > max_size)
                return ERR_PTR(-EUCLEAN);
 
        err = fscrypt_fname_alloc_buffer(cstr.len, &pstr);
index 56a6ee4..5f4da5c 100644 (file)
@@ -7,6 +7,7 @@
 #include <linux/slab.h>
 #include <linux/prefetch.h>
 #include "mount.h"
+#include "internal.h"
 
 struct prepend_buffer {
        char *buf;
index 0b380bb..7bc494e 100644 (file)
@@ -42,8 +42,8 @@
 #include "internal.h"
 
 /*
- * How many user pages to map in one call to get_user_pages().  This determines
- * the size of a structure in the slab cache
+ * How many user pages to map in one call to iov_iter_extract_pages().  This
+ * determines the size of a structure in the slab cache
  */
 #define DIO_PAGES      64
 
@@ -121,12 +121,13 @@ struct dio {
        struct inode *inode;
        loff_t i_size;                  /* i_size when submitted */
        dio_iodone_t *end_io;           /* IO completion function */
+       bool is_pinned;                 /* T if we have pins on the pages */
 
        void *private;                  /* copy from map_bh.b_private */
 
        /* BIO completion state */
        spinlock_t bio_lock;            /* protects BIO fields below */
-       int page_errors;                /* errno from get_user_pages() */
+       int page_errors;                /* err from iov_iter_extract_pages() */
        int is_async;                   /* is IO async ? */
        bool defer_completion;          /* defer AIO completion to workqueue? */
        bool should_dirty;              /* if pages should be dirtied */
@@ -165,14 +166,14 @@ static inline unsigned dio_pages_present(struct dio_submit *sdio)
  */
 static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 {
+       struct page **pages = dio->pages;
        const enum req_op dio_op = dio->opf & REQ_OP_MASK;
        ssize_t ret;
 
-       ret = iov_iter_get_pages2(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
-                               &sdio->from);
+       ret = iov_iter_extract_pages(sdio->iter, &pages, LONG_MAX,
+                                    DIO_PAGES, 0, &sdio->from);
 
        if (ret < 0 && sdio->blocks_available && dio_op == REQ_OP_WRITE) {
-               struct page *page = ZERO_PAGE(0);
                /*
                 * A memory fault, but the filesystem has some outstanding
                 * mapped blocks.  We need to use those blocks up to avoid
@@ -180,8 +181,7 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
                 */
                if (dio->page_errors == 0)
                        dio->page_errors = ret;
-               get_page(page);
-               dio->pages[0] = page;
+               dio->pages[0] = ZERO_PAGE(0);
                sdio->head = 0;
                sdio->tail = 1;
                sdio->from = 0;
@@ -201,9 +201,9 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 
 /*
  * Get another userspace page.  Returns an ERR_PTR on error.  Pages are
- * buffered inside the dio so that we can call get_user_pages() against a
- * decent number of pages, less frequently.  To provide nicer use of the
- * L1 cache.
+ * buffered inside the dio so that we can call iov_iter_extract_pages()
+ * against a decent number of pages, less frequently.  To provide nicer use of
+ * the L1 cache.
  */
 static inline struct page *dio_get_page(struct dio *dio,
                                        struct dio_submit *sdio)
@@ -219,6 +219,18 @@ static inline struct page *dio_get_page(struct dio *dio,
        return dio->pages[sdio->head];
 }
 
+static void dio_pin_page(struct dio *dio, struct page *page)
+{
+       if (dio->is_pinned)
+               folio_add_pin(page_folio(page));
+}
+
+static void dio_unpin_page(struct dio *dio, struct page *page)
+{
+       if (dio->is_pinned)
+               unpin_user_page(page);
+}
+
 /*
  * dio_complete() - called when all DIO BIO I/O has been completed
  *
@@ -285,14 +297,8 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, unsigned int flags)
         * zeros from unwritten extents.
         */
        if (flags & DIO_COMPLETE_INVALIDATE &&
-           ret > 0 && dio_op == REQ_OP_WRITE &&
-           dio->inode->i_mapping->nrpages) {
-               err = invalidate_inode_pages2_range(dio->inode->i_mapping,
-                                       offset >> PAGE_SHIFT,
-                                       (offset + ret - 1) >> PAGE_SHIFT);
-               if (err)
-                       dio_warn_stale_pagecache(dio->iocb->ki_filp);
-       }
+           ret > 0 && dio_op == REQ_OP_WRITE)
+               kiocb_invalidate_post_direct_write(dio->iocb, ret);
 
        inode_dio_end(dio->inode);
 
@@ -402,6 +408,8 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
                bio->bi_end_io = dio_bio_end_aio;
        else
                bio->bi_end_io = dio_bio_end_io;
+       if (dio->is_pinned)
+               bio_set_flag(bio, BIO_PAGE_PINNED);
        sdio->bio = bio;
        sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
 }
@@ -442,8 +450,10 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
  */
 static inline void dio_cleanup(struct dio *dio, struct dio_submit *sdio)
 {
-       while (sdio->head < sdio->tail)
-               put_page(dio->pages[sdio->head++]);
+       if (dio->is_pinned)
+               unpin_user_pages(dio->pages + sdio->head,
+                                sdio->tail - sdio->head);
+       sdio->head = sdio->tail;
 }
 
 /*
@@ -674,7 +684,7 @@ out:
  *
  * Return zero on success.  Non-zero means the caller needs to start a new BIO.
  */
-static inline int dio_bio_add_page(struct dio_submit *sdio)
+static inline int dio_bio_add_page(struct dio *dio, struct dio_submit *sdio)
 {
        int ret;
 
@@ -686,7 +696,7 @@ static inline int dio_bio_add_page(struct dio_submit *sdio)
                 */
                if ((sdio->cur_page_len + sdio->cur_page_offset) == PAGE_SIZE)
                        sdio->pages_in_io--;
-               get_page(sdio->cur_page);
+               dio_pin_page(dio, sdio->cur_page);
                sdio->final_block_in_bio = sdio->cur_page_block +
                        (sdio->cur_page_len >> sdio->blkbits);
                ret = 0;
@@ -741,11 +751,11 @@ static inline int dio_send_cur_page(struct dio *dio, struct dio_submit *sdio,
                        goto out;
        }
 
-       if (dio_bio_add_page(sdio) != 0) {
+       if (dio_bio_add_page(dio, sdio) != 0) {
                dio_bio_submit(dio, sdio);
                ret = dio_new_bio(dio, sdio, sdio->cur_page_block, map_bh);
                if (ret == 0) {
-                       ret = dio_bio_add_page(sdio);
+                       ret = dio_bio_add_page(dio, sdio);
                        BUG_ON(ret != 0);
                }
        }
@@ -802,13 +812,13 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
         */
        if (sdio->cur_page) {
                ret = dio_send_cur_page(dio, sdio, map_bh);
-               put_page(sdio->cur_page);
+               dio_unpin_page(dio, sdio->cur_page);
                sdio->cur_page = NULL;
                if (ret)
                        return ret;
        }
 
-       get_page(page);         /* It is in dio */
+       dio_pin_page(dio, page);                /* It is in dio */
        sdio->cur_page = page;
        sdio->cur_page_offset = offset;
        sdio->cur_page_len = len;
@@ -823,7 +833,7 @@ out:
                ret = dio_send_cur_page(dio, sdio, map_bh);
                if (sdio->bio)
                        dio_bio_submit(dio, sdio);
-               put_page(sdio->cur_page);
+               dio_unpin_page(dio, sdio->cur_page);
                sdio->cur_page = NULL;
        }
        return ret;
@@ -924,7 +934,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 
                                ret = get_more_blocks(dio, sdio, map_bh);
                                if (ret) {
-                                       put_page(page);
+                                       dio_unpin_page(dio, page);
                                        goto out;
                                }
                                if (!buffer_mapped(map_bh))
@@ -969,7 +979,7 @@ do_holes:
 
                                /* AKPM: eargh, -ENOTBLK is a hack */
                                if (dio_op == REQ_OP_WRITE) {
-                                       put_page(page);
+                                       dio_unpin_page(dio, page);
                                        return -ENOTBLK;
                                }
 
@@ -982,7 +992,7 @@ do_holes:
                                if (sdio->block_in_file >=
                                                i_size_aligned >> blkbits) {
                                        /* We hit eof */
-                                       put_page(page);
+                                       dio_unpin_page(dio, page);
                                        goto out;
                                }
                                zero_user(page, from, 1 << blkbits);
@@ -1022,7 +1032,7 @@ do_holes:
                                                  sdio->next_block_for_io,
                                                  map_bh);
                        if (ret) {
-                               put_page(page);
+                               dio_unpin_page(dio, page);
                                goto out;
                        }
                        sdio->next_block_for_io += this_chunk_blocks;
@@ -1037,8 +1047,8 @@ next_block:
                                break;
                }
 
-               /* Drop the ref which was taken in get_user_pages() */
-               put_page(page);
+               /* Drop the pin which was taken in get_user_pages() */
+               dio_unpin_page(dio, page);
        }
 out:
        return ret;
@@ -1133,6 +1143,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
                /* will be released by direct_io_worker */
                inode_lock(inode);
        }
+       dio->is_pinned = iov_iter_extract_will_pin(iter);
 
        /* Once we sampled i_size check for reads beyond EOF */
        dio->i_size = i_size_read(inode);
@@ -1257,7 +1268,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
                ret2 = dio_send_cur_page(dio, &sdio, &map_bh);
                if (retval == 0)
                        retval = ret2;
-               put_page(sdio.cur_page);
+               dio_unpin_page(dio, sdio.cur_page);
                sdio.cur_page = NULL;
        }
        if (sdio.bio)
index d31319d..2beceff 100644 (file)
@@ -116,9 +116,9 @@ static ssize_t cluster_cluster_name_store(struct config_item *item,
 {
        struct dlm_cluster *cl = config_item_to_cluster(item);
 
-       strlcpy(dlm_config.ci_cluster_name, buf,
+       strscpy(dlm_config.ci_cluster_name, buf,
                                sizeof(dlm_config.ci_cluster_name));
-       strlcpy(cl->cl_cluster_name, buf, sizeof(cl->cl_cluster_name));
+       strscpy(cl->cl_cluster_name, buf, sizeof(cl->cl_cluster_name));
        return len;
 }
 
index 268b744..ce0a3c5 100644 (file)
@@ -44,6 +44,31 @@ static ssize_t ecryptfs_read_update_atime(struct kiocb *iocb,
        return rc;
 }
 
+/*
+ * ecryptfs_splice_read_update_atime
+ *
+ * filemap_splice_read updates the atime of upper layer inode.  But, it
+ * doesn't give us a chance to update the atime of the lower layer inode.  This
+ * function is a wrapper to generic_file_read.  It updates the atime of the
+ * lower level inode if generic_file_read returns without any errors. This is
+ * to be used only for file reads.  The function to be used for directory reads
+ * is ecryptfs_read.
+ */
+static ssize_t ecryptfs_splice_read_update_atime(struct file *in, loff_t *ppos,
+                                                struct pipe_inode_info *pipe,
+                                                size_t len, unsigned int flags)
+{
+       ssize_t rc;
+       const struct path *path;
+
+       rc = filemap_splice_read(in, ppos, pipe, len, flags);
+       if (rc >= 0) {
+               path = ecryptfs_dentry_to_lower_path(in->f_path.dentry);
+               touch_atime(path);
+       }
+       return rc;
+}
+
 struct ecryptfs_getdents_callback {
        struct dir_context ctx;
        struct dir_context *caller;
@@ -414,5 +439,5 @@ const struct file_operations ecryptfs_main_fops = {
        .release = ecryptfs_release,
        .fsync = ecryptfs_fsync,
        .fasync = ecryptfs_fasync,
-       .splice_read = generic_file_splice_read,
+       .splice_read = ecryptfs_splice_read_update_atime,
 };
index 26fa170..b1b8465 100644 (file)
@@ -89,8 +89,7 @@ static inline bool erofs_page_is_managed(const struct erofs_sb_info *sbi,
 
 int z_erofs_fixup_insize(struct z_erofs_decompress_req *rq, const char *padbuf,
                         unsigned int padbufsize);
-int z_erofs_decompress(struct z_erofs_decompress_req *rq,
-                      struct page **pagepool);
+extern const struct z_erofs_decompressor erofs_decompressors[];
 
 /* prototypes for specific algorithms */
 int z_erofs_lzma_decompress(struct z_erofs_decompress_req *rq,
index 6fe9a77..db5e4b7 100644 (file)
@@ -448,5 +448,5 @@ const struct file_operations erofs_file_fops = {
        .llseek         = generic_file_llseek,
        .read_iter      = erofs_file_read_iter,
        .mmap           = erofs_file_mmap,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = filemap_splice_read,
 };
index 7021e2c..2a29943 100644 (file)
@@ -363,7 +363,7 @@ static int z_erofs_transform_plain(struct z_erofs_decompress_req *rq,
        return 0;
 }
 
-static struct z_erofs_decompressor decompressors[] = {
+const struct z_erofs_decompressor erofs_decompressors[] = {
        [Z_EROFS_COMPRESSION_SHIFTED] = {
                .decompress = z_erofs_transform_plain,
                .name = "shifted"
@@ -383,9 +383,3 @@ static struct z_erofs_decompressor decompressors[] = {
        },
 #endif
 };
-
-int z_erofs_decompress(struct z_erofs_decompress_req *rq,
-                      struct page **pagepool)
-{
-       return decompressors[rq->alg].decompress(rq, pagepool);
-}
index 1e39c03..36e32fa 100644 (file)
@@ -208,46 +208,12 @@ enum {
        EROFS_ZIP_CACHE_READAROUND
 };
 
-#define EROFS_LOCKED_MAGIC     (INT_MIN | 0xE0F510CCL)
-
 /* basic unit of the workstation of a super_block */
 struct erofs_workgroup {
-       /* the workgroup index in the workstation */
        pgoff_t index;
-
-       /* overall workgroup reference count */
-       atomic_t refcount;
+       struct lockref lockref;
 };
 
-static inline bool erofs_workgroup_try_to_freeze(struct erofs_workgroup *grp,
-                                                int val)
-{
-       preempt_disable();
-       if (val != atomic_cmpxchg(&grp->refcount, val, EROFS_LOCKED_MAGIC)) {
-               preempt_enable();
-               return false;
-       }
-       return true;
-}
-
-static inline void erofs_workgroup_unfreeze(struct erofs_workgroup *grp,
-                                           int orig_val)
-{
-       /*
-        * other observers should notice all modifications
-        * in the freezing period.
-        */
-       smp_mb();
-       atomic_set(&grp->refcount, orig_val);
-       preempt_enable();
-}
-
-static inline int erofs_wait_on_workgroup_freezed(struct erofs_workgroup *grp)
-{
-       return atomic_cond_read_relaxed(&grp->refcount,
-                                       VAL != EROFS_LOCKED_MAGIC);
-}
-
 enum erofs_kmap_type {
        EROFS_NO_KMAP,          /* don't map the buffer */
        EROFS_KMAP,             /* use kmap_local_page() to map the buffer */
@@ -486,7 +452,7 @@ static inline void erofs_pagepool_add(struct page **pagepool, struct page *page)
 void erofs_release_pages(struct page **pagepool);
 
 #ifdef CONFIG_EROFS_FS_ZIP
-int erofs_workgroup_put(struct erofs_workgroup *grp);
+void erofs_workgroup_put(struct erofs_workgroup *grp);
 struct erofs_workgroup *erofs_find_workgroup(struct super_block *sb,
                                             pgoff_t index);
 struct erofs_workgroup *erofs_insert_workgroup(struct super_block *sb,
@@ -500,7 +466,6 @@ int __init z_erofs_init_zip_subsystem(void);
 void z_erofs_exit_zip_subsystem(void);
 int erofs_try_to_free_all_cached_pages(struct erofs_sb_info *sbi,
                                       struct erofs_workgroup *egrp);
-int erofs_try_to_free_cached_page(struct page *page);
 int z_erofs_load_lz4_config(struct super_block *sb,
                            struct erofs_super_block *dsb,
                            struct z_erofs_lz4_cfgs *lz4, int len);
@@ -511,6 +476,7 @@ void erofs_put_pcpubuf(void *ptr);
 int erofs_pcpubuf_growsize(unsigned int nrpages);
 void __init erofs_pcpubuf_init(void);
 void erofs_pcpubuf_exit(void);
+int erofs_init_managed_cache(struct super_block *sb);
 #else
 static inline void erofs_shrinker_register(struct super_block *sb) {}
 static inline void erofs_shrinker_unregister(struct super_block *sb) {}
@@ -530,6 +496,7 @@ static inline int z_erofs_load_lz4_config(struct super_block *sb,
 }
 static inline void erofs_pcpubuf_init(void) {}
 static inline void erofs_pcpubuf_exit(void) {}
+static inline int erofs_init_managed_cache(struct super_block *sb) { return 0; }
 #endif /* !CONFIG_EROFS_FS_ZIP */
 
 #ifdef CONFIG_EROFS_FS_ZIP_LZMA
index 811ab66..9d6a3c6 100644 (file)
@@ -19,6 +19,7 @@
 #include <trace/events/erofs.h>
 
 static struct kmem_cache *erofs_inode_cachep __read_mostly;
+struct file_system_type erofs_fs_type;
 
 void _erofs_err(struct super_block *sb, const char *function,
                const char *fmt, ...)
@@ -253,8 +254,8 @@ static int erofs_init_device(struct erofs_buf *buf, struct super_block *sb,
                        return PTR_ERR(fscache);
                dif->fscache = fscache;
        } else if (!sbi->devs->flatdev) {
-               bdev = blkdev_get_by_path(dif->path, FMODE_READ | FMODE_EXCL,
-                                         sb->s_type);
+               bdev = blkdev_get_by_path(dif->path, BLK_OPEN_READ, sb->s_type,
+                                         NULL);
                if (IS_ERR(bdev))
                        return PTR_ERR(bdev);
                dif->bdev = bdev;
@@ -599,68 +600,6 @@ static int erofs_fc_parse_param(struct fs_context *fc,
        return 0;
 }
 
-#ifdef CONFIG_EROFS_FS_ZIP
-static const struct address_space_operations managed_cache_aops;
-
-static bool erofs_managed_cache_release_folio(struct folio *folio, gfp_t gfp)
-{
-       bool ret = true;
-       struct address_space *const mapping = folio->mapping;
-
-       DBG_BUGON(!folio_test_locked(folio));
-       DBG_BUGON(mapping->a_ops != &managed_cache_aops);
-
-       if (folio_test_private(folio))
-               ret = erofs_try_to_free_cached_page(&folio->page);
-
-       return ret;
-}
-
-/*
- * It will be called only on inode eviction. In case that there are still some
- * decompression requests in progress, wait with rescheduling for a bit here.
- * We could introduce an extra locking instead but it seems unnecessary.
- */
-static void erofs_managed_cache_invalidate_folio(struct folio *folio,
-                                              size_t offset, size_t length)
-{
-       const size_t stop = length + offset;
-
-       DBG_BUGON(!folio_test_locked(folio));
-
-       /* Check for potential overflow in debug mode */
-       DBG_BUGON(stop > folio_size(folio) || stop < length);
-
-       if (offset == 0 && stop == folio_size(folio))
-               while (!erofs_managed_cache_release_folio(folio, GFP_NOFS))
-                       cond_resched();
-}
-
-static const struct address_space_operations managed_cache_aops = {
-       .release_folio = erofs_managed_cache_release_folio,
-       .invalidate_folio = erofs_managed_cache_invalidate_folio,
-};
-
-static int erofs_init_managed_cache(struct super_block *sb)
-{
-       struct erofs_sb_info *const sbi = EROFS_SB(sb);
-       struct inode *const inode = new_inode(sb);
-
-       if (!inode)
-               return -ENOMEM;
-
-       set_nlink(inode, 1);
-       inode->i_size = OFFSET_MAX;
-
-       inode->i_mapping->a_ops = &managed_cache_aops;
-       mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
-       sbi->managed_cache = inode;
-       return 0;
-}
-#else
-static int erofs_init_managed_cache(struct super_block *sb) { return 0; }
-#endif
-
 static struct inode *erofs_nfs_get_inode(struct super_block *sb,
                                         u64 ino, u32 generation)
 {
@@ -877,7 +816,7 @@ static int erofs_release_device_info(int id, void *ptr, void *data)
 
        fs_put_dax(dif->dax_dev, NULL);
        if (dif->bdev)
-               blkdev_put(dif->bdev, FMODE_READ | FMODE_EXCL);
+               blkdev_put(dif->bdev, &erofs_fs_type);
        erofs_fscache_unregister_cookie(dif->fscache);
        dif->fscache = NULL;
        kfree(dif->path);
@@ -1016,10 +955,8 @@ static int __init erofs_module_init(void)
                                               sizeof(struct erofs_inode), 0,
                                               SLAB_RECLAIM_ACCOUNT,
                                               erofs_inode_init_once);
-       if (!erofs_inode_cachep) {
-               err = -ENOMEM;
-               goto icache_err;
-       }
+       if (!erofs_inode_cachep)
+               return -ENOMEM;
 
        err = erofs_init_shrinker();
        if (err)
@@ -1054,7 +991,6 @@ lzma_err:
        erofs_exit_shrinker();
 shrinker_err:
        kmem_cache_destroy(erofs_inode_cachep);
-icache_err:
        return err;
 }
 
index 46627cb..cc6fb9e 100644 (file)
@@ -4,7 +4,6 @@
  *             https://www.huawei.com/
  */
 #include "internal.h"
-#include <linux/pagevec.h>
 
 struct page *erofs_allocpage(struct page **pagepool, gfp_t gfp)
 {
@@ -33,22 +32,21 @@ void erofs_release_pages(struct page **pagepool)
 /* global shrink count (for all mounted EROFS instances) */
 static atomic_long_t erofs_global_shrink_cnt;
 
-static int erofs_workgroup_get(struct erofs_workgroup *grp)
+static bool erofs_workgroup_get(struct erofs_workgroup *grp)
 {
-       int o;
+       if (lockref_get_not_zero(&grp->lockref))
+               return true;
 
-repeat:
-       o = erofs_wait_on_workgroup_freezed(grp);
-       if (o <= 0)
-               return -1;
-
-       if (atomic_cmpxchg(&grp->refcount, o, o + 1) != o)
-               goto repeat;
+       spin_lock(&grp->lockref.lock);
+       if (__lockref_is_dead(&grp->lockref)) {
+               spin_unlock(&grp->lockref.lock);
+               return false;
+       }
 
-       /* decrease refcount paired by erofs_workgroup_put */
-       if (o == 1)
+       if (!grp->lockref.count++)
                atomic_long_dec(&erofs_global_shrink_cnt);
-       return 0;
+       spin_unlock(&grp->lockref.lock);
+       return true;
 }
 
 struct erofs_workgroup *erofs_find_workgroup(struct super_block *sb,
@@ -61,7 +59,7 @@ repeat:
        rcu_read_lock();
        grp = xa_load(&sbi->managed_pslots, index);
        if (grp) {
-               if (erofs_workgroup_get(grp)) {
+               if (!erofs_workgroup_get(grp)) {
                        /* prefer to relax rcu read side */
                        rcu_read_unlock();
                        goto repeat;
@@ -80,11 +78,10 @@ struct erofs_workgroup *erofs_insert_workgroup(struct super_block *sb,
        struct erofs_workgroup *pre;
 
        /*
-        * Bump up a reference count before making this visible
-        * to others for the XArray in order to avoid potential
-        * UAF without serialized by xa_lock.
+        * Bump up before making this visible to others for the XArray in order
+        * to avoid potential UAF without serialized by xa_lock.
         */
-       atomic_inc(&grp->refcount);
+       lockref_get(&grp->lockref);
 
 repeat:
        xa_lock(&sbi->managed_pslots);
@@ -93,13 +90,13 @@ repeat:
        if (pre) {
                if (xa_is_err(pre)) {
                        pre = ERR_PTR(xa_err(pre));
-               } else if (erofs_workgroup_get(pre)) {
+               } else if (!erofs_workgroup_get(pre)) {
                        /* try to legitimize the current in-tree one */
                        xa_unlock(&sbi->managed_pslots);
                        cond_resched();
                        goto repeat;
                }
-               atomic_dec(&grp->refcount);
+               lockref_put_return(&grp->lockref);
                grp = pre;
        }
        xa_unlock(&sbi->managed_pslots);
@@ -112,38 +109,34 @@ static void  __erofs_workgroup_free(struct erofs_workgroup *grp)
        erofs_workgroup_free_rcu(grp);
 }
 
-int erofs_workgroup_put(struct erofs_workgroup *grp)
+void erofs_workgroup_put(struct erofs_workgroup *grp)
 {
-       int count = atomic_dec_return(&grp->refcount);
+       if (lockref_put_or_lock(&grp->lockref))
+               return;
 
-       if (count == 1)
+       DBG_BUGON(__lockref_is_dead(&grp->lockref));
+       if (grp->lockref.count == 1)
                atomic_long_inc(&erofs_global_shrink_cnt);
-       else if (!count)
-               __erofs_workgroup_free(grp);
-       return count;
+       --grp->lockref.count;
+       spin_unlock(&grp->lockref.lock);
 }
 
 static bool erofs_try_to_release_workgroup(struct erofs_sb_info *sbi,
                                           struct erofs_workgroup *grp)
 {
-       /*
-        * If managed cache is on, refcount of workgroups
-        * themselves could be < 0 (freezed). In other words,
-        * there is no guarantee that all refcounts > 0.
-        */
-       if (!erofs_workgroup_try_to_freeze(grp, 1))
-               return false;
+       int free = false;
+
+       spin_lock(&grp->lockref.lock);
+       if (grp->lockref.count)
+               goto out;
 
        /*
-        * Note that all cached pages should be unattached
-        * before deleted from the XArray. Otherwise some
-        * cached pages could be still attached to the orphan
-        * old workgroup when the new one is available in the tree.
+        * Note that all cached pages should be detached before deleted from
+        * the XArray. Otherwise some cached pages could be still attached to
+        * the orphan old workgroup when the new one is available in the tree.
         */
-       if (erofs_try_to_free_all_cached_pages(sbi, grp)) {
-               erofs_workgroup_unfreeze(grp, 1);
-               return false;
-       }
+       if (erofs_try_to_free_all_cached_pages(sbi, grp))
+               goto out;
 
        /*
         * It's impossible to fail after the workgroup is freezed,
@@ -152,10 +145,13 @@ static bool erofs_try_to_release_workgroup(struct erofs_sb_info *sbi,
         */
        DBG_BUGON(__xa_erase(&sbi->managed_pslots, grp->index) != grp);
 
-       /* last refcount should be connected with its managed pslot.  */
-       erofs_workgroup_unfreeze(grp, 0);
-       __erofs_workgroup_free(grp);
-       return true;
+       lockref_mark_dead(&grp->lockref);
+       free = true;
+out:
+       spin_unlock(&grp->lockref.lock);
+       if (free)
+               __erofs_workgroup_free(grp);
+       return free;
 }
 
 static unsigned long erofs_shrink_workstation(struct erofs_sb_info *sbi,
index bbfe7ce..40178b6 100644 (file)
@@ -7,32 +7,27 @@
 #include <linux/security.h>
 #include "xattr.h"
 
-static inline erofs_blk_t erofs_xattr_blkaddr(struct super_block *sb,
-                                             unsigned int xattr_id)
-{
-       return EROFS_SB(sb)->xattr_blkaddr +
-              erofs_blknr(sb, xattr_id * sizeof(__u32));
-}
-
-static inline unsigned int erofs_xattr_blkoff(struct super_block *sb,
-                                             unsigned int xattr_id)
-{
-       return erofs_blkoff(sb, xattr_id * sizeof(__u32));
-}
-
-struct xattr_iter {
+struct erofs_xattr_iter {
        struct super_block *sb;
        struct erofs_buf buf;
+       erofs_off_t pos;
        void *kaddr;
 
-       erofs_blk_t blkaddr;
-       unsigned int ofs;
+       char *buffer;
+       int buffer_size, buffer_ofs;
+
+       /* getxattr */
+       int index, infix_len;
+       struct qstr name;
+
+       /* listxattr */
+       struct dentry *dentry;
 };
 
 static int erofs_init_inode_xattrs(struct inode *inode)
 {
        struct erofs_inode *const vi = EROFS_I(inode);
-       struct xattr_iter it;
+       struct erofs_xattr_iter it;
        unsigned int i;
        struct erofs_xattr_ibody_header *ih;
        struct super_block *sb = inode->i_sb;
@@ -81,17 +76,17 @@ static int erofs_init_inode_xattrs(struct inode *inode)
        }
 
        it.buf = __EROFS_BUF_INITIALIZER;
-       it.blkaddr = erofs_blknr(sb, erofs_iloc(inode) + vi->inode_isize);
-       it.ofs = erofs_blkoff(sb, erofs_iloc(inode) + vi->inode_isize);
+       erofs_init_metabuf(&it.buf, sb);
+       it.pos = erofs_iloc(inode) + vi->inode_isize;
 
        /* read in shared xattr array (non-atomic, see kmalloc below) */
-       it.kaddr = erofs_read_metabuf(&it.buf, sb, it.blkaddr, EROFS_KMAP);
+       it.kaddr = erofs_bread(&it.buf, erofs_blknr(sb, it.pos), EROFS_KMAP);
        if (IS_ERR(it.kaddr)) {
                ret = PTR_ERR(it.kaddr);
                goto out_unlock;
        }
 
-       ih = (struct erofs_xattr_ibody_header *)(it.kaddr + it.ofs);
+       ih = it.kaddr + erofs_blkoff(sb, it.pos);
        vi->xattr_shared_count = ih->h_shared_count;
        vi->xattr_shared_xattrs = kmalloc_array(vi->xattr_shared_count,
                                                sizeof(uint), GFP_KERNEL);
@@ -102,26 +97,20 @@ static int erofs_init_inode_xattrs(struct inode *inode)
        }
 
        /* let's skip ibody header */
-       it.ofs += sizeof(struct erofs_xattr_ibody_header);
+       it.pos += sizeof(struct erofs_xattr_ibody_header);
 
        for (i = 0; i < vi->xattr_shared_count; ++i) {
-               if (it.ofs >= sb->s_blocksize) {
-                       /* cannot be unaligned */
-                       DBG_BUGON(it.ofs != sb->s_blocksize);
-
-                       it.kaddr = erofs_read_metabuf(&it.buf, sb, ++it.blkaddr,
-                                                     EROFS_KMAP);
-                       if (IS_ERR(it.kaddr)) {
-                               kfree(vi->xattr_shared_xattrs);
-                               vi->xattr_shared_xattrs = NULL;
-                               ret = PTR_ERR(it.kaddr);
-                               goto out_unlock;
-                       }
-                       it.ofs = 0;
+               it.kaddr = erofs_bread(&it.buf, erofs_blknr(sb, it.pos),
+                                      EROFS_KMAP);
+               if (IS_ERR(it.kaddr)) {
+                       kfree(vi->xattr_shared_xattrs);
+                       vi->xattr_shared_xattrs = NULL;
+                       ret = PTR_ERR(it.kaddr);
+                       goto out_unlock;
                }
-               vi->xattr_shared_xattrs[i] =
-                       le32_to_cpu(*(__le32 *)(it.kaddr + it.ofs));
-               it.ofs += sizeof(__le32);
+               vi->xattr_shared_xattrs[i] = le32_to_cpu(*(__le32 *)
+                               (it.kaddr + erofs_blkoff(sb, it.pos)));
+               it.pos += sizeof(__le32);
        }
        erofs_put_metabuf(&it.buf);
 
@@ -134,287 +123,6 @@ out_unlock:
        return ret;
 }
 
-/*
- * the general idea for these return values is
- * if    0 is returned, go on processing the current xattr;
- *       1 (> 0) is returned, skip this round to process the next xattr;
- *    -err (< 0) is returned, an error (maybe ENOXATTR) occurred
- *                            and need to be handled
- */
-struct xattr_iter_handlers {
-       int (*entry)(struct xattr_iter *_it, struct erofs_xattr_entry *entry);
-       int (*name)(struct xattr_iter *_it, unsigned int processed, char *buf,
-                   unsigned int len);
-       int (*alloc_buffer)(struct xattr_iter *_it, unsigned int value_sz);
-       void (*value)(struct xattr_iter *_it, unsigned int processed, char *buf,
-                     unsigned int len);
-};
-
-static inline int xattr_iter_fixup(struct xattr_iter *it)
-{
-       if (it->ofs < it->sb->s_blocksize)
-               return 0;
-
-       it->blkaddr += erofs_blknr(it->sb, it->ofs);
-       it->kaddr = erofs_read_metabuf(&it->buf, it->sb, it->blkaddr,
-                                      EROFS_KMAP);
-       if (IS_ERR(it->kaddr))
-               return PTR_ERR(it->kaddr);
-       it->ofs = erofs_blkoff(it->sb, it->ofs);
-       return 0;
-}
-
-static int inline_xattr_iter_begin(struct xattr_iter *it,
-                                  struct inode *inode)
-{
-       struct erofs_inode *const vi = EROFS_I(inode);
-       unsigned int xattr_header_sz, inline_xattr_ofs;
-
-       xattr_header_sz = sizeof(struct erofs_xattr_ibody_header) +
-                         sizeof(u32) * vi->xattr_shared_count;
-       if (xattr_header_sz >= vi->xattr_isize) {
-               DBG_BUGON(xattr_header_sz > vi->xattr_isize);
-               return -ENOATTR;
-       }
-
-       inline_xattr_ofs = vi->inode_isize + xattr_header_sz;
-
-       it->blkaddr = erofs_blknr(it->sb, erofs_iloc(inode) + inline_xattr_ofs);
-       it->ofs = erofs_blkoff(it->sb, erofs_iloc(inode) + inline_xattr_ofs);
-       it->kaddr = erofs_read_metabuf(&it->buf, inode->i_sb, it->blkaddr,
-                                      EROFS_KMAP);
-       if (IS_ERR(it->kaddr))
-               return PTR_ERR(it->kaddr);
-       return vi->xattr_isize - xattr_header_sz;
-}
-
-/*
- * Regardless of success or failure, `xattr_foreach' will end up with
- * `ofs' pointing to the next xattr item rather than an arbitrary position.
- */
-static int xattr_foreach(struct xattr_iter *it,
-                        const struct xattr_iter_handlers *op,
-                        unsigned int *tlimit)
-{
-       struct erofs_xattr_entry entry;
-       unsigned int value_sz, processed, slice;
-       int err;
-
-       /* 0. fixup blkaddr, ofs, ipage */
-       err = xattr_iter_fixup(it);
-       if (err)
-               return err;
-
-       /*
-        * 1. read xattr entry to the memory,
-        *    since we do EROFS_XATTR_ALIGN
-        *    therefore entry should be in the page
-        */
-       entry = *(struct erofs_xattr_entry *)(it->kaddr + it->ofs);
-       if (tlimit) {
-               unsigned int entry_sz = erofs_xattr_entry_size(&entry);
-
-               /* xattr on-disk corruption: xattr entry beyond xattr_isize */
-               if (*tlimit < entry_sz) {
-                       DBG_BUGON(1);
-                       return -EFSCORRUPTED;
-               }
-               *tlimit -= entry_sz;
-       }
-
-       it->ofs += sizeof(struct erofs_xattr_entry);
-       value_sz = le16_to_cpu(entry.e_value_size);
-
-       /* handle entry */
-       err = op->entry(it, &entry);
-       if (err) {
-               it->ofs += entry.e_name_len + value_sz;
-               goto out;
-       }
-
-       /* 2. handle xattr name (ofs will finally be at the end of name) */
-       processed = 0;
-
-       while (processed < entry.e_name_len) {
-               if (it->ofs >= it->sb->s_blocksize) {
-                       DBG_BUGON(it->ofs > it->sb->s_blocksize);
-
-                       err = xattr_iter_fixup(it);
-                       if (err)
-                               goto out;
-                       it->ofs = 0;
-               }
-
-               slice = min_t(unsigned int, it->sb->s_blocksize - it->ofs,
-                             entry.e_name_len - processed);
-
-               /* handle name */
-               err = op->name(it, processed, it->kaddr + it->ofs, slice);
-               if (err) {
-                       it->ofs += entry.e_name_len - processed + value_sz;
-                       goto out;
-               }
-
-               it->ofs += slice;
-               processed += slice;
-       }
-
-       /* 3. handle xattr value */
-       processed = 0;
-
-       if (op->alloc_buffer) {
-               err = op->alloc_buffer(it, value_sz);
-               if (err) {
-                       it->ofs += value_sz;
-                       goto out;
-               }
-       }
-
-       while (processed < value_sz) {
-               if (it->ofs >= it->sb->s_blocksize) {
-                       DBG_BUGON(it->ofs > it->sb->s_blocksize);
-
-                       err = xattr_iter_fixup(it);
-                       if (err)
-                               goto out;
-                       it->ofs = 0;
-               }
-
-               slice = min_t(unsigned int, it->sb->s_blocksize - it->ofs,
-                             value_sz - processed);
-               op->value(it, processed, it->kaddr + it->ofs, slice);
-               it->ofs += slice;
-               processed += slice;
-       }
-
-out:
-       /* xattrs should be 4-byte aligned (on-disk constraint) */
-       it->ofs = EROFS_XATTR_ALIGN(it->ofs);
-       return err < 0 ? err : 0;
-}
-
-struct getxattr_iter {
-       struct xattr_iter it;
-
-       char *buffer;
-       int buffer_size, index, infix_len;
-       struct qstr name;
-};
-
-static int erofs_xattr_long_entrymatch(struct getxattr_iter *it,
-                                      struct erofs_xattr_entry *entry)
-{
-       struct erofs_sb_info *sbi = EROFS_SB(it->it.sb);
-       struct erofs_xattr_prefix_item *pf = sbi->xattr_prefixes +
-               (entry->e_name_index & EROFS_XATTR_LONG_PREFIX_MASK);
-
-       if (pf >= sbi->xattr_prefixes + sbi->xattr_prefix_count)
-               return -ENOATTR;
-
-       if (it->index != pf->prefix->base_index ||
-           it->name.len != entry->e_name_len + pf->infix_len)
-               return -ENOATTR;
-
-       if (memcmp(it->name.name, pf->prefix->infix, pf->infix_len))
-               return -ENOATTR;
-
-       it->infix_len = pf->infix_len;
-       return 0;
-}
-
-static int xattr_entrymatch(struct xattr_iter *_it,
-                           struct erofs_xattr_entry *entry)
-{
-       struct getxattr_iter *it = container_of(_it, struct getxattr_iter, it);
-
-       /* should also match the infix for long name prefixes */
-       if (entry->e_name_index & EROFS_XATTR_LONG_PREFIX)
-               return erofs_xattr_long_entrymatch(it, entry);
-
-       if (it->index != entry->e_name_index ||
-           it->name.len != entry->e_name_len)
-               return -ENOATTR;
-       it->infix_len = 0;
-       return 0;
-}
-
-static int xattr_namematch(struct xattr_iter *_it,
-                          unsigned int processed, char *buf, unsigned int len)
-{
-       struct getxattr_iter *it = container_of(_it, struct getxattr_iter, it);
-
-       if (memcmp(buf, it->name.name + it->infix_len + processed, len))
-               return -ENOATTR;
-       return 0;
-}
-
-static int xattr_checkbuffer(struct xattr_iter *_it,
-                            unsigned int value_sz)
-{
-       struct getxattr_iter *it = container_of(_it, struct getxattr_iter, it);
-       int err = it->buffer_size < value_sz ? -ERANGE : 0;
-
-       it->buffer_size = value_sz;
-       return !it->buffer ? 1 : err;
-}
-
-static void xattr_copyvalue(struct xattr_iter *_it,
-                           unsigned int processed,
-                           char *buf, unsigned int len)
-{
-       struct getxattr_iter *it = container_of(_it, struct getxattr_iter, it);
-
-       memcpy(it->buffer + processed, buf, len);
-}
-
-static const struct xattr_iter_handlers find_xattr_handlers = {
-       .entry = xattr_entrymatch,
-       .name = xattr_namematch,
-       .alloc_buffer = xattr_checkbuffer,
-       .value = xattr_copyvalue
-};
-
-static int inline_getxattr(struct inode *inode, struct getxattr_iter *it)
-{
-       int ret;
-       unsigned int remaining;
-
-       ret = inline_xattr_iter_begin(&it->it, inode);
-       if (ret < 0)
-               return ret;
-
-       remaining = ret;
-       while (remaining) {
-               ret = xattr_foreach(&it->it, &find_xattr_handlers, &remaining);
-               if (ret != -ENOATTR)
-                       break;
-       }
-       return ret ? ret : it->buffer_size;
-}
-
-static int shared_getxattr(struct inode *inode, struct getxattr_iter *it)
-{
-       struct erofs_inode *const vi = EROFS_I(inode);
-       struct super_block *const sb = it->it.sb;
-       unsigned int i, xsid;
-       int ret = -ENOATTR;
-
-       for (i = 0; i < vi->xattr_shared_count; ++i) {
-               xsid = vi->xattr_shared_xattrs[i];
-               it->it.blkaddr = erofs_xattr_blkaddr(sb, xsid);
-               it->it.ofs = erofs_xattr_blkoff(sb, xsid);
-               it->it.kaddr = erofs_read_metabuf(&it->it.buf, sb,
-                                                 it->it.blkaddr, EROFS_KMAP);
-               if (IS_ERR(it->it.kaddr))
-                       return PTR_ERR(it->it.kaddr);
-
-               ret = xattr_foreach(&it->it, &find_xattr_handlers, NULL);
-               if (ret != -ENOATTR)
-                       break;
-       }
-       return ret ? ret : it->buffer_size;
-}
-
 static bool erofs_xattr_user_list(struct dentry *dentry)
 {
        return test_opt(&EROFS_SB(dentry->d_sb)->opt, XATTR_USER);
@@ -425,39 +133,6 @@ static bool erofs_xattr_trusted_list(struct dentry *dentry)
        return capable(CAP_SYS_ADMIN);
 }
 
-int erofs_getxattr(struct inode *inode, int index,
-                  const char *name,
-                  void *buffer, size_t buffer_size)
-{
-       int ret;
-       struct getxattr_iter it;
-
-       if (!name)
-               return -EINVAL;
-
-       ret = erofs_init_inode_xattrs(inode);
-       if (ret)
-               return ret;
-
-       it.index = index;
-       it.name.len = strlen(name);
-       if (it.name.len > EROFS_NAME_LEN)
-               return -ERANGE;
-
-       it.it.buf = __EROFS_BUF_INITIALIZER;
-       it.name.name = name;
-
-       it.buffer = buffer;
-       it.buffer_size = buffer_size;
-
-       it.it.sb = inode->i_sb;
-       ret = inline_getxattr(inode, &it);
-       if (ret == -ENOATTR)
-               ret = shared_getxattr(inode, &it);
-       erofs_put_metabuf(&it.it.buf);
-       return ret;
-}
-
 static int erofs_xattr_generic_get(const struct xattr_handler *handler,
                                   struct dentry *unused, struct inode *inode,
                                   const char *name, void *buffer, size_t size)
@@ -500,30 +175,49 @@ const struct xattr_handler *erofs_xattr_handlers[] = {
        NULL,
 };
 
-struct listxattr_iter {
-       struct xattr_iter it;
-
-       struct dentry *dentry;
-       char *buffer;
-       int buffer_size, buffer_ofs;
-};
+static int erofs_xattr_copy_to_buffer(struct erofs_xattr_iter *it,
+                                     unsigned int len)
+{
+       unsigned int slice, processed;
+       struct super_block *sb = it->sb;
+       void *src;
+
+       for (processed = 0; processed < len; processed += slice) {
+               it->kaddr = erofs_bread(&it->buf, erofs_blknr(sb, it->pos),
+                                       EROFS_KMAP);
+               if (IS_ERR(it->kaddr))
+                       return PTR_ERR(it->kaddr);
+
+               src = it->kaddr + erofs_blkoff(sb, it->pos);
+               slice = min_t(unsigned int, sb->s_blocksize -
+                               erofs_blkoff(sb, it->pos), len - processed);
+               memcpy(it->buffer + it->buffer_ofs, src, slice);
+               it->buffer_ofs += slice;
+               it->pos += slice;
+       }
+       return 0;
+}
 
-static int xattr_entrylist(struct xattr_iter *_it,
-                          struct erofs_xattr_entry *entry)
+static int erofs_listxattr_foreach(struct erofs_xattr_iter *it)
 {
-       struct listxattr_iter *it =
-               container_of(_it, struct listxattr_iter, it);
-       unsigned int base_index = entry->e_name_index;
-       unsigned int prefix_len, infix_len = 0;
+       struct erofs_xattr_entry entry;
+       unsigned int base_index, name_total, prefix_len, infix_len = 0;
        const char *prefix, *infix = NULL;
+       int err;
+
+       /* 1. handle xattr entry */
+       entry = *(struct erofs_xattr_entry *)
+                       (it->kaddr + erofs_blkoff(it->sb, it->pos));
+       it->pos += sizeof(struct erofs_xattr_entry);
 
-       if (entry->e_name_index & EROFS_XATTR_LONG_PREFIX) {
-               struct erofs_sb_info *sbi = EROFS_SB(_it->sb);
+       base_index = entry.e_name_index;
+       if (entry.e_name_index & EROFS_XATTR_LONG_PREFIX) {
+               struct erofs_sb_info *sbi = EROFS_SB(it->sb);
                struct erofs_xattr_prefix_item *pf = sbi->xattr_prefixes +
-                       (entry->e_name_index & EROFS_XATTR_LONG_PREFIX_MASK);
+                       (entry.e_name_index & EROFS_XATTR_LONG_PREFIX_MASK);
 
                if (pf >= sbi->xattr_prefixes + sbi->xattr_prefix_count)
-                       return 1;
+                       return 0;
                infix = pf->prefix->infix;
                infix_len = pf->infix_len;
                base_index = pf->prefix->base_index;
@@ -531,120 +225,228 @@ static int xattr_entrylist(struct xattr_iter *_it,
 
        prefix = erofs_xattr_prefix(base_index, it->dentry);
        if (!prefix)
-               return 1;
+               return 0;
        prefix_len = strlen(prefix);
+       name_total = prefix_len + infix_len + entry.e_name_len + 1;
 
        if (!it->buffer) {
-               it->buffer_ofs += prefix_len + infix_len +
-                                       entry->e_name_len + 1;
-               return 1;
+               it->buffer_ofs += name_total;
+               return 0;
        }
 
-       if (it->buffer_ofs + prefix_len + infix_len +
-               + entry->e_name_len + 1 > it->buffer_size)
+       if (it->buffer_ofs + name_total > it->buffer_size)
                return -ERANGE;
 
        memcpy(it->buffer + it->buffer_ofs, prefix, prefix_len);
        memcpy(it->buffer + it->buffer_ofs + prefix_len, infix, infix_len);
        it->buffer_ofs += prefix_len + infix_len;
-       return 0;
-}
 
-static int xattr_namelist(struct xattr_iter *_it,
-                         unsigned int processed, char *buf, unsigned int len)
-{
-       struct listxattr_iter *it =
-               container_of(_it, struct listxattr_iter, it);
+       /* 2. handle xattr name */
+       err = erofs_xattr_copy_to_buffer(it, entry.e_name_len);
+       if (err)
+               return err;
 
-       memcpy(it->buffer + it->buffer_ofs, buf, len);
-       it->buffer_ofs += len;
+       it->buffer[it->buffer_ofs++] = '\0';
        return 0;
 }
 
-static int xattr_skipvalue(struct xattr_iter *_it,
-                          unsigned int value_sz)
+static int erofs_getxattr_foreach(struct erofs_xattr_iter *it)
 {
-       struct listxattr_iter *it =
-               container_of(_it, struct listxattr_iter, it);
+       struct super_block *sb = it->sb;
+       struct erofs_xattr_entry entry;
+       unsigned int slice, processed, value_sz;
 
-       it->buffer[it->buffer_ofs++] = '\0';
-       return 1;
-}
+       /* 1. handle xattr entry */
+       entry = *(struct erofs_xattr_entry *)
+                       (it->kaddr + erofs_blkoff(sb, it->pos));
+       it->pos += sizeof(struct erofs_xattr_entry);
+       value_sz = le16_to_cpu(entry.e_value_size);
 
-static const struct xattr_iter_handlers list_xattr_handlers = {
-       .entry = xattr_entrylist,
-       .name = xattr_namelist,
-       .alloc_buffer = xattr_skipvalue,
-       .value = NULL
-};
+       /* should also match the infix for long name prefixes */
+       if (entry.e_name_index & EROFS_XATTR_LONG_PREFIX) {
+               struct erofs_sb_info *sbi = EROFS_SB(sb);
+               struct erofs_xattr_prefix_item *pf = sbi->xattr_prefixes +
+                       (entry.e_name_index & EROFS_XATTR_LONG_PREFIX_MASK);
+
+               if (pf >= sbi->xattr_prefixes + sbi->xattr_prefix_count)
+                       return -ENOATTR;
+
+               if (it->index != pf->prefix->base_index ||
+                   it->name.len != entry.e_name_len + pf->infix_len)
+                       return -ENOATTR;
+
+               if (memcmp(it->name.name, pf->prefix->infix, pf->infix_len))
+                       return -ENOATTR;
+
+               it->infix_len = pf->infix_len;
+       } else {
+               if (it->index != entry.e_name_index ||
+                   it->name.len != entry.e_name_len)
+                       return -ENOATTR;
 
-static int inline_listxattr(struct listxattr_iter *it)
+               it->infix_len = 0;
+       }
+
+       /* 2. handle xattr name */
+       for (processed = 0; processed < entry.e_name_len; processed += slice) {
+               it->kaddr = erofs_bread(&it->buf, erofs_blknr(sb, it->pos),
+                                       EROFS_KMAP);
+               if (IS_ERR(it->kaddr))
+                       return PTR_ERR(it->kaddr);
+
+               slice = min_t(unsigned int,
+                               sb->s_blocksize - erofs_blkoff(sb, it->pos),
+                               entry.e_name_len - processed);
+               if (memcmp(it->name.name + it->infix_len + processed,
+                          it->kaddr + erofs_blkoff(sb, it->pos), slice))
+                       return -ENOATTR;
+               it->pos += slice;
+       }
+
+       /* 3. handle xattr value */
+       if (!it->buffer) {
+               it->buffer_ofs = value_sz;
+               return 0;
+       }
+
+       if (it->buffer_size < value_sz)
+               return -ERANGE;
+
+       return erofs_xattr_copy_to_buffer(it, value_sz);
+}
+
+static int erofs_xattr_iter_inline(struct erofs_xattr_iter *it,
+                                  struct inode *inode, bool getxattr)
 {
+       struct erofs_inode *const vi = EROFS_I(inode);
+       unsigned int xattr_header_sz, remaining, entry_sz;
+       erofs_off_t next_pos;
        int ret;
-       unsigned int remaining;
 
-       ret = inline_xattr_iter_begin(&it->it, d_inode(it->dentry));
-       if (ret < 0)
-               return ret;
+       xattr_header_sz = sizeof(struct erofs_xattr_ibody_header) +
+                         sizeof(u32) * vi->xattr_shared_count;
+       if (xattr_header_sz >= vi->xattr_isize) {
+               DBG_BUGON(xattr_header_sz > vi->xattr_isize);
+               return -ENOATTR;
+       }
+
+       remaining = vi->xattr_isize - xattr_header_sz;
+       it->pos = erofs_iloc(inode) + vi->inode_isize + xattr_header_sz;
 
-       remaining = ret;
        while (remaining) {
-               ret = xattr_foreach(&it->it, &list_xattr_handlers, &remaining);
-               if (ret)
+               it->kaddr = erofs_bread(&it->buf, erofs_blknr(it->sb, it->pos),
+                                       EROFS_KMAP);
+               if (IS_ERR(it->kaddr))
+                       return PTR_ERR(it->kaddr);
+
+               entry_sz = erofs_xattr_entry_size(it->kaddr +
+                               erofs_blkoff(it->sb, it->pos));
+               /* xattr on-disk corruption: xattr entry beyond xattr_isize */
+               if (remaining < entry_sz) {
+                       DBG_BUGON(1);
+                       return -EFSCORRUPTED;
+               }
+               remaining -= entry_sz;
+               next_pos = it->pos + entry_sz;
+
+               if (getxattr)
+                       ret = erofs_getxattr_foreach(it);
+               else
+                       ret = erofs_listxattr_foreach(it);
+               if ((getxattr && ret != -ENOATTR) || (!getxattr && ret))
                        break;
+
+               it->pos = next_pos;
        }
-       return ret ? ret : it->buffer_ofs;
+       return ret;
 }
 
-static int shared_listxattr(struct listxattr_iter *it)
+static int erofs_xattr_iter_shared(struct erofs_xattr_iter *it,
+                                  struct inode *inode, bool getxattr)
 {
-       struct inode *const inode = d_inode(it->dentry);
        struct erofs_inode *const vi = EROFS_I(inode);
-       struct super_block *const sb = it->it.sb;
-       unsigned int i, xsid;
-       int ret = 0;
+       struct super_block *const sb = it->sb;
+       struct erofs_sb_info *sbi = EROFS_SB(sb);
+       unsigned int i;
+       int ret = -ENOATTR;
 
        for (i = 0; i < vi->xattr_shared_count; ++i) {
-               xsid = vi->xattr_shared_xattrs[i];
-               it->it.blkaddr = erofs_xattr_blkaddr(sb, xsid);
-               it->it.ofs = erofs_xattr_blkoff(sb, xsid);
-               it->it.kaddr = erofs_read_metabuf(&it->it.buf, sb,
-                                                 it->it.blkaddr, EROFS_KMAP);
-               if (IS_ERR(it->it.kaddr))
-                       return PTR_ERR(it->it.kaddr);
-
-               ret = xattr_foreach(&it->it, &list_xattr_handlers, NULL);
-               if (ret)
+               it->pos = erofs_pos(sb, sbi->xattr_blkaddr) +
+                               vi->xattr_shared_xattrs[i] * sizeof(__le32);
+               it->kaddr = erofs_bread(&it->buf, erofs_blknr(sb, it->pos),
+                                       EROFS_KMAP);
+               if (IS_ERR(it->kaddr))
+                       return PTR_ERR(it->kaddr);
+
+               if (getxattr)
+                       ret = erofs_getxattr_foreach(it);
+               else
+                       ret = erofs_listxattr_foreach(it);
+               if ((getxattr && ret != -ENOATTR) || (!getxattr && ret))
                        break;
        }
-       return ret ? ret : it->buffer_ofs;
+       return ret;
+}
+
+int erofs_getxattr(struct inode *inode, int index, const char *name,
+                  void *buffer, size_t buffer_size)
+{
+       int ret;
+       struct erofs_xattr_iter it;
+
+       if (!name)
+               return -EINVAL;
+
+       ret = erofs_init_inode_xattrs(inode);
+       if (ret)
+               return ret;
+
+       it.index = index;
+       it.name = (struct qstr)QSTR_INIT(name, strlen(name));
+       if (it.name.len > EROFS_NAME_LEN)
+               return -ERANGE;
+
+       it.sb = inode->i_sb;
+       it.buf = __EROFS_BUF_INITIALIZER;
+       erofs_init_metabuf(&it.buf, it.sb);
+       it.buffer = buffer;
+       it.buffer_size = buffer_size;
+       it.buffer_ofs = 0;
+
+       ret = erofs_xattr_iter_inline(&it, inode, true);
+       if (ret == -ENOATTR)
+               ret = erofs_xattr_iter_shared(&it, inode, true);
+       erofs_put_metabuf(&it.buf);
+       return ret ? ret : it.buffer_ofs;
 }
 
-ssize_t erofs_listxattr(struct dentry *dentry,
-                       char *buffer, size_t buffer_size)
+ssize_t erofs_listxattr(struct dentry *dentry, char *buffer, size_t buffer_size)
 {
        int ret;
-       struct listxattr_iter it;
+       struct erofs_xattr_iter it;
+       struct inode *inode = d_inode(dentry);
 
-       ret = erofs_init_inode_xattrs(d_inode(dentry));
+       ret = erofs_init_inode_xattrs(inode);
        if (ret == -ENOATTR)
                return 0;
        if (ret)
                return ret;
 
-       it.it.buf = __EROFS_BUF_INITIALIZER;
+       it.sb = dentry->d_sb;
+       it.buf = __EROFS_BUF_INITIALIZER;
+       erofs_init_metabuf(&it.buf, it.sb);
        it.dentry = dentry;
        it.buffer = buffer;
        it.buffer_size = buffer_size;
        it.buffer_ofs = 0;
 
-       it.it.sb = dentry->d_sb;
-
-       ret = inline_listxattr(&it);
-       if (ret >= 0 || ret == -ENOATTR)
-               ret = shared_listxattr(&it);
-       erofs_put_metabuf(&it.it.buf);
-       return ret;
+       ret = erofs_xattr_iter_inline(&it, inode, false);
+       if (!ret || ret == -ENOATTR)
+               ret = erofs_xattr_iter_shared(&it, inode, false);
+       if (ret == -ENOATTR)
+               ret = 0;
+       erofs_put_metabuf(&it.buf);
+       return ret ? ret : it.buffer_ofs;
 }
 
 void erofs_xattr_prefixes_cleanup(struct super_block *sb)
index 160b3da..5f1890e 100644 (file)
@@ -5,7 +5,6 @@
  * Copyright (C) 2022 Alibaba Cloud
  */
 #include "compress.h"
-#include <linux/prefetch.h>
 #include <linux/psi.h>
 #include <linux/cpuhotplug.h>
 #include <trace/events/erofs.h>
@@ -92,13 +91,8 @@ struct z_erofs_pcluster {
        struct z_erofs_bvec compressed_bvecs[];
 };
 
-/* let's avoid the valid 32-bit kernel addresses */
-
-/* the chained workgroup has't submitted io (still open) */
-#define Z_EROFS_PCLUSTER_TAIL           ((void *)0x5F0ECAFE)
-/* the chained workgroup has already submitted io */
-#define Z_EROFS_PCLUSTER_TAIL_CLOSED    ((void *)0x5F0EDEAD)
-
+/* the end of a chain of pclusters */
+#define Z_EROFS_PCLUSTER_TAIL           ((void *) 0x700 + POISON_POINTER_DELTA)
 #define Z_EROFS_PCLUSTER_NIL            (NULL)
 
 struct z_erofs_decompressqueue {
@@ -241,14 +235,20 @@ static void z_erofs_bvec_iter_begin(struct z_erofs_bvec_iter *iter,
 
 static int z_erofs_bvec_enqueue(struct z_erofs_bvec_iter *iter,
                                struct z_erofs_bvec *bvec,
-                               struct page **candidate_bvpage)
+                               struct page **candidate_bvpage,
+                               struct page **pagepool)
 {
-       if (iter->cur == iter->nr) {
-               if (!*candidate_bvpage)
-                       return -EAGAIN;
-
+       if (iter->cur >= iter->nr) {
+               struct page *nextpage = *candidate_bvpage;
+
+               if (!nextpage) {
+                       nextpage = erofs_allocpage(pagepool, GFP_NOFS);
+                       if (!nextpage)
+                               return -ENOMEM;
+                       set_page_private(nextpage, Z_EROFS_SHORTLIVED_PAGE);
+               }
                DBG_BUGON(iter->bvset->nextpage);
-               iter->bvset->nextpage = *candidate_bvpage;
+               iter->bvset->nextpage = nextpage;
                z_erofs_bvset_flip(iter);
 
                iter->bvset->nextpage = NULL;
@@ -500,20 +500,6 @@ out_error_pcluster_pool:
 enum z_erofs_pclustermode {
        Z_EROFS_PCLUSTER_INFLIGHT,
        /*
-        * The current pclusters was the tail of an exist chain, in addition
-        * that the previous processed chained pclusters are all decided to
-        * be hooked up to it.
-        * A new chain will be created for the remaining pclusters which are
-        * not processed yet, so different from Z_EROFS_PCLUSTER_FOLLOWED,
-        * the next pcluster cannot reuse the whole page safely for inplace I/O
-        * in the following scenario:
-        *  ________________________________________________________________
-        * |      tail (partial) page     |       head (partial) page       |
-        * |   (belongs to the next pcl)  |   (belongs to the current pcl)  |
-        * |_______PCLUSTER_FOLLOWED______|________PCLUSTER_HOOKED__________|
-        */
-       Z_EROFS_PCLUSTER_HOOKED,
-       /*
         * a weak form of Z_EROFS_PCLUSTER_FOLLOWED, the difference is that it
         * could be dispatched into bypass queue later due to uptodated managed
         * pages. All related online pages cannot be reused for inplace I/O (or
@@ -530,8 +516,8 @@ enum z_erofs_pclustermode {
         *  ________________________________________________________________
         * |  tail (partial) page |          head (partial) page           |
         * |  (of the current cl) |      (of the previous collection)      |
-        * | PCLUSTER_FOLLOWED or |                                        |
-        * |_____PCLUSTER_HOOKED__|___________PCLUSTER_FOLLOWED____________|
+        * |                      |                                        |
+        * |__PCLUSTER_FOLLOWED___|___________PCLUSTER_FOLLOWED____________|
         *
         * [  (*) the above page can be used as inplace I/O.               ]
         */
@@ -543,12 +529,12 @@ struct z_erofs_decompress_frontend {
        struct erofs_map_blocks map;
        struct z_erofs_bvec_iter biter;
 
+       struct page *pagepool;
        struct page *candidate_bvpage;
-       struct z_erofs_pcluster *pcl, *tailpcl;
+       struct z_erofs_pcluster *pcl;
        z_erofs_next_pcluster_t owned_head;
        enum z_erofs_pclustermode mode;
 
-       bool readahead;
        /* used for applying cache strategy on the fly */
        bool backmost;
        erofs_off_t headoffset;
@@ -578,8 +564,7 @@ static bool z_erofs_should_alloc_cache(struct z_erofs_decompress_frontend *fe)
        return false;
 }
 
-static void z_erofs_bind_cache(struct z_erofs_decompress_frontend *fe,
-                              struct page **pagepool)
+static void z_erofs_bind_cache(struct z_erofs_decompress_frontend *fe)
 {
        struct address_space *mc = MNGD_MAPPING(EROFS_I_SB(fe->inode));
        struct z_erofs_pcluster *pcl = fe->pcl;
@@ -620,7 +605,7 @@ static void z_erofs_bind_cache(struct z_erofs_decompress_frontend *fe,
                         * succeeds or fallback to in-place I/O instead
                         * to avoid any direct reclaim.
                         */
-                       newpage = erofs_allocpage(pagepool, gfp);
+                       newpage = erofs_allocpage(&fe->pagepool, gfp);
                        if (!newpage)
                                continue;
                        set_page_private(newpage, Z_EROFS_PREALLOCATED_PAGE);
@@ -633,7 +618,7 @@ static void z_erofs_bind_cache(struct z_erofs_decompress_frontend *fe,
                if (page)
                        put_page(page);
                else if (newpage)
-                       erofs_pagepool_add(pagepool, newpage);
+                       erofs_pagepool_add(&fe->pagepool, newpage);
        }
 
        /*
@@ -654,7 +639,7 @@ int erofs_try_to_free_all_cached_pages(struct erofs_sb_info *sbi,
 
        DBG_BUGON(z_erofs_is_inline_pcluster(pcl));
        /*
-        * refcount of workgroup is now freezed as 1,
+        * refcount of workgroup is now freezed as 0,
         * therefore no need to worry about available decompression users.
         */
        for (i = 0; i < pcl->pclusterpages; ++i) {
@@ -678,29 +663,73 @@ int erofs_try_to_free_all_cached_pages(struct erofs_sb_info *sbi,
        return 0;
 }
 
-int erofs_try_to_free_cached_page(struct page *page)
+static bool z_erofs_cache_release_folio(struct folio *folio, gfp_t gfp)
 {
-       struct z_erofs_pcluster *const pcl = (void *)page_private(page);
-       int ret, i;
+       struct z_erofs_pcluster *pcl = folio_get_private(folio);
+       bool ret;
+       int i;
 
-       if (!erofs_workgroup_try_to_freeze(&pcl->obj, 1))
-               return 0;
+       if (!folio_test_private(folio))
+               return true;
+
+       ret = false;
+       spin_lock(&pcl->obj.lockref.lock);
+       if (pcl->obj.lockref.count > 0)
+               goto out;
 
-       ret = 0;
        DBG_BUGON(z_erofs_is_inline_pcluster(pcl));
        for (i = 0; i < pcl->pclusterpages; ++i) {
-               if (pcl->compressed_bvecs[i].page == page) {
+               if (pcl->compressed_bvecs[i].page == &folio->page) {
                        WRITE_ONCE(pcl->compressed_bvecs[i].page, NULL);
-                       ret = 1;
+                       ret = true;
                        break;
                }
        }
-       erofs_workgroup_unfreeze(&pcl->obj, 1);
        if (ret)
-               detach_page_private(page);
+               folio_detach_private(folio);
+out:
+       spin_unlock(&pcl->obj.lockref.lock);
        return ret;
 }
 
+/*
+ * It will be called only on inode eviction. In case that there are still some
+ * decompression requests in progress, wait with rescheduling for a bit here.
+ * An extra lock could be introduced instead but it seems unnecessary.
+ */
+static void z_erofs_cache_invalidate_folio(struct folio *folio,
+                                          size_t offset, size_t length)
+{
+       const size_t stop = length + offset;
+
+       /* Check for potential overflow in debug mode */
+       DBG_BUGON(stop > folio_size(folio) || stop < length);
+
+       if (offset == 0 && stop == folio_size(folio))
+               while (!z_erofs_cache_release_folio(folio, GFP_NOFS))
+                       cond_resched();
+}
+
+static const struct address_space_operations z_erofs_cache_aops = {
+       .release_folio = z_erofs_cache_release_folio,
+       .invalidate_folio = z_erofs_cache_invalidate_folio,
+};
+
+int erofs_init_managed_cache(struct super_block *sb)
+{
+       struct inode *const inode = new_inode(sb);
+
+       if (!inode)
+               return -ENOMEM;
+
+       set_nlink(inode, 1);
+       inode->i_size = OFFSET_MAX;
+       inode->i_mapping->a_ops = &z_erofs_cache_aops;
+       mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
+       EROFS_SB(sb)->managed_cache = inode;
+       return 0;
+}
+
 static bool z_erofs_try_inplace_io(struct z_erofs_decompress_frontend *fe,
                                   struct z_erofs_bvec *bvec)
 {
@@ -731,7 +760,8 @@ static int z_erofs_attach_page(struct z_erofs_decompress_frontend *fe,
                    !fe->candidate_bvpage)
                        fe->candidate_bvpage = bvec->page;
        }
-       ret = z_erofs_bvec_enqueue(&fe->biter, bvec, &fe->candidate_bvpage);
+       ret = z_erofs_bvec_enqueue(&fe->biter, bvec, &fe->candidate_bvpage,
+                                  &fe->pagepool);
        fe->pcl->vcnt += (ret >= 0);
        return ret;
 }
@@ -750,19 +780,7 @@ static void z_erofs_try_to_claim_pcluster(struct z_erofs_decompress_frontend *f)
                return;
        }
 
-       /*
-        * type 2, link to the end of an existing open chain, be careful
-        * that its submission is controlled by the original attached chain.
-        */
-       if (*owned_head != &pcl->next && pcl != f->tailpcl &&
-           cmpxchg(&pcl->next, Z_EROFS_PCLUSTER_TAIL,
-                   *owned_head) == Z_EROFS_PCLUSTER_TAIL) {
-               *owned_head = Z_EROFS_PCLUSTER_TAIL;
-               f->mode = Z_EROFS_PCLUSTER_HOOKED;
-               f->tailpcl = NULL;
-               return;
-       }
-       /* type 3, it belongs to a chain, but it isn't the end of the chain */
+       /* type 2, it belongs to an ongoing chain */
        f->mode = Z_EROFS_PCLUSTER_INFLIGHT;
 }
 
@@ -786,7 +804,7 @@ static int z_erofs_register_pcluster(struct z_erofs_decompress_frontend *fe)
        if (IS_ERR(pcl))
                return PTR_ERR(pcl);
 
-       atomic_set(&pcl->obj.refcount, 1);
+       spin_lock_init(&pcl->obj.lockref.lock);
        pcl->algorithmformat = map->m_algorithmformat;
        pcl->length = 0;
        pcl->partial = true;
@@ -823,9 +841,6 @@ static int z_erofs_register_pcluster(struct z_erofs_decompress_frontend *fe)
                        goto err_out;
                }
        }
-       /* used to check tail merging loop due to corrupted images */
-       if (fe->owned_head == Z_EROFS_PCLUSTER_TAIL)
-               fe->tailpcl = pcl;
        fe->owned_head = &pcl->next;
        fe->pcl = pcl;
        return 0;
@@ -846,7 +861,6 @@ static int z_erofs_collector_begin(struct z_erofs_decompress_frontend *fe)
 
        /* must be Z_EROFS_PCLUSTER_TAIL or pointed to previous pcluster */
        DBG_BUGON(fe->owned_head == Z_EROFS_PCLUSTER_NIL);
-       DBG_BUGON(fe->owned_head == Z_EROFS_PCLUSTER_TAIL_CLOSED);
 
        if (!(map->m_flags & EROFS_MAP_META)) {
                grp = erofs_find_workgroup(fe->inode->i_sb,
@@ -865,10 +879,6 @@ static int z_erofs_collector_begin(struct z_erofs_decompress_frontend *fe)
 
        if (ret == -EEXIST) {
                mutex_lock(&fe->pcl->lock);
-               /* used to check tail merging loop due to corrupted images */
-               if (fe->owned_head == Z_EROFS_PCLUSTER_TAIL)
-                       fe->tailpcl = fe->pcl;
-
                z_erofs_try_to_claim_pcluster(fe);
        } else if (ret) {
                return ret;
@@ -908,10 +918,8 @@ static bool z_erofs_collector_end(struct z_erofs_decompress_frontend *fe)
        z_erofs_bvec_iter_end(&fe->biter);
        mutex_unlock(&pcl->lock);
 
-       if (fe->candidate_bvpage) {
-               DBG_BUGON(z_erofs_is_shortlived_page(fe->candidate_bvpage));
+       if (fe->candidate_bvpage)
                fe->candidate_bvpage = NULL;
-       }
 
        /*
         * if all pending pages are added, don't hold its reference
@@ -958,7 +966,7 @@ static int z_erofs_read_fragment(struct inode *inode, erofs_off_t pos,
 }
 
 static int z_erofs_do_read_page(struct z_erofs_decompress_frontend *fe,
-                               struct page *page, struct page **pagepool)
+                               struct page *page)
 {
        struct inode *const inode = fe->inode;
        struct erofs_map_blocks *const map = &fe->map;
@@ -1016,7 +1024,7 @@ repeat:
                fe->mode = Z_EROFS_PCLUSTER_FOLLOWED_NOINPLACE;
        } else {
                /* bind cache first when cached decompression is preferred */
-               z_erofs_bind_cache(fe, pagepool);
+               z_erofs_bind_cache(fe);
        }
 hitted:
        /*
@@ -1025,8 +1033,7 @@ hitted:
         * those chains are handled asynchronously thus the page cannot be used
         * for inplace I/O or bvpage (should be processed in a strict order.)
         */
-       tight &= (fe->mode >= Z_EROFS_PCLUSTER_HOOKED &&
-                 fe->mode != Z_EROFS_PCLUSTER_FOLLOWED_NOINPLACE);
+       tight &= (fe->mode > Z_EROFS_PCLUSTER_FOLLOWED_NOINPLACE);
 
        cur = end - min_t(unsigned int, offset + end - map->m_la, end);
        if (!(map->m_flags & EROFS_MAP_MAPPED)) {
@@ -1056,24 +1063,13 @@ hitted:
        if (cur)
                tight &= (fe->mode >= Z_EROFS_PCLUSTER_FOLLOWED);
 
-retry:
        err = z_erofs_attach_page(fe, &((struct z_erofs_bvec) {
                                        .page = page,
                                        .offset = offset - map->m_la,
                                        .end = end,
                                  }), exclusive);
-       /* should allocate an additional short-lived page for bvset */
-       if (err == -EAGAIN && !fe->candidate_bvpage) {
-               fe->candidate_bvpage = alloc_page(GFP_NOFS | __GFP_NOFAIL);
-               set_page_private(fe->candidate_bvpage,
-                                Z_EROFS_SHORTLIVED_PAGE);
-               goto retry;
-       }
-
-       if (err) {
-               DBG_BUGON(err == -EAGAIN && fe->candidate_bvpage);
+       if (err)
                goto out;
-       }
 
        z_erofs_onlinepage_split(page);
        /* bump up the number of spiltted parts of a page */
@@ -1104,7 +1100,7 @@ out:
        return err;
 }
 
-static bool z_erofs_get_sync_decompress_policy(struct erofs_sb_info *sbi,
+static bool z_erofs_is_sync_decompress(struct erofs_sb_info *sbi,
                                       unsigned int readahead_pages)
 {
        /* auto: enable for read_folio, disable for readahead */
@@ -1283,6 +1279,8 @@ static int z_erofs_decompress_pcluster(struct z_erofs_decompress_backend *be,
        struct erofs_sb_info *const sbi = EROFS_SB(be->sb);
        struct z_erofs_pcluster *pcl = be->pcl;
        unsigned int pclusterpages = z_erofs_pclusterpages(pcl);
+       const struct z_erofs_decompressor *decompressor =
+                               &erofs_decompressors[pcl->algorithmformat];
        unsigned int i, inputsize;
        int err2;
        struct page *page;
@@ -1326,7 +1324,7 @@ static int z_erofs_decompress_pcluster(struct z_erofs_decompress_backend *be,
        else
                inputsize = pclusterpages * PAGE_SIZE;
 
-       err = z_erofs_decompress(&(struct z_erofs_decompress_req) {
+       err = decompressor->decompress(&(struct z_erofs_decompress_req) {
                                        .sb = be->sb,
                                        .in = be->compressed_pages,
                                        .out = be->decompressed_pages,
@@ -1404,10 +1402,7 @@ static void z_erofs_decompress_queue(const struct z_erofs_decompressqueue *io,
        };
        z_erofs_next_pcluster_t owned = io->head;
 
-       while (owned != Z_EROFS_PCLUSTER_TAIL_CLOSED) {
-               /* impossible that 'owned' equals Z_EROFS_WORK_TPTR_TAIL */
-               DBG_BUGON(owned == Z_EROFS_PCLUSTER_TAIL);
-               /* impossible that 'owned' equals Z_EROFS_PCLUSTER_NIL */
+       while (owned != Z_EROFS_PCLUSTER_TAIL) {
                DBG_BUGON(owned == Z_EROFS_PCLUSTER_NIL);
 
                be.pcl = container_of(owned, struct z_erofs_pcluster, next);
@@ -1424,7 +1419,7 @@ static void z_erofs_decompressqueue_work(struct work_struct *work)
                container_of(work, struct z_erofs_decompressqueue, u.work);
        struct page *pagepool = NULL;
 
-       DBG_BUGON(bgq->head == Z_EROFS_PCLUSTER_TAIL_CLOSED);
+       DBG_BUGON(bgq->head == Z_EROFS_PCLUSTER_TAIL);
        z_erofs_decompress_queue(bgq, &pagepool);
        erofs_release_pages(&pagepool);
        kvfree(bgq);
@@ -1452,7 +1447,7 @@ static void z_erofs_decompress_kickoff(struct z_erofs_decompressqueue *io,
        if (atomic_add_return(bios, &io->pending_bios))
                return;
        /* Use (kthread_)work and sync decompression for atomic contexts only */
-       if (in_atomic() || irqs_disabled()) {
+       if (!in_task() || irqs_disabled() || rcu_read_lock_any_held()) {
 #ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
                struct kthread_worker *worker;
 
@@ -1612,7 +1607,7 @@ fg_out:
                q->sync = true;
        }
        q->sb = sb;
-       q->head = Z_EROFS_PCLUSTER_TAIL_CLOSED;
+       q->head = Z_EROFS_PCLUSTER_TAIL;
        return q;
 }
 
@@ -1630,11 +1625,7 @@ static void move_to_bypass_jobqueue(struct z_erofs_pcluster *pcl,
        z_erofs_next_pcluster_t *const submit_qtail = qtail[JQ_SUBMIT];
        z_erofs_next_pcluster_t *const bypass_qtail = qtail[JQ_BYPASS];
 
-       DBG_BUGON(owned_head == Z_EROFS_PCLUSTER_TAIL_CLOSED);
-       if (owned_head == Z_EROFS_PCLUSTER_TAIL)
-               owned_head = Z_EROFS_PCLUSTER_TAIL_CLOSED;
-
-       WRITE_ONCE(pcl->next, Z_EROFS_PCLUSTER_TAIL_CLOSED);
+       WRITE_ONCE(pcl->next, Z_EROFS_PCLUSTER_TAIL);
 
        WRITE_ONCE(*submit_qtail, owned_head);
        WRITE_ONCE(*bypass_qtail, &pcl->next);
@@ -1668,9 +1659,8 @@ static void z_erofs_decompressqueue_endio(struct bio *bio)
 }
 
 static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f,
-                                struct page **pagepool,
                                 struct z_erofs_decompressqueue *fgq,
-                                bool *force_fg)
+                                bool *force_fg, bool readahead)
 {
        struct super_block *sb = f->inode->i_sb;
        struct address_space *mc = MNGD_MAPPING(EROFS_SB(sb));
@@ -1705,15 +1695,10 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f,
                unsigned int i = 0;
                bool bypass = true;
 
-               /* no possible 'owned_head' equals the following */
-               DBG_BUGON(owned_head == Z_EROFS_PCLUSTER_TAIL_CLOSED);
                DBG_BUGON(owned_head == Z_EROFS_PCLUSTER_NIL);
-
                pcl = container_of(owned_head, struct z_erofs_pcluster, next);
+               owned_head = READ_ONCE(pcl->next);
 
-               /* close the main owned chain at first */
-               owned_head = cmpxchg(&pcl->next, Z_EROFS_PCLUSTER_TAIL,
-                                    Z_EROFS_PCLUSTER_TAIL_CLOSED);
                if (z_erofs_is_inline_pcluster(pcl)) {
                        move_to_bypass_jobqueue(pcl, qtail, owned_head);
                        continue;
@@ -1731,8 +1716,8 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f,
                do {
                        struct page *page;
 
-                       page = pickup_page_for_submission(pcl, i++, pagepool,
-                                                         mc);
+                       page = pickup_page_for_submission(pcl, i++,
+                                       &f->pagepool, mc);
                        if (!page)
                                continue;
 
@@ -1761,7 +1746,7 @@ submit_bio_retry:
                                bio->bi_iter.bi_sector = (sector_t)cur <<
                                        (sb->s_blocksize_bits - 9);
                                bio->bi_private = q[JQ_SUBMIT];
-                               if (f->readahead)
+                               if (readahead)
                                        bio->bi_opf |= REQ_RAHEAD;
                                ++nr_bios;
                        }
@@ -1797,16 +1782,16 @@ submit_bio_retry:
 }
 
 static void z_erofs_runqueue(struct z_erofs_decompress_frontend *f,
-                            struct page **pagepool, bool force_fg)
+                            bool force_fg, bool ra)
 {
        struct z_erofs_decompressqueue io[NR_JOBQUEUES];
 
        if (f->owned_head == Z_EROFS_PCLUSTER_TAIL)
                return;
-       z_erofs_submit_queue(f, pagepool, io, &force_fg);
+       z_erofs_submit_queue(f, io, &force_fg, ra);
 
        /* handle bypass queue (no i/o pclusters) immediately */
-       z_erofs_decompress_queue(&io[JQ_BYPASS], pagepool);
+       z_erofs_decompress_queue(&io[JQ_BYPASS], &f->pagepool);
 
        if (!force_fg)
                return;
@@ -1815,7 +1800,7 @@ static void z_erofs_runqueue(struct z_erofs_decompress_frontend *f,
        wait_for_completion_io(&io[JQ_SUBMIT].u.done);
 
        /* handle synchronous decompress queue in the caller context */
-       z_erofs_decompress_queue(&io[JQ_SUBMIT], pagepool);
+       z_erofs_decompress_queue(&io[JQ_SUBMIT], &f->pagepool);
 }
 
 /*
@@ -1823,29 +1808,28 @@ static void z_erofs_runqueue(struct z_erofs_decompress_frontend *f,
  * approximate readmore strategies as a start.
  */
 static void z_erofs_pcluster_readmore(struct z_erofs_decompress_frontend *f,
-                                     struct readahead_control *rac,
-                                     erofs_off_t end,
-                                     struct page **pagepool,
-                                     bool backmost)
+               struct readahead_control *rac, bool backmost)
 {
        struct inode *inode = f->inode;
        struct erofs_map_blocks *map = &f->map;
-       erofs_off_t cur;
+       erofs_off_t cur, end, headoffset = f->headoffset;
        int err;
 
        if (backmost) {
+               if (rac)
+                       end = headoffset + readahead_length(rac) - 1;
+               else
+                       end = headoffset + PAGE_SIZE - 1;
                map->m_la = end;
                err = z_erofs_map_blocks_iter(inode, map,
                                              EROFS_GET_BLOCKS_READMORE);
                if (err)
                        return;
 
-               /* expend ra for the trailing edge if readahead */
+               /* expand ra for the trailing edge if readahead */
                if (rac) {
-                       loff_t newstart = readahead_pos(rac);
-
                        cur = round_up(map->m_la + map->m_llen, PAGE_SIZE);
-                       readahead_expand(rac, newstart, cur - newstart);
+                       readahead_expand(rac, headoffset, cur - headoffset);
                        return;
                }
                end = round_up(end, PAGE_SIZE);
@@ -1866,7 +1850,7 @@ static void z_erofs_pcluster_readmore(struct z_erofs_decompress_frontend *f,
                        if (PageUptodate(page)) {
                                unlock_page(page);
                        } else {
-                               err = z_erofs_do_read_page(f, page, pagepool);
+                               err = z_erofs_do_read_page(f, page);
                                if (err)
                                        erofs_err(inode->i_sb,
                                                  "readmore error at page %lu @ nid %llu",
@@ -1887,28 +1871,24 @@ static int z_erofs_read_folio(struct file *file, struct folio *folio)
        struct inode *const inode = page->mapping->host;
        struct erofs_sb_info *const sbi = EROFS_I_SB(inode);
        struct z_erofs_decompress_frontend f = DECOMPRESS_FRONTEND_INIT(inode);
-       struct page *pagepool = NULL;
        int err;
 
        trace_erofs_readpage(page, false);
        f.headoffset = (erofs_off_t)page->index << PAGE_SHIFT;
 
-       z_erofs_pcluster_readmore(&f, NULL, f.headoffset + PAGE_SIZE - 1,
-                                 &pagepool, true);
-       err = z_erofs_do_read_page(&f, page, &pagepool);
-       z_erofs_pcluster_readmore(&f, NULL, 0, &pagepool, false);
-
+       z_erofs_pcluster_readmore(&f, NULL, true);
+       err = z_erofs_do_read_page(&f, page);
+       z_erofs_pcluster_readmore(&f, NULL, false);
        (void)z_erofs_collector_end(&f);
 
        /* if some compressed cluster ready, need submit them anyway */
-       z_erofs_runqueue(&f, &pagepool,
-                        z_erofs_get_sync_decompress_policy(sbi, 0));
+       z_erofs_runqueue(&f, z_erofs_is_sync_decompress(sbi, 0), false);
 
        if (err)
                erofs_err(inode->i_sb, "failed to read, err [%d]", err);
 
        erofs_put_metabuf(&f.map.buf);
-       erofs_release_pages(&pagepool);
+       erofs_release_pages(&f.pagepool);
        return err;
 }
 
@@ -1917,14 +1897,12 @@ static void z_erofs_readahead(struct readahead_control *rac)
        struct inode *const inode = rac->mapping->host;
        struct erofs_sb_info *const sbi = EROFS_I_SB(inode);
        struct z_erofs_decompress_frontend f = DECOMPRESS_FRONTEND_INIT(inode);
-       struct page *pagepool = NULL, *head = NULL, *page;
+       struct page *head = NULL, *page;
        unsigned int nr_pages;
 
-       f.readahead = true;
        f.headoffset = readahead_pos(rac);
 
-       z_erofs_pcluster_readmore(&f, rac, f.headoffset +
-                                 readahead_length(rac) - 1, &pagepool, true);
+       z_erofs_pcluster_readmore(&f, rac, true);
        nr_pages = readahead_count(rac);
        trace_erofs_readpages(inode, readahead_index(rac), nr_pages, false);
 
@@ -1940,20 +1918,19 @@ static void z_erofs_readahead(struct readahead_control *rac)
                /* traversal in reverse order */
                head = (void *)page_private(page);
 
-               err = z_erofs_do_read_page(&f, page, &pagepool);
+               err = z_erofs_do_read_page(&f, page);
                if (err)
                        erofs_err(inode->i_sb,
                                  "readahead error at page %lu @ nid %llu",
                                  page->index, EROFS_I(inode)->nid);
                put_page(page);
        }
-       z_erofs_pcluster_readmore(&f, rac, 0, &pagepool, false);
+       z_erofs_pcluster_readmore(&f, rac, false);
        (void)z_erofs_collector_end(&f);
 
-       z_erofs_runqueue(&f, &pagepool,
-                        z_erofs_get_sync_decompress_policy(sbi, nr_pages));
+       z_erofs_runqueue(&f, z_erofs_is_sync_decompress(sbi, nr_pages), true);
        erofs_put_metabuf(&f.map.buf);
-       erofs_release_pages(&pagepool);
+       erofs_release_pages(&f.pagepool);
 }
 
 const struct address_space_operations z_erofs_aops = {
index d37c5c8..1909dda 100644 (file)
@@ -22,8 +22,8 @@ struct z_erofs_maprecorder {
        bool partialref;
 };
 
-static int legacy_load_cluster_from_disk(struct z_erofs_maprecorder *m,
-                                        unsigned long lcn)
+static int z_erofs_load_full_lcluster(struct z_erofs_maprecorder *m,
+                                     unsigned long lcn)
 {
        struct inode *const inode = m->inode;
        struct erofs_inode *const vi = EROFS_I(inode);
@@ -129,7 +129,7 @@ static int unpack_compacted_index(struct z_erofs_maprecorder *m,
        u8 *in, type;
        bool big_pcluster;
 
-       if (1 << amortizedshift == 4)
+       if (1 << amortizedshift == 4 && lclusterbits <= 14)
                vcnt = 2;
        else if (1 << amortizedshift == 2 && lclusterbits == 12)
                vcnt = 16;
@@ -226,12 +226,11 @@ static int unpack_compacted_index(struct z_erofs_maprecorder *m,
        return 0;
 }
 
-static int compacted_load_cluster_from_disk(struct z_erofs_maprecorder *m,
-                                           unsigned long lcn, bool lookahead)
+static int z_erofs_load_compact_lcluster(struct z_erofs_maprecorder *m,
+                                        unsigned long lcn, bool lookahead)
 {
        struct inode *const inode = m->inode;
        struct erofs_inode *const vi = EROFS_I(inode);
-       const unsigned int lclusterbits = vi->z_logical_clusterbits;
        const erofs_off_t ebase = sizeof(struct z_erofs_map_header) +
                ALIGN(erofs_iloc(inode) + vi->inode_isize + vi->xattr_isize, 8);
        unsigned int totalidx = erofs_iblks(inode);
@@ -239,9 +238,6 @@ static int compacted_load_cluster_from_disk(struct z_erofs_maprecorder *m,
        unsigned int amortizedshift;
        erofs_off_t pos;
 
-       if (lclusterbits != 12)
-               return -EOPNOTSUPP;
-
        if (lcn >= totalidx)
                return -EINVAL;
 
@@ -281,23 +277,23 @@ out:
        return unpack_compacted_index(m, amortizedshift, pos, lookahead);
 }
 
-static int z_erofs_load_cluster_from_disk(struct z_erofs_maprecorder *m,
-                                         unsigned int lcn, bool lookahead)
+static int z_erofs_load_lcluster_from_disk(struct z_erofs_maprecorder *m,
+                                          unsigned int lcn, bool lookahead)
 {
-       const unsigned int datamode = EROFS_I(m->inode)->datalayout;
-
-       if (datamode == EROFS_INODE_COMPRESSED_FULL)
-               return legacy_load_cluster_from_disk(m, lcn);
-
-       if (datamode == EROFS_INODE_COMPRESSED_COMPACT)
-               return compacted_load_cluster_from_disk(m, lcn, lookahead);
-
-       return -EINVAL;
+       switch (EROFS_I(m->inode)->datalayout) {
+       case EROFS_INODE_COMPRESSED_FULL:
+               return z_erofs_load_full_lcluster(m, lcn);
+       case EROFS_INODE_COMPRESSED_COMPACT:
+               return z_erofs_load_compact_lcluster(m, lcn, lookahead);
+       default:
+               return -EINVAL;
+       }
 }
 
 static int z_erofs_extent_lookback(struct z_erofs_maprecorder *m,
                                   unsigned int lookback_distance)
 {
+       struct super_block *sb = m->inode->i_sb;
        struct erofs_inode *const vi = EROFS_I(m->inode);
        const unsigned int lclusterbits = vi->z_logical_clusterbits;
 
@@ -305,21 +301,15 @@ static int z_erofs_extent_lookback(struct z_erofs_maprecorder *m,
                unsigned long lcn = m->lcn - lookback_distance;
                int err;
 
-               /* load extent head logical cluster if needed */
-               err = z_erofs_load_cluster_from_disk(m, lcn, false);
+               err = z_erofs_load_lcluster_from_disk(m, lcn, false);
                if (err)
                        return err;
 
                switch (m->type) {
                case Z_EROFS_LCLUSTER_TYPE_NONHEAD:
-                       if (!m->delta[0]) {
-                               erofs_err(m->inode->i_sb,
-                                         "invalid lookback distance 0 @ nid %llu",
-                                         vi->nid);
-                               DBG_BUGON(1);
-                               return -EFSCORRUPTED;
-                       }
                        lookback_distance = m->delta[0];
+                       if (!lookback_distance)
+                               goto err_bogus;
                        continue;
                case Z_EROFS_LCLUSTER_TYPE_PLAIN:
                case Z_EROFS_LCLUSTER_TYPE_HEAD1:
@@ -328,16 +318,15 @@ static int z_erofs_extent_lookback(struct z_erofs_maprecorder *m,
                        m->map->m_la = (lcn << lclusterbits) | m->clusterofs;
                        return 0;
                default:
-                       erofs_err(m->inode->i_sb,
-                                 "unknown type %u @ lcn %lu of nid %llu",
+                       erofs_err(sb, "unknown type %u @ lcn %lu of nid %llu",
                                  m->type, lcn, vi->nid);
                        DBG_BUGON(1);
                        return -EOPNOTSUPP;
                }
        }
-
-       erofs_err(m->inode->i_sb, "bogus lookback distance @ nid %llu",
-                 vi->nid);
+err_bogus:
+       erofs_err(sb, "bogus lookback distance %u @ lcn %lu of nid %llu",
+                 lookback_distance, m->lcn, vi->nid);
        DBG_BUGON(1);
        return -EFSCORRUPTED;
 }
@@ -369,7 +358,7 @@ static int z_erofs_get_extent_compressedlen(struct z_erofs_maprecorder *m,
        if (m->compressedblks)
                goto out;
 
-       err = z_erofs_load_cluster_from_disk(m, lcn, false);
+       err = z_erofs_load_lcluster_from_disk(m, lcn, false);
        if (err)
                return err;
 
@@ -401,9 +390,8 @@ static int z_erofs_get_extent_compressedlen(struct z_erofs_maprecorder *m,
                        break;
                fallthrough;
        default:
-               erofs_err(m->inode->i_sb,
-                         "cannot found CBLKCNT @ lcn %lu of nid %llu",
-                         lcn, vi->nid);
+               erofs_err(sb, "cannot found CBLKCNT @ lcn %lu of nid %llu", lcn,
+                         vi->nid);
                DBG_BUGON(1);
                return -EFSCORRUPTED;
        }
@@ -411,9 +399,7 @@ out:
        map->m_plen = erofs_pos(sb, m->compressedblks);
        return 0;
 err_bonus_cblkcnt:
-       erofs_err(m->inode->i_sb,
-                 "bogus CBLKCNT @ lcn %lu of nid %llu",
-                 lcn, vi->nid);
+       erofs_err(sb, "bogus CBLKCNT @ lcn %lu of nid %llu", lcn, vi->nid);
        DBG_BUGON(1);
        return -EFSCORRUPTED;
 }
@@ -434,7 +420,7 @@ static int z_erofs_get_extent_decompressedlen(struct z_erofs_maprecorder *m)
                        return 0;
                }
 
-               err = z_erofs_load_cluster_from_disk(m, lcn, true);
+               err = z_erofs_load_lcluster_from_disk(m, lcn, true);
                if (err)
                        return err;
 
@@ -481,7 +467,7 @@ static int z_erofs_do_map_blocks(struct inode *inode,
        initial_lcn = ofs >> lclusterbits;
        endoff = ofs & ((1 << lclusterbits) - 1);
 
-       err = z_erofs_load_cluster_from_disk(&m, initial_lcn, false);
+       err = z_erofs_load_lcluster_from_disk(&m, initial_lcn, false);
        if (err)
                goto unmap_out;
 
@@ -539,8 +525,7 @@ static int z_erofs_do_map_blocks(struct inode *inode,
        if (flags & EROFS_GET_BLOCKS_FINDTAIL) {
                vi->z_tailextent_headlcn = m.lcn;
                /* for non-compact indexes, fragmentoff is 64 bits */
-               if (fragment &&
-                   vi->datalayout == EROFS_INODE_COMPRESSED_FULL)
+               if (fragment && vi->datalayout == EROFS_INODE_COMPRESSED_FULL)
                        vi->z_fragmentoff |= (u64)m.pblk << 32;
        }
        if (ztailpacking && m.lcn == vi->z_tailextent_headlcn) {
index 95850a1..8aa36cd 100644 (file)
@@ -33,17 +33,17 @@ struct eventfd_ctx {
        /*
         * Every time that a write(2) is performed on an eventfd, the
         * value of the __u64 being written is added to "count" and a
-        * wakeup is performed on "wqh". A read(2) will return the "count"
-        * value to userspace, and will reset "count" to zero. The kernel
-        * side eventfd_signal() also, adds to the "count" counter and
-        * issue a wakeup.
+        * wakeup is performed on "wqh". If EFD_SEMAPHORE flag was not
+        * specified, a read(2) will return the "count" value to userspace,
+        * and will reset "count" to zero. The kernel side eventfd_signal()
+        * also, adds to the "count" counter and issue a wakeup.
         */
        __u64 count;
        unsigned int flags;
        int id;
 };
 
-__u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n, unsigned mask)
+__u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n, __poll_t mask)
 {
        unsigned long flags;
 
@@ -301,6 +301,8 @@ static void eventfd_show_fdinfo(struct seq_file *m, struct file *f)
                   (unsigned long long)ctx->count);
        spin_unlock_irq(&ctx->wqh.lock);
        seq_printf(m, "eventfd-id: %d\n", ctx->id);
+       seq_printf(m, "eventfd-semaphore: %d\n",
+                  !!(ctx->flags & EFD_SEMAPHORE));
 }
 #endif
 
index 266d45c..4b1b336 100644 (file)
@@ -536,7 +536,7 @@ static void ep_poll_safewake(struct eventpoll *ep, struct epitem *epi,
 #else
 
 static void ep_poll_safewake(struct eventpoll *ep, struct epitem *epi,
-                            unsigned pollflags)
+                            __poll_t pollflags)
 {
        wake_up_poll(&ep->poll_wait, EPOLLIN | pollflags);
 }
index a466e79..25c65b6 100644 (file)
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -220,7 +220,7 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
         */
        mmap_read_lock(bprm->mm);
        ret = get_user_pages_remote(bprm->mm, pos, 1, gup_flags,
-                       &page, NULL, NULL);
+                       &page, NULL);
        mmap_read_unlock(bprm->mm);
        if (ret <= 0)
                return NULL;
index e99183a..3cbd270 100644 (file)
@@ -389,7 +389,7 @@ const struct file_operations exfat_file_operations = {
 #endif
        .mmap           = generic_file_mmap,
        .fsync          = exfat_file_fsync,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = filemap_splice_read,
        .splice_write   = iter_file_splice_write,
 };
 
index 6b4bebe..d1ae0f0 100644 (file)
@@ -192,7 +192,7 @@ const struct file_operations ext2_file_operations = {
        .release        = ext2_release_file,
        .fsync          = ext2_fsync,
        .get_unmapped_area = thp_get_unmapped_area,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = filemap_splice_read,
        .splice_write   = iter_file_splice_write,
 };
 
index 8104a21..02fa8a6 100644 (file)
@@ -2968,6 +2968,7 @@ int ext4_fileattr_set(struct mnt_idmap *idmap,
 int ext4_fileattr_get(struct dentry *dentry, struct fileattr *fa);
 extern void ext4_reset_inode_seed(struct inode *inode);
 int ext4_update_overhead(struct super_block *sb, bool force);
+int ext4_force_shutdown(struct super_block *sb, u32 flags);
 
 /* migrate.c */
 extern int ext4_ext_migrate(struct inode *);
index d101b3b..6a16d07 100644 (file)
@@ -147,6 +147,17 @@ static ssize_t ext4_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
        return generic_file_read_iter(iocb, to);
 }
 
+static ssize_t ext4_file_splice_read(struct file *in, loff_t *ppos,
+                                    struct pipe_inode_info *pipe,
+                                    size_t len, unsigned int flags)
+{
+       struct inode *inode = file_inode(in);
+
+       if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
+               return -EIO;
+       return filemap_splice_read(in, ppos, pipe, len, flags);
+}
+
 /*
  * Called when an inode is released. Note that this is different
  * from ext4_file_open: open gets called at every open, but release
@@ -285,18 +296,13 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
        if (ret <= 0)
                goto out;
 
-       current->backing_dev_info = inode_to_bdi(inode);
        ret = generic_perform_write(iocb, from);
-       current->backing_dev_info = NULL;
 
 out:
        inode_unlock(inode);
-       if (likely(ret > 0)) {
-               iocb->ki_pos += ret;
-               ret = generic_write_sync(iocb, ret);
-       }
-
-       return ret;
+       if (unlikely(ret <= 0))
+               return ret;
+       return generic_write_sync(iocb, ret);
 }
 
 static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
@@ -957,7 +963,7 @@ const struct file_operations ext4_file_operations = {
        .release        = ext4_release_file,
        .fsync          = ext4_sync_file,
        .get_unmapped_area = thp_get_unmapped_area,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = ext4_file_splice_read,
        .splice_write   = iter_file_splice_write,
        .fallocate      = ext4_fallocate,
 };
index 02de439..9ca5833 100644 (file)
@@ -1093,7 +1093,7 @@ static int ext4_block_write_begin(struct folio *folio, loff_t pos, unsigned len,
                        err = -EIO;
        }
        if (unlikely(err)) {
-               page_zero_new_buffers(&folio->page, from, to);
+               folio_zero_new_buffers(folio, from, to);
        } else if (fscrypt_inode_uses_fs_layer_crypto(inode)) {
                for (i = 0; i < nr_wait; i++) {
                        int err2;
@@ -1339,7 +1339,7 @@ static int ext4_write_end(struct file *file,
 }
 
 /*
- * This is a private version of page_zero_new_buffers() which doesn't
+ * This is a private version of folio_zero_new_buffers() which doesn't
  * set the buffer to be dirty, since in data=journalled mode we need
  * to call ext4_dirty_journalled_data() instead.
  */
index f9a4301..961284c 100644 (file)
@@ -793,16 +793,9 @@ static int ext4_ioctl_setproject(struct inode *inode, __u32 projid)
 }
 #endif
 
-static int ext4_shutdown(struct super_block *sb, unsigned long arg)
+int ext4_force_shutdown(struct super_block *sb, u32 flags)
 {
        struct ext4_sb_info *sbi = EXT4_SB(sb);
-       __u32 flags;
-
-       if (!capable(CAP_SYS_ADMIN))
-               return -EPERM;
-
-       if (get_user(flags, (__u32 __user *)arg))
-               return -EFAULT;
 
        if (flags > EXT4_GOING_FLAGS_NOLOGFLUSH)
                return -EINVAL;
@@ -838,6 +831,19 @@ static int ext4_shutdown(struct super_block *sb, unsigned long arg)
        return 0;
 }
 
+static int ext4_ioctl_shutdown(struct super_block *sb, unsigned long arg)
+{
+       u32 flags;
+
+       if (!capable(CAP_SYS_ADMIN))
+               return -EPERM;
+
+       if (get_user(flags, (__u32 __user *)arg))
+               return -EFAULT;
+
+       return ext4_force_shutdown(sb, flags);
+}
+
 struct getfsmap_info {
        struct super_block      *gi_sb;
        struct fsmap_head __user *gi_data;
@@ -1566,7 +1572,7 @@ resizefs_out:
                return ext4_ioctl_get_es_cache(filp, arg);
 
        case EXT4_IOC_SHUTDOWN:
-               return ext4_shutdown(sb, arg);
+               return ext4_ioctl_shutdown(sb, arg);
 
        case FS_IOC_ENABLE_VERITY:
                if (!ext4_has_feature_verity(sb))
index 45b5798..0caf6c7 100644 (file)
@@ -3834,19 +3834,10 @@ static int ext4_rename(struct mnt_idmap *idmap, struct inode *old_dir,
                        return retval;
        }
 
-       /*
-        * We need to protect against old.inode directory getting converted
-        * from inline directory format into a normal one.
-        */
-       if (S_ISDIR(old.inode->i_mode))
-               inode_lock_nested(old.inode, I_MUTEX_NONDIR2);
-
        old.bh = ext4_find_entry(old.dir, &old.dentry->d_name, &old.de,
                                 &old.inlined);
-       if (IS_ERR(old.bh)) {
-               retval = PTR_ERR(old.bh);
-               goto unlock_moved_dir;
-       }
+       if (IS_ERR(old.bh))
+               return PTR_ERR(old.bh);
 
        /*
         *  Check for inode number is _not_ due to possible IO errors.
@@ -4043,10 +4034,6 @@ release_bh:
        brelse(old.bh);
        brelse(new.bh);
 
-unlock_moved_dir:
-       if (S_ISDIR(old.inode->i_mode))
-               inode_unlock(old.inode);
-
        return retval;
 }
 
index 05fcecc..eaa5858 100644 (file)
@@ -1096,6 +1096,15 @@ void ext4_update_dynamic_rev(struct super_block *sb)
         */
 }
 
+static void ext4_bdev_mark_dead(struct block_device *bdev)
+{
+       ext4_force_shutdown(bdev->bd_holder, EXT4_GOING_FLAGS_NOLOGFLUSH);
+}
+
+static const struct blk_holder_ops ext4_holder_ops = {
+       .mark_dead              = ext4_bdev_mark_dead,
+};
+
 /*
  * Open the external journal device
  */
@@ -1103,7 +1112,8 @@ static struct block_device *ext4_blkdev_get(dev_t dev, struct super_block *sb)
 {
        struct block_device *bdev;
 
-       bdev = blkdev_get_by_dev(dev, FMODE_READ|FMODE_WRITE|FMODE_EXCL, sb);
+       bdev = blkdev_get_by_dev(dev, BLK_OPEN_READ | BLK_OPEN_WRITE, sb,
+                                &ext4_holder_ops);
        if (IS_ERR(bdev))
                goto fail;
        return bdev;
@@ -1118,17 +1128,12 @@ fail:
 /*
  * Release the journal device
  */
-static void ext4_blkdev_put(struct block_device *bdev)
-{
-       blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
-}
-
 static void ext4_blkdev_remove(struct ext4_sb_info *sbi)
 {
        struct block_device *bdev;
        bdev = sbi->s_journal_bdev;
        if (bdev) {
-               ext4_blkdev_put(bdev);
+               blkdev_put(bdev, sbi->s_sb);
                sbi->s_journal_bdev = NULL;
        }
 }
@@ -1449,6 +1454,11 @@ static void ext4_destroy_inode(struct inode *inode)
                         EXT4_I(inode)->i_reserved_data_blocks);
 }
 
+static void ext4_shutdown(struct super_block *sb)
+{
+       ext4_force_shutdown(sb, EXT4_GOING_FLAGS_NOLOGFLUSH);
+}
+
 static void init_once(void *foo)
 {
        struct ext4_inode_info *ei = foo;
@@ -1609,6 +1619,7 @@ static const struct super_operations ext4_sops = {
        .unfreeze_fs    = ext4_unfreeze,
        .statfs         = ext4_statfs,
        .show_options   = ext4_show_options,
+       .shutdown       = ext4_shutdown,
 #ifdef CONFIG_QUOTA
        .quota_read     = ext4_quota_read,
        .quota_write    = ext4_quota_write,
@@ -5899,7 +5910,7 @@ static journal_t *ext4_get_dev_journal(struct super_block *sb,
 out_journal:
        jbd2_journal_destroy(journal);
 out_bdev:
-       ext4_blkdev_put(bdev);
+       blkdev_put(bdev, sb);
        return NULL;
 }
 
index 5ac53d2..2435111 100644 (file)
@@ -4367,22 +4367,23 @@ out:
        return ret;
 }
 
-static void f2fs_trace_rw_file_path(struct kiocb *iocb, size_t count, int rw)
+static void f2fs_trace_rw_file_path(struct file *file, loff_t pos, size_t count,
+                                   int rw)
 {
-       struct inode *inode = file_inode(iocb->ki_filp);
+       struct inode *inode = file_inode(file);
        char *buf, *path;
 
        buf = f2fs_getname(F2FS_I_SB(inode));
        if (!buf)
                return;
-       path = dentry_path_raw(file_dentry(iocb->ki_filp), buf, PATH_MAX);
+       path = dentry_path_raw(file_dentry(file), buf, PATH_MAX);
        if (IS_ERR(path))
                goto free_buf;
        if (rw == WRITE)
-               trace_f2fs_datawrite_start(inode, iocb->ki_pos, count,
+               trace_f2fs_datawrite_start(inode, pos, count,
                                current->pid, path, current->comm);
        else
-               trace_f2fs_dataread_start(inode, iocb->ki_pos, count,
+               trace_f2fs_dataread_start(inode, pos, count,
                                current->pid, path, current->comm);
 free_buf:
        f2fs_putname(buf);
@@ -4398,7 +4399,8 @@ static ssize_t f2fs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
                return -EOPNOTSUPP;
 
        if (trace_f2fs_dataread_start_enabled())
-               f2fs_trace_rw_file_path(iocb, iov_iter_count(to), READ);
+               f2fs_trace_rw_file_path(iocb->ki_filp, iocb->ki_pos,
+                                       iov_iter_count(to), READ);
 
        if (f2fs_should_use_dio(inode, iocb, to)) {
                ret = f2fs_dio_read_iter(iocb, to);
@@ -4413,6 +4415,30 @@ static ssize_t f2fs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
        return ret;
 }
 
+static ssize_t f2fs_file_splice_read(struct file *in, loff_t *ppos,
+                                    struct pipe_inode_info *pipe,
+                                    size_t len, unsigned int flags)
+{
+       struct inode *inode = file_inode(in);
+       const loff_t pos = *ppos;
+       ssize_t ret;
+
+       if (!f2fs_is_compress_backend_ready(inode))
+               return -EOPNOTSUPP;
+
+       if (trace_f2fs_dataread_start_enabled())
+               f2fs_trace_rw_file_path(in, pos, len, READ);
+
+       ret = filemap_splice_read(in, ppos, pipe, len, flags);
+       if (ret > 0)
+               f2fs_update_iostat(F2FS_I_SB(inode), inode,
+                                  APP_BUFFERED_READ_IO, ret);
+
+       if (trace_f2fs_dataread_end_enabled())
+               trace_f2fs_dataread_end(inode, pos, ret);
+       return ret;
+}
+
 static ssize_t f2fs_write_checks(struct kiocb *iocb, struct iov_iter *from)
 {
        struct file *file = iocb->ki_filp;
@@ -4517,12 +4543,9 @@ static ssize_t f2fs_buffered_write_iter(struct kiocb *iocb,
        if (iocb->ki_flags & IOCB_NOWAIT)
                return -EOPNOTSUPP;
 
-       current->backing_dev_info = inode_to_bdi(inode);
        ret = generic_perform_write(iocb, from);
-       current->backing_dev_info = NULL;
 
        if (ret > 0) {
-               iocb->ki_pos += ret;
                f2fs_update_iostat(F2FS_I_SB(inode), inode,
                                                APP_BUFFERED_IO, ret);
        }
@@ -4714,7 +4737,8 @@ static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
                ret = preallocated;
        } else {
                if (trace_f2fs_datawrite_start_enabled())
-                       f2fs_trace_rw_file_path(iocb, orig_count, WRITE);
+                       f2fs_trace_rw_file_path(iocb->ki_filp, iocb->ki_pos,
+                                               orig_count, WRITE);
 
                /* Do the actual write. */
                ret = dio ?
@@ -4919,7 +4943,7 @@ const struct file_operations f2fs_file_operations = {
 #ifdef CONFIG_COMPAT
        .compat_ioctl   = f2fs_compat_ioctl,
 #endif
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = f2fs_file_splice_read,
        .splice_write   = iter_file_splice_write,
        .fadvise        = f2fs_file_fadvise,
 };
index 77a7127..ad597b4 100644 (file)
@@ -995,20 +995,12 @@ static int f2fs_rename(struct mnt_idmap *idmap, struct inode *old_dir,
                        goto out;
        }
 
-       /*
-        * Copied from ext4_rename: we need to protect against old.inode
-        * directory getting converted from inline directory format into
-        * a normal one.
-        */
-       if (S_ISDIR(old_inode->i_mode))
-               inode_lock_nested(old_inode, I_MUTEX_NONDIR2);
-
        err = -ENOENT;
        old_entry = f2fs_find_entry(old_dir, &old_dentry->d_name, &old_page);
        if (!old_entry) {
                if (IS_ERR(old_page))
                        err = PTR_ERR(old_page);
-               goto out_unlock_old;
+               goto out;
        }
 
        if (S_ISDIR(old_inode->i_mode)) {
@@ -1116,9 +1108,6 @@ static int f2fs_rename(struct mnt_idmap *idmap, struct inode *old_dir,
 
        f2fs_unlock_op(sbi);
 
-       if (S_ISDIR(old_inode->i_mode))
-               inode_unlock(old_inode);
-
        if (IS_DIRSYNC(old_dir) || IS_DIRSYNC(new_dir))
                f2fs_sync_fs(sbi->sb, 1);
 
@@ -1133,9 +1122,6 @@ out_dir:
                f2fs_put_page(old_dir_page, 0);
 out_old:
        f2fs_put_page(old_page, 0);
-out_unlock_old:
-       if (S_ISDIR(old_inode->i_mode))
-               inode_unlock(old_inode);
 out:
        iput(whiteout);
        return err;
index 9f15b03..e34197a 100644 (file)
@@ -1538,7 +1538,7 @@ static void destroy_device_list(struct f2fs_sb_info *sbi)
        int i;
 
        for (i = 0; i < sbi->s_ndevs; i++) {
-               blkdev_put(FDEV(i).bdev, FMODE_EXCL);
+               blkdev_put(FDEV(i).bdev, sbi->sb->s_type);
 #ifdef CONFIG_BLK_DEV_ZONED
                kvfree(FDEV(i).blkz_seq);
 #endif
@@ -3993,6 +3993,7 @@ static int f2fs_scan_devices(struct f2fs_sb_info *sbi)
        struct f2fs_super_block *raw_super = F2FS_RAW_SUPER(sbi);
        unsigned int max_devices = MAX_DEVICES;
        unsigned int logical_blksize;
+       blk_mode_t mode = sb_open_mode(sbi->sb->s_flags);
        int i;
 
        /* Initialize single device information */
@@ -4024,8 +4025,8 @@ static int f2fs_scan_devices(struct f2fs_sb_info *sbi)
                if (max_devices == 1) {
                        /* Single zoned block device mount */
                        FDEV(0).bdev =
-                               blkdev_get_by_dev(sbi->sb->s_bdev->bd_dev,
-                                       sbi->sb->s_mode, sbi->sb->s_type);
+                               blkdev_get_by_dev(sbi->sb->s_bdev->bd_dev, mode,
+                                                 sbi->sb->s_type, NULL);
                } else {
                        /* Multi-device mount */
                        memcpy(FDEV(i).path, RDEV(i).path, MAX_PATH_LEN);
@@ -4043,8 +4044,9 @@ static int f2fs_scan_devices(struct f2fs_sb_info *sbi)
                                        (FDEV(i).total_segments <<
                                        sbi->log_blocks_per_seg) - 1;
                        }
-                       FDEV(i).bdev = blkdev_get_by_path(FDEV(i).path,
-                                       sbi->sb->s_mode, sbi->sb->s_type);
+                       FDEV(i).bdev = blkdev_get_by_path(FDEV(i).path, mode,
+                                                         sbi->sb->s_type,
+                                                         NULL);
                }
                if (IS_ERR(FDEV(i).bdev))
                        return PTR_ERR(FDEV(i).bdev);
index 795a4fa..4564779 100644 (file)
@@ -209,7 +209,7 @@ const struct file_operations fat_file_operations = {
        .unlocked_ioctl = fat_generic_ioctl,
        .compat_ioctl   = compat_ptr_ioctl,
        .fsync          = fat_file_fsync,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = filemap_splice_read,
        .splice_write   = iter_file_splice_write,
        .fallocate      = fat_fallocate,
 };
index 372653b..e06c68e 100644 (file)
@@ -44,18 +44,40 @@ static struct kmem_cache *filp_cachep __read_mostly;
 
 static struct percpu_counter nr_files __cacheline_aligned_in_smp;
 
+/* Container for backing file with optional real path */
+struct backing_file {
+       struct file file;
+       struct path real_path;
+};
+
+static inline struct backing_file *backing_file(struct file *f)
+{
+       return container_of(f, struct backing_file, file);
+}
+
+struct path *backing_file_real_path(struct file *f)
+{
+       return &backing_file(f)->real_path;
+}
+EXPORT_SYMBOL_GPL(backing_file_real_path);
+
 static void file_free_rcu(struct rcu_head *head)
 {
        struct file *f = container_of(head, struct file, f_rcuhead);
 
        put_cred(f->f_cred);
-       kmem_cache_free(filp_cachep, f);
+       if (unlikely(f->f_mode & FMODE_BACKING))
+               kfree(backing_file(f));
+       else
+               kmem_cache_free(filp_cachep, f);
 }
 
 static inline void file_free(struct file *f)
 {
        security_file_free(f);
-       if (!(f->f_mode & FMODE_NOACCOUNT))
+       if (unlikely(f->f_mode & FMODE_BACKING))
+               path_put(backing_file_real_path(f));
+       if (likely(!(f->f_mode & FMODE_NOACCOUNT)))
                percpu_counter_dec(&nr_files);
        call_rcu(&f->f_rcuhead, file_free_rcu);
 }
@@ -131,20 +153,15 @@ static int __init init_fs_stat_sysctls(void)
 fs_initcall(init_fs_stat_sysctls);
 #endif
 
-static struct file *__alloc_file(int flags, const struct cred *cred)
+static int init_file(struct file *f, int flags, const struct cred *cred)
 {
-       struct file *f;
        int error;
 
-       f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL);
-       if (unlikely(!f))
-               return ERR_PTR(-ENOMEM);
-
        f->f_cred = get_cred(cred);
        error = security_file_alloc(f);
        if (unlikely(error)) {
                file_free_rcu(&f->f_rcuhead);
-               return ERR_PTR(error);
+               return error;
        }
 
        atomic_long_set(&f->f_count, 1);
@@ -155,7 +172,7 @@ static struct file *__alloc_file(int flags, const struct cred *cred)
        f->f_mode = OPEN_FMODE(flags);
        /* f->f_version: 0 */
 
-       return f;
+       return 0;
 }
 
 /* Find an unused file structure and return a pointer to it.
@@ -172,6 +189,7 @@ struct file *alloc_empty_file(int flags, const struct cred *cred)
 {
        static long old_max;
        struct file *f;
+       int error;
 
        /*
         * Privileged users can go above max_files
@@ -185,9 +203,15 @@ struct file *alloc_empty_file(int flags, const struct cred *cred)
                        goto over;
        }
 
-       f = __alloc_file(flags, cred);
-       if (!IS_ERR(f))
-               percpu_counter_inc(&nr_files);
+       f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL);
+       if (unlikely(!f))
+               return ERR_PTR(-ENOMEM);
+
+       error = init_file(f, flags, cred);
+       if (unlikely(error))
+               return ERR_PTR(error);
+
+       percpu_counter_inc(&nr_files);
 
        return f;
 
@@ -203,18 +227,51 @@ over:
 /*
  * Variant of alloc_empty_file() that doesn't check and modify nr_files.
  *
- * Should not be used unless there's a very good reason to do so.
+ * This is only for kernel internal use, and the allocate file must not be
+ * installed into file tables or such.
  */
 struct file *alloc_empty_file_noaccount(int flags, const struct cred *cred)
 {
-       struct file *f = __alloc_file(flags, cred);
+       struct file *f;
+       int error;
 
-       if (!IS_ERR(f))
-               f->f_mode |= FMODE_NOACCOUNT;
+       f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL);
+       if (unlikely(!f))
+               return ERR_PTR(-ENOMEM);
+
+       error = init_file(f, flags, cred);
+       if (unlikely(error))
+               return ERR_PTR(error);
+
+       f->f_mode |= FMODE_NOACCOUNT;
 
        return f;
 }
 
+/*
+ * Variant of alloc_empty_file() that allocates a backing_file container
+ * and doesn't check and modify nr_files.
+ *
+ * This is only for kernel internal use, and the allocate file must not be
+ * installed into file tables or such.
+ */
+struct file *alloc_empty_backing_file(int flags, const struct cred *cred)
+{
+       struct backing_file *ff;
+       int error;
+
+       ff = kzalloc(sizeof(struct backing_file), GFP_KERNEL);
+       if (unlikely(!ff))
+               return ERR_PTR(-ENOMEM);
+
+       error = init_file(&ff->file, flags, cred);
+       if (unlikely(error))
+               return ERR_PTR(error);
+
+       ff->file.f_mode |= FMODE_BACKING | FMODE_NOACCOUNT;
+       return &ff->file;
+}
+
 /**
  * alloc_file - allocate and initialize a 'struct file'
  *
index ae4e51e..aca4b48 100644 (file)
@@ -2024,7 +2024,6 @@ static long wb_writeback(struct bdi_writeback *wb,
        struct blk_plug plug;
 
        blk_start_plug(&plug);
-       spin_lock(&wb->list_lock);
        for (;;) {
                /*
                 * Stop writeback when nr_pages has been consumed
@@ -2049,6 +2048,9 @@ static long wb_writeback(struct bdi_writeback *wb,
                if (work->for_background && !wb_over_bg_thresh(wb))
                        break;
 
+
+               spin_lock(&wb->list_lock);
+
                /*
                 * Kupdate and background works are special and we want to
                 * include all inodes that need writing. Livelock avoidance is
@@ -2078,13 +2080,19 @@ static long wb_writeback(struct bdi_writeback *wb,
                 * mean the overall work is done. So we keep looping as long
                 * as made some progress on cleaning pages or inodes.
                 */
-               if (progress)
+               if (progress) {
+                       spin_unlock(&wb->list_lock);
                        continue;
+               }
+
                /*
                 * No more inodes for IO, bail
                 */
-               if (list_empty(&wb->b_more_io))
+               if (list_empty(&wb->b_more_io)) {
+                       spin_unlock(&wb->list_lock);
                        break;
+               }
+
                /*
                 * Nothing written. Wait for some inode to
                 * become available for writeback. Otherwise
@@ -2096,9 +2104,7 @@ static long wb_writeback(struct bdi_writeback *wb,
                spin_unlock(&wb->list_lock);
                /* This function drops i_lock... */
                inode_sleep_on_writeback(inode);
-               spin_lock(&wb->list_lock);
        }
-       spin_unlock(&wb->list_lock);
        blk_finish_plug(&plug);
 
        return nr_pages - work->nr_pages;
index 24ce12f..851214d 100644 (file)
@@ -561,7 +561,8 @@ static int legacy_parse_param(struct fs_context *fc, struct fs_parameter *param)
                        return -ENOMEM;
        }
 
-       ctx->legacy_data[size++] = ',';
+       if (size)
+               ctx->legacy_data[size++] = ',';
        len = strlen(param->key);
        memcpy(ctx->legacy_data + size, param->key, len);
        size += len;
index 89d97f6..bc41152 100644 (file)
@@ -1280,13 +1280,13 @@ static inline unsigned int fuse_wr_pages(loff_t pos, size_t len,
                     max_pages);
 }
 
-static ssize_t fuse_perform_write(struct kiocb *iocb,
-                                 struct address_space *mapping,
-                                 struct iov_iter *ii, loff_t pos)
+static ssize_t fuse_perform_write(struct kiocb *iocb, struct iov_iter *ii)
 {
+       struct address_space *mapping = iocb->ki_filp->f_mapping;
        struct inode *inode = mapping->host;
        struct fuse_conn *fc = get_fuse_conn(inode);
        struct fuse_inode *fi = get_fuse_inode(inode);
+       loff_t pos = iocb->ki_pos;
        int err = 0;
        ssize_t res = 0;
 
@@ -1329,7 +1329,10 @@ static ssize_t fuse_perform_write(struct kiocb *iocb,
        fuse_write_update_attr(inode, pos, res);
        clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
 
-       return res > 0 ? res : err;
+       if (!res)
+               return err;
+       iocb->ki_pos += res;
+       return res;
 }
 
 static ssize_t fuse_cache_write_iter(struct kiocb *iocb, struct iov_iter *from)
@@ -1337,11 +1340,9 @@ static ssize_t fuse_cache_write_iter(struct kiocb *iocb, struct iov_iter *from)
        struct file *file = iocb->ki_filp;
        struct address_space *mapping = file->f_mapping;
        ssize_t written = 0;
-       ssize_t written_buffered = 0;
        struct inode *inode = mapping->host;
        ssize_t err;
        struct fuse_conn *fc = get_fuse_conn(inode);
-       loff_t endbyte = 0;
 
        if (fc->writeback_cache) {
                /* Update size (EOF optimization) and mode (SUID clearing) */
@@ -1362,9 +1363,6 @@ static ssize_t fuse_cache_write_iter(struct kiocb *iocb, struct iov_iter *from)
 writethrough:
        inode_lock(inode);
 
-       /* We can write back this queue in page reclaim */
-       current->backing_dev_info = inode_to_bdi(inode);
-
        err = generic_write_checks(iocb, from);
        if (err <= 0)
                goto out;
@@ -1378,38 +1376,15 @@ writethrough:
                goto out;
 
        if (iocb->ki_flags & IOCB_DIRECT) {
-               loff_t pos = iocb->ki_pos;
                written = generic_file_direct_write(iocb, from);
                if (written < 0 || !iov_iter_count(from))
                        goto out;
-
-               pos += written;
-
-               written_buffered = fuse_perform_write(iocb, mapping, from, pos);
-               if (written_buffered < 0) {
-                       err = written_buffered;
-                       goto out;
-               }
-               endbyte = pos + written_buffered - 1;
-
-               err = filemap_write_and_wait_range(file->f_mapping, pos,
-                                                  endbyte);
-               if (err)
-                       goto out;
-
-               invalidate_mapping_pages(file->f_mapping,
-                                        pos >> PAGE_SHIFT,
-                                        endbyte >> PAGE_SHIFT);
-
-               written += written_buffered;
-               iocb->ki_pos = pos + written_buffered;
+               written = direct_write_fallback(iocb, from, written,
+                               fuse_perform_write(iocb, from));
        } else {
-               written = fuse_perform_write(iocb, mapping, from, iocb->ki_pos);
-               if (written >= 0)
-                       iocb->ki_pos += written;
+               written = fuse_perform_write(iocb, from);
        }
 out:
-       current->backing_dev_info = NULL;
        inode_unlock(inode);
        if (written > 0)
                written = generic_write_sync(iocb, written);
@@ -3252,7 +3227,7 @@ static const struct file_operations fuse_file_operations = {
        .lock           = fuse_file_lock,
        .get_unmapped_area = thp_get_unmapped_area,
        .flock          = fuse_file_flock,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = filemap_splice_read,
        .splice_write   = iter_file_splice_write,
        .unlocked_ioctl = fuse_file_ioctl,
        .compat_ioctl   = fuse_file_compat_ioctl,
index a5f4be6..1c407eb 100644 (file)
 
 
 void gfs2_trans_add_databufs(struct gfs2_inode *ip, struct folio *folio,
-                            unsigned int from, unsigned int len)
+                            size_t from, size_t len)
 {
        struct buffer_head *head = folio_buffers(folio);
        unsigned int bsize = head->b_size;
        struct buffer_head *bh;
-       unsigned int to = from + len;
-       unsigned int start, end;
+       size_t to = from + len;
+       size_t start, end;
 
        for (bh = head, start = 0; bh != head || !start;
             bh = bh->b_this_page, start = end) {
@@ -82,61 +82,61 @@ static int gfs2_get_block_noalloc(struct inode *inode, sector_t lblock,
 }
 
 /**
- * gfs2_write_jdata_page - gfs2 jdata-specific version of block_write_full_page
- * @page: The page to write
+ * gfs2_write_jdata_folio - gfs2 jdata-specific version of block_write_full_page
+ * @folio: The folio to write
  * @wbc: The writeback control
  *
  * This is the same as calling block_write_full_page, but it also
  * writes pages outside of i_size
  */
-static int gfs2_write_jdata_page(struct page *page,
+static int gfs2_write_jdata_folio(struct folio *folio,
                                 struct writeback_control *wbc)
 {
-       struct inode * const inode = page->mapping->host;
+       struct inode * const inode = folio->mapping->host;
        loff_t i_size = i_size_read(inode);
-       const pgoff_t end_index = i_size >> PAGE_SHIFT;
-       unsigned offset;
 
        /*
-        * The page straddles i_size.  It must be zeroed out on each and every
+        * The folio straddles i_size.  It must be zeroed out on each and every
         * writepage invocation because it may be mmapped.  "A file is mapped
         * in multiples of the page size.  For a file that is not a multiple of
-        * the  page size, the remaining memory is zeroed when mapped, and
+        * the page size, the remaining memory is zeroed when mapped, and
         * writes to that region are not written out to the file."
         */
-       offset = i_size & (PAGE_SIZE - 1);
-       if (page->index == end_index && offset)
-               zero_user_segment(page, offset, PAGE_SIZE);
+       if (folio_pos(folio) < i_size &&
+           i_size < folio_pos(folio) + folio_size(folio))
+               folio_zero_segment(folio, offset_in_folio(folio, i_size),
+                               folio_size(folio));
 
-       return __block_write_full_page(inode, page, gfs2_get_block_noalloc, wbc,
-                                      end_buffer_async_write);
+       return __block_write_full_folio(inode, folio, gfs2_get_block_noalloc,
+                       wbc, end_buffer_async_write);
 }
 
 /**
- * __gfs2_jdata_writepage - The core of jdata writepage
- * @page: The page to write
+ * __gfs2_jdata_write_folio - The core of jdata writepage
+ * @folio: The folio to write
  * @wbc: The writeback control
  *
  * This is shared between writepage and writepages and implements the
  * core of the writepage operation. If a transaction is required then
- * PageChecked will have been set and the transaction will have
+ * the checked flag will have been set and the transaction will have
  * already been started before this is called.
  */
-
-static int __gfs2_jdata_writepage(struct page *page, struct writeback_control *wbc)
+static int __gfs2_jdata_write_folio(struct folio *folio,
+               struct writeback_control *wbc)
 {
-       struct inode *inode = page->mapping->host;
+       struct inode *inode = folio->mapping->host;
        struct gfs2_inode *ip = GFS2_I(inode);
 
-       if (PageChecked(page)) {
-               ClearPageChecked(page);
-               if (!page_has_buffers(page)) {
-                       create_empty_buffers(page, inode->i_sb->s_blocksize,
-                                            BIT(BH_Dirty)|BIT(BH_Uptodate));
+       if (folio_test_checked(folio)) {
+               folio_clear_checked(folio);
+               if (!folio_buffers(folio)) {
+                       folio_create_empty_buffers(folio,
+                                       inode->i_sb->s_blocksize,
+                                       BIT(BH_Dirty)|BIT(BH_Uptodate));
                }
-               gfs2_trans_add_databufs(ip, page_folio(page), 0, PAGE_SIZE);
+               gfs2_trans_add_databufs(ip, folio, 0, folio_size(folio));
        }
-       return gfs2_write_jdata_page(page, wbc);
+       return gfs2_write_jdata_folio(folio, wbc);
 }
 
 /**
@@ -150,20 +150,21 @@ static int __gfs2_jdata_writepage(struct page *page, struct writeback_control *w
 
 static int gfs2_jdata_writepage(struct page *page, struct writeback_control *wbc)
 {
+       struct folio *folio = page_folio(page);
        struct inode *inode = page->mapping->host;
        struct gfs2_inode *ip = GFS2_I(inode);
        struct gfs2_sbd *sdp = GFS2_SB(inode);
 
        if (gfs2_assert_withdraw(sdp, gfs2_glock_is_held_excl(ip->i_gl)))
                goto out;
-       if (PageChecked(page) || current->journal_info)
+       if (folio_test_checked(folio) || current->journal_info)
                goto out_ignore;
-       return __gfs2_jdata_writepage(page, wbc);
+       return __gfs2_jdata_write_folio(folio, wbc);
 
 out_ignore:
-       redirty_page_for_writepage(wbc, page);
+       folio_redirty_for_writepage(wbc, folio);
 out:
-       unlock_page(page);
+       folio_unlock(folio);
        return 0;
 }
 
@@ -255,7 +256,7 @@ continue_unlock:
 
                trace_wbc_writepage(wbc, inode_to_bdi(inode));
 
-               ret = __gfs2_jdata_writepage(&folio->page, wbc);
+               ret = __gfs2_jdata_write_folio(folio, wbc);
                if (unlikely(ret)) {
                        if (ret == AOP_WRITEPAGE_ACTIVATE) {
                                folio_unlock(folio);
index 09db191..f08322e 100644 (file)
@@ -10,6 +10,6 @@
 
 extern void adjust_fs_space(struct inode *inode);
 extern void gfs2_trans_add_databufs(struct gfs2_inode *ip, struct folio *folio,
-                                   unsigned int from, unsigned int len);
+                                   size_t from, size_t len);
 
 #endif /* __AOPS_DOT_H__ */
index cb62c8f..f146447 100644 (file)
@@ -1052,15 +1052,11 @@ retry:
                        goto out_unlock;
        }
 
-       current->backing_dev_info = inode_to_bdi(inode);
        pagefault_disable();
        ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
        pagefault_enable();
-       current->backing_dev_info = NULL;
-       if (ret > 0) {
-               iocb->ki_pos += ret;
+       if (ret > 0)
                written += ret;
-       }
 
        if (inode == sdp->sd_rindex)
                gfs2_glock_dq_uninit(statfs_gh);
@@ -1579,7 +1575,7 @@ const struct file_operations gfs2_file_fops = {
        .fsync          = gfs2_fsync,
        .lock           = gfs2_lock,
        .flock          = gfs2_flock,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = filemap_splice_read,
        .splice_write   = gfs2_file_splice_write,
        .setlease       = simple_nosetlease,
        .fallocate      = gfs2_fallocate,
@@ -1610,7 +1606,7 @@ const struct file_operations gfs2_file_fops_nolock = {
        .open           = gfs2_open,
        .release        = gfs2_release,
        .fsync          = gfs2_fsync,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = filemap_splice_read,
        .splice_write   = gfs2_file_splice_write,
        .setlease       = generic_setlease,
        .fallocate      = gfs2_fallocate,
index 9af9ddb..cd96298 100644 (file)
@@ -254,7 +254,7 @@ static int gfs2_read_super(struct gfs2_sbd *sdp, sector_t sector, int silent)
 
        bio = bio_alloc(sb->s_bdev, 1, REQ_OP_READ | REQ_META, GFP_NOFS);
        bio->bi_iter.bi_sector = sector * (sb->s_blocksize >> 9);
-       bio_add_page(bio, page, PAGE_SIZE, 0);
+       __bio_add_page(bio, page, PAGE_SIZE, 0);
 
        bio->bi_end_io = end_bio_io_page;
        bio->bi_private = page;
index 1f7bd06..441d7fc 100644 (file)
@@ -694,7 +694,7 @@ static const struct file_operations hfs_file_operations = {
        .read_iter      = generic_file_read_iter,
        .write_iter     = generic_file_write_iter,
        .mmap           = generic_file_mmap,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = filemap_splice_read,
        .fsync          = hfs_file_fsync,
        .open           = hfs_file_open,
        .release        = hfs_file_release,
index b216604..7d1a675 100644 (file)
@@ -372,7 +372,7 @@ static const struct file_operations hfsplus_file_operations = {
        .read_iter      = generic_file_read_iter,
        .write_iter     = generic_file_write_iter,
        .mmap           = generic_file_mmap,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = filemap_splice_read,
        .fsync          = hfsplus_file_fsync,
        .open           = hfsplus_file_open,
        .release        = hfsplus_file_release,
index 69cb796..0239e3a 100644 (file)
@@ -65,6 +65,7 @@ struct hostfs_stat {
        unsigned long long blocks;
        unsigned int maj;
        unsigned int min;
+       dev_t dev;
 };
 
 extern int stat_file(const char *path, struct hostfs_stat *p, int fd);
index 28b4f15..4638709 100644 (file)
@@ -26,6 +26,7 @@ struct hostfs_inode_info {
        fmode_t mode;
        struct inode vfs_inode;
        struct mutex open_mutex;
+       dev_t dev;
 };
 
 static inline struct hostfs_inode_info *HOSTFS_I(struct inode *inode)
@@ -182,14 +183,6 @@ static char *follow_link(char *link)
        return ERR_PTR(n);
 }
 
-static struct inode *hostfs_iget(struct super_block *sb)
-{
-       struct inode *inode = new_inode(sb);
-       if (!inode)
-               return ERR_PTR(-ENOMEM);
-       return inode;
-}
-
 static int hostfs_statfs(struct dentry *dentry, struct kstatfs *sf)
 {
        /*
@@ -228,6 +221,7 @@ static struct inode *hostfs_alloc_inode(struct super_block *sb)
                return NULL;
        hi->fd = -1;
        hi->mode = 0;
+       hi->dev = 0;
        inode_init_once(&hi->vfs_inode);
        mutex_init(&hi->open_mutex);
        return &hi->vfs_inode;
@@ -240,6 +234,7 @@ static void hostfs_evict_inode(struct inode *inode)
        if (HOSTFS_I(inode)->fd != -1) {
                close_file(&HOSTFS_I(inode)->fd);
                HOSTFS_I(inode)->fd = -1;
+               HOSTFS_I(inode)->dev = 0;
        }
 }
 
@@ -265,6 +260,7 @@ static int hostfs_show_options(struct seq_file *seq, struct dentry *root)
 static const struct super_operations hostfs_sbops = {
        .alloc_inode    = hostfs_alloc_inode,
        .free_inode     = hostfs_free_inode,
+       .drop_inode     = generic_delete_inode,
        .evict_inode    = hostfs_evict_inode,
        .statfs         = hostfs_statfs,
        .show_options   = hostfs_show_options,
@@ -381,7 +377,7 @@ static int hostfs_fsync(struct file *file, loff_t start, loff_t end,
 
 static const struct file_operations hostfs_file_fops = {
        .llseek         = generic_file_llseek,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = filemap_splice_read,
        .splice_write   = iter_file_splice_write,
        .read_iter      = generic_file_read_iter,
        .write_iter     = generic_file_write_iter,
@@ -512,18 +508,31 @@ static const struct address_space_operations hostfs_aops = {
        .write_end      = hostfs_write_end,
 };
 
-static int read_name(struct inode *ino, char *name)
+static int hostfs_inode_update(struct inode *ino, const struct hostfs_stat *st)
+{
+       set_nlink(ino, st->nlink);
+       i_uid_write(ino, st->uid);
+       i_gid_write(ino, st->gid);
+       ino->i_atime =
+               (struct timespec64){ st->atime.tv_sec, st->atime.tv_nsec };
+       ino->i_mtime =
+               (struct timespec64){ st->mtime.tv_sec, st->mtime.tv_nsec };
+       ino->i_ctime =
+               (struct timespec64){ st->ctime.tv_sec, st->ctime.tv_nsec };
+       ino->i_size = st->size;
+       ino->i_blocks = st->blocks;
+       return 0;
+}
+
+static int hostfs_inode_set(struct inode *ino, void *data)
 {
+       struct hostfs_stat *st = data;
        dev_t rdev;
-       struct hostfs_stat st;
-       int err = stat_file(name, &st, -1);
-       if (err)
-               return err;
 
        /* Reencode maj and min with the kernel encoding.*/
-       rdev = MKDEV(st.maj, st.min);
+       rdev = MKDEV(st->maj, st->min);
 
-       switch (st.mode & S_IFMT) {
+       switch (st->mode & S_IFMT) {
        case S_IFLNK:
                ino->i_op = &hostfs_link_iops;
                break;
@@ -535,7 +544,7 @@ static int read_name(struct inode *ino, char *name)
        case S_IFBLK:
        case S_IFIFO:
        case S_IFSOCK:
-               init_special_inode(ino, st.mode & S_IFMT, rdev);
+               init_special_inode(ino, st->mode & S_IFMT, rdev);
                ino->i_op = &hostfs_iops;
                break;
        case S_IFREG:
@@ -547,17 +556,42 @@ static int read_name(struct inode *ino, char *name)
                return -EIO;
        }
 
-       ino->i_ino = st.ino;
-       ino->i_mode = st.mode;
-       set_nlink(ino, st.nlink);
-       i_uid_write(ino, st.uid);
-       i_gid_write(ino, st.gid);
-       ino->i_atime = (struct timespec64){ st.atime.tv_sec, st.atime.tv_nsec };
-       ino->i_mtime = (struct timespec64){ st.mtime.tv_sec, st.mtime.tv_nsec };
-       ino->i_ctime = (struct timespec64){ st.ctime.tv_sec, st.ctime.tv_nsec };
-       ino->i_size = st.size;
-       ino->i_blocks = st.blocks;
-       return 0;
+       HOSTFS_I(ino)->dev = st->dev;
+       ino->i_ino = st->ino;
+       ino->i_mode = st->mode;
+       return hostfs_inode_update(ino, st);
+}
+
+static int hostfs_inode_test(struct inode *inode, void *data)
+{
+       const struct hostfs_stat *st = data;
+
+       return inode->i_ino == st->ino && HOSTFS_I(inode)->dev == st->dev;
+}
+
+static struct inode *hostfs_iget(struct super_block *sb, char *name)
+{
+       struct inode *inode;
+       struct hostfs_stat st;
+       int err = stat_file(name, &st, -1);
+
+       if (err)
+               return ERR_PTR(err);
+
+       inode = iget5_locked(sb, st.ino, hostfs_inode_test, hostfs_inode_set,
+                            &st);
+       if (!inode)
+               return ERR_PTR(-ENOMEM);
+
+       if (inode->i_state & I_NEW) {
+               unlock_new_inode(inode);
+       } else {
+               spin_lock(&inode->i_lock);
+               hostfs_inode_update(inode, &st);
+               spin_unlock(&inode->i_lock);
+       }
+
+       return inode;
 }
 
 static int hostfs_create(struct mnt_idmap *idmap, struct inode *dir,
@@ -565,62 +599,48 @@ static int hostfs_create(struct mnt_idmap *idmap, struct inode *dir,
 {
        struct inode *inode;
        char *name;
-       int error, fd;
-
-       inode = hostfs_iget(dir->i_sb);
-       if (IS_ERR(inode)) {
-               error = PTR_ERR(inode);
-               goto out;
-       }
+       int fd;
 
-       error = -ENOMEM;
        name = dentry_name(dentry);
        if (name == NULL)
-               goto out_put;
+               return -ENOMEM;
 
        fd = file_create(name, mode & 0777);
-       if (fd < 0)
-               error = fd;
-       else
-               error = read_name(inode, name);
+       if (fd < 0) {
+               __putname(name);
+               return fd;
+       }
 
+       inode = hostfs_iget(dir->i_sb, name);
        __putname(name);
-       if (error)
-               goto out_put;
+       if (IS_ERR(inode))
+               return PTR_ERR(inode);
 
        HOSTFS_I(inode)->fd = fd;
        HOSTFS_I(inode)->mode = FMODE_READ | FMODE_WRITE;
        d_instantiate(dentry, inode);
        return 0;
-
- out_put:
-       iput(inode);
- out:
-       return error;
 }
 
 static struct dentry *hostfs_lookup(struct inode *ino, struct dentry *dentry,
                                    unsigned int flags)
 {
-       struct inode *inode;
+       struct inode *inode = NULL;
        char *name;
-       int err;
-
-       inode = hostfs_iget(ino->i_sb);
-       if (IS_ERR(inode))
-               goto out;
 
-       err = -ENOMEM;
        name = dentry_name(dentry);
-       if (name) {
-               err = read_name(inode, name);
-               __putname(name);
-       }
-       if (err) {
-               iput(inode);
-               inode = (err == -ENOENT) ? NULL : ERR_PTR(err);
+       if (name == NULL)
+               return ERR_PTR(-ENOMEM);
+
+       inode = hostfs_iget(ino->i_sb, name);
+       __putname(name);
+       if (IS_ERR(inode)) {
+               if (PTR_ERR(inode) == -ENOENT)
+                       inode = NULL;
+               else
+                       return ERR_CAST(inode);
        }
- out:
+
        return d_splice_alias(inode, dentry);
 }
 
@@ -704,35 +724,23 @@ static int hostfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
        char *name;
        int err;
 
-       inode = hostfs_iget(dir->i_sb);
-       if (IS_ERR(inode)) {
-               err = PTR_ERR(inode);
-               goto out;
-       }
-
-       err = -ENOMEM;
        name = dentry_name(dentry);
        if (name == NULL)
-               goto out_put;
+               return -ENOMEM;
 
        err = do_mknod(name, mode, MAJOR(dev), MINOR(dev));
-       if (err)
-               goto out_free;
+       if (err) {
+               __putname(name);
+               return err;
+       }
 
-       err = read_name(inode, name);
+       inode = hostfs_iget(dir->i_sb, name);
        __putname(name);
-       if (err)
-               goto out_put;
+       if (IS_ERR(inode))
+               return PTR_ERR(inode);
 
        d_instantiate(dentry, inode);
        return 0;
-
- out_free:
-       __putname(name);
- out_put:
-       iput(inode);
- out:
-       return err;
 }
 
 static int hostfs_rename2(struct mnt_idmap *idmap,
@@ -929,49 +937,40 @@ static int hostfs_fill_sb_common(struct super_block *sb, void *d, int silent)
        sb->s_maxbytes = MAX_LFS_FILESIZE;
        err = super_setup_bdi(sb);
        if (err)
-               goto out;
+               return err;
 
        /* NULL is printed as '(null)' by printf(): avoid that. */
        if (req_root == NULL)
                req_root = "";
 
-       err = -ENOMEM;
        sb->s_fs_info = host_root_path =
                kasprintf(GFP_KERNEL, "%s/%s", root_ino, req_root);
        if (host_root_path == NULL)
-               goto out;
-
-       root_inode = new_inode(sb);
-       if (!root_inode)
-               goto out;
+               return -ENOMEM;
 
-       err = read_name(root_inode, host_root_path);
-       if (err)
-               goto out_put;
+       root_inode = hostfs_iget(sb, host_root_path);
+       if (IS_ERR(root_inode))
+               return PTR_ERR(root_inode);
 
        if (S_ISLNK(root_inode->i_mode)) {
-               char *name = follow_link(host_root_path);
-               if (IS_ERR(name)) {
-                       err = PTR_ERR(name);
-                       goto out_put;
-               }
-               err = read_name(root_inode, name);
+               char *name;
+
+               iput(root_inode);
+               name = follow_link(host_root_path);
+               if (IS_ERR(name))
+                       return PTR_ERR(name);
+
+               root_inode = hostfs_iget(sb, name);
                kfree(name);
-               if (err)
-                       goto out_put;
+               if (IS_ERR(root_inode))
+                       return PTR_ERR(root_inode);
        }
 
-       err = -ENOMEM;
        sb->s_root = d_make_root(root_inode);
        if (sb->s_root == NULL)
-               goto out;
+               return -ENOMEM;
 
        return 0;
-
-out_put:
-       iput(root_inode);
-out:
-       return err;
 }
 
 static struct dentry *hostfs_read_sb(struct file_system_type *type,
index 5ecc470..840619e 100644 (file)
@@ -36,6 +36,7 @@ static void stat64_to_hostfs(const struct stat64 *buf, struct hostfs_stat *p)
        p->blocks = buf->st_blocks;
        p->maj = os_major(buf->st_rdev);
        p->min = os_minor(buf->st_rdev);
+       p->dev = buf->st_dev;
 }
 
 int stat_file(const char *path, struct hostfs_stat *p, int fd)
index 88952d4..1bb8d97 100644 (file)
@@ -259,7 +259,7 @@ const struct file_operations hpfs_file_ops =
        .mmap           = generic_file_mmap,
        .release        = hpfs_file_release,
        .fsync          = hpfs_file_fsync,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = filemap_splice_read,
        .unlocked_ioctl = hpfs_ioctl,
        .compat_ioctl   = compat_ptr_ioctl,
 };
index ecfdfb2..7b17ccf 100644 (file)
@@ -821,7 +821,6 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
                 */
                struct folio *folio;
                unsigned long addr;
-               bool present;
 
                cond_resched();
 
@@ -834,9 +833,6 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
                        break;
                }
 
-               /* Set numa allocation policy based on index */
-               hugetlb_set_vma_policy(&pseudo_vma, inode, index);
-
                /* addr is the offset within the file (zero based) */
                addr = index * hpage_size;
 
@@ -845,12 +841,10 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
                mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
                /* See if already present in mapping to avoid alloc/free */
-               rcu_read_lock();
-               present = page_cache_next_miss(mapping, index, 1) != index;
-               rcu_read_unlock();
-               if (present) {
+               folio = filemap_get_folio(mapping, index);
+               if (!IS_ERR(folio)) {
+                       folio_put(folio);
                        mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-                       hugetlb_drop_vma_policy(&pseudo_vma);
                        continue;
                }
 
@@ -862,6 +856,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
                 * folios in these areas, we need to consume the reserves
                 * to keep reservation accounting consistent.
                 */
+               hugetlb_set_vma_policy(&pseudo_vma, inode, index);
                folio = alloc_hugetlb_folio(&pseudo_vma, addr, 0);
                hugetlb_drop_vma_policy(&pseudo_vma);
                if (IS_ERR(folio)) {
index 577799b..d37fad9 100644 (file)
@@ -1104,9 +1104,51 @@ void discard_new_inode(struct inode *inode)
 EXPORT_SYMBOL(discard_new_inode);
 
 /**
+ * lock_two_inodes - lock two inodes (may be regular files but also dirs)
+ *
+ * Lock any non-NULL argument. The caller must make sure that if he is passing
+ * in two directories, one is not ancestor of the other.  Zero, one or two
+ * objects may be locked by this function.
+ *
+ * @inode1: first inode to lock
+ * @inode2: second inode to lock
+ * @subclass1: inode lock subclass for the first lock obtained
+ * @subclass2: inode lock subclass for the second lock obtained
+ */
+void lock_two_inodes(struct inode *inode1, struct inode *inode2,
+                    unsigned subclass1, unsigned subclass2)
+{
+       if (!inode1 || !inode2) {
+               /*
+                * Make sure @subclass1 will be used for the acquired lock.
+                * This is not strictly necessary (no current caller cares) but
+                * let's keep things consistent.
+                */
+               if (!inode1)
+                       swap(inode1, inode2);
+               goto lock;
+       }
+
+       /*
+        * If one object is directory and the other is not, we must make sure
+        * to lock directory first as the other object may be its child.
+        */
+       if (S_ISDIR(inode2->i_mode) == S_ISDIR(inode1->i_mode)) {
+               if (inode1 > inode2)
+                       swap(inode1, inode2);
+       } else if (!S_ISDIR(inode1->i_mode))
+               swap(inode1, inode2);
+lock:
+       if (inode1)
+               inode_lock_nested(inode1, subclass1);
+       if (inode2 && inode2 != inode1)
+               inode_lock_nested(inode2, subclass2);
+}
+
+/**
  * lock_two_nondirectories - take two i_mutexes on non-directory objects
  *
- * Lock any non-NULL argument that is not a directory.
+ * Lock any non-NULL argument. Passed objects must not be directories.
  * Zero, one or two objects may be locked by this function.
  *
  * @inode1: first inode to lock
@@ -1114,13 +1156,9 @@ EXPORT_SYMBOL(discard_new_inode);
  */
 void lock_two_nondirectories(struct inode *inode1, struct inode *inode2)
 {
-       if (inode1 > inode2)
-               swap(inode1, inode2);
-
-       if (inode1 && !S_ISDIR(inode1->i_mode))
-               inode_lock(inode1);
-       if (inode2 && !S_ISDIR(inode2->i_mode) && inode2 != inode1)
-               inode_lock_nested(inode2, I_MUTEX_NONDIR2);
+       WARN_ON_ONCE(S_ISDIR(inode1->i_mode));
+       WARN_ON_ONCE(S_ISDIR(inode2->i_mode));
+       lock_two_inodes(inode1, inode2, I_MUTEX_NORMAL, I_MUTEX_NONDIR2);
 }
 EXPORT_SYMBOL(lock_two_nondirectories);
 
@@ -1131,10 +1169,14 @@ EXPORT_SYMBOL(lock_two_nondirectories);
  */
 void unlock_two_nondirectories(struct inode *inode1, struct inode *inode2)
 {
-       if (inode1 && !S_ISDIR(inode1->i_mode))
+       if (inode1) {
+               WARN_ON_ONCE(S_ISDIR(inode1->i_mode));
                inode_unlock(inode1);
-       if (inode2 && !S_ISDIR(inode2->i_mode) && inode2 != inode1)
+       }
+       if (inode2 && inode2 != inode1) {
+               WARN_ON_ONCE(S_ISDIR(inode2->i_mode));
                inode_unlock(inode2);
+       }
 }
 EXPORT_SYMBOL(unlock_two_nondirectories);
 
@@ -2264,7 +2306,8 @@ void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
                inode->i_fop = &def_chr_fops;
                inode->i_rdev = rdev;
        } else if (S_ISBLK(mode)) {
-               inode->i_fop = &def_blk_fops;
+               if (IS_ENABLED(CONFIG_BLOCK))
+                       inode->i_fop = &def_blk_fops;
                inode->i_rdev = rdev;
        } else if (S_ISFIFO(mode))
                inode->i_fop = &pipefifo_fops;
index bd3b281..f7a3dc1 100644 (file)
@@ -97,8 +97,9 @@ extern void chroot_fs_refs(const struct path *, const struct path *);
 /*
  * file_table.c
  */
-extern struct file *alloc_empty_file(int, const struct cred *);
-extern struct file *alloc_empty_file_noaccount(int, const struct cred *);
+struct file *alloc_empty_file(int flags, const struct cred *cred);
+struct file *alloc_empty_file_noaccount(int flags, const struct cred *cred);
+struct file *alloc_empty_backing_file(int flags, const struct cred *cred);
 
 static inline void put_file_access(struct file *file)
 {
@@ -121,6 +122,47 @@ extern bool mount_capable(struct fs_context *);
 int sb_init_dio_done_wq(struct super_block *sb);
 
 /*
+ * Prepare superblock for changing its read-only state (i.e., either remount
+ * read-write superblock read-only or vice versa). After this function returns
+ * mnt_is_readonly() will return true for any mount of the superblock if its
+ * caller is able to observe any changes done by the remount. This holds until
+ * sb_end_ro_state_change() is called.
+ */
+static inline void sb_start_ro_state_change(struct super_block *sb)
+{
+       WRITE_ONCE(sb->s_readonly_remount, 1);
+       /*
+        * For RO->RW transition, the barrier pairs with the barrier in
+        * mnt_is_readonly() making sure if mnt_is_readonly() sees SB_RDONLY
+        * cleared, it will see s_readonly_remount set.
+        * For RW->RO transition, the barrier pairs with the barrier in
+        * __mnt_want_write() before the mnt_is_readonly() check. The barrier
+        * makes sure if __mnt_want_write() sees MNT_WRITE_HOLD already
+        * cleared, it will see s_readonly_remount set.
+        */
+       smp_wmb();
+}
+
+/*
+ * Ends section changing read-only state of the superblock. After this function
+ * returns if mnt_is_readonly() returns false, the caller will be able to
+ * observe all the changes remount did to the superblock.
+ */
+static inline void sb_end_ro_state_change(struct super_block *sb)
+{
+       /*
+        * This barrier provides release semantics that pairs with
+        * the smp_rmb() acquire semantics in mnt_is_readonly().
+        * This barrier pair ensure that when mnt_is_readonly() sees
+        * 0 for sb->s_readonly_remount, it will also see all the
+        * preceding flag changes that were made during the RO state
+        * change.
+        */
+       smp_wmb();
+       WRITE_ONCE(sb->s_readonly_remount, 0);
+}
+
+/*
  * open.c
  */
 struct open_flags {
@@ -152,6 +194,8 @@ extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc);
 int dentry_needs_remove_privs(struct mnt_idmap *, struct dentry *dentry);
 bool in_group_or_capable(struct mnt_idmap *idmap,
                         const struct inode *inode, vfsgid_t vfsgid);
+void lock_two_inodes(struct inode *inode1, struct inode *inode2,
+                    unsigned subclass1, unsigned subclass2);
 
 /*
  * fs-writeback.c
index 063133e..a4fa81a 100644 (file)
@@ -312,7 +312,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
                        ctx->bio->bi_opf |= REQ_RAHEAD;
                ctx->bio->bi_iter.bi_sector = sector;
                ctx->bio->bi_end_io = iomap_read_end_io;
-               bio_add_folio(ctx->bio, folio, plen, poff);
+               bio_add_folio_nofail(ctx->bio, folio, plen, poff);
        }
 
 done:
@@ -539,7 +539,7 @@ static int iomap_read_folio_sync(loff_t block_start, struct folio *folio,
 
        bio_init(&bio, iomap->bdev, &bvec, 1, REQ_OP_READ);
        bio.bi_iter.bi_sector = iomap_sector(iomap, block_start);
-       bio_add_folio(&bio, folio, plen, poff);
+       bio_add_folio_nofail(&bio, folio, plen, poff);
        return submit_bio_wait(&bio);
 }
 
@@ -864,16 +864,19 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
                .len            = iov_iter_count(i),
                .flags          = IOMAP_WRITE,
        };
-       int ret;
+       ssize_t ret;
 
        if (iocb->ki_flags & IOCB_NOWAIT)
                iter.flags |= IOMAP_NOWAIT;
 
        while ((ret = iomap_iter(&iter, ops)) > 0)
                iter.processed = iomap_write_iter(&iter, i);
-       if (iter.pos == iocb->ki_pos)
+
+       if (unlikely(ret < 0))
                return ret;
-       return iter.pos - iocb->ki_pos;
+       ret = iter.pos - iocb->ki_pos;
+       iocb->ki_pos += ret;
+       return ret;
 }
 EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
 
@@ -1582,7 +1585,7 @@ iomap_add_to_ioend(struct inode *inode, loff_t pos, struct folio *folio,
 
        if (!bio_add_folio(wpc->ioend->io_bio, folio, len, poff)) {
                wpc->ioend->io_bio = iomap_chain_bio(wpc->ioend->io_bio);
-               bio_add_folio(wpc->ioend->io_bio, folio, len, poff);
+               bio_add_folio_nofail(wpc->ioend->io_bio, folio, len, poff);
        }
 
        if (iop)
index 019cc87..ea3b868 100644 (file)
@@ -81,7 +81,6 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
 {
        const struct iomap_dio_ops *dops = dio->dops;
        struct kiocb *iocb = dio->iocb;
-       struct inode *inode = file_inode(iocb->ki_filp);
        loff_t offset = iocb->ki_pos;
        ssize_t ret = dio->error;
 
@@ -94,7 +93,6 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
                if (offset + ret > dio->i_size &&
                    !(dio->flags & IOMAP_DIO_WRITE))
                        ret = dio->i_size - offset;
-               iocb->ki_pos += ret;
        }
 
        /*
@@ -109,30 +107,25 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
         * ->end_io() when necessary, otherwise a racing buffer read would cache
         * zeros from unwritten extents.
         */
-       if (!dio->error && dio->size &&
-           (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) {
-               int err;
-               err = invalidate_inode_pages2_range(inode->i_mapping,
-                               offset >> PAGE_SHIFT,
-                               (offset + dio->size - 1) >> PAGE_SHIFT);
-               if (err)
-                       dio_warn_stale_pagecache(iocb->ki_filp);
-       }
+       if (!dio->error && dio->size && (dio->flags & IOMAP_DIO_WRITE))
+               kiocb_invalidate_post_direct_write(iocb, dio->size);
 
        inode_dio_end(file_inode(iocb->ki_filp));
-       /*
-        * If this is a DSYNC write, make sure we push it to stable storage now
-        * that we've written data.
-        */
-       if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
-               ret = generic_write_sync(iocb, ret);
 
-       if (ret > 0)
-               ret += dio->done_before;
+       if (ret > 0) {
+               iocb->ki_pos += ret;
 
+               /*
+                * If this is a DSYNC write, make sure we push it to stable
+                * storage now that we've written data.
+                */
+               if (dio->flags & IOMAP_DIO_NEED_SYNC)
+                       ret = generic_write_sync(iocb, ret);
+               if (ret > 0)
+                       ret += dio->done_before;
+       }
        trace_iomap_dio_complete(iocb, dio->error, ret);
        kfree(dio);
-
        return ret;
 }
 EXPORT_SYMBOL_GPL(iomap_dio_complete);
@@ -203,7 +196,6 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
        bio->bi_private = dio;
        bio->bi_end_io = iomap_dio_bio_end_io;
 
-       get_page(page);
        __bio_add_page(bio, page, len, 0);
        iomap_dio_submit_bio(iter, dio, bio, pos);
 }
@@ -479,7 +471,6 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
                const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
                unsigned int dio_flags, void *private, size_t done_before)
 {
-       struct address_space *mapping = iocb->ki_filp->f_mapping;
        struct inode *inode = file_inode(iocb->ki_filp);
        struct iomap_iter iomi = {
                .inode          = inode,
@@ -488,11 +479,11 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
                .flags          = IOMAP_DIRECT,
                .private        = private,
        };
-       loff_t end = iomi.pos + iomi.len - 1, ret = 0;
        bool wait_for_completion =
                is_sync_kiocb(iocb) || (dio_flags & IOMAP_DIO_FORCE_WAIT);
        struct blk_plug plug;
        struct iomap_dio *dio;
+       loff_t ret = 0;
 
        trace_iomap_dio_rw_begin(iocb, iter, dio_flags, done_before);
 
@@ -516,31 +507,29 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
        dio->submit.waiter = current;
        dio->submit.poll_bio = NULL;
 
+       if (iocb->ki_flags & IOCB_NOWAIT)
+               iomi.flags |= IOMAP_NOWAIT;
+
        if (iov_iter_rw(iter) == READ) {
                if (iomi.pos >= dio->i_size)
                        goto out_free_dio;
 
-               if (iocb->ki_flags & IOCB_NOWAIT) {
-                       if (filemap_range_needs_writeback(mapping, iomi.pos,
-                                       end)) {
-                               ret = -EAGAIN;
-                               goto out_free_dio;
-                       }
-                       iomi.flags |= IOMAP_NOWAIT;
-               }
-
                if (user_backed_iter(iter))
                        dio->flags |= IOMAP_DIO_DIRTY;
+
+               ret = kiocb_write_and_wait(iocb, iomi.len);
+               if (ret)
+                       goto out_free_dio;
        } else {
                iomi.flags |= IOMAP_WRITE;
                dio->flags |= IOMAP_DIO_WRITE;
 
-               if (iocb->ki_flags & IOCB_NOWAIT) {
-                       if (filemap_range_has_page(mapping, iomi.pos, end)) {
-                               ret = -EAGAIN;
+               if (dio_flags & IOMAP_DIO_OVERWRITE_ONLY) {
+                       ret = -EAGAIN;
+                       if (iomi.pos >= dio->i_size ||
+                           iomi.pos + iomi.len > dio->i_size)
                                goto out_free_dio;
-                       }
-                       iomi.flags |= IOMAP_NOWAIT;
+                       iomi.flags |= IOMAP_OVERWRITE_ONLY;
                }
 
                /* for data sync or sync, we need sync completion processing */
@@ -556,31 +545,19 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
                        if (!(iocb->ki_flags & IOCB_SYNC))
                                dio->flags |= IOMAP_DIO_WRITE_FUA;
                }
-       }
 
-       if (dio_flags & IOMAP_DIO_OVERWRITE_ONLY) {
-               ret = -EAGAIN;
-               if (iomi.pos >= dio->i_size ||
-                   iomi.pos + iomi.len > dio->i_size)
-                       goto out_free_dio;
-               iomi.flags |= IOMAP_OVERWRITE_ONLY;
-       }
-
-       ret = filemap_write_and_wait_range(mapping, iomi.pos, end);
-       if (ret)
-               goto out_free_dio;
-
-       if (iov_iter_rw(iter) == WRITE) {
                /*
                 * Try to invalidate cache pages for the range we are writing.
                 * If this invalidation fails, let the caller fall back to
                 * buffered I/O.
                 */
-               if (invalidate_inode_pages2_range(mapping,
-                               iomi.pos >> PAGE_SHIFT, end >> PAGE_SHIFT)) {
-                       trace_iomap_dio_invalidate_fail(inode, iomi.pos,
-                                                       iomi.len);
-                       ret = -ENOTBLK;
+               ret = kiocb_invalidate_pages(iocb, iomi.len);
+               if (ret) {
+                       if (ret != -EAGAIN) {
+                               trace_iomap_dio_invalidate_fail(inode, iomi.pos,
+                                                               iomi.len);
+                               ret = -ENOTBLK;
+                       }
                        goto out_free_dio;
                }
 
index 8ae4191..6e17f8f 100644 (file)
@@ -1491,7 +1491,6 @@ journal_t *jbd2_journal_init_inode(struct inode *inode)
 {
        journal_t *journal;
        sector_t blocknr;
-       char *p;
        int err = 0;
 
        blocknr = 0;
@@ -1515,9 +1514,8 @@ journal_t *jbd2_journal_init_inode(struct inode *inode)
 
        journal->j_inode = inode;
        snprintf(journal->j_devname, sizeof(journal->j_devname),
-                "%pg", journal->j_dev);
-       p = strreplace(journal->j_devname, '/', '!');
-       sprintf(p, "-%lu", journal->j_inode->i_ino);
+                "%pg-%lu", journal->j_dev, journal->j_inode->i_ino);
+       strreplace(journal->j_devname, '/', '!');
        jbd2_stats_proc_init(journal);
 
        return journal;
index 837cd55..6ae9d6f 100644 (file)
@@ -211,7 +211,10 @@ static int jffs2_build_filesystem(struct jffs2_sb_info *c)
                ic->scan_dents = NULL;
                cond_resched();
        }
-       jffs2_build_xattr_subsystem(c);
+       ret = jffs2_build_xattr_subsystem(c);
+       if (ret)
+               goto exit;
+
        c->flags &= ~JFFS2_SB_FLAG_BUILDING;
 
        dbg_fsbuild("FS build complete\n");
index 96b0275..2345ca3 100644 (file)
@@ -56,7 +56,7 @@ const struct file_operations jffs2_file_operations =
        .unlocked_ioctl=jffs2_ioctl,
        .mmap =         generic_file_readonly_mmap,
        .fsync =        jffs2_fsync,
-       .splice_read =  generic_file_splice_read,
+       .splice_read =  filemap_splice_read,
        .splice_write = iter_file_splice_write,
 };
 
index aa4048a..3b6bdc9 100644 (file)
@@ -772,10 +772,10 @@ void jffs2_clear_xattr_subsystem(struct jffs2_sb_info *c)
 }
 
 #define XREF_TMPHASH_SIZE      (128)
-void jffs2_build_xattr_subsystem(struct jffs2_sb_info *c)
+int jffs2_build_xattr_subsystem(struct jffs2_sb_info *c)
 {
        struct jffs2_xattr_ref *ref, *_ref;
-       struct jffs2_xattr_ref *xref_tmphash[XREF_TMPHASH_SIZE];
+       struct jffs2_xattr_ref **xref_tmphash;
        struct jffs2_xattr_datum *xd, *_xd;
        struct jffs2_inode_cache *ic;
        struct jffs2_raw_node_ref *raw;
@@ -784,9 +784,12 @@ void jffs2_build_xattr_subsystem(struct jffs2_sb_info *c)
 
        BUG_ON(!(c->flags & JFFS2_SB_FLAG_BUILDING));
 
+       xref_tmphash = kcalloc(XREF_TMPHASH_SIZE,
+                              sizeof(struct jffs2_xattr_ref *), GFP_KERNEL);
+       if (!xref_tmphash)
+               return -ENOMEM;
+
        /* Phase.1 : Merge same xref */
-       for (i=0; i < XREF_TMPHASH_SIZE; i++)
-               xref_tmphash[i] = NULL;
        for (ref=c->xref_temp; ref; ref=_ref) {
                struct jffs2_xattr_ref *tmp;
 
@@ -884,6 +887,8 @@ void jffs2_build_xattr_subsystem(struct jffs2_sb_info *c)
                     "%u of xref (%u dead, %u orphan) found.\n",
                     xdatum_count, xdatum_unchecked_count, xdatum_orphan_count,
                     xref_count, xref_dead_count, xref_orphan_count);
+       kfree(xref_tmphash);
+       return 0;
 }
 
 struct jffs2_xattr_datum *jffs2_setup_xattr_datum(struct jffs2_sb_info *c,
index 720007b..1b5030a 100644 (file)
@@ -71,7 +71,7 @@ static inline int is_xattr_ref_dead(struct jffs2_xattr_ref *ref)
 #ifdef CONFIG_JFFS2_FS_XATTR
 
 extern void jffs2_init_xattr_subsystem(struct jffs2_sb_info *c);
-extern void jffs2_build_xattr_subsystem(struct jffs2_sb_info *c);
+extern int jffs2_build_xattr_subsystem(struct jffs2_sb_info *c);
 extern void jffs2_clear_xattr_subsystem(struct jffs2_sb_info *c);
 
 extern struct jffs2_xattr_datum *jffs2_setup_xattr_datum(struct jffs2_sb_info *c,
@@ -103,7 +103,7 @@ extern ssize_t jffs2_listxattr(struct dentry *, char *, size_t);
 #else
 
 #define jffs2_init_xattr_subsystem(c)
-#define jffs2_build_xattr_subsystem(c)
+#define jffs2_build_xattr_subsystem(c)         (0)
 #define jffs2_clear_xattr_subsystem(c)
 
 #define jffs2_xattr_do_crccheck_inode(c, ic)
index 2ee35be..01b6912 100644 (file)
@@ -144,7 +144,7 @@ const struct file_operations jfs_file_operations = {
        .read_iter      = generic_file_read_iter,
        .write_iter     = generic_file_write_iter,
        .mmap           = generic_file_mmap,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = filemap_splice_read,
        .splice_write   = iter_file_splice_write,
        .fsync          = jfs_fsync,
        .release        = jfs_release,
index 695415c..e855b8f 100644 (file)
@@ -1100,8 +1100,8 @@ int lmLogOpen(struct super_block *sb)
         * file systems to log may have n-to-1 relationship;
         */
 
-       bdev = blkdev_get_by_dev(sbi->logdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL,
-                                log);
+       bdev = blkdev_get_by_dev(sbi->logdev, BLK_OPEN_READ | BLK_OPEN_WRITE,
+                                log, NULL);
        if (IS_ERR(bdev)) {
                rc = PTR_ERR(bdev);
                goto free;
@@ -1141,7 +1141,7 @@ journal_found:
        lbmLogShutdown(log);
 
       close:           /* close external log device */
-       blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
+       blkdev_put(bdev, log);
 
       free:            /* free log descriptor */
        mutex_unlock(&jfs_log_mutex);
@@ -1485,7 +1485,7 @@ int lmLogClose(struct super_block *sb)
        bdev = log->bdev;
        rc = lmLogShutdown(log);
 
-       blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
+       blkdev_put(bdev, log);
 
        kfree(log);
 
@@ -1974,7 +1974,7 @@ static int lbmRead(struct jfs_log * log, int pn, struct lbuf ** bpp)
 
        bio = bio_alloc(log->bdev, 1, REQ_OP_READ, GFP_NOFS);
        bio->bi_iter.bi_sector = bp->l_blkno << (log->l2bsize - 9);
-       bio_add_page(bio, bp->l_page, LOGPSIZE, bp->l_offset);
+       __bio_add_page(bio, bp->l_page, LOGPSIZE, bp->l_offset);
        BUG_ON(bio->bi_iter.bi_size != LOGPSIZE);
 
        bio->bi_end_io = lbmIODone;
@@ -2115,7 +2115,7 @@ static void lbmStartIO(struct lbuf * bp)
 
        bio = bio_alloc(log->bdev, 1, REQ_OP_WRITE | REQ_SYNC, GFP_NOFS);
        bio->bi_iter.bi_sector = bp->l_blkno << (log->l2bsize - 9);
-       bio_add_page(bio, bp->l_page, LOGPSIZE, bp->l_offset);
+       __bio_add_page(bio, bp->l_page, LOGPSIZE, bp->l_offset);
        BUG_ON(bio->bi_iter.bi_size != LOGPSIZE);
 
        bio->bi_end_io = lbmIODone;
index b29d68b..494b9f4 100644 (file)
@@ -876,7 +876,7 @@ static int jfs_symlink(struct mnt_idmap *idmap, struct inode *dip,
        tid_t tid;
        ino_t ino = 0;
        struct component_name dname;
-       int ssize;              /* source pathname size */
+       u32 ssize;              /* source pathname size */
        struct btstack btstack;
        struct inode *ip = d_inode(dentry);
        s64 xlen = 0;
@@ -957,7 +957,7 @@ static int jfs_symlink(struct mnt_idmap *idmap, struct inode *dip,
                if (ssize > sizeof (JFS_IP(ip)->i_inline))
                        JFS_IP(ip)->mode2 &= ~INLINEEA;
 
-               jfs_info("jfs_symlink: fast symlink added  ssize:%d name:%s ",
+               jfs_info("jfs_symlink: fast symlink added  ssize:%u name:%s ",
                         ssize, name);
        }
        /*
@@ -987,7 +987,7 @@ static int jfs_symlink(struct mnt_idmap *idmap, struct inode *dip,
                ip->i_size = ssize - 1;
                while (ssize) {
                        /* This is kind of silly since PATH_MAX == 4K */
-                       int copy_size = min(ssize, PSIZE);
+                       u32 copy_size = min_t(u32, ssize, PSIZE);
 
                        mp = get_metapage(ip, xaddr, PSIZE, 1);
 
index 40c4661..180906c 100644 (file)
@@ -1011,7 +1011,7 @@ const struct file_operations kernfs_file_fops = {
        .release        = kernfs_fop_release,
        .poll           = kernfs_fop_poll,
        .fsync          = noop_fsync,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = copy_splice_read,
        .splice_write   = iter_file_splice_write,
 };
 
index 89cf614..5b85131 100644 (file)
@@ -1613,3 +1613,44 @@ u64 inode_query_iversion(struct inode *inode)
        return cur >> I_VERSION_QUERIED_SHIFT;
 }
 EXPORT_SYMBOL(inode_query_iversion);
+
+ssize_t direct_write_fallback(struct kiocb *iocb, struct iov_iter *iter,
+               ssize_t direct_written, ssize_t buffered_written)
+{
+       struct address_space *mapping = iocb->ki_filp->f_mapping;
+       loff_t pos = iocb->ki_pos - buffered_written;
+       loff_t end = iocb->ki_pos - 1;
+       int err;
+
+       /*
+        * If the buffered write fallback returned an error, we want to return
+        * the number of bytes which were written by direct I/O, or the error
+        * code if that was zero.
+        *
+        * Note that this differs from normal direct-io semantics, which will
+        * return -EFOO even if some bytes were written.
+        */
+       if (unlikely(buffered_written < 0)) {
+               if (direct_written)
+                       return direct_written;
+               return buffered_written;
+       }
+
+       /*
+        * We need to ensure that the page cache pages are written to disk and
+        * invalidated to preserve the expected O_DIRECT semantics.
+        */
+       err = filemap_write_and_wait_range(mapping, pos, end);
+       if (err < 0) {
+               /*
+                * We don't know how much we wrote, so just return the number of
+                * bytes which were direct-written
+                */
+               if (direct_written)
+                       return direct_written;
+               return err;
+       }
+       invalidate_mapping_pages(mapping, pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
+       return direct_written + buffered_written;
+}
+EXPORT_SYMBOL_GPL(direct_write_fallback);
index 04ba95b..22d3ff3 100644 (file)
@@ -355,7 +355,6 @@ static int lockd_get(void)
        int error;
 
        if (nlmsvc_serv) {
-               svc_get(nlmsvc_serv);
                nlmsvc_users++;
                return 0;
        }
index 0dd05d4..906d192 100644 (file)
@@ -19,7 +19,7 @@ const struct file_operations minix_file_operations = {
        .write_iter     = generic_file_write_iter,
        .mmap           = generic_file_mmap,
        .fsync          = generic_file_fsync,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = filemap_splice_read,
 };
 
 static int minix_setattr(struct mnt_idmap *idmap,
index e4fe087..91171da 100644 (file)
@@ -3028,8 +3028,8 @@ static struct dentry *lock_two_directories(struct dentry *p1, struct dentry *p2)
                return p;
        }
 
-       inode_lock_nested(p1->d_inode, I_MUTEX_PARENT);
-       inode_lock_nested(p2->d_inode, I_MUTEX_PARENT2);
+       lock_two_inodes(p1->d_inode, p2->d_inode,
+                       I_MUTEX_PARENT, I_MUTEX_PARENT2);
        return NULL;
 }
 
@@ -3703,7 +3703,7 @@ static int vfs_tmpfile(struct mnt_idmap *idmap,
 }
 
 /**
- * vfs_tmpfile_open - open a tmpfile for kernel internal use
+ * kernel_tmpfile_open - open a tmpfile for kernel internal use
  * @idmap:     idmap of the mount the inode was found from
  * @parentpath:        path of the base directory
  * @mode:      mode of the new tmpfile
@@ -3714,24 +3714,26 @@ static int vfs_tmpfile(struct mnt_idmap *idmap,
  * hence this is only for kernel internal use, and must not be installed into
  * file tables or such.
  */
-struct file *vfs_tmpfile_open(struct mnt_idmap *idmap,
-                         const struct path *parentpath,
-                         umode_t mode, int open_flag, const struct cred *cred)
+struct file *kernel_tmpfile_open(struct mnt_idmap *idmap,
+                                const struct path *parentpath,
+                                umode_t mode, int open_flag,
+                                const struct cred *cred)
 {
        struct file *file;
        int error;
 
        file = alloc_empty_file_noaccount(open_flag, cred);
-       if (!IS_ERR(file)) {
-               error = vfs_tmpfile(idmap, parentpath, file, mode);
-               if (error) {
-                       fput(file);
-                       file = ERR_PTR(error);
-               }
+       if (IS_ERR(file))
+               return file;
+
+       error = vfs_tmpfile(idmap, parentpath, file, mode);
+       if (error) {
+               fput(file);
+               file = ERR_PTR(error);
        }
        return file;
 }
-EXPORT_SYMBOL(vfs_tmpfile_open);
+EXPORT_SYMBOL(kernel_tmpfile_open);
 
 static int do_tmpfile(struct nameidata *nd, unsigned flags,
                const struct open_flags *op,
@@ -4731,7 +4733,7 @@ SYSCALL_DEFINE2(link, const char __user *, oldname, const char __user *, newname
  *        sb->s_vfs_rename_mutex. We might be more accurate, but that's another
  *        story.
  *     c) we have to lock _four_ objects - parents and victim (if it exists),
- *        and source (if it is not a directory).
+ *        and source.
  *        And that - after we got ->i_mutex on parents (until then we don't know
  *        whether the target exists).  Solution: try to be smart with locking
  *        order for inodes.  We rely on the fact that tree topology may change
@@ -4815,10 +4817,16 @@ int vfs_rename(struct renamedata *rd)
 
        take_dentry_name_snapshot(&old_name, old_dentry);
        dget(new_dentry);
-       if (!is_dir || (flags & RENAME_EXCHANGE))
-               lock_two_nondirectories(source, target);
-       else if (target)
-               inode_lock(target);
+       /*
+        * Lock all moved children. Moved directories may need to change parent
+        * pointer so they need the lock to prevent against concurrent
+        * directory changes moving parent pointer. For regular files we've
+        * historically always done this. The lockdep locking subclasses are
+        * somewhat arbitrary but RENAME_EXCHANGE in particular can swap
+        * regular files and directories so it's difficult to tell which
+        * subclasses to use.
+        */
+       lock_two_inodes(source, target, I_MUTEX_NORMAL, I_MUTEX_NONDIR2);
 
        error = -EPERM;
        if (IS_SWAPFILE(source) || (target && IS_SWAPFILE(target)))
@@ -4866,9 +4874,9 @@ int vfs_rename(struct renamedata *rd)
                        d_exchange(old_dentry, new_dentry);
        }
 out:
-       if (!is_dir || (flags & RENAME_EXCHANGE))
-               unlock_two_nondirectories(source, target);
-       else if (target)
+       if (source)
+               inode_unlock(source);
+       if (target)
                inode_unlock(target);
        dput(new_dentry);
        if (!error) {
index 54847db..e157efc 100644 (file)
@@ -309,9 +309,16 @@ static unsigned int mnt_get_writers(struct mount *mnt)
 
 static int mnt_is_readonly(struct vfsmount *mnt)
 {
-       if (mnt->mnt_sb->s_readonly_remount)
+       if (READ_ONCE(mnt->mnt_sb->s_readonly_remount))
                return 1;
-       /* Order wrt setting s_flags/s_readonly_remount in do_remount() */
+       /*
+        * The barrier pairs with the barrier in sb_start_ro_state_change()
+        * making sure if we don't see s_readonly_remount set yet, we also will
+        * not see any superblock / mount flag changes done by remount.
+        * It also pairs with the barrier in sb_end_ro_state_change()
+        * assuring that if we see s_readonly_remount already cleared, we will
+        * see the values of superblock / mount flags updated by remount.
+        */
        smp_rmb();
        return __mnt_is_readonly(mnt);
 }
@@ -364,9 +371,11 @@ int __mnt_want_write(struct vfsmount *m)
                }
        }
        /*
-        * After the slowpath clears MNT_WRITE_HOLD, mnt_is_readonly will
-        * be set to match its requirements. So we must not load that until
-        * MNT_WRITE_HOLD is cleared.
+        * The barrier pairs with the barrier sb_start_ro_state_change() making
+        * sure that if we see MNT_WRITE_HOLD cleared, we will also see
+        * s_readonly_remount set (or even SB_RDONLY / MNT_READONLY flags) in
+        * mnt_is_readonly() and bail in case we are racing with remount
+        * read-only.
         */
        smp_rmb();
        if (mnt_is_readonly(m)) {
@@ -588,10 +597,8 @@ int sb_prepare_remount_readonly(struct super_block *sb)
        if (!err && atomic_long_read(&sb->s_remove_count))
                err = -EBUSY;
 
-       if (!err) {
-               sb->s_readonly_remount = 1;
-               smp_wmb();
-       }
+       if (!err)
+               sb_start_ro_state_change(sb);
        list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) {
                if (mnt->mnt.mnt_flags & MNT_WRITE_HOLD)
                        mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD;
@@ -658,9 +665,25 @@ static bool legitimize_mnt(struct vfsmount *bastard, unsigned seq)
        return false;
 }
 
-/*
- * find the first mount at @dentry on vfsmount @mnt.
- * call under rcu_read_lock()
+/**
+ * __lookup_mnt - find first child mount
+ * @mnt:       parent mount
+ * @dentry:    mountpoint
+ *
+ * If @mnt has a child mount @c mounted @dentry find and return it.
+ *
+ * Note that the child mount @c need not be unique. There are cases
+ * where shadow mounts are created. For example, during mount
+ * propagation when a source mount @mnt whose root got overmounted by a
+ * mount @o after path lookup but before @namespace_sem could be
+ * acquired gets copied and propagated. So @mnt gets copied including
+ * @o. When @mnt is propagated to a destination mount @d that already
+ * has another mount @n mounted at the same mountpoint then the source
+ * mount @mnt will be tucked beneath @n, i.e., @n will be mounted on
+ * @mnt and @mnt mounted on @d. Now both @n and @o are mounted at @mnt
+ * on @dentry.
+ *
+ * Return: The first child of @mnt mounted @dentry or NULL.
  */
 struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
 {
@@ -910,6 +933,33 @@ void mnt_set_mountpoint(struct mount *mnt,
        hlist_add_head(&child_mnt->mnt_mp_list, &mp->m_list);
 }
 
+/**
+ * mnt_set_mountpoint_beneath - mount a mount beneath another one
+ *
+ * @new_parent: the source mount
+ * @top_mnt:    the mount beneath which @new_parent is mounted
+ * @new_mp:     the new mountpoint of @top_mnt on @new_parent
+ *
+ * Remove @top_mnt from its current mountpoint @top_mnt->mnt_mp and
+ * parent @top_mnt->mnt_parent and mount it on top of @new_parent at
+ * @new_mp. And mount @new_parent on the old parent and old
+ * mountpoint of @top_mnt.
+ *
+ * Context: This function expects namespace_lock() and lock_mount_hash()
+ *          to have been acquired in that order.
+ */
+static void mnt_set_mountpoint_beneath(struct mount *new_parent,
+                                      struct mount *top_mnt,
+                                      struct mountpoint *new_mp)
+{
+       struct mount *old_top_parent = top_mnt->mnt_parent;
+       struct mountpoint *old_top_mp = top_mnt->mnt_mp;
+
+       mnt_set_mountpoint(old_top_parent, old_top_mp, new_parent);
+       mnt_change_mountpoint(new_parent, new_mp, top_mnt);
+}
+
+
 static void __attach_mnt(struct mount *mnt, struct mount *parent)
 {
        hlist_add_head_rcu(&mnt->mnt_hash,
@@ -917,15 +967,42 @@ static void __attach_mnt(struct mount *mnt, struct mount *parent)
        list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
 }
 
-/*
- * vfsmount lock must be held for write
+/**
+ * attach_mnt - mount a mount, attach to @mount_hashtable and parent's
+ *              list of child mounts
+ * @parent:  the parent
+ * @mnt:     the new mount
+ * @mp:      the new mountpoint
+ * @beneath: whether to mount @mnt beneath or on top of @parent
+ *
+ * If @beneath is false, mount @mnt at @mp on @parent. Then attach @mnt
+ * to @parent's child mount list and to @mount_hashtable.
+ *
+ * If @beneath is true, remove @mnt from its current parent and
+ * mountpoint and mount it on @mp on @parent, and mount @parent on the
+ * old parent and old mountpoint of @mnt. Finally, attach @parent to
+ * @mnt_hashtable and @parent->mnt_parent->mnt_mounts.
+ *
+ * Note, when __attach_mnt() is called @mnt->mnt_parent already points
+ * to the correct parent.
+ *
+ * Context: This function expects namespace_lock() and lock_mount_hash()
+ *          to have been acquired in that order.
  */
-static void attach_mnt(struct mount *mnt,
-                       struct mount *parent,
-                       struct mountpoint *mp)
+static void attach_mnt(struct mount *mnt, struct mount *parent,
+                      struct mountpoint *mp, bool beneath)
 {
-       mnt_set_mountpoint(parent, mp, mnt);
-       __attach_mnt(mnt, parent);
+       if (beneath)
+               mnt_set_mountpoint_beneath(mnt, parent, mp);
+       else
+               mnt_set_mountpoint(parent, mp, mnt);
+       /*
+        * Note, @mnt->mnt_parent has to be used. If @mnt was mounted
+        * beneath @parent then @mnt will need to be attached to
+        * @parent's old parent, not @parent. IOW, @mnt->mnt_parent
+        * isn't the same mount as @parent.
+        */
+       __attach_mnt(mnt, mnt->mnt_parent);
 }
 
 void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, struct mount *mnt)
@@ -937,7 +1014,7 @@ void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, struct m
        hlist_del_init(&mnt->mnt_mp_list);
        hlist_del_init_rcu(&mnt->mnt_hash);
 
-       attach_mnt(mnt, parent, mp);
+       attach_mnt(mnt, parent, mp, false);
 
        put_mountpoint(old_mp);
        mnt_add_count(old_parent, -1);
@@ -1767,6 +1844,19 @@ bool may_mount(void)
        return ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN);
 }
 
+/**
+ * path_mounted - check whether path is mounted
+ * @path: path to check
+ *
+ * Determine whether @path refers to the root of a mount.
+ *
+ * Return: true if @path is the root of a mount, false if not.
+ */
+static inline bool path_mounted(const struct path *path)
+{
+       return path->mnt->mnt_root == path->dentry;
+}
+
 static void warn_mandlock(void)
 {
        pr_warn_once("=======================================================\n"
@@ -1782,7 +1872,7 @@ static int can_umount(const struct path *path, int flags)
 
        if (!may_mount())
                return -EPERM;
-       if (path->dentry != path->mnt->mnt_root)
+       if (!path_mounted(path))
                return -EINVAL;
        if (!check_mnt(mnt))
                return -EINVAL;
@@ -1925,7 +2015,7 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
                                goto out;
                        lock_mount_hash();
                        list_add_tail(&q->mnt_list, &res->mnt_list);
-                       attach_mnt(q, parent, p->mnt_mp);
+                       attach_mnt(q, parent, p->mnt_mp, false);
                        unlock_mount_hash();
                }
        }
@@ -2134,12 +2224,17 @@ int count_mounts(struct mnt_namespace *ns, struct mount *mnt)
        return 0;
 }
 
-/*
- *  @source_mnt : mount tree to be attached
- *  @nd         : place the mount tree @source_mnt is attached
- *  @parent_nd  : if non-null, detach the source_mnt from its parent and
- *                store the parent mount and mountpoint dentry.
- *                (done when source_mnt is moved)
+enum mnt_tree_flags_t {
+       MNT_TREE_MOVE = BIT(0),
+       MNT_TREE_BENEATH = BIT(1),
+};
+
+/**
+ * attach_recursive_mnt - attach a source mount tree
+ * @source_mnt: mount tree to be attached
+ * @top_mnt:    mount that @source_mnt will be mounted on or mounted beneath
+ * @dest_mp:    the mountpoint @source_mnt will be mounted at
+ * @flags:      modify how @source_mnt is supposed to be attached
  *
  *  NOTE: in the table below explains the semantics when a source mount
  *  of a given type is attached to a destination mount of a given type.
@@ -2196,22 +2291,28 @@ int count_mounts(struct mnt_namespace *ns, struct mount *mnt)
  * applied to each mount in the tree.
  * Must be called without spinlocks held, since this function can sleep
  * in allocations.
+ *
+ * Context: The function expects namespace_lock() to be held.
+ * Return: If @source_mnt was successfully attached 0 is returned.
+ *         Otherwise a negative error code is returned.
  */
 static int attach_recursive_mnt(struct mount *source_mnt,
-                       struct mount *dest_mnt,
-                       struct mountpoint *dest_mp,
-                       bool moving)
+                               struct mount *top_mnt,
+                               struct mountpoint *dest_mp,
+                               enum mnt_tree_flags_t flags)
 {
        struct user_namespace *user_ns = current->nsproxy->mnt_ns->user_ns;
        HLIST_HEAD(tree_list);
-       struct mnt_namespace *ns = dest_mnt->mnt_ns;
+       struct mnt_namespace *ns = top_mnt->mnt_ns;
        struct mountpoint *smp;
-       struct mount *child, *p;
+       struct mount *child, *dest_mnt, *p;
        struct hlist_node *n;
-       int err;
+       int err = 0;
+       bool moving = flags & MNT_TREE_MOVE, beneath = flags & MNT_TREE_BENEATH;
 
-       /* Preallocate a mountpoint in case the new mounts need
-        * to be tucked under other mounts.
+       /*
+        * Preallocate a mountpoint in case the new mounts need to be
+        * mounted beneath mounts on the same mountpoint.
         */
        smp = get_mountpoint(source_mnt->mnt.mnt_root);
        if (IS_ERR(smp))
@@ -2224,29 +2325,41 @@ static int attach_recursive_mnt(struct mount *source_mnt,
                        goto out;
        }
 
+       if (beneath)
+               dest_mnt = top_mnt->mnt_parent;
+       else
+               dest_mnt = top_mnt;
+
        if (IS_MNT_SHARED(dest_mnt)) {
                err = invent_group_ids(source_mnt, true);
                if (err)
                        goto out;
                err = propagate_mnt(dest_mnt, dest_mp, source_mnt, &tree_list);
-               lock_mount_hash();
-               if (err)
-                       goto out_cleanup_ids;
+       }
+       lock_mount_hash();
+       if (err)
+               goto out_cleanup_ids;
+
+       if (IS_MNT_SHARED(dest_mnt)) {
                for (p = source_mnt; p; p = next_mnt(p, source_mnt))
                        set_mnt_shared(p);
-       } else {
-               lock_mount_hash();
        }
+
        if (moving) {
+               if (beneath)
+                       dest_mp = smp;
                unhash_mnt(source_mnt);
-               attach_mnt(source_mnt, dest_mnt, dest_mp);
+               attach_mnt(source_mnt, top_mnt, dest_mp, beneath);
                touch_mnt_namespace(source_mnt->mnt_ns);
        } else {
                if (source_mnt->mnt_ns) {
                        /* move from anon - the caller will destroy */
                        list_del_init(&source_mnt->mnt_ns->list);
                }
-               mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt);
+               if (beneath)
+                       mnt_set_mountpoint_beneath(source_mnt, top_mnt, smp);
+               else
+                       mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt);
                commit_tree(source_mnt);
        }
 
@@ -2286,33 +2399,101 @@ static int attach_recursive_mnt(struct mount *source_mnt,
        return err;
 }
 
-static struct mountpoint *lock_mount(struct path *path)
+/**
+ * do_lock_mount - lock mount and mountpoint
+ * @path:    target path
+ * @beneath: whether the intention is to mount beneath @path
+ *
+ * Follow the mount stack on @path until the top mount @mnt is found. If
+ * the initial @path->{mnt,dentry} is a mountpoint lookup the first
+ * mount stacked on top of it. Then simply follow @{mnt,mnt->mnt_root}
+ * until nothing is stacked on top of it anymore.
+ *
+ * Acquire the inode_lock() on the top mount's ->mnt_root to protect
+ * against concurrent removal of the new mountpoint from another mount
+ * namespace.
+ *
+ * If @beneath is requested, acquire inode_lock() on @mnt's mountpoint
+ * @mp on @mnt->mnt_parent must be acquired. This protects against a
+ * concurrent unlink of @mp->mnt_dentry from another mount namespace
+ * where @mnt doesn't have a child mount mounted @mp. A concurrent
+ * removal of @mnt->mnt_root doesn't matter as nothing will be mounted
+ * on top of it for @beneath.
+ *
+ * In addition, @beneath needs to make sure that @mnt hasn't been
+ * unmounted or moved from its current mountpoint in between dropping
+ * @mount_lock and acquiring @namespace_sem. For the !@beneath case @mnt
+ * being unmounted would be detected later by e.g., calling
+ * check_mnt(mnt) in the function it's called from. For the @beneath
+ * case however, it's useful to detect it directly in do_lock_mount().
+ * If @mnt hasn't been unmounted then @mnt->mnt_mountpoint still points
+ * to @mnt->mnt_mp->m_dentry. But if @mnt has been unmounted it will
+ * point to @mnt->mnt_root and @mnt->mnt_mp will be NULL.
+ *
+ * Return: Either the target mountpoint on the top mount or the top
+ *         mount's mountpoint.
+ */
+static struct mountpoint *do_lock_mount(struct path *path, bool beneath)
 {
-       struct vfsmount *mnt;
-       struct dentry *dentry = path->dentry;
-retry:
-       inode_lock(dentry->d_inode);
-       if (unlikely(cant_mount(dentry))) {
-               inode_unlock(dentry->d_inode);
-               return ERR_PTR(-ENOENT);
-       }
-       namespace_lock();
-       mnt = lookup_mnt(path);
-       if (likely(!mnt)) {
-               struct mountpoint *mp = get_mountpoint(dentry);
-               if (IS_ERR(mp)) {
+       struct vfsmount *mnt = path->mnt;
+       struct dentry *dentry;
+       struct mountpoint *mp = ERR_PTR(-ENOENT);
+
+       for (;;) {
+               struct mount *m;
+
+               if (beneath) {
+                       m = real_mount(mnt);
+                       read_seqlock_excl(&mount_lock);
+                       dentry = dget(m->mnt_mountpoint);
+                       read_sequnlock_excl(&mount_lock);
+               } else {
+                       dentry = path->dentry;
+               }
+
+               inode_lock(dentry->d_inode);
+               if (unlikely(cant_mount(dentry))) {
+                       inode_unlock(dentry->d_inode);
+                       goto out;
+               }
+
+               namespace_lock();
+
+               if (beneath && (!is_mounted(mnt) || m->mnt_mountpoint != dentry)) {
                        namespace_unlock();
                        inode_unlock(dentry->d_inode);
-                       return mp;
+                       goto out;
                }
-               return mp;
+
+               mnt = lookup_mnt(path);
+               if (likely(!mnt))
+                       break;
+
+               namespace_unlock();
+               inode_unlock(dentry->d_inode);
+               if (beneath)
+                       dput(dentry);
+               path_put(path);
+               path->mnt = mnt;
+               path->dentry = dget(mnt->mnt_root);
        }
-       namespace_unlock();
-       inode_unlock(path->dentry->d_inode);
-       path_put(path);
-       path->mnt = mnt;
-       dentry = path->dentry = dget(mnt->mnt_root);
-       goto retry;
+
+       mp = get_mountpoint(dentry);
+       if (IS_ERR(mp)) {
+               namespace_unlock();
+               inode_unlock(dentry->d_inode);
+       }
+
+out:
+       if (beneath)
+               dput(dentry);
+
+       return mp;
+}
+
+static inline struct mountpoint *lock_mount(struct path *path)
+{
+       return do_lock_mount(path, false);
 }
 
 static void unlock_mount(struct mountpoint *where)
@@ -2336,7 +2517,7 @@ static int graft_tree(struct mount *mnt, struct mount *p, struct mountpoint *mp)
              d_is_dir(mnt->mnt.mnt_root))
                return -ENOTDIR;
 
-       return attach_recursive_mnt(mnt, p, mp, false);
+       return attach_recursive_mnt(mnt, p, mp, 0);
 }
 
 /*
@@ -2367,7 +2548,7 @@ static int do_change_type(struct path *path, int ms_flags)
        int type;
        int err = 0;
 
-       if (path->dentry != path->mnt->mnt_root)
+       if (!path_mounted(path))
                return -EINVAL;
 
        type = flags_to_propagation_type(ms_flags);
@@ -2643,7 +2824,7 @@ static int do_reconfigure_mnt(struct path *path, unsigned int mnt_flags)
        if (!check_mnt(mnt))
                return -EINVAL;
 
-       if (path->dentry != mnt->mnt.mnt_root)
+       if (!path_mounted(path))
                return -EINVAL;
 
        if (!can_change_locked_flags(mnt, mnt_flags))
@@ -2682,7 +2863,7 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
        if (!check_mnt(mnt))
                return -EINVAL;
 
-       if (path->dentry != path->mnt->mnt_root)
+       if (!path_mounted(path))
                return -EINVAL;
 
        if (!can_change_locked_flags(mnt, mnt_flags))
@@ -2772,9 +2953,9 @@ static int do_set_group(struct path *from_path, struct path *to_path)
 
        err = -EINVAL;
        /* To and From paths should be mount roots */
-       if (from_path->dentry != from_path->mnt->mnt_root)
+       if (!path_mounted(from_path))
                goto out;
-       if (to_path->dentry != to_path->mnt->mnt_root)
+       if (!path_mounted(to_path))
                goto out;
 
        /* Setting sharing groups is only allowed across same superblock */
@@ -2818,7 +2999,110 @@ out:
        return err;
 }
 
-static int do_move_mount(struct path *old_path, struct path *new_path)
+/**
+ * path_overmounted - check if path is overmounted
+ * @path: path to check
+ *
+ * Check if path is overmounted, i.e., if there's a mount on top of
+ * @path->mnt with @path->dentry as mountpoint.
+ *
+ * Context: This function expects namespace_lock() to be held.
+ * Return: If path is overmounted true is returned, false if not.
+ */
+static inline bool path_overmounted(const struct path *path)
+{
+       rcu_read_lock();
+       if (unlikely(__lookup_mnt(path->mnt, path->dentry))) {
+               rcu_read_unlock();
+               return true;
+       }
+       rcu_read_unlock();
+       return false;
+}
+
+/**
+ * can_move_mount_beneath - check that we can mount beneath the top mount
+ * @from: mount to mount beneath
+ * @to:   mount under which to mount
+ *
+ * - Make sure that @to->dentry is actually the root of a mount under
+ *   which we can mount another mount.
+ * - Make sure that nothing can be mounted beneath the caller's current
+ *   root or the rootfs of the namespace.
+ * - Make sure that the caller can unmount the topmost mount ensuring
+ *   that the caller could reveal the underlying mountpoint.
+ * - Ensure that nothing has been mounted on top of @from before we
+ *   grabbed @namespace_sem to avoid creating pointless shadow mounts.
+ * - Prevent mounting beneath a mount if the propagation relationship
+ *   between the source mount, parent mount, and top mount would lead to
+ *   nonsensical mount trees.
+ *
+ * Context: This function expects namespace_lock() to be held.
+ * Return: On success 0, and on error a negative error code is returned.
+ */
+static int can_move_mount_beneath(const struct path *from,
+                                 const struct path *to,
+                                 const struct mountpoint *mp)
+{
+       struct mount *mnt_from = real_mount(from->mnt),
+                    *mnt_to = real_mount(to->mnt),
+                    *parent_mnt_to = mnt_to->mnt_parent;
+
+       if (!mnt_has_parent(mnt_to))
+               return -EINVAL;
+
+       if (!path_mounted(to))
+               return -EINVAL;
+
+       if (IS_MNT_LOCKED(mnt_to))
+               return -EINVAL;
+
+       /* Avoid creating shadow mounts during mount propagation. */
+       if (path_overmounted(from))
+               return -EINVAL;
+
+       /*
+        * Mounting beneath the rootfs only makes sense when the
+        * semantics of pivot_root(".", ".") are used.
+        */
+       if (&mnt_to->mnt == current->fs->root.mnt)
+               return -EINVAL;
+       if (parent_mnt_to == current->nsproxy->mnt_ns->root)
+               return -EINVAL;
+
+       for (struct mount *p = mnt_from; mnt_has_parent(p); p = p->mnt_parent)
+               if (p == mnt_to)
+                       return -EINVAL;
+
+       /*
+        * If the parent mount propagates to the child mount this would
+        * mean mounting @mnt_from on @mnt_to->mnt_parent and then
+        * propagating a copy @c of @mnt_from on top of @mnt_to. This
+        * defeats the whole purpose of mounting beneath another mount.
+        */
+       if (propagation_would_overmount(parent_mnt_to, mnt_to, mp))
+               return -EINVAL;
+
+       /*
+        * If @mnt_to->mnt_parent propagates to @mnt_from this would
+        * mean propagating a copy @c of @mnt_from on top of @mnt_from.
+        * Afterwards @mnt_from would be mounted on top of
+        * @mnt_to->mnt_parent and @mnt_to would be unmounted from
+        * @mnt->mnt_parent and remounted on @mnt_from. But since @c is
+        * already mounted on @mnt_from, @mnt_to would ultimately be
+        * remounted on top of @c. Afterwards, @mnt_from would be
+        * covered by a copy @c of @mnt_from and @c would be covered by
+        * @mnt_from itself. This defeats the whole purpose of mounting
+        * @mnt_from beneath @mnt_to.
+        */
+       if (propagation_would_overmount(parent_mnt_to, mnt_from, mp))
+               return -EINVAL;
+
+       return 0;
+}
+
+static int do_move_mount(struct path *old_path, struct path *new_path,
+                        bool beneath)
 {
        struct mnt_namespace *ns;
        struct mount *p;
@@ -2827,8 +3111,9 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
        struct mountpoint *mp, *old_mp;
        int err;
        bool attached;
+       enum mnt_tree_flags_t flags = 0;
 
-       mp = lock_mount(new_path);
+       mp = do_lock_mount(new_path, beneath);
        if (IS_ERR(mp))
                return PTR_ERR(mp);
 
@@ -2836,6 +3121,8 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
        p = real_mount(new_path->mnt);
        parent = old->mnt_parent;
        attached = mnt_has_parent(old);
+       if (attached)
+               flags |= MNT_TREE_MOVE;
        old_mp = old->mnt_mp;
        ns = old->mnt_ns;
 
@@ -2855,7 +3142,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
        if (old->mnt.mnt_flags & MNT_LOCKED)
                goto out;
 
-       if (old_path->dentry != old_path->mnt->mnt_root)
+       if (!path_mounted(old_path))
                goto out;
 
        if (d_is_dir(new_path->dentry) !=
@@ -2866,6 +3153,17 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
         */
        if (attached && IS_MNT_SHARED(parent))
                goto out;
+
+       if (beneath) {
+               err = can_move_mount_beneath(old_path, new_path, mp);
+               if (err)
+                       goto out;
+
+               err = -EINVAL;
+               p = p->mnt_parent;
+               flags |= MNT_TREE_BENEATH;
+       }
+
        /*
         * Don't move a mount tree containing unbindable mounts to a destination
         * mount which is shared.
@@ -2879,8 +3177,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
                if (p == old)
                        goto out;
 
-       err = attach_recursive_mnt(old, real_mount(new_path->mnt), mp,
-                                  attached);
+       err = attach_recursive_mnt(old, real_mount(new_path->mnt), mp, flags);
        if (err)
                goto out;
 
@@ -2912,7 +3209,7 @@ static int do_move_mount_old(struct path *path, const char *old_name)
        if (err)
                return err;
 
-       err = do_move_mount(&old_path, path);
+       err = do_move_mount(&old_path, path, false);
        path_put(&old_path);
        return err;
 }
@@ -2937,8 +3234,7 @@ static int do_add_mount(struct mount *newmnt, struct mountpoint *mp,
        }
 
        /* Refuse the same filesystem on the same mount point */
-       if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb &&
-           path->mnt->mnt_root == path->dentry)
+       if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb && path_mounted(path))
                return -EBUSY;
 
        if (d_is_symlink(newmnt->mnt.mnt_root))
@@ -3079,13 +3375,10 @@ int finish_automount(struct vfsmount *m, const struct path *path)
                err = -ENOENT;
                goto discard_locked;
        }
-       rcu_read_lock();
-       if (unlikely(__lookup_mnt(path->mnt, dentry))) {
-               rcu_read_unlock();
+       if (path_overmounted(path)) {
                err = 0;
                goto discard_locked;
        }
-       rcu_read_unlock();
        mp = get_mountpoint(dentry);
        if (IS_ERR(mp)) {
                err = PTR_ERR(mp);
@@ -3777,6 +4070,10 @@ SYSCALL_DEFINE5(move_mount,
        if (flags & ~MOVE_MOUNT__MASK)
                return -EINVAL;
 
+       if ((flags & (MOVE_MOUNT_BENEATH | MOVE_MOUNT_SET_GROUP)) ==
+           (MOVE_MOUNT_BENEATH | MOVE_MOUNT_SET_GROUP))
+               return -EINVAL;
+
        /* If someone gives a pathname, they aren't permitted to move
         * from an fd that requires unmount as we can't get at the flag
         * to clear it afterwards.
@@ -3806,7 +4103,8 @@ SYSCALL_DEFINE5(move_mount,
        if (flags & MOVE_MOUNT_SET_GROUP)
                ret = do_set_group(&from_path, &to_path);
        else
-               ret = do_move_mount(&from_path, &to_path);
+               ret = do_move_mount(&from_path, &to_path,
+                                   (flags & MOVE_MOUNT_BENEATH));
 
 out_to:
        path_put(&to_path);
@@ -3917,11 +4215,11 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root,
        if (new_mnt == root_mnt || old_mnt == root_mnt)
                goto out4; /* loop, on the same file system  */
        error = -EINVAL;
-       if (root.mnt->mnt_root != root.dentry)
+       if (!path_mounted(&root))
                goto out4; /* not a mountpoint */
        if (!mnt_has_parent(root_mnt))
                goto out4; /* not attached */
-       if (new.mnt->mnt_root != new.dentry)
+       if (!path_mounted(&new))
                goto out4; /* not a mountpoint */
        if (!mnt_has_parent(new_mnt))
                goto out4; /* not attached */
@@ -3939,9 +4237,9 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root,
                root_mnt->mnt.mnt_flags &= ~MNT_LOCKED;
        }
        /* mount old root on put_old */
-       attach_mnt(root_mnt, old_mnt, old_mp);
+       attach_mnt(root_mnt, old_mnt, old_mp, false);
        /* mount new_root on / */
-       attach_mnt(new_mnt, root_parent, root_mp);
+       attach_mnt(new_mnt, root_parent, root_mp, false);
        mnt_add_count(root_parent, -1);
        touch_mnt_namespace(current->nsproxy->mnt_ns);
        /* A moved mount should not expire automatically */
@@ -4124,7 +4422,7 @@ static int do_mount_setattr(struct path *path, struct mount_kattr *kattr)
        struct mount *mnt = real_mount(path->mnt);
        int err = 0;
 
-       if (path->dentry != mnt->mnt.mnt_root)
+       if (!path_mounted(path))
                return -EINVAL;
 
        if (kattr->mnt_userns) {
index fea5f88..70f5563 100644 (file)
@@ -35,7 +35,7 @@ bl_free_device(struct pnfs_block_dev *dev)
                }
 
                if (dev->bdev)
-                       blkdev_put(dev->bdev, FMODE_READ | FMODE_WRITE);
+                       blkdev_put(dev->bdev, NULL);
        }
 }
 
@@ -243,7 +243,8 @@ bl_parse_simple(struct nfs_server *server, struct pnfs_block_dev *d,
        if (!dev)
                return -EIO;
 
-       bdev = blkdev_get_by_dev(dev, FMODE_READ | FMODE_WRITE, NULL);
+       bdev = blkdev_get_by_dev(dev, BLK_OPEN_READ | BLK_OPEN_WRITE, NULL,
+                                NULL);
        if (IS_ERR(bdev)) {
                printk(KERN_WARNING "pNFS: failed to open device %d:%d (%ld)\n",
                        MAJOR(dev), MINOR(dev), PTR_ERR(bdev));
@@ -312,7 +313,8 @@ bl_open_path(struct pnfs_block_volume *v, const char *prefix)
        if (!devname)
                return ERR_PTR(-ENOMEM);
 
-       bdev = blkdev_get_by_path(devname, FMODE_READ | FMODE_WRITE, NULL);
+       bdev = blkdev_get_by_path(devname, BLK_OPEN_READ | BLK_OPEN_WRITE, NULL,
+                                 NULL);
        if (IS_ERR(bdev)) {
                pr_warn("pNFS: failed to open device %s (%ld)\n",
                        devname, PTR_ERR(bdev));
@@ -373,7 +375,7 @@ bl_parse_scsi(struct nfs_server *server, struct pnfs_block_dev *d,
        return 0;
 
 out_blkdev_put:
-       blkdev_put(d->bdev, FMODE_READ | FMODE_WRITE);
+       blkdev_put(d->bdev, NULL);
        return error;
 }
 
index f0edf5a..79b1b3f 100644 (file)
@@ -178,6 +178,27 @@ nfs_file_read(struct kiocb *iocb, struct iov_iter *to)
 }
 EXPORT_SYMBOL_GPL(nfs_file_read);
 
+ssize_t
+nfs_file_splice_read(struct file *in, loff_t *ppos, struct pipe_inode_info *pipe,
+                    size_t len, unsigned int flags)
+{
+       struct inode *inode = file_inode(in);
+       ssize_t result;
+
+       dprintk("NFS: splice_read(%pD2, %zu@%llu)\n", in, len, *ppos);
+
+       nfs_start_io_read(inode);
+       result = nfs_revalidate_mapping(inode, in->f_mapping);
+       if (!result) {
+               result = filemap_splice_read(in, ppos, pipe, len, flags);
+               if (result > 0)
+                       nfs_add_stats(inode, NFSIOS_NORMALREADBYTES, result);
+       }
+       nfs_end_io_read(inode);
+       return result;
+}
+EXPORT_SYMBOL_GPL(nfs_file_splice_read);
+
 int
 nfs_file_mmap(struct file * file, struct vm_area_struct * vma)
 {
@@ -648,17 +669,13 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
        since = filemap_sample_wb_err(file->f_mapping);
        nfs_start_io_write(inode);
        result = generic_write_checks(iocb, from);
-       if (result > 0) {
-               current->backing_dev_info = inode_to_bdi(inode);
+       if (result > 0)
                result = generic_perform_write(iocb, from);
-               current->backing_dev_info = NULL;
-       }
        nfs_end_io_write(inode);
        if (result <= 0)
                goto out;
 
        written = result;
-       iocb->ki_pos += written;
        nfs_add_stats(inode, NFSIOS_NORMALWRITTENBYTES, written);
 
        if (mntflags & NFS_MOUNT_WRITE_EAGER) {
@@ -879,7 +896,7 @@ const struct file_operations nfs_file_operations = {
        .fsync          = nfs_file_fsync,
        .lock           = nfs_lock,
        .flock          = nfs_flock,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = nfs_file_splice_read,
        .splice_write   = iter_file_splice_write,
        .check_flags    = nfs_check_flags,
        .setlease       = simple_nosetlease,
index 3cc027d..b5f21d3 100644 (file)
@@ -416,6 +416,8 @@ static inline __u32 nfs_access_xattr_mask(const struct nfs_server *server)
 int nfs_file_fsync(struct file *file, loff_t start, loff_t end, int datasync);
 loff_t nfs_file_llseek(struct file *, loff_t, int);
 ssize_t nfs_file_read(struct kiocb *, struct iov_iter *);
+ssize_t nfs_file_splice_read(struct file *in, loff_t *ppos, struct pipe_inode_info *pipe,
+                            size_t len, unsigned int flags);
 int nfs_file_mmap(struct file *, struct vm_area_struct *);
 ssize_t nfs_file_write(struct kiocb *, struct iov_iter *);
 int nfs_file_release(struct inode *, struct file *);
index 2563ed8..4aeadd6 100644 (file)
@@ -454,7 +454,7 @@ const struct file_operations nfs4_file_operations = {
        .fsync          = nfs_file_fsync,
        .lock           = nfs_lock,
        .flock          = nfs_flock,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = nfs_file_splice_read,
        .splice_write   = iter_file_splice_write,
        .check_flags    = nfs_check_flags,
        .setlease       = nfs4_setlease,
index 620329b..7600100 100644 (file)
@@ -164,7 +164,7 @@ __setup("nfsroot=", nfs_root_setup);
 static int __init root_nfs_copy(char *dest, const char *src,
                                     const size_t destlen)
 {
-       if (strlcpy(dest, src, destlen) > destlen)
+       if (strscpy(dest, src, destlen) == -E2BIG)
                return -1;
        return 0;
 }
index f21259e..4c9b878 100644 (file)
@@ -80,6 +80,8 @@ enum {
 
 int    nfsd_drc_slab_create(void);
 void   nfsd_drc_slab_free(void);
+int    nfsd_net_reply_cache_init(struct nfsd_net *nn);
+void   nfsd_net_reply_cache_destroy(struct nfsd_net *nn);
 int    nfsd_reply_cache_init(struct nfsd_net *);
 void   nfsd_reply_cache_shutdown(struct nfsd_net *);
 int    nfsd_cache_lookup(struct svc_rqst *);
index ae85257..11a0eaa 100644 (file)
@@ -97,7 +97,7 @@ static int expkey_parse(struct cache_detail *cd, char *mesg, int mlen)
                goto out;
 
        err = -EINVAL;
-       if ((len=qword_get(&mesg, buf, PAGE_SIZE)) <= 0)
+       if (qword_get(&mesg, buf, PAGE_SIZE) <= 0)
                goto out;
 
        err = -ENOENT;
@@ -107,7 +107,7 @@ static int expkey_parse(struct cache_detail *cd, char *mesg, int mlen)
        dprintk("found domain %s\n", buf);
 
        err = -EINVAL;
-       if ((len=qword_get(&mesg, buf, PAGE_SIZE)) <= 0)
+       if (qword_get(&mesg, buf, PAGE_SIZE) <= 0)
                goto out;
        fsidtype = simple_strtoul(buf, &ep, 10);
        if (*ep)
@@ -593,7 +593,6 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
 {
        /* client path expiry [flags anonuid anongid fsid] */
        char *buf;
-       int len;
        int err;
        struct auth_domain *dom = NULL;
        struct svc_export exp = {}, *expp;
@@ -609,8 +608,7 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
 
        /* client */
        err = -EINVAL;
-       len = qword_get(&mesg, buf, PAGE_SIZE);
-       if (len <= 0)
+       if (qword_get(&mesg, buf, PAGE_SIZE) <= 0)
                goto out;
 
        err = -ENOENT;
@@ -620,7 +618,7 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
 
        /* path */
        err = -EINVAL;
-       if ((len = qword_get(&mesg, buf, PAGE_SIZE)) <= 0)
+       if (qword_get(&mesg, buf, PAGE_SIZE) <= 0)
                goto out1;
 
        err = kern_path(buf, 0, &exp.ex_path);
@@ -665,7 +663,7 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
                        goto out3;
                exp.ex_fsid = an_int;
 
-               while ((len = qword_get(&mesg, buf, PAGE_SIZE)) > 0) {
+               while (qword_get(&mesg, buf, PAGE_SIZE) > 0) {
                        if (strcmp(buf, "fsloc") == 0)
                                err = fsloc_parse(&mesg, buf, &exp.ex_fslocs);
                        else if (strcmp(buf, "uuid") == 0)
index e6bb8ee..fc8d5b7 100644 (file)
@@ -151,8 +151,6 @@ nfsd3_proc_read(struct svc_rqst *rqstp)
 {
        struct nfsd3_readargs *argp = rqstp->rq_argp;
        struct nfsd3_readres *resp = rqstp->rq_resp;
-       unsigned int len;
-       int v;
 
        dprintk("nfsd: READ(3) %s %lu bytes at %Lu\n",
                                SVCFH_fmt(&argp->fh),
@@ -166,17 +164,7 @@ nfsd3_proc_read(struct svc_rqst *rqstp)
        if (argp->offset + argp->count > (u64)OFFSET_MAX)
                argp->count = (u64)OFFSET_MAX - argp->offset;
 
-       v = 0;
-       len = argp->count;
        resp->pages = rqstp->rq_next_page;
-       while (len > 0) {
-               struct page *page = *(rqstp->rq_next_page++);
-
-               rqstp->rq_vec[v].iov_base = page_address(page);
-               rqstp->rq_vec[v].iov_len = min_t(unsigned int, len, PAGE_SIZE);
-               len -= rqstp->rq_vec[v].iov_len;
-               v++;
-       }
 
        /* Obtain buffer pointer for payload.
         * 1 (status) + 22 (post_op_attr) + 1 (count) + 1 (eof)
@@ -187,7 +175,7 @@ nfsd3_proc_read(struct svc_rqst *rqstp)
 
        fh_copy(&resp->fh, &argp->fh);
        resp->status = nfsd_read(rqstp, &resp->fh, argp->offset,
-                                rqstp->rq_vec, v, &resp->count, &resp->eof);
+                                &resp->count, &resp->eof);
        return rpc_success;
 }
 
index 3308dd6..f321289 100644 (file)
@@ -828,7 +828,8 @@ nfs3svc_encode_readlinkres(struct svc_rqst *rqstp, struct xdr_stream *xdr)
                        return false;
                if (xdr_stream_encode_u32(xdr, resp->len) < 0)
                        return false;
-               xdr_write_pages(xdr, resp->pages, 0, resp->len);
+               svcxdr_encode_opaque_pages(rqstp, xdr, resp->pages, 0,
+                                          resp->len);
                if (svc_encode_result_payload(rqstp, head->iov_len, resp->len) < 0)
                        return false;
                break;
@@ -859,8 +860,9 @@ nfs3svc_encode_readres(struct svc_rqst *rqstp, struct xdr_stream *xdr)
                        return false;
                if (xdr_stream_encode_u32(xdr, resp->count) < 0)
                        return false;
-               xdr_write_pages(xdr, resp->pages, rqstp->rq_res.page_base,
-                               resp->count);
+               svcxdr_encode_opaque_pages(rqstp, xdr, resp->pages,
+                                          rqstp->rq_res.page_base,
+                                          resp->count);
                if (svc_encode_result_payload(rqstp, head->iov_len, resp->count) < 0)
                        return false;
                break;
@@ -961,7 +963,8 @@ nfs3svc_encode_readdirres(struct svc_rqst *rqstp, struct xdr_stream *xdr)
                        return false;
                if (!svcxdr_encode_cookieverf3(xdr, resp->verf))
                        return false;
-               xdr_write_pages(xdr, dirlist->pages, 0, dirlist->len);
+               svcxdr_encode_opaque_pages(rqstp, xdr, dirlist->pages, 0,
+                                          dirlist->len);
                /* no more entries */
                if (xdr_stream_encode_item_absent(xdr) < 0)
                        return false;
index 76db2fe..26b1343 100644 (file)
@@ -2541,6 +2541,20 @@ static __be32 *encode_change(__be32 *p, struct kstat *stat, struct inode *inode,
        return p;
 }
 
+static __be32 nfsd4_encode_nfstime4(struct xdr_stream *xdr,
+                                   struct timespec64 *tv)
+{
+       __be32 *p;
+
+       p = xdr_reserve_space(xdr, XDR_UNIT * 3);
+       if (!p)
+               return nfserr_resource;
+
+       p = xdr_encode_hyper(p, (s64)tv->tv_sec);
+       *p = cpu_to_be32(tv->tv_nsec);
+       return nfs_ok;
+}
+
 /*
  * ctime (in NFSv4, time_metadata) is not writeable, and the client
  * doesn't really care what resolution could theoretically be stored by
@@ -2566,12 +2580,16 @@ static __be32 *encode_time_delta(__be32 *p, struct inode *inode)
        return p;
 }
 
-static __be32 *encode_cinfo(__be32 *p, struct nfsd4_change_info *c)
+static __be32
+nfsd4_encode_change_info4(struct xdr_stream *xdr, struct nfsd4_change_info *c)
 {
-       *p++ = cpu_to_be32(c->atomic);
-       p = xdr_encode_hyper(p, c->before_change);
-       p = xdr_encode_hyper(p, c->after_change);
-       return p;
+       if (xdr_stream_encode_bool(xdr, c->atomic) < 0)
+               return nfserr_resource;
+       if (xdr_stream_encode_u64(xdr, c->before_change) < 0)
+               return nfserr_resource;
+       if (xdr_stream_encode_u64(xdr, c->after_change) < 0)
+               return nfserr_resource;
+       return nfs_ok;
 }
 
 /* Encode as an array of strings the string given with components
@@ -3348,11 +3366,9 @@ out_acl:
                p = xdr_encode_hyper(p, dummy64);
        }
        if (bmval1 & FATTR4_WORD1_TIME_ACCESS) {
-               p = xdr_reserve_space(xdr, 12);
-               if (!p)
-                       goto out_resource;
-               p = xdr_encode_hyper(p, (s64)stat.atime.tv_sec);
-               *p++ = cpu_to_be32(stat.atime.tv_nsec);
+               status = nfsd4_encode_nfstime4(xdr, &stat.atime);
+               if (status)
+                       goto out;
        }
        if (bmval1 & FATTR4_WORD1_TIME_DELTA) {
                p = xdr_reserve_space(xdr, 12);
@@ -3361,25 +3377,19 @@ out_acl:
                p = encode_time_delta(p, d_inode(dentry));
        }
        if (bmval1 & FATTR4_WORD1_TIME_METADATA) {
-               p = xdr_reserve_space(xdr, 12);
-               if (!p)
-                       goto out_resource;
-               p = xdr_encode_hyper(p, (s64)stat.ctime.tv_sec);
-               *p++ = cpu_to_be32(stat.ctime.tv_nsec);
+               status = nfsd4_encode_nfstime4(xdr, &stat.ctime);
+               if (status)
+                       goto out;
        }
        if (bmval1 & FATTR4_WORD1_TIME_MODIFY) {
-               p = xdr_reserve_space(xdr, 12);
-               if (!p)
-                       goto out_resource;
-               p = xdr_encode_hyper(p, (s64)stat.mtime.tv_sec);
-               *p++ = cpu_to_be32(stat.mtime.tv_nsec);
+               status = nfsd4_encode_nfstime4(xdr, &stat.mtime);
+               if (status)
+                       goto out;
        }
        if (bmval1 & FATTR4_WORD1_TIME_CREATE) {
-               p = xdr_reserve_space(xdr, 12);
-               if (!p)
-                       goto out_resource;
-               p = xdr_encode_hyper(p, (s64)stat.btime.tv_sec);
-               *p++ = cpu_to_be32(stat.btime.tv_nsec);
+               status = nfsd4_encode_nfstime4(xdr, &stat.btime);
+               if (status)
+                       goto out;
        }
        if (bmval1 & FATTR4_WORD1_MOUNTED_ON_FILEID) {
                u64 ino = stat.ino;
@@ -3689,6 +3699,30 @@ fail:
 }
 
 static __be32
+nfsd4_encode_verifier4(struct xdr_stream *xdr, const nfs4_verifier *verf)
+{
+       __be32 *p;
+
+       p = xdr_reserve_space(xdr, NFS4_VERIFIER_SIZE);
+       if (!p)
+               return nfserr_resource;
+       memcpy(p, verf->data, sizeof(verf->data));
+       return nfs_ok;
+}
+
+static __be32
+nfsd4_encode_clientid4(struct xdr_stream *xdr, const clientid_t *clientid)
+{
+       __be32 *p;
+
+       p = xdr_reserve_space(xdr, sizeof(__be64));
+       if (!p)
+               return nfserr_resource;
+       memcpy(p, clientid, sizeof(*clientid));
+       return nfs_ok;
+}
+
+static __be32
 nfsd4_encode_stateid(struct xdr_stream *xdr, stateid_t *sid)
 {
        __be32 *p;
@@ -3752,15 +3786,8 @@ nfsd4_encode_commit(struct nfsd4_compoundres *resp, __be32 nfserr,
                    union nfsd4_op_u *u)
 {
        struct nfsd4_commit *commit = &u->commit;
-       struct xdr_stream *xdr = resp->xdr;
-       __be32 *p;
 
-       p = xdr_reserve_space(xdr, NFS4_VERIFIER_SIZE);
-       if (!p)
-               return nfserr_resource;
-       p = xdr_encode_opaque_fixed(p, commit->co_verf.data,
-                                               NFS4_VERIFIER_SIZE);
-       return 0;
+       return nfsd4_encode_verifier4(resp->xdr, &commit->co_verf);
 }
 
 static __be32
@@ -3769,12 +3796,10 @@ nfsd4_encode_create(struct nfsd4_compoundres *resp, __be32 nfserr,
 {
        struct nfsd4_create *create = &u->create;
        struct xdr_stream *xdr = resp->xdr;
-       __be32 *p;
 
-       p = xdr_reserve_space(xdr, 20);
-       if (!p)
-               return nfserr_resource;
-       encode_cinfo(p, &create->cr_cinfo);
+       nfserr = nfsd4_encode_change_info4(xdr, &create->cr_cinfo);
+       if (nfserr)
+               return nfserr;
        return nfsd4_encode_bitmap(xdr, create->cr_bmval[0],
                        create->cr_bmval[1], create->cr_bmval[2]);
 }
@@ -3892,13 +3917,8 @@ nfsd4_encode_link(struct nfsd4_compoundres *resp, __be32 nfserr,
 {
        struct nfsd4_link *link = &u->link;
        struct xdr_stream *xdr = resp->xdr;
-       __be32 *p;
 
-       p = xdr_reserve_space(xdr, 20);
-       if (!p)
-               return nfserr_resource;
-       p = encode_cinfo(p, &link->li_cinfo);
-       return 0;
+       return nfsd4_encode_change_info4(xdr, &link->li_cinfo);
 }
 
 
@@ -3913,11 +3933,11 @@ nfsd4_encode_open(struct nfsd4_compoundres *resp, __be32 nfserr,
        nfserr = nfsd4_encode_stateid(xdr, &open->op_stateid);
        if (nfserr)
                return nfserr;
-       p = xdr_reserve_space(xdr, 24);
-       if (!p)
+       nfserr = nfsd4_encode_change_info4(xdr, &open->op_cinfo);
+       if (nfserr)
+               return nfserr;
+       if (xdr_stream_encode_u32(xdr, open->op_rflags) < 0)
                return nfserr_resource;
-       p = encode_cinfo(p, &open->op_cinfo);
-       *p++ = cpu_to_be32(open->op_rflags);
 
        nfserr = nfsd4_encode_bitmap(xdr, open->op_bmval[0], open->op_bmval[1],
                                        open->op_bmval[2]);
@@ -3956,7 +3976,7 @@ nfsd4_encode_open(struct nfsd4_compoundres *resp, __be32 nfserr,
                p = xdr_reserve_space(xdr, 32);
                if (!p)
                        return nfserr_resource;
-               *p++ = cpu_to_be32(0);
+               *p++ = cpu_to_be32(open->op_recall);
 
                /*
                 * TODO: space_limit's in delegations
@@ -4018,6 +4038,11 @@ nfsd4_encode_open_downgrade(struct nfsd4_compoundres *resp, __be32 nfserr,
        return nfsd4_encode_stateid(xdr, &od->od_stateid);
 }
 
+/*
+ * The operation of this function assumes that this is the only
+ * READ operation in the COMPOUND. If there are multiple READs,
+ * we use nfsd4_encode_readv().
+ */
 static __be32 nfsd4_encode_splice_read(
                                struct nfsd4_compoundres *resp,
                                struct nfsd4_read *read,
@@ -4028,8 +4053,12 @@ static __be32 nfsd4_encode_splice_read(
        int status, space_left;
        __be32 nfserr;
 
-       /* Make sure there will be room for padding if needed */
-       if (xdr->end - xdr->p < 1)
+       /*
+        * Make sure there is room at the end of buf->head for
+        * svcxdr_encode_opaque_pages() to create a tail buffer
+        * to XDR-pad the payload.
+        */
+       if (xdr->iov != xdr->buf->head || xdr->end - xdr->p < 1)
                return nfserr_resource;
 
        nfserr = nfsd_splice_read(read->rd_rqstp, read->rd_fhp,
@@ -4038,6 +4067,8 @@ static __be32 nfsd4_encode_splice_read(
        read->rd_length = maxcount;
        if (nfserr)
                goto out_err;
+       svcxdr_encode_opaque_pages(read->rd_rqstp, xdr, buf->pages,
+                                  buf->page_base, maxcount);
        status = svc_encode_result_payload(read->rd_rqstp,
                                           buf->head[0].iov_len, maxcount);
        if (status) {
@@ -4045,31 +4076,19 @@ static __be32 nfsd4_encode_splice_read(
                goto out_err;
        }
 
-       buf->page_len = maxcount;
-       buf->len += maxcount;
-       xdr->page_ptr += (buf->page_base + maxcount + PAGE_SIZE - 1)
-                                                       / PAGE_SIZE;
-
-       /* Use rest of head for padding and remaining ops: */
-       buf->tail[0].iov_base = xdr->p;
-       buf->tail[0].iov_len = 0;
-       xdr->iov = buf->tail;
-       if (maxcount&3) {
-               int pad = 4 - (maxcount&3);
-
-               *(xdr->p++) = 0;
-
-               buf->tail[0].iov_base += maxcount&3;
-               buf->tail[0].iov_len = pad;
-               buf->len += pad;
-       }
-
+       /*
+        * Prepare to encode subsequent operations.
+        *
+        * xdr_truncate_encode() is not safe to use after a successful
+        * splice read has been done, so the following stream
+        * manipulations are open-coded.
+        */
        space_left = min_t(int, (void *)xdr->end - (void *)xdr->p,
                                buf->buflen - buf->len);
        buf->buflen = buf->len + space_left;
        xdr->end = (__be32 *)((void *)xdr->end + space_left);
 
-       return 0;
+       return nfs_ok;
 
 out_err:
        /*
@@ -4090,13 +4109,13 @@ static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
        __be32 zero = xdr_zero;
        __be32 nfserr;
 
-       read->rd_vlen = xdr_reserve_space_vec(xdr, resp->rqstp->rq_vec, maxcount);
-       if (read->rd_vlen < 0)
+       if (xdr_reserve_space_vec(xdr, maxcount) < 0)
                return nfserr_resource;
 
-       nfserr = nfsd_readv(resp->rqstp, read->rd_fhp, file, read->rd_offset,
-                           resp->rqstp->rq_vec, read->rd_vlen, &maxcount,
-                           &read->rd_eof);
+       nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, file,
+                               read->rd_offset, &maxcount,
+                               xdr->buf->page_len & ~PAGE_MASK,
+                               &read->rd_eof);
        read->rd_length = maxcount;
        if (nfserr)
                return nfserr;
@@ -4213,15 +4232,9 @@ nfsd4_encode_readdir(struct nfsd4_compoundres *resp, __be32 nfserr,
        int starting_len = xdr->buf->len;
        __be32 *p;
 
-       p = xdr_reserve_space(xdr, NFS4_VERIFIER_SIZE);
-       if (!p)
-               return nfserr_resource;
-
-       /* XXX: Following NFSv3, we ignore the READDIR verifier for now. */
-       *p++ = cpu_to_be32(0);
-       *p++ = cpu_to_be32(0);
-       xdr->buf->head[0].iov_len = (char *)xdr->p -
-                                   (char *)xdr->buf->head[0].iov_base;
+       nfserr = nfsd4_encode_verifier4(xdr, &readdir->rd_verf);
+       if (nfserr != nfs_ok)
+               return nfserr;
 
        /*
         * Number of bytes left for directory entries allowing for the
@@ -4299,13 +4312,8 @@ nfsd4_encode_remove(struct nfsd4_compoundres *resp, __be32 nfserr,
 {
        struct nfsd4_remove *remove = &u->remove;
        struct xdr_stream *xdr = resp->xdr;
-       __be32 *p;
 
-       p = xdr_reserve_space(xdr, 20);
-       if (!p)
-               return nfserr_resource;
-       p = encode_cinfo(p, &remove->rm_cinfo);
-       return 0;
+       return nfsd4_encode_change_info4(xdr, &remove->rm_cinfo);
 }
 
 static __be32
@@ -4314,14 +4322,11 @@ nfsd4_encode_rename(struct nfsd4_compoundres *resp, __be32 nfserr,
 {
        struct nfsd4_rename *rename = &u->rename;
        struct xdr_stream *xdr = resp->xdr;
-       __be32 *p;
 
-       p = xdr_reserve_space(xdr, 40);
-       if (!p)
-               return nfserr_resource;
-       p = encode_cinfo(p, &rename->rn_sinfo);
-       p = encode_cinfo(p, &rename->rn_tinfo);
-       return 0;
+       nfserr = nfsd4_encode_change_info4(xdr, &rename->rn_sinfo);
+       if (nfserr)
+               return nfserr;
+       return nfsd4_encode_change_info4(xdr, &rename->rn_tinfo);
 }
 
 static __be32
@@ -4448,23 +4453,25 @@ nfsd4_encode_setclientid(struct nfsd4_compoundres *resp, __be32 nfserr,
 {
        struct nfsd4_setclientid *scd = &u->setclientid;
        struct xdr_stream *xdr = resp->xdr;
-       __be32 *p;
 
        if (!nfserr) {
-               p = xdr_reserve_space(xdr, 8 + NFS4_VERIFIER_SIZE);
-               if (!p)
-                       return nfserr_resource;
-               p = xdr_encode_opaque_fixed(p, &scd->se_clientid, 8);
-               p = xdr_encode_opaque_fixed(p, &scd->se_confirm,
-                                               NFS4_VERIFIER_SIZE);
-       }
-       else if (nfserr == nfserr_clid_inuse) {
-               p = xdr_reserve_space(xdr, 8);
-               if (!p)
-                       return nfserr_resource;
-               *p++ = cpu_to_be32(0);
-               *p++ = cpu_to_be32(0);
+               nfserr = nfsd4_encode_clientid4(xdr, &scd->se_clientid);
+               if (nfserr != nfs_ok)
+                       goto out;
+               nfserr = nfsd4_encode_verifier4(xdr, &scd->se_confirm);
+       } else if (nfserr == nfserr_clid_inuse) {
+               /* empty network id */
+               if (xdr_stream_encode_u32(xdr, 0) < 0) {
+                       nfserr = nfserr_resource;
+                       goto out;
+               }
+               /* empty universal address */
+               if (xdr_stream_encode_u32(xdr, 0) < 0) {
+                       nfserr = nfserr_resource;
+                       goto out;
+               }
        }
+out:
        return nfserr;
 }
 
@@ -4473,17 +4480,12 @@ nfsd4_encode_write(struct nfsd4_compoundres *resp, __be32 nfserr,
                   union nfsd4_op_u *u)
 {
        struct nfsd4_write *write = &u->write;
-       struct xdr_stream *xdr = resp->xdr;
-       __be32 *p;
 
-       p = xdr_reserve_space(xdr, 16);
-       if (!p)
+       if (xdr_stream_encode_u32(resp->xdr, write->wr_bytes_written) < 0)
                return nfserr_resource;
-       *p++ = cpu_to_be32(write->wr_bytes_written);
-       *p++ = cpu_to_be32(write->wr_how_written);
-       p = xdr_encode_opaque_fixed(p, write->wr_verifier.data,
-                                               NFS4_VERIFIER_SIZE);
-       return 0;
+       if (xdr_stream_encode_u32(resp->xdr, write->wr_how_written) < 0)
+               return nfserr_resource;
+       return nfsd4_encode_verifier4(resp->xdr, &write->wr_verifier);
 }
 
 static __be32
@@ -4505,20 +4507,15 @@ nfsd4_encode_exchange_id(struct nfsd4_compoundres *resp, __be32 nfserr,
        server_scope = nn->nfsd_name;
        server_scope_sz = strlen(nn->nfsd_name);
 
-       p = xdr_reserve_space(xdr,
-               8 /* eir_clientid */ +
-               4 /* eir_sequenceid */ +
-               4 /* eir_flags */ +
-               4 /* spr_how */);
-       if (!p)
+       if (nfsd4_encode_clientid4(xdr, &exid->clientid) != nfs_ok)
+               return nfserr_resource;
+       if (xdr_stream_encode_u32(xdr, exid->seqid) < 0)
+               return nfserr_resource;
+       if (xdr_stream_encode_u32(xdr, exid->flags) < 0)
                return nfserr_resource;
 
-       p = xdr_encode_opaque_fixed(p, &exid->clientid, 8);
-       *p++ = cpu_to_be32(exid->seqid);
-       *p++ = cpu_to_be32(exid->flags);
-
-       *p++ = cpu_to_be32(exid->spa_how);
-
+       if (xdr_stream_encode_u32(xdr, exid->spa_how) < 0)
+               return nfserr_resource;
        switch (exid->spa_how) {
        case SP4_NONE:
                break;
@@ -5099,15 +5096,8 @@ nfsd4_encode_setxattr(struct nfsd4_compoundres *resp, __be32 nfserr,
 {
        struct nfsd4_setxattr *setxattr = &u->setxattr;
        struct xdr_stream *xdr = resp->xdr;
-       __be32 *p;
 
-       p = xdr_reserve_space(xdr, 20);
-       if (!p)
-               return nfserr_resource;
-
-       encode_cinfo(p, &setxattr->setxa_cinfo);
-
-       return 0;
+       return nfsd4_encode_change_info4(xdr, &setxattr->setxa_cinfo);
 }
 
 /*
@@ -5253,14 +5243,8 @@ nfsd4_encode_removexattr(struct nfsd4_compoundres *resp, __be32 nfserr,
 {
        struct nfsd4_removexattr *removexattr = &u->removexattr;
        struct xdr_stream *xdr = resp->xdr;
-       __be32 *p;
 
-       p = xdr_reserve_space(xdr, 20);
-       if (!p)
-               return nfserr_resource;
-
-       p = encode_cinfo(p, &removexattr->rmxa_cinfo);
-       return 0;
+       return nfsd4_encode_change_info4(xdr, &removexattr->rmxa_cinfo);
 }
 
 typedef __be32(*nfsd4_enc)(struct nfsd4_compoundres *, __be32, union nfsd4_op_u *u);
@@ -5460,6 +5444,12 @@ status:
 release:
        if (opdesc && opdesc->op_release)
                opdesc->op_release(&op->u);
+
+       /*
+        * Account for pages consumed while encoding this operation.
+        * The xdr_stream primitives don't manage rq_next_page.
+        */
+       rqstp->rq_next_page = xdr->page_ptr + 1;
 }
 
 /* 
@@ -5528,9 +5518,6 @@ nfs4svc_encode_compoundres(struct svc_rqst *rqstp, struct xdr_stream *xdr)
        p = resp->statusp;
 
        *p++ = resp->cstate.status;
-
-       rqstp->rq_next_page = xdr->page_ptr + 1;
-
        *p++ = htonl(resp->taglen);
        memcpy(p, resp->tag, resp->taglen);
        p += XDR_QUADLEN(resp->taglen);
index 041faa1..a8eda1c 100644 (file)
@@ -148,12 +148,23 @@ void nfsd_drc_slab_free(void)
        kmem_cache_destroy(drc_slab);
 }
 
-static int nfsd_reply_cache_stats_init(struct nfsd_net *nn)
+/**
+ * nfsd_net_reply_cache_init - per net namespace reply cache set-up
+ * @nn: nfsd_net being initialized
+ *
+ * Returns zero on succes; otherwise a negative errno is returned.
+ */
+int nfsd_net_reply_cache_init(struct nfsd_net *nn)
 {
        return nfsd_percpu_counters_init(nn->counter, NFSD_NET_COUNTERS_NUM);
 }
 
-static void nfsd_reply_cache_stats_destroy(struct nfsd_net *nn)
+/**
+ * nfsd_net_reply_cache_destroy - per net namespace reply cache tear-down
+ * @nn: nfsd_net being freed
+ *
+ */
+void nfsd_net_reply_cache_destroy(struct nfsd_net *nn)
 {
        nfsd_percpu_counters_destroy(nn->counter, NFSD_NET_COUNTERS_NUM);
 }
@@ -169,17 +180,13 @@ int nfsd_reply_cache_init(struct nfsd_net *nn)
        hashsize = nfsd_hashsize(nn->max_drc_entries);
        nn->maskbits = ilog2(hashsize);
 
-       status = nfsd_reply_cache_stats_init(nn);
-       if (status)
-               goto out_nomem;
-
        nn->nfsd_reply_cache_shrinker.scan_objects = nfsd_reply_cache_scan;
        nn->nfsd_reply_cache_shrinker.count_objects = nfsd_reply_cache_count;
        nn->nfsd_reply_cache_shrinker.seeks = 1;
        status = register_shrinker(&nn->nfsd_reply_cache_shrinker,
                                   "nfsd-reply:%s", nn->nfsd_name);
        if (status)
-               goto out_stats_destroy;
+               return status;
 
        nn->drc_hashtbl = kvzalloc(array_size(hashsize,
                                sizeof(*nn->drc_hashtbl)), GFP_KERNEL);
@@ -195,9 +202,6 @@ int nfsd_reply_cache_init(struct nfsd_net *nn)
        return 0;
 out_shrinker:
        unregister_shrinker(&nn->nfsd_reply_cache_shrinker);
-out_stats_destroy:
-       nfsd_reply_cache_stats_destroy(nn);
-out_nomem:
        printk(KERN_ERR "nfsd: failed to allocate reply cache\n");
        return -ENOMEM;
 }
@@ -217,7 +221,6 @@ void nfsd_reply_cache_shutdown(struct nfsd_net *nn)
                                                                        rp, nn);
                }
        }
-       nfsd_reply_cache_stats_destroy(nn);
 
        kvfree(nn->drc_hashtbl);
        nn->drc_hashtbl = NULL;
index b4fd7a7..1b8b1aa 100644 (file)
@@ -25,6 +25,7 @@
 #include "netns.h"
 #include "pnfs.h"
 #include "filecache.h"
+#include "trace.h"
 
 /*
  *     We have a single directory with several nodes in it.
@@ -109,12 +110,12 @@ static ssize_t nfsctl_transaction_write(struct file *file, const char __user *bu
        if (IS_ERR(data))
                return PTR_ERR(data);
 
-       rv =  write_op[ino](file, data, size);
-       if (rv >= 0) {
-               simple_transaction_set(file, rv);
-               rv = size;
-       }
-       return rv;
+       rv = write_op[ino](file, data, size);
+       if (rv < 0)
+               return rv;
+
+       simple_transaction_set(file, rv);
+       return size;
 }
 
 static ssize_t nfsctl_transaction_read(struct file *file, char __user *buf, size_t size, loff_t *pos)
@@ -230,6 +231,7 @@ static ssize_t write_unlock_ip(struct file *file, char *buf, size_t size)
        if (rpc_pton(net, fo_path, size, sap, salen) == 0)
                return -EINVAL;
 
+       trace_nfsd_ctl_unlock_ip(net, buf);
        return nlmsvc_unlock_all_by_ip(sap);
 }
 
@@ -263,7 +265,7 @@ static ssize_t write_unlock_fs(struct file *file, char *buf, size_t size)
        fo_path = buf;
        if (qword_get(&buf, fo_path, size) < 0)
                return -EINVAL;
-
+       trace_nfsd_ctl_unlock_fs(netns(file), fo_path);
        error = kern_path(fo_path, 0, &path);
        if (error)
                return error;
@@ -324,7 +326,7 @@ static ssize_t write_filehandle(struct file *file, char *buf, size_t size)
        len = qword_get(&mesg, dname, size);
        if (len <= 0)
                return -EINVAL;
-       
+
        path = dname+len+1;
        len = qword_get(&mesg, path, size);
        if (len <= 0)
@@ -338,15 +340,17 @@ static ssize_t write_filehandle(struct file *file, char *buf, size_t size)
                return -EINVAL;
        maxsize = min(maxsize, NFS3_FHSIZE);
 
-       if (qword_get(&mesg, mesg, size)>0)
+       if (qword_get(&mesg, mesg, size) > 0)
                return -EINVAL;
 
+       trace_nfsd_ctl_filehandle(netns(file), dname, path, maxsize);
+
        /* we have all the words, they are in buf.. */
        dom = unix_domain_find(dname);
        if (!dom)
                return -ENOMEM;
 
-       len = exp_rootfh(netns(file), dom, path, &fh,  maxsize);
+       len = exp_rootfh(netns(file), dom, path, &fh, maxsize);
        auth_domain_put(dom);
        if (len)
                return len;
@@ -399,6 +403,7 @@ static ssize_t write_threads(struct file *file, char *buf, size_t size)
                        return rv;
                if (newthreads < 0)
                        return -EINVAL;
+               trace_nfsd_ctl_threads(net, newthreads);
                rv = nfsd_svc(newthreads, net, file->f_cred);
                if (rv < 0)
                        return rv;
@@ -418,8 +423,8 @@ static ssize_t write_threads(struct file *file, char *buf, size_t size)
  * OR
  *
  * Input:
- *                     buf:            C string containing whitespace-
- *                                     separated unsigned integer values
+ *                     buf:            C string containing whitespace-
+ *                                     separated unsigned integer values
  *                                     representing the number of NFSD
  *                                     threads to start in each pool
  *                     size:           non-zero length of C string in @buf
@@ -471,6 +476,7 @@ static ssize_t write_pool_threads(struct file *file, char *buf, size_t size)
                        rv = -EINVAL;
                        if (nthreads[i] < 0)
                                goto out_free;
+                       trace_nfsd_ctl_pool_threads(net, i, nthreads[i]);
                }
                rv = nfsd_set_nrthreads(i, nthreads, net);
                if (rv)
@@ -526,7 +532,7 @@ static ssize_t __write_versions(struct file *file, char *buf, size_t size)
        char *sep;
        struct nfsd_net *nn = net_generic(netns(file), nfsd_net_id);
 
-       if (size>0) {
+       if (size > 0) {
                if (nn->nfsd_serv)
                        /* Cannot change versions without updating
                         * nn->nfsd_serv->sv_xdrsize, and reallocing
@@ -536,6 +542,7 @@ static ssize_t __write_versions(struct file *file, char *buf, size_t size)
                if (buf[size-1] != '\n')
                        return -EINVAL;
                buf[size-1] = 0;
+               trace_nfsd_ctl_version(netns(file), buf);
 
                vers = mesg;
                len = qword_get(&mesg, vers, size);
@@ -637,11 +644,11 @@ out:
  * OR
  *
  * Input:
- *                     buf:            C string containing whitespace-
- *                                     separated positive or negative
- *                                     integer values representing NFS
- *                                     protocol versions to enable ("+n")
- *                                     or disable ("-n")
+ *                     buf:            C string containing whitespace-
+ *                                     separated positive or negative
+ *                                     integer values representing NFS
+ *                                     protocol versions to enable ("+n")
+ *                                     or disable ("-n")
  *                     size:           non-zero length of C string in @buf
  * Output:
  *     On success:     status of zero or more protocol versions has
@@ -689,6 +696,7 @@ static ssize_t __write_ports_addfd(char *buf, struct net *net, const struct cred
        err = get_int(&mesg, &fd);
        if (err != 0 || fd < 0)
                return -EINVAL;
+       trace_nfsd_ctl_ports_addfd(net, fd);
 
        err = nfsd_create_serv(net);
        if (err != 0)
@@ -705,7 +713,7 @@ static ssize_t __write_ports_addfd(char *buf, struct net *net, const struct cred
 }
 
 /*
- * A transport listener is added by writing it's transport name and
+ * A transport listener is added by writing its transport name and
  * a port number.
  */
 static ssize_t __write_ports_addxprt(char *buf, struct net *net, const struct cred *cred)
@@ -720,6 +728,7 @@ static ssize_t __write_ports_addxprt(char *buf, struct net *net, const struct cr
 
        if (port < 1 || port > USHRT_MAX)
                return -EINVAL;
+       trace_nfsd_ctl_ports_addxprt(net, transport, port);
 
        err = nfsd_create_serv(net);
        if (err != 0)
@@ -832,9 +841,9 @@ int nfsd_max_blksize;
  * OR
  *
  * Input:
- *                     buf:            C string containing an unsigned
- *                                     integer value representing the new
- *                                     NFS blksize
+ *                     buf:            C string containing an unsigned
+ *                                     integer value representing the new
+ *                                     NFS blksize
  *                     size:           non-zero length of C string in @buf
  * Output:
  *     On success:     passed-in buffer filled with '\n'-terminated C string
@@ -853,6 +862,8 @@ static ssize_t write_maxblksize(struct file *file, char *buf, size_t size)
                int rv = get_int(&mesg, &bsize);
                if (rv)
                        return rv;
+               trace_nfsd_ctl_maxblksize(netns(file), bsize);
+
                /* force bsize into allowed range and
                 * required alignment.
                 */
@@ -881,9 +892,9 @@ static ssize_t write_maxblksize(struct file *file, char *buf, size_t size)
  * OR
  *
  * Input:
- *                     buf:            C string containing an unsigned
- *                                     integer value representing the new
- *                                     number of max connections
+ *                     buf:            C string containing an unsigned
+ *                                     integer value representing the new
+ *                                     number of max connections
  *                     size:           non-zero length of C string in @buf
  * Output:
  *     On success:     passed-in buffer filled with '\n'-terminated C string
@@ -903,6 +914,7 @@ static ssize_t write_maxconn(struct file *file, char *buf, size_t size)
 
                if (rv)
                        return rv;
+               trace_nfsd_ctl_maxconn(netns(file), maxconn);
                nn->max_connections = maxconn;
        }
 
@@ -913,6 +925,7 @@ static ssize_t write_maxconn(struct file *file, char *buf, size_t size)
 static ssize_t __nfsd4_write_time(struct file *file, char *buf, size_t size,
                                  time64_t *time, struct nfsd_net *nn)
 {
+       struct dentry *dentry = file_dentry(file);
        char *mesg = buf;
        int rv, i;
 
@@ -922,6 +935,9 @@ static ssize_t __nfsd4_write_time(struct file *file, char *buf, size_t size,
                rv = get_int(&mesg, &i);
                if (rv)
                        return rv;
+               trace_nfsd_ctl_time(netns(file), dentry->d_name.name,
+                                   dentry->d_name.len, i);
+
                /*
                 * Some sanity checking.  We don't have a reason for
                 * these particular numbers, but problems with the
@@ -1014,6 +1030,7 @@ static ssize_t __write_recoverydir(struct file *file, char *buf, size_t size,
                len = qword_get(&mesg, recdir, size);
                if (len <= 0)
                        return -EINVAL;
+               trace_nfsd_ctl_recoverydir(netns(file), recdir);
 
                status = nfs4_reset_recoverydir(recdir);
                if (status)
@@ -1065,7 +1082,7 @@ static ssize_t write_recoverydir(struct file *file, char *buf, size_t size)
  * OR
  *
  * Input:
- *                     buf:            any value
+ *                     buf:            any value
  *                     size:           non-zero length of C string in @buf
  * Output:
  *                     passed-in buffer filled with "Y" or "N" with a newline
@@ -1087,7 +1104,7 @@ static ssize_t write_v4_end_grace(struct file *file, char *buf, size_t size)
                case '1':
                        if (!nn->nfsd_serv)
                                return -EBUSY;
-                       nfsd4_end_grace(nn);
+                       trace_nfsd_end_grace(netns(file));
                        break;
                default:
                        return -EINVAL;
@@ -1192,8 +1209,8 @@ static int __nfsd_symlink(struct inode *dir, struct dentry *dentry,
  * @content is assumed to be a NUL-terminated string that lives
  * longer than the symlink itself.
  */
-static void nfsd_symlink(struct dentry *parent, const char *name,
-                        const char *content)
+static void _nfsd_symlink(struct dentry *parent, const char *name,
+                         const char *content)
 {
        struct inode *dir = parent->d_inode;
        struct dentry *dentry;
@@ -1210,8 +1227,8 @@ out:
        inode_unlock(dir);
 }
 #else
-static inline void nfsd_symlink(struct dentry *parent, const char *name,
-                               const char *content)
+static inline void _nfsd_symlink(struct dentry *parent, const char *name,
+                                const char *content)
 {
 }
 
@@ -1389,8 +1406,8 @@ static int nfsd_fill_super(struct super_block *sb, struct fs_context *fc)
        ret = simple_fill_super(sb, 0x6e667364, nfsd_files);
        if (ret)
                return ret;
-       nfsd_symlink(sb->s_root, "supported_krb5_enctypes",
-                    "/proc/net/rpc/gss_krb5_enctypes");
+       _nfsd_symlink(sb->s_root, "supported_krb5_enctypes",
+                     "/proc/net/rpc/gss_krb5_enctypes");
        dentry = nfsd_mkdir(sb->s_root, NULL, "clients");
        if (IS_ERR(dentry))
                return PTR_ERR(dentry);
@@ -1477,7 +1494,17 @@ static int create_proc_exports_entry(void)
 
 unsigned int nfsd_net_id;
 
-static __net_init int nfsd_init_net(struct net *net)
+/**
+ * nfsd_net_init - Prepare the nfsd_net portion of a new net namespace
+ * @net: a freshly-created network namespace
+ *
+ * This information stays around as long as the network namespace is
+ * alive whether or not there is an NFSD instance running in the
+ * namespace.
+ *
+ * Returns zero on success, or a negative errno otherwise.
+ */
+static __net_init int nfsd_net_init(struct net *net)
 {
        int retval;
        struct nfsd_net *nn = net_generic(net, nfsd_net_id);
@@ -1488,6 +1515,9 @@ static __net_init int nfsd_init_net(struct net *net)
        retval = nfsd_idmap_init(net);
        if (retval)
                goto out_idmap_error;
+       retval = nfsd_net_reply_cache_init(nn);
+       if (retval)
+               goto out_repcache_error;
        nn->nfsd_versions = NULL;
        nn->nfsd4_minorversions = NULL;
        nfsd4_init_leases_net(nn);
@@ -1496,22 +1526,32 @@ static __net_init int nfsd_init_net(struct net *net)
 
        return 0;
 
+out_repcache_error:
+       nfsd_idmap_shutdown(net);
 out_idmap_error:
        nfsd_export_shutdown(net);
 out_export_error:
        return retval;
 }
 
-static __net_exit void nfsd_exit_net(struct net *net)
+/**
+ * nfsd_net_exit - Release the nfsd_net portion of a net namespace
+ * @net: a network namespace that is about to be destroyed
+ *
+ */
+static __net_exit void nfsd_net_exit(struct net *net)
 {
+       struct nfsd_net *nn = net_generic(net, nfsd_net_id);
+
+       nfsd_net_reply_cache_destroy(nn);
        nfsd_idmap_shutdown(net);
        nfsd_export_shutdown(net);
-       nfsd_netns_free_versions(net_generic(net, nfsd_net_id));
+       nfsd_netns_free_versions(nn);
 }
 
 static struct pernet_operations nfsd_net_ops = {
-       .init = nfsd_init_net,
-       .exit = nfsd_exit_net,
+       .init = nfsd_net_init,
+       .exit = nfsd_net_exit,
        .id   = &nfsd_net_id,
        .size = sizeof(struct nfsd_net),
 };
index ccd8485..e8e13ae 100644 (file)
@@ -623,16 +623,9 @@ void fh_fill_pre_attrs(struct svc_fh *fhp)
 
        inode = d_inode(fhp->fh_dentry);
        err = fh_getattr(fhp, &stat);
-       if (err) {
-               /* Grab the times from inode anyway */
-               stat.mtime = inode->i_mtime;
-               stat.ctime = inode->i_ctime;
-               stat.size  = inode->i_size;
-               if (v4 && IS_I_VERSION(inode)) {
-                       stat.change_cookie = inode_query_iversion(inode);
-                       stat.result_mask |= STATX_CHANGE_COOKIE;
-               }
-       }
+       if (err)
+               return;
+
        if (v4)
                fhp->fh_pre_change = nfsd4_change_attribute(&stat, inode);
 
@@ -660,15 +653,10 @@ void fh_fill_post_attrs(struct svc_fh *fhp)
                printk("nfsd: inode locked twice during operation.\n");
 
        err = fh_getattr(fhp, &fhp->fh_post_attr);
-       if (err) {
-               fhp->fh_post_saved = false;
-               fhp->fh_post_attr.ctime = inode->i_ctime;
-               if (v4 && IS_I_VERSION(inode)) {
-                       fhp->fh_post_attr.change_cookie = inode_query_iversion(inode);
-                       fhp->fh_post_attr.result_mask |= STATX_CHANGE_COOKIE;
-               }
-       } else
-               fhp->fh_post_saved = true;
+       if (err)
+               return;
+
+       fhp->fh_post_saved = true;
        if (v4)
                fhp->fh_post_change =
                        nfsd4_change_attribute(&fhp->fh_post_attr, inode);
index c371955..a731592 100644 (file)
@@ -176,9 +176,7 @@ nfsd_proc_read(struct svc_rqst *rqstp)
 {
        struct nfsd_readargs *argp = rqstp->rq_argp;
        struct nfsd_readres *resp = rqstp->rq_resp;
-       unsigned int len;
        u32 eof;
-       int v;
 
        dprintk("nfsd: READ    %s %d bytes at %d\n",
                SVCFH_fmt(&argp->fh),
@@ -187,17 +185,7 @@ nfsd_proc_read(struct svc_rqst *rqstp)
        argp->count = min_t(u32, argp->count, NFSSVC_MAXBLKSIZE_V2);
        argp->count = min_t(u32, argp->count, rqstp->rq_res.buflen);
 
-       v = 0;
-       len = argp->count;
        resp->pages = rqstp->rq_next_page;
-       while (len > 0) {
-               struct page *page = *(rqstp->rq_next_page++);
-
-               rqstp->rq_vec[v].iov_base = page_address(page);
-               rqstp->rq_vec[v].iov_len = min_t(unsigned int, len, PAGE_SIZE);
-               len -= rqstp->rq_vec[v].iov_len;
-               v++;
-       }
 
        /* Obtain buffer pointer for payload. 19 is 1 word for
         * status, 17 words for fattr, and 1 word for the byte count.
@@ -207,7 +195,7 @@ nfsd_proc_read(struct svc_rqst *rqstp)
        resp->count = argp->count;
        fh_copy(&resp->fh, &argp->fh);
        resp->status = nfsd_read(rqstp, &resp->fh, argp->offset,
-                                rqstp->rq_vec, v, &resp->count, &eof);
+                                &resp->count, &eof);
        if (resp->status == nfs_ok)
                resp->status = fh_getattr(&resp->fh, &resp->stat);
        else if (resp->status == nfserr_jukebox)
index 9c7b1ef..2154fa6 100644 (file)
@@ -402,6 +402,11 @@ void nfsd_reset_write_verifier(struct nfsd_net *nn)
        write_sequnlock(&nn->writeverf_lock);
 }
 
+/*
+ * Crank up a set of per-namespace resources for a new NFSD instance,
+ * including lockd, a duplicate reply cache, an open file cache
+ * instance, and a cache of NFSv4 state objects.
+ */
 static int nfsd_startup_net(struct net *net, const struct cred *cred)
 {
        struct nfsd_net *nn = net_generic(net, nfsd_net_id);
index caf6355..5777f40 100644 (file)
@@ -468,7 +468,8 @@ nfssvc_encode_readlinkres(struct svc_rqst *rqstp, struct xdr_stream *xdr)
        case nfs_ok:
                if (xdr_stream_encode_u32(xdr, resp->len) < 0)
                        return false;
-               xdr_write_pages(xdr, &resp->page, 0, resp->len);
+               svcxdr_encode_opaque_pages(rqstp, xdr, &resp->page, 0,
+                                          resp->len);
                if (svc_encode_result_payload(rqstp, head->iov_len, resp->len) < 0)
                        return false;
                break;
@@ -491,8 +492,9 @@ nfssvc_encode_readres(struct svc_rqst *rqstp, struct xdr_stream *xdr)
                        return false;
                if (xdr_stream_encode_u32(xdr, resp->count) < 0)
                        return false;
-               xdr_write_pages(xdr, resp->pages, rqstp->rq_res.page_base,
-                               resp->count);
+               svcxdr_encode_opaque_pages(rqstp, xdr, resp->pages,
+                                          rqstp->rq_res.page_base,
+                                          resp->count);
                if (svc_encode_result_payload(rqstp, head->iov_len, resp->count) < 0)
                        return false;
                break;
@@ -511,7 +513,8 @@ nfssvc_encode_readdirres(struct svc_rqst *rqstp, struct xdr_stream *xdr)
                return false;
        switch (resp->status) {
        case nfs_ok:
-               xdr_write_pages(xdr, dirlist->pages, 0, dirlist->len);
+               svcxdr_encode_opaque_pages(rqstp, xdr, dirlist->pages, 0,
+                                          dirlist->len);
                /* no more entries */
                if (xdr_stream_encode_item_absent(xdr) < 0)
                        return false;
index 72a906a..2af7498 100644 (file)
@@ -1581,6 +1581,265 @@ TRACE_EVENT(nfsd_cb_recall_any_done,
        )
 );
 
+TRACE_EVENT(nfsd_ctl_unlock_ip,
+       TP_PROTO(
+               const struct net *net,
+               const char *address
+       ),
+       TP_ARGS(net, address),
+       TP_STRUCT__entry(
+               __field(unsigned int, netns_ino)
+               __string(address, address)
+       ),
+       TP_fast_assign(
+               __entry->netns_ino = net->ns.inum;
+               __assign_str(address, address);
+       ),
+       TP_printk("address=%s",
+               __get_str(address)
+       )
+);
+
+TRACE_EVENT(nfsd_ctl_unlock_fs,
+       TP_PROTO(
+               const struct net *net,
+               const char *path
+       ),
+       TP_ARGS(net, path),
+       TP_STRUCT__entry(
+               __field(unsigned int, netns_ino)
+               __string(path, path)
+       ),
+       TP_fast_assign(
+               __entry->netns_ino = net->ns.inum;
+               __assign_str(path, path);
+       ),
+       TP_printk("path=%s",
+               __get_str(path)
+       )
+);
+
+TRACE_EVENT(nfsd_ctl_filehandle,
+       TP_PROTO(
+               const struct net *net,
+               const char *domain,
+               const char *path,
+               int maxsize
+       ),
+       TP_ARGS(net, domain, path, maxsize),
+       TP_STRUCT__entry(
+               __field(unsigned int, netns_ino)
+               __field(int, maxsize)
+               __string(domain, domain)
+               __string(path, path)
+       ),
+       TP_fast_assign(
+               __entry->netns_ino = net->ns.inum;
+               __entry->maxsize = maxsize;
+               __assign_str(domain, domain);
+               __assign_str(path, path);
+       ),
+       TP_printk("domain=%s path=%s maxsize=%d",
+               __get_str(domain), __get_str(path), __entry->maxsize
+       )
+);
+
+TRACE_EVENT(nfsd_ctl_threads,
+       TP_PROTO(
+               const struct net *net,
+               int newthreads
+       ),
+       TP_ARGS(net, newthreads),
+       TP_STRUCT__entry(
+               __field(unsigned int, netns_ino)
+               __field(int, newthreads)
+       ),
+       TP_fast_assign(
+               __entry->netns_ino = net->ns.inum;
+               __entry->newthreads = newthreads;
+       ),
+       TP_printk("newthreads=%d",
+               __entry->newthreads
+       )
+);
+
+TRACE_EVENT(nfsd_ctl_pool_threads,
+       TP_PROTO(
+               const struct net *net,
+               int pool,
+               int nrthreads
+       ),
+       TP_ARGS(net, pool, nrthreads),
+       TP_STRUCT__entry(
+               __field(unsigned int, netns_ino)
+               __field(int, pool)
+               __field(int, nrthreads)
+       ),
+       TP_fast_assign(
+               __entry->netns_ino = net->ns.inum;
+               __entry->pool = pool;
+               __entry->nrthreads = nrthreads;
+       ),
+       TP_printk("pool=%d nrthreads=%d",
+               __entry->pool, __entry->nrthreads
+       )
+);
+
+TRACE_EVENT(nfsd_ctl_version,
+       TP_PROTO(
+               const struct net *net,
+               const char *mesg
+       ),
+       TP_ARGS(net, mesg),
+       TP_STRUCT__entry(
+               __field(unsigned int, netns_ino)
+               __string(mesg, mesg)
+       ),
+       TP_fast_assign(
+               __entry->netns_ino = net->ns.inum;
+               __assign_str(mesg, mesg);
+       ),
+       TP_printk("%s",
+               __get_str(mesg)
+       )
+);
+
+TRACE_EVENT(nfsd_ctl_ports_addfd,
+       TP_PROTO(
+               const struct net *net,
+               int fd
+       ),
+       TP_ARGS(net, fd),
+       TP_STRUCT__entry(
+               __field(unsigned int, netns_ino)
+               __field(int, fd)
+       ),
+       TP_fast_assign(
+               __entry->netns_ino = net->ns.inum;
+               __entry->fd = fd;
+       ),
+       TP_printk("fd=%d",
+               __entry->fd
+       )
+);
+
+TRACE_EVENT(nfsd_ctl_ports_addxprt,
+       TP_PROTO(
+               const struct net *net,
+               const char *transport,
+               int port
+       ),
+       TP_ARGS(net, transport, port),
+       TP_STRUCT__entry(
+               __field(unsigned int, netns_ino)
+               __field(int, port)
+               __string(transport, transport)
+       ),
+       TP_fast_assign(
+               __entry->netns_ino = net->ns.inum;
+               __entry->port = port;
+               __assign_str(transport, transport);
+       ),
+       TP_printk("transport=%s port=%d",
+               __get_str(transport), __entry->port
+       )
+);
+
+TRACE_EVENT(nfsd_ctl_maxblksize,
+       TP_PROTO(
+               const struct net *net,
+               int bsize
+       ),
+       TP_ARGS(net, bsize),
+       TP_STRUCT__entry(
+               __field(unsigned int, netns_ino)
+               __field(int, bsize)
+       ),
+       TP_fast_assign(
+               __entry->netns_ino = net->ns.inum;
+               __entry->bsize = bsize;
+       ),
+       TP_printk("bsize=%d",
+               __entry->bsize
+       )
+);
+
+TRACE_EVENT(nfsd_ctl_maxconn,
+       TP_PROTO(
+               const struct net *net,
+               int maxconn
+       ),
+       TP_ARGS(net, maxconn),
+       TP_STRUCT__entry(
+               __field(unsigned int, netns_ino)
+               __field(int, maxconn)
+       ),
+       TP_fast_assign(
+               __entry->netns_ino = net->ns.inum;
+               __entry->maxconn = maxconn;
+       ),
+       TP_printk("maxconn=%d",
+               __entry->maxconn
+       )
+);
+
+TRACE_EVENT(nfsd_ctl_time,
+       TP_PROTO(
+               const struct net *net,
+               const char *name,
+               size_t namelen,
+               int time
+       ),
+       TP_ARGS(net, name, namelen, time),
+       TP_STRUCT__entry(
+               __field(unsigned int, netns_ino)
+               __field(int, time)
+               __string_len(name, name, namelen)
+       ),
+       TP_fast_assign(
+               __entry->netns_ino = net->ns.inum;
+               __entry->time = time;
+               __assign_str_len(name, name, namelen);
+       ),
+       TP_printk("file=%s time=%d\n",
+               __get_str(name), __entry->time
+       )
+);
+
+TRACE_EVENT(nfsd_ctl_recoverydir,
+       TP_PROTO(
+               const struct net *net,
+               const char *recdir
+       ),
+       TP_ARGS(net, recdir),
+       TP_STRUCT__entry(
+               __field(unsigned int, netns_ino)
+               __string(recdir, recdir)
+       ),
+       TP_fast_assign(
+               __entry->netns_ino = net->ns.inum;
+               __assign_str(recdir, recdir);
+       ),
+       TP_printk("recdir=%s",
+               __get_str(recdir)
+       )
+);
+
+TRACE_EVENT(nfsd_end_grace,
+       TP_PROTO(
+               const struct net *net
+       ),
+       TP_ARGS(net),
+       TP_STRUCT__entry(
+               __field(unsigned int, netns_ino)
+       ),
+       TP_fast_assign(
+               __entry->netns_ino = net->ns.inum;
+       ),
+       TP_printk("nn=%d", __entry->netns_ino
+       )
+);
+
 #endif /* _NFSD_TRACE_H */
 
 #undef TRACE_INCLUDE_PATH
index 8879e20..8a2321d 100644 (file)
@@ -388,7 +388,9 @@ nfsd_sanitize_attrs(struct inode *inode, struct iattr *iap)
                                iap->ia_mode &= ~S_ISGID;
                } else {
                        /* set ATTR_KILL_* bits and let VFS handle it */
-                       iap->ia_valid |= (ATTR_KILL_SUID | ATTR_KILL_SGID);
+                       iap->ia_valid |= ATTR_KILL_SUID;
+                       iap->ia_valid |=
+                               setattr_should_drop_sgid(&nop_mnt_idmap, inode);
                }
        }
 }
@@ -1001,6 +1003,18 @@ static __be32 nfsd_finish_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
        }
 }
 
+/**
+ * nfsd_splice_read - Perform a VFS read using a splice pipe
+ * @rqstp: RPC transaction context
+ * @fhp: file handle of file to be read
+ * @file: opened struct file of file to be read
+ * @offset: starting byte offset
+ * @count: IN: requested number of bytes; OUT: number of bytes read
+ * @eof: OUT: set non-zero if operation reached the end of the file
+ *
+ * Returns nfs_ok on success, otherwise an nfserr stat value is
+ * returned.
+ */
 __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
                        struct file *file, loff_t offset, unsigned long *count,
                        u32 *eof)
@@ -1014,22 +1028,50 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
        ssize_t host_err;
 
        trace_nfsd_read_splice(rqstp, fhp, offset, *count);
-       rqstp->rq_next_page = rqstp->rq_respages + 1;
        host_err = splice_direct_to_actor(file, &sd, nfsd_direct_splice_actor);
        return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
 }
 
-__be32 nfsd_readv(struct svc_rqst *rqstp, struct svc_fh *fhp,
-                 struct file *file, loff_t offset,
-                 struct kvec *vec, int vlen, unsigned long *count,
-                 u32 *eof)
+/**
+ * nfsd_iter_read - Perform a VFS read using an iterator
+ * @rqstp: RPC transaction context
+ * @fhp: file handle of file to be read
+ * @file: opened struct file of file to be read
+ * @offset: starting byte offset
+ * @count: IN: requested number of bytes; OUT: number of bytes read
+ * @base: offset in first page of read buffer
+ * @eof: OUT: set non-zero if operation reached the end of the file
+ *
+ * Some filesystems or situations cannot use nfsd_splice_read. This
+ * function is the slightly less-performant fallback for those cases.
+ *
+ * Returns nfs_ok on success, otherwise an nfserr stat value is
+ * returned.
+ */
+__be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
+                     struct file *file, loff_t offset, unsigned long *count,
+                     unsigned int base, u32 *eof)
 {
+       unsigned long v, total;
        struct iov_iter iter;
        loff_t ppos = offset;
+       struct page *page;
        ssize_t host_err;
 
+       v = 0;
+       total = *count;
+       while (total) {
+               page = *(rqstp->rq_next_page++);
+               rqstp->rq_vec[v].iov_base = page_address(page) + base;
+               rqstp->rq_vec[v].iov_len = min_t(size_t, total, PAGE_SIZE - base);
+               total -= rqstp->rq_vec[v].iov_len;
+               ++v;
+               base = 0;
+       }
+       WARN_ON_ONCE(v > ARRAY_SIZE(rqstp->rq_vec));
+
        trace_nfsd_read_vector(rqstp, fhp, offset, *count);
-       iov_iter_kvec(&iter, ITER_DEST, vec, vlen, *count);
+       iov_iter_kvec(&iter, ITER_DEST, rqstp->rq_vec, v, *count);
        host_err = vfs_iter_read(file, &iter, &ppos, 0);
        return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
 }
@@ -1159,14 +1201,24 @@ out_nfserr:
        return nfserr;
 }
 
-/*
- * Read data from a file. count must contain the requested read count
- * on entry. On return, *count contains the number of bytes actually read.
+/**
+ * nfsd_read - Read data from a file
+ * @rqstp: RPC transaction context
+ * @fhp: file handle of file to be read
+ * @offset: starting byte offset
+ * @count: IN: requested number of bytes; OUT: number of bytes read
+ * @eof: OUT: set non-zero if operation reached the end of the file
+ *
+ * The caller must verify that there is enough space in @rqstp.rq_res
+ * to perform this operation.
+ *
  * N.B. After this call fhp needs an fh_put
+ *
+ * Returns nfs_ok on success, otherwise an nfserr stat value is
+ * returned.
  */
 __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
-       loff_t offset, struct kvec *vec, int vlen, unsigned long *count,
-       u32 *eof)
+                loff_t offset, unsigned long *count, u32 *eof)
 {
        struct nfsd_file        *nf;
        struct file *file;
@@ -1181,12 +1233,10 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
        if (file->f_op->splice_read && test_bit(RQ_SPLICE_OK, &rqstp->rq_flags))
                err = nfsd_splice_read(rqstp, fhp, file, offset, count, eof);
        else
-               err = nfsd_readv(rqstp, fhp, file, offset, vec, vlen, count, eof);
+               err = nfsd_iter_read(rqstp, fhp, file, offset, count, 0, eof);
 
        nfsd_file_put(nf);
-
        trace_nfsd_read_done(rqstp, fhp, offset, *count);
-
        return err;
 }
 
index 43fb57a..a6890ea 100644 (file)
@@ -110,13 +110,12 @@ __be32            nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
                                struct file *file, loff_t offset,
                                unsigned long *count,
                                u32 *eof);
-__be32         nfsd_readv(struct svc_rqst *rqstp, struct svc_fh *fhp,
+__be32         nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
                                struct file *file, loff_t offset,
-                               struct kvec *vec, int vlen,
-                               unsigned long *count,
+                               unsigned long *count, unsigned int base,
                                u32 *eof);
-__be32                 nfsd_read(struct svc_rqst *, struct svc_fh *,
-                               loff_t, struct kvec *, int, unsigned long *,
+__be32         nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
+                               loff_t offset, unsigned long *count,
                                u32 *eof);
 __be32                 nfsd_write(struct svc_rqst *, struct svc_fh *, loff_t,
                                struct kvec *, int, unsigned long *,
index a265d39..a9eb348 100644 (file)
@@ -140,7 +140,7 @@ const struct file_operations nilfs_file_operations = {
        .open           = generic_file_open,
        /* .release     = nilfs_release_file, */
        .fsync          = nilfs_sync_file,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = filemap_splice_read,
        .splice_write   = iter_file_splice_write,
 };
 
index 9ba4933..0ef8c71 100644 (file)
@@ -1299,14 +1299,11 @@ nilfs_mount(struct file_system_type *fs_type, int flags,
 {
        struct nilfs_super_data sd;
        struct super_block *s;
-       fmode_t mode = FMODE_READ | FMODE_EXCL;
        struct dentry *root_dentry;
        int err, s_new = false;
 
-       if (!(flags & SB_RDONLY))
-               mode |= FMODE_WRITE;
-
-       sd.bdev = blkdev_get_by_path(dev_name, mode, fs_type);
+       sd.bdev = blkdev_get_by_path(dev_name, sb_open_mode(flags), fs_type,
+                                    NULL);
        if (IS_ERR(sd.bdev))
                return ERR_CAST(sd.bdev);
 
@@ -1340,7 +1337,6 @@ nilfs_mount(struct file_system_type *fs_type, int flags,
                s_new = true;
 
                /* New superblock instance created */
-               s->s_mode = mode;
                snprintf(s->s_id, sizeof(s->s_id), "%pg", sd.bdev);
                sb_set_blocksize(s, block_size(sd.bdev));
 
@@ -1378,7 +1374,7 @@ nilfs_mount(struct file_system_type *fs_type, int flags,
        }
 
        if (!s_new)
-               blkdev_put(sd.bdev, mode);
+               blkdev_put(sd.bdev, fs_type);
 
        return root_dentry;
 
@@ -1387,7 +1383,7 @@ nilfs_mount(struct file_system_type *fs_type, int flags,
 
  failed:
        if (!s_new)
-               blkdev_put(sd.bdev, mode);
+               blkdev_put(sd.bdev, fs_type);
        return ERR_PTR(err);
 }
 
diff --git a/fs/no-block.c b/fs/no-block.c
deleted file mode 100644 (file)
index 481c0f0..0000000
+++ /dev/null
@@ -1,19 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-or-later
-/* no-block.c: implementation of routines required for non-BLOCK configuration
- *
- * Copyright (C) 2006 Red Hat, Inc. All Rights Reserved.
- * Written by David Howells (dhowells@redhat.com)
- */
-
-#include <linux/kernel.h>
-#include <linux/fs.h>
-
-static int no_blkdev_open(struct inode * inode, struct file * filp)
-{
-       return -ENODEV;
-}
-
-const struct file_operations def_blk_fops = {
-       .open           = no_blkdev_open,
-       .llseek         = noop_llseek,
-};
index e8aeba1..4e158bc 100644 (file)
@@ -526,7 +526,7 @@ err_out:
  *
  * Return 0 on success and -errno on error.
  *
- * Based on ntfs_read_block() and __block_write_full_page().
+ * Based on ntfs_read_block() and __block_write_full_folio().
  */
 static int ntfs_write_block(struct page *page, struct writeback_control *wbc)
 {
index a3865bc..f79408f 100644 (file)
@@ -2491,7 +2491,7 @@ conv_err_out:
  * byte offset @ofs inside the attribute with the constant byte @val.
  *
  * This function is effectively like memset() applied to an ntfs attribute.
- * Note thie function actually only operates on the page cache pages belonging
+ * Note this function actually only operates on the page cache pages belonging
  * to the ntfs attribute and it marks them dirty after doing the memset().
  * Thus it relies on the vm dirty page write code paths to cause the modified
  * pages to be written to the mft record/disk.
index f9cb180..761aaa0 100644 (file)
@@ -161,7 +161,7 @@ static int ntfs_decompress(struct page *dest_pages[], int completed_pages[],
         */
        u8 *cb_end = cb_start + cb_size; /* End of cb. */
        u8 *cb = cb_start;      /* Current position in cb. */
-       u8 *cb_sb_start = cb;   /* Beginning of the current sb in the cb. */
+       u8 *cb_sb_start;        /* Beginning of the current sb in the cb. */
        u8 *cb_sb_end;          /* End of current sb / beginning of next sb. */
 
        /* Variables for uncompressed data / destination. */
index c481b14..cbc5459 100644 (file)
@@ -1911,11 +1911,9 @@ static ssize_t ntfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 
        inode_lock(vi);
        /* We can write back this queue in page reclaim. */
-       current->backing_dev_info = inode_to_bdi(vi);
        err = ntfs_prepare_file_for_write(iocb, from);
        if (iov_iter_count(from) && !err)
                written = ntfs_perform_write(file, from, iocb->ki_pos);
-       current->backing_dev_info = NULL;
        inode_unlock(vi);
        iocb->ki_pos += written;
        if (likely(written > 0))
@@ -1992,7 +1990,7 @@ const struct file_operations ntfs_file_ops = {
 #endif /* NTFS_RW */
        .mmap           = generic_file_mmap,
        .open           = ntfs_file_open,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = filemap_splice_read,
 };
 
 const struct inode_operations ntfs_file_inode_ops = {
index 4803089..0155f10 100644 (file)
@@ -1955,36 +1955,38 @@ undo_alloc:
                                "attribute.%s", es);
                NVolSetErrors(vol);
        }
-       a = ctx->attr;
+
        if (ntfs_rl_truncate_nolock(vol, &mft_ni->runlist, old_last_vcn)) {
                ntfs_error(vol->sb, "Failed to truncate mft data attribute "
                                "runlist.%s", es);
                NVolSetErrors(vol);
        }
-       if (mp_rebuilt && !IS_ERR(ctx->mrec)) {
-               if (ntfs_mapping_pairs_build(vol, (u8*)a + le16_to_cpu(
+       if (ctx) {
+               a = ctx->attr;
+               if (mp_rebuilt && !IS_ERR(ctx->mrec)) {
+                       if (ntfs_mapping_pairs_build(vol, (u8 *)a + le16_to_cpu(
                                a->data.non_resident.mapping_pairs_offset),
                                old_alen - le16_to_cpu(
-                               a->data.non_resident.mapping_pairs_offset),
+                                       a->data.non_resident.mapping_pairs_offset),
                                rl2, ll, -1, NULL)) {
-                       ntfs_error(vol->sb, "Failed to restore mapping pairs "
+                               ntfs_error(vol->sb, "Failed to restore mapping pairs "
                                        "array.%s", es);
-                       NVolSetErrors(vol);
-               }
-               if (ntfs_attr_record_resize(ctx->mrec, a, old_alen)) {
-                       ntfs_error(vol->sb, "Failed to restore attribute "
+                               NVolSetErrors(vol);
+                       }
+                       if (ntfs_attr_record_resize(ctx->mrec, a, old_alen)) {
+                               ntfs_error(vol->sb, "Failed to restore attribute "
                                        "record.%s", es);
+                               NVolSetErrors(vol);
+                       }
+                       flush_dcache_mft_record_page(ctx->ntfs_ino);
+                       mark_mft_record_dirty(ctx->ntfs_ino);
+               } else if (IS_ERR(ctx->mrec)) {
+                       ntfs_error(vol->sb, "Failed to restore attribute search "
+                               "context.%s", es);
                        NVolSetErrors(vol);
                }
-               flush_dcache_mft_record_page(ctx->ntfs_ino);
-               mark_mft_record_dirty(ctx->ntfs_ino);
-       } else if (IS_ERR(ctx->mrec)) {
-               ntfs_error(vol->sb, "Failed to restore attribute search "
-                               "context.%s", es);
-               NVolSetErrors(vol);
-       }
-       if (ctx)
                ntfs_attr_put_search_ctx(ctx);
+       }
        if (!IS_ERR(mrec))
                unmap_mft_record(mft_ni);
        up_write(&mft_ni->runlist.lock);
index 2643a08..56a7d5b 100644 (file)
@@ -1620,7 +1620,7 @@ read_partial_attrdef_page:
                memcpy((u8*)vol->attrdef + (index++ << PAGE_SHIFT),
                                page_address(page), size);
                ntfs_unmap_page(page);
-       };
+       }
        if (size == PAGE_SIZE) {
                size = i_size & ~PAGE_MASK;
                if (size)
@@ -1689,7 +1689,7 @@ read_partial_upcase_page:
                memcpy((char*)vol->upcase + (index++ << PAGE_SHIFT),
                                page_address(page), size);
                ntfs_unmap_page(page);
-       };
+       }
        if (size == PAGE_SIZE) {
                size = i_size & ~PAGE_MASK;
                if (size)
index 9a3d55c..9be3e8e 100644 (file)
@@ -744,6 +744,35 @@ static ssize_t ntfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
        return generic_file_read_iter(iocb, iter);
 }
 
+static ssize_t ntfs_file_splice_read(struct file *in, loff_t *ppos,
+                                    struct pipe_inode_info *pipe,
+                                    size_t len, unsigned int flags)
+{
+       struct inode *inode = in->f_mapping->host;
+       struct ntfs_inode *ni = ntfs_i(inode);
+
+       if (is_encrypted(ni)) {
+               ntfs_inode_warn(inode, "encrypted i/o not supported");
+               return -EOPNOTSUPP;
+       }
+
+#ifndef CONFIG_NTFS3_LZX_XPRESS
+       if (ni->ni_flags & NI_FLAG_COMPRESSED_MASK) {
+               ntfs_inode_warn(
+                       inode,
+                       "activate CONFIG_NTFS3_LZX_XPRESS to read external compressed files");
+               return -EOPNOTSUPP;
+       }
+#endif
+
+       if (is_dedup(ni)) {
+               ntfs_inode_warn(inode, "read deduplicated not supported");
+               return -EOPNOTSUPP;
+       }
+
+       return filemap_splice_read(in, ppos, pipe, len, flags);
+}
+
 /*
  * ntfs_get_frame_pages
  *
@@ -820,7 +849,6 @@ static ssize_t ntfs_compress_write(struct kiocb *iocb, struct iov_iter *from)
        if (!pages)
                return -ENOMEM;
 
-       current->backing_dev_info = inode_to_bdi(inode);
        err = file_remove_privs(file);
        if (err)
                goto out;
@@ -993,8 +1021,6 @@ static ssize_t ntfs_compress_write(struct kiocb *iocb, struct iov_iter *from)
 out:
        kfree(pages);
 
-       current->backing_dev_info = NULL;
-
        if (err < 0)
                return err;
 
@@ -1159,7 +1185,7 @@ const struct file_operations ntfs_file_operations = {
 #ifdef CONFIG_COMPAT
        .compat_ioctl   = ntfs_compat_ioctl,
 #endif
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = ntfs_file_splice_read,
        .mmap           = ntfs_file_mmap,
        .open           = ntfs_file_open,
        .fsync          = generic_file_fsync,
index 60b97c9..21472e3 100644 (file)
@@ -1503,7 +1503,7 @@ static void o2hb_region_release(struct config_item *item)
        }
 
        if (reg->hr_bdev)
-               blkdev_put(reg->hr_bdev, FMODE_READ|FMODE_WRITE);
+               blkdev_put(reg->hr_bdev, NULL);
 
        kfree(reg->hr_slots);
 
@@ -1786,7 +1786,8 @@ static ssize_t o2hb_region_dev_store(struct config_item *item,
                goto out2;
 
        reg->hr_bdev = blkdev_get_by_dev(f.file->f_mapping->host->i_rdev,
-                                        FMODE_WRITE | FMODE_READ, NULL);
+                                        BLK_OPEN_WRITE | BLK_OPEN_READ, NULL,
+                                        NULL);
        if (IS_ERR(reg->hr_bdev)) {
                ret = PTR_ERR(reg->hr_bdev);
                reg->hr_bdev = NULL;
@@ -1893,7 +1894,7 @@ static ssize_t o2hb_region_dev_store(struct config_item *item,
 
 out3:
        if (ret < 0) {
-               blkdev_put(reg->hr_bdev, FMODE_READ | FMODE_WRITE);
+               blkdev_put(reg->hr_bdev, NULL);
                reg->hr_bdev = NULL;
        }
 out2:
index b173c36..91a1945 100644 (file)
@@ -2558,7 +2558,7 @@ static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
         *
         * Take and drop the meta data lock to update inode fields
         * like i_size. This allows the checks down below
-        * generic_file_read_iter() a chance of actually working.
+        * copy_splice_read() a chance of actually working.
         */
        ret = ocfs2_inode_lock_atime(inode, filp->f_path.mnt, &lock_level,
                                     !nowait);
@@ -2587,6 +2587,43 @@ bail:
        return ret;
 }
 
+static ssize_t ocfs2_file_splice_read(struct file *in, loff_t *ppos,
+                                     struct pipe_inode_info *pipe,
+                                     size_t len, unsigned int flags)
+{
+       struct inode *inode = file_inode(in);
+       ssize_t ret = 0;
+       int lock_level = 0;
+
+       trace_ocfs2_file_splice_read(inode, in, in->f_path.dentry,
+                                    (unsigned long long)OCFS2_I(inode)->ip_blkno,
+                                    in->f_path.dentry->d_name.len,
+                                    in->f_path.dentry->d_name.name,
+                                    flags);
+
+       /*
+        * We're fine letting folks race truncates and extending writes with
+        * read across the cluster, just like they can locally.  Hence no
+        * rw_lock during read.
+        *
+        * Take and drop the meta data lock to update inode fields like i_size.
+        * This allows the checks down below filemap_splice_read() a chance of
+        * actually working.
+        */
+       ret = ocfs2_inode_lock_atime(inode, in->f_path.mnt, &lock_level, 1);
+       if (ret < 0) {
+               if (ret != -EAGAIN)
+                       mlog_errno(ret);
+               goto bail;
+       }
+       ocfs2_inode_unlock(inode, lock_level);
+
+       ret = filemap_splice_read(in, ppos, pipe, len, flags);
+       trace_filemap_splice_read_ret(ret);
+bail:
+       return ret;
+}
+
 /* Refer generic_file_llseek_unlocked() */
 static loff_t ocfs2_file_llseek(struct file *file, loff_t offset, int whence)
 {
@@ -2750,7 +2787,7 @@ const struct file_operations ocfs2_fops = {
 #endif
        .lock           = ocfs2_lock,
        .flock          = ocfs2_flock,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = ocfs2_file_splice_read,
        .splice_write   = iter_file_splice_write,
        .fallocate      = ocfs2_fallocate,
        .remap_file_range = ocfs2_remap_file_range,
@@ -2796,7 +2833,7 @@ const struct file_operations ocfs2_fops_no_plocks = {
        .compat_ioctl   = ocfs2_compat_ioctl,
 #endif
        .flock          = ocfs2_flock,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = filemap_splice_read,
        .splice_write   = iter_file_splice_write,
        .fallocate      = ocfs2_fallocate,
        .remap_file_range = ocfs2_remap_file_range,
index c4426d1..c803c10 100644 (file)
@@ -973,7 +973,7 @@ static int ocfs2_sync_local_to_main(struct ocfs2_super *osb,
        la_start_blk = ocfs2_clusters_to_blocks(osb->sb,
                                                le32_to_cpu(la->la_bm_off));
        bitmap = la->la_bitmap;
-       start = count = bit_off = 0;
+       start = count = 0;
        left = le32_to_cpu(alloc->id1.bitmap1.i_total);
 
        while ((bit_off = ocfs2_find_next_zero_bit(bitmap, left, start))
index dc4bce1..ac4fd1d 100644 (file)
@@ -1315,10 +1315,10 @@ DEFINE_OCFS2_FILE_OPS(ocfs2_sync_file);
 
 DEFINE_OCFS2_FILE_OPS(ocfs2_file_write_iter);
 
-DEFINE_OCFS2_FILE_OPS(ocfs2_file_splice_write);
-
 DEFINE_OCFS2_FILE_OPS(ocfs2_file_read_iter);
 
+DEFINE_OCFS2_FILE_OPS(ocfs2_file_splice_read);
+
 DEFINE_OCFS2_ULL_ULL_ULL_EVENT(ocfs2_truncate_file);
 
 DEFINE_OCFS2_ULL_ULL_EVENT(ocfs2_truncate_file_error);
@@ -1470,6 +1470,7 @@ TRACE_EVENT(ocfs2_prepare_inode_for_write,
 );
 
 DEFINE_OCFS2_INT_EVENT(generic_file_read_iter_ret);
+DEFINE_OCFS2_INT_EVENT(filemap_splice_read_ret);
 
 /* End of trace events for fs/ocfs2/file.c. */
 
index 5022b3e..dfaae1e 100644 (file)
@@ -811,7 +811,7 @@ static int ocfs2_local_free_info(struct super_block *sb, int type)
        struct ocfs2_quota_chunk *chunk;
        struct ocfs2_local_disk_chunk *dchunk;
        int mark_clean = 1, len;
-       int status;
+       int status = 0;
 
        iput(oinfo->dqi_gqinode);
        ocfs2_simple_drop_lockres(OCFS2_SB(sb), &oinfo->dqi_gqlock);
@@ -853,17 +853,14 @@ static int ocfs2_local_free_info(struct super_block *sb, int type)
                                 oinfo->dqi_libh,
                                 olq_update_info,
                                 info);
-       if (status < 0) {
+       if (status < 0)
                mlog_errno(status);
-               goto out;
-       }
-
 out:
        ocfs2_inode_unlock(sb_dqopt(sb)->files[type], 1);
        brelse(oinfo->dqi_libh);
        brelse(oinfo->dqi_lqi_bh);
        kfree(oinfo);
-       return 0;
+       return status;
 }
 
 static void olq_set_dquot(struct buffer_head *bh, void *private)
index 0101f1f..de8f57e 100644 (file)
@@ -334,7 +334,7 @@ const struct file_operations omfs_file_operations = {
        .write_iter = generic_file_write_iter,
        .mmap = generic_file_mmap,
        .fsync = generic_file_fsync,
-       .splice_read = generic_file_splice_read,
+       .splice_read = filemap_splice_read,
 };
 
 static int omfs_setattr(struct mnt_idmap *idmap,
index 4478adc..fb07b28 100644 (file)
--- a/fs/open.c
+++ b/fs/open.c
@@ -700,10 +700,7 @@ SYSCALL_DEFINE2(chmod, const char __user *, filename, umode_t, mode)
        return do_fchmodat(AT_FDCWD, filename, mode);
 }
 
-/**
- * setattr_vfsuid - check and set ia_fsuid attribute
- * @kuid: new inode owner
- *
+/*
  * Check whether @kuid is valid and if so generate and set vfsuid_t in
  * ia_vfsuid.
  *
@@ -718,10 +715,7 @@ static inline bool setattr_vfsuid(struct iattr *attr, kuid_t kuid)
        return true;
 }
 
-/**
- * setattr_vfsgid - check and set ia_fsgid attribute
- * @kgid: new inode owner
- *
+/*
  * Check whether @kgid is valid and if so generate and set vfsgid_t in
  * ia_vfsgid.
  *
@@ -989,7 +983,6 @@ cleanup_file:
  * @file: file pointer
  * @dentry: pointer to dentry
  * @open: open callback
- * @opened: state of open
  *
  * This can be used to finish opening a file passed to i_op->atomic_open().
  *
@@ -1043,7 +1036,6 @@ EXPORT_SYMBOL(file_path);
  * vfs_open - open the file at the given path
  * @path: path to open
  * @file: newly allocated file with f_flag initialized
- * @cred: credentials to use
  */
 int vfs_open(const struct path *path, struct file *file)
 {
@@ -1116,23 +1108,77 @@ struct file *dentry_create(const struct path *path, int flags, umode_t mode,
 }
 EXPORT_SYMBOL(dentry_create);
 
-struct file *open_with_fake_path(const struct path *path, int flags,
+/**
+ * kernel_file_open - open a file for kernel internal use
+ * @path:      path of the file to open
+ * @flags:     open flags
+ * @inode:     the inode
+ * @cred:      credentials for open
+ *
+ * Open a file for use by in-kernel consumers. The file is not accounted
+ * against nr_files and must not be installed into the file descriptor
+ * table.
+ *
+ * Return: Opened file on success, an error pointer on failure.
+ */
+struct file *kernel_file_open(const struct path *path, int flags,
                                struct inode *inode, const struct cred *cred)
 {
-       struct file *f = alloc_empty_file_noaccount(flags, cred);
-       if (!IS_ERR(f)) {
-               int error;
+       struct file *f;
+       int error;
 
-               f->f_path = *path;
-               error = do_dentry_open(f, inode, NULL);
-               if (error) {
-                       fput(f);
-                       f = ERR_PTR(error);
-               }
+       f = alloc_empty_file_noaccount(flags, cred);
+       if (IS_ERR(f))
+               return f;
+
+       f->f_path = *path;
+       error = do_dentry_open(f, inode, NULL);
+       if (error) {
+               fput(f);
+               f = ERR_PTR(error);
        }
        return f;
 }
-EXPORT_SYMBOL(open_with_fake_path);
+EXPORT_SYMBOL_GPL(kernel_file_open);
+
+/**
+ * backing_file_open - open a backing file for kernel internal use
+ * @path:      path of the file to open
+ * @flags:     open flags
+ * @path:      path of the backing file
+ * @cred:      credentials for open
+ *
+ * Open a backing file for a stackable filesystem (e.g., overlayfs).
+ * @path may be on the stackable filesystem and backing inode on the
+ * underlying filesystem. In this case, we want to be able to return
+ * the @real_path of the backing inode. This is done by embedding the
+ * returned file into a container structure that also stores the path of
+ * the backing inode on the underlying filesystem, which can be
+ * retrieved using backing_file_real_path().
+ */
+struct file *backing_file_open(const struct path *path, int flags,
+                              const struct path *real_path,
+                              const struct cred *cred)
+{
+       struct file *f;
+       int error;
+
+       f = alloc_empty_backing_file(flags, cred);
+       if (IS_ERR(f))
+               return f;
+
+       f->f_path = *path;
+       path_get(real_path);
+       *backing_file_real_path(f) = *real_path;
+       error = do_dentry_open(f, d_inode(real_path->dentry), NULL);
+       if (error) {
+               fput(f);
+               f = ERR_PTR(error);
+       }
+
+       return f;
+}
+EXPORT_SYMBOL_GPL(backing_file_open);
 
 #define WILL_CREATE(flags)     (flags & (O_CREAT | __O_TMPFILE))
 #define O_PATH_FLAGS           (O_DIRECTORY | O_NOFOLLOW | O_PATH | O_CLOEXEC)
@@ -1156,7 +1202,7 @@ inline struct open_how build_open_how(int flags, umode_t mode)
 inline int build_open_flags(const struct open_how *how, struct open_flags *op)
 {
        u64 flags = how->flags;
-       u64 strip = FMODE_NONOTIFY | O_CLOEXEC;
+       u64 strip = __FMODE_NONOTIFY | O_CLOEXEC;
        int lookup_flags = 0;
        int acc_mode = ACC_MODE(flags);
 
index 1a4301a..d683722 100644 (file)
@@ -337,6 +337,26 @@ out:
        return ret;
 }
 
+static ssize_t orangefs_file_splice_read(struct file *in, loff_t *ppos,
+                                        struct pipe_inode_info *pipe,
+                                        size_t len, unsigned int flags)
+{
+       struct inode *inode = file_inode(in);
+       ssize_t ret;
+
+       orangefs_stats.reads++;
+
+       down_read(&inode->i_rwsem);
+       ret = orangefs_revalidate_mapping(inode);
+       if (ret)
+               goto out;
+
+       ret = filemap_splice_read(in, ppos, pipe, len, flags);
+out:
+       up_read(&inode->i_rwsem);
+       return ret;
+}
+
 static ssize_t orangefs_file_write_iter(struct kiocb *iocb,
     struct iov_iter *iter)
 {
@@ -556,7 +576,7 @@ const struct file_operations orangefs_file_operations = {
        .lock           = orangefs_lock,
        .mmap           = orangefs_file_mmap,
        .open           = generic_file_open,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = orangefs_file_splice_read,
        .splice_write   = iter_file_splice_write,
        .flush          = orangefs_flush,
        .release        = orangefs_file_release,
index 7c04f03..1f93a3a 100644 (file)
@@ -34,8 +34,8 @@ static char ovl_whatisit(struct inode *inode, struct inode *realinode)
                return 'm';
 }
 
-/* No atime modification nor notify on underlying */
-#define OVL_OPEN_FLAGS (O_NOATIME | FMODE_NONOTIFY)
+/* No atime modification on underlying */
+#define OVL_OPEN_FLAGS (O_NOATIME)
 
 static struct file *ovl_open_realfile(const struct file *file,
                                      const struct path *realpath)
@@ -61,8 +61,8 @@ static struct file *ovl_open_realfile(const struct file *file,
                if (!inode_owner_or_capable(real_idmap, realinode))
                        flags &= ~O_NOATIME;
 
-               realfile = open_with_fake_path(&file->f_path, flags, realinode,
-                                              current_cred());
+               realfile = backing_file_open(&file->f_path, flags, realpath,
+                                            current_cred());
        }
        revert_creds(old_cred);
 
@@ -419,6 +419,27 @@ out_unlock:
        return ret;
 }
 
+static ssize_t ovl_splice_read(struct file *in, loff_t *ppos,
+                              struct pipe_inode_info *pipe, size_t len,
+                              unsigned int flags)
+{
+       const struct cred *old_cred;
+       struct fd real;
+       ssize_t ret;
+
+       ret = ovl_real_fdget(in, &real);
+       if (ret)
+               return ret;
+
+       old_cred = ovl_override_creds(file_inode(in)->i_sb);
+       ret = vfs_splice_read(real.file, ppos, pipe, len, flags);
+       revert_creds(old_cred);
+       ovl_file_accessed(in);
+
+       fdput(real);
+       return ret;
+}
+
 /*
  * Calling iter_file_splice_write() directly from overlay's f_op may deadlock
  * due to lock order inversion between pipe->mutex in iter_file_splice_write()
@@ -695,7 +716,7 @@ const struct file_operations ovl_file_operations = {
        .fallocate      = ovl_fallocate,
        .fadvise        = ovl_fadvise,
        .flush          = ovl_flush,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = ovl_splice_read,
        .splice_write   = ovl_splice_write,
 
        .copy_file_range        = ovl_copy_file_range,
index 4d0b278..23686e8 100644 (file)
@@ -329,8 +329,9 @@ static inline struct file *ovl_do_tmpfile(struct ovl_fs *ofs,
                                          struct dentry *dentry, umode_t mode)
 {
        struct path path = { .mnt = ovl_upper_mnt(ofs), .dentry = dentry };
-       struct file *file = vfs_tmpfile_open(ovl_upper_mnt_idmap(ofs), &path, mode,
-                                       O_LARGEFILE | O_WRONLY, current_cred());
+       struct file *file = kernel_tmpfile_open(ovl_upper_mnt_idmap(ofs), &path,
+                                               mode, O_LARGEFILE | O_WRONLY,
+                                               current_cred());
        int err = PTR_ERR_OR_ZERO(file);
 
        pr_debug("tmpfile(%pd2, 0%o) = %i\n", dentry, mode, err);
index 3cede8b..e4d0340 100644 (file)
@@ -216,7 +216,7 @@ static struct mount *next_group(struct mount *m, struct mount *origin)
 static struct mount *last_dest, *first_source, *last_source, *dest_master;
 static struct hlist_head *list;
 
-static inline bool peers(struct mount *m1, struct mount *m2)
+static inline bool peers(const struct mount *m1, const struct mount *m2)
 {
        return m1->mnt_group_id == m2->mnt_group_id && m1->mnt_group_id;
 }
@@ -354,6 +354,46 @@ static inline int do_refcount_check(struct mount *mnt, int count)
        return mnt_get_count(mnt) > count;
 }
 
+/**
+ * propagation_would_overmount - check whether propagation from @from
+ *                               would overmount @to
+ * @from: shared mount
+ * @to:   mount to check
+ * @mp:   future mountpoint of @to on @from
+ *
+ * If @from propagates mounts to @to, @from and @to must either be peers
+ * or one of the masters in the hierarchy of masters of @to must be a
+ * peer of @from.
+ *
+ * If the root of the @to mount is equal to the future mountpoint @mp of
+ * the @to mount on @from then @to will be overmounted by whatever is
+ * propagated to it.
+ *
+ * Context: This function expects namespace_lock() to be held and that
+ *          @mp is stable.
+ * Return: If @from overmounts @to, true is returned, false if not.
+ */
+bool propagation_would_overmount(const struct mount *from,
+                                const struct mount *to,
+                                const struct mountpoint *mp)
+{
+       if (!IS_MNT_SHARED(from))
+               return false;
+
+       if (IS_MNT_NEW(to))
+               return false;
+
+       if (to->mnt.mnt_root != mp->m_dentry)
+               return false;
+
+       for (const struct mount *m = to; m; m = m->mnt_master) {
+               if (peers(from, m))
+                       return true;
+       }
+
+       return false;
+}
+
 /*
  * check if the mount 'mnt' can be unmounted successfully.
  * @mnt: the mount to be checked for unmount
index 988f1aa..0b02a63 100644 (file)
@@ -53,4 +53,7 @@ struct mount *copy_tree(struct mount *, struct dentry *, int);
 bool is_path_reachable(struct mount *, struct dentry *,
                         const struct path *root);
 int count_mounts(struct mnt_namespace *ns, struct mount *mnt);
+bool propagation_would_overmount(const struct mount *from,
+                                const struct mount *to,
+                                const struct mountpoint *mp);
 #endif /* _LINUX_PNODE_H */
index f495fdb..67b09a1 100644 (file)
@@ -591,7 +591,7 @@ static const struct file_operations proc_iter_file_ops = {
        .llseek         = proc_reg_llseek,
        .read_iter      = proc_reg_read_iter,
        .write          = proc_reg_write,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = copy_splice_read,
        .poll           = proc_reg_poll,
        .unlocked_ioctl = proc_reg_unlocked_ioctl,
        .mmap           = proc_reg_mmap,
@@ -617,7 +617,7 @@ static const struct file_operations proc_reg_file_ops_compat = {
 static const struct file_operations proc_iter_file_ops_compat = {
        .llseek         = proc_reg_llseek,
        .read_iter      = proc_reg_read_iter,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = copy_splice_read,
        .write          = proc_reg_write,
        .poll           = proc_reg_poll,
        .unlocked_ioctl = proc_reg_unlocked_ioctl,
index 25b44b3..5d0cf59 100644 (file)
@@ -419,7 +419,7 @@ static ssize_t read_kcore_iter(struct kiocb *iocb, struct iov_iter *iter)
                char *notes;
                size_t i = 0;
 
-               strlcpy(prpsinfo.pr_psargs, saved_command_line,
+               strscpy(prpsinfo.pr_psargs, saved_command_line,
                        sizeof(prpsinfo.pr_psargs));
 
                notes = kzalloc(notes_len, GFP_KERNEL);
index b43d0bd..8dca4d6 100644 (file)
@@ -168,6 +168,11 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
                    global_zone_page_state(NR_FREE_CMA_PAGES));
 #endif
 
+#ifdef CONFIG_UNACCEPTED_MEMORY
+       show_val_kb(m, "Unaccepted:     ",
+                   global_zone_page_state(NR_UNACCEPTED));
+#endif
+
        hugetlb_report_meminfo(m);
 
        arch_report_meminfo(m);
index 8038833..4e54889 100644 (file)
@@ -29,9 +29,8 @@ static const struct file_operations proc_sys_dir_file_operations;
 static const struct inode_operations proc_sys_dir_operations;
 
 /* Support for permanently empty directories */
-
 struct ctl_table sysctl_mount_point[] = {
-       { }
+       {.type = SYSCTL_TABLE_TYPE_PERMANENTLY_EMPTY }
 };
 
 /**
@@ -48,21 +47,14 @@ struct ctl_table_header *register_sysctl_mount_point(const char *path)
 }
 EXPORT_SYMBOL(register_sysctl_mount_point);
 
-static bool is_empty_dir(struct ctl_table_header *head)
-{
-       return head->ctl_table[0].child == sysctl_mount_point;
-}
-
-static void set_empty_dir(struct ctl_dir *dir)
-{
-       dir->header.ctl_table[0].child = sysctl_mount_point;
-}
-
-static void clear_empty_dir(struct ctl_dir *dir)
-
-{
-       dir->header.ctl_table[0].child = NULL;
-}
+#define sysctl_is_perm_empty_ctl_table(tptr)           \
+       (tptr[0].type == SYSCTL_TABLE_TYPE_PERMANENTLY_EMPTY)
+#define sysctl_is_perm_empty_ctl_header(hptr)          \
+       (sysctl_is_perm_empty_ctl_table(hptr->ctl_table))
+#define sysctl_set_perm_empty_ctl_header(hptr)         \
+       (hptr->ctl_table[0].type = SYSCTL_TABLE_TYPE_PERMANENTLY_EMPTY)
+#define sysctl_clear_perm_empty_ctl_header(hptr)       \
+       (hptr->ctl_table[0].type = SYSCTL_TABLE_TYPE_DEFAULT)
 
 void proc_sys_poll_notify(struct ctl_table_poll *poll)
 {
@@ -230,20 +222,22 @@ static void erase_header(struct ctl_table_header *head)
 static int insert_header(struct ctl_dir *dir, struct ctl_table_header *header)
 {
        struct ctl_table *entry;
+       struct ctl_table_header *dir_h = &dir->header;
        int err;
 
+
        /* Is this a permanently empty directory? */
-       if (is_empty_dir(&dir->header))
+       if (sysctl_is_perm_empty_ctl_header(dir_h))
                return -EROFS;
 
        /* Am I creating a permanently empty directory? */
-       if (header->ctl_table == sysctl_mount_point) {
+       if (sysctl_is_perm_empty_ctl_table(header->ctl_table)) {
                if (!RB_EMPTY_ROOT(&dir->root))
                        return -EINVAL;
-               set_empty_dir(dir);
+               sysctl_set_perm_empty_ctl_header(dir_h);
        }
 
-       dir->header.nreg++;
+       dir_h->nreg++;
        header->parent = dir;
        err = insert_links(header);
        if (err)
@@ -259,9 +253,9 @@ fail:
        put_links(header);
 fail_links:
        if (header->ctl_table == sysctl_mount_point)
-               clear_empty_dir(dir);
+               sysctl_clear_perm_empty_ctl_header(dir_h);
        header->parent = NULL;
-       drop_sysctl_table(&dir->header);
+       drop_sysctl_table(dir_h);
        return err;
 }
 
@@ -479,7 +473,7 @@ static struct inode *proc_sys_make_inode(struct super_block *sb,
                inode->i_mode |= S_IFDIR;
                inode->i_op = &proc_sys_dir_operations;
                inode->i_fop = &proc_sys_dir_file_operations;
-               if (is_empty_dir(head))
+               if (sysctl_is_perm_empty_ctl_header(head))
                        make_empty_dir_inode(inode);
        }
 
@@ -868,7 +862,7 @@ static const struct file_operations proc_sys_file_operations = {
        .poll           = proc_sys_poll,
        .read_iter      = proc_sys_read,
        .write_iter     = proc_sys_write,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = copy_splice_read,
        .splice_write   = iter_file_splice_write,
        .llseek         = default_llseek,
 };
@@ -1136,9 +1130,6 @@ static int sysctl_check_table(const char *path, struct ctl_table *table)
        struct ctl_table *entry;
        int err = 0;
        list_for_each_table_entry(entry, table) {
-               if (entry->child)
-                       err |= sysctl_err(path, entry, "Not a file");
-
                if ((entry->proc_handler == proc_dostring) ||
                    (entry->proc_handler == proc_dobool) ||
                    (entry->proc_handler == proc_dointvec) ||
@@ -1406,7 +1397,6 @@ fail_put_dir_locked:
        spin_unlock(&sysctl_lock);
 fail:
        kfree(header);
-       dump_stack();
        return NULL;
 }
 
@@ -1466,185 +1456,6 @@ void __init __register_sysctl_init(const char *path, struct ctl_table *table,
        kmemleak_not_leak(hdr);
 }
 
-static char *append_path(const char *path, char *pos, const char *name)
-{
-       int namelen;
-       namelen = strlen(name);
-       if (((pos - path) + namelen + 2) >= PATH_MAX)
-               return NULL;
-       memcpy(pos, name, namelen);
-       pos[namelen] = '/';
-       pos[namelen + 1] = '\0';
-       pos += namelen + 1;
-       return pos;
-}
-
-static int count_subheaders(struct ctl_table *table)
-{
-       int has_files = 0;
-       int nr_subheaders = 0;
-       struct ctl_table *entry;
-
-       /* special case: no directory and empty directory */
-       if (!table || !table->procname)
-               return 1;
-
-       list_for_each_table_entry(entry, table) {
-               if (entry->child)
-                       nr_subheaders += count_subheaders(entry->child);
-               else
-                       has_files = 1;
-       }
-       return nr_subheaders + has_files;
-}
-
-static int register_leaf_sysctl_tables(const char *path, char *pos,
-       struct ctl_table_header ***subheader, struct ctl_table_set *set,
-       struct ctl_table *table)
-{
-       struct ctl_table *ctl_table_arg = NULL;
-       struct ctl_table *entry, *files;
-       int nr_files = 0;
-       int nr_dirs = 0;
-       int err = -ENOMEM;
-
-       list_for_each_table_entry(entry, table) {
-               if (entry->child)
-                       nr_dirs++;
-               else
-                       nr_files++;
-       }
-
-       files = table;
-       /* If there are mixed files and directories we need a new table */
-       if (nr_dirs && nr_files) {
-               struct ctl_table *new;
-               files = kcalloc(nr_files + 1, sizeof(struct ctl_table),
-                               GFP_KERNEL);
-               if (!files)
-                       goto out;
-
-               ctl_table_arg = files;
-               new = files;
-
-               list_for_each_table_entry(entry, table) {
-                       if (entry->child)
-                               continue;
-                       *new = *entry;
-                       new++;
-               }
-       }
-
-       /* Register everything except a directory full of subdirectories */
-       if (nr_files || !nr_dirs) {
-               struct ctl_table_header *header;
-               header = __register_sysctl_table(set, path, files);
-               if (!header) {
-                       kfree(ctl_table_arg);
-                       goto out;
-               }
-
-               /* Remember if we need to free the file table */
-               header->ctl_table_arg = ctl_table_arg;
-               **subheader = header;
-               (*subheader)++;
-       }
-
-       /* Recurse into the subdirectories. */
-       list_for_each_table_entry(entry, table) {
-               char *child_pos;
-
-               if (!entry->child)
-                       continue;
-
-               err = -ENAMETOOLONG;
-               child_pos = append_path(path, pos, entry->procname);
-               if (!child_pos)
-                       goto out;
-
-               err = register_leaf_sysctl_tables(path, child_pos, subheader,
-                                                 set, entry->child);
-               pos[0] = '\0';
-               if (err)
-                       goto out;
-       }
-       err = 0;
-out:
-       /* On failure our caller will unregister all registered subheaders */
-       return err;
-}
-
-/**
- * register_sysctl_table - register a sysctl table hierarchy
- * @table: the top-level table structure
- *
- * Register a sysctl table hierarchy. @table should be a filled in ctl_table
- * array. A completely 0 filled entry terminates the table.
- * We are slowly deprecating this call so avoid its use.
- */
-struct ctl_table_header *register_sysctl_table(struct ctl_table *table)
-{
-       struct ctl_table *ctl_table_arg = table;
-       int nr_subheaders = count_subheaders(table);
-       struct ctl_table_header *header = NULL, **subheaders, **subheader;
-       char *new_path, *pos;
-
-       pos = new_path = kmalloc(PATH_MAX, GFP_KERNEL);
-       if (!new_path)
-               return NULL;
-
-       pos[0] = '\0';
-       while (table->procname && table->child && !table[1].procname) {
-               pos = append_path(new_path, pos, table->procname);
-               if (!pos)
-                       goto out;
-               table = table->child;
-       }
-       if (nr_subheaders == 1) {
-               header = __register_sysctl_table(&sysctl_table_root.default_set, new_path, table);
-               if (header)
-                       header->ctl_table_arg = ctl_table_arg;
-       } else {
-               header = kzalloc(sizeof(*header) +
-                                sizeof(*subheaders)*nr_subheaders, GFP_KERNEL);
-               if (!header)
-                       goto out;
-
-               subheaders = (struct ctl_table_header **) (header + 1);
-               subheader = subheaders;
-               header->ctl_table_arg = ctl_table_arg;
-
-               if (register_leaf_sysctl_tables(new_path, pos, &subheader,
-                                               &sysctl_table_root.default_set, table))
-                       goto err_register_leaves;
-       }
-
-out:
-       kfree(new_path);
-       return header;
-
-err_register_leaves:
-       while (subheader > subheaders) {
-               struct ctl_table_header *subh = *(--subheader);
-               struct ctl_table *table = subh->ctl_table_arg;
-               unregister_sysctl_table(subh);
-               kfree(table);
-       }
-       kfree(header);
-       header = NULL;
-       goto out;
-}
-EXPORT_SYMBOL(register_sysctl_table);
-
-int __register_sysctl_base(struct ctl_table *base_table)
-{
-       struct ctl_table_header *hdr;
-
-       hdr = register_sysctl_table(base_table);
-       kmemleak_not_leak(hdr);
-       return 0;
-}
-
 static void put_links(struct ctl_table_header *header)
 {
        struct ctl_table_set *root_set = &sysctl_table_root.default_set;
@@ -1700,35 +1511,18 @@ static void drop_sysctl_table(struct ctl_table_header *header)
 
 /**
  * unregister_sysctl_table - unregister a sysctl table hierarchy
- * @header: the header returned from register_sysctl_table
+ * @header: the header returned from register_sysctl or __register_sysctl_table
  *
  * Unregisters the sysctl table and all children. proc entries may not
  * actually be removed until they are no longer used by anyone.
  */
 void unregister_sysctl_table(struct ctl_table_header * header)
 {
-       int nr_subheaders;
        might_sleep();
 
        if (header == NULL)
                return;
 
-       nr_subheaders = count_subheaders(header->ctl_table_arg);
-       if (unlikely(nr_subheaders > 1)) {
-               struct ctl_table_header **subheaders;
-               int i;
-
-               subheaders = (struct ctl_table_header **)(header + 1);
-               for (i = nr_subheaders -1; i >= 0; i--) {
-                       struct ctl_table_header *subh = subheaders[i];
-                       struct ctl_table *table = subh->ctl_table_arg;
-                       unregister_sysctl_table(subh);
-                       kfree(table);
-               }
-               kfree(header);
-               return;
-       }
-
        spin_lock(&sysctl_lock);
        drop_sysctl_table(header);
        spin_unlock(&sysctl_lock);
index 420510f..507cd4e 100644 (file)
@@ -538,13 +538,14 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
        bool locked = !!(vma->vm_flags & VM_LOCKED);
        struct page *page = NULL;
        bool migration = false, young = false, dirty = false;
+       pte_t ptent = ptep_get(pte);
 
-       if (pte_present(*pte)) {
-               page = vm_normal_page(vma, addr, *pte);
-               young = pte_young(*pte);
-               dirty = pte_dirty(*pte);
-       } else if (is_swap_pte(*pte)) {
-               swp_entry_t swpent = pte_to_swp_entry(*pte);
+       if (pte_present(ptent)) {
+               page = vm_normal_page(vma, addr, ptent);
+               young = pte_young(ptent);
+               dirty = pte_dirty(ptent);
+       } else if (is_swap_pte(ptent)) {
+               swp_entry_t swpent = pte_to_swp_entry(ptent);
 
                if (!non_swap_entry(swpent)) {
                        int mapcount;
@@ -631,14 +632,11 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
                goto out;
        }
 
-       if (pmd_trans_unstable(pmd))
-               goto out;
-       /*
-        * The mmap_lock held all the way back in m_start() is what
-        * keeps khugepaged out of here and from collapsing things
-        * in here.
-        */
        pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+       if (!pte) {
+               walk->action = ACTION_AGAIN;
+               return 0;
+       }
        for (; addr != end; pte++, addr += PAGE_SIZE)
                smaps_pte_entry(pte, addr, walk);
        pte_unmap_unlock(pte - 1, ptl);
@@ -735,11 +733,12 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
        struct mem_size_stats *mss = walk->private;
        struct vm_area_struct *vma = walk->vma;
        struct page *page = NULL;
+       pte_t ptent = ptep_get(pte);
 
-       if (pte_present(*pte)) {
-               page = vm_normal_page(vma, addr, *pte);
-       } else if (is_swap_pte(*pte)) {
-               swp_entry_t swpent = pte_to_swp_entry(*pte);
+       if (pte_present(ptent)) {
+               page = vm_normal_page(vma, addr, ptent);
+       } else if (is_swap_pte(ptent)) {
+               swp_entry_t swpent = pte_to_swp_entry(ptent);
 
                if (is_pfn_swap_entry(swpent))
                        page = pfn_swap_entry_to_page(swpent);
@@ -1108,7 +1107,7 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
         * Documentation/admin-guide/mm/soft-dirty.rst for full description
         * of how soft-dirty works.
         */
-       pte_t ptent = *pte;
+       pte_t ptent = ptep_get(pte);
 
        if (pte_present(ptent)) {
                pte_t old_pte;
@@ -1191,12 +1190,13 @@ out:
                return 0;
        }
 
-       if (pmd_trans_unstable(pmd))
-               return 0;
-
        pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+       if (!pte) {
+               walk->action = ACTION_AGAIN;
+               return 0;
+       }
        for (; addr != end; pte++, addr += PAGE_SIZE) {
-               ptent = *pte;
+               ptent = ptep_get(pte);
 
                if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
                        clear_soft_dirty(vma, addr, pte);
@@ -1538,9 +1538,6 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
                spin_unlock(ptl);
                return err;
        }
-
-       if (pmd_trans_unstable(pmdp))
-               return 0;
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
        /*
@@ -1548,10 +1545,14 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
         * goes beyond vma->vm_end.
         */
        orig_pte = pte = pte_offset_map_lock(walk->mm, pmdp, addr, &ptl);
+       if (!pte) {
+               walk->action = ACTION_AGAIN;
+               return err;
+       }
        for (; addr < end; pte++, addr += PAGE_SIZE) {
                pagemap_entry_t pme;
 
-               pme = pte_to_pagemap_entry(pm, vma, addr, *pte);
+               pme = pte_to_pagemap_entry(pm, vma, addr, ptep_get(pte));
                err = add_to_pagemap(addr, &pme, pm);
                if (err)
                        break;
@@ -1689,23 +1690,23 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
        /* watch out for wraparound */
        start_vaddr = end_vaddr;
        if (svpfn <= (ULONG_MAX >> PAGE_SHIFT)) {
+               unsigned long end;
+
                ret = mmap_read_lock_killable(mm);
                if (ret)
                        goto out_free;
                start_vaddr = untagged_addr_remote(mm, svpfn << PAGE_SHIFT);
                mmap_read_unlock(mm);
+
+               end = start_vaddr + ((count / PM_ENTRY_BYTES) << PAGE_SHIFT);
+               if (end >= start_vaddr && end < mm->task_size)
+                       end_vaddr = end;
        }
 
        /* Ensure the address is inside the task */
        if (start_vaddr > mm->task_size)
                start_vaddr = end_vaddr;
 
-       /*
-        * The odds are that this will stop walking way
-        * before end_vaddr, because the length of the
-        * user buffer is tracked in "pm", and the walk
-        * will stop when we hit the end of the buffer.
-        */
        ret = 0;
        while (count && (start_vaddr < end_vaddr)) {
                int len;
@@ -1887,16 +1888,18 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
                spin_unlock(ptl);
                return 0;
        }
-
-       if (pmd_trans_unstable(pmd))
-               return 0;
 #endif
        orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+       if (!pte) {
+               walk->action = ACTION_AGAIN;
+               return 0;
+       }
        do {
-               struct page *page = can_gather_numa_stats(*pte, vma, addr);
+               pte_t ptent = ptep_get(pte);
+               struct page *page = can_gather_numa_stats(ptent, vma, addr);
                if (!page)
                        continue;
-               gather_stats(page, md, pte_dirty(*pte), 1);
+               gather_stats(page, md, pte_dirty(ptent), 1);
 
        } while (pte++, addr += PAGE_SIZE, addr != end);
        pte_unmap_unlock(orig_pte, ptl);
index 0ec3507..2c8b622 100644 (file)
@@ -51,7 +51,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
                sbytes += kobjsize(mm);
        else
                bytes += kobjsize(mm);
-       
+
        if (current->fs && current->fs->users > 1)
                sbytes += kobjsize(current->fs);
        else
@@ -69,13 +69,13 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 
        bytes += kobjsize(current); /* includes kernel stack */
 
+       mmap_read_unlock(mm);
+
        seq_printf(m,
                "Mem:\t%8lu bytes\n"
                "Slack:\t%8lu bytes\n"
                "Shared:\t%8lu bytes\n",
                bytes, slack, sbytes);
-
-       mmap_read_unlock(mm);
 }
 
 unsigned long task_vsize(struct mm_struct *mm)
index 03f5963..cb80a77 100644 (file)
@@ -877,7 +877,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
        phdr.p_offset  = roundup(note_off, PAGE_SIZE);
        phdr.p_vaddr   = phdr.p_paddr = 0;
        phdr.p_filesz  = phdr.p_memsz = phdr_sz;
-       phdr.p_align   = 0;
+       phdr.p_align   = 4;
 
        /* Add merged PT_NOTE program header*/
        tmp = elfptr + sizeof(Elf64_Ehdr);
@@ -1068,7 +1068,7 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
        phdr.p_offset  = roundup(note_off, PAGE_SIZE);
        phdr.p_vaddr   = phdr.p_paddr = 0;
        phdr.p_filesz  = phdr.p_memsz = phdr_sz;
-       phdr.p_align   = 0;
+       phdr.p_align   = 4;
 
        /* Add merged PT_NOTE program header*/
        tmp = elfptr + sizeof(Elf32_Ehdr);
index 846f945..250eb5b 100644 (file)
@@ -324,7 +324,7 @@ static int mountstats_open(struct inode *inode, struct file *file)
 const struct file_operations proc_mounts_operations = {
        .open           = mounts_open,
        .read_iter      = seq_read_iter,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = copy_splice_read,
        .llseek         = seq_lseek,
        .release        = mounts_release,
        .poll           = mounts_poll,
@@ -333,7 +333,7 @@ const struct file_operations proc_mounts_operations = {
 const struct file_operations proc_mountinfo_operations = {
        .open           = mountinfo_open,
        .read_iter      = seq_read_iter,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = copy_splice_read,
        .llseek         = seq_lseek,
        .release        = mounts_release,
        .poll           = mounts_poll,
@@ -342,7 +342,7 @@ const struct file_operations proc_mountinfo_operations = {
 const struct file_operations proc_mountstats_operations = {
        .open           = mountstats_open,
        .read_iter      = seq_read_iter,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = copy_splice_read,
        .llseek         = seq_lseek,
        .release        = mounts_release,
 };
index 4ae0cfc..de8cf5d 100644 (file)
@@ -263,9 +263,9 @@ static __init const char *early_boot_devpath(const char *initial_devname)
         * same scheme to find the device that we use for mounting
         * the root file system.
         */
-       dev_t dev = name_to_dev_t(initial_devname);
+       dev_t dev;
 
-       if (!dev) {
+       if (early_lookup_bdev(initial_devname, &dev)) {
                pr_err("failed to resolve '%s'!\n", initial_devname);
                return initial_devname;
        }
index ade66db..2f625e1 100644 (file)
@@ -875,7 +875,7 @@ fail_out:
        return err;
 }
 
-static int ramoops_remove(struct platform_device *pdev)
+static void ramoops_remove(struct platform_device *pdev)
 {
        struct ramoops_context *cxt = &oops_cxt;
 
@@ -885,8 +885,6 @@ static int ramoops_remove(struct platform_device *pdev)
        cxt->pstore.bufsize = 0;
 
        ramoops_free_przs(cxt);
-
-       return 0;
 }
 
 static const struct of_device_id dt_match[] = {
@@ -896,7 +894,7 @@ static const struct of_device_id dt_match[] = {
 
 static struct platform_driver ramoops_driver = {
        .probe          = ramoops_probe,
-       .remove         = ramoops_remove,
+       .remove_new     = ramoops_remove,
        .driver         = {
                .name           = "ramoops",
                .of_match_table = dt_match,
index 966191d..85aaf0f 100644 (file)
@@ -599,6 +599,8 @@ struct persistent_ram_zone *persistent_ram_new(phys_addr_t start, size_t size,
        raw_spin_lock_init(&prz->buffer_lock);
        prz->flags = flags;
        prz->label = kstrdup(label, GFP_KERNEL);
+       if (!prz->label)
+               goto err;
 
        ret = persistent_ram_buffer_map(start, size, prz, memtype);
        if (ret)
index 12af049..c7a1aa3 100644 (file)
@@ -43,7 +43,7 @@ const struct file_operations ramfs_file_operations = {
        .write_iter     = generic_file_write_iter,
        .mmap           = generic_file_mmap,
        .fsync          = noop_fsync,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = filemap_splice_read,
        .splice_write   = iter_file_splice_write,
        .llseek         = generic_file_llseek,
        .get_unmapped_area      = ramfs_mmu_get_unmapped_area,
index 9fbb9b5..efb1b4c 100644 (file)
@@ -43,7 +43,7 @@ const struct file_operations ramfs_file_operations = {
        .read_iter              = generic_file_read_iter,
        .write_iter             = generic_file_write_iter,
        .fsync                  = noop_fsync,
-       .splice_read            = generic_file_splice_read,
+       .splice_read            = filemap_splice_read,
        .splice_write           = iter_file_splice_write,
        .llseek                 = generic_file_llseek,
 };
index 5ba580c..fef477c 100644 (file)
@@ -278,7 +278,7 @@ int ramfs_init_fs_context(struct fs_context *fc)
        return 0;
 }
 
-static void ramfs_kill_sb(struct super_block *sb)
+void ramfs_kill_sb(struct super_block *sb)
 {
        kfree(sb->s_fs_info);
        kill_litter_super(sb);
index a21ba3b..b07de77 100644 (file)
@@ -29,7 +29,7 @@ const struct file_operations generic_ro_fops = {
        .llseek         = generic_file_llseek,
        .read_iter      = generic_file_read_iter,
        .mmap           = generic_file_readonly_mmap,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = filemap_splice_read,
 };
 
 EXPORT_SYMBOL(generic_ro_fops);
index 9c53edb..b264ce6 100644 (file)
@@ -131,7 +131,7 @@ struct old_linux_dirent {
        unsigned long   d_ino;
        unsigned long   d_offset;
        unsigned short  d_namlen;
-       char            d_name[1];
+       char            d_name[];
 };
 
 struct readdir_callback {
@@ -208,7 +208,7 @@ struct linux_dirent {
        unsigned long   d_ino;
        unsigned long   d_off;
        unsigned short  d_reclen;
-       char            d_name[1];
+       char            d_name[];
 };
 
 struct getdents_callback {
@@ -388,7 +388,7 @@ struct compat_old_linux_dirent {
        compat_ulong_t  d_ino;
        compat_ulong_t  d_offset;
        unsigned short  d_namlen;
-       char            d_name[1];
+       char            d_name[];
 };
 
 struct compat_readdir_callback {
@@ -460,7 +460,7 @@ struct compat_linux_dirent {
        compat_ulong_t  d_ino;
        compat_ulong_t  d_off;
        unsigned short  d_reclen;
-       char            d_name[1];
+       char            d_name[];
 };
 
 struct compat_getdents_callback {
index b54cc70..8eb3ad3 100644 (file)
@@ -247,7 +247,7 @@ const struct file_operations reiserfs_file_operations = {
        .fsync = reiserfs_sync_file,
        .read_iter = generic_file_read_iter,
        .write_iter = generic_file_write_iter,
-       .splice_read = generic_file_splice_read,
+       .splice_read = filemap_splice_read,
        .splice_write = iter_file_splice_write,
        .llseek = generic_file_llseek,
 };
index d8debbb..77bd3b2 100644 (file)
@@ -2506,7 +2506,7 @@ out:
 
 /*
  * mason@suse.com: updated in 2.5.54 to follow the same general io
- * start/recovery path as __block_write_full_page, along with special
+ * start/recovery path as __block_write_full_folio, along with special
  * code to handle reiserfs tails.
  */
 static int reiserfs_write_full_page(struct page *page,
@@ -2872,6 +2872,7 @@ static int reiserfs_write_end(struct file *file, struct address_space *mapping,
                              loff_t pos, unsigned len, unsigned copied,
                              struct page *page, void *fsdata)
 {
+       struct folio *folio = page_folio(page);
        struct inode *inode = page->mapping->host;
        int ret = 0;
        int update_sd = 0;
@@ -2887,12 +2888,12 @@ static int reiserfs_write_end(struct file *file, struct address_space *mapping,
 
        start = pos & (PAGE_SIZE - 1);
        if (unlikely(copied < len)) {
-               if (!PageUptodate(page))
+               if (!folio_test_uptodate(folio))
                        copied = 0;
 
-               page_zero_new_buffers(page, start + copied, start + len);
+               folio_zero_new_buffers(folio, start + copied, start + len);
        }
-       flush_dcache_page(page);
+       flush_dcache_folio(folio);
 
        reiserfs_commit_page(inode, page, start, start + copied);
 
index 4d11d60..479aa4a 100644 (file)
@@ -2589,7 +2589,12 @@ static void release_journal_dev(struct super_block *super,
                               struct reiserfs_journal *journal)
 {
        if (journal->j_dev_bd != NULL) {
-               blkdev_put(journal->j_dev_bd, journal->j_dev_mode);
+               void *holder = NULL;
+
+               if (journal->j_dev_bd->bd_dev != super->s_dev)
+                       holder = journal;
+
+               blkdev_put(journal->j_dev_bd, holder);
                journal->j_dev_bd = NULL;
        }
 }
@@ -2598,9 +2603,10 @@ static int journal_init_dev(struct super_block *super,
                            struct reiserfs_journal *journal,
                            const char *jdev_name)
 {
+       blk_mode_t blkdev_mode = BLK_OPEN_READ;
+       void *holder = journal;
        int result;
        dev_t jdev;
-       fmode_t blkdev_mode = FMODE_READ | FMODE_WRITE | FMODE_EXCL;
 
        result = 0;
 
@@ -2608,16 +2614,15 @@ static int journal_init_dev(struct super_block *super,
        jdev = SB_ONDISK_JOURNAL_DEVICE(super) ?
            new_decode_dev(SB_ONDISK_JOURNAL_DEVICE(super)) : super->s_dev;
 
-       if (bdev_read_only(super->s_bdev))
-               blkdev_mode = FMODE_READ;
+       if (!bdev_read_only(super->s_bdev))
+               blkdev_mode |= BLK_OPEN_WRITE;
 
        /* there is no "jdev" option and journal is on separate device */
        if ((!jdev_name || !jdev_name[0])) {
                if (jdev == super->s_dev)
-                       blkdev_mode &= ~FMODE_EXCL;
-               journal->j_dev_bd = blkdev_get_by_dev(jdev, blkdev_mode,
-                                                     journal);
-               journal->j_dev_mode = blkdev_mode;
+                       holder = NULL;
+               journal->j_dev_bd = blkdev_get_by_dev(jdev, blkdev_mode, holder,
+                                                     NULL);
                if (IS_ERR(journal->j_dev_bd)) {
                        result = PTR_ERR(journal->j_dev_bd);
                        journal->j_dev_bd = NULL;
@@ -2631,8 +2636,8 @@ static int journal_init_dev(struct super_block *super,
                return 0;
        }
 
-       journal->j_dev_mode = blkdev_mode;
-       journal->j_dev_bd = blkdev_get_by_path(jdev_name, blkdev_mode, journal);
+       journal->j_dev_bd = blkdev_get_by_path(jdev_name, blkdev_mode, holder,
+                                              NULL);
        if (IS_ERR(journal->j_dev_bd)) {
                result = PTR_ERR(journal->j_dev_bd);
                journal->j_dev_bd = NULL;
index 1bccf6a..55e8525 100644 (file)
@@ -300,7 +300,6 @@ struct reiserfs_journal {
        struct reiserfs_journal_cnode *j_first;
 
        struct block_device *j_dev_bd;
-       fmode_t j_dev_mode;
 
        /* first block on s_dev of reserved area journal */
        int j_1st_reserved_block;
index 6e0a099..078dd8c 100644 (file)
@@ -67,6 +67,7 @@ int reiserfs_security_init(struct inode *dir, struct inode *inode,
 
        sec->name = NULL;
        sec->value = NULL;
+       sec->length = 0;
 
        /* Don't add selinux attributes on xattrs - they'll never get used */
        if (IS_PRIVATE(dir))
index 1331a89..87ae4f0 100644 (file)
@@ -15,6 +15,7 @@
 #include <linux/mount.h>
 #include <linux/fs.h>
 #include <linux/dax.h>
+#include <linux/overflow.h>
 #include "internal.h"
 
 #include <linux/uaccess.h>
@@ -101,10 +102,12 @@ static int generic_remap_checks(struct file *file_in, loff_t pos_in,
 static int remap_verify_area(struct file *file, loff_t pos, loff_t len,
                             bool write)
 {
+       loff_t tmp;
+
        if (unlikely(pos < 0 || len < 0))
                return -EINVAL;
 
-       if (unlikely((loff_t) (pos + len) < 0))
+       if (unlikely(check_add_overflow(pos, len, &tmp)))
                return -EINVAL;
 
        return security_file_permission(file, write ? MAY_WRITE : MAY_READ);
index 4578dc4..4520ca4 100644 (file)
@@ -78,7 +78,7 @@ static unsigned romfs_mmap_capabilities(struct file *file)
 const struct file_operations romfs_ro_fops = {
        .llseek                 = generic_file_llseek,
        .read_iter              = generic_file_read_iter,
-       .splice_read            = generic_file_splice_read,
+       .splice_read            = filemap_splice_read,
        .mmap                   = romfs_mmap,
        .get_unmapped_area      = romfs_get_unmapped_area,
        .mmap_capabilities      = romfs_mmap_capabilities,
index 43a4d86..4f4492e 100644 (file)
@@ -1376,7 +1376,7 @@ const struct file_operations cifs_file_ops = {
        .fsync = cifs_fsync,
        .flush = cifs_flush,
        .mmap  = cifs_file_mmap,
-       .splice_read = cifs_splice_read,
+       .splice_read = filemap_splice_read,
        .splice_write = iter_file_splice_write,
        .llseek = cifs_llseek,
        .unlocked_ioctl = cifs_ioctl,
@@ -1396,7 +1396,7 @@ const struct file_operations cifs_file_strict_ops = {
        .fsync = cifs_strict_fsync,
        .flush = cifs_flush,
        .mmap = cifs_file_strict_mmap,
-       .splice_read = cifs_splice_read,
+       .splice_read = filemap_splice_read,
        .splice_write = iter_file_splice_write,
        .llseek = cifs_llseek,
        .unlocked_ioctl = cifs_ioctl,
@@ -1416,7 +1416,7 @@ const struct file_operations cifs_file_direct_ops = {
        .fsync = cifs_fsync,
        .flush = cifs_flush,
        .mmap = cifs_file_mmap,
-       .splice_read = direct_splice_read,
+       .splice_read = copy_splice_read,
        .splice_write = iter_file_splice_write,
        .unlocked_ioctl  = cifs_ioctl,
        .copy_file_range = cifs_copy_file_range,
@@ -1434,7 +1434,7 @@ const struct file_operations cifs_file_nobrl_ops = {
        .fsync = cifs_fsync,
        .flush = cifs_flush,
        .mmap  = cifs_file_mmap,
-       .splice_read = cifs_splice_read,
+       .splice_read = filemap_splice_read,
        .splice_write = iter_file_splice_write,
        .llseek = cifs_llseek,
        .unlocked_ioctl = cifs_ioctl,
@@ -1452,7 +1452,7 @@ const struct file_operations cifs_file_strict_nobrl_ops = {
        .fsync = cifs_strict_fsync,
        .flush = cifs_flush,
        .mmap = cifs_file_strict_mmap,
-       .splice_read = cifs_splice_read,
+       .splice_read = filemap_splice_read,
        .splice_write = iter_file_splice_write,
        .llseek = cifs_llseek,
        .unlocked_ioctl = cifs_ioctl,
@@ -1470,7 +1470,7 @@ const struct file_operations cifs_file_direct_nobrl_ops = {
        .fsync = cifs_fsync,
        .flush = cifs_flush,
        .mmap = cifs_file_mmap,
-       .splice_read = direct_splice_read,
+       .splice_read = copy_splice_read,
        .splice_write = iter_file_splice_write,
        .unlocked_ioctl  = cifs_ioctl,
        .copy_file_range = cifs_copy_file_range,
index 74cd6fa..d7274ee 100644 (file)
@@ -100,9 +100,6 @@ extern ssize_t cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to);
 extern ssize_t cifs_user_writev(struct kiocb *iocb, struct iov_iter *from);
 extern ssize_t cifs_direct_writev(struct kiocb *iocb, struct iov_iter *from);
 extern ssize_t cifs_strict_writev(struct kiocb *iocb, struct iov_iter *from);
-extern ssize_t cifs_splice_read(struct file *in, loff_t *ppos,
-                               struct pipe_inode_info *pipe, size_t len,
-                               unsigned int flags);
 extern int cifs_flock(struct file *pfile, int cmd, struct file_lock *plock);
 extern int cifs_lock(struct file *, int, struct file_lock *);
 extern int cifs_fsync(struct file *, loff_t, loff_t, int);
index 0512833..f30f6dd 100644 (file)
@@ -5083,19 +5083,3 @@ const struct address_space_operations cifs_addr_ops_smallbuf = {
        .launder_folio = cifs_launder_folio,
        .migrate_folio = filemap_migrate_folio,
 };
-
-/*
- * Splice data from a file into a pipe.
- */
-ssize_t cifs_splice_read(struct file *in, loff_t *ppos,
-                        struct pipe_inode_info *pipe, size_t len,
-                        unsigned int flags)
-{
-       if (unlikely(*ppos >= file_inode(in)->i_sb->s_maxbytes))
-               return 0;
-       if (unlikely(!len))
-               return 0;
-       if (in->f_flags & O_DIRECT)
-               return direct_splice_read(in, ppos, pipe, len, flags);
-       return filemap_splice_read(in, ppos, pipe, len, flags);
-}
index 17d6924..004eb1c 100644 (file)
@@ -300,20 +300,36 @@ void splice_shrink_spd(struct splice_pipe_desc *spd)
        kfree(spd->partial);
 }
 
-/*
- * Splice data from an O_DIRECT file into pages and then add them to the output
- * pipe.
+/**
+ * copy_splice_read -  Copy data from a file and splice the copy into a pipe
+ * @in: The file to read from
+ * @ppos: Pointer to the file position to read from
+ * @pipe: The pipe to splice into
+ * @len: The amount to splice
+ * @flags: The SPLICE_F_* flags
+ *
+ * This function allocates a bunch of pages sufficient to hold the requested
+ * amount of data (but limited by the remaining pipe capacity), passes it to
+ * the file's ->read_iter() to read into and then splices the used pages into
+ * the pipe.
+ *
+ * Return: On success, the number of bytes read will be returned and *@ppos
+ * will be updated if appropriate; 0 will be returned if there is no more data
+ * to be read; -EAGAIN will be returned if the pipe had no space, and some
+ * other negative error code will be returned on error.  A short read may occur
+ * if the pipe has insufficient space, we reach the end of the data or we hit a
+ * hole.
  */
-ssize_t direct_splice_read(struct file *in, loff_t *ppos,
-                          struct pipe_inode_info *pipe,
-                          size_t len, unsigned int flags)
+ssize_t copy_splice_read(struct file *in, loff_t *ppos,
+                        struct pipe_inode_info *pipe,
+                        size_t len, unsigned int flags)
 {
        struct iov_iter to;
        struct bio_vec *bv;
        struct kiocb kiocb;
        struct page **pages;
        ssize_t ret;
-       size_t used, npages, chunk, remain, reclaim;
+       size_t used, npages, chunk, remain, keep = 0;
        int i;
 
        /* Work out how much data we can actually add into the pipe */
@@ -327,7 +343,7 @@ ssize_t direct_splice_read(struct file *in, loff_t *ppos,
        if (!bv)
                return -ENOMEM;
 
-       pages = (void *)(bv + npages);
+       pages = (struct page **)(bv + npages);
        npages = alloc_pages_bulk_array(GFP_USER, npages, pages);
        if (!npages) {
                kfree(bv);
@@ -350,31 +366,25 @@ ssize_t direct_splice_read(struct file *in, loff_t *ppos,
        kiocb.ki_pos = *ppos;
        ret = call_read_iter(in, &kiocb, &to);
 
-       reclaim = npages * PAGE_SIZE;
-       remain = 0;
        if (ret > 0) {
-               reclaim -= ret;
-               remain = ret;
+               keep = DIV_ROUND_UP(ret, PAGE_SIZE);
                *ppos = kiocb.ki_pos;
-               file_accessed(in);
-       } else if (ret < 0) {
-               /*
-                * callers of ->splice_read() expect -EAGAIN on
-                * "can't put anything in there", rather than -EFAULT.
-                */
-               if (ret == -EFAULT)
-                       ret = -EAGAIN;
        }
 
+       /*
+        * Callers of ->splice_read() expect -EAGAIN on "can't put anything in
+        * there", rather than -EFAULT.
+        */
+       if (ret == -EFAULT)
+               ret = -EAGAIN;
+
        /* Free any pages that didn't get touched at all. */
-       reclaim /= PAGE_SIZE;
-       if (reclaim) {
-               npages -= reclaim;
-               release_pages(pages + npages, reclaim);
-       }
+       if (keep < npages)
+               release_pages(pages + keep, npages - keep);
 
        /* Push the remaining pages into the pipe. */
-       for (i = 0; i < npages; i++) {
+       remain = ret;
+       for (i = 0; i < keep; i++) {
                struct pipe_buffer *buf = pipe_head_buf(pipe);
 
                chunk = min_t(size_t, remain, PAGE_SIZE);
@@ -391,50 +401,7 @@ ssize_t direct_splice_read(struct file *in, loff_t *ppos,
        kfree(bv);
        return ret;
 }
-EXPORT_SYMBOL(direct_splice_read);
-
-/**
- * generic_file_splice_read - splice data from file to a pipe
- * @in:                file to splice from
- * @ppos:      position in @in
- * @pipe:      pipe to splice to
- * @len:       number of bytes to splice
- * @flags:     splice modifier flags
- *
- * Description:
- *    Will read pages from given file and fill them into a pipe. Can be
- *    used as long as it has more or less sane ->read_iter().
- *
- */
-ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
-                                struct pipe_inode_info *pipe, size_t len,
-                                unsigned int flags)
-{
-       struct iov_iter to;
-       struct kiocb kiocb;
-       int ret;
-
-       iov_iter_pipe(&to, ITER_DEST, pipe, len);
-       init_sync_kiocb(&kiocb, in);
-       kiocb.ki_pos = *ppos;
-       ret = call_read_iter(in, &kiocb, &to);
-       if (ret > 0) {
-               *ppos = kiocb.ki_pos;
-               file_accessed(in);
-       } else if (ret < 0) {
-               /* free what was emitted */
-               pipe_discard_from(pipe, to.start_head);
-               /*
-                * callers of ->splice_read() expect -EAGAIN on
-                * "can't put anything in there", rather than -EFAULT.
-                */
-               if (ret == -EFAULT)
-                       ret = -EAGAIN;
-       }
-
-       return ret;
-}
-EXPORT_SYMBOL(generic_file_splice_read);
+EXPORT_SYMBOL(copy_splice_read);
 
 const struct pipe_buf_operations default_pipe_buf_ops = {
        .release        = generic_pipe_buf_release,
@@ -978,18 +945,32 @@ static void do_splice_eof(struct splice_desc *sd)
                sd->splice_eof(sd);
 }
 
-/*
- * Attempt to initiate a splice from a file to a pipe.
+/**
+ * vfs_splice_read - Read data from a file and splice it into a pipe
+ * @in:                File to splice from
+ * @ppos:      Input file offset
+ * @pipe:      Pipe to splice to
+ * @len:       Number of bytes to splice
+ * @flags:     Splice modifier flags (SPLICE_F_*)
+ *
+ * Splice the requested amount of data from the input file to the pipe.  This
+ * is synchronous as the caller must hold the pipe lock across the entire
+ * operation.
+ *
+ * If successful, it returns the amount of data spliced, 0 if it hit the EOF or
+ * a hole and a negative error code otherwise.
  */
-static long do_splice_to(struct file *in, loff_t *ppos,
-                        struct pipe_inode_info *pipe, size_t len,
-                        unsigned int flags)
+long vfs_splice_read(struct file *in, loff_t *ppos,
+                    struct pipe_inode_info *pipe, size_t len,
+                    unsigned int flags)
 {
        unsigned int p_space;
        int ret;
 
        if (unlikely(!(in->f_mode & FMODE_READ)))
                return -EBADF;
+       if (!len)
+               return 0;
 
        /* Don't try to read more the pipe has space for. */
        p_space = pipe->max_usage - pipe_occupancy(pipe->head, pipe->tail);
@@ -1004,8 +985,15 @@ static long do_splice_to(struct file *in, loff_t *ppos,
 
        if (unlikely(!in->f_op->splice_read))
                return warn_unsupported(in, "read");
+       /*
+        * O_DIRECT and DAX don't deal with the pagecache, so we allocate a
+        * buffer, copy into it and splice that into the pipe.
+        */
+       if ((in->f_flags & O_DIRECT) || IS_DAX(in->f_mapping->host))
+               return copy_splice_read(in, ppos, pipe, len, flags);
        return in->f_op->splice_read(in, ppos, pipe, len, flags);
 }
+EXPORT_SYMBOL_GPL(vfs_splice_read);
 
 /**
  * splice_direct_to_actor - splices data directly between two non-pipes
@@ -1079,7 +1067,7 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
                size_t read_len;
                loff_t pos = sd->pos, prev_pos = pos;
 
-               ret = do_splice_to(in, &pos, pipe, len, flags);
+               ret = vfs_splice_read(in, &pos, pipe, len, flags);
                if (unlikely(ret <= 0))
                        goto read_failure;
 
@@ -1243,7 +1231,7 @@ long splice_file_to_pipe(struct file *in,
        pipe_lock(opipe);
        ret = wait_for_space(opipe, flags);
        if (!ret)
-               ret = do_splice_to(in, offset, opipe, len, flags);
+               ret = vfs_splice_read(in, offset, opipe, len, flags);
        pipe_unlock(opipe);
        if (ret > 0)
                wakeup_pipe_readers(opipe);
index bed3bb8..6aa9c2e 100644 (file)
@@ -17,8 +17,8 @@
 #include <linux/fs.h>
 #include <linux/vfs.h>
 #include <linux/slab.h>
+#include <linux/pagemap.h>
 #include <linux/string.h>
-#include <linux/buffer_head.h>
 #include <linux/bio.h>
 
 #include "squashfs_fs.h"
@@ -76,10 +76,101 @@ static int copy_bio_to_actor(struct bio *bio,
        return copied_bytes;
 }
 
+static int squashfs_bio_read_cached(struct bio *fullbio,
+               struct address_space *cache_mapping, u64 index, int length,
+               u64 read_start, u64 read_end, int page_count)
+{
+       struct page *head_to_cache = NULL, *tail_to_cache = NULL;
+       struct block_device *bdev = fullbio->bi_bdev;
+       int start_idx = 0, end_idx = 0;
+       struct bvec_iter_all iter_all;
+       struct bio *bio = NULL;
+       struct bio_vec *bv;
+       int idx = 0;
+       int err = 0;
+
+       bio_for_each_segment_all(bv, fullbio, iter_all) {
+               struct page *page = bv->bv_page;
+
+               if (page->mapping == cache_mapping) {
+                       idx++;
+                       continue;
+               }
+
+               /*
+                * We only use this when the device block size is the same as
+                * the page size, so read_start and read_end cover full pages.
+                *
+                * Compare these to the original required index and length to
+                * only cache pages which were requested partially, since these
+                * are the ones which are likely to be needed when reading
+                * adjacent blocks.
+                */
+               if (idx == 0 && index != read_start)
+                       head_to_cache = page;
+               else if (idx == page_count - 1 && index + length != read_end)
+                       tail_to_cache = page;
+
+               if (!bio || idx != end_idx) {
+                       struct bio *new = bio_alloc_clone(bdev, fullbio,
+                                                         GFP_NOIO, &fs_bio_set);
+
+                       if (bio) {
+                               bio_trim(bio, start_idx * PAGE_SECTORS,
+                                        (end_idx - start_idx) * PAGE_SECTORS);
+                               bio_chain(bio, new);
+                               submit_bio(bio);
+                       }
+
+                       bio = new;
+                       start_idx = idx;
+               }
+
+               idx++;
+               end_idx = idx;
+       }
+
+       if (bio) {
+               bio_trim(bio, start_idx * PAGE_SECTORS,
+                        (end_idx - start_idx) * PAGE_SECTORS);
+               err = submit_bio_wait(bio);
+               bio_put(bio);
+       }
+
+       if (err)
+               return err;
+
+       if (head_to_cache) {
+               int ret = add_to_page_cache_lru(head_to_cache, cache_mapping,
+                                               read_start >> PAGE_SHIFT,
+                                               GFP_NOIO);
+
+               if (!ret) {
+                       SetPageUptodate(head_to_cache);
+                       unlock_page(head_to_cache);
+               }
+
+       }
+
+       if (tail_to_cache) {
+               int ret = add_to_page_cache_lru(tail_to_cache, cache_mapping,
+                                               (read_end >> PAGE_SHIFT) - 1,
+                                               GFP_NOIO);
+
+               if (!ret) {
+                       SetPageUptodate(tail_to_cache);
+                       unlock_page(tail_to_cache);
+               }
+       }
+
+       return 0;
+}
+
 static int squashfs_bio_read(struct super_block *sb, u64 index, int length,
                             struct bio **biop, int *block_offset)
 {
        struct squashfs_sb_info *msblk = sb->s_fs_info;
+       struct address_space *cache_mapping = msblk->cache_mapping;
        const u64 read_start = round_down(index, msblk->devblksize);
        const sector_t block = read_start >> msblk->devblksize_log2;
        const u64 read_end = round_up(index + length, msblk->devblksize);
@@ -99,21 +190,34 @@ static int squashfs_bio_read(struct super_block *sb, u64 index, int length,
        for (i = 0; i < page_count; ++i) {
                unsigned int len =
                        min_t(unsigned int, PAGE_SIZE - offset, total_len);
-               struct page *page = alloc_page(GFP_NOIO);
+               struct page *page = NULL;
+
+               if (cache_mapping)
+                       page = find_get_page(cache_mapping,
+                                            (read_start >> PAGE_SHIFT) + i);
+               if (!page)
+                       page = alloc_page(GFP_NOIO);
 
                if (!page) {
                        error = -ENOMEM;
                        goto out_free_bio;
                }
-               if (!bio_add_page(bio, page, len, offset)) {
-                       error = -EIO;
-                       goto out_free_bio;
-               }
+
+               /*
+                * Use the __ version to avoid merging since we need each page
+                * to be separate when we check for and avoid cached pages.
+                */
+               __bio_add_page(bio, page, len, offset);
                offset = 0;
                total_len -= len;
        }
 
-       error = submit_bio_wait(bio);
+       if (cache_mapping)
+               error = squashfs_bio_read_cached(bio, cache_mapping, index,
+                                                length, read_start, read_end,
+                                                page_count);
+       else
+               error = submit_bio_wait(bio);
        if (error)
                goto out_free_bio;
 
index 8893cb9..a676084 100644 (file)
@@ -11,7 +11,6 @@
 #include <linux/types.h>
 #include <linux/mutex.h>
 #include <linux/slab.h>
-#include <linux/buffer_head.h>
 
 #include "squashfs_fs.h"
 #include "squashfs_fs_sb.h"
index 1dfadf7..8a218e7 100644 (file)
@@ -7,7 +7,6 @@
 #include <linux/types.h>
 #include <linux/slab.h>
 #include <linux/percpu.h>
-#include <linux/buffer_head.h>
 #include <linux/local_lock.h>
 
 #include "squashfs_fs.h"
index 72f6f4b..c01998e 100644 (file)
@@ -47,6 +47,7 @@ struct squashfs_sb_info {
        struct squashfs_cache                   *block_cache;
        struct squashfs_cache                   *fragment_cache;
        struct squashfs_cache                   *read_page;
+       struct address_space                    *cache_mapping;
        int                                     next_meta_index;
        __le64                                  *id_table;
        __le64                                  *fragment_index;
index e090fae..22e8128 100644 (file)
@@ -329,6 +329,19 @@ static int squashfs_fill_super(struct super_block *sb, struct fs_context *fc)
                goto failed_mount;
        }
 
+       if (msblk->devblksize == PAGE_SIZE) {
+               struct inode *cache = new_inode(sb);
+
+               if (cache == NULL)
+                       goto failed_mount;
+
+               set_nlink(cache, 1);
+               cache->i_size = OFFSET_MAX;
+               mapping_set_gfp_mask(cache->i_mapping, GFP_NOFS);
+
+               msblk->cache_mapping = cache->i_mapping;
+       }
+
        msblk->stream = squashfs_decompressor_setup(sb, flags);
        if (IS_ERR(msblk->stream)) {
                err = PTR_ERR(msblk->stream);
@@ -454,6 +467,8 @@ failed_mount:
        squashfs_cache_delete(msblk->block_cache);
        squashfs_cache_delete(msblk->fragment_cache);
        squashfs_cache_delete(msblk->read_page);
+       if (msblk->cache_mapping)
+               iput(msblk->cache_mapping->host);
        msblk->thread_ops->destroy(msblk);
        kfree(msblk->inode_lookup_table);
        kfree(msblk->fragment_index);
@@ -572,6 +587,8 @@ static void squashfs_put_super(struct super_block *sb)
                squashfs_cache_delete(sbi->block_cache);
                squashfs_cache_delete(sbi->fragment_cache);
                squashfs_cache_delete(sbi->read_page);
+               if (sbi->cache_mapping)
+                       iput(sbi->cache_mapping->host);
                sbi->thread_ops->destroy(sbi);
                kfree(sbi->id_table);
                kfree(sbi->fragment_index);
index 04bc62a..05ff6ab 100644 (file)
@@ -595,7 +595,7 @@ retry:
        fc->s_fs_info = NULL;
        s->s_type = fc->fs_type;
        s->s_iflags |= fc->s_iflags;
-       strlcpy(s->s_id, s->s_type->name, sizeof(s->s_id));
+       strscpy(s->s_id, s->s_type->name, sizeof(s->s_id));
        list_add_tail(&s->s_list, &super_blocks);
        hlist_add_head(&s->s_instances, &s->s_type->fs_supers);
        spin_unlock(&sb_lock);
@@ -674,7 +674,7 @@ retry:
                return ERR_PTR(err);
        }
        s->s_type = type;
-       strlcpy(s->s_id, type->name, sizeof(s->s_id));
+       strscpy(s->s_id, type->name, sizeof(s->s_id));
        list_add_tail(&s->s_list, &super_blocks);
        hlist_add_head(&s->s_instances, &type->fs_supers);
        spin_unlock(&sb_lock);
@@ -903,6 +903,7 @@ int reconfigure_super(struct fs_context *fc)
        struct super_block *sb = fc->root->d_sb;
        int retval;
        bool remount_ro = false;
+       bool remount_rw = false;
        bool force = fc->sb_flags & SB_FORCE;
 
        if (fc->sb_flags_mask & ~MS_RMT_MASK)
@@ -920,7 +921,7 @@ int reconfigure_super(struct fs_context *fc)
                    bdev_read_only(sb->s_bdev))
                        return -EACCES;
 #endif
-
+               remount_rw = !(fc->sb_flags & SB_RDONLY) && sb_rdonly(sb);
                remount_ro = (fc->sb_flags & SB_RDONLY) && !sb_rdonly(sb);
        }
 
@@ -943,13 +944,18 @@ int reconfigure_super(struct fs_context *fc)
         */
        if (remount_ro) {
                if (force) {
-                       sb->s_readonly_remount = 1;
-                       smp_wmb();
+                       sb_start_ro_state_change(sb);
                } else {
                        retval = sb_prepare_remount_readonly(sb);
                        if (retval)
                                return retval;
                }
+       } else if (remount_rw) {
+               /*
+                * Protect filesystem's reconfigure code from writes from
+                * userspace until reconfigure finishes.
+                */
+               sb_start_ro_state_change(sb);
        }
 
        if (fc->ops->reconfigure) {
@@ -965,9 +971,7 @@ int reconfigure_super(struct fs_context *fc)
 
        WRITE_ONCE(sb->s_flags, ((sb->s_flags & ~fc->sb_flags_mask) |
                                 (fc->sb_flags & fc->sb_flags_mask)));
-       /* Needs to be ordered wrt mnt_is_readonly() */
-       smp_wmb();
-       sb->s_readonly_remount = 0;
+       sb_end_ro_state_change(sb);
 
        /*
         * Some filesystems modify their metadata via some other path than the
@@ -982,7 +986,7 @@ int reconfigure_super(struct fs_context *fc)
        return 0;
 
 cancel_readonly:
-       sb->s_readonly_remount = 0;
+       sb_end_ro_state_change(sb);
        return retval;
 }
 
@@ -1206,6 +1210,22 @@ int get_tree_keyed(struct fs_context *fc,
 EXPORT_SYMBOL(get_tree_keyed);
 
 #ifdef CONFIG_BLOCK
+static void fs_mark_dead(struct block_device *bdev)
+{
+       struct super_block *sb;
+
+       sb = get_super(bdev);
+       if (!sb)
+               return;
+
+       if (sb->s_op->shutdown)
+               sb->s_op->shutdown(sb);
+       drop_super(sb);
+}
+
+static const struct blk_holder_ops fs_holder_ops = {
+       .mark_dead              = fs_mark_dead,
+};
 
 static int set_bdev_super(struct super_block *s, void *data)
 {
@@ -1239,16 +1259,13 @@ int get_tree_bdev(struct fs_context *fc,
 {
        struct block_device *bdev;
        struct super_block *s;
-       fmode_t mode = FMODE_READ | FMODE_EXCL;
        int error = 0;
 
-       if (!(fc->sb_flags & SB_RDONLY))
-               mode |= FMODE_WRITE;
-
        if (!fc->source)
                return invalf(fc, "No source specified");
 
-       bdev = blkdev_get_by_path(fc->source, mode, fc->fs_type);
+       bdev = blkdev_get_by_path(fc->source, sb_open_mode(fc->sb_flags),
+                                 fc->fs_type, &fs_holder_ops);
        if (IS_ERR(bdev)) {
                errorf(fc, "%s: Can't open blockdev", fc->source);
                return PTR_ERR(bdev);
@@ -1262,7 +1279,7 @@ int get_tree_bdev(struct fs_context *fc,
        if (bdev->bd_fsfreeze_count > 0) {
                mutex_unlock(&bdev->bd_fsfreeze_mutex);
                warnf(fc, "%pg: Can't mount, blockdev is frozen", bdev);
-               blkdev_put(bdev, mode);
+               blkdev_put(bdev, fc->fs_type);
                return -EBUSY;
        }
 
@@ -1271,7 +1288,7 @@ int get_tree_bdev(struct fs_context *fc,
        s = sget_fc(fc, test_bdev_super_fc, set_bdev_super_fc);
        mutex_unlock(&bdev->bd_fsfreeze_mutex);
        if (IS_ERR(s)) {
-               blkdev_put(bdev, mode);
+               blkdev_put(bdev, fc->fs_type);
                return PTR_ERR(s);
        }
 
@@ -1280,7 +1297,7 @@ int get_tree_bdev(struct fs_context *fc,
                if ((fc->sb_flags ^ s->s_flags) & SB_RDONLY) {
                        warnf(fc, "%pg: Can't mount, would change RO state", bdev);
                        deactivate_locked_super(s);
-                       blkdev_put(bdev, mode);
+                       blkdev_put(bdev, fc->fs_type);
                        return -EBUSY;
                }
 
@@ -1292,10 +1309,9 @@ int get_tree_bdev(struct fs_context *fc,
                 * holding an active reference.
                 */
                up_write(&s->s_umount);
-               blkdev_put(bdev, mode);
+               blkdev_put(bdev, fc->fs_type);
                down_write(&s->s_umount);
        } else {
-               s->s_mode = mode;
                snprintf(s->s_id, sizeof(s->s_id), "%pg", bdev);
                shrinker_debugfs_rename(&s->s_shrink, "sb-%s:%s",
                                        fc->fs_type->name, s->s_id);
@@ -1327,13 +1343,10 @@ struct dentry *mount_bdev(struct file_system_type *fs_type,
 {
        struct block_device *bdev;
        struct super_block *s;
-       fmode_t mode = FMODE_READ | FMODE_EXCL;
        int error = 0;
 
-       if (!(flags & SB_RDONLY))
-               mode |= FMODE_WRITE;
-
-       bdev = blkdev_get_by_path(dev_name, mode, fs_type);
+       bdev = blkdev_get_by_path(dev_name, sb_open_mode(flags), fs_type,
+                                 &fs_holder_ops);
        if (IS_ERR(bdev))
                return ERR_CAST(bdev);
 
@@ -1369,10 +1382,9 @@ struct dentry *mount_bdev(struct file_system_type *fs_type,
                 * holding an active reference.
                 */
                up_write(&s->s_umount);
-               blkdev_put(bdev, mode);
+               blkdev_put(bdev, fs_type);
                down_write(&s->s_umount);
        } else {
-               s->s_mode = mode;
                snprintf(s->s_id, sizeof(s->s_id), "%pg", bdev);
                shrinker_debugfs_rename(&s->s_shrink, "sb-%s:%s",
                                        fs_type->name, s->s_id);
@@ -1392,7 +1404,7 @@ struct dentry *mount_bdev(struct file_system_type *fs_type,
 error_s:
        error = PTR_ERR(s);
 error_bdev:
-       blkdev_put(bdev, mode);
+       blkdev_put(bdev, fs_type);
 error:
        return ERR_PTR(error);
 }
@@ -1401,13 +1413,11 @@ EXPORT_SYMBOL(mount_bdev);
 void kill_block_super(struct super_block *sb)
 {
        struct block_device *bdev = sb->s_bdev;
-       fmode_t mode = sb->s_mode;
 
        bdev->bd_super = NULL;
        generic_shutdown_super(sb);
        sync_blockdev(bdev);
-       WARN_ON_ONCE(!(mode & FMODE_EXCL));
-       blkdev_put(bdev, mode | FMODE_EXCL);
+       blkdev_put(bdev, sb->s_type);
 }
 
 EXPORT_SYMBOL(kill_block_super);
index c701273..76a0aee 100644 (file)
@@ -29,11 +29,10 @@ static struct ctl_table fs_shared_sysctls[] = {
        { }
 };
 
-DECLARE_SYSCTL_BASE(fs, fs_shared_sysctls);
-
 static int __init init_fs_sysctls(void)
 {
-       return register_sysctl_base(fs);
+       register_sysctl_init("fs", fs_shared_sysctls);
+       return 0;
 }
 
 early_initcall(init_fs_sysctls);
index cdb3d63..0140010 100644 (file)
@@ -52,7 +52,7 @@ static int sysv_handle_dirsync(struct inode *dir)
 }
 
 /*
- * Calls to dir_get_page()/put_and_unmap_page() must be nested according to the
+ * Calls to dir_get_page()/unmap_and_put_page() must be nested according to the
  * rules documented in mm/highmem.rst.
  *
  * NOTE: sysv_find_entry() and sysv_dotdot() act as calls to dir_get_page()
@@ -103,11 +103,11 @@ static int sysv_readdir(struct file *file, struct dir_context *ctx)
                        if (!dir_emit(ctx, name, strnlen(name,SYSV_NAMELEN),
                                        fs16_to_cpu(SYSV_SB(sb), de->inode),
                                        DT_UNKNOWN)) {
-                               put_and_unmap_page(page, kaddr);
+                               unmap_and_put_page(page, kaddr);
                                return 0;
                        }
                }
-               put_and_unmap_page(page, kaddr);
+               unmap_and_put_page(page, kaddr);
        }
        return 0;
 }
@@ -131,7 +131,7 @@ static inline int namecompare(int len, int maxlen,
  * itself (as a parameter - res_dir). It does NOT read the inode of the
  * entry - you'll have to do that yourself if you want to.
  *
- * On Success put_and_unmap_page() should be called on *res_page.
+ * On Success unmap_and_put_page() should be called on *res_page.
  *
  * sysv_find_entry() acts as a call to dir_get_page() and must be treated
  * accordingly for nesting purposes.
@@ -166,7 +166,7 @@ struct sysv_dir_entry *sysv_find_entry(struct dentry *dentry, struct page **res_
                                                        name, de->name))
                                        goto found;
                        }
-                       put_and_unmap_page(page, kaddr);
+                       unmap_and_put_page(page, kaddr);
                }
 
                if (++n >= npages)
@@ -209,7 +209,7 @@ int sysv_add_link(struct dentry *dentry, struct inode *inode)
                                goto out_page;
                        de++;
                }
-               put_and_unmap_page(page, kaddr);
+               unmap_and_put_page(page, kaddr);
        }
        BUG();
        return -EINVAL;
@@ -228,7 +228,7 @@ got_it:
        mark_inode_dirty(dir);
        err = sysv_handle_dirsync(dir);
 out_page:
-       put_and_unmap_page(page, kaddr);
+       unmap_and_put_page(page, kaddr);
        return err;
 out_unlock:
        unlock_page(page);
@@ -321,12 +321,12 @@ int sysv_empty_dir(struct inode * inode)
                        if (de->name[1] != '.' || de->name[2])
                                goto not_empty;
                }
-               put_and_unmap_page(page, kaddr);
+               unmap_and_put_page(page, kaddr);
        }
        return 1;
 
 not_empty:
-       put_and_unmap_page(page, kaddr);
+       unmap_and_put_page(page, kaddr);
        return 0;
 }
 
@@ -352,7 +352,7 @@ int sysv_set_link(struct sysv_dir_entry *de, struct page *page,
 }
 
 /*
- * Calls to dir_get_page()/put_and_unmap_page() must be nested according to the
+ * Calls to dir_get_page()/unmap_and_put_page() must be nested according to the
  * rules documented in mm/highmem.rst.
  *
  * sysv_dotdot() acts as a call to dir_get_page() and must be treated
@@ -376,7 +376,7 @@ ino_t sysv_inode_by_name(struct dentry *dentry)
        
        if (de) {
                res = fs16_to_cpu(SYSV_SB(dentry->d_sb), de->inode);
-               put_and_unmap_page(page, de);
+               unmap_and_put_page(page, de);
        }
        return res;
 }
index 50eb925..c645f60 100644 (file)
@@ -26,7 +26,7 @@ const struct file_operations sysv_file_operations = {
        .write_iter     = generic_file_write_iter,
        .mmap           = generic_file_mmap,
        .fsync          = generic_file_fsync,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = filemap_splice_read,
 };
 
 static int sysv_setattr(struct mnt_idmap *idmap,
index b22764f..58d7f43 100644 (file)
@@ -145,6 +145,10 @@ static int alloc_branch(struct inode *inode,
                 */
                parent = block_to_cpu(SYSV_SB(inode->i_sb), branch[n-1].key);
                bh = sb_getblk(inode->i_sb, parent);
+               if (!bh) {
+                       sysv_free_block(inode->i_sb, branch[n].key);
+                       break;
+               }
                lock_buffer(bh);
                memset(bh->b_data, 0, blocksize);
                branch[n].bh = bh;
index 2b2dba4..fcf163f 100644 (file)
@@ -164,7 +164,7 @@ static int sysv_unlink(struct inode * dir, struct dentry * dentry)
                inode->i_ctime = dir->i_ctime;
                inode_dec_link_count(inode);
        }
-       put_and_unmap_page(page, de);
+       unmap_and_put_page(page, de);
        return err;
 }
 
@@ -227,7 +227,7 @@ static int sysv_rename(struct mnt_idmap *idmap, struct inode *old_dir,
                if (!new_de)
                        goto out_dir;
                err = sysv_set_link(new_de, new_page, old_inode);
-               put_and_unmap_page(new_page, new_de);
+               unmap_and_put_page(new_page, new_de);
                if (err)
                        goto out_dir;
                new_inode->i_ctime = current_time(new_inode);
@@ -256,9 +256,9 @@ static int sysv_rename(struct mnt_idmap *idmap, struct inode *old_dir,
 
 out_dir:
        if (dir_de)
-               put_and_unmap_page(dir_page, dir_de);
+               unmap_and_put_page(dir_page, dir_de);
 out_old:
-       put_and_unmap_page(old_page, old_de);
+       unmap_and_put_page(old_page, old_de);
 out:
        return err;
 }
index 979ab1d..6738fe4 100644 (file)
@@ -1669,7 +1669,7 @@ const struct file_operations ubifs_file_operations = {
        .mmap           = ubifs_file_mmap,
        .fsync          = ubifs_fsync,
        .unlocked_ioctl = ubifs_ioctl,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = filemap_splice_read,
        .splice_write   = iter_file_splice_write,
        .open           = fscrypt_file_open,
 #ifdef CONFIG_COMPAT
index 8238f74..29daf5d 100644 (file)
@@ -209,7 +209,7 @@ const struct file_operations udf_file_operations = {
        .write_iter             = udf_file_write_iter,
        .release                = udf_release_file,
        .fsync                  = generic_file_fsync,
-       .splice_read            = generic_file_splice_read,
+       .splice_read            = filemap_splice_read,
        .splice_write           = iter_file_splice_write,
        .llseek                 = generic_file_llseek,
 };
index fd20423..fd29a66 100644 (file)
@@ -793,11 +793,6 @@ static int udf_rename(struct mnt_idmap *idmap, struct inode *old_dir,
                        if (!empty_dir(new_inode))
                                goto out_oiter;
                }
-               /*
-                * We need to protect against old_inode getting converted from
-                * ICB to normal directory.
-                */
-               inode_lock_nested(old_inode, I_MUTEX_NONDIR2);
                retval = udf_fiiter_find_entry(old_inode, &dotdot_name,
                                               &diriter);
                if (retval == -ENOENT) {
@@ -806,10 +801,8 @@ static int udf_rename(struct mnt_idmap *idmap, struct inode *old_dir,
                                old_inode->i_ino);
                        retval = -EFSCORRUPTED;
                }
-               if (retval) {
-                       inode_unlock(old_inode);
+               if (retval)
                        goto out_oiter;
-               }
                has_diriter = true;
                tloc = lelb_to_cpu(diriter.fi.icb.extLocation);
                if (udf_get_lb_pblock(old_inode->i_sb, &tloc, 0) !=
@@ -889,7 +882,6 @@ static int udf_rename(struct mnt_idmap *idmap, struct inode *old_dir,
                               udf_dir_entry_len(&diriter.fi));
                udf_fiiter_write_fi(&diriter, NULL);
                udf_fiiter_release(&diriter);
-               inode_unlock(old_inode);
 
                inode_dec_link_count(old_dir);
                if (new_inode)
@@ -901,10 +893,8 @@ static int udf_rename(struct mnt_idmap *idmap, struct inode *old_dir,
        }
        return 0;
 out_oiter:
-       if (has_diriter) {
+       if (has_diriter)
                udf_fiiter_release(&diriter);
-               inode_unlock(old_inode);
-       }
        udf_fiiter_release(&oiter);
 
        return retval;
index 7e08758..6558882 100644 (file)
@@ -41,5 +41,5 @@ const struct file_operations ufs_file_operations = {
        .mmap           = generic_file_mmap,
        .open           = generic_file_open,
        .fsync          = generic_file_fsync,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = filemap_splice_read,
 };
index 4e800bb..7cecd49 100644 (file)
@@ -335,6 +335,7 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
        pud_t *pud;
        pmd_t *pmd, _pmd;
        pte_t *pte;
+       pte_t ptent;
        bool ret = true;
 
        mmap_assert_locked(mm);
@@ -349,20 +350,13 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
        if (!pud_present(*pud))
                goto out;
        pmd = pmd_offset(pud, address);
-       /*
-        * READ_ONCE must function as a barrier with narrower scope
-        * and it must be equivalent to:
-        *      _pmd = *pmd; barrier();
-        *
-        * This is to deal with the instability (as in
-        * pmd_trans_unstable) of the pmd.
-        */
-       _pmd = READ_ONCE(*pmd);
+again:
+       _pmd = pmdp_get_lockless(pmd);
        if (pmd_none(_pmd))
                goto out;
 
        ret = false;
-       if (!pmd_present(_pmd))
+       if (!pmd_present(_pmd) || pmd_devmap(_pmd))
                goto out;
 
        if (pmd_trans_huge(_pmd)) {
@@ -371,19 +365,20 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
                goto out;
        }
 
-       /*
-        * the pmd is stable (as in !pmd_trans_unstable) so we can re-read it
-        * and use the standard pte_offset_map() instead of parsing _pmd.
-        */
        pte = pte_offset_map(pmd, address);
+       if (!pte) {
+               ret = true;
+               goto again;
+       }
        /*
         * Lockless access: we're in a wait_event so it's ok if it
         * changes under us.  PTE markers should be handled the same as none
         * ptes here.
         */
-       if (pte_none_mostly(*pte))
+       ptent = ptep_get(pte);
+       if (pte_none_mostly(ptent))
                ret = true;
-       if (!pte_write(*pte) && (reason & VM_UFFD_WP))
+       if (!pte_write(ptent) && (reason & VM_UFFD_WP))
                ret = true;
        pte_unmap(pte);
 
@@ -857,31 +852,26 @@ static bool has_unmap_ctx(struct userfaultfd_ctx *ctx, struct list_head *unmaps,
        return false;
 }
 
-int userfaultfd_unmap_prep(struct mm_struct *mm, unsigned long start,
+int userfaultfd_unmap_prep(struct vm_area_struct *vma, unsigned long start,
                           unsigned long end, struct list_head *unmaps)
 {
-       VMA_ITERATOR(vmi, mm, start);
-       struct vm_area_struct *vma;
-
-       for_each_vma_range(vmi, vma, end) {
-               struct userfaultfd_unmap_ctx *unmap_ctx;
-               struct userfaultfd_ctx *ctx = vma->vm_userfaultfd_ctx.ctx;
+       struct userfaultfd_unmap_ctx *unmap_ctx;
+       struct userfaultfd_ctx *ctx = vma->vm_userfaultfd_ctx.ctx;
 
-               if (!ctx || !(ctx->features & UFFD_FEATURE_EVENT_UNMAP) ||
-                   has_unmap_ctx(ctx, unmaps, start, end))
-                       continue;
+       if (!ctx || !(ctx->features & UFFD_FEATURE_EVENT_UNMAP) ||
+           has_unmap_ctx(ctx, unmaps, start, end))
+               return 0;
 
-               unmap_ctx = kzalloc(sizeof(*unmap_ctx), GFP_KERNEL);
-               if (!unmap_ctx)
-                       return -ENOMEM;
+       unmap_ctx = kzalloc(sizeof(*unmap_ctx), GFP_KERNEL);
+       if (!unmap_ctx)
+               return -ENOMEM;
 
-               userfaultfd_ctx_get(ctx);
-               atomic_inc(&ctx->mmap_changing);
-               unmap_ctx->ctx = ctx;
-               unmap_ctx->start = start;
-               unmap_ctx->end = end;
-               list_add_tail(&unmap_ctx->list, unmaps);
-       }
+       userfaultfd_ctx_get(ctx);
+       atomic_inc(&ctx->mmap_changing);
+       unmap_ctx->ctx = ctx;
+       unmap_ctx->start = start;
+       unmap_ctx->end = end;
+       list_add_tail(&unmap_ctx->list, unmaps);
 
        return 0;
 }
index 572aa1c..2307f80 100644 (file)
@@ -217,7 +217,7 @@ const struct file_operations vboxsf_reg_fops = {
        .open = vboxsf_file_open,
        .release = vboxsf_file_release,
        .fsync = noop_fsync,
-       .splice_read = generic_file_splice_read,
+       .splice_read = filemap_splice_read,
 };
 
 const struct inode_operations vboxsf_reg_iops = {
index d2f6df6..1fb8f4d 100644 (file)
@@ -176,7 +176,7 @@ static int vboxsf_fill_super(struct super_block *sb, struct fs_context *fc)
        }
        folder_name->size = size;
        folder_name->length = size - 1;
-       strlcpy(folder_name->string.utf8, fc->source, size);
+       strscpy(folder_name->string.utf8, fc->source, size);
        err = vboxsf_map_folder(folder_name, &sbi->root);
        kfree(folder_name);
        if (err) {
index a7ffd71..e1036e5 100644 (file)
@@ -39,14 +39,14 @@ config FS_VERITY_BUILTIN_SIGNATURES
        depends on FS_VERITY
        select SYSTEM_DATA_VERIFICATION
        help
-         Support verifying signatures of verity files against the X.509
-         certificates that have been loaded into the ".fs-verity"
-         kernel keyring.
+         This option adds support for in-kernel verification of
+         fs-verity builtin signatures.
 
-         This is meant as a relatively simple mechanism that can be
-         used to provide an authenticity guarantee for verity files, as
-         an alternative to IMA appraisal.  Userspace programs still
-         need to check that the verity bit is set in order to get an
-         authenticity guarantee.
+         Please take great care before using this feature.  It is not
+         the only way to do signatures with fs-verity, and the
+         alternatives (such as userspace signature verification, and
+         IMA appraisal) can be much better.  For details about the
+         limitations of this feature, see
+         Documentation/filesystems/fsverity.rst.
 
          If unsure, say N.
index fc4c50e..c284f46 100644 (file)
@@ -7,6 +7,7 @@
 
 #include "fsverity_private.h"
 
+#include <crypto/hash.h>
 #include <linux/mount.h>
 #include <linux/sched/signal.h>
 #include <linux/uaccess.h>
@@ -20,7 +21,7 @@ struct block_buffer {
 /* Hash a block, writing the result to the next level's pending block buffer. */
 static int hash_one_block(struct inode *inode,
                          const struct merkle_tree_params *params,
-                         struct ahash_request *req, struct block_buffer *cur)
+                         struct block_buffer *cur)
 {
        struct block_buffer *next = cur + 1;
        int err;
@@ -36,8 +37,7 @@ static int hash_one_block(struct inode *inode,
        /* Zero-pad the block if it's shorter than the block size. */
        memset(&cur->data[cur->filled], 0, params->block_size - cur->filled);
 
-       err = fsverity_hash_block(params, inode, req, virt_to_page(cur->data),
-                                 offset_in_page(cur->data),
+       err = fsverity_hash_block(params, inode, cur->data,
                                  &next->data[next->filled]);
        if (err)
                return err;
@@ -76,7 +76,6 @@ static int build_merkle_tree(struct file *filp,
        struct inode *inode = file_inode(filp);
        const u64 data_size = inode->i_size;
        const int num_levels = params->num_levels;
-       struct ahash_request *req;
        struct block_buffer _buffers[1 + FS_VERITY_MAX_LEVELS + 1] = {};
        struct block_buffer *buffers = &_buffers[1];
        unsigned long level_offset[FS_VERITY_MAX_LEVELS];
@@ -90,9 +89,6 @@ static int build_merkle_tree(struct file *filp,
                return 0;
        }
 
-       /* This allocation never fails, since it's mempool-backed. */
-       req = fsverity_alloc_hash_request(params->hash_alg, GFP_KERNEL);
-
        /*
         * Allocate the block buffers.  Buffer "-1" is for data blocks.
         * Buffers 0 <= level < num_levels are for the actual tree levels.
@@ -130,7 +126,7 @@ static int build_merkle_tree(struct file *filp,
                        fsverity_err(inode, "Short read of file data");
                        goto out;
                }
-               err = hash_one_block(inode, params, req, &buffers[-1]);
+               err = hash_one_block(inode, params, &buffers[-1]);
                if (err)
                        goto out;
                for (level = 0; level < num_levels; level++) {
@@ -141,8 +137,7 @@ static int build_merkle_tree(struct file *filp,
                        }
                        /* Next block at @level is full */
 
-                       err = hash_one_block(inode, params, req,
-                                            &buffers[level]);
+                       err = hash_one_block(inode, params, &buffers[level]);
                        if (err)
                                goto out;
                        err = write_merkle_tree_block(inode,
@@ -162,8 +157,7 @@ static int build_merkle_tree(struct file *filp,
        /* Finish all nonempty pending tree blocks. */
        for (level = 0; level < num_levels; level++) {
                if (buffers[level].filled != 0) {
-                       err = hash_one_block(inode, params, req,
-                                            &buffers[level]);
+                       err = hash_one_block(inode, params, &buffers[level]);
                        if (err)
                                goto out;
                        err = write_merkle_tree_block(inode,
@@ -183,7 +177,6 @@ static int build_merkle_tree(struct file *filp,
 out:
        for (level = -1; level < num_levels; level++)
                kfree(buffers[level].data);
-       fsverity_free_hash_request(params->hash_alg, req);
        return err;
 }
 
@@ -215,7 +208,7 @@ static int enable_verity(struct file *filp,
        }
        desc->salt_size = arg->salt_size;
 
-       /* Get the signature if the user provided one */
+       /* Get the builtin signature if the user provided one */
        if (arg->sig_size &&
            copy_from_user(desc->signature, u64_to_user_ptr(arg->sig_ptr),
                           arg->sig_size)) {
index d34dcc0..49bf3a1 100644 (file)
@@ -11,9 +11,6 @@
 #define pr_fmt(fmt) "fs-verity: " fmt
 
 #include <linux/fsverity.h>
-#include <linux/mempool.h>
-
-struct ahash_request;
 
 /*
  * Implementation limit: maximum depth of the Merkle tree.  For now 8 is plenty;
@@ -23,11 +20,10 @@ struct ahash_request;
 
 /* A hash algorithm supported by fs-verity */
 struct fsverity_hash_alg {
-       struct crypto_ahash *tfm; /* hash tfm, allocated on demand */
+       struct crypto_shash *tfm; /* hash tfm, allocated on demand */
        const char *name;         /* crypto API name, e.g. sha256 */
        unsigned int digest_size; /* digest size in bytes, e.g. 32 for SHA-256 */
        unsigned int block_size;  /* block size in bytes, e.g. 64 for SHA-256 */
-       mempool_t req_pool;       /* mempool with a preallocated hash request */
        /*
         * The HASH_ALGO_* constant for this algorithm.  This is different from
         * FS_VERITY_HASH_ALG_*, which uses a different numbering scheme.
@@ -37,7 +33,7 @@ struct fsverity_hash_alg {
 
 /* Merkle tree parameters: hash algorithm, initial hash state, and topology */
 struct merkle_tree_params {
-       struct fsverity_hash_alg *hash_alg; /* the hash algorithm */
+       const struct fsverity_hash_alg *hash_alg; /* the hash algorithm */
        const u8 *hashstate;            /* initial hash state or NULL */
        unsigned int digest_size;       /* same as hash_alg->digest_size */
        unsigned int block_size;        /* size of data and tree blocks */
@@ -83,18 +79,13 @@ struct fsverity_info {
 
 extern struct fsverity_hash_alg fsverity_hash_algs[];
 
-struct fsverity_hash_alg *fsverity_get_hash_alg(const struct inode *inode,
-                                               unsigned int num);
-struct ahash_request *fsverity_alloc_hash_request(struct fsverity_hash_alg *alg,
-                                                 gfp_t gfp_flags);
-void fsverity_free_hash_request(struct fsverity_hash_alg *alg,
-                               struct ahash_request *req);
-const u8 *fsverity_prepare_hash_state(struct fsverity_hash_alg *alg,
+const struct fsverity_hash_alg *fsverity_get_hash_alg(const struct inode *inode,
+                                                     unsigned int num);
+const u8 *fsverity_prepare_hash_state(const struct fsverity_hash_alg *alg,
                                      const u8 *salt, size_t salt_size);
 int fsverity_hash_block(const struct merkle_tree_params *params,
-                       const struct inode *inode, struct ahash_request *req,
-                       struct page *page, unsigned int offset, u8 *out);
-int fsverity_hash_buffer(struct fsverity_hash_alg *alg,
+                       const struct inode *inode, const void *data, u8 *out);
+int fsverity_hash_buffer(const struct fsverity_hash_alg *alg,
                         const void *data, size_t size, u8 *out);
 void __init fsverity_check_hash_algs(void);
 
index ea00dbe..c598d20 100644 (file)
@@ -8,7 +8,6 @@
 #include "fsverity_private.h"
 
 #include <crypto/hash.h>
-#include <linux/scatterlist.h>
 
 /* The hash algorithms supported by fs-verity */
 struct fsverity_hash_alg fsverity_hash_algs[] = {
@@ -40,11 +39,11 @@ static DEFINE_MUTEX(fsverity_hash_alg_init_mutex);
  *
  * Return: pointer to the hash alg on success, else an ERR_PTR()
  */
-struct fsverity_hash_alg *fsverity_get_hash_alg(const struct inode *inode,
-                                               unsigned int num)
+const struct fsverity_hash_alg *fsverity_get_hash_alg(const struct inode *inode,
+                                                     unsigned int num)
 {
        struct fsverity_hash_alg *alg;
-       struct crypto_ahash *tfm;
+       struct crypto_shash *tfm;
        int err;
 
        if (num >= ARRAY_SIZE(fsverity_hash_algs) ||
@@ -63,11 +62,7 @@ struct fsverity_hash_alg *fsverity_get_hash_alg(const struct inode *inode,
        if (alg->tfm != NULL)
                goto out_unlock;
 
-       /*
-        * Using the shash API would make things a bit simpler, but the ahash
-        * API is preferable as it allows the use of crypto accelerators.
-        */
-       tfm = crypto_alloc_ahash(alg->name, 0, 0);
+       tfm = crypto_alloc_shash(alg->name, 0, 0);
        if (IS_ERR(tfm)) {
                if (PTR_ERR(tfm) == -ENOENT) {
                        fsverity_warn(inode,
@@ -84,26 +79,20 @@ struct fsverity_hash_alg *fsverity_get_hash_alg(const struct inode *inode,
        }
 
        err = -EINVAL;
-       if (WARN_ON_ONCE(alg->digest_size != crypto_ahash_digestsize(tfm)))
+       if (WARN_ON_ONCE(alg->digest_size != crypto_shash_digestsize(tfm)))
                goto err_free_tfm;
-       if (WARN_ON_ONCE(alg->block_size != crypto_ahash_blocksize(tfm)))
-               goto err_free_tfm;
-
-       err = mempool_init_kmalloc_pool(&alg->req_pool, 1,
-                                       sizeof(struct ahash_request) +
-                                       crypto_ahash_reqsize(tfm));
-       if (err)
+       if (WARN_ON_ONCE(alg->block_size != crypto_shash_blocksize(tfm)))
                goto err_free_tfm;
 
        pr_info("%s using implementation \"%s\"\n",
-               alg->name, crypto_ahash_driver_name(tfm));
+               alg->name, crypto_shash_driver_name(tfm));
 
        /* pairs with smp_load_acquire() above */
        smp_store_release(&alg->tfm, tfm);
        goto out_unlock;
 
 err_free_tfm:
-       crypto_free_ahash(tfm);
+       crypto_free_shash(tfm);
        alg = ERR_PTR(err);
 out_unlock:
        mutex_unlock(&fsverity_hash_alg_init_mutex);
@@ -111,42 +100,6 @@ out_unlock:
 }
 
 /**
- * fsverity_alloc_hash_request() - allocate a hash request object
- * @alg: the hash algorithm for which to allocate the request
- * @gfp_flags: memory allocation flags
- *
- * This is mempool-backed, so this never fails if __GFP_DIRECT_RECLAIM is set in
- * @gfp_flags.  However, in that case this might need to wait for all
- * previously-allocated requests to be freed.  So to avoid deadlocks, callers
- * must never need multiple requests at a time to make forward progress.
- *
- * Return: the request object on success; NULL on failure (but see above)
- */
-struct ahash_request *fsverity_alloc_hash_request(struct fsverity_hash_alg *alg,
-                                                 gfp_t gfp_flags)
-{
-       struct ahash_request *req = mempool_alloc(&alg->req_pool, gfp_flags);
-
-       if (req)
-               ahash_request_set_tfm(req, alg->tfm);
-       return req;
-}
-
-/**
- * fsverity_free_hash_request() - free a hash request object
- * @alg: the hash algorithm
- * @req: the hash request object to free
- */
-void fsverity_free_hash_request(struct fsverity_hash_alg *alg,
-                               struct ahash_request *req)
-{
-       if (req) {
-               ahash_request_zero(req);
-               mempool_free(req, &alg->req_pool);
-       }
-}
-
-/**
  * fsverity_prepare_hash_state() - precompute the initial hash state
  * @alg: hash algorithm
  * @salt: a salt which is to be prepended to all data to be hashed
@@ -155,27 +108,24 @@ void fsverity_free_hash_request(struct fsverity_hash_alg *alg,
  * Return: NULL if the salt is empty, otherwise the kmalloc()'ed precomputed
  *        initial hash state on success or an ERR_PTR() on failure.
  */
-const u8 *fsverity_prepare_hash_state(struct fsverity_hash_alg *alg,
+const u8 *fsverity_prepare_hash_state(const struct fsverity_hash_alg *alg,
                                      const u8 *salt, size_t salt_size)
 {
        u8 *hashstate = NULL;
-       struct ahash_request *req = NULL;
+       SHASH_DESC_ON_STACK(desc, alg->tfm);
        u8 *padded_salt = NULL;
        size_t padded_salt_size;
-       struct scatterlist sg;
-       DECLARE_CRYPTO_WAIT(wait);
        int err;
 
+       desc->tfm = alg->tfm;
+
        if (salt_size == 0)
                return NULL;
 
-       hashstate = kmalloc(crypto_ahash_statesize(alg->tfm), GFP_KERNEL);
+       hashstate = kmalloc(crypto_shash_statesize(alg->tfm), GFP_KERNEL);
        if (!hashstate)
                return ERR_PTR(-ENOMEM);
 
-       /* This allocation never fails, since it's mempool-backed. */
-       req = fsverity_alloc_hash_request(alg, GFP_KERNEL);
-
        /*
         * Zero-pad the salt to the next multiple of the input size of the hash
         * algorithm's compression function, e.g. 64 bytes for SHA-256 or 128
@@ -190,26 +140,18 @@ const u8 *fsverity_prepare_hash_state(struct fsverity_hash_alg *alg,
                goto err_free;
        }
        memcpy(padded_salt, salt, salt_size);
-
-       sg_init_one(&sg, padded_salt, padded_salt_size);
-       ahash_request_set_callback(req, CRYPTO_TFM_REQ_MAY_SLEEP |
-                                       CRYPTO_TFM_REQ_MAY_BACKLOG,
-                                  crypto_req_done, &wait);
-       ahash_request_set_crypt(req, &sg, NULL, padded_salt_size);
-
-       err = crypto_wait_req(crypto_ahash_init(req), &wait);
+       err = crypto_shash_init(desc);
        if (err)
                goto err_free;
 
-       err = crypto_wait_req(crypto_ahash_update(req), &wait);
+       err = crypto_shash_update(desc, padded_salt, padded_salt_size);
        if (err)
                goto err_free;
 
-       err = crypto_ahash_export(req, hashstate);
+       err = crypto_shash_export(desc, hashstate);
        if (err)
                goto err_free;
 out:
-       fsverity_free_hash_request(alg, req);
        kfree(padded_salt);
        return hashstate;
 
@@ -223,9 +165,7 @@ err_free:
  * fsverity_hash_block() - hash a single data or hash block
  * @params: the Merkle tree's parameters
  * @inode: inode for which the hashing is being done
- * @req: preallocated hash request
- * @page: the page containing the block to hash
- * @offset: the offset of the block within @page
+ * @data: virtual address of a buffer containing the block to hash
  * @out: output digest, size 'params->digest_size' bytes
  *
  * Hash a single data or hash block.  The hash is salted if a salt is specified
@@ -234,33 +174,24 @@ err_free:
  * Return: 0 on success, -errno on failure
  */
 int fsverity_hash_block(const struct merkle_tree_params *params,
-                       const struct inode *inode, struct ahash_request *req,
-                       struct page *page, unsigned int offset, u8 *out)
+                       const struct inode *inode, const void *data, u8 *out)
 {
-       struct scatterlist sg;
-       DECLARE_CRYPTO_WAIT(wait);
+       SHASH_DESC_ON_STACK(desc, params->hash_alg->tfm);
        int err;
 
-       sg_init_table(&sg, 1);
-       sg_set_page(&sg, page, params->block_size, offset);
-       ahash_request_set_callback(req, CRYPTO_TFM_REQ_MAY_SLEEP |
-                                       CRYPTO_TFM_REQ_MAY_BACKLOG,
-                                  crypto_req_done, &wait);
-       ahash_request_set_crypt(req, &sg, out, params->block_size);
+       desc->tfm = params->hash_alg->tfm;
 
        if (params->hashstate) {
-               err = crypto_ahash_import(req, params->hashstate);
+               err = crypto_shash_import(desc, params->hashstate);
                if (err) {
                        fsverity_err(inode,
                                     "Error %d importing hash state", err);
                        return err;
                }
-               err = crypto_ahash_finup(req);
+               err = crypto_shash_finup(desc, data, params->block_size, out);
        } else {
-               err = crypto_ahash_digest(req);
+               err = crypto_shash_digest(desc, data, params->block_size, out);
        }
-
-       err = crypto_wait_req(err, &wait);
        if (err)
                fsverity_err(inode, "Error %d computing block hash", err);
        return err;
@@ -273,32 +204,12 @@ int fsverity_hash_block(const struct merkle_tree_params *params,
  * @size: size of data to hash, in bytes
  * @out: output digest, size 'alg->digest_size' bytes
  *
- * Hash some data which is located in physically contiguous memory (i.e. memory
- * allocated by kmalloc(), not by vmalloc()).  No salt is used.
- *
  * Return: 0 on success, -errno on failure
  */
-int fsverity_hash_buffer(struct fsverity_hash_alg *alg,
+int fsverity_hash_buffer(const struct fsverity_hash_alg *alg,
                         const void *data, size_t size, u8 *out)
 {
-       struct ahash_request *req;
-       struct scatterlist sg;
-       DECLARE_CRYPTO_WAIT(wait);
-       int err;
-
-       /* This allocation never fails, since it's mempool-backed. */
-       req = fsverity_alloc_hash_request(alg, GFP_KERNEL);
-
-       sg_init_one(&sg, data, size);
-       ahash_request_set_callback(req, CRYPTO_TFM_REQ_MAY_SLEEP |
-                                       CRYPTO_TFM_REQ_MAY_BACKLOG,
-                                  crypto_req_done, &wait);
-       ahash_request_set_crypt(req, &sg, out, size);
-
-       err = crypto_wait_req(crypto_ahash_digest(req), &wait);
-
-       fsverity_free_hash_request(alg, req);
-       return err;
+       return crypto_shash_tfm_digest(alg->tfm, data, size, out);
 }
 
 void __init fsverity_check_hash_algs(void)
index 5c79ea1..eec5956 100644 (file)
@@ -61,27 +61,42 @@ EXPORT_SYMBOL_GPL(fsverity_ioctl_measure);
 /**
  * fsverity_get_digest() - get a verity file's digest
  * @inode: inode to get digest of
- * @digest: (out) pointer to the digest
- * @alg: (out) pointer to the hash algorithm enumeration
+ * @raw_digest: (out) the raw file digest
+ * @alg: (out) the digest's algorithm, as a FS_VERITY_HASH_ALG_* value
+ * @halg: (out) the digest's algorithm, as a HASH_ALGO_* value
  *
- * Return the file hash algorithm and digest of an fsverity protected file.
- * Assumption: before calling this, the file must have been opened.
+ * Retrieves the fsverity digest of the given file.  The file must have been
+ * opened at least once since the inode was last loaded into the inode cache;
+ * otherwise this function will not recognize when fsverity is enabled.
  *
- * Return: 0 on success, -errno on failure
+ * The file's fsverity digest consists of @raw_digest in combination with either
+ * @alg or @halg.  (The caller can choose which one of @alg or @halg to use.)
+ *
+ * IMPORTANT: Callers *must* make use of one of the two algorithm IDs, since
+ * @raw_digest is meaningless without knowing which algorithm it uses!  fsverity
+ * provides no security guarantee for users who ignore the algorithm ID, even if
+ * they use the digest size (since algorithms can share the same digest size).
+ *
+ * Return: The size of the raw digest in bytes, or 0 if the file doesn't have
+ *        fsverity enabled.
  */
 int fsverity_get_digest(struct inode *inode,
-                       u8 digest[FS_VERITY_MAX_DIGEST_SIZE],
-                       enum hash_algo *alg)
+                       u8 raw_digest[FS_VERITY_MAX_DIGEST_SIZE],
+                       u8 *alg, enum hash_algo *halg)
 {
        const struct fsverity_info *vi;
        const struct fsverity_hash_alg *hash_alg;
 
        vi = fsverity_get_info(inode);
        if (!vi)
-               return -ENODATA; /* not a verity file */
+               return 0; /* not a verity file */
 
        hash_alg = vi->tree_params.hash_alg;
-       memcpy(digest, vi->file_digest, hash_alg->digest_size);
-       *alg = hash_alg->algo_id;
-       return 0;
+       memcpy(raw_digest, vi->file_digest, hash_alg->digest_size);
+       if (alg)
+               *alg = hash_alg - fsverity_hash_algs;
+       if (halg)
+               *halg = hash_alg->algo_id;
+       return hash_alg->digest_size;
 }
+EXPORT_SYMBOL_GPL(fsverity_get_digest);
index 52048b7..1db5106 100644 (file)
@@ -32,7 +32,7 @@ int fsverity_init_merkle_tree_params(struct merkle_tree_params *params,
                                     unsigned int log_blocksize,
                                     const u8 *salt, size_t salt_size)
 {
-       struct fsverity_hash_alg *hash_alg;
+       const struct fsverity_hash_alg *hash_alg;
        int err;
        u64 blocks;
        u64 blocks_in_level[FS_VERITY_MAX_LEVELS];
@@ -156,9 +156,9 @@ out_err:
 
 /*
  * Compute the file digest by hashing the fsverity_descriptor excluding the
- * signature and with the sig_size field set to 0.
+ * builtin signature and with the sig_size field set to 0.
  */
-static int compute_file_digest(struct fsverity_hash_alg *hash_alg,
+static int compute_file_digest(const struct fsverity_hash_alg *hash_alg,
                               struct fsverity_descriptor *desc,
                               u8 *file_digest)
 {
@@ -174,7 +174,7 @@ static int compute_file_digest(struct fsverity_hash_alg *hash_alg,
 
 /*
  * Create a new fsverity_info from the given fsverity_descriptor (with optional
- * appended signature), and check the signature if present.  The
+ * appended builtin signature), and check the signature if present.  The
  * fsverity_descriptor must have already undergone basic validation.
  */
 struct fsverity_info *fsverity_create_info(const struct inode *inode,
@@ -319,8 +319,8 @@ static bool validate_fsverity_descriptor(struct inode *inode,
 }
 
 /*
- * Read the inode's fsverity_descriptor (with optional appended signature) from
- * the filesystem, and do basic validation of it.
+ * Read the inode's fsverity_descriptor (with optional appended builtin
+ * signature) from the filesystem, and do basic validation of it.
  */
 int fsverity_get_descriptor(struct inode *inode,
                            struct fsverity_descriptor **desc_ret)
index 2aefc55..f584327 100644 (file)
@@ -105,7 +105,7 @@ static int fsverity_read_descriptor(struct inode *inode,
        if (res)
                return res;
 
-       /* don't include the signature */
+       /* don't include the builtin signature */
        desc_size = offsetof(struct fsverity_descriptor, signature);
        desc->sig_size = 0;
 
@@ -131,7 +131,7 @@ static int fsverity_read_signature(struct inode *inode,
        }
 
        /*
-        * Include only the signature.  Note that fsverity_get_descriptor()
+        * Include only the builtin signature.  fsverity_get_descriptor()
         * already verified that sig_size is in-bounds.
         */
        res = fsverity_read_buffer(buf, offset, length, desc->signature,
index b8c51ad..72034bc 100644 (file)
@@ -5,6 +5,14 @@
  * Copyright 2019 Google LLC
  */
 
+/*
+ * This file implements verification of fs-verity builtin signatures.  Please
+ * take great care before using this feature.  It is not the only way to do
+ * signatures with fs-verity, and the alternatives (such as userspace signature
+ * verification, and IMA appraisal) can be much better.  For details about the
+ * limitations of this feature, see Documentation/filesystems/fsverity.rst.
+ */
+
 #include "fsverity_private.h"
 
 #include <linux/cred.h>
index e250822..433cef5 100644 (file)
 
 static struct workqueue_struct *fsverity_read_workqueue;
 
-static inline int cmp_hashes(const struct fsverity_info *vi,
-                            const u8 *want_hash, const u8 *real_hash,
-                            u64 data_pos, int level)
-{
-       const unsigned int hsize = vi->tree_params.digest_size;
-
-       if (memcmp(want_hash, real_hash, hsize) == 0)
-               return 0;
-
-       fsverity_err(vi->inode,
-                    "FILE CORRUPTED! pos=%llu, level=%d, want_hash=%s:%*phN, real_hash=%s:%*phN",
-                    data_pos, level,
-                    vi->tree_params.hash_alg->name, hsize, want_hash,
-                    vi->tree_params.hash_alg->name, hsize, real_hash);
-       return -EBADMSG;
-}
-
-static bool data_is_zeroed(struct inode *inode, struct page *page,
-                          unsigned int len, unsigned int offset)
-{
-       void *virt = kmap_local_page(page);
-
-       if (memchr_inv(virt + offset, 0, len)) {
-               kunmap_local(virt);
-               fsverity_err(inode,
-                            "FILE CORRUPTED!  Data past EOF is not zeroed");
-               return false;
-       }
-       kunmap_local(virt);
-       return true;
-}
-
 /*
  * Returns true if the hash block with index @hblock_idx in the tree, located in
  * @hpage, has already been verified.
@@ -122,9 +90,7 @@ static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
  */
 static bool
 verify_data_block(struct inode *inode, struct fsverity_info *vi,
-                 struct ahash_request *req, struct page *data_page,
-                 u64 data_pos, unsigned int dblock_offset_in_page,
-                 unsigned long max_ra_pages)
+                 const void *data, u64 data_pos, unsigned long max_ra_pages)
 {
        const struct merkle_tree_params *params = &vi->tree_params;
        const unsigned int hsize = params->digest_size;
@@ -136,11 +102,11 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
        struct {
                /* Page containing the hash block */
                struct page *page;
+               /* Mapped address of the hash block (will be within @page) */
+               const void *addr;
                /* Index of the hash block in the tree overall */
                unsigned long index;
-               /* Byte offset of the hash block within @page */
-               unsigned int offset_in_page;
-               /* Byte offset of the wanted hash within @page */
+               /* Byte offset of the wanted hash relative to @addr */
                unsigned int hoffset;
        } hblocks[FS_VERITY_MAX_LEVELS];
        /*
@@ -148,7 +114,9 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
         * index of that block's hash within the current level.
         */
        u64 hidx = data_pos >> params->log_blocksize;
-       int err;
+
+       /* Up to 1 + FS_VERITY_MAX_LEVELS pages may be mapped at once */
+       BUILD_BUG_ON(1 + FS_VERITY_MAX_LEVELS > KM_MAX_IDX);
 
        if (unlikely(data_pos >= inode->i_size)) {
                /*
@@ -159,8 +127,12 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
                 * any part past EOF should be all zeroes.  Therefore, we need
                 * to verify that any data blocks fully past EOF are all zeroes.
                 */
-               return data_is_zeroed(inode, data_page, params->block_size,
-                                     dblock_offset_in_page);
+               if (memchr_inv(data, 0, params->block_size)) {
+                       fsverity_err(inode,
+                                    "FILE CORRUPTED!  Data past EOF is not zeroed");
+                       return false;
+               }
+               return true;
        }
 
        /*
@@ -175,6 +147,7 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
                unsigned int hblock_offset_in_page;
                unsigned int hoffset;
                struct page *hpage;
+               const void *haddr;
 
                /*
                 * The index of the block in the current level; also the index
@@ -192,30 +165,30 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
                hblock_offset_in_page =
                        (hblock_idx << params->log_blocksize) & ~PAGE_MASK;
 
-               /* Byte offset of the hash within the page */
-               hoffset = hblock_offset_in_page +
-                         ((hidx << params->log_digestsize) &
-                          (params->block_size - 1));
+               /* Byte offset of the hash within the block */
+               hoffset = (hidx << params->log_digestsize) &
+                         (params->block_size - 1);
 
                hpage = inode->i_sb->s_vop->read_merkle_tree_page(inode,
                                hpage_idx, level == 0 ? min(max_ra_pages,
                                        params->tree_pages - hpage_idx) : 0);
                if (IS_ERR(hpage)) {
-                       err = PTR_ERR(hpage);
                        fsverity_err(inode,
-                                    "Error %d reading Merkle tree page %lu",
-                                    err, hpage_idx);
-                       goto out;
+                                    "Error %ld reading Merkle tree page %lu",
+                                    PTR_ERR(hpage), hpage_idx);
+                       goto error;
                }
+               haddr = kmap_local_page(hpage) + hblock_offset_in_page;
                if (is_hash_block_verified(vi, hpage, hblock_idx)) {
-                       memcpy_from_page(_want_hash, hpage, hoffset, hsize);
+                       memcpy(_want_hash, haddr + hoffset, hsize);
                        want_hash = _want_hash;
+                       kunmap_local(haddr);
                        put_page(hpage);
                        goto descend;
                }
                hblocks[level].page = hpage;
+               hblocks[level].addr = haddr;
                hblocks[level].index = hblock_idx;
-               hblocks[level].offset_in_page = hblock_offset_in_page;
                hblocks[level].hoffset = hoffset;
                hidx = next_hidx;
        }
@@ -225,18 +198,14 @@ descend:
        /* Descend the tree verifying hash blocks. */
        for (; level > 0; level--) {
                struct page *hpage = hblocks[level - 1].page;
+               const void *haddr = hblocks[level - 1].addr;
                unsigned long hblock_idx = hblocks[level - 1].index;
-               unsigned int hblock_offset_in_page =
-                       hblocks[level - 1].offset_in_page;
                unsigned int hoffset = hblocks[level - 1].hoffset;
 
-               err = fsverity_hash_block(params, inode, req, hpage,
-                                         hblock_offset_in_page, real_hash);
-               if (err)
-                       goto out;
-               err = cmp_hashes(vi, want_hash, real_hash, data_pos, level - 1);
-               if (err)
-                       goto out;
+               if (fsverity_hash_block(params, inode, haddr, real_hash) != 0)
+                       goto error;
+               if (memcmp(want_hash, real_hash, hsize) != 0)
+                       goto corrupted;
                /*
                 * Mark the hash block as verified.  This must be atomic and
                 * idempotent, as the same hash block might be verified by
@@ -246,29 +215,39 @@ descend:
                        set_bit(hblock_idx, vi->hash_block_verified);
                else
                        SetPageChecked(hpage);
-               memcpy_from_page(_want_hash, hpage, hoffset, hsize);
+               memcpy(_want_hash, haddr + hoffset, hsize);
                want_hash = _want_hash;
+               kunmap_local(haddr);
                put_page(hpage);
        }
 
        /* Finally, verify the data block. */
-       err = fsverity_hash_block(params, inode, req, data_page,
-                                 dblock_offset_in_page, real_hash);
-       if (err)
-               goto out;
-       err = cmp_hashes(vi, want_hash, real_hash, data_pos, -1);
-out:
-       for (; level > 0; level--)
-               put_page(hblocks[level - 1].page);
+       if (fsverity_hash_block(params, inode, data, real_hash) != 0)
+               goto error;
+       if (memcmp(want_hash, real_hash, hsize) != 0)
+               goto corrupted;
+       return true;
 
-       return err == 0;
+corrupted:
+       fsverity_err(inode,
+                    "FILE CORRUPTED! pos=%llu, level=%d, want_hash=%s:%*phN, real_hash=%s:%*phN",
+                    data_pos, level - 1,
+                    params->hash_alg->name, hsize, want_hash,
+                    params->hash_alg->name, hsize, real_hash);
+error:
+       for (; level > 0; level--) {
+               kunmap_local(hblocks[level - 1].addr);
+               put_page(hblocks[level - 1].page);
+       }
+       return false;
 }
 
 static bool
-verify_data_blocks(struct inode *inode, struct fsverity_info *vi,
-                  struct ahash_request *req, struct folio *data_folio,
-                  size_t len, size_t offset, unsigned long max_ra_pages)
+verify_data_blocks(struct folio *data_folio, size_t len, size_t offset,
+                  unsigned long max_ra_pages)
 {
+       struct inode *inode = data_folio->mapping->host;
+       struct fsverity_info *vi = inode->i_verity_info;
        const unsigned int block_size = vi->tree_params.block_size;
        u64 pos = (u64)data_folio->index << PAGE_SHIFT;
 
@@ -278,11 +257,14 @@ verify_data_blocks(struct inode *inode, struct fsverity_info *vi,
                         folio_test_uptodate(data_folio)))
                return false;
        do {
-               struct page *data_page =
-                       folio_page(data_folio, offset >> PAGE_SHIFT);
-
-               if (!verify_data_block(inode, vi, req, data_page, pos + offset,
-                                      offset & ~PAGE_MASK, max_ra_pages))
+               void *data;
+               bool valid;
+
+               data = kmap_local_folio(data_folio, offset);
+               valid = verify_data_block(inode, vi, data, pos + offset,
+                                         max_ra_pages);
+               kunmap_local(data);
+               if (!valid)
                        return false;
                offset += block_size;
                len -= block_size;
@@ -304,19 +286,7 @@ verify_data_blocks(struct inode *inode, struct fsverity_info *vi,
  */
 bool fsverity_verify_blocks(struct folio *folio, size_t len, size_t offset)
 {
-       struct inode *inode = folio->mapping->host;
-       struct fsverity_info *vi = inode->i_verity_info;
-       struct ahash_request *req;
-       bool valid;
-
-       /* This allocation never fails, since it's mempool-backed. */
-       req = fsverity_alloc_hash_request(vi->tree_params.hash_alg, GFP_NOFS);
-
-       valid = verify_data_blocks(inode, vi, req, folio, len, offset, 0);
-
-       fsverity_free_hash_request(vi->tree_params.hash_alg, req);
-
-       return valid;
+       return verify_data_blocks(folio, len, offset, 0);
 }
 EXPORT_SYMBOL_GPL(fsverity_verify_blocks);
 
@@ -337,15 +307,9 @@ EXPORT_SYMBOL_GPL(fsverity_verify_blocks);
  */
 void fsverity_verify_bio(struct bio *bio)
 {
-       struct inode *inode = bio_first_page_all(bio)->mapping->host;
-       struct fsverity_info *vi = inode->i_verity_info;
-       struct ahash_request *req;
        struct folio_iter fi;
        unsigned long max_ra_pages = 0;
 
-       /* This allocation never fails, since it's mempool-backed. */
-       req = fsverity_alloc_hash_request(vi->tree_params.hash_alg, GFP_NOFS);
-
        if (bio->bi_opf & REQ_RAHEAD) {
                /*
                 * If this bio is for data readahead, then we also do readahead
@@ -360,14 +324,12 @@ void fsverity_verify_bio(struct bio *bio)
        }
 
        bio_for_each_folio_all(fi, bio) {
-               if (!verify_data_blocks(inode, vi, req, fi.folio, fi.length,
-                                       fi.offset, max_ra_pages)) {
+               if (!verify_data_blocks(fi.folio, fi.length, fi.offset,
+                                       max_ra_pages)) {
                        bio->bi_status = BLK_STS_IOERR;
                        break;
                }
        }
-
-       fsverity_free_hash_request(vi->tree_params.hash_alg, req);
 }
 EXPORT_SYMBOL_GPL(fsverity_verify_bio);
 #endif /* CONFIG_BLOCK */
index a2aa36b..4d68a58 100644 (file)
@@ -301,7 +301,7 @@ struct xfs_btree_cur
 static inline size_t
 xfs_btree_cur_sizeof(unsigned int nlevels)
 {
-       return struct_size((struct xfs_btree_cur *)NULL, bc_levels, nlevels);
+       return struct_size_t(struct xfs_btree_cur, bc_levels, nlevels);
 }
 
 /* cursor flags */
index 9d7b9ee..c32b5fa 100644 (file)
@@ -60,7 +60,7 @@ struct xchk_btree {
 static inline size_t
 xchk_btree_sizeof(unsigned int nlevels)
 {
-       return struct_size((struct xchk_btree *)NULL, lastkey, nlevels - 1);
+       return struct_size_t(struct xchk_btree, lastkey, nlevels - 1);
 }
 
 int xchk_btree(struct xfs_scrub *sc, struct xfs_btree_cur *cur,
index aede746..0c6671e 100644 (file)
@@ -306,6 +306,34 @@ xfs_file_read_iter(
        return ret;
 }
 
+STATIC ssize_t
+xfs_file_splice_read(
+       struct file             *in,
+       loff_t                  *ppos,
+       struct pipe_inode_info  *pipe,
+       size_t                  len,
+       unsigned int            flags)
+{
+       struct inode            *inode = file_inode(in);
+       struct xfs_inode        *ip = XFS_I(inode);
+       struct xfs_mount        *mp = ip->i_mount;
+       ssize_t                 ret = 0;
+
+       XFS_STATS_INC(mp, xs_read_calls);
+
+       if (xfs_is_shutdown(mp))
+               return -EIO;
+
+       trace_xfs_file_splice_read(ip, *ppos, len);
+
+       xfs_ilock(ip, XFS_IOLOCK_SHARED);
+       ret = filemap_splice_read(in, ppos, pipe, len, flags);
+       xfs_iunlock(ip, XFS_IOLOCK_SHARED);
+       if (ret > 0)
+               XFS_STATS_ADD(mp, xs_read_bytes, ret);
+       return ret;
+}
+
 /*
  * Common pre-write limit and setup checks.
  *
@@ -717,14 +745,9 @@ write_retry:
        if (ret)
                goto out;
 
-       /* We can write back this queue in page reclaim */
-       current->backing_dev_info = inode_to_bdi(inode);
-
        trace_xfs_file_buffered_write(iocb, from);
        ret = iomap_file_buffered_write(iocb, from,
                        &xfs_buffered_write_iomap_ops);
-       if (likely(ret >= 0))
-               iocb->ki_pos += ret;
 
        /*
         * If we hit a space limit, try to free up some lingering preallocated
@@ -753,7 +776,6 @@ write_retry:
                goto write_retry;
        }
 
-       current->backing_dev_info = NULL;
 out:
        if (iolock)
                xfs_iunlock(ip, iolock);
@@ -1423,7 +1445,7 @@ const struct file_operations xfs_file_operations = {
        .llseek         = xfs_file_llseek,
        .read_iter      = xfs_file_read_iter,
        .write_iter     = xfs_file_write_iter,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = xfs_file_splice_read,
        .splice_write   = iter_file_splice_write,
        .iopoll         = iocb_bio_iopoll,
        .unlocked_ioctl = xfs_file_ioctl,
index 13851c0..9ebb833 100644 (file)
@@ -534,6 +534,9 @@ xfs_do_force_shutdown(
        } else if (flags & SHUTDOWN_CORRUPT_ONDISK) {
                tag = XFS_PTAG_SHUTDOWN_CORRUPT;
                why = "Corruption of on-disk metadata";
+       } else if (flags & SHUTDOWN_DEVICE_REMOVED) {
+               tag = XFS_PTAG_SHUTDOWN_IOERROR;
+               why = "Block device removal";
        } else {
                tag = XFS_PTAG_SHUTDOWN_IOERROR;
                why = "Metadata I/O Error";
index 6c09f89..e2866e7 100644 (file)
@@ -458,12 +458,14 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, uint32_t flags, char *fname,
 #define SHUTDOWN_FORCE_UMOUNT  (1u << 2) /* shutdown from a forced unmount */
 #define SHUTDOWN_CORRUPT_INCORE        (1u << 3) /* corrupt in-memory structures */
 #define SHUTDOWN_CORRUPT_ONDISK        (1u << 4)  /* corrupt metadata on device */
+#define SHUTDOWN_DEVICE_REMOVED        (1u << 5) /* device removed underneath us */
 
 #define XFS_SHUTDOWN_STRINGS \
        { SHUTDOWN_META_IO_ERROR,       "metadata_io" }, \
        { SHUTDOWN_LOG_IO_ERROR,        "log_io" }, \
        { SHUTDOWN_FORCE_UMOUNT,        "force_umount" }, \
-       { SHUTDOWN_CORRUPT_INCORE,      "corruption" }
+       { SHUTDOWN_CORRUPT_INCORE,      "corruption" }, \
+       { SHUTDOWN_DEVICE_REMOVED,      "device_removed" }
 
 /*
  * Flags for xfs_mountfs
index 4120bd1..d910b14 100644 (file)
@@ -377,6 +377,17 @@ disable_dax:
        return 0;
 }
 
+static void
+xfs_bdev_mark_dead(
+       struct block_device     *bdev)
+{
+       xfs_force_shutdown(bdev->bd_holder, SHUTDOWN_DEVICE_REMOVED);
+}
+
+static const struct blk_holder_ops xfs_holder_ops = {
+       .mark_dead              = xfs_bdev_mark_dead,
+};
+
 STATIC int
 xfs_blkdev_get(
        xfs_mount_t             *mp,
@@ -385,8 +396,8 @@ xfs_blkdev_get(
 {
        int                     error = 0;
 
-       *bdevp = blkdev_get_by_path(name, FMODE_READ|FMODE_WRITE|FMODE_EXCL,
-                                   mp);
+       *bdevp = blkdev_get_by_path(name, BLK_OPEN_READ | BLK_OPEN_WRITE, mp,
+                                   &xfs_holder_ops);
        if (IS_ERR(*bdevp)) {
                error = PTR_ERR(*bdevp);
                xfs_warn(mp, "Invalid device [%s], error=%d", name, error);
@@ -397,10 +408,11 @@ xfs_blkdev_get(
 
 STATIC void
 xfs_blkdev_put(
+       struct xfs_mount        *mp,
        struct block_device     *bdev)
 {
        if (bdev)
-               blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
+               blkdev_put(bdev, mp);
 }
 
 STATIC void
@@ -411,13 +423,13 @@ xfs_close_devices(
                struct block_device *logdev = mp->m_logdev_targp->bt_bdev;
 
                xfs_free_buftarg(mp->m_logdev_targp);
-               xfs_blkdev_put(logdev);
+               xfs_blkdev_put(mp, logdev);
        }
        if (mp->m_rtdev_targp) {
                struct block_device *rtdev = mp->m_rtdev_targp->bt_bdev;
 
                xfs_free_buftarg(mp->m_rtdev_targp);
-               xfs_blkdev_put(rtdev);
+               xfs_blkdev_put(mp, rtdev);
        }
        xfs_free_buftarg(mp->m_ddev_targp);
 }
@@ -492,10 +504,10 @@ xfs_open_devices(
  out_free_ddev_targ:
        xfs_free_buftarg(mp->m_ddev_targp);
  out_close_rtdev:
-       xfs_blkdev_put(rtdev);
+       xfs_blkdev_put(mp, rtdev);
  out_close_logdev:
        if (logdev && logdev != ddev)
-               xfs_blkdev_put(logdev);
+               xfs_blkdev_put(mp, logdev);
        return error;
 }
 
@@ -1160,6 +1172,13 @@ xfs_fs_free_cached_objects(
        return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
 }
 
+static void
+xfs_fs_shutdown(
+       struct super_block      *sb)
+{
+       xfs_force_shutdown(XFS_M(sb), SHUTDOWN_DEVICE_REMOVED);
+}
+
 static const struct super_operations xfs_super_operations = {
        .alloc_inode            = xfs_fs_alloc_inode,
        .destroy_inode          = xfs_fs_destroy_inode,
@@ -1173,6 +1192,7 @@ static const struct super_operations xfs_super_operations = {
        .show_options           = xfs_fs_show_options,
        .nr_cached_objects      = xfs_fs_nr_cached_objects,
        .free_cached_objects    = xfs_fs_free_cached_objects,
+       .shutdown               = xfs_fs_shutdown,
 };
 
 static int
index cd4ca5b..4db6692 100644 (file)
@@ -1445,7 +1445,6 @@ DEFINE_RW_EVENT(xfs_file_direct_write);
 DEFINE_RW_EVENT(xfs_file_dax_write);
 DEFINE_RW_EVENT(xfs_reflink_bounce_dio_write);
 
-
 DECLARE_EVENT_CLASS(xfs_imap_class,
        TP_PROTO(struct xfs_inode *ip, xfs_off_t offset, ssize_t count,
                 int whichfork, struct xfs_bmbt_irec *irec),
@@ -1535,6 +1534,7 @@ DEFINE_SIMPLE_IO_EVENT(xfs_zero_eof);
 DEFINE_SIMPLE_IO_EVENT(xfs_end_io_direct_write);
 DEFINE_SIMPLE_IO_EVENT(xfs_end_io_direct_write_unwritten);
 DEFINE_SIMPLE_IO_EVENT(xfs_end_io_direct_write_append);
+DEFINE_SIMPLE_IO_EVENT(xfs_file_splice_read);
 
 DECLARE_EVENT_CLASS(xfs_itrunc_class,
        TP_PROTO(struct xfs_inode *ip, xfs_fsize_t new_size),
index 132f01d..92c9aaa 100644 (file)
@@ -181,7 +181,6 @@ const struct address_space_operations zonefs_file_aops = {
        .migrate_folio          = filemap_migrate_folio,
        .is_partially_uptodate  = iomap_is_partially_uptodate,
        .error_remove_page      = generic_error_remove_page,
-       .direct_IO              = noop_direct_IO,
        .swap_activate          = zonefs_swap_activate,
 };
 
@@ -342,6 +341,77 @@ static loff_t zonefs_file_llseek(struct file *file, loff_t offset, int whence)
        return generic_file_llseek_size(file, offset, whence, isize, isize);
 }
 
+struct zonefs_zone_append_bio {
+       /* The target inode of the BIO */
+       struct inode *inode;
+
+       /* For sync writes, the target append write offset */
+       u64 append_offset;
+
+       /*
+        * This member must come last, bio_alloc_bioset will allocate enough
+        * bytes for entire zonefs_bio but relies on bio being last.
+        */
+       struct bio bio;
+};
+
+static inline struct zonefs_zone_append_bio *
+zonefs_zone_append_bio(struct bio *bio)
+{
+       return container_of(bio, struct zonefs_zone_append_bio, bio);
+}
+
+static void zonefs_file_zone_append_dio_bio_end_io(struct bio *bio)
+{
+       struct zonefs_zone_append_bio *za_bio = zonefs_zone_append_bio(bio);
+       struct zonefs_zone *z = zonefs_inode_zone(za_bio->inode);
+       sector_t za_sector;
+
+       if (bio->bi_status != BLK_STS_OK)
+               goto bio_end;
+
+       /*
+        * If the file zone was written underneath the file system, the zone
+        * append operation can still succedd (if the zone is not full) but
+        * the write append location will not be where we expect it to be.
+        * Check that we wrote where we intended to, that is, at z->z_wpoffset.
+        */
+       za_sector = z->z_sector + (za_bio->append_offset >> SECTOR_SHIFT);
+       if (bio->bi_iter.bi_sector != za_sector) {
+               zonefs_warn(za_bio->inode->i_sb,
+                           "Invalid write sector %llu for zone at %llu\n",
+                           bio->bi_iter.bi_sector, z->z_sector);
+               bio->bi_status = BLK_STS_IOERR;
+       }
+
+bio_end:
+       iomap_dio_bio_end_io(bio);
+}
+
+static void zonefs_file_zone_append_dio_submit_io(const struct iomap_iter *iter,
+                                                 struct bio *bio,
+                                                 loff_t file_offset)
+{
+       struct zonefs_zone_append_bio *za_bio = zonefs_zone_append_bio(bio);
+       struct inode *inode = iter->inode;
+       struct zonefs_zone *z = zonefs_inode_zone(inode);
+
+       /*
+        * Issue a zone append BIO to process sync dio writes. The append
+        * file offset is saved to check the zone append write location
+        * on completion of the BIO.
+        */
+       za_bio->inode = inode;
+       za_bio->append_offset = file_offset;
+
+       bio->bi_opf &= ~REQ_OP_WRITE;
+       bio->bi_opf |= REQ_OP_ZONE_APPEND;
+       bio->bi_iter.bi_sector = z->z_sector;
+       bio->bi_end_io = zonefs_file_zone_append_dio_bio_end_io;
+
+       submit_bio(bio);
+}
+
 static int zonefs_file_write_dio_end_io(struct kiocb *iocb, ssize_t size,
                                        int error, unsigned int flags)
 {
@@ -372,93 +442,17 @@ static int zonefs_file_write_dio_end_io(struct kiocb *iocb, ssize_t size,
        return 0;
 }
 
-static const struct iomap_dio_ops zonefs_write_dio_ops = {
-       .end_io                 = zonefs_file_write_dio_end_io,
-};
-
-static ssize_t zonefs_file_dio_append(struct kiocb *iocb, struct iov_iter *from)
-{
-       struct inode *inode = file_inode(iocb->ki_filp);
-       struct zonefs_zone *z = zonefs_inode_zone(inode);
-       struct block_device *bdev = inode->i_sb->s_bdev;
-       unsigned int max = bdev_max_zone_append_sectors(bdev);
-       pgoff_t start, end;
-       struct bio *bio;
-       ssize_t size = 0;
-       int nr_pages;
-       ssize_t ret;
-
-       max = ALIGN_DOWN(max << SECTOR_SHIFT, inode->i_sb->s_blocksize);
-       iov_iter_truncate(from, max);
-
-       /*
-        * If the inode block size (zone write granularity) is smaller than the
-        * page size, we may be appending data belonging to the last page of the
-        * inode straddling inode->i_size, with that page already cached due to
-        * a buffered read or readahead. So make sure to invalidate that page.
-        * This will always be a no-op for the case where the block size is
-        * equal to the page size.
-        */
-       start = iocb->ki_pos >> PAGE_SHIFT;
-       end = (iocb->ki_pos + iov_iter_count(from) - 1) >> PAGE_SHIFT;
-       if (invalidate_inode_pages2_range(inode->i_mapping, start, end))
-               return -EBUSY;
-
-       nr_pages = iov_iter_npages(from, BIO_MAX_VECS);
-       if (!nr_pages)
-               return 0;
-
-       bio = bio_alloc(bdev, nr_pages,
-                       REQ_OP_ZONE_APPEND | REQ_SYNC | REQ_IDLE, GFP_NOFS);
-       bio->bi_iter.bi_sector = z->z_sector;
-       bio->bi_ioprio = iocb->ki_ioprio;
-       if (iocb_is_dsync(iocb))
-               bio->bi_opf |= REQ_FUA;
-
-       ret = bio_iov_iter_get_pages(bio, from);
-       if (unlikely(ret))
-               goto out_release;
-
-       size = bio->bi_iter.bi_size;
-       task_io_account_write(size);
-
-       if (iocb->ki_flags & IOCB_HIPRI)
-               bio_set_polled(bio, iocb);
-
-       ret = submit_bio_wait(bio);
-
-       /*
-        * If the file zone was written underneath the file system, the zone
-        * write pointer may not be where we expect it to be, but the zone
-        * append write can still succeed. So check manually that we wrote where
-        * we intended to, that is, at zi->i_wpoffset.
-        */
-       if (!ret) {
-               sector_t wpsector =
-                       z->z_sector + (z->z_wpoffset >> SECTOR_SHIFT);
-
-               if (bio->bi_iter.bi_sector != wpsector) {
-                       zonefs_warn(inode->i_sb,
-                               "Corrupted write pointer %llu for zone at %llu\n",
-                               bio->bi_iter.bi_sector, z->z_sector);
-                       ret = -EIO;
-               }
-       }
-
-       zonefs_file_write_dio_end_io(iocb, size, ret, 0);
-       trace_zonefs_file_dio_append(inode, size, ret);
+static struct bio_set zonefs_zone_append_bio_set;
 
-out_release:
-       bio_release_pages(bio, false);
-       bio_put(bio);
-
-       if (ret >= 0) {
-               iocb->ki_pos += size;
-               return size;
-       }
+static const struct iomap_dio_ops zonefs_zone_append_dio_ops = {
+       .submit_io      = zonefs_file_zone_append_dio_submit_io,
+       .end_io         = zonefs_file_write_dio_end_io,
+       .bio_set        = &zonefs_zone_append_bio_set,
+};
 
-       return ret;
-}
+static const struct iomap_dio_ops zonefs_write_dio_ops = {
+       .end_io         = zonefs_file_write_dio_end_io,
+};
 
 /*
  * Do not exceed the LFS limits nor the file zone size. If pos is under the
@@ -539,6 +533,7 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
        struct zonefs_inode_info *zi = ZONEFS_I(inode);
        struct zonefs_zone *z = zonefs_inode_zone(inode);
        struct super_block *sb = inode->i_sb;
+       const struct iomap_dio_ops *dio_ops;
        bool sync = is_sync_kiocb(iocb);
        bool append = false;
        ssize_t ret, count;
@@ -582,20 +577,26 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
        }
 
        if (append) {
-               ret = zonefs_file_dio_append(iocb, from);
+               unsigned int max = bdev_max_zone_append_sectors(sb->s_bdev);
+
+               max = ALIGN_DOWN(max << SECTOR_SHIFT, sb->s_blocksize);
+               iov_iter_truncate(from, max);
+
+               dio_ops = &zonefs_zone_append_dio_ops;
        } else {
-               /*
-                * iomap_dio_rw() may return ENOTBLK if there was an issue with
-                * page invalidation. Overwrite that error code with EBUSY to
-                * be consistent with zonefs_file_dio_append() return value for
-                * similar issues.
-                */
-               ret = iomap_dio_rw(iocb, from, &zonefs_write_iomap_ops,
-                                  &zonefs_write_dio_ops, 0, NULL, 0);
-               if (ret == -ENOTBLK)
-                       ret = -EBUSY;
+               dio_ops = &zonefs_write_dio_ops;
        }
 
+       /*
+        * iomap_dio_rw() may return ENOTBLK if there was an issue with
+        * page invalidation. Overwrite that error code with EBUSY so that
+        * the user can make sense of the error.
+        */
+       ret = iomap_dio_rw(iocb, from, &zonefs_write_iomap_ops,
+                          dio_ops, 0, NULL, 0);
+       if (ret == -ENOTBLK)
+               ret = -EBUSY;
+
        if (zonefs_zone_is_seq(z) &&
            (ret > 0 || ret == -EIOCBQUEUED)) {
                if (ret > 0)
@@ -643,9 +644,7 @@ static ssize_t zonefs_file_buffered_write(struct kiocb *iocb,
                goto inode_unlock;
 
        ret = iomap_file_buffered_write(iocb, from, &zonefs_write_iomap_ops);
-       if (ret > 0)
-               iocb->ki_pos += ret;
-       else if (ret == -EIO)
+       if (ret == -EIO)
                zonefs_io_error(inode, true);
 
 inode_unlock:
@@ -752,6 +751,44 @@ inode_unlock:
        return ret;
 }
 
+static ssize_t zonefs_file_splice_read(struct file *in, loff_t *ppos,
+                                      struct pipe_inode_info *pipe,
+                                      size_t len, unsigned int flags)
+{
+       struct inode *inode = file_inode(in);
+       struct zonefs_inode_info *zi = ZONEFS_I(inode);
+       struct zonefs_zone *z = zonefs_inode_zone(inode);
+       loff_t isize;
+       ssize_t ret = 0;
+
+       /* Offline zones cannot be read */
+       if (unlikely(IS_IMMUTABLE(inode) && !(inode->i_mode & 0777)))
+               return -EPERM;
+
+       if (*ppos >= z->z_capacity)
+               return 0;
+
+       inode_lock_shared(inode);
+
+       /* Limit read operations to written data */
+       mutex_lock(&zi->i_truncate_mutex);
+       isize = i_size_read(inode);
+       if (*ppos >= isize)
+               len = 0;
+       else
+               len = min_t(loff_t, len, isize - *ppos);
+       mutex_unlock(&zi->i_truncate_mutex);
+
+       if (len > 0) {
+               ret = filemap_splice_read(in, ppos, pipe, len, flags);
+               if (ret == -EIO)
+                       zonefs_io_error(inode, false);
+       }
+
+       inode_unlock_shared(inode);
+       return ret;
+}
+
 /*
  * Write open accounting is done only for sequential files.
  */
@@ -813,6 +850,7 @@ static int zonefs_file_open(struct inode *inode, struct file *file)
 {
        int ret;
 
+       file->f_mode |= FMODE_CAN_ODIRECT;
        ret = generic_file_open(inode, file);
        if (ret)
                return ret;
@@ -896,7 +934,19 @@ const struct file_operations zonefs_file_operations = {
        .llseek         = zonefs_file_llseek,
        .read_iter      = zonefs_file_read_iter,
        .write_iter     = zonefs_file_write_iter,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = zonefs_file_splice_read,
        .splice_write   = iter_file_splice_write,
        .iopoll         = iocb_bio_iopoll,
 };
+
+int zonefs_file_bioset_init(void)
+{
+       return bioset_init(&zonefs_zone_append_bio_set, BIO_POOL_SIZE,
+                          offsetof(struct zonefs_zone_append_bio, bio),
+                          BIOSET_NEED_BVECS);
+}
+
+void zonefs_file_bioset_exit(void)
+{
+       bioset_exit(&zonefs_zone_append_bio_set);
+}
index 23b8b29..bbe44a2 100644 (file)
@@ -1128,7 +1128,7 @@ static int zonefs_read_super(struct super_block *sb)
 
        bio_init(&bio, sb->s_bdev, &bio_vec, 1, REQ_OP_READ);
        bio.bi_iter.bi_sector = 0;
-       bio_add_page(&bio, page, PAGE_SIZE, 0);
+       __bio_add_page(&bio, page, PAGE_SIZE, 0);
 
        ret = submit_bio_wait(&bio);
        if (ret)
@@ -1412,10 +1412,14 @@ static int __init zonefs_init(void)
 
        BUILD_BUG_ON(sizeof(struct zonefs_super) != ZONEFS_SUPER_SIZE);
 
-       ret = zonefs_init_inodecache();
+       ret = zonefs_file_bioset_init();
        if (ret)
                return ret;
 
+       ret = zonefs_init_inodecache();
+       if (ret)
+               goto destroy_bioset;
+
        ret = zonefs_sysfs_init();
        if (ret)
                goto destroy_inodecache;
@@ -1430,6 +1434,8 @@ sysfs_exit:
        zonefs_sysfs_exit();
 destroy_inodecache:
        zonefs_destroy_inodecache();
+destroy_bioset:
+       zonefs_file_bioset_exit();
 
        return ret;
 }
@@ -1439,6 +1445,7 @@ static void __exit zonefs_exit(void)
        unregister_filesystem(&zonefs_type);
        zonefs_sysfs_exit();
        zonefs_destroy_inodecache();
+       zonefs_file_bioset_exit();
 }
 
 MODULE_AUTHOR("Damien Le Moal");
index 8175652..f663b8e 100644 (file)
@@ -279,6 +279,8 @@ extern const struct file_operations zonefs_dir_operations;
 extern const struct address_space_operations zonefs_file_aops;
 extern const struct file_operations zonefs_file_operations;
 int zonefs_file_truncate(struct inode *inode, loff_t isize);
+int zonefs_file_bioset_init(void);
+void zonefs_file_bioset_exit(void);
 
 /* In sysfs.c */
 int zonefs_sysfs_register(struct super_block *sb);
index a6affc0..c941d99 100644 (file)
@@ -289,6 +289,8 @@ struct acpi_dep_data {
        acpi_handle supplier;
        acpi_handle consumer;
        bool honor_dep;
+       bool met;
+       bool free_when_met;
 };
 
 /* Performance Management */
index e5dfb6f..451f627 100644 (file)
@@ -307,7 +307,8 @@ enum acpi_preferred_pm_profiles {
        PM_SOHO_SERVER = 5,
        PM_APPLIANCE_PC = 6,
        PM_PERFORMANCE_SERVER = 7,
-       PM_TABLET = 8
+       PM_TABLET = 8,
+       NR_PM_PROFILES = 9
 };
 
 /* Values for sleep_status and sleep_control registers (V5+ FADT) */
index f51c46f..000764a 100644 (file)
@@ -86,7 +86,7 @@ struct acpi_table_slic {
 struct acpi_table_slit {
        struct acpi_table_header header;        /* Common ACPI table header */
        u64 locality_count;
-       u8 entry[1];            /* Real size = localities^2 */
+       u8 entry[];                             /* Real size = localities^2 */
 };
 
 /*******************************************************************************
index e271d67..22142c7 100644 (file)
@@ -130,7 +130,4 @@ ATOMIC_OP(xor, ^)
 #define arch_atomic_read(v)                    READ_ONCE((v)->counter)
 #define arch_atomic_set(v, i)                  WRITE_ONCE(((v)->counter), (i))
 
-#define arch_atomic_xchg(ptr, v)               (arch_xchg(&(ptr)->counter, (u32)(v)))
-#define arch_atomic_cmpxchg(v, old, new)       (arch_cmpxchg(&((v)->counter), (u32)(old), (u32)(new)))
-
 #endif /* __ASM_GENERIC_ATOMIC_H */
index 71ab4ba..e076e07 100644 (file)
@@ -15,21 +15,21 @@ static __always_inline void
 arch_set_bit(unsigned int nr, volatile unsigned long *p)
 {
        p += BIT_WORD(nr);
-       arch_atomic_long_or(BIT_MASK(nr), (atomic_long_t *)p);
+       raw_atomic_long_or(BIT_MASK(nr), (atomic_long_t *)p);
 }
 
 static __always_inline void
 arch_clear_bit(unsigned int nr, volatile unsigned long *p)
 {
        p += BIT_WORD(nr);
-       arch_atomic_long_andnot(BIT_MASK(nr), (atomic_long_t *)p);
+       raw_atomic_long_andnot(BIT_MASK(nr), (atomic_long_t *)p);
 }
 
 static __always_inline void
 arch_change_bit(unsigned int nr, volatile unsigned long *p)
 {
        p += BIT_WORD(nr);
-       arch_atomic_long_xor(BIT_MASK(nr), (atomic_long_t *)p);
+       raw_atomic_long_xor(BIT_MASK(nr), (atomic_long_t *)p);
 }
 
 static __always_inline int
@@ -39,7 +39,7 @@ arch_test_and_set_bit(unsigned int nr, volatile unsigned long *p)
        unsigned long mask = BIT_MASK(nr);
 
        p += BIT_WORD(nr);
-       old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p);
+       old = raw_atomic_long_fetch_or(mask, (atomic_long_t *)p);
        return !!(old & mask);
 }
 
@@ -50,7 +50,7 @@ arch_test_and_clear_bit(unsigned int nr, volatile unsigned long *p)
        unsigned long mask = BIT_MASK(nr);
 
        p += BIT_WORD(nr);
-       old = arch_atomic_long_fetch_andnot(mask, (atomic_long_t *)p);
+       old = raw_atomic_long_fetch_andnot(mask, (atomic_long_t *)p);
        return !!(old & mask);
 }
 
@@ -61,7 +61,7 @@ arch_test_and_change_bit(unsigned int nr, volatile unsigned long *p)
        unsigned long mask = BIT_MASK(nr);
 
        p += BIT_WORD(nr);
-       old = arch_atomic_long_fetch_xor(mask, (atomic_long_t *)p);
+       old = raw_atomic_long_fetch_xor(mask, (atomic_long_t *)p);
        return !!(old & mask);
 }
 
index 630f2f6..4091351 100644 (file)
@@ -25,7 +25,7 @@ arch_test_and_set_bit_lock(unsigned int nr, volatile unsigned long *p)
        if (READ_ONCE(*p) & mask)
                return 1;
 
-       old = arch_atomic_long_fetch_or_acquire(mask, (atomic_long_t *)p);
+       old = raw_atomic_long_fetch_or_acquire(mask, (atomic_long_t *)p);
        return !!(old & mask);
 }
 
@@ -41,7 +41,7 @@ static __always_inline void
 arch_clear_bit_unlock(unsigned int nr, volatile unsigned long *p)
 {
        p += BIT_WORD(nr);
-       arch_atomic_long_fetch_andnot_release(BIT_MASK(nr), (atomic_long_t *)p);
+       raw_atomic_long_fetch_andnot_release(BIT_MASK(nr), (atomic_long_t *)p);
 }
 
 /**
@@ -63,7 +63,7 @@ arch___clear_bit_unlock(unsigned int nr, volatile unsigned long *p)
        p += BIT_WORD(nr);
        old = READ_ONCE(*p);
        old &= ~BIT_MASK(nr);
-       arch_atomic_long_set_release((atomic_long_t *)p, old);
+       raw_atomic_long_set_release((atomic_long_t *)p, old);
 }
 
 /**
@@ -83,7 +83,7 @@ static inline bool arch_clear_bit_unlock_is_negative_byte(unsigned int nr,
        unsigned long mask = BIT_MASK(nr);
 
        p += BIT_WORD(nr);
-       old = arch_atomic_long_fetch_andnot_release(mask, (atomic_long_t *)p);
+       old = raw_atomic_long_fetch_andnot_release(mask, (atomic_long_t *)p);
        return !!(old & BIT(7));
 }
 #define arch_clear_bit_unlock_is_negative_byte arch_clear_bit_unlock_is_negative_byte
index 4050b19..6e79442 100644 (file)
@@ -87,10 +87,12 @@ struct bug_entry {
  *
  * Use the versions with printk format strings to provide better diagnostics.
  */
-#ifndef __WARN_FLAGS
 extern __printf(4, 5)
 void warn_slowpath_fmt(const char *file, const int line, unsigned taint,
                       const char *fmt, ...);
+extern __printf(1, 2) void __warn_printk(const char *fmt, ...);
+
+#ifndef __WARN_FLAGS
 #define __WARN()               __WARN_printf(TAINT_WARN, NULL)
 #define __WARN_printf(taint, arg...) do {                              \
                instrumentation_begin();                                \
@@ -98,7 +100,6 @@ void warn_slowpath_fmt(const char *file, const int line, unsigned taint,
                instrumentation_end();                                  \
        } while (0)
 #else
-extern __printf(1, 2) void __warn_printk(const char *fmt, ...);
 #define __WARN()               __WARN_FLAGS(BUGFLAG_TAINT(TAINT_WARN))
 #define __WARN_printf(taint, arg...) do {                              \
                instrumentation_begin();                                \
diff --git a/include/asm-generic/bugs.h b/include/asm-generic/bugs.h
deleted file mode 100644 (file)
index 6902183..0000000
+++ /dev/null
@@ -1,11 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef __ASM_GENERIC_BUGS_H
-#define __ASM_GENERIC_BUGS_H
-/*
- * This file is included by 'init/main.c' to check for
- * architecture-dependent bugs.
- */
-
-static inline void check_bugs(void) { }
-
-#endif /* __ASM_GENERIC_BUGS_H */
index 6432a7f..94cbd50 100644 (file)
@@ -89,27 +89,35 @@ do {                                                                        \
        __ret;                                                          \
 })
 
-#define raw_cpu_generic_cmpxchg(pcp, oval, nval)                       \
+#define __cpu_fallback_try_cmpxchg(pcp, ovalp, nval, _cmpxchg)         \
+({                                                                     \
+       typeof(pcp) __val, __old = *(ovalp);                            \
+       __val = _cmpxchg(pcp, __old, nval);                             \
+       if (__val != __old)                                             \
+               *(ovalp) = __val;                                       \
+       __val == __old;                                                 \
+})
+
+#define raw_cpu_generic_try_cmpxchg(pcp, ovalp, nval)                  \
 ({                                                                     \
        typeof(pcp) *__p = raw_cpu_ptr(&(pcp));                         \
-       typeof(pcp) __ret;                                              \
-       __ret = *__p;                                                   \
-       if (__ret == (oval))                                            \
+       typeof(pcp) __val = *__p, ___old = *(ovalp);                    \
+       bool __ret;                                                     \
+       if (__val == ___old) {                                          \
                *__p = nval;                                            \
+               __ret = true;                                           \
+       } else {                                                        \
+               *(ovalp) = __val;                                       \
+               __ret = false;                                          \
+       }                                                               \
        __ret;                                                          \
 })
 
-#define raw_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
+#define raw_cpu_generic_cmpxchg(pcp, oval, nval)                       \
 ({                                                                     \
-       typeof(pcp1) *__p1 = raw_cpu_ptr(&(pcp1));                      \
-       typeof(pcp2) *__p2 = raw_cpu_ptr(&(pcp2));                      \
-       int __ret = 0;                                                  \
-       if (*__p1 == (oval1) && *__p2  == (oval2)) {                    \
-               *__p1 = nval1;                                          \
-               *__p2 = nval2;                                          \
-               __ret = 1;                                              \
-       }                                                               \
-       (__ret);                                                        \
+       typeof(pcp) __old = (oval);                                     \
+       raw_cpu_generic_try_cmpxchg(pcp, &__old, nval);                 \
+       __old;                                                          \
 })
 
 #define __this_cpu_generic_read_nopreempt(pcp)                         \
@@ -170,23 +178,22 @@ do {                                                                      \
        __ret;                                                          \
 })
 
-#define this_cpu_generic_cmpxchg(pcp, oval, nval)                      \
+#define this_cpu_generic_try_cmpxchg(pcp, ovalp, nval)                 \
 ({                                                                     \
-       typeof(pcp) __ret;                                              \
+       bool __ret;                                                     \
        unsigned long __flags;                                          \
        raw_local_irq_save(__flags);                                    \
-       __ret = raw_cpu_generic_cmpxchg(pcp, oval, nval);               \
+       __ret = raw_cpu_generic_try_cmpxchg(pcp, ovalp, nval);          \
        raw_local_irq_restore(__flags);                                 \
        __ret;                                                          \
 })
 
-#define this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)        \
+#define this_cpu_generic_cmpxchg(pcp, oval, nval)                      \
 ({                                                                     \
-       int __ret;                                                      \
+       typeof(pcp) __ret;                                              \
        unsigned long __flags;                                          \
        raw_local_irq_save(__flags);                                    \
-       __ret = raw_cpu_generic_cmpxchg_double(pcp1, pcp2,              \
-                       oval1, oval2, nval1, nval2);                    \
+       __ret = raw_cpu_generic_cmpxchg(pcp, oval, nval);               \
        raw_local_irq_restore(__flags);                                 \
        __ret;                                                          \
 })
@@ -282,6 +289,62 @@ do {                                                                       \
 #define raw_cpu_xchg_8(pcp, nval)      raw_cpu_generic_xchg(pcp, nval)
 #endif
 
+#ifndef raw_cpu_try_cmpxchg_1
+#ifdef raw_cpu_cmpxchg_1
+#define raw_cpu_try_cmpxchg_1(pcp, ovalp, nval) \
+       __cpu_fallback_try_cmpxchg(pcp, ovalp, nval, raw_cpu_cmpxchg_1)
+#else
+#define raw_cpu_try_cmpxchg_1(pcp, ovalp, nval) \
+       raw_cpu_generic_try_cmpxchg(pcp, ovalp, nval)
+#endif
+#endif
+#ifndef raw_cpu_try_cmpxchg_2
+#ifdef raw_cpu_cmpxchg_2
+#define raw_cpu_try_cmpxchg_2(pcp, ovalp, nval) \
+       __cpu_fallback_try_cmpxchg(pcp, ovalp, nval, raw_cpu_cmpxchg_2)
+#else
+#define raw_cpu_try_cmpxchg_2(pcp, ovalp, nval) \
+       raw_cpu_generic_try_cmpxchg(pcp, ovalp, nval)
+#endif
+#endif
+#ifndef raw_cpu_try_cmpxchg_4
+#ifdef raw_cpu_cmpxchg_4
+#define raw_cpu_try_cmpxchg_4(pcp, ovalp, nval) \
+       __cpu_fallback_try_cmpxchg(pcp, ovalp, nval, raw_cpu_cmpxchg_4)
+#else
+#define raw_cpu_try_cmpxchg_4(pcp, ovalp, nval) \
+       raw_cpu_generic_try_cmpxchg(pcp, ovalp, nval)
+#endif
+#endif
+#ifndef raw_cpu_try_cmpxchg_8
+#ifdef raw_cpu_cmpxchg_8
+#define raw_cpu_try_cmpxchg_8(pcp, ovalp, nval) \
+       __cpu_fallback_try_cmpxchg(pcp, ovalp, nval, raw_cpu_cmpxchg_8)
+#else
+#define raw_cpu_try_cmpxchg_8(pcp, ovalp, nval) \
+       raw_cpu_generic_try_cmpxchg(pcp, ovalp, nval)
+#endif
+#endif
+
+#ifndef raw_cpu_try_cmpxchg64
+#ifdef raw_cpu_cmpxchg64
+#define raw_cpu_try_cmpxchg64(pcp, ovalp, nval) \
+       __cpu_fallback_try_cmpxchg(pcp, ovalp, nval, raw_cpu_cmpxchg64)
+#else
+#define raw_cpu_try_cmpxchg64(pcp, ovalp, nval) \
+       raw_cpu_generic_try_cmpxchg(pcp, ovalp, nval)
+#endif
+#endif
+#ifndef raw_cpu_try_cmpxchg128
+#ifdef raw_cpu_cmpxchg128
+#define raw_cpu_try_cmpxchg128(pcp, ovalp, nval) \
+       __cpu_fallback_try_cmpxchg(pcp, ovalp, nval, raw_cpu_cmpxchg128)
+#else
+#define raw_cpu_try_cmpxchg128(pcp, ovalp, nval) \
+       raw_cpu_generic_try_cmpxchg(pcp, ovalp, nval)
+#endif
+#endif
+
 #ifndef raw_cpu_cmpxchg_1
 #define raw_cpu_cmpxchg_1(pcp, oval, nval) \
        raw_cpu_generic_cmpxchg(pcp, oval, nval)
@@ -299,21 +362,13 @@ do {                                                                      \
        raw_cpu_generic_cmpxchg(pcp, oval, nval)
 #endif
 
-#ifndef raw_cpu_cmpxchg_double_1
-#define raw_cpu_cmpxchg_double_1(pcp1, pcp2, oval1, oval2, nval1, nval2) \
-       raw_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
-#endif
-#ifndef raw_cpu_cmpxchg_double_2
-#define raw_cpu_cmpxchg_double_2(pcp1, pcp2, oval1, oval2, nval1, nval2) \
-       raw_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
-#endif
-#ifndef raw_cpu_cmpxchg_double_4
-#define raw_cpu_cmpxchg_double_4(pcp1, pcp2, oval1, oval2, nval1, nval2) \
-       raw_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
+#ifndef raw_cpu_cmpxchg64
+#define raw_cpu_cmpxchg64(pcp, oval, nval) \
+       raw_cpu_generic_cmpxchg(pcp, oval, nval)
 #endif
-#ifndef raw_cpu_cmpxchg_double_8
-#define raw_cpu_cmpxchg_double_8(pcp1, pcp2, oval1, oval2, nval1, nval2) \
-       raw_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
+#ifndef raw_cpu_cmpxchg128
+#define raw_cpu_cmpxchg128(pcp, oval, nval) \
+       raw_cpu_generic_cmpxchg(pcp, oval, nval)
 #endif
 
 #ifndef this_cpu_read_1
@@ -407,6 +462,62 @@ do {                                                                       \
 #define this_cpu_xchg_8(pcp, nval)     this_cpu_generic_xchg(pcp, nval)
 #endif
 
+#ifndef this_cpu_try_cmpxchg_1
+#ifdef this_cpu_cmpxchg_1
+#define this_cpu_try_cmpxchg_1(pcp, ovalp, nval) \
+       __cpu_fallback_try_cmpxchg(pcp, ovalp, nval, this_cpu_cmpxchg_1)
+#else
+#define this_cpu_try_cmpxchg_1(pcp, ovalp, nval) \
+       this_cpu_generic_try_cmpxchg(pcp, ovalp, nval)
+#endif
+#endif
+#ifndef this_cpu_try_cmpxchg_2
+#ifdef this_cpu_cmpxchg_2
+#define this_cpu_try_cmpxchg_2(pcp, ovalp, nval) \
+       __cpu_fallback_try_cmpxchg(pcp, ovalp, nval, this_cpu_cmpxchg_2)
+#else
+#define this_cpu_try_cmpxchg_2(pcp, ovalp, nval) \
+       this_cpu_generic_try_cmpxchg(pcp, ovalp, nval)
+#endif
+#endif
+#ifndef this_cpu_try_cmpxchg_4
+#ifdef this_cpu_cmpxchg_4
+#define this_cpu_try_cmpxchg_4(pcp, ovalp, nval) \
+       __cpu_fallback_try_cmpxchg(pcp, ovalp, nval, this_cpu_cmpxchg_4)
+#else
+#define this_cpu_try_cmpxchg_4(pcp, ovalp, nval) \
+       this_cpu_generic_try_cmpxchg(pcp, ovalp, nval)
+#endif
+#endif
+#ifndef this_cpu_try_cmpxchg_8
+#ifdef this_cpu_cmpxchg_8
+#define this_cpu_try_cmpxchg_8(pcp, ovalp, nval) \
+       __cpu_fallback_try_cmpxchg(pcp, ovalp, nval, this_cpu_cmpxchg_8)
+#else
+#define this_cpu_try_cmpxchg_8(pcp, ovalp, nval) \
+       this_cpu_generic_try_cmpxchg(pcp, ovalp, nval)
+#endif
+#endif
+
+#ifndef this_cpu_try_cmpxchg64
+#ifdef this_cpu_cmpxchg64
+#define this_cpu_try_cmpxchg64(pcp, ovalp, nval) \
+       __cpu_fallback_try_cmpxchg(pcp, ovalp, nval, this_cpu_cmpxchg64)
+#else
+#define this_cpu_try_cmpxchg64(pcp, ovalp, nval) \
+       this_cpu_generic_try_cmpxchg(pcp, ovalp, nval)
+#endif
+#endif
+#ifndef this_cpu_try_cmpxchg128
+#ifdef this_cpu_cmpxchg128
+#define this_cpu_try_cmpxchg128(pcp, ovalp, nval) \
+       __cpu_fallback_try_cmpxchg(pcp, ovalp, nval, this_cpu_cmpxchg128)
+#else
+#define this_cpu_try_cmpxchg128(pcp, ovalp, nval) \
+       this_cpu_generic_try_cmpxchg(pcp, ovalp, nval)
+#endif
+#endif
+
 #ifndef this_cpu_cmpxchg_1
 #define this_cpu_cmpxchg_1(pcp, oval, nval) \
        this_cpu_generic_cmpxchg(pcp, oval, nval)
@@ -424,21 +535,13 @@ do {                                                                      \
        this_cpu_generic_cmpxchg(pcp, oval, nval)
 #endif
 
-#ifndef this_cpu_cmpxchg_double_1
-#define this_cpu_cmpxchg_double_1(pcp1, pcp2, oval1, oval2, nval1, nval2) \
-       this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
-#endif
-#ifndef this_cpu_cmpxchg_double_2
-#define this_cpu_cmpxchg_double_2(pcp1, pcp2, oval1, oval2, nval1, nval2) \
-       this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
-#endif
-#ifndef this_cpu_cmpxchg_double_4
-#define this_cpu_cmpxchg_double_4(pcp1, pcp2, oval1, oval2, nval1, nval2) \
-       this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
+#ifndef this_cpu_cmpxchg64
+#define this_cpu_cmpxchg64(pcp, oval, nval) \
+       this_cpu_generic_cmpxchg(pcp, oval, nval)
 #endif
-#ifndef this_cpu_cmpxchg_double_8
-#define this_cpu_cmpxchg_double_8(pcp1, pcp2, oval1, oval2, nval1, nval2) \
-       this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
+#ifndef this_cpu_cmpxchg128
+#define this_cpu_cmpxchg128(pcp, oval, nval) \
+       this_cpu_generic_cmpxchg(pcp, oval, nval)
 #endif
 
 #endif /* _ASM_GENERIC_PERCPU_H_ */
index cebdf1c..da9e562 100644 (file)
 
 #ifdef CONFIG_UNWINDER_ORC
 #define ORC_UNWIND_TABLE                                               \
+       .orc_header : AT(ADDR(.orc_header) - LOAD_OFFSET) {             \
+               BOUNDED_SECTION_BY(.orc_header, _orc_header)            \
+       }                                                               \
        . = ALIGN(4);                                                   \
        .orc_unwind_ip : AT(ADDR(.orc_unwind_ip) - LOAD_OFFSET) {       \
                BOUNDED_SECTION_BY(.orc_unwind_ip, _orc_unwind_ip)      \
index 536f897..6cdc873 100644 (file)
@@ -38,8 +38,9 @@ extern void hv_remap_tsc_clocksource(void);
 extern unsigned long hv_get_tsc_pfn(void);
 extern struct ms_hyperv_tsc_page *hv_get_tsc_page(void);
 
-static inline notrace u64
-hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page *tsc_pg, u64 *cur_tsc)
+static __always_inline bool
+hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page *tsc_pg,
+                    u64 *cur_tsc, u64 *time)
 {
        u64 scale, offset;
        u32 sequence;
@@ -63,7 +64,7 @@ hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page *tsc_pg, u64 *cur_tsc)
        do {
                sequence = READ_ONCE(tsc_pg->tsc_sequence);
                if (!sequence)
-                       return U64_MAX;
+                       return false;
                /*
                 * Make sure we read sequence before we read other values from
                 * TSC page.
@@ -82,15 +83,8 @@ hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page *tsc_pg, u64 *cur_tsc)
 
        } while (READ_ONCE(tsc_pg->tsc_sequence) != sequence);
 
-       return mul_u64_u64_shr(*cur_tsc, scale, 64) + offset;
-}
-
-static inline notrace u64
-hv_read_tsc_page(const struct ms_hyperv_tsc_page *tsc_pg)
-{
-       u64 cur_tsc;
-
-       return hv_read_tsc_page_tsc(tsc_pg, &cur_tsc);
+       *time = mul_u64_u64_shr(*cur_tsc, scale, 64) + offset;
+       return true;
 }
 
 #else /* CONFIG_HYPERV_TIMER */
@@ -104,10 +98,10 @@ static inline struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
        return NULL;
 }
 
-static inline u64 hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page *tsc_pg,
-                                      u64 *cur_tsc)
+static __always_inline bool
+hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page *tsc_pg, u64 *cur_tsc, u64 *time)
 {
-       return U64_MAX;
+       return false;
 }
 
 static inline int hv_stimer_cleanup(unsigned int cpu) { return 0; }
index 0b8e6bc..f3b37cb 100644 (file)
 #include <linux/types.h>
 
 typedef struct {
-       u64 a, b;
-} u128;
-
-typedef struct {
        __be64 a, b;
 } be128;
 
@@ -61,20 +57,16 @@ typedef struct {
        __le64 b, a;
 } le128;
 
-static inline void u128_xor(u128 *r, const u128 *p, const u128 *q)
+static inline void be128_xor(be128 *r, const be128 *p, const be128 *q)
 {
        r->a = p->a ^ q->a;
        r->b = p->b ^ q->b;
 }
 
-static inline void be128_xor(be128 *r, const be128 *p, const be128 *q)
-{
-       u128_xor((u128 *)r, (u128 *)p, (u128 *)q);
-}
-
 static inline void le128_xor(le128 *r, const le128 *p, const le128 *q)
 {
-       u128_xor((u128 *)r, (u128 *)p, (u128 *)q);
+       r->a = p->a ^ q->a;
+       r->b = p->b ^ q->b;
 }
 
 #endif /* _CRYPTO_B128OPS_H */
index c0d88b3..c7383e9 100644 (file)
@@ -387,4 +387,96 @@ static inline int kunit_destroy_named_resource(struct kunit *test,
  */
 void kunit_remove_resource(struct kunit *test, struct kunit_resource *res);
 
+/* A 'deferred action' function to be used with kunit_add_action. */
+typedef void (kunit_action_t)(void *);
+
+/**
+ * kunit_add_action() - Call a function when the test ends.
+ * @test: Test case to associate the action with.
+ * @action: The function to run on test exit
+ * @ctx: Data passed into @func
+ *
+ * Defer the execution of a function until the test exits, either normally or
+ * due to a failure.  @ctx is passed as additional context. All functions
+ * registered with kunit_add_action() will execute in the opposite order to that
+ * they were registered in.
+ *
+ * This is useful for cleaning up allocated memory and resources, as these
+ * functions are called even if the test aborts early due to, e.g., a failed
+ * assertion.
+ *
+ * See also: devm_add_action() for the devres equivalent.
+ *
+ * Returns:
+ *   0 on success, an error if the action could not be deferred.
+ */
+int kunit_add_action(struct kunit *test, kunit_action_t *action, void *ctx);
+
+/**
+ * kunit_add_action_or_reset() - Call a function when the test ends.
+ * @test: Test case to associate the action with.
+ * @action: The function to run on test exit
+ * @ctx: Data passed into @func
+ *
+ * Defer the execution of a function until the test exits, either normally or
+ * due to a failure.  @ctx is passed as additional context. All functions
+ * registered with kunit_add_action() will execute in the opposite order to that
+ * they were registered in.
+ *
+ * This is useful for cleaning up allocated memory and resources, as these
+ * functions are called even if the test aborts early due to, e.g., a failed
+ * assertion.
+ *
+ * If the action cannot be created (e.g., due to the system being out of memory),
+ * then action(ctx) will be called immediately, and an error will be returned.
+ *
+ * See also: devm_add_action_or_reset() for the devres equivalent.
+ *
+ * Returns:
+ *   0 on success, an error if the action could not be deferred.
+ */
+int kunit_add_action_or_reset(struct kunit *test, kunit_action_t *action,
+                             void *ctx);
+
+/**
+ * kunit_remove_action() - Cancel a matching deferred action.
+ * @test: Test case the action is associated with.
+ * @action: The deferred function to cancel.
+ * @ctx: The context passed to the deferred function to trigger.
+ *
+ * Prevent an action deferred via kunit_add_action() from executing when the
+ * test terminates.
+ *
+ * If the function/context pair was deferred multiple times, only the most
+ * recent one will be cancelled.
+ *
+ * See also: devm_remove_action() for the devres equivalent.
+ */
+void kunit_remove_action(struct kunit *test,
+                        kunit_action_t *action,
+                        void *ctx);
+
+/**
+ * kunit_release_action() - Run a matching action call immediately.
+ * @test: Test case the action is associated with.
+ * @action: The deferred function to trigger.
+ * @ctx: The context passed to the deferred function to trigger.
+ *
+ * Execute a function deferred via kunit_add_action()) immediately, rather than
+ * when the test ends.
+ *
+ * If the function/context pair was deferred multiple times, it will only be
+ * executed once here. The most recent deferral will no longer execute when
+ * the test ends.
+ *
+ * kunit_release_action(test, func, ctx);
+ * is equivalent to
+ * func(ctx);
+ * kunit_remove_action(test, func, ctx);
+ *
+ * See also: devm_release_action() for the devres equivalent.
+ */
+void kunit_release_action(struct kunit *test,
+                         kunit_action_t *action,
+                         void *ctx);
 #endif /* _KUNIT_RESOURCE_H */
index 57b309c..23120d5 100644 (file)
@@ -47,6 +47,7 @@ struct kunit;
  * sub-subtest.  See the "Subtests" section in
  * https://node-tap.org/tap-protocol/
  */
+#define KUNIT_INDENT_LEN               4
 #define KUNIT_SUBTEST_INDENT           "    "
 #define KUNIT_SUBSUBTEST_INDENT                "        "
 
@@ -168,6 +169,9 @@ static inline char *kunit_status_to_ok_not_ok(enum kunit_status status)
  * test case, similar to the notion of a *test fixture* or a *test class*
  * in other unit testing frameworks like JUnit or Googletest.
  *
+ * Note that @exit and @suite_exit will run even if @init or @suite_init
+ * fail: make sure they can handle any inconsistent state which may result.
+ *
  * Every &struct kunit_case must be associated with a kunit_suite for KUnit
  * to run it.
  */
@@ -321,8 +325,11 @@ enum kunit_status kunit_suite_has_succeeded(struct kunit_suite *suite);
  * @gfp: flags passed to underlying kmalloc().
  *
  * Just like `kmalloc_array(...)`, except the allocation is managed by the test case
- * and is automatically cleaned up after the test case concludes. See &struct
- * kunit_resource for more information.
+ * and is automatically cleaned up after the test case concludes. See kunit_add_action()
+ * for more information.
+ *
+ * Note that some internal context data is also allocated with GFP_KERNEL,
+ * regardless of the gfp passed in.
  */
 void *kunit_kmalloc_array(struct kunit *test, size_t n, size_t size, gfp_t gfp);
 
@@ -333,6 +340,9 @@ void *kunit_kmalloc_array(struct kunit *test, size_t n, size_t size, gfp_t gfp);
  * @gfp: flags passed to underlying kmalloc().
  *
  * See kmalloc() and kunit_kmalloc_array() for more information.
+ *
+ * Note that some internal context data is also allocated with GFP_KERNEL,
+ * regardless of the gfp passed in.
  */
 static inline void *kunit_kmalloc(struct kunit *test, size_t size, gfp_t gfp)
 {
@@ -472,7 +482,9 @@ void __printf(2, 3) kunit_log_append(char *log, const char *fmt, ...);
  */
 #define KUNIT_SUCCEED(test) do {} while (0)
 
-void kunit_do_failed_assertion(struct kunit *test,
+void __noreturn __kunit_abort(struct kunit *test);
+
+void __kunit_do_failed_assertion(struct kunit *test,
                               const struct kunit_loc *loc,
                               enum kunit_assert_type type,
                               const struct kunit_assert *assert,
@@ -482,13 +494,15 @@ void kunit_do_failed_assertion(struct kunit *test,
 #define _KUNIT_FAILED(test, assert_type, assert_class, assert_format, INITIALIZER, fmt, ...) do { \
        static const struct kunit_loc __loc = KUNIT_CURRENT_LOC;               \
        const struct assert_class __assertion = INITIALIZER;                   \
-       kunit_do_failed_assertion(test,                                        \
-                                 &__loc,                                      \
-                                 assert_type,                                 \
-                                 &__assertion.assert,                         \
-                                 assert_format,                               \
-                                 fmt,                                         \
-                                 ##__VA_ARGS__);                              \
+       __kunit_do_failed_assertion(test,                                      \
+                                   &__loc,                                    \
+                                   assert_type,                               \
+                                   &__assertion.assert,                       \
+                                   assert_format,                             \
+                                   fmt,                                       \
+                                   ##__VA_ARGS__);                            \
+       if (assert_type == KUNIT_ASSERTION)                                    \
+               __kunit_abort(test);                                           \
 } while (0)
 
 
index 7b71dd7..fd8849a 100644 (file)
@@ -712,7 +712,6 @@ int acpi_match_platform_list(const struct acpi_platform_list *plat);
 
 extern void acpi_early_init(void);
 extern void acpi_subsystem_init(void);
-extern void arch_post_acpi_subsys_init(void);
 
 extern int acpi_nvs_register(__u64 start, __u64 size);
 
@@ -1084,6 +1083,8 @@ static inline bool acpi_sleep_state_supported(u8 sleep_state)
 
 #endif /* !CONFIG_ACPI */
 
+extern void arch_post_acpi_subsys_init(void);
+
 #ifdef CONFIG_ACPI_HOTPLUG_IOAPIC
 int acpi_ioapic_add(acpi_handle root);
 #else
@@ -1507,6 +1508,12 @@ static inline int find_acpi_cpu_topology_hetero_id(unsigned int cpu)
 }
 #endif
 
+#ifdef CONFIG_ARM64
+void acpi_arm_init(void);
+#else
+static inline void acpi_arm_init(void) { }
+#endif
+
 #ifdef CONFIG_ACPI_PCC
 void acpi_init_pcc(void);
 #else
diff --git a/include/linux/acpi_agdi.h b/include/linux/acpi_agdi.h
deleted file mode 100644 (file)
index f477f0b..0000000
+++ /dev/null
@@ -1,13 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-
-#ifndef __ACPI_AGDI_H__
-#define __ACPI_AGDI_H__
-
-#include <linux/acpi.h>
-
-#ifdef CONFIG_ACPI_AGDI
-void __init acpi_agdi_init(void);
-#else
-static inline void acpi_agdi_init(void) {}
-#endif
-#endif /* __ACPI_AGDI_H__ */
diff --git a/include/linux/acpi_apmt.h b/include/linux/acpi_apmt.h
deleted file mode 100644 (file)
index 40bd634..0000000
+++ /dev/null
@@ -1,19 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0
- *
- * ARM CoreSight PMU driver.
- * Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES.
- *
- */
-
-#ifndef __ACPI_APMT_H__
-#define __ACPI_APMT_H__
-
-#include <linux/acpi.h>
-
-#ifdef CONFIG_ACPI_APMT
-void acpi_apmt_init(void);
-#else
-static inline void acpi_apmt_init(void) { }
-#endif /* CONFIG_ACPI_APMT */
-
-#endif /* __ACPI_APMT_H__ */
index b43be09..ee7cb6a 100644 (file)
@@ -26,13 +26,13 @@ int iort_register_domain_token(int trans_id, phys_addr_t base,
                               struct fwnode_handle *fw_node);
 void iort_deregister_domain_token(int trans_id);
 struct fwnode_handle *iort_find_domain_token(int trans_id);
+int iort_pmsi_get_dev_id(struct device *dev, u32 *dev_id);
+
 #ifdef CONFIG_ACPI_IORT
-void acpi_iort_init(void);
 u32 iort_msi_map_id(struct device *dev, u32 id);
 struct irq_domain *iort_get_device_domain(struct device *dev, u32 id,
                                          enum irq_domain_bus_token bus_token);
 void acpi_configure_pmsi_domain(struct device *dev);
-int iort_pmsi_get_dev_id(struct device *dev, u32 *dev_id);
 void iort_get_rmr_sids(struct fwnode_handle *iommu_fwnode,
                       struct list_head *head);
 void iort_put_rmr_sids(struct fwnode_handle *iommu_fwnode,
@@ -43,7 +43,6 @@ int iort_iommu_configure_id(struct device *dev, const u32 *id_in);
 void iort_iommu_get_resv_regions(struct device *dev, struct list_head *head);
 phys_addr_t acpi_iort_dma_get_max_cpu_address(void);
 #else
-static inline void acpi_iort_init(void) { }
 static inline u32 iort_msi_map_id(struct device *dev, u32 id)
 { return id; }
 static inline struct irq_domain *iort_get_device_domain(
index c10ebf8..446394f 100644 (file)
@@ -94,7 +94,8 @@ struct amd_cpudata {
  * enum amd_pstate_mode - driver working mode of amd pstate
  */
 enum amd_pstate_mode {
-       AMD_PSTATE_DISABLE = 0,
+       AMD_PSTATE_UNDEFINED = 0,
+       AMD_PSTATE_DISABLE,
        AMD_PSTATE_PASSIVE,
        AMD_PSTATE_ACTIVE,
        AMD_PSTATE_GUIDED,
@@ -102,6 +103,7 @@ enum amd_pstate_mode {
 };
 
 static const char * const amd_pstate_mode_string[] = {
+       [AMD_PSTATE_UNDEFINED]   = "undefined",
        [AMD_PSTATE_DISABLE]     = "disable",
        [AMD_PSTATE_PASSIVE]     = "passive",
        [AMD_PSTATE_ACTIVE]      = "active",
index a6e4437..18f5744 100644 (file)
 
 #include <linux/compiler.h>
 
-#ifndef arch_xchg_relaxed
-#define arch_xchg_acquire arch_xchg
-#define arch_xchg_release arch_xchg
-#define arch_xchg_relaxed arch_xchg
-#else /* arch_xchg_relaxed */
-
-#ifndef arch_xchg_acquire
-#define arch_xchg_acquire(...) \
-       __atomic_op_acquire(arch_xchg, __VA_ARGS__)
+#if defined(arch_xchg)
+#define raw_xchg arch_xchg
+#elif defined(arch_xchg_relaxed)
+#define raw_xchg(...) \
+       __atomic_op_fence(arch_xchg, __VA_ARGS__)
+#else
+extern void raw_xchg_not_implemented(void);
+#define raw_xchg(...) raw_xchg_not_implemented()
 #endif
 
-#ifndef arch_xchg_release
-#define arch_xchg_release(...) \
-       __atomic_op_release(arch_xchg, __VA_ARGS__)
+#if defined(arch_xchg_acquire)
+#define raw_xchg_acquire arch_xchg_acquire
+#elif defined(arch_xchg_relaxed)
+#define raw_xchg_acquire(...) \
+       __atomic_op_acquire(arch_xchg, __VA_ARGS__)
+#elif defined(arch_xchg)
+#define raw_xchg_acquire arch_xchg
+#else
+extern void raw_xchg_acquire_not_implemented(void);
+#define raw_xchg_acquire(...) raw_xchg_acquire_not_implemented()
 #endif
 
-#ifndef arch_xchg
-#define arch_xchg(...) \
-       __atomic_op_fence(arch_xchg, __VA_ARGS__)
+#if defined(arch_xchg_release)
+#define raw_xchg_release arch_xchg_release
+#elif defined(arch_xchg_relaxed)
+#define raw_xchg_release(...) \
+       __atomic_op_release(arch_xchg, __VA_ARGS__)
+#elif defined(arch_xchg)
+#define raw_xchg_release arch_xchg
+#else
+extern void raw_xchg_release_not_implemented(void);
+#define raw_xchg_release(...) raw_xchg_release_not_implemented()
+#endif
+
+#if defined(arch_xchg_relaxed)
+#define raw_xchg_relaxed arch_xchg_relaxed
+#elif defined(arch_xchg)
+#define raw_xchg_relaxed arch_xchg
+#else
+extern void raw_xchg_relaxed_not_implemented(void);
+#define raw_xchg_relaxed(...) raw_xchg_relaxed_not_implemented()
+#endif
+
+#if defined(arch_cmpxchg)
+#define raw_cmpxchg arch_cmpxchg
+#elif defined(arch_cmpxchg_relaxed)
+#define raw_cmpxchg(...) \
+       __atomic_op_fence(arch_cmpxchg, __VA_ARGS__)
+#else
+extern void raw_cmpxchg_not_implemented(void);
+#define raw_cmpxchg(...) raw_cmpxchg_not_implemented()
 #endif
 
-#endif /* arch_xchg_relaxed */
-
-#ifndef arch_cmpxchg_relaxed
-#define arch_cmpxchg_acquire arch_cmpxchg
-#define arch_cmpxchg_release arch_cmpxchg
-#define arch_cmpxchg_relaxed arch_cmpxchg
-#else /* arch_cmpxchg_relaxed */
-
-#ifndef arch_cmpxchg_acquire
-#define arch_cmpxchg_acquire(...) \
+#if defined(arch_cmpxchg_acquire)
+#define raw_cmpxchg_acquire arch_cmpxchg_acquire
+#elif defined(arch_cmpxchg_relaxed)
+#define raw_cmpxchg_acquire(...) \
        __atomic_op_acquire(arch_cmpxchg, __VA_ARGS__)
+#elif defined(arch_cmpxchg)
+#define raw_cmpxchg_acquire arch_cmpxchg
+#else
+extern void raw_cmpxchg_acquire_not_implemented(void);
+#define raw_cmpxchg_acquire(...) raw_cmpxchg_acquire_not_implemented()
 #endif
 
-#ifndef arch_cmpxchg_release
-#define arch_cmpxchg_release(...) \
+#if defined(arch_cmpxchg_release)
+#define raw_cmpxchg_release arch_cmpxchg_release
+#elif defined(arch_cmpxchg_relaxed)
+#define raw_cmpxchg_release(...) \
        __atomic_op_release(arch_cmpxchg, __VA_ARGS__)
+#elif defined(arch_cmpxchg)
+#define raw_cmpxchg_release arch_cmpxchg
+#else
+extern void raw_cmpxchg_release_not_implemented(void);
+#define raw_cmpxchg_release(...) raw_cmpxchg_release_not_implemented()
+#endif
+
+#if defined(arch_cmpxchg_relaxed)
+#define raw_cmpxchg_relaxed arch_cmpxchg_relaxed
+#elif defined(arch_cmpxchg)
+#define raw_cmpxchg_relaxed arch_cmpxchg
+#else
+extern void raw_cmpxchg_relaxed_not_implemented(void);
+#define raw_cmpxchg_relaxed(...) raw_cmpxchg_relaxed_not_implemented()
+#endif
+
+#if defined(arch_cmpxchg64)
+#define raw_cmpxchg64 arch_cmpxchg64
+#elif defined(arch_cmpxchg64_relaxed)
+#define raw_cmpxchg64(...) \
+       __atomic_op_fence(arch_cmpxchg64, __VA_ARGS__)
+#else
+extern void raw_cmpxchg64_not_implemented(void);
+#define raw_cmpxchg64(...) raw_cmpxchg64_not_implemented()
 #endif
 
-#ifndef arch_cmpxchg
-#define arch_cmpxchg(...) \
-       __atomic_op_fence(arch_cmpxchg, __VA_ARGS__)
-#endif
-
-#endif /* arch_cmpxchg_relaxed */
-
-#ifndef arch_cmpxchg64_relaxed
-#define arch_cmpxchg64_acquire arch_cmpxchg64
-#define arch_cmpxchg64_release arch_cmpxchg64
-#define arch_cmpxchg64_relaxed arch_cmpxchg64
-#else /* arch_cmpxchg64_relaxed */
-
-#ifndef arch_cmpxchg64_acquire
-#define arch_cmpxchg64_acquire(...) \
+#if defined(arch_cmpxchg64_acquire)
+#define raw_cmpxchg64_acquire arch_cmpxchg64_acquire
+#elif defined(arch_cmpxchg64_relaxed)
+#define raw_cmpxchg64_acquire(...) \
        __atomic_op_acquire(arch_cmpxchg64, __VA_ARGS__)
+#elif defined(arch_cmpxchg64)
+#define raw_cmpxchg64_acquire arch_cmpxchg64
+#else
+extern void raw_cmpxchg64_acquire_not_implemented(void);
+#define raw_cmpxchg64_acquire(...) raw_cmpxchg64_acquire_not_implemented()
 #endif
 
-#ifndef arch_cmpxchg64_release
-#define arch_cmpxchg64_release(...) \
+#if defined(arch_cmpxchg64_release)
+#define raw_cmpxchg64_release arch_cmpxchg64_release
+#elif defined(arch_cmpxchg64_relaxed)
+#define raw_cmpxchg64_release(...) \
        __atomic_op_release(arch_cmpxchg64, __VA_ARGS__)
+#elif defined(arch_cmpxchg64)
+#define raw_cmpxchg64_release arch_cmpxchg64
+#else
+extern void raw_cmpxchg64_release_not_implemented(void);
+#define raw_cmpxchg64_release(...) raw_cmpxchg64_release_not_implemented()
+#endif
+
+#if defined(arch_cmpxchg64_relaxed)
+#define raw_cmpxchg64_relaxed arch_cmpxchg64_relaxed
+#elif defined(arch_cmpxchg64)
+#define raw_cmpxchg64_relaxed arch_cmpxchg64
+#else
+extern void raw_cmpxchg64_relaxed_not_implemented(void);
+#define raw_cmpxchg64_relaxed(...) raw_cmpxchg64_relaxed_not_implemented()
+#endif
+
+#if defined(arch_cmpxchg128)
+#define raw_cmpxchg128 arch_cmpxchg128
+#elif defined(arch_cmpxchg128_relaxed)
+#define raw_cmpxchg128(...) \
+       __atomic_op_fence(arch_cmpxchg128, __VA_ARGS__)
+#else
+extern void raw_cmpxchg128_not_implemented(void);
+#define raw_cmpxchg128(...) raw_cmpxchg128_not_implemented()
+#endif
+
+#if defined(arch_cmpxchg128_acquire)
+#define raw_cmpxchg128_acquire arch_cmpxchg128_acquire
+#elif defined(arch_cmpxchg128_relaxed)
+#define raw_cmpxchg128_acquire(...) \
+       __atomic_op_acquire(arch_cmpxchg128, __VA_ARGS__)
+#elif defined(arch_cmpxchg128)
+#define raw_cmpxchg128_acquire arch_cmpxchg128
+#else
+extern void raw_cmpxchg128_acquire_not_implemented(void);
+#define raw_cmpxchg128_acquire(...) raw_cmpxchg128_acquire_not_implemented()
+#endif
+
+#if defined(arch_cmpxchg128_release)
+#define raw_cmpxchg128_release arch_cmpxchg128_release
+#elif defined(arch_cmpxchg128_relaxed)
+#define raw_cmpxchg128_release(...) \
+       __atomic_op_release(arch_cmpxchg128, __VA_ARGS__)
+#elif defined(arch_cmpxchg128)
+#define raw_cmpxchg128_release arch_cmpxchg128
+#else
+extern void raw_cmpxchg128_release_not_implemented(void);
+#define raw_cmpxchg128_release(...) raw_cmpxchg128_release_not_implemented()
+#endif
+
+#if defined(arch_cmpxchg128_relaxed)
+#define raw_cmpxchg128_relaxed arch_cmpxchg128_relaxed
+#elif defined(arch_cmpxchg128)
+#define raw_cmpxchg128_relaxed arch_cmpxchg128
+#else
+extern void raw_cmpxchg128_relaxed_not_implemented(void);
+#define raw_cmpxchg128_relaxed(...) raw_cmpxchg128_relaxed_not_implemented()
+#endif
+
+#if defined(arch_try_cmpxchg)
+#define raw_try_cmpxchg arch_try_cmpxchg
+#elif defined(arch_try_cmpxchg_relaxed)
+#define raw_try_cmpxchg(...) \
+       __atomic_op_fence(arch_try_cmpxchg, __VA_ARGS__)
+#else
+#define raw_try_cmpxchg(_ptr, _oldp, _new) \
+({ \
+       typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
+       ___r = raw_cmpxchg((_ptr), ___o, (_new)); \
+       if (unlikely(___r != ___o)) \
+               *___op = ___r; \
+       likely(___r == ___o); \
+})
 #endif
 
-#ifndef arch_cmpxchg64
-#define arch_cmpxchg64(...) \
-       __atomic_op_fence(arch_cmpxchg64, __VA_ARGS__)
+#if defined(arch_try_cmpxchg_acquire)
+#define raw_try_cmpxchg_acquire arch_try_cmpxchg_acquire
+#elif defined(arch_try_cmpxchg_relaxed)
+#define raw_try_cmpxchg_acquire(...) \
+       __atomic_op_acquire(arch_try_cmpxchg, __VA_ARGS__)
+#elif defined(arch_try_cmpxchg)
+#define raw_try_cmpxchg_acquire arch_try_cmpxchg
+#else
+#define raw_try_cmpxchg_acquire(_ptr, _oldp, _new) \
+({ \
+       typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
+       ___r = raw_cmpxchg_acquire((_ptr), ___o, (_new)); \
+       if (unlikely(___r != ___o)) \
+               *___op = ___r; \
+       likely(___r == ___o); \
+})
 #endif
 
-#endif /* arch_cmpxchg64_relaxed */
-
-#ifndef arch_try_cmpxchg_relaxed
-#ifdef arch_try_cmpxchg
-#define arch_try_cmpxchg_acquire arch_try_cmpxchg
-#define arch_try_cmpxchg_release arch_try_cmpxchg
-#define arch_try_cmpxchg_relaxed arch_try_cmpxchg
-#endif /* arch_try_cmpxchg */
-
-#ifndef arch_try_cmpxchg
-#define arch_try_cmpxchg(_ptr, _oldp, _new) \
+#if defined(arch_try_cmpxchg_release)
+#define raw_try_cmpxchg_release arch_try_cmpxchg_release
+#elif defined(arch_try_cmpxchg_relaxed)
+#define raw_try_cmpxchg_release(...) \
+       __atomic_op_release(arch_try_cmpxchg, __VA_ARGS__)
+#elif defined(arch_try_cmpxchg)
+#define raw_try_cmpxchg_release arch_try_cmpxchg
+#else
+#define raw_try_cmpxchg_release(_ptr, _oldp, _new) \
 ({ \
        typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
-       ___r = arch_cmpxchg((_ptr), ___o, (_new)); \
+       ___r = raw_cmpxchg_release((_ptr), ___o, (_new)); \
        if (unlikely(___r != ___o)) \
                *___op = ___r; \
        likely(___r == ___o); \
 })
-#endif /* arch_try_cmpxchg */
+#endif
 
-#ifndef arch_try_cmpxchg_acquire
-#define arch_try_cmpxchg_acquire(_ptr, _oldp, _new) \
+#if defined(arch_try_cmpxchg_relaxed)
+#define raw_try_cmpxchg_relaxed arch_try_cmpxchg_relaxed
+#elif defined(arch_try_cmpxchg)
+#define raw_try_cmpxchg_relaxed arch_try_cmpxchg
+#else
+#define raw_try_cmpxchg_relaxed(_ptr, _oldp, _new) \
 ({ \
        typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
-       ___r = arch_cmpxchg_acquire((_ptr), ___o, (_new)); \
+       ___r = raw_cmpxchg_relaxed((_ptr), ___o, (_new)); \
        if (unlikely(___r != ___o)) \
                *___op = ___r; \
        likely(___r == ___o); \
 })
-#endif /* arch_try_cmpxchg_acquire */
+#endif
 
-#ifndef arch_try_cmpxchg_release
-#define arch_try_cmpxchg_release(_ptr, _oldp, _new) \
+#if defined(arch_try_cmpxchg64)
+#define raw_try_cmpxchg64 arch_try_cmpxchg64
+#elif defined(arch_try_cmpxchg64_relaxed)
+#define raw_try_cmpxchg64(...) \
+       __atomic_op_fence(arch_try_cmpxchg64, __VA_ARGS__)
+#else
+#define raw_try_cmpxchg64(_ptr, _oldp, _new) \
 ({ \
        typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
-       ___r = arch_cmpxchg_release((_ptr), ___o, (_new)); \
+       ___r = raw_cmpxchg64((_ptr), ___o, (_new)); \
        if (unlikely(___r != ___o)) \
                *___op = ___r; \
        likely(___r == ___o); \
 })
-#endif /* arch_try_cmpxchg_release */
+#endif
 
-#ifndef arch_try_cmpxchg_relaxed
-#define arch_try_cmpxchg_relaxed(_ptr, _oldp, _new) \
+#if defined(arch_try_cmpxchg64_acquire)
+#define raw_try_cmpxchg64_acquire arch_try_cmpxchg64_acquire
+#elif defined(arch_try_cmpxchg64_relaxed)
+#define raw_try_cmpxchg64_acquire(...) \
+       __atomic_op_acquire(arch_try_cmpxchg64, __VA_ARGS__)
+#elif defined(arch_try_cmpxchg64)
+#define raw_try_cmpxchg64_acquire arch_try_cmpxchg64
+#else
+#define raw_try_cmpxchg64_acquire(_ptr, _oldp, _new) \
 ({ \
        typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
-       ___r = arch_cmpxchg_relaxed((_ptr), ___o, (_new)); \
+       ___r = raw_cmpxchg64_acquire((_ptr), ___o, (_new)); \
        if (unlikely(___r != ___o)) \
                *___op = ___r; \
        likely(___r == ___o); \
 })
-#endif /* arch_try_cmpxchg_relaxed */
-
-#else /* arch_try_cmpxchg_relaxed */
-
-#ifndef arch_try_cmpxchg_acquire
-#define arch_try_cmpxchg_acquire(...) \
-       __atomic_op_acquire(arch_try_cmpxchg, __VA_ARGS__)
 #endif
 
-#ifndef arch_try_cmpxchg_release
-#define arch_try_cmpxchg_release(...) \
-       __atomic_op_release(arch_try_cmpxchg, __VA_ARGS__)
+#if defined(arch_try_cmpxchg64_release)
+#define raw_try_cmpxchg64_release arch_try_cmpxchg64_release
+#elif defined(arch_try_cmpxchg64_relaxed)
+#define raw_try_cmpxchg64_release(...) \
+       __atomic_op_release(arch_try_cmpxchg64, __VA_ARGS__)
+#elif defined(arch_try_cmpxchg64)
+#define raw_try_cmpxchg64_release arch_try_cmpxchg64
+#else
+#define raw_try_cmpxchg64_release(_ptr, _oldp, _new) \
+({ \
+       typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
+       ___r = raw_cmpxchg64_release((_ptr), ___o, (_new)); \
+       if (unlikely(___r != ___o)) \
+               *___op = ___r; \
+       likely(___r == ___o); \
+})
 #endif
 
-#ifndef arch_try_cmpxchg
-#define arch_try_cmpxchg(...) \
-       __atomic_op_fence(arch_try_cmpxchg, __VA_ARGS__)
+#if defined(arch_try_cmpxchg64_relaxed)
+#define raw_try_cmpxchg64_relaxed arch_try_cmpxchg64_relaxed
+#elif defined(arch_try_cmpxchg64)
+#define raw_try_cmpxchg64_relaxed arch_try_cmpxchg64
+#else
+#define raw_try_cmpxchg64_relaxed(_ptr, _oldp, _new) \
+({ \
+       typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
+       ___r = raw_cmpxchg64_relaxed((_ptr), ___o, (_new)); \
+       if (unlikely(___r != ___o)) \
+               *___op = ___r; \
+       likely(___r == ___o); \
+})
 #endif
 
-#endif /* arch_try_cmpxchg_relaxed */
-
-#ifndef arch_try_cmpxchg64_relaxed
-#ifdef arch_try_cmpxchg64
-#define arch_try_cmpxchg64_acquire arch_try_cmpxchg64
-#define arch_try_cmpxchg64_release arch_try_cmpxchg64
-#define arch_try_cmpxchg64_relaxed arch_try_cmpxchg64
-#endif /* arch_try_cmpxchg64 */
-
-#ifndef arch_try_cmpxchg64
-#define arch_try_cmpxchg64(_ptr, _oldp, _new) \
+#if defined(arch_try_cmpxchg128)
+#define raw_try_cmpxchg128 arch_try_cmpxchg128
+#elif defined(arch_try_cmpxchg128_relaxed)
+#define raw_try_cmpxchg128(...) \
+       __atomic_op_fence(arch_try_cmpxchg128, __VA_ARGS__)
+#else
+#define raw_try_cmpxchg128(_ptr, _oldp, _new) \
 ({ \
        typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
-       ___r = arch_cmpxchg64((_ptr), ___o, (_new)); \
+       ___r = raw_cmpxchg128((_ptr), ___o, (_new)); \
        if (unlikely(___r != ___o)) \
                *___op = ___r; \
        likely(___r == ___o); \
 })
-#endif /* arch_try_cmpxchg64 */
+#endif
 
-#ifndef arch_try_cmpxchg64_acquire
-#define arch_try_cmpxchg64_acquire(_ptr, _oldp, _new) \
+#if defined(arch_try_cmpxchg128_acquire)
+#define raw_try_cmpxchg128_acquire arch_try_cmpxchg128_acquire
+#elif defined(arch_try_cmpxchg128_relaxed)
+#define raw_try_cmpxchg128_acquire(...) \
+       __atomic_op_acquire(arch_try_cmpxchg128, __VA_ARGS__)
+#elif defined(arch_try_cmpxchg128)
+#define raw_try_cmpxchg128_acquire arch_try_cmpxchg128
+#else
+#define raw_try_cmpxchg128_acquire(_ptr, _oldp, _new) \
 ({ \
        typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
-       ___r = arch_cmpxchg64_acquire((_ptr), ___o, (_new)); \
+       ___r = raw_cmpxchg128_acquire((_ptr), ___o, (_new)); \
        if (unlikely(___r != ___o)) \
                *___op = ___r; \
        likely(___r == ___o); \
 })
-#endif /* arch_try_cmpxchg64_acquire */
+#endif
 
-#ifndef arch_try_cmpxchg64_release
-#define arch_try_cmpxchg64_release(_ptr, _oldp, _new) \
+#if defined(arch_try_cmpxchg128_release)
+#define raw_try_cmpxchg128_release arch_try_cmpxchg128_release
+#elif defined(arch_try_cmpxchg128_relaxed)
+#define raw_try_cmpxchg128_release(...) \
+       __atomic_op_release(arch_try_cmpxchg128, __VA_ARGS__)
+#elif defined(arch_try_cmpxchg128)
+#define raw_try_cmpxchg128_release arch_try_cmpxchg128
+#else
+#define raw_try_cmpxchg128_release(_ptr, _oldp, _new) \
 ({ \
        typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
-       ___r = arch_cmpxchg64_release((_ptr), ___o, (_new)); \
+       ___r = raw_cmpxchg128_release((_ptr), ___o, (_new)); \
        if (unlikely(___r != ___o)) \
                *___op = ___r; \
        likely(___r == ___o); \
 })
-#endif /* arch_try_cmpxchg64_release */
+#endif
 
-#ifndef arch_try_cmpxchg64_relaxed
-#define arch_try_cmpxchg64_relaxed(_ptr, _oldp, _new) \
+#if defined(arch_try_cmpxchg128_relaxed)
+#define raw_try_cmpxchg128_relaxed arch_try_cmpxchg128_relaxed
+#elif defined(arch_try_cmpxchg128)
+#define raw_try_cmpxchg128_relaxed arch_try_cmpxchg128
+#else
+#define raw_try_cmpxchg128_relaxed(_ptr, _oldp, _new) \
 ({ \
        typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
-       ___r = arch_cmpxchg64_relaxed((_ptr), ___o, (_new)); \
+       ___r = raw_cmpxchg128_relaxed((_ptr), ___o, (_new)); \
        if (unlikely(___r != ___o)) \
                *___op = ___r; \
        likely(___r == ___o); \
 })
-#endif /* arch_try_cmpxchg64_relaxed */
-
-#else /* arch_try_cmpxchg64_relaxed */
-
-#ifndef arch_try_cmpxchg64_acquire
-#define arch_try_cmpxchg64_acquire(...) \
-       __atomic_op_acquire(arch_try_cmpxchg64, __VA_ARGS__)
 #endif
 
-#ifndef arch_try_cmpxchg64_release
-#define arch_try_cmpxchg64_release(...) \
-       __atomic_op_release(arch_try_cmpxchg64, __VA_ARGS__)
-#endif
+#define raw_cmpxchg_local arch_cmpxchg_local
 
-#ifndef arch_try_cmpxchg64
-#define arch_try_cmpxchg64(...) \
-       __atomic_op_fence(arch_try_cmpxchg64, __VA_ARGS__)
+#ifdef arch_try_cmpxchg_local
+#define raw_try_cmpxchg_local arch_try_cmpxchg_local
+#else
+#define raw_try_cmpxchg_local(_ptr, _oldp, _new) \
+({ \
+       typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
+       ___r = raw_cmpxchg_local((_ptr), ___o, (_new)); \
+       if (unlikely(___r != ___o)) \
+               *___op = ___r; \
+       likely(___r == ___o); \
+})
 #endif
 
-#endif /* arch_try_cmpxchg64_relaxed */
+#define raw_cmpxchg64_local arch_cmpxchg64_local
 
-#ifndef arch_try_cmpxchg_local
-#define arch_try_cmpxchg_local(_ptr, _oldp, _new) \
+#ifdef arch_try_cmpxchg64_local
+#define raw_try_cmpxchg64_local arch_try_cmpxchg64_local
+#else
+#define raw_try_cmpxchg64_local(_ptr, _oldp, _new) \
 ({ \
        typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
-       ___r = arch_cmpxchg_local((_ptr), ___o, (_new)); \
+       ___r = raw_cmpxchg64_local((_ptr), ___o, (_new)); \
        if (unlikely(___r != ___o)) \
                *___op = ___r; \
        likely(___r == ___o); \
 })
-#endif /* arch_try_cmpxchg_local */
+#endif
 
-#ifndef arch_try_cmpxchg64_local
-#define arch_try_cmpxchg64_local(_ptr, _oldp, _new) \
+#define raw_cmpxchg128_local arch_cmpxchg128_local
+
+#ifdef arch_try_cmpxchg128_local
+#define raw_try_cmpxchg128_local arch_try_cmpxchg128_local
+#else
+#define raw_try_cmpxchg128_local(_ptr, _oldp, _new) \
 ({ \
        typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
-       ___r = arch_cmpxchg64_local((_ptr), ___o, (_new)); \
+       ___r = raw_cmpxchg128_local((_ptr), ___o, (_new)); \
        if (unlikely(___r != ___o)) \
                *___op = ___r; \
        likely(___r == ___o); \
 })
-#endif /* arch_try_cmpxchg64_local */
+#endif
+
+#define raw_sync_cmpxchg arch_sync_cmpxchg
+
+/**
+ * raw_atomic_read() - atomic load with relaxed ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically loads the value of @v with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_read() elsewhere.
+ *
+ * Return: The value loaded from @v.
+ */
+static __always_inline int
+raw_atomic_read(const atomic_t *v)
+{
+       return arch_atomic_read(v);
+}
 
-#ifndef arch_atomic_read_acquire
+/**
+ * raw_atomic_read_acquire() - atomic load with acquire ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically loads the value of @v with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_read_acquire() elsewhere.
+ *
+ * Return: The value loaded from @v.
+ */
 static __always_inline int
-arch_atomic_read_acquire(const atomic_t *v)
+raw_atomic_read_acquire(const atomic_t *v)
 {
+#if defined(arch_atomic_read_acquire)
+       return arch_atomic_read_acquire(v);
+#elif defined(arch_atomic_read)
+       return arch_atomic_read(v);
+#else
        int ret;
 
        if (__native_word(atomic_t)) {
                ret = smp_load_acquire(&(v)->counter);
        } else {
-               ret = arch_atomic_read(v);
+               ret = raw_atomic_read(v);
                __atomic_acquire_fence();
        }
 
        return ret;
-}
-#define arch_atomic_read_acquire arch_atomic_read_acquire
 #endif
+}
+
+/**
+ * raw_atomic_set() - atomic set with relaxed ordering
+ * @v: pointer to atomic_t
+ * @i: int value to assign
+ *
+ * Atomically sets @v to @i with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_set() elsewhere.
+ *
+ * Return: Nothing.
+ */
+static __always_inline void
+raw_atomic_set(atomic_t *v, int i)
+{
+       arch_atomic_set(v, i);
+}
 
-#ifndef arch_atomic_set_release
+/**
+ * raw_atomic_set_release() - atomic set with release ordering
+ * @v: pointer to atomic_t
+ * @i: int value to assign
+ *
+ * Atomically sets @v to @i with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_set_release() elsewhere.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
-arch_atomic_set_release(atomic_t *v, int i)
+raw_atomic_set_release(atomic_t *v, int i)
 {
+#if defined(arch_atomic_set_release)
+       arch_atomic_set_release(v, i);
+#elif defined(arch_atomic_set)
+       arch_atomic_set(v, i);
+#else
        if (__native_word(atomic_t)) {
                smp_store_release(&(v)->counter, i);
        } else {
                __atomic_release_fence();
-               arch_atomic_set(v, i);
+               raw_atomic_set(v, i);
        }
-}
-#define arch_atomic_set_release arch_atomic_set_release
 #endif
+}
+
+/**
+ * raw_atomic_add() - atomic add with relaxed ordering
+ * @i: int value to add
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_add() elsewhere.
+ *
+ * Return: Nothing.
+ */
+static __always_inline void
+raw_atomic_add(int i, atomic_t *v)
+{
+       arch_atomic_add(i, v);
+}
 
-#ifndef arch_atomic_add_return_relaxed
-#define arch_atomic_add_return_acquire arch_atomic_add_return
-#define arch_atomic_add_return_release arch_atomic_add_return
-#define arch_atomic_add_return_relaxed arch_atomic_add_return
-#else /* arch_atomic_add_return_relaxed */
+/**
+ * raw_atomic_add_return() - atomic add with full ordering
+ * @i: int value to add
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + @i) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_add_return() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
+static __always_inline int
+raw_atomic_add_return(int i, atomic_t *v)
+{
+#if defined(arch_atomic_add_return)
+       return arch_atomic_add_return(i, v);
+#elif defined(arch_atomic_add_return_relaxed)
+       int ret;
+       __atomic_pre_full_fence();
+       ret = arch_atomic_add_return_relaxed(i, v);
+       __atomic_post_full_fence();
+       return ret;
+#else
+#error "Unable to define raw_atomic_add_return"
+#endif
+}
 
-#ifndef arch_atomic_add_return_acquire
+/**
+ * raw_atomic_add_return_acquire() - atomic add with acquire ordering
+ * @i: int value to add
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + @i) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_add_return_acquire() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline int
-arch_atomic_add_return_acquire(int i, atomic_t *v)
+raw_atomic_add_return_acquire(int i, atomic_t *v)
 {
+#if defined(arch_atomic_add_return_acquire)
+       return arch_atomic_add_return_acquire(i, v);
+#elif defined(arch_atomic_add_return_relaxed)
        int ret = arch_atomic_add_return_relaxed(i, v);
        __atomic_acquire_fence();
        return ret;
-}
-#define arch_atomic_add_return_acquire arch_atomic_add_return_acquire
+#elif defined(arch_atomic_add_return)
+       return arch_atomic_add_return(i, v);
+#else
+#error "Unable to define raw_atomic_add_return_acquire"
 #endif
+}
 
-#ifndef arch_atomic_add_return_release
+/**
+ * raw_atomic_add_return_release() - atomic add with release ordering
+ * @i: int value to add
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + @i) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_add_return_release() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline int
-arch_atomic_add_return_release(int i, atomic_t *v)
+raw_atomic_add_return_release(int i, atomic_t *v)
 {
+#if defined(arch_atomic_add_return_release)
+       return arch_atomic_add_return_release(i, v);
+#elif defined(arch_atomic_add_return_relaxed)
        __atomic_release_fence();
        return arch_atomic_add_return_relaxed(i, v);
+#elif defined(arch_atomic_add_return)
+       return arch_atomic_add_return(i, v);
+#else
+#error "Unable to define raw_atomic_add_return_release"
+#endif
 }
-#define arch_atomic_add_return_release arch_atomic_add_return_release
+
+/**
+ * raw_atomic_add_return_relaxed() - atomic add with relaxed ordering
+ * @i: int value to add
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_add_return_relaxed() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
+static __always_inline int
+raw_atomic_add_return_relaxed(int i, atomic_t *v)
+{
+#if defined(arch_atomic_add_return_relaxed)
+       return arch_atomic_add_return_relaxed(i, v);
+#elif defined(arch_atomic_add_return)
+       return arch_atomic_add_return(i, v);
+#else
+#error "Unable to define raw_atomic_add_return_relaxed"
 #endif
+}
 
-#ifndef arch_atomic_add_return
+/**
+ * raw_atomic_fetch_add() - atomic add with full ordering
+ * @i: int value to add
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + @i) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_add() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_add_return(int i, atomic_t *v)
+raw_atomic_fetch_add(int i, atomic_t *v)
 {
+#if defined(arch_atomic_fetch_add)
+       return arch_atomic_fetch_add(i, v);
+#elif defined(arch_atomic_fetch_add_relaxed)
        int ret;
        __atomic_pre_full_fence();
-       ret = arch_atomic_add_return_relaxed(i, v);
+       ret = arch_atomic_fetch_add_relaxed(i, v);
        __atomic_post_full_fence();
        return ret;
-}
-#define arch_atomic_add_return arch_atomic_add_return
+#else
+#error "Unable to define raw_atomic_fetch_add"
 #endif
+}
 
-#endif /* arch_atomic_add_return_relaxed */
-
-#ifndef arch_atomic_fetch_add_relaxed
-#define arch_atomic_fetch_add_acquire arch_atomic_fetch_add
-#define arch_atomic_fetch_add_release arch_atomic_fetch_add
-#define arch_atomic_fetch_add_relaxed arch_atomic_fetch_add
-#else /* arch_atomic_fetch_add_relaxed */
-
-#ifndef arch_atomic_fetch_add_acquire
+/**
+ * raw_atomic_fetch_add_acquire() - atomic add with acquire ordering
+ * @i: int value to add
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + @i) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_add_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_fetch_add_acquire(int i, atomic_t *v)
+raw_atomic_fetch_add_acquire(int i, atomic_t *v)
 {
+#if defined(arch_atomic_fetch_add_acquire)
+       return arch_atomic_fetch_add_acquire(i, v);
+#elif defined(arch_atomic_fetch_add_relaxed)
        int ret = arch_atomic_fetch_add_relaxed(i, v);
        __atomic_acquire_fence();
        return ret;
-}
-#define arch_atomic_fetch_add_acquire arch_atomic_fetch_add_acquire
+#elif defined(arch_atomic_fetch_add)
+       return arch_atomic_fetch_add(i, v);
+#else
+#error "Unable to define raw_atomic_fetch_add_acquire"
 #endif
+}
 
-#ifndef arch_atomic_fetch_add_release
+/**
+ * raw_atomic_fetch_add_release() - atomic add with release ordering
+ * @i: int value to add
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + @i) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_add_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_fetch_add_release(int i, atomic_t *v)
+raw_atomic_fetch_add_release(int i, atomic_t *v)
 {
+#if defined(arch_atomic_fetch_add_release)
+       return arch_atomic_fetch_add_release(i, v);
+#elif defined(arch_atomic_fetch_add_relaxed)
        __atomic_release_fence();
        return arch_atomic_fetch_add_relaxed(i, v);
+#elif defined(arch_atomic_fetch_add)
+       return arch_atomic_fetch_add(i, v);
+#else
+#error "Unable to define raw_atomic_fetch_add_release"
+#endif
 }
-#define arch_atomic_fetch_add_release arch_atomic_fetch_add_release
+
+/**
+ * raw_atomic_fetch_add_relaxed() - atomic add with relaxed ordering
+ * @i: int value to add
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_add_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
+static __always_inline int
+raw_atomic_fetch_add_relaxed(int i, atomic_t *v)
+{
+#if defined(arch_atomic_fetch_add_relaxed)
+       return arch_atomic_fetch_add_relaxed(i, v);
+#elif defined(arch_atomic_fetch_add)
+       return arch_atomic_fetch_add(i, v);
+#else
+#error "Unable to define raw_atomic_fetch_add_relaxed"
 #endif
+}
+
+/**
+ * raw_atomic_sub() - atomic subtract with relaxed ordering
+ * @i: int value to subtract
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_sub() elsewhere.
+ *
+ * Return: Nothing.
+ */
+static __always_inline void
+raw_atomic_sub(int i, atomic_t *v)
+{
+       arch_atomic_sub(i, v);
+}
 
-#ifndef arch_atomic_fetch_add
+/**
+ * raw_atomic_sub_return() - atomic subtract with full ordering
+ * @i: int value to subtract
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - @i) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_sub_return() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline int
-arch_atomic_fetch_add(int i, atomic_t *v)
+raw_atomic_sub_return(int i, atomic_t *v)
 {
+#if defined(arch_atomic_sub_return)
+       return arch_atomic_sub_return(i, v);
+#elif defined(arch_atomic_sub_return_relaxed)
        int ret;
        __atomic_pre_full_fence();
-       ret = arch_atomic_fetch_add_relaxed(i, v);
+       ret = arch_atomic_sub_return_relaxed(i, v);
        __atomic_post_full_fence();
        return ret;
-}
-#define arch_atomic_fetch_add arch_atomic_fetch_add
+#else
+#error "Unable to define raw_atomic_sub_return"
 #endif
+}
 
-#endif /* arch_atomic_fetch_add_relaxed */
-
-#ifndef arch_atomic_sub_return_relaxed
-#define arch_atomic_sub_return_acquire arch_atomic_sub_return
-#define arch_atomic_sub_return_release arch_atomic_sub_return
-#define arch_atomic_sub_return_relaxed arch_atomic_sub_return
-#else /* arch_atomic_sub_return_relaxed */
-
-#ifndef arch_atomic_sub_return_acquire
+/**
+ * raw_atomic_sub_return_acquire() - atomic subtract with acquire ordering
+ * @i: int value to subtract
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - @i) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_sub_return_acquire() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline int
-arch_atomic_sub_return_acquire(int i, atomic_t *v)
+raw_atomic_sub_return_acquire(int i, atomic_t *v)
 {
+#if defined(arch_atomic_sub_return_acquire)
+       return arch_atomic_sub_return_acquire(i, v);
+#elif defined(arch_atomic_sub_return_relaxed)
        int ret = arch_atomic_sub_return_relaxed(i, v);
        __atomic_acquire_fence();
        return ret;
-}
-#define arch_atomic_sub_return_acquire arch_atomic_sub_return_acquire
+#elif defined(arch_atomic_sub_return)
+       return arch_atomic_sub_return(i, v);
+#else
+#error "Unable to define raw_atomic_sub_return_acquire"
 #endif
+}
 
-#ifndef arch_atomic_sub_return_release
+/**
+ * raw_atomic_sub_return_release() - atomic subtract with release ordering
+ * @i: int value to subtract
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - @i) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_sub_return_release() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline int
-arch_atomic_sub_return_release(int i, atomic_t *v)
+raw_atomic_sub_return_release(int i, atomic_t *v)
 {
+#if defined(arch_atomic_sub_return_release)
+       return arch_atomic_sub_return_release(i, v);
+#elif defined(arch_atomic_sub_return_relaxed)
        __atomic_release_fence();
        return arch_atomic_sub_return_relaxed(i, v);
+#elif defined(arch_atomic_sub_return)
+       return arch_atomic_sub_return(i, v);
+#else
+#error "Unable to define raw_atomic_sub_return_release"
+#endif
 }
-#define arch_atomic_sub_return_release arch_atomic_sub_return_release
+
+/**
+ * raw_atomic_sub_return_relaxed() - atomic subtract with relaxed ordering
+ * @i: int value to subtract
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_sub_return_relaxed() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
+static __always_inline int
+raw_atomic_sub_return_relaxed(int i, atomic_t *v)
+{
+#if defined(arch_atomic_sub_return_relaxed)
+       return arch_atomic_sub_return_relaxed(i, v);
+#elif defined(arch_atomic_sub_return)
+       return arch_atomic_sub_return(i, v);
+#else
+#error "Unable to define raw_atomic_sub_return_relaxed"
 #endif
+}
 
-#ifndef arch_atomic_sub_return
+/**
+ * raw_atomic_fetch_sub() - atomic subtract with full ordering
+ * @i: int value to subtract
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - @i) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_sub() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_sub_return(int i, atomic_t *v)
+raw_atomic_fetch_sub(int i, atomic_t *v)
 {
+#if defined(arch_atomic_fetch_sub)
+       return arch_atomic_fetch_sub(i, v);
+#elif defined(arch_atomic_fetch_sub_relaxed)
        int ret;
        __atomic_pre_full_fence();
-       ret = arch_atomic_sub_return_relaxed(i, v);
+       ret = arch_atomic_fetch_sub_relaxed(i, v);
        __atomic_post_full_fence();
        return ret;
-}
-#define arch_atomic_sub_return arch_atomic_sub_return
+#else
+#error "Unable to define raw_atomic_fetch_sub"
 #endif
+}
 
-#endif /* arch_atomic_sub_return_relaxed */
-
-#ifndef arch_atomic_fetch_sub_relaxed
-#define arch_atomic_fetch_sub_acquire arch_atomic_fetch_sub
-#define arch_atomic_fetch_sub_release arch_atomic_fetch_sub
-#define arch_atomic_fetch_sub_relaxed arch_atomic_fetch_sub
-#else /* arch_atomic_fetch_sub_relaxed */
-
-#ifndef arch_atomic_fetch_sub_acquire
+/**
+ * raw_atomic_fetch_sub_acquire() - atomic subtract with acquire ordering
+ * @i: int value to subtract
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - @i) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_sub_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_fetch_sub_acquire(int i, atomic_t *v)
+raw_atomic_fetch_sub_acquire(int i, atomic_t *v)
 {
+#if defined(arch_atomic_fetch_sub_acquire)
+       return arch_atomic_fetch_sub_acquire(i, v);
+#elif defined(arch_atomic_fetch_sub_relaxed)
        int ret = arch_atomic_fetch_sub_relaxed(i, v);
        __atomic_acquire_fence();
        return ret;
-}
-#define arch_atomic_fetch_sub_acquire arch_atomic_fetch_sub_acquire
+#elif defined(arch_atomic_fetch_sub)
+       return arch_atomic_fetch_sub(i, v);
+#else
+#error "Unable to define raw_atomic_fetch_sub_acquire"
 #endif
+}
 
-#ifndef arch_atomic_fetch_sub_release
+/**
+ * raw_atomic_fetch_sub_release() - atomic subtract with release ordering
+ * @i: int value to subtract
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - @i) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_sub_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_fetch_sub_release(int i, atomic_t *v)
+raw_atomic_fetch_sub_release(int i, atomic_t *v)
 {
+#if defined(arch_atomic_fetch_sub_release)
+       return arch_atomic_fetch_sub_release(i, v);
+#elif defined(arch_atomic_fetch_sub_relaxed)
        __atomic_release_fence();
        return arch_atomic_fetch_sub_relaxed(i, v);
-}
-#define arch_atomic_fetch_sub_release arch_atomic_fetch_sub_release
+#elif defined(arch_atomic_fetch_sub)
+       return arch_atomic_fetch_sub(i, v);
+#else
+#error "Unable to define raw_atomic_fetch_sub_release"
 #endif
+}
 
-#ifndef arch_atomic_fetch_sub
+/**
+ * raw_atomic_fetch_sub_relaxed() - atomic subtract with relaxed ordering
+ * @i: int value to subtract
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_sub_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_fetch_sub(int i, atomic_t *v)
+raw_atomic_fetch_sub_relaxed(int i, atomic_t *v)
 {
-       int ret;
-       __atomic_pre_full_fence();
-       ret = arch_atomic_fetch_sub_relaxed(i, v);
-       __atomic_post_full_fence();
-       return ret;
-}
-#define arch_atomic_fetch_sub arch_atomic_fetch_sub
+#if defined(arch_atomic_fetch_sub_relaxed)
+       return arch_atomic_fetch_sub_relaxed(i, v);
+#elif defined(arch_atomic_fetch_sub)
+       return arch_atomic_fetch_sub(i, v);
+#else
+#error "Unable to define raw_atomic_fetch_sub_relaxed"
 #endif
-
-#endif /* arch_atomic_fetch_sub_relaxed */
-
-#ifndef arch_atomic_inc
-static __always_inline void
-arch_atomic_inc(atomic_t *v)
-{
-       arch_atomic_add(1, v);
 }
-#define arch_atomic_inc arch_atomic_inc
-#endif
-
-#ifndef arch_atomic_inc_return_relaxed
-#ifdef arch_atomic_inc_return
-#define arch_atomic_inc_return_acquire arch_atomic_inc_return
-#define arch_atomic_inc_return_release arch_atomic_inc_return
-#define arch_atomic_inc_return_relaxed arch_atomic_inc_return
-#endif /* arch_atomic_inc_return */
 
-#ifndef arch_atomic_inc_return
-static __always_inline int
-arch_atomic_inc_return(atomic_t *v)
+/**
+ * raw_atomic_inc() - atomic increment with relaxed ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + 1) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_inc() elsewhere.
+ *
+ * Return: Nothing.
+ */
+static __always_inline void
+raw_atomic_inc(atomic_t *v)
 {
-       return arch_atomic_add_return(1, v);
-}
-#define arch_atomic_inc_return arch_atomic_inc_return
+#if defined(arch_atomic_inc)
+       arch_atomic_inc(v);
+#else
+       raw_atomic_add(1, v);
 #endif
-
-#ifndef arch_atomic_inc_return_acquire
-static __always_inline int
-arch_atomic_inc_return_acquire(atomic_t *v)
-{
-       return arch_atomic_add_return_acquire(1, v);
 }
-#define arch_atomic_inc_return_acquire arch_atomic_inc_return_acquire
-#endif
 
-#ifndef arch_atomic_inc_return_release
+/**
+ * raw_atomic_inc_return() - atomic increment with full ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + 1) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_inc_return() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline int
-arch_atomic_inc_return_release(atomic_t *v)
+raw_atomic_inc_return(atomic_t *v)
 {
-       return arch_atomic_add_return_release(1, v);
-}
-#define arch_atomic_inc_return_release arch_atomic_inc_return_release
+#if defined(arch_atomic_inc_return)
+       return arch_atomic_inc_return(v);
+#elif defined(arch_atomic_inc_return_relaxed)
+       int ret;
+       __atomic_pre_full_fence();
+       ret = arch_atomic_inc_return_relaxed(v);
+       __atomic_post_full_fence();
+       return ret;
+#else
+       return raw_atomic_add_return(1, v);
 #endif
-
-#ifndef arch_atomic_inc_return_relaxed
-static __always_inline int
-arch_atomic_inc_return_relaxed(atomic_t *v)
-{
-       return arch_atomic_add_return_relaxed(1, v);
 }
-#define arch_atomic_inc_return_relaxed arch_atomic_inc_return_relaxed
-#endif
 
-#else /* arch_atomic_inc_return_relaxed */
-
-#ifndef arch_atomic_inc_return_acquire
+/**
+ * raw_atomic_inc_return_acquire() - atomic increment with acquire ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + 1) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_inc_return_acquire() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline int
-arch_atomic_inc_return_acquire(atomic_t *v)
+raw_atomic_inc_return_acquire(atomic_t *v)
 {
+#if defined(arch_atomic_inc_return_acquire)
+       return arch_atomic_inc_return_acquire(v);
+#elif defined(arch_atomic_inc_return_relaxed)
        int ret = arch_atomic_inc_return_relaxed(v);
        __atomic_acquire_fence();
        return ret;
-}
-#define arch_atomic_inc_return_acquire arch_atomic_inc_return_acquire
+#elif defined(arch_atomic_inc_return)
+       return arch_atomic_inc_return(v);
+#else
+       return raw_atomic_add_return_acquire(1, v);
 #endif
+}
 
-#ifndef arch_atomic_inc_return_release
+/**
+ * raw_atomic_inc_return_release() - atomic increment with release ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + 1) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_inc_return_release() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline int
-arch_atomic_inc_return_release(atomic_t *v)
+raw_atomic_inc_return_release(atomic_t *v)
 {
+#if defined(arch_atomic_inc_return_release)
+       return arch_atomic_inc_return_release(v);
+#elif defined(arch_atomic_inc_return_relaxed)
        __atomic_release_fence();
        return arch_atomic_inc_return_relaxed(v);
-}
-#define arch_atomic_inc_return_release arch_atomic_inc_return_release
+#elif defined(arch_atomic_inc_return)
+       return arch_atomic_inc_return(v);
+#else
+       return raw_atomic_add_return_release(1, v);
 #endif
-
-#ifndef arch_atomic_inc_return
-static __always_inline int
-arch_atomic_inc_return(atomic_t *v)
-{
-       int ret;
-       __atomic_pre_full_fence();
-       ret = arch_atomic_inc_return_relaxed(v);
-       __atomic_post_full_fence();
-       return ret;
 }
-#define arch_atomic_inc_return arch_atomic_inc_return
-#endif
-
-#endif /* arch_atomic_inc_return_relaxed */
 
-#ifndef arch_atomic_fetch_inc_relaxed
-#ifdef arch_atomic_fetch_inc
-#define arch_atomic_fetch_inc_acquire arch_atomic_fetch_inc
-#define arch_atomic_fetch_inc_release arch_atomic_fetch_inc
-#define arch_atomic_fetch_inc_relaxed arch_atomic_fetch_inc
-#endif /* arch_atomic_fetch_inc */
-
-#ifndef arch_atomic_fetch_inc
+/**
+ * raw_atomic_inc_return_relaxed() - atomic increment with relaxed ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + 1) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_inc_return_relaxed() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline int
-arch_atomic_fetch_inc(atomic_t *v)
+raw_atomic_inc_return_relaxed(atomic_t *v)
 {
-       return arch_atomic_fetch_add(1, v);
-}
-#define arch_atomic_fetch_inc arch_atomic_fetch_inc
+#if defined(arch_atomic_inc_return_relaxed)
+       return arch_atomic_inc_return_relaxed(v);
+#elif defined(arch_atomic_inc_return)
+       return arch_atomic_inc_return(v);
+#else
+       return raw_atomic_add_return_relaxed(1, v);
 #endif
-
-#ifndef arch_atomic_fetch_inc_acquire
-static __always_inline int
-arch_atomic_fetch_inc_acquire(atomic_t *v)
-{
-       return arch_atomic_fetch_add_acquire(1, v);
 }
-#define arch_atomic_fetch_inc_acquire arch_atomic_fetch_inc_acquire
-#endif
 
-#ifndef arch_atomic_fetch_inc_release
+/**
+ * raw_atomic_fetch_inc() - atomic increment with full ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + 1) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_inc() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_fetch_inc_release(atomic_t *v)
+raw_atomic_fetch_inc(atomic_t *v)
 {
-       return arch_atomic_fetch_add_release(1, v);
-}
-#define arch_atomic_fetch_inc_release arch_atomic_fetch_inc_release
+#if defined(arch_atomic_fetch_inc)
+       return arch_atomic_fetch_inc(v);
+#elif defined(arch_atomic_fetch_inc_relaxed)
+       int ret;
+       __atomic_pre_full_fence();
+       ret = arch_atomic_fetch_inc_relaxed(v);
+       __atomic_post_full_fence();
+       return ret;
+#else
+       return raw_atomic_fetch_add(1, v);
 #endif
-
-#ifndef arch_atomic_fetch_inc_relaxed
-static __always_inline int
-arch_atomic_fetch_inc_relaxed(atomic_t *v)
-{
-       return arch_atomic_fetch_add_relaxed(1, v);
 }
-#define arch_atomic_fetch_inc_relaxed arch_atomic_fetch_inc_relaxed
-#endif
 
-#else /* arch_atomic_fetch_inc_relaxed */
-
-#ifndef arch_atomic_fetch_inc_acquire
+/**
+ * raw_atomic_fetch_inc_acquire() - atomic increment with acquire ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + 1) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_inc_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_fetch_inc_acquire(atomic_t *v)
+raw_atomic_fetch_inc_acquire(atomic_t *v)
 {
+#if defined(arch_atomic_fetch_inc_acquire)
+       return arch_atomic_fetch_inc_acquire(v);
+#elif defined(arch_atomic_fetch_inc_relaxed)
        int ret = arch_atomic_fetch_inc_relaxed(v);
        __atomic_acquire_fence();
        return ret;
-}
-#define arch_atomic_fetch_inc_acquire arch_atomic_fetch_inc_acquire
+#elif defined(arch_atomic_fetch_inc)
+       return arch_atomic_fetch_inc(v);
+#else
+       return raw_atomic_fetch_add_acquire(1, v);
 #endif
+}
 
-#ifndef arch_atomic_fetch_inc_release
+/**
+ * raw_atomic_fetch_inc_release() - atomic increment with release ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + 1) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_inc_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_fetch_inc_release(atomic_t *v)
+raw_atomic_fetch_inc_release(atomic_t *v)
 {
+#if defined(arch_atomic_fetch_inc_release)
+       return arch_atomic_fetch_inc_release(v);
+#elif defined(arch_atomic_fetch_inc_relaxed)
        __atomic_release_fence();
        return arch_atomic_fetch_inc_relaxed(v);
-}
-#define arch_atomic_fetch_inc_release arch_atomic_fetch_inc_release
+#elif defined(arch_atomic_fetch_inc)
+       return arch_atomic_fetch_inc(v);
+#else
+       return raw_atomic_fetch_add_release(1, v);
 #endif
+}
 
-#ifndef arch_atomic_fetch_inc
+/**
+ * raw_atomic_fetch_inc_relaxed() - atomic increment with relaxed ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + 1) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_inc_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_fetch_inc(atomic_t *v)
+raw_atomic_fetch_inc_relaxed(atomic_t *v)
 {
-       int ret;
-       __atomic_pre_full_fence();
-       ret = arch_atomic_fetch_inc_relaxed(v);
-       __atomic_post_full_fence();
-       return ret;
-}
-#define arch_atomic_fetch_inc arch_atomic_fetch_inc
+#if defined(arch_atomic_fetch_inc_relaxed)
+       return arch_atomic_fetch_inc_relaxed(v);
+#elif defined(arch_atomic_fetch_inc)
+       return arch_atomic_fetch_inc(v);
+#else
+       return raw_atomic_fetch_add_relaxed(1, v);
 #endif
-
-#endif /* arch_atomic_fetch_inc_relaxed */
-
-#ifndef arch_atomic_dec
-static __always_inline void
-arch_atomic_dec(atomic_t *v)
-{
-       arch_atomic_sub(1, v);
 }
-#define arch_atomic_dec arch_atomic_dec
-#endif
-
-#ifndef arch_atomic_dec_return_relaxed
-#ifdef arch_atomic_dec_return
-#define arch_atomic_dec_return_acquire arch_atomic_dec_return
-#define arch_atomic_dec_return_release arch_atomic_dec_return
-#define arch_atomic_dec_return_relaxed arch_atomic_dec_return
-#endif /* arch_atomic_dec_return */
 
-#ifndef arch_atomic_dec_return
-static __always_inline int
-arch_atomic_dec_return(atomic_t *v)
+/**
+ * raw_atomic_dec() - atomic decrement with relaxed ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - 1) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_dec() elsewhere.
+ *
+ * Return: Nothing.
+ */
+static __always_inline void
+raw_atomic_dec(atomic_t *v)
 {
-       return arch_atomic_sub_return(1, v);
-}
-#define arch_atomic_dec_return arch_atomic_dec_return
+#if defined(arch_atomic_dec)
+       arch_atomic_dec(v);
+#else
+       raw_atomic_sub(1, v);
 #endif
-
-#ifndef arch_atomic_dec_return_acquire
-static __always_inline int
-arch_atomic_dec_return_acquire(atomic_t *v)
-{
-       return arch_atomic_sub_return_acquire(1, v);
 }
-#define arch_atomic_dec_return_acquire arch_atomic_dec_return_acquire
-#endif
 
-#ifndef arch_atomic_dec_return_release
+/**
+ * raw_atomic_dec_return() - atomic decrement with full ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - 1) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_dec_return() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline int
-arch_atomic_dec_return_release(atomic_t *v)
+raw_atomic_dec_return(atomic_t *v)
 {
-       return arch_atomic_sub_return_release(1, v);
-}
-#define arch_atomic_dec_return_release arch_atomic_dec_return_release
+#if defined(arch_atomic_dec_return)
+       return arch_atomic_dec_return(v);
+#elif defined(arch_atomic_dec_return_relaxed)
+       int ret;
+       __atomic_pre_full_fence();
+       ret = arch_atomic_dec_return_relaxed(v);
+       __atomic_post_full_fence();
+       return ret;
+#else
+       return raw_atomic_sub_return(1, v);
 #endif
-
-#ifndef arch_atomic_dec_return_relaxed
-static __always_inline int
-arch_atomic_dec_return_relaxed(atomic_t *v)
-{
-       return arch_atomic_sub_return_relaxed(1, v);
 }
-#define arch_atomic_dec_return_relaxed arch_atomic_dec_return_relaxed
-#endif
 
-#else /* arch_atomic_dec_return_relaxed */
-
-#ifndef arch_atomic_dec_return_acquire
+/**
+ * raw_atomic_dec_return_acquire() - atomic decrement with acquire ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - 1) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_dec_return_acquire() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline int
-arch_atomic_dec_return_acquire(atomic_t *v)
+raw_atomic_dec_return_acquire(atomic_t *v)
 {
+#if defined(arch_atomic_dec_return_acquire)
+       return arch_atomic_dec_return_acquire(v);
+#elif defined(arch_atomic_dec_return_relaxed)
        int ret = arch_atomic_dec_return_relaxed(v);
        __atomic_acquire_fence();
        return ret;
-}
-#define arch_atomic_dec_return_acquire arch_atomic_dec_return_acquire
+#elif defined(arch_atomic_dec_return)
+       return arch_atomic_dec_return(v);
+#else
+       return raw_atomic_sub_return_acquire(1, v);
 #endif
+}
 
-#ifndef arch_atomic_dec_return_release
+/**
+ * raw_atomic_dec_return_release() - atomic decrement with release ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - 1) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_dec_return_release() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline int
-arch_atomic_dec_return_release(atomic_t *v)
+raw_atomic_dec_return_release(atomic_t *v)
 {
+#if defined(arch_atomic_dec_return_release)
+       return arch_atomic_dec_return_release(v);
+#elif defined(arch_atomic_dec_return_relaxed)
        __atomic_release_fence();
        return arch_atomic_dec_return_relaxed(v);
+#elif defined(arch_atomic_dec_return)
+       return arch_atomic_dec_return(v);
+#else
+       return raw_atomic_sub_return_release(1, v);
+#endif
 }
-#define arch_atomic_dec_return_release arch_atomic_dec_return_release
+
+/**
+ * raw_atomic_dec_return_relaxed() - atomic decrement with relaxed ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - 1) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_dec_return_relaxed() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
+static __always_inline int
+raw_atomic_dec_return_relaxed(atomic_t *v)
+{
+#if defined(arch_atomic_dec_return_relaxed)
+       return arch_atomic_dec_return_relaxed(v);
+#elif defined(arch_atomic_dec_return)
+       return arch_atomic_dec_return(v);
+#else
+       return raw_atomic_sub_return_relaxed(1, v);
 #endif
+}
 
-#ifndef arch_atomic_dec_return
+/**
+ * raw_atomic_fetch_dec() - atomic decrement with full ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - 1) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_dec() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_dec_return(atomic_t *v)
+raw_atomic_fetch_dec(atomic_t *v)
 {
+#if defined(arch_atomic_fetch_dec)
+       return arch_atomic_fetch_dec(v);
+#elif defined(arch_atomic_fetch_dec_relaxed)
        int ret;
        __atomic_pre_full_fence();
-       ret = arch_atomic_dec_return_relaxed(v);
+       ret = arch_atomic_fetch_dec_relaxed(v);
        __atomic_post_full_fence();
        return ret;
-}
-#define arch_atomic_dec_return arch_atomic_dec_return
+#else
+       return raw_atomic_fetch_sub(1, v);
 #endif
-
-#endif /* arch_atomic_dec_return_relaxed */
-
-#ifndef arch_atomic_fetch_dec_relaxed
-#ifdef arch_atomic_fetch_dec
-#define arch_atomic_fetch_dec_acquire arch_atomic_fetch_dec
-#define arch_atomic_fetch_dec_release arch_atomic_fetch_dec
-#define arch_atomic_fetch_dec_relaxed arch_atomic_fetch_dec
-#endif /* arch_atomic_fetch_dec */
-
-#ifndef arch_atomic_fetch_dec
-static __always_inline int
-arch_atomic_fetch_dec(atomic_t *v)
-{
-       return arch_atomic_fetch_sub(1, v);
 }
-#define arch_atomic_fetch_dec arch_atomic_fetch_dec
-#endif
 
-#ifndef arch_atomic_fetch_dec_acquire
+/**
+ * raw_atomic_fetch_dec_acquire() - atomic decrement with acquire ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - 1) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_dec_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_fetch_dec_acquire(atomic_t *v)
+raw_atomic_fetch_dec_acquire(atomic_t *v)
 {
-       return arch_atomic_fetch_sub_acquire(1, v);
-}
-#define arch_atomic_fetch_dec_acquire arch_atomic_fetch_dec_acquire
+#if defined(arch_atomic_fetch_dec_acquire)
+       return arch_atomic_fetch_dec_acquire(v);
+#elif defined(arch_atomic_fetch_dec_relaxed)
+       int ret = arch_atomic_fetch_dec_relaxed(v);
+       __atomic_acquire_fence();
+       return ret;
+#elif defined(arch_atomic_fetch_dec)
+       return arch_atomic_fetch_dec(v);
+#else
+       return raw_atomic_fetch_sub_acquire(1, v);
 #endif
-
-#ifndef arch_atomic_fetch_dec_release
-static __always_inline int
-arch_atomic_fetch_dec_release(atomic_t *v)
-{
-       return arch_atomic_fetch_sub_release(1, v);
 }
-#define arch_atomic_fetch_dec_release arch_atomic_fetch_dec_release
-#endif
 
-#ifndef arch_atomic_fetch_dec_relaxed
+/**
+ * raw_atomic_fetch_dec_release() - atomic decrement with release ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - 1) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_dec_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_fetch_dec_relaxed(atomic_t *v)
+raw_atomic_fetch_dec_release(atomic_t *v)
 {
-       return arch_atomic_fetch_sub_relaxed(1, v);
-}
-#define arch_atomic_fetch_dec_relaxed arch_atomic_fetch_dec_relaxed
+#if defined(arch_atomic_fetch_dec_release)
+       return arch_atomic_fetch_dec_release(v);
+#elif defined(arch_atomic_fetch_dec_relaxed)
+       __atomic_release_fence();
+       return arch_atomic_fetch_dec_relaxed(v);
+#elif defined(arch_atomic_fetch_dec)
+       return arch_atomic_fetch_dec(v);
+#else
+       return raw_atomic_fetch_sub_release(1, v);
 #endif
+}
 
-#else /* arch_atomic_fetch_dec_relaxed */
-
-#ifndef arch_atomic_fetch_dec_acquire
+/**
+ * raw_atomic_fetch_dec_relaxed() - atomic decrement with relaxed ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - 1) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_dec_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_fetch_dec_acquire(atomic_t *v)
+raw_atomic_fetch_dec_relaxed(atomic_t *v)
 {
-       int ret = arch_atomic_fetch_dec_relaxed(v);
-       __atomic_acquire_fence();
-       return ret;
-}
-#define arch_atomic_fetch_dec_acquire arch_atomic_fetch_dec_acquire
+#if defined(arch_atomic_fetch_dec_relaxed)
+       return arch_atomic_fetch_dec_relaxed(v);
+#elif defined(arch_atomic_fetch_dec)
+       return arch_atomic_fetch_dec(v);
+#else
+       return raw_atomic_fetch_sub_relaxed(1, v);
 #endif
+}
 
-#ifndef arch_atomic_fetch_dec_release
-static __always_inline int
-arch_atomic_fetch_dec_release(atomic_t *v)
+/**
+ * raw_atomic_and() - atomic bitwise AND with relaxed ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v & @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_and() elsewhere.
+ *
+ * Return: Nothing.
+ */
+static __always_inline void
+raw_atomic_and(int i, atomic_t *v)
 {
-       __atomic_release_fence();
-       return arch_atomic_fetch_dec_relaxed(v);
+       arch_atomic_and(i, v);
 }
-#define arch_atomic_fetch_dec_release arch_atomic_fetch_dec_release
-#endif
 
-#ifndef arch_atomic_fetch_dec
+/**
+ * raw_atomic_fetch_and() - atomic bitwise AND with full ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v & @i) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_and() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_fetch_dec(atomic_t *v)
+raw_atomic_fetch_and(int i, atomic_t *v)
 {
+#if defined(arch_atomic_fetch_and)
+       return arch_atomic_fetch_and(i, v);
+#elif defined(arch_atomic_fetch_and_relaxed)
        int ret;
        __atomic_pre_full_fence();
-       ret = arch_atomic_fetch_dec_relaxed(v);
+       ret = arch_atomic_fetch_and_relaxed(i, v);
        __atomic_post_full_fence();
        return ret;
-}
-#define arch_atomic_fetch_dec arch_atomic_fetch_dec
+#else
+#error "Unable to define raw_atomic_fetch_and"
 #endif
+}
 
-#endif /* arch_atomic_fetch_dec_relaxed */
-
-#ifndef arch_atomic_fetch_and_relaxed
-#define arch_atomic_fetch_and_acquire arch_atomic_fetch_and
-#define arch_atomic_fetch_and_release arch_atomic_fetch_and
-#define arch_atomic_fetch_and_relaxed arch_atomic_fetch_and
-#else /* arch_atomic_fetch_and_relaxed */
-
-#ifndef arch_atomic_fetch_and_acquire
+/**
+ * raw_atomic_fetch_and_acquire() - atomic bitwise AND with acquire ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v & @i) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_and_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_fetch_and_acquire(int i, atomic_t *v)
+raw_atomic_fetch_and_acquire(int i, atomic_t *v)
 {
+#if defined(arch_atomic_fetch_and_acquire)
+       return arch_atomic_fetch_and_acquire(i, v);
+#elif defined(arch_atomic_fetch_and_relaxed)
        int ret = arch_atomic_fetch_and_relaxed(i, v);
        __atomic_acquire_fence();
        return ret;
-}
-#define arch_atomic_fetch_and_acquire arch_atomic_fetch_and_acquire
+#elif defined(arch_atomic_fetch_and)
+       return arch_atomic_fetch_and(i, v);
+#else
+#error "Unable to define raw_atomic_fetch_and_acquire"
 #endif
+}
 
-#ifndef arch_atomic_fetch_and_release
+/**
+ * raw_atomic_fetch_and_release() - atomic bitwise AND with release ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v & @i) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_and_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_fetch_and_release(int i, atomic_t *v)
+raw_atomic_fetch_and_release(int i, atomic_t *v)
 {
+#if defined(arch_atomic_fetch_and_release)
+       return arch_atomic_fetch_and_release(i, v);
+#elif defined(arch_atomic_fetch_and_relaxed)
        __atomic_release_fence();
        return arch_atomic_fetch_and_relaxed(i, v);
-}
-#define arch_atomic_fetch_and_release arch_atomic_fetch_and_release
+#elif defined(arch_atomic_fetch_and)
+       return arch_atomic_fetch_and(i, v);
+#else
+#error "Unable to define raw_atomic_fetch_and_release"
 #endif
+}
 
-#ifndef arch_atomic_fetch_and
+/**
+ * raw_atomic_fetch_and_relaxed() - atomic bitwise AND with relaxed ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v & @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_and_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_fetch_and(int i, atomic_t *v)
+raw_atomic_fetch_and_relaxed(int i, atomic_t *v)
 {
-       int ret;
-       __atomic_pre_full_fence();
-       ret = arch_atomic_fetch_and_relaxed(i, v);
-       __atomic_post_full_fence();
-       return ret;
-}
-#define arch_atomic_fetch_and arch_atomic_fetch_and
+#if defined(arch_atomic_fetch_and_relaxed)
+       return arch_atomic_fetch_and_relaxed(i, v);
+#elif defined(arch_atomic_fetch_and)
+       return arch_atomic_fetch_and(i, v);
+#else
+#error "Unable to define raw_atomic_fetch_and_relaxed"
 #endif
+}
 
-#endif /* arch_atomic_fetch_and_relaxed */
-
-#ifndef arch_atomic_andnot
+/**
+ * raw_atomic_andnot() - atomic bitwise AND NOT with relaxed ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v & ~@i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_andnot() elsewhere.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
-arch_atomic_andnot(int i, atomic_t *v)
+raw_atomic_andnot(int i, atomic_t *v)
 {
-       arch_atomic_and(~i, v);
-}
-#define arch_atomic_andnot arch_atomic_andnot
+#if defined(arch_atomic_andnot)
+       arch_atomic_andnot(i, v);
+#else
+       raw_atomic_and(~i, v);
 #endif
-
-#ifndef arch_atomic_fetch_andnot_relaxed
-#ifdef arch_atomic_fetch_andnot
-#define arch_atomic_fetch_andnot_acquire arch_atomic_fetch_andnot
-#define arch_atomic_fetch_andnot_release arch_atomic_fetch_andnot
-#define arch_atomic_fetch_andnot_relaxed arch_atomic_fetch_andnot
-#endif /* arch_atomic_fetch_andnot */
-
-#ifndef arch_atomic_fetch_andnot
-static __always_inline int
-arch_atomic_fetch_andnot(int i, atomic_t *v)
-{
-       return arch_atomic_fetch_and(~i, v);
 }
-#define arch_atomic_fetch_andnot arch_atomic_fetch_andnot
-#endif
 
-#ifndef arch_atomic_fetch_andnot_acquire
-static __always_inline int
-arch_atomic_fetch_andnot_acquire(int i, atomic_t *v)
-{
-       return arch_atomic_fetch_and_acquire(~i, v);
-}
-#define arch_atomic_fetch_andnot_acquire arch_atomic_fetch_andnot_acquire
-#endif
-
-#ifndef arch_atomic_fetch_andnot_release
+/**
+ * raw_atomic_fetch_andnot() - atomic bitwise AND NOT with full ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v & ~@i) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_andnot() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_fetch_andnot_release(int i, atomic_t *v)
+raw_atomic_fetch_andnot(int i, atomic_t *v)
 {
-       return arch_atomic_fetch_and_release(~i, v);
-}
-#define arch_atomic_fetch_andnot_release arch_atomic_fetch_andnot_release
+#if defined(arch_atomic_fetch_andnot)
+       return arch_atomic_fetch_andnot(i, v);
+#elif defined(arch_atomic_fetch_andnot_relaxed)
+       int ret;
+       __atomic_pre_full_fence();
+       ret = arch_atomic_fetch_andnot_relaxed(i, v);
+       __atomic_post_full_fence();
+       return ret;
+#else
+       return raw_atomic_fetch_and(~i, v);
 #endif
-
-#ifndef arch_atomic_fetch_andnot_relaxed
-static __always_inline int
-arch_atomic_fetch_andnot_relaxed(int i, atomic_t *v)
-{
-       return arch_atomic_fetch_and_relaxed(~i, v);
 }
-#define arch_atomic_fetch_andnot_relaxed arch_atomic_fetch_andnot_relaxed
-#endif
-
-#else /* arch_atomic_fetch_andnot_relaxed */
 
-#ifndef arch_atomic_fetch_andnot_acquire
+/**
+ * raw_atomic_fetch_andnot_acquire() - atomic bitwise AND NOT with acquire ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v & ~@i) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_andnot_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_fetch_andnot_acquire(int i, atomic_t *v)
+raw_atomic_fetch_andnot_acquire(int i, atomic_t *v)
 {
+#if defined(arch_atomic_fetch_andnot_acquire)
+       return arch_atomic_fetch_andnot_acquire(i, v);
+#elif defined(arch_atomic_fetch_andnot_relaxed)
        int ret = arch_atomic_fetch_andnot_relaxed(i, v);
        __atomic_acquire_fence();
        return ret;
-}
-#define arch_atomic_fetch_andnot_acquire arch_atomic_fetch_andnot_acquire
+#elif defined(arch_atomic_fetch_andnot)
+       return arch_atomic_fetch_andnot(i, v);
+#else
+       return raw_atomic_fetch_and_acquire(~i, v);
 #endif
+}
 
-#ifndef arch_atomic_fetch_andnot_release
+/**
+ * raw_atomic_fetch_andnot_release() - atomic bitwise AND NOT with release ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v & ~@i) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_andnot_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_fetch_andnot_release(int i, atomic_t *v)
+raw_atomic_fetch_andnot_release(int i, atomic_t *v)
 {
+#if defined(arch_atomic_fetch_andnot_release)
+       return arch_atomic_fetch_andnot_release(i, v);
+#elif defined(arch_atomic_fetch_andnot_relaxed)
        __atomic_release_fence();
        return arch_atomic_fetch_andnot_relaxed(i, v);
+#elif defined(arch_atomic_fetch_andnot)
+       return arch_atomic_fetch_andnot(i, v);
+#else
+       return raw_atomic_fetch_and_release(~i, v);
+#endif
 }
-#define arch_atomic_fetch_andnot_release arch_atomic_fetch_andnot_release
+
+/**
+ * raw_atomic_fetch_andnot_relaxed() - atomic bitwise AND NOT with relaxed ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v & ~@i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_andnot_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
+static __always_inline int
+raw_atomic_fetch_andnot_relaxed(int i, atomic_t *v)
+{
+#if defined(arch_atomic_fetch_andnot_relaxed)
+       return arch_atomic_fetch_andnot_relaxed(i, v);
+#elif defined(arch_atomic_fetch_andnot)
+       return arch_atomic_fetch_andnot(i, v);
+#else
+       return raw_atomic_fetch_and_relaxed(~i, v);
 #endif
+}
+
+/**
+ * raw_atomic_or() - atomic bitwise OR with relaxed ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v | @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_or() elsewhere.
+ *
+ * Return: Nothing.
+ */
+static __always_inline void
+raw_atomic_or(int i, atomic_t *v)
+{
+       arch_atomic_or(i, v);
+}
 
-#ifndef arch_atomic_fetch_andnot
+/**
+ * raw_atomic_fetch_or() - atomic bitwise OR with full ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v | @i) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_or() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_fetch_andnot(int i, atomic_t *v)
+raw_atomic_fetch_or(int i, atomic_t *v)
 {
+#if defined(arch_atomic_fetch_or)
+       return arch_atomic_fetch_or(i, v);
+#elif defined(arch_atomic_fetch_or_relaxed)
        int ret;
        __atomic_pre_full_fence();
-       ret = arch_atomic_fetch_andnot_relaxed(i, v);
+       ret = arch_atomic_fetch_or_relaxed(i, v);
        __atomic_post_full_fence();
        return ret;
-}
-#define arch_atomic_fetch_andnot arch_atomic_fetch_andnot
+#else
+#error "Unable to define raw_atomic_fetch_or"
 #endif
+}
 
-#endif /* arch_atomic_fetch_andnot_relaxed */
-
-#ifndef arch_atomic_fetch_or_relaxed
-#define arch_atomic_fetch_or_acquire arch_atomic_fetch_or
-#define arch_atomic_fetch_or_release arch_atomic_fetch_or
-#define arch_atomic_fetch_or_relaxed arch_atomic_fetch_or
-#else /* arch_atomic_fetch_or_relaxed */
-
-#ifndef arch_atomic_fetch_or_acquire
+/**
+ * raw_atomic_fetch_or_acquire() - atomic bitwise OR with acquire ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v | @i) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_or_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_fetch_or_acquire(int i, atomic_t *v)
+raw_atomic_fetch_or_acquire(int i, atomic_t *v)
 {
+#if defined(arch_atomic_fetch_or_acquire)
+       return arch_atomic_fetch_or_acquire(i, v);
+#elif defined(arch_atomic_fetch_or_relaxed)
        int ret = arch_atomic_fetch_or_relaxed(i, v);
        __atomic_acquire_fence();
        return ret;
-}
-#define arch_atomic_fetch_or_acquire arch_atomic_fetch_or_acquire
+#elif defined(arch_atomic_fetch_or)
+       return arch_atomic_fetch_or(i, v);
+#else
+#error "Unable to define raw_atomic_fetch_or_acquire"
 #endif
+}
 
-#ifndef arch_atomic_fetch_or_release
+/**
+ * raw_atomic_fetch_or_release() - atomic bitwise OR with release ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v | @i) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_or_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_fetch_or_release(int i, atomic_t *v)
+raw_atomic_fetch_or_release(int i, atomic_t *v)
 {
+#if defined(arch_atomic_fetch_or_release)
+       return arch_atomic_fetch_or_release(i, v);
+#elif defined(arch_atomic_fetch_or_relaxed)
        __atomic_release_fence();
        return arch_atomic_fetch_or_relaxed(i, v);
+#elif defined(arch_atomic_fetch_or)
+       return arch_atomic_fetch_or(i, v);
+#else
+#error "Unable to define raw_atomic_fetch_or_release"
+#endif
 }
-#define arch_atomic_fetch_or_release arch_atomic_fetch_or_release
+
+/**
+ * raw_atomic_fetch_or_relaxed() - atomic bitwise OR with relaxed ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v | @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_or_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
+static __always_inline int
+raw_atomic_fetch_or_relaxed(int i, atomic_t *v)
+{
+#if defined(arch_atomic_fetch_or_relaxed)
+       return arch_atomic_fetch_or_relaxed(i, v);
+#elif defined(arch_atomic_fetch_or)
+       return arch_atomic_fetch_or(i, v);
+#else
+#error "Unable to define raw_atomic_fetch_or_relaxed"
 #endif
+}
 
-#ifndef arch_atomic_fetch_or
+/**
+ * raw_atomic_xor() - atomic bitwise XOR with relaxed ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v ^ @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_xor() elsewhere.
+ *
+ * Return: Nothing.
+ */
+static __always_inline void
+raw_atomic_xor(int i, atomic_t *v)
+{
+       arch_atomic_xor(i, v);
+}
+
+/**
+ * raw_atomic_fetch_xor() - atomic bitwise XOR with full ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v ^ @i) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_xor() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_fetch_or(int i, atomic_t *v)
+raw_atomic_fetch_xor(int i, atomic_t *v)
 {
+#if defined(arch_atomic_fetch_xor)
+       return arch_atomic_fetch_xor(i, v);
+#elif defined(arch_atomic_fetch_xor_relaxed)
        int ret;
        __atomic_pre_full_fence();
-       ret = arch_atomic_fetch_or_relaxed(i, v);
+       ret = arch_atomic_fetch_xor_relaxed(i, v);
        __atomic_post_full_fence();
        return ret;
-}
-#define arch_atomic_fetch_or arch_atomic_fetch_or
+#else
+#error "Unable to define raw_atomic_fetch_xor"
 #endif
+}
 
-#endif /* arch_atomic_fetch_or_relaxed */
-
-#ifndef arch_atomic_fetch_xor_relaxed
-#define arch_atomic_fetch_xor_acquire arch_atomic_fetch_xor
-#define arch_atomic_fetch_xor_release arch_atomic_fetch_xor
-#define arch_atomic_fetch_xor_relaxed arch_atomic_fetch_xor
-#else /* arch_atomic_fetch_xor_relaxed */
-
-#ifndef arch_atomic_fetch_xor_acquire
+/**
+ * raw_atomic_fetch_xor_acquire() - atomic bitwise XOR with acquire ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v ^ @i) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_xor_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_fetch_xor_acquire(int i, atomic_t *v)
+raw_atomic_fetch_xor_acquire(int i, atomic_t *v)
 {
+#if defined(arch_atomic_fetch_xor_acquire)
+       return arch_atomic_fetch_xor_acquire(i, v);
+#elif defined(arch_atomic_fetch_xor_relaxed)
        int ret = arch_atomic_fetch_xor_relaxed(i, v);
        __atomic_acquire_fence();
        return ret;
-}
-#define arch_atomic_fetch_xor_acquire arch_atomic_fetch_xor_acquire
+#elif defined(arch_atomic_fetch_xor)
+       return arch_atomic_fetch_xor(i, v);
+#else
+#error "Unable to define raw_atomic_fetch_xor_acquire"
 #endif
+}
 
-#ifndef arch_atomic_fetch_xor_release
+/**
+ * raw_atomic_fetch_xor_release() - atomic bitwise XOR with release ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v ^ @i) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_xor_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_fetch_xor_release(int i, atomic_t *v)
+raw_atomic_fetch_xor_release(int i, atomic_t *v)
 {
+#if defined(arch_atomic_fetch_xor_release)
+       return arch_atomic_fetch_xor_release(i, v);
+#elif defined(arch_atomic_fetch_xor_relaxed)
        __atomic_release_fence();
        return arch_atomic_fetch_xor_relaxed(i, v);
+#elif defined(arch_atomic_fetch_xor)
+       return arch_atomic_fetch_xor(i, v);
+#else
+#error "Unable to define raw_atomic_fetch_xor_release"
+#endif
 }
-#define arch_atomic_fetch_xor_release arch_atomic_fetch_xor_release
+
+/**
+ * raw_atomic_fetch_xor_relaxed() - atomic bitwise XOR with relaxed ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v ^ @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_xor_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
+static __always_inline int
+raw_atomic_fetch_xor_relaxed(int i, atomic_t *v)
+{
+#if defined(arch_atomic_fetch_xor_relaxed)
+       return arch_atomic_fetch_xor_relaxed(i, v);
+#elif defined(arch_atomic_fetch_xor)
+       return arch_atomic_fetch_xor(i, v);
+#else
+#error "Unable to define raw_atomic_fetch_xor_relaxed"
 #endif
+}
 
-#ifndef arch_atomic_fetch_xor
+/**
+ * raw_atomic_xchg() - atomic exchange with full ordering
+ * @v: pointer to atomic_t
+ * @new: int value to assign
+ *
+ * Atomically updates @v to @new with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_xchg() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_fetch_xor(int i, atomic_t *v)
+raw_atomic_xchg(atomic_t *v, int new)
 {
+#if defined(arch_atomic_xchg)
+       return arch_atomic_xchg(v, new);
+#elif defined(arch_atomic_xchg_relaxed)
        int ret;
        __atomic_pre_full_fence();
-       ret = arch_atomic_fetch_xor_relaxed(i, v);
+       ret = arch_atomic_xchg_relaxed(v, new);
        __atomic_post_full_fence();
        return ret;
-}
-#define arch_atomic_fetch_xor arch_atomic_fetch_xor
+#else
+       return raw_xchg(&v->counter, new);
 #endif
+}
 
-#endif /* arch_atomic_fetch_xor_relaxed */
-
-#ifndef arch_atomic_xchg_relaxed
-#define arch_atomic_xchg_acquire arch_atomic_xchg
-#define arch_atomic_xchg_release arch_atomic_xchg
-#define arch_atomic_xchg_relaxed arch_atomic_xchg
-#else /* arch_atomic_xchg_relaxed */
-
-#ifndef arch_atomic_xchg_acquire
+/**
+ * raw_atomic_xchg_acquire() - atomic exchange with acquire ordering
+ * @v: pointer to atomic_t
+ * @new: int value to assign
+ *
+ * Atomically updates @v to @new with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_xchg_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_xchg_acquire(atomic_t *v, int i)
+raw_atomic_xchg_acquire(atomic_t *v, int new)
 {
-       int ret = arch_atomic_xchg_relaxed(v, i);
+#if defined(arch_atomic_xchg_acquire)
+       return arch_atomic_xchg_acquire(v, new);
+#elif defined(arch_atomic_xchg_relaxed)
+       int ret = arch_atomic_xchg_relaxed(v, new);
        __atomic_acquire_fence();
        return ret;
-}
-#define arch_atomic_xchg_acquire arch_atomic_xchg_acquire
+#elif defined(arch_atomic_xchg)
+       return arch_atomic_xchg(v, new);
+#else
+       return raw_xchg_acquire(&v->counter, new);
 #endif
+}
 
-#ifndef arch_atomic_xchg_release
+/**
+ * raw_atomic_xchg_release() - atomic exchange with release ordering
+ * @v: pointer to atomic_t
+ * @new: int value to assign
+ *
+ * Atomically updates @v to @new with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_xchg_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_xchg_release(atomic_t *v, int i)
+raw_atomic_xchg_release(atomic_t *v, int new)
 {
+#if defined(arch_atomic_xchg_release)
+       return arch_atomic_xchg_release(v, new);
+#elif defined(arch_atomic_xchg_relaxed)
        __atomic_release_fence();
-       return arch_atomic_xchg_relaxed(v, i);
+       return arch_atomic_xchg_relaxed(v, new);
+#elif defined(arch_atomic_xchg)
+       return arch_atomic_xchg(v, new);
+#else
+       return raw_xchg_release(&v->counter, new);
+#endif
 }
-#define arch_atomic_xchg_release arch_atomic_xchg_release
+
+/**
+ * raw_atomic_xchg_relaxed() - atomic exchange with relaxed ordering
+ * @v: pointer to atomic_t
+ * @new: int value to assign
+ *
+ * Atomically updates @v to @new with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_xchg_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
+static __always_inline int
+raw_atomic_xchg_relaxed(atomic_t *v, int new)
+{
+#if defined(arch_atomic_xchg_relaxed)
+       return arch_atomic_xchg_relaxed(v, new);
+#elif defined(arch_atomic_xchg)
+       return arch_atomic_xchg(v, new);
+#else
+       return raw_xchg_relaxed(&v->counter, new);
 #endif
+}
 
-#ifndef arch_atomic_xchg
+/**
+ * raw_atomic_cmpxchg() - atomic compare and exchange with full ordering
+ * @v: pointer to atomic_t
+ * @old: int value to compare with
+ * @new: int value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_cmpxchg() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_xchg(atomic_t *v, int i)
+raw_atomic_cmpxchg(atomic_t *v, int old, int new)
 {
+#if defined(arch_atomic_cmpxchg)
+       return arch_atomic_cmpxchg(v, old, new);
+#elif defined(arch_atomic_cmpxchg_relaxed)
        int ret;
        __atomic_pre_full_fence();
-       ret = arch_atomic_xchg_relaxed(v, i);
+       ret = arch_atomic_cmpxchg_relaxed(v, old, new);
        __atomic_post_full_fence();
        return ret;
-}
-#define arch_atomic_xchg arch_atomic_xchg
-#endif
-
-#endif /* arch_atomic_xchg_relaxed */
-
-#ifndef arch_atomic_cmpxchg_relaxed
-#define arch_atomic_cmpxchg_acquire arch_atomic_cmpxchg
-#define arch_atomic_cmpxchg_release arch_atomic_cmpxchg
-#define arch_atomic_cmpxchg_relaxed arch_atomic_cmpxchg
-#else /* arch_atomic_cmpxchg_relaxed */
+#else
+       return raw_cmpxchg(&v->counter, old, new);
+#endif
+}
 
-#ifndef arch_atomic_cmpxchg_acquire
+/**
+ * raw_atomic_cmpxchg_acquire() - atomic compare and exchange with acquire ordering
+ * @v: pointer to atomic_t
+ * @old: int value to compare with
+ * @new: int value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_cmpxchg_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_cmpxchg_acquire(atomic_t *v, int old, int new)
+raw_atomic_cmpxchg_acquire(atomic_t *v, int old, int new)
 {
+#if defined(arch_atomic_cmpxchg_acquire)
+       return arch_atomic_cmpxchg_acquire(v, old, new);
+#elif defined(arch_atomic_cmpxchg_relaxed)
        int ret = arch_atomic_cmpxchg_relaxed(v, old, new);
        __atomic_acquire_fence();
        return ret;
-}
-#define arch_atomic_cmpxchg_acquire arch_atomic_cmpxchg_acquire
+#elif defined(arch_atomic_cmpxchg)
+       return arch_atomic_cmpxchg(v, old, new);
+#else
+       return raw_cmpxchg_acquire(&v->counter, old, new);
 #endif
+}
 
-#ifndef arch_atomic_cmpxchg_release
+/**
+ * raw_atomic_cmpxchg_release() - atomic compare and exchange with release ordering
+ * @v: pointer to atomic_t
+ * @old: int value to compare with
+ * @new: int value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_cmpxchg_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_cmpxchg_release(atomic_t *v, int old, int new)
+raw_atomic_cmpxchg_release(atomic_t *v, int old, int new)
 {
+#if defined(arch_atomic_cmpxchg_release)
+       return arch_atomic_cmpxchg_release(v, old, new);
+#elif defined(arch_atomic_cmpxchg_relaxed)
        __atomic_release_fence();
        return arch_atomic_cmpxchg_relaxed(v, old, new);
-}
-#define arch_atomic_cmpxchg_release arch_atomic_cmpxchg_release
+#elif defined(arch_atomic_cmpxchg)
+       return arch_atomic_cmpxchg(v, old, new);
+#else
+       return raw_cmpxchg_release(&v->counter, old, new);
 #endif
+}
 
-#ifndef arch_atomic_cmpxchg
+/**
+ * raw_atomic_cmpxchg_relaxed() - atomic compare and exchange with relaxed ordering
+ * @v: pointer to atomic_t
+ * @old: int value to compare with
+ * @new: int value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_cmpxchg_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-arch_atomic_cmpxchg(atomic_t *v, int old, int new)
+raw_atomic_cmpxchg_relaxed(atomic_t *v, int old, int new)
 {
-       int ret;
-       __atomic_pre_full_fence();
-       ret = arch_atomic_cmpxchg_relaxed(v, old, new);
-       __atomic_post_full_fence();
-       return ret;
-}
-#define arch_atomic_cmpxchg arch_atomic_cmpxchg
+#if defined(arch_atomic_cmpxchg_relaxed)
+       return arch_atomic_cmpxchg_relaxed(v, old, new);
+#elif defined(arch_atomic_cmpxchg)
+       return arch_atomic_cmpxchg(v, old, new);
+#else
+       return raw_cmpxchg_relaxed(&v->counter, old, new);
 #endif
+}
 
-#endif /* arch_atomic_cmpxchg_relaxed */
-
-#ifndef arch_atomic_try_cmpxchg_relaxed
-#ifdef arch_atomic_try_cmpxchg
-#define arch_atomic_try_cmpxchg_acquire arch_atomic_try_cmpxchg
-#define arch_atomic_try_cmpxchg_release arch_atomic_try_cmpxchg
-#define arch_atomic_try_cmpxchg_relaxed arch_atomic_try_cmpxchg
-#endif /* arch_atomic_try_cmpxchg */
-
-#ifndef arch_atomic_try_cmpxchg
+/**
+ * raw_atomic_try_cmpxchg() - atomic compare and exchange with full ordering
+ * @v: pointer to atomic_t
+ * @old: pointer to int value to compare with
+ * @new: int value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with full ordering.
+ * Otherwise, updates @old to the current value of @v.
+ *
+ * Safe to use in noinstr code; prefer atomic_try_cmpxchg() elsewhere.
+ *
+ * Return: @true if the exchange occured, @false otherwise.
+ */
 static __always_inline bool
-arch_atomic_try_cmpxchg(atomic_t *v, int *old, int new)
+raw_atomic_try_cmpxchg(atomic_t *v, int *old, int new)
 {
+#if defined(arch_atomic_try_cmpxchg)
+       return arch_atomic_try_cmpxchg(v, old, new);
+#elif defined(arch_atomic_try_cmpxchg_relaxed)
+       bool ret;
+       __atomic_pre_full_fence();
+       ret = arch_atomic_try_cmpxchg_relaxed(v, old, new);
+       __atomic_post_full_fence();
+       return ret;
+#else
        int r, o = *old;
-       r = arch_atomic_cmpxchg(v, o, new);
+       r = raw_atomic_cmpxchg(v, o, new);
        if (unlikely(r != o))
                *old = r;
        return likely(r == o);
-}
-#define arch_atomic_try_cmpxchg arch_atomic_try_cmpxchg
 #endif
+}
 
-#ifndef arch_atomic_try_cmpxchg_acquire
+/**
+ * raw_atomic_try_cmpxchg_acquire() - atomic compare and exchange with acquire ordering
+ * @v: pointer to atomic_t
+ * @old: pointer to int value to compare with
+ * @new: int value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with acquire ordering.
+ * Otherwise, updates @old to the current value of @v.
+ *
+ * Safe to use in noinstr code; prefer atomic_try_cmpxchg_acquire() elsewhere.
+ *
+ * Return: @true if the exchange occured, @false otherwise.
+ */
 static __always_inline bool
-arch_atomic_try_cmpxchg_acquire(atomic_t *v, int *old, int new)
+raw_atomic_try_cmpxchg_acquire(atomic_t *v, int *old, int new)
 {
+#if defined(arch_atomic_try_cmpxchg_acquire)
+       return arch_atomic_try_cmpxchg_acquire(v, old, new);
+#elif defined(arch_atomic_try_cmpxchg_relaxed)
+       bool ret = arch_atomic_try_cmpxchg_relaxed(v, old, new);
+       __atomic_acquire_fence();
+       return ret;
+#elif defined(arch_atomic_try_cmpxchg)
+       return arch_atomic_try_cmpxchg(v, old, new);
+#else
        int r, o = *old;
-       r = arch_atomic_cmpxchg_acquire(v, o, new);
+       r = raw_atomic_cmpxchg_acquire(v, o, new);
        if (unlikely(r != o))
                *old = r;
        return likely(r == o);
-}
-#define arch_atomic_try_cmpxchg_acquire arch_atomic_try_cmpxchg_acquire
 #endif
+}
 
-#ifndef arch_atomic_try_cmpxchg_release
+/**
+ * raw_atomic_try_cmpxchg_release() - atomic compare and exchange with release ordering
+ * @v: pointer to atomic_t
+ * @old: pointer to int value to compare with
+ * @new: int value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with release ordering.
+ * Otherwise, updates @old to the current value of @v.
+ *
+ * Safe to use in noinstr code; prefer atomic_try_cmpxchg_release() elsewhere.
+ *
+ * Return: @true if the exchange occured, @false otherwise.
+ */
 static __always_inline bool
-arch_atomic_try_cmpxchg_release(atomic_t *v, int *old, int new)
+raw_atomic_try_cmpxchg_release(atomic_t *v, int *old, int new)
 {
+#if defined(arch_atomic_try_cmpxchg_release)
+       return arch_atomic_try_cmpxchg_release(v, old, new);
+#elif defined(arch_atomic_try_cmpxchg_relaxed)
+       __atomic_release_fence();
+       return arch_atomic_try_cmpxchg_relaxed(v, old, new);
+#elif defined(arch_atomic_try_cmpxchg)
+       return arch_atomic_try_cmpxchg(v, old, new);
+#else
        int r, o = *old;
-       r = arch_atomic_cmpxchg_release(v, o, new);
+       r = raw_atomic_cmpxchg_release(v, o, new);
        if (unlikely(r != o))
                *old = r;
        return likely(r == o);
-}
-#define arch_atomic_try_cmpxchg_release arch_atomic_try_cmpxchg_release
 #endif
+}
 
-#ifndef arch_atomic_try_cmpxchg_relaxed
+/**
+ * raw_atomic_try_cmpxchg_relaxed() - atomic compare and exchange with relaxed ordering
+ * @v: pointer to atomic_t
+ * @old: pointer to int value to compare with
+ * @new: int value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with relaxed ordering.
+ * Otherwise, updates @old to the current value of @v.
+ *
+ * Safe to use in noinstr code; prefer atomic_try_cmpxchg_relaxed() elsewhere.
+ *
+ * Return: @true if the exchange occured, @false otherwise.
+ */
 static __always_inline bool
-arch_atomic_try_cmpxchg_relaxed(atomic_t *v, int *old, int new)
+raw_atomic_try_cmpxchg_relaxed(atomic_t *v, int *old, int new)
 {
+#if defined(arch_atomic_try_cmpxchg_relaxed)
+       return arch_atomic_try_cmpxchg_relaxed(v, old, new);
+#elif defined(arch_atomic_try_cmpxchg)
+       return arch_atomic_try_cmpxchg(v, old, new);
+#else
        int r, o = *old;
-       r = arch_atomic_cmpxchg_relaxed(v, o, new);
+       r = raw_atomic_cmpxchg_relaxed(v, o, new);
        if (unlikely(r != o))
                *old = r;
        return likely(r == o);
-}
-#define arch_atomic_try_cmpxchg_relaxed arch_atomic_try_cmpxchg_relaxed
-#endif
-
-#else /* arch_atomic_try_cmpxchg_relaxed */
-
-#ifndef arch_atomic_try_cmpxchg_acquire
-static __always_inline bool
-arch_atomic_try_cmpxchg_acquire(atomic_t *v, int *old, int new)
-{
-       bool ret = arch_atomic_try_cmpxchg_relaxed(v, old, new);
-       __atomic_acquire_fence();
-       return ret;
-}
-#define arch_atomic_try_cmpxchg_acquire arch_atomic_try_cmpxchg_acquire
-#endif
-
-#ifndef arch_atomic_try_cmpxchg_release
-static __always_inline bool
-arch_atomic_try_cmpxchg_release(atomic_t *v, int *old, int new)
-{
-       __atomic_release_fence();
-       return arch_atomic_try_cmpxchg_relaxed(v, old, new);
-}
-#define arch_atomic_try_cmpxchg_release arch_atomic_try_cmpxchg_release
 #endif
-
-#ifndef arch_atomic_try_cmpxchg
-static __always_inline bool
-arch_atomic_try_cmpxchg(atomic_t *v, int *old, int new)
-{
-       bool ret;
-       __atomic_pre_full_fence();
-       ret = arch_atomic_try_cmpxchg_relaxed(v, old, new);
-       __atomic_post_full_fence();
-       return ret;
 }
-#define arch_atomic_try_cmpxchg arch_atomic_try_cmpxchg
-#endif
 
-#endif /* arch_atomic_try_cmpxchg_relaxed */
-
-#ifndef arch_atomic_sub_and_test
 /**
- * arch_atomic_sub_and_test - subtract value from variable and test result
- * @i: integer value to subtract
- * @v: pointer of type atomic_t
+ * raw_atomic_sub_and_test() - atomic subtract and test if zero with full ordering
+ * @i: int value to add
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - @i) with full ordering.
  *
- * Atomically subtracts @i from @v and returns
- * true if the result is zero, or false for all
- * other cases.
+ * Safe to use in noinstr code; prefer atomic_sub_and_test() elsewhere.
+ *
+ * Return: @true if the resulting value of @v is zero, @false otherwise.
  */
 static __always_inline bool
-arch_atomic_sub_and_test(int i, atomic_t *v)
+raw_atomic_sub_and_test(int i, atomic_t *v)
 {
-       return arch_atomic_sub_return(i, v) == 0;
-}
-#define arch_atomic_sub_and_test arch_atomic_sub_and_test
+#if defined(arch_atomic_sub_and_test)
+       return arch_atomic_sub_and_test(i, v);
+#else
+       return raw_atomic_sub_return(i, v) == 0;
 #endif
+}
 
-#ifndef arch_atomic_dec_and_test
 /**
- * arch_atomic_dec_and_test - decrement and test
- * @v: pointer of type atomic_t
+ * raw_atomic_dec_and_test() - atomic decrement and test if zero with full ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - 1) with full ordering.
  *
- * Atomically decrements @v by 1 and
- * returns true if the result is 0, or false for all other
- * cases.
+ * Safe to use in noinstr code; prefer atomic_dec_and_test() elsewhere.
+ *
+ * Return: @true if the resulting value of @v is zero, @false otherwise.
  */
 static __always_inline bool
-arch_atomic_dec_and_test(atomic_t *v)
+raw_atomic_dec_and_test(atomic_t *v)
 {
-       return arch_atomic_dec_return(v) == 0;
-}
-#define arch_atomic_dec_and_test arch_atomic_dec_and_test
+#if defined(arch_atomic_dec_and_test)
+       return arch_atomic_dec_and_test(v);
+#else
+       return raw_atomic_dec_return(v) == 0;
 #endif
+}
 
-#ifndef arch_atomic_inc_and_test
 /**
- * arch_atomic_inc_and_test - increment and test
- * @v: pointer of type atomic_t
+ * raw_atomic_inc_and_test() - atomic increment and test if zero with full ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + 1) with full ordering.
  *
- * Atomically increments @v by 1
- * and returns true if the result is zero, or false for all
- * other cases.
+ * Safe to use in noinstr code; prefer atomic_inc_and_test() elsewhere.
+ *
+ * Return: @true if the resulting value of @v is zero, @false otherwise.
  */
 static __always_inline bool
-arch_atomic_inc_and_test(atomic_t *v)
+raw_atomic_inc_and_test(atomic_t *v)
 {
-       return arch_atomic_inc_return(v) == 0;
-}
-#define arch_atomic_inc_and_test arch_atomic_inc_and_test
+#if defined(arch_atomic_inc_and_test)
+       return arch_atomic_inc_and_test(v);
+#else
+       return raw_atomic_inc_return(v) == 0;
 #endif
+}
 
-#ifndef arch_atomic_add_negative_relaxed
-#ifdef arch_atomic_add_negative
-#define arch_atomic_add_negative_acquire arch_atomic_add_negative
-#define arch_atomic_add_negative_release arch_atomic_add_negative
-#define arch_atomic_add_negative_relaxed arch_atomic_add_negative
-#endif /* arch_atomic_add_negative */
-
-#ifndef arch_atomic_add_negative
 /**
- * arch_atomic_add_negative - Add and test if negative
- * @i: integer value to add
- * @v: pointer of type atomic_t
+ * raw_atomic_add_negative() - atomic add and test if negative with full ordering
+ * @i: int value to add
+ * @v: pointer to atomic_t
  *
- * Atomically adds @i to @v and returns true if the result is negative,
- * or false when the result is greater than or equal to zero.
+ * Atomically updates @v to (@v + @i) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_add_negative() elsewhere.
+ *
+ * Return: @true if the resulting value of @v is negative, @false otherwise.
  */
 static __always_inline bool
-arch_atomic_add_negative(int i, atomic_t *v)
+raw_atomic_add_negative(int i, atomic_t *v)
 {
-       return arch_atomic_add_return(i, v) < 0;
-}
-#define arch_atomic_add_negative arch_atomic_add_negative
+#if defined(arch_atomic_add_negative)
+       return arch_atomic_add_negative(i, v);
+#elif defined(arch_atomic_add_negative_relaxed)
+       bool ret;
+       __atomic_pre_full_fence();
+       ret = arch_atomic_add_negative_relaxed(i, v);
+       __atomic_post_full_fence();
+       return ret;
+#else
+       return raw_atomic_add_return(i, v) < 0;
 #endif
+}
 
-#ifndef arch_atomic_add_negative_acquire
 /**
- * arch_atomic_add_negative_acquire - Add and test if negative
- * @i: integer value to add
- * @v: pointer of type atomic_t
+ * raw_atomic_add_negative_acquire() - atomic add and test if negative with acquire ordering
+ * @i: int value to add
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + @i) with acquire ordering.
  *
- * Atomically adds @i to @v and returns true if the result is negative,
- * or false when the result is greater than or equal to zero.
+ * Safe to use in noinstr code; prefer atomic_add_negative_acquire() elsewhere.
+ *
+ * Return: @true if the resulting value of @v is negative, @false otherwise.
  */
 static __always_inline bool
-arch_atomic_add_negative_acquire(int i, atomic_t *v)
+raw_atomic_add_negative_acquire(int i, atomic_t *v)
 {
-       return arch_atomic_add_return_acquire(i, v) < 0;
-}
-#define arch_atomic_add_negative_acquire arch_atomic_add_negative_acquire
+#if defined(arch_atomic_add_negative_acquire)
+       return arch_atomic_add_negative_acquire(i, v);
+#elif defined(arch_atomic_add_negative_relaxed)
+       bool ret = arch_atomic_add_negative_relaxed(i, v);
+       __atomic_acquire_fence();
+       return ret;
+#elif defined(arch_atomic_add_negative)
+       return arch_atomic_add_negative(i, v);
+#else
+       return raw_atomic_add_return_acquire(i, v) < 0;
 #endif
+}
 
-#ifndef arch_atomic_add_negative_release
 /**
- * arch_atomic_add_negative_release - Add and test if negative
- * @i: integer value to add
- * @v: pointer of type atomic_t
+ * raw_atomic_add_negative_release() - atomic add and test if negative with release ordering
+ * @i: int value to add
+ * @v: pointer to atomic_t
  *
- * Atomically adds @i to @v and returns true if the result is negative,
- * or false when the result is greater than or equal to zero.
+ * Atomically updates @v to (@v + @i) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_add_negative_release() elsewhere.
+ *
+ * Return: @true if the resulting value of @v is negative, @false otherwise.
  */
 static __always_inline bool
-arch_atomic_add_negative_release(int i, atomic_t *v)
+raw_atomic_add_negative_release(int i, atomic_t *v)
 {
-       return arch_atomic_add_return_release(i, v) < 0;
-}
-#define arch_atomic_add_negative_release arch_atomic_add_negative_release
+#if defined(arch_atomic_add_negative_release)
+       return arch_atomic_add_negative_release(i, v);
+#elif defined(arch_atomic_add_negative_relaxed)
+       __atomic_release_fence();
+       return arch_atomic_add_negative_relaxed(i, v);
+#elif defined(arch_atomic_add_negative)
+       return arch_atomic_add_negative(i, v);
+#else
+       return raw_atomic_add_return_release(i, v) < 0;
 #endif
+}
 
-#ifndef arch_atomic_add_negative_relaxed
 /**
- * arch_atomic_add_negative_relaxed - Add and test if negative
- * @i: integer value to add
- * @v: pointer of type atomic_t
+ * raw_atomic_add_negative_relaxed() - atomic add and test if negative with relaxed ordering
+ * @i: int value to add
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_add_negative_relaxed() elsewhere.
  *
- * Atomically adds @i to @v and returns true if the result is negative,
- * or false when the result is greater than or equal to zero.
+ * Return: @true if the resulting value of @v is negative, @false otherwise.
  */
 static __always_inline bool
-arch_atomic_add_negative_relaxed(int i, atomic_t *v)
-{
-       return arch_atomic_add_return_relaxed(i, v) < 0;
-}
-#define arch_atomic_add_negative_relaxed arch_atomic_add_negative_relaxed
-#endif
-
-#else /* arch_atomic_add_negative_relaxed */
-
-#ifndef arch_atomic_add_negative_acquire
-static __always_inline bool
-arch_atomic_add_negative_acquire(int i, atomic_t *v)
-{
-       bool ret = arch_atomic_add_negative_relaxed(i, v);
-       __atomic_acquire_fence();
-       return ret;
-}
-#define arch_atomic_add_negative_acquire arch_atomic_add_negative_acquire
-#endif
-
-#ifndef arch_atomic_add_negative_release
-static __always_inline bool
-arch_atomic_add_negative_release(int i, atomic_t *v)
+raw_atomic_add_negative_relaxed(int i, atomic_t *v)
 {
-       __atomic_release_fence();
+#if defined(arch_atomic_add_negative_relaxed)
        return arch_atomic_add_negative_relaxed(i, v);
-}
-#define arch_atomic_add_negative_release arch_atomic_add_negative_release
+#elif defined(arch_atomic_add_negative)
+       return arch_atomic_add_negative(i, v);
+#else
+       return raw_atomic_add_return_relaxed(i, v) < 0;
 #endif
-
-#ifndef arch_atomic_add_negative
-static __always_inline bool
-arch_atomic_add_negative(int i, atomic_t *v)
-{
-       bool ret;
-       __atomic_pre_full_fence();
-       ret = arch_atomic_add_negative_relaxed(i, v);
-       __atomic_post_full_fence();
-       return ret;
 }
-#define arch_atomic_add_negative arch_atomic_add_negative
-#endif
-
-#endif /* arch_atomic_add_negative_relaxed */
 
-#ifndef arch_atomic_fetch_add_unless
 /**
- * arch_atomic_fetch_add_unless - add unless the number is already a given value
- * @v: pointer of type atomic_t
- * @a: the amount to add to v...
- * @u: ...unless v is equal to u.
+ * raw_atomic_fetch_add_unless() - atomic add unless value with full ordering
+ * @v: pointer to atomic_t
+ * @a: int value to add
+ * @u: int value to compare with
+ *
+ * If (@v != @u), atomically updates @v to (@v + @a) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_fetch_add_unless() elsewhere.
  *
- * Atomically adds @a to @v, so long as @v was not already @u.
- * Returns original value of @v
+ * Return: The original value of @v.
  */
 static __always_inline int
-arch_atomic_fetch_add_unless(atomic_t *v, int a, int u)
+raw_atomic_fetch_add_unless(atomic_t *v, int a, int u)
 {
-       int c = arch_atomic_read(v);
+#if defined(arch_atomic_fetch_add_unless)
+       return arch_atomic_fetch_add_unless(v, a, u);
+#else
+       int c = raw_atomic_read(v);
 
        do {
                if (unlikely(c == u))
                        break;
-       } while (!arch_atomic_try_cmpxchg(v, &c, c + a));
+       } while (!raw_atomic_try_cmpxchg(v, &c, c + a));
 
        return c;
-}
-#define arch_atomic_fetch_add_unless arch_atomic_fetch_add_unless
 #endif
+}
 
-#ifndef arch_atomic_add_unless
 /**
- * arch_atomic_add_unless - add unless the number is already a given value
- * @v: pointer of type atomic_t
- * @a: the amount to add to v...
- * @u: ...unless v is equal to u.
+ * raw_atomic_add_unless() - atomic add unless value with full ordering
+ * @v: pointer to atomic_t
+ * @a: int value to add
+ * @u: int value to compare with
+ *
+ * If (@v != @u), atomically updates @v to (@v + @a) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_add_unless() elsewhere.
  *
- * Atomically adds @a to @v, if @v was not already @u.
- * Returns true if the addition was done.
+ * Return: @true if @v was updated, @false otherwise.
  */
 static __always_inline bool
-arch_atomic_add_unless(atomic_t *v, int a, int u)
+raw_atomic_add_unless(atomic_t *v, int a, int u)
 {
-       return arch_atomic_fetch_add_unless(v, a, u) != u;
-}
-#define arch_atomic_add_unless arch_atomic_add_unless
+#if defined(arch_atomic_add_unless)
+       return arch_atomic_add_unless(v, a, u);
+#else
+       return raw_atomic_fetch_add_unless(v, a, u) != u;
 #endif
+}
 
-#ifndef arch_atomic_inc_not_zero
 /**
- * arch_atomic_inc_not_zero - increment unless the number is zero
- * @v: pointer of type atomic_t
+ * raw_atomic_inc_not_zero() - atomic increment unless zero with full ordering
+ * @v: pointer to atomic_t
+ *
+ * If (@v != 0), atomically updates @v to (@v + 1) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_inc_not_zero() elsewhere.
  *
- * Atomically increments @v by 1, if @v is non-zero.
- * Returns true if the increment was done.
+ * Return: @true if @v was updated, @false otherwise.
  */
 static __always_inline bool
-arch_atomic_inc_not_zero(atomic_t *v)
+raw_atomic_inc_not_zero(atomic_t *v)
 {
-       return arch_atomic_add_unless(v, 1, 0);
-}
-#define arch_atomic_inc_not_zero arch_atomic_inc_not_zero
+#if defined(arch_atomic_inc_not_zero)
+       return arch_atomic_inc_not_zero(v);
+#else
+       return raw_atomic_add_unless(v, 1, 0);
 #endif
+}
 
-#ifndef arch_atomic_inc_unless_negative
+/**
+ * raw_atomic_inc_unless_negative() - atomic increment unless negative with full ordering
+ * @v: pointer to atomic_t
+ *
+ * If (@v >= 0), atomically updates @v to (@v + 1) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_inc_unless_negative() elsewhere.
+ *
+ * Return: @true if @v was updated, @false otherwise.
+ */
 static __always_inline bool
-arch_atomic_inc_unless_negative(atomic_t *v)
+raw_atomic_inc_unless_negative(atomic_t *v)
 {
-       int c = arch_atomic_read(v);
+#if defined(arch_atomic_inc_unless_negative)
+       return arch_atomic_inc_unless_negative(v);
+#else
+       int c = raw_atomic_read(v);
 
        do {
                if (unlikely(c < 0))
                        return false;
-       } while (!arch_atomic_try_cmpxchg(v, &c, c + 1));
+       } while (!raw_atomic_try_cmpxchg(v, &c, c + 1));
 
        return true;
-}
-#define arch_atomic_inc_unless_negative arch_atomic_inc_unless_negative
 #endif
+}
 
-#ifndef arch_atomic_dec_unless_positive
+/**
+ * raw_atomic_dec_unless_positive() - atomic decrement unless positive with full ordering
+ * @v: pointer to atomic_t
+ *
+ * If (@v <= 0), atomically updates @v to (@v - 1) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_dec_unless_positive() elsewhere.
+ *
+ * Return: @true if @v was updated, @false otherwise.
+ */
 static __always_inline bool
-arch_atomic_dec_unless_positive(atomic_t *v)
+raw_atomic_dec_unless_positive(atomic_t *v)
 {
-       int c = arch_atomic_read(v);
+#if defined(arch_atomic_dec_unless_positive)
+       return arch_atomic_dec_unless_positive(v);
+#else
+       int c = raw_atomic_read(v);
 
        do {
                if (unlikely(c > 0))
                        return false;
-       } while (!arch_atomic_try_cmpxchg(v, &c, c - 1));
+       } while (!raw_atomic_try_cmpxchg(v, &c, c - 1));
 
        return true;
-}
-#define arch_atomic_dec_unless_positive arch_atomic_dec_unless_positive
 #endif
+}
 
-#ifndef arch_atomic_dec_if_positive
+/**
+ * raw_atomic_dec_if_positive() - atomic decrement if positive with full ordering
+ * @v: pointer to atomic_t
+ *
+ * If (@v > 0), atomically updates @v to (@v - 1) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_dec_if_positive() elsewhere.
+ *
+ * Return: The old value of (@v - 1), regardless of whether @v was updated.
+ */
 static __always_inline int
-arch_atomic_dec_if_positive(atomic_t *v)
+raw_atomic_dec_if_positive(atomic_t *v)
 {
-       int dec, c = arch_atomic_read(v);
+#if defined(arch_atomic_dec_if_positive)
+       return arch_atomic_dec_if_positive(v);
+#else
+       int dec, c = raw_atomic_read(v);
 
        do {
                dec = c - 1;
                if (unlikely(dec < 0))
                        break;
-       } while (!arch_atomic_try_cmpxchg(v, &c, dec));
+       } while (!raw_atomic_try_cmpxchg(v, &c, dec));
 
        return dec;
-}
-#define arch_atomic_dec_if_positive arch_atomic_dec_if_positive
 #endif
+}
 
 #ifdef CONFIG_GENERIC_ATOMIC64
 #include <asm-generic/atomic64.h>
 #endif
 
-#ifndef arch_atomic64_read_acquire
+/**
+ * raw_atomic64_read() - atomic load with relaxed ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically loads the value of @v with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_read() elsewhere.
+ *
+ * Return: The value loaded from @v.
+ */
+static __always_inline s64
+raw_atomic64_read(const atomic64_t *v)
+{
+       return arch_atomic64_read(v);
+}
+
+/**
+ * raw_atomic64_read_acquire() - atomic load with acquire ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically loads the value of @v with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_read_acquire() elsewhere.
+ *
+ * Return: The value loaded from @v.
+ */
 static __always_inline s64
-arch_atomic64_read_acquire(const atomic64_t *v)
+raw_atomic64_read_acquire(const atomic64_t *v)
 {
+#if defined(arch_atomic64_read_acquire)
+       return arch_atomic64_read_acquire(v);
+#elif defined(arch_atomic64_read)
+       return arch_atomic64_read(v);
+#else
        s64 ret;
 
        if (__native_word(atomic64_t)) {
                ret = smp_load_acquire(&(v)->counter);
        } else {
-               ret = arch_atomic64_read(v);
+               ret = raw_atomic64_read(v);
                __atomic_acquire_fence();
        }
 
        return ret;
-}
-#define arch_atomic64_read_acquire arch_atomic64_read_acquire
 #endif
+}
+
+/**
+ * raw_atomic64_set() - atomic set with relaxed ordering
+ * @v: pointer to atomic64_t
+ * @i: s64 value to assign
+ *
+ * Atomically sets @v to @i with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_set() elsewhere.
+ *
+ * Return: Nothing.
+ */
+static __always_inline void
+raw_atomic64_set(atomic64_t *v, s64 i)
+{
+       arch_atomic64_set(v, i);
+}
 
-#ifndef arch_atomic64_set_release
+/**
+ * raw_atomic64_set_release() - atomic set with release ordering
+ * @v: pointer to atomic64_t
+ * @i: s64 value to assign
+ *
+ * Atomically sets @v to @i with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_set_release() elsewhere.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
-arch_atomic64_set_release(atomic64_t *v, s64 i)
+raw_atomic64_set_release(atomic64_t *v, s64 i)
 {
+#if defined(arch_atomic64_set_release)
+       arch_atomic64_set_release(v, i);
+#elif defined(arch_atomic64_set)
+       arch_atomic64_set(v, i);
+#else
        if (__native_word(atomic64_t)) {
                smp_store_release(&(v)->counter, i);
        } else {
                __atomic_release_fence();
-               arch_atomic64_set(v, i);
+               raw_atomic64_set(v, i);
        }
-}
-#define arch_atomic64_set_release arch_atomic64_set_release
 #endif
+}
+
+/**
+ * raw_atomic64_add() - atomic add with relaxed ordering
+ * @i: s64 value to add
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_add() elsewhere.
+ *
+ * Return: Nothing.
+ */
+static __always_inline void
+raw_atomic64_add(s64 i, atomic64_t *v)
+{
+       arch_atomic64_add(i, v);
+}
 
-#ifndef arch_atomic64_add_return_relaxed
-#define arch_atomic64_add_return_acquire arch_atomic64_add_return
-#define arch_atomic64_add_return_release arch_atomic64_add_return
-#define arch_atomic64_add_return_relaxed arch_atomic64_add_return
-#else /* arch_atomic64_add_return_relaxed */
+/**
+ * raw_atomic64_add_return() - atomic add with full ordering
+ * @i: s64 value to add
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + @i) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_add_return() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
+static __always_inline s64
+raw_atomic64_add_return(s64 i, atomic64_t *v)
+{
+#if defined(arch_atomic64_add_return)
+       return arch_atomic64_add_return(i, v);
+#elif defined(arch_atomic64_add_return_relaxed)
+       s64 ret;
+       __atomic_pre_full_fence();
+       ret = arch_atomic64_add_return_relaxed(i, v);
+       __atomic_post_full_fence();
+       return ret;
+#else
+#error "Unable to define raw_atomic64_add_return"
+#endif
+}
 
-#ifndef arch_atomic64_add_return_acquire
+/**
+ * raw_atomic64_add_return_acquire() - atomic add with acquire ordering
+ * @i: s64 value to add
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + @i) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_add_return_acquire() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline s64
-arch_atomic64_add_return_acquire(s64 i, atomic64_t *v)
+raw_atomic64_add_return_acquire(s64 i, atomic64_t *v)
 {
+#if defined(arch_atomic64_add_return_acquire)
+       return arch_atomic64_add_return_acquire(i, v);
+#elif defined(arch_atomic64_add_return_relaxed)
        s64 ret = arch_atomic64_add_return_relaxed(i, v);
        __atomic_acquire_fence();
        return ret;
-}
-#define arch_atomic64_add_return_acquire arch_atomic64_add_return_acquire
+#elif defined(arch_atomic64_add_return)
+       return arch_atomic64_add_return(i, v);
+#else
+#error "Unable to define raw_atomic64_add_return_acquire"
 #endif
+}
 
-#ifndef arch_atomic64_add_return_release
+/**
+ * raw_atomic64_add_return_release() - atomic add with release ordering
+ * @i: s64 value to add
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + @i) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_add_return_release() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline s64
-arch_atomic64_add_return_release(s64 i, atomic64_t *v)
+raw_atomic64_add_return_release(s64 i, atomic64_t *v)
 {
+#if defined(arch_atomic64_add_return_release)
+       return arch_atomic64_add_return_release(i, v);
+#elif defined(arch_atomic64_add_return_relaxed)
        __atomic_release_fence();
        return arch_atomic64_add_return_relaxed(i, v);
+#elif defined(arch_atomic64_add_return)
+       return arch_atomic64_add_return(i, v);
+#else
+#error "Unable to define raw_atomic64_add_return_release"
+#endif
 }
-#define arch_atomic64_add_return_release arch_atomic64_add_return_release
+
+/**
+ * raw_atomic64_add_return_relaxed() - atomic add with relaxed ordering
+ * @i: s64 value to add
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_add_return_relaxed() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
+static __always_inline s64
+raw_atomic64_add_return_relaxed(s64 i, atomic64_t *v)
+{
+#if defined(arch_atomic64_add_return_relaxed)
+       return arch_atomic64_add_return_relaxed(i, v);
+#elif defined(arch_atomic64_add_return)
+       return arch_atomic64_add_return(i, v);
+#else
+#error "Unable to define raw_atomic64_add_return_relaxed"
 #endif
+}
 
-#ifndef arch_atomic64_add_return
+/**
+ * raw_atomic64_fetch_add() - atomic add with full ordering
+ * @i: s64 value to add
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + @i) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_add() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_add_return(s64 i, atomic64_t *v)
+raw_atomic64_fetch_add(s64 i, atomic64_t *v)
 {
+#if defined(arch_atomic64_fetch_add)
+       return arch_atomic64_fetch_add(i, v);
+#elif defined(arch_atomic64_fetch_add_relaxed)
        s64 ret;
        __atomic_pre_full_fence();
-       ret = arch_atomic64_add_return_relaxed(i, v);
+       ret = arch_atomic64_fetch_add_relaxed(i, v);
        __atomic_post_full_fence();
        return ret;
-}
-#define arch_atomic64_add_return arch_atomic64_add_return
+#else
+#error "Unable to define raw_atomic64_fetch_add"
 #endif
+}
 
-#endif /* arch_atomic64_add_return_relaxed */
-
-#ifndef arch_atomic64_fetch_add_relaxed
-#define arch_atomic64_fetch_add_acquire arch_atomic64_fetch_add
-#define arch_atomic64_fetch_add_release arch_atomic64_fetch_add
-#define arch_atomic64_fetch_add_relaxed arch_atomic64_fetch_add
-#else /* arch_atomic64_fetch_add_relaxed */
-
-#ifndef arch_atomic64_fetch_add_acquire
+/**
+ * raw_atomic64_fetch_add_acquire() - atomic add with acquire ordering
+ * @i: s64 value to add
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + @i) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_add_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_fetch_add_acquire(s64 i, atomic64_t *v)
+raw_atomic64_fetch_add_acquire(s64 i, atomic64_t *v)
 {
+#if defined(arch_atomic64_fetch_add_acquire)
+       return arch_atomic64_fetch_add_acquire(i, v);
+#elif defined(arch_atomic64_fetch_add_relaxed)
        s64 ret = arch_atomic64_fetch_add_relaxed(i, v);
        __atomic_acquire_fence();
        return ret;
-}
-#define arch_atomic64_fetch_add_acquire arch_atomic64_fetch_add_acquire
+#elif defined(arch_atomic64_fetch_add)
+       return arch_atomic64_fetch_add(i, v);
+#else
+#error "Unable to define raw_atomic64_fetch_add_acquire"
 #endif
+}
 
-#ifndef arch_atomic64_fetch_add_release
+/**
+ * raw_atomic64_fetch_add_release() - atomic add with release ordering
+ * @i: s64 value to add
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + @i) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_add_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_fetch_add_release(s64 i, atomic64_t *v)
+raw_atomic64_fetch_add_release(s64 i, atomic64_t *v)
 {
+#if defined(arch_atomic64_fetch_add_release)
+       return arch_atomic64_fetch_add_release(i, v);
+#elif defined(arch_atomic64_fetch_add_relaxed)
        __atomic_release_fence();
        return arch_atomic64_fetch_add_relaxed(i, v);
+#elif defined(arch_atomic64_fetch_add)
+       return arch_atomic64_fetch_add(i, v);
+#else
+#error "Unable to define raw_atomic64_fetch_add_release"
+#endif
 }
-#define arch_atomic64_fetch_add_release arch_atomic64_fetch_add_release
+
+/**
+ * raw_atomic64_fetch_add_relaxed() - atomic add with relaxed ordering
+ * @i: s64 value to add
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_add_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
+static __always_inline s64
+raw_atomic64_fetch_add_relaxed(s64 i, atomic64_t *v)
+{
+#if defined(arch_atomic64_fetch_add_relaxed)
+       return arch_atomic64_fetch_add_relaxed(i, v);
+#elif defined(arch_atomic64_fetch_add)
+       return arch_atomic64_fetch_add(i, v);
+#else
+#error "Unable to define raw_atomic64_fetch_add_relaxed"
 #endif
+}
+
+/**
+ * raw_atomic64_sub() - atomic subtract with relaxed ordering
+ * @i: s64 value to subtract
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_sub() elsewhere.
+ *
+ * Return: Nothing.
+ */
+static __always_inline void
+raw_atomic64_sub(s64 i, atomic64_t *v)
+{
+       arch_atomic64_sub(i, v);
+}
 
-#ifndef arch_atomic64_fetch_add
+/**
+ * raw_atomic64_sub_return() - atomic subtract with full ordering
+ * @i: s64 value to subtract
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - @i) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_sub_return() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline s64
-arch_atomic64_fetch_add(s64 i, atomic64_t *v)
+raw_atomic64_sub_return(s64 i, atomic64_t *v)
 {
+#if defined(arch_atomic64_sub_return)
+       return arch_atomic64_sub_return(i, v);
+#elif defined(arch_atomic64_sub_return_relaxed)
        s64 ret;
        __atomic_pre_full_fence();
-       ret = arch_atomic64_fetch_add_relaxed(i, v);
+       ret = arch_atomic64_sub_return_relaxed(i, v);
        __atomic_post_full_fence();
        return ret;
-}
-#define arch_atomic64_fetch_add arch_atomic64_fetch_add
+#else
+#error "Unable to define raw_atomic64_sub_return"
 #endif
+}
 
-#endif /* arch_atomic64_fetch_add_relaxed */
-
-#ifndef arch_atomic64_sub_return_relaxed
-#define arch_atomic64_sub_return_acquire arch_atomic64_sub_return
-#define arch_atomic64_sub_return_release arch_atomic64_sub_return
-#define arch_atomic64_sub_return_relaxed arch_atomic64_sub_return
-#else /* arch_atomic64_sub_return_relaxed */
-
-#ifndef arch_atomic64_sub_return_acquire
+/**
+ * raw_atomic64_sub_return_acquire() - atomic subtract with acquire ordering
+ * @i: s64 value to subtract
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - @i) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_sub_return_acquire() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline s64
-arch_atomic64_sub_return_acquire(s64 i, atomic64_t *v)
+raw_atomic64_sub_return_acquire(s64 i, atomic64_t *v)
 {
+#if defined(arch_atomic64_sub_return_acquire)
+       return arch_atomic64_sub_return_acquire(i, v);
+#elif defined(arch_atomic64_sub_return_relaxed)
        s64 ret = arch_atomic64_sub_return_relaxed(i, v);
        __atomic_acquire_fence();
        return ret;
-}
-#define arch_atomic64_sub_return_acquire arch_atomic64_sub_return_acquire
+#elif defined(arch_atomic64_sub_return)
+       return arch_atomic64_sub_return(i, v);
+#else
+#error "Unable to define raw_atomic64_sub_return_acquire"
 #endif
+}
 
-#ifndef arch_atomic64_sub_return_release
+/**
+ * raw_atomic64_sub_return_release() - atomic subtract with release ordering
+ * @i: s64 value to subtract
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - @i) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_sub_return_release() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline s64
-arch_atomic64_sub_return_release(s64 i, atomic64_t *v)
+raw_atomic64_sub_return_release(s64 i, atomic64_t *v)
 {
+#if defined(arch_atomic64_sub_return_release)
+       return arch_atomic64_sub_return_release(i, v);
+#elif defined(arch_atomic64_sub_return_relaxed)
        __atomic_release_fence();
        return arch_atomic64_sub_return_relaxed(i, v);
+#elif defined(arch_atomic64_sub_return)
+       return arch_atomic64_sub_return(i, v);
+#else
+#error "Unable to define raw_atomic64_sub_return_release"
+#endif
 }
-#define arch_atomic64_sub_return_release arch_atomic64_sub_return_release
+
+/**
+ * raw_atomic64_sub_return_relaxed() - atomic subtract with relaxed ordering
+ * @i: s64 value to subtract
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_sub_return_relaxed() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
+static __always_inline s64
+raw_atomic64_sub_return_relaxed(s64 i, atomic64_t *v)
+{
+#if defined(arch_atomic64_sub_return_relaxed)
+       return arch_atomic64_sub_return_relaxed(i, v);
+#elif defined(arch_atomic64_sub_return)
+       return arch_atomic64_sub_return(i, v);
+#else
+#error "Unable to define raw_atomic64_sub_return_relaxed"
 #endif
+}
 
-#ifndef arch_atomic64_sub_return
+/**
+ * raw_atomic64_fetch_sub() - atomic subtract with full ordering
+ * @i: s64 value to subtract
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - @i) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_sub() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_sub_return(s64 i, atomic64_t *v)
+raw_atomic64_fetch_sub(s64 i, atomic64_t *v)
 {
+#if defined(arch_atomic64_fetch_sub)
+       return arch_atomic64_fetch_sub(i, v);
+#elif defined(arch_atomic64_fetch_sub_relaxed)
        s64 ret;
        __atomic_pre_full_fence();
-       ret = arch_atomic64_sub_return_relaxed(i, v);
+       ret = arch_atomic64_fetch_sub_relaxed(i, v);
        __atomic_post_full_fence();
        return ret;
-}
-#define arch_atomic64_sub_return arch_atomic64_sub_return
+#else
+#error "Unable to define raw_atomic64_fetch_sub"
 #endif
+}
 
-#endif /* arch_atomic64_sub_return_relaxed */
-
-#ifndef arch_atomic64_fetch_sub_relaxed
-#define arch_atomic64_fetch_sub_acquire arch_atomic64_fetch_sub
-#define arch_atomic64_fetch_sub_release arch_atomic64_fetch_sub
-#define arch_atomic64_fetch_sub_relaxed arch_atomic64_fetch_sub
-#else /* arch_atomic64_fetch_sub_relaxed */
-
-#ifndef arch_atomic64_fetch_sub_acquire
+/**
+ * raw_atomic64_fetch_sub_acquire() - atomic subtract with acquire ordering
+ * @i: s64 value to subtract
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - @i) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_sub_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_fetch_sub_acquire(s64 i, atomic64_t *v)
+raw_atomic64_fetch_sub_acquire(s64 i, atomic64_t *v)
 {
+#if defined(arch_atomic64_fetch_sub_acquire)
+       return arch_atomic64_fetch_sub_acquire(i, v);
+#elif defined(arch_atomic64_fetch_sub_relaxed)
        s64 ret = arch_atomic64_fetch_sub_relaxed(i, v);
        __atomic_acquire_fence();
        return ret;
-}
-#define arch_atomic64_fetch_sub_acquire arch_atomic64_fetch_sub_acquire
+#elif defined(arch_atomic64_fetch_sub)
+       return arch_atomic64_fetch_sub(i, v);
+#else
+#error "Unable to define raw_atomic64_fetch_sub_acquire"
 #endif
+}
 
-#ifndef arch_atomic64_fetch_sub_release
+/**
+ * raw_atomic64_fetch_sub_release() - atomic subtract with release ordering
+ * @i: s64 value to subtract
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - @i) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_sub_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_fetch_sub_release(s64 i, atomic64_t *v)
+raw_atomic64_fetch_sub_release(s64 i, atomic64_t *v)
 {
+#if defined(arch_atomic64_fetch_sub_release)
+       return arch_atomic64_fetch_sub_release(i, v);
+#elif defined(arch_atomic64_fetch_sub_relaxed)
        __atomic_release_fence();
        return arch_atomic64_fetch_sub_relaxed(i, v);
-}
-#define arch_atomic64_fetch_sub_release arch_atomic64_fetch_sub_release
+#elif defined(arch_atomic64_fetch_sub)
+       return arch_atomic64_fetch_sub(i, v);
+#else
+#error "Unable to define raw_atomic64_fetch_sub_release"
 #endif
+}
 
-#ifndef arch_atomic64_fetch_sub
+/**
+ * raw_atomic64_fetch_sub_relaxed() - atomic subtract with relaxed ordering
+ * @i: s64 value to subtract
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_sub_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_fetch_sub(s64 i, atomic64_t *v)
+raw_atomic64_fetch_sub_relaxed(s64 i, atomic64_t *v)
 {
-       s64 ret;
-       __atomic_pre_full_fence();
-       ret = arch_atomic64_fetch_sub_relaxed(i, v);
-       __atomic_post_full_fence();
-       return ret;
-}
-#define arch_atomic64_fetch_sub arch_atomic64_fetch_sub
+#if defined(arch_atomic64_fetch_sub_relaxed)
+       return arch_atomic64_fetch_sub_relaxed(i, v);
+#elif defined(arch_atomic64_fetch_sub)
+       return arch_atomic64_fetch_sub(i, v);
+#else
+#error "Unable to define raw_atomic64_fetch_sub_relaxed"
 #endif
-
-#endif /* arch_atomic64_fetch_sub_relaxed */
-
-#ifndef arch_atomic64_inc
-static __always_inline void
-arch_atomic64_inc(atomic64_t *v)
-{
-       arch_atomic64_add(1, v);
 }
-#define arch_atomic64_inc arch_atomic64_inc
-#endif
 
-#ifndef arch_atomic64_inc_return_relaxed
-#ifdef arch_atomic64_inc_return
-#define arch_atomic64_inc_return_acquire arch_atomic64_inc_return
-#define arch_atomic64_inc_return_release arch_atomic64_inc_return
-#define arch_atomic64_inc_return_relaxed arch_atomic64_inc_return
-#endif /* arch_atomic64_inc_return */
-
-#ifndef arch_atomic64_inc_return
-static __always_inline s64
-arch_atomic64_inc_return(atomic64_t *v)
+/**
+ * raw_atomic64_inc() - atomic increment with relaxed ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + 1) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_inc() elsewhere.
+ *
+ * Return: Nothing.
+ */
+static __always_inline void
+raw_atomic64_inc(atomic64_t *v)
 {
-       return arch_atomic64_add_return(1, v);
-}
-#define arch_atomic64_inc_return arch_atomic64_inc_return
+#if defined(arch_atomic64_inc)
+       arch_atomic64_inc(v);
+#else
+       raw_atomic64_add(1, v);
 #endif
-
-#ifndef arch_atomic64_inc_return_acquire
-static __always_inline s64
-arch_atomic64_inc_return_acquire(atomic64_t *v)
-{
-       return arch_atomic64_add_return_acquire(1, v);
 }
-#define arch_atomic64_inc_return_acquire arch_atomic64_inc_return_acquire
-#endif
 
-#ifndef arch_atomic64_inc_return_release
+/**
+ * raw_atomic64_inc_return() - atomic increment with full ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + 1) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_inc_return() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline s64
-arch_atomic64_inc_return_release(atomic64_t *v)
+raw_atomic64_inc_return(atomic64_t *v)
 {
-       return arch_atomic64_add_return_release(1, v);
-}
-#define arch_atomic64_inc_return_release arch_atomic64_inc_return_release
+#if defined(arch_atomic64_inc_return)
+       return arch_atomic64_inc_return(v);
+#elif defined(arch_atomic64_inc_return_relaxed)
+       s64 ret;
+       __atomic_pre_full_fence();
+       ret = arch_atomic64_inc_return_relaxed(v);
+       __atomic_post_full_fence();
+       return ret;
+#else
+       return raw_atomic64_add_return(1, v);
 #endif
-
-#ifndef arch_atomic64_inc_return_relaxed
-static __always_inline s64
-arch_atomic64_inc_return_relaxed(atomic64_t *v)
-{
-       return arch_atomic64_add_return_relaxed(1, v);
 }
-#define arch_atomic64_inc_return_relaxed arch_atomic64_inc_return_relaxed
-#endif
 
-#else /* arch_atomic64_inc_return_relaxed */
-
-#ifndef arch_atomic64_inc_return_acquire
+/**
+ * raw_atomic64_inc_return_acquire() - atomic increment with acquire ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + 1) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_inc_return_acquire() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline s64
-arch_atomic64_inc_return_acquire(atomic64_t *v)
+raw_atomic64_inc_return_acquire(atomic64_t *v)
 {
+#if defined(arch_atomic64_inc_return_acquire)
+       return arch_atomic64_inc_return_acquire(v);
+#elif defined(arch_atomic64_inc_return_relaxed)
        s64 ret = arch_atomic64_inc_return_relaxed(v);
        __atomic_acquire_fence();
        return ret;
-}
-#define arch_atomic64_inc_return_acquire arch_atomic64_inc_return_acquire
+#elif defined(arch_atomic64_inc_return)
+       return arch_atomic64_inc_return(v);
+#else
+       return raw_atomic64_add_return_acquire(1, v);
 #endif
+}
 
-#ifndef arch_atomic64_inc_return_release
+/**
+ * raw_atomic64_inc_return_release() - atomic increment with release ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + 1) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_inc_return_release() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline s64
-arch_atomic64_inc_return_release(atomic64_t *v)
+raw_atomic64_inc_return_release(atomic64_t *v)
 {
+#if defined(arch_atomic64_inc_return_release)
+       return arch_atomic64_inc_return_release(v);
+#elif defined(arch_atomic64_inc_return_relaxed)
        __atomic_release_fence();
        return arch_atomic64_inc_return_relaxed(v);
-}
-#define arch_atomic64_inc_return_release arch_atomic64_inc_return_release
+#elif defined(arch_atomic64_inc_return)
+       return arch_atomic64_inc_return(v);
+#else
+       return raw_atomic64_add_return_release(1, v);
 #endif
-
-#ifndef arch_atomic64_inc_return
-static __always_inline s64
-arch_atomic64_inc_return(atomic64_t *v)
-{
-       s64 ret;
-       __atomic_pre_full_fence();
-       ret = arch_atomic64_inc_return_relaxed(v);
-       __atomic_post_full_fence();
-       return ret;
 }
-#define arch_atomic64_inc_return arch_atomic64_inc_return
-#endif
-
-#endif /* arch_atomic64_inc_return_relaxed */
 
-#ifndef arch_atomic64_fetch_inc_relaxed
-#ifdef arch_atomic64_fetch_inc
-#define arch_atomic64_fetch_inc_acquire arch_atomic64_fetch_inc
-#define arch_atomic64_fetch_inc_release arch_atomic64_fetch_inc
-#define arch_atomic64_fetch_inc_relaxed arch_atomic64_fetch_inc
-#endif /* arch_atomic64_fetch_inc */
-
-#ifndef arch_atomic64_fetch_inc
+/**
+ * raw_atomic64_inc_return_relaxed() - atomic increment with relaxed ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + 1) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_inc_return_relaxed() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline s64
-arch_atomic64_fetch_inc(atomic64_t *v)
+raw_atomic64_inc_return_relaxed(atomic64_t *v)
 {
-       return arch_atomic64_fetch_add(1, v);
-}
-#define arch_atomic64_fetch_inc arch_atomic64_fetch_inc
+#if defined(arch_atomic64_inc_return_relaxed)
+       return arch_atomic64_inc_return_relaxed(v);
+#elif defined(arch_atomic64_inc_return)
+       return arch_atomic64_inc_return(v);
+#else
+       return raw_atomic64_add_return_relaxed(1, v);
 #endif
-
-#ifndef arch_atomic64_fetch_inc_acquire
-static __always_inline s64
-arch_atomic64_fetch_inc_acquire(atomic64_t *v)
-{
-       return arch_atomic64_fetch_add_acquire(1, v);
 }
-#define arch_atomic64_fetch_inc_acquire arch_atomic64_fetch_inc_acquire
-#endif
 
-#ifndef arch_atomic64_fetch_inc_release
+/**
+ * raw_atomic64_fetch_inc() - atomic increment with full ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + 1) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_inc() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_fetch_inc_release(atomic64_t *v)
+raw_atomic64_fetch_inc(atomic64_t *v)
 {
-       return arch_atomic64_fetch_add_release(1, v);
-}
-#define arch_atomic64_fetch_inc_release arch_atomic64_fetch_inc_release
+#if defined(arch_atomic64_fetch_inc)
+       return arch_atomic64_fetch_inc(v);
+#elif defined(arch_atomic64_fetch_inc_relaxed)
+       s64 ret;
+       __atomic_pre_full_fence();
+       ret = arch_atomic64_fetch_inc_relaxed(v);
+       __atomic_post_full_fence();
+       return ret;
+#else
+       return raw_atomic64_fetch_add(1, v);
 #endif
-
-#ifndef arch_atomic64_fetch_inc_relaxed
-static __always_inline s64
-arch_atomic64_fetch_inc_relaxed(atomic64_t *v)
-{
-       return arch_atomic64_fetch_add_relaxed(1, v);
 }
-#define arch_atomic64_fetch_inc_relaxed arch_atomic64_fetch_inc_relaxed
-#endif
-
-#else /* arch_atomic64_fetch_inc_relaxed */
 
-#ifndef arch_atomic64_fetch_inc_acquire
+/**
+ * raw_atomic64_fetch_inc_acquire() - atomic increment with acquire ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + 1) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_inc_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_fetch_inc_acquire(atomic64_t *v)
+raw_atomic64_fetch_inc_acquire(atomic64_t *v)
 {
+#if defined(arch_atomic64_fetch_inc_acquire)
+       return arch_atomic64_fetch_inc_acquire(v);
+#elif defined(arch_atomic64_fetch_inc_relaxed)
        s64 ret = arch_atomic64_fetch_inc_relaxed(v);
        __atomic_acquire_fence();
        return ret;
-}
-#define arch_atomic64_fetch_inc_acquire arch_atomic64_fetch_inc_acquire
+#elif defined(arch_atomic64_fetch_inc)
+       return arch_atomic64_fetch_inc(v);
+#else
+       return raw_atomic64_fetch_add_acquire(1, v);
 #endif
+}
 
-#ifndef arch_atomic64_fetch_inc_release
+/**
+ * raw_atomic64_fetch_inc_release() - atomic increment with release ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + 1) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_inc_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_fetch_inc_release(atomic64_t *v)
+raw_atomic64_fetch_inc_release(atomic64_t *v)
 {
+#if defined(arch_atomic64_fetch_inc_release)
+       return arch_atomic64_fetch_inc_release(v);
+#elif defined(arch_atomic64_fetch_inc_relaxed)
        __atomic_release_fence();
        return arch_atomic64_fetch_inc_relaxed(v);
-}
-#define arch_atomic64_fetch_inc_release arch_atomic64_fetch_inc_release
-#endif
-
-#ifndef arch_atomic64_fetch_inc
-static __always_inline s64
-arch_atomic64_fetch_inc(atomic64_t *v)
-{
-       s64 ret;
-       __atomic_pre_full_fence();
-       ret = arch_atomic64_fetch_inc_relaxed(v);
-       __atomic_post_full_fence();
-       return ret;
-}
-#define arch_atomic64_fetch_inc arch_atomic64_fetch_inc
+#elif defined(arch_atomic64_fetch_inc)
+       return arch_atomic64_fetch_inc(v);
+#else
+       return raw_atomic64_fetch_add_release(1, v);
 #endif
-
-#endif /* arch_atomic64_fetch_inc_relaxed */
-
-#ifndef arch_atomic64_dec
-static __always_inline void
-arch_atomic64_dec(atomic64_t *v)
-{
-       arch_atomic64_sub(1, v);
 }
-#define arch_atomic64_dec arch_atomic64_dec
-#endif
 
-#ifndef arch_atomic64_dec_return_relaxed
-#ifdef arch_atomic64_dec_return
-#define arch_atomic64_dec_return_acquire arch_atomic64_dec_return
-#define arch_atomic64_dec_return_release arch_atomic64_dec_return
-#define arch_atomic64_dec_return_relaxed arch_atomic64_dec_return
-#endif /* arch_atomic64_dec_return */
-
-#ifndef arch_atomic64_dec_return
+/**
+ * raw_atomic64_fetch_inc_relaxed() - atomic increment with relaxed ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + 1) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_inc_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_dec_return(atomic64_t *v)
+raw_atomic64_fetch_inc_relaxed(atomic64_t *v)
 {
-       return arch_atomic64_sub_return(1, v);
-}
-#define arch_atomic64_dec_return arch_atomic64_dec_return
+#if defined(arch_atomic64_fetch_inc_relaxed)
+       return arch_atomic64_fetch_inc_relaxed(v);
+#elif defined(arch_atomic64_fetch_inc)
+       return arch_atomic64_fetch_inc(v);
+#else
+       return raw_atomic64_fetch_add_relaxed(1, v);
 #endif
+}
 
-#ifndef arch_atomic64_dec_return_acquire
-static __always_inline s64
-arch_atomic64_dec_return_acquire(atomic64_t *v)
+/**
+ * raw_atomic64_dec() - atomic decrement with relaxed ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - 1) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_dec() elsewhere.
+ *
+ * Return: Nothing.
+ */
+static __always_inline void
+raw_atomic64_dec(atomic64_t *v)
 {
-       return arch_atomic64_sub_return_acquire(1, v);
-}
-#define arch_atomic64_dec_return_acquire arch_atomic64_dec_return_acquire
+#if defined(arch_atomic64_dec)
+       arch_atomic64_dec(v);
+#else
+       raw_atomic64_sub(1, v);
 #endif
-
-#ifndef arch_atomic64_dec_return_release
-static __always_inline s64
-arch_atomic64_dec_return_release(atomic64_t *v)
-{
-       return arch_atomic64_sub_return_release(1, v);
 }
-#define arch_atomic64_dec_return_release arch_atomic64_dec_return_release
-#endif
 
-#ifndef arch_atomic64_dec_return_relaxed
+/**
+ * raw_atomic64_dec_return() - atomic decrement with full ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - 1) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_dec_return() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline s64
-arch_atomic64_dec_return_relaxed(atomic64_t *v)
+raw_atomic64_dec_return(atomic64_t *v)
 {
-       return arch_atomic64_sub_return_relaxed(1, v);
-}
-#define arch_atomic64_dec_return_relaxed arch_atomic64_dec_return_relaxed
+#if defined(arch_atomic64_dec_return)
+       return arch_atomic64_dec_return(v);
+#elif defined(arch_atomic64_dec_return_relaxed)
+       s64 ret;
+       __atomic_pre_full_fence();
+       ret = arch_atomic64_dec_return_relaxed(v);
+       __atomic_post_full_fence();
+       return ret;
+#else
+       return raw_atomic64_sub_return(1, v);
 #endif
+}
 
-#else /* arch_atomic64_dec_return_relaxed */
-
-#ifndef arch_atomic64_dec_return_acquire
+/**
+ * raw_atomic64_dec_return_acquire() - atomic decrement with acquire ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - 1) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_dec_return_acquire() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline s64
-arch_atomic64_dec_return_acquire(atomic64_t *v)
+raw_atomic64_dec_return_acquire(atomic64_t *v)
 {
+#if defined(arch_atomic64_dec_return_acquire)
+       return arch_atomic64_dec_return_acquire(v);
+#elif defined(arch_atomic64_dec_return_relaxed)
        s64 ret = arch_atomic64_dec_return_relaxed(v);
        __atomic_acquire_fence();
        return ret;
-}
-#define arch_atomic64_dec_return_acquire arch_atomic64_dec_return_acquire
+#elif defined(arch_atomic64_dec_return)
+       return arch_atomic64_dec_return(v);
+#else
+       return raw_atomic64_sub_return_acquire(1, v);
 #endif
+}
 
-#ifndef arch_atomic64_dec_return_release
+/**
+ * raw_atomic64_dec_return_release() - atomic decrement with release ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - 1) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_dec_return_release() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline s64
-arch_atomic64_dec_return_release(atomic64_t *v)
+raw_atomic64_dec_return_release(atomic64_t *v)
 {
+#if defined(arch_atomic64_dec_return_release)
+       return arch_atomic64_dec_return_release(v);
+#elif defined(arch_atomic64_dec_return_relaxed)
        __atomic_release_fence();
        return arch_atomic64_dec_return_relaxed(v);
+#elif defined(arch_atomic64_dec_return)
+       return arch_atomic64_dec_return(v);
+#else
+       return raw_atomic64_sub_return_release(1, v);
+#endif
 }
-#define arch_atomic64_dec_return_release arch_atomic64_dec_return_release
+
+/**
+ * raw_atomic64_dec_return_relaxed() - atomic decrement with relaxed ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - 1) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_dec_return_relaxed() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
+static __always_inline s64
+raw_atomic64_dec_return_relaxed(atomic64_t *v)
+{
+#if defined(arch_atomic64_dec_return_relaxed)
+       return arch_atomic64_dec_return_relaxed(v);
+#elif defined(arch_atomic64_dec_return)
+       return arch_atomic64_dec_return(v);
+#else
+       return raw_atomic64_sub_return_relaxed(1, v);
 #endif
+}
 
-#ifndef arch_atomic64_dec_return
+/**
+ * raw_atomic64_fetch_dec() - atomic decrement with full ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - 1) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_dec() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_dec_return(atomic64_t *v)
+raw_atomic64_fetch_dec(atomic64_t *v)
 {
+#if defined(arch_atomic64_fetch_dec)
+       return arch_atomic64_fetch_dec(v);
+#elif defined(arch_atomic64_fetch_dec_relaxed)
        s64 ret;
        __atomic_pre_full_fence();
-       ret = arch_atomic64_dec_return_relaxed(v);
+       ret = arch_atomic64_fetch_dec_relaxed(v);
        __atomic_post_full_fence();
        return ret;
-}
-#define arch_atomic64_dec_return arch_atomic64_dec_return
+#else
+       return raw_atomic64_fetch_sub(1, v);
 #endif
-
-#endif /* arch_atomic64_dec_return_relaxed */
-
-#ifndef arch_atomic64_fetch_dec_relaxed
-#ifdef arch_atomic64_fetch_dec
-#define arch_atomic64_fetch_dec_acquire arch_atomic64_fetch_dec
-#define arch_atomic64_fetch_dec_release arch_atomic64_fetch_dec
-#define arch_atomic64_fetch_dec_relaxed arch_atomic64_fetch_dec
-#endif /* arch_atomic64_fetch_dec */
-
-#ifndef arch_atomic64_fetch_dec
-static __always_inline s64
-arch_atomic64_fetch_dec(atomic64_t *v)
-{
-       return arch_atomic64_fetch_sub(1, v);
 }
-#define arch_atomic64_fetch_dec arch_atomic64_fetch_dec
-#endif
 
-#ifndef arch_atomic64_fetch_dec_acquire
+/**
+ * raw_atomic64_fetch_dec_acquire() - atomic decrement with acquire ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - 1) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_dec_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_fetch_dec_acquire(atomic64_t *v)
+raw_atomic64_fetch_dec_acquire(atomic64_t *v)
 {
-       return arch_atomic64_fetch_sub_acquire(1, v);
-}
-#define arch_atomic64_fetch_dec_acquire arch_atomic64_fetch_dec_acquire
+#if defined(arch_atomic64_fetch_dec_acquire)
+       return arch_atomic64_fetch_dec_acquire(v);
+#elif defined(arch_atomic64_fetch_dec_relaxed)
+       s64 ret = arch_atomic64_fetch_dec_relaxed(v);
+       __atomic_acquire_fence();
+       return ret;
+#elif defined(arch_atomic64_fetch_dec)
+       return arch_atomic64_fetch_dec(v);
+#else
+       return raw_atomic64_fetch_sub_acquire(1, v);
 #endif
-
-#ifndef arch_atomic64_fetch_dec_release
-static __always_inline s64
-arch_atomic64_fetch_dec_release(atomic64_t *v)
-{
-       return arch_atomic64_fetch_sub_release(1, v);
 }
-#define arch_atomic64_fetch_dec_release arch_atomic64_fetch_dec_release
-#endif
 
-#ifndef arch_atomic64_fetch_dec_relaxed
+/**
+ * raw_atomic64_fetch_dec_release() - atomic decrement with release ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - 1) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_dec_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_fetch_dec_relaxed(atomic64_t *v)
+raw_atomic64_fetch_dec_release(atomic64_t *v)
 {
-       return arch_atomic64_fetch_sub_relaxed(1, v);
-}
-#define arch_atomic64_fetch_dec_relaxed arch_atomic64_fetch_dec_relaxed
+#if defined(arch_atomic64_fetch_dec_release)
+       return arch_atomic64_fetch_dec_release(v);
+#elif defined(arch_atomic64_fetch_dec_relaxed)
+       __atomic_release_fence();
+       return arch_atomic64_fetch_dec_relaxed(v);
+#elif defined(arch_atomic64_fetch_dec)
+       return arch_atomic64_fetch_dec(v);
+#else
+       return raw_atomic64_fetch_sub_release(1, v);
 #endif
+}
 
-#else /* arch_atomic64_fetch_dec_relaxed */
-
-#ifndef arch_atomic64_fetch_dec_acquire
+/**
+ * raw_atomic64_fetch_dec_relaxed() - atomic decrement with relaxed ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - 1) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_dec_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_fetch_dec_acquire(atomic64_t *v)
+raw_atomic64_fetch_dec_relaxed(atomic64_t *v)
 {
-       s64 ret = arch_atomic64_fetch_dec_relaxed(v);
-       __atomic_acquire_fence();
-       return ret;
-}
-#define arch_atomic64_fetch_dec_acquire arch_atomic64_fetch_dec_acquire
+#if defined(arch_atomic64_fetch_dec_relaxed)
+       return arch_atomic64_fetch_dec_relaxed(v);
+#elif defined(arch_atomic64_fetch_dec)
+       return arch_atomic64_fetch_dec(v);
+#else
+       return raw_atomic64_fetch_sub_relaxed(1, v);
 #endif
+}
 
-#ifndef arch_atomic64_fetch_dec_release
-static __always_inline s64
-arch_atomic64_fetch_dec_release(atomic64_t *v)
+/**
+ * raw_atomic64_and() - atomic bitwise AND with relaxed ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v & @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_and() elsewhere.
+ *
+ * Return: Nothing.
+ */
+static __always_inline void
+raw_atomic64_and(s64 i, atomic64_t *v)
 {
-       __atomic_release_fence();
-       return arch_atomic64_fetch_dec_relaxed(v);
+       arch_atomic64_and(i, v);
 }
-#define arch_atomic64_fetch_dec_release arch_atomic64_fetch_dec_release
-#endif
 
-#ifndef arch_atomic64_fetch_dec
+/**
+ * raw_atomic64_fetch_and() - atomic bitwise AND with full ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v & @i) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_and() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_fetch_dec(atomic64_t *v)
+raw_atomic64_fetch_and(s64 i, atomic64_t *v)
 {
+#if defined(arch_atomic64_fetch_and)
+       return arch_atomic64_fetch_and(i, v);
+#elif defined(arch_atomic64_fetch_and_relaxed)
        s64 ret;
        __atomic_pre_full_fence();
-       ret = arch_atomic64_fetch_dec_relaxed(v);
+       ret = arch_atomic64_fetch_and_relaxed(i, v);
        __atomic_post_full_fence();
        return ret;
-}
-#define arch_atomic64_fetch_dec arch_atomic64_fetch_dec
+#else
+#error "Unable to define raw_atomic64_fetch_and"
 #endif
+}
 
-#endif /* arch_atomic64_fetch_dec_relaxed */
-
-#ifndef arch_atomic64_fetch_and_relaxed
-#define arch_atomic64_fetch_and_acquire arch_atomic64_fetch_and
-#define arch_atomic64_fetch_and_release arch_atomic64_fetch_and
-#define arch_atomic64_fetch_and_relaxed arch_atomic64_fetch_and
-#else /* arch_atomic64_fetch_and_relaxed */
-
-#ifndef arch_atomic64_fetch_and_acquire
+/**
+ * raw_atomic64_fetch_and_acquire() - atomic bitwise AND with acquire ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v & @i) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_and_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_fetch_and_acquire(s64 i, atomic64_t *v)
+raw_atomic64_fetch_and_acquire(s64 i, atomic64_t *v)
 {
+#if defined(arch_atomic64_fetch_and_acquire)
+       return arch_atomic64_fetch_and_acquire(i, v);
+#elif defined(arch_atomic64_fetch_and_relaxed)
        s64 ret = arch_atomic64_fetch_and_relaxed(i, v);
        __atomic_acquire_fence();
        return ret;
-}
-#define arch_atomic64_fetch_and_acquire arch_atomic64_fetch_and_acquire
+#elif defined(arch_atomic64_fetch_and)
+       return arch_atomic64_fetch_and(i, v);
+#else
+#error "Unable to define raw_atomic64_fetch_and_acquire"
 #endif
+}
 
-#ifndef arch_atomic64_fetch_and_release
+/**
+ * raw_atomic64_fetch_and_release() - atomic bitwise AND with release ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v & @i) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_and_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_fetch_and_release(s64 i, atomic64_t *v)
+raw_atomic64_fetch_and_release(s64 i, atomic64_t *v)
 {
+#if defined(arch_atomic64_fetch_and_release)
+       return arch_atomic64_fetch_and_release(i, v);
+#elif defined(arch_atomic64_fetch_and_relaxed)
        __atomic_release_fence();
        return arch_atomic64_fetch_and_relaxed(i, v);
-}
-#define arch_atomic64_fetch_and_release arch_atomic64_fetch_and_release
+#elif defined(arch_atomic64_fetch_and)
+       return arch_atomic64_fetch_and(i, v);
+#else
+#error "Unable to define raw_atomic64_fetch_and_release"
 #endif
+}
 
-#ifndef arch_atomic64_fetch_and
+/**
+ * raw_atomic64_fetch_and_relaxed() - atomic bitwise AND with relaxed ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v & @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_and_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_fetch_and(s64 i, atomic64_t *v)
+raw_atomic64_fetch_and_relaxed(s64 i, atomic64_t *v)
 {
-       s64 ret;
-       __atomic_pre_full_fence();
-       ret = arch_atomic64_fetch_and_relaxed(i, v);
-       __atomic_post_full_fence();
-       return ret;
-}
-#define arch_atomic64_fetch_and arch_atomic64_fetch_and
+#if defined(arch_atomic64_fetch_and_relaxed)
+       return arch_atomic64_fetch_and_relaxed(i, v);
+#elif defined(arch_atomic64_fetch_and)
+       return arch_atomic64_fetch_and(i, v);
+#else
+#error "Unable to define raw_atomic64_fetch_and_relaxed"
 #endif
-
-#endif /* arch_atomic64_fetch_and_relaxed */
-
-#ifndef arch_atomic64_andnot
-static __always_inline void
-arch_atomic64_andnot(s64 i, atomic64_t *v)
-{
-       arch_atomic64_and(~i, v);
 }
-#define arch_atomic64_andnot arch_atomic64_andnot
-#endif
-
-#ifndef arch_atomic64_fetch_andnot_relaxed
-#ifdef arch_atomic64_fetch_andnot
-#define arch_atomic64_fetch_andnot_acquire arch_atomic64_fetch_andnot
-#define arch_atomic64_fetch_andnot_release arch_atomic64_fetch_andnot
-#define arch_atomic64_fetch_andnot_relaxed arch_atomic64_fetch_andnot
-#endif /* arch_atomic64_fetch_andnot */
 
-#ifndef arch_atomic64_fetch_andnot
-static __always_inline s64
-arch_atomic64_fetch_andnot(s64 i, atomic64_t *v)
+/**
+ * raw_atomic64_andnot() - atomic bitwise AND NOT with relaxed ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v & ~@i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_andnot() elsewhere.
+ *
+ * Return: Nothing.
+ */
+static __always_inline void
+raw_atomic64_andnot(s64 i, atomic64_t *v)
 {
-       return arch_atomic64_fetch_and(~i, v);
-}
-#define arch_atomic64_fetch_andnot arch_atomic64_fetch_andnot
+#if defined(arch_atomic64_andnot)
+       arch_atomic64_andnot(i, v);
+#else
+       raw_atomic64_and(~i, v);
 #endif
-
-#ifndef arch_atomic64_fetch_andnot_acquire
-static __always_inline s64
-arch_atomic64_fetch_andnot_acquire(s64 i, atomic64_t *v)
-{
-       return arch_atomic64_fetch_and_acquire(~i, v);
 }
-#define arch_atomic64_fetch_andnot_acquire arch_atomic64_fetch_andnot_acquire
-#endif
 
-#ifndef arch_atomic64_fetch_andnot_release
+/**
+ * raw_atomic64_fetch_andnot() - atomic bitwise AND NOT with full ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v & ~@i) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_andnot() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_fetch_andnot_release(s64 i, atomic64_t *v)
+raw_atomic64_fetch_andnot(s64 i, atomic64_t *v)
 {
-       return arch_atomic64_fetch_and_release(~i, v);
-}
-#define arch_atomic64_fetch_andnot_release arch_atomic64_fetch_andnot_release
+#if defined(arch_atomic64_fetch_andnot)
+       return arch_atomic64_fetch_andnot(i, v);
+#elif defined(arch_atomic64_fetch_andnot_relaxed)
+       s64 ret;
+       __atomic_pre_full_fence();
+       ret = arch_atomic64_fetch_andnot_relaxed(i, v);
+       __atomic_post_full_fence();
+       return ret;
+#else
+       return raw_atomic64_fetch_and(~i, v);
 #endif
-
-#ifndef arch_atomic64_fetch_andnot_relaxed
-static __always_inline s64
-arch_atomic64_fetch_andnot_relaxed(s64 i, atomic64_t *v)
-{
-       return arch_atomic64_fetch_and_relaxed(~i, v);
 }
-#define arch_atomic64_fetch_andnot_relaxed arch_atomic64_fetch_andnot_relaxed
-#endif
-
-#else /* arch_atomic64_fetch_andnot_relaxed */
 
-#ifndef arch_atomic64_fetch_andnot_acquire
+/**
+ * raw_atomic64_fetch_andnot_acquire() - atomic bitwise AND NOT with acquire ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v & ~@i) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_andnot_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_fetch_andnot_acquire(s64 i, atomic64_t *v)
+raw_atomic64_fetch_andnot_acquire(s64 i, atomic64_t *v)
 {
+#if defined(arch_atomic64_fetch_andnot_acquire)
+       return arch_atomic64_fetch_andnot_acquire(i, v);
+#elif defined(arch_atomic64_fetch_andnot_relaxed)
        s64 ret = arch_atomic64_fetch_andnot_relaxed(i, v);
        __atomic_acquire_fence();
        return ret;
-}
-#define arch_atomic64_fetch_andnot_acquire arch_atomic64_fetch_andnot_acquire
+#elif defined(arch_atomic64_fetch_andnot)
+       return arch_atomic64_fetch_andnot(i, v);
+#else
+       return raw_atomic64_fetch_and_acquire(~i, v);
 #endif
+}
 
-#ifndef arch_atomic64_fetch_andnot_release
+/**
+ * raw_atomic64_fetch_andnot_release() - atomic bitwise AND NOT with release ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v & ~@i) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_andnot_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_fetch_andnot_release(s64 i, atomic64_t *v)
+raw_atomic64_fetch_andnot_release(s64 i, atomic64_t *v)
 {
+#if defined(arch_atomic64_fetch_andnot_release)
+       return arch_atomic64_fetch_andnot_release(i, v);
+#elif defined(arch_atomic64_fetch_andnot_relaxed)
        __atomic_release_fence();
        return arch_atomic64_fetch_andnot_relaxed(i, v);
+#elif defined(arch_atomic64_fetch_andnot)
+       return arch_atomic64_fetch_andnot(i, v);
+#else
+       return raw_atomic64_fetch_and_release(~i, v);
+#endif
 }
-#define arch_atomic64_fetch_andnot_release arch_atomic64_fetch_andnot_release
+
+/**
+ * raw_atomic64_fetch_andnot_relaxed() - atomic bitwise AND NOT with relaxed ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v & ~@i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_andnot_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
+static __always_inline s64
+raw_atomic64_fetch_andnot_relaxed(s64 i, atomic64_t *v)
+{
+#if defined(arch_atomic64_fetch_andnot_relaxed)
+       return arch_atomic64_fetch_andnot_relaxed(i, v);
+#elif defined(arch_atomic64_fetch_andnot)
+       return arch_atomic64_fetch_andnot(i, v);
+#else
+       return raw_atomic64_fetch_and_relaxed(~i, v);
 #endif
+}
+
+/**
+ * raw_atomic64_or() - atomic bitwise OR with relaxed ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v | @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_or() elsewhere.
+ *
+ * Return: Nothing.
+ */
+static __always_inline void
+raw_atomic64_or(s64 i, atomic64_t *v)
+{
+       arch_atomic64_or(i, v);
+}
 
-#ifndef arch_atomic64_fetch_andnot
+/**
+ * raw_atomic64_fetch_or() - atomic bitwise OR with full ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v | @i) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_or() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_fetch_andnot(s64 i, atomic64_t *v)
+raw_atomic64_fetch_or(s64 i, atomic64_t *v)
 {
+#if defined(arch_atomic64_fetch_or)
+       return arch_atomic64_fetch_or(i, v);
+#elif defined(arch_atomic64_fetch_or_relaxed)
        s64 ret;
        __atomic_pre_full_fence();
-       ret = arch_atomic64_fetch_andnot_relaxed(i, v);
+       ret = arch_atomic64_fetch_or_relaxed(i, v);
        __atomic_post_full_fence();
        return ret;
-}
-#define arch_atomic64_fetch_andnot arch_atomic64_fetch_andnot
+#else
+#error "Unable to define raw_atomic64_fetch_or"
 #endif
+}
 
-#endif /* arch_atomic64_fetch_andnot_relaxed */
-
-#ifndef arch_atomic64_fetch_or_relaxed
-#define arch_atomic64_fetch_or_acquire arch_atomic64_fetch_or
-#define arch_atomic64_fetch_or_release arch_atomic64_fetch_or
-#define arch_atomic64_fetch_or_relaxed arch_atomic64_fetch_or
-#else /* arch_atomic64_fetch_or_relaxed */
-
-#ifndef arch_atomic64_fetch_or_acquire
+/**
+ * raw_atomic64_fetch_or_acquire() - atomic bitwise OR with acquire ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v | @i) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_or_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_fetch_or_acquire(s64 i, atomic64_t *v)
+raw_atomic64_fetch_or_acquire(s64 i, atomic64_t *v)
 {
+#if defined(arch_atomic64_fetch_or_acquire)
+       return arch_atomic64_fetch_or_acquire(i, v);
+#elif defined(arch_atomic64_fetch_or_relaxed)
        s64 ret = arch_atomic64_fetch_or_relaxed(i, v);
        __atomic_acquire_fence();
        return ret;
-}
-#define arch_atomic64_fetch_or_acquire arch_atomic64_fetch_or_acquire
+#elif defined(arch_atomic64_fetch_or)
+       return arch_atomic64_fetch_or(i, v);
+#else
+#error "Unable to define raw_atomic64_fetch_or_acquire"
 #endif
+}
 
-#ifndef arch_atomic64_fetch_or_release
+/**
+ * raw_atomic64_fetch_or_release() - atomic bitwise OR with release ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v | @i) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_or_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_fetch_or_release(s64 i, atomic64_t *v)
+raw_atomic64_fetch_or_release(s64 i, atomic64_t *v)
 {
+#if defined(arch_atomic64_fetch_or_release)
+       return arch_atomic64_fetch_or_release(i, v);
+#elif defined(arch_atomic64_fetch_or_relaxed)
        __atomic_release_fence();
        return arch_atomic64_fetch_or_relaxed(i, v);
+#elif defined(arch_atomic64_fetch_or)
+       return arch_atomic64_fetch_or(i, v);
+#else
+#error "Unable to define raw_atomic64_fetch_or_release"
+#endif
 }
-#define arch_atomic64_fetch_or_release arch_atomic64_fetch_or_release
+
+/**
+ * raw_atomic64_fetch_or_relaxed() - atomic bitwise OR with relaxed ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v | @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_or_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
+static __always_inline s64
+raw_atomic64_fetch_or_relaxed(s64 i, atomic64_t *v)
+{
+#if defined(arch_atomic64_fetch_or_relaxed)
+       return arch_atomic64_fetch_or_relaxed(i, v);
+#elif defined(arch_atomic64_fetch_or)
+       return arch_atomic64_fetch_or(i, v);
+#else
+#error "Unable to define raw_atomic64_fetch_or_relaxed"
 #endif
+}
+
+/**
+ * raw_atomic64_xor() - atomic bitwise XOR with relaxed ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v ^ @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_xor() elsewhere.
+ *
+ * Return: Nothing.
+ */
+static __always_inline void
+raw_atomic64_xor(s64 i, atomic64_t *v)
+{
+       arch_atomic64_xor(i, v);
+}
 
-#ifndef arch_atomic64_fetch_or
+/**
+ * raw_atomic64_fetch_xor() - atomic bitwise XOR with full ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v ^ @i) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_xor() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_fetch_or(s64 i, atomic64_t *v)
+raw_atomic64_fetch_xor(s64 i, atomic64_t *v)
 {
+#if defined(arch_atomic64_fetch_xor)
+       return arch_atomic64_fetch_xor(i, v);
+#elif defined(arch_atomic64_fetch_xor_relaxed)
        s64 ret;
        __atomic_pre_full_fence();
-       ret = arch_atomic64_fetch_or_relaxed(i, v);
+       ret = arch_atomic64_fetch_xor_relaxed(i, v);
        __atomic_post_full_fence();
        return ret;
-}
-#define arch_atomic64_fetch_or arch_atomic64_fetch_or
-#endif
-
-#endif /* arch_atomic64_fetch_or_relaxed */
-
-#ifndef arch_atomic64_fetch_xor_relaxed
-#define arch_atomic64_fetch_xor_acquire arch_atomic64_fetch_xor
-#define arch_atomic64_fetch_xor_release arch_atomic64_fetch_xor
-#define arch_atomic64_fetch_xor_relaxed arch_atomic64_fetch_xor
-#else /* arch_atomic64_fetch_xor_relaxed */
+#else
+#error "Unable to define raw_atomic64_fetch_xor"
+#endif
+}
 
-#ifndef arch_atomic64_fetch_xor_acquire
+/**
+ * raw_atomic64_fetch_xor_acquire() - atomic bitwise XOR with acquire ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v ^ @i) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_xor_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_fetch_xor_acquire(s64 i, atomic64_t *v)
+raw_atomic64_fetch_xor_acquire(s64 i, atomic64_t *v)
 {
+#if defined(arch_atomic64_fetch_xor_acquire)
+       return arch_atomic64_fetch_xor_acquire(i, v);
+#elif defined(arch_atomic64_fetch_xor_relaxed)
        s64 ret = arch_atomic64_fetch_xor_relaxed(i, v);
        __atomic_acquire_fence();
        return ret;
-}
-#define arch_atomic64_fetch_xor_acquire arch_atomic64_fetch_xor_acquire
+#elif defined(arch_atomic64_fetch_xor)
+       return arch_atomic64_fetch_xor(i, v);
+#else
+#error "Unable to define raw_atomic64_fetch_xor_acquire"
 #endif
+}
 
-#ifndef arch_atomic64_fetch_xor_release
+/**
+ * raw_atomic64_fetch_xor_release() - atomic bitwise XOR with release ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v ^ @i) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_xor_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_fetch_xor_release(s64 i, atomic64_t *v)
+raw_atomic64_fetch_xor_release(s64 i, atomic64_t *v)
 {
+#if defined(arch_atomic64_fetch_xor_release)
+       return arch_atomic64_fetch_xor_release(i, v);
+#elif defined(arch_atomic64_fetch_xor_relaxed)
        __atomic_release_fence();
        return arch_atomic64_fetch_xor_relaxed(i, v);
+#elif defined(arch_atomic64_fetch_xor)
+       return arch_atomic64_fetch_xor(i, v);
+#else
+#error "Unable to define raw_atomic64_fetch_xor_release"
+#endif
 }
-#define arch_atomic64_fetch_xor_release arch_atomic64_fetch_xor_release
+
+/**
+ * raw_atomic64_fetch_xor_relaxed() - atomic bitwise XOR with relaxed ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v ^ @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_fetch_xor_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
+static __always_inline s64
+raw_atomic64_fetch_xor_relaxed(s64 i, atomic64_t *v)
+{
+#if defined(arch_atomic64_fetch_xor_relaxed)
+       return arch_atomic64_fetch_xor_relaxed(i, v);
+#elif defined(arch_atomic64_fetch_xor)
+       return arch_atomic64_fetch_xor(i, v);
+#else
+#error "Unable to define raw_atomic64_fetch_xor_relaxed"
 #endif
+}
 
-#ifndef arch_atomic64_fetch_xor
+/**
+ * raw_atomic64_xchg() - atomic exchange with full ordering
+ * @v: pointer to atomic64_t
+ * @new: s64 value to assign
+ *
+ * Atomically updates @v to @new with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_xchg() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_fetch_xor(s64 i, atomic64_t *v)
+raw_atomic64_xchg(atomic64_t *v, s64 new)
 {
+#if defined(arch_atomic64_xchg)
+       return arch_atomic64_xchg(v, new);
+#elif defined(arch_atomic64_xchg_relaxed)
        s64 ret;
        __atomic_pre_full_fence();
-       ret = arch_atomic64_fetch_xor_relaxed(i, v);
+       ret = arch_atomic64_xchg_relaxed(v, new);
        __atomic_post_full_fence();
        return ret;
-}
-#define arch_atomic64_fetch_xor arch_atomic64_fetch_xor
+#else
+       return raw_xchg(&v->counter, new);
 #endif
+}
 
-#endif /* arch_atomic64_fetch_xor_relaxed */
-
-#ifndef arch_atomic64_xchg_relaxed
-#define arch_atomic64_xchg_acquire arch_atomic64_xchg
-#define arch_atomic64_xchg_release arch_atomic64_xchg
-#define arch_atomic64_xchg_relaxed arch_atomic64_xchg
-#else /* arch_atomic64_xchg_relaxed */
-
-#ifndef arch_atomic64_xchg_acquire
+/**
+ * raw_atomic64_xchg_acquire() - atomic exchange with acquire ordering
+ * @v: pointer to atomic64_t
+ * @new: s64 value to assign
+ *
+ * Atomically updates @v to @new with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_xchg_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_xchg_acquire(atomic64_t *v, s64 i)
+raw_atomic64_xchg_acquire(atomic64_t *v, s64 new)
 {
-       s64 ret = arch_atomic64_xchg_relaxed(v, i);
+#if defined(arch_atomic64_xchg_acquire)
+       return arch_atomic64_xchg_acquire(v, new);
+#elif defined(arch_atomic64_xchg_relaxed)
+       s64 ret = arch_atomic64_xchg_relaxed(v, new);
        __atomic_acquire_fence();
        return ret;
-}
-#define arch_atomic64_xchg_acquire arch_atomic64_xchg_acquire
+#elif defined(arch_atomic64_xchg)
+       return arch_atomic64_xchg(v, new);
+#else
+       return raw_xchg_acquire(&v->counter, new);
 #endif
+}
 
-#ifndef arch_atomic64_xchg_release
+/**
+ * raw_atomic64_xchg_release() - atomic exchange with release ordering
+ * @v: pointer to atomic64_t
+ * @new: s64 value to assign
+ *
+ * Atomically updates @v to @new with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_xchg_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_xchg_release(atomic64_t *v, s64 i)
+raw_atomic64_xchg_release(atomic64_t *v, s64 new)
 {
+#if defined(arch_atomic64_xchg_release)
+       return arch_atomic64_xchg_release(v, new);
+#elif defined(arch_atomic64_xchg_relaxed)
        __atomic_release_fence();
-       return arch_atomic64_xchg_relaxed(v, i);
+       return arch_atomic64_xchg_relaxed(v, new);
+#elif defined(arch_atomic64_xchg)
+       return arch_atomic64_xchg(v, new);
+#else
+       return raw_xchg_release(&v->counter, new);
+#endif
 }
-#define arch_atomic64_xchg_release arch_atomic64_xchg_release
+
+/**
+ * raw_atomic64_xchg_relaxed() - atomic exchange with relaxed ordering
+ * @v: pointer to atomic64_t
+ * @new: s64 value to assign
+ *
+ * Atomically updates @v to @new with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_xchg_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
+static __always_inline s64
+raw_atomic64_xchg_relaxed(atomic64_t *v, s64 new)
+{
+#if defined(arch_atomic64_xchg_relaxed)
+       return arch_atomic64_xchg_relaxed(v, new);
+#elif defined(arch_atomic64_xchg)
+       return arch_atomic64_xchg(v, new);
+#else
+       return raw_xchg_relaxed(&v->counter, new);
 #endif
+}
 
-#ifndef arch_atomic64_xchg
+/**
+ * raw_atomic64_cmpxchg() - atomic compare and exchange with full ordering
+ * @v: pointer to atomic64_t
+ * @old: s64 value to compare with
+ * @new: s64 value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_cmpxchg() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_xchg(atomic64_t *v, s64 i)
+raw_atomic64_cmpxchg(atomic64_t *v, s64 old, s64 new)
 {
+#if defined(arch_atomic64_cmpxchg)
+       return arch_atomic64_cmpxchg(v, old, new);
+#elif defined(arch_atomic64_cmpxchg_relaxed)
        s64 ret;
        __atomic_pre_full_fence();
-       ret = arch_atomic64_xchg_relaxed(v, i);
+       ret = arch_atomic64_cmpxchg_relaxed(v, old, new);
        __atomic_post_full_fence();
        return ret;
-}
-#define arch_atomic64_xchg arch_atomic64_xchg
+#else
+       return raw_cmpxchg(&v->counter, old, new);
 #endif
+}
 
-#endif /* arch_atomic64_xchg_relaxed */
-
-#ifndef arch_atomic64_cmpxchg_relaxed
-#define arch_atomic64_cmpxchg_acquire arch_atomic64_cmpxchg
-#define arch_atomic64_cmpxchg_release arch_atomic64_cmpxchg
-#define arch_atomic64_cmpxchg_relaxed arch_atomic64_cmpxchg
-#else /* arch_atomic64_cmpxchg_relaxed */
-
-#ifndef arch_atomic64_cmpxchg_acquire
+/**
+ * raw_atomic64_cmpxchg_acquire() - atomic compare and exchange with acquire ordering
+ * @v: pointer to atomic64_t
+ * @old: s64 value to compare with
+ * @new: s64 value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_cmpxchg_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_cmpxchg_acquire(atomic64_t *v, s64 old, s64 new)
+raw_atomic64_cmpxchg_acquire(atomic64_t *v, s64 old, s64 new)
 {
+#if defined(arch_atomic64_cmpxchg_acquire)
+       return arch_atomic64_cmpxchg_acquire(v, old, new);
+#elif defined(arch_atomic64_cmpxchg_relaxed)
        s64 ret = arch_atomic64_cmpxchg_relaxed(v, old, new);
        __atomic_acquire_fence();
        return ret;
-}
-#define arch_atomic64_cmpxchg_acquire arch_atomic64_cmpxchg_acquire
+#elif defined(arch_atomic64_cmpxchg)
+       return arch_atomic64_cmpxchg(v, old, new);
+#else
+       return raw_cmpxchg_acquire(&v->counter, old, new);
 #endif
+}
 
-#ifndef arch_atomic64_cmpxchg_release
+/**
+ * raw_atomic64_cmpxchg_release() - atomic compare and exchange with release ordering
+ * @v: pointer to atomic64_t
+ * @old: s64 value to compare with
+ * @new: s64 value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_cmpxchg_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_cmpxchg_release(atomic64_t *v, s64 old, s64 new)
+raw_atomic64_cmpxchg_release(atomic64_t *v, s64 old, s64 new)
 {
+#if defined(arch_atomic64_cmpxchg_release)
+       return arch_atomic64_cmpxchg_release(v, old, new);
+#elif defined(arch_atomic64_cmpxchg_relaxed)
        __atomic_release_fence();
        return arch_atomic64_cmpxchg_relaxed(v, old, new);
-}
-#define arch_atomic64_cmpxchg_release arch_atomic64_cmpxchg_release
+#elif defined(arch_atomic64_cmpxchg)
+       return arch_atomic64_cmpxchg(v, old, new);
+#else
+       return raw_cmpxchg_release(&v->counter, old, new);
 #endif
+}
 
-#ifndef arch_atomic64_cmpxchg
+/**
+ * raw_atomic64_cmpxchg_relaxed() - atomic compare and exchange with relaxed ordering
+ * @v: pointer to atomic64_t
+ * @old: s64 value to compare with
+ * @new: s64 value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_cmpxchg_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-arch_atomic64_cmpxchg(atomic64_t *v, s64 old, s64 new)
+raw_atomic64_cmpxchg_relaxed(atomic64_t *v, s64 old, s64 new)
 {
-       s64 ret;
-       __atomic_pre_full_fence();
-       ret = arch_atomic64_cmpxchg_relaxed(v, old, new);
-       __atomic_post_full_fence();
-       return ret;
-}
-#define arch_atomic64_cmpxchg arch_atomic64_cmpxchg
+#if defined(arch_atomic64_cmpxchg_relaxed)
+       return arch_atomic64_cmpxchg_relaxed(v, old, new);
+#elif defined(arch_atomic64_cmpxchg)
+       return arch_atomic64_cmpxchg(v, old, new);
+#else
+       return raw_cmpxchg_relaxed(&v->counter, old, new);
 #endif
+}
 
-#endif /* arch_atomic64_cmpxchg_relaxed */
-
-#ifndef arch_atomic64_try_cmpxchg_relaxed
-#ifdef arch_atomic64_try_cmpxchg
-#define arch_atomic64_try_cmpxchg_acquire arch_atomic64_try_cmpxchg
-#define arch_atomic64_try_cmpxchg_release arch_atomic64_try_cmpxchg
-#define arch_atomic64_try_cmpxchg_relaxed arch_atomic64_try_cmpxchg
-#endif /* arch_atomic64_try_cmpxchg */
-
-#ifndef arch_atomic64_try_cmpxchg
+/**
+ * raw_atomic64_try_cmpxchg() - atomic compare and exchange with full ordering
+ * @v: pointer to atomic64_t
+ * @old: pointer to s64 value to compare with
+ * @new: s64 value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with full ordering.
+ * Otherwise, updates @old to the current value of @v.
+ *
+ * Safe to use in noinstr code; prefer atomic64_try_cmpxchg() elsewhere.
+ *
+ * Return: @true if the exchange occured, @false otherwise.
+ */
 static __always_inline bool
-arch_atomic64_try_cmpxchg(atomic64_t *v, s64 *old, s64 new)
+raw_atomic64_try_cmpxchg(atomic64_t *v, s64 *old, s64 new)
 {
+#if defined(arch_atomic64_try_cmpxchg)
+       return arch_atomic64_try_cmpxchg(v, old, new);
+#elif defined(arch_atomic64_try_cmpxchg_relaxed)
+       bool ret;
+       __atomic_pre_full_fence();
+       ret = arch_atomic64_try_cmpxchg_relaxed(v, old, new);
+       __atomic_post_full_fence();
+       return ret;
+#else
        s64 r, o = *old;
-       r = arch_atomic64_cmpxchg(v, o, new);
+       r = raw_atomic64_cmpxchg(v, o, new);
        if (unlikely(r != o))
                *old = r;
        return likely(r == o);
-}
-#define arch_atomic64_try_cmpxchg arch_atomic64_try_cmpxchg
 #endif
+}
 
-#ifndef arch_atomic64_try_cmpxchg_acquire
+/**
+ * raw_atomic64_try_cmpxchg_acquire() - atomic compare and exchange with acquire ordering
+ * @v: pointer to atomic64_t
+ * @old: pointer to s64 value to compare with
+ * @new: s64 value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with acquire ordering.
+ * Otherwise, updates @old to the current value of @v.
+ *
+ * Safe to use in noinstr code; prefer atomic64_try_cmpxchg_acquire() elsewhere.
+ *
+ * Return: @true if the exchange occured, @false otherwise.
+ */
 static __always_inline bool
-arch_atomic64_try_cmpxchg_acquire(atomic64_t *v, s64 *old, s64 new)
+raw_atomic64_try_cmpxchg_acquire(atomic64_t *v, s64 *old, s64 new)
 {
+#if defined(arch_atomic64_try_cmpxchg_acquire)
+       return arch_atomic64_try_cmpxchg_acquire(v, old, new);
+#elif defined(arch_atomic64_try_cmpxchg_relaxed)
+       bool ret = arch_atomic64_try_cmpxchg_relaxed(v, old, new);
+       __atomic_acquire_fence();
+       return ret;
+#elif defined(arch_atomic64_try_cmpxchg)
+       return arch_atomic64_try_cmpxchg(v, old, new);
+#else
        s64 r, o = *old;
-       r = arch_atomic64_cmpxchg_acquire(v, o, new);
+       r = raw_atomic64_cmpxchg_acquire(v, o, new);
        if (unlikely(r != o))
                *old = r;
        return likely(r == o);
-}
-#define arch_atomic64_try_cmpxchg_acquire arch_atomic64_try_cmpxchg_acquire
 #endif
+}
 
-#ifndef arch_atomic64_try_cmpxchg_release
+/**
+ * raw_atomic64_try_cmpxchg_release() - atomic compare and exchange with release ordering
+ * @v: pointer to atomic64_t
+ * @old: pointer to s64 value to compare with
+ * @new: s64 value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with release ordering.
+ * Otherwise, updates @old to the current value of @v.
+ *
+ * Safe to use in noinstr code; prefer atomic64_try_cmpxchg_release() elsewhere.
+ *
+ * Return: @true if the exchange occured, @false otherwise.
+ */
 static __always_inline bool
-arch_atomic64_try_cmpxchg_release(atomic64_t *v, s64 *old, s64 new)
+raw_atomic64_try_cmpxchg_release(atomic64_t *v, s64 *old, s64 new)
 {
+#if defined(arch_atomic64_try_cmpxchg_release)
+       return arch_atomic64_try_cmpxchg_release(v, old, new);
+#elif defined(arch_atomic64_try_cmpxchg_relaxed)
+       __atomic_release_fence();
+       return arch_atomic64_try_cmpxchg_relaxed(v, old, new);
+#elif defined(arch_atomic64_try_cmpxchg)
+       return arch_atomic64_try_cmpxchg(v, old, new);
+#else
        s64 r, o = *old;
-       r = arch_atomic64_cmpxchg_release(v, o, new);
+       r = raw_atomic64_cmpxchg_release(v, o, new);
        if (unlikely(r != o))
                *old = r;
        return likely(r == o);
-}
-#define arch_atomic64_try_cmpxchg_release arch_atomic64_try_cmpxchg_release
 #endif
+}
 
-#ifndef arch_atomic64_try_cmpxchg_relaxed
+/**
+ * raw_atomic64_try_cmpxchg_relaxed() - atomic compare and exchange with relaxed ordering
+ * @v: pointer to atomic64_t
+ * @old: pointer to s64 value to compare with
+ * @new: s64 value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with relaxed ordering.
+ * Otherwise, updates @old to the current value of @v.
+ *
+ * Safe to use in noinstr code; prefer atomic64_try_cmpxchg_relaxed() elsewhere.
+ *
+ * Return: @true if the exchange occured, @false otherwise.
+ */
 static __always_inline bool
-arch_atomic64_try_cmpxchg_relaxed(atomic64_t *v, s64 *old, s64 new)
+raw_atomic64_try_cmpxchg_relaxed(atomic64_t *v, s64 *old, s64 new)
 {
+#if defined(arch_atomic64_try_cmpxchg_relaxed)
+       return arch_atomic64_try_cmpxchg_relaxed(v, old, new);
+#elif defined(arch_atomic64_try_cmpxchg)
+       return arch_atomic64_try_cmpxchg(v, old, new);
+#else
        s64 r, o = *old;
-       r = arch_atomic64_cmpxchg_relaxed(v, o, new);
+       r = raw_atomic64_cmpxchg_relaxed(v, o, new);
        if (unlikely(r != o))
                *old = r;
        return likely(r == o);
-}
-#define arch_atomic64_try_cmpxchg_relaxed arch_atomic64_try_cmpxchg_relaxed
-#endif
-
-#else /* arch_atomic64_try_cmpxchg_relaxed */
-
-#ifndef arch_atomic64_try_cmpxchg_acquire
-static __always_inline bool
-arch_atomic64_try_cmpxchg_acquire(atomic64_t *v, s64 *old, s64 new)
-{
-       bool ret = arch_atomic64_try_cmpxchg_relaxed(v, old, new);
-       __atomic_acquire_fence();
-       return ret;
-}
-#define arch_atomic64_try_cmpxchg_acquire arch_atomic64_try_cmpxchg_acquire
-#endif
-
-#ifndef arch_atomic64_try_cmpxchg_release
-static __always_inline bool
-arch_atomic64_try_cmpxchg_release(atomic64_t *v, s64 *old, s64 new)
-{
-       __atomic_release_fence();
-       return arch_atomic64_try_cmpxchg_relaxed(v, old, new);
-}
-#define arch_atomic64_try_cmpxchg_release arch_atomic64_try_cmpxchg_release
 #endif
-
-#ifndef arch_atomic64_try_cmpxchg
-static __always_inline bool
-arch_atomic64_try_cmpxchg(atomic64_t *v, s64 *old, s64 new)
-{
-       bool ret;
-       __atomic_pre_full_fence();
-       ret = arch_atomic64_try_cmpxchg_relaxed(v, old, new);
-       __atomic_post_full_fence();
-       return ret;
 }
-#define arch_atomic64_try_cmpxchg arch_atomic64_try_cmpxchg
-#endif
-
-#endif /* arch_atomic64_try_cmpxchg_relaxed */
 
-#ifndef arch_atomic64_sub_and_test
 /**
- * arch_atomic64_sub_and_test - subtract value from variable and test result
- * @i: integer value to subtract
- * @v: pointer of type atomic64_t
+ * raw_atomic64_sub_and_test() - atomic subtract and test if zero with full ordering
+ * @i: s64 value to add
+ * @v: pointer to atomic64_t
  *
- * Atomically subtracts @i from @v and returns
- * true if the result is zero, or false for all
- * other cases.
+ * Atomically updates @v to (@v - @i) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_sub_and_test() elsewhere.
+ *
+ * Return: @true if the resulting value of @v is zero, @false otherwise.
  */
 static __always_inline bool
-arch_atomic64_sub_and_test(s64 i, atomic64_t *v)
+raw_atomic64_sub_and_test(s64 i, atomic64_t *v)
 {
-       return arch_atomic64_sub_return(i, v) == 0;
-}
-#define arch_atomic64_sub_and_test arch_atomic64_sub_and_test
+#if defined(arch_atomic64_sub_and_test)
+       return arch_atomic64_sub_and_test(i, v);
+#else
+       return raw_atomic64_sub_return(i, v) == 0;
 #endif
+}
 
-#ifndef arch_atomic64_dec_and_test
 /**
- * arch_atomic64_dec_and_test - decrement and test
- * @v: pointer of type atomic64_t
+ * raw_atomic64_dec_and_test() - atomic decrement and test if zero with full ordering
+ * @v: pointer to atomic64_t
  *
- * Atomically decrements @v by 1 and
- * returns true if the result is 0, or false for all other
- * cases.
+ * Atomically updates @v to (@v - 1) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_dec_and_test() elsewhere.
+ *
+ * Return: @true if the resulting value of @v is zero, @false otherwise.
  */
 static __always_inline bool
-arch_atomic64_dec_and_test(atomic64_t *v)
+raw_atomic64_dec_and_test(atomic64_t *v)
 {
-       return arch_atomic64_dec_return(v) == 0;
-}
-#define arch_atomic64_dec_and_test arch_atomic64_dec_and_test
+#if defined(arch_atomic64_dec_and_test)
+       return arch_atomic64_dec_and_test(v);
+#else
+       return raw_atomic64_dec_return(v) == 0;
 #endif
+}
 
-#ifndef arch_atomic64_inc_and_test
 /**
- * arch_atomic64_inc_and_test - increment and test
- * @v: pointer of type atomic64_t
+ * raw_atomic64_inc_and_test() - atomic increment and test if zero with full ordering
+ * @v: pointer to atomic64_t
  *
- * Atomically increments @v by 1
- * and returns true if the result is zero, or false for all
- * other cases.
+ * Atomically updates @v to (@v + 1) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_inc_and_test() elsewhere.
+ *
+ * Return: @true if the resulting value of @v is zero, @false otherwise.
  */
 static __always_inline bool
-arch_atomic64_inc_and_test(atomic64_t *v)
+raw_atomic64_inc_and_test(atomic64_t *v)
 {
-       return arch_atomic64_inc_return(v) == 0;
-}
-#define arch_atomic64_inc_and_test arch_atomic64_inc_and_test
+#if defined(arch_atomic64_inc_and_test)
+       return arch_atomic64_inc_and_test(v);
+#else
+       return raw_atomic64_inc_return(v) == 0;
 #endif
+}
 
-#ifndef arch_atomic64_add_negative_relaxed
-#ifdef arch_atomic64_add_negative
-#define arch_atomic64_add_negative_acquire arch_atomic64_add_negative
-#define arch_atomic64_add_negative_release arch_atomic64_add_negative
-#define arch_atomic64_add_negative_relaxed arch_atomic64_add_negative
-#endif /* arch_atomic64_add_negative */
-
-#ifndef arch_atomic64_add_negative
 /**
- * arch_atomic64_add_negative - Add and test if negative
- * @i: integer value to add
- * @v: pointer of type atomic64_t
+ * raw_atomic64_add_negative() - atomic add and test if negative with full ordering
+ * @i: s64 value to add
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + @i) with full ordering.
  *
- * Atomically adds @i to @v and returns true if the result is negative,
- * or false when the result is greater than or equal to zero.
+ * Safe to use in noinstr code; prefer atomic64_add_negative() elsewhere.
+ *
+ * Return: @true if the resulting value of @v is negative, @false otherwise.
  */
 static __always_inline bool
-arch_atomic64_add_negative(s64 i, atomic64_t *v)
+raw_atomic64_add_negative(s64 i, atomic64_t *v)
 {
-       return arch_atomic64_add_return(i, v) < 0;
-}
-#define arch_atomic64_add_negative arch_atomic64_add_negative
+#if defined(arch_atomic64_add_negative)
+       return arch_atomic64_add_negative(i, v);
+#elif defined(arch_atomic64_add_negative_relaxed)
+       bool ret;
+       __atomic_pre_full_fence();
+       ret = arch_atomic64_add_negative_relaxed(i, v);
+       __atomic_post_full_fence();
+       return ret;
+#else
+       return raw_atomic64_add_return(i, v) < 0;
 #endif
+}
 
-#ifndef arch_atomic64_add_negative_acquire
 /**
- * arch_atomic64_add_negative_acquire - Add and test if negative
- * @i: integer value to add
- * @v: pointer of type atomic64_t
+ * raw_atomic64_add_negative_acquire() - atomic add and test if negative with acquire ordering
+ * @i: s64 value to add
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + @i) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_add_negative_acquire() elsewhere.
  *
- * Atomically adds @i to @v and returns true if the result is negative,
- * or false when the result is greater than or equal to zero.
+ * Return: @true if the resulting value of @v is negative, @false otherwise.
  */
 static __always_inline bool
-arch_atomic64_add_negative_acquire(s64 i, atomic64_t *v)
+raw_atomic64_add_negative_acquire(s64 i, atomic64_t *v)
 {
-       return arch_atomic64_add_return_acquire(i, v) < 0;
-}
-#define arch_atomic64_add_negative_acquire arch_atomic64_add_negative_acquire
+#if defined(arch_atomic64_add_negative_acquire)
+       return arch_atomic64_add_negative_acquire(i, v);
+#elif defined(arch_atomic64_add_negative_relaxed)
+       bool ret = arch_atomic64_add_negative_relaxed(i, v);
+       __atomic_acquire_fence();
+       return ret;
+#elif defined(arch_atomic64_add_negative)
+       return arch_atomic64_add_negative(i, v);
+#else
+       return raw_atomic64_add_return_acquire(i, v) < 0;
 #endif
+}
 
-#ifndef arch_atomic64_add_negative_release
 /**
- * arch_atomic64_add_negative_release - Add and test if negative
- * @i: integer value to add
- * @v: pointer of type atomic64_t
+ * raw_atomic64_add_negative_release() - atomic add and test if negative with release ordering
+ * @i: s64 value to add
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + @i) with release ordering.
  *
- * Atomically adds @i to @v and returns true if the result is negative,
- * or false when the result is greater than or equal to zero.
+ * Safe to use in noinstr code; prefer atomic64_add_negative_release() elsewhere.
+ *
+ * Return: @true if the resulting value of @v is negative, @false otherwise.
  */
 static __always_inline bool
-arch_atomic64_add_negative_release(s64 i, atomic64_t *v)
+raw_atomic64_add_negative_release(s64 i, atomic64_t *v)
 {
-       return arch_atomic64_add_return_release(i, v) < 0;
-}
-#define arch_atomic64_add_negative_release arch_atomic64_add_negative_release
+#if defined(arch_atomic64_add_negative_release)
+       return arch_atomic64_add_negative_release(i, v);
+#elif defined(arch_atomic64_add_negative_relaxed)
+       __atomic_release_fence();
+       return arch_atomic64_add_negative_relaxed(i, v);
+#elif defined(arch_atomic64_add_negative)
+       return arch_atomic64_add_negative(i, v);
+#else
+       return raw_atomic64_add_return_release(i, v) < 0;
 #endif
+}
 
-#ifndef arch_atomic64_add_negative_relaxed
 /**
- * arch_atomic64_add_negative_relaxed - Add and test if negative
- * @i: integer value to add
- * @v: pointer of type atomic64_t
+ * raw_atomic64_add_negative_relaxed() - atomic add and test if negative with relaxed ordering
+ * @i: s64 value to add
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + @i) with relaxed ordering.
  *
- * Atomically adds @i to @v and returns true if the result is negative,
- * or false when the result is greater than or equal to zero.
+ * Safe to use in noinstr code; prefer atomic64_add_negative_relaxed() elsewhere.
+ *
+ * Return: @true if the resulting value of @v is negative, @false otherwise.
  */
 static __always_inline bool
-arch_atomic64_add_negative_relaxed(s64 i, atomic64_t *v)
-{
-       return arch_atomic64_add_return_relaxed(i, v) < 0;
-}
-#define arch_atomic64_add_negative_relaxed arch_atomic64_add_negative_relaxed
-#endif
-
-#else /* arch_atomic64_add_negative_relaxed */
-
-#ifndef arch_atomic64_add_negative_acquire
-static __always_inline bool
-arch_atomic64_add_negative_acquire(s64 i, atomic64_t *v)
-{
-       bool ret = arch_atomic64_add_negative_relaxed(i, v);
-       __atomic_acquire_fence();
-       return ret;
-}
-#define arch_atomic64_add_negative_acquire arch_atomic64_add_negative_acquire
-#endif
-
-#ifndef arch_atomic64_add_negative_release
-static __always_inline bool
-arch_atomic64_add_negative_release(s64 i, atomic64_t *v)
+raw_atomic64_add_negative_relaxed(s64 i, atomic64_t *v)
 {
-       __atomic_release_fence();
+#if defined(arch_atomic64_add_negative_relaxed)
        return arch_atomic64_add_negative_relaxed(i, v);
-}
-#define arch_atomic64_add_negative_release arch_atomic64_add_negative_release
+#elif defined(arch_atomic64_add_negative)
+       return arch_atomic64_add_negative(i, v);
+#else
+       return raw_atomic64_add_return_relaxed(i, v) < 0;
 #endif
-
-#ifndef arch_atomic64_add_negative
-static __always_inline bool
-arch_atomic64_add_negative(s64 i, atomic64_t *v)
-{
-       bool ret;
-       __atomic_pre_full_fence();
-       ret = arch_atomic64_add_negative_relaxed(i, v);
-       __atomic_post_full_fence();
-       return ret;
 }
-#define arch_atomic64_add_negative arch_atomic64_add_negative
-#endif
 
-#endif /* arch_atomic64_add_negative_relaxed */
-
-#ifndef arch_atomic64_fetch_add_unless
 /**
- * arch_atomic64_fetch_add_unless - add unless the number is already a given value
- * @v: pointer of type atomic64_t
- * @a: the amount to add to v...
- * @u: ...unless v is equal to u.
+ * raw_atomic64_fetch_add_unless() - atomic add unless value with full ordering
+ * @v: pointer to atomic64_t
+ * @a: s64 value to add
+ * @u: s64 value to compare with
+ *
+ * If (@v != @u), atomically updates @v to (@v + @a) with full ordering.
  *
- * Atomically adds @a to @v, so long as @v was not already @u.
- * Returns original value of @v
+ * Safe to use in noinstr code; prefer atomic64_fetch_add_unless() elsewhere.
+ *
+ * Return: The original value of @v.
  */
 static __always_inline s64
-arch_atomic64_fetch_add_unless(atomic64_t *v, s64 a, s64 u)
+raw_atomic64_fetch_add_unless(atomic64_t *v, s64 a, s64 u)
 {
-       s64 c = arch_atomic64_read(v);
+#if defined(arch_atomic64_fetch_add_unless)
+       return arch_atomic64_fetch_add_unless(v, a, u);
+#else
+       s64 c = raw_atomic64_read(v);
 
        do {
                if (unlikely(c == u))
                        break;
-       } while (!arch_atomic64_try_cmpxchg(v, &c, c + a));
+       } while (!raw_atomic64_try_cmpxchg(v, &c, c + a));
 
        return c;
-}
-#define arch_atomic64_fetch_add_unless arch_atomic64_fetch_add_unless
 #endif
+}
 
-#ifndef arch_atomic64_add_unless
 /**
- * arch_atomic64_add_unless - add unless the number is already a given value
- * @v: pointer of type atomic64_t
- * @a: the amount to add to v...
- * @u: ...unless v is equal to u.
+ * raw_atomic64_add_unless() - atomic add unless value with full ordering
+ * @v: pointer to atomic64_t
+ * @a: s64 value to add
+ * @u: s64 value to compare with
+ *
+ * If (@v != @u), atomically updates @v to (@v + @a) with full ordering.
  *
- * Atomically adds @a to @v, if @v was not already @u.
- * Returns true if the addition was done.
+ * Safe to use in noinstr code; prefer atomic64_add_unless() elsewhere.
+ *
+ * Return: @true if @v was updated, @false otherwise.
  */
 static __always_inline bool
-arch_atomic64_add_unless(atomic64_t *v, s64 a, s64 u)
+raw_atomic64_add_unless(atomic64_t *v, s64 a, s64 u)
 {
-       return arch_atomic64_fetch_add_unless(v, a, u) != u;
-}
-#define arch_atomic64_add_unless arch_atomic64_add_unless
+#if defined(arch_atomic64_add_unless)
+       return arch_atomic64_add_unless(v, a, u);
+#else
+       return raw_atomic64_fetch_add_unless(v, a, u) != u;
 #endif
+}
 
-#ifndef arch_atomic64_inc_not_zero
 /**
- * arch_atomic64_inc_not_zero - increment unless the number is zero
- * @v: pointer of type atomic64_t
+ * raw_atomic64_inc_not_zero() - atomic increment unless zero with full ordering
+ * @v: pointer to atomic64_t
+ *
+ * If (@v != 0), atomically updates @v to (@v + 1) with full ordering.
  *
- * Atomically increments @v by 1, if @v is non-zero.
- * Returns true if the increment was done.
+ * Safe to use in noinstr code; prefer atomic64_inc_not_zero() elsewhere.
+ *
+ * Return: @true if @v was updated, @false otherwise.
  */
 static __always_inline bool
-arch_atomic64_inc_not_zero(atomic64_t *v)
+raw_atomic64_inc_not_zero(atomic64_t *v)
 {
-       return arch_atomic64_add_unless(v, 1, 0);
-}
-#define arch_atomic64_inc_not_zero arch_atomic64_inc_not_zero
+#if defined(arch_atomic64_inc_not_zero)
+       return arch_atomic64_inc_not_zero(v);
+#else
+       return raw_atomic64_add_unless(v, 1, 0);
 #endif
+}
 
-#ifndef arch_atomic64_inc_unless_negative
+/**
+ * raw_atomic64_inc_unless_negative() - atomic increment unless negative with full ordering
+ * @v: pointer to atomic64_t
+ *
+ * If (@v >= 0), atomically updates @v to (@v + 1) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_inc_unless_negative() elsewhere.
+ *
+ * Return: @true if @v was updated, @false otherwise.
+ */
 static __always_inline bool
-arch_atomic64_inc_unless_negative(atomic64_t *v)
+raw_atomic64_inc_unless_negative(atomic64_t *v)
 {
-       s64 c = arch_atomic64_read(v);
+#if defined(arch_atomic64_inc_unless_negative)
+       return arch_atomic64_inc_unless_negative(v);
+#else
+       s64 c = raw_atomic64_read(v);
 
        do {
                if (unlikely(c < 0))
                        return false;
-       } while (!arch_atomic64_try_cmpxchg(v, &c, c + 1));
+       } while (!raw_atomic64_try_cmpxchg(v, &c, c + 1));
 
        return true;
-}
-#define arch_atomic64_inc_unless_negative arch_atomic64_inc_unless_negative
 #endif
+}
 
-#ifndef arch_atomic64_dec_unless_positive
+/**
+ * raw_atomic64_dec_unless_positive() - atomic decrement unless positive with full ordering
+ * @v: pointer to atomic64_t
+ *
+ * If (@v <= 0), atomically updates @v to (@v - 1) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_dec_unless_positive() elsewhere.
+ *
+ * Return: @true if @v was updated, @false otherwise.
+ */
 static __always_inline bool
-arch_atomic64_dec_unless_positive(atomic64_t *v)
+raw_atomic64_dec_unless_positive(atomic64_t *v)
 {
-       s64 c = arch_atomic64_read(v);
+#if defined(arch_atomic64_dec_unless_positive)
+       return arch_atomic64_dec_unless_positive(v);
+#else
+       s64 c = raw_atomic64_read(v);
 
        do {
                if (unlikely(c > 0))
                        return false;
-       } while (!arch_atomic64_try_cmpxchg(v, &c, c - 1));
+       } while (!raw_atomic64_try_cmpxchg(v, &c, c - 1));
 
        return true;
-}
-#define arch_atomic64_dec_unless_positive arch_atomic64_dec_unless_positive
 #endif
+}
 
-#ifndef arch_atomic64_dec_if_positive
+/**
+ * raw_atomic64_dec_if_positive() - atomic decrement if positive with full ordering
+ * @v: pointer to atomic64_t
+ *
+ * If (@v > 0), atomically updates @v to (@v - 1) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic64_dec_if_positive() elsewhere.
+ *
+ * Return: The old value of (@v - 1), regardless of whether @v was updated.
+ */
 static __always_inline s64
-arch_atomic64_dec_if_positive(atomic64_t *v)
+raw_atomic64_dec_if_positive(atomic64_t *v)
 {
-       s64 dec, c = arch_atomic64_read(v);
+#if defined(arch_atomic64_dec_if_positive)
+       return arch_atomic64_dec_if_positive(v);
+#else
+       s64 dec, c = raw_atomic64_read(v);
 
        do {
                dec = c - 1;
                if (unlikely(dec < 0))
                        break;
-       } while (!arch_atomic64_try_cmpxchg(v, &c, dec));
+       } while (!raw_atomic64_try_cmpxchg(v, &c, dec));
 
        return dec;
-}
-#define arch_atomic64_dec_if_positive arch_atomic64_dec_if_positive
 #endif
+}
 
 #endif /* _LINUX_ATOMIC_FALLBACK_H */
-// ad2e2b4d168dbc60a73922616047a9bfa446af36
+// 202b45c7db600ce36198eb1f1fc2c2d5268ace2d
index 03a232a..d401b40 100644 (file)
@@ -4,15 +4,10 @@
 // DO NOT MODIFY THIS FILE DIRECTLY
 
 /*
- * This file provides wrappers with KASAN instrumentation for atomic operations.
- * To use this functionality an arch's atomic.h file needs to define all
- * atomic operations with arch_ prefix (e.g. arch_atomic_read()) and include
- * this file at the end. This file provides atomic_read() that forwards to
- * arch_atomic_read() for actual atomic operation.
- * Note: if an arch atomic operation is implemented by means of other atomic
- * operations (e.g. atomic_read()/atomic_cmpxchg() loop), then it needs to use
- * arch_ variants (i.e. arch_atomic_read()/arch_atomic_cmpxchg()) to avoid
- * double instrumentation.
+ * This file provoides atomic operations with explicit instrumentation (e.g.
+ * KASAN, KCSAN), which should be used unless it is necessary to avoid
+ * instrumentation. Where it is necessary to aovid instrumenation, the
+ * raw_atomic*() operations should be used.
  */
 #ifndef _LINUX_ATOMIC_INSTRUMENTED_H
 #define _LINUX_ATOMIC_INSTRUMENTED_H
 #include <linux/compiler.h>
 #include <linux/instrumented.h>
 
+/**
+ * atomic_read() - atomic load with relaxed ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically loads the value of @v with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_read() there.
+ *
+ * Return: The value loaded from @v.
+ */
 static __always_inline int
 atomic_read(const atomic_t *v)
 {
        instrument_atomic_read(v, sizeof(*v));
-       return arch_atomic_read(v);
-}
-
+       return raw_atomic_read(v);
+}
+
+/**
+ * atomic_read_acquire() - atomic load with acquire ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically loads the value of @v with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_read_acquire() there.
+ *
+ * Return: The value loaded from @v.
+ */
 static __always_inline int
 atomic_read_acquire(const atomic_t *v)
 {
        instrument_atomic_read(v, sizeof(*v));
-       return arch_atomic_read_acquire(v);
-}
-
+       return raw_atomic_read_acquire(v);
+}
+
+/**
+ * atomic_set() - atomic set with relaxed ordering
+ * @v: pointer to atomic_t
+ * @i: int value to assign
+ *
+ * Atomically sets @v to @i with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_set() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic_set(atomic_t *v, int i)
 {
        instrument_atomic_write(v, sizeof(*v));
-       arch_atomic_set(v, i);
-}
-
+       raw_atomic_set(v, i);
+}
+
+/**
+ * atomic_set_release() - atomic set with release ordering
+ * @v: pointer to atomic_t
+ * @i: int value to assign
+ *
+ * Atomically sets @v to @i with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_set_release() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic_set_release(atomic_t *v, int i)
 {
        kcsan_release();
        instrument_atomic_write(v, sizeof(*v));
-       arch_atomic_set_release(v, i);
-}
-
+       raw_atomic_set_release(v, i);
+}
+
+/**
+ * atomic_add() - atomic add with relaxed ordering
+ * @i: int value to add
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_add() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic_add(int i, atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       arch_atomic_add(i, v);
+       raw_atomic_add(i, v);
 }
 
+/**
+ * atomic_add_return() - atomic add with full ordering
+ * @i: int value to add
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + @i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_add_return() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline int
 atomic_add_return(int i, atomic_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_add_return(i, v);
+       return raw_atomic_add_return(i, v);
 }
 
+/**
+ * atomic_add_return_acquire() - atomic add with acquire ordering
+ * @i: int value to add
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + @i) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_add_return_acquire() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline int
 atomic_add_return_acquire(int i, atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_add_return_acquire(i, v);
+       return raw_atomic_add_return_acquire(i, v);
 }
 
+/**
+ * atomic_add_return_release() - atomic add with release ordering
+ * @i: int value to add
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + @i) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_add_return_release() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline int
 atomic_add_return_release(int i, atomic_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_add_return_release(i, v);
+       return raw_atomic_add_return_release(i, v);
 }
 
+/**
+ * atomic_add_return_relaxed() - atomic add with relaxed ordering
+ * @i: int value to add
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_add_return_relaxed() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline int
 atomic_add_return_relaxed(int i, atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_add_return_relaxed(i, v);
+       return raw_atomic_add_return_relaxed(i, v);
 }
 
+/**
+ * atomic_fetch_add() - atomic add with full ordering
+ * @i: int value to add
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + @i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_add() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_add(int i, atomic_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_add(i, v);
+       return raw_atomic_fetch_add(i, v);
 }
 
+/**
+ * atomic_fetch_add_acquire() - atomic add with acquire ordering
+ * @i: int value to add
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + @i) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_add_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_add_acquire(int i, atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_add_acquire(i, v);
+       return raw_atomic_fetch_add_acquire(i, v);
 }
 
+/**
+ * atomic_fetch_add_release() - atomic add with release ordering
+ * @i: int value to add
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + @i) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_add_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_add_release(int i, atomic_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_add_release(i, v);
+       return raw_atomic_fetch_add_release(i, v);
 }
 
+/**
+ * atomic_fetch_add_relaxed() - atomic add with relaxed ordering
+ * @i: int value to add
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_add_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_add_relaxed(int i, atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_add_relaxed(i, v);
+       return raw_atomic_fetch_add_relaxed(i, v);
 }
 
+/**
+ * atomic_sub() - atomic subtract with relaxed ordering
+ * @i: int value to subtract
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_sub() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic_sub(int i, atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       arch_atomic_sub(i, v);
+       raw_atomic_sub(i, v);
 }
 
+/**
+ * atomic_sub_return() - atomic subtract with full ordering
+ * @i: int value to subtract
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - @i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_sub_return() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline int
 atomic_sub_return(int i, atomic_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_sub_return(i, v);
+       return raw_atomic_sub_return(i, v);
 }
 
+/**
+ * atomic_sub_return_acquire() - atomic subtract with acquire ordering
+ * @i: int value to subtract
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - @i) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_sub_return_acquire() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline int
 atomic_sub_return_acquire(int i, atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_sub_return_acquire(i, v);
+       return raw_atomic_sub_return_acquire(i, v);
 }
 
+/**
+ * atomic_sub_return_release() - atomic subtract with release ordering
+ * @i: int value to subtract
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - @i) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_sub_return_release() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline int
 atomic_sub_return_release(int i, atomic_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_sub_return_release(i, v);
+       return raw_atomic_sub_return_release(i, v);
 }
 
+/**
+ * atomic_sub_return_relaxed() - atomic subtract with relaxed ordering
+ * @i: int value to subtract
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_sub_return_relaxed() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline int
 atomic_sub_return_relaxed(int i, atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_sub_return_relaxed(i, v);
+       return raw_atomic_sub_return_relaxed(i, v);
 }
 
+/**
+ * atomic_fetch_sub() - atomic subtract with full ordering
+ * @i: int value to subtract
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - @i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_sub() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_sub(int i, atomic_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_sub(i, v);
+       return raw_atomic_fetch_sub(i, v);
 }
 
+/**
+ * atomic_fetch_sub_acquire() - atomic subtract with acquire ordering
+ * @i: int value to subtract
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - @i) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_sub_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_sub_acquire(int i, atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_sub_acquire(i, v);
+       return raw_atomic_fetch_sub_acquire(i, v);
 }
 
+/**
+ * atomic_fetch_sub_release() - atomic subtract with release ordering
+ * @i: int value to subtract
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - @i) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_sub_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_sub_release(int i, atomic_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_sub_release(i, v);
+       return raw_atomic_fetch_sub_release(i, v);
 }
 
+/**
+ * atomic_fetch_sub_relaxed() - atomic subtract with relaxed ordering
+ * @i: int value to subtract
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_sub_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_sub_relaxed(int i, atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_sub_relaxed(i, v);
+       return raw_atomic_fetch_sub_relaxed(i, v);
 }
 
+/**
+ * atomic_inc() - atomic increment with relaxed ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + 1) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_inc() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic_inc(atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       arch_atomic_inc(v);
+       raw_atomic_inc(v);
 }
 
+/**
+ * atomic_inc_return() - atomic increment with full ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_inc_return() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline int
 atomic_inc_return(atomic_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_inc_return(v);
+       return raw_atomic_inc_return(v);
 }
 
+/**
+ * atomic_inc_return_acquire() - atomic increment with acquire ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + 1) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_inc_return_acquire() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline int
 atomic_inc_return_acquire(atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_inc_return_acquire(v);
+       return raw_atomic_inc_return_acquire(v);
 }
 
+/**
+ * atomic_inc_return_release() - atomic increment with release ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + 1) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_inc_return_release() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline int
 atomic_inc_return_release(atomic_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_inc_return_release(v);
+       return raw_atomic_inc_return_release(v);
 }
 
+/**
+ * atomic_inc_return_relaxed() - atomic increment with relaxed ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + 1) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_inc_return_relaxed() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline int
 atomic_inc_return_relaxed(atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_inc_return_relaxed(v);
+       return raw_atomic_inc_return_relaxed(v);
 }
 
+/**
+ * atomic_fetch_inc() - atomic increment with full ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_inc() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_inc(atomic_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_inc(v);
+       return raw_atomic_fetch_inc(v);
 }
 
+/**
+ * atomic_fetch_inc_acquire() - atomic increment with acquire ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + 1) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_inc_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_inc_acquire(atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_inc_acquire(v);
+       return raw_atomic_fetch_inc_acquire(v);
 }
 
+/**
+ * atomic_fetch_inc_release() - atomic increment with release ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + 1) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_inc_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_inc_release(atomic_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_inc_release(v);
+       return raw_atomic_fetch_inc_release(v);
 }
 
+/**
+ * atomic_fetch_inc_relaxed() - atomic increment with relaxed ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + 1) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_inc_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_inc_relaxed(atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_inc_relaxed(v);
+       return raw_atomic_fetch_inc_relaxed(v);
 }
 
+/**
+ * atomic_dec() - atomic decrement with relaxed ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - 1) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_dec() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic_dec(atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       arch_atomic_dec(v);
+       raw_atomic_dec(v);
 }
 
+/**
+ * atomic_dec_return() - atomic decrement with full ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_dec_return() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline int
 atomic_dec_return(atomic_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_dec_return(v);
+       return raw_atomic_dec_return(v);
 }
 
+/**
+ * atomic_dec_return_acquire() - atomic decrement with acquire ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - 1) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_dec_return_acquire() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline int
 atomic_dec_return_acquire(atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_dec_return_acquire(v);
+       return raw_atomic_dec_return_acquire(v);
 }
 
+/**
+ * atomic_dec_return_release() - atomic decrement with release ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - 1) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_dec_return_release() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline int
 atomic_dec_return_release(atomic_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_dec_return_release(v);
+       return raw_atomic_dec_return_release(v);
 }
 
+/**
+ * atomic_dec_return_relaxed() - atomic decrement with relaxed ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - 1) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_dec_return_relaxed() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline int
 atomic_dec_return_relaxed(atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_dec_return_relaxed(v);
+       return raw_atomic_dec_return_relaxed(v);
 }
 
+/**
+ * atomic_fetch_dec() - atomic decrement with full ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_dec() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_dec(atomic_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_dec(v);
+       return raw_atomic_fetch_dec(v);
 }
 
+/**
+ * atomic_fetch_dec_acquire() - atomic decrement with acquire ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - 1) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_dec_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_dec_acquire(atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_dec_acquire(v);
+       return raw_atomic_fetch_dec_acquire(v);
 }
 
+/**
+ * atomic_fetch_dec_release() - atomic decrement with release ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - 1) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_dec_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_dec_release(atomic_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_dec_release(v);
+       return raw_atomic_fetch_dec_release(v);
 }
 
+/**
+ * atomic_fetch_dec_relaxed() - atomic decrement with relaxed ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - 1) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_dec_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_dec_relaxed(atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_dec_relaxed(v);
+       return raw_atomic_fetch_dec_relaxed(v);
 }
 
+/**
+ * atomic_and() - atomic bitwise AND with relaxed ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v & @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_and() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic_and(int i, atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       arch_atomic_and(i, v);
+       raw_atomic_and(i, v);
 }
 
+/**
+ * atomic_fetch_and() - atomic bitwise AND with full ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v & @i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_and() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_and(int i, atomic_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_and(i, v);
+       return raw_atomic_fetch_and(i, v);
 }
 
+/**
+ * atomic_fetch_and_acquire() - atomic bitwise AND with acquire ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v & @i) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_and_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_and_acquire(int i, atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_and_acquire(i, v);
+       return raw_atomic_fetch_and_acquire(i, v);
 }
 
+/**
+ * atomic_fetch_and_release() - atomic bitwise AND with release ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v & @i) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_and_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_and_release(int i, atomic_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_and_release(i, v);
+       return raw_atomic_fetch_and_release(i, v);
 }
 
+/**
+ * atomic_fetch_and_relaxed() - atomic bitwise AND with relaxed ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v & @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_and_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_and_relaxed(int i, atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_and_relaxed(i, v);
+       return raw_atomic_fetch_and_relaxed(i, v);
 }
 
+/**
+ * atomic_andnot() - atomic bitwise AND NOT with relaxed ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v & ~@i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_andnot() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic_andnot(int i, atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       arch_atomic_andnot(i, v);
+       raw_atomic_andnot(i, v);
 }
 
+/**
+ * atomic_fetch_andnot() - atomic bitwise AND NOT with full ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v & ~@i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_andnot() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_andnot(int i, atomic_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_andnot(i, v);
+       return raw_atomic_fetch_andnot(i, v);
 }
 
+/**
+ * atomic_fetch_andnot_acquire() - atomic bitwise AND NOT with acquire ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v & ~@i) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_andnot_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_andnot_acquire(int i, atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_andnot_acquire(i, v);
+       return raw_atomic_fetch_andnot_acquire(i, v);
 }
 
+/**
+ * atomic_fetch_andnot_release() - atomic bitwise AND NOT with release ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v & ~@i) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_andnot_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_andnot_release(int i, atomic_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_andnot_release(i, v);
+       return raw_atomic_fetch_andnot_release(i, v);
 }
 
+/**
+ * atomic_fetch_andnot_relaxed() - atomic bitwise AND NOT with relaxed ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v & ~@i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_andnot_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_andnot_relaxed(int i, atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_andnot_relaxed(i, v);
+       return raw_atomic_fetch_andnot_relaxed(i, v);
 }
 
+/**
+ * atomic_or() - atomic bitwise OR with relaxed ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v | @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_or() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic_or(int i, atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       arch_atomic_or(i, v);
+       raw_atomic_or(i, v);
 }
 
+/**
+ * atomic_fetch_or() - atomic bitwise OR with full ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v | @i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_or() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_or(int i, atomic_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_or(i, v);
+       return raw_atomic_fetch_or(i, v);
 }
 
+/**
+ * atomic_fetch_or_acquire() - atomic bitwise OR with acquire ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v | @i) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_or_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_or_acquire(int i, atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_or_acquire(i, v);
+       return raw_atomic_fetch_or_acquire(i, v);
 }
 
+/**
+ * atomic_fetch_or_release() - atomic bitwise OR with release ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v | @i) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_or_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_or_release(int i, atomic_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_or_release(i, v);
+       return raw_atomic_fetch_or_release(i, v);
 }
 
+/**
+ * atomic_fetch_or_relaxed() - atomic bitwise OR with relaxed ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v | @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_or_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_or_relaxed(int i, atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_or_relaxed(i, v);
+       return raw_atomic_fetch_or_relaxed(i, v);
 }
 
+/**
+ * atomic_xor() - atomic bitwise XOR with relaxed ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v ^ @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_xor() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic_xor(int i, atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       arch_atomic_xor(i, v);
+       raw_atomic_xor(i, v);
 }
 
+/**
+ * atomic_fetch_xor() - atomic bitwise XOR with full ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v ^ @i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_xor() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_xor(int i, atomic_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_xor(i, v);
+       return raw_atomic_fetch_xor(i, v);
 }
 
+/**
+ * atomic_fetch_xor_acquire() - atomic bitwise XOR with acquire ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v ^ @i) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_xor_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_xor_acquire(int i, atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_xor_acquire(i, v);
+       return raw_atomic_fetch_xor_acquire(i, v);
 }
 
+/**
+ * atomic_fetch_xor_release() - atomic bitwise XOR with release ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v ^ @i) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_xor_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_xor_release(int i, atomic_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_xor_release(i, v);
+       return raw_atomic_fetch_xor_release(i, v);
 }
 
+/**
+ * atomic_fetch_xor_relaxed() - atomic bitwise XOR with relaxed ordering
+ * @i: int value
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v ^ @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_xor_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_xor_relaxed(int i, atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_xor_relaxed(i, v);
+       return raw_atomic_fetch_xor_relaxed(i, v);
 }
 
+/**
+ * atomic_xchg() - atomic exchange with full ordering
+ * @v: pointer to atomic_t
+ * @new: int value to assign
+ *
+ * Atomically updates @v to @new with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_xchg() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-atomic_xchg(atomic_t *v, int i)
+atomic_xchg(atomic_t *v, int new)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_xchg(v, i);
+       return raw_atomic_xchg(v, new);
 }
 
+/**
+ * atomic_xchg_acquire() - atomic exchange with acquire ordering
+ * @v: pointer to atomic_t
+ * @new: int value to assign
+ *
+ * Atomically updates @v to @new with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_xchg_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-atomic_xchg_acquire(atomic_t *v, int i)
+atomic_xchg_acquire(atomic_t *v, int new)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_xchg_acquire(v, i);
+       return raw_atomic_xchg_acquire(v, new);
 }
 
+/**
+ * atomic_xchg_release() - atomic exchange with release ordering
+ * @v: pointer to atomic_t
+ * @new: int value to assign
+ *
+ * Atomically updates @v to @new with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_xchg_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-atomic_xchg_release(atomic_t *v, int i)
+atomic_xchg_release(atomic_t *v, int new)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_xchg_release(v, i);
+       return raw_atomic_xchg_release(v, new);
 }
 
+/**
+ * atomic_xchg_relaxed() - atomic exchange with relaxed ordering
+ * @v: pointer to atomic_t
+ * @new: int value to assign
+ *
+ * Atomically updates @v to @new with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_xchg_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
-atomic_xchg_relaxed(atomic_t *v, int i)
+atomic_xchg_relaxed(atomic_t *v, int new)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_xchg_relaxed(v, i);
+       return raw_atomic_xchg_relaxed(v, new);
 }
 
+/**
+ * atomic_cmpxchg() - atomic compare and exchange with full ordering
+ * @v: pointer to atomic_t
+ * @old: int value to compare with
+ * @new: int value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_cmpxchg() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_cmpxchg(atomic_t *v, int old, int new)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_cmpxchg(v, old, new);
+       return raw_atomic_cmpxchg(v, old, new);
 }
 
+/**
+ * atomic_cmpxchg_acquire() - atomic compare and exchange with acquire ordering
+ * @v: pointer to atomic_t
+ * @old: int value to compare with
+ * @new: int value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_cmpxchg_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_cmpxchg_acquire(atomic_t *v, int old, int new)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_cmpxchg_acquire(v, old, new);
+       return raw_atomic_cmpxchg_acquire(v, old, new);
 }
 
+/**
+ * atomic_cmpxchg_release() - atomic compare and exchange with release ordering
+ * @v: pointer to atomic_t
+ * @old: int value to compare with
+ * @new: int value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_cmpxchg_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_cmpxchg_release(atomic_t *v, int old, int new)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_cmpxchg_release(v, old, new);
+       return raw_atomic_cmpxchg_release(v, old, new);
 }
 
+/**
+ * atomic_cmpxchg_relaxed() - atomic compare and exchange with relaxed ordering
+ * @v: pointer to atomic_t
+ * @old: int value to compare with
+ * @new: int value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_cmpxchg_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_cmpxchg_relaxed(atomic_t *v, int old, int new)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_cmpxchg_relaxed(v, old, new);
+       return raw_atomic_cmpxchg_relaxed(v, old, new);
 }
 
+/**
+ * atomic_try_cmpxchg() - atomic compare and exchange with full ordering
+ * @v: pointer to atomic_t
+ * @old: pointer to int value to compare with
+ * @new: int value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with full ordering.
+ * Otherwise, updates @old to the current value of @v.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_try_cmpxchg() there.
+ *
+ * Return: @true if the exchange occured, @false otherwise.
+ */
 static __always_inline bool
 atomic_try_cmpxchg(atomic_t *v, int *old, int new)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
        instrument_atomic_read_write(old, sizeof(*old));
-       return arch_atomic_try_cmpxchg(v, old, new);
-}
-
+       return raw_atomic_try_cmpxchg(v, old, new);
+}
+
+/**
+ * atomic_try_cmpxchg_acquire() - atomic compare and exchange with acquire ordering
+ * @v: pointer to atomic_t
+ * @old: pointer to int value to compare with
+ * @new: int value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with acquire ordering.
+ * Otherwise, updates @old to the current value of @v.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_try_cmpxchg_acquire() there.
+ *
+ * Return: @true if the exchange occured, @false otherwise.
+ */
 static __always_inline bool
 atomic_try_cmpxchg_acquire(atomic_t *v, int *old, int new)
 {
        instrument_atomic_read_write(v, sizeof(*v));
        instrument_atomic_read_write(old, sizeof(*old));
-       return arch_atomic_try_cmpxchg_acquire(v, old, new);
-}
-
+       return raw_atomic_try_cmpxchg_acquire(v, old, new);
+}
+
+/**
+ * atomic_try_cmpxchg_release() - atomic compare and exchange with release ordering
+ * @v: pointer to atomic_t
+ * @old: pointer to int value to compare with
+ * @new: int value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with release ordering.
+ * Otherwise, updates @old to the current value of @v.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_try_cmpxchg_release() there.
+ *
+ * Return: @true if the exchange occured, @false otherwise.
+ */
 static __always_inline bool
 atomic_try_cmpxchg_release(atomic_t *v, int *old, int new)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
        instrument_atomic_read_write(old, sizeof(*old));
-       return arch_atomic_try_cmpxchg_release(v, old, new);
-}
-
+       return raw_atomic_try_cmpxchg_release(v, old, new);
+}
+
+/**
+ * atomic_try_cmpxchg_relaxed() - atomic compare and exchange with relaxed ordering
+ * @v: pointer to atomic_t
+ * @old: pointer to int value to compare with
+ * @new: int value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with relaxed ordering.
+ * Otherwise, updates @old to the current value of @v.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_try_cmpxchg_relaxed() there.
+ *
+ * Return: @true if the exchange occured, @false otherwise.
+ */
 static __always_inline bool
 atomic_try_cmpxchg_relaxed(atomic_t *v, int *old, int new)
 {
        instrument_atomic_read_write(v, sizeof(*v));
        instrument_atomic_read_write(old, sizeof(*old));
-       return arch_atomic_try_cmpxchg_relaxed(v, old, new);
-}
-
+       return raw_atomic_try_cmpxchg_relaxed(v, old, new);
+}
+
+/**
+ * atomic_sub_and_test() - atomic subtract and test if zero with full ordering
+ * @i: int value to add
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - @i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_sub_and_test() there.
+ *
+ * Return: @true if the resulting value of @v is zero, @false otherwise.
+ */
 static __always_inline bool
 atomic_sub_and_test(int i, atomic_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_sub_and_test(i, v);
+       return raw_atomic_sub_and_test(i, v);
 }
 
+/**
+ * atomic_dec_and_test() - atomic decrement and test if zero with full ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v - 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_dec_and_test() there.
+ *
+ * Return: @true if the resulting value of @v is zero, @false otherwise.
+ */
 static __always_inline bool
 atomic_dec_and_test(atomic_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_dec_and_test(v);
+       return raw_atomic_dec_and_test(v);
 }
 
+/**
+ * atomic_inc_and_test() - atomic increment and test if zero with full ordering
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_inc_and_test() there.
+ *
+ * Return: @true if the resulting value of @v is zero, @false otherwise.
+ */
 static __always_inline bool
 atomic_inc_and_test(atomic_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_inc_and_test(v);
+       return raw_atomic_inc_and_test(v);
 }
 
+/**
+ * atomic_add_negative() - atomic add and test if negative with full ordering
+ * @i: int value to add
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + @i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_add_negative() there.
+ *
+ * Return: @true if the resulting value of @v is negative, @false otherwise.
+ */
 static __always_inline bool
 atomic_add_negative(int i, atomic_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_add_negative(i, v);
+       return raw_atomic_add_negative(i, v);
 }
 
+/**
+ * atomic_add_negative_acquire() - atomic add and test if negative with acquire ordering
+ * @i: int value to add
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + @i) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_add_negative_acquire() there.
+ *
+ * Return: @true if the resulting value of @v is negative, @false otherwise.
+ */
 static __always_inline bool
 atomic_add_negative_acquire(int i, atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_add_negative_acquire(i, v);
+       return raw_atomic_add_negative_acquire(i, v);
 }
 
+/**
+ * atomic_add_negative_release() - atomic add and test if negative with release ordering
+ * @i: int value to add
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + @i) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_add_negative_release() there.
+ *
+ * Return: @true if the resulting value of @v is negative, @false otherwise.
+ */
 static __always_inline bool
 atomic_add_negative_release(int i, atomic_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_add_negative_release(i, v);
+       return raw_atomic_add_negative_release(i, v);
 }
 
+/**
+ * atomic_add_negative_relaxed() - atomic add and test if negative with relaxed ordering
+ * @i: int value to add
+ * @v: pointer to atomic_t
+ *
+ * Atomically updates @v to (@v + @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_add_negative_relaxed() there.
+ *
+ * Return: @true if the resulting value of @v is negative, @false otherwise.
+ */
 static __always_inline bool
 atomic_add_negative_relaxed(int i, atomic_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_add_negative_relaxed(i, v);
+       return raw_atomic_add_negative_relaxed(i, v);
 }
 
+/**
+ * atomic_fetch_add_unless() - atomic add unless value with full ordering
+ * @v: pointer to atomic_t
+ * @a: int value to add
+ * @u: int value to compare with
+ *
+ * If (@v != @u), atomically updates @v to (@v + @a) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_fetch_add_unless() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline int
 atomic_fetch_add_unless(atomic_t *v, int a, int u)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_fetch_add_unless(v, a, u);
+       return raw_atomic_fetch_add_unless(v, a, u);
 }
 
+/**
+ * atomic_add_unless() - atomic add unless value with full ordering
+ * @v: pointer to atomic_t
+ * @a: int value to add
+ * @u: int value to compare with
+ *
+ * If (@v != @u), atomically updates @v to (@v + @a) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_add_unless() there.
+ *
+ * Return: @true if @v was updated, @false otherwise.
+ */
 static __always_inline bool
 atomic_add_unless(atomic_t *v, int a, int u)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_add_unless(v, a, u);
+       return raw_atomic_add_unless(v, a, u);
 }
 
+/**
+ * atomic_inc_not_zero() - atomic increment unless zero with full ordering
+ * @v: pointer to atomic_t
+ *
+ * If (@v != 0), atomically updates @v to (@v + 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_inc_not_zero() there.
+ *
+ * Return: @true if @v was updated, @false otherwise.
+ */
 static __always_inline bool
 atomic_inc_not_zero(atomic_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_inc_not_zero(v);
+       return raw_atomic_inc_not_zero(v);
 }
 
+/**
+ * atomic_inc_unless_negative() - atomic increment unless negative with full ordering
+ * @v: pointer to atomic_t
+ *
+ * If (@v >= 0), atomically updates @v to (@v + 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_inc_unless_negative() there.
+ *
+ * Return: @true if @v was updated, @false otherwise.
+ */
 static __always_inline bool
 atomic_inc_unless_negative(atomic_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_inc_unless_negative(v);
+       return raw_atomic_inc_unless_negative(v);
 }
 
+/**
+ * atomic_dec_unless_positive() - atomic decrement unless positive with full ordering
+ * @v: pointer to atomic_t
+ *
+ * If (@v <= 0), atomically updates @v to (@v - 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_dec_unless_positive() there.
+ *
+ * Return: @true if @v was updated, @false otherwise.
+ */
 static __always_inline bool
 atomic_dec_unless_positive(atomic_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_dec_unless_positive(v);
+       return raw_atomic_dec_unless_positive(v);
 }
 
+/**
+ * atomic_dec_if_positive() - atomic decrement if positive with full ordering
+ * @v: pointer to atomic_t
+ *
+ * If (@v > 0), atomically updates @v to (@v - 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_dec_if_positive() there.
+ *
+ * Return: The old value of (@v - 1), regardless of whether @v was updated.
+ */
 static __always_inline int
 atomic_dec_if_positive(atomic_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_dec_if_positive(v);
+       return raw_atomic_dec_if_positive(v);
 }
 
+/**
+ * atomic64_read() - atomic load with relaxed ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically loads the value of @v with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_read() there.
+ *
+ * Return: The value loaded from @v.
+ */
 static __always_inline s64
 atomic64_read(const atomic64_t *v)
 {
        instrument_atomic_read(v, sizeof(*v));
-       return arch_atomic64_read(v);
-}
-
+       return raw_atomic64_read(v);
+}
+
+/**
+ * atomic64_read_acquire() - atomic load with acquire ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically loads the value of @v with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_read_acquire() there.
+ *
+ * Return: The value loaded from @v.
+ */
 static __always_inline s64
 atomic64_read_acquire(const atomic64_t *v)
 {
        instrument_atomic_read(v, sizeof(*v));
-       return arch_atomic64_read_acquire(v);
-}
-
+       return raw_atomic64_read_acquire(v);
+}
+
+/**
+ * atomic64_set() - atomic set with relaxed ordering
+ * @v: pointer to atomic64_t
+ * @i: s64 value to assign
+ *
+ * Atomically sets @v to @i with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_set() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic64_set(atomic64_t *v, s64 i)
 {
        instrument_atomic_write(v, sizeof(*v));
-       arch_atomic64_set(v, i);
-}
-
+       raw_atomic64_set(v, i);
+}
+
+/**
+ * atomic64_set_release() - atomic set with release ordering
+ * @v: pointer to atomic64_t
+ * @i: s64 value to assign
+ *
+ * Atomically sets @v to @i with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_set_release() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic64_set_release(atomic64_t *v, s64 i)
 {
        kcsan_release();
        instrument_atomic_write(v, sizeof(*v));
-       arch_atomic64_set_release(v, i);
-}
-
+       raw_atomic64_set_release(v, i);
+}
+
+/**
+ * atomic64_add() - atomic add with relaxed ordering
+ * @i: s64 value to add
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_add() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic64_add(s64 i, atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       arch_atomic64_add(i, v);
+       raw_atomic64_add(i, v);
 }
 
+/**
+ * atomic64_add_return() - atomic add with full ordering
+ * @i: s64 value to add
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + @i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_add_return() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline s64
 atomic64_add_return(s64 i, atomic64_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_add_return(i, v);
+       return raw_atomic64_add_return(i, v);
 }
 
+/**
+ * atomic64_add_return_acquire() - atomic add with acquire ordering
+ * @i: s64 value to add
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + @i) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_add_return_acquire() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline s64
 atomic64_add_return_acquire(s64 i, atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_add_return_acquire(i, v);
+       return raw_atomic64_add_return_acquire(i, v);
 }
 
+/**
+ * atomic64_add_return_release() - atomic add with release ordering
+ * @i: s64 value to add
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + @i) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_add_return_release() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline s64
 atomic64_add_return_release(s64 i, atomic64_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_add_return_release(i, v);
+       return raw_atomic64_add_return_release(i, v);
 }
 
+/**
+ * atomic64_add_return_relaxed() - atomic add with relaxed ordering
+ * @i: s64 value to add
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_add_return_relaxed() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline s64
 atomic64_add_return_relaxed(s64 i, atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_add_return_relaxed(i, v);
+       return raw_atomic64_add_return_relaxed(i, v);
 }
 
+/**
+ * atomic64_fetch_add() - atomic add with full ordering
+ * @i: s64 value to add
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + @i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_add() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_add(s64 i, atomic64_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_add(i, v);
+       return raw_atomic64_fetch_add(i, v);
 }
 
+/**
+ * atomic64_fetch_add_acquire() - atomic add with acquire ordering
+ * @i: s64 value to add
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + @i) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_add_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_add_acquire(s64 i, atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_add_acquire(i, v);
+       return raw_atomic64_fetch_add_acquire(i, v);
 }
 
+/**
+ * atomic64_fetch_add_release() - atomic add with release ordering
+ * @i: s64 value to add
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + @i) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_add_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_add_release(s64 i, atomic64_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_add_release(i, v);
+       return raw_atomic64_fetch_add_release(i, v);
 }
 
+/**
+ * atomic64_fetch_add_relaxed() - atomic add with relaxed ordering
+ * @i: s64 value to add
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_add_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_add_relaxed(s64 i, atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_add_relaxed(i, v);
+       return raw_atomic64_fetch_add_relaxed(i, v);
 }
 
+/**
+ * atomic64_sub() - atomic subtract with relaxed ordering
+ * @i: s64 value to subtract
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_sub() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic64_sub(s64 i, atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       arch_atomic64_sub(i, v);
+       raw_atomic64_sub(i, v);
 }
 
+/**
+ * atomic64_sub_return() - atomic subtract with full ordering
+ * @i: s64 value to subtract
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - @i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_sub_return() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline s64
 atomic64_sub_return(s64 i, atomic64_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_sub_return(i, v);
+       return raw_atomic64_sub_return(i, v);
 }
 
+/**
+ * atomic64_sub_return_acquire() - atomic subtract with acquire ordering
+ * @i: s64 value to subtract
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - @i) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_sub_return_acquire() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline s64
 atomic64_sub_return_acquire(s64 i, atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_sub_return_acquire(i, v);
+       return raw_atomic64_sub_return_acquire(i, v);
 }
 
+/**
+ * atomic64_sub_return_release() - atomic subtract with release ordering
+ * @i: s64 value to subtract
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - @i) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_sub_return_release() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline s64
 atomic64_sub_return_release(s64 i, atomic64_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_sub_return_release(i, v);
+       return raw_atomic64_sub_return_release(i, v);
 }
 
+/**
+ * atomic64_sub_return_relaxed() - atomic subtract with relaxed ordering
+ * @i: s64 value to subtract
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_sub_return_relaxed() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline s64
 atomic64_sub_return_relaxed(s64 i, atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_sub_return_relaxed(i, v);
+       return raw_atomic64_sub_return_relaxed(i, v);
 }
 
+/**
+ * atomic64_fetch_sub() - atomic subtract with full ordering
+ * @i: s64 value to subtract
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - @i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_sub() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_sub(s64 i, atomic64_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_sub(i, v);
+       return raw_atomic64_fetch_sub(i, v);
 }
 
+/**
+ * atomic64_fetch_sub_acquire() - atomic subtract with acquire ordering
+ * @i: s64 value to subtract
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - @i) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_sub_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_sub_acquire(s64 i, atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_sub_acquire(i, v);
+       return raw_atomic64_fetch_sub_acquire(i, v);
 }
 
+/**
+ * atomic64_fetch_sub_release() - atomic subtract with release ordering
+ * @i: s64 value to subtract
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - @i) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_sub_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_sub_release(s64 i, atomic64_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_sub_release(i, v);
+       return raw_atomic64_fetch_sub_release(i, v);
 }
 
+/**
+ * atomic64_fetch_sub_relaxed() - atomic subtract with relaxed ordering
+ * @i: s64 value to subtract
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_sub_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_sub_relaxed(s64 i, atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_sub_relaxed(i, v);
+       return raw_atomic64_fetch_sub_relaxed(i, v);
 }
 
+/**
+ * atomic64_inc() - atomic increment with relaxed ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + 1) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_inc() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic64_inc(atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       arch_atomic64_inc(v);
+       raw_atomic64_inc(v);
 }
 
+/**
+ * atomic64_inc_return() - atomic increment with full ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_inc_return() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline s64
 atomic64_inc_return(atomic64_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_inc_return(v);
+       return raw_atomic64_inc_return(v);
 }
 
+/**
+ * atomic64_inc_return_acquire() - atomic increment with acquire ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + 1) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_inc_return_acquire() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline s64
 atomic64_inc_return_acquire(atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_inc_return_acquire(v);
+       return raw_atomic64_inc_return_acquire(v);
 }
 
+/**
+ * atomic64_inc_return_release() - atomic increment with release ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + 1) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_inc_return_release() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline s64
 atomic64_inc_return_release(atomic64_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_inc_return_release(v);
+       return raw_atomic64_inc_return_release(v);
 }
 
+/**
+ * atomic64_inc_return_relaxed() - atomic increment with relaxed ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + 1) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_inc_return_relaxed() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline s64
 atomic64_inc_return_relaxed(atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_inc_return_relaxed(v);
+       return raw_atomic64_inc_return_relaxed(v);
 }
 
+/**
+ * atomic64_fetch_inc() - atomic increment with full ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_inc() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_inc(atomic64_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_inc(v);
+       return raw_atomic64_fetch_inc(v);
 }
 
+/**
+ * atomic64_fetch_inc_acquire() - atomic increment with acquire ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + 1) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_inc_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_inc_acquire(atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_inc_acquire(v);
+       return raw_atomic64_fetch_inc_acquire(v);
 }
 
+/**
+ * atomic64_fetch_inc_release() - atomic increment with release ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + 1) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_inc_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_inc_release(atomic64_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_inc_release(v);
+       return raw_atomic64_fetch_inc_release(v);
 }
 
+/**
+ * atomic64_fetch_inc_relaxed() - atomic increment with relaxed ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + 1) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_inc_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_inc_relaxed(atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_inc_relaxed(v);
+       return raw_atomic64_fetch_inc_relaxed(v);
 }
 
+/**
+ * atomic64_dec() - atomic decrement with relaxed ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - 1) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_dec() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic64_dec(atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       arch_atomic64_dec(v);
+       raw_atomic64_dec(v);
 }
 
+/**
+ * atomic64_dec_return() - atomic decrement with full ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_dec_return() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline s64
 atomic64_dec_return(atomic64_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_dec_return(v);
+       return raw_atomic64_dec_return(v);
 }
 
+/**
+ * atomic64_dec_return_acquire() - atomic decrement with acquire ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - 1) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_dec_return_acquire() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline s64
 atomic64_dec_return_acquire(atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_dec_return_acquire(v);
+       return raw_atomic64_dec_return_acquire(v);
 }
 
+/**
+ * atomic64_dec_return_release() - atomic decrement with release ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - 1) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_dec_return_release() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline s64
 atomic64_dec_return_release(atomic64_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_dec_return_release(v);
+       return raw_atomic64_dec_return_release(v);
 }
 
+/**
+ * atomic64_dec_return_relaxed() - atomic decrement with relaxed ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - 1) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_dec_return_relaxed() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline s64
 atomic64_dec_return_relaxed(atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_dec_return_relaxed(v);
+       return raw_atomic64_dec_return_relaxed(v);
 }
 
+/**
+ * atomic64_fetch_dec() - atomic decrement with full ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_dec() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_dec(atomic64_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_dec(v);
+       return raw_atomic64_fetch_dec(v);
 }
 
+/**
+ * atomic64_fetch_dec_acquire() - atomic decrement with acquire ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - 1) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_dec_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_dec_acquire(atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_dec_acquire(v);
+       return raw_atomic64_fetch_dec_acquire(v);
 }
 
+/**
+ * atomic64_fetch_dec_release() - atomic decrement with release ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - 1) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_dec_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_dec_release(atomic64_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_dec_release(v);
+       return raw_atomic64_fetch_dec_release(v);
 }
 
+/**
+ * atomic64_fetch_dec_relaxed() - atomic decrement with relaxed ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - 1) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_dec_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_dec_relaxed(atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_dec_relaxed(v);
+       return raw_atomic64_fetch_dec_relaxed(v);
 }
 
+/**
+ * atomic64_and() - atomic bitwise AND with relaxed ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v & @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_and() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic64_and(s64 i, atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       arch_atomic64_and(i, v);
+       raw_atomic64_and(i, v);
 }
 
+/**
+ * atomic64_fetch_and() - atomic bitwise AND with full ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v & @i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_and() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_and(s64 i, atomic64_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_and(i, v);
+       return raw_atomic64_fetch_and(i, v);
 }
 
+/**
+ * atomic64_fetch_and_acquire() - atomic bitwise AND with acquire ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v & @i) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_and_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_and_acquire(s64 i, atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_and_acquire(i, v);
+       return raw_atomic64_fetch_and_acquire(i, v);
 }
 
+/**
+ * atomic64_fetch_and_release() - atomic bitwise AND with release ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v & @i) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_and_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_and_release(s64 i, atomic64_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_and_release(i, v);
+       return raw_atomic64_fetch_and_release(i, v);
 }
 
+/**
+ * atomic64_fetch_and_relaxed() - atomic bitwise AND with relaxed ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v & @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_and_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_and_relaxed(s64 i, atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_and_relaxed(i, v);
+       return raw_atomic64_fetch_and_relaxed(i, v);
 }
 
+/**
+ * atomic64_andnot() - atomic bitwise AND NOT with relaxed ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v & ~@i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_andnot() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic64_andnot(s64 i, atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       arch_atomic64_andnot(i, v);
+       raw_atomic64_andnot(i, v);
 }
 
+/**
+ * atomic64_fetch_andnot() - atomic bitwise AND NOT with full ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v & ~@i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_andnot() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_andnot(s64 i, atomic64_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_andnot(i, v);
+       return raw_atomic64_fetch_andnot(i, v);
 }
 
+/**
+ * atomic64_fetch_andnot_acquire() - atomic bitwise AND NOT with acquire ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v & ~@i) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_andnot_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_andnot_acquire(s64 i, atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_andnot_acquire(i, v);
+       return raw_atomic64_fetch_andnot_acquire(i, v);
 }
 
+/**
+ * atomic64_fetch_andnot_release() - atomic bitwise AND NOT with release ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v & ~@i) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_andnot_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_andnot_release(s64 i, atomic64_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_andnot_release(i, v);
+       return raw_atomic64_fetch_andnot_release(i, v);
 }
 
+/**
+ * atomic64_fetch_andnot_relaxed() - atomic bitwise AND NOT with relaxed ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v & ~@i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_andnot_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_andnot_relaxed(s64 i, atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_andnot_relaxed(i, v);
+       return raw_atomic64_fetch_andnot_relaxed(i, v);
 }
 
+/**
+ * atomic64_or() - atomic bitwise OR with relaxed ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v | @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_or() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic64_or(s64 i, atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       arch_atomic64_or(i, v);
+       raw_atomic64_or(i, v);
 }
 
+/**
+ * atomic64_fetch_or() - atomic bitwise OR with full ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v | @i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_or() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_or(s64 i, atomic64_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_or(i, v);
+       return raw_atomic64_fetch_or(i, v);
 }
 
+/**
+ * atomic64_fetch_or_acquire() - atomic bitwise OR with acquire ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v | @i) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_or_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_or_acquire(s64 i, atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_or_acquire(i, v);
+       return raw_atomic64_fetch_or_acquire(i, v);
 }
 
+/**
+ * atomic64_fetch_or_release() - atomic bitwise OR with release ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v | @i) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_or_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_or_release(s64 i, atomic64_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_or_release(i, v);
+       return raw_atomic64_fetch_or_release(i, v);
 }
 
+/**
+ * atomic64_fetch_or_relaxed() - atomic bitwise OR with relaxed ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v | @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_or_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_or_relaxed(s64 i, atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_or_relaxed(i, v);
+       return raw_atomic64_fetch_or_relaxed(i, v);
 }
 
+/**
+ * atomic64_xor() - atomic bitwise XOR with relaxed ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v ^ @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_xor() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic64_xor(s64 i, atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       arch_atomic64_xor(i, v);
+       raw_atomic64_xor(i, v);
 }
 
+/**
+ * atomic64_fetch_xor() - atomic bitwise XOR with full ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v ^ @i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_xor() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_xor(s64 i, atomic64_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_xor(i, v);
+       return raw_atomic64_fetch_xor(i, v);
 }
 
+/**
+ * atomic64_fetch_xor_acquire() - atomic bitwise XOR with acquire ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v ^ @i) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_xor_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_xor_acquire(s64 i, atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_xor_acquire(i, v);
+       return raw_atomic64_fetch_xor_acquire(i, v);
 }
 
+/**
+ * atomic64_fetch_xor_release() - atomic bitwise XOR with release ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v ^ @i) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_xor_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_xor_release(s64 i, atomic64_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_xor_release(i, v);
+       return raw_atomic64_fetch_xor_release(i, v);
 }
 
+/**
+ * atomic64_fetch_xor_relaxed() - atomic bitwise XOR with relaxed ordering
+ * @i: s64 value
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v ^ @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_xor_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_xor_relaxed(s64 i, atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_xor_relaxed(i, v);
+       return raw_atomic64_fetch_xor_relaxed(i, v);
 }
 
+/**
+ * atomic64_xchg() - atomic exchange with full ordering
+ * @v: pointer to atomic64_t
+ * @new: s64 value to assign
+ *
+ * Atomically updates @v to @new with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_xchg() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-atomic64_xchg(atomic64_t *v, s64 i)
+atomic64_xchg(atomic64_t *v, s64 new)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_xchg(v, i);
+       return raw_atomic64_xchg(v, new);
 }
 
+/**
+ * atomic64_xchg_acquire() - atomic exchange with acquire ordering
+ * @v: pointer to atomic64_t
+ * @new: s64 value to assign
+ *
+ * Atomically updates @v to @new with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_xchg_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-atomic64_xchg_acquire(atomic64_t *v, s64 i)
+atomic64_xchg_acquire(atomic64_t *v, s64 new)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_xchg_acquire(v, i);
+       return raw_atomic64_xchg_acquire(v, new);
 }
 
+/**
+ * atomic64_xchg_release() - atomic exchange with release ordering
+ * @v: pointer to atomic64_t
+ * @new: s64 value to assign
+ *
+ * Atomically updates @v to @new with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_xchg_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-atomic64_xchg_release(atomic64_t *v, s64 i)
+atomic64_xchg_release(atomic64_t *v, s64 new)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_xchg_release(v, i);
+       return raw_atomic64_xchg_release(v, new);
 }
 
+/**
+ * atomic64_xchg_relaxed() - atomic exchange with relaxed ordering
+ * @v: pointer to atomic64_t
+ * @new: s64 value to assign
+ *
+ * Atomically updates @v to @new with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_xchg_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
-atomic64_xchg_relaxed(atomic64_t *v, s64 i)
+atomic64_xchg_relaxed(atomic64_t *v, s64 new)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_xchg_relaxed(v, i);
+       return raw_atomic64_xchg_relaxed(v, new);
 }
 
+/**
+ * atomic64_cmpxchg() - atomic compare and exchange with full ordering
+ * @v: pointer to atomic64_t
+ * @old: s64 value to compare with
+ * @new: s64 value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_cmpxchg() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_cmpxchg(atomic64_t *v, s64 old, s64 new)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_cmpxchg(v, old, new);
+       return raw_atomic64_cmpxchg(v, old, new);
 }
 
+/**
+ * atomic64_cmpxchg_acquire() - atomic compare and exchange with acquire ordering
+ * @v: pointer to atomic64_t
+ * @old: s64 value to compare with
+ * @new: s64 value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_cmpxchg_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_cmpxchg_acquire(atomic64_t *v, s64 old, s64 new)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_cmpxchg_acquire(v, old, new);
+       return raw_atomic64_cmpxchg_acquire(v, old, new);
 }
 
+/**
+ * atomic64_cmpxchg_release() - atomic compare and exchange with release ordering
+ * @v: pointer to atomic64_t
+ * @old: s64 value to compare with
+ * @new: s64 value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_cmpxchg_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_cmpxchg_release(atomic64_t *v, s64 old, s64 new)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_cmpxchg_release(v, old, new);
+       return raw_atomic64_cmpxchg_release(v, old, new);
 }
 
+/**
+ * atomic64_cmpxchg_relaxed() - atomic compare and exchange with relaxed ordering
+ * @v: pointer to atomic64_t
+ * @old: s64 value to compare with
+ * @new: s64 value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_cmpxchg_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_cmpxchg_relaxed(atomic64_t *v, s64 old, s64 new)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_cmpxchg_relaxed(v, old, new);
+       return raw_atomic64_cmpxchg_relaxed(v, old, new);
 }
 
+/**
+ * atomic64_try_cmpxchg() - atomic compare and exchange with full ordering
+ * @v: pointer to atomic64_t
+ * @old: pointer to s64 value to compare with
+ * @new: s64 value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with full ordering.
+ * Otherwise, updates @old to the current value of @v.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_try_cmpxchg() there.
+ *
+ * Return: @true if the exchange occured, @false otherwise.
+ */
 static __always_inline bool
 atomic64_try_cmpxchg(atomic64_t *v, s64 *old, s64 new)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
        instrument_atomic_read_write(old, sizeof(*old));
-       return arch_atomic64_try_cmpxchg(v, old, new);
-}
-
+       return raw_atomic64_try_cmpxchg(v, old, new);
+}
+
+/**
+ * atomic64_try_cmpxchg_acquire() - atomic compare and exchange with acquire ordering
+ * @v: pointer to atomic64_t
+ * @old: pointer to s64 value to compare with
+ * @new: s64 value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with acquire ordering.
+ * Otherwise, updates @old to the current value of @v.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_try_cmpxchg_acquire() there.
+ *
+ * Return: @true if the exchange occured, @false otherwise.
+ */
 static __always_inline bool
 atomic64_try_cmpxchg_acquire(atomic64_t *v, s64 *old, s64 new)
 {
        instrument_atomic_read_write(v, sizeof(*v));
        instrument_atomic_read_write(old, sizeof(*old));
-       return arch_atomic64_try_cmpxchg_acquire(v, old, new);
-}
-
+       return raw_atomic64_try_cmpxchg_acquire(v, old, new);
+}
+
+/**
+ * atomic64_try_cmpxchg_release() - atomic compare and exchange with release ordering
+ * @v: pointer to atomic64_t
+ * @old: pointer to s64 value to compare with
+ * @new: s64 value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with release ordering.
+ * Otherwise, updates @old to the current value of @v.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_try_cmpxchg_release() there.
+ *
+ * Return: @true if the exchange occured, @false otherwise.
+ */
 static __always_inline bool
 atomic64_try_cmpxchg_release(atomic64_t *v, s64 *old, s64 new)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
        instrument_atomic_read_write(old, sizeof(*old));
-       return arch_atomic64_try_cmpxchg_release(v, old, new);
-}
-
+       return raw_atomic64_try_cmpxchg_release(v, old, new);
+}
+
+/**
+ * atomic64_try_cmpxchg_relaxed() - atomic compare and exchange with relaxed ordering
+ * @v: pointer to atomic64_t
+ * @old: pointer to s64 value to compare with
+ * @new: s64 value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with relaxed ordering.
+ * Otherwise, updates @old to the current value of @v.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_try_cmpxchg_relaxed() there.
+ *
+ * Return: @true if the exchange occured, @false otherwise.
+ */
 static __always_inline bool
 atomic64_try_cmpxchg_relaxed(atomic64_t *v, s64 *old, s64 new)
 {
        instrument_atomic_read_write(v, sizeof(*v));
        instrument_atomic_read_write(old, sizeof(*old));
-       return arch_atomic64_try_cmpxchg_relaxed(v, old, new);
-}
-
+       return raw_atomic64_try_cmpxchg_relaxed(v, old, new);
+}
+
+/**
+ * atomic64_sub_and_test() - atomic subtract and test if zero with full ordering
+ * @i: s64 value to add
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - @i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_sub_and_test() there.
+ *
+ * Return: @true if the resulting value of @v is zero, @false otherwise.
+ */
 static __always_inline bool
 atomic64_sub_and_test(s64 i, atomic64_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_sub_and_test(i, v);
+       return raw_atomic64_sub_and_test(i, v);
 }
 
+/**
+ * atomic64_dec_and_test() - atomic decrement and test if zero with full ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v - 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_dec_and_test() there.
+ *
+ * Return: @true if the resulting value of @v is zero, @false otherwise.
+ */
 static __always_inline bool
 atomic64_dec_and_test(atomic64_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_dec_and_test(v);
+       return raw_atomic64_dec_and_test(v);
 }
 
+/**
+ * atomic64_inc_and_test() - atomic increment and test if zero with full ordering
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_inc_and_test() there.
+ *
+ * Return: @true if the resulting value of @v is zero, @false otherwise.
+ */
 static __always_inline bool
 atomic64_inc_and_test(atomic64_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_inc_and_test(v);
+       return raw_atomic64_inc_and_test(v);
 }
 
+/**
+ * atomic64_add_negative() - atomic add and test if negative with full ordering
+ * @i: s64 value to add
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + @i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_add_negative() there.
+ *
+ * Return: @true if the resulting value of @v is negative, @false otherwise.
+ */
 static __always_inline bool
 atomic64_add_negative(s64 i, atomic64_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_add_negative(i, v);
+       return raw_atomic64_add_negative(i, v);
 }
 
+/**
+ * atomic64_add_negative_acquire() - atomic add and test if negative with acquire ordering
+ * @i: s64 value to add
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + @i) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_add_negative_acquire() there.
+ *
+ * Return: @true if the resulting value of @v is negative, @false otherwise.
+ */
 static __always_inline bool
 atomic64_add_negative_acquire(s64 i, atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_add_negative_acquire(i, v);
+       return raw_atomic64_add_negative_acquire(i, v);
 }
 
+/**
+ * atomic64_add_negative_release() - atomic add and test if negative with release ordering
+ * @i: s64 value to add
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + @i) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_add_negative_release() there.
+ *
+ * Return: @true if the resulting value of @v is negative, @false otherwise.
+ */
 static __always_inline bool
 atomic64_add_negative_release(s64 i, atomic64_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_add_negative_release(i, v);
+       return raw_atomic64_add_negative_release(i, v);
 }
 
+/**
+ * atomic64_add_negative_relaxed() - atomic add and test if negative with relaxed ordering
+ * @i: s64 value to add
+ * @v: pointer to atomic64_t
+ *
+ * Atomically updates @v to (@v + @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_add_negative_relaxed() there.
+ *
+ * Return: @true if the resulting value of @v is negative, @false otherwise.
+ */
 static __always_inline bool
 atomic64_add_negative_relaxed(s64 i, atomic64_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_add_negative_relaxed(i, v);
+       return raw_atomic64_add_negative_relaxed(i, v);
 }
 
+/**
+ * atomic64_fetch_add_unless() - atomic add unless value with full ordering
+ * @v: pointer to atomic64_t
+ * @a: s64 value to add
+ * @u: s64 value to compare with
+ *
+ * If (@v != @u), atomically updates @v to (@v + @a) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_fetch_add_unless() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline s64
 atomic64_fetch_add_unless(atomic64_t *v, s64 a, s64 u)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_fetch_add_unless(v, a, u);
+       return raw_atomic64_fetch_add_unless(v, a, u);
 }
 
+/**
+ * atomic64_add_unless() - atomic add unless value with full ordering
+ * @v: pointer to atomic64_t
+ * @a: s64 value to add
+ * @u: s64 value to compare with
+ *
+ * If (@v != @u), atomically updates @v to (@v + @a) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_add_unless() there.
+ *
+ * Return: @true if @v was updated, @false otherwise.
+ */
 static __always_inline bool
 atomic64_add_unless(atomic64_t *v, s64 a, s64 u)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_add_unless(v, a, u);
+       return raw_atomic64_add_unless(v, a, u);
 }
 
+/**
+ * atomic64_inc_not_zero() - atomic increment unless zero with full ordering
+ * @v: pointer to atomic64_t
+ *
+ * If (@v != 0), atomically updates @v to (@v + 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_inc_not_zero() there.
+ *
+ * Return: @true if @v was updated, @false otherwise.
+ */
 static __always_inline bool
 atomic64_inc_not_zero(atomic64_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_inc_not_zero(v);
+       return raw_atomic64_inc_not_zero(v);
 }
 
+/**
+ * atomic64_inc_unless_negative() - atomic increment unless negative with full ordering
+ * @v: pointer to atomic64_t
+ *
+ * If (@v >= 0), atomically updates @v to (@v + 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_inc_unless_negative() there.
+ *
+ * Return: @true if @v was updated, @false otherwise.
+ */
 static __always_inline bool
 atomic64_inc_unless_negative(atomic64_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_inc_unless_negative(v);
+       return raw_atomic64_inc_unless_negative(v);
 }
 
+/**
+ * atomic64_dec_unless_positive() - atomic decrement unless positive with full ordering
+ * @v: pointer to atomic64_t
+ *
+ * If (@v <= 0), atomically updates @v to (@v - 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_dec_unless_positive() there.
+ *
+ * Return: @true if @v was updated, @false otherwise.
+ */
 static __always_inline bool
 atomic64_dec_unless_positive(atomic64_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_dec_unless_positive(v);
+       return raw_atomic64_dec_unless_positive(v);
 }
 
+/**
+ * atomic64_dec_if_positive() - atomic decrement if positive with full ordering
+ * @v: pointer to atomic64_t
+ *
+ * If (@v > 0), atomically updates @v to (@v - 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic64_dec_if_positive() there.
+ *
+ * Return: The old value of (@v - 1), regardless of whether @v was updated.
+ */
 static __always_inline s64
 atomic64_dec_if_positive(atomic64_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic64_dec_if_positive(v);
+       return raw_atomic64_dec_if_positive(v);
 }
 
+/**
+ * atomic_long_read() - atomic load with relaxed ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically loads the value of @v with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_read() there.
+ *
+ * Return: The value loaded from @v.
+ */
 static __always_inline long
 atomic_long_read(const atomic_long_t *v)
 {
        instrument_atomic_read(v, sizeof(*v));
-       return arch_atomic_long_read(v);
-}
-
+       return raw_atomic_long_read(v);
+}
+
+/**
+ * atomic_long_read_acquire() - atomic load with acquire ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically loads the value of @v with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_read_acquire() there.
+ *
+ * Return: The value loaded from @v.
+ */
 static __always_inline long
 atomic_long_read_acquire(const atomic_long_t *v)
 {
        instrument_atomic_read(v, sizeof(*v));
-       return arch_atomic_long_read_acquire(v);
-}
-
+       return raw_atomic_long_read_acquire(v);
+}
+
+/**
+ * atomic_long_set() - atomic set with relaxed ordering
+ * @v: pointer to atomic_long_t
+ * @i: long value to assign
+ *
+ * Atomically sets @v to @i with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_set() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic_long_set(atomic_long_t *v, long i)
 {
        instrument_atomic_write(v, sizeof(*v));
-       arch_atomic_long_set(v, i);
-}
-
+       raw_atomic_long_set(v, i);
+}
+
+/**
+ * atomic_long_set_release() - atomic set with release ordering
+ * @v: pointer to atomic_long_t
+ * @i: long value to assign
+ *
+ * Atomically sets @v to @i with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_set_release() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic_long_set_release(atomic_long_t *v, long i)
 {
        kcsan_release();
        instrument_atomic_write(v, sizeof(*v));
-       arch_atomic_long_set_release(v, i);
-}
-
+       raw_atomic_long_set_release(v, i);
+}
+
+/**
+ * atomic_long_add() - atomic add with relaxed ordering
+ * @i: long value to add
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_add() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic_long_add(long i, atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       arch_atomic_long_add(i, v);
+       raw_atomic_long_add(i, v);
 }
 
+/**
+ * atomic_long_add_return() - atomic add with full ordering
+ * @i: long value to add
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + @i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_add_return() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
 atomic_long_add_return(long i, atomic_long_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_add_return(i, v);
+       return raw_atomic_long_add_return(i, v);
 }
 
+/**
+ * atomic_long_add_return_acquire() - atomic add with acquire ordering
+ * @i: long value to add
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + @i) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_add_return_acquire() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
 atomic_long_add_return_acquire(long i, atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_add_return_acquire(i, v);
+       return raw_atomic_long_add_return_acquire(i, v);
 }
 
+/**
+ * atomic_long_add_return_release() - atomic add with release ordering
+ * @i: long value to add
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + @i) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_add_return_release() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
 atomic_long_add_return_release(long i, atomic_long_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_add_return_release(i, v);
+       return raw_atomic_long_add_return_release(i, v);
 }
 
+/**
+ * atomic_long_add_return_relaxed() - atomic add with relaxed ordering
+ * @i: long value to add
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_add_return_relaxed() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
 atomic_long_add_return_relaxed(long i, atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_add_return_relaxed(i, v);
+       return raw_atomic_long_add_return_relaxed(i, v);
 }
 
+/**
+ * atomic_long_fetch_add() - atomic add with full ordering
+ * @i: long value to add
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + @i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_add() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_add(long i, atomic_long_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_add(i, v);
+       return raw_atomic_long_fetch_add(i, v);
 }
 
+/**
+ * atomic_long_fetch_add_acquire() - atomic add with acquire ordering
+ * @i: long value to add
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + @i) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_add_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_add_acquire(long i, atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_add_acquire(i, v);
+       return raw_atomic_long_fetch_add_acquire(i, v);
 }
 
+/**
+ * atomic_long_fetch_add_release() - atomic add with release ordering
+ * @i: long value to add
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + @i) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_add_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_add_release(long i, atomic_long_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_add_release(i, v);
+       return raw_atomic_long_fetch_add_release(i, v);
 }
 
+/**
+ * atomic_long_fetch_add_relaxed() - atomic add with relaxed ordering
+ * @i: long value to add
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_add_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_add_relaxed(long i, atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_add_relaxed(i, v);
+       return raw_atomic_long_fetch_add_relaxed(i, v);
 }
 
+/**
+ * atomic_long_sub() - atomic subtract with relaxed ordering
+ * @i: long value to subtract
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_sub() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic_long_sub(long i, atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       arch_atomic_long_sub(i, v);
+       raw_atomic_long_sub(i, v);
 }
 
+/**
+ * atomic_long_sub_return() - atomic subtract with full ordering
+ * @i: long value to subtract
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - @i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_sub_return() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
 atomic_long_sub_return(long i, atomic_long_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_sub_return(i, v);
+       return raw_atomic_long_sub_return(i, v);
 }
 
+/**
+ * atomic_long_sub_return_acquire() - atomic subtract with acquire ordering
+ * @i: long value to subtract
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - @i) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_sub_return_acquire() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
 atomic_long_sub_return_acquire(long i, atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_sub_return_acquire(i, v);
+       return raw_atomic_long_sub_return_acquire(i, v);
 }
 
+/**
+ * atomic_long_sub_return_release() - atomic subtract with release ordering
+ * @i: long value to subtract
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - @i) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_sub_return_release() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
 atomic_long_sub_return_release(long i, atomic_long_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_sub_return_release(i, v);
+       return raw_atomic_long_sub_return_release(i, v);
 }
 
+/**
+ * atomic_long_sub_return_relaxed() - atomic subtract with relaxed ordering
+ * @i: long value to subtract
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_sub_return_relaxed() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
 atomic_long_sub_return_relaxed(long i, atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_sub_return_relaxed(i, v);
+       return raw_atomic_long_sub_return_relaxed(i, v);
 }
 
+/**
+ * atomic_long_fetch_sub() - atomic subtract with full ordering
+ * @i: long value to subtract
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - @i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_sub() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_sub(long i, atomic_long_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_sub(i, v);
+       return raw_atomic_long_fetch_sub(i, v);
 }
 
+/**
+ * atomic_long_fetch_sub_acquire() - atomic subtract with acquire ordering
+ * @i: long value to subtract
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - @i) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_sub_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_sub_acquire(long i, atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_sub_acquire(i, v);
+       return raw_atomic_long_fetch_sub_acquire(i, v);
 }
 
+/**
+ * atomic_long_fetch_sub_release() - atomic subtract with release ordering
+ * @i: long value to subtract
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - @i) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_sub_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_sub_release(long i, atomic_long_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_sub_release(i, v);
+       return raw_atomic_long_fetch_sub_release(i, v);
 }
 
+/**
+ * atomic_long_fetch_sub_relaxed() - atomic subtract with relaxed ordering
+ * @i: long value to subtract
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_sub_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_sub_relaxed(long i, atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_sub_relaxed(i, v);
+       return raw_atomic_long_fetch_sub_relaxed(i, v);
 }
 
+/**
+ * atomic_long_inc() - atomic increment with relaxed ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + 1) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_inc() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic_long_inc(atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       arch_atomic_long_inc(v);
+       raw_atomic_long_inc(v);
 }
 
+/**
+ * atomic_long_inc_return() - atomic increment with full ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_inc_return() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
 atomic_long_inc_return(atomic_long_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_inc_return(v);
+       return raw_atomic_long_inc_return(v);
 }
 
+/**
+ * atomic_long_inc_return_acquire() - atomic increment with acquire ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + 1) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_inc_return_acquire() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
 atomic_long_inc_return_acquire(atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_inc_return_acquire(v);
+       return raw_atomic_long_inc_return_acquire(v);
 }
 
+/**
+ * atomic_long_inc_return_release() - atomic increment with release ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + 1) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_inc_return_release() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
 atomic_long_inc_return_release(atomic_long_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_inc_return_release(v);
+       return raw_atomic_long_inc_return_release(v);
 }
 
+/**
+ * atomic_long_inc_return_relaxed() - atomic increment with relaxed ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + 1) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_inc_return_relaxed() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
 atomic_long_inc_return_relaxed(atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_inc_return_relaxed(v);
+       return raw_atomic_long_inc_return_relaxed(v);
 }
 
+/**
+ * atomic_long_fetch_inc() - atomic increment with full ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_inc() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_inc(atomic_long_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_inc(v);
+       return raw_atomic_long_fetch_inc(v);
 }
 
+/**
+ * atomic_long_fetch_inc_acquire() - atomic increment with acquire ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + 1) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_inc_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_inc_acquire(atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_inc_acquire(v);
+       return raw_atomic_long_fetch_inc_acquire(v);
 }
 
+/**
+ * atomic_long_fetch_inc_release() - atomic increment with release ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + 1) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_inc_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_inc_release(atomic_long_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_inc_release(v);
+       return raw_atomic_long_fetch_inc_release(v);
 }
 
+/**
+ * atomic_long_fetch_inc_relaxed() - atomic increment with relaxed ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + 1) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_inc_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_inc_relaxed(atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_inc_relaxed(v);
+       return raw_atomic_long_fetch_inc_relaxed(v);
 }
 
+/**
+ * atomic_long_dec() - atomic decrement with relaxed ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - 1) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_dec() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic_long_dec(atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       arch_atomic_long_dec(v);
+       raw_atomic_long_dec(v);
 }
 
+/**
+ * atomic_long_dec_return() - atomic decrement with full ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_dec_return() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
 atomic_long_dec_return(atomic_long_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_dec_return(v);
+       return raw_atomic_long_dec_return(v);
 }
 
+/**
+ * atomic_long_dec_return_acquire() - atomic decrement with acquire ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - 1) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_dec_return_acquire() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
 atomic_long_dec_return_acquire(atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_dec_return_acquire(v);
+       return raw_atomic_long_dec_return_acquire(v);
 }
 
+/**
+ * atomic_long_dec_return_release() - atomic decrement with release ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - 1) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_dec_return_release() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
 atomic_long_dec_return_release(atomic_long_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_dec_return_release(v);
+       return raw_atomic_long_dec_return_release(v);
 }
 
+/**
+ * atomic_long_dec_return_relaxed() - atomic decrement with relaxed ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - 1) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_dec_return_relaxed() there.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
 atomic_long_dec_return_relaxed(atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_dec_return_relaxed(v);
+       return raw_atomic_long_dec_return_relaxed(v);
 }
 
+/**
+ * atomic_long_fetch_dec() - atomic decrement with full ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_dec() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_dec(atomic_long_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_dec(v);
+       return raw_atomic_long_fetch_dec(v);
 }
 
+/**
+ * atomic_long_fetch_dec_acquire() - atomic decrement with acquire ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - 1) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_dec_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_dec_acquire(atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_dec_acquire(v);
+       return raw_atomic_long_fetch_dec_acquire(v);
 }
 
+/**
+ * atomic_long_fetch_dec_release() - atomic decrement with release ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - 1) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_dec_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_dec_release(atomic_long_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_dec_release(v);
+       return raw_atomic_long_fetch_dec_release(v);
 }
 
+/**
+ * atomic_long_fetch_dec_relaxed() - atomic decrement with relaxed ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - 1) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_dec_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_dec_relaxed(atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_dec_relaxed(v);
+       return raw_atomic_long_fetch_dec_relaxed(v);
 }
 
+/**
+ * atomic_long_and() - atomic bitwise AND with relaxed ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v & @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_and() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic_long_and(long i, atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       arch_atomic_long_and(i, v);
+       raw_atomic_long_and(i, v);
 }
 
+/**
+ * atomic_long_fetch_and() - atomic bitwise AND with full ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v & @i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_and() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_and(long i, atomic_long_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_and(i, v);
+       return raw_atomic_long_fetch_and(i, v);
 }
 
+/**
+ * atomic_long_fetch_and_acquire() - atomic bitwise AND with acquire ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v & @i) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_and_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_and_acquire(long i, atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_and_acquire(i, v);
+       return raw_atomic_long_fetch_and_acquire(i, v);
 }
 
+/**
+ * atomic_long_fetch_and_release() - atomic bitwise AND with release ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v & @i) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_and_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_and_release(long i, atomic_long_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_and_release(i, v);
+       return raw_atomic_long_fetch_and_release(i, v);
 }
 
+/**
+ * atomic_long_fetch_and_relaxed() - atomic bitwise AND with relaxed ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v & @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_and_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_and_relaxed(long i, atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_and_relaxed(i, v);
+       return raw_atomic_long_fetch_and_relaxed(i, v);
 }
 
+/**
+ * atomic_long_andnot() - atomic bitwise AND NOT with relaxed ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v & ~@i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_andnot() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic_long_andnot(long i, atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       arch_atomic_long_andnot(i, v);
+       raw_atomic_long_andnot(i, v);
 }
 
+/**
+ * atomic_long_fetch_andnot() - atomic bitwise AND NOT with full ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v & ~@i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_andnot() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_andnot(long i, atomic_long_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_andnot(i, v);
+       return raw_atomic_long_fetch_andnot(i, v);
 }
 
+/**
+ * atomic_long_fetch_andnot_acquire() - atomic bitwise AND NOT with acquire ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v & ~@i) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_andnot_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_andnot_acquire(long i, atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_andnot_acquire(i, v);
+       return raw_atomic_long_fetch_andnot_acquire(i, v);
 }
 
+/**
+ * atomic_long_fetch_andnot_release() - atomic bitwise AND NOT with release ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v & ~@i) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_andnot_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_andnot_release(long i, atomic_long_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_andnot_release(i, v);
+       return raw_atomic_long_fetch_andnot_release(i, v);
 }
 
+/**
+ * atomic_long_fetch_andnot_relaxed() - atomic bitwise AND NOT with relaxed ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v & ~@i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_andnot_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_andnot_relaxed(long i, atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_andnot_relaxed(i, v);
+       return raw_atomic_long_fetch_andnot_relaxed(i, v);
 }
 
+/**
+ * atomic_long_or() - atomic bitwise OR with relaxed ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v | @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_or() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic_long_or(long i, atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       arch_atomic_long_or(i, v);
+       raw_atomic_long_or(i, v);
 }
 
+/**
+ * atomic_long_fetch_or() - atomic bitwise OR with full ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v | @i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_or() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_or(long i, atomic_long_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_or(i, v);
+       return raw_atomic_long_fetch_or(i, v);
 }
 
+/**
+ * atomic_long_fetch_or_acquire() - atomic bitwise OR with acquire ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v | @i) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_or_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_or_acquire(long i, atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_or_acquire(i, v);
+       return raw_atomic_long_fetch_or_acquire(i, v);
 }
 
+/**
+ * atomic_long_fetch_or_release() - atomic bitwise OR with release ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v | @i) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_or_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_or_release(long i, atomic_long_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_or_release(i, v);
+       return raw_atomic_long_fetch_or_release(i, v);
 }
 
+/**
+ * atomic_long_fetch_or_relaxed() - atomic bitwise OR with relaxed ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v | @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_or_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_or_relaxed(long i, atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_or_relaxed(i, v);
+       return raw_atomic_long_fetch_or_relaxed(i, v);
 }
 
+/**
+ * atomic_long_xor() - atomic bitwise XOR with relaxed ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v ^ @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_xor() there.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
 atomic_long_xor(long i, atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       arch_atomic_long_xor(i, v);
+       raw_atomic_long_xor(i, v);
 }
 
+/**
+ * atomic_long_fetch_xor() - atomic bitwise XOR with full ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v ^ @i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_xor() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_xor(long i, atomic_long_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_xor(i, v);
+       return raw_atomic_long_fetch_xor(i, v);
 }
 
+/**
+ * atomic_long_fetch_xor_acquire() - atomic bitwise XOR with acquire ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v ^ @i) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_xor_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_xor_acquire(long i, atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_xor_acquire(i, v);
+       return raw_atomic_long_fetch_xor_acquire(i, v);
 }
 
+/**
+ * atomic_long_fetch_xor_release() - atomic bitwise XOR with release ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v ^ @i) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_xor_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_xor_release(long i, atomic_long_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_xor_release(i, v);
+       return raw_atomic_long_fetch_xor_release(i, v);
 }
 
+/**
+ * atomic_long_fetch_xor_relaxed() - atomic bitwise XOR with relaxed ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v ^ @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_xor_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_xor_relaxed(long i, atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_xor_relaxed(i, v);
+       return raw_atomic_long_fetch_xor_relaxed(i, v);
 }
 
+/**
+ * atomic_long_xchg() - atomic exchange with full ordering
+ * @v: pointer to atomic_long_t
+ * @new: long value to assign
+ *
+ * Atomically updates @v to @new with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_xchg() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-atomic_long_xchg(atomic_long_t *v, long i)
+atomic_long_xchg(atomic_long_t *v, long new)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_xchg(v, i);
+       return raw_atomic_long_xchg(v, new);
 }
 
+/**
+ * atomic_long_xchg_acquire() - atomic exchange with acquire ordering
+ * @v: pointer to atomic_long_t
+ * @new: long value to assign
+ *
+ * Atomically updates @v to @new with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_xchg_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-atomic_long_xchg_acquire(atomic_long_t *v, long i)
+atomic_long_xchg_acquire(atomic_long_t *v, long new)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_xchg_acquire(v, i);
+       return raw_atomic_long_xchg_acquire(v, new);
 }
 
+/**
+ * atomic_long_xchg_release() - atomic exchange with release ordering
+ * @v: pointer to atomic_long_t
+ * @new: long value to assign
+ *
+ * Atomically updates @v to @new with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_xchg_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-atomic_long_xchg_release(atomic_long_t *v, long i)
+atomic_long_xchg_release(atomic_long_t *v, long new)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_xchg_release(v, i);
+       return raw_atomic_long_xchg_release(v, new);
 }
 
+/**
+ * atomic_long_xchg_relaxed() - atomic exchange with relaxed ordering
+ * @v: pointer to atomic_long_t
+ * @new: long value to assign
+ *
+ * Atomically updates @v to @new with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_xchg_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-atomic_long_xchg_relaxed(atomic_long_t *v, long i)
+atomic_long_xchg_relaxed(atomic_long_t *v, long new)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_xchg_relaxed(v, i);
+       return raw_atomic_long_xchg_relaxed(v, new);
 }
 
+/**
+ * atomic_long_cmpxchg() - atomic compare and exchange with full ordering
+ * @v: pointer to atomic_long_t
+ * @old: long value to compare with
+ * @new: long value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_cmpxchg() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_cmpxchg(atomic_long_t *v, long old, long new)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_cmpxchg(v, old, new);
+       return raw_atomic_long_cmpxchg(v, old, new);
 }
 
+/**
+ * atomic_long_cmpxchg_acquire() - atomic compare and exchange with acquire ordering
+ * @v: pointer to atomic_long_t
+ * @old: long value to compare with
+ * @new: long value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_cmpxchg_acquire() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_cmpxchg_acquire(atomic_long_t *v, long old, long new)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_cmpxchg_acquire(v, old, new);
+       return raw_atomic_long_cmpxchg_acquire(v, old, new);
 }
 
+/**
+ * atomic_long_cmpxchg_release() - atomic compare and exchange with release ordering
+ * @v: pointer to atomic_long_t
+ * @old: long value to compare with
+ * @new: long value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_cmpxchg_release() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_cmpxchg_release(atomic_long_t *v, long old, long new)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_cmpxchg_release(v, old, new);
+       return raw_atomic_long_cmpxchg_release(v, old, new);
 }
 
+/**
+ * atomic_long_cmpxchg_relaxed() - atomic compare and exchange with relaxed ordering
+ * @v: pointer to atomic_long_t
+ * @old: long value to compare with
+ * @new: long value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_cmpxchg_relaxed() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_cmpxchg_relaxed(atomic_long_t *v, long old, long new)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_cmpxchg_relaxed(v, old, new);
+       return raw_atomic_long_cmpxchg_relaxed(v, old, new);
 }
 
+/**
+ * atomic_long_try_cmpxchg() - atomic compare and exchange with full ordering
+ * @v: pointer to atomic_long_t
+ * @old: pointer to long value to compare with
+ * @new: long value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with full ordering.
+ * Otherwise, updates @old to the current value of @v.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_try_cmpxchg() there.
+ *
+ * Return: @true if the exchange occured, @false otherwise.
+ */
 static __always_inline bool
 atomic_long_try_cmpxchg(atomic_long_t *v, long *old, long new)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
        instrument_atomic_read_write(old, sizeof(*old));
-       return arch_atomic_long_try_cmpxchg(v, old, new);
-}
-
+       return raw_atomic_long_try_cmpxchg(v, old, new);
+}
+
+/**
+ * atomic_long_try_cmpxchg_acquire() - atomic compare and exchange with acquire ordering
+ * @v: pointer to atomic_long_t
+ * @old: pointer to long value to compare with
+ * @new: long value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with acquire ordering.
+ * Otherwise, updates @old to the current value of @v.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_try_cmpxchg_acquire() there.
+ *
+ * Return: @true if the exchange occured, @false otherwise.
+ */
 static __always_inline bool
 atomic_long_try_cmpxchg_acquire(atomic_long_t *v, long *old, long new)
 {
        instrument_atomic_read_write(v, sizeof(*v));
        instrument_atomic_read_write(old, sizeof(*old));
-       return arch_atomic_long_try_cmpxchg_acquire(v, old, new);
-}
-
+       return raw_atomic_long_try_cmpxchg_acquire(v, old, new);
+}
+
+/**
+ * atomic_long_try_cmpxchg_release() - atomic compare and exchange with release ordering
+ * @v: pointer to atomic_long_t
+ * @old: pointer to long value to compare with
+ * @new: long value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with release ordering.
+ * Otherwise, updates @old to the current value of @v.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_try_cmpxchg_release() there.
+ *
+ * Return: @true if the exchange occured, @false otherwise.
+ */
 static __always_inline bool
 atomic_long_try_cmpxchg_release(atomic_long_t *v, long *old, long new)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
        instrument_atomic_read_write(old, sizeof(*old));
-       return arch_atomic_long_try_cmpxchg_release(v, old, new);
-}
-
+       return raw_atomic_long_try_cmpxchg_release(v, old, new);
+}
+
+/**
+ * atomic_long_try_cmpxchg_relaxed() - atomic compare and exchange with relaxed ordering
+ * @v: pointer to atomic_long_t
+ * @old: pointer to long value to compare with
+ * @new: long value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with relaxed ordering.
+ * Otherwise, updates @old to the current value of @v.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_try_cmpxchg_relaxed() there.
+ *
+ * Return: @true if the exchange occured, @false otherwise.
+ */
 static __always_inline bool
 atomic_long_try_cmpxchg_relaxed(atomic_long_t *v, long *old, long new)
 {
        instrument_atomic_read_write(v, sizeof(*v));
        instrument_atomic_read_write(old, sizeof(*old));
-       return arch_atomic_long_try_cmpxchg_relaxed(v, old, new);
-}
-
+       return raw_atomic_long_try_cmpxchg_relaxed(v, old, new);
+}
+
+/**
+ * atomic_long_sub_and_test() - atomic subtract and test if zero with full ordering
+ * @i: long value to add
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - @i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_sub_and_test() there.
+ *
+ * Return: @true if the resulting value of @v is zero, @false otherwise.
+ */
 static __always_inline bool
 atomic_long_sub_and_test(long i, atomic_long_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_sub_and_test(i, v);
+       return raw_atomic_long_sub_and_test(i, v);
 }
 
+/**
+ * atomic_long_dec_and_test() - atomic decrement and test if zero with full ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_dec_and_test() there.
+ *
+ * Return: @true if the resulting value of @v is zero, @false otherwise.
+ */
 static __always_inline bool
 atomic_long_dec_and_test(atomic_long_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_dec_and_test(v);
+       return raw_atomic_long_dec_and_test(v);
 }
 
+/**
+ * atomic_long_inc_and_test() - atomic increment and test if zero with full ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_inc_and_test() there.
+ *
+ * Return: @true if the resulting value of @v is zero, @false otherwise.
+ */
 static __always_inline bool
 atomic_long_inc_and_test(atomic_long_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_inc_and_test(v);
+       return raw_atomic_long_inc_and_test(v);
 }
 
+/**
+ * atomic_long_add_negative() - atomic add and test if negative with full ordering
+ * @i: long value to add
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + @i) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_add_negative() there.
+ *
+ * Return: @true if the resulting value of @v is negative, @false otherwise.
+ */
 static __always_inline bool
 atomic_long_add_negative(long i, atomic_long_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_add_negative(i, v);
+       return raw_atomic_long_add_negative(i, v);
 }
 
+/**
+ * atomic_long_add_negative_acquire() - atomic add and test if negative with acquire ordering
+ * @i: long value to add
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + @i) with acquire ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_add_negative_acquire() there.
+ *
+ * Return: @true if the resulting value of @v is negative, @false otherwise.
+ */
 static __always_inline bool
 atomic_long_add_negative_acquire(long i, atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_add_negative_acquire(i, v);
+       return raw_atomic_long_add_negative_acquire(i, v);
 }
 
+/**
+ * atomic_long_add_negative_release() - atomic add and test if negative with release ordering
+ * @i: long value to add
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + @i) with release ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_add_negative_release() there.
+ *
+ * Return: @true if the resulting value of @v is negative, @false otherwise.
+ */
 static __always_inline bool
 atomic_long_add_negative_release(long i, atomic_long_t *v)
 {
        kcsan_release();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_add_negative_release(i, v);
+       return raw_atomic_long_add_negative_release(i, v);
 }
 
+/**
+ * atomic_long_add_negative_relaxed() - atomic add and test if negative with relaxed ordering
+ * @i: long value to add
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + @i) with relaxed ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_add_negative_relaxed() there.
+ *
+ * Return: @true if the resulting value of @v is negative, @false otherwise.
+ */
 static __always_inline bool
 atomic_long_add_negative_relaxed(long i, atomic_long_t *v)
 {
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_add_negative_relaxed(i, v);
+       return raw_atomic_long_add_negative_relaxed(i, v);
 }
 
+/**
+ * atomic_long_fetch_add_unless() - atomic add unless value with full ordering
+ * @v: pointer to atomic_long_t
+ * @a: long value to add
+ * @u: long value to compare with
+ *
+ * If (@v != @u), atomically updates @v to (@v + @a) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_fetch_add_unless() there.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
 atomic_long_fetch_add_unless(atomic_long_t *v, long a, long u)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_fetch_add_unless(v, a, u);
+       return raw_atomic_long_fetch_add_unless(v, a, u);
 }
 
+/**
+ * atomic_long_add_unless() - atomic add unless value with full ordering
+ * @v: pointer to atomic_long_t
+ * @a: long value to add
+ * @u: long value to compare with
+ *
+ * If (@v != @u), atomically updates @v to (@v + @a) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_add_unless() there.
+ *
+ * Return: @true if @v was updated, @false otherwise.
+ */
 static __always_inline bool
 atomic_long_add_unless(atomic_long_t *v, long a, long u)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_add_unless(v, a, u);
+       return raw_atomic_long_add_unless(v, a, u);
 }
 
+/**
+ * atomic_long_inc_not_zero() - atomic increment unless zero with full ordering
+ * @v: pointer to atomic_long_t
+ *
+ * If (@v != 0), atomically updates @v to (@v + 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_inc_not_zero() there.
+ *
+ * Return: @true if @v was updated, @false otherwise.
+ */
 static __always_inline bool
 atomic_long_inc_not_zero(atomic_long_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_inc_not_zero(v);
+       return raw_atomic_long_inc_not_zero(v);
 }
 
+/**
+ * atomic_long_inc_unless_negative() - atomic increment unless negative with full ordering
+ * @v: pointer to atomic_long_t
+ *
+ * If (@v >= 0), atomically updates @v to (@v + 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_inc_unless_negative() there.
+ *
+ * Return: @true if @v was updated, @false otherwise.
+ */
 static __always_inline bool
 atomic_long_inc_unless_negative(atomic_long_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_inc_unless_negative(v);
+       return raw_atomic_long_inc_unless_negative(v);
 }
 
+/**
+ * atomic_long_dec_unless_positive() - atomic decrement unless positive with full ordering
+ * @v: pointer to atomic_long_t
+ *
+ * If (@v <= 0), atomically updates @v to (@v - 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_dec_unless_positive() there.
+ *
+ * Return: @true if @v was updated, @false otherwise.
+ */
 static __always_inline bool
 atomic_long_dec_unless_positive(atomic_long_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_dec_unless_positive(v);
+       return raw_atomic_long_dec_unless_positive(v);
 }
 
+/**
+ * atomic_long_dec_if_positive() - atomic decrement if positive with full ordering
+ * @v: pointer to atomic_long_t
+ *
+ * If (@v > 0), atomically updates @v to (@v - 1) with full ordering.
+ *
+ * Unsafe to use in noinstr code; use raw_atomic_long_dec_if_positive() there.
+ *
+ * Return: The old value of (@v - 1), regardless of whether @v was updated.
+ */
 static __always_inline long
 atomic_long_dec_if_positive(atomic_long_t *v)
 {
        kcsan_mb();
        instrument_atomic_read_write(v, sizeof(*v));
-       return arch_atomic_long_dec_if_positive(v);
+       return raw_atomic_long_dec_if_positive(v);
 }
 
 #define xchg(ptr, ...) \
@@ -1949,14 +4713,14 @@ atomic_long_dec_if_positive(atomic_long_t *v)
        typeof(ptr) __ai_ptr = (ptr); \
        kcsan_mb(); \
        instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
-       arch_xchg(__ai_ptr, __VA_ARGS__); \
+       raw_xchg(__ai_ptr, __VA_ARGS__); \
 })
 
 #define xchg_acquire(ptr, ...) \
 ({ \
        typeof(ptr) __ai_ptr = (ptr); \
        instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
-       arch_xchg_acquire(__ai_ptr, __VA_ARGS__); \
+       raw_xchg_acquire(__ai_ptr, __VA_ARGS__); \
 })
 
 #define xchg_release(ptr, ...) \
@@ -1964,14 +4728,14 @@ atomic_long_dec_if_positive(atomic_long_t *v)
        typeof(ptr) __ai_ptr = (ptr); \
        kcsan_release(); \
        instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
-       arch_xchg_release(__ai_ptr, __VA_ARGS__); \
+       raw_xchg_release(__ai_ptr, __VA_ARGS__); \
 })
 
 #define xchg_relaxed(ptr, ...) \
 ({ \
        typeof(ptr) __ai_ptr = (ptr); \
        instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
-       arch_xchg_relaxed(__ai_ptr, __VA_ARGS__); \
+       raw_xchg_relaxed(__ai_ptr, __VA_ARGS__); \
 })
 
 #define cmpxchg(ptr, ...) \
@@ -1979,14 +4743,14 @@ atomic_long_dec_if_positive(atomic_long_t *v)
        typeof(ptr) __ai_ptr = (ptr); \
        kcsan_mb(); \
        instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
-       arch_cmpxchg(__ai_ptr, __VA_ARGS__); \
+       raw_cmpxchg(__ai_ptr, __VA_ARGS__); \
 })
 
 #define cmpxchg_acquire(ptr, ...) \
 ({ \
        typeof(ptr) __ai_ptr = (ptr); \
        instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
-       arch_cmpxchg_acquire(__ai_ptr, __VA_ARGS__); \
+       raw_cmpxchg_acquire(__ai_ptr, __VA_ARGS__); \
 })
 
 #define cmpxchg_release(ptr, ...) \
@@ -1994,14 +4758,14 @@ atomic_long_dec_if_positive(atomic_long_t *v)
        typeof(ptr) __ai_ptr = (ptr); \
        kcsan_release(); \
        instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
-       arch_cmpxchg_release(__ai_ptr, __VA_ARGS__); \
+       raw_cmpxchg_release(__ai_ptr, __VA_ARGS__); \
 })
 
 #define cmpxchg_relaxed(ptr, ...) \
 ({ \
        typeof(ptr) __ai_ptr = (ptr); \
        instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
-       arch_cmpxchg_relaxed(__ai_ptr, __VA_ARGS__); \
+       raw_cmpxchg_relaxed(__ai_ptr, __VA_ARGS__); \
 })
 
 #define cmpxchg64(ptr, ...) \
@@ -2009,14 +4773,14 @@ atomic_long_dec_if_positive(atomic_long_t *v)
        typeof(ptr) __ai_ptr = (ptr); \
        kcsan_mb(); \
        instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
-       arch_cmpxchg64(__ai_ptr, __VA_ARGS__); \
+       raw_cmpxchg64(__ai_ptr, __VA_ARGS__); \
 })
 
 #define cmpxchg64_acquire(ptr, ...) \
 ({ \
        typeof(ptr) __ai_ptr = (ptr); \
        instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
-       arch_cmpxchg64_acquire(__ai_ptr, __VA_ARGS__); \
+       raw_cmpxchg64_acquire(__ai_ptr, __VA_ARGS__); \
 })
 
 #define cmpxchg64_release(ptr, ...) \
@@ -2024,14 +4788,44 @@ atomic_long_dec_if_positive(atomic_long_t *v)
        typeof(ptr) __ai_ptr = (ptr); \
        kcsan_release(); \
        instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
-       arch_cmpxchg64_release(__ai_ptr, __VA_ARGS__); \
+       raw_cmpxchg64_release(__ai_ptr, __VA_ARGS__); \
 })
 
 #define cmpxchg64_relaxed(ptr, ...) \
 ({ \
        typeof(ptr) __ai_ptr = (ptr); \
        instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
-       arch_cmpxchg64_relaxed(__ai_ptr, __VA_ARGS__); \
+       raw_cmpxchg64_relaxed(__ai_ptr, __VA_ARGS__); \
+})
+
+#define cmpxchg128(ptr, ...) \
+({ \
+       typeof(ptr) __ai_ptr = (ptr); \
+       kcsan_mb(); \
+       instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
+       raw_cmpxchg128(__ai_ptr, __VA_ARGS__); \
+})
+
+#define cmpxchg128_acquire(ptr, ...) \
+({ \
+       typeof(ptr) __ai_ptr = (ptr); \
+       instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
+       raw_cmpxchg128_acquire(__ai_ptr, __VA_ARGS__); \
+})
+
+#define cmpxchg128_release(ptr, ...) \
+({ \
+       typeof(ptr) __ai_ptr = (ptr); \
+       kcsan_release(); \
+       instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
+       raw_cmpxchg128_release(__ai_ptr, __VA_ARGS__); \
+})
+
+#define cmpxchg128_relaxed(ptr, ...) \
+({ \
+       typeof(ptr) __ai_ptr = (ptr); \
+       instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
+       raw_cmpxchg128_relaxed(__ai_ptr, __VA_ARGS__); \
 })
 
 #define try_cmpxchg(ptr, oldp, ...) \
@@ -2041,7 +4835,7 @@ atomic_long_dec_if_positive(atomic_long_t *v)
        kcsan_mb(); \
        instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
        instrument_read_write(__ai_oldp, sizeof(*__ai_oldp)); \
-       arch_try_cmpxchg(__ai_ptr, __ai_oldp, __VA_ARGS__); \
+       raw_try_cmpxchg(__ai_ptr, __ai_oldp, __VA_ARGS__); \
 })
 
 #define try_cmpxchg_acquire(ptr, oldp, ...) \
@@ -2050,7 +4844,7 @@ atomic_long_dec_if_positive(atomic_long_t *v)
        typeof(oldp) __ai_oldp = (oldp); \
        instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
        instrument_read_write(__ai_oldp, sizeof(*__ai_oldp)); \
-       arch_try_cmpxchg_acquire(__ai_ptr, __ai_oldp, __VA_ARGS__); \
+       raw_try_cmpxchg_acquire(__ai_ptr, __ai_oldp, __VA_ARGS__); \
 })
 
 #define try_cmpxchg_release(ptr, oldp, ...) \
@@ -2060,7 +4854,7 @@ atomic_long_dec_if_positive(atomic_long_t *v)
        kcsan_release(); \
        instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
        instrument_read_write(__ai_oldp, sizeof(*__ai_oldp)); \
-       arch_try_cmpxchg_release(__ai_ptr, __ai_oldp, __VA_ARGS__); \
+       raw_try_cmpxchg_release(__ai_ptr, __ai_oldp, __VA_ARGS__); \
 })
 
 #define try_cmpxchg_relaxed(ptr, oldp, ...) \
@@ -2069,7 +4863,7 @@ atomic_long_dec_if_positive(atomic_long_t *v)
        typeof(oldp) __ai_oldp = (oldp); \
        instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
        instrument_read_write(__ai_oldp, sizeof(*__ai_oldp)); \
-       arch_try_cmpxchg_relaxed(__ai_ptr, __ai_oldp, __VA_ARGS__); \
+       raw_try_cmpxchg_relaxed(__ai_ptr, __ai_oldp, __VA_ARGS__); \
 })
 
 #define try_cmpxchg64(ptr, oldp, ...) \
@@ -2079,7 +4873,7 @@ atomic_long_dec_if_positive(atomic_long_t *v)
        kcsan_mb(); \
        instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
        instrument_read_write(__ai_oldp, sizeof(*__ai_oldp)); \
-       arch_try_cmpxchg64(__ai_ptr, __ai_oldp, __VA_ARGS__); \
+       raw_try_cmpxchg64(__ai_ptr, __ai_oldp, __VA_ARGS__); \
 })
 
 #define try_cmpxchg64_acquire(ptr, oldp, ...) \
@@ -2088,7 +4882,7 @@ atomic_long_dec_if_positive(atomic_long_t *v)
        typeof(oldp) __ai_oldp = (oldp); \
        instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
        instrument_read_write(__ai_oldp, sizeof(*__ai_oldp)); \
-       arch_try_cmpxchg64_acquire(__ai_ptr, __ai_oldp, __VA_ARGS__); \
+       raw_try_cmpxchg64_acquire(__ai_ptr, __ai_oldp, __VA_ARGS__); \
 })
 
 #define try_cmpxchg64_release(ptr, oldp, ...) \
@@ -2098,7 +4892,7 @@ atomic_long_dec_if_positive(atomic_long_t *v)
        kcsan_release(); \
        instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
        instrument_read_write(__ai_oldp, sizeof(*__ai_oldp)); \
-       arch_try_cmpxchg64_release(__ai_ptr, __ai_oldp, __VA_ARGS__); \
+       raw_try_cmpxchg64_release(__ai_ptr, __ai_oldp, __VA_ARGS__); \
 })
 
 #define try_cmpxchg64_relaxed(ptr, oldp, ...) \
@@ -2107,21 +4901,66 @@ atomic_long_dec_if_positive(atomic_long_t *v)
        typeof(oldp) __ai_oldp = (oldp); \
        instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
        instrument_read_write(__ai_oldp, sizeof(*__ai_oldp)); \
-       arch_try_cmpxchg64_relaxed(__ai_ptr, __ai_oldp, __VA_ARGS__); \
+       raw_try_cmpxchg64_relaxed(__ai_ptr, __ai_oldp, __VA_ARGS__); \
+})
+
+#define try_cmpxchg128(ptr, oldp, ...) \
+({ \
+       typeof(ptr) __ai_ptr = (ptr); \
+       typeof(oldp) __ai_oldp = (oldp); \
+       kcsan_mb(); \
+       instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
+       instrument_read_write(__ai_oldp, sizeof(*__ai_oldp)); \
+       raw_try_cmpxchg128(__ai_ptr, __ai_oldp, __VA_ARGS__); \
+})
+
+#define try_cmpxchg128_acquire(ptr, oldp, ...) \
+({ \
+       typeof(ptr) __ai_ptr = (ptr); \
+       typeof(oldp) __ai_oldp = (oldp); \
+       instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
+       instrument_read_write(__ai_oldp, sizeof(*__ai_oldp)); \
+       raw_try_cmpxchg128_acquire(__ai_ptr, __ai_oldp, __VA_ARGS__); \
+})
+
+#define try_cmpxchg128_release(ptr, oldp, ...) \
+({ \
+       typeof(ptr) __ai_ptr = (ptr); \
+       typeof(oldp) __ai_oldp = (oldp); \
+       kcsan_release(); \
+       instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
+       instrument_read_write(__ai_oldp, sizeof(*__ai_oldp)); \
+       raw_try_cmpxchg128_release(__ai_ptr, __ai_oldp, __VA_ARGS__); \
+})
+
+#define try_cmpxchg128_relaxed(ptr, oldp, ...) \
+({ \
+       typeof(ptr) __ai_ptr = (ptr); \
+       typeof(oldp) __ai_oldp = (oldp); \
+       instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
+       instrument_read_write(__ai_oldp, sizeof(*__ai_oldp)); \
+       raw_try_cmpxchg128_relaxed(__ai_ptr, __ai_oldp, __VA_ARGS__); \
 })
 
 #define cmpxchg_local(ptr, ...) \
 ({ \
        typeof(ptr) __ai_ptr = (ptr); \
        instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
-       arch_cmpxchg_local(__ai_ptr, __VA_ARGS__); \
+       raw_cmpxchg_local(__ai_ptr, __VA_ARGS__); \
 })
 
 #define cmpxchg64_local(ptr, ...) \
 ({ \
        typeof(ptr) __ai_ptr = (ptr); \
        instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
-       arch_cmpxchg64_local(__ai_ptr, __VA_ARGS__); \
+       raw_cmpxchg64_local(__ai_ptr, __VA_ARGS__); \
+})
+
+#define cmpxchg128_local(ptr, ...) \
+({ \
+       typeof(ptr) __ai_ptr = (ptr); \
+       instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
+       raw_cmpxchg128_local(__ai_ptr, __VA_ARGS__); \
 })
 
 #define sync_cmpxchg(ptr, ...) \
@@ -2129,7 +4968,7 @@ atomic_long_dec_if_positive(atomic_long_t *v)
        typeof(ptr) __ai_ptr = (ptr); \
        kcsan_mb(); \
        instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
-       arch_sync_cmpxchg(__ai_ptr, __VA_ARGS__); \
+       raw_sync_cmpxchg(__ai_ptr, __VA_ARGS__); \
 })
 
 #define try_cmpxchg_local(ptr, oldp, ...) \
@@ -2138,7 +4977,7 @@ atomic_long_dec_if_positive(atomic_long_t *v)
        typeof(oldp) __ai_oldp = (oldp); \
        instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
        instrument_read_write(__ai_oldp, sizeof(*__ai_oldp)); \
-       arch_try_cmpxchg_local(__ai_ptr, __ai_oldp, __VA_ARGS__); \
+       raw_try_cmpxchg_local(__ai_ptr, __ai_oldp, __VA_ARGS__); \
 })
 
 #define try_cmpxchg64_local(ptr, oldp, ...) \
@@ -2147,24 +4986,18 @@ atomic_long_dec_if_positive(atomic_long_t *v)
        typeof(oldp) __ai_oldp = (oldp); \
        instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
        instrument_read_write(__ai_oldp, sizeof(*__ai_oldp)); \
-       arch_try_cmpxchg64_local(__ai_ptr, __ai_oldp, __VA_ARGS__); \
+       raw_try_cmpxchg64_local(__ai_ptr, __ai_oldp, __VA_ARGS__); \
 })
 
-#define cmpxchg_double(ptr, ...) \
+#define try_cmpxchg128_local(ptr, oldp, ...) \
 ({ \
        typeof(ptr) __ai_ptr = (ptr); \
-       kcsan_mb(); \
-       instrument_atomic_read_write(__ai_ptr, 2 * sizeof(*__ai_ptr)); \
-       arch_cmpxchg_double(__ai_ptr, __VA_ARGS__); \
+       typeof(oldp) __ai_oldp = (oldp); \
+       instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \
+       instrument_read_write(__ai_oldp, sizeof(*__ai_oldp)); \
+       raw_try_cmpxchg128_local(__ai_ptr, __ai_oldp, __VA_ARGS__); \
 })
 
 
-#define cmpxchg_double_local(ptr, ...) \
-({ \
-       typeof(ptr) __ai_ptr = (ptr); \
-       instrument_atomic_read_write(__ai_ptr, 2 * sizeof(*__ai_ptr)); \
-       arch_cmpxchg_double_local(__ai_ptr, __VA_ARGS__); \
-})
-
 #endif /* _LINUX_ATOMIC_INSTRUMENTED_H */
-// 6b513a42e1a1b5962532a019b7fc91eaa044ad5e
+// 1568f875fef72097413caab8339120c065a39aa4
index 2fc51ba..c829471 100644 (file)
@@ -21,1030 +21,1778 @@ typedef atomic_t atomic_long_t;
 #define atomic_long_cond_read_relaxed  atomic_cond_read_relaxed
 #endif
 
-#ifdef CONFIG_64BIT
-
-static __always_inline long
-arch_atomic_long_read(const atomic_long_t *v)
-{
-       return arch_atomic64_read(v);
-}
-
-static __always_inline long
-arch_atomic_long_read_acquire(const atomic_long_t *v)
-{
-       return arch_atomic64_read_acquire(v);
-}
-
-static __always_inline void
-arch_atomic_long_set(atomic_long_t *v, long i)
-{
-       arch_atomic64_set(v, i);
-}
-
-static __always_inline void
-arch_atomic_long_set_release(atomic_long_t *v, long i)
-{
-       arch_atomic64_set_release(v, i);
-}
-
-static __always_inline void
-arch_atomic_long_add(long i, atomic_long_t *v)
-{
-       arch_atomic64_add(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_add_return(long i, atomic_long_t *v)
-{
-       return arch_atomic64_add_return(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_add_return_acquire(long i, atomic_long_t *v)
-{
-       return arch_atomic64_add_return_acquire(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_add_return_release(long i, atomic_long_t *v)
-{
-       return arch_atomic64_add_return_release(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_add_return_relaxed(long i, atomic_long_t *v)
-{
-       return arch_atomic64_add_return_relaxed(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_add(long i, atomic_long_t *v)
-{
-       return arch_atomic64_fetch_add(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_add_acquire(long i, atomic_long_t *v)
-{
-       return arch_atomic64_fetch_add_acquire(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_add_release(long i, atomic_long_t *v)
-{
-       return arch_atomic64_fetch_add_release(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_add_relaxed(long i, atomic_long_t *v)
-{
-       return arch_atomic64_fetch_add_relaxed(i, v);
-}
-
-static __always_inline void
-arch_atomic_long_sub(long i, atomic_long_t *v)
-{
-       arch_atomic64_sub(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_sub_return(long i, atomic_long_t *v)
-{
-       return arch_atomic64_sub_return(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_sub_return_acquire(long i, atomic_long_t *v)
-{
-       return arch_atomic64_sub_return_acquire(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_sub_return_release(long i, atomic_long_t *v)
-{
-       return arch_atomic64_sub_return_release(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_sub_return_relaxed(long i, atomic_long_t *v)
-{
-       return arch_atomic64_sub_return_relaxed(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_sub(long i, atomic_long_t *v)
-{
-       return arch_atomic64_fetch_sub(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_sub_acquire(long i, atomic_long_t *v)
-{
-       return arch_atomic64_fetch_sub_acquire(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_sub_release(long i, atomic_long_t *v)
-{
-       return arch_atomic64_fetch_sub_release(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_sub_relaxed(long i, atomic_long_t *v)
-{
-       return arch_atomic64_fetch_sub_relaxed(i, v);
-}
-
-static __always_inline void
-arch_atomic_long_inc(atomic_long_t *v)
-{
-       arch_atomic64_inc(v);
-}
-
-static __always_inline long
-arch_atomic_long_inc_return(atomic_long_t *v)
-{
-       return arch_atomic64_inc_return(v);
-}
-
-static __always_inline long
-arch_atomic_long_inc_return_acquire(atomic_long_t *v)
-{
-       return arch_atomic64_inc_return_acquire(v);
-}
-
-static __always_inline long
-arch_atomic_long_inc_return_release(atomic_long_t *v)
-{
-       return arch_atomic64_inc_return_release(v);
-}
-
-static __always_inline long
-arch_atomic_long_inc_return_relaxed(atomic_long_t *v)
-{
-       return arch_atomic64_inc_return_relaxed(v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_inc(atomic_long_t *v)
-{
-       return arch_atomic64_fetch_inc(v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_inc_acquire(atomic_long_t *v)
-{
-       return arch_atomic64_fetch_inc_acquire(v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_inc_release(atomic_long_t *v)
-{
-       return arch_atomic64_fetch_inc_release(v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_inc_relaxed(atomic_long_t *v)
-{
-       return arch_atomic64_fetch_inc_relaxed(v);
-}
-
-static __always_inline void
-arch_atomic_long_dec(atomic_long_t *v)
-{
-       arch_atomic64_dec(v);
-}
-
-static __always_inline long
-arch_atomic_long_dec_return(atomic_long_t *v)
-{
-       return arch_atomic64_dec_return(v);
-}
-
-static __always_inline long
-arch_atomic_long_dec_return_acquire(atomic_long_t *v)
-{
-       return arch_atomic64_dec_return_acquire(v);
-}
-
-static __always_inline long
-arch_atomic_long_dec_return_release(atomic_long_t *v)
-{
-       return arch_atomic64_dec_return_release(v);
-}
-
-static __always_inline long
-arch_atomic_long_dec_return_relaxed(atomic_long_t *v)
-{
-       return arch_atomic64_dec_return_relaxed(v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_dec(atomic_long_t *v)
-{
-       return arch_atomic64_fetch_dec(v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_dec_acquire(atomic_long_t *v)
-{
-       return arch_atomic64_fetch_dec_acquire(v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_dec_release(atomic_long_t *v)
-{
-       return arch_atomic64_fetch_dec_release(v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_dec_relaxed(atomic_long_t *v)
-{
-       return arch_atomic64_fetch_dec_relaxed(v);
-}
-
-static __always_inline void
-arch_atomic_long_and(long i, atomic_long_t *v)
-{
-       arch_atomic64_and(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_and(long i, atomic_long_t *v)
-{
-       return arch_atomic64_fetch_and(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_and_acquire(long i, atomic_long_t *v)
-{
-       return arch_atomic64_fetch_and_acquire(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_and_release(long i, atomic_long_t *v)
-{
-       return arch_atomic64_fetch_and_release(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_and_relaxed(long i, atomic_long_t *v)
-{
-       return arch_atomic64_fetch_and_relaxed(i, v);
-}
-
-static __always_inline void
-arch_atomic_long_andnot(long i, atomic_long_t *v)
-{
-       arch_atomic64_andnot(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_andnot(long i, atomic_long_t *v)
-{
-       return arch_atomic64_fetch_andnot(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_andnot_acquire(long i, atomic_long_t *v)
-{
-       return arch_atomic64_fetch_andnot_acquire(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_andnot_release(long i, atomic_long_t *v)
-{
-       return arch_atomic64_fetch_andnot_release(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_andnot_relaxed(long i, atomic_long_t *v)
-{
-       return arch_atomic64_fetch_andnot_relaxed(i, v);
-}
-
-static __always_inline void
-arch_atomic_long_or(long i, atomic_long_t *v)
-{
-       arch_atomic64_or(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_or(long i, atomic_long_t *v)
-{
-       return arch_atomic64_fetch_or(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_or_acquire(long i, atomic_long_t *v)
-{
-       return arch_atomic64_fetch_or_acquire(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_or_release(long i, atomic_long_t *v)
-{
-       return arch_atomic64_fetch_or_release(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_or_relaxed(long i, atomic_long_t *v)
-{
-       return arch_atomic64_fetch_or_relaxed(i, v);
-}
-
-static __always_inline void
-arch_atomic_long_xor(long i, atomic_long_t *v)
-{
-       arch_atomic64_xor(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_xor(long i, atomic_long_t *v)
-{
-       return arch_atomic64_fetch_xor(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_xor_acquire(long i, atomic_long_t *v)
-{
-       return arch_atomic64_fetch_xor_acquire(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_xor_release(long i, atomic_long_t *v)
-{
-       return arch_atomic64_fetch_xor_release(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_xor_relaxed(long i, atomic_long_t *v)
-{
-       return arch_atomic64_fetch_xor_relaxed(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_xchg(atomic_long_t *v, long i)
-{
-       return arch_atomic64_xchg(v, i);
-}
-
-static __always_inline long
-arch_atomic_long_xchg_acquire(atomic_long_t *v, long i)
-{
-       return arch_atomic64_xchg_acquire(v, i);
-}
-
-static __always_inline long
-arch_atomic_long_xchg_release(atomic_long_t *v, long i)
-{
-       return arch_atomic64_xchg_release(v, i);
-}
-
-static __always_inline long
-arch_atomic_long_xchg_relaxed(atomic_long_t *v, long i)
-{
-       return arch_atomic64_xchg_relaxed(v, i);
-}
-
-static __always_inline long
-arch_atomic_long_cmpxchg(atomic_long_t *v, long old, long new)
-{
-       return arch_atomic64_cmpxchg(v, old, new);
-}
-
-static __always_inline long
-arch_atomic_long_cmpxchg_acquire(atomic_long_t *v, long old, long new)
-{
-       return arch_atomic64_cmpxchg_acquire(v, old, new);
-}
-
-static __always_inline long
-arch_atomic_long_cmpxchg_release(atomic_long_t *v, long old, long new)
-{
-       return arch_atomic64_cmpxchg_release(v, old, new);
-}
-
-static __always_inline long
-arch_atomic_long_cmpxchg_relaxed(atomic_long_t *v, long old, long new)
-{
-       return arch_atomic64_cmpxchg_relaxed(v, old, new);
-}
-
-static __always_inline bool
-arch_atomic_long_try_cmpxchg(atomic_long_t *v, long *old, long new)
-{
-       return arch_atomic64_try_cmpxchg(v, (s64 *)old, new);
-}
-
-static __always_inline bool
-arch_atomic_long_try_cmpxchg_acquire(atomic_long_t *v, long *old, long new)
-{
-       return arch_atomic64_try_cmpxchg_acquire(v, (s64 *)old, new);
-}
-
-static __always_inline bool
-arch_atomic_long_try_cmpxchg_release(atomic_long_t *v, long *old, long new)
-{
-       return arch_atomic64_try_cmpxchg_release(v, (s64 *)old, new);
-}
-
-static __always_inline bool
-arch_atomic_long_try_cmpxchg_relaxed(atomic_long_t *v, long *old, long new)
-{
-       return arch_atomic64_try_cmpxchg_relaxed(v, (s64 *)old, new);
-}
-
-static __always_inline bool
-arch_atomic_long_sub_and_test(long i, atomic_long_t *v)
-{
-       return arch_atomic64_sub_and_test(i, v);
-}
-
-static __always_inline bool
-arch_atomic_long_dec_and_test(atomic_long_t *v)
-{
-       return arch_atomic64_dec_and_test(v);
-}
-
-static __always_inline bool
-arch_atomic_long_inc_and_test(atomic_long_t *v)
-{
-       return arch_atomic64_inc_and_test(v);
-}
-
-static __always_inline bool
-arch_atomic_long_add_negative(long i, atomic_long_t *v)
-{
-       return arch_atomic64_add_negative(i, v);
-}
-
-static __always_inline bool
-arch_atomic_long_add_negative_acquire(long i, atomic_long_t *v)
-{
-       return arch_atomic64_add_negative_acquire(i, v);
-}
-
-static __always_inline bool
-arch_atomic_long_add_negative_release(long i, atomic_long_t *v)
-{
-       return arch_atomic64_add_negative_release(i, v);
-}
-
-static __always_inline bool
-arch_atomic_long_add_negative_relaxed(long i, atomic_long_t *v)
-{
-       return arch_atomic64_add_negative_relaxed(i, v);
-}
-
-static __always_inline long
-arch_atomic_long_fetch_add_unless(atomic_long_t *v, long a, long u)
-{
-       return arch_atomic64_fetch_add_unless(v, a, u);
-}
-
-static __always_inline bool
-arch_atomic_long_add_unless(atomic_long_t *v, long a, long u)
-{
-       return arch_atomic64_add_unless(v, a, u);
-}
-
-static __always_inline bool
-arch_atomic_long_inc_not_zero(atomic_long_t *v)
-{
-       return arch_atomic64_inc_not_zero(v);
-}
-
-static __always_inline bool
-arch_atomic_long_inc_unless_negative(atomic_long_t *v)
-{
-       return arch_atomic64_inc_unless_negative(v);
-}
-
-static __always_inline bool
-arch_atomic_long_dec_unless_positive(atomic_long_t *v)
-{
-       return arch_atomic64_dec_unless_positive(v);
-}
-
-static __always_inline long
-arch_atomic_long_dec_if_positive(atomic_long_t *v)
-{
-       return arch_atomic64_dec_if_positive(v);
-}
-
-#else /* CONFIG_64BIT */
-
-static __always_inline long
-arch_atomic_long_read(const atomic_long_t *v)
+/**
+ * raw_atomic_long_read() - atomic load with relaxed ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically loads the value of @v with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_read() elsewhere.
+ *
+ * Return: The value loaded from @v.
+ */
+static __always_inline long
+raw_atomic_long_read(const atomic_long_t *v)
 {
-       return arch_atomic_read(v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_read(v);
+#else
+       return raw_atomic_read(v);
+#endif
 }
 
+/**
+ * raw_atomic_long_read_acquire() - atomic load with acquire ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically loads the value of @v with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_read_acquire() elsewhere.
+ *
+ * Return: The value loaded from @v.
+ */
 static __always_inline long
-arch_atomic_long_read_acquire(const atomic_long_t *v)
+raw_atomic_long_read_acquire(const atomic_long_t *v)
 {
-       return arch_atomic_read_acquire(v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_read_acquire(v);
+#else
+       return raw_atomic_read_acquire(v);
+#endif
 }
 
+/**
+ * raw_atomic_long_set() - atomic set with relaxed ordering
+ * @v: pointer to atomic_long_t
+ * @i: long value to assign
+ *
+ * Atomically sets @v to @i with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_set() elsewhere.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
-arch_atomic_long_set(atomic_long_t *v, long i)
+raw_atomic_long_set(atomic_long_t *v, long i)
 {
-       arch_atomic_set(v, i);
+#ifdef CONFIG_64BIT
+       raw_atomic64_set(v, i);
+#else
+       raw_atomic_set(v, i);
+#endif
 }
 
+/**
+ * raw_atomic_long_set_release() - atomic set with release ordering
+ * @v: pointer to atomic_long_t
+ * @i: long value to assign
+ *
+ * Atomically sets @v to @i with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_set_release() elsewhere.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
-arch_atomic_long_set_release(atomic_long_t *v, long i)
+raw_atomic_long_set_release(atomic_long_t *v, long i)
 {
-       arch_atomic_set_release(v, i);
+#ifdef CONFIG_64BIT
+       raw_atomic64_set_release(v, i);
+#else
+       raw_atomic_set_release(v, i);
+#endif
 }
 
+/**
+ * raw_atomic_long_add() - atomic add with relaxed ordering
+ * @i: long value to add
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_add() elsewhere.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
-arch_atomic_long_add(long i, atomic_long_t *v)
+raw_atomic_long_add(long i, atomic_long_t *v)
 {
-       arch_atomic_add(i, v);
+#ifdef CONFIG_64BIT
+       raw_atomic64_add(i, v);
+#else
+       raw_atomic_add(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_add_return() - atomic add with full ordering
+ * @i: long value to add
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + @i) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_add_return() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
-arch_atomic_long_add_return(long i, atomic_long_t *v)
+raw_atomic_long_add_return(long i, atomic_long_t *v)
 {
-       return arch_atomic_add_return(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_add_return(i, v);
+#else
+       return raw_atomic_add_return(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_add_return_acquire() - atomic add with acquire ordering
+ * @i: long value to add
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + @i) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_add_return_acquire() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
-arch_atomic_long_add_return_acquire(long i, atomic_long_t *v)
+raw_atomic_long_add_return_acquire(long i, atomic_long_t *v)
 {
-       return arch_atomic_add_return_acquire(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_add_return_acquire(i, v);
+#else
+       return raw_atomic_add_return_acquire(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_add_return_release() - atomic add with release ordering
+ * @i: long value to add
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + @i) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_add_return_release() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
-arch_atomic_long_add_return_release(long i, atomic_long_t *v)
+raw_atomic_long_add_return_release(long i, atomic_long_t *v)
 {
-       return arch_atomic_add_return_release(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_add_return_release(i, v);
+#else
+       return raw_atomic_add_return_release(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_add_return_relaxed() - atomic add with relaxed ordering
+ * @i: long value to add
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_add_return_relaxed() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
-arch_atomic_long_add_return_relaxed(long i, atomic_long_t *v)
+raw_atomic_long_add_return_relaxed(long i, atomic_long_t *v)
 {
-       return arch_atomic_add_return_relaxed(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_add_return_relaxed(i, v);
+#else
+       return raw_atomic_add_return_relaxed(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_add() - atomic add with full ordering
+ * @i: long value to add
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + @i) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_add() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_add(long i, atomic_long_t *v)
+raw_atomic_long_fetch_add(long i, atomic_long_t *v)
 {
-       return arch_atomic_fetch_add(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_add(i, v);
+#else
+       return raw_atomic_fetch_add(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_add_acquire() - atomic add with acquire ordering
+ * @i: long value to add
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + @i) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_add_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_add_acquire(long i, atomic_long_t *v)
+raw_atomic_long_fetch_add_acquire(long i, atomic_long_t *v)
 {
-       return arch_atomic_fetch_add_acquire(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_add_acquire(i, v);
+#else
+       return raw_atomic_fetch_add_acquire(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_add_release() - atomic add with release ordering
+ * @i: long value to add
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + @i) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_add_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_add_release(long i, atomic_long_t *v)
+raw_atomic_long_fetch_add_release(long i, atomic_long_t *v)
 {
-       return arch_atomic_fetch_add_release(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_add_release(i, v);
+#else
+       return raw_atomic_fetch_add_release(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_add_relaxed() - atomic add with relaxed ordering
+ * @i: long value to add
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_add_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_add_relaxed(long i, atomic_long_t *v)
+raw_atomic_long_fetch_add_relaxed(long i, atomic_long_t *v)
 {
-       return arch_atomic_fetch_add_relaxed(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_add_relaxed(i, v);
+#else
+       return raw_atomic_fetch_add_relaxed(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_sub() - atomic subtract with relaxed ordering
+ * @i: long value to subtract
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_sub() elsewhere.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
-arch_atomic_long_sub(long i, atomic_long_t *v)
+raw_atomic_long_sub(long i, atomic_long_t *v)
 {
-       arch_atomic_sub(i, v);
+#ifdef CONFIG_64BIT
+       raw_atomic64_sub(i, v);
+#else
+       raw_atomic_sub(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_sub_return() - atomic subtract with full ordering
+ * @i: long value to subtract
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - @i) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_sub_return() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
-arch_atomic_long_sub_return(long i, atomic_long_t *v)
+raw_atomic_long_sub_return(long i, atomic_long_t *v)
 {
-       return arch_atomic_sub_return(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_sub_return(i, v);
+#else
+       return raw_atomic_sub_return(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_sub_return_acquire() - atomic subtract with acquire ordering
+ * @i: long value to subtract
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - @i) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_sub_return_acquire() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
-arch_atomic_long_sub_return_acquire(long i, atomic_long_t *v)
+raw_atomic_long_sub_return_acquire(long i, atomic_long_t *v)
 {
-       return arch_atomic_sub_return_acquire(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_sub_return_acquire(i, v);
+#else
+       return raw_atomic_sub_return_acquire(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_sub_return_release() - atomic subtract with release ordering
+ * @i: long value to subtract
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - @i) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_sub_return_release() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
-arch_atomic_long_sub_return_release(long i, atomic_long_t *v)
+raw_atomic_long_sub_return_release(long i, atomic_long_t *v)
 {
-       return arch_atomic_sub_return_release(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_sub_return_release(i, v);
+#else
+       return raw_atomic_sub_return_release(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_sub_return_relaxed() - atomic subtract with relaxed ordering
+ * @i: long value to subtract
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_sub_return_relaxed() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
-arch_atomic_long_sub_return_relaxed(long i, atomic_long_t *v)
+raw_atomic_long_sub_return_relaxed(long i, atomic_long_t *v)
 {
-       return arch_atomic_sub_return_relaxed(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_sub_return_relaxed(i, v);
+#else
+       return raw_atomic_sub_return_relaxed(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_sub() - atomic subtract with full ordering
+ * @i: long value to subtract
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - @i) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_sub() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_sub(long i, atomic_long_t *v)
+raw_atomic_long_fetch_sub(long i, atomic_long_t *v)
 {
-       return arch_atomic_fetch_sub(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_sub(i, v);
+#else
+       return raw_atomic_fetch_sub(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_sub_acquire() - atomic subtract with acquire ordering
+ * @i: long value to subtract
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - @i) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_sub_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_sub_acquire(long i, atomic_long_t *v)
+raw_atomic_long_fetch_sub_acquire(long i, atomic_long_t *v)
 {
-       return arch_atomic_fetch_sub_acquire(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_sub_acquire(i, v);
+#else
+       return raw_atomic_fetch_sub_acquire(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_sub_release() - atomic subtract with release ordering
+ * @i: long value to subtract
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - @i) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_sub_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_sub_release(long i, atomic_long_t *v)
+raw_atomic_long_fetch_sub_release(long i, atomic_long_t *v)
 {
-       return arch_atomic_fetch_sub_release(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_sub_release(i, v);
+#else
+       return raw_atomic_fetch_sub_release(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_sub_relaxed() - atomic subtract with relaxed ordering
+ * @i: long value to subtract
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_sub_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_sub_relaxed(long i, atomic_long_t *v)
+raw_atomic_long_fetch_sub_relaxed(long i, atomic_long_t *v)
 {
-       return arch_atomic_fetch_sub_relaxed(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_sub_relaxed(i, v);
+#else
+       return raw_atomic_fetch_sub_relaxed(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_inc() - atomic increment with relaxed ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + 1) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_inc() elsewhere.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
-arch_atomic_long_inc(atomic_long_t *v)
+raw_atomic_long_inc(atomic_long_t *v)
 {
-       arch_atomic_inc(v);
+#ifdef CONFIG_64BIT
+       raw_atomic64_inc(v);
+#else
+       raw_atomic_inc(v);
+#endif
 }
 
+/**
+ * raw_atomic_long_inc_return() - atomic increment with full ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + 1) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_inc_return() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
-arch_atomic_long_inc_return(atomic_long_t *v)
+raw_atomic_long_inc_return(atomic_long_t *v)
 {
-       return arch_atomic_inc_return(v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_inc_return(v);
+#else
+       return raw_atomic_inc_return(v);
+#endif
 }
 
+/**
+ * raw_atomic_long_inc_return_acquire() - atomic increment with acquire ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + 1) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_inc_return_acquire() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
-arch_atomic_long_inc_return_acquire(atomic_long_t *v)
+raw_atomic_long_inc_return_acquire(atomic_long_t *v)
 {
-       return arch_atomic_inc_return_acquire(v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_inc_return_acquire(v);
+#else
+       return raw_atomic_inc_return_acquire(v);
+#endif
 }
 
+/**
+ * raw_atomic_long_inc_return_release() - atomic increment with release ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + 1) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_inc_return_release() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
-arch_atomic_long_inc_return_release(atomic_long_t *v)
+raw_atomic_long_inc_return_release(atomic_long_t *v)
 {
-       return arch_atomic_inc_return_release(v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_inc_return_release(v);
+#else
+       return raw_atomic_inc_return_release(v);
+#endif
 }
 
+/**
+ * raw_atomic_long_inc_return_relaxed() - atomic increment with relaxed ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + 1) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_inc_return_relaxed() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
-arch_atomic_long_inc_return_relaxed(atomic_long_t *v)
+raw_atomic_long_inc_return_relaxed(atomic_long_t *v)
 {
-       return arch_atomic_inc_return_relaxed(v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_inc_return_relaxed(v);
+#else
+       return raw_atomic_inc_return_relaxed(v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_inc() - atomic increment with full ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + 1) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_inc() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_inc(atomic_long_t *v)
+raw_atomic_long_fetch_inc(atomic_long_t *v)
 {
-       return arch_atomic_fetch_inc(v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_inc(v);
+#else
+       return raw_atomic_fetch_inc(v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_inc_acquire() - atomic increment with acquire ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + 1) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_inc_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_inc_acquire(atomic_long_t *v)
+raw_atomic_long_fetch_inc_acquire(atomic_long_t *v)
 {
-       return arch_atomic_fetch_inc_acquire(v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_inc_acquire(v);
+#else
+       return raw_atomic_fetch_inc_acquire(v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_inc_release() - atomic increment with release ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + 1) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_inc_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_inc_release(atomic_long_t *v)
+raw_atomic_long_fetch_inc_release(atomic_long_t *v)
 {
-       return arch_atomic_fetch_inc_release(v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_inc_release(v);
+#else
+       return raw_atomic_fetch_inc_release(v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_inc_relaxed() - atomic increment with relaxed ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + 1) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_inc_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_inc_relaxed(atomic_long_t *v)
+raw_atomic_long_fetch_inc_relaxed(atomic_long_t *v)
 {
-       return arch_atomic_fetch_inc_relaxed(v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_inc_relaxed(v);
+#else
+       return raw_atomic_fetch_inc_relaxed(v);
+#endif
 }
 
+/**
+ * raw_atomic_long_dec() - atomic decrement with relaxed ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - 1) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_dec() elsewhere.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
-arch_atomic_long_dec(atomic_long_t *v)
+raw_atomic_long_dec(atomic_long_t *v)
 {
-       arch_atomic_dec(v);
+#ifdef CONFIG_64BIT
+       raw_atomic64_dec(v);
+#else
+       raw_atomic_dec(v);
+#endif
 }
 
+/**
+ * raw_atomic_long_dec_return() - atomic decrement with full ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - 1) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_dec_return() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
-arch_atomic_long_dec_return(atomic_long_t *v)
+raw_atomic_long_dec_return(atomic_long_t *v)
 {
-       return arch_atomic_dec_return(v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_dec_return(v);
+#else
+       return raw_atomic_dec_return(v);
+#endif
 }
 
+/**
+ * raw_atomic_long_dec_return_acquire() - atomic decrement with acquire ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - 1) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_dec_return_acquire() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
-arch_atomic_long_dec_return_acquire(atomic_long_t *v)
+raw_atomic_long_dec_return_acquire(atomic_long_t *v)
 {
-       return arch_atomic_dec_return_acquire(v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_dec_return_acquire(v);
+#else
+       return raw_atomic_dec_return_acquire(v);
+#endif
 }
 
+/**
+ * raw_atomic_long_dec_return_release() - atomic decrement with release ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - 1) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_dec_return_release() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
-arch_atomic_long_dec_return_release(atomic_long_t *v)
+raw_atomic_long_dec_return_release(atomic_long_t *v)
 {
-       return arch_atomic_dec_return_release(v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_dec_return_release(v);
+#else
+       return raw_atomic_dec_return_release(v);
+#endif
 }
 
+/**
+ * raw_atomic_long_dec_return_relaxed() - atomic decrement with relaxed ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - 1) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_dec_return_relaxed() elsewhere.
+ *
+ * Return: The updated value of @v.
+ */
 static __always_inline long
-arch_atomic_long_dec_return_relaxed(atomic_long_t *v)
+raw_atomic_long_dec_return_relaxed(atomic_long_t *v)
 {
-       return arch_atomic_dec_return_relaxed(v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_dec_return_relaxed(v);
+#else
+       return raw_atomic_dec_return_relaxed(v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_dec() - atomic decrement with full ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - 1) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_dec() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_dec(atomic_long_t *v)
+raw_atomic_long_fetch_dec(atomic_long_t *v)
 {
-       return arch_atomic_fetch_dec(v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_dec(v);
+#else
+       return raw_atomic_fetch_dec(v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_dec_acquire() - atomic decrement with acquire ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - 1) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_dec_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_dec_acquire(atomic_long_t *v)
+raw_atomic_long_fetch_dec_acquire(atomic_long_t *v)
 {
-       return arch_atomic_fetch_dec_acquire(v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_dec_acquire(v);
+#else
+       return raw_atomic_fetch_dec_acquire(v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_dec_release() - atomic decrement with release ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - 1) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_dec_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_dec_release(atomic_long_t *v)
+raw_atomic_long_fetch_dec_release(atomic_long_t *v)
 {
-       return arch_atomic_fetch_dec_release(v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_dec_release(v);
+#else
+       return raw_atomic_fetch_dec_release(v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_dec_relaxed() - atomic decrement with relaxed ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - 1) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_dec_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_dec_relaxed(atomic_long_t *v)
+raw_atomic_long_fetch_dec_relaxed(atomic_long_t *v)
 {
-       return arch_atomic_fetch_dec_relaxed(v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_dec_relaxed(v);
+#else
+       return raw_atomic_fetch_dec_relaxed(v);
+#endif
 }
 
+/**
+ * raw_atomic_long_and() - atomic bitwise AND with relaxed ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v & @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_and() elsewhere.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
-arch_atomic_long_and(long i, atomic_long_t *v)
+raw_atomic_long_and(long i, atomic_long_t *v)
 {
-       arch_atomic_and(i, v);
+#ifdef CONFIG_64BIT
+       raw_atomic64_and(i, v);
+#else
+       raw_atomic_and(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_and() - atomic bitwise AND with full ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v & @i) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_and() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_and(long i, atomic_long_t *v)
+raw_atomic_long_fetch_and(long i, atomic_long_t *v)
 {
-       return arch_atomic_fetch_and(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_and(i, v);
+#else
+       return raw_atomic_fetch_and(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_and_acquire() - atomic bitwise AND with acquire ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v & @i) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_and_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_and_acquire(long i, atomic_long_t *v)
+raw_atomic_long_fetch_and_acquire(long i, atomic_long_t *v)
 {
-       return arch_atomic_fetch_and_acquire(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_and_acquire(i, v);
+#else
+       return raw_atomic_fetch_and_acquire(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_and_release() - atomic bitwise AND with release ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v & @i) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_and_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_and_release(long i, atomic_long_t *v)
+raw_atomic_long_fetch_and_release(long i, atomic_long_t *v)
 {
-       return arch_atomic_fetch_and_release(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_and_release(i, v);
+#else
+       return raw_atomic_fetch_and_release(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_and_relaxed() - atomic bitwise AND with relaxed ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v & @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_and_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_and_relaxed(long i, atomic_long_t *v)
+raw_atomic_long_fetch_and_relaxed(long i, atomic_long_t *v)
 {
-       return arch_atomic_fetch_and_relaxed(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_and_relaxed(i, v);
+#else
+       return raw_atomic_fetch_and_relaxed(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_andnot() - atomic bitwise AND NOT with relaxed ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v & ~@i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_andnot() elsewhere.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
-arch_atomic_long_andnot(long i, atomic_long_t *v)
+raw_atomic_long_andnot(long i, atomic_long_t *v)
 {
-       arch_atomic_andnot(i, v);
+#ifdef CONFIG_64BIT
+       raw_atomic64_andnot(i, v);
+#else
+       raw_atomic_andnot(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_andnot() - atomic bitwise AND NOT with full ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v & ~@i) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_andnot() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_andnot(long i, atomic_long_t *v)
+raw_atomic_long_fetch_andnot(long i, atomic_long_t *v)
 {
-       return arch_atomic_fetch_andnot(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_andnot(i, v);
+#else
+       return raw_atomic_fetch_andnot(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_andnot_acquire() - atomic bitwise AND NOT with acquire ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v & ~@i) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_andnot_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_andnot_acquire(long i, atomic_long_t *v)
+raw_atomic_long_fetch_andnot_acquire(long i, atomic_long_t *v)
 {
-       return arch_atomic_fetch_andnot_acquire(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_andnot_acquire(i, v);
+#else
+       return raw_atomic_fetch_andnot_acquire(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_andnot_release() - atomic bitwise AND NOT with release ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v & ~@i) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_andnot_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_andnot_release(long i, atomic_long_t *v)
+raw_atomic_long_fetch_andnot_release(long i, atomic_long_t *v)
 {
-       return arch_atomic_fetch_andnot_release(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_andnot_release(i, v);
+#else
+       return raw_atomic_fetch_andnot_release(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_andnot_relaxed() - atomic bitwise AND NOT with relaxed ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v & ~@i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_andnot_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_andnot_relaxed(long i, atomic_long_t *v)
+raw_atomic_long_fetch_andnot_relaxed(long i, atomic_long_t *v)
 {
-       return arch_atomic_fetch_andnot_relaxed(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_andnot_relaxed(i, v);
+#else
+       return raw_atomic_fetch_andnot_relaxed(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_or() - atomic bitwise OR with relaxed ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v | @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_or() elsewhere.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
-arch_atomic_long_or(long i, atomic_long_t *v)
+raw_atomic_long_or(long i, atomic_long_t *v)
 {
-       arch_atomic_or(i, v);
+#ifdef CONFIG_64BIT
+       raw_atomic64_or(i, v);
+#else
+       raw_atomic_or(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_or() - atomic bitwise OR with full ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v | @i) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_or() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_or(long i, atomic_long_t *v)
+raw_atomic_long_fetch_or(long i, atomic_long_t *v)
 {
-       return arch_atomic_fetch_or(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_or(i, v);
+#else
+       return raw_atomic_fetch_or(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_or_acquire() - atomic bitwise OR with acquire ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v | @i) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_or_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_or_acquire(long i, atomic_long_t *v)
+raw_atomic_long_fetch_or_acquire(long i, atomic_long_t *v)
 {
-       return arch_atomic_fetch_or_acquire(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_or_acquire(i, v);
+#else
+       return raw_atomic_fetch_or_acquire(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_or_release() - atomic bitwise OR with release ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v | @i) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_or_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_or_release(long i, atomic_long_t *v)
+raw_atomic_long_fetch_or_release(long i, atomic_long_t *v)
 {
-       return arch_atomic_fetch_or_release(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_or_release(i, v);
+#else
+       return raw_atomic_fetch_or_release(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_or_relaxed() - atomic bitwise OR with relaxed ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v | @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_or_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_or_relaxed(long i, atomic_long_t *v)
+raw_atomic_long_fetch_or_relaxed(long i, atomic_long_t *v)
 {
-       return arch_atomic_fetch_or_relaxed(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_or_relaxed(i, v);
+#else
+       return raw_atomic_fetch_or_relaxed(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_xor() - atomic bitwise XOR with relaxed ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v ^ @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_xor() elsewhere.
+ *
+ * Return: Nothing.
+ */
 static __always_inline void
-arch_atomic_long_xor(long i, atomic_long_t *v)
+raw_atomic_long_xor(long i, atomic_long_t *v)
 {
-       arch_atomic_xor(i, v);
+#ifdef CONFIG_64BIT
+       raw_atomic64_xor(i, v);
+#else
+       raw_atomic_xor(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_xor() - atomic bitwise XOR with full ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v ^ @i) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_xor() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_xor(long i, atomic_long_t *v)
+raw_atomic_long_fetch_xor(long i, atomic_long_t *v)
 {
-       return arch_atomic_fetch_xor(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_xor(i, v);
+#else
+       return raw_atomic_fetch_xor(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_xor_acquire() - atomic bitwise XOR with acquire ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v ^ @i) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_xor_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_xor_acquire(long i, atomic_long_t *v)
+raw_atomic_long_fetch_xor_acquire(long i, atomic_long_t *v)
 {
-       return arch_atomic_fetch_xor_acquire(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_xor_acquire(i, v);
+#else
+       return raw_atomic_fetch_xor_acquire(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_xor_release() - atomic bitwise XOR with release ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v ^ @i) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_xor_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_xor_release(long i, atomic_long_t *v)
+raw_atomic_long_fetch_xor_release(long i, atomic_long_t *v)
 {
-       return arch_atomic_fetch_xor_release(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_xor_release(i, v);
+#else
+       return raw_atomic_fetch_xor_release(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_xor_relaxed() - atomic bitwise XOR with relaxed ordering
+ * @i: long value
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v ^ @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_xor_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_xor_relaxed(long i, atomic_long_t *v)
+raw_atomic_long_fetch_xor_relaxed(long i, atomic_long_t *v)
 {
-       return arch_atomic_fetch_xor_relaxed(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_xor_relaxed(i, v);
+#else
+       return raw_atomic_fetch_xor_relaxed(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_xchg() - atomic exchange with full ordering
+ * @v: pointer to atomic_long_t
+ * @new: long value to assign
+ *
+ * Atomically updates @v to @new with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_xchg() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_xchg(atomic_long_t *v, long i)
+raw_atomic_long_xchg(atomic_long_t *v, long new)
 {
-       return arch_atomic_xchg(v, i);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_xchg(v, new);
+#else
+       return raw_atomic_xchg(v, new);
+#endif
 }
 
+/**
+ * raw_atomic_long_xchg_acquire() - atomic exchange with acquire ordering
+ * @v: pointer to atomic_long_t
+ * @new: long value to assign
+ *
+ * Atomically updates @v to @new with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_xchg_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_xchg_acquire(atomic_long_t *v, long i)
+raw_atomic_long_xchg_acquire(atomic_long_t *v, long new)
 {
-       return arch_atomic_xchg_acquire(v, i);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_xchg_acquire(v, new);
+#else
+       return raw_atomic_xchg_acquire(v, new);
+#endif
 }
 
+/**
+ * raw_atomic_long_xchg_release() - atomic exchange with release ordering
+ * @v: pointer to atomic_long_t
+ * @new: long value to assign
+ *
+ * Atomically updates @v to @new with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_xchg_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_xchg_release(atomic_long_t *v, long i)
+raw_atomic_long_xchg_release(atomic_long_t *v, long new)
 {
-       return arch_atomic_xchg_release(v, i);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_xchg_release(v, new);
+#else
+       return raw_atomic_xchg_release(v, new);
+#endif
 }
 
+/**
+ * raw_atomic_long_xchg_relaxed() - atomic exchange with relaxed ordering
+ * @v: pointer to atomic_long_t
+ * @new: long value to assign
+ *
+ * Atomically updates @v to @new with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_xchg_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_xchg_relaxed(atomic_long_t *v, long i)
+raw_atomic_long_xchg_relaxed(atomic_long_t *v, long new)
 {
-       return arch_atomic_xchg_relaxed(v, i);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_xchg_relaxed(v, new);
+#else
+       return raw_atomic_xchg_relaxed(v, new);
+#endif
 }
 
+/**
+ * raw_atomic_long_cmpxchg() - atomic compare and exchange with full ordering
+ * @v: pointer to atomic_long_t
+ * @old: long value to compare with
+ * @new: long value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_cmpxchg() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_cmpxchg(atomic_long_t *v, long old, long new)
+raw_atomic_long_cmpxchg(atomic_long_t *v, long old, long new)
 {
-       return arch_atomic_cmpxchg(v, old, new);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_cmpxchg(v, old, new);
+#else
+       return raw_atomic_cmpxchg(v, old, new);
+#endif
 }
 
+/**
+ * raw_atomic_long_cmpxchg_acquire() - atomic compare and exchange with acquire ordering
+ * @v: pointer to atomic_long_t
+ * @old: long value to compare with
+ * @new: long value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_cmpxchg_acquire() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_cmpxchg_acquire(atomic_long_t *v, long old, long new)
+raw_atomic_long_cmpxchg_acquire(atomic_long_t *v, long old, long new)
 {
-       return arch_atomic_cmpxchg_acquire(v, old, new);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_cmpxchg_acquire(v, old, new);
+#else
+       return raw_atomic_cmpxchg_acquire(v, old, new);
+#endif
 }
 
+/**
+ * raw_atomic_long_cmpxchg_release() - atomic compare and exchange with release ordering
+ * @v: pointer to atomic_long_t
+ * @old: long value to compare with
+ * @new: long value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_cmpxchg_release() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_cmpxchg_release(atomic_long_t *v, long old, long new)
+raw_atomic_long_cmpxchg_release(atomic_long_t *v, long old, long new)
 {
-       return arch_atomic_cmpxchg_release(v, old, new);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_cmpxchg_release(v, old, new);
+#else
+       return raw_atomic_cmpxchg_release(v, old, new);
+#endif
 }
 
+/**
+ * raw_atomic_long_cmpxchg_relaxed() - atomic compare and exchange with relaxed ordering
+ * @v: pointer to atomic_long_t
+ * @old: long value to compare with
+ * @new: long value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_cmpxchg_relaxed() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_cmpxchg_relaxed(atomic_long_t *v, long old, long new)
+raw_atomic_long_cmpxchg_relaxed(atomic_long_t *v, long old, long new)
 {
-       return arch_atomic_cmpxchg_relaxed(v, old, new);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_cmpxchg_relaxed(v, old, new);
+#else
+       return raw_atomic_cmpxchg_relaxed(v, old, new);
+#endif
 }
 
+/**
+ * raw_atomic_long_try_cmpxchg() - atomic compare and exchange with full ordering
+ * @v: pointer to atomic_long_t
+ * @old: pointer to long value to compare with
+ * @new: long value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with full ordering.
+ * Otherwise, updates @old to the current value of @v.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_try_cmpxchg() elsewhere.
+ *
+ * Return: @true if the exchange occured, @false otherwise.
+ */
 static __always_inline bool
-arch_atomic_long_try_cmpxchg(atomic_long_t *v, long *old, long new)
+raw_atomic_long_try_cmpxchg(atomic_long_t *v, long *old, long new)
 {
-       return arch_atomic_try_cmpxchg(v, (int *)old, new);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_try_cmpxchg(v, (s64 *)old, new);
+#else
+       return raw_atomic_try_cmpxchg(v, (int *)old, new);
+#endif
 }
 
+/**
+ * raw_atomic_long_try_cmpxchg_acquire() - atomic compare and exchange with acquire ordering
+ * @v: pointer to atomic_long_t
+ * @old: pointer to long value to compare with
+ * @new: long value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with acquire ordering.
+ * Otherwise, updates @old to the current value of @v.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_try_cmpxchg_acquire() elsewhere.
+ *
+ * Return: @true if the exchange occured, @false otherwise.
+ */
 static __always_inline bool
-arch_atomic_long_try_cmpxchg_acquire(atomic_long_t *v, long *old, long new)
+raw_atomic_long_try_cmpxchg_acquire(atomic_long_t *v, long *old, long new)
 {
-       return arch_atomic_try_cmpxchg_acquire(v, (int *)old, new);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_try_cmpxchg_acquire(v, (s64 *)old, new);
+#else
+       return raw_atomic_try_cmpxchg_acquire(v, (int *)old, new);
+#endif
 }
 
+/**
+ * raw_atomic_long_try_cmpxchg_release() - atomic compare and exchange with release ordering
+ * @v: pointer to atomic_long_t
+ * @old: pointer to long value to compare with
+ * @new: long value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with release ordering.
+ * Otherwise, updates @old to the current value of @v.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_try_cmpxchg_release() elsewhere.
+ *
+ * Return: @true if the exchange occured, @false otherwise.
+ */
 static __always_inline bool
-arch_atomic_long_try_cmpxchg_release(atomic_long_t *v, long *old, long new)
+raw_atomic_long_try_cmpxchg_release(atomic_long_t *v, long *old, long new)
 {
-       return arch_atomic_try_cmpxchg_release(v, (int *)old, new);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_try_cmpxchg_release(v, (s64 *)old, new);
+#else
+       return raw_atomic_try_cmpxchg_release(v, (int *)old, new);
+#endif
 }
 
+/**
+ * raw_atomic_long_try_cmpxchg_relaxed() - atomic compare and exchange with relaxed ordering
+ * @v: pointer to atomic_long_t
+ * @old: pointer to long value to compare with
+ * @new: long value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with relaxed ordering.
+ * Otherwise, updates @old to the current value of @v.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_try_cmpxchg_relaxed() elsewhere.
+ *
+ * Return: @true if the exchange occured, @false otherwise.
+ */
 static __always_inline bool
-arch_atomic_long_try_cmpxchg_relaxed(atomic_long_t *v, long *old, long new)
+raw_atomic_long_try_cmpxchg_relaxed(atomic_long_t *v, long *old, long new)
 {
-       return arch_atomic_try_cmpxchg_relaxed(v, (int *)old, new);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_try_cmpxchg_relaxed(v, (s64 *)old, new);
+#else
+       return raw_atomic_try_cmpxchg_relaxed(v, (int *)old, new);
+#endif
 }
 
+/**
+ * raw_atomic_long_sub_and_test() - atomic subtract and test if zero with full ordering
+ * @i: long value to add
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - @i) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_sub_and_test() elsewhere.
+ *
+ * Return: @true if the resulting value of @v is zero, @false otherwise.
+ */
 static __always_inline bool
-arch_atomic_long_sub_and_test(long i, atomic_long_t *v)
+raw_atomic_long_sub_and_test(long i, atomic_long_t *v)
 {
-       return arch_atomic_sub_and_test(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_sub_and_test(i, v);
+#else
+       return raw_atomic_sub_and_test(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_dec_and_test() - atomic decrement and test if zero with full ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v - 1) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_dec_and_test() elsewhere.
+ *
+ * Return: @true if the resulting value of @v is zero, @false otherwise.
+ */
 static __always_inline bool
-arch_atomic_long_dec_and_test(atomic_long_t *v)
+raw_atomic_long_dec_and_test(atomic_long_t *v)
 {
-       return arch_atomic_dec_and_test(v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_dec_and_test(v);
+#else
+       return raw_atomic_dec_and_test(v);
+#endif
 }
 
+/**
+ * raw_atomic_long_inc_and_test() - atomic increment and test if zero with full ordering
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + 1) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_inc_and_test() elsewhere.
+ *
+ * Return: @true if the resulting value of @v is zero, @false otherwise.
+ */
 static __always_inline bool
-arch_atomic_long_inc_and_test(atomic_long_t *v)
+raw_atomic_long_inc_and_test(atomic_long_t *v)
 {
-       return arch_atomic_inc_and_test(v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_inc_and_test(v);
+#else
+       return raw_atomic_inc_and_test(v);
+#endif
 }
 
+/**
+ * raw_atomic_long_add_negative() - atomic add and test if negative with full ordering
+ * @i: long value to add
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + @i) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_add_negative() elsewhere.
+ *
+ * Return: @true if the resulting value of @v is negative, @false otherwise.
+ */
 static __always_inline bool
-arch_atomic_long_add_negative(long i, atomic_long_t *v)
+raw_atomic_long_add_negative(long i, atomic_long_t *v)
 {
-       return arch_atomic_add_negative(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_add_negative(i, v);
+#else
+       return raw_atomic_add_negative(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_add_negative_acquire() - atomic add and test if negative with acquire ordering
+ * @i: long value to add
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + @i) with acquire ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_add_negative_acquire() elsewhere.
+ *
+ * Return: @true if the resulting value of @v is negative, @false otherwise.
+ */
 static __always_inline bool
-arch_atomic_long_add_negative_acquire(long i, atomic_long_t *v)
+raw_atomic_long_add_negative_acquire(long i, atomic_long_t *v)
 {
-       return arch_atomic_add_negative_acquire(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_add_negative_acquire(i, v);
+#else
+       return raw_atomic_add_negative_acquire(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_add_negative_release() - atomic add and test if negative with release ordering
+ * @i: long value to add
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + @i) with release ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_add_negative_release() elsewhere.
+ *
+ * Return: @true if the resulting value of @v is negative, @false otherwise.
+ */
 static __always_inline bool
-arch_atomic_long_add_negative_release(long i, atomic_long_t *v)
+raw_atomic_long_add_negative_release(long i, atomic_long_t *v)
 {
-       return arch_atomic_add_negative_release(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_add_negative_release(i, v);
+#else
+       return raw_atomic_add_negative_release(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_add_negative_relaxed() - atomic add and test if negative with relaxed ordering
+ * @i: long value to add
+ * @v: pointer to atomic_long_t
+ *
+ * Atomically updates @v to (@v + @i) with relaxed ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_add_negative_relaxed() elsewhere.
+ *
+ * Return: @true if the resulting value of @v is negative, @false otherwise.
+ */
 static __always_inline bool
-arch_atomic_long_add_negative_relaxed(long i, atomic_long_t *v)
+raw_atomic_long_add_negative_relaxed(long i, atomic_long_t *v)
 {
-       return arch_atomic_add_negative_relaxed(i, v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_add_negative_relaxed(i, v);
+#else
+       return raw_atomic_add_negative_relaxed(i, v);
+#endif
 }
 
+/**
+ * raw_atomic_long_fetch_add_unless() - atomic add unless value with full ordering
+ * @v: pointer to atomic_long_t
+ * @a: long value to add
+ * @u: long value to compare with
+ *
+ * If (@v != @u), atomically updates @v to (@v + @a) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_fetch_add_unless() elsewhere.
+ *
+ * Return: The original value of @v.
+ */
 static __always_inline long
-arch_atomic_long_fetch_add_unless(atomic_long_t *v, long a, long u)
+raw_atomic_long_fetch_add_unless(atomic_long_t *v, long a, long u)
 {
-       return arch_atomic_fetch_add_unless(v, a, u);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_fetch_add_unless(v, a, u);
+#else
+       return raw_atomic_fetch_add_unless(v, a, u);
+#endif
 }
 
+/**
+ * raw_atomic_long_add_unless() - atomic add unless value with full ordering
+ * @v: pointer to atomic_long_t
+ * @a: long value to add
+ * @u: long value to compare with
+ *
+ * If (@v != @u), atomically updates @v to (@v + @a) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_add_unless() elsewhere.
+ *
+ * Return: @true if @v was updated, @false otherwise.
+ */
 static __always_inline bool
-arch_atomic_long_add_unless(atomic_long_t *v, long a, long u)
+raw_atomic_long_add_unless(atomic_long_t *v, long a, long u)
 {
-       return arch_atomic_add_unless(v, a, u);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_add_unless(v, a, u);
+#else
+       return raw_atomic_add_unless(v, a, u);
+#endif
 }
 
+/**
+ * raw_atomic_long_inc_not_zero() - atomic increment unless zero with full ordering
+ * @v: pointer to atomic_long_t
+ *
+ * If (@v != 0), atomically updates @v to (@v + 1) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_inc_not_zero() elsewhere.
+ *
+ * Return: @true if @v was updated, @false otherwise.
+ */
 static __always_inline bool
-arch_atomic_long_inc_not_zero(atomic_long_t *v)
+raw_atomic_long_inc_not_zero(atomic_long_t *v)
 {
-       return arch_atomic_inc_not_zero(v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_inc_not_zero(v);
+#else
+       return raw_atomic_inc_not_zero(v);
+#endif
 }
 
+/**
+ * raw_atomic_long_inc_unless_negative() - atomic increment unless negative with full ordering
+ * @v: pointer to atomic_long_t
+ *
+ * If (@v >= 0), atomically updates @v to (@v + 1) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_inc_unless_negative() elsewhere.
+ *
+ * Return: @true if @v was updated, @false otherwise.
+ */
 static __always_inline bool
-arch_atomic_long_inc_unless_negative(atomic_long_t *v)
+raw_atomic_long_inc_unless_negative(atomic_long_t *v)
 {
-       return arch_atomic_inc_unless_negative(v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_inc_unless_negative(v);
+#else
+       return raw_atomic_inc_unless_negative(v);
+#endif
 }
 
+/**
+ * raw_atomic_long_dec_unless_positive() - atomic decrement unless positive with full ordering
+ * @v: pointer to atomic_long_t
+ *
+ * If (@v <= 0), atomically updates @v to (@v - 1) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_dec_unless_positive() elsewhere.
+ *
+ * Return: @true if @v was updated, @false otherwise.
+ */
 static __always_inline bool
-arch_atomic_long_dec_unless_positive(atomic_long_t *v)
+raw_atomic_long_dec_unless_positive(atomic_long_t *v)
 {
-       return arch_atomic_dec_unless_positive(v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_dec_unless_positive(v);
+#else
+       return raw_atomic_dec_unless_positive(v);
+#endif
 }
 
+/**
+ * raw_atomic_long_dec_if_positive() - atomic decrement if positive with full ordering
+ * @v: pointer to atomic_long_t
+ *
+ * If (@v > 0), atomically updates @v to (@v - 1) with full ordering.
+ *
+ * Safe to use in noinstr code; prefer atomic_long_dec_if_positive() elsewhere.
+ *
+ * Return: The old value of (@v - 1), regardless of whether @v was updated.
+ */
 static __always_inline long
-arch_atomic_long_dec_if_positive(atomic_long_t *v)
+raw_atomic_long_dec_if_positive(atomic_long_t *v)
 {
-       return arch_atomic_dec_if_positive(v);
+#ifdef CONFIG_64BIT
+       return raw_atomic64_dec_if_positive(v);
+#else
+       return raw_atomic_dec_if_positive(v);
+#endif
 }
 
-#endif /* CONFIG_64BIT */
 #endif /* _LINUX_ATOMIC_LONG_H */
-// a194c07d7d2f4b0e178d3c118c919775d5d65f50
+// 4ef23f98c73cff96d239896175fd26b10b88899e
index 31086a7..6a3a9e1 100644 (file)
@@ -130,8 +130,6 @@ extern unsigned compat_dir_class[];
 extern unsigned compat_chattr_class[];
 extern unsigned compat_signal_class[];
 
-extern int audit_classify_compat_syscall(int abi, unsigned syscall);
-
 /* audit_names->type values */
 #define        AUDIT_TYPE_UNKNOWN      0       /* we don't know yet */
 #define        AUDIT_TYPE_NORMAL       1       /* a "normal" audit record */
index 8fdb1af..0e34d67 100644 (file)
@@ -21,4 +21,6 @@ enum auditsc_class_t {
        AUDITSC_NVALS /* count */
 };
 
+extern int audit_classify_compat_syscall(int abi, unsigned syscall);
+
 #endif
index b3e7529..c4f5b52 100644 (file)
@@ -229,7 +229,7 @@ static inline void bio_cnt_set(struct bio *bio, unsigned int count)
 
 static inline bool bio_flagged(struct bio *bio, unsigned int bit)
 {
-       return (bio->bi_flags & (1U << bit)) != 0;
+       return bio->bi_flags & (1U << bit);
 }
 
 static inline void bio_set_flag(struct bio *bio, unsigned int bit)
@@ -465,14 +465,18 @@ extern void bio_uninit(struct bio *);
 void bio_reset(struct bio *bio, struct block_device *bdev, blk_opf_t opf);
 void bio_chain(struct bio *, struct bio *);
 
-int bio_add_page(struct bio *, struct page *, unsigned len, unsigned off);
-bool bio_add_folio(struct bio *, struct folio *, size_t len, size_t off);
+int __must_check bio_add_page(struct bio *bio, struct page *page, unsigned len,
+                             unsigned off);
+bool __must_check bio_add_folio(struct bio *bio, struct folio *folio,
+                               size_t len, size_t off);
 extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
                           unsigned int, unsigned int);
 int bio_add_zone_append_page(struct bio *bio, struct page *page,
                             unsigned int len, unsigned int offset);
 void __bio_add_page(struct bio *bio, struct page *page,
                unsigned int len, unsigned int off);
+void bio_add_folio_nofail(struct bio *bio, struct folio *folio, size_t len,
+                         size_t off);
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter);
 void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter);
 void __bio_release_pages(struct bio *bio, bool mark_dirty);
@@ -488,7 +492,7 @@ void zero_fill_bio(struct bio *bio);
 
 static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
 {
-       if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+       if (bio_flagged(bio, BIO_PAGE_PINNED))
                __bio_release_pages(bio, mark_dirty);
 }
 
index 06caacd..f401067 100644 (file)
@@ -28,8 +28,6 @@ typedef __u32 __bitwise req_flags_t;
 
 /* drive already may have started this one */
 #define RQF_STARTED            ((__force req_flags_t)(1 << 1))
-/* may not be passed by ioscheduler */
-#define RQF_SOFTBARRIER                ((__force req_flags_t)(1 << 3))
 /* request for flush sequence */
 #define RQF_FLUSH_SEQ          ((__force req_flags_t)(1 << 4))
 /* merge of different types, fail separately */
@@ -38,12 +36,14 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_MQ_INFLIGHT                ((__force req_flags_t)(1 << 6))
 /* don't call prep for this one */
 #define RQF_DONTPREP           ((__force req_flags_t)(1 << 7))
+/* use hctx->sched_tags */
+#define RQF_SCHED_TAGS         ((__force req_flags_t)(1 << 8))
+/* use an I/O scheduler for this request */
+#define RQF_USE_SCHED          ((__force req_flags_t)(1 << 9))
 /* vaguely specified driver internal error.  Ignored by the block layer */
 #define RQF_FAILED             ((__force req_flags_t)(1 << 10))
 /* don't warn about errors */
 #define RQF_QUIET              ((__force req_flags_t)(1 << 11))
-/* elevator private data attached */
-#define RQF_ELVPRIV            ((__force req_flags_t)(1 << 12))
 /* account into disk and partition IO statistics */
 #define RQF_IO_STAT            ((__force req_flags_t)(1 << 13))
 /* runtime pm request */
@@ -59,13 +59,11 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_ZONE_WRITE_LOCKED  ((__force req_flags_t)(1 << 19))
 /* ->timeout has been called, don't expire again */
 #define RQF_TIMED_OUT          ((__force req_flags_t)(1 << 21))
-/* queue has elevator attached */
-#define RQF_ELV                        ((__force req_flags_t)(1 << 22))
-#define RQF_RESV                       ((__force req_flags_t)(1 << 23))
+#define RQF_RESV               ((__force req_flags_t)(1 << 23))
 
 /* flags that prevent us from merging requests: */
 #define RQF_NOMERGE_FLAGS \
-       (RQF_STARTED | RQF_SOFTBARRIER | RQF_FLUSH_SEQ | RQF_SPECIAL_PAYLOAD)
+       (RQF_STARTED | RQF_FLUSH_SEQ | RQF_SPECIAL_PAYLOAD)
 
 enum mq_rq_state {
        MQ_RQ_IDLE              = 0,
@@ -169,25 +167,20 @@ struct request {
                void *completion_data;
        };
 
-
        /*
         * Three pointers are available for the IO schedulers, if they need
-        * more they have to dynamically allocate it.  Flush requests are
-        * never put on the IO scheduler. So let the flush fields share
-        * space with the elevator data.
+        * more they have to dynamically allocate it.
         */
-       union {
-               struct {
-                       struct io_cq            *icq;
-                       void                    *priv[2];
-               } elv;
-
-               struct {
-                       unsigned int            seq;
-                       struct list_head        list;
-                       rq_end_io_fn            *saved_end_io;
-               } flush;
-       };
+       struct {
+               struct io_cq            *icq;
+               void                    *priv[2];
+       } elv;
+
+       struct {
+               unsigned int            seq;
+               struct list_head        list;
+               rq_end_io_fn            *saved_end_io;
+       } flush;
 
        union {
                struct __call_single_data csd;
@@ -208,7 +201,7 @@ static inline enum req_op req_op(const struct request *req)
 
 static inline bool blk_rq_is_passthrough(struct request *rq)
 {
-       return blk_op_is_passthrough(req_op(rq));
+       return blk_op_is_passthrough(rq->cmd_flags);
 }
 
 static inline unsigned short req_get_ioprio(struct request *req)
@@ -746,8 +739,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 struct blk_mq_tags {
        unsigned int nr_tags;
        unsigned int nr_reserved_tags;
-
-       atomic_t active_queues;
+       unsigned int active_queues;
 
        struct sbitmap_queue bitmap_tags;
        struct sbitmap_queue breserved_tags;
@@ -844,7 +836,7 @@ void blk_mq_end_request_batch(struct io_comp_batch *ib);
  */
 static inline bool blk_mq_need_time_stamp(struct request *rq)
 {
-       return (rq->rq_flags & (RQF_IO_STAT | RQF_STATS | RQF_ELV));
+       return (rq->rq_flags & (RQF_IO_STAT | RQF_STATS | RQF_USE_SCHED));
 }
 
 static inline bool blk_mq_is_reserved_rq(struct request *rq)
@@ -860,7 +852,7 @@ static inline bool blk_mq_add_to_batch(struct request *req,
                                       struct io_comp_batch *iob, int ioerror,
                                       void (*complete)(struct io_comp_batch *))
 {
-       if (!iob || (req->rq_flags & RQF_ELV) || ioerror ||
+       if (!iob || (req->rq_flags & RQF_USE_SCHED) || ioerror ||
                        (req->end_io && !blk_rq_is_passthrough(req)))
                return false;
 
@@ -1164,6 +1156,18 @@ static inline unsigned int blk_rq_zone_is_seq(struct request *rq)
        return disk_zone_is_seq(rq->q->disk, blk_rq_pos(rq));
 }
 
+/**
+ * blk_rq_is_seq_zoned_write() - Check if @rq requires write serialization.
+ * @rq: Request to examine.
+ *
+ * Note: REQ_OP_ZONE_APPEND requests do not require serialization.
+ */
+static inline bool blk_rq_is_seq_zoned_write(struct request *rq)
+{
+       return op_needs_zoned_write_locking(req_op(rq)) &&
+               blk_rq_zone_is_seq(rq);
+}
+
 bool blk_req_needs_zone_write_lock(struct request *rq);
 bool blk_req_zone_write_trylock(struct request *rq);
 void __blk_req_zone_write_lock(struct request *rq);
@@ -1194,6 +1198,11 @@ static inline bool blk_req_can_dispatch_to_zone(struct request *rq)
        return !blk_req_zone_is_write_locked(rq);
 }
 #else /* CONFIG_BLK_DEV_ZONED */
+static inline bool blk_rq_is_seq_zoned_write(struct request *rq)
+{
+       return false;
+}
+
 static inline bool blk_req_needs_zone_write_lock(struct request *rq)
 {
        return false;
index 740afe8..752a54e 100644 (file)
@@ -55,6 +55,8 @@ struct block_device {
        struct super_block *    bd_super;
        void *                  bd_claiming;
        void *                  bd_holder;
+       const struct blk_holder_ops *bd_holder_ops;
+       struct mutex            bd_holder_lock;
        /* The counter of freeze processes */
        int                     bd_fsfreeze_count;
        int                     bd_holders;
@@ -323,7 +325,7 @@ struct bio {
  * bio flags
  */
 enum {
-       BIO_NO_PAGE_REF,        /* don't put release vec pages */
+       BIO_PAGE_PINNED,        /* Unpin pages in bio_release_pages() */
        BIO_CLONED,             /* doesn't own data */
        BIO_BOUNCED,            /* bio is a bounce bio */
        BIO_QUIET,              /* Make BIO Quiet */
index c0ffe20..ed44a99 100644 (file)
@@ -41,7 +41,7 @@ struct blk_stat_callback;
 struct blk_crypto_profile;
 
 extern const struct device_type disk_type;
-extern struct device_type part_type;
+extern const struct device_type part_type;
 extern struct class block_class;
 
 /*
@@ -112,6 +112,19 @@ struct blk_integrity {
        unsigned char                           tag_size;
 };
 
+typedef unsigned int __bitwise blk_mode_t;
+
+/* open for reading */
+#define BLK_OPEN_READ          ((__force blk_mode_t)(1 << 0))
+/* open for writing */
+#define BLK_OPEN_WRITE         ((__force blk_mode_t)(1 << 1))
+/* open exclusively (vs other exclusive openers */
+#define BLK_OPEN_EXCL          ((__force blk_mode_t)(1 << 2))
+/* opened with O_NDELAY */
+#define BLK_OPEN_NDELAY                ((__force blk_mode_t)(1 << 3))
+/* open for "writes" only for ioctls (specialy hack for floppy.c) */
+#define BLK_OPEN_WRITE_IOCTL   ((__force blk_mode_t)(1 << 4))
+
 struct gendisk {
        /*
         * major/first_minor/minors should not be set by any new driver, the
@@ -187,6 +200,7 @@ struct gendisk {
        struct badblocks *bb;
        struct lockdep_map lockdep_map;
        u64 diskseq;
+       blk_mode_t open_mode;
 
        /*
         * Independent sector access ranges. This is always NULL for
@@ -318,7 +332,6 @@ typedef int (*report_zones_cb)(struct blk_zone *zone, unsigned int idx,
 void disk_set_zoned(struct gendisk *disk, enum blk_zoned_model model);
 
 #ifdef CONFIG_BLK_DEV_ZONED
-
 #define BLK_ALL_ZONES  ((unsigned int)-1)
 int blkdev_report_zones(struct block_device *bdev, sector_t sector,
                        unsigned int nr_zones, report_zones_cb cb, void *data);
@@ -328,33 +341,11 @@ extern int blkdev_zone_mgmt(struct block_device *bdev, enum req_op op,
                            gfp_t gfp_mask);
 int blk_revalidate_disk_zones(struct gendisk *disk,
                              void (*update_driver_data)(struct gendisk *disk));
-
-extern int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
-                                    unsigned int cmd, unsigned long arg);
-extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
-                                 unsigned int cmd, unsigned long arg);
-
 #else /* CONFIG_BLK_DEV_ZONED */
-
 static inline unsigned int bdev_nr_zones(struct block_device *bdev)
 {
        return 0;
 }
-
-static inline int blkdev_report_zones_ioctl(struct block_device *bdev,
-                                           fmode_t mode, unsigned int cmd,
-                                           unsigned long arg)
-{
-       return -ENOTTY;
-}
-
-static inline int blkdev_zone_mgmt_ioctl(struct block_device *bdev,
-                                        fmode_t mode, unsigned int cmd,
-                                        unsigned long arg)
-{
-       return -ENOTTY;
-}
-
 #endif /* CONFIG_BLK_DEV_ZONED */
 
 /*
@@ -392,6 +383,7 @@ struct request_queue {
 
        struct blk_queue_stats  *stats;
        struct rq_qos           *rq_qos;
+       struct mutex            rq_qos_mutex;
 
        const struct blk_mq_ops *mq_ops;
 
@@ -487,6 +479,7 @@ struct request_queue {
         * for flush operations
         */
        struct blk_flush_queue  *fq;
+       struct list_head        flush_list;
 
        struct list_head        requeue_list;
        spinlock_t              requeue_lock;
@@ -815,7 +808,7 @@ int __register_blkdev(unsigned int major, const char *name,
        __register_blkdev(major, name, NULL)
 void unregister_blkdev(unsigned int major, const char *name);
 
-bool bdev_check_media_change(struct block_device *bdev);
+bool disk_check_media_change(struct gendisk *disk);
 int __invalidate_device(struct block_device *bdev, bool kill_dirty);
 void set_capacity(struct gendisk *disk, sector_t size);
 
@@ -836,7 +829,6 @@ static inline void bd_unlink_disk_holder(struct block_device *bdev,
 
 dev_t part_devt(struct gendisk *disk, u8 partno);
 void inc_diskseq(struct gendisk *disk);
-dev_t blk_lookup_devt(const char *name, int partno);
 void blk_request_module(dev_t devt);
 
 extern int blk_register_queue(struct gendisk *disk);
@@ -1281,15 +1273,18 @@ static inline unsigned int bdev_zone_no(struct block_device *bdev, sector_t sec)
        return disk_zone_no(bdev->bd_disk, sec);
 }
 
-static inline bool bdev_op_is_zoned_write(struct block_device *bdev,
-                                         blk_opf_t op)
+/* Whether write serialization is required for @op on zoned devices. */
+static inline bool op_needs_zoned_write_locking(enum req_op op)
 {
-       if (!bdev_is_zoned(bdev))
-               return false;
-
        return op == REQ_OP_WRITE || op == REQ_OP_WRITE_ZEROES;
 }
 
+static inline bool bdev_op_is_zoned_write(struct block_device *bdev,
+                                         enum req_op op)
+{
+       return bdev_is_zoned(bdev) && op_needs_zoned_write_locking(op);
+}
+
 static inline sector_t bdev_zone_sectors(struct block_device *bdev)
 {
        struct request_queue *q = bdev_get_queue(bdev);
@@ -1380,10 +1375,12 @@ struct block_device_operations {
        void (*submit_bio)(struct bio *bio);
        int (*poll_bio)(struct bio *bio, struct io_comp_batch *iob,
                        unsigned int flags);
-       int (*open) (struct block_device *, fmode_t);
-       void (*release) (struct gendisk *, fmode_t);
-       int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
-       int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
+       int (*open)(struct gendisk *disk, blk_mode_t mode);
+       void (*release)(struct gendisk *disk);
+       int (*ioctl)(struct block_device *bdev, blk_mode_t mode,
+                       unsigned cmd, unsigned long arg);
+       int (*compat_ioctl)(struct block_device *bdev, blk_mode_t mode,
+                       unsigned cmd, unsigned long arg);
        unsigned int (*check_events) (struct gendisk *disk,
                                      unsigned int clearing);
        void (*unlock_native_capacity) (struct gendisk *);
@@ -1410,7 +1407,7 @@ struct block_device_operations {
 };
 
 #ifdef CONFIG_COMPAT
-extern int blkdev_compat_ptr_ioctl(struct block_device *, fmode_t,
+extern int blkdev_compat_ptr_ioctl(struct block_device *, blk_mode_t,
                                      unsigned int, unsigned long);
 #else
 #define blkdev_compat_ptr_ioctl NULL
@@ -1463,22 +1460,31 @@ void blkdev_show(struct seq_file *seqf, off_t offset);
 #define BLKDEV_MAJOR_MAX       0
 #endif
 
-struct block_device *blkdev_get_by_path(const char *path, fmode_t mode,
-               void *holder);
-struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder);
-int bd_prepare_to_claim(struct block_device *bdev, void *holder);
+struct blk_holder_ops {
+       void (*mark_dead)(struct block_device *bdev);
+};
+
+/*
+ * Return the correct open flags for blkdev_get_by_* for super block flags
+ * as stored in sb->s_flags.
+ */
+#define sb_open_mode(flags) \
+       (BLK_OPEN_READ | (((flags) & SB_RDONLY) ? 0 : BLK_OPEN_WRITE))
+
+struct block_device *blkdev_get_by_dev(dev_t dev, blk_mode_t mode, void *holder,
+               const struct blk_holder_ops *hops);
+struct block_device *blkdev_get_by_path(const char *path, blk_mode_t mode,
+               void *holder, const struct blk_holder_ops *hops);
+int bd_prepare_to_claim(struct block_device *bdev, void *holder,
+               const struct blk_holder_ops *hops);
 void bd_abort_claiming(struct block_device *bdev, void *holder);
-void blkdev_put(struct block_device *bdev, fmode_t mode);
+void blkdev_put(struct block_device *bdev, void *holder);
 
 /* just for blk-cgroup, don't use elsewhere */
 struct block_device *blkdev_get_no_open(dev_t dev);
 void blkdev_put_no_open(struct block_device *bdev);
 
-struct block_device *bdev_alloc(struct gendisk *disk, u8 partno);
-void bdev_add(struct block_device *bdev, dev_t dev);
 struct block_device *I_BDEV(struct inode *inode);
-int truncate_bdev_range(struct block_device *bdev, fmode_t mode, loff_t lstart,
-               loff_t lend);
 
 #ifdef CONFIG_BLOCK
 void invalidate_bdev(struct block_device *bdev);
@@ -1488,6 +1494,7 @@ int sync_blockdev_nowait(struct block_device *bdev);
 void sync_bdevs(bool wait);
 void bdev_statx_dioalign(struct inode *inode, struct kstat *stat);
 void printk_all_partitions(void);
+int __init early_lookup_bdev(const char *pathname, dev_t *dev);
 #else
 static inline void invalidate_bdev(struct block_device *bdev)
 {
@@ -1509,6 +1516,10 @@ static inline void bdev_statx_dioalign(struct inode *inode, struct kstat *stat)
 static inline void printk_all_partitions(void)
 {
 }
+static inline int early_lookup_bdev(const char *pathname, dev_t *dev)
+{
+       return -EINVAL;
+}
 #endif /* CONFIG_BLOCK */
 
 int fsync_bdev(struct block_device *bdev);
index cfbda11..122c62e 100644 (file)
@@ -85,10 +85,14 @@ extern int blk_trace_remove(struct request_queue *q);
 # define blk_add_driver_data(rq, data, len)            do {} while (0)
 # define blk_trace_setup(q, name, dev, bdev, arg)      (-ENOTTY)
 # define blk_trace_startstop(q, start)                 (-ENOTTY)
-# define blk_trace_remove(q)                           (-ENOTTY)
 # define blk_add_trace_msg(q, fmt, ...)                        do { } while (0)
 # define blk_add_cgroup_trace_msg(q, cg, fmt, ...)     do { } while (0)
 # define blk_trace_note_message_enabled(q)             (false)
+
+static inline int blk_trace_remove(struct request_queue *q)
+{
+       return -ENOTTY;
+}
 #endif /* CONFIG_BLK_DEV_IO_TRACE */
 
 #ifdef CONFIG_COMPAT
index 1ac81c8..ee2df73 100644 (file)
@@ -9,7 +9,7 @@ struct device;
 struct request_queue;
 
 typedef int (bsg_sg_io_fn)(struct request_queue *, struct sg_io_v4 *hdr,
-               fmode_t mode, unsigned int timeout);
+               bool open_for_write, unsigned int timeout);
 
 struct bsg_device *bsg_register_queue(struct request_queue *q,
                struct device *parent, const char *name,
index 1520793..c794ea7 100644 (file)
@@ -263,7 +263,7 @@ extern int buffer_heads_over_limit;
 void block_invalidate_folio(struct folio *folio, size_t offset, size_t length);
 int block_write_full_page(struct page *page, get_block_t *get_block,
                                struct writeback_control *wbc);
-int __block_write_full_page(struct inode *inode, struct page *page,
+int __block_write_full_folio(struct inode *inode, struct folio *folio,
                        get_block_t *get_block, struct writeback_control *wbc,
                        bh_end_io_t *handler);
 int block_read_full_folio(struct folio *, get_block_t *);
@@ -278,7 +278,7 @@ int block_write_end(struct file *, struct address_space *,
 int generic_write_end(struct file *, struct address_space *,
                                loff_t, unsigned, unsigned,
                                struct page *, void *);
-void page_zero_new_buffers(struct page *page, unsigned from, unsigned to);
+void folio_zero_new_buffers(struct folio *folio, size_t from, size_t to);
 void clean_page_buffers(struct page *page);
 int cont_write_begin(struct file *, struct address_space *, loff_t,
                        unsigned, struct page **, void **,
index 5da1bbd..9900d20 100644 (file)
@@ -98,4 +98,10 @@ struct cacheline_padding {
 #define CACHELINE_PADDING(name)
 #endif
 
+#ifdef ARCH_DMA_MINALIGN
+#define ARCH_HAS_DMA_MINALIGN
+#else
+#define ARCH_DMA_MINALIGN __alignof__(unsigned long long)
+#endif
+
 #endif /* __LINUX_CACHE_H */
index 67caa90..98c6fd0 100644 (file)
@@ -13,6 +13,7 @@
 
 #include <linux/fs.h>          /* not really needed, later.. */
 #include <linux/list.h>
+#include <linux/blkdev.h>
 #include <scsi/scsi_common.h>
 #include <uapi/linux/cdrom.h>
 
@@ -61,9 +62,9 @@ struct cdrom_device_info {
        __u8 last_sense;
        __u8 media_written;             /* dirty flag, DVD+RW bookkeeping */
        unsigned short mmc3_profile;    /* current MMC3 profile */
-       int for_data;
        int (*exit)(struct cdrom_device_info *);
        int mrw_mode_page;
+       bool opened_for_data;
        __s64 last_media_change_ms;
 };
 
@@ -101,11 +102,10 @@ int cdrom_read_tocentry(struct cdrom_device_info *cdi,
                struct cdrom_tocentry *entry);
 
 /* the general block_device operations structure: */
-extern int cdrom_open(struct cdrom_device_info *cdi, struct block_device *bdev,
-                       fmode_t mode);
-extern void cdrom_release(struct cdrom_device_info *cdi, fmode_t mode);
-extern int cdrom_ioctl(struct cdrom_device_info *cdi, struct block_device *bdev,
-                      fmode_t mode, unsigned int cmd, unsigned long arg);
+int cdrom_open(struct cdrom_device_info *cdi, blk_mode_t mode);
+void cdrom_release(struct cdrom_device_info *cdi);
+int cdrom_ioctl(struct cdrom_device_info *cdi, struct block_device *bdev,
+               unsigned int cmd, unsigned long arg);
 extern unsigned int cdrom_check_events(struct cdrom_device_info *cdi,
                                       unsigned int clearing);
 
index 885f539..b307013 100644 (file)
@@ -118,7 +118,6 @@ int cgroup_rm_cftypes(struct cftype *cfts);
 void cgroup_file_notify(struct cgroup_file *cfile);
 void cgroup_file_show(struct cgroup_file *cfile, bool show);
 
-int task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
 int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry);
 int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
                     struct pid *pid, struct task_struct *tsk);
@@ -692,7 +691,6 @@ static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
  */
 void cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
 void cgroup_rstat_flush(struct cgroup *cgrp);
-void cgroup_rstat_flush_atomic(struct cgroup *cgrp);
 void cgroup_rstat_flush_hold(struct cgroup *cgrp);
 void cgroup_rstat_flush_release(void);
 
index a6e512c..e947764 100644 (file)
@@ -89,89 +89,17 @@ extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
                const struct alloc_context *ac, enum compact_priority prio,
                struct page **page);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
-extern enum compact_result compaction_suitable(struct zone *zone, int order,
-               unsigned int alloc_flags, int highest_zoneidx);
+extern bool compaction_suitable(struct zone *zone, int order,
+                                              int highest_zoneidx);
 
 extern void compaction_defer_reset(struct zone *zone, int order,
                                bool alloc_success);
 
-/* Compaction has made some progress and retrying makes sense */
-static inline bool compaction_made_progress(enum compact_result result)
-{
-       /*
-        * Even though this might sound confusing this in fact tells us
-        * that the compaction successfully isolated and migrated some
-        * pageblocks.
-        */
-       if (result == COMPACT_SUCCESS)
-               return true;
-
-       return false;
-}
-
-/* Compaction has failed and it doesn't make much sense to keep retrying. */
-static inline bool compaction_failed(enum compact_result result)
-{
-       /* All zones were scanned completely and still not result. */
-       if (result == COMPACT_COMPLETE)
-               return true;
-
-       return false;
-}
-
-/* Compaction needs reclaim to be performed first, so it can continue. */
-static inline bool compaction_needs_reclaim(enum compact_result result)
-{
-       /*
-        * Compaction backed off due to watermark checks for order-0
-        * so the regular reclaim has to try harder and reclaim something.
-        */
-       if (result == COMPACT_SKIPPED)
-               return true;
-
-       return false;
-}
-
-/*
- * Compaction has backed off for some reason after doing some work or none
- * at all. It might be throttling or lock contention. Retrying might be still
- * worthwhile, but with a higher priority if allowed.
- */
-static inline bool compaction_withdrawn(enum compact_result result)
-{
-       /*
-        * If compaction is deferred for high-order allocations, it is
-        * because sync compaction recently failed. If this is the case
-        * and the caller requested a THP allocation, we do not want
-        * to heavily disrupt the system, so we fail the allocation
-        * instead of entering direct reclaim.
-        */
-       if (result == COMPACT_DEFERRED)
-               return true;
-
-       /*
-        * If compaction in async mode encounters contention or blocks higher
-        * priority task we back off early rather than cause stalls.
-        */
-       if (result == COMPACT_CONTENDED)
-               return true;
-
-       /*
-        * Page scanners have met but we haven't scanned full zones so this
-        * is a back off in fact.
-        */
-       if (result == COMPACT_PARTIAL_SKIPPED)
-               return true;
-
-       return false;
-}
-
-
 bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
                                        int alloc_flags);
 
-extern void kcompactd_run(int nid);
-extern void kcompactd_stop(int nid);
+extern void __meminit kcompactd_run(int nid);
+extern void __meminit kcompactd_stop(int nid);
 extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int highest_zoneidx);
 
 #else
@@ -179,32 +107,12 @@ static inline void reset_isolation_suitable(pg_data_t *pgdat)
 {
 }
 
-static inline enum compact_result compaction_suitable(struct zone *zone, int order,
-                                       int alloc_flags, int highest_zoneidx)
-{
-       return COMPACT_SKIPPED;
-}
-
-static inline bool compaction_made_progress(enum compact_result result)
-{
-       return false;
-}
-
-static inline bool compaction_failed(enum compact_result result)
-{
-       return false;
-}
-
-static inline bool compaction_needs_reclaim(enum compact_result result)
+static inline bool compaction_suitable(struct zone *zone, int order,
+                                                     int highest_zoneidx)
 {
        return false;
 }
 
-static inline bool compaction_withdrawn(enum compact_result result)
-{
-       return true;
-}
-
 static inline void kcompactd_run(int nid)
 {
 }
index e659cb6..571fa79 100644 (file)
 #endif
 
 /*
+ * Optional: only supported since gcc >= 14
+ * Optional: only supported since clang >= 17
+ *
+ *   gcc: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108896
+ * clang: https://reviews.llvm.org/D148381
+ */
+#if __has_attribute(__element_count__)
+# define __counted_by(member)          __attribute__((__element_count__(#member)))
+#else
+# define __counted_by(member)
+#endif
+
+/*
  * Optional: only supported since clang >= 14.0
  *
  *   gcc: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-error-function-attribute
 #define __noreturn                      __attribute__((__noreturn__))
 
 /*
+ * Optional: only supported since GCC >= 11.1, clang >= 7.0.
+ *
+ *   gcc: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-no_005fstack_005fprotector-function-attribute
+ *   clang: https://clang.llvm.org/docs/AttributeReference.html#no-stack-protector-safebuffers
+ */
+#if __has_attribute(__no_stack_protector__)
+# define __no_stack_protector          __attribute__((__no_stack_protector__))
+#else
+# define __no_stack_protector
+#endif
+
+/*
  * Optional: not supported by gcc.
  *
  * clang: https://clang.llvm.org/docs/AttributeReference.html#overloadable
index d3cbb6c..6e76b9d 100644 (file)
@@ -119,7 +119,7 @@ extern void ct_idle_exit(void);
  */
 static __always_inline bool rcu_dynticks_curr_cpu_in_eqs(void)
 {
-       return !(arch_atomic_read(this_cpu_ptr(&context_tracking.state)) & RCU_DYNTICKS_IDX);
+       return !(raw_atomic_read(this_cpu_ptr(&context_tracking.state)) & RCU_DYNTICKS_IDX);
 }
 
 /*
@@ -128,7 +128,7 @@ static __always_inline bool rcu_dynticks_curr_cpu_in_eqs(void)
  */
 static __always_inline unsigned long ct_state_inc(int incby)
 {
-       return arch_atomic_add_return(incby, this_cpu_ptr(&context_tracking.state));
+       return raw_atomic_add_return(incby, this_cpu_ptr(&context_tracking.state));
 }
 
 static __always_inline bool warn_rcu_enter(void)
index fdd537e..bbff5f7 100644 (file)
@@ -51,7 +51,7 @@ DECLARE_PER_CPU(struct context_tracking, context_tracking);
 #ifdef CONFIG_CONTEXT_TRACKING_USER
 static __always_inline int __ct_state(void)
 {
-       return arch_atomic_read(this_cpu_ptr(&context_tracking.state)) & CT_STATE_MASK;
+       return raw_atomic_read(this_cpu_ptr(&context_tracking.state)) & CT_STATE_MASK;
 }
 #endif
 
index 8582a71..6e6e57e 100644 (file)
@@ -184,8 +184,12 @@ void arch_cpu_idle_enter(void);
 void arch_cpu_idle_exit(void);
 void __noreturn arch_cpu_idle_dead(void);
 
-int cpu_report_state(int cpu);
-int cpu_check_up_prepare(int cpu);
+#ifdef CONFIG_ARCH_HAS_CPU_FINALIZE_INIT
+void arch_cpu_finalize_init(void);
+#else
+static inline void arch_cpu_finalize_init(void) { }
+#endif
+
 void cpu_set_state_online(int cpu);
 void play_idle_precise(u64 duration_ns, u64 latency_ns);
 
@@ -195,8 +199,6 @@ static inline void play_idle(unsigned long duration_us)
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
-bool cpu_wait_death(unsigned int cpu, int seconds);
-bool cpu_report_death(void);
 void cpuhp_report_idle_dead(void);
 #else
 static inline void cpuhp_report_idle_dead(void) { }
index 26e2eb3..172ff51 100644 (file)
@@ -340,7 +340,10 @@ struct cpufreq_driver {
        /*
         * ->fast_switch() replacement for drivers that use an internal
         * representation of performance levels and can pass hints other than
-        * the target performance level to the hardware.
+        * the target performance level to the hardware. This can only be set
+        * if ->fast_switch is set too, because in those cases (under specific
+        * conditions) scale invariance can be disabled, which causes the
+        * schedutil governor to fall back to the latter.
         */
        void            (*adjust_perf)(unsigned int cpu,
                                       unsigned long min_perf,
index 3ceb9df..25b6e6e 100644 (file)
@@ -133,6 +133,7 @@ enum cpuhp_state {
        CPUHP_MIPS_SOC_PREPARE,
        CPUHP_BP_PREPARE_DYN,
        CPUHP_BP_PREPARE_DYN_END                = CPUHP_BP_PREPARE_DYN + 20,
+       CPUHP_BP_KICK_AP,
        CPUHP_BRINGUP_CPU,
 
        /*
@@ -518,4 +519,20 @@ void cpuhp_online_idle(enum cpuhp_state state);
 static inline void cpuhp_online_idle(enum cpuhp_state state) { }
 #endif
 
+struct task_struct;
+
+void cpuhp_ap_sync_alive(void);
+void arch_cpuhp_sync_state_poll(void);
+void arch_cpuhp_cleanup_kick_cpu(unsigned int cpu);
+int arch_cpuhp_kick_ap_alive(unsigned int cpu, struct task_struct *tidle);
+bool arch_cpuhp_init_parallel_bringup(void);
+
+#ifdef CONFIG_HOTPLUG_CORE_SYNC_DEAD
+void cpuhp_ap_report_dead(void);
+void arch_cpuhp_cleanup_dead_cpu(unsigned int cpu);
+#else
+static inline void cpuhp_ap_report_dead(void) { }
+static inline void arch_cpuhp_cleanup_dead_cpu(unsigned int cpu) { }
+#endif
+
 #endif
index ca736b0..0d2e2a3 100644 (file)
@@ -1071,7 +1071,7 @@ static inline const struct cpumask *get_cpu_mask(unsigned int cpu)
  */
 static __always_inline unsigned int num_online_cpus(void)
 {
-       return arch_atomic_read(&__num_online_cpus);
+       return raw_atomic_read(&__num_online_cpus);
 }
 #define num_possible_cpus()    cpumask_weight(cpu_possible_mask)
 #define num_present_cpus()     cpumask_weight(cpu_present_mask)
index 980b76a..d629094 100644 (file)
@@ -71,8 +71,10 @@ extern void cpuset_init_smp(void);
 extern void cpuset_force_rebuild(void);
 extern void cpuset_update_active_cpus(void);
 extern void cpuset_wait_for_hotplug(void);
-extern void cpuset_read_lock(void);
-extern void cpuset_read_unlock(void);
+extern void inc_dl_tasks_cs(struct task_struct *task);
+extern void dec_dl_tasks_cs(struct task_struct *task);
+extern void cpuset_lock(void);
+extern void cpuset_unlock(void);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern bool cpuset_cpus_allowed_fallback(struct task_struct *p);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
@@ -189,8 +191,10 @@ static inline void cpuset_update_active_cpus(void)
 
 static inline void cpuset_wait_for_hotplug(void) { }
 
-static inline void cpuset_read_lock(void) { }
-static inline void cpuset_read_unlock(void) { }
+static inline void inc_dl_tasks_cs(struct task_struct *task) { }
+static inline void dec_dl_tasks_cs(struct task_struct *task) { }
+static inline void cpuset_lock(void) { }
+static inline void cpuset_unlock(void) { }
 
 static inline void cpuset_cpus_allowed(struct task_struct *p,
                                       struct cpumask *mask)
index 039e7e0..ff9cda9 100644 (file)
@@ -56,6 +56,7 @@ static inline void ndelay(unsigned long x)
 
 extern unsigned long lpj_fine;
 void calibrate_delay(void);
+unsigned long calibrate_delay_is_known(void);
 void __attribute__((weak)) calibration_delay_done(void);
 void msleep(unsigned int msecs);
 unsigned long msleep_interruptible(unsigned int msecs);
index 7fd704b..d312ffb 100644 (file)
@@ -108,7 +108,6 @@ struct devfreq_dev_profile {
        unsigned long initial_freq;
        unsigned int polling_ms;
        enum devfreq_timer timer;
-       bool is_cooling_device;
 
        int (*target)(struct device *dev, unsigned long *freq, u32 flags);
        int (*get_dev_status)(struct device *dev,
@@ -118,6 +117,8 @@ struct devfreq_dev_profile {
 
        unsigned long *freq_table;
        unsigned int max_state;
+
+       bool is_cooling_device;
 };
 
 /**
index a52d2b9..69d0435 100644 (file)
@@ -166,17 +166,15 @@ void dm_error(const char *message);
 struct dm_dev {
        struct block_device *bdev;
        struct dax_device *dax_dev;
-       fmode_t mode;
+       blk_mode_t mode;
        char name[16];
 };
 
-dev_t dm_get_dev_t(const char *path);
-
 /*
  * Constructors should call these functions to ensure destination devices
  * are opened/closed correctly.
  */
-int dm_get_device(struct dm_target *ti, const char *path, fmode_t mode,
+int dm_get_device(struct dm_target *ti, const char *path, blk_mode_t mode,
                  struct dm_dev **result);
 void dm_put_device(struct dm_target *ti, struct dm_dev *d);
 
@@ -545,7 +543,7 @@ int dm_set_geometry(struct mapped_device *md, struct hd_geometry *geo);
 /*
  * First create an empty table.
  */
-int dm_table_create(struct dm_table **result, fmode_t mode,
+int dm_table_create(struct dm_table **result, blk_mode_t mode,
                    unsigned int num_targets, struct mapped_device *md);
 
 /*
@@ -588,7 +586,7 @@ void dm_sync_table(struct mapped_device *md);
  * Queries
  */
 sector_t dm_table_get_size(struct dm_table *t);
-fmode_t dm_table_get_mode(struct dm_table *t);
+blk_mode_t dm_table_get_mode(struct dm_table *t);
 struct mapped_device *dm_table_get_md(struct dm_table *t);
 const char *dm_table_device_name(struct dm_table *t);
 
index c244267..7738f45 100644 (file)
@@ -126,7 +126,7 @@ int __must_check driver_register(struct device_driver *drv);
 void driver_unregister(struct device_driver *drv);
 
 struct device_driver *driver_find(const char *name, const struct bus_type *bus);
-int driver_probe_done(void);
+bool __init driver_probe_done(void);
 void wait_for_device_probe(void);
 void __init wait_for_init_devices_probe(void);
 
index 31f114f..9bf19b5 100644 (file)
@@ -8,6 +8,7 @@
 
 #include <linux/dma-mapping.h>
 #include <linux/pgtable.h>
+#include <linux/slab.h>
 
 struct cma;
 
@@ -277,6 +278,66 @@ static inline bool dev_is_dma_coherent(struct device *dev)
 }
 #endif /* CONFIG_ARCH_HAS_DMA_COHERENCE_H */
 
+/*
+ * Check whether potential kmalloc() buffers are safe for non-coherent DMA.
+ */
+static inline bool dma_kmalloc_safe(struct device *dev,
+                                   enum dma_data_direction dir)
+{
+       /*
+        * If DMA bouncing of kmalloc() buffers is disabled, the kmalloc()
+        * caches have already been aligned to a DMA-safe size.
+        */
+       if (!IS_ENABLED(CONFIG_DMA_BOUNCE_UNALIGNED_KMALLOC))
+               return true;
+
+       /*
+        * kmalloc() buffers are DMA-safe irrespective of size if the device
+        * is coherent or the direction is DMA_TO_DEVICE (non-desctructive
+        * cache maintenance and benign cache line evictions).
+        */
+       if (dev_is_dma_coherent(dev) || dir == DMA_TO_DEVICE)
+               return true;
+
+       return false;
+}
+
+/*
+ * Check whether the given size, assuming it is for a kmalloc()'ed buffer, is
+ * sufficiently aligned for non-coherent DMA.
+ */
+static inline bool dma_kmalloc_size_aligned(size_t size)
+{
+       /*
+        * Larger kmalloc() sizes are guaranteed to be aligned to
+        * ARCH_DMA_MINALIGN.
+        */
+       if (size >= 2 * ARCH_DMA_MINALIGN ||
+           IS_ALIGNED(kmalloc_size_roundup(size), dma_get_cache_alignment()))
+               return true;
+
+       return false;
+}
+
+/*
+ * Check whether the given object size may have originated from a kmalloc()
+ * buffer with a slab alignment below the DMA-safe alignment and needs
+ * bouncing for non-coherent DMA. The pointer alignment is not considered and
+ * in-structure DMA-safe offsets are the responsibility of the caller. Such
+ * code should use the static ARCH_DMA_MINALIGN for compiler annotations.
+ *
+ * The heuristics can have false positives, bouncing unnecessarily, though the
+ * buffers would be small. False negatives are theoretically possible if, for
+ * example, multiple small kmalloc() buffers are coalesced into a larger
+ * buffer that passes the alignment check. There are no such known constructs
+ * in the kernel.
+ */
+static inline bool dma_kmalloc_needs_bounce(struct device *dev, size_t size,
+                                           enum dma_data_direction dir)
+{
+       return !dma_kmalloc_safe(dev, dir) && !dma_kmalloc_size_aligned(size);
+}
+
 void *arch_dma_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle,
                gfp_t gfp, unsigned long attrs);
 void arch_dma_free(struct device *dev, size_t size, void *cpu_addr,
index 0ee20b7..e13050e 100644 (file)
@@ -2,6 +2,7 @@
 #ifndef _LINUX_DMA_MAPPING_H
 #define _LINUX_DMA_MAPPING_H
 
+#include <linux/cache.h>
 #include <linux/sizes.h>
 #include <linux/string.h>
 #include <linux/device.h>
@@ -543,13 +544,15 @@ static inline int dma_set_min_align_mask(struct device *dev,
        return 0;
 }
 
+#ifndef dma_get_cache_alignment
 static inline int dma_get_cache_alignment(void)
 {
-#ifdef ARCH_DMA_MINALIGN
+#ifdef ARCH_HAS_DMA_MINALIGN
        return ARCH_DMA_MINALIGN;
 #endif
        return 1;
 }
+#endif
 
 static inline void *dmam_alloc_coherent(struct device *dev, size_t size,
                dma_addr_t *dma_handle, gfp_t gfp)
index 725d5e6..27dbd4c 100644 (file)
@@ -202,67 +202,74 @@ static inline void detect_intel_iommu(void)
 
 struct irte {
        union {
-               /* Shared between remapped and posted mode*/
                struct {
-                       __u64   present         : 1,  /*  0      */
-                               fpd             : 1,  /*  1      */
-                               __res0          : 6,  /*  2 -  6 */
-                               avail           : 4,  /*  8 - 11 */
-                               __res1          : 3,  /* 12 - 14 */
-                               pst             : 1,  /* 15      */
-                               vector          : 8,  /* 16 - 23 */
-                               __res2          : 40; /* 24 - 63 */
+                       union {
+                               /* Shared between remapped and posted mode*/
+                               struct {
+                                       __u64   present         : 1,  /*  0      */
+                                               fpd             : 1,  /*  1      */
+                                               __res0          : 6,  /*  2 -  6 */
+                                               avail           : 4,  /*  8 - 11 */
+                                               __res1          : 3,  /* 12 - 14 */
+                                               pst             : 1,  /* 15      */
+                                               vector          : 8,  /* 16 - 23 */
+                                               __res2          : 40; /* 24 - 63 */
+                               };
+
+                               /* Remapped mode */
+                               struct {
+                                       __u64   r_present       : 1,  /*  0      */
+                                               r_fpd           : 1,  /*  1      */
+                                               dst_mode        : 1,  /*  2      */
+                                               redir_hint      : 1,  /*  3      */
+                                               trigger_mode    : 1,  /*  4      */
+                                               dlvry_mode      : 3,  /*  5 -  7 */
+                                               r_avail         : 4,  /*  8 - 11 */
+                                               r_res0          : 4,  /* 12 - 15 */
+                                               r_vector        : 8,  /* 16 - 23 */
+                                               r_res1          : 8,  /* 24 - 31 */
+                                               dest_id         : 32; /* 32 - 63 */
+                               };
+
+                               /* Posted mode */
+                               struct {
+                                       __u64   p_present       : 1,  /*  0      */
+                                               p_fpd           : 1,  /*  1      */
+                                               p_res0          : 6,  /*  2 -  7 */
+                                               p_avail         : 4,  /*  8 - 11 */
+                                               p_res1          : 2,  /* 12 - 13 */
+                                               p_urgent        : 1,  /* 14      */
+                                               p_pst           : 1,  /* 15      */
+                                               p_vector        : 8,  /* 16 - 23 */
+                                               p_res2          : 14, /* 24 - 37 */
+                                               pda_l           : 26; /* 38 - 63 */
+                               };
+                               __u64 low;
+                       };
+
+                       union {
+                               /* Shared between remapped and posted mode*/
+                               struct {
+                                       __u64   sid             : 16,  /* 64 - 79  */
+                                               sq              : 2,   /* 80 - 81  */
+                                               svt             : 2,   /* 82 - 83  */
+                                               __res3          : 44;  /* 84 - 127 */
+                               };
+
+                               /* Posted mode*/
+                               struct {
+                                       __u64   p_sid           : 16,  /* 64 - 79  */
+                                               p_sq            : 2,   /* 80 - 81  */
+                                               p_svt           : 2,   /* 82 - 83  */
+                                               p_res3          : 12,  /* 84 - 95  */
+                                               pda_h           : 32;  /* 96 - 127 */
+                               };
+                               __u64 high;
+                       };
                };
-
-               /* Remapped mode */
-               struct {
-                       __u64   r_present       : 1,  /*  0      */
-                               r_fpd           : 1,  /*  1      */
-                               dst_mode        : 1,  /*  2      */
-                               redir_hint      : 1,  /*  3      */
-                               trigger_mode    : 1,  /*  4      */
-                               dlvry_mode      : 3,  /*  5 -  7 */
-                               r_avail         : 4,  /*  8 - 11 */
-                               r_res0          : 4,  /* 12 - 15 */
-                               r_vector        : 8,  /* 16 - 23 */
-                               r_res1          : 8,  /* 24 - 31 */
-                               dest_id         : 32; /* 32 - 63 */
-               };
-
-               /* Posted mode */
-               struct {
-                       __u64   p_present       : 1,  /*  0      */
-                               p_fpd           : 1,  /*  1      */
-                               p_res0          : 6,  /*  2 -  7 */
-                               p_avail         : 4,  /*  8 - 11 */
-                               p_res1          : 2,  /* 12 - 13 */
-                               p_urgent        : 1,  /* 14      */
-                               p_pst           : 1,  /* 15      */
-                               p_vector        : 8,  /* 16 - 23 */
-                               p_res2          : 14, /* 24 - 37 */
-                               pda_l           : 26; /* 38 - 63 */
-               };
-               __u64 low;
-       };
-
-       union {
-               /* Shared between remapped and posted mode*/
-               struct {
-                       __u64   sid             : 16,  /* 64 - 79  */
-                               sq              : 2,   /* 80 - 81  */
-                               svt             : 2,   /* 82 - 83  */
-                               __res3          : 44;  /* 84 - 127 */
-               };
-
-               /* Posted mode*/
-               struct {
-                       __u64   p_sid           : 16,  /* 64 - 79  */
-                               p_sq            : 2,   /* 80 - 81  */
-                               p_svt           : 2,   /* 82 - 83  */
-                               p_res3          : 12,  /* 84 - 95  */
-                               pda_h           : 32;  /* 96 - 127 */
-               };
-               __u64 high;
+#ifdef CONFIG_IRQ_REMAP
+               __u128 irte;
+#endif
        };
 };
 
index 571d1a6..18d83a6 100644 (file)
@@ -108,7 +108,8 @@ typedef     struct {
 #define EFI_MEMORY_MAPPED_IO_PORT_SPACE        12
 #define EFI_PAL_CODE                   13
 #define EFI_PERSISTENT_MEMORY          14
-#define EFI_MAX_MEMORY_TYPE            15
+#define EFI_UNACCEPTED_MEMORY          15
+#define EFI_MAX_MEMORY_TYPE            16
 
 /* Attribute values: */
 #define EFI_MEMORY_UC          ((u64)0x0000000000000001ULL)    /* uncached */
@@ -417,6 +418,7 @@ void efi_native_runtime_setup(void);
 #define LINUX_EFI_MOK_VARIABLE_TABLE_GUID      EFI_GUID(0xc451ed2b, 0x9694, 0x45d3,  0xba, 0xba, 0xed, 0x9f, 0x89, 0x88, 0xa3, 0x89)
 #define LINUX_EFI_COCO_SECRET_AREA_GUID                EFI_GUID(0xadf956ad, 0xe98c, 0x484c,  0xae, 0x11, 0xb5, 0x1c, 0x7d, 0x33, 0x64, 0x47)
 #define LINUX_EFI_BOOT_MEMMAP_GUID             EFI_GUID(0x800f683f, 0xd08b, 0x423a,  0xa2, 0x93, 0x96, 0x5c, 0x3c, 0x6f, 0xe2, 0xb4)
+#define LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID    EFI_GUID(0xd5d1de3c, 0x105c, 0x44f9,  0x9e, 0xa9, 0xbc, 0xef, 0x98, 0x12, 0x00, 0x31)
 
 #define RISCV_EFI_BOOT_PROTOCOL_GUID           EFI_GUID(0xccd15fec, 0x6f73, 0x4eec,  0x83, 0x95, 0x3e, 0x69, 0xe4, 0xb9, 0x40, 0xbf)
 
@@ -435,6 +437,9 @@ void efi_native_runtime_setup(void);
 #define DELLEMC_EFI_RCI2_TABLE_GUID            EFI_GUID(0x2d9f28a2, 0xa886, 0x456a,  0x97, 0xa8, 0xf1, 0x1e, 0xf2, 0x4f, 0xf4, 0x55)
 #define AMD_SEV_MEM_ENCRYPT_GUID               EFI_GUID(0x0cf29b71, 0x9e51, 0x433a,  0xa3, 0xb7, 0x81, 0xf3, 0xab, 0x16, 0xb8, 0x75)
 
+/* OVMF protocol GUIDs */
+#define OVMF_SEV_MEMORY_ACCEPTANCE_PROTOCOL_GUID       EFI_GUID(0xc5a010fe, 0x38a7, 0x4531,  0x8a, 0x4a, 0x05, 0x00, 0xd2, 0xfd, 0x16, 0x49)
+
 typedef struct {
        efi_guid_t guid;
        u64 table;
@@ -534,6 +539,14 @@ struct efi_boot_memmap {
        efi_memory_desc_t       map[];
 };
 
+struct efi_unaccepted_memory {
+       u32 version;
+       u32 unit_size;
+       u64 phys_base;
+       u64 size;
+       unsigned long bitmap[];
+};
+
 /*
  * Architecture independent structure for describing a memory map for the
  * benefit of efi_memmap_init_early(), and for passing context between
@@ -636,6 +649,7 @@ extern struct efi {
        unsigned long                   tpm_final_log;          /* TPM2 Final Events Log table */
        unsigned long                   mokvar_table;           /* MOK variable config table */
        unsigned long                   coco_secret;            /* Confidential computing secret table */
+       unsigned long                   unaccepted;             /* Unaccepted memory table */
 
        efi_get_time_t                  *get_time;
        efi_set_time_t                  *set_time;
index a139c64..b5d9bb2 100644 (file)
 
 #ifndef __ASSEMBLY__
 
+/**
+ * IS_ERR_VALUE - Detect an error pointer.
+ * @x: The pointer to check.
+ *
+ * Like IS_ERR(), but does not generate a compiler warning if result is unused.
+ */
 #define IS_ERR_VALUE(x) unlikely((unsigned long)(void *)(x) >= (unsigned long)-MAX_ERRNO)
 
+/**
+ * ERR_PTR - Create an error pointer.
+ * @error: A negative error code.
+ *
+ * Encodes @error into a pointer value. Users should consider the result
+ * opaque and not assume anything about how the error is encoded.
+ *
+ * Return: A pointer with @error encoded within its value.
+ */
 static inline void * __must_check ERR_PTR(long error)
 {
        return (void *) error;
 }
 
+/**
+ * PTR_ERR - Extract the error code from an error pointer.
+ * @ptr: An error pointer.
+ * Return: The error code within @ptr.
+ */
 static inline long __must_check PTR_ERR(__force const void *ptr)
 {
        return (long) ptr;
 }
 
+/**
+ * IS_ERR - Detect an error pointer.
+ * @ptr: The pointer to check.
+ * Return: true if @ptr is an error pointer, false otherwise.
+ */
 static inline bool __must_check IS_ERR(__force const void *ptr)
 {
        return IS_ERR_VALUE((unsigned long)ptr);
 }
 
+/**
+ * IS_ERR_OR_NULL - Detect an error pointer or a null pointer.
+ * @ptr: The pointer to check.
+ *
+ * Like IS_ERR(), but also returns true for a null pointer.
+ */
 static inline bool __must_check IS_ERR_OR_NULL(__force const void *ptr)
 {
        return unlikely(!ptr) || IS_ERR_VALUE((unsigned long)ptr);
@@ -54,6 +85,23 @@ static inline void * __must_check ERR_CAST(__force const void *ptr)
        return (void *) ptr;
 }
 
+/**
+ * PTR_ERR_OR_ZERO - Extract the error code from a pointer if it has one.
+ * @ptr: A potential error pointer.
+ *
+ * Convenience function that can be used inside a function that returns
+ * an error code to propagate errors received as error pointers.
+ * For example, ``return PTR_ERR_OR_ZERO(ptr);`` replaces:
+ *
+ * .. code-block:: c
+ *
+ *     if (IS_ERR(ptr))
+ *             return PTR_ERR(ptr);
+ *     else
+ *             return 0;
+ *
+ * Return: The error code within @ptr if it is an error pointer; 0 otherwise.
+ */
 static inline int __must_check PTR_ERR_OR_ZERO(__force const void *ptr)
 {
        if (IS_ERR(ptr))
index 36a4865..b9d8365 100644 (file)
@@ -9,12 +9,12 @@
 #ifndef _LINUX_EVENTFD_H
 #define _LINUX_EVENTFD_H
 
-#include <linux/fcntl.h>
 #include <linux/wait.h>
 #include <linux/err.h>
 #include <linux/percpu-defs.h>
 #include <linux/percpu.h>
 #include <linux/sched.h>
+#include <uapi/linux/eventfd.h>
 
 /*
  * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
  * from eventfd, in order to leave a free define-space for
  * shared O_* flags.
  */
-#define EFD_SEMAPHORE (1 << 0)
-#define EFD_CLOEXEC O_CLOEXEC
-#define EFD_NONBLOCK O_NONBLOCK
-
 #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
 #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
 
@@ -40,7 +36,7 @@ struct file *eventfd_fget(int fd);
 struct eventfd_ctx *eventfd_ctx_fdget(int fd);
 struct eventfd_ctx *eventfd_ctx_fileget(struct file *file);
 __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n);
-__u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n, unsigned mask);
+__u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n, __poll_t mask);
 int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_entry_t *wait,
                                  __u64 *cnt);
 void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt);
index 481abf5..6d5edef 100644 (file)
@@ -93,6 +93,15 @@ struct kmem_cache;
 
 bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order);
 
+#ifdef CONFIG_FAIL_PAGE_ALLOC
+bool __should_fail_alloc_page(gfp_t gfp_mask, unsigned int order);
+#else
+static inline bool __should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
+{
+       return false;
+}
+#endif /* CONFIG_FAIL_PAGE_ALLOC */
+
 int should_failslab(struct kmem_cache *s, gfp_t gfpflags);
 #ifdef CONFIG_FAILSLAB
 extern bool __should_failslab(struct kmem_cache *s, gfp_t gfpflags);
index c9de1f5..da51a83 100644 (file)
@@ -20,7 +20,7 @@ void __write_overflow_field(size_t avail, size_t wanted) __compiletime_warning("
 ({                                                             \
        char *__p = (char *)(p);                                \
        size_t __ret = SIZE_MAX;                                \
-       size_t __p_size = __member_size(p);                     \
+       const size_t __p_size = __member_size(p);               \
        if (__p_size != SIZE_MAX &&                             \
            __builtin_constant_p(*__p)) {                       \
                size_t __p_len = __p_size - 1;                  \
@@ -142,7 +142,7 @@ extern char *__underlying_strncpy(char *p, const char *q, __kernel_size_t size)
 __FORTIFY_INLINE __diagnose_as(__builtin_strncpy, 1, 2, 3)
 char *strncpy(char * const POS p, const char *q, __kernel_size_t size)
 {
-       size_t p_size = __member_size(p);
+       const size_t p_size = __member_size(p);
 
        if (__compiletime_lessthan(p_size, size))
                __write_overflow();
@@ -151,33 +151,6 @@ char *strncpy(char * const POS p, const char *q, __kernel_size_t size)
        return __underlying_strncpy(p, q, size);
 }
 
-/**
- * strcat - Append a string to an existing string
- *
- * @p: pointer to NUL-terminated string to append to
- * @q: pointer to NUL-terminated source string to append from
- *
- * Do not use this function. While FORTIFY_SOURCE tries to avoid
- * read and write overflows, this is only possible when the
- * destination buffer size is known to the compiler. Prefer
- * building the string with formatting, via scnprintf() or similar.
- * At the very least, use strncat().
- *
- * Returns @p.
- *
- */
-__FORTIFY_INLINE __diagnose_as(__builtin_strcat, 1, 2)
-char *strcat(char * const POS p, const char *q)
-{
-       size_t p_size = __member_size(p);
-
-       if (p_size == SIZE_MAX)
-               return __underlying_strcat(p, q);
-       if (strlcat(p, q, p_size) >= p_size)
-               fortify_panic(__func__);
-       return p;
-}
-
 extern __kernel_size_t __real_strnlen(const char *, __kernel_size_t) __RENAME(strnlen);
 /**
  * strnlen - Return bounded count of characters in a NUL-terminated string
@@ -191,8 +164,8 @@ extern __kernel_size_t __real_strnlen(const char *, __kernel_size_t) __RENAME(st
  */
 __FORTIFY_INLINE __kernel_size_t strnlen(const char * const POS p, __kernel_size_t maxlen)
 {
-       size_t p_size = __member_size(p);
-       size_t p_len = __compiletime_strlen(p);
+       const size_t p_size = __member_size(p);
+       const size_t p_len = __compiletime_strlen(p);
        size_t ret;
 
        /* We can take compile-time actions when maxlen is const. */
@@ -233,8 +206,8 @@ __FORTIFY_INLINE __kernel_size_t strnlen(const char * const POS p, __kernel_size
 __FORTIFY_INLINE __diagnose_as(__builtin_strlen, 1)
 __kernel_size_t __fortify_strlen(const char * const POS p)
 {
+       const size_t p_size = __member_size(p);
        __kernel_size_t ret;
-       size_t p_size = __member_size(p);
 
        /* Give up if we don't know how large p is. */
        if (p_size == SIZE_MAX)
@@ -267,8 +240,8 @@ extern size_t __real_strlcpy(char *, const char *, size_t) __RENAME(strlcpy);
  */
 __FORTIFY_INLINE size_t strlcpy(char * const POS p, const char * const POS q, size_t size)
 {
-       size_t p_size = __member_size(p);
-       size_t q_size = __member_size(q);
+       const size_t p_size = __member_size(p);
+       const size_t q_size = __member_size(q);
        size_t q_len;   /* Full count of source string length. */
        size_t len;     /* Count of characters going into destination. */
 
@@ -299,8 +272,8 @@ extern ssize_t __real_strscpy(char *, const char *, size_t) __RENAME(strscpy);
  * @q: Where to copy the string from
  * @size: Size of destination buffer
  *
- * Copy the source string @p, or as much of it as fits, into the destination
- * @q buffer. The behavior is undefined if the string buffers overlap. The
+ * Copy the source string @q, or as much of it as fits, into the destination
+ * @p buffer. The behavior is undefined if the string buffers overlap. The
  * destination @p buffer is always NUL terminated, unless it's zero-sized.
  *
  * Preferred to strlcpy() since the API doesn't require reading memory
@@ -318,10 +291,10 @@ extern ssize_t __real_strscpy(char *, const char *, size_t) __RENAME(strscpy);
  */
 __FORTIFY_INLINE ssize_t strscpy(char * const POS p, const char * const POS q, size_t size)
 {
-       size_t len;
        /* Use string size rather than possible enclosing struct size. */
-       size_t p_size = __member_size(p);
-       size_t q_size = __member_size(q);
+       const size_t p_size = __member_size(p);
+       const size_t q_size = __member_size(q);
+       size_t len;
 
        /* If we cannot get size of p and q default to call strscpy. */
        if (p_size == SIZE_MAX && q_size == SIZE_MAX)
@@ -371,6 +344,96 @@ __FORTIFY_INLINE ssize_t strscpy(char * const POS p, const char * const POS q, s
        return __real_strscpy(p, q, len);
 }
 
+/* Defined after fortified strlen() to reuse it. */
+extern size_t __real_strlcat(char *p, const char *q, size_t avail) __RENAME(strlcat);
+/**
+ * strlcat - Append a string to an existing string
+ *
+ * @p: pointer to %NUL-terminated string to append to
+ * @q: pointer to %NUL-terminated string to append from
+ * @avail: Maximum bytes available in @p
+ *
+ * Appends %NUL-terminated string @q after the %NUL-terminated
+ * string at @p, but will not write beyond @avail bytes total,
+ * potentially truncating the copy from @q. @p will stay
+ * %NUL-terminated only if a %NUL already existed within
+ * the @avail bytes of @p. If so, the resulting number of
+ * bytes copied from @q will be at most "@avail - strlen(@p) - 1".
+ *
+ * Do not use this function. While FORTIFY_SOURCE tries to avoid
+ * read and write overflows, this is only possible when the sizes
+ * of @p and @q are known to the compiler. Prefer building the
+ * string with formatting, via scnprintf(), seq_buf, or similar.
+ *
+ * Returns total bytes that _would_ have been contained by @p
+ * regardless of truncation, similar to snprintf(). If return
+ * value is >= @avail, the string has been truncated.
+ *
+ */
+__FORTIFY_INLINE
+size_t strlcat(char * const POS p, const char * const POS q, size_t avail)
+{
+       const size_t p_size = __member_size(p);
+       const size_t q_size = __member_size(q);
+       size_t p_len, copy_len;
+       size_t actual, wanted;
+
+       /* Give up immediately if both buffer sizes are unknown. */
+       if (p_size == SIZE_MAX && q_size == SIZE_MAX)
+               return __real_strlcat(p, q, avail);
+
+       p_len = strnlen(p, avail);
+       copy_len = strlen(q);
+       wanted = actual = p_len + copy_len;
+
+       /* Cannot append any more: report truncation. */
+       if (avail <= p_len)
+               return wanted;
+
+       /* Give up if string is already overflowed. */
+       if (p_size <= p_len)
+               fortify_panic(__func__);
+
+       if (actual >= avail) {
+               copy_len = avail - p_len - 1;
+               actual = p_len + copy_len;
+       }
+
+       /* Give up if copy will overflow. */
+       if (p_size <= actual)
+               fortify_panic(__func__);
+       __underlying_memcpy(p + p_len, q, copy_len);
+       p[actual] = '\0';
+
+       return wanted;
+}
+
+/* Defined after fortified strlcat() to reuse it. */
+/**
+ * strcat - Append a string to an existing string
+ *
+ * @p: pointer to NUL-terminated string to append to
+ * @q: pointer to NUL-terminated source string to append from
+ *
+ * Do not use this function. While FORTIFY_SOURCE tries to avoid
+ * read and write overflows, this is only possible when the
+ * destination buffer size is known to the compiler. Prefer
+ * building the string with formatting, via scnprintf() or similar.
+ * At the very least, use strncat().
+ *
+ * Returns @p.
+ *
+ */
+__FORTIFY_INLINE __diagnose_as(__builtin_strcat, 1, 2)
+char *strcat(char * const POS p, const char *q)
+{
+       const size_t p_size = __member_size(p);
+
+       if (strlcat(p, q, p_size) >= p_size)
+               fortify_panic(__func__);
+       return p;
+}
+
 /**
  * strncat - Append a string to an existing string
  *
@@ -394,9 +457,9 @@ __FORTIFY_INLINE ssize_t strscpy(char * const POS p, const char * const POS q, s
 __FORTIFY_INLINE __diagnose_as(__builtin_strncat, 1, 2, 3)
 char *strncat(char * const POS p, const char * const POS q, __kernel_size_t count)
 {
+       const size_t p_size = __member_size(p);
+       const size_t q_size = __member_size(q);
        size_t p_len, copy_len;
-       size_t p_size = __member_size(p);
-       size_t q_size = __member_size(q);
 
        if (p_size == SIZE_MAX && q_size == SIZE_MAX)
                return __underlying_strncat(p, q, count);
@@ -639,7 +702,7 @@ __FORTIFY_INLINE bool fortify_memcpy_chk(__kernel_size_t size,
 extern void *__real_memscan(void *, int, __kernel_size_t) __RENAME(memscan);
 __FORTIFY_INLINE void *memscan(void * const POS0 p, int c, __kernel_size_t size)
 {
-       size_t p_size = __struct_size(p);
+       const size_t p_size = __struct_size(p);
 
        if (__compiletime_lessthan(p_size, size))
                __read_overflow();
@@ -651,8 +714,8 @@ __FORTIFY_INLINE void *memscan(void * const POS0 p, int c, __kernel_size_t size)
 __FORTIFY_INLINE __diagnose_as(__builtin_memcmp, 1, 2, 3)
 int memcmp(const void * const POS0 p, const void * const POS0 q, __kernel_size_t size)
 {
-       size_t p_size = __struct_size(p);
-       size_t q_size = __struct_size(q);
+       const size_t p_size = __struct_size(p);
+       const size_t q_size = __struct_size(q);
 
        if (__builtin_constant_p(size)) {
                if (__compiletime_lessthan(p_size, size))
@@ -668,7 +731,7 @@ int memcmp(const void * const POS0 p, const void * const POS0 q, __kernel_size_t
 __FORTIFY_INLINE __diagnose_as(__builtin_memchr, 1, 2, 3)
 void *memchr(const void * const POS0 p, int c, __kernel_size_t size)
 {
-       size_t p_size = __struct_size(p);
+       const size_t p_size = __struct_size(p);
 
        if (__compiletime_lessthan(p_size, size))
                __read_overflow();
@@ -680,7 +743,7 @@ void *memchr(const void * const POS0 p, int c, __kernel_size_t size)
 void *__real_memchr_inv(const void *s, int c, size_t n) __RENAME(memchr_inv);
 __FORTIFY_INLINE void *memchr_inv(const void * const POS0 p, int c, size_t size)
 {
-       size_t p_size = __struct_size(p);
+       const size_t p_size = __struct_size(p);
 
        if (__compiletime_lessthan(p_size, size))
                __read_overflow();
@@ -693,7 +756,7 @@ extern void *__real_kmemdup(const void *src, size_t len, gfp_t gfp) __RENAME(kme
                                                                    __realloc_size(2);
 __FORTIFY_INLINE void *kmemdup(const void * const POS0 p, size_t size, gfp_t gfp)
 {
-       size_t p_size = __struct_size(p);
+       const size_t p_size = __struct_size(p);
 
        if (__compiletime_lessthan(p_size, size))
                __read_overflow();
@@ -720,8 +783,8 @@ __FORTIFY_INLINE void *kmemdup(const void * const POS0 p, size_t size, gfp_t gfp
 __FORTIFY_INLINE __diagnose_as(__builtin_strcpy, 1, 2)
 char *strcpy(char * const POS p, const char * const POS q)
 {
-       size_t p_size = __member_size(p);
-       size_t q_size = __member_size(q);
+       const size_t p_size = __member_size(p);
+       const size_t q_size = __member_size(q);
        size_t size;
 
        /* If neither buffer size is known, immediately give up. */
index a631bac..eaa0ac5 100644 (file)
@@ -10,7 +10,7 @@
 struct frontswap_ops {
        void (*init)(unsigned); /* this swap type was just swapon'ed */
        int (*store)(unsigned, pgoff_t, struct page *); /* store a page */
-       int (*load)(unsigned, pgoff_t, struct page *); /* load a page */
+       int (*load)(unsigned, pgoff_t, struct page *, bool *); /* load a page */
        void (*invalidate_page)(unsigned, pgoff_t); /* page no longer needed */
        void (*invalidate_area)(unsigned); /* swap type just swapoff'ed */
 };
index 67998c6..d4b67bd 100644 (file)
@@ -119,13 +119,6 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 #define FMODE_PWRITE           ((__force fmode_t)0x10)
 /* File is opened for execution with sys_execve / sys_uselib */
 #define FMODE_EXEC             ((__force fmode_t)0x20)
-/* File is opened with O_NDELAY (only set for block devices) */
-#define FMODE_NDELAY           ((__force fmode_t)0x40)
-/* File is opened with O_EXCL (only set for block devices) */
-#define FMODE_EXCL             ((__force fmode_t)0x80)
-/* File is opened using open(.., 3, ..) and is writeable only for ioctls
-   (specialy hack for floppy.c) */
-#define FMODE_WRITE_IOCTL      ((__force fmode_t)0x100)
 /* 32bit hashes as llseek() offset (for directories) */
 #define FMODE_32BITHASH         ((__force fmode_t)0x200)
 /* 64bit hashes as llseek() offset (for directories) */
@@ -171,6 +164,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 /* File supports non-exclusive O_DIRECT writes from multiple threads */
 #define FMODE_DIO_PARALLEL_WRITE       ((__force fmode_t)0x1000000)
 
+/* File is embedded in backing_file object */
+#define FMODE_BACKING          ((__force fmode_t)0x2000000)
+
 /* File was opened by fanotify and shouldn't generate fanotify events */
 #define FMODE_NONOTIFY         ((__force fmode_t)0x4000000)
 
@@ -956,29 +952,35 @@ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
                index <  ra->start + ra->size);
 }
 
+/*
+ * f_{lock,count,pos_lock} members can be highly contended and share
+ * the same cacheline. f_{lock,mode} are very frequently used together
+ * and so share the same cacheline as well. The read-mostly
+ * f_{path,inode,op} are kept on a separate cacheline.
+ */
 struct file {
        union {
                struct llist_node       f_llist;
                struct rcu_head         f_rcuhead;
                unsigned int            f_iocb_flags;
        };
-       struct path             f_path;
-       struct inode            *f_inode;       /* cached value */
-       const struct file_operations    *f_op;
 
        /*
         * Protects f_ep, f_flags.
         * Must not be taken from IRQ context.
         */
        spinlock_t              f_lock;
-       atomic_long_t           f_count;
-       unsigned int            f_flags;
        fmode_t                 f_mode;
+       atomic_long_t           f_count;
        struct mutex            f_pos_lock;
        loff_t                  f_pos;
+       unsigned int            f_flags;
        struct fown_struct      f_owner;
        const struct cred       *f_cred;
        struct file_ra_state    f_ra;
+       struct path             f_path;
+       struct inode            *f_inode;       /* cached value */
+       const struct file_operations    *f_op;
 
        u64                     f_version;
 #ifdef CONFIG_SECURITY
@@ -1215,7 +1217,6 @@ struct super_block {
        uuid_t                  s_uuid;         /* UUID */
 
        unsigned int            s_max_links;
-       fmode_t                 s_mode;
 
        /*
         * The next field is for VFS *only*. No filesystems have any business
@@ -1242,7 +1243,7 @@ struct super_block {
         */
        atomic_long_t s_fsnotify_connectors;
 
-       /* Being remounted read-only */
+       /* Read-only state of the superblock is being changed */
        int s_readonly_remount;
 
        /* per-sb errseq_t for reporting writeback errors via syncfs */
@@ -1672,9 +1673,12 @@ static inline int vfs_whiteout(struct mnt_idmap *idmap,
                         WHITEOUT_DEV);
 }
 
-struct file *vfs_tmpfile_open(struct mnt_idmap *idmap,
-                       const struct path *parentpath,
-                       umode_t mode, int open_flag, const struct cred *cred);
+struct file *kernel_tmpfile_open(struct mnt_idmap *idmap,
+                                const struct path *parentpath,
+                                umode_t mode, int open_flag,
+                                const struct cred *cred);
+struct file *kernel_file_open(const struct path *path, int flags,
+                             struct inode *inode, const struct cred *cred);
 
 int vfs_mkobj(struct dentry *, umode_t,
                int (*f)(struct dentry *, umode_t, void *),
@@ -1932,6 +1936,7 @@ struct super_operations {
                                  struct shrink_control *);
        long (*free_cached_objects)(struct super_block *,
                                    struct shrink_control *);
+       void (*shutdown)(struct super_block *sb);
 };
 
 /*
@@ -2349,11 +2354,31 @@ static inline struct file *file_open_root_mnt(struct vfsmount *mnt,
        return file_open_root(&(struct path){.mnt = mnt, .dentry = mnt->mnt_root},
                              name, flags, mode);
 }
-extern struct file * dentry_open(const struct path *, int, const struct cred *);
-extern struct file *dentry_create(const struct path *path, int flags,
-                                 umode_t mode, const struct cred *cred);
-extern struct file * open_with_fake_path(const struct path *, int,
-                                        struct inode*, const struct cred *);
+struct file *dentry_open(const struct path *path, int flags,
+                        const struct cred *creds);
+struct file *dentry_create(const struct path *path, int flags, umode_t mode,
+                          const struct cred *cred);
+struct file *backing_file_open(const struct path *path, int flags,
+                              const struct path *real_path,
+                              const struct cred *cred);
+struct path *backing_file_real_path(struct file *f);
+
+/*
+ * file_real_path - get the path corresponding to f_inode
+ *
+ * When opening a backing file for a stackable filesystem (e.g.,
+ * overlayfs) f_path may be on the stackable filesystem and f_inode on
+ * the underlying filesystem.  When the path associated with f_inode is
+ * needed, this helper should be used instead of accessing f_path
+ * directly.
+*/
+static inline const struct path *file_real_path(struct file *f)
+{
+       if (unlikely(f->f_mode & FMODE_BACKING))
+               return backing_file_real_path(f);
+       return &f->f_path;
+}
+
 static inline struct file *file_clone_open(struct file *file)
 {
        return dentry_open(&file->f_path, file->f_flags, file->f_cred);
@@ -2669,7 +2694,7 @@ extern void evict_inodes(struct super_block *sb);
 void dump_mapping(const struct address_space *);
 
 /*
- * Userspace may rely on the the inode number being non-zero. For example, glibc
+ * Userspace may rely on the inode number being non-zero. For example, glibc
  * simply ignores files with zero i_ino in unlink() and other places.
  *
  * As an additional complication, if userspace was compiled with
@@ -2738,6 +2763,8 @@ extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t generic_file_direct_write(struct kiocb *, struct iov_iter *);
 ssize_t generic_perform_write(struct kiocb *, struct iov_iter *);
+ssize_t direct_write_fallback(struct kiocb *iocb, struct iov_iter *iter,
+               ssize_t direct_written, ssize_t buffered_written);
 
 ssize_t vfs_iter_read(struct file *file, struct iov_iter *iter, loff_t *ppos,
                rwf_t flags);
@@ -2752,11 +2779,9 @@ ssize_t vfs_iocb_iter_write(struct file *file, struct kiocb *iocb,
 ssize_t filemap_splice_read(struct file *in, loff_t *ppos,
                            struct pipe_inode_info *pipe,
                            size_t len, unsigned int flags);
-ssize_t direct_splice_read(struct file *in, loff_t *ppos,
-                          struct pipe_inode_info *pipe,
-                          size_t len, unsigned int flags);
-extern ssize_t generic_file_splice_read(struct file *, loff_t *,
-               struct pipe_inode_info *, size_t, unsigned int);
+ssize_t copy_splice_read(struct file *in, loff_t *ppos,
+                        struct pipe_inode_info *pipe,
+                        size_t len, unsigned int flags);
 extern ssize_t iter_file_splice_write(struct pipe_inode_info *,
                struct file *, loff_t *, size_t, unsigned int);
 extern long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
@@ -2835,11 +2860,6 @@ static inline void inode_dio_end(struct inode *inode)
                wake_up_bit(&inode->i_state, __I_DIO_WAKEUP);
 }
 
-/*
- * Warn about a page cache invalidation failure diring a direct I/O write.
- */
-void dio_warn_stale_pagecache(struct file *filp);
-
 extern void inode_set_flags(struct inode *inode, unsigned int flags,
                            unsigned int mask);
 
index bb8467c..ed48e4f 100644 (file)
@@ -91,11 +91,13 @@ static inline void fsnotify_dentry(struct dentry *dentry, __u32 mask)
 
 static inline int fsnotify_file(struct file *file, __u32 mask)
 {
-       const struct path *path = &file->f_path;
+       const struct path *path;
 
        if (file->f_mode & FMODE_NONOTIFY)
                return 0;
 
+       /* Overlayfs internal files have fake f_path */
+       path = file_real_path(file);
        return fsnotify_parent(path->dentry, mask, path, FSNOTIFY_EVENT_PATH);
 }
 
index e76605d..1eb7eae 100644 (file)
@@ -143,8 +143,8 @@ int fsverity_ioctl_enable(struct file *filp, const void __user *arg);
 
 int fsverity_ioctl_measure(struct file *filp, void __user *arg);
 int fsverity_get_digest(struct inode *inode,
-                       u8 digest[FS_VERITY_MAX_DIGEST_SIZE],
-                       enum hash_algo *alg);
+                       u8 raw_digest[FS_VERITY_MAX_DIGEST_SIZE],
+                       u8 *alg, enum hash_algo *halg);
 
 /* open.c */
 
@@ -197,10 +197,14 @@ static inline int fsverity_ioctl_measure(struct file *filp, void __user *arg)
 }
 
 static inline int fsverity_get_digest(struct inode *inode,
-                                     u8 digest[FS_VERITY_MAX_DIGEST_SIZE],
-                                     enum hash_algo *alg)
+                                     u8 raw_digest[FS_VERITY_MAX_DIGEST_SIZE],
+                                     u8 *alg, enum hash_algo *halg)
 {
-       return -EOPNOTSUPP;
+       /*
+        * fsverity is not enabled in the kernel configuration, so always report
+        * that the file doesn't have fsverity enabled (digest size 0).
+        */
+       return 0;
 }
 
 /* open.c */
index ed8cb53..665f066 100644 (file)
@@ -338,19 +338,12 @@ extern gfp_t gfp_allowed_mask;
 /* Returns true if the gfp_mask allows use of ALLOC_NO_WATERMARK */
 bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
 
-extern void pm_restrict_gfp_mask(void);
-extern void pm_restore_gfp_mask(void);
-
-extern gfp_t vma_thp_gfp_mask(struct vm_area_struct *vma);
-
-#ifdef CONFIG_PM_SLEEP
-extern bool pm_suspended_storage(void);
-#else
-static inline bool pm_suspended_storage(void)
+static inline bool gfp_has_io_fs(gfp_t gfp)
 {
-       return false;
+       return (gfp & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS);
 }
-#endif /* CONFIG_PM_SLEEP */
+
+extern gfp_t vma_thp_gfp_mask(struct vm_area_struct *vma);
 
 #ifdef CONFIG_CONTIG_ALLOC
 /* The below functions must be run on a range from a single zone. */
index 5c6db55..67b8774 100644 (file)
@@ -252,6 +252,14 @@ struct gpio_irq_chip {
        bool initialized;
 
        /**
+        * @domain_is_allocated_externally:
+        *
+        * True it the irq_domain was allocated outside of gpiolib, in which
+        * case gpiolib won't free the irq_domain itself.
+        */
+       bool domain_is_allocated_externally;
+
+       /**
         * @init_hw: optional routine to initialize hardware before
         * an IRQ chip will be added. This is quite useful when
         * a particular driver wants to clear IRQ related registers
index 4de1dbc..68da306 100644 (file)
@@ -507,7 +507,7 @@ static inline void folio_zero_range(struct folio *folio,
        zero_user_segments(&folio->page, start, start + length, 0, 0);
 }
 
-static inline void put_and_unmap_page(struct page *page, void *addr)
+static inline void unmap_and_put_page(struct page *page, void *addr)
 {
        kunmap_local(addr);
        put_page(page);
index 6d041aa..ca3c8e1 100644 (file)
@@ -133,9 +133,8 @@ int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *,
 struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
                                unsigned long address, unsigned int flags);
 long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
-                        struct page **, struct vm_area_struct **,
-                        unsigned long *, unsigned long *, long, unsigned int,
-                        int *);
+                        struct page **, unsigned long *, unsigned long *,
+                        long, unsigned int, int *);
 void unmap_hugepage_range(struct vm_area_struct *,
                          unsigned long, unsigned long, struct page *,
                          zap_flags_t);
@@ -306,9 +305,8 @@ static inline struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 
 static inline long follow_hugetlb_page(struct mm_struct *mm,
                        struct vm_area_struct *vma, struct page **pages,
-                       struct vm_area_struct **vmas, unsigned long *position,
-                       unsigned long *nr_pages, long i, unsigned int flags,
-                       int *nonblocking)
+                       unsigned long *position, unsigned long *nr_pages,
+                       long i, unsigned int flags, int *nonblocking)
 {
        BUG();
        return 0;
@@ -757,26 +755,12 @@ static inline struct hugepage_subpool *hugetlb_folio_subpool(struct folio *folio
        return folio->_hugetlb_subpool;
 }
 
-/*
- * hugetlb page subpool pointer located in hpage[2].hugetlb_subpool
- */
-static inline struct hugepage_subpool *hugetlb_page_subpool(struct page *hpage)
-{
-       return hugetlb_folio_subpool(page_folio(hpage));
-}
-
 static inline void hugetlb_set_folio_subpool(struct folio *folio,
                                        struct hugepage_subpool *subpool)
 {
        folio->_hugetlb_subpool = subpool;
 }
 
-static inline void hugetlb_set_page_subpool(struct page *hpage,
-                                       struct hugepage_subpool *subpool)
-{
-       hugetlb_set_folio_subpool(page_folio(hpage), subpool);
-}
-
 static inline struct hstate *hstate_file(struct file *f)
 {
        return hstate_inode(file_inode(f));
@@ -1031,11 +1015,6 @@ static inline struct hugepage_subpool *hugetlb_folio_subpool(struct folio *folio
        return NULL;
 }
 
-static inline struct hugepage_subpool *hugetlb_page_subpool(struct page *hpage)
-{
-       return NULL;
-}
-
 static inline int isolate_or_dissolve_huge_page(struct page *page,
                                                struct list_head *list)
 {
@@ -1200,7 +1179,11 @@ static inline void hugetlb_count_sub(long l, struct mm_struct *mm)
 static inline pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
                                          unsigned long addr, pte_t *ptep)
 {
+#ifdef CONFIG_MMU
+       return ptep_get(ptep);
+#else
        return *ptep;
+#endif
 }
 
 static inline void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
index 81413cd..d28a5e8 100644 (file)
@@ -722,7 +722,7 @@ static inline void *iio_device_get_drvdata(const struct iio_dev *indio_dev)
  * must not share  cachelines with the rest of the structure, thus making
  * them safe for use with non-coherent DMA.
  */
-#define IIO_DMA_MINALIGN ARCH_KMALLOC_MINALIGN
+#define IIO_DMA_MINALIGN ARCH_DMA_MINALIGN
 struct iio_dev *iio_device_alloc(struct device *parent, int sizeof_priv);
 
 /* The information at the returned address is guaranteed to be cacheline aligned */
index c5fe6d2..266c3e1 100644 (file)
@@ -152,6 +152,23 @@ extern unsigned int reset_devices;
 void setup_arch(char **);
 void prepare_namespace(void);
 void __init init_rootfs(void);
+
+void init_IRQ(void);
+void time_init(void);
+void poking_init(void);
+void pgtable_cache_init(void);
+
+extern initcall_entry_t __initcall_start[];
+extern initcall_entry_t __initcall0_start[];
+extern initcall_entry_t __initcall1_start[];
+extern initcall_entry_t __initcall2_start[];
+extern initcall_entry_t __initcall3_start[];
+extern initcall_entry_t __initcall4_start[];
+extern initcall_entry_t __initcall5_start[];
+extern initcall_entry_t __initcall6_start[];
+extern initcall_entry_t __initcall7_start[];
+extern initcall_entry_t __initcall_end[];
+
 extern struct file_system_type rootfs_fs_type;
 
 #if defined(CONFIG_STRICT_KERNEL_RWX) || defined(CONFIG_STRICT_MODULE_RWX)
@@ -309,6 +326,8 @@ struct obs_kernel_param {
        int early;
 };
 
+extern const struct obs_kernel_param __setup_start[], __setup_end[];
+
 /*
  * Only for really core code.  See moduleparam.h for the normal way.
  *
index 9f4b6f5..e6936cb 100644 (file)
 #include <linux/powercap.h>
 #include <linux/cpuhotplug.h>
 
+enum rapl_if_type {
+       RAPL_IF_MSR,    /* RAPL I/F using MSR registers */
+       RAPL_IF_MMIO,   /* RAPL I/F using MMIO registers */
+       RAPL_IF_TPMI,   /* RAPL I/F using TPMI registers */
+};
+
 enum rapl_domain_type {
        RAPL_DOMAIN_PACKAGE,    /* entire package/socket */
        RAPL_DOMAIN_PP0,        /* core power plane */
@@ -30,17 +36,23 @@ enum rapl_domain_reg_id {
        RAPL_DOMAIN_REG_POLICY,
        RAPL_DOMAIN_REG_INFO,
        RAPL_DOMAIN_REG_PL4,
+       RAPL_DOMAIN_REG_UNIT,
+       RAPL_DOMAIN_REG_PL2,
        RAPL_DOMAIN_REG_MAX,
 };
 
 struct rapl_domain;
 
 enum rapl_primitives {
-       ENERGY_COUNTER,
        POWER_LIMIT1,
        POWER_LIMIT2,
        POWER_LIMIT4,
+       ENERGY_COUNTER,
        FW_LOCK,
+       FW_HIGH_LOCK,
+       PL1_LOCK,
+       PL2_LOCK,
+       PL4_LOCK,
 
        PL1_ENABLE,             /* power limit 1, aka long term */
        PL1_CLAMP,              /* allow frequency to go below OS request */
@@ -74,12 +86,13 @@ struct rapl_domain_data {
        unsigned long timestamp;
 };
 
-#define NR_POWER_LIMITS (3)
+#define NR_POWER_LIMITS        (POWER_LIMIT4 + 1)
+
 struct rapl_power_limit {
        struct powercap_zone_constraint *constraint;
-       int prim_id;            /* primitive ID used to enable */
        struct rapl_domain *domain;
        const char *name;
+       bool locked;
        u64 last_power_limit;
 };
 
@@ -96,7 +109,9 @@ struct rapl_domain {
        struct rapl_power_limit rpl[NR_POWER_LIMITS];
        u64 attr_map;           /* track capabilities */
        unsigned int state;
-       unsigned int domain_energy_unit;
+       unsigned int power_unit;
+       unsigned int energy_unit;
+       unsigned int time_unit;
        struct rapl_package *rp;
 };
 
@@ -121,16 +136,20 @@ struct reg_action {
  *                             registers.
  * @write_raw:                 Callback for writing RAPL interface specific
  *                             registers.
+ * @defaults:                  internal pointer to interface default settings
+ * @rpi:                       internal pointer to interface primitive info
  */
 struct rapl_if_priv {
+       enum rapl_if_type type;
        struct powercap_control_type *control_type;
-       struct rapl_domain *platform_rapl_domain;
        enum cpuhp_state pcap_rapl_online;
        u64 reg_unit;
        u64 regs[RAPL_DOMAIN_MAX][RAPL_DOMAIN_REG_MAX];
        int limits[RAPL_DOMAIN_MAX];
-       int (*read_raw)(int cpu, struct reg_action *ra);
-       int (*write_raw)(int cpu, struct reg_action *ra);
+       int (*read_raw)(int id, struct reg_action *ra);
+       int (*write_raw)(int id, struct reg_action *ra);
+       void *defaults;
+       void *rpi;
 };
 
 /* maximum rapl package domain name: package-%d-die-%d */
@@ -140,9 +159,6 @@ struct rapl_package {
        unsigned int id;        /* logical die id, equals physical 1-die systems */
        unsigned int nr_domains;
        unsigned long domain_map;       /* bit map of active domains */
-       unsigned int power_unit;
-       unsigned int energy_unit;
-       unsigned int time_unit;
        struct rapl_domain *domains;    /* array of domains, sized at runtime */
        struct powercap_zone *power_zone;       /* keep track of parent zone */
        unsigned long power_limit_irq;  /* keep track of package power limit
@@ -156,8 +172,8 @@ struct rapl_package {
        struct rapl_if_priv *priv;
 };
 
-struct rapl_package *rapl_find_package_domain(int cpu, struct rapl_if_priv *priv);
-struct rapl_package *rapl_add_package(int cpu, struct rapl_if_priv *priv);
+struct rapl_package *rapl_find_package_domain(int id, struct rapl_if_priv *priv, bool id_is_cpu);
+struct rapl_package *rapl_add_package(int id, struct rapl_if_priv *priv, bool id_is_cpu);
 void rapl_remove_package(struct rapl_package *rp);
 
 #endif /* __INTEL_RAPL_H__ */
index 308f4f0..7304f2a 100644 (file)
@@ -68,6 +68,11 @@ void *devm_memremap(struct device *dev, resource_size_t offset,
                size_t size, unsigned long flags);
 void devm_memunmap(struct device *dev, void *addr);
 
+/* architectures can override this */
+pgprot_t __init early_memremap_pgprot_adjust(resource_size_t phys_addr,
+                                       unsigned long size, pgprot_t prot);
+
+
 #ifdef CONFIG_PCI
 /*
  * The PCI specifications (Rev 3.0, 3.2.5 "Transaction Ordering and
index 7fe31b2..bb9c666 100644 (file)
@@ -46,13 +46,23 @@ int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
                              struct iov_iter *iter, void *ioucmd);
 void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret, ssize_t res2,
                        unsigned issue_flags);
-void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd,
-                       void (*task_work_cb)(struct io_uring_cmd *, unsigned));
 struct sock *io_uring_get_socket(struct file *file);
 void __io_uring_cancel(bool cancel_all);
 void __io_uring_free(struct task_struct *tsk);
 void io_uring_unreg_ringfd(void);
 const char *io_uring_get_opcode(u8 opcode);
+void __io_uring_cmd_do_in_task(struct io_uring_cmd *ioucmd,
+                           void (*task_work_cb)(struct io_uring_cmd *, unsigned),
+                           unsigned flags);
+/* users should follow semantics of IOU_F_TWQ_LAZY_WAKE */
+void io_uring_cmd_do_in_task_lazy(struct io_uring_cmd *ioucmd,
+                       void (*task_work_cb)(struct io_uring_cmd *, unsigned));
+
+static inline void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd,
+                       void (*task_work_cb)(struct io_uring_cmd *, unsigned))
+{
+       __io_uring_cmd_do_in_task(ioucmd, task_work_cb, 0);
+}
 
 static inline void io_uring_files_cancel(void)
 {
@@ -85,6 +95,10 @@ static inline void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd,
                        void (*task_work_cb)(struct io_uring_cmd *, unsigned))
 {
 }
+static inline void io_uring_cmd_do_in_task_lazy(struct io_uring_cmd *ioucmd,
+                       void (*task_work_cb)(struct io_uring_cmd *, unsigned))
+{
+}
 static inline struct sock *io_uring_get_socket(struct file *file)
 {
        return NULL;
index 1b2a20a..f04ce51 100644 (file)
@@ -211,6 +211,16 @@ struct io_ring_ctx {
                unsigned int            compat: 1;
 
                enum task_work_notify_mode      notify_method;
+
+               /*
+                * If IORING_SETUP_NO_MMAP is used, then the below holds
+                * the gup'ed pages for the two rings, and the sqes.
+                */
+               unsigned short          n_ring_pages;
+               unsigned short          n_sqe_pages;
+               struct page             **ring_pages;
+               struct page             **sqe_pages;
+
                struct io_rings                 *rings;
                struct task_struct              *submitter_task;
                struct percpu_ref               refs;
index b1b28af..d8a6fdc 100644 (file)
@@ -223,32 +223,35 @@ struct irq_data {
  *                               irq_chip::irq_set_affinity() when deactivated.
  * IRQD_IRQ_ENABLED_ON_SUSPEND - Interrupt is enabled on suspend by irq pm if
  *                               irqchip have flag IRQCHIP_ENABLE_WAKEUP_ON_SUSPEND set.
+ * IRQD_RESEND_WHEN_IN_PROGRESS        - Interrupt may fire when already in progress in which
+ *                               case it must be resent at the next available opportunity.
  */
 enum {
        IRQD_TRIGGER_MASK               = 0xf,
-       IRQD_SETAFFINITY_PENDING        = (1 <<  8),
-       IRQD_ACTIVATED                  = (1 <<  9),
-       IRQD_NO_BALANCING               = (1 << 10),
-       IRQD_PER_CPU                    = (1 << 11),
-       IRQD_AFFINITY_SET               = (1 << 12),
-       IRQD_LEVEL                      = (1 << 13),
-       IRQD_WAKEUP_STATE               = (1 << 14),
-       IRQD_MOVE_PCNTXT                = (1 << 15),
-       IRQD_IRQ_DISABLED               = (1 << 16),
-       IRQD_IRQ_MASKED                 = (1 << 17),
-       IRQD_IRQ_INPROGRESS             = (1 << 18),
-       IRQD_WAKEUP_ARMED               = (1 << 19),
-       IRQD_FORWARDED_TO_VCPU          = (1 << 20),
-       IRQD_AFFINITY_MANAGED           = (1 << 21),
-       IRQD_IRQ_STARTED                = (1 << 22),
-       IRQD_MANAGED_SHUTDOWN           = (1 << 23),
-       IRQD_SINGLE_TARGET              = (1 << 24),
-       IRQD_DEFAULT_TRIGGER_SET        = (1 << 25),
-       IRQD_CAN_RESERVE                = (1 << 26),
-       IRQD_MSI_NOMASK_QUIRK           = (1 << 27),
-       IRQD_HANDLE_ENFORCE_IRQCTX      = (1 << 28),
-       IRQD_AFFINITY_ON_ACTIVATE       = (1 << 29),
-       IRQD_IRQ_ENABLED_ON_SUSPEND     = (1 << 30),
+       IRQD_SETAFFINITY_PENDING        = BIT(8),
+       IRQD_ACTIVATED                  = BIT(9),
+       IRQD_NO_BALANCING               = BIT(10),
+       IRQD_PER_CPU                    = BIT(11),
+       IRQD_AFFINITY_SET               = BIT(12),
+       IRQD_LEVEL                      = BIT(13),
+       IRQD_WAKEUP_STATE               = BIT(14),
+       IRQD_MOVE_PCNTXT                = BIT(15),
+       IRQD_IRQ_DISABLED               = BIT(16),
+       IRQD_IRQ_MASKED                 = BIT(17),
+       IRQD_IRQ_INPROGRESS             = BIT(18),
+       IRQD_WAKEUP_ARMED               = BIT(19),
+       IRQD_FORWARDED_TO_VCPU          = BIT(20),
+       IRQD_AFFINITY_MANAGED           = BIT(21),
+       IRQD_IRQ_STARTED                = BIT(22),
+       IRQD_MANAGED_SHUTDOWN           = BIT(23),
+       IRQD_SINGLE_TARGET              = BIT(24),
+       IRQD_DEFAULT_TRIGGER_SET        = BIT(25),
+       IRQD_CAN_RESERVE                = BIT(26),
+       IRQD_MSI_NOMASK_QUIRK           = BIT(27),
+       IRQD_HANDLE_ENFORCE_IRQCTX      = BIT(28),
+       IRQD_AFFINITY_ON_ACTIVATE       = BIT(29),
+       IRQD_IRQ_ENABLED_ON_SUSPEND     = BIT(30),
+       IRQD_RESEND_WHEN_IN_PROGRESS    = BIT(31),
 };
 
 #define __irqd_to_state(d) ACCESS_PRIVATE((d)->common, state_use_accessors)
@@ -448,6 +451,16 @@ static inline bool irqd_affinity_on_activate(struct irq_data *d)
        return __irqd_to_state(d) & IRQD_AFFINITY_ON_ACTIVATE;
 }
 
+static inline void irqd_set_resend_when_in_progress(struct irq_data *d)
+{
+       __irqd_to_state(d) |= IRQD_RESEND_WHEN_IN_PROGRESS;
+}
+
+static inline bool irqd_needs_resend_when_in_progress(struct irq_data *d)
+{
+       return __irqd_to_state(d) & IRQD_RESEND_WHEN_IN_PROGRESS;
+}
+
 #undef __irqd_to_state
 
 static inline irq_hw_number_t irqd_to_hwirq(struct irq_data *d)
diff --git a/include/linux/irqchip/mmp.h b/include/linux/irqchip/mmp.h
deleted file mode 100644 (file)
index aa18137..0000000
+++ /dev/null
@@ -1,10 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef        __IRQCHIP_MMP_H
-#define        __IRQCHIP_MMP_H
-
-extern struct irq_chip icu_irq_chip;
-
-extern void icu_init_irq(void);
-extern void mmp2_init_icu(void);
-
-#endif /* __IRQCHIP_MMP_H */
diff --git a/include/linux/irqchip/mxs.h b/include/linux/irqchip/mxs.h
deleted file mode 100644 (file)
index 4f447e3..0000000
+++ /dev/null
@@ -1,11 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Copyright (C) 2013 Freescale Semiconductor, Inc.
- */
-
-#ifndef __LINUX_IRQCHIP_MXS_H
-#define __LINUX_IRQCHIP_MXS_H
-
-extern void icoll_handle_irq(struct pt_regs *);
-
-#endif
index 844a8e3..d9451d4 100644 (file)
@@ -102,6 +102,9 @@ struct irq_desc {
        int                     parent_irq;
        struct module           *owner;
        const char              *name;
+#ifdef CONFIG_HARDIRQS_SW_RESEND
+       struct hlist_node       resend_node;
+#endif
 } ____cacheline_internodealigned_in_smp;
 
 #ifdef CONFIG_SPARSE_IRQ
index 790e7fc..e274274 100644 (file)
  */
 extern phys_addr_t ibft_phys_addr;
 
+#ifdef CONFIG_ISCSI_IBFT_FIND
+
 /*
  * Routine used to find and reserve the iSCSI Boot Format Table. The
  * physical address is set in the ibft_phys_addr variable.
  */
-#ifdef CONFIG_ISCSI_IBFT_FIND
 void reserve_ibft_region(void);
+
+/*
+ * Physical bounds to search for the iSCSI Boot Format Table.
+ */
+#define IBFT_START 0x80000 /* 512kB */
+#define IBFT_END 0x100000 /* 1MB */
+
 #else
 static inline void reserve_ibft_region(void) {}
 #endif
index 4e968eb..f0a949b 100644 (file)
@@ -257,7 +257,7 @@ extern enum jump_label_type jump_label_init_type(struct jump_entry *entry);
 
 static __always_inline int static_key_count(struct static_key *key)
 {
-       return arch_atomic_read(&key->enabled);
+       return raw_atomic_read(&key->enabled);
 }
 
 static __always_inline void jump_label_init(void)
index fe3c999..c3f075e 100644 (file)
@@ -65,6 +65,9 @@ static inline void *dereference_symbol_descriptor(void *ptr)
        return ptr;
 }
 
+/* How and when do we show kallsyms values? */
+extern bool kallsyms_show_value(const struct cred *cred);
+
 #ifdef CONFIG_KALLSYMS
 unsigned long kallsyms_sym_address(int idx);
 int kallsyms_on_each_symbol(int (*fn)(void *, const char *, unsigned long),
@@ -93,10 +96,6 @@ extern int sprint_backtrace(char *buffer, unsigned long address);
 extern int sprint_backtrace_build_id(char *buffer, unsigned long address);
 
 int lookup_symbol_name(unsigned long addr, char *symname);
-int lookup_symbol_attrs(unsigned long addr, unsigned long *size, unsigned long *offset, char *modname, char *name);
-
-/* How and when do we show kallsyms values? */
-extern bool kallsyms_show_value(const struct cred *cred);
 
 #else /* !CONFIG_KALLSYMS */
 
@@ -155,16 +154,6 @@ static inline int lookup_symbol_name(unsigned long addr, char *symname)
        return -ERANGE;
 }
 
-static inline int lookup_symbol_attrs(unsigned long addr, unsigned long *size, unsigned long *offset, char *modname, char *name)
-{
-       return -ERANGE;
-}
-
-static inline bool kallsyms_show_value(const struct cred *cred)
-{
-       return false;
-}
-
 static inline int kallsyms_on_each_symbol(int (*fn)(void *, const char *, unsigned long),
                                          void *data)
 {
index f7ef706..819b6bc 100644 (file)
@@ -343,7 +343,7 @@ static inline void *kasan_reset_tag(const void *addr)
  * @is_write: whether the bad access is a write or a read
  * @ip: instruction pointer for the accessibility check or the bad access itself
  */
-bool kasan_report(unsigned long addr, size_t size,
+bool kasan_report(const void *addr, size_t size,
                bool is_write, unsigned long ip);
 
 #else /* CONFIG_KASAN_SW_TAGS || CONFIG_KASAN_HW_TAGS */
index ee04256..b851ba4 100644 (file)
@@ -72,6 +72,23 @@ static inline void kcov_remote_stop_softirq(void)
                kcov_remote_stop();
 }
 
+#ifdef CONFIG_64BIT
+typedef unsigned long kcov_u64;
+#else
+typedef unsigned long long kcov_u64;
+#endif
+
+void __sanitizer_cov_trace_pc(void);
+void __sanitizer_cov_trace_cmp1(u8 arg1, u8 arg2);
+void __sanitizer_cov_trace_cmp2(u16 arg1, u16 arg2);
+void __sanitizer_cov_trace_cmp4(u32 arg1, u32 arg2);
+void __sanitizer_cov_trace_cmp8(kcov_u64 arg1, kcov_u64 arg2);
+void __sanitizer_cov_trace_const_cmp1(u8 arg1, u8 arg2);
+void __sanitizer_cov_trace_const_cmp2(u16 arg1, u16 arg2);
+void __sanitizer_cov_trace_const_cmp4(u32 arg1, u32 arg2);
+void __sanitizer_cov_trace_const_cmp8(kcov_u64 arg1, kcov_u64 arg2);
+void __sanitizer_cov_trace_switch(kcov_u64 val, void *cases);
+
 #else
 
 static inline void kcov_task_init(struct task_struct *t) {}
index 8dc7f7c..938d7ec 100644 (file)
@@ -490,9 +490,6 @@ do {                                                                        \
        rcu_assign_pointer((KEY)->payload.rcu_data0, (PAYLOAD));        \
 } while (0)
 
-#ifdef CONFIG_SYSCTL
-extern struct ctl_table key_sysctls[];
-#endif
 /*
  * the userspace interface
  */
index 30e5bec..f1f95a7 100644 (file)
@@ -89,6 +89,7 @@ int kthread_stop(struct task_struct *k);
 bool kthread_should_stop(void);
 bool kthread_should_park(void);
 bool __kthread_should_park(struct task_struct *k);
+bool kthread_should_stop_or_park(void);
 bool kthread_freezable_should_stop(bool *was_frozen);
 void *kthread_func(struct task_struct *k);
 void *kthread_data(struct task_struct *k);
index 74bd269..310f859 100644 (file)
@@ -447,6 +447,14 @@ extern int lockdep_is_held(const void *);
 
 #endif /* !LOCKDEP */
 
+#ifdef CONFIG_PROVE_LOCKING
+void lockdep_set_lock_cmp_fn(struct lockdep_map *, lock_cmp_fn, lock_print_fn);
+
+#define lock_set_cmp_fn(lock, ...)     lockdep_set_lock_cmp_fn(&(lock)->dep_map, __VA_ARGS__)
+#else
+#define lock_set_cmp_fn(lock, ...)     do { } while (0)
+#endif
+
 enum xhlock_context_t {
        XHLOCK_HARD,
        XHLOCK_SOFT,
index 59f4fb1..2ebc323 100644 (file)
@@ -85,6 +85,11 @@ struct lock_trace;
 
 #define LOCKSTAT_POINTS                4
 
+struct lockdep_map;
+typedef int (*lock_cmp_fn)(const struct lockdep_map *a,
+                          const struct lockdep_map *b);
+typedef void (*lock_print_fn)(const struct lockdep_map *map);
+
 /*
  * The lock-class itself. The order of the structure members matters.
  * reinit_class() zeroes the key member and all subsequent members.
@@ -110,6 +115,9 @@ struct lock_class {
        struct list_head                locks_after, locks_before;
 
        const struct lockdep_subclass_key *key;
+       lock_cmp_fn                     cmp_fn;
+       lock_print_fn                   print_fn;
+
        unsigned int                    subclass;
        unsigned int                    dep_gen_id;
 
index 6bb55e6..7308a1a 100644 (file)
@@ -343,6 +343,7 @@ LSM_HOOK(void, LSM_RET_VOID, sctp_sk_clone, struct sctp_association *asoc,
         struct sock *sk, struct sock *newsk)
 LSM_HOOK(int, 0, sctp_assoc_established, struct sctp_association *asoc,
         struct sk_buff *skb)
+LSM_HOOK(int, 0, mptcp_add_subflow, struct sock *sk, struct sock *ssk)
 #endif /* CONFIG_SECURITY_NETWORK */
 
 #ifdef CONFIG_SECURITY_INFINIBAND
index 1fadb5f..295548c 100644 (file)
@@ -455,7 +455,9 @@ void *mas_erase(struct ma_state *mas);
 int mas_store_gfp(struct ma_state *mas, void *entry, gfp_t gfp);
 void mas_store_prealloc(struct ma_state *mas, void *entry);
 void *mas_find(struct ma_state *mas, unsigned long max);
+void *mas_find_range(struct ma_state *mas, unsigned long max);
 void *mas_find_rev(struct ma_state *mas, unsigned long min);
+void *mas_find_range_rev(struct ma_state *mas, unsigned long max);
 int mas_preallocate(struct ma_state *mas, gfp_t gfp);
 bool mas_is_err(struct ma_state *mas);
 
@@ -466,10 +468,18 @@ void mas_destroy(struct ma_state *mas);
 int mas_expected_entries(struct ma_state *mas, unsigned long nr_entries);
 
 void *mas_prev(struct ma_state *mas, unsigned long min);
+void *mas_prev_range(struct ma_state *mas, unsigned long max);
 void *mas_next(struct ma_state *mas, unsigned long max);
+void *mas_next_range(struct ma_state *mas, unsigned long max);
 
 int mas_empty_area(struct ma_state *mas, unsigned long min, unsigned long max,
                   unsigned long size);
+/*
+ * This finds an empty area from the highest address to the lowest.
+ * AKA "Topdown" version,
+ */
+int mas_empty_area_rev(struct ma_state *mas, unsigned long min,
+                      unsigned long max, unsigned long size);
 
 static inline void mas_init(struct ma_state *mas, struct maple_tree *tree,
                            unsigned long addr)
@@ -482,23 +492,17 @@ static inline void mas_init(struct ma_state *mas, struct maple_tree *tree,
 }
 
 /* Checks if a mas has not found anything */
-static inline bool mas_is_none(struct ma_state *mas)
+static inline bool mas_is_none(const struct ma_state *mas)
 {
        return mas->node == MAS_NONE;
 }
 
 /* Checks if a mas has been paused */
-static inline bool mas_is_paused(struct ma_state *mas)
+static inline bool mas_is_paused(const struct ma_state *mas)
 {
        return mas->node == MAS_PAUSE;
 }
 
-/*
- * This finds an empty area from the highest address to the lowest.
- * AKA "Topdown" version,
- */
-int mas_empty_area_rev(struct ma_state *mas, unsigned long min,
-                      unsigned long max, unsigned long size);
 /**
  * mas_reset() - Reset a Maple Tree operation state.
  * @mas: Maple Tree operation state.
@@ -528,7 +532,6 @@ static inline void mas_reset(struct ma_state *mas)
 #define mas_for_each(__mas, __entry, __max) \
        while (((__entry) = mas_find((__mas), (__max))) != NULL)
 
-
 /**
  * mas_set_range() - Set up Maple Tree operation state for a different index.
  * @mas: Maple Tree operation state.
@@ -616,7 +619,7 @@ static inline void mt_clear_in_rcu(struct maple_tree *mt)
                return;
 
        if (mt_external_lock(mt)) {
-               BUG_ON(!mt_lock_is_held(mt));
+               WARN_ON(!mt_lock_is_held(mt));
                mt->ma_flags &= ~MT_FLAGS_USE_RCU;
        } else {
                mtree_lock(mt);
@@ -635,7 +638,7 @@ static inline void mt_set_in_rcu(struct maple_tree *mt)
                return;
 
        if (mt_external_lock(mt)) {
-               BUG_ON(!mt_lock_is_held(mt));
+               WARN_ON(!mt_lock_is_held(mt));
                mt->ma_flags |= MT_FLAGS_USE_RCU;
        } else {
                mtree_lock(mt);
@@ -670,10 +673,17 @@ void *mt_next(struct maple_tree *mt, unsigned long index, unsigned long max);
 
 
 #ifdef CONFIG_DEBUG_MAPLE_TREE
+enum mt_dump_format {
+       mt_dump_dec,
+       mt_dump_hex,
+};
+
 extern atomic_t maple_tree_tests_run;
 extern atomic_t maple_tree_tests_passed;
 
-void mt_dump(const struct maple_tree *mt);
+void mt_dump(const struct maple_tree *mt, enum mt_dump_format format);
+void mas_dump(const struct ma_state *mas);
+void mas_wr_dump(const struct ma_wr_state *wr_mas);
 void mt_validate(struct maple_tree *mt);
 void mt_cache_shrink(void);
 #define MT_BUG_ON(__tree, __x) do {                                    \
@@ -681,7 +691,23 @@ void mt_cache_shrink(void);
        if (__x) {                                                      \
                pr_info("BUG at %s:%d (%u)\n",                          \
                __func__, __LINE__, __x);                               \
-               mt_dump(__tree);                                        \
+               mt_dump(__tree, mt_dump_hex);                           \
+               pr_info("Pass: %u Run:%u\n",                            \
+                       atomic_read(&maple_tree_tests_passed),          \
+                       atomic_read(&maple_tree_tests_run));            \
+               dump_stack();                                           \
+       } else {                                                        \
+               atomic_inc(&maple_tree_tests_passed);                   \
+       }                                                               \
+} while (0)
+
+#define MAS_BUG_ON(__mas, __x) do {                                    \
+       atomic_inc(&maple_tree_tests_run);                              \
+       if (__x) {                                                      \
+               pr_info("BUG at %s:%d (%u)\n",                          \
+               __func__, __LINE__, __x);                               \
+               mas_dump(__mas);                                        \
+               mt_dump((__mas)->tree, mt_dump_hex);                    \
                pr_info("Pass: %u Run:%u\n",                            \
                        atomic_read(&maple_tree_tests_passed),          \
                        atomic_read(&maple_tree_tests_run));            \
@@ -690,8 +716,84 @@ void mt_cache_shrink(void);
                atomic_inc(&maple_tree_tests_passed);                   \
        }                                                               \
 } while (0)
+
+#define MAS_WR_BUG_ON(__wrmas, __x) do {                               \
+       atomic_inc(&maple_tree_tests_run);                              \
+       if (__x) {                                                      \
+               pr_info("BUG at %s:%d (%u)\n",                          \
+               __func__, __LINE__, __x);                               \
+               mas_wr_dump(__wrmas);                                   \
+               mas_dump((__wrmas)->mas);                               \
+               mt_dump((__wrmas)->mas->tree, mt_dump_hex);             \
+               pr_info("Pass: %u Run:%u\n",                            \
+                       atomic_read(&maple_tree_tests_passed),          \
+                       atomic_read(&maple_tree_tests_run));            \
+               dump_stack();                                           \
+       } else {                                                        \
+               atomic_inc(&maple_tree_tests_passed);                   \
+       }                                                               \
+} while (0)
+
+#define MT_WARN_ON(__tree, __x)  ({                                    \
+       int ret = !!(__x);                                              \
+       atomic_inc(&maple_tree_tests_run);                              \
+       if (ret) {                                                      \
+               pr_info("WARN at %s:%d (%u)\n",                         \
+               __func__, __LINE__, __x);                               \
+               mt_dump(__tree, mt_dump_hex);                           \
+               pr_info("Pass: %u Run:%u\n",                            \
+                       atomic_read(&maple_tree_tests_passed),          \
+                       atomic_read(&maple_tree_tests_run));            \
+               dump_stack();                                           \
+       } else {                                                        \
+               atomic_inc(&maple_tree_tests_passed);                   \
+       }                                                               \
+       unlikely(ret);                                                  \
+})
+
+#define MAS_WARN_ON(__mas, __x) ({                                     \
+       int ret = !!(__x);                                              \
+       atomic_inc(&maple_tree_tests_run);                              \
+       if (ret) {                                                      \
+               pr_info("WARN at %s:%d (%u)\n",                         \
+               __func__, __LINE__, __x);                               \
+               mas_dump(__mas);                                        \
+               mt_dump((__mas)->tree, mt_dump_hex);                    \
+               pr_info("Pass: %u Run:%u\n",                            \
+                       atomic_read(&maple_tree_tests_passed),          \
+                       atomic_read(&maple_tree_tests_run));            \
+               dump_stack();                                           \
+       } else {                                                        \
+               atomic_inc(&maple_tree_tests_passed);                   \
+       }                                                               \
+       unlikely(ret);                                                  \
+})
+
+#define MAS_WR_WARN_ON(__wrmas, __x) ({                                        \
+       int ret = !!(__x);                                              \
+       atomic_inc(&maple_tree_tests_run);                              \
+       if (ret) {                                                      \
+               pr_info("WARN at %s:%d (%u)\n",                         \
+               __func__, __LINE__, __x);                               \
+               mas_wr_dump(__wrmas);                                   \
+               mas_dump((__wrmas)->mas);                               \
+               mt_dump((__wrmas)->mas->tree, mt_dump_hex);             \
+               pr_info("Pass: %u Run:%u\n",                            \
+                       atomic_read(&maple_tree_tests_passed),          \
+                       atomic_read(&maple_tree_tests_run));            \
+               dump_stack();                                           \
+       } else {                                                        \
+               atomic_inc(&maple_tree_tests_passed);                   \
+       }                                                               \
+       unlikely(ret);                                                  \
+})
 #else
-#define MT_BUG_ON(__tree, __x) BUG_ON(__x)
+#define MT_BUG_ON(__tree, __x)         BUG_ON(__x)
+#define MAS_BUG_ON(__mas, __x)         BUG_ON(__x)
+#define MAS_WR_BUG_ON(__mas, __x)      BUG_ON(__x)
+#define MT_WARN_ON(__tree, __x)                WARN_ON(__x)
+#define MAS_WARN_ON(__mas, __x)                WARN_ON(__x)
+#define MAS_WR_WARN_ON(__mas, __x)     WARN_ON(__x)
 #endif /* CONFIG_DEBUG_MAPLE_TREE */
 
 #endif /*_LINUX_MAPLE_TREE_H */
index 439b8f0..2d38865 100644 (file)
@@ -118,17 +118,17 @@ __STRUCT_FRACT(s32)
 __STRUCT_FRACT(u32)
 #undef __STRUCT_FRACT
 
-/*
- * Multiplies an integer by a fraction, while avoiding unnecessary
- * overflow or loss of precision.
- */
-#define mult_frac(x, numer, denom)(                    \
-{                                                      \
-       typeof(x) quot = (x) / (denom);                 \
-       typeof(x) rem  = (x) % (denom);                 \
-       (quot * (numer)) + ((rem * (numer)) / (denom)); \
-}                                                      \
-)
+/* Calculate "x * n / d" without unnecessary overflow or loss of precision. */
+#define mult_frac(x, n, d)     \
+({                             \
+       typeof(x) x_ = (x);     \
+       typeof(n) n_ = (n);     \
+       typeof(d) d_ = (d);     \
+                               \
+       typeof(x_) q = x_ / d_; \
+       typeof(x_) r = x_ % d_; \
+       q * n_ + r * n_ / d_;   \
+})
 
 #define sector_div(a, b) do_div(a, b)
 
index 8b9191a..bf74478 100644 (file)
@@ -168,7 +168,7 @@ static __always_inline u64 mul_u64_u32_shr(u64 a, u32 mul, unsigned int shift)
 #endif /* mul_u64_u32_shr */
 
 #ifndef mul_u64_u64_shr
-static inline u64 mul_u64_u64_shr(u64 a, u64 mul, unsigned int shift)
+static __always_inline u64 mul_u64_u64_shr(u64 a, u64 mul, unsigned int shift)
 {
        return (u64)(((unsigned __int128)a * mul) >> shift);
 }
index f82ee3f..f71ff9f 100644 (file)
@@ -128,7 +128,6 @@ int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
 
 void memblock_free_all(void);
 void memblock_free(void *ptr, size_t size);
-void reset_node_managed_pages(pg_data_t *pgdat);
 void reset_all_zones_managed_pages(void);
 
 /* Low level functions */
index 222d737..5818af8 100644 (file)
@@ -419,7 +419,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
  *
  * - the folio lock
  * - LRU isolation
- * - lock_page_memcg()
+ * - folio_memcg_lock()
  * - exclusive reference
  * - mem_cgroup_trylock_pages()
  *
@@ -820,8 +820,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
                                   struct mem_cgroup *,
                                   struct mem_cgroup_reclaim_cookie *);
 void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
-int mem_cgroup_scan_tasks(struct mem_cgroup *,
-                         int (*)(struct task_struct *, void *), void *);
+void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
+                          int (*)(struct task_struct *, void *), void *arg);
 
 static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
 {
@@ -949,8 +949,6 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg);
 
 void folio_memcg_lock(struct folio *folio);
 void folio_memcg_unlock(struct folio *folio);
-void lock_page_memcg(struct page *page);
-void unlock_page_memcg(struct page *page);
 
 void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);
 
@@ -1038,7 +1036,6 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 }
 
 void mem_cgroup_flush_stats(void);
-void mem_cgroup_flush_stats_atomic(void);
 void mem_cgroup_flush_stats_ratelimited(void);
 
 void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
@@ -1367,10 +1364,9 @@ static inline void mem_cgroup_iter_break(struct mem_cgroup *root,
 {
 }
 
-static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
+static inline void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
                int (*fn)(struct task_struct *, void *), void *arg)
 {
-       return 0;
 }
 
 static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
@@ -1439,14 +1435,6 @@ mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
 {
 }
 
-static inline void lock_page_memcg(struct page *page)
-{
-}
-
-static inline void unlock_page_memcg(struct page *page)
-{
-}
-
 static inline void folio_memcg_lock(struct folio *folio)
 {
 }
@@ -1537,10 +1525,6 @@ static inline void mem_cgroup_flush_stats(void)
 {
 }
 
-static inline void mem_cgroup_flush_stats_atomic(void)
-{
-}
-
 static inline void mem_cgroup_flush_stats_ratelimited(void)
 {
 }
index 9fcbf57..013c697 100644 (file)
@@ -326,9 +326,6 @@ static inline int remove_memory(u64 start, u64 size)
 static inline void __remove_memory(u64 start, u64 size) {}
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
-extern void set_zone_contiguous(struct zone *zone);
-extern void clear_zone_contiguous(struct zone *zone);
-
 #ifdef CONFIG_MEMORY_HOTPLUG
 extern void __ref free_area_init_core_hotplug(struct pglist_data *pgdat);
 extern int __add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
@@ -347,9 +344,8 @@ extern void remove_pfn_range_from_zone(struct zone *zone,
 extern int sparse_add_section(int nid, unsigned long pfn,
                unsigned long nr_pages, struct vmem_altmap *altmap,
                struct dev_pagemap *pgmap);
-extern void sparse_remove_section(struct mem_section *ms,
-               unsigned long pfn, unsigned long nr_pages,
-               unsigned long map_offset, struct vmem_altmap *altmap);
+extern void sparse_remove_section(unsigned long pfn, unsigned long nr_pages,
+                                 struct vmem_altmap *altmap);
 extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
                                          unsigned long pnum);
 extern struct zone *zone_for_pfn_range(int online_type, int nid,
index beb3f44..fff7fa6 100644 (file)
@@ -17,6 +17,7 @@ enum axp20x_variants {
        AXP221_ID,
        AXP223_ID,
        AXP288_ID,
+       AXP313A_ID,
        AXP803_ID,
        AXP806_ID,
        AXP809_ID,
@@ -92,6 +93,17 @@ enum axp20x_variants {
 #define AXP22X_ALDO3_V_OUT             0x2a
 #define AXP22X_CHRG_CTRL3              0x35
 
+#define AXP313A_ON_INDICATE            0x00
+#define AXP313A_OUTPUT_CONTROL         0x10
+#define AXP313A_DCDC1_CONRTOL          0x13
+#define AXP313A_DCDC2_CONRTOL          0x14
+#define AXP313A_DCDC3_CONRTOL          0x15
+#define AXP313A_ALDO1_CONRTOL          0x16
+#define AXP313A_DLDO1_CONRTOL          0x17
+#define AXP313A_SHUTDOWN_CTRL          0x1a
+#define AXP313A_IRQ_EN                 0x20
+#define AXP313A_IRQ_STATE              0x21
+
 #define AXP806_STARTUP_SRC             0x00
 #define AXP806_CHIP_ID                 0x03
 #define AXP806_PWR_OUT_CTRL1           0x10
@@ -364,6 +376,16 @@ enum {
 };
 
 enum {
+       AXP313A_DCDC1 = 0,
+       AXP313A_DCDC2,
+       AXP313A_DCDC3,
+       AXP313A_ALDO1,
+       AXP313A_DLDO1,
+       AXP313A_RTC_LDO,
+       AXP313A_REG_ID_MAX,
+};
+
+enum {
        AXP806_DCDCA = 0,
        AXP806_DCDCB,
        AXP806_DCDCC,
@@ -616,6 +638,16 @@ enum axp288_irqs {
        AXP288_IRQ_BC_USB_CHNG,
 };
 
+enum axp313a_irqs {
+       AXP313A_IRQ_DIE_TEMP_HIGH,
+       AXP313A_IRQ_DCDC2_V_LOW = 2,
+       AXP313A_IRQ_DCDC3_V_LOW,
+       AXP313A_IRQ_PEK_LONG,
+       AXP313A_IRQ_PEK_SHORT,
+       AXP313A_IRQ_PEK_FAL_EDGE,
+       AXP313A_IRQ_PEK_RIS_EDGE,
+};
+
 enum axp803_irqs {
        AXP803_IRQ_ACIN_OVER_V = 1,
        AXP803_IRQ_ACIN_PLUGIN,
index 9af1f31..78e167a 100644 (file)
@@ -289,6 +289,414 @@ enum rk805_reg {
 #define RK805_INT_ALARM_EN             (1 << 3)
 #define RK805_INT_TIMER_EN             (1 << 2)
 
+/* RK806 */
+#define RK806_POWER_EN0                        0x0
+#define RK806_POWER_EN1                        0x1
+#define RK806_POWER_EN2                        0x2
+#define RK806_POWER_EN3                        0x3
+#define RK806_POWER_EN4                        0x4
+#define RK806_POWER_EN5                        0x5
+#define RK806_POWER_SLP_EN0            0x6
+#define RK806_POWER_SLP_EN1            0x7
+#define RK806_POWER_SLP_EN2            0x8
+#define RK806_POWER_DISCHRG_EN0                0x9
+#define RK806_POWER_DISCHRG_EN1                0xA
+#define RK806_POWER_DISCHRG_EN2                0xB
+#define RK806_BUCK_FB_CONFIG           0xC
+#define RK806_SLP_LP_CONFIG            0xD
+#define RK806_POWER_FPWM_EN0           0xE
+#define RK806_POWER_FPWM_EN1           0xF
+#define RK806_BUCK1_CONFIG             0x10
+#define RK806_BUCK2_CONFIG             0x11
+#define RK806_BUCK3_CONFIG             0x12
+#define RK806_BUCK4_CONFIG             0x13
+#define RK806_BUCK5_CONFIG             0x14
+#define RK806_BUCK6_CONFIG             0x15
+#define RK806_BUCK7_CONFIG             0x16
+#define RK806_BUCK8_CONFIG             0x17
+#define RK806_BUCK9_CONFIG             0x18
+#define RK806_BUCK10_CONFIG            0x19
+#define RK806_BUCK1_ON_VSEL            0x1A
+#define RK806_BUCK2_ON_VSEL            0x1B
+#define RK806_BUCK3_ON_VSEL            0x1C
+#define RK806_BUCK4_ON_VSEL            0x1D
+#define RK806_BUCK5_ON_VSEL            0x1E
+#define RK806_BUCK6_ON_VSEL            0x1F
+#define RK806_BUCK7_ON_VSEL            0x20
+#define RK806_BUCK8_ON_VSEL            0x21
+#define RK806_BUCK9_ON_VSEL            0x22
+#define RK806_BUCK10_ON_VSEL           0x23
+#define RK806_BUCK1_SLP_VSEL           0x24
+#define RK806_BUCK2_SLP_VSEL           0x25
+#define RK806_BUCK3_SLP_VSEL           0x26
+#define RK806_BUCK4_SLP_VSEL           0x27
+#define RK806_BUCK5_SLP_VSEL           0x28
+#define RK806_BUCK6_SLP_VSEL           0x29
+#define RK806_BUCK7_SLP_VSEL           0x2A
+#define RK806_BUCK8_SLP_VSEL           0x2B
+#define RK806_BUCK9_SLP_VSEL           0x2D
+#define RK806_BUCK10_SLP_VSEL          0x2E
+#define RK806_BUCK_DEBUG1              0x30
+#define RK806_BUCK_DEBUG2              0x31
+#define RK806_BUCK_DEBUG3              0x32
+#define RK806_BUCK_DEBUG4              0x33
+#define RK806_BUCK_DEBUG5              0x34
+#define RK806_BUCK_DEBUG6              0x35
+#define RK806_BUCK_DEBUG7              0x36
+#define RK806_BUCK_DEBUG8              0x37
+#define RK806_BUCK_DEBUG9              0x38
+#define RK806_BUCK_DEBUG10             0x39
+#define RK806_BUCK_DEBUG11             0x3A
+#define RK806_BUCK_DEBUG12             0x3B
+#define RK806_BUCK_DEBUG13             0x3C
+#define RK806_BUCK_DEBUG14             0x3D
+#define RK806_BUCK_DEBUG15             0x3E
+#define RK806_BUCK_DEBUG16             0x3F
+#define RK806_BUCK_DEBUG17             0x40
+#define RK806_BUCK_DEBUG18             0x41
+#define RK806_NLDO_IMAX                        0x42
+#define RK806_NLDO1_ON_VSEL            0x43
+#define RK806_NLDO2_ON_VSEL            0x44
+#define RK806_NLDO3_ON_VSEL            0x45
+#define RK806_NLDO4_ON_VSEL            0x46
+#define RK806_NLDO5_ON_VSEL            0x47
+#define RK806_NLDO1_SLP_VSEL           0x48
+#define RK806_NLDO2_SLP_VSEL           0x49
+#define RK806_NLDO3_SLP_VSEL           0x4A
+#define RK806_NLDO4_SLP_VSEL           0x4B
+#define RK806_NLDO5_SLP_VSEL           0x4C
+#define RK806_PLDO_IMAX                        0x4D
+#define RK806_PLDO1_ON_VSEL            0x4E
+#define RK806_PLDO2_ON_VSEL            0x4F
+#define RK806_PLDO3_ON_VSEL            0x50
+#define RK806_PLDO4_ON_VSEL            0x51
+#define RK806_PLDO5_ON_VSEL            0x52
+#define RK806_PLDO6_ON_VSEL            0x53
+#define RK806_PLDO1_SLP_VSEL           0x54
+#define RK806_PLDO2_SLP_VSEL           0x55
+#define RK806_PLDO3_SLP_VSEL           0x56
+#define RK806_PLDO4_SLP_VSEL           0x57
+#define RK806_PLDO5_SLP_VSEL           0x58
+#define RK806_PLDO6_SLP_VSEL           0x59
+#define RK806_CHIP_NAME                        0x5A
+#define RK806_CHIP_VER                 0x5B
+#define RK806_OTP_VER                  0x5C
+#define RK806_SYS_STS                  0x5D
+#define RK806_SYS_CFG0                 0x5E
+#define RK806_SYS_CFG1                 0x5F
+#define RK806_SYS_OPTION               0x61
+#define RK806_SLEEP_CONFIG0            0x62
+#define RK806_SLEEP_CONFIG1            0x63
+#define RK806_SLEEP_CTR_SEL0           0x64
+#define RK806_SLEEP_CTR_SEL1           0x65
+#define RK806_SLEEP_CTR_SEL2           0x66
+#define RK806_SLEEP_CTR_SEL3           0x67
+#define RK806_SLEEP_CTR_SEL4           0x68
+#define RK806_SLEEP_CTR_SEL5           0x69
+#define RK806_DVS_CTRL_SEL0            0x6A
+#define RK806_DVS_CTRL_SEL1            0x6B
+#define RK806_DVS_CTRL_SEL2            0x6C
+#define RK806_DVS_CTRL_SEL3            0x6D
+#define RK806_DVS_CTRL_SEL4            0x6E
+#define RK806_DVS_CTRL_SEL5            0x6F
+#define RK806_DVS_START_CTRL           0x70
+#define RK806_SLEEP_GPIO               0x71
+#define RK806_SYS_CFG3                 0x72
+#define RK806_ON_SOURCE                        0x74
+#define RK806_OFF_SOURCE               0x75
+#define RK806_PWRON_KEY                        0x76
+#define RK806_INT_STS0                 0x77
+#define RK806_INT_MSK0                 0x78
+#define RK806_INT_STS1                 0x79
+#define RK806_INT_MSK1                 0x7A
+#define RK806_GPIO_INT_CONFIG          0x7B
+#define RK806_DATA_REG0                        0x7C
+#define RK806_DATA_REG1                        0x7D
+#define RK806_DATA_REG2                        0x7E
+#define RK806_DATA_REG3                        0x7F
+#define RK806_DATA_REG4                        0x80
+#define RK806_DATA_REG5                        0x81
+#define RK806_DATA_REG6                        0x82
+#define RK806_DATA_REG7                        0x83
+#define RK806_DATA_REG8                        0x84
+#define RK806_DATA_REG9                        0x85
+#define RK806_DATA_REG10               0x86
+#define RK806_DATA_REG11               0x87
+#define RK806_DATA_REG12               0x88
+#define RK806_DATA_REG13               0x89
+#define RK806_DATA_REG14               0x8A
+#define RK806_DATA_REG15               0x8B
+#define RK806_TM_REG                   0x8C
+#define RK806_OTP_EN_REG               0x8D
+#define RK806_FUNC_OTP_EN_REG          0x8E
+#define RK806_TEST_REG1                        0x8F
+#define RK806_TEST_REG2                        0x90
+#define RK806_TEST_REG3                        0x91
+#define RK806_TEST_REG4                        0x92
+#define RK806_TEST_REG5                        0x93
+#define RK806_BUCK_VSEL_OTP_REG0       0x94
+#define RK806_BUCK_VSEL_OTP_REG1       0x95
+#define RK806_BUCK_VSEL_OTP_REG2       0x96
+#define RK806_BUCK_VSEL_OTP_REG3       0x97
+#define RK806_BUCK_VSEL_OTP_REG4       0x98
+#define RK806_BUCK_VSEL_OTP_REG5       0x99
+#define RK806_BUCK_VSEL_OTP_REG6       0x9A
+#define RK806_BUCK_VSEL_OTP_REG7       0x9B
+#define RK806_BUCK_VSEL_OTP_REG8       0x9C
+#define RK806_BUCK_VSEL_OTP_REG9       0x9D
+#define RK806_NLDO1_VSEL_OTP_REG0      0x9E
+#define RK806_NLDO1_VSEL_OTP_REG1      0x9F
+#define RK806_NLDO1_VSEL_OTP_REG2      0xA0
+#define RK806_NLDO1_VSEL_OTP_REG3      0xA1
+#define RK806_NLDO1_VSEL_OTP_REG4      0xA2
+#define RK806_PLDO_VSEL_OTP_REG0       0xA3
+#define RK806_PLDO_VSEL_OTP_REG1       0xA4
+#define RK806_PLDO_VSEL_OTP_REG2       0xA5
+#define RK806_PLDO_VSEL_OTP_REG3       0xA6
+#define RK806_PLDO_VSEL_OTP_REG4       0xA7
+#define RK806_PLDO_VSEL_OTP_REG5       0xA8
+#define RK806_BUCK_EN_OTP_REG1         0xA9
+#define RK806_NLDO_EN_OTP_REG1         0xAA
+#define RK806_PLDO_EN_OTP_REG1         0xAB
+#define RK806_BUCK_FB_RES_OTP_REG1     0xAC
+#define RK806_OTP_RESEV_REG0           0xAD
+#define RK806_OTP_RESEV_REG1           0xAE
+#define RK806_OTP_RESEV_REG2           0xAF
+#define RK806_OTP_RESEV_REG3           0xB0
+#define RK806_OTP_RESEV_REG4           0xB1
+#define RK806_BUCK_SEQ_REG0            0xB2
+#define RK806_BUCK_SEQ_REG1            0xB3
+#define RK806_BUCK_SEQ_REG2            0xB4
+#define RK806_BUCK_SEQ_REG3            0xB5
+#define RK806_BUCK_SEQ_REG4            0xB6
+#define RK806_BUCK_SEQ_REG5            0xB7
+#define RK806_BUCK_SEQ_REG6            0xB8
+#define RK806_BUCK_SEQ_REG7            0xB9
+#define RK806_BUCK_SEQ_REG8            0xBA
+#define RK806_BUCK_SEQ_REG9            0xBB
+#define RK806_BUCK_SEQ_REG10           0xBC
+#define RK806_BUCK_SEQ_REG11           0xBD
+#define RK806_BUCK_SEQ_REG12           0xBE
+#define RK806_BUCK_SEQ_REG13           0xBF
+#define RK806_BUCK_SEQ_REG14           0xC0
+#define RK806_BUCK_SEQ_REG15           0xC1
+#define RK806_BUCK_SEQ_REG16           0xC2
+#define RK806_BUCK_SEQ_REG17           0xC3
+#define RK806_HK_TRIM_REG1             0xC4
+#define RK806_HK_TRIM_REG2             0xC5
+#define RK806_BUCK_REF_TRIM_REG1       0xC6
+#define RK806_BUCK_REF_TRIM_REG2       0xC7
+#define RK806_BUCK_REF_TRIM_REG3       0xC8
+#define RK806_BUCK_REF_TRIM_REG4       0xC9
+#define RK806_BUCK_REF_TRIM_REG5       0xCA
+#define RK806_BUCK_OSC_TRIM_REG1       0xCB
+#define RK806_BUCK_OSC_TRIM_REG2       0xCC
+#define RK806_BUCK_OSC_TRIM_REG3       0xCD
+#define RK806_BUCK_OSC_TRIM_REG4       0xCE
+#define RK806_BUCK_OSC_TRIM_REG5       0xCF
+#define RK806_BUCK_TRIM_ZCDIOS_REG1    0xD0
+#define RK806_BUCK_TRIM_ZCDIOS_REG2    0xD1
+#define RK806_NLDO_TRIM_REG1           0xD2
+#define RK806_NLDO_TRIM_REG2           0xD3
+#define RK806_NLDO_TRIM_REG3           0xD4
+#define RK806_PLDO_TRIM_REG1           0xD5
+#define RK806_PLDO_TRIM_REG2           0xD6
+#define RK806_PLDO_TRIM_REG3           0xD7
+#define RK806_TRIM_ICOMP_REG1          0xD8
+#define RK806_TRIM_ICOMP_REG2          0xD9
+#define RK806_EFUSE_CONTROL_REGH       0xDA
+#define RK806_FUSE_PROG_REG            0xDB
+#define RK806_MAIN_FSM_STS_REG         0xDD
+#define RK806_FSM_REG                  0xDE
+#define RK806_TOP_RESEV_OFFR           0xEC
+#define RK806_TOP_RESEV_POR            0xED
+#define RK806_BUCK_VRSN_REG1           0xEE
+#define RK806_BUCK_VRSN_REG2           0xEF
+#define RK806_NLDO_RLOAD_SEL_REG1      0xF0
+#define RK806_PLDO_RLOAD_SEL_REG1      0xF1
+#define RK806_PLDO_RLOAD_SEL_REG2      0xF2
+#define RK806_BUCK_CMIN_MX_REG1                0xF3
+#define RK806_BUCK_CMIN_MX_REG2                0xF4
+#define RK806_BUCK_FREQ_SET_REG1       0xF5
+#define RK806_BUCK_FREQ_SET_REG2       0xF6
+#define RK806_BUCK_RS_MEABS_REG1       0xF7
+#define RK806_BUCK_RS_MEABS_REG2       0xF8
+#define RK806_BUCK_RS_ZDLEB_REG1       0xF9
+#define RK806_BUCK_RS_ZDLEB_REG2       0xFA
+#define RK806_BUCK_RSERVE_REG1         0xFB
+#define RK806_BUCK_RSERVE_REG2         0xFC
+#define RK806_BUCK_RSERVE_REG3         0xFD
+#define RK806_BUCK_RSERVE_REG4         0xFE
+#define RK806_BUCK_RSERVE_REG5         0xFF
+
+/* INT_STS Register field definitions */
+#define RK806_INT_STS_PWRON_FALL       BIT(0)
+#define RK806_INT_STS_PWRON_RISE       BIT(1)
+#define RK806_INT_STS_PWRON            BIT(2)
+#define RK806_INT_STS_PWRON_LP         BIT(3)
+#define RK806_INT_STS_HOTDIE           BIT(4)
+#define RK806_INT_STS_VDC_RISE         BIT(5)
+#define RK806_INT_STS_VDC_FALL         BIT(6)
+#define RK806_INT_STS_VB_LO            BIT(7)
+#define RK806_INT_STS_REV0             BIT(0)
+#define RK806_INT_STS_REV1             BIT(1)
+#define RK806_INT_STS_REV2             BIT(2)
+#define RK806_INT_STS_CRC_ERROR                BIT(3)
+#define RK806_INT_STS_SLP3_GPIO                BIT(4)
+#define RK806_INT_STS_SLP2_GPIO                BIT(5)
+#define RK806_INT_STS_SLP1_GPIO                BIT(6)
+#define RK806_INT_STS_WDT              BIT(7)
+
+/* SPI command */
+#define RK806_CMD_READ                 0
+#define RK806_CMD_WRITE                        BIT(7)
+#define RK806_CMD_CRC_EN               BIT(6)
+#define RK806_CMD_CRC_DIS              0
+#define RK806_CMD_LEN_MSK              0x0f
+#define RK806_REG_H                    0x00
+
+#define VERSION_AB             0x01
+
+enum rk806_reg_id {
+       RK806_ID_DCDC1 = 0,
+       RK806_ID_DCDC2,
+       RK806_ID_DCDC3,
+       RK806_ID_DCDC4,
+       RK806_ID_DCDC5,
+       RK806_ID_DCDC6,
+       RK806_ID_DCDC7,
+       RK806_ID_DCDC8,
+       RK806_ID_DCDC9,
+       RK806_ID_DCDC10,
+
+       RK806_ID_NLDO1,
+       RK806_ID_NLDO2,
+       RK806_ID_NLDO3,
+       RK806_ID_NLDO4,
+       RK806_ID_NLDO5,
+
+       RK806_ID_PLDO1,
+       RK806_ID_PLDO2,
+       RK806_ID_PLDO3,
+       RK806_ID_PLDO4,
+       RK806_ID_PLDO5,
+       RK806_ID_PLDO6,
+       RK806_ID_END,
+};
+
+/* Define the RK806 IRQ numbers */
+enum rk806_irqs {
+       /* INT_STS0 registers */
+       RK806_IRQ_PWRON_FALL,
+       RK806_IRQ_PWRON_RISE,
+       RK806_IRQ_PWRON,
+       RK806_IRQ_PWRON_LP,
+       RK806_IRQ_HOTDIE,
+       RK806_IRQ_VDC_RISE,
+       RK806_IRQ_VDC_FALL,
+       RK806_IRQ_VB_LO,
+
+       /* INT_STS0 registers */
+       RK806_IRQ_REV0,
+       RK806_IRQ_REV1,
+       RK806_IRQ_REV2,
+       RK806_IRQ_CRC_ERROR,
+       RK806_IRQ_SLP3_GPIO,
+       RK806_IRQ_SLP2_GPIO,
+       RK806_IRQ_SLP1_GPIO,
+       RK806_IRQ_WDT,
+};
+
+/* VCC1 Low Voltage Threshold */
+enum rk806_lv_sel {
+       VB_LO_SEL_2800,
+       VB_LO_SEL_2900,
+       VB_LO_SEL_3000,
+       VB_LO_SEL_3100,
+       VB_LO_SEL_3200,
+       VB_LO_SEL_3300,
+       VB_LO_SEL_3400,
+       VB_LO_SEL_3500,
+};
+
+/* System Shutdown Voltage Select */
+enum rk806_uv_sel {
+       VB_UV_SEL_2700,
+       VB_UV_SEL_2800,
+       VB_UV_SEL_2900,
+       VB_UV_SEL_3000,
+       VB_UV_SEL_3100,
+       VB_UV_SEL_3200,
+       VB_UV_SEL_3300,
+       VB_UV_SEL_3400,
+};
+
+/* Pin Function */
+enum rk806_pwrctrl_fun {
+       PWRCTRL_NULL_FUN,
+       PWRCTRL_SLP_FUN,
+       PWRCTRL_POWOFF_FUN,
+       PWRCTRL_RST_FUN,
+       PWRCTRL_DVS_FUN,
+       PWRCTRL_GPIO_FUN,
+};
+
+/* Pin Polarity */
+enum rk806_pin_level {
+       POL_LOW,
+       POL_HIGH,
+};
+
+enum rk806_vsel_ctr_sel {
+       CTR_BY_NO_EFFECT,
+       CTR_BY_PWRCTRL1,
+       CTR_BY_PWRCTRL2,
+       CTR_BY_PWRCTRL3,
+};
+
+enum rk806_dvs_ctr_sel {
+       CTR_SEL_NO_EFFECT,
+       CTR_SEL_DVS_START1,
+       CTR_SEL_DVS_START2,
+       CTR_SEL_DVS_START3,
+};
+
+enum rk806_pin_dr_sel {
+       RK806_PIN_INPUT,
+       RK806_PIN_OUTPUT,
+};
+
+#define RK806_INT_POL_MSK              BIT(1)
+#define RK806_INT_POL_H                        BIT(1)
+#define RK806_INT_POL_L                        0
+
+#define RK806_SLAVE_RESTART_FUN_MSK    BIT(1)
+#define RK806_SLAVE_RESTART_FUN_EN     BIT(1)
+#define RK806_SLAVE_RESTART_FUN_OFF    0
+
+#define RK806_SYS_ENB2_2M_MSK          BIT(1)
+#define RK806_SYS_ENB2_2M_EN           BIT(1)
+#define RK806_SYS_ENB2_2M_OFF          0
+
+enum rk806_int_fun {
+       RK806_INT_ONLY,
+       RK806_INT_ADN_WKUP,
+};
+
+enum rk806_dvs_mode {
+       RK806_DVS_NOT_SUPPORT,
+       RK806_DVS_START1,
+       RK806_DVS_START2,
+       RK806_DVS_START3,
+       RK806_DVS_PWRCTRL1,
+       RK806_DVS_PWRCTRL2,
+       RK806_DVS_PWRCTRL3,
+       RK806_DVS_START_PWRCTR1,
+       RK806_DVS_START_PWRCTR2,
+       RK806_DVS_START_PWRCTR3,
+       RK806_DVS_END,
+};
+
 /* RK808 IRQ Definitions */
 #define RK808_IRQ_VOUT_LO      0
 #define RK808_IRQ_VB_LO                1
@@ -780,6 +1188,7 @@ enum {
 
 enum {
        RK805_ID = 0x8050,
+       RK806_ID = 0x8060,
        RK808_ID = 0x0000,
        RK809_ID = 0x8090,
        RK817_ID = 0x8170,
@@ -787,11 +1196,17 @@ enum {
 };
 
 struct rk808 {
-       struct i2c_client               *i2c;
+       struct device                   *dev;
        struct regmap_irq_chip_data     *irq_data;
        struct regmap                   *regmap;
        long                            variant;
        const struct regmap_config      *regmap_cfg;
        const struct regmap_irq_chip    *regmap_irq_chip;
 };
+
+void rk8xx_shutdown(struct device *dev);
+int rk8xx_probe(struct device *dev, int variant, unsigned int irq, struct regmap *regmap);
+int rk8xx_suspend(struct device *dev);
+int rk8xx_resume(struct device *dev);
+
 #endif /* __LINUX_REGULATOR_RK808_H */
diff --git a/include/linux/mfd/tps6594.h b/include/linux/mfd/tps6594.h
new file mode 100644 (file)
index 0000000..3f7c5e2
--- /dev/null
@@ -0,0 +1,1020 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Functions to access TPS6594 Power Management IC
+ *
+ * Copyright (C) 2023 BayLibre Incorporated - https://www.baylibre.com/
+ */
+
+#ifndef __LINUX_MFD_TPS6594_H
+#define __LINUX_MFD_TPS6594_H
+
+#include <linux/device.h>
+#include <linux/regmap.h>
+
+struct regmap_irq_chip_data;
+
+/* Chip id list */
+enum pmic_id {
+       TPS6594,
+       TPS6593,
+       LP8764,
+};
+
+/* Macro to get page index from register address */
+#define TPS6594_REG_TO_PAGE(reg)       ((reg) >> 8)
+
+/* Registers for page 0 of TPS6594 */
+#define TPS6594_REG_DEV_REV                            0x01
+
+#define TPS6594_REG_NVM_CODE_1                         0x02
+#define TPS6594_REG_NVM_CODE_2                         0x03
+
+#define TPS6594_REG_BUCKX_CTRL(buck_inst)              (0x04 + ((buck_inst) << 1))
+#define TPS6594_REG_BUCKX_CONF(buck_inst)              (0x05 + ((buck_inst) << 1))
+#define TPS6594_REG_BUCKX_VOUT_1(buck_inst)            (0x0e + ((buck_inst) << 1))
+#define TPS6594_REG_BUCKX_VOUT_2(buck_inst)            (0x0f + ((buck_inst) << 1))
+#define TPS6594_REG_BUCKX_PG_WINDOW(buck_inst)         (0x18 + (buck_inst))
+
+#define TPS6594_REG_LDOX_CTRL(ldo_inst)                        (0x1d + (ldo_inst))
+#define TPS6594_REG_LDORTC_CTRL                                0x22
+#define TPS6594_REG_LDOX_VOUT(ldo_inst)                        (0x23 + (ldo_inst))
+#define TPS6594_REG_LDOX_PG_WINDOW(ldo_inst)           (0x27 + (ldo_inst))
+
+#define TPS6594_REG_VCCA_VMON_CTRL                     0x2b
+#define TPS6594_REG_VCCA_PG_WINDOW                     0x2c
+#define TPS6594_REG_VMON1_PG_WINDOW                    0x2d
+#define TPS6594_REG_VMON1_PG_LEVEL                     0x2e
+#define TPS6594_REG_VMON2_PG_WINDOW                    0x2f
+#define TPS6594_REG_VMON2_PG_LEVEL                     0x30
+
+#define TPS6594_REG_GPIOX_CONF(gpio_inst)              (0x31 + (gpio_inst))
+#define TPS6594_REG_NPWRON_CONF                                0x3c
+#define TPS6594_REG_GPIO_OUT_1                         0x3d
+#define TPS6594_REG_GPIO_OUT_2                         0x3e
+#define TPS6594_REG_GPIO_IN_1                          0x3f
+#define TPS6594_REG_GPIO_IN_2                          0x40
+#define TPS6594_REG_GPIOX_OUT(gpio_inst)               (TPS6594_REG_GPIO_OUT_1 + (gpio_inst) / 8)
+#define TPS6594_REG_GPIOX_IN(gpio_inst)                        (TPS6594_REG_GPIO_IN_1 + (gpio_inst) / 8)
+
+#define TPS6594_REG_GPIO_IN_1                          0x3f
+#define TPS6594_REG_GPIO_IN_2                          0x40
+
+#define TPS6594_REG_RAIL_SEL_1                         0x41
+#define TPS6594_REG_RAIL_SEL_2                         0x42
+#define TPS6594_REG_RAIL_SEL_3                         0x43
+
+#define TPS6594_REG_FSM_TRIG_SEL_1                     0x44
+#define TPS6594_REG_FSM_TRIG_SEL_2                     0x45
+#define TPS6594_REG_FSM_TRIG_MASK_1                    0x46
+#define TPS6594_REG_FSM_TRIG_MASK_2                    0x47
+#define TPS6594_REG_FSM_TRIG_MASK_3                    0x48
+
+#define TPS6594_REG_MASK_BUCK1_2                       0x49
+#define TPS6594_REG_MASK_BUCK3_4                       0x4a
+#define TPS6594_REG_MASK_BUCK5                         0x4b
+#define TPS6594_REG_MASK_LDO1_2                                0x4c
+#define TPS6594_REG_MASK_LDO3_4                                0x4d
+#define TPS6594_REG_MASK_VMON                          0x4e
+#define TPS6594_REG_MASK_GPIO1_8_FALL                  0x4f
+#define TPS6594_REG_MASK_GPIO1_8_RISE                  0x50
+#define TPS6594_REG_MASK_GPIO9_11                      0x51
+#define TPS6594_REG_MASK_STARTUP                       0x52
+#define TPS6594_REG_MASK_MISC                          0x53
+#define TPS6594_REG_MASK_MODERATE_ERR                  0x54
+#define TPS6594_REG_MASK_FSM_ERR                       0x56
+#define TPS6594_REG_MASK_COMM_ERR                      0x57
+#define TPS6594_REG_MASK_READBACK_ERR                  0x58
+#define TPS6594_REG_MASK_ESM                           0x59
+
+#define TPS6594_REG_INT_TOP                            0x5a
+#define TPS6594_REG_INT_BUCK                           0x5b
+#define TPS6594_REG_INT_BUCK1_2                                0x5c
+#define TPS6594_REG_INT_BUCK3_4                                0x5d
+#define TPS6594_REG_INT_BUCK5                          0x5e
+#define TPS6594_REG_INT_LDO_VMON                       0x5f
+#define TPS6594_REG_INT_LDO1_2                         0x60
+#define TPS6594_REG_INT_LDO3_4                         0x61
+#define TPS6594_REG_INT_VMON                           0x62
+#define TPS6594_REG_INT_GPIO                           0x63
+#define TPS6594_REG_INT_GPIO1_8                                0x64
+#define TPS6594_REG_INT_STARTUP                                0x65
+#define TPS6594_REG_INT_MISC                           0x66
+#define TPS6594_REG_INT_MODERATE_ERR                   0x67
+#define TPS6594_REG_INT_SEVERE_ERR                     0x68
+#define TPS6594_REG_INT_FSM_ERR                                0x69
+#define TPS6594_REG_INT_COMM_ERR                       0x6a
+#define TPS6594_REG_INT_READBACK_ERR                   0x6b
+#define TPS6594_REG_INT_ESM                            0x6c
+
+#define TPS6594_REG_STAT_BUCK1_2                       0x6d
+#define TPS6594_REG_STAT_BUCK3_4                       0x6e
+#define TPS6594_REG_STAT_BUCK5                         0x6f
+#define TPS6594_REG_STAT_LDO1_2                                0x70
+#define TPS6594_REG_STAT_LDO3_4                                0x71
+#define TPS6594_REG_STAT_VMON                          0x72
+#define TPS6594_REG_STAT_STARTUP                       0x73
+#define TPS6594_REG_STAT_MISC                          0x74
+#define TPS6594_REG_STAT_MODERATE_ERR                  0x75
+#define TPS6594_REG_STAT_SEVERE_ERR                    0x76
+#define TPS6594_REG_STAT_READBACK_ERR                  0x77
+
+#define TPS6594_REG_PGOOD_SEL_1                                0x78
+#define TPS6594_REG_PGOOD_SEL_2                                0x79
+#define TPS6594_REG_PGOOD_SEL_3                                0x7a
+#define TPS6594_REG_PGOOD_SEL_4                                0x7b
+
+#define TPS6594_REG_PLL_CTRL                           0x7c
+
+#define TPS6594_REG_CONFIG_1                           0x7d
+#define TPS6594_REG_CONFIG_2                           0x7e
+
+#define TPS6594_REG_ENABLE_DRV_REG                     0x80
+
+#define TPS6594_REG_MISC_CTRL                          0x81
+
+#define TPS6594_REG_ENABLE_DRV_STAT                    0x82
+
+#define TPS6594_REG_RECOV_CNT_REG_1                    0x83
+#define TPS6594_REG_RECOV_CNT_REG_2                    0x84
+
+#define TPS6594_REG_FSM_I2C_TRIGGERS                   0x85
+#define TPS6594_REG_FSM_NSLEEP_TRIGGERS                        0x86
+
+#define TPS6594_REG_BUCK_RESET_REG                     0x87
+
+#define TPS6594_REG_SPREAD_SPECTRUM_1                  0x88
+
+#define TPS6594_REG_FREQ_SEL                           0x8a
+
+#define TPS6594_REG_FSM_STEP_SIZE                      0x8b
+
+#define TPS6594_REG_LDO_RV_TIMEOUT_REG_1               0x8c
+#define TPS6594_REG_LDO_RV_TIMEOUT_REG_2               0x8d
+
+#define TPS6594_REG_USER_SPARE_REGS                    0x8e
+
+#define TPS6594_REG_ESM_MCU_START_REG                  0x8f
+#define TPS6594_REG_ESM_MCU_DELAY1_REG                 0x90
+#define TPS6594_REG_ESM_MCU_DELAY2_REG                 0x91
+#define TPS6594_REG_ESM_MCU_MODE_CFG                   0x92
+#define TPS6594_REG_ESM_MCU_HMAX_REG                   0x93
+#define TPS6594_REG_ESM_MCU_HMIN_REG                   0x94
+#define TPS6594_REG_ESM_MCU_LMAX_REG                   0x95
+#define TPS6594_REG_ESM_MCU_LMIN_REG                   0x96
+#define TPS6594_REG_ESM_MCU_ERR_CNT_REG                        0x97
+#define TPS6594_REG_ESM_SOC_START_REG                  0x98
+#define TPS6594_REG_ESM_SOC_DELAY1_REG                 0x99
+#define TPS6594_REG_ESM_SOC_DELAY2_REG                 0x9a
+#define TPS6594_REG_ESM_SOC_MODE_CFG                   0x9b
+#define TPS6594_REG_ESM_SOC_HMAX_REG                   0x9c
+#define TPS6594_REG_ESM_SOC_HMIN_REG                   0x9d
+#define TPS6594_REG_ESM_SOC_LMAX_REG                   0x9e
+#define TPS6594_REG_ESM_SOC_LMIN_REG                   0x9f
+#define TPS6594_REG_ESM_SOC_ERR_CNT_REG                        0xa0
+
+#define TPS6594_REG_REGISTER_LOCK                      0xa1
+
+#define TPS6594_REG_MANUFACTURING_VER                  0xa6
+
+#define TPS6594_REG_CUSTOMER_NVM_ID_REG                        0xa7
+
+#define TPS6594_REG_VMON_CONF_REG                      0xa8
+
+#define TPS6594_REG_SOFT_REBOOT_REG                    0xab
+
+#define TPS6594_REG_RTC_SECONDS                                0xb5
+#define TPS6594_REG_RTC_MINUTES                                0xb6
+#define TPS6594_REG_RTC_HOURS                          0xb7
+#define TPS6594_REG_RTC_DAYS                           0xb8
+#define TPS6594_REG_RTC_MONTHS                         0xb9
+#define TPS6594_REG_RTC_YEARS                          0xba
+#define TPS6594_REG_RTC_WEEKS                          0xbb
+
+#define TPS6594_REG_ALARM_SECONDS                      0xbc
+#define TPS6594_REG_ALARM_MINUTES                      0xbd
+#define TPS6594_REG_ALARM_HOURS                                0xbe
+#define TPS6594_REG_ALARM_DAYS                         0xbf
+#define TPS6594_REG_ALARM_MONTHS                       0xc0
+#define TPS6594_REG_ALARM_YEARS                                0xc1
+
+#define TPS6594_REG_RTC_CTRL_1                         0xc2
+#define TPS6594_REG_RTC_CTRL_2                         0xc3
+#define TPS6594_REG_RTC_STATUS                         0xc4
+#define TPS6594_REG_RTC_INTERRUPTS                     0xc5
+#define TPS6594_REG_RTC_COMP_LSB                       0xc6
+#define TPS6594_REG_RTC_COMP_MSB                       0xc7
+#define TPS6594_REG_RTC_RESET_STATUS                   0xc8
+
+#define TPS6594_REG_SCRATCH_PAD_REG_1                  0xc9
+#define TPS6594_REG_SCRATCH_PAD_REG_2                  0xca
+#define TPS6594_REG_SCRATCH_PAD_REG_3                  0xcb
+#define TPS6594_REG_SCRATCH_PAD_REG_4                  0xcc
+
+#define TPS6594_REG_PFSM_DELAY_REG_1                   0xcd
+#define TPS6594_REG_PFSM_DELAY_REG_2                   0xce
+#define TPS6594_REG_PFSM_DELAY_REG_3                   0xcf
+#define TPS6594_REG_PFSM_DELAY_REG_4                   0xd0
+
+/* Registers for page 1 of TPS6594 */
+#define TPS6594_REG_SERIAL_IF_CONFIG                   0x11a
+#define TPS6594_REG_I2C1_ID                            0x122
+#define TPS6594_REG_I2C2_ID                            0x123
+
+/* Registers for page 4 of TPS6594 */
+#define TPS6594_REG_WD_ANSWER_REG                      0x401
+#define TPS6594_REG_WD_QUESTION_ANSW_CNT               0x402
+#define TPS6594_REG_WD_WIN1_CFG                                0x403
+#define TPS6594_REG_WD_WIN2_CFG                                0x404
+#define TPS6594_REG_WD_LONGWIN_CFG                     0x405
+#define TPS6594_REG_WD_MODE_REG                                0x406
+#define TPS6594_REG_WD_QA_CFG                          0x407
+#define TPS6594_REG_WD_ERR_STATUS                      0x408
+#define TPS6594_REG_WD_THR_CFG                         0x409
+#define TPS6594_REG_DWD_FAIL_CNT_REG                   0x40a
+
+/* BUCKX_CTRL register field definition */
+#define TPS6594_BIT_BUCK_EN                            BIT(0)
+#define TPS6594_BIT_BUCK_FPWM                          BIT(1)
+#define TPS6594_BIT_BUCK_FPWM_MP                       BIT(2)
+#define TPS6594_BIT_BUCK_VSEL                          BIT(3)
+#define TPS6594_BIT_BUCK_VMON_EN                       BIT(4)
+#define TPS6594_BIT_BUCK_PLDN                          BIT(5)
+#define TPS6594_BIT_BUCK_RV_SEL                                BIT(7)
+
+/* BUCKX_CONF register field definition */
+#define TPS6594_MASK_BUCK_SLEW_RATE                    GENMASK(2, 0)
+#define TPS6594_MASK_BUCK_ILIM                         GENMASK(5, 3)
+
+/* BUCKX_PG_WINDOW register field definition */
+#define TPS6594_MASK_BUCK_OV_THR                       GENMASK(2, 0)
+#define TPS6594_MASK_BUCK_UV_THR                       GENMASK(5, 3)
+
+/* BUCKX VSET */
+#define TPS6594_MASK_BUCKS_VSET GENMASK(7, 0)
+
+/* LDOX_CTRL register field definition */
+#define TPS6594_BIT_LDO_EN                             BIT(0)
+#define TPS6594_BIT_LDO_SLOW_RAMP                      BIT(1)
+#define TPS6594_BIT_LDO_VMON_EN                                BIT(4)
+#define TPS6594_MASK_LDO_PLDN                          GENMASK(6, 5)
+#define TPS6594_BIT_LDO_RV_SEL                         BIT(7)
+
+/* LDORTC_CTRL register field definition */
+#define TPS6594_BIT_LDORTC_DIS                         BIT(0)
+
+/* LDOX_VOUT register field definition */
+#define TPS6594_MASK_LDO123_VSET                       GENMASK(6, 1)
+#define TPS6594_MASK_LDO4_VSET                         GENMASK(6, 0)
+#define TPS6594_BIT_LDO_BYPASS                         BIT(7)
+
+/* LDOX_PG_WINDOW register field definition */
+#define TPS6594_MASK_LDO_OV_THR                                GENMASK(2, 0)
+#define TPS6594_MASK_LDO_UV_THR                                GENMASK(5, 3)
+
+/* VCCA_VMON_CTRL register field definition */
+#define TPS6594_BIT_VMON_EN                            BIT(0)
+#define TPS6594_BIT_VMON1_EN                           BIT(1)
+#define TPS6594_BIT_VMON1_RV_SEL                       BIT(2)
+#define TPS6594_BIT_VMON2_EN                           BIT(3)
+#define TPS6594_BIT_VMON2_RV_SEL                       BIT(4)
+#define TPS6594_BIT_VMON_DEGLITCH_SEL                  BIT(5)
+
+/* VCCA_PG_WINDOW register field definition */
+#define TPS6594_MASK_VCCA_OV_THR                       GENMASK(2, 0)
+#define TPS6594_MASK_VCCA_UV_THR                       GENMASK(5, 3)
+#define TPS6594_BIT_VCCA_PG_SET                                BIT(6)
+
+/* VMONX_PG_WINDOW register field definition */
+#define TPS6594_MASK_VMONX_OV_THR                      GENMASK(2, 0)
+#define TPS6594_MASK_VMONX_UV_THR                      GENMASK(5, 3)
+#define TPS6594_BIT_VMONX_RANGE                                BIT(6)
+
+/* GPIOX_CONF register field definition */
+#define TPS6594_BIT_GPIO_DIR                           BIT(0)
+#define TPS6594_BIT_GPIO_OD                            BIT(1)
+#define TPS6594_BIT_GPIO_PU_SEL                                BIT(2)
+#define TPS6594_BIT_GPIO_PU_PD_EN                      BIT(3)
+#define TPS6594_BIT_GPIO_DEGLITCH_EN                   BIT(4)
+#define TPS6594_MASK_GPIO_SEL                          GENMASK(7, 5)
+
+/* NPWRON_CONF register field definition */
+#define TPS6594_BIT_NRSTOUT_OD                         BIT(0)
+#define TPS6594_BIT_ENABLE_PU_SEL                      BIT(2)
+#define TPS6594_BIT_ENABLE_PU_PD_EN                    BIT(3)
+#define TPS6594_BIT_ENABLE_DEGLITCH_EN                 BIT(4)
+#define TPS6594_BIT_ENABLE_POL                         BIT(5)
+#define TPS6594_MASK_NPWRON_SEL                                GENMASK(7, 6)
+
+/* GPIO_OUT_X register field definition */
+#define TPS6594_BIT_GPIOX_OUT(gpio_inst)               BIT((gpio_inst) % 8)
+
+/* GPIO_IN_X register field definition */
+#define TPS6594_BIT_GPIOX_IN(gpio_inst)                        BIT((gpio_inst) % 8)
+#define TPS6594_BIT_NPWRON_IN                          BIT(3)
+
+/* RAIL_SEL_1 register field definition */
+#define TPS6594_MASK_BUCK1_GRP_SEL                     GENMASK(1, 0)
+#define TPS6594_MASK_BUCK2_GRP_SEL                     GENMASK(3, 2)
+#define TPS6594_MASK_BUCK3_GRP_SEL                     GENMASK(5, 4)
+#define TPS6594_MASK_BUCK4_GRP_SEL                     GENMASK(7, 6)
+
+/* RAIL_SEL_2 register field definition */
+#define TPS6594_MASK_BUCK5_GRP_SEL                     GENMASK(1, 0)
+#define TPS6594_MASK_LDO1_GRP_SEL                      GENMASK(3, 2)
+#define TPS6594_MASK_LDO2_GRP_SEL                      GENMASK(5, 4)
+#define TPS6594_MASK_LDO3_GRP_SEL                      GENMASK(7, 6)
+
+/* RAIL_SEL_3 register field definition */
+#define TPS6594_MASK_LDO4_GRP_SEL                      GENMASK(1, 0)
+#define TPS6594_MASK_VCCA_GRP_SEL                      GENMASK(3, 2)
+#define TPS6594_MASK_VMON1_GRP_SEL                     GENMASK(5, 4)
+#define TPS6594_MASK_VMON2_GRP_SEL                     GENMASK(7, 6)
+
+/* FSM_TRIG_SEL_1 register field definition */
+#define TPS6594_MASK_MCU_RAIL_TRIG                     GENMASK(1, 0)
+#define TPS6594_MASK_SOC_RAIL_TRIG                     GENMASK(3, 2)
+#define TPS6594_MASK_OTHER_RAIL_TRIG                   GENMASK(5, 4)
+#define TPS6594_MASK_SEVERE_ERR_TRIG                   GENMASK(7, 6)
+
+/* FSM_TRIG_SEL_2 register field definition */
+#define TPS6594_MASK_MODERATE_ERR_TRIG                 GENMASK(1, 0)
+
+/* FSM_TRIG_MASK_X register field definition */
+#define TPS6594_BIT_GPIOX_FSM_MASK(gpio_inst)          BIT(((gpio_inst) << 1) % 8)
+#define TPS6594_BIT_GPIOX_FSM_MASK_POL(gpio_inst)      BIT(((gpio_inst) << 1) % 8 + 1)
+
+/* MASK_BUCKX register field definition */
+#define TPS6594_BIT_BUCKX_OV_MASK(buck_inst)           BIT(((buck_inst) << 2) % 8)
+#define TPS6594_BIT_BUCKX_UV_MASK(buck_inst)           BIT(((buck_inst) << 2) % 8 + 1)
+#define TPS6594_BIT_BUCKX_ILIM_MASK(buck_inst)         BIT(((buck_inst) << 2) % 8 + 3)
+
+/* MASK_LDOX register field definition */
+#define TPS6594_BIT_LDOX_OV_MASK(ldo_inst)             BIT(((ldo_inst) << 2) % 8)
+#define TPS6594_BIT_LDOX_UV_MASK(ldo_inst)             BIT(((ldo_inst) << 2) % 8 + 1)
+#define TPS6594_BIT_LDOX_ILIM_MASK(ldo_inst)           BIT(((ldo_inst) << 2) % 8 + 3)
+
+/* MASK_VMON register field definition */
+#define TPS6594_BIT_VCCA_OV_MASK                       BIT(0)
+#define TPS6594_BIT_VCCA_UV_MASK                       BIT(1)
+#define TPS6594_BIT_VMON1_OV_MASK                      BIT(2)
+#define TPS6594_BIT_VMON1_UV_MASK                      BIT(3)
+#define TPS6594_BIT_VMON2_OV_MASK                      BIT(5)
+#define TPS6594_BIT_VMON2_UV_MASK                      BIT(6)
+
+/* MASK_GPIOX register field definition */
+#define TPS6594_BIT_GPIOX_FALL_MASK(gpio_inst)         BIT((gpio_inst) < 8 ? \
+                                                           (gpio_inst) : (gpio_inst) % 8)
+#define TPS6594_BIT_GPIOX_RISE_MASK(gpio_inst)         BIT((gpio_inst) < 8 ? \
+                                                           (gpio_inst) : (gpio_inst) % 8 + 3)
+
+/* MASK_STARTUP register field definition */
+#define TPS6594_BIT_NPWRON_START_MASK                  BIT(0)
+#define TPS6594_BIT_ENABLE_MASK                                BIT(1)
+#define TPS6594_BIT_FSD_MASK                           BIT(4)
+#define TPS6594_BIT_SOFT_REBOOT_MASK                   BIT(5)
+
+/* MASK_MISC register field definition */
+#define TPS6594_BIT_BIST_PASS_MASK                     BIT(0)
+#define TPS6594_BIT_EXT_CLK_MASK                       BIT(1)
+#define TPS6594_BIT_TWARN_MASK                         BIT(3)
+
+/* MASK_MODERATE_ERR register field definition */
+#define TPS6594_BIT_BIST_FAIL_MASK                     BIT(1)
+#define TPS6594_BIT_REG_CRC_ERR_MASK                   BIT(2)
+#define TPS6594_BIT_SPMI_ERR_MASK                      BIT(4)
+#define TPS6594_BIT_NPWRON_LONG_MASK                   BIT(5)
+#define TPS6594_BIT_NINT_READBACK_MASK                 BIT(6)
+#define TPS6594_BIT_NRSTOUT_READBACK_MASK              BIT(7)
+
+/* MASK_FSM_ERR register field definition */
+#define TPS6594_BIT_IMM_SHUTDOWN_MASK                  BIT(0)
+#define TPS6594_BIT_ORD_SHUTDOWN_MASK                  BIT(1)
+#define TPS6594_BIT_MCU_PWR_ERR_MASK                   BIT(2)
+#define TPS6594_BIT_SOC_PWR_ERR_MASK                   BIT(3)
+
+/* MASK_COMM_ERR register field definition */
+#define TPS6594_BIT_COMM_FRM_ERR_MASK                  BIT(0)
+#define TPS6594_BIT_COMM_CRC_ERR_MASK                  BIT(1)
+#define TPS6594_BIT_COMM_ADR_ERR_MASK                  BIT(3)
+#define TPS6594_BIT_I2C2_CRC_ERR_MASK                  BIT(5)
+#define TPS6594_BIT_I2C2_ADR_ERR_MASK                  BIT(7)
+
+/* MASK_READBACK_ERR register field definition */
+#define TPS6594_BIT_EN_DRV_READBACK_MASK               BIT(0)
+#define TPS6594_BIT_NRSTOUT_SOC_READBACK_MASK          BIT(3)
+
+/* MASK_ESM register field definition */
+#define TPS6594_BIT_ESM_SOC_PIN_MASK                   BIT(0)
+#define TPS6594_BIT_ESM_SOC_FAIL_MASK                  BIT(1)
+#define TPS6594_BIT_ESM_SOC_RST_MASK                   BIT(2)
+#define TPS6594_BIT_ESM_MCU_PIN_MASK                   BIT(3)
+#define TPS6594_BIT_ESM_MCU_FAIL_MASK                  BIT(4)
+#define TPS6594_BIT_ESM_MCU_RST_MASK                   BIT(5)
+
+/* INT_TOP register field definition */
+#define TPS6594_BIT_BUCK_INT                           BIT(0)
+#define TPS6594_BIT_LDO_VMON_INT                       BIT(1)
+#define TPS6594_BIT_GPIO_INT                           BIT(2)
+#define TPS6594_BIT_STARTUP_INT                                BIT(3)
+#define TPS6594_BIT_MISC_INT                           BIT(4)
+#define TPS6594_BIT_MODERATE_ERR_INT                   BIT(5)
+#define TPS6594_BIT_SEVERE_ERR_INT                     BIT(6)
+#define TPS6594_BIT_FSM_ERR_INT                                BIT(7)
+
+/* INT_BUCK register field definition */
+#define TPS6594_BIT_BUCK1_2_INT                                BIT(0)
+#define TPS6594_BIT_BUCK3_4_INT                                BIT(1)
+#define TPS6594_BIT_BUCK5_INT                          BIT(2)
+
+/* INT_BUCKX register field definition */
+#define TPS6594_BIT_BUCKX_OV_INT(buck_inst)            BIT(((buck_inst) << 2) % 8)
+#define TPS6594_BIT_BUCKX_UV_INT(buck_inst)            BIT(((buck_inst) << 2) % 8 + 1)
+#define TPS6594_BIT_BUCKX_SC_INT(buck_inst)            BIT(((buck_inst) << 2) % 8 + 2)
+#define TPS6594_BIT_BUCKX_ILIM_INT(buck_inst)          BIT(((buck_inst) << 2) % 8 + 3)
+
+/* INT_LDO_VMON register field definition */
+#define TPS6594_BIT_LDO1_2_INT                         BIT(0)
+#define TPS6594_BIT_LDO3_4_INT                         BIT(1)
+#define TPS6594_BIT_VCCA_INT                           BIT(4)
+
+/* INT_LDOX register field definition */
+#define TPS6594_BIT_LDOX_OV_INT(ldo_inst)              BIT(((ldo_inst) << 2) % 8)
+#define TPS6594_BIT_LDOX_UV_INT(ldo_inst)              BIT(((ldo_inst) << 2) % 8 + 1)
+#define TPS6594_BIT_LDOX_SC_INT(ldo_inst)              BIT(((ldo_inst) << 2) % 8 + 2)
+#define TPS6594_BIT_LDOX_ILIM_INT(ldo_inst)            BIT(((ldo_inst) << 2) % 8 + 3)
+
+/* INT_VMON register field definition */
+#define TPS6594_BIT_VCCA_OV_INT                                BIT(0)
+#define TPS6594_BIT_VCCA_UV_INT                                BIT(1)
+#define TPS6594_BIT_VMON1_OV_INT                       BIT(2)
+#define TPS6594_BIT_VMON1_UV_INT                       BIT(3)
+#define TPS6594_BIT_VMON1_RV_INT                       BIT(4)
+#define TPS6594_BIT_VMON2_OV_INT                       BIT(5)
+#define TPS6594_BIT_VMON2_UV_INT                       BIT(6)
+#define TPS6594_BIT_VMON2_RV_INT                       BIT(7)
+
+/* INT_GPIO register field definition */
+#define TPS6594_BIT_GPIO9_INT                          BIT(0)
+#define TPS6594_BIT_GPIO10_INT                         BIT(1)
+#define TPS6594_BIT_GPIO11_INT                         BIT(2)
+#define TPS6594_BIT_GPIO1_8_INT                                BIT(3)
+
+/* INT_GPIOX register field definition */
+#define TPS6594_BIT_GPIOX_INT(gpio_inst)               BIT(gpio_inst)
+
+/* INT_STARTUP register field definition */
+#define TPS6594_BIT_NPWRON_START_INT                   BIT(0)
+#define TPS6594_BIT_ENABLE_INT                         BIT(1)
+#define TPS6594_BIT_RTC_INT                            BIT(2)
+#define TPS6594_BIT_FSD_INT                            BIT(4)
+#define TPS6594_BIT_SOFT_REBOOT_INT                    BIT(5)
+
+/* INT_MISC register field definition */
+#define TPS6594_BIT_BIST_PASS_INT                      BIT(0)
+#define TPS6594_BIT_EXT_CLK_INT                                BIT(1)
+#define TPS6594_BIT_TWARN_INT                          BIT(3)
+
+/* INT_MODERATE_ERR register field definition */
+#define TPS6594_BIT_TSD_ORD_INT                                BIT(0)
+#define TPS6594_BIT_BIST_FAIL_INT                      BIT(1)
+#define TPS6594_BIT_REG_CRC_ERR_INT                    BIT(2)
+#define TPS6594_BIT_RECOV_CNT_INT                      BIT(3)
+#define TPS6594_BIT_SPMI_ERR_INT                       BIT(4)
+#define TPS6594_BIT_NPWRON_LONG_INT                    BIT(5)
+#define TPS6594_BIT_NINT_READBACK_INT                  BIT(6)
+#define TPS6594_BIT_NRSTOUT_READBACK_INT               BIT(7)
+
+/* INT_SEVERE_ERR register field definition */
+#define TPS6594_BIT_TSD_IMM_INT                                BIT(0)
+#define TPS6594_BIT_VCCA_OVP_INT                       BIT(1)
+#define TPS6594_BIT_PFSM_ERR_INT                       BIT(2)
+
+/* INT_FSM_ERR register field definition */
+#define TPS6594_BIT_IMM_SHUTDOWN_INT                   BIT(0)
+#define TPS6594_BIT_ORD_SHUTDOWN_INT                   BIT(1)
+#define TPS6594_BIT_MCU_PWR_ERR_INT                    BIT(2)
+#define TPS6594_BIT_SOC_PWR_ERR_INT                    BIT(3)
+#define TPS6594_BIT_COMM_ERR_INT                       BIT(4)
+#define TPS6594_BIT_READBACK_ERR_INT                   BIT(5)
+#define TPS6594_BIT_ESM_INT                            BIT(6)
+#define TPS6594_BIT_WD_INT                             BIT(7)
+
+/* INT_COMM_ERR register field definition */
+#define TPS6594_BIT_COMM_FRM_ERR_INT                   BIT(0)
+#define TPS6594_BIT_COMM_CRC_ERR_INT                   BIT(1)
+#define TPS6594_BIT_COMM_ADR_ERR_INT                   BIT(3)
+#define TPS6594_BIT_I2C2_CRC_ERR_INT                   BIT(5)
+#define TPS6594_BIT_I2C2_ADR_ERR_INT                   BIT(7)
+
+/* INT_READBACK_ERR register field definition */
+#define TPS6594_BIT_EN_DRV_READBACK_INT                        BIT(0)
+#define TPS6594_BIT_NRSTOUT_SOC_READBACK_INT           BIT(3)
+
+/* INT_ESM register field definition */
+#define TPS6594_BIT_ESM_SOC_PIN_INT                    BIT(0)
+#define TPS6594_BIT_ESM_SOC_FAIL_INT                   BIT(1)
+#define TPS6594_BIT_ESM_SOC_RST_INT                    BIT(2)
+#define TPS6594_BIT_ESM_MCU_PIN_INT                    BIT(3)
+#define TPS6594_BIT_ESM_MCU_FAIL_INT                   BIT(4)
+#define TPS6594_BIT_ESM_MCU_RST_INT                    BIT(5)
+
+/* STAT_BUCKX register field definition */
+#define TPS6594_BIT_BUCKX_OV_STAT(buck_inst)           BIT(((buck_inst) << 2) % 8)
+#define TPS6594_BIT_BUCKX_UV_STAT(buck_inst)           BIT(((buck_inst) << 2) % 8 + 1)
+#define TPS6594_BIT_BUCKX_ILIM_STAT(buck_inst)         BIT(((buck_inst) << 2) % 8 + 3)
+
+/* STAT_LDOX register field definition */
+#define TPS6594_BIT_LDOX_OV_STAT(ldo_inst)             BIT(((ldo_inst) << 2) % 8)
+#define TPS6594_BIT_LDOX_UV_STAT(ldo_inst)             BIT(((ldo_inst) << 2) % 8 + 1)
+#define TPS6594_BIT_LDOX_ILIM_STAT(ldo_inst)           BIT(((ldo_inst) << 2) % 8 + 3)
+
+/* STAT_VMON register field definition */
+#define TPS6594_BIT_VCCA_OV_STAT                       BIT(0)
+#define TPS6594_BIT_VCCA_UV_STAT                       BIT(1)
+#define TPS6594_BIT_VMON1_OV_STAT                      BIT(2)
+#define TPS6594_BIT_VMON1_UV_STAT                      BIT(3)
+#define TPS6594_BIT_VMON2_OV_STAT                      BIT(5)
+#define TPS6594_BIT_VMON2_UV_STAT                      BIT(6)
+
+/* STAT_STARTUP register field definition */
+#define TPS6594_BIT_ENABLE_STAT                                BIT(1)
+
+/* STAT_MISC register field definition */
+#define TPS6594_BIT_EXT_CLK_STAT                       BIT(1)
+#define TPS6594_BIT_TWARN_STAT                         BIT(3)
+
+/* STAT_MODERATE_ERR register field definition */
+#define TPS6594_BIT_TSD_ORD_STAT                       BIT(0)
+
+/* STAT_SEVERE_ERR register field definition */
+#define TPS6594_BIT_TSD_IMM_STAT                       BIT(0)
+#define TPS6594_BIT_VCCA_OVP_STAT                      BIT(1)
+
+/* STAT_READBACK_ERR register field definition */
+#define TPS6594_BIT_EN_DRV_READBACK_STAT               BIT(0)
+#define TPS6594_BIT_NINT_READBACK_STAT                 BIT(1)
+#define TPS6594_BIT_NRSTOUT_READBACK_STAT              BIT(2)
+#define TPS6594_BIT_NRSTOUT_SOC_READBACK_STAT          BIT(3)
+
+/* PGOOD_SEL_1 register field definition */
+#define TPS6594_MASK_PGOOD_SEL_BUCK1                   GENMASK(1, 0)
+#define TPS6594_MASK_PGOOD_SEL_BUCK2                   GENMASK(3, 2)
+#define TPS6594_MASK_PGOOD_SEL_BUCK3                   GENMASK(5, 4)
+#define TPS6594_MASK_PGOOD_SEL_BUCK4                   GENMASK(7, 6)
+
+/* PGOOD_SEL_2 register field definition */
+#define TPS6594_MASK_PGOOD_SEL_BUCK5                   GENMASK(1, 0)
+
+/* PGOOD_SEL_3 register field definition */
+#define TPS6594_MASK_PGOOD_SEL_LDO1                    GENMASK(1, 0)
+#define TPS6594_MASK_PGOOD_SEL_LDO2                    GENMASK(3, 2)
+#define TPS6594_MASK_PGOOD_SEL_LDO3                    GENMASK(5, 4)
+#define TPS6594_MASK_PGOOD_SEL_LDO4                    GENMASK(7, 6)
+
+/* PGOOD_SEL_4 register field definition */
+#define TPS6594_BIT_PGOOD_SEL_VCCA                     BIT(0)
+#define TPS6594_BIT_PGOOD_SEL_VMON1                    BIT(1)
+#define TPS6594_BIT_PGOOD_SEL_VMON2                    BIT(2)
+#define TPS6594_BIT_PGOOD_SEL_TDIE_WARN                        BIT(3)
+#define TPS6594_BIT_PGOOD_SEL_NRSTOUT                  BIT(4)
+#define TPS6594_BIT_PGOOD_SEL_NRSTOUT_SOC              BIT(5)
+#define TPS6594_BIT_PGOOD_POL                          BIT(6)
+#define TPS6594_BIT_PGOOD_WINDOW                       BIT(7)
+
+/* PLL_CTRL register field definition */
+#define TPS6594_MASK_EXT_CLK_FREQ                      GENMASK(1, 0)
+
+/* CONFIG_1 register field definition */
+#define TPS6594_BIT_TWARN_LEVEL                                BIT(0)
+#define TPS6594_BIT_TSD_ORD_LEVEL                      BIT(1)
+#define TPS6594_BIT_I2C1_HS                            BIT(3)
+#define TPS6594_BIT_I2C2_HS                            BIT(4)
+#define TPS6594_BIT_EN_ILIM_FSM_CTRL                   BIT(5)
+#define TPS6594_BIT_NSLEEP1_MASK                       BIT(6)
+#define TPS6594_BIT_NSLEEP2_MASK                       BIT(7)
+
+/* CONFIG_2 register field definition */
+#define TPS6594_BIT_BB_CHARGER_EN                      BIT(0)
+#define TPS6594_BIT_BB_ICHR                            BIT(1)
+#define TPS6594_MASK_BB_VEOC                           GENMASK(3, 2)
+#define TPS6594_BB_EOC_RDY                             BIT(7)
+
+/* ENABLE_DRV_REG register field definition */
+#define TPS6594_BIT_ENABLE_DRV                         BIT(0)
+
+/* MISC_CTRL register field definition */
+#define TPS6594_BIT_NRSTOUT                            BIT(0)
+#define TPS6594_BIT_NRSTOUT_SOC                                BIT(1)
+#define TPS6594_BIT_LPM_EN                             BIT(2)
+#define TPS6594_BIT_CLKMON_EN                          BIT(3)
+#define TPS6594_BIT_AMUXOUT_EN                         BIT(4)
+#define TPS6594_BIT_SEL_EXT_CLK                                BIT(5)
+#define TPS6594_MASK_SYNCCLKOUT_FREQ_SEL               GENMASK(7, 6)
+
+/* ENABLE_DRV_STAT register field definition */
+#define TPS6594_BIT_EN_DRV_IN                          BIT(0)
+#define TPS6594_BIT_NRSTOUT_IN                         BIT(1)
+#define TPS6594_BIT_NRSTOUT_SOC_IN                     BIT(2)
+#define TPS6594_BIT_FORCE_EN_DRV_LOW                   BIT(3)
+#define TPS6594_BIT_SPMI_LPM_EN                                BIT(4)
+
+/* RECOV_CNT_REG_1 register field definition */
+#define TPS6594_MASK_RECOV_CNT                         GENMASK(3, 0)
+
+/* RECOV_CNT_REG_2 register field definition */
+#define TPS6594_MASK_RECOV_CNT_THR                     GENMASK(3, 0)
+#define TPS6594_BIT_RECOV_CNT_CLR                      BIT(4)
+
+/* FSM_I2C_TRIGGERS register field definition */
+#define TPS6594_BIT_TRIGGER_I2C(bit)                   BIT(bit)
+
+/* FSM_NSLEEP_TRIGGERS register field definition */
+#define TPS6594_BIT_NSLEEP1B                           BIT(0)
+#define TPS6594_BIT_NSLEEP2B                           BIT(1)
+
+/* BUCK_RESET_REG register field definition */
+#define TPS6594_BIT_BUCKX_RESET(buck_inst)             BIT(buck_inst)
+
+/* SPREAD_SPECTRUM_1 register field definition */
+#define TPS6594_MASK_SS_DEPTH                          GENMASK(1, 0)
+#define TPS6594_BIT_SS_EN                              BIT(2)
+
+/* FREQ_SEL register field definition */
+#define TPS6594_BIT_BUCKX_FREQ_SEL(buck_inst)          BIT(buck_inst)
+
+/* FSM_STEP_SIZE register field definition */
+#define TPS6594_MASK_PFSM_DELAY_STEP                   GENMASK(4, 0)
+
+/* LDO_RV_TIMEOUT_REG_1 register field definition */
+#define TPS6594_MASK_LDO1_RV_TIMEOUT                   GENMASK(3, 0)
+#define TPS6594_MASK_LDO2_RV_TIMEOUT                   GENMASK(7, 4)
+
+/* LDO_RV_TIMEOUT_REG_2 register field definition */
+#define TPS6594_MASK_LDO3_RV_TIMEOUT                   GENMASK(3, 0)
+#define TPS6594_MASK_LDO4_RV_TIMEOUT                   GENMASK(7, 4)
+
+/* USER_SPARE_REGS register field definition */
+#define TPS6594_BIT_USER_SPARE(bit)                    BIT(bit)
+
+/* ESM_MCU_START_REG register field definition */
+#define TPS6594_BIT_ESM_MCU_START                      BIT(0)
+
+/* ESM_MCU_MODE_CFG register field definition */
+#define TPS6594_MASK_ESM_MCU_ERR_CNT_TH                        GENMASK(3, 0)
+#define TPS6594_BIT_ESM_MCU_ENDRV                      BIT(5)
+#define TPS6594_BIT_ESM_MCU_EN                         BIT(6)
+#define TPS6594_BIT_ESM_MCU_MODE                       BIT(7)
+
+/* ESM_MCU_ERR_CNT_REG register field definition */
+#define TPS6594_MASK_ESM_MCU_ERR_CNT                   GENMASK(4, 0)
+
+/* ESM_SOC_START_REG register field definition */
+#define TPS6594_BIT_ESM_SOC_START                      BIT(0)
+
+/* ESM_SOC_MODE_CFG register field definition */
+#define TPS6594_MASK_ESM_SOC_ERR_CNT_TH                        GENMASK(3, 0)
+#define TPS6594_BIT_ESM_SOC_ENDRV                      BIT(5)
+#define TPS6594_BIT_ESM_SOC_EN                         BIT(6)
+#define TPS6594_BIT_ESM_SOC_MODE                       BIT(7)
+
+/* ESM_SOC_ERR_CNT_REG register field definition */
+#define TPS6594_MASK_ESM_SOC_ERR_CNT                   GENMASK(4, 0)
+
+/* REGISTER_LOCK register field definition */
+#define TPS6594_BIT_REGISTER_LOCK_STATUS               BIT(0)
+
+/* VMON_CONF register field definition */
+#define TPS6594_MASK_VMON1_SLEW_RATE                   GENMASK(2, 0)
+#define TPS6594_MASK_VMON2_SLEW_RATE                   GENMASK(5, 3)
+
+/* SOFT_REBOOT_REG register field definition */
+#define TPS6594_BIT_SOFT_REBOOT                                BIT(0)
+
+/* RTC_SECONDS & ALARM_SECONDS register field definition */
+#define TPS6594_MASK_SECOND_0                          GENMASK(3, 0)
+#define TPS6594_MASK_SECOND_1                          GENMASK(6, 4)
+
+/* RTC_MINUTES & ALARM_MINUTES register field definition */
+#define TPS6594_MASK_MINUTE_0                          GENMASK(3, 0)
+#define TPS6594_MASK_MINUTE_1                          GENMASK(6, 4)
+
+/* RTC_HOURS & ALARM_HOURS register field definition */
+#define TPS6594_MASK_HOUR_0                            GENMASK(3, 0)
+#define TPS6594_MASK_HOUR_1                            GENMASK(5, 4)
+#define TPS6594_BIT_PM_NAM                             BIT(7)
+
+/* RTC_DAYS & ALARM_DAYS register field definition */
+#define TPS6594_MASK_DAY_0                             GENMASK(3, 0)
+#define TPS6594_MASK_DAY_1                             GENMASK(5, 4)
+
+/* RTC_MONTHS & ALARM_MONTHS register field definition */
+#define TPS6594_MASK_MONTH_0                           GENMASK(3, 0)
+#define TPS6594_BIT_MONTH_1                            BIT(4)
+
+/* RTC_YEARS & ALARM_YEARS register field definition */
+#define TPS6594_MASK_YEAR_0                            GENMASK(3, 0)
+#define TPS6594_MASK_YEAR_1                            GENMASK(7, 4)
+
+/* RTC_WEEKS register field definition */
+#define TPS6594_MASK_WEEK                              GENMASK(2, 0)
+
+/* RTC_CTRL_1 register field definition */
+#define TPS6594_BIT_STOP_RTC                           BIT(0)
+#define TPS6594_BIT_ROUND_30S                          BIT(1)
+#define TPS6594_BIT_AUTO_COMP                          BIT(2)
+#define TPS6594_BIT_MODE_12_24                         BIT(3)
+#define TPS6594_BIT_SET_32_COUNTER                     BIT(5)
+#define TPS6594_BIT_GET_TIME                           BIT(6)
+#define TPS6594_BIT_RTC_V_OPT                          BIT(7)
+
+/* RTC_CTRL_2 register field definition */
+#define TPS6594_BIT_XTAL_EN                            BIT(0)
+#define TPS6594_MASK_XTAL_SEL                          GENMASK(2, 1)
+#define TPS6594_BIT_LP_STANDBY_SEL                     BIT(3)
+#define TPS6594_BIT_FAST_BIST                          BIT(4)
+#define TPS6594_MASK_STARTUP_DEST                      GENMASK(6, 5)
+#define TPS6594_BIT_FIRST_STARTUP_DONE                 BIT(7)
+
+/* RTC_STATUS register field definition */
+#define TPS6594_BIT_RUN                                        BIT(1)
+#define TPS6594_BIT_TIMER                              BIT(5)
+#define TPS6594_BIT_ALARM                              BIT(6)
+#define TPS6594_BIT_POWER_UP                           BIT(7)
+
+/* RTC_INTERRUPTS register field definition */
+#define TPS6594_MASK_EVERY                             GENMASK(1, 0)
+#define TPS6594_BIT_IT_TIMER                           BIT(2)
+#define TPS6594_BIT_IT_ALARM                           BIT(3)
+
+/* RTC_RESET_STATUS register field definition */
+#define TPS6594_BIT_RESET_STATUS_RTC                   BIT(0)
+
+/* SERIAL_IF_CONFIG register field definition */
+#define TPS6594_BIT_I2C_SPI_SEL                                BIT(0)
+#define TPS6594_BIT_I2C1_SPI_CRC_EN                    BIT(1)
+#define TPS6594_BIT_I2C2_CRC_EN                                BIT(2)
+#define TPS6594_MASK_T_CRC                             GENMASK(7, 3)
+
+/* WD_QUESTION_ANSW_CNT register field definition */
+#define TPS6594_MASK_WD_QUESTION                       GENMASK(3, 0)
+#define TPS6594_MASK_WD_ANSW_CNT                       GENMASK(5, 4)
+
+/* WD_MODE_REG register field definition */
+#define TPS6594_BIT_WD_RETURN_LONGWIN                  BIT(0)
+#define TPS6594_BIT_WD_MODE_SELECT                     BIT(1)
+#define TPS6594_BIT_WD_PWRHOLD                         BIT(2)
+
+/* WD_QA_CFG register field definition */
+#define TPS6594_MASK_WD_QUESTION_SEED                  GENMASK(3, 0)
+#define TPS6594_MASK_WD_QA_LFSR                                GENMASK(5, 4)
+#define TPS6594_MASK_WD_QA_FDBK                                GENMASK(7, 6)
+
+/* WD_ERR_STATUS register field definition */
+#define TPS6594_BIT_WD_LONGWIN_TIMEOUT_INT             BIT(0)
+#define TPS6594_BIT_WD_TIMEOUT                         BIT(1)
+#define TPS6594_BIT_WD_TRIG_EARLY                      BIT(2)
+#define TPS6594_BIT_WD_ANSW_EARLY                      BIT(3)
+#define TPS6594_BIT_WD_SEQ_ERR                         BIT(4)
+#define TPS6594_BIT_WD_ANSW_ERR                                BIT(5)
+#define TPS6594_BIT_WD_FAIL_INT                                BIT(6)
+#define TPS6594_BIT_WD_RST_INT                         BIT(7)
+
+/* WD_THR_CFG register field definition */
+#define TPS6594_MASK_WD_RST_TH                         GENMASK(2, 0)
+#define TPS6594_MASK_WD_FAIL_TH                                GENMASK(5, 3)
+#define TPS6594_BIT_WD_EN                              BIT(6)
+#define TPS6594_BIT_WD_RST_EN                          BIT(7)
+
+/* WD_FAIL_CNT_REG register field definition */
+#define TPS6594_MASK_WD_FAIL_CNT                       GENMASK(3, 0)
+#define TPS6594_BIT_WD_FIRST_OK                                BIT(5)
+#define TPS6594_BIT_WD_BAD_EVENT                       BIT(6)
+
+/* CRC8 polynomial for I2C & SPI protocols */
+#define TPS6594_CRC8_POLYNOMIAL        0x07
+
+/* IRQs */
+enum tps6594_irqs {
+       /* INT_BUCK1_2 register */
+       TPS6594_IRQ_BUCK1_OV,
+       TPS6594_IRQ_BUCK1_UV,
+       TPS6594_IRQ_BUCK1_SC,
+       TPS6594_IRQ_BUCK1_ILIM,
+       TPS6594_IRQ_BUCK2_OV,
+       TPS6594_IRQ_BUCK2_UV,
+       TPS6594_IRQ_BUCK2_SC,
+       TPS6594_IRQ_BUCK2_ILIM,
+       /* INT_BUCK3_4 register */
+       TPS6594_IRQ_BUCK3_OV,
+       TPS6594_IRQ_BUCK3_UV,
+       TPS6594_IRQ_BUCK3_SC,
+       TPS6594_IRQ_BUCK3_ILIM,
+       TPS6594_IRQ_BUCK4_OV,
+       TPS6594_IRQ_BUCK4_UV,
+       TPS6594_IRQ_BUCK4_SC,
+       TPS6594_IRQ_BUCK4_ILIM,
+       /* INT_BUCK5 register */
+       TPS6594_IRQ_BUCK5_OV,
+       TPS6594_IRQ_BUCK5_UV,
+       TPS6594_IRQ_BUCK5_SC,
+       TPS6594_IRQ_BUCK5_ILIM,
+       /* INT_LDO1_2 register */
+       TPS6594_IRQ_LDO1_OV,
+       TPS6594_IRQ_LDO1_UV,
+       TPS6594_IRQ_LDO1_SC,
+       TPS6594_IRQ_LDO1_ILIM,
+       TPS6594_IRQ_LDO2_OV,
+       TPS6594_IRQ_LDO2_UV,
+       TPS6594_IRQ_LDO2_SC,
+       TPS6594_IRQ_LDO2_ILIM,
+       /* INT_LDO3_4 register */
+       TPS6594_IRQ_LDO3_OV,
+       TPS6594_IRQ_LDO3_UV,
+       TPS6594_IRQ_LDO3_SC,
+       TPS6594_IRQ_LDO3_ILIM,
+       TPS6594_IRQ_LDO4_OV,
+       TPS6594_IRQ_LDO4_UV,
+       TPS6594_IRQ_LDO4_SC,
+       TPS6594_IRQ_LDO4_ILIM,
+       /* INT_VMON register */
+       TPS6594_IRQ_VCCA_OV,
+       TPS6594_IRQ_VCCA_UV,
+       TPS6594_IRQ_VMON1_OV,
+       TPS6594_IRQ_VMON1_UV,
+       TPS6594_IRQ_VMON1_RV,
+       TPS6594_IRQ_VMON2_OV,
+       TPS6594_IRQ_VMON2_UV,
+       TPS6594_IRQ_VMON2_RV,
+       /* INT_GPIO register */
+       TPS6594_IRQ_GPIO9,
+       TPS6594_IRQ_GPIO10,
+       TPS6594_IRQ_GPIO11,
+       /* INT_GPIO1_8 register */
+       TPS6594_IRQ_GPIO1,
+       TPS6594_IRQ_GPIO2,
+       TPS6594_IRQ_GPIO3,
+       TPS6594_IRQ_GPIO4,
+       TPS6594_IRQ_GPIO5,
+       TPS6594_IRQ_GPIO6,
+       TPS6594_IRQ_GPIO7,
+       TPS6594_IRQ_GPIO8,
+       /* INT_STARTUP register */
+       TPS6594_IRQ_NPWRON_START,
+       TPS6594_IRQ_ENABLE,
+       TPS6594_IRQ_FSD,
+       TPS6594_IRQ_SOFT_REBOOT,
+       /* INT_MISC register */
+       TPS6594_IRQ_BIST_PASS,
+       TPS6594_IRQ_EXT_CLK,
+       TPS6594_IRQ_TWARN,
+       /* INT_MODERATE_ERR register */
+       TPS6594_IRQ_TSD_ORD,
+       TPS6594_IRQ_BIST_FAIL,
+       TPS6594_IRQ_REG_CRC_ERR,
+       TPS6594_IRQ_RECOV_CNT,
+       TPS6594_IRQ_SPMI_ERR,
+       TPS6594_IRQ_NPWRON_LONG,
+       TPS6594_IRQ_NINT_READBACK,
+       TPS6594_IRQ_NRSTOUT_READBACK,
+       /* INT_SEVERE_ERR register */
+       TPS6594_IRQ_TSD_IMM,
+       TPS6594_IRQ_VCCA_OVP,
+       TPS6594_IRQ_PFSM_ERR,
+       /* INT_FSM_ERR register */
+       TPS6594_IRQ_IMM_SHUTDOWN,
+       TPS6594_IRQ_ORD_SHUTDOWN,
+       TPS6594_IRQ_MCU_PWR_ERR,
+       TPS6594_IRQ_SOC_PWR_ERR,
+       /* INT_COMM_ERR register */
+       TPS6594_IRQ_COMM_FRM_ERR,
+       TPS6594_IRQ_COMM_CRC_ERR,
+       TPS6594_IRQ_COMM_ADR_ERR,
+       TPS6594_IRQ_I2C2_CRC_ERR,
+       TPS6594_IRQ_I2C2_ADR_ERR,
+       /* INT_READBACK_ERR register */
+       TPS6594_IRQ_EN_DRV_READBACK,
+       TPS6594_IRQ_NRSTOUT_SOC_READBACK,
+       /* INT_ESM register */
+       TPS6594_IRQ_ESM_SOC_PIN,
+       TPS6594_IRQ_ESM_SOC_FAIL,
+       TPS6594_IRQ_ESM_SOC_RST,
+       /* RTC_STATUS register */
+       TPS6594_IRQ_TIMER,
+       TPS6594_IRQ_ALARM,
+       TPS6594_IRQ_POWER_UP,
+};
+
+#define TPS6594_IRQ_NAME_BUCK1_OV              "buck1_ov"
+#define TPS6594_IRQ_NAME_BUCK1_UV              "buck1_uv"
+#define TPS6594_IRQ_NAME_BUCK1_SC              "buck1_sc"
+#define TPS6594_IRQ_NAME_BUCK1_ILIM            "buck1_ilim"
+#define TPS6594_IRQ_NAME_BUCK2_OV              "buck2_ov"
+#define TPS6594_IRQ_NAME_BUCK2_UV              "buck2_uv"
+#define TPS6594_IRQ_NAME_BUCK2_SC              "buck2_sc"
+#define TPS6594_IRQ_NAME_BUCK2_ILIM            "buck2_ilim"
+#define TPS6594_IRQ_NAME_BUCK3_OV              "buck3_ov"
+#define TPS6594_IRQ_NAME_BUCK3_UV              "buck3_uv"
+#define TPS6594_IRQ_NAME_BUCK3_SC              "buck3_sc"
+#define TPS6594_IRQ_NAME_BUCK3_ILIM            "buck3_ilim"
+#define TPS6594_IRQ_NAME_BUCK4_OV              "buck4_ov"
+#define TPS6594_IRQ_NAME_BUCK4_UV              "buck4_uv"
+#define TPS6594_IRQ_NAME_BUCK4_SC              "buck4_sc"
+#define TPS6594_IRQ_NAME_BUCK4_ILIM            "buck4_ilim"
+#define TPS6594_IRQ_NAME_BUCK5_OV              "buck5_ov"
+#define TPS6594_IRQ_NAME_BUCK5_UV              "buck5_uv"
+#define TPS6594_IRQ_NAME_BUCK5_SC              "buck5_sc"
+#define TPS6594_IRQ_NAME_BUCK5_ILIM            "buck5_ilim"
+#define TPS6594_IRQ_NAME_LDO1_OV               "ldo1_ov"
+#define TPS6594_IRQ_NAME_LDO1_UV               "ldo1_uv"
+#define TPS6594_IRQ_NAME_LDO1_SC               "ldo1_sc"
+#define TPS6594_IRQ_NAME_LDO1_ILIM             "ldo1_ilim"
+#define TPS6594_IRQ_NAME_LDO2_OV               "ldo2_ov"
+#define TPS6594_IRQ_NAME_LDO2_UV               "ldo2_uv"
+#define TPS6594_IRQ_NAME_LDO2_SC               "ldo2_sc"
+#define TPS6594_IRQ_NAME_LDO2_ILIM             "ldo2_ilim"
+#define TPS6594_IRQ_NAME_LDO3_OV               "ldo3_ov"
+#define TPS6594_IRQ_NAME_LDO3_UV               "ldo3_uv"
+#define TPS6594_IRQ_NAME_LDO3_SC               "ldo3_sc"
+#define TPS6594_IRQ_NAME_LDO3_ILIM             "ldo3_ilim"
+#define TPS6594_IRQ_NAME_LDO4_OV               "ldo4_ov"
+#define TPS6594_IRQ_NAME_LDO4_UV               "ldo4_uv"
+#define TPS6594_IRQ_NAME_LDO4_SC               "ldo4_sc"
+#define TPS6594_IRQ_NAME_LDO4_ILIM             "ldo4_ilim"
+#define TPS6594_IRQ_NAME_VCCA_OV               "vcca_ov"
+#define TPS6594_IRQ_NAME_VCCA_UV               "vcca_uv"
+#define TPS6594_IRQ_NAME_VMON1_OV              "vmon1_ov"
+#define TPS6594_IRQ_NAME_VMON1_UV              "vmon1_uv"
+#define TPS6594_IRQ_NAME_VMON1_RV              "vmon1_rv"
+#define TPS6594_IRQ_NAME_VMON2_OV              "vmon2_ov"
+#define TPS6594_IRQ_NAME_VMON2_UV              "vmon2_uv"
+#define TPS6594_IRQ_NAME_VMON2_RV              "vmon2_rv"
+#define TPS6594_IRQ_NAME_GPIO9                 "gpio9"
+#define TPS6594_IRQ_NAME_GPIO10                        "gpio10"
+#define TPS6594_IRQ_NAME_GPIO11                        "gpio11"
+#define TPS6594_IRQ_NAME_GPIO1                 "gpio1"
+#define TPS6594_IRQ_NAME_GPIO2                 "gpio2"
+#define TPS6594_IRQ_NAME_GPIO3                 "gpio3"
+#define TPS6594_IRQ_NAME_GPIO4                 "gpio4"
+#define TPS6594_IRQ_NAME_GPIO5                 "gpio5"
+#define TPS6594_IRQ_NAME_GPIO6                 "gpio6"
+#define TPS6594_IRQ_NAME_GPIO7                 "gpio7"
+#define TPS6594_IRQ_NAME_GPIO8                 "gpio8"
+#define TPS6594_IRQ_NAME_NPWRON_START          "npwron_start"
+#define TPS6594_IRQ_NAME_ENABLE                        "enable"
+#define TPS6594_IRQ_NAME_FSD                   "fsd"
+#define TPS6594_IRQ_NAME_SOFT_REBOOT           "soft_reboot"
+#define TPS6594_IRQ_NAME_BIST_PASS             "bist_pass"
+#define TPS6594_IRQ_NAME_EXT_CLK               "ext_clk"
+#define TPS6594_IRQ_NAME_TWARN                 "twarn"
+#define TPS6594_IRQ_NAME_TSD_ORD               "tsd_ord"
+#define TPS6594_IRQ_NAME_BIST_FAIL             "bist_fail"
+#define TPS6594_IRQ_NAME_REG_CRC_ERR           "reg_crc_err"
+#define TPS6594_IRQ_NAME_RECOV_CNT             "recov_cnt"
+#define TPS6594_IRQ_NAME_SPMI_ERR              "spmi_err"
+#define TPS6594_IRQ_NAME_NPWRON_LONG           "npwron_long"
+#define TPS6594_IRQ_NAME_NINT_READBACK         "nint_readback"
+#define TPS6594_IRQ_NAME_NRSTOUT_READBACK      "nrstout_readback"
+#define TPS6594_IRQ_NAME_TSD_IMM               "tsd_imm"
+#define TPS6594_IRQ_NAME_VCCA_OVP              "vcca_ovp"
+#define TPS6594_IRQ_NAME_PFSM_ERR              "pfsm_err"
+#define TPS6594_IRQ_NAME_IMM_SHUTDOWN          "imm_shutdown"
+#define TPS6594_IRQ_NAME_ORD_SHUTDOWN          "ord_shutdown"
+#define TPS6594_IRQ_NAME_MCU_PWR_ERR           "mcu_pwr_err"
+#define TPS6594_IRQ_NAME_SOC_PWR_ERR           "soc_pwr_err"
+#define TPS6594_IRQ_NAME_COMM_FRM_ERR          "comm_frm_err"
+#define TPS6594_IRQ_NAME_COMM_CRC_ERR          "comm_crc_err"
+#define TPS6594_IRQ_NAME_COMM_ADR_ERR          "comm_adr_err"
+#define TPS6594_IRQ_NAME_EN_DRV_READBACK       "en_drv_readback"
+#define TPS6594_IRQ_NAME_NRSTOUT_SOC_READBACK  "nrstout_soc_readback"
+#define TPS6594_IRQ_NAME_ESM_SOC_PIN           "esm_soc_pin"
+#define TPS6594_IRQ_NAME_ESM_SOC_FAIL          "esm_soc_fail"
+#define TPS6594_IRQ_NAME_ESM_SOC_RST           "esm_soc_rst"
+#define TPS6594_IRQ_NAME_TIMER                 "timer"
+#define TPS6594_IRQ_NAME_ALARM                 "alarm"
+#define TPS6594_IRQ_NAME_POWERUP               "powerup"
+
+/**
+ * struct tps6594 - device private data structure
+ *
+ * @dev:      MFD parent device
+ * @chip_id:  chip ID
+ * @reg:      I2C slave address or SPI chip select number
+ * @use_crc:  if true, use CRC for I2C and SPI interface protocols
+ * @regmap:   regmap for accessing the device registers
+ * @irq:      irq generated by the device
+ * @irq_data: regmap irq data used for the irq chip
+ */
+struct tps6594 {
+       struct device *dev;
+       unsigned long chip_id;
+       unsigned short reg;
+       bool use_crc;
+       struct regmap *regmap;
+       int irq;
+       struct regmap_irq_chip_data *irq_data;
+};
+
+bool tps6594_is_volatile_reg(struct device *dev, unsigned int reg);
+int tps6594_device_init(struct tps6594 *tps, bool enable_crc);
+
+#endif /*  __LINUX_MFD_TPS6594_H */
index 6241a15..711dd94 100644 (file)
@@ -7,8 +7,8 @@
 #include <linux/migrate_mode.h>
 #include <linux/hugetlb.h>
 
-typedef struct page *new_page_t(struct page *page, unsigned long private);
-typedef void free_page_t(struct page *page, unsigned long private);
+typedef struct folio *new_folio_t(struct folio *folio, unsigned long private);
+typedef void free_folio_t(struct folio *folio, unsigned long private);
 
 struct migration_target_control;
 
@@ -67,16 +67,16 @@ int migrate_folio_extra(struct address_space *mapping, struct folio *dst,
                struct folio *src, enum migrate_mode mode, int extra_count);
 int migrate_folio(struct address_space *mapping, struct folio *dst,
                struct folio *src, enum migrate_mode mode);
-int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
+int migrate_pages(struct list_head *l, new_folio_t new, free_folio_t free,
                  unsigned long private, enum migrate_mode mode, int reason,
                  unsigned int *ret_succeeded);
-struct page *alloc_migration_target(struct page *page, unsigned long private);
+struct folio *alloc_migration_target(struct folio *src, unsigned long private);
 bool isolate_movable_page(struct page *page, isolate_mode_t mode);
 
 int migrate_huge_page_move_mapping(struct address_space *mapping,
                struct folio *dst, struct folio *src);
-void migration_entry_wait_on_locked(swp_entry_t entry, pte_t *ptep,
-                               spinlock_t *ptl);
+void migration_entry_wait_on_locked(swp_entry_t entry, spinlock_t *ptl)
+               __releases(ptl);
 void folio_migrate_flags(struct folio *newfolio, struct folio *folio);
 void folio_migrate_copy(struct folio *newfolio, struct folio *folio);
 int folio_migrate_mapping(struct address_space *mapping,
@@ -85,11 +85,11 @@ int folio_migrate_mapping(struct address_space *mapping,
 #else
 
 static inline void putback_movable_pages(struct list_head *l) {}
-static inline int migrate_pages(struct list_head *l, new_page_t new,
-               free_page_t free, unsigned long private, enum migrate_mode mode,
-               int reason, unsigned int *ret_succeeded)
+static inline int migrate_pages(struct list_head *l, new_folio_t new,
+               free_folio_t free, unsigned long private,
+               enum migrate_mode mode, int reason, unsigned int *ret_succeeded)
        { return -ENOSYS; }
-static inline struct page *alloc_migration_target(struct page *page,
+static inline struct folio *alloc_migration_target(struct folio *src,
                unsigned long private)
        { return NULL; }
 static inline bool isolate_movable_page(struct page *page, isolate_mode_t mode)
index 27ce770..eef34f6 100644 (file)
@@ -725,7 +725,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 
 #else /* CONFIG_PER_VMA_LOCK */
 
-static inline void vma_init_lock(struct vm_area_struct *vma) {}
 static inline bool vma_start_read(struct vm_area_struct *vma)
                { return false; }
 static inline void vma_end_read(struct vm_area_struct *vma) {}
@@ -866,11 +865,24 @@ static inline struct vm_area_struct *vma_next(struct vma_iterator *vmi)
        return mas_find(&vmi->mas, ULONG_MAX);
 }
 
+static inline
+struct vm_area_struct *vma_iter_next_range(struct vma_iterator *vmi)
+{
+       return mas_next_range(&vmi->mas, ULONG_MAX);
+}
+
+
 static inline struct vm_area_struct *vma_prev(struct vma_iterator *vmi)
 {
        return mas_prev(&vmi->mas, 0);
 }
 
+static inline
+struct vm_area_struct *vma_iter_prev_range(struct vma_iterator *vmi)
+{
+       return mas_prev_range(&vmi->mas, 0);
+}
+
 static inline unsigned long vma_iter_addr(struct vma_iterator *vmi)
 {
        return vmi->mas.index;
@@ -1208,17 +1220,6 @@ enum compound_dtor_id {
 #endif
        NR_COMPOUND_DTORS,
 };
-extern compound_page_dtor * const compound_page_dtors[NR_COMPOUND_DTORS];
-
-static inline void set_compound_page_dtor(struct page *page,
-               enum compound_dtor_id compound_dtor)
-{
-       struct folio *folio = (struct folio *)page;
-
-       VM_BUG_ON_PAGE(compound_dtor >= NR_COMPOUND_DTORS, page);
-       VM_BUG_ON_PAGE(!PageHead(page), page);
-       folio->_folio_dtor = compound_dtor;
-}
 
 static inline void folio_set_compound_dtor(struct folio *folio,
                enum compound_dtor_id compound_dtor)
@@ -1229,16 +1230,6 @@ static inline void folio_set_compound_dtor(struct folio *folio,
 
 void destroy_large_folio(struct folio *folio);
 
-static inline void set_compound_order(struct page *page, unsigned int order)
-{
-       struct folio *folio = (struct folio *)page;
-
-       folio->_folio_order = order;
-#ifdef CONFIG_64BIT
-       folio->_folio_nr_pages = 1U << order;
-#endif
-}
-
 /* Returns the number of bytes in this potentially compound page. */
 static inline unsigned long page_size(struct page *page)
 {
@@ -1910,39 +1901,57 @@ static inline bool page_needs_cow_for_dma(struct vm_area_struct *vma,
        return page_maybe_dma_pinned(page);
 }
 
-/* MIGRATE_CMA and ZONE_MOVABLE do not allow pin pages */
+/**
+ * is_zero_page - Query if a page is a zero page
+ * @page: The page to query
+ *
+ * This returns true if @page is one of the permanent zero pages.
+ */
+static inline bool is_zero_page(const struct page *page)
+{
+       return is_zero_pfn(page_to_pfn(page));
+}
+
+/**
+ * is_zero_folio - Query if a folio is a zero page
+ * @folio: The folio to query
+ *
+ * This returns true if @folio is one of the permanent zero pages.
+ */
+static inline bool is_zero_folio(const struct folio *folio)
+{
+       return is_zero_page(&folio->page);
+}
+
+/* MIGRATE_CMA and ZONE_MOVABLE do not allow pin folios */
 #ifdef CONFIG_MIGRATION
-static inline bool is_longterm_pinnable_page(struct page *page)
+static inline bool folio_is_longterm_pinnable(struct folio *folio)
 {
 #ifdef CONFIG_CMA
-       int mt = get_pageblock_migratetype(page);
+       int mt = folio_migratetype(folio);
 
        if (mt == MIGRATE_CMA || mt == MIGRATE_ISOLATE)
                return false;
 #endif
-       /* The zero page may always be pinned */
-       if (is_zero_pfn(page_to_pfn(page)))
+       /* The zero page can be "pinned" but gets special handling. */
+       if (is_zero_folio(folio))
                return true;
 
        /* Coherent device memory must always allow eviction. */
-       if (is_device_coherent_page(page))
+       if (folio_is_device_coherent(folio))
                return false;
 
-       /* Otherwise, non-movable zone pages can be pinned. */
-       return !is_zone_movable_page(page);
+       /* Otherwise, non-movable zone folios can be pinned. */
+       return !folio_is_zone_movable(folio);
+
 }
 #else
-static inline bool is_longterm_pinnable_page(struct page *page)
+static inline bool folio_is_longterm_pinnable(struct folio *folio)
 {
        return true;
 }
 #endif
 
-static inline bool folio_is_longterm_pinnable(struct folio *folio)
-{
-       return is_longterm_pinnable_page(&folio->page);
-}
-
 static inline void set_page_zone(struct page *page, enum zone_type zone)
 {
        page->flags &= ~(ZONES_MASK << ZONES_PGSHIFT);
@@ -2353,6 +2362,9 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping,
        unmap_mapping_range(mapping, holebegin, holelen, 0);
 }
 
+static inline struct vm_area_struct *vma_lookup(struct mm_struct *mm,
+                                               unsigned long addr);
+
 extern int access_process_vm(struct task_struct *tsk, unsigned long addr,
                void *buf, int len, unsigned int gup_flags);
 extern int access_remote_vm(struct mm_struct *mm, unsigned long addr,
@@ -2361,19 +2373,42 @@ extern int __access_remote_vm(struct mm_struct *mm, unsigned long addr,
                              void *buf, int len, unsigned int gup_flags);
 
 long get_user_pages_remote(struct mm_struct *mm,
-                           unsigned long start, unsigned long nr_pages,
-                           unsigned int gup_flags, struct page **pages,
-                           struct vm_area_struct **vmas, int *locked);
+                          unsigned long start, unsigned long nr_pages,
+                          unsigned int gup_flags, struct page **pages,
+                          int *locked);
 long pin_user_pages_remote(struct mm_struct *mm,
                           unsigned long start, unsigned long nr_pages,
                           unsigned int gup_flags, struct page **pages,
-                          struct vm_area_struct **vmas, int *locked);
+                          int *locked);
+
+static inline struct page *get_user_page_vma_remote(struct mm_struct *mm,
+                                                   unsigned long addr,
+                                                   int gup_flags,
+                                                   struct vm_area_struct **vmap)
+{
+       struct page *page;
+       struct vm_area_struct *vma;
+       int got = get_user_pages_remote(mm, addr, 1, gup_flags, &page, NULL);
+
+       if (got < 0)
+               return ERR_PTR(got);
+       if (got == 0)
+               return NULL;
+
+       vma = vma_lookup(mm, addr);
+       if (WARN_ON_ONCE(!vma)) {
+               put_page(page);
+               return ERR_PTR(-EINVAL);
+       }
+
+       *vmap = vma;
+       return page;
+}
+
 long get_user_pages(unsigned long start, unsigned long nr_pages,
-                           unsigned int gup_flags, struct page **pages,
-                           struct vm_area_struct **vmas);
+                   unsigned int gup_flags, struct page **pages);
 long pin_user_pages(unsigned long start, unsigned long nr_pages,
-                   unsigned int gup_flags, struct page **pages,
-                   struct vm_area_struct **vmas);
+                   unsigned int gup_flags, struct page **pages);
 long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
                    struct page **pages, unsigned int gup_flags);
 long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
@@ -2383,6 +2418,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
                        unsigned int gup_flags, struct page **pages);
 int pin_user_pages_fast(unsigned long start, int nr_pages,
                        unsigned int gup_flags, struct page **pages);
+void folio_add_pin(struct folio *folio);
 
 int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
 int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
@@ -2422,6 +2458,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
 #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
                                            MM_CP_UFFD_WP_RESOLVE)
 
+bool vma_needs_dirty_tracking(struct vm_area_struct *vma);
 int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
 static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
 {
@@ -2787,14 +2824,25 @@ static inline void pgtable_pte_page_dtor(struct page *page)
        dec_lruvec_page_state(page, NR_PAGETABLE);
 }
 
-#define pte_offset_map_lock(mm, pmd, address, ptlp)    \
-({                                                     \
-       spinlock_t *__ptl = pte_lockptr(mm, pmd);       \
-       pte_t *__pte = pte_offset_map(pmd, address);    \
-       *(ptlp) = __ptl;                                \
-       spin_lock(__ptl);                               \
-       __pte;                                          \
-})
+pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp);
+static inline pte_t *pte_offset_map(pmd_t *pmd, unsigned long addr)
+{
+       return __pte_offset_map(pmd, addr, NULL);
+}
+
+pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
+                       unsigned long addr, spinlock_t **ptlp);
+static inline pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
+                       unsigned long addr, spinlock_t **ptlp)
+{
+       pte_t *pte;
+
+       __cond_lock(*ptlp, pte = __pte_offset_map_lock(mm, pmd, addr, ptlp));
+       return pte;
+}
+
+pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
+                       unsigned long addr, spinlock_t **ptlp);
 
 #define pte_unmap_unlock(pte, ptl)     do {            \
        spin_unlock(ptl);                               \
@@ -2915,7 +2963,8 @@ extern unsigned long free_reserved_area(void *start, void *end,
 
 extern void adjust_managed_page_count(struct page *page, long count);
 
-extern void reserve_bootmem_region(phys_addr_t start, phys_addr_t end);
+extern void reserve_bootmem_region(phys_addr_t start,
+                                  phys_addr_t end, int nid);
 
 /* Free the reserved page into the buddy system, so it gets managed. */
 static inline void free_reserved_page(struct page *page)
@@ -2994,12 +3043,6 @@ extern int __meminit early_pfn_to_nid(unsigned long pfn);
 #endif
 
 extern void set_dma_reserve(unsigned long new_dma_reserve);
-extern void memmap_init_range(unsigned long, int, unsigned long,
-               unsigned long, unsigned long, enum meminit_context,
-               struct vmem_altmap *, int migratetype);
-extern void setup_per_zone_wmarks(void);
-extern void calculate_min_free_kbytes(void);
-extern int __meminit init_per_zone_wmark_min(void);
 extern void mem_init(void);
 extern void __init mmap_init(void);
 
@@ -3020,11 +3063,6 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...);
 
 extern void setup_per_cpu_pageset(void);
 
-/* page_alloc.c */
-extern int min_free_kbytes;
-extern int watermark_boost_factor;
-extern int watermark_scale_factor;
-
 /* nommu.c */
 extern atomic_long_t mmap_pages_allocated;
 extern int nommu_shrink_inode_mappings(struct inode *, size_t, size_t);
@@ -3453,13 +3491,12 @@ static inline bool debug_pagealloc_enabled_static(void)
        return static_branch_unlikely(&_debug_pagealloc_enabled);
 }
 
-#ifdef CONFIG_DEBUG_PAGEALLOC
 /*
  * To support DEBUG_PAGEALLOC architecture must ensure that
  * __kernel_map_pages() never fails
  */
 extern void __kernel_map_pages(struct page *page, int numpages, int enable);
-
+#ifdef CONFIG_DEBUG_PAGEALLOC
 static inline void debug_pagealloc_map_pages(struct page *page, int numpages)
 {
        if (debug_pagealloc_enabled_static())
@@ -3471,9 +3508,58 @@ static inline void debug_pagealloc_unmap_pages(struct page *page, int numpages)
        if (debug_pagealloc_enabled_static())
                __kernel_map_pages(page, numpages, 0);
 }
+
+extern unsigned int _debug_guardpage_minorder;
+DECLARE_STATIC_KEY_FALSE(_debug_guardpage_enabled);
+
+static inline unsigned int debug_guardpage_minorder(void)
+{
+       return _debug_guardpage_minorder;
+}
+
+static inline bool debug_guardpage_enabled(void)
+{
+       return static_branch_unlikely(&_debug_guardpage_enabled);
+}
+
+static inline bool page_is_guard(struct page *page)
+{
+       if (!debug_guardpage_enabled())
+               return false;
+
+       return PageGuard(page);
+}
+
+bool __set_page_guard(struct zone *zone, struct page *page, unsigned int order,
+                     int migratetype);
+static inline bool set_page_guard(struct zone *zone, struct page *page,
+                                 unsigned int order, int migratetype)
+{
+       if (!debug_guardpage_enabled())
+               return false;
+       return __set_page_guard(zone, page, order, migratetype);
+}
+
+void __clear_page_guard(struct zone *zone, struct page *page, unsigned int order,
+                       int migratetype);
+static inline void clear_page_guard(struct zone *zone, struct page *page,
+                                   unsigned int order, int migratetype)
+{
+       if (!debug_guardpage_enabled())
+               return;
+       __clear_page_guard(zone, page, order, migratetype);
+}
+
 #else  /* CONFIG_DEBUG_PAGEALLOC */
 static inline void debug_pagealloc_map_pages(struct page *page, int numpages) {}
 static inline void debug_pagealloc_unmap_pages(struct page *page, int numpages) {}
+static inline unsigned int debug_guardpage_minorder(void) { return 0; }
+static inline bool debug_guardpage_enabled(void) { return false; }
+static inline bool page_is_guard(struct page *page) { return false; }
+static inline bool set_page_guard(struct zone *zone, struct page *page,
+                       unsigned int order, int migratetype) { return false; }
+static inline void clear_page_guard(struct zone *zone, struct page *page,
+                               unsigned int order, int migratetype) {}
 #endif /* CONFIG_DEBUG_PAGEALLOC */
 
 #ifdef __HAVE_ARCH_GATE_AREA
@@ -3586,6 +3672,10 @@ extern void shake_page(struct page *p);
 extern atomic_long_t num_poisoned_pages __read_mostly;
 extern int soft_offline_page(unsigned long pfn, int flags);
 #ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Sysfs entries for memory failure handling statistics.
+ */
+extern const struct attribute_group memory_failure_attr_group;
 extern void memory_failure_queue(unsigned long pfn, int flags);
 extern int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
                                        bool *migratable_cleared);
@@ -3678,11 +3768,6 @@ enum mf_action_page_type {
        MF_MSG_UNKNOWN,
 };
 
-/*
- * Sysfs entries for memory failure handling statistics.
- */
-extern const struct attribute_group memory_failure_attr_group;
-
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 extern void clear_huge_page(struct page *page,
                            unsigned long addr_hint,
@@ -3712,33 +3797,6 @@ static inline bool vma_is_special_huge(const struct vm_area_struct *vma)
 
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
 
-#ifdef CONFIG_DEBUG_PAGEALLOC
-extern unsigned int _debug_guardpage_minorder;
-DECLARE_STATIC_KEY_FALSE(_debug_guardpage_enabled);
-
-static inline unsigned int debug_guardpage_minorder(void)
-{
-       return _debug_guardpage_minorder;
-}
-
-static inline bool debug_guardpage_enabled(void)
-{
-       return static_branch_unlikely(&_debug_guardpage_enabled);
-}
-
-static inline bool page_is_guard(struct page *page)
-{
-       if (!debug_guardpage_enabled())
-               return false;
-
-       return PageGuard(page);
-}
-#else
-static inline unsigned int debug_guardpage_minorder(void) { return 0; }
-static inline bool debug_guardpage_enabled(void) { return false; }
-static inline bool page_is_guard(struct page *page) { return false; }
-#endif /* CONFIG_DEBUG_PAGEALLOC */
-
 #if MAX_NUMNODES > 1
 void __init setup_nr_node_ids(void);
 #else
@@ -3816,4 +3874,23 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
 }
 #endif
 
+#ifdef CONFIG_UNACCEPTED_MEMORY
+
+bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end);
+void accept_memory(phys_addr_t start, phys_addr_t end);
+
+#else
+
+static inline bool range_contains_unaccepted_memory(phys_addr_t start,
+                                                   phys_addr_t end)
+{
+       return false;
+}
+
+static inline void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+}
+
+#endif
+
 #endif /* _LINUX_MM_H */
index 0e1d239..21d6c72 100644 (file)
@@ -323,12 +323,6 @@ void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio)
                list_add(&folio->lru, &lruvec->lists[lru]);
 }
 
-static __always_inline void add_page_to_lru_list(struct page *page,
-                               struct lruvec *lruvec)
-{
-       lruvec_add_folio(lruvec, page_folio(page));
-}
-
 static __always_inline
 void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)
 {
@@ -357,12 +351,6 @@ void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio)
                        -folio_nr_pages(folio));
 }
 
-static __always_inline void del_page_from_lru_list(struct page *page,
-                               struct lruvec *lruvec)
-{
-       lruvec_del_folio(lruvec, page_folio(page));
-}
-
 #ifdef CONFIG_ANON_VMA_NAME
 /*
  * mmap_lock should be read-locked when calling anon_vma_name(). Caller should
@@ -555,7 +543,7 @@ pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
        bool arm_uffd_pte = false;
 
        /* The current status of the pte should be "cleared" before calling */
-       WARN_ON_ONCE(!pte_none(*pte));
+       WARN_ON_ONCE(!pte_none(ptep_get(pte)));
 
        /*
         * NOTE: userfaultfd_wp_unpopulated() doesn't need this whole
index 306a3d1..de10fc7 100644 (file)
@@ -583,6 +583,21 @@ struct mm_cid {
 struct kioctx_table;
 struct mm_struct {
        struct {
+               /*
+                * Fields which are often written to are placed in a separate
+                * cache line.
+                */
+               struct {
+                       /**
+                        * @mm_count: The number of references to &struct
+                        * mm_struct (@mm_users count as 1).
+                        *
+                        * Use mmgrab()/mmdrop() to modify. When this drops to
+                        * 0, the &struct mm_struct is freed.
+                        */
+                       atomic_t mm_count;
+               } ____cacheline_aligned_in_smp;
+
                struct maple_tree mm_mt;
 #ifdef CONFIG_MMU
                unsigned long (*get_unmapped_area) (struct file *filp,
@@ -620,14 +635,6 @@ struct mm_struct {
                 */
                atomic_t mm_users;
 
-               /**
-                * @mm_count: The number of references to &struct mm_struct
-                * (@mm_users count as 1).
-                *
-                * Use mmgrab()/mmdrop() to modify. When this drops to 0, the
-                * &struct mm_struct is freed.
-                */
-               atomic_t mm_count;
 #ifdef CONFIG_SCHED_MM_CID
                /**
                 * @pcpu_cid: Per-cpu current cid.
index c726ea7..daa2f40 100644 (file)
@@ -294,6 +294,7 @@ struct mmc_card {
 #define MMC_QUIRK_TRIM_BROKEN  (1<<12)         /* Skip trim */
 #define MMC_QUIRK_BROKEN_HPI   (1<<13)         /* Disable broken HPI support */
 #define MMC_QUIRK_BROKEN_SD_DISCARD    (1<<14) /* Disable broken SD discard support */
+#define MMC_QUIRK_BROKEN_SD_CACHE      (1<<15) /* Disable broken SD cache support */
 
        bool                    reenable_cmdq;  /* Re-enable Command Queue */
 
index b8728d1..7c3e7b0 100644 (file)
@@ -8,10 +8,12 @@
 struct page;
 struct vm_area_struct;
 struct mm_struct;
+struct vma_iterator;
 
 void dump_page(struct page *page, const char *reason);
 void dump_vma(const struct vm_area_struct *vma);
 void dump_mm(const struct mm_struct *mm);
+void vma_iter_dump_tree(const struct vma_iterator *vmi);
 
 #ifdef CONFIG_DEBUG_VM
 #define VM_BUG_ON(cond) BUG_ON(cond)
@@ -74,6 +76,17 @@ void dump_mm(const struct mm_struct *mm);
        }                                                               \
        unlikely(__ret_warn_once);                                      \
 })
+#define VM_WARN_ON_ONCE_MM(cond, mm)           ({                      \
+       static bool __section(".data.once") __warned;                   \
+       int __ret_warn_once = !!(cond);                                 \
+                                                                       \
+       if (unlikely(__ret_warn_once && !__warned)) {                   \
+               dump_mm(mm);                                            \
+               __warned = true;                                        \
+               WARN_ON(1);                                             \
+       }                                                               \
+       unlikely(__ret_warn_once);                                      \
+})
 
 #define VM_WARN_ON(cond) (void)WARN_ON(cond)
 #define VM_WARN_ON_ONCE(cond) (void)WARN_ON_ONCE(cond)
@@ -90,6 +103,7 @@ void dump_mm(const struct mm_struct *mm);
 #define VM_WARN_ON_ONCE_PAGE(cond, page)  BUILD_BUG_ON_INVALID(cond)
 #define VM_WARN_ON_FOLIO(cond, folio)  BUILD_BUG_ON_INVALID(cond)
 #define VM_WARN_ON_ONCE_FOLIO(cond, folio)  BUILD_BUG_ON_INVALID(cond)
+#define VM_WARN_ON_ONCE_MM(cond, mm)  BUILD_BUG_ON_INVALID(cond)
 #define VM_WARN_ONCE(cond, format...) BUILD_BUG_ON_INVALID(cond)
 #define VM_WARN(cond, format...) BUILD_BUG_ON_INVALID(cond)
 #endif
index a4889c9..5e50b78 100644 (file)
@@ -105,6 +105,9 @@ extern int page_group_by_mobility_disabled;
 #define get_pageblock_migratetype(page)                                        \
        get_pfnblock_flags_mask(page, page_to_pfn(page), MIGRATETYPE_MASK)
 
+#define folio_migratetype(folio)                               \
+       get_pfnblock_flags_mask(&folio->page, folio_pfn(folio),         \
+                       MIGRATETYPE_MASK)
 struct free_area {
        struct list_head        free_list[MIGRATE_TYPES];
        unsigned long           nr_free;
@@ -143,6 +146,9 @@ enum zone_stat_item {
        NR_ZSPAGES,             /* allocated in zsmalloc */
 #endif
        NR_FREE_CMA_PAGES,
+#ifdef CONFIG_UNACCEPTED_MEMORY
+       NR_UNACCEPTED,
+#endif
        NR_VM_ZONE_STAT_ITEMS };
 
 enum node_stat_item {
@@ -290,9 +296,21 @@ static inline bool is_active_lru(enum lru_list lru)
 #define ANON_AND_FILE 2
 
 enum lruvec_flags {
-       LRUVEC_CONGESTED,               /* lruvec has many dirty pages
-                                        * backed by a congested BDI
-                                        */
+       /*
+        * An lruvec has many dirty pages backed by a congested BDI:
+        * 1. LRUVEC_CGROUP_CONGESTED is set by cgroup-level reclaim.
+        *    It can be cleared by cgroup reclaim or kswapd.
+        * 2. LRUVEC_NODE_CONGESTED is set by kswapd node-level reclaim.
+        *    It can only be cleared by kswapd.
+        *
+        * Essentially, kswapd can unthrottle an lruvec throttled by cgroup
+        * reclaim, but not vice versa. This only applies to the root cgroup.
+        * The goal is to prevent cgroup reclaim on the root cgroup (e.g.
+        * memory.reclaim) to unthrottle an unbalanced node (that was throttled
+        * by kswapd).
+        */
+       LRUVEC_CGROUP_CONGESTED,
+       LRUVEC_NODE_CONGESTED,
 };
 
 #endif /* !__GENERATING_BOUNDS_H */
@@ -534,7 +552,7 @@ void lru_gen_exit_memcg(struct mem_cgroup *memcg);
 void lru_gen_online_memcg(struct mem_cgroup *memcg);
 void lru_gen_offline_memcg(struct mem_cgroup *memcg);
 void lru_gen_release_memcg(struct mem_cgroup *memcg);
-void lru_gen_soft_reclaim(struct lruvec *lruvec);
+void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid);
 
 #else /* !CONFIG_MEMCG */
 
@@ -585,7 +603,7 @@ static inline void lru_gen_release_memcg(struct mem_cgroup *memcg)
 {
 }
 
-static inline void lru_gen_soft_reclaim(struct lruvec *lruvec)
+static inline void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid)
 {
 }
 
@@ -910,6 +928,11 @@ struct zone {
        /* free areas of different sizes */
        struct free_area        free_area[MAX_ORDER + 1];
 
+#ifdef CONFIG_UNACCEPTED_MEMORY
+       /* Pages to be accepted. All pages on the list are MAX_ORDER */
+       struct list_head        unaccepted_pages;
+#endif
+
        /* zone flags, see below */
        unsigned long           flags;
 
@@ -1116,6 +1139,11 @@ static inline bool is_zone_movable_page(const struct page *page)
 {
        return page_zonenum(page) == ZONE_MOVABLE;
 }
+
+static inline bool folio_is_zone_movable(const struct folio *folio)
+{
+       return folio_zonenum(folio) == ZONE_MOVABLE;
+}
 #endif
 
 /*
@@ -1512,27 +1540,6 @@ static inline bool has_managed_dma(void)
 }
 #endif
 
-/* These two functions are used to setup the per zone pages min values */
-struct ctl_table;
-
-int min_free_kbytes_sysctl_handler(struct ctl_table *, int, void *, size_t *,
-               loff_t *);
-int watermark_scale_factor_sysctl_handler(struct ctl_table *, int, void *,
-               size_t *, loff_t *);
-extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES];
-int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int, void *,
-               size_t *, loff_t *);
-int percpu_pagelist_high_fraction_sysctl_handler(struct ctl_table *, int,
-               void *, size_t *, loff_t *);
-int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
-               void *, size_t *, loff_t *);
-int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
-               void *, size_t *, loff_t *);
-int numa_zonelist_order_handler(struct ctl_table *, int,
-               void *, size_t *, loff_t *);
-extern int percpu_pagelist_high_fraction;
-extern char numa_zonelist_order[];
-#define NUMA_ZONELIST_ORDER_LEN        16
 
 #ifndef CONFIG_NUMA
 
index 9e56763..a98e188 100644 (file)
@@ -968,15 +968,6 @@ static inline int lookup_module_symbol_name(unsigned long addr, char *symname)
        return -ERANGE;
 }
 
-static inline int lookup_module_symbol_attrs(unsigned long addr,
-                                            unsigned long *size,
-                                            unsigned long *offset,
-                                            char *modname,
-                                            char *name)
-{
-       return -ERANGE;
-}
-
 static inline int module_get_kallsym(unsigned int symnum, unsigned long *value,
                                     char *type, char *name,
                                     char *module_name, int *exported)
index 1ea326c..4f40b40 100644 (file)
@@ -107,7 +107,6 @@ extern struct vfsmount *vfs_submount(const struct dentry *mountpoint,
 extern void mnt_set_expiry(struct vfsmount *mnt, struct list_head *expiry_list);
 extern void mark_mounts_for_expiry(struct list_head *mounts);
 
-extern dev_t name_to_dev_t(const char *name);
 extern bool path_is_mountpoint(const struct path *path);
 
 extern bool our_mnt(struct vfsmount *mnt);
@@ -124,4 +123,6 @@ extern int iterate_mounts(int (*)(struct vfsmount *, void *), void *,
                          struct vfsmount *);
 extern void kern_unmount_array(struct vfsmount *mnt[], unsigned int num);
 
+extern int cifs_root_data(char **dev, char **opts);
+
 #endif /* _LINUX_MOUNT_H */
index 15cc9b9..6e47143 100644 (file)
@@ -34,7 +34,7 @@ struct mtd_blktrans_dev {
        struct blk_mq_tag_set *tag_set;
        spinlock_t queue_lock;
        void *priv;
-       fmode_t file_mode;
+       bool writable;
 };
 
 struct mtd_blktrans_ops {
index 048c0b9..e3e6a64 100644 (file)
@@ -7,19 +7,19 @@
 
 #include <linux/sched.h>
 #include <asm/irq.h>
-#if defined(CONFIG_HAVE_NMI_WATCHDOG)
+
+/* Arch specific watchdogs might need to share extra watchdog-related APIs. */
+#if defined(CONFIG_HARDLOCKUP_DETECTOR_ARCH) || defined(CONFIG_HARDLOCKUP_DETECTOR_SPARC64)
 #include <asm/nmi.h>
 #endif
 
 #ifdef CONFIG_LOCKUP_DETECTOR
 void lockup_detector_init(void);
+void lockup_detector_retry_init(void);
 void lockup_detector_soft_poweroff(void);
 void lockup_detector_cleanup(void);
-bool is_hardlockup(void);
 
 extern int watchdog_user_enabled;
-extern int nmi_watchdog_user_enabled;
-extern int soft_watchdog_user_enabled;
 extern int watchdog_thresh;
 extern unsigned long watchdog_enabled;
 
@@ -35,6 +35,7 @@ extern int sysctl_hardlockup_all_cpu_backtrace;
 
 #else /* CONFIG_LOCKUP_DETECTOR */
 static inline void lockup_detector_init(void) { }
+static inline void lockup_detector_retry_init(void) { }
 static inline void lockup_detector_soft_poweroff(void) { }
 static inline void lockup_detector_cleanup(void) { }
 #endif /* !CONFIG_LOCKUP_DETECTOR */
@@ -69,17 +70,17 @@ static inline void reset_hung_task_detector(void) { }
  * 'watchdog_enabled' variable. Each lockup detector has its dedicated bit -
  * bit 0 for the hard lockup detector and bit 1 for the soft lockup detector.
  *
- * 'watchdog_user_enabled', 'nmi_watchdog_user_enabled' and
- * 'soft_watchdog_user_enabled' are variables that are only used as an
+ * 'watchdog_user_enabled', 'watchdog_hardlockup_user_enabled' and
+ * 'watchdog_softlockup_user_enabled' are variables that are only used as an
  * 'interface' between the parameters in /proc/sys/kernel and the internal
  * state bits in 'watchdog_enabled'. The 'watchdog_thresh' variable is
  * handled differently because its value is not boolean, and the lockup
  * detectors are 'suspended' while 'watchdog_thresh' is equal zero.
  */
-#define NMI_WATCHDOG_ENABLED_BIT   0
-#define SOFT_WATCHDOG_ENABLED_BIT  1
-#define NMI_WATCHDOG_ENABLED      (1 << NMI_WATCHDOG_ENABLED_BIT)
-#define SOFT_WATCHDOG_ENABLED     (1 << SOFT_WATCHDOG_ENABLED_BIT)
+#define WATCHDOG_HARDLOCKUP_ENABLED_BIT  0
+#define WATCHDOG_SOFTOCKUP_ENABLED_BIT   1
+#define WATCHDOG_HARDLOCKUP_ENABLED     (1 << WATCHDOG_HARDLOCKUP_ENABLED_BIT)
+#define WATCHDOG_SOFTOCKUP_ENABLED      (1 << WATCHDOG_SOFTOCKUP_ENABLED_BIT)
 
 #if defined(CONFIG_HARDLOCKUP_DETECTOR)
 extern void hardlockup_detector_disable(void);
@@ -88,52 +89,63 @@ extern unsigned int hardlockup_panic;
 static inline void hardlockup_detector_disable(void) {}
 #endif
 
-#if defined(CONFIG_HAVE_NMI_WATCHDOG) || defined(CONFIG_HARDLOCKUP_DETECTOR)
-# define NMI_WATCHDOG_SYSCTL_PERM      0644
+/* Sparc64 has special implemetantion that is always enabled. */
+#if defined(CONFIG_HARDLOCKUP_DETECTOR) || defined(CONFIG_HARDLOCKUP_DETECTOR_SPARC64)
+void arch_touch_nmi_watchdog(void);
 #else
-# define NMI_WATCHDOG_SYSCTL_PERM      0444
+static inline void arch_touch_nmi_watchdog(void) { }
+#endif
+
+#if defined(CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER)
+void watchdog_hardlockup_touch_cpu(unsigned int cpu);
+void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs);
 #endif
 
 #if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
-extern void arch_touch_nmi_watchdog(void);
 extern void hardlockup_detector_perf_stop(void);
 extern void hardlockup_detector_perf_restart(void);
-extern void hardlockup_detector_perf_disable(void);
-extern void hardlockup_detector_perf_enable(void);
 extern void hardlockup_detector_perf_cleanup(void);
-extern int hardlockup_detector_perf_init(void);
 #else
 static inline void hardlockup_detector_perf_stop(void) { }
 static inline void hardlockup_detector_perf_restart(void) { }
-static inline void hardlockup_detector_perf_disable(void) { }
-static inline void hardlockup_detector_perf_enable(void) { }
 static inline void hardlockup_detector_perf_cleanup(void) { }
-# if !defined(CONFIG_HAVE_NMI_WATCHDOG)
-static inline int hardlockup_detector_perf_init(void) { return -ENODEV; }
-static inline void arch_touch_nmi_watchdog(void) {}
-# else
-static inline int hardlockup_detector_perf_init(void) { return 0; }
-# endif
 #endif
 
-void watchdog_nmi_stop(void);
-void watchdog_nmi_start(void);
-int watchdog_nmi_probe(void);
-int watchdog_nmi_enable(unsigned int cpu);
-void watchdog_nmi_disable(unsigned int cpu);
+void watchdog_hardlockup_stop(void);
+void watchdog_hardlockup_start(void);
+int watchdog_hardlockup_probe(void);
+void watchdog_hardlockup_enable(unsigned int cpu);
+void watchdog_hardlockup_disable(unsigned int cpu);
 
 void lockup_detector_reconfigure(void);
 
+#ifdef CONFIG_HARDLOCKUP_DETECTOR_BUDDY
+void watchdog_buddy_check_hardlockup(int hrtimer_interrupts);
+#else
+static inline void watchdog_buddy_check_hardlockup(int hrtimer_interrupts) {}
+#endif
+
 /**
- * touch_nmi_watchdog - restart NMI watchdog timeout.
+ * touch_nmi_watchdog - manually reset the hardlockup watchdog timeout.
  *
- * If the architecture supports the NMI watchdog, touch_nmi_watchdog()
- * may be used to reset the timeout - for code which intentionally
- * disables interrupts for a long time. This call is stateless.
+ * If we support detecting hardlockups, touch_nmi_watchdog() may be
+ * used to pet the watchdog (reset the timeout) - for code which
+ * intentionally disables interrupts for a long time. This call is stateless.
+ *
+ * Though this function has "nmi" in the name, the hardlockup watchdog might
+ * not be backed by NMIs. This function will likely be renamed to
+ * touch_hardlockup_watchdog() in the future.
  */
 static inline void touch_nmi_watchdog(void)
 {
+       /*
+        * Pass on to the hardlockup detector selected via CONFIG_. Note that
+        * the hardlockup detector may not be arch-specific nor using NMIs
+        * and the arch_touch_nmi_watchdog() function will likely be renamed
+        * in the future.
+        */
        arch_touch_nmi_watchdog();
+
        touch_softlockup_watchdog();
 }
 
@@ -194,10 +206,11 @@ static inline bool trigger_single_cpu_backtrace(int cpu)
 
 #ifdef CONFIG_HARDLOCKUP_DETECTOR_PERF
 u64 hw_nmi_get_sample_period(int watchdog_thresh);
+bool arch_perf_nmi_is_available(void);
 #endif
 
 #if defined(CONFIG_HARDLOCKUP_CHECK_TIMESTAMP) && \
-    defined(CONFIG_HARDLOCKUP_DETECTOR)
+    defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
 void watchdog_update_hrtimer_threshold(u64 period);
 #else
 static inline void watchdog_update_hrtimer_threshold(u64 period) { }
index 392fc6c..bdcd85e 100644 (file)
@@ -93,6 +93,7 @@ extern struct bus_type nubus_bus_type;
 
 /* Generic NuBus interface functions, modelled after the PCI interface */
 #ifdef CONFIG_PROC_FS
+extern bool nubus_populate_procfs;
 void nubus_proc_init(void);
 struct proc_dir_entry *nubus_proc_add_board(struct nubus_board *board);
 struct proc_dir_entry *nubus_proc_add_rsrc_dir(struct proc_dir_entry *procdir,
index fa092b9..4109f1b 100644 (file)
@@ -185,7 +185,6 @@ enum nvmefc_fcp_datadir {
  * @first_sgl: memory for 1st scatter/gather list segment for payload data
  * @sg_cnt:    number of elements in the scatter/gather list
  * @io_dir:    direction of the FCP request (see NVMEFC_FCP_xxx)
- * @sqid:      The nvme SQID the command is being issued on
  * @done:      The callback routine the LLDD is to invoke upon completion of
  *             the FCP operation. req argument is the pointer to the original
  *             FCP IO operation.
@@ -194,12 +193,13 @@ enum nvmefc_fcp_datadir {
  *             while processing the operation. The length of the buffer
  *             corresponds to the fcprqst_priv_sz value specified in the
  *             nvme_fc_port_template supplied by the LLDD.
+ * @sqid:      The nvme SQID the command is being issued on
  *
  * Values set by the LLDD indicating completion status of the FCP operation.
  * Must be set prior to calling the done() callback.
+ * @rcv_rsplen: length, in bytes, of the FCP RSP IU received.
  * @transferred_length: amount of payload data, in bytes, that were
  *             transferred. Should equal payload_length on success.
- * @rcv_rsplen: length, in bytes, of the FCP RSP IU received.
  * @status:    Completion status of the FCP operation. must be 0 upon success,
  *             negative errno value upon failure (ex: -EIO). Note: this is
  *             NOT a reflection of the NVME CQE completion status. Only the
@@ -219,14 +219,14 @@ struct nvmefc_fcp_req {
        int                     sg_cnt;
        enum nvmefc_fcp_datadir io_dir;
 
-       __le16                  sqid;
-
        void (*done)(struct nvmefc_fcp_req *req);
 
        void                    *private;
 
-       u32                     transferred_length;
+       __le16                  sqid;
+
        u16                     rcv_rsplen;
+       u32                     transferred_length;
        u32                     status;
 } __aligned(sizeof(u64));      /* alignment for other things alloc'd with */
 
index c460236..3c2891d 100644 (file)
@@ -56,6 +56,8 @@ extern int olpc_ec_sci_query(u16 *sci_value);
 
 extern bool olpc_ec_wakeup_available(void);
 
+asmlinkage int xo1_do_sleep(u8 sleep_state);
+
 #else
 
 static inline int olpc_ec_cmd(u8 cmd, u8 *inbuf, size_t inlen, u8 *outbuf,
index 0e33b5c..f9b6031 100644 (file)
@@ -283,7 +283,7 @@ static inline size_t __must_check size_sub(size_t minuend, size_t subtrahend)
  * @member: Name of the array member.
  * @count: Number of elements in the array.
  *
- * Calculates size of memory needed for structure @p followed by an
+ * Calculates size of memory needed for structure of @p followed by an
  * array of @count number of @member elements.
  *
  * Return: number of bytes needed or SIZE_MAX on overflow.
@@ -293,4 +293,20 @@ static inline size_t __must_check size_sub(size_t minuend, size_t subtrahend)
                sizeof(*(p)) + flex_array_size(p, member, count),       \
                size_add(sizeof(*(p)), flex_array_size(p, member, count)))
 
+/**
+ * struct_size_t() - Calculate size of structure with trailing flexible array
+ * @type: structure type name.
+ * @member: Name of the array member.
+ * @count: Number of elements in the array.
+ *
+ * Calculates size of memory needed for structure @type followed by an
+ * array of @count number of @member elements. Prefer using struct_size()
+ * when possible instead, to keep calculations associated with a specific
+ * instance variable of type @type.
+ *
+ * Return: number of bytes needed or SIZE_MAX on overflow.
+ */
+#define struct_size_t(type, member, count)                                     \
+       struct_size((type *)NULL, member, count)
+
 #endif /* __LINUX_OVERFLOW_H */
index 5456b7b..4ac3439 100644 (file)
@@ -37,27 +37,12 @@ void set_pageblock_migratetype(struct page *page, int migratetype);
 int move_freepages_block(struct zone *zone, struct page *page,
                                int migratetype, int *num_movable);
 
-/*
- * Changes migrate type in [start_pfn, end_pfn) to be MIGRATE_ISOLATE.
- */
-int
-start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
-                        int migratetype, int flags, gfp_t gfp_flags);
+int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
+                            int migratetype, int flags, gfp_t gfp_flags);
 
-/*
- * Changes MIGRATE_ISOLATE to MIGRATE_MOVABLE.
- * target range is [start_pfn, end_pfn)
- */
-void
-undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
-                       int migratetype);
+void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
+                            int migratetype);
 
-/*
- * Test all pages in [start_pfn, end_pfn) are isolated or not.
- */
 int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
                        int isol_flags);
-
-struct page *alloc_migrate_target(struct page *page, unsigned long private);
-
 #endif
index a56308a..716953e 100644 (file)
@@ -30,6 +30,9 @@ static inline void invalidate_remote_inode(struct inode *inode)
 int invalidate_inode_pages2(struct address_space *mapping);
 int invalidate_inode_pages2_range(struct address_space *mapping,
                pgoff_t start, pgoff_t end);
+int kiocb_invalidate_pages(struct kiocb *iocb, size_t count);
+void kiocb_invalidate_post_direct_write(struct kiocb *iocb, size_t count);
+
 int write_inode_now(struct inode *, int sync);
 int filemap_fdatawrite(struct address_space *);
 int filemap_flush(struct address_space *);
@@ -54,6 +57,7 @@ int filemap_check_errors(struct address_space *mapping);
 void __filemap_set_wb_err(struct address_space *mapping, int err);
 int filemap_fdatawrite_wbc(struct address_space *mapping,
                           struct writeback_control *wbc);
+int kiocb_write_and_wait(struct kiocb *iocb, size_t count);
 
 static inline int filemap_write_and_wait(struct address_space *mapping)
 {
@@ -1078,8 +1082,6 @@ int filemap_migrate_folio(struct address_space *mapping, struct folio *dst,
 #else
 #define filemap_migrate_folio NULL
 #endif
-void page_endio(struct page *page, bool is_write, int err);
-
 void folio_end_private_2(struct folio *folio);
 void folio_wait_private_2(struct folio *folio);
 int folio_wait_private_2_killable(struct folio *folio);
index f582f72..87cc678 100644 (file)
@@ -3,65 +3,18 @@
  * include/linux/pagevec.h
  *
  * In many places it is efficient to batch an operation up against multiple
- * pages.  A pagevec is a multipage container which is used for that.
+ * folios.  A folio_batch is a container which is used for that.
  */
 
 #ifndef _LINUX_PAGEVEC_H
 #define _LINUX_PAGEVEC_H
 
-#include <linux/xarray.h>
+#include <linux/types.h>
 
-/* 15 pointers + header align the pagevec structure to a power of two */
+/* 15 pointers + header align the folio_batch structure to a power of two */
 #define PAGEVEC_SIZE   15
 
-struct page;
 struct folio;
-struct address_space;
-
-/* Layout must match folio_batch */
-struct pagevec {
-       unsigned char nr;
-       bool percpu_pvec_drained;
-       struct page *pages[PAGEVEC_SIZE];
-};
-
-void __pagevec_release(struct pagevec *pvec);
-
-static inline void pagevec_init(struct pagevec *pvec)
-{
-       pvec->nr = 0;
-       pvec->percpu_pvec_drained = false;
-}
-
-static inline void pagevec_reinit(struct pagevec *pvec)
-{
-       pvec->nr = 0;
-}
-
-static inline unsigned pagevec_count(struct pagevec *pvec)
-{
-       return pvec->nr;
-}
-
-static inline unsigned pagevec_space(struct pagevec *pvec)
-{
-       return PAGEVEC_SIZE - pvec->nr;
-}
-
-/*
- * Add a page to a pagevec.  Returns the number of slots still available.
- */
-static inline unsigned pagevec_add(struct pagevec *pvec, struct page *page)
-{
-       pvec->pages[pvec->nr++] = page;
-       return pagevec_space(pvec);
-}
-
-static inline void pagevec_release(struct pagevec *pvec)
-{
-       if (pagevec_count(pvec))
-               __pagevec_release(pvec);
-}
 
 /**
  * struct folio_batch - A collection of folios.
@@ -78,11 +31,6 @@ struct folio_batch {
        struct folio *folios[PAGEVEC_SIZE];
 };
 
-/* Layout must match pagevec */
-static_assert(sizeof(struct pagevec) == sizeof(struct folio_batch));
-static_assert(offsetof(struct pagevec, pages) ==
-               offsetof(struct folio_batch, folios));
-
 /**
  * folio_batch_init() - Initialise a batch of folios
  * @fbatch: The folio batch.
@@ -105,7 +53,7 @@ static inline unsigned int folio_batch_count(struct folio_batch *fbatch)
        return fbatch->nr;
 }
 
-static inline unsigned int fbatch_space(struct folio_batch *fbatch)
+static inline unsigned int folio_batch_space(struct folio_batch *fbatch)
 {
        return PAGEVEC_SIZE - fbatch->nr;
 }
@@ -124,12 +72,15 @@ static inline unsigned folio_batch_add(struct folio_batch *fbatch,
                struct folio *folio)
 {
        fbatch->folios[fbatch->nr++] = folio;
-       return fbatch_space(fbatch);
+       return folio_batch_space(fbatch);
 }
 
+void __folio_batch_release(struct folio_batch *pvec);
+
 static inline void folio_batch_release(struct folio_batch *fbatch)
 {
-       pagevec_release((struct pagevec *)fbatch);
+       if (folio_batch_count(fbatch))
+               __folio_batch_release(fbatch);
 }
 
 void folio_batch_remove_exceptionals(struct folio_batch *fbatch);
index 979b776..6717b15 100644 (file)
@@ -32,6 +32,9 @@ extern int sysctl_panic_on_stackoverflow;
 
 extern bool crash_kexec_post_notifiers;
 
+extern void __stack_chk_fail(void);
+void abort(void);
+
 /*
  * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It
  * holds a CPU number which is executing panic() currently. A value of
index a0bc9e0..243c82d 100644 (file)
@@ -180,6 +180,8 @@ struct ieee1284_info {
        struct semaphore irq;
 };
 
+#define PARPORT_NAME_MAX_LEN 15
+
 /* A parallel port */
 struct parport {
        unsigned long base;     /* base address */
index 95f33da..a99b1fc 100644 (file)
 #define PCI_DEVICE_ID_AMD_19H_M60H_DF_F3 0x14e3
 #define PCI_DEVICE_ID_AMD_19H_M70H_DF_F3 0x14f3
 #define PCI_DEVICE_ID_AMD_19H_M78H_DF_F3 0x12fb
+#define PCI_DEVICE_ID_AMD_MI200_DF_F3  0x14d3
 #define PCI_DEVICE_ID_AMD_CNB17H_F3    0x1703
 #define PCI_DEVICE_ID_AMD_LANCE                0x2000
 #define PCI_DEVICE_ID_AMD_LANCE_HOME   0x2001
index e60727b..ec35731 100644 (file)
@@ -343,31 +343,19 @@ static __always_inline void __this_cpu_preempt_check(const char *op) { }
        pscr2_ret__;                                                    \
 })
 
-/*
- * Special handling for cmpxchg_double.  cmpxchg_double is passed two
- * percpu variables.  The first has to be aligned to a double word
- * boundary and the second has to follow directly thereafter.
- * We enforce this on all architectures even if they don't support
- * a double cmpxchg instruction, since it's a cheap requirement, and it
- * avoids breaking the requirement for architectures with the instruction.
- */
-#define __pcpu_double_call_return_bool(stem, pcp1, pcp2, ...)          \
+#define __pcpu_size_call_return2bool(stem, variable, ...)              \
 ({                                                                     \
-       bool pdcrb_ret__;                                               \
-       __verify_pcpu_ptr(&(pcp1));                                     \
-       BUILD_BUG_ON(sizeof(pcp1) != sizeof(pcp2));                     \
-       VM_BUG_ON((unsigned long)(&(pcp1)) % (2 * sizeof(pcp1)));       \
-       VM_BUG_ON((unsigned long)(&(pcp2)) !=                           \
-                 (unsigned long)(&(pcp1)) + sizeof(pcp1));             \
-       switch(sizeof(pcp1)) {                                          \
-       case 1: pdcrb_ret__ = stem##1(pcp1, pcp2, __VA_ARGS__); break;  \
-       case 2: pdcrb_ret__ = stem##2(pcp1, pcp2, __VA_ARGS__); break;  \
-       case 4: pdcrb_ret__ = stem##4(pcp1, pcp2, __VA_ARGS__); break;  \
-       case 8: pdcrb_ret__ = stem##8(pcp1, pcp2, __VA_ARGS__); break;  \
+       bool pscr2_ret__;                                               \
+       __verify_pcpu_ptr(&(variable));                                 \
+       switch(sizeof(variable)) {                                      \
+       case 1: pscr2_ret__ = stem##1(variable, __VA_ARGS__); break;    \
+       case 2: pscr2_ret__ = stem##2(variable, __VA_ARGS__); break;    \
+       case 4: pscr2_ret__ = stem##4(variable, __VA_ARGS__); break;    \
+       case 8: pscr2_ret__ = stem##8(variable, __VA_ARGS__); break;    \
        default:                                                        \
                __bad_size_call_parameter(); break;                     \
        }                                                               \
-       pdcrb_ret__;                                                    \
+       pscr2_ret__;                                                    \
 })
 
 #define __pcpu_size_call(stem, variable, ...)                          \
@@ -426,9 +414,8 @@ do {                                                                        \
 #define raw_cpu_xchg(pcp, nval)                __pcpu_size_call_return2(raw_cpu_xchg_, pcp, nval)
 #define raw_cpu_cmpxchg(pcp, oval, nval) \
        __pcpu_size_call_return2(raw_cpu_cmpxchg_, pcp, oval, nval)
-#define raw_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
-       __pcpu_double_call_return_bool(raw_cpu_cmpxchg_double_, pcp1, pcp2, oval1, oval2, nval1, nval2)
-
+#define raw_cpu_try_cmpxchg(pcp, ovalp, nval) \
+       __pcpu_size_call_return2bool(raw_cpu_try_cmpxchg_, pcp, ovalp, nval)
 #define raw_cpu_sub(pcp, val)          raw_cpu_add(pcp, -(val))
 #define raw_cpu_inc(pcp)               raw_cpu_add(pcp, 1)
 #define raw_cpu_dec(pcp)               raw_cpu_sub(pcp, 1)
@@ -488,11 +475,6 @@ do {                                                                       \
        raw_cpu_cmpxchg(pcp, oval, nval);                               \
 })
 
-#define __this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
-({     __this_cpu_preempt_check("cmpxchg_double");                     \
-       raw_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2); \
-})
-
 #define __this_cpu_sub(pcp, val)       __this_cpu_add(pcp, -(typeof(pcp))(val))
 #define __this_cpu_inc(pcp)            __this_cpu_add(pcp, 1)
 #define __this_cpu_dec(pcp)            __this_cpu_sub(pcp, 1)
@@ -513,9 +495,8 @@ do {                                                                        \
 #define this_cpu_xchg(pcp, nval)       __pcpu_size_call_return2(this_cpu_xchg_, pcp, nval)
 #define this_cpu_cmpxchg(pcp, oval, nval) \
        __pcpu_size_call_return2(this_cpu_cmpxchg_, pcp, oval, nval)
-#define this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
-       __pcpu_double_call_return_bool(this_cpu_cmpxchg_double_, pcp1, pcp2, oval1, oval2, nval1, nval2)
-
+#define this_cpu_try_cmpxchg(pcp, ovalp, nval) \
+       __pcpu_size_call_return2bool(this_cpu_try_cmpxchg_, pcp, ovalp, nval)
 #define this_cpu_sub(pcp, val)         this_cpu_add(pcp, -(typeof(pcp))(val))
 #define this_cpu_inc(pcp)              this_cpu_add(pcp, 1)
 #define this_cpu_dec(pcp)              this_cpu_sub(pcp, 1)
index 1338ea2..42125cf 100644 (file)
@@ -103,12 +103,10 @@ extern void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai);
 extern void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
                                         void *base_addr);
 
-#ifdef CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK
 extern int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size,
                                size_t atom_size,
                                pcpu_fc_cpu_distance_fn_t cpu_distance_fn,
                                pcpu_fc_cpu_to_node_fn_t cpu_to_nd_fn);
-#endif
 
 #ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
 void __init pcpu_populate_pte(unsigned long addr);
index 525b5d6..a0801f6 100644 (file)
  */
 #define ARMPMU_EVT_64BIT               0x00001 /* Event uses a 64bit counter */
 #define ARMPMU_EVT_47BIT               0x00002 /* Event uses a 47bit counter */
+#define ARMPMU_EVT_63BIT               0x00004 /* Event uses a 63bit counter */
 
 static_assert((PERF_EVENT_FLAG_ARCH & ARMPMU_EVT_64BIT) == ARMPMU_EVT_64BIT);
 static_assert((PERF_EVENT_FLAG_ARCH & ARMPMU_EVT_47BIT) == ARMPMU_EVT_47BIT);
+static_assert((PERF_EVENT_FLAG_ARCH & ARMPMU_EVT_63BIT) == ARMPMU_EVT_63BIT);
 
 #define HW_OP_UNSUPPORTED              0xFFFF
 #define C(_x)                          PERF_COUNT_HW_CACHE_##_x
@@ -171,6 +173,8 @@ void kvm_host_pmu_init(struct arm_pmu *pmu);
 #define kvm_host_pmu_init(x)   do { } while(0)
 #endif
 
+bool arm_pmu_irq_is_nmi(void);
+
 /* Internal functions only for core arm_pmu code */
 struct arm_pmu *armpmu_alloc(void);
 void armpmu_free(struct arm_pmu *pmu);
index d5628a7..b528be0 100644 (file)
@@ -295,6 +295,8 @@ struct perf_event_pmu_context;
 
 struct perf_output_handle;
 
+#define PMU_NULL_DEV   ((void *)(~0UL))
+
 /**
  * struct pmu - generic performance monitoring unit
  */
@@ -827,6 +829,14 @@ struct perf_event {
        void *security;
 #endif
        struct list_head                sb_list;
+
+       /*
+        * Certain events gets forwarded to another pmu internally by over-
+        * writing kernel copy of event->attr.type without user being aware
+        * of it. event->orig_type contains original 'type' requested by
+        * user.
+        */
+       __u32                           orig_type;
 #endif /* CONFIG_PERF_EVENTS */
 };
 
@@ -1845,9 +1855,9 @@ int perf_event_exit_cpu(unsigned int cpu);
 #define perf_event_exit_cpu    NULL
 #endif
 
-extern void __weak arch_perf_update_userpage(struct perf_event *event,
-                                            struct perf_event_mmap_page *userpg,
-                                            u64 now);
+extern void arch_perf_update_userpage(struct perf_event *event,
+                                     struct perf_event_mmap_page *userpg,
+                                     u64 now);
 
 #ifdef CONFIG_MMU
 extern __weak u64 arch_perf_get_page_size(struct mm_struct *mm, unsigned long addr);
index c5a5148..5063b48 100644 (file)
@@ -94,14 +94,22 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 #define pte_offset_kernel pte_offset_kernel
 #endif
 
-#if defined(CONFIG_HIGHPTE)
-#define pte_offset_map(dir, address)                           \
-       ((pte_t *)kmap_atomic(pmd_page(*(dir))) +               \
-        pte_index((address)))
-#define pte_unmap(pte) kunmap_atomic((pte))
+#ifdef CONFIG_HIGHPTE
+#define __pte_map(pmd, address) \
+       ((pte_t *)kmap_local_page(pmd_page(*(pmd))) + pte_index((address)))
+#define pte_unmap(pte) do {    \
+       kunmap_local((pte));    \
+       /* rcu_read_unlock() to be added later */       \
+} while (0)
 #else
-#define pte_offset_map(dir, address)   pte_offset_kernel((dir), (address))
-#define pte_unmap(pte) ((void)(pte))   /* NOP */
+static inline pte_t *__pte_map(pmd_t *pmd, unsigned long address)
+{
+       return pte_offset_kernel(pmd, address);
+}
+static inline void pte_unmap(pte_t *pte)
+{
+       /* rcu_read_unlock() to be added later */
+}
 #endif
 
 /* Find an entry in the second-level page table.. */
@@ -204,12 +212,26 @@ static inline int pudp_set_access_flags(struct vm_area_struct *vma,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
+#ifndef ptep_get
+static inline pte_t ptep_get(pte_t *ptep)
+{
+       return READ_ONCE(*ptep);
+}
+#endif
+
+#ifndef pmdp_get
+static inline pmd_t pmdp_get(pmd_t *pmdp)
+{
+       return READ_ONCE(*pmdp);
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
                                            unsigned long address,
                                            pte_t *ptep)
 {
-       pte_t pte = *ptep;
+       pte_t pte = ptep_get(ptep);
        int r = 1;
        if (!pte_young(pte))
                r = 0;
@@ -296,7 +318,7 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
                                       unsigned long address,
                                       pte_t *ptep)
 {
-       pte_t pte = *ptep;
+       pte_t pte = ptep_get(ptep);
        pte_clear(mm, address, ptep);
        page_table_check_pte_clear(mm, address, pte);
        return pte;
@@ -309,20 +331,6 @@ static inline void ptep_clear(struct mm_struct *mm, unsigned long addr,
        ptep_get_and_clear(mm, addr, ptep);
 }
 
-#ifndef ptep_get
-static inline pte_t ptep_get(pte_t *ptep)
-{
-       return READ_ONCE(*ptep);
-}
-#endif
-
-#ifndef pmdp_get
-static inline pmd_t pmdp_get(pmd_t *pmdp)
-{
-       return READ_ONCE(*pmdp);
-}
-#endif
-
 #ifdef CONFIG_GUP_GET_PXX_LOW_HIGH
 /*
  * For walking the pagetables without holding any locks.  Some architectures
@@ -511,7 +519,7 @@ extern pud_t pudp_huge_clear_flush(struct vm_area_struct *vma,
 struct mm_struct;
 static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long address, pte_t *ptep)
 {
-       pte_t old_pte = *ptep;
+       pte_t old_pte = ptep_get(ptep);
        set_pte_at(mm, address, ptep, pte_wrprotect(old_pte));
 }
 #endif
@@ -591,6 +599,10 @@ extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
 #endif
 
+#ifndef arch_needs_pgtable_deposit
+#define arch_needs_pgtable_deposit() (false)
+#endif
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /*
  * This is an implementation of pmdp_establish() that is only suitable for an
@@ -1292,9 +1304,10 @@ static inline int pud_trans_huge(pud_t pud)
 }
 #endif
 
-/* See pmd_none_or_trans_huge_or_clear_bad for discussion. */
-static inline int pud_none_or_trans_huge_or_dev_or_clear_bad(pud_t *pud)
+static inline int pud_trans_unstable(pud_t *pud)
 {
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
+       defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
        pud_t pudval = READ_ONCE(*pud);
 
        if (pud_none(pudval) || pud_trans_huge(pudval) || pud_devmap(pudval))
@@ -1303,121 +1316,10 @@ static inline int pud_none_or_trans_huge_or_dev_or_clear_bad(pud_t *pud)
                pud_clear_bad(pud);
                return 1;
        }
-       return 0;
-}
-
-/* See pmd_trans_unstable for discussion. */
-static inline int pud_trans_unstable(pud_t *pud)
-{
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE) &&                    \
-       defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
-       return pud_none_or_trans_huge_or_dev_or_clear_bad(pud);
-#else
-       return 0;
-#endif
-}
-
-#ifndef arch_needs_pgtable_deposit
-#define arch_needs_pgtable_deposit() (false)
-#endif
-/*
- * This function is meant to be used by sites walking pagetables with
- * the mmap_lock held in read mode to protect against MADV_DONTNEED and
- * transhuge page faults. MADV_DONTNEED can convert a transhuge pmd
- * into a null pmd and the transhuge page fault can convert a null pmd
- * into an hugepmd or into a regular pmd (if the hugepage allocation
- * fails). While holding the mmap_lock in read mode the pmd becomes
- * stable and stops changing under us only if it's not null and not a
- * transhuge pmd. When those races occurs and this function makes a
- * difference vs the standard pmd_none_or_clear_bad, the result is
- * undefined so behaving like if the pmd was none is safe (because it
- * can return none anyway). The compiler level barrier() is critically
- * important to compute the two checks atomically on the same pmdval.
- *
- * For 32bit kernels with a 64bit large pmd_t this automatically takes
- * care of reading the pmd atomically to avoid SMP race conditions
- * against pmd_populate() when the mmap_lock is hold for reading by the
- * caller (a special atomic read not done by "gcc" as in the generic
- * version above, is also needed when THP is disabled because the page
- * fault can populate the pmd from under us).
- */
-static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
-{
-       pmd_t pmdval = pmdp_get_lockless(pmd);
-       /*
-        * The barrier will stabilize the pmdval in a register or on
-        * the stack so that it will stop changing under the code.
-        *
-        * When CONFIG_TRANSPARENT_HUGEPAGE=y on x86 32bit PAE,
-        * pmdp_get_lockless is allowed to return a not atomic pmdval
-        * (for example pointing to an hugepage that has never been
-        * mapped in the pmd). The below checks will only care about
-        * the low part of the pmd with 32bit PAE x86 anyway, with the
-        * exception of pmd_none(). So the important thing is that if
-        * the low part of the pmd is found null, the high part will
-        * be also null or the pmd_none() check below would be
-        * confused.
-        */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-       barrier();
 #endif
-       /*
-        * !pmd_present() checks for pmd migration entries
-        *
-        * The complete check uses is_pmd_migration_entry() in linux/swapops.h
-        * But using that requires moving current function and pmd_trans_unstable()
-        * to linux/swapops.h to resolve dependency, which is too much code move.
-        *
-        * !pmd_present() is equivalent to is_pmd_migration_entry() currently,
-        * because !pmd_present() pages can only be under migration not swapped
-        * out.
-        *
-        * pmd_none() is preserved for future condition checks on pmd migration
-        * entries and not confusing with this function name, although it is
-        * redundant with !pmd_present().
-        */
-       if (pmd_none(pmdval) || pmd_trans_huge(pmdval) ||
-               (IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION) && !pmd_present(pmdval)))
-               return 1;
-       if (unlikely(pmd_bad(pmdval))) {
-               pmd_clear_bad(pmd);
-               return 1;
-       }
        return 0;
 }
 
-/*
- * This is a noop if Transparent Hugepage Support is not built into
- * the kernel. Otherwise it is equivalent to
- * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in
- * places that already verified the pmd is not none and they want to
- * walk ptes while holding the mmap sem in read mode (write mode don't
- * need this). If THP is not enabled, the pmd can't go away under the
- * code even if MADV_DONTNEED runs, but if THP is enabled we need to
- * run a pmd_trans_unstable before walking the ptes after
- * split_huge_pmd returns (because it may have run when the pmd become
- * null, but then a page fault can map in a THP and not a regular page).
- */
-static inline int pmd_trans_unstable(pmd_t *pmd)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-       return pmd_none_or_trans_huge_or_clear_bad(pmd);
-#else
-       return 0;
-#endif
-}
-
-/*
- * the ordering of these checks is important for pmds with _page_devmap set.
- * if we check pmd_trans_unstable() first we will trip the bad_pmd() check
- * inside of pmd_none_or_trans_huge_or_clear_bad(). this will end up correctly
- * returning 1 but not before it spams dmesg with the pmd_clear_bad() output.
- */
-static inline int pmd_devmap_trans_unstable(pmd_t *pmd)
-{
-       return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
-}
-
 #ifndef CONFIG_NUMA_BALANCING
 /*
  * Technically a PTE can be PROTNONE even when not doing NUMA balancing but
index d2c3f16..02e0086 100644 (file)
@@ -261,18 +261,14 @@ void generic_pipe_buf_release(struct pipe_inode_info *, struct pipe_buffer *);
 
 extern const struct pipe_buf_operations nosteal_pipe_buf_ops;
 
-#ifdef CONFIG_WATCH_QUEUE
 unsigned long account_pipe_buffers(struct user_struct *user,
                                   unsigned long old, unsigned long new);
 bool too_many_pipe_buffers_soft(unsigned long user_bufs);
 bool too_many_pipe_buffers_hard(unsigned long user_bufs);
 bool pipe_is_unprivileged_user(void);
-#endif
 
 /* for F_SETPIPE_SZ and F_GETPIPE_SZ */
-#ifdef CONFIG_WATCH_QUEUE
 int pipe_resize_ring(struct pipe_inode_info *pipe, unsigned int nr_slots);
-#endif
 long pipe_fcntl(struct file *, unsigned int, unsigned long arg);
 struct pipe_inode_info *get_pipe_info(struct file *file, bool for_splice);
 
index f9c5ac8..80cb00d 100644 (file)
@@ -156,7 +156,6 @@ struct pktcdvd_device
 {
        struct block_device     *bdev;          /* dev attached */
        dev_t                   pkt_dev;        /* our dev */
-       char                    name[20];
        struct packet_settings  settings;
        struct packet_stats     stats;
        int                     refcnt;         /* Open count */
index 3101152..1d6e6c4 100644 (file)
@@ -36,6 +36,7 @@ struct s3c64xx_spi_info {
        int src_clk_nr;
        int num_cs;
        bool no_cs;
+       bool polling;
        int (*cfg_gpio)(void);
 };
 
index 0260f5e..253f267 100644 (file)
@@ -158,6 +158,8 @@ int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
                        struct pid *pid, struct task_struct *task);
 #endif /* CONFIG_PROC_PID_ARCH_STATUS */
 
+void arch_report_meminfo(struct seq_file *m);
+
 #else /* CONFIG_PROC_FS */
 
 static inline void proc_root_init(void)
index 917528d..d506dc6 100644 (file)
@@ -7,6 +7,7 @@
 struct inode *ramfs_get_inode(struct super_block *sb, const struct inode *dir,
         umode_t mode, dev_t dev);
 extern int ramfs_init_fs_context(struct fs_context *fc);
+extern void ramfs_kill_sb(struct super_block *sb);
 
 #ifdef CONFIG_MMU
 static inline int
index 3d1a9e7..6a0999c 100644 (file)
@@ -206,7 +206,7 @@ latch_tree_find(void *key, struct latch_tree_root *root,
        do {
                seq = raw_read_seqcount_latch(&root->seq);
                node = __lt_find(key, root, seq & 1, ops->comp);
-       } while (read_seqcount_latch_retry(&root->seq, seq));
+       } while (raw_read_seqcount_latch_retry(&root->seq, seq));
 
        return node;
 }
index dcd2cf1..7d9c2a6 100644 (file)
@@ -156,31 +156,6 @@ static inline int rcu_nocb_cpu_deoffload(int cpu) { return 0; }
 static inline void rcu_nocb_flush_deferred_wakeup(void) { }
 #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
 
-/**
- * RCU_NONIDLE - Indicate idle-loop code that needs RCU readers
- * @a: Code that RCU needs to pay attention to.
- *
- * RCU read-side critical sections are forbidden in the inner idle loop,
- * that is, between the ct_idle_enter() and the ct_idle_exit() -- RCU
- * will happily ignore any such read-side critical sections.  However,
- * things like powertop need tracepoints in the inner idle loop.
- *
- * This macro provides the way out:  RCU_NONIDLE(do_something_with_RCU())
- * will tell RCU that it needs to pay attention, invoke its argument
- * (in this example, calling the do_something_with_RCU() function),
- * and then tell RCU to go back to ignoring this CPU.  It is permissible
- * to nest RCU_NONIDLE() wrappers, but not indefinitely (but the limit is
- * on the order of a million or so, even on 32-bit systems).  It is
- * not legal to block within RCU_NONIDLE(), nor is it permissible to
- * transfer control either into or out of RCU_NONIDLE()'s statement.
- */
-#define RCU_NONIDLE(a) \
-       do { \
-               ct_irq_enter_irqson(); \
-               do { a; } while (0); \
-               ct_irq_exit_irqson(); \
-       } while (0)
-
 /*
  * Note a quasi-voluntary context switch for RCU-tasks's benefit.
  * This is a macro rather than an inline function to avoid #include hell.
@@ -957,9 +932,8 @@ static inline notrace void rcu_read_unlock_sched_notrace(void)
 
 /**
  * kfree_rcu() - kfree an object after a grace period.
- * @ptr: pointer to kfree for both single- and double-argument invocations.
- * @rhf: the name of the struct rcu_head within the type of @ptr,
- *       but only for double-argument invocations.
+ * @ptr: pointer to kfree for double-argument invocations.
+ * @rhf: the name of the struct rcu_head within the type of @ptr.
  *
  * Many rcu callbacks functions just call kfree() on the base structure.
  * These functions are trivial, but their size adds up, and furthermore
@@ -984,26 +958,18 @@ static inline notrace void rcu_read_unlock_sched_notrace(void)
  * The BUILD_BUG_ON check must not involve any function calls, hence the
  * checks are done in macros here.
  */
-#define kfree_rcu(ptr, rhf...) kvfree_rcu(ptr, ## rhf)
+#define kfree_rcu(ptr, rhf) kvfree_rcu_arg_2(ptr, rhf)
+#define kvfree_rcu(ptr, rhf) kvfree_rcu_arg_2(ptr, rhf)
 
 /**
- * kvfree_rcu() - kvfree an object after a grace period.
- *
- * This macro consists of one or two arguments and it is
- * based on whether an object is head-less or not. If it
- * has a head then a semantic stays the same as it used
- * to be before:
- *
- *     kvfree_rcu(ptr, rhf);
- *
- * where @ptr is a pointer to kvfree(), @rhf is the name
- * of the rcu_head structure within the type of @ptr.
+ * kfree_rcu_mightsleep() - kfree an object after a grace period.
+ * @ptr: pointer to kfree for single-argument invocations.
  *
  * When it comes to head-less variant, only one argument
  * is passed and that is just a pointer which has to be
  * freed after a grace period. Therefore the semantic is
  *
- *     kvfree_rcu(ptr);
+ *     kfree_rcu_mightsleep(ptr);
  *
  * where @ptr is the pointer to be freed by kvfree().
  *
@@ -1012,13 +978,9 @@ static inline notrace void rcu_read_unlock_sched_notrace(void)
  * annotation. Otherwise, please switch and embed the
  * rcu_head structure within the type of @ptr.
  */
-#define kvfree_rcu(...) KVFREE_GET_MACRO(__VA_ARGS__,          \
-       kvfree_rcu_arg_2, kvfree_rcu_arg_1)(__VA_ARGS__)
-
+#define kfree_rcu_mightsleep(ptr) kvfree_rcu_arg_1(ptr)
 #define kvfree_rcu_mightsleep(ptr) kvfree_rcu_arg_1(ptr)
-#define kfree_rcu_mightsleep(ptr) kvfree_rcu_mightsleep(ptr)
 
-#define KVFREE_GET_MACRO(_1, _2, NAME, ...) NAME
 #define kvfree_rcu_arg_2(ptr, rhf)                                     \
 do {                                                                   \
        typeof (ptr) ___p = (ptr);                                      \
index c2b9cc5..8fc0b3e 100644 (file)
@@ -1528,9 +1528,6 @@ struct regmap_irq_chip_data;
  *                  status_base. Should contain num_regs arrays.
  *                  Can be provided for chips with more complex mapping than
  *                  1.st bit to 1.st sub-reg, 2.nd bit to 2.nd sub-reg, ...
- *                  When used with not_fixed_stride, each one-element array
- *                  member contains offset calculated as address from each
- *                  peripheral to first peripheral.
  * @num_main_regs: Number of 'main status' irq registers for chips which have
  *                main_status set.
  *
@@ -1542,10 +1539,6 @@ struct regmap_irq_chip_data;
  * @ack_base:    Base ack address. If zero then the chip is clear on read.
  *               Using zero value is possible with @use_ack bit.
  * @wake_base:   Base address for wake enables.  If zero unsupported.
- * @type_base:   Base address for irq type.  If zero unsupported.  Deprecated,
- *              use @config_base instead.
- * @virt_reg_base:   Base addresses for extra config regs. Deprecated, use
- *                  @config_base instead.
  * @config_base: Base address for IRQ type config regs. If null unsupported.
  * @irq_reg_stride:  Stride to use for chips where registers are not contiguous.
  * @init_ack_masked: Ack all masked interrupts once during initalization.
@@ -1571,11 +1564,6 @@ struct regmap_irq_chip_data;
  *                   registers before unmasking interrupts to clear any bits
  *                   set when they were masked.
  * @runtime_pm:  Hold a runtime PM lock on the device when accessing it.
- * @not_fixed_stride: Used when chip peripherals are not laid out with fixed
- *                   stride. Must be used with sub_reg_offsets containing the
- *                   offsets to each peripheral. Deprecated; the same thing
- *                   can be accomplished with a @get_irq_reg callback, without
- *                   the need for a @sub_reg_offsets table.
  * @no_status: No status register: all interrupts assumed generated by device.
  *
  * @num_regs:    Number of registers in each control bank.
@@ -1583,12 +1571,6 @@ struct regmap_irq_chip_data;
  * @irqs:        Descriptors for individual IRQs.  Interrupt numbers are
  *               assigned based on the index in the array of the interrupt.
  * @num_irqs:    Number of descriptors.
- *
- * @num_type_reg:    Number of type registers. Deprecated, use config registers
- *                  instead.
- * @num_virt_regs:   Number of non-standard irq configuration registers.
- *                  If zero unsupported. Deprecated, use config registers
- *                  instead.
  * @num_config_bases:  Number of config base registers.
  * @num_config_regs:   Number of config registers for each config base register.
  *
@@ -1598,15 +1580,12 @@ struct regmap_irq_chip_data;
  *                  after handling the interrupts in regmap_irq_handler().
  * @handle_mask_sync: Callback used to handle IRQ mask syncs. The index will be
  *                   in the range [0, num_regs)
- * @set_type_virt:   Driver specific callback to extend regmap_irq_set_type()
- *                  and configure virt regs. Deprecated, use @set_type_config
- *                  callback and config registers instead.
  * @set_type_config: Callback used for configuring irq types.
  * @get_irq_reg: Callback for mapping (base register, index) pairs to register
  *              addresses. The base register will be one of @status_base,
  *              @mask_base, etc., @main_status, or any of @config_base.
  *              The index will be in the range [0, num_main_regs[ for the
- *              main status base, [0, num_type_settings[ for any config
+ *              main status base, [0, num_config_regs[ for any config
  *              register base, and [0, num_regs[ for any other base.
  *              If unspecified then regmap_irq_get_irq_reg_linear() is used.
  * @irq_drv_data:    Driver specific IRQ data which is passed as parameter when
@@ -1629,8 +1608,6 @@ struct regmap_irq_chip {
        unsigned int unmask_base;
        unsigned int ack_base;
        unsigned int wake_base;
-       unsigned int type_base;
-       unsigned int *virt_reg_base;
        const unsigned int *config_base;
        unsigned int irq_reg_stride;
        unsigned int init_ack_masked:1;
@@ -1643,7 +1620,6 @@ struct regmap_irq_chip {
        unsigned int type_in_mask:1;
        unsigned int clear_on_unmask:1;
        unsigned int runtime_pm:1;
-       unsigned int not_fixed_stride:1;
        unsigned int no_status:1;
 
        int num_regs;
@@ -1651,18 +1627,13 @@ struct regmap_irq_chip {
        const struct regmap_irq *irqs;
        int num_irqs;
 
-       int num_type_reg;
-       int num_virt_regs;
        int num_config_bases;
        int num_config_regs;
 
        int (*handle_pre_irq)(void *irq_drv_data);
        int (*handle_post_irq)(void *irq_drv_data);
-       int (*handle_mask_sync)(struct regmap *map, int index,
-                               unsigned int mask_buf_def,
+       int (*handle_mask_sync)(int index, unsigned int mask_buf_def,
                                unsigned int mask_buf, void *irq_drv_data);
-       int (*set_type_virt)(unsigned int **buf, unsigned int type,
-                            unsigned long hwirq, int reg);
        int (*set_type_config)(unsigned int **buf, unsigned int type,
                               const struct regmap_irq *irq_data, int idx,
                               void *irq_drv_data);
index d3b4a3d..c6ef7d6 100644 (file)
@@ -758,6 +758,8 @@ int regulator_set_current_limit_regmap(struct regulator_dev *rdev,
                                       int min_uA, int max_uA);
 int regulator_get_current_limit_regmap(struct regulator_dev *rdev);
 void *regulator_get_init_drvdata(struct regulator_init_data *reg_init_data);
+int regulator_find_closest_bigger(unsigned int target, const unsigned int *table,
+                                 unsigned int num_sel, unsigned int *sel);
 int regulator_set_ramp_delay_regmap(struct regulator_dev *rdev, int ramp_delay);
 int regulator_sync_voltage_rdev(struct regulator_dev *rdev);
 
index bdcf83c..c71a6a9 100644 (file)
@@ -41,15 +41,12 @@ enum {
        MT6358_ID_VIO28,
        MT6358_ID_VA12,
        MT6358_ID_VRF18,
-       MT6358_ID_VCN33_BT,
-       MT6358_ID_VCN33_WIFI,
+       MT6358_ID_VCN33,
        MT6358_ID_VCAMA2,
        MT6358_ID_VMC,
        MT6358_ID_VLDO28,
        MT6358_ID_VAUD28,
        MT6358_ID_VSIM2,
-       MT6358_ID_VCORE_SSHUB,
-       MT6358_ID_VSRAM_OTHERS_SSHUB,
        MT6358_ID_RG_MAX,
 };
 
@@ -85,13 +82,10 @@ enum {
        MT6366_ID_VIO28,
        MT6366_ID_VA12,
        MT6366_ID_VRF18,
-       MT6366_ID_VCN33_BT,
-       MT6366_ID_VCN33_WIFI,
+       MT6366_ID_VCN33,
        MT6366_ID_VMC,
        MT6366_ID_VAUD28,
        MT6366_ID_VSIM2,
-       MT6366_ID_VCORE_SSHUB,
-       MT6366_ID_VSRAM_OTHERS_SSHUB,
        MT6366_ID_RG_MAX,
 };
 
index 4e78651..847c9a0 100644 (file)
@@ -9,15 +9,8 @@
 enum {
        Root_NFS = MKDEV(UNNAMED_MAJOR, 255),
        Root_CIFS = MKDEV(UNNAMED_MAJOR, 254),
+       Root_Generic = MKDEV(UNNAMED_MAJOR, 253),
        Root_RAM0 = MKDEV(RAMDISK_MAJOR, 0),
-       Root_RAM1 = MKDEV(RAMDISK_MAJOR, 1),
-       Root_FD0 = MKDEV(FLOPPY_MAJOR, 0),
-       Root_HDA1 = MKDEV(IDE0_MAJOR, 1),
-       Root_HDA2 = MKDEV(IDE0_MAJOR, 2),
-       Root_SDA1 = MKDEV(SCSI_DISK0_MAJOR, 1),
-       Root_SDA2 = MKDEV(SCSI_DISK0_MAJOR, 2),
-       Root_HDC1 = MKDEV(IDE1_MAJOR, 1),
-       Root_SR0 = MKDEV(SCSI_CDROM_MAJOR, 0),
 };
 
 extern dev_t ROOT_DEV;
index 375a5e9..77df3d7 100644 (file)
@@ -16,7 +16,7 @@ struct scatterlist {
 #ifdef CONFIG_NEED_SG_DMA_LENGTH
        unsigned int    dma_length;
 #endif
-#ifdef CONFIG_PCI_P2PDMA
+#ifdef CONFIG_NEED_SG_DMA_FLAGS
        unsigned int    dma_flags;
 #endif
 };
@@ -141,6 +141,30 @@ static inline void sg_set_page(struct scatterlist *sg, struct page *page,
        sg->length = len;
 }
 
+/**
+ * sg_set_folio - Set sg entry to point at given folio
+ * @sg:                 SG entry
+ * @folio:      The folio
+ * @len:        Length of data
+ * @offset:     Offset into folio
+ *
+ * Description:
+ *   Use this function to set an sg entry pointing at a folio, never assign
+ *   the folio directly. We encode sg table information in the lower bits
+ *   of the folio pointer. See sg_page() for looking up the page belonging
+ *   to an sg entry.
+ *
+ **/
+static inline void sg_set_folio(struct scatterlist *sg, struct folio *folio,
+                              size_t len, size_t offset)
+{
+       WARN_ON_ONCE(len > UINT_MAX);
+       WARN_ON_ONCE(offset > UINT_MAX);
+       sg_assign_page(sg, &folio->page);
+       sg->offset = offset;
+       sg->length = len;
+}
+
 static inline struct page *sg_page(struct scatterlist *sg)
 {
 #ifdef CONFIG_DEBUG_SG
@@ -249,17 +273,18 @@ static inline void sg_unmark_end(struct scatterlist *sg)
 }
 
 /*
- * CONFGI_PCI_P2PDMA depends on CONFIG_64BIT which means there is 4 bytes
- * in struct scatterlist (assuming also CONFIG_NEED_SG_DMA_LENGTH is set).
- * Use this padding for DMA flags bits to indicate when a specific
- * dma address is a bus address.
+ * One 64-bit architectures there is a 4-byte padding in struct scatterlist
+ * (assuming also CONFIG_NEED_SG_DMA_LENGTH is set). Use this padding for DMA
+ * flags bits to indicate when a specific dma address is a bus address or the
+ * buffer may have been bounced via SWIOTLB.
  */
-#ifdef CONFIG_PCI_P2PDMA
+#ifdef CONFIG_NEED_SG_DMA_FLAGS
 
-#define SG_DMA_BUS_ADDRESS (1 << 0)
+#define SG_DMA_BUS_ADDRESS     (1 << 0)
+#define SG_DMA_SWIOTLB         (1 << 1)
 
 /**
- * sg_dma_is_bus address - Return whether a given segment was marked
+ * sg_dma_is_bus_address - Return whether a given segment was marked
  *                        as a bus address
  * @sg:                 SG entry
  *
@@ -267,13 +292,13 @@ static inline void sg_unmark_end(struct scatterlist *sg)
  *   Returns true if sg_dma_mark_bus_address() has been called on
  *   this segment.
  **/
-static inline bool sg_is_dma_bus_address(struct scatterlist *sg)
+static inline bool sg_dma_is_bus_address(struct scatterlist *sg)
 {
        return sg->dma_flags & SG_DMA_BUS_ADDRESS;
 }
 
 /**
- * sg_dma_mark_bus address - Mark the scatterlist entry as a bus address
+ * sg_dma_mark_bus_address - Mark the scatterlist entry as a bus address
  * @sg:                 SG entry
  *
  * Description:
@@ -299,9 +324,37 @@ static inline void sg_dma_unmark_bus_address(struct scatterlist *sg)
        sg->dma_flags &= ~SG_DMA_BUS_ADDRESS;
 }
 
+/**
+ * sg_dma_is_swiotlb - Return whether the scatterlist was marked for SWIOTLB
+ *                     bouncing
+ * @sg:                SG entry
+ *
+ * Description:
+ *   Returns true if the scatterlist was marked for SWIOTLB bouncing. Not all
+ *   elements may have been bounced, so the caller would have to check
+ *   individual SG entries with is_swiotlb_buffer().
+ */
+static inline bool sg_dma_is_swiotlb(struct scatterlist *sg)
+{
+       return sg->dma_flags & SG_DMA_SWIOTLB;
+}
+
+/**
+ * sg_dma_mark_swiotlb - Mark the scatterlist for SWIOTLB bouncing
+ * @sg:                SG entry
+ *
+ * Description:
+ *   Marks a a scatterlist for SWIOTLB bounce. Not all SG entries may be
+ *   bounced.
+ */
+static inline void sg_dma_mark_swiotlb(struct scatterlist *sg)
+{
+       sg->dma_flags |= SG_DMA_SWIOTLB;
+}
+
 #else
 
-static inline bool sg_is_dma_bus_address(struct scatterlist *sg)
+static inline bool sg_dma_is_bus_address(struct scatterlist *sg)
 {
        return false;
 }
@@ -311,8 +364,15 @@ static inline void sg_dma_mark_bus_address(struct scatterlist *sg)
 static inline void sg_dma_unmark_bus_address(struct scatterlist *sg)
 {
 }
+static inline bool sg_dma_is_swiotlb(struct scatterlist *sg)
+{
+       return false;
+}
+static inline void sg_dma_mark_swiotlb(struct scatterlist *sg)
+{
+}
 
-#endif
+#endif /* CONFIG_NEED_SG_DMA_FLAGS */
 
 /**
  * sg_phys - Return physical address of an sg entry
index eed5d65..609bde8 100644 (file)
@@ -41,7 +41,6 @@
 
 /* task_struct member predeclarations (sorted alphabetically): */
 struct audit_context;
-struct backing_dev_info;
 struct bio_list;
 struct blk_plug;
 struct bpf_local_storage;
@@ -1186,8 +1185,6 @@ struct task_struct {
        /* VM state: */
        struct reclaim_state            *reclaim_state;
 
-       struct backing_dev_info         *backing_dev_info;
-
        struct io_context               *io_context;
 
 #ifdef CONFIG_COMPACTION
@@ -1852,7 +1849,9 @@ current_restore_flags(unsigned long orig_flags, unsigned long flags)
 }
 
 extern int cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
-extern int task_can_attach(struct task_struct *p, const struct cpumask *cs_effective_cpus);
+extern int task_can_attach(struct task_struct *p);
+extern int dl_bw_alloc(int cpu, u64 dl_bw);
+extern void dl_bw_free(int cpu, u64 dl_bw);
 #ifdef CONFIG_SMP
 extern void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask);
 extern int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask);
@@ -2006,15 +2005,12 @@ static __always_inline void scheduler_ipi(void)
         */
        preempt_fold_need_resched();
 }
-extern unsigned long wait_task_inactive(struct task_struct *, unsigned int match_state);
 #else
 static inline void scheduler_ipi(void) { }
-static inline unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state)
-{
-       return 1;
-}
 #endif
 
+extern unsigned long wait_task_inactive(struct task_struct *, unsigned int match_state);
+
 /*
  * Set thread flags in other task's structures.
  * See asm/thread_info.h for TIF_xxxx flags available:
index ca008f7..196f0ca 100644 (file)
  *
  * Please use one of the three interfaces below.
  */
-extern unsigned long long notrace sched_clock(void);
+extern u64 sched_clock(void);
+
+#if defined(CONFIG_ARCH_WANTS_NO_INSTR) || defined(CONFIG_GENERIC_SCHED_CLOCK)
+extern u64 sched_clock_noinstr(void);
+#else
+static __always_inline u64 sched_clock_noinstr(void)
+{
+       return sched_clock();
+}
+#endif
 
 /*
  * See the comment in kernel/sched/clock.c
@@ -45,6 +54,11 @@ static inline u64 cpu_clock(int cpu)
        return sched_clock();
 }
 
+static __always_inline u64 local_clock_noinstr(void)
+{
+       return sched_clock_noinstr();
+}
+
 static __always_inline u64 local_clock(void)
 {
        return sched_clock();
@@ -79,6 +93,7 @@ static inline u64 cpu_clock(int cpu)
        return sched_clock_cpu(cpu);
 }
 
+extern u64 local_clock_noinstr(void);
 extern u64 local_clock(void);
 
 #endif
index 57bde66..fad77b5 100644 (file)
@@ -132,12 +132,9 @@ SD_FLAG(SD_SERIALIZE, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)
 /*
  * Place busy tasks earlier in the domain
  *
- * SHARED_CHILD: Usually set on the SMT level. Technically could be set further
- *               up, but currently assumed to be set from the base domain
- *               upwards (see update_top_cache_domain()).
  * NEEDS_GROUPS: Load balancing flag.
  */
-SD_FLAG(SD_ASYM_PACKING, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
+SD_FLAG(SD_ASYM_PACKING, SDF_NEEDS_GROUPS)
 
 /*
  * Prefer to place tasks in a sibling domain
index 2009926..669e8cf 100644 (file)
@@ -135,7 +135,7 @@ struct signal_struct {
 #ifdef CONFIG_POSIX_TIMERS
 
        /* POSIX.1b Interval Timers */
-       int                     posix_timer_id;
+       unsigned int            next_posix_timer_id;
        struct list_head        posix_timers;
 
        /* ITIMER_REAL timer for the process */
index 816df6c..67b573d 100644 (file)
@@ -203,7 +203,7 @@ struct sched_domain_topology_level {
 #endif
 };
 
-extern void set_sched_topology(struct sched_domain_topology_level *tl);
+extern void __init set_sched_topology(struct sched_domain_topology_level *tl);
 
 #ifdef CONFIG_SCHED_DEBUG
 # define SD_INIT_NAME(type)            .name = #type
index e2734e9..3282850 100644 (file)
@@ -1465,6 +1465,7 @@ void security_sctp_sk_clone(struct sctp_association *asoc, struct sock *sk,
                            struct sock *newsk);
 int security_sctp_assoc_established(struct sctp_association *asoc,
                                    struct sk_buff *skb);
+int security_mptcp_add_subflow(struct sock *sk, struct sock *ssk);
 
 #else  /* CONFIG_SECURITY_NETWORK */
 static inline int security_unix_stream_connect(struct sock *sock,
@@ -1692,6 +1693,11 @@ static inline int security_sctp_assoc_established(struct sctp_association *asoc,
 {
        return 0;
 }
+
+static inline int security_mptcp_add_subflow(struct sock *sk, struct sock *ssk)
+{
+       return 0;
+}
 #endif /* CONFIG_SECURITY_NETWORK */
 
 #ifdef CONFIG_SECURITY_INFINIBAND
index 3926e90..987a59d 100644 (file)
@@ -671,9 +671,9 @@ typedef struct {
  *
  * Return: sequence counter raw value. Use the lowest bit as an index for
  * picking which data copy to read. The full counter must then be checked
- * with read_seqcount_latch_retry().
+ * with raw_read_seqcount_latch_retry().
  */
-static inline unsigned raw_read_seqcount_latch(const seqcount_latch_t *s)
+static __always_inline unsigned raw_read_seqcount_latch(const seqcount_latch_t *s)
 {
        /*
         * Pairs with the first smp_wmb() in raw_write_seqcount_latch().
@@ -683,16 +683,17 @@ static inline unsigned raw_read_seqcount_latch(const seqcount_latch_t *s)
 }
 
 /**
- * read_seqcount_latch_retry() - end a seqcount_latch_t read section
+ * raw_read_seqcount_latch_retry() - end a seqcount_latch_t read section
  * @s:         Pointer to seqcount_latch_t
  * @start:     count, from raw_read_seqcount_latch()
  *
  * Return: true if a read section retry is required, else false
  */
-static inline int
-read_seqcount_latch_retry(const seqcount_latch_t *s, unsigned start)
+static __always_inline int
+raw_read_seqcount_latch_retry(const seqcount_latch_t *s, unsigned start)
 {
-       return read_seqcount_retry(&s->seqcount, start);
+       smp_rmb();
+       return unlikely(READ_ONCE(s->seqcount.sequence) != start);
 }
 
 /**
@@ -752,7 +753,7 @@ read_seqcount_latch_retry(const seqcount_latch_t *s, unsigned start)
  *                     entry = data_query(latch->data[idx], ...);
  *
  *             // This includes needed smp_rmb()
- *             } while (read_seqcount_latch_retry(&latch->seq, seq));
+ *             } while (raw_read_seqcount_latch_retry(&latch->seq, seq));
  *
  *             return entry;
  *     }
index 6b3e155..ca53425 100644 (file)
@@ -12,6 +12,7 @@
 #ifndef _LINUX_SLAB_H
 #define        _LINUX_SLAB_H
 
+#include <linux/cache.h>
 #include <linux/gfp.h>
 #include <linux/overflow.h>
 #include <linux/types.h>
@@ -235,12 +236,17 @@ void kmem_dump_obj(void *object);
  * alignment larger than the alignment of a 64-bit integer.
  * Setting ARCH_DMA_MINALIGN in arch headers allows that.
  */
-#if defined(ARCH_DMA_MINALIGN) && ARCH_DMA_MINALIGN > 8
+#ifdef ARCH_HAS_DMA_MINALIGN
+#if ARCH_DMA_MINALIGN > 8 && !defined(ARCH_KMALLOC_MINALIGN)
 #define ARCH_KMALLOC_MINALIGN ARCH_DMA_MINALIGN
-#define KMALLOC_MIN_SIZE ARCH_DMA_MINALIGN
-#define KMALLOC_SHIFT_LOW ilog2(ARCH_DMA_MINALIGN)
-#else
+#endif
+#endif
+
+#ifndef ARCH_KMALLOC_MINALIGN
 #define ARCH_KMALLOC_MINALIGN __alignof__(unsigned long long)
+#elif ARCH_KMALLOC_MINALIGN > 8
+#define KMALLOC_MIN_SIZE ARCH_KMALLOC_MINALIGN
+#define KMALLOC_SHIFT_LOW ilog2(KMALLOC_MIN_SIZE)
 #endif
 
 /*
index f6df03f..deb90cf 100644 (file)
@@ -39,7 +39,8 @@ enum stat_item {
        CPU_PARTIAL_FREE,       /* Refill cpu partial on free */
        CPU_PARTIAL_NODE,       /* Refill cpu partial from node partial */
        CPU_PARTIAL_DRAIN,      /* Drain cpu partial to node partial */
-       NR_SLUB_STAT_ITEMS };
+       NR_SLUB_STAT_ITEMS
+};
 
 #ifndef CONFIG_SLUB_TINY
 /*
@@ -47,8 +48,13 @@ enum stat_item {
  * with this_cpu_cmpxchg_double() alignment requirements.
  */
 struct kmem_cache_cpu {
-       void **freelist;        /* Pointer to next available object */
-       unsigned long tid;      /* Globally unique transaction id */
+       union {
+               struct {
+                       void **freelist;        /* Pointer to next available object */
+                       unsigned long tid;      /* Globally unique transaction id */
+               };
+               freelist_aba_t freelist_tid;
+       };
        struct slab *slab;      /* The slab from which we are allocating */
 #ifdef CONFIG_SLUB_CPU_PARTIAL
        struct slab *partial;   /* Partially allocated frozen slabs */
index c55a0bc..821a191 100644 (file)
@@ -490,9 +490,13 @@ int geni_se_clk_freq_match(struct geni_se *se, unsigned long req_freq,
                           unsigned int *index, unsigned long *res_freq,
                           bool exact);
 
+void geni_se_tx_init_dma(struct geni_se *se, dma_addr_t iova, size_t len);
+
 int geni_se_tx_dma_prep(struct geni_se *se, void *buf, size_t len,
                        dma_addr_t *iova);
 
+void geni_se_rx_init_dma(struct geni_se *se, dma_addr_t iova, size_t len);
+
 int geni_se_rx_dma_prep(struct geni_se *se, void *buf, size_t len,
                        dma_addr_t *iova);
 
index cfe42f8..32c94ea 100644 (file)
@@ -1261,6 +1261,23 @@ static inline bool spi_is_bpw_supported(struct spi_device *spi, u32 bpw)
        return false;
 }
 
+/**
+ * spi_controller_xfer_timeout - Compute a suitable timeout value
+ * @ctlr: SPI device
+ * @xfer: Transfer descriptor
+ *
+ * Compute a relevant timeout value for the given transfer. We derive the time
+ * that it would take on a single data line and take twice this amount of time
+ * with a minimum of 500ms to avoid false positives on loaded systems.
+ *
+ * Returns: Transfer timeout value in milliseconds.
+ */
+static inline unsigned int spi_controller_xfer_timeout(struct spi_controller *ctlr,
+                                                      struct spi_transfer *xfer)
+{
+       return max(xfer->len * 8 * 2 / (xfer->speed_hz / 1000), 500U);
+}
+
 /*---------------------------------------------------------------------------*/
 
 /* SPI transfer replacement methods which make use of spi_res */
index 4fab18a..6c46157 100644 (file)
@@ -77,6 +77,9 @@ extern ssize_t splice_to_pipe(struct pipe_inode_info *,
                              struct splice_pipe_desc *);
 extern ssize_t add_to_pipe(struct pipe_inode_info *,
                              struct pipe_buffer *);
+long vfs_splice_read(struct file *in, loff_t *ppos,
+                    struct pipe_inode_info *pipe, size_t len,
+                    unsigned int flags);
 extern ssize_t splice_direct_to_actor(struct file *, struct splice_desc *,
                                      splice_direct_actor *);
 extern long do_splice(struct file *in, loff_t *off_in,
index 41c4b26..eb92a50 100644 (file)
@@ -212,7 +212,7 @@ static inline int srcu_read_lock(struct srcu_struct *ssp) __acquires(ssp)
 
        srcu_check_nmi_safety(ssp, false);
        retval = __srcu_read_lock(ssp);
-       srcu_lock_acquire(&(ssp)->dep_map);
+       srcu_lock_acquire(&ssp->dep_map);
        return retval;
 }
 
@@ -229,7 +229,7 @@ static inline int srcu_read_lock_nmisafe(struct srcu_struct *ssp) __acquires(ssp
 
        srcu_check_nmi_safety(ssp, true);
        retval = __srcu_read_lock_nmisafe(ssp);
-       rcu_lock_acquire(&(ssp)->dep_map);
+       rcu_lock_acquire(&ssp->dep_map);
        return retval;
 }
 
@@ -284,7 +284,7 @@ static inline void srcu_read_unlock(struct srcu_struct *ssp, int idx)
 {
        WARN_ON_ONCE(idx & ~0x1);
        srcu_check_nmi_safety(ssp, false);
-       srcu_lock_release(&(ssp)->dep_map);
+       srcu_lock_release(&ssp->dep_map);
        __srcu_read_unlock(ssp, idx);
 }
 
@@ -300,7 +300,7 @@ static inline void srcu_read_unlock_nmisafe(struct srcu_struct *ssp, int idx)
 {
        WARN_ON_ONCE(idx & ~0x1);
        srcu_check_nmi_safety(ssp, true);
-       rcu_lock_release(&(ssp)->dep_map);
+       rcu_lock_release(&ssp->dep_map);
        __srcu_read_unlock_nmisafe(ssp, idx);
 }
 
index c062c58..dbfc664 100644 (file)
@@ -169,7 +169,7 @@ static inline void memcpy_flushcache(void *dst, const void *src, size_t cnt)
 #endif
 
 void *memchr_inv(const void *s, int c, size_t n);
-char *strreplace(char *s, char old, char new);
+char *strreplace(char *str, char old, char new);
 
 extern void kfree_const(const void *x);
 
index f66ec8f..f875111 100644 (file)
@@ -222,7 +222,7 @@ struct svc_rqst {
        struct page *           *rq_next_page; /* next reply page to use */
        struct page *           *rq_page_end;  /* one past the last page */
 
-       struct pagevec          rq_pvec;
+       struct folio_batch      rq_fbatch;
        struct kvec             rq_vec[RPCSVC_MAXPAGES]; /* generally useful.. */
        struct bio_vec          rq_bvec[RPCSVC_MAXPAGES];
 
@@ -508,6 +508,27 @@ static inline void svcxdr_init_encode(struct svc_rqst *rqstp)
 }
 
 /**
+ * svcxdr_encode_opaque_pages - Insert pages into an xdr_stream
+ * @xdr: xdr_stream to be updated
+ * @pages: array of pages to insert
+ * @base: starting offset of first data byte in @pages
+ * @len: number of data bytes in @pages to insert
+ *
+ * After the @pages are added, the tail iovec is instantiated pointing
+ * to end of the head buffer, and the stream is set up to encode
+ * subsequent items into the tail.
+ */
+static inline void svcxdr_encode_opaque_pages(struct svc_rqst *rqstp,
+                                             struct xdr_stream *xdr,
+                                             struct page **pages,
+                                             unsigned int base,
+                                             unsigned int len)
+{
+       xdr_write_pages(xdr, pages, base, len);
+       xdr->page_ptr = rqstp->rq_next_page - 1;
+}
+
+/**
  * svcxdr_set_auth_slack -
  * @rqstp: RPC transaction
  * @slack: buffer space to reserve for the transaction's security flavor
index fbc4bd4..a5ee0af 100644 (file)
@@ -135,7 +135,6 @@ struct svc_rdma_recv_ctxt {
        struct ib_sge           rc_recv_sge;
        void                    *rc_recv_buf;
        struct xdr_stream       rc_stream;
-       bool                    rc_temp;
        u32                     rc_byte_len;
        unsigned int            rc_page_count;
        u32                     rc_inv_rkey;
@@ -155,12 +154,12 @@ struct svc_rdma_send_ctxt {
 
        struct ib_send_wr       sc_send_wr;
        struct ib_cqe           sc_cqe;
-       struct completion       sc_done;
        struct xdr_buf          sc_hdrbuf;
        struct xdr_stream       sc_stream;
        void                    *sc_xprt_buf;
+       int                     sc_page_count;
        int                     sc_cur_sge_no;
-
+       struct page             *sc_pages[RPCSVC_MAXPAGES];
        struct ib_sge           sc_sges[];
 };
 
index 72014c9..f89ec4b 100644 (file)
@@ -242,8 +242,7 @@ extern void xdr_init_encode(struct xdr_stream *xdr, struct xdr_buf *buf,
 extern void xdr_init_encode_pages(struct xdr_stream *xdr, struct xdr_buf *buf,
                           struct page **pages, struct rpc_rqst *rqst);
 extern __be32 *xdr_reserve_space(struct xdr_stream *xdr, size_t nbytes);
-extern int xdr_reserve_space_vec(struct xdr_stream *xdr, struct kvec *vec,
-               size_t nbytes);
+extern int xdr_reserve_space_vec(struct xdr_stream *xdr, size_t nbytes);
 extern void __xdr_commit_encode(struct xdr_stream *xdr);
 extern void xdr_truncate_encode(struct xdr_stream *xdr, size_t len);
 extern void xdr_truncate_decode(struct xdr_stream *xdr, size_t len);
index d0d4598..ef50308 100644 (file)
@@ -202,6 +202,7 @@ struct platform_s2idle_ops {
 };
 
 #ifdef CONFIG_SUSPEND
+extern suspend_state_t pm_suspend_target_state;
 extern suspend_state_t mem_sleep_current;
 extern suspend_state_t mem_sleep_default;
 
@@ -337,6 +338,8 @@ extern bool sync_on_suspend_enabled;
 #else /* !CONFIG_SUSPEND */
 #define suspend_valid_only_mem NULL
 
+#define pm_suspend_target_state        (PM_SUSPEND_ON)
+
 static inline void pm_suspend_clear_flags(void) {}
 static inline void pm_set_suspend_via_firmware(void) {}
 static inline void pm_set_resume_via_firmware(void) {}
@@ -364,9 +367,6 @@ struct pbe {
        struct pbe *next;
 };
 
-/* mm/page_alloc.c */
-extern void mark_free_pages(struct zone *zone);
-
 /**
  * struct platform_hibernation_ops - hibernation platform support
  *
@@ -452,6 +452,10 @@ extern struct pbe *restore_pblist;
 int pfn_is_nosave(unsigned long pfn);
 
 int hibernate_quiet_exec(int (*func)(void *data), void *data);
+int hibernate_resume_nonboot_cpu_disable(void);
+int arch_hibernation_header_save(void *addr, unsigned int max_size);
+int arch_hibernation_header_restore(void *addr);
+
 #else /* CONFIG_HIBERNATION */
 static inline void register_nosave_region(unsigned long b, unsigned long e) {}
 static inline int swsusp_page_is_forbidden(struct page *p) { return 0; }
@@ -468,6 +472,8 @@ static inline int hibernate_quiet_exec(int (*func)(void *data), void *data) {
 }
 #endif /* CONFIG_HIBERNATION */
 
+int arch_resume_nosmt(void);
+
 #ifdef CONFIG_HIBERNATION_SNAPSHOT_DEV
 int is_hibernate_resume_dev(dev_t dev);
 #else
@@ -503,7 +509,11 @@ extern void pm_report_max_hw_sleep(u64 t);
 
 /* drivers/base/power/wakeup.c */
 extern bool events_check_enabled;
-extern suspend_state_t pm_suspend_target_state;
+
+static inline bool pm_suspended_storage(void)
+{
+       return !gfp_has_io_fs(gfp_allowed_mask);
+}
 
 extern bool pm_wakeup_pending(void);
 extern void pm_system_wakeup(void);
@@ -538,6 +548,7 @@ static inline void ksys_sync_helper(void) {}
 
 #define pm_notifier(fn, pri)   do { (void)(fn); } while (0)
 
+static inline bool pm_suspended_storage(void) { return false; }
 static inline bool pm_wakeup_pending(void) { return false; }
 static inline void pm_system_wakeup(void) {}
 static inline void pm_wakeup_clear(bool reset) {}
@@ -551,6 +562,7 @@ static inline void unlock_system_sleep(unsigned int flags) {}
 #ifdef CONFIG_PM_SLEEP_DEBUG
 extern bool pm_print_times_enabled;
 extern bool pm_debug_messages_on;
+extern bool pm_debug_messages_should_print(void);
 static inline int pm_dyn_debug_messages_on(void)
 {
 #ifdef CONFIG_DYNAMIC_DEBUG
@@ -564,14 +576,14 @@ static inline int pm_dyn_debug_messages_on(void)
 #endif
 #define __pm_pr_dbg(fmt, ...)                                  \
        do {                                                    \
-               if (pm_debug_messages_on)                       \
+               if (pm_debug_messages_should_print())           \
                        printk(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__);  \
                else if (pm_dyn_debug_messages_on())            \
                        pr_debug(fmt, ##__VA_ARGS__);   \
        } while (0)
 #define __pm_deferred_pr_dbg(fmt, ...)                         \
        do {                                                    \
-               if (pm_debug_messages_on)                       \
+               if (pm_debug_messages_should_print())           \
                        printk_deferred(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__); \
        } while (0)
 #else
@@ -589,7 +601,8 @@ static inline int pm_dyn_debug_messages_on(void)
 /**
  * pm_pr_dbg - print pm sleep debug messages
  *
- * If pm_debug_messages_on is enabled, print message.
+ * If pm_debug_messages_on is enabled and the system is entering/leaving
+ *      suspend, print message.
  * If pm_debug_messages_on is disabled and CONFIG_DYNAMIC_DEBUG is enabled,
  *     print message only from instances explicitly enabled on dynamic debug's
  *     control.
index 3c69cb6..4565464 100644 (file)
@@ -337,25 +337,6 @@ struct swap_info_struct {
                                           */
 };
 
-#ifdef CONFIG_64BIT
-#define SWAP_RA_ORDER_CEILING  5
-#else
-/* Avoid stack overflow, because we need to save part of page table */
-#define SWAP_RA_ORDER_CEILING  3
-#define SWAP_RA_PTE_CACHE_SIZE (1 << SWAP_RA_ORDER_CEILING)
-#endif
-
-struct vma_swap_readahead {
-       unsigned short win;
-       unsigned short offset;
-       unsigned short nr_pte;
-#ifdef CONFIG_64BIT
-       pte_t *ptes;
-#else
-       pte_t ptes[SWAP_RA_PTE_CACHE_SIZE];
-#endif
-};
-
 static inline swp_entry_t folio_swap_entry(struct folio *folio)
 {
        swp_entry_t entry = { .val = page_private(&folio->page) };
@@ -368,6 +349,7 @@ static inline void folio_set_swap_entry(struct folio *folio, swp_entry_t entry)
 }
 
 /* linux/mm/workingset.c */
+bool workingset_test_recent(void *shadow, bool file, bool *workingset);
 void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages);
 void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg);
 void workingset_refault(struct folio *folio, void *shadow);
@@ -457,10 +439,9 @@ static inline bool node_reclaim_enabled(void)
 }
 
 void check_move_unevictable_folios(struct folio_batch *fbatch);
-void check_move_unevictable_pages(struct pagevec *pvec);
 
-extern void kswapd_run(int nid);
-extern void kswapd_stop(int nid);
+extern void __meminit kswapd_run(int nid);
+extern void __meminit kswapd_stop(int nid);
 
 #ifdef CONFIG_SWAP
 
@@ -512,7 +493,7 @@ int find_first_swap(dev_t *device);
 extern unsigned int count_swap_pages(int, int);
 extern sector_t swapdev_block(int, pgoff_t);
 extern int __swap_count(swp_entry_t entry);
-extern int __swp_swapcount(swp_entry_t entry);
+extern int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry);
 extern int swp_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
 extern struct swap_info_struct *swp_swap_info(swp_entry_t entry);
@@ -590,7 +571,7 @@ static inline int __swap_count(swp_entry_t entry)
        return 0;
 }
 
-static inline int __swp_swapcount(swp_entry_t entry)
+static inline int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
 {
        return 0;
 }
index 3a451b7..4c932cb 100644 (file)
@@ -332,15 +332,9 @@ static inline bool is_migration_entry_dirty(swp_entry_t entry)
        return false;
 }
 
-extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
-                                       spinlock_t *ptl);
 extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
                                        unsigned long address);
-#ifdef CONFIG_HUGETLB_PAGE
-extern void __migration_entry_wait_huge(struct vm_area_struct *vma,
-                                       pte_t *ptep, spinlock_t *ptl);
 extern void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte);
-#endif /* CONFIG_HUGETLB_PAGE */
 #else  /* CONFIG_MIGRATION */
 static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
 {
@@ -362,15 +356,10 @@ static inline int is_migration_entry(swp_entry_t swp)
        return 0;
 }
 
-static inline void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
-                                       spinlock_t *ptl) { }
 static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
-                                        unsigned long address) { }
-#ifdef CONFIG_HUGETLB_PAGE
-static inline void __migration_entry_wait_huge(struct vm_area_struct *vma,
-                                              pte_t *ptep, spinlock_t *ptl) { }
-static inline void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte) { }
-#endif /* CONFIG_HUGETLB_PAGE */
+                                       unsigned long address) { }
+static inline void migration_entry_wait_huge(struct vm_area_struct *vma,
+                                       pte_t *pte) { }
 static inline int is_writable_migration_entry(swp_entry_t entry)
 {
        return 0;
index 33a0ee3..d18ce14 100644 (file)
@@ -72,6 +72,8 @@ struct open_how;
 struct mount_attr;
 struct landlock_ruleset_attr;
 enum landlock_rule_type;
+struct cachestat_range;
+struct cachestat;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -1058,6 +1060,9 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
 asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
                                            unsigned long home_node,
                                            unsigned long flags);
+asmlinkage long sys_cachestat(unsigned int fd,
+               struct cachestat_range __user *cstat_range,
+               struct cachestat __user *cstat, unsigned int flags);
 
 /*
  * Architecture-specific system calls
@@ -1280,6 +1285,7 @@ asmlinkage long sys_ni_syscall(void);
 
 #endif /* CONFIG_ARCH_HAS_SYSCALL_WRAPPER */
 
+asmlinkage long sys_ni_posix_timers(void);
 
 /*
  * Kernel code should not call syscalls (i.e., sys_xyzyyz()) directly.
index 3d08277..59d451f 100644 (file)
@@ -89,7 +89,7 @@ int proc_do_static_key(struct ctl_table *table, int write, void *buffer,
                size_t *lenp, loff_t *ppos);
 
 /*
- * Register a set of sysctl names by calling register_sysctl_table
+ * Register a set of sysctl names by calling register_sysctl
  * with an initialised array of struct ctl_table's.  An entry with 
  * NULL procname terminates the table.  table->de will be
  * set up by the registration and need not be initialised in advance.
@@ -137,7 +137,17 @@ struct ctl_table {
        void *data;
        int maxlen;
        umode_t mode;
-       struct ctl_table *child;        /* Deprecated */
+       /**
+        * enum type - Enumeration to differentiate between ctl target types
+        * @SYSCTL_TABLE_TYPE_DEFAULT: ctl target with no special considerations
+        * @SYSCTL_TABLE_TYPE_PERMANENTLY_EMPTY: Used to identify a permanently
+        *                                       empty directory target to serve
+        *                                       as mount point.
+        */
+       enum {
+               SYSCTL_TABLE_TYPE_DEFAULT,
+               SYSCTL_TABLE_TYPE_PERMANENTLY_EMPTY
+       } type;
        proc_handler *proc_handler;     /* Callback for text formatting */
        struct ctl_table_poll *poll;
        void *extra1;
@@ -197,20 +207,6 @@ struct ctl_path {
 
 #ifdef CONFIG_SYSCTL
 
-#define DECLARE_SYSCTL_BASE(_name, _table)                             \
-static struct ctl_table _name##_base_table[] = {                       \
-       {                                                               \
-               .procname       = #_name,                               \
-               .mode           = 0555,                                 \
-               .child          = _table,                               \
-       },                                                              \
-       { },                                                            \
-}
-
-extern int __register_sysctl_base(struct ctl_table *base_table);
-
-#define register_sysctl_base(_name) __register_sysctl_base(_name##_base_table)
-
 void proc_sys_poll_notify(struct ctl_table_poll *poll);
 
 extern void setup_sysctl_set(struct ctl_table_set *p,
@@ -222,7 +218,6 @@ struct ctl_table_header *__register_sysctl_table(
        struct ctl_table_set *set,
        const char *path, struct ctl_table *table);
 struct ctl_table_header *register_sysctl(const char *path, struct ctl_table *table);
-struct ctl_table_header *register_sysctl_table(struct ctl_table * table);
 void unregister_sysctl_table(struct ctl_table_header * table);
 
 extern int sysctl_init_bases(void);
@@ -244,24 +239,10 @@ extern int unaligned_enabled;
 extern int unaligned_dump_stack;
 extern int no_unaligned_warning;
 
-extern struct ctl_table sysctl_mount_point[];
+#define SYSCTL_PERM_EMPTY_DIR  (1 << 0)
 
 #else /* CONFIG_SYSCTL */
 
-#define DECLARE_SYSCTL_BASE(_name, _table)
-
-static inline int __register_sysctl_base(struct ctl_table *base_table)
-{
-       return 0;
-}
-
-#define register_sysctl_base(table) __register_sysctl_base(table)
-
-static inline struct ctl_table_header *register_sysctl_table(struct ctl_table * table)
-{
-       return NULL;
-}
-
 static inline void register_sysctl_init(const char *path, struct ctl_table *table)
 {
 }
index c026468..9ea0b28 100644 (file)
@@ -256,6 +256,11 @@ check_copy_size(const void *addr, size_t bytes, bool is_source)
 static inline void arch_setup_new_exec(void) { }
 #endif
 
+void arch_task_cache_init(void); /* for CONFIG_SH */
+void arch_release_task_struct(struct task_struct *tsk);
+int arch_dup_task_struct(struct task_struct *dst,
+                               struct task_struct *src);
+
 #endif /* __KERNEL__ */
 
 #endif /* _LINUX_THREAD_INFO_H */
index bb9d3f5..03d9c5a 100644 (file)
@@ -44,7 +44,6 @@ struct time_namespace *copy_time_ns(unsigned long flags,
                                    struct time_namespace *old_ns);
 void free_time_ns(struct time_namespace *ns);
 void timens_on_fork(struct nsproxy *nsproxy, struct task_struct *tsk);
-struct vdso_data *arch_get_vdso_data(void *vvar_page);
 struct page *find_timens_vvar_page(struct vm_area_struct *vma);
 
 static inline void put_time_ns(struct time_namespace *ns)
@@ -163,4 +162,6 @@ static inline ktime_t timens_ktime_to_host(clockid_t clockid, ktime_t tim)
 }
 #endif
 
+struct vdso_data *arch_get_vdso_data(void *vvar_page);
+
 #endif /* _LINUX_TIMENS_H */
index 688fb94..253168b 100644 (file)
 #define DECLARE_BITMAP(name,bits) \
        unsigned long name[BITS_TO_LONGS(bits)]
 
+#ifdef __SIZEOF_INT128__
+typedef __s128 s128;
+typedef __u128 u128;
+#endif
+
 typedef u32 __kernel_dev_t;
 
 typedef __kernel_fd_set                fd_set;
@@ -35,6 +40,7 @@ typedef __kernel_uid16_t        uid16_t;
 typedef __kernel_gid16_t        gid16_t;
 
 typedef unsigned long          uintptr_t;
+typedef long                   intptr_t;
 
 #ifdef CONFIG_HAVE_UID16
 /* This is defined by include/asm-{arch}/posix_types.h */
index 0ccb983..ff81e5c 100644 (file)
@@ -11,7 +11,6 @@
 #include <uapi/linux/uio.h>
 
 struct page;
-struct pipe_inode_info;
 
 typedef unsigned int __bitwise iov_iter_extraction_t;
 
@@ -25,7 +24,6 @@ enum iter_type {
        ITER_IOVEC,
        ITER_KVEC,
        ITER_BVEC,
-       ITER_PIPE,
        ITER_XARRAY,
        ITER_DISCARD,
        ITER_UBUF,
@@ -74,7 +72,6 @@ struct iov_iter {
                                const struct kvec *kvec;
                                const struct bio_vec *bvec;
                                struct xarray *xarray;
-                               struct pipe_inode_info *pipe;
                                void __user *ubuf;
                        };
                        size_t count;
@@ -82,10 +79,6 @@ struct iov_iter {
        };
        union {
                unsigned long nr_segs;
-               struct {
-                       unsigned int head;
-                       unsigned int start_head;
-               };
                loff_t xarray_start;
        };
 };
@@ -133,11 +126,6 @@ static inline bool iov_iter_is_bvec(const struct iov_iter *i)
        return iov_iter_type(i) == ITER_BVEC;
 }
 
-static inline bool iov_iter_is_pipe(const struct iov_iter *i)
-{
-       return iov_iter_type(i) == ITER_PIPE;
-}
-
 static inline bool iov_iter_is_discard(const struct iov_iter *i)
 {
        return iov_iter_type(i) == ITER_DISCARD;
@@ -286,19 +274,11 @@ void iov_iter_kvec(struct iov_iter *i, unsigned int direction, const struct kvec
                        unsigned long nr_segs, size_t count);
 void iov_iter_bvec(struct iov_iter *i, unsigned int direction, const struct bio_vec *bvec,
                        unsigned long nr_segs, size_t count);
-void iov_iter_pipe(struct iov_iter *i, unsigned int direction, struct pipe_inode_info *pipe,
-                       size_t count);
 void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count);
 void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *xarray,
                     loff_t start, size_t count);
-ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
-               size_t maxsize, unsigned maxpages, size_t *start,
-               iov_iter_extraction_t extraction_flags);
 ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
                        size_t maxsize, unsigned maxpages, size_t *start);
-ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
-               struct page ***pages, size_t maxsize, size_t *start,
-               iov_iter_extraction_t extraction_flags);
 ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
                        size_t maxsize, size_t *start);
 int iov_iter_npages(const struct iov_iter *i, int maxpages);
index 5d1f612..daa6a70 100644 (file)
@@ -42,8 +42,6 @@ call_usermodehelper_setup(const char *path, char **argv, char **envp,
 extern int
 call_usermodehelper_exec(struct subprocess_info *info, int wait);
 
-extern struct ctl_table usermodehelper_table[];
-
 enum umh_disable_depth {
        UMH_ENABLED = 0,
        UMH_FREEZING,
index d78b015..ac7b0c9 100644 (file)
@@ -188,8 +188,8 @@ extern bool userfaultfd_remove(struct vm_area_struct *vma,
                               unsigned long start,
                               unsigned long end);
 
-extern int userfaultfd_unmap_prep(struct mm_struct *mm, unsigned long start,
-                                 unsigned long end, struct list_head *uf);
+extern int userfaultfd_unmap_prep(struct vm_area_struct *vma,
+               unsigned long start, unsigned long end, struct list_head *uf);
 extern void userfaultfd_unmap_complete(struct mm_struct *mm,
                                       struct list_head *uf);
 extern bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma);
@@ -271,7 +271,7 @@ static inline bool userfaultfd_remove(struct vm_area_struct *vma,
        return true;
 }
 
-static inline int userfaultfd_unmap_prep(struct mm_struct *mm,
+static inline int userfaultfd_unmap_prep(struct vm_area_struct *vma,
                                         unsigned long start, unsigned long end,
                                         struct list_head *uf)
 {
index fc6bba2..45cd42f 100644 (file)
@@ -38,7 +38,7 @@ struct watch_filter {
 struct watch_queue {
        struct rcu_head         rcu;
        struct watch_filter __rcu *filter;
-       struct pipe_inode_info  *pipe;          /* The pipe we're using as a buffer */
+       struct pipe_inode_info  *pipe;          /* Pipe we use as a buffer, NULL if queue closed */
        struct hlist_head       watches;        /* Contributory watches */
        struct page             **notes;        /* Preallocated notifications */
        unsigned long           *notes_bitmap;  /* Allocation bitmap for notes */
@@ -46,7 +46,6 @@ struct watch_queue {
        spinlock_t              lock;
        unsigned int            nr_notes;       /* Number of notes */
        unsigned int            nr_pages;       /* Number of pages in notes[] */
-       bool                    defunct;        /* T when queues closed */
 };
 
 /*
index 3992c99..683efe2 100644 (file)
@@ -68,7 +68,6 @@ enum {
        WORK_OFFQ_FLAG_BASE     = WORK_STRUCT_COLOR_SHIFT,
 
        __WORK_OFFQ_CANCELING   = WORK_OFFQ_FLAG_BASE,
-       WORK_OFFQ_CANCELING     = (1 << __WORK_OFFQ_CANCELING),
 
        /*
         * When a work item is off queue, its high bits point to the last
@@ -79,12 +78,6 @@ enum {
        WORK_OFFQ_POOL_SHIFT    = WORK_OFFQ_FLAG_BASE + WORK_OFFQ_FLAG_BITS,
        WORK_OFFQ_LEFT          = BITS_PER_LONG - WORK_OFFQ_POOL_SHIFT,
        WORK_OFFQ_POOL_BITS     = WORK_OFFQ_LEFT <= 31 ? WORK_OFFQ_LEFT : 31,
-       WORK_OFFQ_POOL_NONE     = (1LU << WORK_OFFQ_POOL_BITS) - 1,
-
-       /* convenience constants */
-       WORK_STRUCT_FLAG_MASK   = (1UL << WORK_STRUCT_FLAG_BITS) - 1,
-       WORK_STRUCT_WQ_DATA_MASK = ~WORK_STRUCT_FLAG_MASK,
-       WORK_STRUCT_NO_POOL     = (unsigned long)WORK_OFFQ_POOL_NONE << WORK_OFFQ_POOL_SHIFT,
 
        /* bit mask for work_busy() return values */
        WORK_BUSY_PENDING       = 1 << 0,
@@ -94,6 +87,14 @@ enum {
        WORKER_DESC_LEN         = 24,
 };
 
+/* Convenience constants - of type 'unsigned long', not 'enum'! */
+#define WORK_OFFQ_CANCELING    (1ul << __WORK_OFFQ_CANCELING)
+#define WORK_OFFQ_POOL_NONE    ((1ul << WORK_OFFQ_POOL_BITS) - 1)
+#define WORK_STRUCT_NO_POOL    (WORK_OFFQ_POOL_NONE << WORK_OFFQ_POOL_SHIFT)
+
+#define WORK_STRUCT_FLAG_MASK    ((1ul << WORK_STRUCT_FLAG_BITS) - 1)
+#define WORK_STRUCT_WQ_DATA_MASK (~WORK_STRUCT_FLAG_MASK)
+
 struct work_struct {
        atomic_long_t data;
        struct list_head entry;
index e899701..3296438 100644 (file)
 
 struct zpool;
 
-struct zpool_ops {
-       int (*evict)(struct zpool *pool, unsigned long handle);
-};
-
 /*
  * Control how a handle is mapped.  It will be ignored if the
  * implementation does not support it.  Its use is optional.
@@ -39,8 +35,7 @@ enum zpool_mapmode {
 
 bool zpool_has_pool(char *type);
 
-struct zpool *zpool_create_pool(const char *type, const char *name,
-                       gfp_t gfp, const struct zpool_ops *ops);
+struct zpool *zpool_create_pool(const char *type, const char *name, gfp_t gfp);
 
 const char *zpool_get_type(struct zpool *pool);
 
@@ -53,9 +48,6 @@ int zpool_malloc(struct zpool *pool, size_t size, gfp_t gfp,
 
 void zpool_free(struct zpool *pool, unsigned long handle);
 
-int zpool_shrink(struct zpool *pool, unsigned int pages,
-                       unsigned int *reclaimed);
-
 void *zpool_map_handle(struct zpool *pool, unsigned long handle,
                        enum zpool_mapmode mm);
 
@@ -72,7 +64,6 @@ u64 zpool_get_total_size(struct zpool *pool);
  * @destroy:   destroy a pool.
  * @malloc:    allocate mem from a pool.
  * @free:      free mem from a pool.
- * @shrink:    shrink the pool.
  * @sleep_mapped: whether zpool driver can sleep during map.
  * @map:       map a handle.
  * @unmap:     unmap a handle.
@@ -87,10 +78,7 @@ struct zpool_driver {
        atomic_t refcount;
        struct list_head list;
 
-       void *(*create)(const char *name,
-                       gfp_t gfp,
-                       const struct zpool_ops *ops,
-                       struct zpool *zpool);
+       void *(*create)(const char *name, gfp_t gfp);
        void (*destroy)(void *pool);
 
        bool malloc_support_movable;
@@ -98,9 +86,6 @@ struct zpool_driver {
                                unsigned long *handle);
        void (*free)(void *pool, unsigned long handle);
 
-       int (*shrink)(void *pool, unsigned int pages,
-                               unsigned int *reclaimed);
-
        bool sleep_mapped;
        void *(*map)(void *pool, unsigned long handle,
                                enum zpool_mapmode mm);
@@ -113,7 +98,6 @@ void zpool_register_driver(struct zpool_driver *driver);
 
 int zpool_unregister_driver(struct zpool_driver *driver);
 
-bool zpool_evictable(struct zpool *pool);
 bool zpool_can_sleep_mapped(struct zpool *pool);
 
 #endif
index beac64e..a207c07 100644 (file)
@@ -45,11 +45,11 @@ typedef struct scsi_fctargaddress {
 
 int scsi_ioctl_block_when_processing_errors(struct scsi_device *sdev,
                int cmd, bool ndelay);
-int scsi_ioctl(struct scsi_device *sdev, fmode_t mode, int cmd,
+int scsi_ioctl(struct scsi_device *sdev, bool open_for_write, int cmd,
                void __user *arg);
 int get_sg_io_hdr(struct sg_io_hdr *hdr, const void __user *argp);
 int put_sg_io_hdr(const struct sg_io_hdr *hdr, void __user *argp);
-bool scsi_cmd_allowed(unsigned char *cmd, fmode_t mode);
+bool scsi_cmd_allowed(unsigned char *cmd, bool open_for_write);
 
 #endif /* __KERNEL__ */
 #endif /* _SCSI_IOCTL_H */
diff --git a/include/soc/imx/timer.h b/include/soc/imx/timer.h
deleted file mode 100644 (file)
index 25f29c6..0000000
+++ /dev/null
@@ -1,16 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Copyright 2015 Linaro Ltd.
- */
-
-#ifndef __SOC_IMX_TIMER_H__
-#define __SOC_IMX_TIMER_H__
-
-enum imx_gpt_type {
-       GPT_TYPE_IMX1,          /* i.MX1 */
-       GPT_TYPE_IMX21,         /* i.MX21/27 */
-       GPT_TYPE_IMX31,         /* i.MX31/35/25/37/51/6Q */
-       GPT_TYPE_IMX6DL,        /* i.MX6DL/SX/SL */
-};
-
-#endif  /* __SOC_IMX_TIMER_H__ */
index 7f4dfbd..40e60c3 100644 (file)
@@ -246,6 +246,32 @@ DEFINE_EVENT(block_rq, block_rq_merge,
 );
 
 /**
+ * block_io_start - insert a request for execution
+ * @rq: block IO operation request
+ *
+ * Called when block operation request @rq is queued for execution
+ */
+DEFINE_EVENT(block_rq, block_io_start,
+
+       TP_PROTO(struct request *rq),
+
+       TP_ARGS(rq)
+);
+
+/**
+ * block_io_done - block IO operation request completed
+ * @rq: block IO operation request
+ *
+ * Called when block operation request @rq is completed
+ */
+DEFINE_EVENT(block_rq, block_io_done,
+
+       TP_PROTO(struct request *rq),
+
+       TP_ARGS(rq)
+);
+
+/**
  * block_bio_complete - completed all work on the block operation
  * @q: queue holding the block operation
  * @bio: block operation completed
index 8ea9cea..a8206f5 100644 (file)
@@ -661,6 +661,35 @@ DEFINE_EVENT(btrfs__ordered_extent, btrfs_ordered_extent_mark_finished,
             TP_ARGS(inode, ordered)
 );
 
+TRACE_EVENT(btrfs_finish_ordered_extent,
+
+       TP_PROTO(const struct btrfs_inode *inode, u64 start, u64 len,
+                bool uptodate),
+
+       TP_ARGS(inode, start, len, uptodate),
+
+       TP_STRUCT__entry_btrfs(
+               __field(        u64,     ino            )
+               __field(        u64,     start          )
+               __field(        u64,     len            )
+               __field(        bool,    uptodate       )
+               __field(        u64,     root_objectid  )
+       ),
+
+       TP_fast_assign_btrfs(inode->root->fs_info,
+               __entry->ino    = btrfs_ino(inode);
+               __entry->start  = start;
+               __entry->len    = len;
+               __entry->uptodate = uptodate;
+               __entry->root_objectid = inode->root->root_key.objectid;
+       ),
+
+       TP_printk_btrfs("root=%llu(%s) ino=%llu start=%llu len=%llu uptodate=%d",
+                 show_root_type(__entry->root_objectid),
+                 __entry->ino, __entry->start,
+                 __entry->len, !!__entry->uptodate)
+);
+
 DECLARE_EVENT_CLASS(btrfs__writepage,
 
        TP_PROTO(const struct page *page, const struct inode *inode,
@@ -1982,25 +2011,27 @@ DEFINE_EVENT(btrfs__prelim_ref, btrfs_prelim_ref_insert,
 );
 
 TRACE_EVENT(btrfs_inode_mod_outstanding_extents,
-       TP_PROTO(const struct btrfs_root *root, u64 ino, int mod),
+       TP_PROTO(const struct btrfs_root *root, u64 ino, int mod, unsigned outstanding),
 
-       TP_ARGS(root, ino, mod),
+       TP_ARGS(root, ino, mod, outstanding),
 
        TP_STRUCT__entry_btrfs(
                __field(        u64, root_objectid      )
                __field(        u64, ino                )
                __field(        int, mod                )
+               __field(        unsigned, outstanding   )
        ),
 
        TP_fast_assign_btrfs(root->fs_info,
                __entry->root_objectid  = root->root_key.objectid;
                __entry->ino            = ino;
                __entry->mod            = mod;
+               __entry->outstanding    = outstanding;
        ),
 
-       TP_printk_btrfs("root=%llu(%s) ino=%llu mod=%d",
+       TP_printk_btrfs("root=%llu(%s) ino=%llu mod=%d outstanding=%u",
                        show_root_type(__entry->root_objectid),
-                       __entry->ino, __entry->mod)
+                       __entry->ino, __entry->mod, __entry->outstanding)
 );
 
 DECLARE_EVENT_CLASS(btrfs__block_group,
index 3313eb8..2b2a975 100644 (file)
@@ -64,6 +64,17 @@ DEFINE_EVENT(mm_compaction_isolate_template, mm_compaction_isolate_freepages,
        TP_ARGS(start_pfn, end_pfn, nr_scanned, nr_taken)
 );
 
+DEFINE_EVENT(mm_compaction_isolate_template, mm_compaction_fast_isolate_freepages,
+
+       TP_PROTO(
+               unsigned long start_pfn,
+               unsigned long end_pfn,
+               unsigned long nr_scanned,
+               unsigned long nr_taken),
+
+       TP_ARGS(start_pfn, end_pfn, nr_scanned, nr_taken)
+);
+
 #ifdef CONFIG_COMPACTION
 TRACE_EVENT(mm_compaction_migratepages,
 
diff --git a/include/trace/events/csd.h b/include/trace/events/csd.h
new file mode 100644 (file)
index 0000000..67e9d01
--- /dev/null
@@ -0,0 +1,72 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM csd
+
+#if !defined(_TRACE_CSD_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_CSD_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(csd_queue_cpu,
+
+       TP_PROTO(const unsigned int cpu,
+               unsigned long callsite,
+               smp_call_func_t func,
+               struct __call_single_data *csd),
+
+       TP_ARGS(cpu, callsite, func, csd),
+
+       TP_STRUCT__entry(
+               __field(unsigned int, cpu)
+               __field(void *, callsite)
+               __field(void *, func)
+               __field(void *, csd)
+               ),
+
+           TP_fast_assign(
+               __entry->cpu = cpu;
+               __entry->callsite = (void *)callsite;
+               __entry->func = func;
+               __entry->csd  = csd;
+               ),
+
+       TP_printk("cpu=%u callsite=%pS func=%ps csd=%p",
+               __entry->cpu, __entry->callsite, __entry->func, __entry->csd)
+       );
+
+/*
+ * Tracepoints for a function which is called as an effect of smp_call_function.*
+ */
+DECLARE_EVENT_CLASS(csd_function,
+
+       TP_PROTO(smp_call_func_t func, struct __call_single_data *csd),
+
+       TP_ARGS(func, csd),
+
+       TP_STRUCT__entry(
+               __field(void *, func)
+               __field(void *, csd)
+       ),
+
+       TP_fast_assign(
+               __entry->func   = func;
+               __entry->csd    = csd;
+       ),
+
+       TP_printk("func=%ps, csd=%p", __entry->func, __entry->csd)
+);
+
+DEFINE_EVENT(csd_function, csd_function_entry,
+       TP_PROTO(smp_call_func_t func, struct __call_single_data *csd),
+       TP_ARGS(func, csd)
+);
+
+DEFINE_EVENT(csd_function, csd_function_exit,
+       TP_PROTO(smp_call_func_t func, struct __call_single_data *csd),
+       TP_ARGS(func, csd)
+);
+
+#endif /* _TRACE_CSD_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
index b63e7c0..1478b9d 100644 (file)
@@ -223,8 +223,8 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY,  "softdirty"     )               \
 #define compact_result_to_feedback(result)     \
 ({                                             \
        enum compact_result __result = result;  \
-       (compaction_failed(__result)) ? COMPACTION_FAILED : \
-               (compaction_withdrawn(__result)) ? COMPACTION_WITHDRAWN : COMPACTION_PROGRESS; \
+       (__result == COMPACT_COMPLETE) ? COMPACTION_FAILED : \
+               (__result == COMPACT_SUCCESS) ? COMPACTION_PROGRESS : COMPACTION_WITHDRAWN; \
 })
 
 #define COMPACTION_FEEDBACK            \
index 8f461e0..f8069ef 100644 (file)
@@ -2112,6 +2112,14 @@ DEFINE_POST_CHUNK_EVENT(read);
 DEFINE_POST_CHUNK_EVENT(write);
 DEFINE_POST_CHUNK_EVENT(reply);
 
+DEFINE_EVENT(svcrdma_post_chunk_class, svcrdma_cc_release,
+       TP_PROTO(
+               const struct rpc_rdma_cid *cid,
+               int sqecount
+       ),
+       TP_ARGS(cid, sqecount)
+);
+
 TRACE_EVENT(svcrdma_wc_read,
        TP_PROTO(
                const struct ib_wc *wc,
index 31bc702..69e42ef 100644 (file)
@@ -2104,31 +2104,46 @@ DEFINE_SVC_DEFERRED_EVENT(drop);
 DEFINE_SVC_DEFERRED_EVENT(queue);
 DEFINE_SVC_DEFERRED_EVENT(recv);
 
-TRACE_EVENT(svcsock_new_socket,
+DECLARE_EVENT_CLASS(svcsock_lifetime_class,
        TP_PROTO(
+               const void *svsk,
                const struct socket *socket
        ),
-
-       TP_ARGS(socket),
-
+       TP_ARGS(svsk, socket),
        TP_STRUCT__entry(
+               __field(unsigned int, netns_ino)
+               __field(const void *, svsk)
+               __field(const void *, sk)
                __field(unsigned long, type)
                __field(unsigned long, family)
-               __field(bool, listener)
+               __field(unsigned long, state)
        ),
-
        TP_fast_assign(
+               struct sock *sk = socket->sk;
+
+               __entry->netns_ino = sock_net(sk)->ns.inum;
+               __entry->svsk = svsk;
+               __entry->sk = sk;
                __entry->type = socket->type;
-               __entry->family = socket->sk->sk_family;
-               __entry->listener = (socket->sk->sk_state == TCP_LISTEN);
+               __entry->family = sk->sk_family;
+               __entry->state = sk->sk_state;
        ),
-
-       TP_printk("type=%s family=%s%s",
-               show_socket_type(__entry->type),
+       TP_printk("svsk=%p type=%s family=%s%s",
+               __entry->svsk, show_socket_type(__entry->type),
                rpc_show_address_family(__entry->family),
-               __entry->listener ? " (listener)" : ""
+               __entry->state == TCP_LISTEN ? " (listener)" : ""
        )
 );
+#define DEFINE_SVCSOCK_LIFETIME_EVENT(name) \
+       DEFINE_EVENT(svcsock_lifetime_class, name, \
+               TP_PROTO( \
+                       const void *svsk, \
+                       const struct socket *socket \
+               ), \
+               TP_ARGS(svsk, socket))
+
+DEFINE_SVCSOCK_LIFETIME_EVENT(svcsock_new);
+DEFINE_SVCSOCK_LIFETIME_EVENT(svcsock_free);
 
 TRACE_EVENT(svcsock_marker,
        TP_PROTO(
index 3e8619c..b4bc282 100644 (file)
@@ -158,7 +158,11 @@ DEFINE_EVENT(timer_class, timer_cancel,
                { HRTIMER_MODE_ABS_SOFT,        "ABS|SOFT"      },      \
                { HRTIMER_MODE_REL_SOFT,        "REL|SOFT"      },      \
                { HRTIMER_MODE_ABS_PINNED_SOFT, "ABS|PINNED|SOFT" },    \
-               { HRTIMER_MODE_REL_PINNED_SOFT, "REL|PINNED|SOFT" })
+               { HRTIMER_MODE_REL_PINNED_SOFT, "REL|PINNED|SOFT" },    \
+               { HRTIMER_MODE_ABS_HARD,        "ABS|HARD" },           \
+               { HRTIMER_MODE_REL_HARD,        "REL|HARD" },           \
+               { HRTIMER_MODE_ABS_PINNED_HARD, "ABS|PINNED|HARD" },    \
+               { HRTIMER_MODE_REL_PINNED_HARD, "REL|PINNED|HARD" })
 
 /**
  * hrtimer_init - called when the hrtimer is initialized
index 45fa180..cd639fa 100644 (file)
@@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
 #define __NR_set_mempolicy_home_node 450
 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
 
+#define __NR_cachestat 451
+__SYSCALL(__NR_cachestat, sys_cachestat)
+
 #undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 452
 
 /*
  * 32 bit systems traditionally used different
index 5e2fb84..a5aff2e 100644 (file)
@@ -7,42 +7,42 @@
 /* Just the needed definitions for the RDB of an Amiga HD. */
 
 struct RigidDiskBlock {
-       __u32   rdb_ID;
+       __be32  rdb_ID;
        __be32  rdb_SummedLongs;
-       __s32   rdb_ChkSum;
-       __u32   rdb_HostID;
+       __be32  rdb_ChkSum;
+       __be32  rdb_HostID;
        __be32  rdb_BlockBytes;
-       __u32   rdb_Flags;
-       __u32   rdb_BadBlockList;
+       __be32  rdb_Flags;
+       __be32  rdb_BadBlockList;
        __be32  rdb_PartitionList;
-       __u32   rdb_FileSysHeaderList;
-       __u32   rdb_DriveInit;
-       __u32   rdb_Reserved1[6];
-       __u32   rdb_Cylinders;
-       __u32   rdb_Sectors;
-       __u32   rdb_Heads;
-       __u32   rdb_Interleave;
-       __u32   rdb_Park;
-       __u32   rdb_Reserved2[3];
-       __u32   rdb_WritePreComp;
-       __u32   rdb_ReducedWrite;
-       __u32   rdb_StepRate;
-       __u32   rdb_Reserved3[5];
-       __u32   rdb_RDBBlocksLo;
-       __u32   rdb_RDBBlocksHi;
-       __u32   rdb_LoCylinder;
-       __u32   rdb_HiCylinder;
-       __u32   rdb_CylBlocks;
-       __u32   rdb_AutoParkSeconds;
-       __u32   rdb_HighRDSKBlock;
-       __u32   rdb_Reserved4;
+       __be32  rdb_FileSysHeaderList;
+       __be32  rdb_DriveInit;
+       __be32  rdb_Reserved1[6];
+       __be32  rdb_Cylinders;
+       __be32  rdb_Sectors;
+       __be32  rdb_Heads;
+       __be32  rdb_Interleave;
+       __be32  rdb_Park;
+       __be32  rdb_Reserved2[3];
+       __be32  rdb_WritePreComp;
+       __be32  rdb_ReducedWrite;
+       __be32  rdb_StepRate;
+       __be32  rdb_Reserved3[5];
+       __be32  rdb_RDBBlocksLo;
+       __be32  rdb_RDBBlocksHi;
+       __be32  rdb_LoCylinder;
+       __be32  rdb_HiCylinder;
+       __be32  rdb_CylBlocks;
+       __be32  rdb_AutoParkSeconds;
+       __be32  rdb_HighRDSKBlock;
+       __be32  rdb_Reserved4;
        char    rdb_DiskVendor[8];
        char    rdb_DiskProduct[16];
        char    rdb_DiskRevision[4];
        char    rdb_ControllerVendor[8];
        char    rdb_ControllerProduct[16];
        char    rdb_ControllerRevision[4];
-       __u32   rdb_Reserved5[10];
+       __be32  rdb_Reserved5[10];
 };
 
 #define        IDNAME_RIGIDDISK        0x5244534B      /* "RDSK" */
@@ -50,16 +50,16 @@ struct RigidDiskBlock {
 struct PartitionBlock {
        __be32  pb_ID;
        __be32  pb_SummedLongs;
-       __s32   pb_ChkSum;
-       __u32   pb_HostID;
+       __be32  pb_ChkSum;
+       __be32  pb_HostID;
        __be32  pb_Next;
-       __u32   pb_Flags;
-       __u32   pb_Reserved1[2];
-       __u32   pb_DevFlags;
+       __be32  pb_Flags;
+       __be32  pb_Reserved1[2];
+       __be32  pb_DevFlags;
        __u8    pb_DriveName[32];
-       __u32   pb_Reserved2[15];
+       __be32  pb_Reserved2[15];
        __be32  pb_Environment[17];
-       __u32   pb_EReserved[15];
+       __be32  pb_EReserved[15];
 };
 
 #define        IDNAME_PARTITION        0x50415254      /* "PART" */
index 62e6253..08be539 100644 (file)
@@ -109,7 +109,7 @@ struct autofs_dev_ioctl {
                struct args_ismountpoint        ismountpoint;
        };
 
-       char path[0];
+       char path[];
 };
 
 static inline void init_autofs_dev_ioctl(struct autofs_dev_ioctl *in)
index 3d61a0a..5bb9060 100644 (file)
@@ -41,11 +41,12 @@ typedef struct __user_cap_header_struct {
        int pid;
 } __user *cap_user_header_t;
 
-typedef struct __user_cap_data_struct {
+struct __user_cap_data_struct {
         __u32 effective;
         __u32 permitted;
         __u32 inheritable;
-} __user *cap_user_data_t;
+};
+typedef struct __user_cap_data_struct __user *cap_user_data_t;
 
 
 #define VFS_CAP_REVISION_MASK  0xFF000000
index ac3da85..4d1c8d4 100644 (file)
@@ -372,7 +372,8 @@ typedef struct elf64_shdr {
  * Notes used in ET_CORE. Architectures export some of the arch register sets
  * using the corresponding note types via the PTRACE_GETREGSET and
  * PTRACE_SETREGSET requests.
- * The note name for all these is "LINUX".
+ * The note name for these types is "LINUX", except NT_PRFPREG that is named
+ * "CORE".
  */
 #define NT_PRSTATUS    1
 #define NT_PRFPREG     2
diff --git a/include/uapi/linux/eventfd.h b/include/uapi/linux/eventfd.h
new file mode 100644 (file)
index 0000000..2eb9ab6
--- /dev/null
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_EVENTFD_H
+#define _UAPI_LINUX_EVENTFD_H
+
+#include <linux/fcntl.h>
+
+#define EFD_SEMAPHORE (1 << 0)
+#define EFD_CLOEXEC O_CLOEXEC
+#define EFD_NONBLOCK O_NONBLOCK
+
+#endif /* _UAPI_LINUX_EVENTFD_H */
index 0716cb1..f222d26 100644 (file)
@@ -173,6 +173,18 @@ enum {
  */
 #define IORING_SETUP_DEFER_TASKRUN     (1U << 13)
 
+/*
+ * Application provides the memory for the rings
+ */
+#define IORING_SETUP_NO_MMAP           (1U << 14)
+
+/*
+ * Register the ring fd in itself for use with
+ * IORING_REGISTER_USE_REGISTERED_RING; return a registered fd index rather
+ * than an fd.
+ */
+#define IORING_SETUP_REGISTERED_FD_ONLY        (1U << 15)
+
 enum io_uring_op {
        IORING_OP_NOP,
        IORING_OP_READV,
@@ -406,7 +418,7 @@ struct io_sqring_offsets {
        __u32 dropped;
        __u32 array;
        __u32 resv1;
-       __u64 resv2;
+       __u64 user_addr;
 };
 
 /*
@@ -425,7 +437,7 @@ struct io_cqring_offsets {
        __u32 cqes;
        __u32 flags;
        __u32 resv1;
-       __u64 resv2;
+       __u64 user_addr;
 };
 
 /*
index f55bc68..a246e11 100644 (file)
@@ -4,6 +4,7 @@
 
 #include <asm/mman.h>
 #include <asm-generic/hugetlb_encode.h>
+#include <linux/types.h>
 
 #define MREMAP_MAYMOVE         1
 #define MREMAP_FIXED           2
 #define MAP_HUGE_2GB   HUGETLB_FLAG_ENCODE_2GB
 #define MAP_HUGE_16GB  HUGETLB_FLAG_ENCODE_16GB
 
+struct cachestat_range {
+       __u64 off;
+       __u64 len;
+};
+
+struct cachestat {
+       __u64 nr_cache;
+       __u64 nr_dirty;
+       __u64 nr_writeback;
+       __u64 nr_evicted;
+       __u64 nr_recently_evicted;
+};
+
 #endif /* _UAPI_LINUX_MMAN_H */
index 4d93967..8eb0d7b 100644 (file)
@@ -74,7 +74,8 @@
 #define MOVE_MOUNT_T_AUTOMOUNTS                0x00000020 /* Follow automounts on to path */
 #define MOVE_MOUNT_T_EMPTY_PATH                0x00000040 /* Empty to path permitted */
 #define MOVE_MOUNT_SET_GROUP           0x00000100 /* Set sharing group instead */
-#define MOVE_MOUNT__MASK               0x00000177
+#define MOVE_MOUNT_BENEATH             0x00000200 /* Mount beneath top mount */
+#define MOVE_MOUNT__MASK               0x00000377
 
 /*
  * fsopen() flags.
index 6a5552d..987a302 100644 (file)
@@ -16,6 +16,7 @@
 #include <linux/types.h>
 
 /*
+ * UNUSED:
  * 1 for normal debug messages, 2 is very verbose. 0 to turn it off.
  */
 #define PACKET_DEBUG           1
index 9d5f580..ca56e47 100644 (file)
@@ -28,6 +28,7 @@
 #define        SPI_RX_OCTAL            _BITUL(14)      /* receive with 8 wires */
 #define        SPI_3WIRE_HIZ           _BITUL(15)      /* high impedance turnaround */
 #define        SPI_RX_CPHA_FLIP        _BITUL(16)      /* flip CPHA on Rx only xfer */
+#define SPI_MOSI_IDLE_LOW      _BITUL(17)      /* leave mosi line low when idle */
 
 /*
  * All the bits defined above should be covered by SPI_MODE_USER_MASK.
@@ -37,6 +38,6 @@
  * These bits must not overlap. A static assert check should make sure of that.
  * If adding extra bits, make sure to increase the bit index below as well.
  */
-#define SPI_MODE_USER_MASK     (_BITUL(17) - 1)
+#define SPI_MODE_USER_MASK     (_BITUL(18) - 1)
 
 #endif /* _UAPI_SPI_H */
index 308433b..6375a06 100644 (file)
 
 #include <linux/posix_types.h>
 
+#ifdef __SIZEOF_INT128__
+typedef __signed__ __int128 __s128 __attribute__((aligned(16)));
+typedef unsigned __int128 __u128 __attribute__((aligned(16)));
+#endif
 
 /*
  * Below are truly Linux-specific types that should never collide with
index 640bf68..4b8558d 100644 (file)
        _IOWR('u', UBLK_CMD_END_USER_RECOVERY, struct ublksrv_ctrl_cmd)
 #define UBLK_U_CMD_GET_DEV_INFO2       \
        _IOR('u', UBLK_CMD_GET_DEV_INFO2, struct ublksrv_ctrl_cmd)
+#define UBLK_U_CMD_GET_FEATURES        \
+       _IOR('u', 0x13, struct ublksrv_ctrl_cmd)
+
+/*
+ * 64bits are enough now, and it should be easy to extend in case of
+ * running out of feature flags
+ */
+#define UBLK_FEATURES_LEN  8
 
 /*
  * IO commands, issued by ublk server, and handled by ublk driver.
 #define UBLKSRV_CMD_BUF_OFFSET 0
 #define UBLKSRV_IO_BUF_OFFSET  0x80000000
 
-/* tag bit is 12bit, so at most 4096 IOs for each queue */
+/* tag bit is 16bit, so far limit at most 4096 IOs for each queue */
 #define UBLK_MAX_QUEUE_DEPTH   4096
 
+/* single IO buffer max size is 32MB */
+#define UBLK_IO_BUF_OFF                0
+#define UBLK_IO_BUF_BITS       25
+#define UBLK_IO_BUF_BITS_MASK  ((1ULL << UBLK_IO_BUF_BITS) - 1)
+
+/* so at most 64K IOs for each queue */
+#define UBLK_TAG_OFF           UBLK_IO_BUF_BITS
+#define UBLK_TAG_BITS          16
+#define UBLK_TAG_BITS_MASK     ((1ULL << UBLK_TAG_BITS) - 1)
+
+/* max 4096 queues */
+#define UBLK_QID_OFF           (UBLK_TAG_OFF + UBLK_TAG_BITS)
+#define UBLK_QID_BITS          12
+#define UBLK_QID_BITS_MASK     ((1ULL << UBLK_QID_BITS) - 1)
+
+#define UBLK_MAX_NR_QUEUES     (1U << UBLK_QID_BITS)
+
+#define UBLKSRV_IO_BUF_TOTAL_BITS      (UBLK_QID_OFF + UBLK_QID_BITS)
+#define UBLKSRV_IO_BUF_TOTAL_SIZE      (1ULL << UBLKSRV_IO_BUF_TOTAL_BITS)
+
 /*
  * zero copy requires 4k block size, and can remap ublk driver's io
  * request into ublksrv's vm space
 /* use ioctl encoding for uring command */
 #define UBLK_F_CMD_IOCTL_ENCODE        (1UL << 6)
 
+/* Copy between request and user buffer by pread()/pwrite() */
+#define UBLK_F_USER_COPY       (1UL << 7)
+
 /* device state */
 #define UBLK_S_DEV_DEAD        0
 #define UBLK_S_DEV_LIVE        1
index 0552e8d..b71276b 100644 (file)
@@ -646,6 +646,15 @@ enum {
        VFIO_CCW_NUM_IRQS
 };
 
+/*
+ * The vfio-ap bus driver makes use of the following IRQ index mapping.
+ * Unimplemented IRQ types return a count of zero.
+ */
+enum {
+       VFIO_AP_REQ_IRQ_INDEX,
+       VFIO_AP_NUM_IRQS
+};
+
 /**
  * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12,
  *                                           struct vfio_pci_hot_reset_info)
index 44c2855..ac1281c 100644 (file)
@@ -138,4 +138,7 @@ int xen_test_irq_shared(int irq);
 
 /* initialize Xen IRQ subsystem */
 void xen_init_IRQ(void);
+
+irqreturn_t xen_debug_interrupt(int irq, void *dev_id);
+
 #endif /* _XEN_EVENTS_H */
index 0efeb65..f989162 100644 (file)
@@ -31,6 +31,9 @@ extern uint32_t xen_start_flags;
 
 #include <xen/interface/hvm/start_info.h>
 extern struct hvm_start_info pvh_start_info;
+void xen_prepare_pvh(void);
+struct pt_regs;
+void xen_pv_evtchn_do_upcall(struct pt_regs *regs);
 
 #ifdef CONFIG_XEN_DOM0
 #include <xen/interface/xen.h>
index 32c2495..f7f65af 100644 (file)
@@ -1771,6 +1771,16 @@ config RSEQ
 
          If unsure, say Y.
 
+config CACHESTAT_SYSCALL
+       bool "Enable cachestat() system call" if EXPERT
+       default y
+       help
+         Enable the cachestat system call, which queries the page cache
+         statistics of a file (number of cached pages, dirty pages,
+         pages marked for writeback, (recently) evicted pages).
+
+         If unsure say Y here.
+
 config DEBUG_RSEQ
        default n
        bool "Enabled debugging of rseq() system call" if EXPERT
index 811e94d..1aa0158 100644 (file)
@@ -28,7 +28,6 @@
 #include "do_mounts.h"
 
 int root_mountflags = MS_RDONLY | MS_SILENT;
-static char * __initdata root_device_name;
 static char __initdata saved_root_name[64];
 static int root_wait;
 
@@ -60,240 +59,6 @@ static int __init readwrite(char *str)
 __setup("ro", readonly);
 __setup("rw", readwrite);
 
-#ifdef CONFIG_BLOCK
-struct uuidcmp {
-       const char *uuid;
-       int len;
-};
-
-/**
- * match_dev_by_uuid - callback for finding a partition using its uuid
- * @dev:       device passed in by the caller
- * @data:      opaque pointer to the desired struct uuidcmp to match
- *
- * Returns 1 if the device matches, and 0 otherwise.
- */
-static int match_dev_by_uuid(struct device *dev, const void *data)
-{
-       struct block_device *bdev = dev_to_bdev(dev);
-       const struct uuidcmp *cmp = data;
-
-       if (!bdev->bd_meta_info ||
-           strncasecmp(cmp->uuid, bdev->bd_meta_info->uuid, cmp->len))
-               return 0;
-       return 1;
-}
-
-/**
- * devt_from_partuuid - looks up the dev_t of a partition by its UUID
- * @uuid_str:  char array containing ascii UUID
- *
- * The function will return the first partition which contains a matching
- * UUID value in its partition_meta_info struct.  This does not search
- * by filesystem UUIDs.
- *
- * If @uuid_str is followed by a "/PARTNROFF=%d", then the number will be
- * extracted and used as an offset from the partition identified by the UUID.
- *
- * Returns the matching dev_t on success or 0 on failure.
- */
-static dev_t devt_from_partuuid(const char *uuid_str)
-{
-       struct uuidcmp cmp;
-       struct device *dev = NULL;
-       dev_t devt = 0;
-       int offset = 0;
-       char *slash;
-
-       cmp.uuid = uuid_str;
-
-       slash = strchr(uuid_str, '/');
-       /* Check for optional partition number offset attributes. */
-       if (slash) {
-               char c = 0;
-
-               /* Explicitly fail on poor PARTUUID syntax. */
-               if (sscanf(slash + 1, "PARTNROFF=%d%c", &offset, &c) != 1)
-                       goto clear_root_wait;
-               cmp.len = slash - uuid_str;
-       } else {
-               cmp.len = strlen(uuid_str);
-       }
-
-       if (!cmp.len)
-               goto clear_root_wait;
-
-       dev = class_find_device(&block_class, NULL, &cmp, &match_dev_by_uuid);
-       if (!dev)
-               return 0;
-
-       if (offset) {
-               /*
-                * Attempt to find the requested partition by adding an offset
-                * to the partition number found by UUID.
-                */
-               devt = part_devt(dev_to_disk(dev),
-                                dev_to_bdev(dev)->bd_partno + offset);
-       } else {
-               devt = dev->devt;
-       }
-
-       put_device(dev);
-       return devt;
-
-clear_root_wait:
-       pr_err("VFS: PARTUUID= is invalid.\n"
-              "Expected PARTUUID=<valid-uuid-id>[/PARTNROFF=%%d]\n");
-       if (root_wait)
-               pr_err("Disabling rootwait; root= is invalid.\n");
-       root_wait = 0;
-       return 0;
-}
-
-/**
- * match_dev_by_label - callback for finding a partition using its label
- * @dev:       device passed in by the caller
- * @data:      opaque pointer to the label to match
- *
- * Returns 1 if the device matches, and 0 otherwise.
- */
-static int match_dev_by_label(struct device *dev, const void *data)
-{
-       struct block_device *bdev = dev_to_bdev(dev);
-       const char *label = data;
-
-       if (!bdev->bd_meta_info || strcmp(label, bdev->bd_meta_info->volname))
-               return 0;
-       return 1;
-}
-
-static dev_t devt_from_partlabel(const char *label)
-{
-       struct device *dev;
-       dev_t devt = 0;
-
-       dev = class_find_device(&block_class, NULL, label, &match_dev_by_label);
-       if (dev) {
-               devt = dev->devt;
-               put_device(dev);
-       }
-
-       return devt;
-}
-
-static dev_t devt_from_devname(const char *name)
-{
-       dev_t devt = 0;
-       int part;
-       char s[32];
-       char *p;
-
-       if (strlen(name) > 31)
-               return 0;
-       strcpy(s, name);
-       for (p = s; *p; p++) {
-               if (*p == '/')
-                       *p = '!';
-       }
-
-       devt = blk_lookup_devt(s, 0);
-       if (devt)
-               return devt;
-
-       /*
-        * Try non-existent, but valid partition, which may only exist after
-        * opening the device, like partitioned md devices.
-        */
-       while (p > s && isdigit(p[-1]))
-               p--;
-       if (p == s || !*p || *p == '0')
-               return 0;
-
-       /* try disk name without <part number> */
-       part = simple_strtoul(p, NULL, 10);
-       *p = '\0';
-       devt = blk_lookup_devt(s, part);
-       if (devt)
-               return devt;
-
-       /* try disk name without p<part number> */
-       if (p < s + 2 || !isdigit(p[-2]) || p[-1] != 'p')
-               return 0;
-       p[-1] = '\0';
-       return blk_lookup_devt(s, part);
-}
-#endif /* CONFIG_BLOCK */
-
-static dev_t devt_from_devnum(const char *name)
-{
-       unsigned maj, min, offset;
-       dev_t devt = 0;
-       char *p, dummy;
-
-       if (sscanf(name, "%u:%u%c", &maj, &min, &dummy) == 2 ||
-           sscanf(name, "%u:%u:%u:%c", &maj, &min, &offset, &dummy) == 3) {
-               devt = MKDEV(maj, min);
-               if (maj != MAJOR(devt) || min != MINOR(devt))
-                       return 0;
-       } else {
-               devt = new_decode_dev(simple_strtoul(name, &p, 16));
-               if (*p)
-                       return 0;
-       }
-
-       return devt;
-}
-
-/*
- *     Convert a name into device number.  We accept the following variants:
- *
- *     1) <hex_major><hex_minor> device number in hexadecimal represents itself
- *         no leading 0x, for example b302.
- *     2) /dev/nfs represents Root_NFS (0xff)
- *     3) /dev/<disk_name> represents the device number of disk
- *     4) /dev/<disk_name><decimal> represents the device number
- *         of partition - device number of disk plus the partition number
- *     5) /dev/<disk_name>p<decimal> - same as the above, that form is
- *        used when disk name of partitioned disk ends on a digit.
- *     6) PARTUUID=00112233-4455-6677-8899-AABBCCDDEEFF representing the
- *        unique id of a partition if the partition table provides it.
- *        The UUID may be either an EFI/GPT UUID, or refer to an MSDOS
- *        partition using the format SSSSSSSS-PP, where SSSSSSSS is a zero-
- *        filled hex representation of the 32-bit "NT disk signature", and PP
- *        is a zero-filled hex representation of the 1-based partition number.
- *     7) PARTUUID=<UUID>/PARTNROFF=<int> to select a partition in relation to
- *        a partition with a known unique id.
- *     8) <major>:<minor> major and minor number of the device separated by
- *        a colon.
- *     9) PARTLABEL=<name> with name being the GPT partition label.
- *        MSDOS partitions do not support labels!
- *     10) /dev/cifs represents Root_CIFS (0xfe)
- *
- *     If name doesn't have fall into the categories above, we return (0,0).
- *     block_class is used to check if something is a disk name. If the disk
- *     name contains slashes, the device name has them replaced with
- *     bangs.
- */
-dev_t name_to_dev_t(const char *name)
-{
-       if (strcmp(name, "/dev/nfs") == 0)
-               return Root_NFS;
-       if (strcmp(name, "/dev/cifs") == 0)
-               return Root_CIFS;
-       if (strcmp(name, "/dev/ram") == 0)
-               return Root_RAM0;
-#ifdef CONFIG_BLOCK
-       if (strncmp(name, "PARTUUID=", 9) == 0)
-               return devt_from_partuuid(name + 9);
-       if (strncmp(name, "PARTLABEL=", 10) == 0)
-               return devt_from_partlabel(name + 10);
-       if (strncmp(name, "/dev/", 5) == 0)
-               return devt_from_devname(name + 5);
-#endif
-       return devt_from_devnum(name);
-}
-EXPORT_SYMBOL_GPL(name_to_dev_t);
-
 static int __init root_dev_setup(char *line)
 {
        strscpy(saved_root_name, line, sizeof(saved_root_name));
@@ -338,7 +103,7 @@ __setup("rootfstype=", fs_names_setup);
 __setup("rootdelay=", root_delay_setup);
 
 /* This can return zero length strings. Caller should check */
-static int __init split_fs_names(char *page, size_t size, char *names)
+static int __init split_fs_names(char *page, size_t size)
 {
        int count = 1;
        char *p = page;
@@ -391,7 +156,7 @@ out:
        return ret;
 }
 
-void __init mount_block_root(char *name, int flags)
+void __init mount_root_generic(char *name, char *pretty_name, int flags)
 {
        struct page *page = alloc_page(GFP_KERNEL);
        char *fs_names = page_address(page);
@@ -402,7 +167,7 @@ void __init mount_block_root(char *name, int flags)
        scnprintf(b, BDEVNAME_SIZE, "unknown-block(%u,%u)",
                  MAJOR(ROOT_DEV), MINOR(ROOT_DEV));
        if (root_fs_names)
-               num_fs = split_fs_names(fs_names, PAGE_SIZE, root_fs_names);
+               num_fs = split_fs_names(fs_names, PAGE_SIZE);
        else
                num_fs = list_bdev_fs_names(fs_names, PAGE_SIZE);
 retry:
@@ -425,10 +190,21 @@ retry:
                 * and give them a list of the available devices
                 */
                printk("VFS: Cannot open root device \"%s\" or %s: error %d\n",
-                               root_device_name, b, err);
+                               pretty_name, b, err);
                printk("Please append a correct \"root=\" boot option; here are the available partitions:\n");
-
                printk_all_partitions();
+
+               if (root_fs_names)
+                       num_fs = list_bdev_fs_names(fs_names, PAGE_SIZE);
+               if (!num_fs)
+                       pr_err("Can't find any bdev filesystem to be used for mount!\n");
+               else {
+                       pr_err("List of all bdev filesystems:\n");
+                       for (i = 0, p = fs_names; i < num_fs; i++, p += strlen(p)+1)
+                               pr_err(" %s", p);
+                       pr_err("\n");
+               }
+
                panic("VFS: Unable to mount root fs on %s", b);
        }
        if (!(flags & SB_RDONLY)) {
@@ -453,15 +229,14 @@ out:
 #define NFSROOT_TIMEOUT_MAX    30
 #define NFSROOT_RETRY_MAX      5
 
-static int __init mount_nfs_root(void)
+static void __init mount_nfs_root(void)
 {
        char *root_dev, *root_data;
        unsigned int timeout;
-       int try, err;
+       int try;
 
-       err = nfs_root_data(&root_dev, &root_data);
-       if (err != 0)
-               return 0;
+       if (nfs_root_data(&root_dev, &root_data))
+               goto fail;
 
        /*
         * The server or network may not be ready, so try several
@@ -470,10 +245,8 @@ static int __init mount_nfs_root(void)
         */
        timeout = NFSROOT_TIMEOUT_MIN;
        for (try = 1; ; try++) {
-               err = do_mount_root(root_dev, "nfs",
-                                       root_mountflags, root_data);
-               if (err == 0)
-                       return 1;
+               if (!do_mount_root(root_dev, "nfs", root_mountflags, root_data))
+                       return;
                if (try > NFSROOT_RETRY_MAX)
                        break;
 
@@ -483,34 +256,35 @@ static int __init mount_nfs_root(void)
                if (timeout > NFSROOT_TIMEOUT_MAX)
                        timeout = NFSROOT_TIMEOUT_MAX;
        }
-       return 0;
+fail:
+       pr_err("VFS: Unable to mount root fs via NFS.\n");
 }
-#endif
+#else
+static inline void mount_nfs_root(void)
+{
+}
+#endif /* CONFIG_ROOT_NFS */
 
 #ifdef CONFIG_CIFS_ROOT
 
-extern int cifs_root_data(char **dev, char **opts);
-
 #define CIFSROOT_TIMEOUT_MIN   5
 #define CIFSROOT_TIMEOUT_MAX   30
 #define CIFSROOT_RETRY_MAX     5
 
-static int __init mount_cifs_root(void)
+static void __init mount_cifs_root(void)
 {
        char *root_dev, *root_data;
        unsigned int timeout;
-       int try, err;
+       int try;
 
-       err = cifs_root_data(&root_dev, &root_data);
-       if (err != 0)
-               return 0;
+       if (cifs_root_data(&root_dev, &root_data))
+               goto fail;
 
        timeout = CIFSROOT_TIMEOUT_MIN;
        for (try = 1; ; try++) {
-               err = do_mount_root(root_dev, "cifs", root_mountflags,
-                                   root_data);
-               if (err == 0)
-                       return 1;
+               if (!do_mount_root(root_dev, "cifs", root_mountflags,
+                                  root_data))
+                       return;
                if (try > CIFSROOT_RETRY_MAX)
                        break;
 
@@ -519,9 +293,14 @@ static int __init mount_cifs_root(void)
                if (timeout > CIFSROOT_TIMEOUT_MAX)
                        timeout = CIFSROOT_TIMEOUT_MAX;
        }
-       return 0;
+fail:
+       pr_err("VFS: Unable to mount root fs via SMB.\n");
+}
+#else
+static inline void mount_cifs_root(void)
+{
 }
-#endif
+#endif /* CONFIG_CIFS_ROOT */
 
 static bool __init fs_is_nodev(char *fstype)
 {
@@ -536,7 +315,7 @@ static bool __init fs_is_nodev(char *fstype)
        return ret;
 }
 
-static int __init mount_nodev_root(void)
+static int __init mount_nodev_root(char *root_device_name)
 {
        char *fs_names, *fstype;
        int err = -EINVAL;
@@ -545,7 +324,7 @@ static int __init mount_nodev_root(void)
        fs_names = (void *)__get_free_page(GFP_KERNEL);
        if (!fs_names)
                return -EINVAL;
-       num_fs = split_fs_names(fs_names, PAGE_SIZE, root_fs_names);
+       num_fs = split_fs_names(fs_names, PAGE_SIZE);
 
        for (i = 0, fstype = fs_names; i < num_fs;
             i++, fstype += strlen(fstype) + 1) {
@@ -563,35 +342,84 @@ static int __init mount_nodev_root(void)
        return err;
 }
 
-void __init mount_root(void)
+#ifdef CONFIG_BLOCK
+static void __init mount_block_root(char *root_device_name)
 {
-#ifdef CONFIG_ROOT_NFS
-       if (ROOT_DEV == Root_NFS) {
-               if (!mount_nfs_root())
-                       printk(KERN_ERR "VFS: Unable to mount root fs via NFS.\n");
-               return;
+       int err = create_dev("/dev/root", ROOT_DEV);
+
+       if (err < 0)
+               pr_emerg("Failed to create /dev/root: %d\n", err);
+       mount_root_generic("/dev/root", root_device_name, root_mountflags);
+}
+#else
+static inline void mount_block_root(char *root_device_name)
+{
+}
+#endif /* CONFIG_BLOCK */
+
+void __init mount_root(char *root_device_name)
+{
+       switch (ROOT_DEV) {
+       case Root_NFS:
+               mount_nfs_root();
+               break;
+       case Root_CIFS:
+               mount_cifs_root();
+               break;
+       case Root_Generic:
+               mount_root_generic(root_device_name, root_device_name,
+                                  root_mountflags);
+               break;
+       case 0:
+               if (root_device_name && root_fs_names &&
+                   mount_nodev_root(root_device_name) == 0)
+                       break;
+               fallthrough;
+       default:
+               mount_block_root(root_device_name);
+               break;
        }
-#endif
-#ifdef CONFIG_CIFS_ROOT
-       if (ROOT_DEV == Root_CIFS) {
-               if (!mount_cifs_root())
-                       printk(KERN_ERR "VFS: Unable to mount root fs via SMB.\n");
+}
+
+/* wait for any asynchronous scanning to complete */
+static void __init wait_for_root(char *root_device_name)
+{
+       if (ROOT_DEV != 0)
                return;
-       }
-#endif
-       if (ROOT_DEV == 0 && root_device_name && root_fs_names) {
-               if (mount_nodev_root() == 0)
-                       return;
-       }
-#ifdef CONFIG_BLOCK
-       {
-               int err = create_dev("/dev/root", ROOT_DEV);
 
-               if (err < 0)
-                       pr_emerg("Failed to create /dev/root: %d\n", err);
-               mount_block_root("/dev/root", root_mountflags);
+       pr_info("Waiting for root device %s...\n", root_device_name);
+
+       while (!driver_probe_done() ||
+              early_lookup_bdev(root_device_name, &ROOT_DEV) < 0)
+               msleep(5);
+       async_synchronize_full();
+
+}
+
+static dev_t __init parse_root_device(char *root_device_name)
+{
+       int error;
+       dev_t dev;
+
+       if (!strncmp(root_device_name, "mtd", 3) ||
+           !strncmp(root_device_name, "ubi", 3))
+               return Root_Generic;
+       if (strcmp(root_device_name, "/dev/nfs") == 0)
+               return Root_NFS;
+       if (strcmp(root_device_name, "/dev/cifs") == 0)
+               return Root_CIFS;
+       if (strcmp(root_device_name, "/dev/ram") == 0)
+               return Root_RAM0;
+
+       error = early_lookup_bdev(root_device_name, &dev);
+       if (error) {
+               if (error == -EINVAL && root_wait) {
+                       pr_err("Disabling rootwait; root= is invalid.\n");
+                       root_wait = 0;
+               }
+               return 0;
        }
-#endif
+       return dev;
 }
 
 /*
@@ -616,32 +444,15 @@ void __init prepare_namespace(void)
 
        md_run_setup();
 
-       if (saved_root_name[0]) {
-               root_device_name = saved_root_name;
-               if (!strncmp(root_device_name, "mtd", 3) ||
-                   !strncmp(root_device_name, "ubi", 3)) {
-                       mount_block_root(root_device_name, root_mountflags);
-                       goto out;
-               }
-               ROOT_DEV = name_to_dev_t(root_device_name);
-               if (strncmp(root_device_name, "/dev/", 5) == 0)
-                       root_device_name += 5;
-       }
+       if (saved_root_name[0])
+               ROOT_DEV = parse_root_device(saved_root_name);
 
-       if (initrd_load())
+       if (initrd_load(saved_root_name))
                goto out;
 
-       /* wait for any asynchronous scanning to complete */
-       if ((ROOT_DEV == 0) && root_wait) {
-               printk(KERN_INFO "Waiting for root device %s...\n",
-                       saved_root_name);
-               while (driver_probe_done() != 0 ||
-                       (ROOT_DEV = name_to_dev_t(saved_root_name)) == 0)
-                       msleep(5);
-               async_synchronize_full();
-       }
-
-       mount_root();
+       if (root_wait)
+               wait_for_root(saved_root_name);
+       mount_root(saved_root_name);
 out:
        devtmpfs_mount();
        init_mount(".", "/", NULL, MS_MOVE, NULL);
index 7a29ac3..15e372b 100644 (file)
@@ -10,8 +10,8 @@
 #include <linux/root_dev.h>
 #include <linux/init_syscalls.h>
 
-void  mount_block_root(char *name, int flags);
-void  mount_root(void);
+void  mount_root_generic(char *name, char *pretty_name, int flags);
+void  mount_root(char *root_device_name);
 extern int root_mountflags;
 
 static inline __init int create_dev(char *name, dev_t dev)
@@ -33,11 +33,11 @@ static inline int rd_load_image(char *from) { return 0; }
 #endif
 
 #ifdef CONFIG_BLK_DEV_INITRD
-
-bool __init initrd_load(void);
-
+bool __init initrd_load(char *root_device_name);
 #else
-
-static inline bool initrd_load(void) { return false; }
+static inline bool initrd_load(char *root_device_name)
+{
+       return false;
+       }
 
 #endif
index 3473124..425f4bc 100644 (file)
@@ -83,7 +83,7 @@ static int __init init_linuxrc(struct subprocess_info *info, struct cred *new)
        return 0;
 }
 
-static void __init handle_initrd(void)
+static void __init handle_initrd(char *root_device_name)
 {
        struct subprocess_info *info;
        static char *argv[] = { "linuxrc", NULL, };
@@ -95,7 +95,8 @@ static void __init handle_initrd(void)
        real_root_dev = new_encode_dev(ROOT_DEV);
        create_dev("/dev/root.old", Root_RAM0);
        /* mount initrd on rootfs' /root */
-       mount_block_root("/dev/root.old", root_mountflags & ~MS_RDONLY);
+       mount_root_generic("/dev/root.old", root_device_name,
+                          root_mountflags & ~MS_RDONLY);
        init_mkdir("/old", 0700);
        init_chdir("/old");
 
@@ -117,7 +118,7 @@ static void __init handle_initrd(void)
 
        init_chdir("/");
        ROOT_DEV = new_decode_dev(real_root_dev);
-       mount_root();
+       mount_root(root_device_name);
 
        printk(KERN_NOTICE "Trying to move old root to /initrd ... ");
        error = init_mount("/old", "/root/initrd", NULL, MS_MOVE, NULL);
@@ -133,7 +134,7 @@ static void __init handle_initrd(void)
        }
 }
 
-bool __init initrd_load(void)
+bool __init initrd_load(char *root_device_name)
 {
        if (mount_initrd) {
                create_dev("/dev/ram", Root_RAM0);
@@ -145,7 +146,7 @@ bool __init initrd_load(void)
                 */
                if (rd_load_image("/initrd.image") && ROOT_DEV != Root_RAM0) {
                        init_unlink("/initrd.image");
-                       handle_initrd();
+                       handle_initrd(root_device_name);
                        return true;
                }
        }
index af50044..ad920fa 100644 (file)
@@ -95,7 +95,6 @@
 #include <linux/cache.h>
 #include <linux/rodata_test.h>
 #include <linux/jump_label.h>
-#include <linux/mem_encrypt.h>
 #include <linux/kcsan.h>
 #include <linux/init_syscalls.h>
 #include <linux/stackdepot.h>
 #include <net/net_namespace.h>
 
 #include <asm/io.h>
-#include <asm/bugs.h>
 #include <asm/setup.h>
 #include <asm/sections.h>
 #include <asm/cacheflush.h>
 
 static int kernel_init(void *);
 
-extern void init_IRQ(void);
-extern void radix_tree_init(void);
-extern void maple_tree_init(void);
-
 /*
  * Debug helper: via this flag we know that we are in 'early bootup code'
  * where only the boot processor is running with IRQ disabled.  This means
@@ -137,7 +131,6 @@ EXPORT_SYMBOL(system_state);
 #define MAX_INIT_ARGS CONFIG_INIT_ENV_ARG_LIMIT
 #define MAX_INIT_ENVS CONFIG_INIT_ENV_ARG_LIMIT
 
-extern void time_init(void);
 /* Default late time init is NULL. archs can override this later. */
 void (*__initdata late_time_init)(void);
 
@@ -196,8 +189,6 @@ static const char *argv_init[MAX_INIT_ARGS+2] = { "init", NULL, };
 const char *envp_init[MAX_INIT_ENVS+2] = { "HOME=/", "TERM=linux", NULL, };
 static const char *panic_later, *panic_param;
 
-extern const struct obs_kernel_param __setup_start[], __setup_end[];
-
 static bool __init obsolete_checksetup(char *line)
 {
        const struct obs_kernel_param *p;
@@ -787,8 +778,6 @@ void __init __weak thread_stack_cache_init(void)
 }
 #endif
 
-void __init __weak mem_encrypt_init(void) { }
-
 void __init __weak poking_init(void) { }
 
 void __init __weak pgtable_cache_init(void) { }
@@ -877,7 +866,8 @@ static void __init print_unknown_bootoptions(void)
        memblock_free(unknown_options, len);
 }
 
-asmlinkage __visible void __init __no_sanitize_address __noreturn start_kernel(void)
+asmlinkage __visible __init __no_sanitize_address __noreturn __no_stack_protector
+void start_kernel(void)
 {
        char *command_line;
        char *after_dashes;
@@ -1042,15 +1032,7 @@ asmlinkage __visible void __init __no_sanitize_address __noreturn start_kernel(v
        sched_clock_init();
        calibrate_delay();
 
-       /*
-        * This needs to be called before any devices perform DMA
-        * operations that might use the SWIOTLB bounce buffers. It will
-        * mark the bounce buffers as decrypted so that their usage will
-        * not cause "plain-text" data to be decrypted when accessed. It
-        * must be called after late_time_init() so that Hyper-V x86/x64
-        * hypercalls work when the SWIOTLB bounce buffers are decrypted.
-        */
-       mem_encrypt_init();
+       arch_cpu_finalize_init();
 
        pid_idr_init();
        anon_vma_init();
@@ -1078,8 +1060,6 @@ asmlinkage __visible void __init __no_sanitize_address __noreturn start_kernel(v
        taskstats_init_early();
        delayacct_init();
 
-       check_bugs();
-
        acpi_subsystem_init();
        arch_post_acpi_subsys_init();
        kcsan_init();
@@ -1087,7 +1067,13 @@ asmlinkage __visible void __init __no_sanitize_address __noreturn start_kernel(v
        /* Do the rest non-__init'ed, we're now alive */
        arch_call_rest_init();
 
+       /*
+        * Avoid stack canaries in callers of boot_init_stack_canary for gcc-10
+        * and older.
+        */
+#if !__has_attribute(__no_stack_protector__)
        prevent_tail_call_optimization();
+#endif
 }
 
 /* Call all constructor functions linked into the kernel. */
@@ -1263,17 +1249,6 @@ int __init_or_module do_one_initcall(initcall_t fn)
 }
 
 
-extern initcall_entry_t __initcall_start[];
-extern initcall_entry_t __initcall0_start[];
-extern initcall_entry_t __initcall1_start[];
-extern initcall_entry_t __initcall2_start[];
-extern initcall_entry_t __initcall3_start[];
-extern initcall_entry_t __initcall4_start[];
-extern initcall_entry_t __initcall5_start[];
-extern initcall_entry_t __initcall6_start[];
-extern initcall_entry_t __initcall7_start[];
-extern initcall_entry_t __initcall_end[];
-
 static initcall_entry_t *initcall_levels[] __initdata = {
        __initcall0_start,
        __initcall1_start,
index b4f5dfa..58c46c8 100644 (file)
@@ -216,13 +216,10 @@ static int __io_sync_cancel(struct io_uring_task *tctx,
        /* fixed must be grabbed every time since we drop the uring_lock */
        if ((cd->flags & IORING_ASYNC_CANCEL_FD) &&
            (cd->flags & IORING_ASYNC_CANCEL_FD_FIXED)) {
-               unsigned long file_ptr;
-
                if (unlikely(fd >= ctx->nr_user_files))
                        return -EBADF;
                fd = array_index_nospec(fd, ctx->nr_user_files);
-               file_ptr = io_fixed_file_slot(&ctx->file_table, fd)->file_ptr;
-               cd->file = (struct file *) (file_ptr & FFS_MASK);
+               cd->file = io_file_from_index(&ctx->file_table, fd);
                if (!cd->file)
                        return -EBADF;
        }
index 0f6fa79..e7d7499 100644 (file)
@@ -78,10 +78,8 @@ static int io_install_fixed_file(struct io_ring_ctx *ctx, struct file *file,
        file_slot = io_fixed_file_slot(&ctx->file_table, slot_index);
 
        if (file_slot->file_ptr) {
-               struct file *old_file;
-
-               old_file = (struct file *)(file_slot->file_ptr & FFS_MASK);
-               ret = io_queue_rsrc_removal(ctx->file_data, slot_index, old_file);
+               ret = io_queue_rsrc_removal(ctx->file_data, slot_index,
+                                           io_slot_file(file_slot));
                if (ret)
                        return ret;
 
@@ -140,7 +138,6 @@ int io_fixed_fd_install(struct io_kiocb *req, unsigned int issue_flags,
 int io_fixed_fd_remove(struct io_ring_ctx *ctx, unsigned int offset)
 {
        struct io_fixed_file *file_slot;
-       struct file *file;
        int ret;
 
        if (unlikely(!ctx->file_data))
@@ -153,8 +150,8 @@ int io_fixed_fd_remove(struct io_ring_ctx *ctx, unsigned int offset)
        if (!file_slot->file_ptr)
                return -EBADF;
 
-       file = (struct file *)(file_slot->file_ptr & FFS_MASK);
-       ret = io_queue_rsrc_removal(ctx->file_data, offset, file);
+       ret = io_queue_rsrc_removal(ctx->file_data, offset,
+                                   io_slot_file(file_slot));
        if (ret)
                return ret;
 
index 351111f..b47adf1 100644 (file)
@@ -5,10 +5,6 @@
 #include <linux/file.h>
 #include <linux/io_uring_types.h>
 
-#define FFS_NOWAIT             0x1UL
-#define FFS_ISREG              0x2UL
-#define FFS_MASK               ~(FFS_NOWAIT|FFS_ISREG)
-
 bool io_alloc_file_tables(struct io_file_table *table, unsigned nr_files);
 void io_free_file_tables(struct io_file_table *table);
 
@@ -43,21 +39,31 @@ io_fixed_file_slot(struct io_file_table *table, unsigned i)
        return &table->files[i];
 }
 
+#define FFS_NOWAIT             0x1UL
+#define FFS_ISREG              0x2UL
+#define FFS_MASK               ~(FFS_NOWAIT|FFS_ISREG)
+
+static inline unsigned int io_slot_flags(struct io_fixed_file *slot)
+{
+       return (slot->file_ptr & ~FFS_MASK) << REQ_F_SUPPORT_NOWAIT_BIT;
+}
+
+static inline struct file *io_slot_file(struct io_fixed_file *slot)
+{
+       return (struct file *)(slot->file_ptr & FFS_MASK);
+}
+
 static inline struct file *io_file_from_index(struct io_file_table *table,
                                              int index)
 {
-       struct io_fixed_file *slot = io_fixed_file_slot(table, index);
-
-       return (struct file *) (slot->file_ptr & FFS_MASK);
+       return io_slot_file(io_fixed_file_slot(table, index));
 }
 
 static inline void io_fixed_file_set(struct io_fixed_file *file_slot,
                                     struct file *file)
 {
-       unsigned long file_ptr = (unsigned long) file;
-
-       file_ptr |= io_file_get_flags(file);
-       file_slot->file_ptr = file_ptr;
+       file_slot->file_ptr = (unsigned long)file |
+               (io_file_get_flags(file) >> REQ_F_SUPPORT_NOWAIT_BIT);
 }
 
 static inline void io_reset_alloc_hint(struct io_ring_ctx *ctx)
index 3bca7a7..1b53a2a 100644 (file)
@@ -95,6 +95,7 @@
 
 #include "timeout.h"
 #include "poll.h"
+#include "rw.h"
 #include "alloc_cache.h"
 
 #define IORING_MAX_ENTRIES     32768
@@ -145,8 +146,6 @@ static bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
                                         struct task_struct *task,
                                         bool cancel_all);
 
-static void io_dismantle_req(struct io_kiocb *req);
-static void io_clean_op(struct io_kiocb *req);
 static void io_queue_sqe(struct io_kiocb *req);
 static void io_move_task_work_from_local(struct io_ring_ctx *ctx);
 static void __io_submit_flush_completions(struct io_ring_ctx *ctx);
@@ -367,6 +366,39 @@ static bool req_need_defer(struct io_kiocb *req, u32 seq)
        return false;
 }
 
+static void io_clean_op(struct io_kiocb *req)
+{
+       if (req->flags & REQ_F_BUFFER_SELECTED) {
+               spin_lock(&req->ctx->completion_lock);
+               io_put_kbuf_comp(req);
+               spin_unlock(&req->ctx->completion_lock);
+       }
+
+       if (req->flags & REQ_F_NEED_CLEANUP) {
+               const struct io_cold_def *def = &io_cold_defs[req->opcode];
+
+               if (def->cleanup)
+                       def->cleanup(req);
+       }
+       if ((req->flags & REQ_F_POLLED) && req->apoll) {
+               kfree(req->apoll->double_poll);
+               kfree(req->apoll);
+               req->apoll = NULL;
+       }
+       if (req->flags & REQ_F_INFLIGHT) {
+               struct io_uring_task *tctx = req->task->io_uring;
+
+               atomic_dec(&tctx->inflight_tracked);
+       }
+       if (req->flags & REQ_F_CREDS)
+               put_cred(req->creds);
+       if (req->flags & REQ_F_ASYNC_DATA) {
+               kfree(req->async_data);
+               req->async_data = NULL;
+       }
+       req->flags &= ~IO_REQ_CLEAN_FLAGS;
+}
+
 static inline void io_req_track_inflight(struct io_kiocb *req)
 {
        if (!(req->flags & REQ_F_INFLIGHT)) {
@@ -423,8 +455,8 @@ static void io_prep_async_work(struct io_kiocb *req)
        if (req->flags & REQ_F_FORCE_ASYNC)
                req->work.flags |= IO_WQ_WORK_CONCURRENT;
 
-       if (req->file && !io_req_ffs_set(req))
-               req->flags |= io_file_get_flags(req->file) << REQ_F_SUPPORT_NOWAIT_BIT;
+       if (req->file && !(req->flags & REQ_F_FIXED_FILE))
+               req->flags |= io_file_get_flags(req->file);
 
        if (req->file && (req->flags & REQ_F_ISREG)) {
                bool should_hash = def->hash_reg_file;
@@ -594,42 +626,18 @@ void __io_commit_cqring_flush(struct io_ring_ctx *ctx)
 }
 
 static inline void __io_cq_lock(struct io_ring_ctx *ctx)
-       __acquires(ctx->completion_lock)
 {
        if (!ctx->task_complete)
                spin_lock(&ctx->completion_lock);
 }
 
-static inline void __io_cq_unlock(struct io_ring_ctx *ctx)
-{
-       if (!ctx->task_complete)
-               spin_unlock(&ctx->completion_lock);
-}
-
 static inline void io_cq_lock(struct io_ring_ctx *ctx)
        __acquires(ctx->completion_lock)
 {
        spin_lock(&ctx->completion_lock);
 }
 
-static inline void io_cq_unlock(struct io_ring_ctx *ctx)
-       __releases(ctx->completion_lock)
-{
-       spin_unlock(&ctx->completion_lock);
-}
-
-/* keep it inlined for io_submit_flush_completions() */
 static inline void __io_cq_unlock_post(struct io_ring_ctx *ctx)
-       __releases(ctx->completion_lock)
-{
-       io_commit_cqring(ctx);
-       __io_cq_unlock(ctx);
-       io_commit_cqring_flush(ctx);
-       io_cqring_wake(ctx);
-}
-
-static void __io_cq_unlock_post_flush(struct io_ring_ctx *ctx)
-       __releases(ctx->completion_lock)
 {
        io_commit_cqring(ctx);
 
@@ -641,13 +649,13 @@ static void __io_cq_unlock_post_flush(struct io_ring_ctx *ctx)
                 */
                io_commit_cqring_flush(ctx);
        } else {
-               __io_cq_unlock(ctx);
+               spin_unlock(&ctx->completion_lock);
                io_commit_cqring_flush(ctx);
                io_cqring_wake(ctx);
        }
 }
 
-void io_cq_unlock_post(struct io_ring_ctx *ctx)
+static void io_cq_unlock_post(struct io_ring_ctx *ctx)
        __releases(ctx->completion_lock)
 {
        io_commit_cqring(ctx);
@@ -662,10 +670,10 @@ static void io_cqring_overflow_kill(struct io_ring_ctx *ctx)
        struct io_overflow_cqe *ocqe;
        LIST_HEAD(list);
 
-       io_cq_lock(ctx);
+       spin_lock(&ctx->completion_lock);
        list_splice_init(&ctx->cq_overflow_list, &list);
        clear_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq);
-       io_cq_unlock(ctx);
+       spin_unlock(&ctx->completion_lock);
 
        while (!list_empty(&list)) {
                ocqe = list_first_entry(&list, struct io_overflow_cqe, list);
@@ -722,29 +730,29 @@ static void io_cqring_overflow_flush(struct io_ring_ctx *ctx)
 }
 
 /* can be called by any task */
-static void io_put_task_remote(struct task_struct *task, int nr)
+static void io_put_task_remote(struct task_struct *task)
 {
        struct io_uring_task *tctx = task->io_uring;
 
-       percpu_counter_sub(&tctx->inflight, nr);
+       percpu_counter_sub(&tctx->inflight, 1);
        if (unlikely(atomic_read(&tctx->in_cancel)))
                wake_up(&tctx->wait);
-       put_task_struct_many(task, nr);
+       put_task_struct(task);
 }
 
 /* used by a task to put its own references */
-static void io_put_task_local(struct task_struct *task, int nr)
+static void io_put_task_local(struct task_struct *task)
 {
-       task->io_uring->cached_refs += nr;
+       task->io_uring->cached_refs++;
 }
 
 /* must to be called somewhat shortly after putting a request */
-static inline void io_put_task(struct task_struct *task, int nr)
+static inline void io_put_task(struct task_struct *task)
 {
        if (likely(task == current))
-               io_put_task_local(task, nr);
+               io_put_task_local(task);
        else
-               io_put_task_remote(task, nr);
+               io_put_task_remote(task);
 }
 
 void io_task_refs_refill(struct io_uring_task *tctx)
@@ -934,20 +942,19 @@ bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags
        return __io_post_aux_cqe(ctx, user_data, res, cflags, true);
 }
 
-bool io_aux_cqe(struct io_ring_ctx *ctx, bool defer, u64 user_data, s32 res, u32 cflags,
+bool io_aux_cqe(const struct io_kiocb *req, bool defer, s32 res, u32 cflags,
                bool allow_overflow)
 {
+       struct io_ring_ctx *ctx = req->ctx;
+       u64 user_data = req->cqe.user_data;
        struct io_uring_cqe *cqe;
-       unsigned int length;
 
        if (!defer)
                return __io_post_aux_cqe(ctx, user_data, res, cflags, allow_overflow);
 
-       length = ARRAY_SIZE(ctx->submit_state.cqes);
-
        lockdep_assert_held(&ctx->uring_lock);
 
-       if (ctx->submit_state.cqes_count == length) {
+       if (ctx->submit_state.cqes_count == ARRAY_SIZE(ctx->submit_state.cqes)) {
                __io_cq_lock(ctx);
                __io_flush_post_cqes(ctx);
                /* no need to flush - flush is deferred */
@@ -991,14 +998,18 @@ static void __io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
                        }
                }
                io_put_kbuf_comp(req);
-               io_dismantle_req(req);
+               if (unlikely(req->flags & IO_REQ_CLEAN_FLAGS))
+                       io_clean_op(req);
+               if (!(req->flags & REQ_F_FIXED_FILE))
+                       io_put_file(req->file);
+
                rsrc_node = req->rsrc_node;
                /*
                 * Selected buffer deallocation in io_clean_op() assumes that
                 * we don't hold ->completion_lock. Clean them here to avoid
                 * deadlocks.
                 */
-               io_put_task_remote(req->task, 1);
+               io_put_task_remote(req->task);
                wq_list_add_head(&req->comp_list, &ctx->locked_free_list);
                ctx->locked_free_nr++;
        }
@@ -1111,36 +1122,13 @@ __cold bool __io_alloc_req_refill(struct io_ring_ctx *ctx)
        return true;
 }
 
-static inline void io_dismantle_req(struct io_kiocb *req)
-{
-       unsigned int flags = req->flags;
-
-       if (unlikely(flags & IO_REQ_CLEAN_FLAGS))
-               io_clean_op(req);
-       if (!(flags & REQ_F_FIXED_FILE))
-               io_put_file(req->file);
-}
-
-static __cold void io_free_req_tw(struct io_kiocb *req, struct io_tw_state *ts)
-{
-       struct io_ring_ctx *ctx = req->ctx;
-
-       if (req->rsrc_node) {
-               io_tw_lock(ctx, ts);
-               io_put_rsrc_node(ctx, req->rsrc_node);
-       }
-       io_dismantle_req(req);
-       io_put_task_remote(req->task, 1);
-
-       spin_lock(&ctx->completion_lock);
-       wq_list_add_head(&req->comp_list, &ctx->locked_free_list);
-       ctx->locked_free_nr++;
-       spin_unlock(&ctx->completion_lock);
-}
-
 __cold void io_free_req(struct io_kiocb *req)
 {
-       req->io_task_work.func = io_free_req_tw;
+       /* refs were already put, restore them for io_req_task_complete() */
+       req->flags &= ~REQ_F_REFCOUNT;
+       /* we only want to free it, don't post CQEs */
+       req->flags |= REQ_F_CQE_SKIP;
+       req->io_task_work.func = io_req_task_complete;
        io_req_task_work_add(req);
 }
 
@@ -1205,7 +1193,9 @@ static unsigned int handle_tw_list(struct llist_node *node,
                        ts->locked = mutex_trylock(&(*ctx)->uring_lock);
                        percpu_ref_get(&(*ctx)->refs);
                }
-               req->io_task_work.func(req, ts);
+               INDIRECT_CALL_2(req->io_task_work.func,
+                               io_poll_task_func, io_req_rw_complete,
+                               req, ts);
                node = next;
                count++;
                if (unlikely(need_resched())) {
@@ -1303,7 +1293,7 @@ static __cold void io_fallback_tw(struct io_uring_task *tctx)
        }
 }
 
-static void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
+static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
 {
        struct io_ring_ctx *ctx = req->ctx;
        unsigned nr_wait, nr_tw, nr_tw_prev;
@@ -1354,19 +1344,11 @@ static void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
        wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
 }
 
-void __io_req_task_work_add(struct io_kiocb *req, unsigned flags)
+static void io_req_normal_work_add(struct io_kiocb *req)
 {
        struct io_uring_task *tctx = req->task->io_uring;
        struct io_ring_ctx *ctx = req->ctx;
 
-       if (!(flags & IOU_F_TWQ_FORCE_NORMAL) &&
-           (ctx->flags & IORING_SETUP_DEFER_TASKRUN)) {
-               rcu_read_lock();
-               io_req_local_work_add(req, flags);
-               rcu_read_unlock();
-               return;
-       }
-
        /* task_work already pending, we're done */
        if (!llist_add(&req->io_task_work.node, &tctx->task_list))
                return;
@@ -1380,6 +1362,17 @@ void __io_req_task_work_add(struct io_kiocb *req, unsigned flags)
        io_fallback_tw(tctx);
 }
 
+void __io_req_task_work_add(struct io_kiocb *req, unsigned flags)
+{
+       if (req->ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
+               rcu_read_lock();
+               io_req_local_work_add(req, flags);
+               rcu_read_unlock();
+       } else {
+               io_req_normal_work_add(req);
+       }
+}
+
 static void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx)
 {
        struct llist_node *node;
@@ -1390,7 +1383,7 @@ static void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx)
                                                    io_task_work.node);
 
                node = node->next;
-               __io_req_task_work_add(req, IOU_F_TWQ_FORCE_NORMAL);
+               io_req_normal_work_add(req);
        }
 }
 
@@ -1405,13 +1398,19 @@ static int __io_run_local_work(struct io_ring_ctx *ctx, struct io_tw_state *ts)
        if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
                atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
 again:
-       node = io_llist_xchg(&ctx->work_llist, NULL);
+       /*
+        * llists are in reverse order, flip it back the right way before
+        * running the pending items.
+        */
+       node = llist_reverse_order(io_llist_xchg(&ctx->work_llist, NULL));
        while (node) {
                struct llist_node *next = node->next;
                struct io_kiocb *req = container_of(node, struct io_kiocb,
                                                    io_task_work.node);
                prefetch(container_of(next, struct io_kiocb, io_task_work.node));
-               req->io_task_work.func(req, ts);
+               INDIRECT_CALL_2(req->io_task_work.func,
+                               io_poll_task_func, io_req_rw_complete,
+                               req, ts);
                ret++;
                node = next;
        }
@@ -1498,9 +1497,6 @@ void io_queue_next(struct io_kiocb *req)
 void io_free_batch_list(struct io_ring_ctx *ctx, struct io_wq_work_node *node)
        __must_hold(&ctx->uring_lock)
 {
-       struct task_struct *task = NULL;
-       int task_refs = 0;
-
        do {
                struct io_kiocb *req = container_of(node, struct io_kiocb,
                                                    comp_list);
@@ -1530,19 +1526,10 @@ void io_free_batch_list(struct io_ring_ctx *ctx, struct io_wq_work_node *node)
 
                io_req_put_rsrc_locked(req, ctx);
 
-               if (req->task != task) {
-                       if (task)
-                               io_put_task(task, task_refs);
-                       task = req->task;
-                       task_refs = 0;
-               }
-               task_refs++;
+               io_put_task(req->task);
                node = req->comp_list.next;
                io_req_add_to_cache(req, ctx);
        } while (node);
-
-       if (task)
-               io_put_task(task, task_refs);
 }
 
 static void __io_submit_flush_completions(struct io_ring_ctx *ctx)
@@ -1570,7 +1557,7 @@ static void __io_submit_flush_completions(struct io_ring_ctx *ctx)
                        }
                }
        }
-       __io_cq_unlock_post_flush(ctx);
+       __io_cq_unlock_post(ctx);
 
        if (!wq_list_empty(&ctx->submit_state.compl_reqs)) {
                io_free_batch_list(ctx, state->compl_reqs.first);
@@ -1578,22 +1565,6 @@ static void __io_submit_flush_completions(struct io_ring_ctx *ctx)
        }
 }
 
-/*
- * Drop reference to request, return next in chain (if there is one) if this
- * was the last reference to this request.
- */
-static inline struct io_kiocb *io_put_req_find_next(struct io_kiocb *req)
-{
-       struct io_kiocb *nxt = NULL;
-
-       if (req_ref_put_and_test(req)) {
-               if (unlikely(req->flags & IO_REQ_LINK_FLAGS))
-                       nxt = io_req_find_next(req);
-               io_free_req(req);
-       }
-       return nxt;
-}
-
 static unsigned io_cqring_events(struct io_ring_ctx *ctx)
 {
        /* See comment at the top of this file */
@@ -1758,54 +1729,14 @@ static void io_iopoll_req_issued(struct io_kiocb *req, unsigned int issue_flags)
        }
 }
 
-static bool io_bdev_nowait(struct block_device *bdev)
-{
-       return !bdev || bdev_nowait(bdev);
-}
-
-/*
- * If we tracked the file through the SCM inflight mechanism, we could support
- * any file. For now, just ensure that anything potentially problematic is done
- * inline.
- */
-static bool __io_file_supports_nowait(struct file *file, umode_t mode)
-{
-       if (S_ISBLK(mode)) {
-               if (IS_ENABLED(CONFIG_BLOCK) &&
-                   io_bdev_nowait(I_BDEV(file->f_mapping->host)))
-                       return true;
-               return false;
-       }
-       if (S_ISSOCK(mode))
-               return true;
-       if (S_ISREG(mode)) {
-               if (IS_ENABLED(CONFIG_BLOCK) &&
-                   io_bdev_nowait(file->f_inode->i_sb->s_bdev) &&
-                   !io_is_uring_fops(file))
-                       return true;
-               return false;
-       }
-
-       /* any ->read/write should understand O_NONBLOCK */
-       if (file->f_flags & O_NONBLOCK)
-               return true;
-       return file->f_mode & FMODE_NOWAIT;
-}
-
-/*
- * If we tracked the file through the SCM inflight mechanism, we could support
- * any file. For now, just ensure that anything potentially problematic is done
- * inline.
- */
 unsigned int io_file_get_flags(struct file *file)
 {
-       umode_t mode = file_inode(file)->i_mode;
        unsigned int res = 0;
 
-       if (S_ISREG(mode))
-               res |= FFS_ISREG;
-       if (__io_file_supports_nowait(file, mode))
-               res |= FFS_NOWAIT;
+       if (S_ISREG(file_inode(file)->i_mode))
+               res |= REQ_F_ISREG;
+       if ((file->f_flags & O_NONBLOCK) || (file->f_mode & FMODE_NOWAIT))
+               res |= REQ_F_SUPPORT_NOWAIT;
        return res;
 }
 
@@ -1891,39 +1822,6 @@ queue:
        spin_unlock(&ctx->completion_lock);
 }
 
-static void io_clean_op(struct io_kiocb *req)
-{
-       if (req->flags & REQ_F_BUFFER_SELECTED) {
-               spin_lock(&req->ctx->completion_lock);
-               io_put_kbuf_comp(req);
-               spin_unlock(&req->ctx->completion_lock);
-       }
-
-       if (req->flags & REQ_F_NEED_CLEANUP) {
-               const struct io_cold_def *def = &io_cold_defs[req->opcode];
-
-               if (def->cleanup)
-                       def->cleanup(req);
-       }
-       if ((req->flags & REQ_F_POLLED) && req->apoll) {
-               kfree(req->apoll->double_poll);
-               kfree(req->apoll);
-               req->apoll = NULL;
-       }
-       if (req->flags & REQ_F_INFLIGHT) {
-               struct io_uring_task *tctx = req->task->io_uring;
-
-               atomic_dec(&tctx->inflight_tracked);
-       }
-       if (req->flags & REQ_F_CREDS)
-               put_cred(req->creds);
-       if (req->flags & REQ_F_ASYNC_DATA) {
-               kfree(req->async_data);
-               req->async_data = NULL;
-       }
-       req->flags &= ~IO_REQ_CLEAN_FLAGS;
-}
-
 static bool io_assign_file(struct io_kiocb *req, const struct io_issue_def *def,
                           unsigned int issue_flags)
 {
@@ -1986,9 +1884,14 @@ int io_poll_issue(struct io_kiocb *req, struct io_tw_state *ts)
 struct io_wq_work *io_wq_free_work(struct io_wq_work *work)
 {
        struct io_kiocb *req = container_of(work, struct io_kiocb, work);
+       struct io_kiocb *nxt = NULL;
 
-       req = io_put_req_find_next(req);
-       return req ? &req->work : NULL;
+       if (req_ref_put_and_test(req)) {
+               if (req->flags & IO_REQ_LINK_FLAGS)
+                       nxt = io_req_find_next(req);
+               io_free_req(req);
+       }
+       return nxt ? &nxt->work : NULL;
 }
 
 void io_wq_submit_work(struct io_wq_work *work)
@@ -2060,19 +1963,17 @@ inline struct file *io_file_get_fixed(struct io_kiocb *req, int fd,
                                      unsigned int issue_flags)
 {
        struct io_ring_ctx *ctx = req->ctx;
+       struct io_fixed_file *slot;
        struct file *file = NULL;
-       unsigned long file_ptr;
 
        io_ring_submit_lock(ctx, issue_flags);
 
        if (unlikely((unsigned int)fd >= ctx->nr_user_files))
                goto out;
        fd = array_index_nospec(fd, ctx->nr_user_files);
-       file_ptr = io_fixed_file_slot(&ctx->file_table, fd)->file_ptr;
-       file = (struct file *) (file_ptr & FFS_MASK);
-       file_ptr &= ~FFS_MASK;
-       /* mask in overlapping REQ_F and FFS bits */
-       req->flags |= (file_ptr << REQ_F_SUPPORT_NOWAIT_BIT);
+       slot = io_fixed_file_slot(&ctx->file_table, fd);
+       file = io_slot_file(slot);
+       req->flags |= io_slot_flags(slot);
        io_req_set_rsrc_node(req, ctx, 0);
 out:
        io_ring_submit_unlock(ctx, issue_flags);
@@ -2709,11 +2610,96 @@ static void io_mem_free(void *ptr)
                free_compound_page(page);
 }
 
+static void io_pages_free(struct page ***pages, int npages)
+{
+       struct page **page_array;
+       int i;
+
+       if (!pages)
+               return;
+       page_array = *pages;
+       for (i = 0; i < npages; i++)
+               unpin_user_page(page_array[i]);
+       kvfree(page_array);
+       *pages = NULL;
+}
+
+static void *__io_uaddr_map(struct page ***pages, unsigned short *npages,
+                           unsigned long uaddr, size_t size)
+{
+       struct page **page_array;
+       unsigned int nr_pages;
+       int ret;
+
+       *npages = 0;
+
+       if (uaddr & (PAGE_SIZE - 1) || !size)
+               return ERR_PTR(-EINVAL);
+
+       nr_pages = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
+       if (nr_pages > USHRT_MAX)
+               return ERR_PTR(-EINVAL);
+       page_array = kvmalloc_array(nr_pages, sizeof(struct page *), GFP_KERNEL);
+       if (!page_array)
+               return ERR_PTR(-ENOMEM);
+
+       ret = pin_user_pages_fast(uaddr, nr_pages, FOLL_WRITE | FOLL_LONGTERM,
+                                       page_array);
+       if (ret != nr_pages) {
+err:
+               io_pages_free(&page_array, ret > 0 ? ret : 0);
+               return ret < 0 ? ERR_PTR(ret) : ERR_PTR(-EFAULT);
+       }
+       /*
+        * Should be a single page. If the ring is small enough that we can
+        * use a normal page, that is fine. If we need multiple pages, then
+        * userspace should use a huge page. That's the only way to guarantee
+        * that we get contigious memory, outside of just being lucky or
+        * (currently) having low memory fragmentation.
+        */
+       if (page_array[0] != page_array[ret - 1])
+               goto err;
+       *pages = page_array;
+       *npages = nr_pages;
+       return page_to_virt(page_array[0]);
+}
+
+static void *io_rings_map(struct io_ring_ctx *ctx, unsigned long uaddr,
+                         size_t size)
+{
+       return __io_uaddr_map(&ctx->ring_pages, &ctx->n_ring_pages, uaddr,
+                               size);
+}
+
+static void *io_sqes_map(struct io_ring_ctx *ctx, unsigned long uaddr,
+                        size_t size)
+{
+       return __io_uaddr_map(&ctx->sqe_pages, &ctx->n_sqe_pages, uaddr,
+                               size);
+}
+
+static void io_rings_free(struct io_ring_ctx *ctx)
+{
+       if (!(ctx->flags & IORING_SETUP_NO_MMAP)) {
+               io_mem_free(ctx->rings);
+               io_mem_free(ctx->sq_sqes);
+               ctx->rings = NULL;
+               ctx->sq_sqes = NULL;
+       } else {
+               io_pages_free(&ctx->ring_pages, ctx->n_ring_pages);
+               io_pages_free(&ctx->sqe_pages, ctx->n_sqe_pages);
+       }
+}
+
 static void *io_mem_alloc(size_t size)
 {
        gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP;
+       void *ret;
 
-       return (void *) __get_free_pages(gfp, get_order(size));
+       ret = (void *) __get_free_pages(gfp, get_order(size));
+       if (ret)
+               return ret;
+       return ERR_PTR(-ENOMEM);
 }
 
 static unsigned long rings_size(struct io_ring_ctx *ctx, unsigned int sq_entries,
@@ -2869,8 +2855,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
                mmdrop(ctx->mm_account);
                ctx->mm_account = NULL;
        }
-       io_mem_free(ctx->rings);
-       io_mem_free(ctx->sq_sqes);
+       io_rings_free(ctx);
 
        percpu_ref_exit(&ctx->refs);
        free_uid(ctx->user);
@@ -3050,7 +3035,18 @@ static __cold void io_ring_exit_work(struct work_struct *work)
                        /* there is little hope left, don't run it too often */
                        interval = HZ * 60;
                }
-       } while (!wait_for_completion_timeout(&ctx->ref_comp, interval));
+               /*
+                * This is really an uninterruptible wait, as it has to be
+                * complete. But it's also run from a kworker, which doesn't
+                * take signals, so it's fine to make it interruptible. This
+                * avoids scenarios where we knowingly can wait much longer
+                * on completions, for example if someone does a SIGSTOP on
+                * a task that needs to finish task_work to make this loop
+                * complete. That's a synthetic situation that should not
+                * cause a stuck task backtrace, and hence a potential panic
+                * on stuck tasks if that is enabled.
+                */
+       } while (!wait_for_completion_interruptible_timeout(&ctx->ref_comp, interval));
 
        init_completion(&exit.completion);
        init_task_work(&exit.task_work, io_tctx_exit_cb);
@@ -3074,7 +3070,12 @@ static __cold void io_ring_exit_work(struct work_struct *work)
                        continue;
 
                mutex_unlock(&ctx->uring_lock);
-               wait_for_completion(&exit.completion);
+               /*
+                * See comment above for
+                * wait_for_completion_interruptible_timeout() on why this
+                * wait is marked as interruptible.
+                */
+               wait_for_completion_interruptible(&exit.completion);
                mutex_lock(&ctx->uring_lock);
        }
        mutex_unlock(&ctx->uring_lock);
@@ -3348,6 +3349,10 @@ static void *io_uring_validate_mmap_request(struct file *file,
        struct page *page;
        void *ptr;
 
+       /* Don't allow mmap if the ring was setup without it */
+       if (ctx->flags & IORING_SETUP_NO_MMAP)
+               return ERR_PTR(-EINVAL);
+
        switch (offset & IORING_OFF_MMAP_MASK) {
        case IORING_OFF_SQ_RING:
        case IORING_OFF_CQ_RING:
@@ -3673,6 +3678,7 @@ static __cold int io_allocate_scq_urings(struct io_ring_ctx *ctx,
 {
        struct io_rings *rings;
        size_t size, sq_array_offset;
+       void *ptr;
 
        /* make sure these are sane, as we already accounted them */
        ctx->sq_entries = p->sq_entries;
@@ -3682,9 +3688,13 @@ static __cold int io_allocate_scq_urings(struct io_ring_ctx *ctx,
        if (size == SIZE_MAX)
                return -EOVERFLOW;
 
-       rings = io_mem_alloc(size);
-       if (!rings)
-               return -ENOMEM;
+       if (!(ctx->flags & IORING_SETUP_NO_MMAP))
+               rings = io_mem_alloc(size);
+       else
+               rings = io_rings_map(ctx, p->cq_off.user_addr, size);
+
+       if (IS_ERR(rings))
+               return PTR_ERR(rings);
 
        ctx->rings = rings;
        ctx->sq_array = (u32 *)((char *)rings + sq_array_offset);
@@ -3698,34 +3708,31 @@ static __cold int io_allocate_scq_urings(struct io_ring_ctx *ctx,
        else
                size = array_size(sizeof(struct io_uring_sqe), p->sq_entries);
        if (size == SIZE_MAX) {
-               io_mem_free(ctx->rings);
-               ctx->rings = NULL;
+               io_rings_free(ctx);
                return -EOVERFLOW;
        }
 
-       ctx->sq_sqes = io_mem_alloc(size);
-       if (!ctx->sq_sqes) {
-               io_mem_free(ctx->rings);
-               ctx->rings = NULL;
-               return -ENOMEM;
+       if (!(ctx->flags & IORING_SETUP_NO_MMAP))
+               ptr = io_mem_alloc(size);
+       else
+               ptr = io_sqes_map(ctx, p->sq_off.user_addr, size);
+
+       if (IS_ERR(ptr)) {
+               io_rings_free(ctx);
+               return PTR_ERR(ptr);
        }
 
+       ctx->sq_sqes = ptr;
        return 0;
 }
 
-static int io_uring_install_fd(struct io_ring_ctx *ctx, struct file *file)
+static int io_uring_install_fd(struct file *file)
 {
-       int ret, fd;
+       int fd;
 
        fd = get_unused_fd_flags(O_RDWR | O_CLOEXEC);
        if (fd < 0)
                return fd;
-
-       ret = __io_uring_add_tctx_node(ctx);
-       if (ret) {
-               put_unused_fd(fd);
-               return ret;
-       }
        fd_install(fd, file);
        return fd;
 }
@@ -3765,6 +3772,7 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
                                  struct io_uring_params __user *params)
 {
        struct io_ring_ctx *ctx;
+       struct io_uring_task *tctx;
        struct file *file;
        int ret;
 
@@ -3776,6 +3784,10 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
                entries = IORING_MAX_ENTRIES;
        }
 
+       if ((p->flags & IORING_SETUP_REGISTERED_FD_ONLY)
+           && !(p->flags & IORING_SETUP_NO_MMAP))
+               return -EINVAL;
+
        /*
         * Use twice as many entries for the CQ ring. It's possible for the
         * application to drive a higher depth than the size of the SQ ring,
@@ -3887,7 +3899,6 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
        if (ret)
                goto err;
 
-       memset(&p->sq_off, 0, sizeof(p->sq_off));
        p->sq_off.head = offsetof(struct io_rings, sq.head);
        p->sq_off.tail = offsetof(struct io_rings, sq.tail);
        p->sq_off.ring_mask = offsetof(struct io_rings, sq_ring_mask);
@@ -3895,8 +3906,10 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
        p->sq_off.flags = offsetof(struct io_rings, sq_flags);
        p->sq_off.dropped = offsetof(struct io_rings, sq_dropped);
        p->sq_off.array = (char *)ctx->sq_array - (char *)ctx->rings;
+       p->sq_off.resv1 = 0;
+       if (!(ctx->flags & IORING_SETUP_NO_MMAP))
+               p->sq_off.user_addr = 0;
 
-       memset(&p->cq_off, 0, sizeof(p->cq_off));
        p->cq_off.head = offsetof(struct io_rings, cq.head);
        p->cq_off.tail = offsetof(struct io_rings, cq.tail);
        p->cq_off.ring_mask = offsetof(struct io_rings, cq_ring_mask);
@@ -3904,6 +3917,9 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
        p->cq_off.overflow = offsetof(struct io_rings, cq_overflow);
        p->cq_off.cqes = offsetof(struct io_rings, cqes);
        p->cq_off.flags = offsetof(struct io_rings, cq_flags);
+       p->cq_off.resv1 = 0;
+       if (!(ctx->flags & IORING_SETUP_NO_MMAP))
+               p->cq_off.user_addr = 0;
 
        p->features = IORING_FEAT_SINGLE_MMAP | IORING_FEAT_NODROP |
                        IORING_FEAT_SUBMIT_STABLE | IORING_FEAT_RW_CUR_POS |
@@ -3928,22 +3944,30 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
                goto err;
        }
 
+       ret = __io_uring_add_tctx_node(ctx);
+       if (ret)
+               goto err_fput;
+       tctx = current->io_uring;
+
        /*
         * Install ring fd as the very last thing, so we don't risk someone
         * having closed it before we finish setup
         */
-       ret = io_uring_install_fd(ctx, file);
-       if (ret < 0) {
-               /* fput will clean it up */
-               fput(file);
-               return ret;
-       }
+       if (p->flags & IORING_SETUP_REGISTERED_FD_ONLY)
+               ret = io_ring_add_registered_file(tctx, file, 0, IO_RINGFD_REG_MAX);
+       else
+               ret = io_uring_install_fd(file);
+       if (ret < 0)
+               goto err_fput;
 
        trace_io_uring_create(ret, ctx, p->sq_entries, p->cq_entries, p->flags);
        return ret;
 err:
        io_ring_ctx_wait_and_kill(ctx);
        return ret;
+err_fput:
+       fput(file);
+       return ret;
 }
 
 /*
@@ -3969,7 +3993,8 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
                        IORING_SETUP_R_DISABLED | IORING_SETUP_SUBMIT_ALL |
                        IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG |
                        IORING_SETUP_SQE128 | IORING_SETUP_CQE32 |
-                       IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN))
+                       IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN |
+                       IORING_SETUP_NO_MMAP | IORING_SETUP_REGISTERED_FD_ONLY))
                return -EINVAL;
 
        return io_uring_create(entries, &p, params);
index 259bf79..d3606d3 100644 (file)
@@ -16,9 +16,6 @@
 #endif
 
 enum {
-       /* don't use deferred task_work */
-       IOU_F_TWQ_FORCE_NORMAL                  = 1,
-
        /*
         * A hint to not wake right away but delay until there are enough of
         * tw's queued to match the number of CQEs the task is waiting for.
@@ -26,7 +23,7 @@ enum {
         * Must not be used wirh requests generating more than one CQE.
         * It's also ignored unless IORING_SETUP_DEFER_TASKRUN is set.
         */
-       IOU_F_TWQ_LAZY_WAKE                     = 2,
+       IOU_F_TWQ_LAZY_WAKE                     = 1,
 };
 
 enum {
@@ -47,7 +44,7 @@ int io_run_task_work_sig(struct io_ring_ctx *ctx);
 void io_req_defer_failed(struct io_kiocb *req, s32 res);
 void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags);
 bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags);
-bool io_aux_cqe(struct io_ring_ctx *ctx, bool defer, u64 user_data, s32 res, u32 cflags,
+bool io_aux_cqe(const struct io_kiocb *req, bool defer, s32 res, u32 cflags,
                bool allow_overflow);
 void __io_commit_cqring_flush(struct io_ring_ctx *ctx);
 
@@ -57,11 +54,6 @@ struct file *io_file_get_normal(struct io_kiocb *req, int fd);
 struct file *io_file_get_fixed(struct io_kiocb *req, int fd,
                               unsigned issue_flags);
 
-static inline bool io_req_ffs_set(struct io_kiocb *req)
-{
-       return req->flags & REQ_F_FIXED_FILE;
-}
-
 void __io_req_task_work_add(struct io_kiocb *req, unsigned flags);
 bool io_is_uring_fops(struct file *file);
 bool io_alloc_async_data(struct io_kiocb *req);
@@ -75,6 +67,9 @@ __cold void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd);
 int io_uring_alloc_task_context(struct task_struct *task,
                                struct io_ring_ctx *ctx);
 
+int io_ring_add_registered_file(struct io_uring_task *tctx, struct file *file,
+                                    int start, int end);
+
 int io_poll_issue(struct io_kiocb *req, struct io_tw_state *ts);
 int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr);
 int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin);
@@ -115,8 +110,6 @@ static inline void io_req_task_work_add(struct io_kiocb *req)
 #define io_for_each_link(pos, head) \
        for (pos = (head); pos; pos = pos->link)
 
-void io_cq_unlock_post(struct io_ring_ctx *ctx);
-
 static inline struct io_uring_cqe *io_get_cqe_overflow(struct io_ring_ctx *ctx,
                                                       bool overflow)
 {
index 85fd7ce..cd6dcf6 100644 (file)
@@ -162,14 +162,12 @@ static struct file *io_msg_grab_file(struct io_kiocb *req, unsigned int issue_fl
        struct io_msg *msg = io_kiocb_to_cmd(req, struct io_msg);
        struct io_ring_ctx *ctx = req->ctx;
        struct file *file = NULL;
-       unsigned long file_ptr;
        int idx = msg->src_fd;
 
        io_ring_submit_lock(ctx, issue_flags);
        if (likely(idx < ctx->nr_user_files)) {
                idx = array_index_nospec(idx, ctx->nr_user_files);
-               file_ptr = io_fixed_file_slot(&ctx->file_table, idx)->file_ptr;
-               file = (struct file *) (file_ptr & FFS_MASK);
+               file = io_file_from_index(&ctx->file_table, idx);
                if (file)
                        get_file(file);
        }
index 4b8e847..d7e6efe 100644 (file)
@@ -625,9 +625,15 @@ static inline void io_recv_prep_retry(struct io_kiocb *req)
  * again (for multishot).
  */
 static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
-                                 unsigned int cflags, bool mshot_finished,
+                                 struct msghdr *msg, bool mshot_finished,
                                  unsigned issue_flags)
 {
+       unsigned int cflags;
+
+       cflags = io_put_kbuf(req, issue_flags);
+       if (msg->msg_inq && msg->msg_inq != -1U)
+               cflags |= IORING_CQE_F_SOCK_NONEMPTY;
+
        if (!(req->flags & REQ_F_APOLL_MULTISHOT)) {
                io_req_set_res(req, *ret, cflags);
                *ret = IOU_OK;
@@ -635,10 +641,18 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
        }
 
        if (!mshot_finished) {
-               if (io_aux_cqe(req->ctx, issue_flags & IO_URING_F_COMPLETE_DEFER,
-                              req->cqe.user_data, *ret, cflags | IORING_CQE_F_MORE, true)) {
+               if (io_aux_cqe(req, issue_flags & IO_URING_F_COMPLETE_DEFER,
+                              *ret, cflags | IORING_CQE_F_MORE, true)) {
                        io_recv_prep_retry(req);
-                       return false;
+                       /* Known not-empty or unknown state, retry */
+                       if (cflags & IORING_CQE_F_SOCK_NONEMPTY ||
+                           msg->msg_inq == -1U)
+                               return false;
+                       if (issue_flags & IO_URING_F_MULTISHOT)
+                               *ret = IOU_ISSUE_SKIP_COMPLETE;
+                       else
+                               *ret = -EAGAIN;
+                       return true;
                }
                /* Otherwise stop multishot but use the current result. */
        }
@@ -741,7 +755,6 @@ int io_recvmsg(struct io_kiocb *req, unsigned int issue_flags)
        struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
        struct io_async_msghdr iomsg, *kmsg;
        struct socket *sock;
-       unsigned int cflags;
        unsigned flags;
        int ret, min_ret = 0;
        bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
@@ -792,6 +805,7 @@ retry_multishot:
                flags |= MSG_DONTWAIT;
 
        kmsg->msg.msg_get_inq = 1;
+       kmsg->msg.msg_inq = -1U;
        if (req->flags & REQ_F_APOLL_MULTISHOT) {
                ret = io_recvmsg_multishot(sock, sr, kmsg, flags,
                                           &mshot_finished);
@@ -832,11 +846,7 @@ retry_multishot:
        else
                io_kbuf_recycle(req, issue_flags);
 
-       cflags = io_put_kbuf(req, issue_flags);
-       if (kmsg->msg.msg_inq)
-               cflags |= IORING_CQE_F_SOCK_NONEMPTY;
-
-       if (!io_recv_finish(req, &ret, cflags, mshot_finished, issue_flags))
+       if (!io_recv_finish(req, &ret, &kmsg->msg, mshot_finished, issue_flags))
                goto retry_multishot;
 
        if (mshot_finished) {
@@ -855,7 +865,6 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
        struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
        struct msghdr msg;
        struct socket *sock;
-       unsigned int cflags;
        unsigned flags;
        int ret, min_ret = 0;
        bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
@@ -872,6 +881,14 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
        if (unlikely(!sock))
                return -ENOTSOCK;
 
+       msg.msg_name = NULL;
+       msg.msg_namelen = 0;
+       msg.msg_control = NULL;
+       msg.msg_get_inq = 1;
+       msg.msg_controllen = 0;
+       msg.msg_iocb = NULL;
+       msg.msg_ubuf = NULL;
+
 retry_multishot:
        if (io_do_buffer_select(req)) {
                void __user *buf;
@@ -886,14 +903,8 @@ retry_multishot:
        if (unlikely(ret))
                goto out_free;
 
-       msg.msg_name = NULL;
-       msg.msg_namelen = 0;
-       msg.msg_control = NULL;
-       msg.msg_get_inq = 1;
+       msg.msg_inq = -1U;
        msg.msg_flags = 0;
-       msg.msg_controllen = 0;
-       msg.msg_iocb = NULL;
-       msg.msg_ubuf = NULL;
 
        flags = sr->msg_flags;
        if (force_nonblock)
@@ -933,11 +944,7 @@ out_free:
        else
                io_kbuf_recycle(req, issue_flags);
 
-       cflags = io_put_kbuf(req, issue_flags);
-       if (msg.msg_inq)
-               cflags |= IORING_CQE_F_SOCK_NONEMPTY;
-
-       if (!io_recv_finish(req, &ret, cflags, ret <= 0, issue_flags))
+       if (!io_recv_finish(req, &ret, &msg, ret <= 0, issue_flags))
                goto retry_multishot;
 
        return ret;
@@ -1310,7 +1317,6 @@ int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 
 int io_accept(struct io_kiocb *req, unsigned int issue_flags)
 {
-       struct io_ring_ctx *ctx = req->ctx;
        struct io_accept *accept = io_kiocb_to_cmd(req, struct io_accept);
        bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
        unsigned int file_flags = force_nonblock ? O_NONBLOCK : 0;
@@ -1360,8 +1366,8 @@ retry:
 
        if (ret < 0)
                return ret;
-       if (io_aux_cqe(ctx, issue_flags & IO_URING_F_COMPLETE_DEFER,
-                      req->cqe.user_data, ret, IORING_CQE_F_MORE, true))
+       if (io_aux_cqe(req, issue_flags & IO_URING_F_COMPLETE_DEFER, ret,
+                      IORING_CQE_F_MORE, true))
                goto retry;
 
        return -ECANCELED;
index a78b8af..d4597ef 100644 (file)
@@ -300,8 +300,8 @@ static int io_poll_check_events(struct io_kiocb *req, struct io_tw_state *ts)
                        __poll_t mask = mangle_poll(req->cqe.res &
                                                    req->apoll_events);
 
-                       if (!io_aux_cqe(req->ctx, ts->locked, req->cqe.user_data,
-                                       mask, IORING_CQE_F_MORE, false)) {
+                       if (!io_aux_cqe(req, ts->locked, mask,
+                                       IORING_CQE_F_MORE, false)) {
                                io_req_set_res(req, mask, 0);
                                return IOU_POLL_REMOVE_POLL_USE_RES;
                        }
@@ -326,7 +326,7 @@ static int io_poll_check_events(struct io_kiocb *req, struct io_tw_state *ts)
        return IOU_POLL_NO_ACTION;
 }
 
-static void io_poll_task_func(struct io_kiocb *req, struct io_tw_state *ts)
+void io_poll_task_func(struct io_kiocb *req, struct io_tw_state *ts)
 {
        int ret;
 
index b2393b4..ff4d5d7 100644 (file)
@@ -38,3 +38,5 @@ bool io_poll_remove_all(struct io_ring_ctx *ctx, struct task_struct *tsk,
                        bool cancel_all);
 
 void io_apoll_cache_free(struct io_cache_entry *entry);
+
+void io_poll_task_func(struct io_kiocb *req, struct io_tw_state *ts);
index d46f72a..5e8fdd9 100644 (file)
@@ -354,7 +354,6 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
        __s32 __user *fds = u64_to_user_ptr(up->data);
        struct io_rsrc_data *data = ctx->file_data;
        struct io_fixed_file *file_slot;
-       struct file *file;
        int fd, i, err = 0;
        unsigned int done;
 
@@ -382,15 +381,16 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
                file_slot = io_fixed_file_slot(&ctx->file_table, i);
 
                if (file_slot->file_ptr) {
-                       file = (struct file *)(file_slot->file_ptr & FFS_MASK);
-                       err = io_queue_rsrc_removal(data, i, file);
+                       err = io_queue_rsrc_removal(data, i,
+                                                   io_slot_file(file_slot));
                        if (err)
                                break;
                        file_slot->file_ptr = 0;
                        io_file_bitmap_clear(&ctx->file_table, i);
                }
                if (fd != -1) {
-                       file = fget(fd);
+                       struct file *file = fget(fd);
+
                        if (!file) {
                                err = -EBADF;
                                break;
@@ -1030,9 +1030,8 @@ static int io_buffer_account_pin(struct io_ring_ctx *ctx, struct page **pages,
 struct page **io_pin_pages(unsigned long ubuf, unsigned long len, int *npages)
 {
        unsigned long start, end, nr_pages;
-       struct vm_area_struct **vmas = NULL;
        struct page **pages = NULL;
-       int i, pret, ret = -ENOMEM;
+       int pret, ret = -ENOMEM;
 
        end = (ubuf + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
        start = ubuf >> PAGE_SHIFT;
@@ -1042,45 +1041,24 @@ struct page **io_pin_pages(unsigned long ubuf, unsigned long len, int *npages)
        if (!pages)
                goto done;
 
-       vmas = kvmalloc_array(nr_pages, sizeof(struct vm_area_struct *),
-                             GFP_KERNEL);
-       if (!vmas)
-               goto done;
-
        ret = 0;
        mmap_read_lock(current->mm);
        pret = pin_user_pages(ubuf, nr_pages, FOLL_WRITE | FOLL_LONGTERM,
-                             pages, vmas);
-       if (pret == nr_pages) {
-               /* don't support file backed memory */
-               for (i = 0; i < nr_pages; i++) {
-                       struct vm_area_struct *vma = vmas[i];
-
-                       if (vma_is_shmem(vma))
-                               continue;
-                       if (vma->vm_file &&
-                           !is_file_hugepages(vma->vm_file)) {
-                               ret = -EOPNOTSUPP;
-                               break;
-                       }
-               }
+                             pages);
+       if (pret == nr_pages)
                *npages = nr_pages;
-       } else {
+       else
                ret = pret < 0 ? pret : -EFAULT;
-       }
+
        mmap_read_unlock(current->mm);
        if (ret) {
-               /*
-                * if we did partial map, or found file backed vmas,
-                * release any pages we did get
-                */
+               /* if we did partial map, release any pages we did get */
                if (pret > 0)
                        unpin_user_pages(pages, pret);
                goto done;
        }
        ret = 0;
 done:
-       kvfree(vmas);
        if (ret < 0) {
                kvfree(pages);
                pages = ERR_PTR(ret);
index 3f118ed..1bce220 100644 (file)
@@ -283,7 +283,7 @@ static inline int io_fixup_rw_res(struct io_kiocb *req, long res)
        return res;
 }
 
-static void io_req_rw_complete(struct io_kiocb *req, struct io_tw_state *ts)
+void io_req_rw_complete(struct io_kiocb *req, struct io_tw_state *ts)
 {
        io_req_io_end(req);
 
@@ -666,8 +666,8 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode)
        if (unlikely(!file || !(file->f_mode & mode)))
                return -EBADF;
 
-       if (!io_req_ffs_set(req))
-               req->flags |= io_file_get_flags(file) << REQ_F_SUPPORT_NOWAIT_BIT;
+       if (!(req->flags & REQ_F_FIXED_FILE))
+               req->flags |= io_file_get_flags(file);
 
        kiocb->ki_flags = file->f_iocb_flags;
        ret = kiocb_set_rw_flags(kiocb, rw->flags);
index 3b733f4..4b89f96 100644 (file)
@@ -22,3 +22,4 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags);
 int io_writev_prep_async(struct io_kiocb *req);
 void io_readv_writev_cleanup(struct io_kiocb *req);
 void io_rw_fail(struct io_kiocb *req);
+void io_req_rw_complete(struct io_kiocb *req, struct io_tw_state *ts);
index 3a8d1dd..c043fe9 100644 (file)
@@ -208,31 +208,40 @@ void io_uring_unreg_ringfd(void)
        }
 }
 
-static int io_ring_add_registered_fd(struct io_uring_task *tctx, int fd,
+int io_ring_add_registered_file(struct io_uring_task *tctx, struct file *file,
                                     int start, int end)
 {
-       struct file *file;
        int offset;
-
        for (offset = start; offset < end; offset++) {
                offset = array_index_nospec(offset, IO_RINGFD_REG_MAX);
                if (tctx->registered_rings[offset])
                        continue;
 
-               file = fget(fd);
-               if (!file) {
-                       return -EBADF;
-               } else if (!io_is_uring_fops(file)) {
-                       fput(file);
-                       return -EOPNOTSUPP;
-               }
                tctx->registered_rings[offset] = file;
                return offset;
        }
-
        return -EBUSY;
 }
 
+static int io_ring_add_registered_fd(struct io_uring_task *tctx, int fd,
+                                    int start, int end)
+{
+       struct file *file;
+       int offset;
+
+       file = fget(fd);
+       if (!file) {
+               return -EBADF;
+       } else if (!io_is_uring_fops(file)) {
+               fput(file);
+               return -EOPNOTSUPP;
+       }
+       offset = io_ring_add_registered_file(tctx, file, start, end);
+       if (offset < 0)
+               fput(file);
+       return offset;
+}
+
 /*
  * Register a ring fd to avoid fdget/fdput for each io_uring_enter()
  * invocation. User passes in an array of struct io_uring_rsrc_update
index fc95017..fb0547b 100644 (file)
@@ -73,8 +73,8 @@ static void io_timeout_complete(struct io_kiocb *req, struct io_tw_state *ts)
 
        if (!io_timeout_finish(timeout, data)) {
                bool filled;
-               filled = io_aux_cqe(ctx, ts->locked, req->cqe.user_data, -ETIME,
-                                   IORING_CQE_F_MORE, false);
+               filled = io_aux_cqe(req, ts->locked, -ETIME, IORING_CQE_F_MORE,
+                                   false);
                if (filled) {
                        /* re-arm timer */
                        spin_lock_irq(&ctx->timeout_lock);
@@ -594,7 +594,7 @@ int io_timeout(struct io_kiocb *req, unsigned int issue_flags)
                goto add;
        }
 
-       tail = ctx->cached_cq_tail - atomic_read(&ctx->cq_timeouts);
+       tail = data_race(ctx->cached_cq_tail) - atomic_read(&ctx->cq_timeouts);
        timeout->target_seq = tail + off;
 
        /* Update the last seq here in case io_flush_timeouts() hasn't.
index 5e32db4..476c787 100644 (file)
@@ -20,16 +20,24 @@ static void io_uring_cmd_work(struct io_kiocb *req, struct io_tw_state *ts)
        ioucmd->task_work_cb(ioucmd, issue_flags);
 }
 
-void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd,
-                       void (*task_work_cb)(struct io_uring_cmd *, unsigned))
+void __io_uring_cmd_do_in_task(struct io_uring_cmd *ioucmd,
+                       void (*task_work_cb)(struct io_uring_cmd *, unsigned),
+                       unsigned flags)
 {
        struct io_kiocb *req = cmd_to_io_kiocb(ioucmd);
 
        ioucmd->task_work_cb = task_work_cb;
        req->io_task_work.func = io_uring_cmd_work;
-       io_req_task_work_add(req);
+       __io_req_task_work_add(req, flags);
+}
+EXPORT_SYMBOL_GPL(__io_uring_cmd_do_in_task);
+
+void io_uring_cmd_do_in_task_lazy(struct io_uring_cmd *ioucmd,
+                       void (*task_work_cb)(struct io_uring_cmd *, unsigned))
+{
+       __io_uring_cmd_do_in_task(ioucmd, task_work_cb, IOU_F_TWQ_LAZY_WAKE);
 }
-EXPORT_SYMBOL_GPL(io_uring_cmd_complete_in_task);
+EXPORT_SYMBOL_GPL(io_uring_cmd_do_in_task_lazy);
 
 static inline void io_req_set_cqe32_extra(struct io_kiocb *req,
                                          u64 extra1, u64 extra2)
index b69c953..3947122 100644 (file)
@@ -10,7 +10,7 @@ obj-y     = fork.o exec_domain.o panic.o \
            extable.o params.o \
            kthread.o sys_ni.o nsproxy.o \
            notifier.o ksysfs.o cred.o reboot.o \
-           async.o range.o smpboot.o ucount.o regset.o
+           async.o range.o smpboot.o ucount.o regset.o ksyms_common.o
 
 obj-$(CONFIG_USERMODE_DRIVER) += usermode_driver.o
 obj-$(CONFIG_MULTIUSER) += groups.o
@@ -91,7 +91,8 @@ obj-$(CONFIG_FAIL_FUNCTION) += fail_function.o
 obj-$(CONFIG_KGDB) += debug/
 obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
 obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
-obj-$(CONFIG_HARDLOCKUP_DETECTOR_PERF) += watchdog_hld.o
+obj-$(CONFIG_HARDLOCKUP_DETECTOR_BUDDY) += watchdog_buddy.o
+obj-$(CONFIG_HARDLOCKUP_DETECTOR_PERF) += watchdog_perf.o
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
index c57b008..94738bc 100644 (file)
@@ -259,8 +259,8 @@ extern struct tty_struct *audit_get_tty(void);
 extern void audit_put_tty(struct tty_struct *tty);
 
 /* audit watch/mark/tree functions */
-#ifdef CONFIG_AUDITSYSCALL
 extern unsigned int audit_serial(void);
+#ifdef CONFIG_AUDITSYSCALL
 extern int auditsc_get_stamp(struct audit_context *ctx,
                              struct timespec64 *t, unsigned int *serial);
 
index 3e058f4..1a27951 100644 (file)
@@ -467,6 +467,7 @@ EXPORT_SYMBOL(file_ns_capable);
 /**
  * privileged_wrt_inode_uidgid - Do capabilities in the namespace work over the inode?
  * @ns: The user namespace in question
+ * @idmap: idmap of the mount @inode was found from
  * @inode: The inode in question
  *
  * Return true if the inode uid and gid are within the namespace.
@@ -481,6 +482,7 @@ bool privileged_wrt_inode_uidgid(struct user_namespace *ns,
 
 /**
  * capable_wrt_inode_uidgid - Check nsown_capable and uid and gid mapped
+ * @idmap: idmap of the mount @inode was found from
  * @inode: The inode in question
  * @cap: The capability in question
  *
index 367b0a4..c56071f 100644 (file)
@@ -220,8 +220,6 @@ static inline void get_css_set(struct css_set *cset)
 
 bool cgroup_ssid_enabled(int ssid);
 bool cgroup_on_dfl(const struct cgroup *cgrp);
-bool cgroup_is_thread_root(struct cgroup *cgrp);
-bool cgroup_is_threaded(struct cgroup *cgrp);
 
 struct cgroup_root *cgroup_root_from_kf(struct kernfs_root *kf_root);
 struct cgroup *task_cgroup_from_root(struct task_struct *task,
index 5407241..8304431 100644 (file)
@@ -563,7 +563,7 @@ static ssize_t cgroup_release_agent_write(struct kernfs_open_file *of,
        if (!cgrp)
                return -ENODEV;
        spin_lock(&release_agent_path_lock);
-       strlcpy(cgrp->root->release_agent_path, strstrip(buf),
+       strscpy(cgrp->root->release_agent_path, strstrip(buf),
                sizeof(cgrp->root->release_agent_path));
        spin_unlock(&release_agent_path_lock);
        cgroup_kn_unlock(of->kn);
@@ -797,7 +797,7 @@ void cgroup1_release_agent(struct work_struct *work)
                goto out_free;
 
        spin_lock(&release_agent_path_lock);
-       strlcpy(agentbuf, cgrp->root->release_agent_path, PATH_MAX);
+       strscpy(agentbuf, cgrp->root->release_agent_path, PATH_MAX);
        spin_unlock(&release_agent_path_lock);
        if (!agentbuf[0])
                goto out_free;
index 4d42f0c..bfe3cd8 100644 (file)
@@ -57,6 +57,7 @@
 #include <linux/file.h>
 #include <linux/fs_parser.h>
 #include <linux/sched/cputime.h>
+#include <linux/sched/deadline.h>
 #include <linux/psi.h>
 #include <net/sock.h>
 
@@ -312,8 +313,6 @@ bool cgroup_ssid_enabled(int ssid)
  *   masks of ancestors.
  *
  * - blkcg: blk-throttle becomes properly hierarchical.
- *
- * - debug: disallowed on the default hierarchy.
  */
 bool cgroup_on_dfl(const struct cgroup *cgrp)
 {
@@ -356,7 +355,7 @@ static bool cgroup_has_tasks(struct cgroup *cgrp)
        return cgrp->nr_populated_csets;
 }
 
-bool cgroup_is_threaded(struct cgroup *cgrp)
+static bool cgroup_is_threaded(struct cgroup *cgrp)
 {
        return cgrp->dom_cgrp != cgrp;
 }
@@ -395,7 +394,7 @@ static bool cgroup_can_be_thread_root(struct cgroup *cgrp)
 }
 
 /* is @cgrp root of a threaded subtree? */
-bool cgroup_is_thread_root(struct cgroup *cgrp)
+static bool cgroup_is_thread_root(struct cgroup *cgrp)
 {
        /* thread root should be a domain */
        if (cgroup_is_threaded(cgrp))
@@ -618,7 +617,7 @@ EXPORT_SYMBOL_GPL(cgroup_get_e_css);
 static void cgroup_get_live(struct cgroup *cgrp)
 {
        WARN_ON_ONCE(cgroup_is_dead(cgrp));
-       css_get(&cgrp->self);
+       cgroup_get(cgrp);
 }
 
 /**
@@ -690,21 +689,6 @@ EXPORT_SYMBOL_GPL(of_css);
                else
 
 /**
- * for_each_e_css - iterate all effective css's of a cgroup
- * @css: the iteration cursor
- * @ssid: the index of the subsystem, CGROUP_SUBSYS_COUNT after reaching the end
- * @cgrp: the target cgroup to iterate css's of
- *
- * Should be called under cgroup_[tree_]mutex.
- */
-#define for_each_e_css(css, ssid, cgrp)                                            \
-       for ((ssid) = 0; (ssid) < CGROUP_SUBSYS_COUNT; (ssid)++)            \
-               if (!((css) = cgroup_e_css_by_mask(cgrp,                    \
-                                                  cgroup_subsys[(ssid)]))) \
-                       ;                                                   \
-               else
-
-/**
  * do_each_subsys_mask - filter for_each_subsys with a bitmask
  * @ss: the iteration cursor
  * @ssid: the index of @ss, CGROUP_SUBSYS_COUNT after reaching the end
@@ -2393,45 +2377,6 @@ int cgroup_path_ns(struct cgroup *cgrp, char *buf, size_t buflen,
 EXPORT_SYMBOL_GPL(cgroup_path_ns);
 
 /**
- * task_cgroup_path - cgroup path of a task in the first cgroup hierarchy
- * @task: target task
- * @buf: the buffer to write the path into
- * @buflen: the length of the buffer
- *
- * Determine @task's cgroup on the first (the one with the lowest non-zero
- * hierarchy_id) cgroup hierarchy and copy its path into @buf.  This
- * function grabs cgroup_mutex and shouldn't be used inside locks used by
- * cgroup controller callbacks.
- *
- * Return value is the same as kernfs_path().
- */
-int task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
-{
-       struct cgroup_root *root;
-       struct cgroup *cgrp;
-       int hierarchy_id = 1;
-       int ret;
-
-       cgroup_lock();
-       spin_lock_irq(&css_set_lock);
-
-       root = idr_get_next(&cgroup_hierarchy_idr, &hierarchy_id);
-
-       if (root) {
-               cgrp = task_cgroup_from_root(task, root);
-               ret = cgroup_path_ns_locked(cgrp, buf, buflen, &init_cgroup_ns);
-       } else {
-               /* if no hierarchy exists, everyone is in "/" */
-               ret = strscpy(buf, "/", buflen);
-       }
-
-       spin_unlock_irq(&css_set_lock);
-       cgroup_unlock();
-       return ret;
-}
-EXPORT_SYMBOL_GPL(task_cgroup_path);
-
-/**
  * cgroup_attach_lock - Lock for ->attach()
  * @lock_threadgroup: whether to down_write cgroup_threadgroup_rwsem
  *
@@ -2885,9 +2830,9 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup,
        struct task_struct *task;
 
        /*
-        * Prevent freeing of tasks while we take a snapshot. Tasks that are
-        * already PF_EXITING could be freed from underneath us unless we
-        * take an rcu_read_lock.
+        * The following thread iteration should be inside an RCU critical
+        * section to prevent tasks from being freed while taking the snapshot.
+        * spin_lock_irq() implies RCU critical section here.
         */
        spin_lock_irq(&css_set_lock);
        task = leader;
@@ -3891,6 +3836,14 @@ static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of,
        return psi_trigger_poll(&ctx->psi.trigger, of->file, pt);
 }
 
+static int cgroup_pressure_open(struct kernfs_open_file *of)
+{
+       if (of->file->f_mode & FMODE_WRITE && !capable(CAP_SYS_RESOURCE))
+               return -EPERM;
+
+       return 0;
+}
+
 static void cgroup_pressure_release(struct kernfs_open_file *of)
 {
        struct cgroup_file_ctx *ctx = of->priv;
@@ -5290,6 +5243,7 @@ static struct cftype cgroup_psi_files[] = {
        {
                .name = "io.pressure",
                .file_offset = offsetof(struct cgroup, psi_files[PSI_IO]),
+               .open = cgroup_pressure_open,
                .seq_show = cgroup_io_pressure_show,
                .write = cgroup_io_pressure_write,
                .poll = cgroup_pressure_poll,
@@ -5298,6 +5252,7 @@ static struct cftype cgroup_psi_files[] = {
        {
                .name = "memory.pressure",
                .file_offset = offsetof(struct cgroup, psi_files[PSI_MEM]),
+               .open = cgroup_pressure_open,
                .seq_show = cgroup_memory_pressure_show,
                .write = cgroup_memory_pressure_write,
                .poll = cgroup_pressure_poll,
@@ -5306,6 +5261,7 @@ static struct cftype cgroup_psi_files[] = {
        {
                .name = "cpu.pressure",
                .file_offset = offsetof(struct cgroup, psi_files[PSI_CPU]),
+               .open = cgroup_pressure_open,
                .seq_show = cgroup_cpu_pressure_show,
                .write = cgroup_cpu_pressure_write,
                .poll = cgroup_pressure_poll,
@@ -5315,6 +5271,7 @@ static struct cftype cgroup_psi_files[] = {
        {
                .name = "irq.pressure",
                .file_offset = offsetof(struct cgroup, psi_files[PSI_IRQ]),
+               .open = cgroup_pressure_open,
                .seq_show = cgroup_irq_pressure_show,
                .write = cgroup_irq_pressure_write,
                .poll = cgroup_pressure_poll,
@@ -6696,6 +6653,9 @@ void cgroup_exit(struct task_struct *tsk)
        list_add_tail(&tsk->cg_list, &cset->dying_tasks);
        cset->nr_tasks--;
 
+       if (dl_task(tsk))
+               dec_dl_tasks_cs(tsk);
+
        WARN_ON_ONCE(cgroup_task_frozen(tsk));
        if (unlikely(!(tsk->flags & PF_KTHREAD) &&
                     test_bit(CGRP_FREEZE, &task_dfl_cgroup(tsk)->flags)))
index e4ca2dd..58e6f18 100644 (file)
 #include <linux/cpu.h>
 #include <linux/cpumask.h>
 #include <linux/cpuset.h>
-#include <linux/err.h>
-#include <linux/errno.h>
-#include <linux/file.h>
-#include <linux/fs.h>
 #include <linux/init.h>
 #include <linux/interrupt.h>
 #include <linux/kernel.h>
-#include <linux/kmod.h>
-#include <linux/kthread.h>
-#include <linux/list.h>
 #include <linux/mempolicy.h>
 #include <linux/mm.h>
 #include <linux/memory.h>
 #include <linux/export.h>
-#include <linux/mount.h>
-#include <linux/fs_context.h>
-#include <linux/namei.h>
-#include <linux/pagemap.h>
-#include <linux/proc_fs.h>
 #include <linux/rcupdate.h>
 #include <linux/sched.h>
 #include <linux/sched/deadline.h>
 #include <linux/sched/mm.h>
 #include <linux/sched/task.h>
-#include <linux/seq_file.h>
 #include <linux/security.h>
-#include <linux/slab.h>
 #include <linux/spinlock.h>
-#include <linux/stat.h>
-#include <linux/string.h>
-#include <linux/time.h>
-#include <linux/time64.h>
-#include <linux/backing-dev.h>
-#include <linux/sort.h>
 #include <linux/oom.h>
 #include <linux/sched/isolation.h>
-#include <linux/uaccess.h>
-#include <linux/atomic.h>
-#include <linux/mutex.h>
 #include <linux/cgroup.h>
 #include <linux/wait.h>
 
@@ -193,6 +170,14 @@ struct cpuset {
        int use_parent_ecpus;
        int child_ecpus_count;
 
+       /*
+        * number of SCHED_DEADLINE tasks attached to this cpuset, so that we
+        * know when to rebuild associated root domain bandwidth information.
+        */
+       int nr_deadline_tasks;
+       int nr_migrate_dl_tasks;
+       u64 sum_migrate_dl_bw;
+
        /* Invalid partition error code, not lock protected */
        enum prs_errcode prs_err;
 
@@ -245,6 +230,20 @@ static inline struct cpuset *parent_cs(struct cpuset *cs)
        return css_cs(cs->css.parent);
 }
 
+void inc_dl_tasks_cs(struct task_struct *p)
+{
+       struct cpuset *cs = task_cs(p);
+
+       cs->nr_deadline_tasks++;
+}
+
+void dec_dl_tasks_cs(struct task_struct *p)
+{
+       struct cpuset *cs = task_cs(p);
+
+       cs->nr_deadline_tasks--;
+}
+
 /* bits in struct cpuset flags field */
 typedef enum {
        CS_ONLINE,
@@ -366,22 +365,23 @@ static struct cpuset top_cpuset = {
                if (is_cpuset_online(((des_cs) = css_cs((pos_css)))))
 
 /*
- * There are two global locks guarding cpuset structures - cpuset_rwsem and
+ * There are two global locks guarding cpuset structures - cpuset_mutex and
  * callback_lock. We also require taking task_lock() when dereferencing a
  * task's cpuset pointer. See "The task_lock() exception", at the end of this
- * comment.  The cpuset code uses only cpuset_rwsem write lock.  Other
- * kernel subsystems can use cpuset_read_lock()/cpuset_read_unlock() to
- * prevent change to cpuset structures.
+ * comment.  The cpuset code uses only cpuset_mutex. Other kernel subsystems
+ * can use cpuset_lock()/cpuset_unlock() to prevent change to cpuset
+ * structures. Note that cpuset_mutex needs to be a mutex as it is used in
+ * paths that rely on priority inheritance (e.g. scheduler - on RT) for
+ * correctness.
  *
  * A task must hold both locks to modify cpusets.  If a task holds
- * cpuset_rwsem, it blocks others wanting that rwsem, ensuring that it
- * is the only task able to also acquire callback_lock and be able to
- * modify cpusets.  It can perform various checks on the cpuset structure
- * first, knowing nothing will change.  It can also allocate memory while
- * just holding cpuset_rwsem.  While it is performing these checks, various
- * callback routines can briefly acquire callback_lock to query cpusets.
- * Once it is ready to make the changes, it takes callback_lock, blocking
- * everyone else.
+ * cpuset_mutex, it blocks others, ensuring that it is the only task able to
+ * also acquire callback_lock and be able to modify cpusets.  It can perform
+ * various checks on the cpuset structure first, knowing nothing will change.
+ * It can also allocate memory while just holding cpuset_mutex.  While it is
+ * performing these checks, various callback routines can briefly acquire
+ * callback_lock to query cpusets.  Once it is ready to make the changes, it
+ * takes callback_lock, blocking everyone else.
  *
  * Calls to the kernel memory allocator can not be made while holding
  * callback_lock, as that would risk double tripping on callback_lock
@@ -403,16 +403,16 @@ static struct cpuset top_cpuset = {
  * guidelines for accessing subsystem state in kernel/cgroup.c
  */
 
-DEFINE_STATIC_PERCPU_RWSEM(cpuset_rwsem);
+static DEFINE_MUTEX(cpuset_mutex);
 
-void cpuset_read_lock(void)
+void cpuset_lock(void)
 {
-       percpu_down_read(&cpuset_rwsem);
+       mutex_lock(&cpuset_mutex);
 }
 
-void cpuset_read_unlock(void)
+void cpuset_unlock(void)
 {
-       percpu_up_read(&cpuset_rwsem);
+       mutex_unlock(&cpuset_mutex);
 }
 
 static DEFINE_SPINLOCK(callback_lock);
@@ -496,7 +496,7 @@ static inline bool partition_is_populated(struct cpuset *cs,
  * One way or another, we guarantee to return some non-empty subset
  * of cpu_online_mask.
  *
- * Call with callback_lock or cpuset_rwsem held.
+ * Call with callback_lock or cpuset_mutex held.
  */
 static void guarantee_online_cpus(struct task_struct *tsk,
                                  struct cpumask *pmask)
@@ -538,7 +538,7 @@ out_unlock:
  * One way or another, we guarantee to return some non-empty subset
  * of node_states[N_MEMORY].
  *
- * Call with callback_lock or cpuset_rwsem held.
+ * Call with callback_lock or cpuset_mutex held.
  */
 static void guarantee_online_mems(struct cpuset *cs, nodemask_t *pmask)
 {
@@ -550,7 +550,7 @@ static void guarantee_online_mems(struct cpuset *cs, nodemask_t *pmask)
 /*
  * update task's spread flag if cpuset's page/slab spread flag is set
  *
- * Call with callback_lock or cpuset_rwsem held. The check can be skipped
+ * Call with callback_lock or cpuset_mutex held. The check can be skipped
  * if on default hierarchy.
  */
 static void cpuset_update_task_spread_flags(struct cpuset *cs,
@@ -575,7 +575,7 @@ static void cpuset_update_task_spread_flags(struct cpuset *cs,
  *
  * One cpuset is a subset of another if all its allowed CPUs and
  * Memory Nodes are a subset of the other, and its exclusive flags
- * are only set if the other's are set.  Call holding cpuset_rwsem.
+ * are only set if the other's are set.  Call holding cpuset_mutex.
  */
 
 static int is_cpuset_subset(const struct cpuset *p, const struct cpuset *q)
@@ -713,7 +713,7 @@ out:
  * If we replaced the flag and mask values of the current cpuset
  * (cur) with those values in the trial cpuset (trial), would
  * our various subset and exclusive rules still be valid?  Presumes
- * cpuset_rwsem held.
+ * cpuset_mutex held.
  *
  * 'cur' is the address of an actual, in-use cpuset.  Operations
  * such as list traversal that depend on the actual address of the
@@ -829,7 +829,7 @@ static void update_domain_attr_tree(struct sched_domain_attr *dattr,
        rcu_read_unlock();
 }
 
-/* Must be called with cpuset_rwsem held.  */
+/* Must be called with cpuset_mutex held.  */
 static inline int nr_cpusets(void)
 {
        /* jump label reference count + the top-level cpuset */
@@ -855,7 +855,7 @@ static inline int nr_cpusets(void)
  * domains when operating in the severe memory shortage situations
  * that could cause allocation failures below.
  *
- * Must be called with cpuset_rwsem held.
+ * Must be called with cpuset_mutex held.
  *
  * The three key local variables below are:
  *    cp - cpuset pointer, used (together with pos_css) to perform a
@@ -1066,11 +1066,14 @@ done:
        return ndoms;
 }
 
-static void update_tasks_root_domain(struct cpuset *cs)
+static void dl_update_tasks_root_domain(struct cpuset *cs)
 {
        struct css_task_iter it;
        struct task_struct *task;
 
+       if (cs->nr_deadline_tasks == 0)
+               return;
+
        css_task_iter_start(&cs->css, 0, &it);
 
        while ((task = css_task_iter_next(&it)))
@@ -1079,12 +1082,12 @@ static void update_tasks_root_domain(struct cpuset *cs)
        css_task_iter_end(&it);
 }
 
-static void rebuild_root_domains(void)
+static void dl_rebuild_rd_accounting(void)
 {
        struct cpuset *cs = NULL;
        struct cgroup_subsys_state *pos_css;
 
-       percpu_rwsem_assert_held(&cpuset_rwsem);
+       lockdep_assert_held(&cpuset_mutex);
        lockdep_assert_cpus_held();
        lockdep_assert_held(&sched_domains_mutex);
 
@@ -1107,7 +1110,7 @@ static void rebuild_root_domains(void)
 
                rcu_read_unlock();
 
-               update_tasks_root_domain(cs);
+               dl_update_tasks_root_domain(cs);
 
                rcu_read_lock();
                css_put(&cs->css);
@@ -1121,7 +1124,7 @@ partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 {
        mutex_lock(&sched_domains_mutex);
        partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
-       rebuild_root_domains();
+       dl_rebuild_rd_accounting();
        mutex_unlock(&sched_domains_mutex);
 }
 
@@ -1134,7 +1137,7 @@ partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
  * 'cpus' is removed, then call this routine to rebuild the
  * scheduler's dynamic sched domains.
  *
- * Call with cpuset_rwsem held.  Takes cpus_read_lock().
+ * Call with cpuset_mutex held.  Takes cpus_read_lock().
  */
 static void rebuild_sched_domains_locked(void)
 {
@@ -1145,7 +1148,7 @@ static void rebuild_sched_domains_locked(void)
        int ndoms;
 
        lockdep_assert_cpus_held();
-       percpu_rwsem_assert_held(&cpuset_rwsem);
+       lockdep_assert_held(&cpuset_mutex);
 
        /*
         * If we have raced with CPU hotplug, return early to avoid
@@ -1196,9 +1199,9 @@ static void rebuild_sched_domains_locked(void)
 void rebuild_sched_domains(void)
 {
        cpus_read_lock();
-       percpu_down_write(&cpuset_rwsem);
+       mutex_lock(&cpuset_mutex);
        rebuild_sched_domains_locked();
-       percpu_up_write(&cpuset_rwsem);
+       mutex_unlock(&cpuset_mutex);
        cpus_read_unlock();
 }
 
@@ -1208,7 +1211,7 @@ void rebuild_sched_domains(void)
  * @new_cpus: the temp variable for the new effective_cpus mask
  *
  * Iterate through each task of @cs updating its cpus_allowed to the
- * effective cpuset's.  As this function is called with cpuset_rwsem held,
+ * effective cpuset's.  As this function is called with cpuset_mutex held,
  * cpuset membership stays stable. For top_cpuset, task_cpu_possible_mask()
  * is used instead of effective_cpus to make sure all offline CPUs are also
  * included as hotplug code won't update cpumasks for tasks in top_cpuset.
@@ -1322,7 +1325,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
        int old_prs, new_prs;
        int part_error = PERR_NONE;     /* Partition error? */
 
-       percpu_rwsem_assert_held(&cpuset_rwsem);
+       lockdep_assert_held(&cpuset_mutex);
 
        /*
         * The parent must be a partition root.
@@ -1545,7 +1548,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
  *
  * On legacy hierarchy, effective_cpus will be the same with cpu_allowed.
  *
- * Called with cpuset_rwsem held
+ * Called with cpuset_mutex held
  */
 static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
                                 bool force)
@@ -1705,7 +1708,7 @@ static void update_sibling_cpumasks(struct cpuset *parent, struct cpuset *cs,
        struct cpuset *sibling;
        struct cgroup_subsys_state *pos_css;
 
-       percpu_rwsem_assert_held(&cpuset_rwsem);
+       lockdep_assert_held(&cpuset_mutex);
 
        /*
         * Check all its siblings and call update_cpumasks_hier()
@@ -1955,12 +1958,12 @@ static void *cpuset_being_rebound;
  * @cs: the cpuset in which each task's mems_allowed mask needs to be changed
  *
  * Iterate through each task of @cs updating its mems_allowed to the
- * effective cpuset's.  As this function is called with cpuset_rwsem held,
+ * effective cpuset's.  As this function is called with cpuset_mutex held,
  * cpuset membership stays stable.
  */
 static void update_tasks_nodemask(struct cpuset *cs)
 {
-       static nodemask_t newmems;      /* protected by cpuset_rwsem */
+       static nodemask_t newmems;      /* protected by cpuset_mutex */
        struct css_task_iter it;
        struct task_struct *task;
 
@@ -1973,7 +1976,7 @@ static void update_tasks_nodemask(struct cpuset *cs)
         * take while holding tasklist_lock.  Forks can happen - the
         * mpol_dup() cpuset_being_rebound check will catch such forks,
         * and rebind their vma mempolicies too.  Because we still hold
-        * the global cpuset_rwsem, we know that no other rebind effort
+        * the global cpuset_mutex, we know that no other rebind effort
         * will be contending for the global variable cpuset_being_rebound.
         * It's ok if we rebind the same mm twice; mpol_rebind_mm()
         * is idempotent.  Also migrate pages in each mm to new nodes.
@@ -2019,7 +2022,7 @@ static void update_tasks_nodemask(struct cpuset *cs)
  *
  * On legacy hierarchy, effective_mems will be the same with mems_allowed.
  *
- * Called with cpuset_rwsem held
+ * Called with cpuset_mutex held
  */
 static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
 {
@@ -2072,7 +2075,7 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
  * mempolicies and if the cpuset is marked 'memory_migrate',
  * migrate the tasks pages to the new memory.
  *
- * Call with cpuset_rwsem held. May take callback_lock during call.
+ * Call with cpuset_mutex held. May take callback_lock during call.
  * Will take tasklist_lock, scan tasklist for tasks in cpuset cs,
  * lock each such tasks mm->mmap_lock, scan its vma's and rebind
  * their mempolicies to the cpusets new mems_allowed.
@@ -2164,7 +2167,7 @@ static int update_relax_domain_level(struct cpuset *cs, s64 val)
  * @cs: the cpuset in which each task's spread flags needs to be changed
  *
  * Iterate through each task of @cs updating its spread flags.  As this
- * function is called with cpuset_rwsem held, cpuset membership stays
+ * function is called with cpuset_mutex held, cpuset membership stays
  * stable.
  */
 static void update_tasks_flags(struct cpuset *cs)
@@ -2184,7 +2187,7 @@ static void update_tasks_flags(struct cpuset *cs)
  * cs:         the cpuset to update
  * turning_on:         whether the flag is being set or cleared
  *
- * Call with cpuset_rwsem held.
+ * Call with cpuset_mutex held.
  */
 
 static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
@@ -2234,7 +2237,7 @@ out:
  * @new_prs: new partition root state
  * Return: 0 if successful, != 0 if error
  *
- * Call with cpuset_rwsem held.
+ * Call with cpuset_mutex held.
  */
 static int update_prstate(struct cpuset *cs, int new_prs)
 {
@@ -2472,19 +2475,26 @@ static int cpuset_can_attach_check(struct cpuset *cs)
        return 0;
 }
 
-/* Called by cgroups to determine if a cpuset is usable; cpuset_rwsem held */
+static void reset_migrate_dl_data(struct cpuset *cs)
+{
+       cs->nr_migrate_dl_tasks = 0;
+       cs->sum_migrate_dl_bw = 0;
+}
+
+/* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held */
 static int cpuset_can_attach(struct cgroup_taskset *tset)
 {
        struct cgroup_subsys_state *css;
-       struct cpuset *cs;
+       struct cpuset *cs, *oldcs;
        struct task_struct *task;
        int ret;
 
        /* used later by cpuset_attach() */
        cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
+       oldcs = cpuset_attach_old_cs;
        cs = css_cs(css);
 
-       percpu_down_write(&cpuset_rwsem);
+       mutex_lock(&cpuset_mutex);
 
        /* Check to see if task is allowed in the cpuset */
        ret = cpuset_can_attach_check(cs);
@@ -2492,21 +2502,46 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
                goto out_unlock;
 
        cgroup_taskset_for_each(task, css, tset) {
-               ret = task_can_attach(task, cs->effective_cpus);
+               ret = task_can_attach(task);
                if (ret)
                        goto out_unlock;
                ret = security_task_setscheduler(task);
                if (ret)
                        goto out_unlock;
+
+               if (dl_task(task)) {
+                       cs->nr_migrate_dl_tasks++;
+                       cs->sum_migrate_dl_bw += task->dl.dl_bw;
+               }
        }
 
+       if (!cs->nr_migrate_dl_tasks)
+               goto out_success;
+
+       if (!cpumask_intersects(oldcs->effective_cpus, cs->effective_cpus)) {
+               int cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
+
+               if (unlikely(cpu >= nr_cpu_ids)) {
+                       reset_migrate_dl_data(cs);
+                       ret = -EINVAL;
+                       goto out_unlock;
+               }
+
+               ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
+               if (ret) {
+                       reset_migrate_dl_data(cs);
+                       goto out_unlock;
+               }
+       }
+
+out_success:
        /*
         * Mark attach is in progress.  This makes validate_change() fail
         * changes which zero cpus/mems_allowed.
         */
        cs->attach_in_progress++;
 out_unlock:
-       percpu_up_write(&cpuset_rwsem);
+       mutex_unlock(&cpuset_mutex);
        return ret;
 }
 
@@ -2518,15 +2553,23 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
        cgroup_taskset_first(tset, &css);
        cs = css_cs(css);
 
-       percpu_down_write(&cpuset_rwsem);
+       mutex_lock(&cpuset_mutex);
        cs->attach_in_progress--;
        if (!cs->attach_in_progress)
                wake_up(&cpuset_attach_wq);
-       percpu_up_write(&cpuset_rwsem);
+
+       if (cs->nr_migrate_dl_tasks) {
+               int cpu = cpumask_any(cs->effective_cpus);
+
+               dl_bw_free(cpu, cs->sum_migrate_dl_bw);
+               reset_migrate_dl_data(cs);
+       }
+
+       mutex_unlock(&cpuset_mutex);
 }
 
 /*
- * Protected by cpuset_rwsem. cpus_attach is used only by cpuset_attach_task()
+ * Protected by cpuset_mutex. cpus_attach is used only by cpuset_attach_task()
  * but we can't allocate it dynamically there.  Define it global and
  * allocate from cpuset_init().
  */
@@ -2535,7 +2578,7 @@ static nodemask_t cpuset_attach_nodemask_to;
 
 static void cpuset_attach_task(struct cpuset *cs, struct task_struct *task)
 {
-       percpu_rwsem_assert_held(&cpuset_rwsem);
+       lockdep_assert_held(&cpuset_mutex);
 
        if (cs != &top_cpuset)
                guarantee_online_cpus(task, cpus_attach);
@@ -2565,7 +2608,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
        cs = css_cs(css);
 
        lockdep_assert_cpus_held();     /* see cgroup_attach_lock() */
-       percpu_down_write(&cpuset_rwsem);
+       mutex_lock(&cpuset_mutex);
        cpus_updated = !cpumask_equal(cs->effective_cpus,
                                      oldcs->effective_cpus);
        mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
@@ -2622,11 +2665,17 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 out:
        cs->old_mems_allowed = cpuset_attach_nodemask_to;
 
+       if (cs->nr_migrate_dl_tasks) {
+               cs->nr_deadline_tasks += cs->nr_migrate_dl_tasks;
+               oldcs->nr_deadline_tasks -= cs->nr_migrate_dl_tasks;
+               reset_migrate_dl_data(cs);
+       }
+
        cs->attach_in_progress--;
        if (!cs->attach_in_progress)
                wake_up(&cpuset_attach_wq);
 
-       percpu_up_write(&cpuset_rwsem);
+       mutex_unlock(&cpuset_mutex);
 }
 
 /* The various types of files and directories in a cpuset file system */
@@ -2658,7 +2707,7 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
        int retval = 0;
 
        cpus_read_lock();
-       percpu_down_write(&cpuset_rwsem);
+       mutex_lock(&cpuset_mutex);
        if (!is_cpuset_online(cs)) {
                retval = -ENODEV;
                goto out_unlock;
@@ -2694,7 +2743,7 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
                break;
        }
 out_unlock:
-       percpu_up_write(&cpuset_rwsem);
+       mutex_unlock(&cpuset_mutex);
        cpus_read_unlock();
        return retval;
 }
@@ -2707,7 +2756,7 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft,
        int retval = -ENODEV;
 
        cpus_read_lock();
-       percpu_down_write(&cpuset_rwsem);
+       mutex_lock(&cpuset_mutex);
        if (!is_cpuset_online(cs))
                goto out_unlock;
 
@@ -2720,7 +2769,7 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft,
                break;
        }
 out_unlock:
-       percpu_up_write(&cpuset_rwsem);
+       mutex_unlock(&cpuset_mutex);
        cpus_read_unlock();
        return retval;
 }
@@ -2753,7 +2802,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
         * operation like this one can lead to a deadlock through kernfs
         * active_ref protection.  Let's break the protection.  Losing the
         * protection is okay as we check whether @cs is online after
-        * grabbing cpuset_rwsem anyway.  This only happens on the legacy
+        * grabbing cpuset_mutex anyway.  This only happens on the legacy
         * hierarchies.
         */
        css_get(&cs->css);
@@ -2761,7 +2810,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
        flush_work(&cpuset_hotplug_work);
 
        cpus_read_lock();
-       percpu_down_write(&cpuset_rwsem);
+       mutex_lock(&cpuset_mutex);
        if (!is_cpuset_online(cs))
                goto out_unlock;
 
@@ -2785,7 +2834,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 
        free_cpuset(trialcs);
 out_unlock:
-       percpu_up_write(&cpuset_rwsem);
+       mutex_unlock(&cpuset_mutex);
        cpus_read_unlock();
        kernfs_unbreak_active_protection(of->kn);
        css_put(&cs->css);
@@ -2933,13 +2982,13 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
 
        css_get(&cs->css);
        cpus_read_lock();
-       percpu_down_write(&cpuset_rwsem);
+       mutex_lock(&cpuset_mutex);
        if (!is_cpuset_online(cs))
                goto out_unlock;
 
        retval = update_prstate(cs, val);
 out_unlock:
-       percpu_up_write(&cpuset_rwsem);
+       mutex_unlock(&cpuset_mutex);
        cpus_read_unlock();
        css_put(&cs->css);
        return retval ?: nbytes;
@@ -3156,7 +3205,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
                return 0;
 
        cpus_read_lock();
-       percpu_down_write(&cpuset_rwsem);
+       mutex_lock(&cpuset_mutex);
 
        set_bit(CS_ONLINE, &cs->flags);
        if (is_spread_page(parent))
@@ -3207,7 +3256,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
        cpumask_copy(cs->effective_cpus, parent->cpus_allowed);
        spin_unlock_irq(&callback_lock);
 out_unlock:
-       percpu_up_write(&cpuset_rwsem);
+       mutex_unlock(&cpuset_mutex);
        cpus_read_unlock();
        return 0;
 }
@@ -3228,7 +3277,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
        struct cpuset *cs = css_cs(css);
 
        cpus_read_lock();
-       percpu_down_write(&cpuset_rwsem);
+       mutex_lock(&cpuset_mutex);
 
        if (is_partition_valid(cs))
                update_prstate(cs, 0);
@@ -3247,7 +3296,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
        cpuset_dec();
        clear_bit(CS_ONLINE, &cs->flags);
 
-       percpu_up_write(&cpuset_rwsem);
+       mutex_unlock(&cpuset_mutex);
        cpus_read_unlock();
 }
 
@@ -3260,7 +3309,7 @@ static void cpuset_css_free(struct cgroup_subsys_state *css)
 
 static void cpuset_bind(struct cgroup_subsys_state *root_css)
 {
-       percpu_down_write(&cpuset_rwsem);
+       mutex_lock(&cpuset_mutex);
        spin_lock_irq(&callback_lock);
 
        if (is_in_v2_mode()) {
@@ -3273,7 +3322,7 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css)
        }
 
        spin_unlock_irq(&callback_lock);
-       percpu_up_write(&cpuset_rwsem);
+       mutex_unlock(&cpuset_mutex);
 }
 
 /*
@@ -3294,14 +3343,14 @@ static int cpuset_can_fork(struct task_struct *task, struct css_set *cset)
                return 0;
 
        lockdep_assert_held(&cgroup_mutex);
-       percpu_down_write(&cpuset_rwsem);
+       mutex_lock(&cpuset_mutex);
 
        /* Check to see if task is allowed in the cpuset */
        ret = cpuset_can_attach_check(cs);
        if (ret)
                goto out_unlock;
 
-       ret = task_can_attach(task, cs->effective_cpus);
+       ret = task_can_attach(task);
        if (ret)
                goto out_unlock;
 
@@ -3315,7 +3364,7 @@ static int cpuset_can_fork(struct task_struct *task, struct css_set *cset)
         */
        cs->attach_in_progress++;
 out_unlock:
-       percpu_up_write(&cpuset_rwsem);
+       mutex_unlock(&cpuset_mutex);
        return ret;
 }
 
@@ -3331,11 +3380,11 @@ static void cpuset_cancel_fork(struct task_struct *task, struct css_set *cset)
        if (same_cs)
                return;
 
-       percpu_down_write(&cpuset_rwsem);
+       mutex_lock(&cpuset_mutex);
        cs->attach_in_progress--;
        if (!cs->attach_in_progress)
                wake_up(&cpuset_attach_wq);
-       percpu_up_write(&cpuset_rwsem);
+       mutex_unlock(&cpuset_mutex);
 }
 
 /*
@@ -3363,7 +3412,7 @@ static void cpuset_fork(struct task_struct *task)
        }
 
        /* CLONE_INTO_CGROUP */
-       percpu_down_write(&cpuset_rwsem);
+       mutex_lock(&cpuset_mutex);
        guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
        cpuset_attach_task(cs, task);
 
@@ -3371,7 +3420,7 @@ static void cpuset_fork(struct task_struct *task)
        if (!cs->attach_in_progress)
                wake_up(&cpuset_attach_wq);
 
-       percpu_up_write(&cpuset_rwsem);
+       mutex_unlock(&cpuset_mutex);
 }
 
 struct cgroup_subsys cpuset_cgrp_subsys = {
@@ -3472,7 +3521,7 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
        is_empty = cpumask_empty(cs->cpus_allowed) ||
                   nodes_empty(cs->mems_allowed);
 
-       percpu_up_write(&cpuset_rwsem);
+       mutex_unlock(&cpuset_mutex);
 
        /*
         * Move tasks to the nearest ancestor with execution resources,
@@ -3482,7 +3531,7 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
        if (is_empty)
                remove_tasks_in_empty_cpuset(cs);
 
-       percpu_down_write(&cpuset_rwsem);
+       mutex_lock(&cpuset_mutex);
 }
 
 static void
@@ -3533,14 +3582,14 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 retry:
        wait_event(cpuset_attach_wq, cs->attach_in_progress == 0);
 
-       percpu_down_write(&cpuset_rwsem);
+       mutex_lock(&cpuset_mutex);
 
        /*
         * We have raced with task attaching. We wait until attaching
         * is finished, so we won't attach a task to an empty cpuset.
         */
        if (cs->attach_in_progress) {
-               percpu_up_write(&cpuset_rwsem);
+               mutex_unlock(&cpuset_mutex);
                goto retry;
        }
 
@@ -3637,7 +3686,7 @@ update_tasks:
                                            cpus_updated, mems_updated);
 
 unlock:
-       percpu_up_write(&cpuset_rwsem);
+       mutex_unlock(&cpuset_mutex);
 }
 
 /**
@@ -3667,7 +3716,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
        if (on_dfl && !alloc_cpumasks(NULL, &tmp))
                ptmp = &tmp;
 
-       percpu_down_write(&cpuset_rwsem);
+       mutex_lock(&cpuset_mutex);
 
        /* fetch the available cpus/mems and find out which changed how */
        cpumask_copy(&new_cpus, cpu_active_mask);
@@ -3724,7 +3773,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
                update_tasks_nodemask(&top_cpuset);
        }
 
-       percpu_up_write(&cpuset_rwsem);
+       mutex_unlock(&cpuset_mutex);
 
        /* if cpus or mems changed, we need to propagate to descendants */
        if (cpus_updated || mems_updated) {
@@ -4155,7 +4204,7 @@ void __cpuset_memory_pressure_bump(void)
  *  - Used for /proc/<pid>/cpuset.
  *  - No need to task_lock(tsk) on this tsk->cpuset reference, as it
  *    doesn't really matter if tsk->cpuset changes after we read it,
- *    and we take cpuset_rwsem, keeping cpuset_attach() from changing it
+ *    and we take cpuset_mutex, keeping cpuset_attach() from changing it
  *    anyway.
  */
 int proc_cpuset_show(struct seq_file *m, struct pid_namespace *ns,
index fe3e8a0..ae2f4dd 100644 (file)
@@ -357,7 +357,6 @@ static struct cftype misc_cg_files[] = {
        {
                .name = "current",
                .seq_show = misc_cg_current_show,
-               .flags = CFTYPE_NOT_ON_ROOT,
        },
        {
                .name = "capacity",
index 3135406..ef5878f 100644 (file)
@@ -197,6 +197,7 @@ uncharge_cg_locked(struct rdma_cgroup *cg,
 
 /**
  * rdmacg_uncharge_hierarchy - hierarchically uncharge rdma resource count
+ * @cg: pointer to cg to uncharge and all parents in hierarchy
  * @device: pointer to rdmacg device
  * @stop_cg: while traversing hirerchy, when meet with stop_cg cgroup
  *           stop uncharging
@@ -221,6 +222,7 @@ static void rdmacg_uncharge_hierarchy(struct rdma_cgroup *cg,
 
 /**
  * rdmacg_uncharge - hierarchically uncharge rdma resource count
+ * @cg: pointer to cg to uncharge and all parents in hierarchy
  * @device: pointer to rdmacg device
  * @index: index of the resource to uncharge in cgroup in given resource pool
  */
index 9c4c552..2542c21 100644 (file)
@@ -171,7 +171,7 @@ __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
 __diag_pop();
 
 /* see cgroup_rstat_flush() */
-static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
+static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
        __releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
 {
        int cpu;
@@ -207,9 +207,8 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
                }
                raw_spin_unlock_irqrestore(cpu_lock, flags);
 
-               /* if @may_sleep, play nice and yield if necessary */
-               if (may_sleep && (need_resched() ||
-                                 spin_needbreak(&cgroup_rstat_lock))) {
+               /* play nice and yield if necessary */
+               if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
                        spin_unlock_irq(&cgroup_rstat_lock);
                        if (!cond_resched())
                                cpu_relax();
@@ -236,26 +235,11 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
        might_sleep();
 
        spin_lock_irq(&cgroup_rstat_lock);
-       cgroup_rstat_flush_locked(cgrp, true);
+       cgroup_rstat_flush_locked(cgrp);
        spin_unlock_irq(&cgroup_rstat_lock);
 }
 
 /**
- * cgroup_rstat_flush_atomic- atomic version of cgroup_rstat_flush()
- * @cgrp: target cgroup
- *
- * This function can be called from any context.
- */
-void cgroup_rstat_flush_atomic(struct cgroup *cgrp)
-{
-       unsigned long flags;
-
-       spin_lock_irqsave(&cgroup_rstat_lock, flags);
-       cgroup_rstat_flush_locked(cgrp, false);
-       spin_unlock_irqrestore(&cgroup_rstat_lock, flags);
-}
-
-/**
  * cgroup_rstat_flush_hold - flush stats in @cgrp's subtree and hold
  * @cgrp: target cgroup
  *
@@ -269,7 +253,7 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp)
 {
        might_sleep();
        spin_lock_irq(&cgroup_rstat_lock);
-       cgroup_rstat_flush_locked(cgrp, true);
+       cgroup_rstat_flush_locked(cgrp);
 }
 
 /**
index a09f1c1..6ef0b35 100644 (file)
@@ -510,7 +510,7 @@ void noinstr __ct_user_enter(enum ctx_state state)
                         * In this we case we don't care about any concurrency/ordering.
                         */
                        if (!IS_ENABLED(CONFIG_CONTEXT_TRACKING_IDLE))
-                               arch_atomic_set(&ct->state, state);
+                               raw_atomic_set(&ct->state, state);
                } else {
                        /*
                         * Even if context tracking is disabled on this CPU, because it's outside
@@ -527,7 +527,7 @@ void noinstr __ct_user_enter(enum ctx_state state)
                         */
                        if (!IS_ENABLED(CONFIG_CONTEXT_TRACKING_IDLE)) {
                                /* Tracking for vtime only, no concurrent RCU EQS accounting */
-                               arch_atomic_set(&ct->state, state);
+                               raw_atomic_set(&ct->state, state);
                        } else {
                                /*
                                 * Tracking for vtime and RCU EQS. Make sure we don't race
@@ -535,7 +535,7 @@ void noinstr __ct_user_enter(enum ctx_state state)
                                 * RCU only requires RCU_DYNTICKS_IDX increments to be fully
                                 * ordered.
                                 */
-                               arch_atomic_add(state, &ct->state);
+                               raw_atomic_add(state, &ct->state);
                        }
                }
        }
@@ -630,12 +630,12 @@ void noinstr __ct_user_exit(enum ctx_state state)
                         * In this we case we don't care about any concurrency/ordering.
                         */
                        if (!IS_ENABLED(CONFIG_CONTEXT_TRACKING_IDLE))
-                               arch_atomic_set(&ct->state, CONTEXT_KERNEL);
+                               raw_atomic_set(&ct->state, CONTEXT_KERNEL);
 
                } else {
                        if (!IS_ENABLED(CONFIG_CONTEXT_TRACKING_IDLE)) {
                                /* Tracking for vtime only, no concurrent RCU EQS accounting */
-                               arch_atomic_set(&ct->state, CONTEXT_KERNEL);
+                               raw_atomic_set(&ct->state, CONTEXT_KERNEL);
                        } else {
                                /*
                                 * Tracking for vtime and RCU EQS. Make sure we don't race
@@ -643,7 +643,7 @@ void noinstr __ct_user_exit(enum ctx_state state)
                                 * RCU only requires RCU_DYNTICKS_IDX increments to be fully
                                 * ordered.
                                 */
-                               arch_atomic_sub(state, &ct->state);
+                               raw_atomic_sub(state, &ct->state);
                        }
                }
        }
index f4a2c58..88a7ede 100644 (file)
@@ -17,6 +17,7 @@
 #include <linux/cpu.h>
 #include <linux/oom.h>
 #include <linux/rcupdate.h>
+#include <linux/delay.h>
 #include <linux/export.h>
 #include <linux/bug.h>
 #include <linux/kthread.h>
@@ -59,6 +60,7 @@
  * @last:      For multi-instance rollback, remember how far we got
  * @cb_state:  The state for a single callback (install/uninstall)
  * @result:    Result of the operation
+ * @ap_sync_state:     State for AP synchronization
  * @done_up:   Signal completion to the issuer of the task for cpu-up
  * @done_down: Signal completion to the issuer of the task for cpu-down
  */
@@ -76,6 +78,7 @@ struct cpuhp_cpu_state {
        struct hlist_node       *last;
        enum cpuhp_state        cb_state;
        int                     result;
+       atomic_t                ap_sync_state;
        struct completion       done_up;
        struct completion       done_down;
 #endif
@@ -276,6 +279,182 @@ static bool cpuhp_is_atomic_state(enum cpuhp_state state)
        return CPUHP_AP_IDLE_DEAD <= state && state < CPUHP_AP_ONLINE;
 }
 
+/* Synchronization state management */
+enum cpuhp_sync_state {
+       SYNC_STATE_DEAD,
+       SYNC_STATE_KICKED,
+       SYNC_STATE_SHOULD_DIE,
+       SYNC_STATE_ALIVE,
+       SYNC_STATE_SHOULD_ONLINE,
+       SYNC_STATE_ONLINE,
+};
+
+#ifdef CONFIG_HOTPLUG_CORE_SYNC
+/**
+ * cpuhp_ap_update_sync_state - Update synchronization state during bringup/teardown
+ * @state:     The synchronization state to set
+ *
+ * No synchronization point. Just update of the synchronization state, but implies
+ * a full barrier so that the AP changes are visible before the control CPU proceeds.
+ */
+static inline void cpuhp_ap_update_sync_state(enum cpuhp_sync_state state)
+{
+       atomic_t *st = this_cpu_ptr(&cpuhp_state.ap_sync_state);
+
+       (void)atomic_xchg(st, state);
+}
+
+void __weak arch_cpuhp_sync_state_poll(void) { cpu_relax(); }
+
+static bool cpuhp_wait_for_sync_state(unsigned int cpu, enum cpuhp_sync_state state,
+                                     enum cpuhp_sync_state next_state)
+{
+       atomic_t *st = per_cpu_ptr(&cpuhp_state.ap_sync_state, cpu);
+       ktime_t now, end, start = ktime_get();
+       int sync;
+
+       end = start + 10ULL * NSEC_PER_SEC;
+
+       sync = atomic_read(st);
+       while (1) {
+               if (sync == state) {
+                       if (!atomic_try_cmpxchg(st, &sync, next_state))
+                               continue;
+                       return true;
+               }
+
+               now = ktime_get();
+               if (now > end) {
+                       /* Timeout. Leave the state unchanged */
+                       return false;
+               } else if (now - start < NSEC_PER_MSEC) {
+                       /* Poll for one millisecond */
+                       arch_cpuhp_sync_state_poll();
+               } else {
+                       usleep_range_state(USEC_PER_MSEC, 2 * USEC_PER_MSEC, TASK_UNINTERRUPTIBLE);
+               }
+               sync = atomic_read(st);
+       }
+       return true;
+}
+#else  /* CONFIG_HOTPLUG_CORE_SYNC */
+static inline void cpuhp_ap_update_sync_state(enum cpuhp_sync_state state) { }
+#endif /* !CONFIG_HOTPLUG_CORE_SYNC */
+
+#ifdef CONFIG_HOTPLUG_CORE_SYNC_DEAD
+/**
+ * cpuhp_ap_report_dead - Update synchronization state to DEAD
+ *
+ * No synchronization point. Just update of the synchronization state.
+ */
+void cpuhp_ap_report_dead(void)
+{
+       cpuhp_ap_update_sync_state(SYNC_STATE_DEAD);
+}
+
+void __weak arch_cpuhp_cleanup_dead_cpu(unsigned int cpu) { }
+
+/*
+ * Late CPU shutdown synchronization point. Cannot use cpuhp_state::done_down
+ * because the AP cannot issue complete() at this stage.
+ */
+static void cpuhp_bp_sync_dead(unsigned int cpu)
+{
+       atomic_t *st = per_cpu_ptr(&cpuhp_state.ap_sync_state, cpu);
+       int sync = atomic_read(st);
+
+       do {
+               /* CPU can have reported dead already. Don't overwrite that! */
+               if (sync == SYNC_STATE_DEAD)
+                       break;
+       } while (!atomic_try_cmpxchg(st, &sync, SYNC_STATE_SHOULD_DIE));
+
+       if (cpuhp_wait_for_sync_state(cpu, SYNC_STATE_DEAD, SYNC_STATE_DEAD)) {
+               /* CPU reached dead state. Invoke the cleanup function */
+               arch_cpuhp_cleanup_dead_cpu(cpu);
+               return;
+       }
+
+       /* No further action possible. Emit message and give up. */
+       pr_err("CPU%u failed to report dead state\n", cpu);
+}
+#else /* CONFIG_HOTPLUG_CORE_SYNC_DEAD */
+static inline void cpuhp_bp_sync_dead(unsigned int cpu) { }
+#endif /* !CONFIG_HOTPLUG_CORE_SYNC_DEAD */
+
+#ifdef CONFIG_HOTPLUG_CORE_SYNC_FULL
+/**
+ * cpuhp_ap_sync_alive - Synchronize AP with the control CPU once it is alive
+ *
+ * Updates the AP synchronization state to SYNC_STATE_ALIVE and waits
+ * for the BP to release it.
+ */
+void cpuhp_ap_sync_alive(void)
+{
+       atomic_t *st = this_cpu_ptr(&cpuhp_state.ap_sync_state);
+
+       cpuhp_ap_update_sync_state(SYNC_STATE_ALIVE);
+
+       /* Wait for the control CPU to release it. */
+       while (atomic_read(st) != SYNC_STATE_SHOULD_ONLINE)
+               cpu_relax();
+}
+
+static bool cpuhp_can_boot_ap(unsigned int cpu)
+{
+       atomic_t *st = per_cpu_ptr(&cpuhp_state.ap_sync_state, cpu);
+       int sync = atomic_read(st);
+
+again:
+       switch (sync) {
+       case SYNC_STATE_DEAD:
+               /* CPU is properly dead */
+               break;
+       case SYNC_STATE_KICKED:
+               /* CPU did not come up in previous attempt */
+               break;
+       case SYNC_STATE_ALIVE:
+               /* CPU is stuck cpuhp_ap_sync_alive(). */
+               break;
+       default:
+               /* CPU failed to report online or dead and is in limbo state. */
+               return false;
+       }
+
+       /* Prepare for booting */
+       if (!atomic_try_cmpxchg(st, &sync, SYNC_STATE_KICKED))
+               goto again;
+
+       return true;
+}
+
+void __weak arch_cpuhp_cleanup_kick_cpu(unsigned int cpu) { }
+
+/*
+ * Early CPU bringup synchronization point. Cannot use cpuhp_state::done_up
+ * because the AP cannot issue complete() so early in the bringup.
+ */
+static int cpuhp_bp_sync_alive(unsigned int cpu)
+{
+       int ret = 0;
+
+       if (!IS_ENABLED(CONFIG_HOTPLUG_CORE_SYNC_FULL))
+               return 0;
+
+       if (!cpuhp_wait_for_sync_state(cpu, SYNC_STATE_ALIVE, SYNC_STATE_SHOULD_ONLINE)) {
+               pr_err("CPU%u failed to report alive state\n", cpu);
+               ret = -EIO;
+       }
+
+       /* Let the architecture cleanup the kick alive mechanics. */
+       arch_cpuhp_cleanup_kick_cpu(cpu);
+       return ret;
+}
+#else /* CONFIG_HOTPLUG_CORE_SYNC_FULL */
+static inline int cpuhp_bp_sync_alive(unsigned int cpu) { return 0; }
+static inline bool cpuhp_can_boot_ap(unsigned int cpu) { return true; }
+#endif /* !CONFIG_HOTPLUG_CORE_SYNC_FULL */
+
 /* Serializes the updates to cpu_online_mask, cpu_present_mask */
 static DEFINE_MUTEX(cpu_add_remove_lock);
 bool cpuhp_tasks_frozen;
@@ -470,8 +649,23 @@ bool cpu_smt_possible(void)
                cpu_smt_control != CPU_SMT_NOT_SUPPORTED;
 }
 EXPORT_SYMBOL_GPL(cpu_smt_possible);
+
+static inline bool cpuhp_smt_aware(void)
+{
+       return topology_smt_supported();
+}
+
+static inline const struct cpumask *cpuhp_get_primary_thread_mask(void)
+{
+       return cpu_primary_thread_mask;
+}
 #else
 static inline bool cpu_smt_allowed(unsigned int cpu) { return true; }
+static inline bool cpuhp_smt_aware(void) { return false; }
+static inline const struct cpumask *cpuhp_get_primary_thread_mask(void)
+{
+       return cpu_present_mask;
+}
 #endif
 
 static inline enum cpuhp_state
@@ -558,7 +752,7 @@ static int cpuhp_kick_ap(int cpu, struct cpuhp_cpu_state *st,
        return ret;
 }
 
-static int bringup_wait_for_ap(unsigned int cpu)
+static int bringup_wait_for_ap_online(unsigned int cpu)
 {
        struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
 
@@ -579,38 +773,94 @@ static int bringup_wait_for_ap(unsigned int cpu)
         */
        if (!cpu_smt_allowed(cpu))
                return -ECANCELED;
+       return 0;
+}
+
+#ifdef CONFIG_HOTPLUG_SPLIT_STARTUP
+static int cpuhp_kick_ap_alive(unsigned int cpu)
+{
+       if (!cpuhp_can_boot_ap(cpu))
+               return -EAGAIN;
+
+       return arch_cpuhp_kick_ap_alive(cpu, idle_thread_get(cpu));
+}
+
+static int cpuhp_bringup_ap(unsigned int cpu)
+{
+       struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
+       int ret;
+
+       /*
+        * Some architectures have to walk the irq descriptors to
+        * setup the vector space for the cpu which comes online.
+        * Prevent irq alloc/free across the bringup.
+        */
+       irq_lock_sparse();
+
+       ret = cpuhp_bp_sync_alive(cpu);
+       if (ret)
+               goto out_unlock;
+
+       ret = bringup_wait_for_ap_online(cpu);
+       if (ret)
+               goto out_unlock;
+
+       irq_unlock_sparse();
 
        if (st->target <= CPUHP_AP_ONLINE_IDLE)
                return 0;
 
        return cpuhp_kick_ap(cpu, st, st->target);
-}
 
+out_unlock:
+       irq_unlock_sparse();
+       return ret;
+}
+#else
 static int bringup_cpu(unsigned int cpu)
 {
+       struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
        struct task_struct *idle = idle_thread_get(cpu);
        int ret;
 
-       /*
-        * Reset stale stack state from the last time this CPU was online.
-        */
-       scs_task_reset(idle);
-       kasan_unpoison_task_stack(idle);
+       if (!cpuhp_can_boot_ap(cpu))
+               return -EAGAIN;
 
        /*
         * Some architectures have to walk the irq descriptors to
         * setup the vector space for the cpu which comes online.
-        * Prevent irq alloc/free across the bringup.
+        *
+        * Prevent irq alloc/free across the bringup by acquiring the
+        * sparse irq lock. Hold it until the upcoming CPU completes the
+        * startup in cpuhp_online_idle() which allows to avoid
+        * intermediate synchronization points in the architecture code.
         */
        irq_lock_sparse();
 
-       /* Arch-specific enabling code. */
        ret = __cpu_up(cpu, idle);
-       irq_unlock_sparse();
        if (ret)
-               return ret;
-       return bringup_wait_for_ap(cpu);
+               goto out_unlock;
+
+       ret = cpuhp_bp_sync_alive(cpu);
+       if (ret)
+               goto out_unlock;
+
+       ret = bringup_wait_for_ap_online(cpu);
+       if (ret)
+               goto out_unlock;
+
+       irq_unlock_sparse();
+
+       if (st->target <= CPUHP_AP_ONLINE_IDLE)
+               return 0;
+
+       return cpuhp_kick_ap(cpu, st, st->target);
+
+out_unlock:
+       irq_unlock_sparse();
+       return ret;
 }
+#endif
 
 static int finish_cpu(unsigned int cpu)
 {
@@ -1099,6 +1349,8 @@ static int takedown_cpu(unsigned int cpu)
        /* This actually kills the CPU. */
        __cpu_die(cpu);
 
+       cpuhp_bp_sync_dead(cpu);
+
        tick_cleanup_dead_cpu(cpu);
        rcutree_migrate_callbacks(cpu);
        return 0;
@@ -1345,8 +1597,10 @@ void cpuhp_online_idle(enum cpuhp_state state)
        if (state != CPUHP_AP_ONLINE_IDLE)
                return;
 
+       cpuhp_ap_update_sync_state(SYNC_STATE_ONLINE);
+
        /*
-        * Unpart the stopper thread before we start the idle loop (and start
+        * Unpark the stopper thread before we start the idle loop (and start
         * scheduling); this ensures the stopper task is always available.
         */
        stop_machine_unpark(smp_processor_id());
@@ -1383,6 +1637,12 @@ static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target)
                        ret = PTR_ERR(idle);
                        goto out;
                }
+
+               /*
+                * Reset stale stack state from the last time this CPU was online.
+                */
+               scs_task_reset(idle);
+               kasan_unpoison_task_stack(idle);
        }
 
        cpuhp_tasks_frozen = tasks_frozen;
@@ -1502,18 +1762,96 @@ int bringup_hibernate_cpu(unsigned int sleep_cpu)
        return 0;
 }
 
-void bringup_nonboot_cpus(unsigned int setup_max_cpus)
+static void __init cpuhp_bringup_mask(const struct cpumask *mask, unsigned int ncpus,
+                                     enum cpuhp_state target)
 {
        unsigned int cpu;
 
-       for_each_present_cpu(cpu) {
-               if (num_online_cpus() >= setup_max_cpus)
+       for_each_cpu(cpu, mask) {
+               struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
+
+               if (cpu_up(cpu, target) && can_rollback_cpu(st)) {
+                       /*
+                        * If this failed then cpu_up() might have only
+                        * rolled back to CPUHP_BP_KICK_AP for the final
+                        * online. Clean it up. NOOP if already rolled back.
+                        */
+                       WARN_ON(cpuhp_invoke_callback_range(false, cpu, st, CPUHP_OFFLINE));
+               }
+
+               if (!--ncpus)
                        break;
-               if (!cpu_online(cpu))
-                       cpu_up(cpu, CPUHP_ONLINE);
        }
 }
 
+#ifdef CONFIG_HOTPLUG_PARALLEL
+static bool __cpuhp_parallel_bringup __ro_after_init = true;
+
+static int __init parallel_bringup_parse_param(char *arg)
+{
+       return kstrtobool(arg, &__cpuhp_parallel_bringup);
+}
+early_param("cpuhp.parallel", parallel_bringup_parse_param);
+
+/*
+ * On architectures which have enabled parallel bringup this invokes all BP
+ * prepare states for each of the to be onlined APs first. The last state
+ * sends the startup IPI to the APs. The APs proceed through the low level
+ * bringup code in parallel and then wait for the control CPU to release
+ * them one by one for the final onlining procedure.
+ *
+ * This avoids waiting for each AP to respond to the startup IPI in
+ * CPUHP_BRINGUP_CPU.
+ */
+static bool __init cpuhp_bringup_cpus_parallel(unsigned int ncpus)
+{
+       const struct cpumask *mask = cpu_present_mask;
+
+       if (__cpuhp_parallel_bringup)
+               __cpuhp_parallel_bringup = arch_cpuhp_init_parallel_bringup();
+       if (!__cpuhp_parallel_bringup)
+               return false;
+
+       if (cpuhp_smt_aware()) {
+               const struct cpumask *pmask = cpuhp_get_primary_thread_mask();
+               static struct cpumask tmp_mask __initdata;
+
+               /*
+                * X86 requires to prevent that SMT siblings stopped while
+                * the primary thread does a microcode update for various
+                * reasons. Bring the primary threads up first.
+                */
+               cpumask_and(&tmp_mask, mask, pmask);
+               cpuhp_bringup_mask(&tmp_mask, ncpus, CPUHP_BP_KICK_AP);
+               cpuhp_bringup_mask(&tmp_mask, ncpus, CPUHP_ONLINE);
+               /* Account for the online CPUs */
+               ncpus -= num_online_cpus();
+               if (!ncpus)
+                       return true;
+               /* Create the mask for secondary CPUs */
+               cpumask_andnot(&tmp_mask, mask, pmask);
+               mask = &tmp_mask;
+       }
+
+       /* Bring the not-yet started CPUs up */
+       cpuhp_bringup_mask(mask, ncpus, CPUHP_BP_KICK_AP);
+       cpuhp_bringup_mask(mask, ncpus, CPUHP_ONLINE);
+       return true;
+}
+#else
+static inline bool cpuhp_bringup_cpus_parallel(unsigned int ncpus) { return false; }
+#endif /* CONFIG_HOTPLUG_PARALLEL */
+
+void __init bringup_nonboot_cpus(unsigned int setup_max_cpus)
+{
+       /* Try parallel bringup optimization if enabled */
+       if (cpuhp_bringup_cpus_parallel(setup_max_cpus))
+               return;
+
+       /* Full per CPU serialized bringup */
+       cpuhp_bringup_mask(cpu_present_mask, setup_max_cpus, CPUHP_ONLINE);
+}
+
 #ifdef CONFIG_PM_SLEEP_SMP
 static cpumask_var_t frozen_cpus;
 
@@ -1740,13 +2078,38 @@ static struct cpuhp_step cpuhp_hp_states[] = {
                .startup.single         = timers_prepare_cpu,
                .teardown.single        = timers_dead_cpu,
        },
-       /* Kicks the plugged cpu into life */
+
+#ifdef CONFIG_HOTPLUG_SPLIT_STARTUP
+       /*
+        * Kicks the AP alive. AP will wait in cpuhp_ap_sync_alive() until
+        * the next step will release it.
+        */
+       [CPUHP_BP_KICK_AP] = {
+               .name                   = "cpu:kick_ap",
+               .startup.single         = cpuhp_kick_ap_alive,
+       },
+
+       /*
+        * Waits for the AP to reach cpuhp_ap_sync_alive() and then
+        * releases it for the complete bringup.
+        */
+       [CPUHP_BRINGUP_CPU] = {
+               .name                   = "cpu:bringup",
+               .startup.single         = cpuhp_bringup_ap,
+               .teardown.single        = finish_cpu,
+               .cant_stop              = true,
+       },
+#else
+       /*
+        * All-in-one CPU bringup state which includes the kick alive.
+        */
        [CPUHP_BRINGUP_CPU] = {
                .name                   = "cpu:bringup",
                .startup.single         = bringup_cpu,
                .teardown.single        = finish_cpu,
                .cant_stop              = true,
        },
+#endif
        /* Final state before CPU kills itself */
        [CPUHP_AP_IDLE_DEAD] = {
                .name                   = "idle:dead",
@@ -2723,6 +3086,7 @@ void __init boot_cpu_hotplug_init(void)
 {
 #ifdef CONFIG_SMP
        cpumask_set_cpu(smp_processor_id(), &cpus_booted_once_mask);
+       atomic_set(this_cpu_ptr(&cpuhp_state.ap_sync_state), SYNC_STATE_ONLINE);
 #endif
        this_cpu_write(cpuhp_state.state, CPUHP_ONLINE);
        this_cpu_write(cpuhp_state.target, CPUHP_ONLINE);
index 6677d0e..abea182 100644 (file)
@@ -24,6 +24,9 @@ config DMA_OPS_BYPASS
 config ARCH_HAS_DMA_MAP_DIRECT
        bool
 
+config NEED_SG_DMA_FLAGS
+       bool
+
 config NEED_SG_DMA_LENGTH
        bool
 
@@ -87,6 +90,10 @@ config SWIOTLB
        bool
        select NEED_DMA_MAP_STATE
 
+config DMA_BOUNCE_UNALIGNED_KMALLOC
+       bool
+       depends on SWIOTLB
+
 config DMA_RESTRICTED_POOL
        bool "DMA Restricted Pool"
        depends on OF && OF_RESERVED_MEM && SWIOTLB
index 5595d1d..d29cade 100644 (file)
@@ -463,7 +463,7 @@ void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
        int i;
 
        for_each_sg(sgl,  sg, nents, i) {
-               if (sg_is_dma_bus_address(sg))
+               if (sg_dma_is_bus_address(sg))
                        sg_dma_unmark_bus_address(sg);
                else
                        dma_direct_unmap_page(dev, sg->dma_address,
index e38ffc5..97ec892 100644 (file)
@@ -94,7 +94,8 @@ static inline dma_addr_t dma_direct_map_page(struct device *dev,
                return swiotlb_map(dev, phys, size, dir, attrs);
        }
 
-       if (unlikely(!dma_capable(dev, dma_addr, size, true))) {
+       if (unlikely(!dma_capable(dev, dma_addr, size, true)) ||
+           dma_kmalloc_needs_bounce(dev, size, dir)) {
                if (is_pci_p2pdma_page(page))
                        return DMA_MAPPING_ERROR;
                if (is_swiotlb_active(dev))
index db016e4..3060427 100644 (file)
@@ -6647,7 +6647,7 @@ static void perf_sigtrap(struct perf_event *event)
                return;
 
        send_sig_perf((void __user *)event->pending_addr,
-                     event->attr.type, event->attr.sig_data);
+                     event->orig_type, event->attr.sig_data);
 }
 
 /*
@@ -7490,6 +7490,7 @@ static u64 perf_get_pgtable_size(struct mm_struct *mm, unsigned long addr)
                return pud_leaf_size(pud);
 
        pmdp = pmd_offset_lockless(pudp, pud, addr);
+again:
        pmd = pmdp_get_lockless(pmdp);
        if (!pmd_present(pmd))
                return 0;
@@ -7498,6 +7499,9 @@ static u64 perf_get_pgtable_size(struct mm_struct *mm, unsigned long addr)
                return pmd_leaf_size(pmd);
 
        ptep = pte_offset_map(&pmd, addr);
+       if (!ptep)
+               goto again;
+
        pte = ptep_get_lockless(ptep);
        if (pte_present(pte))
                size = pte_leaf_size(pte);
@@ -9951,6 +9955,9 @@ static void sw_perf_event_destroy(struct perf_event *event)
        swevent_hlist_put();
 }
 
+static struct pmu perf_cpu_clock; /* fwd declaration */
+static struct pmu perf_task_clock;
+
 static int perf_swevent_init(struct perf_event *event)
 {
        u64 event_id = event->attr.config;
@@ -9966,7 +9973,10 @@ static int perf_swevent_init(struct perf_event *event)
 
        switch (event_id) {
        case PERF_COUNT_SW_CPU_CLOCK:
+               event->attr.type = perf_cpu_clock.type;
+               return -ENOENT;
        case PERF_COUNT_SW_TASK_CLOCK:
+               event->attr.type = perf_task_clock.type;
                return -ENOENT;
 
        default:
@@ -11098,7 +11108,7 @@ static void cpu_clock_event_read(struct perf_event *event)
 
 static int cpu_clock_event_init(struct perf_event *event)
 {
-       if (event->attr.type != PERF_TYPE_SOFTWARE)
+       if (event->attr.type != perf_cpu_clock.type)
                return -ENOENT;
 
        if (event->attr.config != PERF_COUNT_SW_CPU_CLOCK)
@@ -11119,6 +11129,7 @@ static struct pmu perf_cpu_clock = {
        .task_ctx_nr    = perf_sw_context,
 
        .capabilities   = PERF_PMU_CAP_NO_NMI,
+       .dev            = PMU_NULL_DEV,
 
        .event_init     = cpu_clock_event_init,
        .add            = cpu_clock_event_add,
@@ -11179,7 +11190,7 @@ static void task_clock_event_read(struct perf_event *event)
 
 static int task_clock_event_init(struct perf_event *event)
 {
-       if (event->attr.type != PERF_TYPE_SOFTWARE)
+       if (event->attr.type != perf_task_clock.type)
                return -ENOENT;
 
        if (event->attr.config != PERF_COUNT_SW_TASK_CLOCK)
@@ -11200,6 +11211,7 @@ static struct pmu perf_task_clock = {
        .task_ctx_nr    = perf_sw_context,
 
        .capabilities   = PERF_PMU_CAP_NO_NMI,
+       .dev            = PMU_NULL_DEV,
 
        .event_init     = task_clock_event_init,
        .add            = task_clock_event_add,
@@ -11427,31 +11439,31 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
                goto unlock;
 
        pmu->type = -1;
-       if (!name)
-               goto skip_type;
+       if (WARN_ONCE(!name, "Can not register anonymous pmu.\n")) {
+               ret = -EINVAL;
+               goto free_pdc;
+       }
+
        pmu->name = name;
 
-       if (type != PERF_TYPE_SOFTWARE) {
-               if (type >= 0)
-                       max = type;
+       if (type >= 0)
+               max = type;
 
-               ret = idr_alloc(&pmu_idr, pmu, max, 0, GFP_KERNEL);
-               if (ret < 0)
-                       goto free_pdc;
+       ret = idr_alloc(&pmu_idr, pmu, max, 0, GFP_KERNEL);
+       if (ret < 0)
+               goto free_pdc;
 
-               WARN_ON(type >= 0 && ret != type);
+       WARN_ON(type >= 0 && ret != type);
 
-               type = ret;
-       }
+       type = ret;
        pmu->type = type;
 
-       if (pmu_bus_running) {
+       if (pmu_bus_running && !pmu->dev) {
                ret = pmu_dev_alloc(pmu);
                if (ret)
                        goto free_idr;
        }
 
-skip_type:
        ret = -ENOMEM;
        pmu->cpu_pmu_context = alloc_percpu(struct perf_cpu_pmu_context);
        if (!pmu->cpu_pmu_context)
@@ -11493,16 +11505,7 @@ skip_type:
        if (!pmu->event_idx)
                pmu->event_idx = perf_event_idx_default;
 
-       /*
-        * Ensure the TYPE_SOFTWARE PMUs are at the head of the list,
-        * since these cannot be in the IDR. This way the linear search
-        * is fast, provided a valid software event is provided.
-        */
-       if (type == PERF_TYPE_SOFTWARE || !name)
-               list_add_rcu(&pmu->entry, &pmus);
-       else
-               list_add_tail_rcu(&pmu->entry, &pmus);
-
+       list_add_rcu(&pmu->entry, &pmus);
        atomic_set(&pmu->exclusive_cnt, 0);
        ret = 0;
 unlock:
@@ -11511,12 +11514,13 @@ unlock:
        return ret;
 
 free_dev:
-       device_del(pmu->dev);
-       put_device(pmu->dev);
+       if (pmu->dev && pmu->dev != PMU_NULL_DEV) {
+               device_del(pmu->dev);
+               put_device(pmu->dev);
+       }
 
 free_idr:
-       if (pmu->type != PERF_TYPE_SOFTWARE)
-               idr_remove(&pmu_idr, pmu->type);
+       idr_remove(&pmu_idr, pmu->type);
 
 free_pdc:
        free_percpu(pmu->pmu_disable_count);
@@ -11537,9 +11541,8 @@ void perf_pmu_unregister(struct pmu *pmu)
        synchronize_rcu();
 
        free_percpu(pmu->pmu_disable_count);
-       if (pmu->type != PERF_TYPE_SOFTWARE)
-               idr_remove(&pmu_idr, pmu->type);
-       if (pmu_bus_running) {
+       idr_remove(&pmu_idr, pmu->type);
+       if (pmu_bus_running && pmu->dev && pmu->dev != PMU_NULL_DEV) {
                if (pmu->nr_addr_filters)
                        device_remove_file(pmu->dev, &dev_attr_nr_addr_filters);
                device_del(pmu->dev);
@@ -11613,6 +11616,12 @@ static struct pmu *perf_init_event(struct perf_event *event)
 
        idx = srcu_read_lock(&pmus_srcu);
 
+       /*
+        * Save original type before calling pmu->event_init() since certain
+        * pmus overwrites event->attr.type to forward event to another pmu.
+        */
+       event->orig_type = event->attr.type;
+
        /* Try parent's PMU first: */
        if (event->parent && event->parent->pmu) {
                pmu = event->parent->pmu;
@@ -13652,8 +13661,8 @@ void __init perf_event_init(void)
        perf_event_init_all_cpus();
        init_srcu_struct(&pmus_srcu);
        perf_pmu_register(&perf_swevent, "software", PERF_TYPE_SOFTWARE);
-       perf_pmu_register(&perf_cpu_clock, NULL, -1);
-       perf_pmu_register(&perf_task_clock, NULL, -1);
+       perf_pmu_register(&perf_cpu_clock, "cpu_clock", -1);
+       perf_pmu_register(&perf_task_clock, "task_clock", -1);
        perf_tp_register();
        perf_event_init_cpu(smp_processor_id());
        register_reboot_notifier(&perf_reboot_notifier);
@@ -13696,7 +13705,7 @@ static int __init perf_event_sysfs_init(void)
                goto unlock;
 
        list_for_each_entry(pmu, &pmus, entry) {
-               if (!pmu->name || pmu->type < 0)
+               if (pmu->dev)
                        continue;
 
                ret = pmu_dev_alloc(pmu);
index 59887c6..f0ac5b8 100644 (file)
@@ -192,7 +192,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
                inc_mm_counter(mm, MM_ANONPAGES);
        }
 
-       flush_cache_page(vma, addr, pte_pfn(*pvmw.pte));
+       flush_cache_page(vma, addr, pte_pfn(ptep_get(pvmw.pte)));
        ptep_clear_flush_notify(vma, addr, pvmw.pte);
        if (new_page)
                set_pte_at_notify(mm, addr, pvmw.pte,
@@ -365,7 +365,6 @@ __update_ref_ctr(struct mm_struct *mm, unsigned long vaddr, short d)
 {
        void *kaddr;
        struct page *page;
-       struct vm_area_struct *vma;
        int ret;
        short *ptr;
 
@@ -373,7 +372,7 @@ __update_ref_ctr(struct mm_struct *mm, unsigned long vaddr, short d)
                return -EINVAL;
 
        ret = get_user_pages_remote(mm, vaddr, 1,
-                       FOLL_WRITE, &page, &vma, NULL);
+                                   FOLL_WRITE, &page, NULL);
        if (unlikely(ret <= 0)) {
                /*
                 * We are asking for 1 page. If get_user_pages_remote() fails,
@@ -474,10 +473,9 @@ retry:
        if (is_register)
                gup_flags |= FOLL_SPLIT_PMD;
        /* Read the page with vaddr into memory */
-       ret = get_user_pages_remote(mm, vaddr, 1, gup_flags,
-                                   &old_page, &vma, NULL);
-       if (ret <= 0)
-               return ret;
+       old_page = get_user_page_vma_remote(mm, vaddr, gup_flags, &vma);
+       if (IS_ERR_OR_NULL(old_page))
+               return old_page ? PTR_ERR(old_page) : 0;
 
        ret = verify_opcode(old_page, vaddr, &opcode);
        if (ret <= 0)
@@ -2027,8 +2025,7 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
         * but we treat this as a 'remote' access since it is
         * essentially a kernel access to the memory.
         */
-       result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, &page,
-                       NULL, NULL);
+       result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, &page, NULL);
        if (result < 0)
                return result;
 
index 41c9641..b85814e 100644 (file)
@@ -252,23 +252,19 @@ static int memcg_charge_kernel_stack(struct vm_struct *vm)
 {
        int i;
        int ret;
+       int nr_charged = 0;
 
-       BUILD_BUG_ON(IS_ENABLED(CONFIG_VMAP_STACK) && PAGE_SIZE % 1024 != 0);
        BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE);
 
        for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
                ret = memcg_kmem_charge_page(vm->pages[i], GFP_KERNEL, 0);
                if (ret)
                        goto err;
+               nr_charged++;
        }
        return 0;
 err:
-       /*
-        * If memcg_kmem_charge_page() fails, page's memory cgroup pointer is
-        * NULL, and memcg_kmem_uncharge_page() in free_thread_stack() will
-        * ignore this page.
-        */
-       for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
+       for (i = 0; i < nr_charged; i++)
                memcg_kmem_uncharge_page(vm->pages[i], 0);
        return ret;
 }
index 49e7bc8..ee8c0ac 100644 (file)
@@ -306,6 +306,7 @@ static void __irq_disable(struct irq_desc *desc, bool mask);
 void irq_shutdown(struct irq_desc *desc)
 {
        if (irqd_is_started(&desc->irq_data)) {
+               clear_irq_resend(desc);
                desc->depth = 1;
                if (desc->irq_data.chip->irq_shutdown) {
                        desc->irq_data.chip->irq_shutdown(&desc->irq_data);
@@ -692,8 +693,16 @@ void handle_fasteoi_irq(struct irq_desc *desc)
 
        raw_spin_lock(&desc->lock);
 
-       if (!irq_may_run(desc))
+       /*
+        * When an affinity change races with IRQ handling, the next interrupt
+        * can arrive on the new CPU before the original CPU has completed
+        * handling the previous one - it may need to be resent.
+        */
+       if (!irq_may_run(desc)) {
+               if (irqd_needs_resend_when_in_progress(&desc->irq_data))
+                       desc->istate |= IRQS_PENDING;
                goto out;
+       }
 
        desc->istate &= ~(IRQS_REPLAY | IRQS_WAITING);
 
@@ -715,6 +724,12 @@ void handle_fasteoi_irq(struct irq_desc *desc)
 
        cond_unmask_eoi_irq(desc, chip);
 
+       /*
+        * When the race described above happens this will resend the interrupt.
+        */
+       if (unlikely(desc->istate & IRQS_PENDING))
+               check_irq_resend(desc, false);
+
        raw_spin_unlock(&desc->lock);
        return;
 out:
index bbcaac6..5971a66 100644 (file)
@@ -133,6 +133,8 @@ static const struct irq_bit_descr irqdata_states[] = {
        BIT_MASK_DESCR(IRQD_HANDLE_ENFORCE_IRQCTX),
 
        BIT_MASK_DESCR(IRQD_IRQ_ENABLED_ON_SUSPEND),
+
+       BIT_MASK_DESCR(IRQD_RESEND_WHEN_IN_PROGRESS),
 };
 
 static const struct irq_bit_descr irqdesc_states[] = {
index 5fdc0b5..bdd35bb 100644 (file)
@@ -12,9 +12,9 @@
 #include <linux/sched/clock.h>
 
 #ifdef CONFIG_SPARSE_IRQ
-# define IRQ_BITMAP_BITS       (NR_IRQS + 8196)
+# define MAX_SPARSE_IRQS       INT_MAX
 #else
-# define IRQ_BITMAP_BITS       NR_IRQS
+# define MAX_SPARSE_IRQS       NR_IRQS
 #endif
 
 #define istate core_internal_state__do_not_mess_with_it
@@ -47,9 +47,12 @@ enum {
  *                               detection
  * IRQS_POLL_INPROGRESS                - polling in progress
  * IRQS_ONESHOT                        - irq is not unmasked in primary handler
- * IRQS_REPLAY                 - irq is replayed
+ * IRQS_REPLAY                 - irq has been resent and will not be resent
+ *                               again until the handler has run and cleared
+ *                               this flag.
  * IRQS_WAITING                        - irq is waiting
- * IRQS_PENDING                        - irq is pending and replayed later
+ * IRQS_PENDING                        - irq needs to be resent and should be resent
+ *                               at the next available opportunity.
  * IRQS_SUSPENDED              - irq is suspended
  * IRQS_NMI                    - irq line is used to deliver NMIs
  * IRQS_SYSFS                  - descriptor has been added to sysfs
@@ -113,6 +116,8 @@ irqreturn_t handle_irq_event(struct irq_desc *desc);
 
 /* Resending of interrupts :*/
 int check_irq_resend(struct irq_desc *desc, bool inject);
+void clear_irq_resend(struct irq_desc *desc);
+void irq_resend_init(struct irq_desc *desc);
 bool irq_wait_for_poll(struct irq_desc *desc);
 void __irq_wake_thread(struct irq_desc *desc, struct irqaction *action);
 
index 240e145..27ca1c8 100644 (file)
@@ -12,8 +12,7 @@
 #include <linux/export.h>
 #include <linux/interrupt.h>
 #include <linux/kernel_stat.h>
-#include <linux/radix-tree.h>
-#include <linux/bitmap.h>
+#include <linux/maple_tree.h>
 #include <linux/irqdomain.h>
 #include <linux/sysfs.h>
 
@@ -131,7 +130,40 @@ int nr_irqs = NR_IRQS;
 EXPORT_SYMBOL_GPL(nr_irqs);
 
 static DEFINE_MUTEX(sparse_irq_lock);
-static DECLARE_BITMAP(allocated_irqs, IRQ_BITMAP_BITS);
+static struct maple_tree sparse_irqs = MTREE_INIT_EXT(sparse_irqs,
+                                       MT_FLAGS_ALLOC_RANGE |
+                                       MT_FLAGS_LOCK_EXTERN |
+                                       MT_FLAGS_USE_RCU,
+                                       sparse_irq_lock);
+
+static int irq_find_free_area(unsigned int from, unsigned int cnt)
+{
+       MA_STATE(mas, &sparse_irqs, 0, 0);
+
+       if (mas_empty_area(&mas, from, MAX_SPARSE_IRQS, cnt))
+               return -ENOSPC;
+       return mas.index;
+}
+
+static unsigned int irq_find_at_or_after(unsigned int offset)
+{
+       unsigned long index = offset;
+       struct irq_desc *desc = mt_find(&sparse_irqs, &index, nr_irqs);
+
+       return desc ? irq_desc_get_irq(desc) : nr_irqs;
+}
+
+static void irq_insert_desc(unsigned int irq, struct irq_desc *desc)
+{
+       MA_STATE(mas, &sparse_irqs, irq, irq);
+       WARN_ON(mas_store_gfp(&mas, desc, GFP_KERNEL) != 0);
+}
+
+static void delete_irq_desc(unsigned int irq)
+{
+       MA_STATE(mas, &sparse_irqs, irq, irq);
+       mas_erase(&mas);
+}
 
 #ifdef CONFIG_SPARSE_IRQ
 
@@ -344,26 +376,14 @@ static void irq_sysfs_del(struct irq_desc *desc) {}
 
 #endif /* CONFIG_SYSFS */
 
-static RADIX_TREE(irq_desc_tree, GFP_KERNEL);
-
-static void irq_insert_desc(unsigned int irq, struct irq_desc *desc)
-{
-       radix_tree_insert(&irq_desc_tree, irq, desc);
-}
-
 struct irq_desc *irq_to_desc(unsigned int irq)
 {
-       return radix_tree_lookup(&irq_desc_tree, irq);
+       return mtree_load(&sparse_irqs, irq);
 }
 #ifdef CONFIG_KVM_BOOK3S_64_HV_MODULE
 EXPORT_SYMBOL_GPL(irq_to_desc);
 #endif
 
-static void delete_irq_desc(unsigned int irq)
-{
-       radix_tree_delete(&irq_desc_tree, irq);
-}
-
 #ifdef CONFIG_SMP
 static void free_masks(struct irq_desc *desc)
 {
@@ -415,6 +435,7 @@ static struct irq_desc *alloc_desc(int irq, int node, unsigned int flags,
        desc_set_defaults(irq, desc, node, affinity, owner);
        irqd_set(&desc->irq_data, flags);
        kobject_init(&desc->kobj, &irq_kobj_type);
+       irq_resend_init(desc);
 
        return desc;
 
@@ -505,7 +526,6 @@ static int alloc_descs(unsigned int start, unsigned int cnt, int node,
                irq_sysfs_add(start + i, desc);
                irq_add_debugfs_entry(start + i, desc);
        }
-       bitmap_set(allocated_irqs, start, cnt);
        return start;
 
 err:
@@ -516,7 +536,7 @@ err:
 
 static int irq_expand_nr_irqs(unsigned int nr)
 {
-       if (nr > IRQ_BITMAP_BITS)
+       if (nr > MAX_SPARSE_IRQS)
                return -ENOMEM;
        nr_irqs = nr;
        return 0;
@@ -534,18 +554,17 @@ int __init early_irq_init(void)
        printk(KERN_INFO "NR_IRQS: %d, nr_irqs: %d, preallocated irqs: %d\n",
               NR_IRQS, nr_irqs, initcnt);
 
-       if (WARN_ON(nr_irqs > IRQ_BITMAP_BITS))
-               nr_irqs = IRQ_BITMAP_BITS;
+       if (WARN_ON(nr_irqs > MAX_SPARSE_IRQS))
+               nr_irqs = MAX_SPARSE_IRQS;
 
-       if (WARN_ON(initcnt > IRQ_BITMAP_BITS))
-               initcnt = IRQ_BITMAP_BITS;
+       if (WARN_ON(initcnt > MAX_SPARSE_IRQS))
+               initcnt = MAX_SPARSE_IRQS;
 
        if (initcnt > nr_irqs)
                nr_irqs = initcnt;
 
        for (i = 0; i < initcnt; i++) {
                desc = alloc_desc(i, node, 0, NULL, NULL);
-               set_bit(i, allocated_irqs);
                irq_insert_desc(i, desc);
        }
        return arch_early_irq_init();
@@ -581,6 +600,7 @@ int __init early_irq_init(void)
                mutex_init(&desc[i].request_mutex);
                init_waitqueue_head(&desc[i].wait_for_threads);
                desc_set_defaults(i, &desc[i], node, NULL, NULL);
+               irq_resend_init(desc);
        }
        return arch_early_irq_init();
 }
@@ -599,6 +619,7 @@ static void free_desc(unsigned int irq)
        raw_spin_lock_irqsave(&desc->lock, flags);
        desc_set_defaults(irq, desc, irq_desc_get_node(desc), NULL, NULL);
        raw_spin_unlock_irqrestore(&desc->lock, flags);
+       delete_irq_desc(irq);
 }
 
 static inline int alloc_descs(unsigned int start, unsigned int cnt, int node,
@@ -611,8 +632,8 @@ static inline int alloc_descs(unsigned int start, unsigned int cnt, int node,
                struct irq_desc *desc = irq_to_desc(start + i);
 
                desc->owner = owner;
+               irq_insert_desc(start + i, desc);
        }
-       bitmap_set(allocated_irqs, start, cnt);
        return start;
 }
 
@@ -624,7 +645,7 @@ static int irq_expand_nr_irqs(unsigned int nr)
 void irq_mark_irq(unsigned int irq)
 {
        mutex_lock(&sparse_irq_lock);
-       bitmap_set(allocated_irqs, irq, 1);
+       irq_insert_desc(irq, irq_desc + irq);
        mutex_unlock(&sparse_irq_lock);
 }
 
@@ -768,7 +789,6 @@ void irq_free_descs(unsigned int from, unsigned int cnt)
        for (i = 0; i < cnt; i++)
                free_desc(from + i);
 
-       bitmap_clear(allocated_irqs, from, cnt);
        mutex_unlock(&sparse_irq_lock);
 }
 EXPORT_SYMBOL_GPL(irq_free_descs);
@@ -810,8 +830,7 @@ __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
 
        mutex_lock(&sparse_irq_lock);
 
-       start = bitmap_find_next_zero_area(allocated_irqs, IRQ_BITMAP_BITS,
-                                          from, cnt, 0);
+       start = irq_find_free_area(from, cnt);
        ret = -EEXIST;
        if (irq >=0 && start != irq)
                goto unlock;
@@ -836,7 +855,7 @@ EXPORT_SYMBOL_GPL(__irq_alloc_descs);
  */
 unsigned int irq_get_next_irq(unsigned int offset)
 {
-       return find_next_bit(allocated_irqs, nr_irqs, offset);
+       return irq_find_at_or_after(offset);
 }
 
 struct irq_desc *
index f34760a..5bd0162 100644 (file)
@@ -1915,6 +1915,8 @@ static void irq_domain_check_hierarchy(struct irq_domain *domain)
 #endif /* CONFIG_IRQ_DOMAIN_HIERARCHY */
 
 #ifdef CONFIG_GENERIC_IRQ_DEBUGFS
+#include "internals.h"
+
 static struct dentry *domain_dir;
 
 static void
index 0c46e9f..edec335 100644 (file)
@@ -21,8 +21,9 @@
 
 #ifdef CONFIG_HARDIRQS_SW_RESEND
 
-/* Bitmap to handle software resend of interrupts: */
-static DECLARE_BITMAP(irqs_resend, IRQ_BITMAP_BITS);
+/* hlist_head to handle software resend of interrupts: */
+static HLIST_HEAD(irq_resend_list);
+static DEFINE_RAW_SPINLOCK(irq_resend_lock);
 
 /*
  * Run software resends of IRQ's
@@ -30,18 +31,17 @@ static DECLARE_BITMAP(irqs_resend, IRQ_BITMAP_BITS);
 static void resend_irqs(struct tasklet_struct *unused)
 {
        struct irq_desc *desc;
-       int irq;
-
-       while (!bitmap_empty(irqs_resend, nr_irqs)) {
-               irq = find_first_bit(irqs_resend, nr_irqs);
-               clear_bit(irq, irqs_resend);
-               desc = irq_to_desc(irq);
-               if (!desc)
-                       continue;
-               local_irq_disable();
+
+       raw_spin_lock_irq(&irq_resend_lock);
+       while (!hlist_empty(&irq_resend_list)) {
+               desc = hlist_entry(irq_resend_list.first, struct irq_desc,
+                                  resend_node);
+               hlist_del_init(&desc->resend_node);
+               raw_spin_unlock(&irq_resend_lock);
                desc->handle_irq(desc);
-               local_irq_enable();
+               raw_spin_lock(&irq_resend_lock);
        }
+       raw_spin_unlock_irq(&irq_resend_lock);
 }
 
 /* Tasklet to handle resend: */
@@ -49,8 +49,6 @@ static DECLARE_TASKLET(resend_tasklet, resend_irqs);
 
 static int irq_sw_resend(struct irq_desc *desc)
 {
-       unsigned int irq = irq_desc_get_irq(desc);
-
        /*
         * Validate whether this interrupt can be safely injected from
         * non interrupt context
@@ -70,16 +68,31 @@ static int irq_sw_resend(struct irq_desc *desc)
                 */
                if (!desc->parent_irq)
                        return -EINVAL;
-               irq = desc->parent_irq;
        }
 
-       /* Set it pending and activate the softirq: */
-       set_bit(irq, irqs_resend);
+       /* Add to resend_list and activate the softirq: */
+       raw_spin_lock(&irq_resend_lock);
+       hlist_add_head(&desc->resend_node, &irq_resend_list);
+       raw_spin_unlock(&irq_resend_lock);
        tasklet_schedule(&resend_tasklet);
        return 0;
 }
 
+void clear_irq_resend(struct irq_desc *desc)
+{
+       raw_spin_lock(&irq_resend_lock);
+       hlist_del_init(&desc->resend_node);
+       raw_spin_unlock(&irq_resend_lock);
+}
+
+void irq_resend_init(struct irq_desc *desc)
+{
+       INIT_HLIST_NODE(&desc->resend_node);
+}
 #else
+void clear_irq_resend(struct irq_desc *desc) {}
+void irq_resend_init(struct irq_desc *desc) {}
+
 static int irq_sw_resend(struct irq_desc *desc)
 {
        return -EINVAL;
index 7774739..7982cc9 100644 (file)
@@ -484,34 +484,6 @@ found:
        return 0;
 }
 
-int lookup_symbol_attrs(unsigned long addr, unsigned long *size,
-                       unsigned long *offset, char *modname, char *name)
-{
-       int res;
-
-       name[0] = '\0';
-       name[KSYM_NAME_LEN - 1] = '\0';
-
-       if (is_ksym_addr(addr)) {
-               unsigned long pos;
-
-               pos = get_symbol_pos(addr, size, offset);
-               /* Grab name */
-               kallsyms_expand_symbol(get_symbol_offset(pos),
-                                      name, KSYM_NAME_LEN);
-               modname[0] = '\0';
-               goto found;
-       }
-       /* See if it's in a module. */
-       res = lookup_module_symbol_attrs(addr, size, offset, modname, name);
-       if (res)
-               return res;
-
-found:
-       cleanup_symbol_name(name);
-       return 0;
-}
-
 /* Look up a kernel symbol and return it in a text buffer. */
 static int __sprint_symbol(char *buffer, unsigned long address,
                           int symbol_offset, int add_offset, int add_buildid)
@@ -646,7 +618,6 @@ int sprint_backtrace_build_id(char *buffer, unsigned long address)
 /* To avoid using get_symbol_offset for every symbol, we carry prefix along. */
 struct kallsym_iter {
        loff_t pos;
-       loff_t pos_arch_end;
        loff_t pos_mod_end;
        loff_t pos_ftrace_mod_end;
        loff_t pos_bpf_end;
@@ -659,29 +630,9 @@ struct kallsym_iter {
        int show_value;
 };
 
-int __weak arch_get_kallsym(unsigned int symnum, unsigned long *value,
-                           char *type, char *name)
-{
-       return -EINVAL;
-}
-
-static int get_ksymbol_arch(struct kallsym_iter *iter)
-{
-       int ret = arch_get_kallsym(iter->pos - kallsyms_num_syms,
-                                  &iter->value, &iter->type,
-                                  iter->name);
-
-       if (ret < 0) {
-               iter->pos_arch_end = iter->pos;
-               return 0;
-       }
-
-       return 1;
-}
-
 static int get_ksymbol_mod(struct kallsym_iter *iter)
 {
-       int ret = module_get_kallsym(iter->pos - iter->pos_arch_end,
+       int ret = module_get_kallsym(iter->pos - kallsyms_num_syms,
                                     &iter->value, &iter->type,
                                     iter->name, iter->module_name,
                                     &iter->exported);
@@ -716,7 +667,7 @@ static int get_ksymbol_bpf(struct kallsym_iter *iter)
 {
        int ret;
 
-       strlcpy(iter->module_name, "bpf", MODULE_NAME_LEN);
+       strscpy(iter->module_name, "bpf", MODULE_NAME_LEN);
        iter->exported = 0;
        ret = bpf_get_kallsym(iter->pos - iter->pos_ftrace_mod_end,
                              &iter->value, &iter->type,
@@ -736,7 +687,7 @@ static int get_ksymbol_bpf(struct kallsym_iter *iter)
  */
 static int get_ksymbol_kprobe(struct kallsym_iter *iter)
 {
-       strlcpy(iter->module_name, "__builtin__kprobes", MODULE_NAME_LEN);
+       strscpy(iter->module_name, "__builtin__kprobes", MODULE_NAME_LEN);
        iter->exported = 0;
        return kprobe_get_kallsym(iter->pos - iter->pos_bpf_end,
                                  &iter->value, &iter->type,
@@ -764,7 +715,6 @@ static void reset_iter(struct kallsym_iter *iter, loff_t new_pos)
        iter->nameoff = get_symbol_offset(new_pos);
        iter->pos = new_pos;
        if (new_pos == 0) {
-               iter->pos_arch_end = 0;
                iter->pos_mod_end = 0;
                iter->pos_ftrace_mod_end = 0;
                iter->pos_bpf_end = 0;
@@ -780,10 +730,6 @@ static int update_iter_mod(struct kallsym_iter *iter, loff_t pos)
 {
        iter->pos = pos;
 
-       if ((!iter->pos_arch_end || iter->pos_arch_end > pos) &&
-           get_ksymbol_arch(iter))
-               return 1;
-
        if ((!iter->pos_mod_end || iter->pos_mod_end > pos) &&
            get_ksymbol_mod(iter))
                return 1;
@@ -961,41 +907,6 @@ late_initcall(bpf_ksym_iter_register);
 
 #endif /* CONFIG_BPF_SYSCALL */
 
-static inline int kallsyms_for_perf(void)
-{
-#ifdef CONFIG_PERF_EVENTS
-       extern int sysctl_perf_event_paranoid;
-       if (sysctl_perf_event_paranoid <= 1)
-               return 1;
-#endif
-       return 0;
-}
-
-/*
- * We show kallsyms information even to normal users if we've enabled
- * kernel profiling and are explicitly not paranoid (so kptr_restrict
- * is clear, and sysctl_perf_event_paranoid isn't set).
- *
- * Otherwise, require CAP_SYSLOG (assuming kptr_restrict isn't set to
- * block even that).
- */
-bool kallsyms_show_value(const struct cred *cred)
-{
-       switch (kptr_restrict) {
-       case 0:
-               if (kallsyms_for_perf())
-                       return true;
-               fallthrough;
-       case 1:
-               if (security_capable(cred, &init_user_ns, CAP_SYSLOG,
-                                    CAP_OPT_NOAUDIT) == 0)
-                       return true;
-               fallthrough;
-       default:
-               return false;
-       }
-}
-
 static int kallsyms_open(struct inode *inode, struct file *file)
 {
        /*
index 84c7173..f9ac2e9 100644 (file)
@@ -279,7 +279,7 @@ void notrace __sanitizer_cov_trace_cmp4(u32 arg1, u32 arg2)
 }
 EXPORT_SYMBOL(__sanitizer_cov_trace_cmp4);
 
-void notrace __sanitizer_cov_trace_cmp8(u64 arg1, u64 arg2)
+void notrace __sanitizer_cov_trace_cmp8(kcov_u64 arg1, kcov_u64 arg2)
 {
        write_comp_data(KCOV_CMP_SIZE(3), arg1, arg2, _RET_IP_);
 }
@@ -306,16 +306,17 @@ void notrace __sanitizer_cov_trace_const_cmp4(u32 arg1, u32 arg2)
 }
 EXPORT_SYMBOL(__sanitizer_cov_trace_const_cmp4);
 
-void notrace __sanitizer_cov_trace_const_cmp8(u64 arg1, u64 arg2)
+void notrace __sanitizer_cov_trace_const_cmp8(kcov_u64 arg1, kcov_u64 arg2)
 {
        write_comp_data(KCOV_CMP_SIZE(3) | KCOV_CMP_CONST, arg1, arg2,
                        _RET_IP_);
 }
 EXPORT_SYMBOL(__sanitizer_cov_trace_const_cmp8);
 
-void notrace __sanitizer_cov_trace_switch(u64 val, u64 *cases)
+void notrace __sanitizer_cov_trace_switch(kcov_u64 val, void *arg)
 {
        u64 i;
+       u64 *cases = arg;
        u64 count = cases[0];
        u64 size = cases[1];
        u64 type = KCOV_CMP_CONST;
index 3d578c6..e2f2574 100644 (file)
@@ -1091,6 +1091,11 @@ __bpf_kfunc void crash_kexec(struct pt_regs *regs)
        }
 }
 
+static inline resource_size_t crash_resource_size(const struct resource *res)
+{
+       return !res->end ? 0 : resource_size(res);
+}
+
 ssize_t crash_get_memory_size(void)
 {
        ssize_t size = 0;
@@ -1098,19 +1103,45 @@ ssize_t crash_get_memory_size(void)
        if (!kexec_trylock())
                return -EBUSY;
 
-       if (crashk_res.end != crashk_res.start)
-               size = resource_size(&crashk_res);
+       size += crash_resource_size(&crashk_res);
+       size += crash_resource_size(&crashk_low_res);
 
        kexec_unlock();
        return size;
 }
 
+static int __crash_shrink_memory(struct resource *old_res,
+                                unsigned long new_size)
+{
+       struct resource *ram_res;
+
+       ram_res = kzalloc(sizeof(*ram_res), GFP_KERNEL);
+       if (!ram_res)
+               return -ENOMEM;
+
+       ram_res->start = old_res->start + new_size;
+       ram_res->end   = old_res->end;
+       ram_res->flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM;
+       ram_res->name  = "System RAM";
+
+       if (!new_size) {
+               release_resource(old_res);
+               old_res->start = 0;
+               old_res->end   = 0;
+       } else {
+               crashk_res.end = ram_res->start - 1;
+       }
+
+       crash_free_reserved_phys_range(ram_res->start, ram_res->end);
+       insert_resource(&iomem_resource, ram_res);
+
+       return 0;
+}
+
 int crash_shrink_memory(unsigned long new_size)
 {
        int ret = 0;
-       unsigned long start, end;
-       unsigned long old_size;
-       struct resource *ram_res;
+       unsigned long old_size, low_size;
 
        if (!kexec_trylock())
                return -EBUSY;
@@ -1119,36 +1150,42 @@ int crash_shrink_memory(unsigned long new_size)
                ret = -ENOENT;
                goto unlock;
        }
-       start = crashk_res.start;
-       end = crashk_res.end;
-       old_size = (end == 0) ? 0 : end - start + 1;
+
+       low_size = crash_resource_size(&crashk_low_res);
+       old_size = crash_resource_size(&crashk_res) + low_size;
+       new_size = roundup(new_size, KEXEC_CRASH_MEM_ALIGN);
        if (new_size >= old_size) {
                ret = (new_size == old_size) ? 0 : -EINVAL;
                goto unlock;
        }
 
-       ram_res = kzalloc(sizeof(*ram_res), GFP_KERNEL);
-       if (!ram_res) {
-               ret = -ENOMEM;
-               goto unlock;
-       }
-
-       start = roundup(start, KEXEC_CRASH_MEM_ALIGN);
-       end = roundup(start + new_size, KEXEC_CRASH_MEM_ALIGN);
-
-       crash_free_reserved_phys_range(end, crashk_res.end);
-
-       if ((start == end) && (crashk_res.parent != NULL))
-               release_resource(&crashk_res);
-
-       ram_res->start = end;
-       ram_res->end = crashk_res.end;
-       ram_res->flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM;
-       ram_res->name = "System RAM";
+       /*
+        * (low_size > new_size) implies that low_size is greater than zero.
+        * This also means that if low_size is zero, the else branch is taken.
+        *
+        * If low_size is greater than 0, (low_size > new_size) indicates that
+        * crashk_low_res also needs to be shrunken. Otherwise, only crashk_res
+        * needs to be shrunken.
+        */
+       if (low_size > new_size) {
+               ret = __crash_shrink_memory(&crashk_res, 0);
+               if (ret)
+                       goto unlock;
 
-       crashk_res.end = end - 1;
+               ret = __crash_shrink_memory(&crashk_low_res, new_size);
+       } else {
+               ret = __crash_shrink_memory(&crashk_res, new_size - low_size);
+       }
 
-       insert_resource(&iomem_resource, ram_res);
+       /* Swap crashk_res and crashk_low_res if needed */
+       if (!crashk_res.end && crashk_low_res.end) {
+               crashk_res.start = crashk_low_res.start;
+               crashk_res.end   = crashk_low_res.end;
+               release_resource(&crashk_low_res);
+               crashk_low_res.start = 0;
+               crashk_low_res.end   = 0;
+               insert_resource(&iomem_resource, &crashk_res);
+       }
 
 unlock:
        kexec_unlock();
index 69ee4a2..881ba0d 100644 (file)
@@ -867,6 +867,7 @@ static int kexec_purgatory_setup_sechdrs(struct purgatory_info *pi,
 {
        unsigned long bss_addr;
        unsigned long offset;
+       size_t sechdrs_size;
        Elf_Shdr *sechdrs;
        int i;
 
@@ -874,11 +875,11 @@ static int kexec_purgatory_setup_sechdrs(struct purgatory_info *pi,
         * The section headers in kexec_purgatory are read-only. In order to
         * have them modifiable make a temporary copy.
         */
-       sechdrs = vzalloc(array_size(sizeof(Elf_Shdr), pi->ehdr->e_shnum));
+       sechdrs_size = array_size(sizeof(Elf_Shdr), pi->ehdr->e_shnum);
+       sechdrs = vzalloc(sechdrs_size);
        if (!sechdrs)
                return -ENOMEM;
-       memcpy(sechdrs, (void *)pi->ehdr + pi->ehdr->e_shoff,
-              pi->ehdr->e_shnum * sizeof(Elf_Shdr));
+       memcpy(sechdrs, (void *)pi->ehdr + pi->ehdr->e_shoff, sechdrs_size);
        pi->sechdrs = sechdrs;
 
        offset = 0;
diff --git a/kernel/ksyms_common.c b/kernel/ksyms_common.c
new file mode 100644 (file)
index 0000000..cf1a73c
--- /dev/null
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * ksyms_common.c: A split of kernel/kallsyms.c
+ * Contains a few generic function definations independent of config KALLSYMS.
+ */
+#include <linux/kallsyms.h>
+#include <linux/security.h>
+
+static inline int kallsyms_for_perf(void)
+{
+#ifdef CONFIG_PERF_EVENTS
+       extern int sysctl_perf_event_paranoid;
+
+       if (sysctl_perf_event_paranoid <= 1)
+               return 1;
+#endif
+       return 0;
+}
+
+/*
+ * We show kallsyms information even to normal users if we've enabled
+ * kernel profiling and are explicitly not paranoid (so kptr_restrict
+ * is clear, and sysctl_perf_event_paranoid isn't set).
+ *
+ * Otherwise, require CAP_SYSLOG (assuming kptr_restrict isn't set to
+ * block even that).
+ */
+bool kallsyms_show_value(const struct cred *cred)
+{
+       switch (kptr_restrict) {
+       case 0:
+               if (kallsyms_for_perf())
+                       return true;
+               fallthrough;
+       case 1:
+               if (security_capable(cred, &init_user_ns, CAP_SYSLOG,
+                                    CAP_OPT_NOAUDIT) == 0)
+                       return true;
+               fallthrough;
+       default:
+               return false;
+       }
+}
index 490792b..4fff7df 100644 (file)
@@ -182,6 +182,16 @@ bool kthread_should_park(void)
 }
 EXPORT_SYMBOL_GPL(kthread_should_park);
 
+bool kthread_should_stop_or_park(void)
+{
+       struct kthread *kthread = __to_kthread(current);
+
+       if (!kthread)
+               return false;
+
+       return kthread->flags & (BIT(KTHREAD_SHOULD_STOP) | BIT(KTHREAD_SHOULD_PARK));
+}
+
 /**
  * kthread_freezable_should_stop - should this freezable kthread return now?
  * @was_frozen: optional out parameter, indicates whether %current was frozen
@@ -312,10 +322,10 @@ void __noreturn kthread_exit(long result)
  * @comp: Completion to complete
  * @code: The integer value to return to kthread_stop().
  *
- * If present complete @comp and the reuturn code to kthread_stop().
+ * If present, complete @comp and then return code to kthread_stop().
  *
  * A kernel thread whose module may be removed after the completion of
- * @comp can use this function exit safely.
+ * @comp can use this function to exit safely.
  *
  * Does not return.
  */
index 8c7e7d2..a6016b9 100644 (file)
@@ -57,4 +57,8 @@ static inline void __lockevent_add(enum lock_events event, int inc)
 #define lockevent_cond_inc(ev, c)
 
 #endif /* CONFIG_LOCK_EVENT_COUNTS */
+
+ssize_t lockevent_read(struct file *file, char __user *user_buf,
+                      size_t count, loff_t *ppos);
+
 #endif /* __LOCKING_LOCK_EVENTS_H */
index 4dfd2f3..111607d 100644 (file)
@@ -709,7 +709,7 @@ void get_usage_chars(struct lock_class *class, char usage[LOCK_USAGE_CHARS])
        usage[i] = '\0';
 }
 
-static void __print_lock_name(struct lock_class *class)
+static void __print_lock_name(struct held_lock *hlock, struct lock_class *class)
 {
        char str[KSYM_NAME_LEN];
        const char *name;
@@ -724,17 +724,19 @@ static void __print_lock_name(struct lock_class *class)
                        printk(KERN_CONT "#%d", class->name_version);
                if (class->subclass)
                        printk(KERN_CONT "/%d", class->subclass);
+               if (hlock && class->print_fn)
+                       class->print_fn(hlock->instance);
        }
 }
 
-static void print_lock_name(struct lock_class *class)
+static void print_lock_name(struct held_lock *hlock, struct lock_class *class)
 {
        char usage[LOCK_USAGE_CHARS];
 
        get_usage_chars(class, usage);
 
        printk(KERN_CONT " (");
-       __print_lock_name(class);
+       __print_lock_name(hlock, class);
        printk(KERN_CONT "){%s}-{%d:%d}", usage,
                        class->wait_type_outer ?: class->wait_type_inner,
                        class->wait_type_inner);
@@ -772,7 +774,7 @@ static void print_lock(struct held_lock *hlock)
        }
 
        printk(KERN_CONT "%px", hlock->instance);
-       print_lock_name(lock);
+       print_lock_name(hlock, lock);
        printk(KERN_CONT ", at: %pS\n", (void *)hlock->acquire_ip);
 }
 
@@ -1868,7 +1870,7 @@ print_circular_bug_entry(struct lock_list *target, int depth)
        if (debug_locks_silent)
                return;
        printk("\n-> #%u", depth);
-       print_lock_name(target->class);
+       print_lock_name(NULL, target->class);
        printk(KERN_CONT ":\n");
        print_lock_trace(target->trace, 6);
 }
@@ -1899,11 +1901,11 @@ print_circular_lock_scenario(struct held_lock *src,
         */
        if (parent != source) {
                printk("Chain exists of:\n  ");
-               __print_lock_name(source);
+               __print_lock_name(src, source);
                printk(KERN_CONT " --> ");
-               __print_lock_name(parent);
+               __print_lock_name(NULL, parent);
                printk(KERN_CONT " --> ");
-               __print_lock_name(target);
+               __print_lock_name(tgt, target);
                printk(KERN_CONT "\n\n");
        }
 
@@ -1914,13 +1916,13 @@ print_circular_lock_scenario(struct held_lock *src,
                printk("  rlock(");
        else
                printk("  lock(");
-       __print_lock_name(target);
+       __print_lock_name(tgt, target);
        printk(KERN_CONT ");\n");
        printk("                               lock(");
-       __print_lock_name(parent);
+       __print_lock_name(NULL, parent);
        printk(KERN_CONT ");\n");
        printk("                               lock(");
-       __print_lock_name(target);
+       __print_lock_name(tgt, target);
        printk(KERN_CONT ");\n");
        if (src_read != 0)
                printk("  rlock(");
@@ -1928,7 +1930,7 @@ print_circular_lock_scenario(struct held_lock *src,
                printk("  sync(");
        else
                printk("  lock(");
-       __print_lock_name(source);
+       __print_lock_name(src, source);
        printk(KERN_CONT ");\n");
        printk("\n *** DEADLOCK ***\n\n");
 }
@@ -2154,6 +2156,8 @@ check_path(struct held_lock *target, struct lock_list *src_entry,
        return ret;
 }
 
+static void print_deadlock_bug(struct task_struct *, struct held_lock *, struct held_lock *);
+
 /*
  * Prove that the dependency graph starting at <src> can not
  * lead to <target>. If it can, there is a circle when adding
@@ -2185,7 +2189,10 @@ check_noncircular(struct held_lock *src, struct held_lock *target,
                        *trace = save_trace();
                }
 
-               print_circular_bug(&src_entry, target_entry, src, target);
+               if (src->class_idx == target->class_idx)
+                       print_deadlock_bug(current, src, target);
+               else
+                       print_circular_bug(&src_entry, target_entry, src, target);
        }
 
        return ret;
@@ -2346,7 +2353,7 @@ static void print_lock_class_header(struct lock_class *class, int depth)
        int bit;
 
        printk("%*s->", depth, "");
-       print_lock_name(class);
+       print_lock_name(NULL, class);
 #ifdef CONFIG_DEBUG_LOCKDEP
        printk(KERN_CONT " ops: %lu", debug_class_ops_read(class));
 #endif
@@ -2528,11 +2535,11 @@ print_irq_lock_scenario(struct lock_list *safe_entry,
         */
        if (middle_class != unsafe_class) {
                printk("Chain exists of:\n  ");
-               __print_lock_name(safe_class);
+               __print_lock_name(NULL, safe_class);
                printk(KERN_CONT " --> ");
-               __print_lock_name(middle_class);
+               __print_lock_name(NULL, middle_class);
                printk(KERN_CONT " --> ");
-               __print_lock_name(unsafe_class);
+               __print_lock_name(NULL, unsafe_class);
                printk(KERN_CONT "\n\n");
        }
 
@@ -2540,18 +2547,18 @@ print_irq_lock_scenario(struct lock_list *safe_entry,
        printk("       CPU0                    CPU1\n");
        printk("       ----                    ----\n");
        printk("  lock(");
-       __print_lock_name(unsafe_class);
+       __print_lock_name(NULL, unsafe_class);
        printk(KERN_CONT ");\n");
        printk("                               local_irq_disable();\n");
        printk("                               lock(");
-       __print_lock_name(safe_class);
+       __print_lock_name(NULL, safe_class);
        printk(KERN_CONT ");\n");
        printk("                               lock(");
-       __print_lock_name(middle_class);
+       __print_lock_name(NULL, middle_class);
        printk(KERN_CONT ");\n");
        printk("  <Interrupt>\n");
        printk("    lock(");
-       __print_lock_name(safe_class);
+       __print_lock_name(NULL, safe_class);
        printk(KERN_CONT ");\n");
        printk("\n *** DEADLOCK ***\n\n");
 }
@@ -2588,20 +2595,20 @@ print_bad_irq_dependency(struct task_struct *curr,
        pr_warn("\nand this task is already holding:\n");
        print_lock(prev);
        pr_warn("which would create a new lock dependency:\n");
-       print_lock_name(hlock_class(prev));
+       print_lock_name(prev, hlock_class(prev));
        pr_cont(" ->");
-       print_lock_name(hlock_class(next));
+       print_lock_name(next, hlock_class(next));
        pr_cont("\n");
 
        pr_warn("\nbut this new dependency connects a %s-irq-safe lock:\n",
                irqclass);
-       print_lock_name(backwards_entry->class);
+       print_lock_name(NULL, backwards_entry->class);
        pr_warn("\n... which became %s-irq-safe at:\n", irqclass);
 
        print_lock_trace(backwards_entry->class->usage_traces[bit1], 1);
 
        pr_warn("\nto a %s-irq-unsafe lock:\n", irqclass);
-       print_lock_name(forwards_entry->class);
+       print_lock_name(NULL, forwards_entry->class);
        pr_warn("\n... which became %s-irq-unsafe at:\n", irqclass);
        pr_warn("...");
 
@@ -2971,10 +2978,10 @@ print_deadlock_scenario(struct held_lock *nxt, struct held_lock *prv)
        printk("       CPU0\n");
        printk("       ----\n");
        printk("  lock(");
-       __print_lock_name(prev);
+       __print_lock_name(prv, prev);
        printk(KERN_CONT ");\n");
        printk("  lock(");
-       __print_lock_name(next);
+       __print_lock_name(nxt, next);
        printk(KERN_CONT ");\n");
        printk("\n *** DEADLOCK ***\n\n");
        printk(" May be due to missing lock nesting notation\n\n");
@@ -2984,6 +2991,8 @@ static void
 print_deadlock_bug(struct task_struct *curr, struct held_lock *prev,
                   struct held_lock *next)
 {
+       struct lock_class *class = hlock_class(prev);
+
        if (!debug_locks_off_graph_unlock() || debug_locks_silent)
                return;
 
@@ -2998,6 +3007,11 @@ print_deadlock_bug(struct task_struct *curr, struct held_lock *prev,
        pr_warn("\nbut task is already holding lock:\n");
        print_lock(prev);
 
+       if (class->cmp_fn) {
+               pr_warn("and the lock comparison function returns %i:\n",
+                       class->cmp_fn(prev->instance, next->instance));
+       }
+
        pr_warn("\nother info that might help us debug this:\n");
        print_deadlock_scenario(next, prev);
        lockdep_print_held_locks(curr);
@@ -3019,6 +3033,7 @@ print_deadlock_bug(struct task_struct *curr, struct held_lock *prev,
 static int
 check_deadlock(struct task_struct *curr, struct held_lock *next)
 {
+       struct lock_class *class;
        struct held_lock *prev;
        struct held_lock *nest = NULL;
        int i;
@@ -3039,6 +3054,12 @@ check_deadlock(struct task_struct *curr, struct held_lock *next)
                if ((next->read == 2) && prev->read)
                        continue;
 
+               class = hlock_class(prev);
+
+               if (class->cmp_fn &&
+                   class->cmp_fn(prev->instance, next->instance) < 0)
+                       continue;
+
                /*
                 * We're holding the nest_lock, which serializes this lock's
                 * nesting behaviour.
@@ -3100,6 +3121,14 @@ check_prev_add(struct task_struct *curr, struct held_lock *prev,
                return 2;
        }
 
+       if (prev->class_idx == next->class_idx) {
+               struct lock_class *class = hlock_class(prev);
+
+               if (class->cmp_fn &&
+                   class->cmp_fn(prev->instance, next->instance) < 0)
+                       return 2;
+       }
+
        /*
         * Prove that the new <prev> -> <next> dependency would not
         * create a circular dependency in the graph. (We do this by
@@ -3576,7 +3605,7 @@ static void print_chain_keys_chain(struct lock_chain *chain)
                hlock_id = chain_hlocks[chain->base + i];
                chain_key = print_chain_key_iteration(hlock_id, chain_key);
 
-               print_lock_name(lock_classes + chain_hlock_class_idx(hlock_id));
+               print_lock_name(NULL, lock_classes + chain_hlock_class_idx(hlock_id));
                printk("\n");
        }
 }
@@ -3933,11 +3962,11 @@ static void print_usage_bug_scenario(struct held_lock *lock)
        printk("       CPU0\n");
        printk("       ----\n");
        printk("  lock(");
-       __print_lock_name(class);
+       __print_lock_name(lock, class);
        printk(KERN_CONT ");\n");
        printk("  <Interrupt>\n");
        printk("    lock(");
-       __print_lock_name(class);
+       __print_lock_name(lock, class);
        printk(KERN_CONT ");\n");
        printk("\n *** DEADLOCK ***\n\n");
 }
@@ -4023,7 +4052,7 @@ print_irq_inversion_bug(struct task_struct *curr,
                pr_warn("but this lock took another, %s-unsafe lock in the past:\n", irqclass);
        else
                pr_warn("but this lock was taken by another, %s-safe lock in the past:\n", irqclass);
-       print_lock_name(other->class);
+       print_lock_name(NULL, other->class);
        pr_warn("\n\nand interrupts could create inverse lock ordering between them.\n\n");
 
        pr_warn("\nother info that might help us debug this:\n");
@@ -4896,6 +4925,33 @@ EXPORT_SYMBOL_GPL(lockdep_init_map_type);
 struct lock_class_key __lockdep_no_validate__;
 EXPORT_SYMBOL_GPL(__lockdep_no_validate__);
 
+#ifdef CONFIG_PROVE_LOCKING
+void lockdep_set_lock_cmp_fn(struct lockdep_map *lock, lock_cmp_fn cmp_fn,
+                            lock_print_fn print_fn)
+{
+       struct lock_class *class = lock->class_cache[0];
+       unsigned long flags;
+
+       raw_local_irq_save(flags);
+       lockdep_recursion_inc();
+
+       if (!class)
+               class = register_lock_class(lock, 0, 0);
+
+       if (class) {
+               WARN_ON(class->cmp_fn   && class->cmp_fn != cmp_fn);
+               WARN_ON(class->print_fn && class->print_fn != print_fn);
+
+               class->cmp_fn   = cmp_fn;
+               class->print_fn = print_fn;
+       }
+
+       lockdep_recursion_finish();
+       raw_local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(lockdep_set_lock_cmp_fn);
+#endif
+
 static void
 print_lock_nested_lock_not_held(struct task_struct *curr,
                                struct held_lock *hlock)
index 153ddc4..949d3de 100644 (file)
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Paul E. McKenney <paulmck@linux.ibm.com>");
 
-torture_param(int, nwriters_stress, -1,
-            "Number of write-locking stress-test threads");
-torture_param(int, nreaders_stress, -1,
-            "Number of read-locking stress-test threads");
+torture_param(int, nwriters_stress, -1, "Number of write-locking stress-test threads");
+torture_param(int, nreaders_stress, -1, "Number of read-locking stress-test threads");
+torture_param(int, long_hold, 100, "Do occasional long hold of lock (ms), 0=disable");
 torture_param(int, onoff_holdoff, 0, "Time after boot before CPU hotplugs (s)");
-torture_param(int, onoff_interval, 0,
-            "Time between CPU hotplugs (s), 0=disable");
-torture_param(int, shuffle_interval, 3,
-            "Number of jiffies between shuffles, 0=disable");
+torture_param(int, onoff_interval, 0, "Time between CPU hotplugs (s), 0=disable");
+torture_param(int, shuffle_interval, 3, "Number of jiffies between shuffles, 0=disable");
 torture_param(int, shutdown_secs, 0, "Shutdown time (j), <= zero to disable.");
-torture_param(int, stat_interval, 60,
-            "Number of seconds between stats printk()s");
+torture_param(int, stat_interval, 60, "Number of seconds between stats printk()s");
 torture_param(int, stutter, 5, "Number of jiffies to run/halt test, 0=disable");
 torture_param(int, rt_boost, 2,
-               "Do periodic rt-boost. 0=Disable, 1=Only for rt_mutex, 2=For all lock types.");
+                  "Do periodic rt-boost. 0=Disable, 1=Only for rt_mutex, 2=For all lock types.");
 torture_param(int, rt_boost_factor, 50, "A factor determining how often rt-boost happens.");
-torture_param(int, verbose, 1,
-            "Enable verbose debugging printk()s");
+torture_param(int, verbose, 1, "Enable verbose debugging printk()s");
 torture_param(int, nested_locks, 0, "Number of nested locks (max = 8)");
 /* Going much higher trips "BUG: MAX_LOCKDEP_CHAIN_HLOCKS too low!" errors */
 #define MAX_NESTED_LOCKS 8
@@ -120,7 +115,7 @@ static int torture_lock_busted_write_lock(int tid __maybe_unused)
 
 static void torture_lock_busted_write_delay(struct torture_random_state *trsp)
 {
-       const unsigned long longdelay_ms = 100;
+       const unsigned long longdelay_ms = long_hold ? long_hold : ULONG_MAX;
 
        /* We want a long delay occasionally to force massive contention.  */
        if (!(torture_random(trsp) %
@@ -198,16 +193,18 @@ __acquires(torture_spinlock)
 static void torture_spin_lock_write_delay(struct torture_random_state *trsp)
 {
        const unsigned long shortdelay_us = 2;
-       const unsigned long longdelay_ms = 100;
+       const unsigned long longdelay_ms = long_hold ? long_hold : ULONG_MAX;
+       unsigned long j;
 
        /* We want a short delay mostly to emulate likely code, and
         * we want a long delay occasionally to force massive contention.
         */
-       if (!(torture_random(trsp) %
-             (cxt.nrealwriters_stress * 2000 * longdelay_ms)))
+       if (!(torture_random(trsp) % (cxt.nrealwriters_stress * 2000 * longdelay_ms))) {
+               j = jiffies;
                mdelay(longdelay_ms);
-       if (!(torture_random(trsp) %
-             (cxt.nrealwriters_stress * 2 * shortdelay_us)))
+               pr_alert("%s: delay = %lu jiffies.\n", __func__, jiffies - j);
+       }
+       if (!(torture_random(trsp) % (cxt.nrealwriters_stress * 200 * shortdelay_us)))
                udelay(shortdelay_us);
        if (!(torture_random(trsp) % (cxt.nrealwriters_stress * 20000)))
                torture_preempt_schedule();  /* Allow test to be preempted. */
@@ -322,7 +319,7 @@ __acquires(torture_rwlock)
 static void torture_rwlock_write_delay(struct torture_random_state *trsp)
 {
        const unsigned long shortdelay_us = 2;
-       const unsigned long longdelay_ms = 100;
+       const unsigned long longdelay_ms = long_hold ? long_hold : ULONG_MAX;
 
        /* We want a short delay mostly to emulate likely code, and
         * we want a long delay occasionally to force massive contention.
@@ -455,14 +452,12 @@ __acquires(torture_mutex)
 
 static void torture_mutex_delay(struct torture_random_state *trsp)
 {
-       const unsigned long longdelay_ms = 100;
+       const unsigned long longdelay_ms = long_hold ? long_hold : ULONG_MAX;
 
        /* We want a long delay occasionally to force massive contention.  */
        if (!(torture_random(trsp) %
              (cxt.nrealwriters_stress * 2000 * longdelay_ms)))
                mdelay(longdelay_ms * 5);
-       else
-               mdelay(longdelay_ms / 5);
        if (!(torture_random(trsp) % (cxt.nrealwriters_stress * 20000)))
                torture_preempt_schedule();  /* Allow test to be preempted. */
 }
@@ -630,7 +625,7 @@ __acquires(torture_rtmutex)
 static void torture_rtmutex_delay(struct torture_random_state *trsp)
 {
        const unsigned long shortdelay_us = 2;
-       const unsigned long longdelay_ms = 100;
+       const unsigned long longdelay_ms = long_hold ? long_hold : ULONG_MAX;
 
        /*
         * We want a short delay mostly to emulate likely code, and
@@ -640,7 +635,7 @@ static void torture_rtmutex_delay(struct torture_random_state *trsp)
              (cxt.nrealwriters_stress * 2000 * longdelay_ms)))
                mdelay(longdelay_ms);
        if (!(torture_random(trsp) %
-             (cxt.nrealwriters_stress * 2 * shortdelay_us)))
+             (cxt.nrealwriters_stress * 200 * shortdelay_us)))
                udelay(shortdelay_us);
        if (!(torture_random(trsp) % (cxt.nrealwriters_stress * 20000)))
                torture_preempt_schedule();  /* Allow test to be preempted. */
@@ -695,14 +690,12 @@ __acquires(torture_rwsem)
 
 static void torture_rwsem_write_delay(struct torture_random_state *trsp)
 {
-       const unsigned long longdelay_ms = 100;
+       const unsigned long longdelay_ms = long_hold ? long_hold : ULONG_MAX;
 
        /* We want a long delay occasionally to force massive contention.  */
        if (!(torture_random(trsp) %
              (cxt.nrealwriters_stress * 2000 * longdelay_ms)))
                mdelay(longdelay_ms * 10);
-       else
-               mdelay(longdelay_ms / 10);
        if (!(torture_random(trsp) % (cxt.nrealwriters_stress * 20000)))
                torture_preempt_schedule();  /* Allow test to be preempted. */
 }
@@ -848,8 +841,8 @@ static int lock_torture_writer(void *arg)
 
                        lwsp->n_lock_acquired++;
                }
-               cxt.cur_ops->write_delay(&rand);
                if (!skip_main_lock) {
+                       cxt.cur_ops->write_delay(&rand);
                        lock_is_write_held = false;
                        WRITE_ONCE(last_lock_release, jiffies);
                        cxt.cur_ops->writeunlock(tid);
index c550d7d..ef73ae7 100644 (file)
@@ -381,34 +381,6 @@ out:
        return -ERANGE;
 }
 
-int lookup_module_symbol_attrs(unsigned long addr, unsigned long *size,
-                              unsigned long *offset, char *modname, char *name)
-{
-       struct module *mod;
-
-       preempt_disable();
-       list_for_each_entry_rcu(mod, &modules, list) {
-               if (mod->state == MODULE_STATE_UNFORMED)
-                       continue;
-               if (within_module(addr, mod)) {
-                       const char *sym;
-
-                       sym = find_kallsyms_symbol(mod, addr, size, offset);
-                       if (!sym)
-                               goto out;
-                       if (modname)
-                               strscpy(modname, mod->name, MODULE_NAME_LEN);
-                       if (name)
-                               strscpy(name, sym, KSYM_NAME_LEN);
-                       preempt_enable();
-                       return 0;
-               }
-       }
-out:
-       preempt_enable();
-       return -ERANGE;
-}
-
 int module_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
                       char *name, char *module_name, int *exported)
 {
index 4e2cf78..834de86 100644 (file)
@@ -820,10 +820,8 @@ static struct module_attribute modinfo_refcnt =
 void __module_get(struct module *module)
 {
        if (module) {
-               preempt_disable();
                atomic_inc(&module->refcnt);
                trace_module_get(module, _RET_IP_);
-               preempt_enable();
        }
 }
 EXPORT_SYMBOL(__module_get);
@@ -833,15 +831,12 @@ bool try_module_get(struct module *module)
        bool ret = true;
 
        if (module) {
-               preempt_disable();
                /* Note: here, we can fail to get a reference */
                if (likely(module_is_live(module) &&
                           atomic_inc_not_zero(&module->refcnt) != 0))
                        trace_module_get(module, _RET_IP_);
                else
                        ret = false;
-
-               preempt_enable();
        }
        return ret;
 }
@@ -852,11 +847,9 @@ void module_put(struct module *module)
        int ret;
 
        if (module) {
-               preempt_disable();
                ret = atomic_dec_if_positive(&module->refcnt);
                WARN_ON(ret < 0);       /* Failed to put refcount */
                trace_module_put(module, _RET_IP_);
-               preempt_enable();
        }
 }
 EXPORT_SYMBOL(module_put);
@@ -3057,26 +3050,83 @@ SYSCALL_DEFINE3(init_module, void __user *, umod,
        return load_module(&info, uargs, 0);
 }
 
-SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
+struct idempotent {
+       const void *cookie;
+       struct hlist_node entry;
+       struct completion complete;
+       int ret;
+};
+
+#define IDEM_HASH_BITS 8
+static struct hlist_head idem_hash[1 << IDEM_HASH_BITS];
+static DEFINE_SPINLOCK(idem_lock);
+
+static bool idempotent(struct idempotent *u, const void *cookie)
+{
+       int hash = hash_ptr(cookie, IDEM_HASH_BITS);
+       struct hlist_head *head = idem_hash + hash;
+       struct idempotent *existing;
+       bool first;
+
+       u->ret = 0;
+       u->cookie = cookie;
+       init_completion(&u->complete);
+
+       spin_lock(&idem_lock);
+       first = true;
+       hlist_for_each_entry(existing, head, entry) {
+               if (existing->cookie != cookie)
+                       continue;
+               first = false;
+               break;
+       }
+       hlist_add_head(&u->entry, idem_hash + hash);
+       spin_unlock(&idem_lock);
+
+       return !first;
+}
+
+/*
+ * We were the first one with 'cookie' on the list, and we ended
+ * up completing the operation. We now need to walk the list,
+ * remove everybody - which includes ourselves - fill in the return
+ * value, and then complete the operation.
+ */
+static void idempotent_complete(struct idempotent *u, int ret)
+{
+       const void *cookie = u->cookie;
+       int hash = hash_ptr(cookie, IDEM_HASH_BITS);
+       struct hlist_head *head = idem_hash + hash;
+       struct hlist_node *next;
+       struct idempotent *pos;
+
+       spin_lock(&idem_lock);
+       hlist_for_each_entry_safe(pos, next, head, entry) {
+               if (pos->cookie != cookie)
+                       continue;
+               hlist_del(&pos->entry);
+               pos->ret = ret;
+               complete(&pos->complete);
+       }
+       spin_unlock(&idem_lock);
+}
+
+static int init_module_from_file(struct file *f, const char __user * uargs, int flags)
 {
+       struct idempotent idem;
        struct load_info info = { };
        void *buf = NULL;
-       int len;
-       int err;
-
-       err = may_init_module();
-       if (err)
-               return err;
+       int len, ret;
 
-       pr_debug("finit_module: fd=%d, uargs=%p, flags=%i\n", fd, uargs, flags);
+       if (!f || !(f->f_mode & FMODE_READ))
+               return -EBADF;
 
-       if (flags & ~(MODULE_INIT_IGNORE_MODVERSIONS
-                     |MODULE_INIT_IGNORE_VERMAGIC
-                     |MODULE_INIT_COMPRESSED_FILE))
-               return -EINVAL;
+       if (idempotent(&idem, file_inode(f))) {
+               wait_for_completion(&idem.complete);
+               return idem.ret;
+       }
 
-       len = kernel_read_file_from_fd(fd, 0, &buf, INT_MAX, NULL,
-                                      READING_MODULE);
+       len = kernel_read_file(f, 0, &buf, INT_MAX, NULL, READING_MODULE);
        if (len < 0) {
                mod_stat_inc(&failed_kreads);
                mod_stat_add_long(len, &invalid_kread_bytes);
@@ -3084,7 +3134,7 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
        }
 
        if (flags & MODULE_INIT_COMPRESSED_FILE) {
-               err = module_decompress(&info, buf, len);
+               int err = module_decompress(&info, buf, len);
                vfree(buf); /* compressed data is no longer needed */
                if (err) {
                        mod_stat_inc(&failed_decompress);
@@ -3096,7 +3146,31 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
                info.len = len;
        }
 
-       return load_module(&info, uargs, flags);
+       ret = load_module(&info, uargs, flags);
+       idempotent_complete(&idem, ret);
+       return ret;
+}
+
+SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
+{
+       int err;
+       struct fd f;
+
+       err = may_init_module();
+       if (err)
+               return err;
+
+       pr_debug("finit_module: fd=%d, uargs=%p, flags=%i\n", fd, uargs, flags);
+
+       if (flags & ~(MODULE_INIT_IGNORE_MODVERSIONS
+                     |MODULE_INIT_IGNORE_VERMAGIC
+                     |MODULE_INIT_COMPRESSED_FILE))
+               return -EINVAL;
+
+       f = fdget(fd);
+       err = init_module_from_file(f.file, uargs, flags);
+       fdput(f);
+       return err;
 }
 
 /* Keep in sync with MODULE_FLAGS_BUF_SIZE !!! */
index 886d2eb..10effe4 100644 (file)
@@ -684,6 +684,7 @@ void __warn(const char *file, int line, void *caller, unsigned taint,
        add_taint(taint, LOCKDEP_STILL_OK);
 }
 
+#ifdef CONFIG_BUG
 #ifndef __WARN_FLAGS
 void warn_slowpath_fmt(const char *file, int line, unsigned taint,
                       const char *fmt, ...)
@@ -722,8 +723,6 @@ void __warn_printk(const char *fmt, ...)
 EXPORT_SYMBOL(__warn_printk);
 #endif
 
-#ifdef CONFIG_BUG
-
 /* Support resetting WARN*_ONCE state */
 
 static int clear_warn_once_set(void *data, u64 val)
index 6a75489..07d01f6 100644 (file)
@@ -847,7 +847,7 @@ static void __init param_sysfs_builtin(void)
                        name_len = 0;
                } else {
                        name_len = dot - kp->name + 1;
-                       strlcpy(modname, kp->name, name_len);
+                       strscpy(modname, kp->name, name_len);
                }
                kernel_add_sysfs_param(modname, kp, name_len);
        }
index d67a4d4..b26e027 100644 (file)
@@ -52,7 +52,6 @@ static inline void register_pid_ns_sysctl_table_vm(void)
 }
 #else
 static inline void initialize_memfd_noexec_scope(struct pid_namespace *ns) {}
-static inline void set_memfd_noexec_scope(struct pid_namespace *ns) {}
 static inline void register_pid_ns_sysctl_table_vm(void) {}
 #endif
 
index 30d1274..f62e89d 100644 (file)
@@ -11,6 +11,7 @@
 
 #define pr_fmt(fmt) "PM: hibernation: " fmt
 
+#include <linux/blkdev.h>
 #include <linux/export.h>
 #include <linux/suspend.h>
 #include <linux/reboot.h>
@@ -64,7 +65,6 @@ enum {
 static int hibernation_mode = HIBERNATION_SHUTDOWN;
 
 bool freezer_test_done;
-bool snapshot_test;
 
 static const struct platform_hibernation_ops *hibernation_ops;
 
@@ -684,26 +684,22 @@ static void power_down(void)
                cpu_relax();
 }
 
-static int load_image_and_restore(void)
+static int load_image_and_restore(bool snapshot_test)
 {
        int error;
        unsigned int flags;
-       fmode_t mode = FMODE_READ;
-
-       if (snapshot_test)
-               mode |= FMODE_EXCL;
 
        pm_pr_dbg("Loading hibernation image.\n");
 
        lock_device_hotplug();
        error = create_basic_memory_bitmaps();
        if (error) {
-               swsusp_close(mode);
+               swsusp_close(snapshot_test);
                goto Unlock;
        }
 
        error = swsusp_read(&flags);
-       swsusp_close(mode);
+       swsusp_close(snapshot_test);
        if (!error)
                error = hibernation_restore(flags & SF_PLATFORM_MODE);
 
@@ -721,6 +717,7 @@ static int load_image_and_restore(void)
  */
 int hibernate(void)
 {
+       bool snapshot_test = false;
        unsigned int sleep_flags;
        int error;
 
@@ -748,9 +745,6 @@ int hibernate(void)
        if (error)
                goto Exit;
 
-       /* protected by system_transition_mutex */
-       snapshot_test = false;
-
        lock_device_hotplug();
        /* Allocate memory management structures */
        error = create_basic_memory_bitmaps();
@@ -792,9 +786,9 @@ int hibernate(void)
        unlock_device_hotplug();
        if (snapshot_test) {
                pm_pr_dbg("Checking hibernation image\n");
-               error = swsusp_check();
+               error = swsusp_check(snapshot_test);
                if (!error)
-                       error = load_image_and_restore();
+                       error = load_image_and_restore(snapshot_test);
        }
        thaw_processes();
 
@@ -910,52 +904,10 @@ unlock:
 }
 EXPORT_SYMBOL_GPL(hibernate_quiet_exec);
 
-/**
- * software_resume - Resume from a saved hibernation image.
- *
- * This routine is called as a late initcall, when all devices have been
- * discovered and initialized already.
- *
- * The image reading code is called to see if there is a hibernation image
- * available for reading.  If that is the case, devices are quiesced and the
- * contents of memory is restored from the saved image.
- *
- * If this is successful, control reappears in the restored target kernel in
- * hibernation_snapshot() which returns to hibernate().  Otherwise, the routine
- * attempts to recover gracefully and make the kernel return to the normal mode
- * of operation.
- */
-static int software_resume(void)
+static int __init find_resume_device(void)
 {
-       int error;
-
-       /*
-        * If the user said "noresume".. bail out early.
-        */
-       if (noresume || !hibernation_available())
-               return 0;
-
-       /*
-        * name_to_dev_t() below takes a sysfs buffer mutex when sysfs
-        * is configured into the kernel. Since the regular hibernate
-        * trigger path is via sysfs which takes a buffer mutex before
-        * calling hibernate functions (which take system_transition_mutex)
-        * this can cause lockdep to complain about a possible ABBA deadlock
-        * which cannot happen since we're in the boot code here and
-        * sysfs can't be invoked yet. Therefore, we use a subclass
-        * here to avoid lockdep complaining.
-        */
-       mutex_lock_nested(&system_transition_mutex, SINGLE_DEPTH_NESTING);
-
-       snapshot_test = false;
-
-       if (swsusp_resume_device)
-               goto Check_image;
-
-       if (!strlen(resume_file)) {
-               error = -ENOENT;
-               goto Unlock;
-       }
+       if (!strlen(resume_file))
+               return -ENOENT;
 
        pm_pr_dbg("Checking hibernation image partition %s\n", resume_file);
 
@@ -966,40 +918,41 @@ static int software_resume(void)
        }
 
        /* Check if the device is there */
-       swsusp_resume_device = name_to_dev_t(resume_file);
-       if (!swsusp_resume_device) {
-               /*
-                * Some device discovery might still be in progress; we need
-                * to wait for this to finish.
-                */
-               wait_for_device_probe();
-
-               if (resume_wait) {
-                       while ((swsusp_resume_device = name_to_dev_t(resume_file)) == 0)
-                               msleep(10);
-                       async_synchronize_full();
-               }
+       if (!early_lookup_bdev(resume_file, &swsusp_resume_device))
+               return 0;
 
-               swsusp_resume_device = name_to_dev_t(resume_file);
-               if (!swsusp_resume_device) {
-                       error = -ENODEV;
-                       goto Unlock;
-               }
+       /*
+        * Some device discovery might still be in progress; we need to wait for
+        * this to finish.
+        */
+       wait_for_device_probe();
+       if (resume_wait) {
+               while (early_lookup_bdev(resume_file, &swsusp_resume_device))
+                       msleep(10);
+               async_synchronize_full();
        }
 
- Check_image:
+       return early_lookup_bdev(resume_file, &swsusp_resume_device);
+}
+
+static int software_resume(void)
+{
+       int error;
+
        pm_pr_dbg("Hibernation image partition %d:%d present\n",
                MAJOR(swsusp_resume_device), MINOR(swsusp_resume_device));
 
        pm_pr_dbg("Looking for hibernation image.\n");
-       error = swsusp_check();
+
+       mutex_lock(&system_transition_mutex);
+       error = swsusp_check(false);
        if (error)
                goto Unlock;
 
        /* The snapshot device should not be opened while we're running */
        if (!hibernate_acquire()) {
                error = -EBUSY;
-               swsusp_close(FMODE_READ | FMODE_EXCL);
+               swsusp_close(false);
                goto Unlock;
        }
 
@@ -1020,7 +973,7 @@ static int software_resume(void)
                goto Close_Finish;
        }
 
-       error = load_image_and_restore();
+       error = load_image_and_restore(false);
        thaw_processes();
  Finish:
        pm_notifier_call_chain(PM_POST_RESTORE);
@@ -1034,11 +987,43 @@ static int software_resume(void)
        pm_pr_dbg("Hibernation image not present or could not be loaded.\n");
        return error;
  Close_Finish:
-       swsusp_close(FMODE_READ | FMODE_EXCL);
+       swsusp_close(false);
        goto Finish;
 }
 
-late_initcall_sync(software_resume);
+/**
+ * software_resume_initcall - Resume from a saved hibernation image.
+ *
+ * This routine is called as a late initcall, when all devices have been
+ * discovered and initialized already.
+ *
+ * The image reading code is called to see if there is a hibernation image
+ * available for reading.  If that is the case, devices are quiesced and the
+ * contents of memory is restored from the saved image.
+ *
+ * If this is successful, control reappears in the restored target kernel in
+ * hibernation_snapshot() which returns to hibernate().  Otherwise, the routine
+ * attempts to recover gracefully and make the kernel return to the normal mode
+ * of operation.
+ */
+static int __init software_resume_initcall(void)
+{
+       /*
+        * If the user said "noresume".. bail out early.
+        */
+       if (noresume || !hibernation_available())
+               return 0;
+
+       if (!swsusp_resume_device) {
+               int error = find_resume_device();
+
+               if (error)
+                       return error;
+       }
+
+       return software_resume();
+}
+late_initcall_sync(software_resume_initcall);
 
 
 static const char * const hibernation_modes[] = {
@@ -1177,7 +1162,11 @@ static ssize_t resume_store(struct kobject *kobj, struct kobj_attribute *attr,
        unsigned int sleep_flags;
        int len = n;
        char *name;
-       dev_t res;
+       dev_t dev;
+       int error;
+
+       if (!hibernation_available())
+               return 0;
 
        if (len && buf[len-1] == '\n')
                len--;
@@ -1185,13 +1174,29 @@ static ssize_t resume_store(struct kobject *kobj, struct kobj_attribute *attr,
        if (!name)
                return -ENOMEM;
 
-       res = name_to_dev_t(name);
+       error = lookup_bdev(name, &dev);
+       if (error) {
+               unsigned maj, min, offset;
+               char *p, dummy;
+
+               if (sscanf(name, "%u:%u%c", &maj, &min, &dummy) == 2 ||
+                   sscanf(name, "%u:%u:%u:%c", &maj, &min, &offset,
+                               &dummy) == 3) {
+                       dev = MKDEV(maj, min);
+                       if (maj != MAJOR(dev) || min != MINOR(dev))
+                               error = -EINVAL;
+               } else {
+                       dev = new_decode_dev(simple_strtoul(name, &p, 16));
+                       if (*p)
+                               error = -EINVAL;
+               }
+       }
        kfree(name);
-       if (!res)
-               return -EINVAL;
+       if (error)
+               return error;
 
        sleep_flags = lock_system_sleep();
-       swsusp_resume_device = res;
+       swsusp_resume_device = dev;
        unlock_system_sleep(sleep_flags);
 
        pm_pr_dbg("Configured hibernation resume from disk to %u\n",
index 3113ec2..f6425ae 100644 (file)
 #include "power.h"
 
 #ifdef CONFIG_PM_SLEEP
+/*
+ * The following functions are used by the suspend/hibernate code to temporarily
+ * change gfp_allowed_mask in order to avoid using I/O during memory allocations
+ * while devices are suspended.  To avoid races with the suspend/hibernate code,
+ * they should always be called with system_transition_mutex held
+ * (gfp_allowed_mask also should only be modified with system_transition_mutex
+ * held, unless the suspend/hibernate code is guaranteed not to run in parallel
+ * with that modification).
+ */
+static gfp_t saved_gfp_mask;
+
+void pm_restore_gfp_mask(void)
+{
+       WARN_ON(!mutex_is_locked(&system_transition_mutex));
+       if (saved_gfp_mask) {
+               gfp_allowed_mask = saved_gfp_mask;
+               saved_gfp_mask = 0;
+       }
+}
+
+void pm_restrict_gfp_mask(void)
+{
+       WARN_ON(!mutex_is_locked(&system_transition_mutex));
+       WARN_ON(saved_gfp_mask);
+       saved_gfp_mask = gfp_allowed_mask;
+       gfp_allowed_mask &= ~(__GFP_IO | __GFP_FS);
+}
 
 unsigned int lock_system_sleep(void)
 {
@@ -556,6 +583,12 @@ power_attr_ro(pm_wakeup_irq);
 
 bool pm_debug_messages_on __read_mostly;
 
+bool pm_debug_messages_should_print(void)
+{
+       return pm_debug_messages_on && pm_suspend_target_state != PM_SUSPEND_ON;
+}
+EXPORT_SYMBOL_GPL(pm_debug_messages_should_print);
+
 static ssize_t pm_debug_messages_show(struct kobject *kobj,
                                      struct kobj_attribute *attr, char *buf)
 {
index b83c8d5..46eb14d 100644 (file)
@@ -26,9 +26,6 @@ extern void __init hibernate_image_size_init(void);
 /* Maximum size of architecture specific data in a hibernation header */
 #define MAX_ARCH_HEADER_SIZE   (sizeof(struct new_utsname) + 4)
 
-extern int arch_hibernation_header_save(void *addr, unsigned int max_size);
-extern int arch_hibernation_header_restore(void *addr);
-
 static inline int init_header_complete(struct swsusp_info *info)
 {
        return arch_hibernation_header_save(info, MAX_ARCH_HEADER_SIZE);
@@ -41,8 +38,6 @@ static inline const char *check_image_kernel(struct swsusp_info *info)
 }
 #endif /* CONFIG_ARCH_HIBERNATION_HEADER */
 
-extern int hibernate_resume_nonboot_cpu_disable(void);
-
 /*
  * Keep some memory free so that I/O operations can succeed without paging
  * [Might this be more than 4 MB?]
@@ -59,7 +54,6 @@ asmlinkage int swsusp_save(void);
 
 /* kernel/power/hibernate.c */
 extern bool freezer_test_done;
-extern bool snapshot_test;
 
 extern int hibernation_snapshot(int platform_mode);
 extern int hibernation_restore(int platform_mode);
@@ -174,11 +168,11 @@ extern int swsusp_swap_in_use(void);
 #define SF_HW_SIG              8
 
 /* kernel/power/hibernate.c */
-extern int swsusp_check(void);
+int swsusp_check(bool snapshot_test);
 extern void swsusp_free(void);
 extern int swsusp_read(unsigned int *flags_p);
 extern int swsusp_write(unsigned int flags);
-extern void swsusp_close(fmode_t);
+void swsusp_close(bool snapshot_test);
 #ifdef CONFIG_SUSPEND
 extern int swsusp_unmark(void);
 #endif
@@ -216,6 +210,11 @@ static inline void suspend_test_finish(const char *label) {}
 /* kernel/power/main.c */
 extern int pm_notifier_call_chain_robust(unsigned long val_up, unsigned long val_down);
 extern int pm_notifier_call_chain(unsigned long val);
+void pm_restrict_gfp_mask(void);
+void pm_restore_gfp_mask(void);
+#else
+static inline void pm_restrict_gfp_mask(void) {}
+static inline void pm_restore_gfp_mask(void) {}
 #endif
 
 #ifdef CONFIG_HIGHMEM
index cd8b7b3..0415d5e 100644 (file)
@@ -398,7 +398,7 @@ struct mem_zone_bm_rtree {
        unsigned int blocks;            /* Number of Bitmap Blocks     */
 };
 
-/* strcut bm_position is used for browsing memory bitmaps */
+/* struct bm_position is used for browsing memory bitmaps */
 
 struct bm_position {
        struct mem_zone_bm_rtree *zone;
@@ -1228,6 +1228,58 @@ unsigned int snapshot_additional_pages(struct zone *zone)
        return 2 * rtree;
 }
 
+/*
+ * Touch the watchdog for every WD_PAGE_COUNT pages.
+ */
+#define WD_PAGE_COUNT  (128*1024)
+
+static void mark_free_pages(struct zone *zone)
+{
+       unsigned long pfn, max_zone_pfn, page_count = WD_PAGE_COUNT;
+       unsigned long flags;
+       unsigned int order, t;
+       struct page *page;
+
+       if (zone_is_empty(zone))
+               return;
+
+       spin_lock_irqsave(&zone->lock, flags);
+
+       max_zone_pfn = zone_end_pfn(zone);
+       for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++)
+               if (pfn_valid(pfn)) {
+                       page = pfn_to_page(pfn);
+
+                       if (!--page_count) {
+                               touch_nmi_watchdog();
+                               page_count = WD_PAGE_COUNT;
+                       }
+
+                       if (page_zone(page) != zone)
+                               continue;
+
+                       if (!swsusp_page_is_forbidden(page))
+                               swsusp_unset_page_free(page);
+               }
+
+       for_each_migratetype_order(order, t) {
+               list_for_each_entry(page,
+                               &zone->free_area[order].free_list[t], buddy_list) {
+                       unsigned long i;
+
+                       pfn = page_to_pfn(page);
+                       for (i = 0; i < (1UL << order); i++) {
+                               if (!--page_count) {
+                                       touch_nmi_watchdog();
+                                       page_count = WD_PAGE_COUNT;
+                               }
+                               swsusp_set_page_free(pfn_to_page(pfn + i));
+                       }
+               }
+       }
+       spin_unlock_irqrestore(&zone->lock, flags);
+}
+
 #ifdef CONFIG_HIGHMEM
 /**
  * count_free_highmem_pages - Compute the total number of free highmem pages.
index 92e41ed..f6ebcd0 100644 (file)
@@ -356,14 +356,14 @@ static int swsusp_swap_check(void)
                return res;
        root_swap = res;
 
-       hib_resume_bdev = blkdev_get_by_dev(swsusp_resume_device, FMODE_WRITE,
-                       NULL);
+       hib_resume_bdev = blkdev_get_by_dev(swsusp_resume_device,
+                       BLK_OPEN_WRITE, NULL, NULL);
        if (IS_ERR(hib_resume_bdev))
                return PTR_ERR(hib_resume_bdev);
 
        res = set_blocksize(hib_resume_bdev, PAGE_SIZE);
        if (res < 0)
-               blkdev_put(hib_resume_bdev, FMODE_WRITE);
+               blkdev_put(hib_resume_bdev, NULL);
 
        return res;
 }
@@ -443,7 +443,7 @@ static int get_swap_writer(struct swap_map_handle *handle)
 err_rel:
        release_swap_writer(handle);
 err_close:
-       swsusp_close(FMODE_WRITE);
+       swsusp_close(false);
        return ret;
 }
 
@@ -508,7 +508,7 @@ static int swap_writer_finish(struct swap_map_handle *handle,
        if (error)
                free_all_swap_pages(root_swap);
        release_swap_writer(handle);
-       swsusp_close(FMODE_WRITE);
+       swsusp_close(false);
 
        return error;
 }
@@ -1510,21 +1510,19 @@ end:
        return error;
 }
 
+static void *swsusp_holder;
+
 /**
  *      swsusp_check - Check for swsusp signature in the resume device
  */
 
-int swsusp_check(void)
+int swsusp_check(bool snapshot_test)
 {
+       void *holder = snapshot_test ? &swsusp_holder : NULL;
        int error;
-       void *holder;
-       fmode_t mode = FMODE_READ;
 
-       if (snapshot_test)
-               mode |= FMODE_EXCL;
-
-       hib_resume_bdev = blkdev_get_by_dev(swsusp_resume_device,
-                                           mode, &holder);
+       hib_resume_bdev = blkdev_get_by_dev(swsusp_resume_device, BLK_OPEN_READ,
+                                           holder, NULL);
        if (!IS_ERR(hib_resume_bdev)) {
                set_blocksize(hib_resume_bdev, PAGE_SIZE);
                clear_page(swsusp_header);
@@ -1551,7 +1549,7 @@ int swsusp_check(void)
 
 put:
                if (error)
-                       blkdev_put(hib_resume_bdev, mode);
+                       blkdev_put(hib_resume_bdev, holder);
                else
                        pr_debug("Image signature found, resuming\n");
        } else {
@@ -1568,14 +1566,14 @@ put:
  *     swsusp_close - close swap device.
  */
 
-void swsusp_close(fmode_t mode)
+void swsusp_close(bool snapshot_test)
 {
        if (IS_ERR(hib_resume_bdev)) {
                pr_debug("Image device not initialised\n");
                return;
        }
 
-       blkdev_put(hib_resume_bdev, mode);
+       blkdev_put(hib_resume_bdev, snapshot_test ? &swsusp_holder : NULL);
 }
 
 /**
index 6a333ad..357a4d1 100644 (file)
@@ -528,7 +528,7 @@ static u64 latched_seq_read_nolock(struct latched_seq *ls)
                seq = raw_read_seqcount_latch(&ls->latch);
                idx = seq & 0x1;
                val = ls->val[idx];
-       } while (read_seqcount_latch_retry(&ls->latch, seq));
+       } while (raw_read_seqcount_latch_retry(&ls->latch, seq));
 
        return val;
 }
index 9071182..bdd7ead 100644 (file)
@@ -314,4 +314,22 @@ config RCU_LAZY
          To save power, batch RCU callbacks and flush after delay, memory
          pressure, or callback list growing too big.
 
+config RCU_DOUBLE_CHECK_CB_TIME
+       bool "RCU callback-batch backup time check"
+       depends on RCU_EXPERT
+       default n
+       help
+         Use this option to provide more precise enforcement of the
+         rcutree.rcu_resched_ns module parameter in situations where
+         a single RCU callback might run for hundreds of microseconds,
+         thus defeating the 32-callback batching used to amortize the
+         cost of the fine-grained but expensive local_clock() function.
+
+         This option rounds rcutree.rcu_resched_ns up to the next
+         jiffy, and overrides the 32-callback batching if this limit
+         is exceeded.
+
+         Say Y here if you need tighter callback-limit enforcement.
+         Say N here if you are unsure.
+
 endmenu # "RCU Subsystem"
index 4a1b962..98c1544 100644 (file)
@@ -642,4 +642,10 @@ void show_rcu_tasks_trace_gp_kthread(void);
 static inline void show_rcu_tasks_trace_gp_kthread(void) {}
 #endif
 
+#ifdef CONFIG_TINY_RCU
+static inline bool rcu_cpu_beenfullyonline(int cpu) { return true; }
+#else
+bool rcu_cpu_beenfullyonline(int cpu);
+#endif
+
 #endif /* __LINUX_RCU_H */
index e82ec9f..d122173 100644 (file)
@@ -522,89 +522,6 @@ rcu_scale_print_module_parms(struct rcu_scale_ops *cur_ops, const char *tag)
                 scale_type, tag, nrealreaders, nrealwriters, verbose, shutdown);
 }
 
-static void
-rcu_scale_cleanup(void)
-{
-       int i;
-       int j;
-       int ngps = 0;
-       u64 *wdp;
-       u64 *wdpp;
-
-       /*
-        * Would like warning at start, but everything is expedited
-        * during the mid-boot phase, so have to wait till the end.
-        */
-       if (rcu_gp_is_expedited() && !rcu_gp_is_normal() && !gp_exp)
-               SCALEOUT_ERRSTRING("All grace periods expedited, no normal ones to measure!");
-       if (rcu_gp_is_normal() && gp_exp)
-               SCALEOUT_ERRSTRING("All grace periods normal, no expedited ones to measure!");
-       if (gp_exp && gp_async)
-               SCALEOUT_ERRSTRING("No expedited async GPs, so went with async!");
-
-       if (torture_cleanup_begin())
-               return;
-       if (!cur_ops) {
-               torture_cleanup_end();
-               return;
-       }
-
-       if (reader_tasks) {
-               for (i = 0; i < nrealreaders; i++)
-                       torture_stop_kthread(rcu_scale_reader,
-                                            reader_tasks[i]);
-               kfree(reader_tasks);
-       }
-
-       if (writer_tasks) {
-               for (i = 0; i < nrealwriters; i++) {
-                       torture_stop_kthread(rcu_scale_writer,
-                                            writer_tasks[i]);
-                       if (!writer_n_durations)
-                               continue;
-                       j = writer_n_durations[i];
-                       pr_alert("%s%s writer %d gps: %d\n",
-                                scale_type, SCALE_FLAG, i, j);
-                       ngps += j;
-               }
-               pr_alert("%s%s start: %llu end: %llu duration: %llu gps: %d batches: %ld\n",
-                        scale_type, SCALE_FLAG,
-                        t_rcu_scale_writer_started, t_rcu_scale_writer_finished,
-                        t_rcu_scale_writer_finished -
-                        t_rcu_scale_writer_started,
-                        ngps,
-                        rcuscale_seq_diff(b_rcu_gp_test_finished,
-                                          b_rcu_gp_test_started));
-               for (i = 0; i < nrealwriters; i++) {
-                       if (!writer_durations)
-                               break;
-                       if (!writer_n_durations)
-                               continue;
-                       wdpp = writer_durations[i];
-                       if (!wdpp)
-                               continue;
-                       for (j = 0; j < writer_n_durations[i]; j++) {
-                               wdp = &wdpp[j];
-                               pr_alert("%s%s %4d writer-duration: %5d %llu\n",
-                                       scale_type, SCALE_FLAG,
-                                       i, j, *wdp);
-                               if (j % 100 == 0)
-                                       schedule_timeout_uninterruptible(1);
-                       }
-                       kfree(writer_durations[i]);
-               }
-               kfree(writer_tasks);
-               kfree(writer_durations);
-               kfree(writer_n_durations);
-       }
-
-       /* Do torture-type-specific cleanup operations.  */
-       if (cur_ops->cleanup != NULL)
-               cur_ops->cleanup();
-
-       torture_cleanup_end();
-}
-
 /*
  * Return the number if non-negative.  If -1, the number of CPUs.
  * If less than -1, that much less than the number of CPUs, but
@@ -625,20 +542,6 @@ static int compute_real(int n)
 }
 
 /*
- * RCU scalability shutdown kthread.  Just waits to be awakened, then shuts
- * down system.
- */
-static int
-rcu_scale_shutdown(void *arg)
-{
-       wait_event_idle(shutdown_wq, atomic_read(&n_rcu_scale_writer_finished) >= nrealwriters);
-       smp_mb(); /* Wake before output. */
-       rcu_scale_cleanup();
-       kernel_power_off();
-       return -EINVAL;
-}
-
-/*
  * kfree_rcu() scalability tests: Start a kfree_rcu() loop on all CPUs for number
  * of iterations and measure total time and number of GP for all iterations to complete.
  */
@@ -874,6 +777,108 @@ unwind:
        return firsterr;
 }
 
+static void
+rcu_scale_cleanup(void)
+{
+       int i;
+       int j;
+       int ngps = 0;
+       u64 *wdp;
+       u64 *wdpp;
+
+       /*
+        * Would like warning at start, but everything is expedited
+        * during the mid-boot phase, so have to wait till the end.
+        */
+       if (rcu_gp_is_expedited() && !rcu_gp_is_normal() && !gp_exp)
+               SCALEOUT_ERRSTRING("All grace periods expedited, no normal ones to measure!");
+       if (rcu_gp_is_normal() && gp_exp)
+               SCALEOUT_ERRSTRING("All grace periods normal, no expedited ones to measure!");
+       if (gp_exp && gp_async)
+               SCALEOUT_ERRSTRING("No expedited async GPs, so went with async!");
+
+       if (kfree_rcu_test) {
+               kfree_scale_cleanup();
+               return;
+       }
+
+       if (torture_cleanup_begin())
+               return;
+       if (!cur_ops) {
+               torture_cleanup_end();
+               return;
+       }
+
+       if (reader_tasks) {
+               for (i = 0; i < nrealreaders; i++)
+                       torture_stop_kthread(rcu_scale_reader,
+                                            reader_tasks[i]);
+               kfree(reader_tasks);
+       }
+
+       if (writer_tasks) {
+               for (i = 0; i < nrealwriters; i++) {
+                       torture_stop_kthread(rcu_scale_writer,
+                                            writer_tasks[i]);
+                       if (!writer_n_durations)
+                               continue;
+                       j = writer_n_durations[i];
+                       pr_alert("%s%s writer %d gps: %d\n",
+                                scale_type, SCALE_FLAG, i, j);
+                       ngps += j;
+               }
+               pr_alert("%s%s start: %llu end: %llu duration: %llu gps: %d batches: %ld\n",
+                        scale_type, SCALE_FLAG,
+                        t_rcu_scale_writer_started, t_rcu_scale_writer_finished,
+                        t_rcu_scale_writer_finished -
+                        t_rcu_scale_writer_started,
+                        ngps,
+                        rcuscale_seq_diff(b_rcu_gp_test_finished,
+                                          b_rcu_gp_test_started));
+               for (i = 0; i < nrealwriters; i++) {
+                       if (!writer_durations)
+                               break;
+                       if (!writer_n_durations)
+                               continue;
+                       wdpp = writer_durations[i];
+                       if (!wdpp)
+                               continue;
+                       for (j = 0; j < writer_n_durations[i]; j++) {
+                               wdp = &wdpp[j];
+                               pr_alert("%s%s %4d writer-duration: %5d %llu\n",
+                                       scale_type, SCALE_FLAG,
+                                       i, j, *wdp);
+                               if (j % 100 == 0)
+                                       schedule_timeout_uninterruptible(1);
+                       }
+                       kfree(writer_durations[i]);
+               }
+               kfree(writer_tasks);
+               kfree(writer_durations);
+               kfree(writer_n_durations);
+       }
+
+       /* Do torture-type-specific cleanup operations.  */
+       if (cur_ops->cleanup != NULL)
+               cur_ops->cleanup();
+
+       torture_cleanup_end();
+}
+
+/*
+ * RCU scalability shutdown kthread.  Just waits to be awakened, then shuts
+ * down system.
+ */
+static int
+rcu_scale_shutdown(void *arg)
+{
+       wait_event_idle(shutdown_wq, atomic_read(&n_rcu_scale_writer_finished) >= nrealwriters);
+       smp_mb(); /* Wake before output. */
+       rcu_scale_cleanup();
+       kernel_power_off();
+       return -EINVAL;
+}
+
 static int __init
 rcu_scale_init(void)
 {
index 5f4fc81..b770add 100644 (file)
@@ -241,7 +241,6 @@ static void cblist_init_generic(struct rcu_tasks *rtp)
        if (rcu_task_enqueue_lim < 0) {
                rcu_task_enqueue_lim = 1;
                rcu_task_cb_adjust = true;
-               pr_info("%s: Setting adjustable number of callback queues.\n", __func__);
        } else if (rcu_task_enqueue_lim == 0) {
                rcu_task_enqueue_lim = 1;
        }
@@ -272,7 +271,9 @@ static void cblist_init_generic(struct rcu_tasks *rtp)
                raw_spin_unlock_rcu_node(rtpcp); // irqs remain disabled.
        }
        raw_spin_unlock_irqrestore(&rtp->cbs_gbl_lock, flags);
-       pr_info("%s: Setting shift to %d and lim to %d.\n", __func__, data_race(rtp->percpu_enqueue_shift), data_race(rtp->percpu_enqueue_lim));
+
+       pr_info("%s: Setting shift to %d and lim to %d rcu_task_cb_adjust=%d.\n", rtp->name,
+                       data_race(rtp->percpu_enqueue_shift), data_race(rtp->percpu_enqueue_lim), rcu_task_cb_adjust);
 }
 
 // IRQ-work handler that does deferred wakeup for call_rcu_tasks_generic().
@@ -463,6 +464,7 @@ static void rcu_tasks_invoke_cbs(struct rcu_tasks *rtp, struct rcu_tasks_percpu
 {
        int cpu;
        int cpunext;
+       int cpuwq;
        unsigned long flags;
        int len;
        struct rcu_head *rhp;
@@ -473,11 +475,13 @@ static void rcu_tasks_invoke_cbs(struct rcu_tasks *rtp, struct rcu_tasks_percpu
        cpunext = cpu * 2 + 1;
        if (cpunext < smp_load_acquire(&rtp->percpu_dequeue_lim)) {
                rtpcp_next = per_cpu_ptr(rtp->rtpcpu, cpunext);
-               queue_work_on(cpunext, system_wq, &rtpcp_next->rtp_work);
+               cpuwq = rcu_cpu_beenfullyonline(cpunext) ? cpunext : WORK_CPU_UNBOUND;
+               queue_work_on(cpuwq, system_wq, &rtpcp_next->rtp_work);
                cpunext++;
                if (cpunext < smp_load_acquire(&rtp->percpu_dequeue_lim)) {
                        rtpcp_next = per_cpu_ptr(rtp->rtpcpu, cpunext);
-                       queue_work_on(cpunext, system_wq, &rtpcp_next->rtp_work);
+                       cpuwq = rcu_cpu_beenfullyonline(cpunext) ? cpunext : WORK_CPU_UNBOUND;
+                       queue_work_on(cpuwq, system_wq, &rtpcp_next->rtp_work);
                }
        }
 
index f52ff72..1449cb6 100644 (file)
@@ -2046,19 +2046,35 @@ rcu_check_quiescent_state(struct rcu_data *rdp)
        rcu_report_qs_rdp(rdp);
 }
 
+/* Return true if callback-invocation time limit exceeded. */
+static bool rcu_do_batch_check_time(long count, long tlimit,
+                                   bool jlimit_check, unsigned long jlimit)
+{
+       // Invoke local_clock() only once per 32 consecutive callbacks.
+       return unlikely(tlimit) &&
+              (!likely(count & 31) ||
+               (IS_ENABLED(CONFIG_RCU_DOUBLE_CHECK_CB_TIME) &&
+                jlimit_check && time_after(jiffies, jlimit))) &&
+              local_clock() >= tlimit;
+}
+
 /*
  * Invoke any RCU callbacks that have made it to the end of their grace
  * period.  Throttle as specified by rdp->blimit.
  */
 static void rcu_do_batch(struct rcu_data *rdp)
 {
+       long bl;
+       long count = 0;
        int div;
        bool __maybe_unused empty;
        unsigned long flags;
-       struct rcu_head *rhp;
+       unsigned long jlimit;
+       bool jlimit_check = false;
+       long pending;
        struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl);
-       long bl, count = 0;
-       long pending, tlimit = 0;
+       struct rcu_head *rhp;
+       long tlimit = 0;
 
        /* If no callbacks are ready, just return. */
        if (!rcu_segcblist_ready_cbs(&rdp->cblist)) {
@@ -2082,11 +2098,15 @@ static void rcu_do_batch(struct rcu_data *rdp)
        div = READ_ONCE(rcu_divisor);
        div = div < 0 ? 7 : div > sizeof(long) * 8 - 2 ? sizeof(long) * 8 - 2 : div;
        bl = max(rdp->blimit, pending >> div);
-       if (in_serving_softirq() && unlikely(bl > 100)) {
+       if ((in_serving_softirq() || rdp->rcu_cpu_kthread_status == RCU_KTHREAD_RUNNING) &&
+           (IS_ENABLED(CONFIG_RCU_DOUBLE_CHECK_CB_TIME) || unlikely(bl > 100))) {
+               const long npj = NSEC_PER_SEC / HZ;
                long rrn = READ_ONCE(rcu_resched_ns);
 
                rrn = rrn < NSEC_PER_MSEC ? NSEC_PER_MSEC : rrn > NSEC_PER_SEC ? NSEC_PER_SEC : rrn;
                tlimit = local_clock() + rrn;
+               jlimit = jiffies + (rrn + npj + 1) / npj;
+               jlimit_check = true;
        }
        trace_rcu_batch_start(rcu_state.name,
                              rcu_segcblist_n_cbs(&rdp->cblist), bl);
@@ -2126,21 +2146,23 @@ static void rcu_do_batch(struct rcu_data *rdp)
                         * Make sure we don't spend too much time here and deprive other
                         * softirq vectors of CPU cycles.
                         */
-                       if (unlikely(tlimit)) {
-                               /* only call local_clock() every 32 callbacks */
-                               if (likely((count & 31) || local_clock() < tlimit))
-                                       continue;
-                               /* Exceeded the time limit, so leave. */
+                       if (rcu_do_batch_check_time(count, tlimit, jlimit_check, jlimit))
                                break;
-                       }
                } else {
-                       // In rcuoc context, so no worries about depriving
-                       // other softirq vectors of CPU cycles.
+                       // In rcuc/rcuoc context, so no worries about
+                       // depriving other softirq vectors of CPU cycles.
                        local_bh_enable();
                        lockdep_assert_irqs_enabled();
                        cond_resched_tasks_rcu_qs();
                        lockdep_assert_irqs_enabled();
                        local_bh_disable();
+                       // But rcuc kthreads can delay quiescent-state
+                       // reporting, so check time limits for them.
+                       if (rdp->rcu_cpu_kthread_status == RCU_KTHREAD_RUNNING &&
+                           rcu_do_batch_check_time(count, tlimit, jlimit_check, jlimit)) {
+                               rdp->rcu_cpu_has_work = 1;
+                               break;
+                       }
                }
        }
 
@@ -2459,12 +2481,12 @@ static void rcu_cpu_kthread(unsigned int cpu)
                *statusp = RCU_KTHREAD_RUNNING;
                local_irq_disable();
                work = *workp;
-               *workp = 0;
+               WRITE_ONCE(*workp, 0);
                local_irq_enable();
                if (work)
                        rcu_core();
                local_bh_enable();
-               if (*workp == 0) {
+               if (!READ_ONCE(*workp)) {
                        trace_rcu_utilization(TPS("End CPU kthread@rcu_wait"));
                        *statusp = RCU_KTHREAD_WAITING;
                        return;
@@ -2756,7 +2778,7 @@ EXPORT_SYMBOL_GPL(call_rcu);
  */
 struct kvfree_rcu_bulk_data {
        struct list_head list;
-       unsigned long gp_snap;
+       struct rcu_gp_oldstate gp_snap;
        unsigned long nr_records;
        void *records[];
 };
@@ -2773,6 +2795,7 @@ struct kvfree_rcu_bulk_data {
  * struct kfree_rcu_cpu_work - single batch of kfree_rcu() requests
  * @rcu_work: Let queue_rcu_work() invoke workqueue handler after grace period
  * @head_free: List of kfree_rcu() objects waiting for a grace period
+ * @head_free_gp_snap: Grace-period snapshot to check for attempted premature frees.
  * @bulk_head_free: Bulk-List of kvfree_rcu() objects waiting for a grace period
  * @krcp: Pointer to @kfree_rcu_cpu structure
  */
@@ -2780,6 +2803,7 @@ struct kvfree_rcu_bulk_data {
 struct kfree_rcu_cpu_work {
        struct rcu_work rcu_work;
        struct rcu_head *head_free;
+       struct rcu_gp_oldstate head_free_gp_snap;
        struct list_head bulk_head_free[FREE_N_CHANNELS];
        struct kfree_rcu_cpu *krcp;
 };
@@ -2900,6 +2924,9 @@ drain_page_cache(struct kfree_rcu_cpu *krcp)
        struct llist_node *page_list, *pos, *n;
        int freed = 0;
 
+       if (!rcu_min_cached_objs)
+               return 0;
+
        raw_spin_lock_irqsave(&krcp->lock, flags);
        page_list = llist_del_all(&krcp->bkvcache);
        WRITE_ONCE(krcp->nr_bkv_objs, 0);
@@ -2920,24 +2947,25 @@ kvfree_rcu_bulk(struct kfree_rcu_cpu *krcp,
        unsigned long flags;
        int i;
 
-       debug_rcu_bhead_unqueue(bnode);
-
-       rcu_lock_acquire(&rcu_callback_map);
-       if (idx == 0) { // kmalloc() / kfree().
-               trace_rcu_invoke_kfree_bulk_callback(
-                       rcu_state.name, bnode->nr_records,
-                       bnode->records);
-
-               kfree_bulk(bnode->nr_records, bnode->records);
-       } else { // vmalloc() / vfree().
-               for (i = 0; i < bnode->nr_records; i++) {
-                       trace_rcu_invoke_kvfree_callback(
-                               rcu_state.name, bnode->records[i], 0);
-
-                       vfree(bnode->records[i]);
+       if (!WARN_ON_ONCE(!poll_state_synchronize_rcu_full(&bnode->gp_snap))) {
+               debug_rcu_bhead_unqueue(bnode);
+               rcu_lock_acquire(&rcu_callback_map);
+               if (idx == 0) { // kmalloc() / kfree().
+                       trace_rcu_invoke_kfree_bulk_callback(
+                               rcu_state.name, bnode->nr_records,
+                               bnode->records);
+
+                       kfree_bulk(bnode->nr_records, bnode->records);
+               } else { // vmalloc() / vfree().
+                       for (i = 0; i < bnode->nr_records; i++) {
+                               trace_rcu_invoke_kvfree_callback(
+                                       rcu_state.name, bnode->records[i], 0);
+
+                               vfree(bnode->records[i]);
+                       }
                }
+               rcu_lock_release(&rcu_callback_map);
        }
-       rcu_lock_release(&rcu_callback_map);
 
        raw_spin_lock_irqsave(&krcp->lock, flags);
        if (put_cached_bnode(krcp, bnode))
@@ -2984,6 +3012,7 @@ static void kfree_rcu_work(struct work_struct *work)
        struct rcu_head *head;
        struct kfree_rcu_cpu *krcp;
        struct kfree_rcu_cpu_work *krwp;
+       struct rcu_gp_oldstate head_gp_snap;
        int i;
 
        krwp = container_of(to_rcu_work(work),
@@ -2998,6 +3027,7 @@ static void kfree_rcu_work(struct work_struct *work)
        // Channel 3.
        head = krwp->head_free;
        krwp->head_free = NULL;
+       head_gp_snap = krwp->head_free_gp_snap;
        raw_spin_unlock_irqrestore(&krcp->lock, flags);
 
        // Handle the first two channels.
@@ -3014,7 +3044,8 @@ static void kfree_rcu_work(struct work_struct *work)
         * queued on a linked list through their rcu_head structures.
         * This list is named "Channel 3".
         */
-       kvfree_rcu_list(head);
+       if (head && !WARN_ON_ONCE(!poll_state_synchronize_rcu_full(&head_gp_snap)))
+               kvfree_rcu_list(head);
 }
 
 static bool
@@ -3081,7 +3112,7 @@ kvfree_rcu_drain_ready(struct kfree_rcu_cpu *krcp)
                INIT_LIST_HEAD(&bulk_ready[i]);
 
                list_for_each_entry_safe_reverse(bnode, n, &krcp->bulk_head[i], list) {
-                       if (!poll_state_synchronize_rcu(bnode->gp_snap))
+                       if (!poll_state_synchronize_rcu_full(&bnode->gp_snap))
                                break;
 
                        atomic_sub(bnode->nr_records, &krcp->bulk_count[i]);
@@ -3146,6 +3177,7 @@ static void kfree_rcu_monitor(struct work_struct *work)
                        // objects queued on the linked list.
                        if (!krwp->head_free) {
                                krwp->head_free = krcp->head;
+                               get_state_synchronize_rcu_full(&krwp->head_free_gp_snap);
                                atomic_set(&krcp->head_count, 0);
                                WRITE_ONCE(krcp->head, NULL);
                        }
@@ -3194,7 +3226,7 @@ static void fill_page_cache_func(struct work_struct *work)
        nr_pages = atomic_read(&krcp->backoff_page_cache_fill) ?
                1 : rcu_min_cached_objs;
 
-       for (i = 0; i < nr_pages; i++) {
+       for (i = READ_ONCE(krcp->nr_bkv_objs); i < nr_pages; i++) {
                bnode = (struct kvfree_rcu_bulk_data *)
                        __get_free_page(GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
 
@@ -3218,6 +3250,10 @@ static void fill_page_cache_func(struct work_struct *work)
 static void
 run_page_cache_worker(struct kfree_rcu_cpu *krcp)
 {
+       // If cache disabled, bail out.
+       if (!rcu_min_cached_objs)
+               return;
+
        if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING &&
                        !atomic_xchg(&krcp->work_in_progress, 1)) {
                if (atomic_read(&krcp->backoff_page_cache_fill)) {
@@ -3272,7 +3308,7 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
                        // scenarios.
                        bnode = (struct kvfree_rcu_bulk_data *)
                                __get_free_page(GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
-                       *krcp = krc_this_cpu_lock(flags);
+                       raw_spin_lock_irqsave(&(*krcp)->lock, *flags);
                }
 
                if (!bnode)
@@ -3285,7 +3321,7 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
 
        // Finally insert and update the GP for this page.
        bnode->records[bnode->nr_records++] = ptr;
-       bnode->gp_snap = get_state_synchronize_rcu();
+       get_state_synchronize_rcu_full(&bnode->gp_snap);
        atomic_inc(&(*krcp)->bulk_count[idx]);
 
        return true;
@@ -4283,7 +4319,6 @@ int rcutree_prepare_cpu(unsigned int cpu)
         */
        rnp = rdp->mynode;
        raw_spin_lock_rcu_node(rnp);            /* irqs already disabled. */
-       rdp->beenonline = true;  /* We have now been online. */
        rdp->gp_seq = READ_ONCE(rnp->gp_seq);
        rdp->gp_seq_needed = rdp->gp_seq;
        rdp->cpu_no_qs.b.norm = true;
@@ -4311,6 +4346,16 @@ static void rcutree_affinity_setting(unsigned int cpu, int outgoing)
 }
 
 /*
+ * Has the specified (known valid) CPU ever been fully online?
+ */
+bool rcu_cpu_beenfullyonline(int cpu)
+{
+       struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
+
+       return smp_load_acquire(&rdp->beenonline);
+}
+
+/*
  * Near the end of the CPU-online process.  Pretty much all services
  * enabled, and the CPU is now very much alive.
  */
@@ -4368,15 +4413,16 @@ int rcutree_offline_cpu(unsigned int cpu)
  * Note that this function is special in that it is invoked directly
  * from the incoming CPU rather than from the cpuhp_step mechanism.
  * This is because this function must be invoked at a precise location.
+ * This incoming CPU must not have enabled interrupts yet.
  */
 void rcu_cpu_starting(unsigned int cpu)
 {
-       unsigned long flags;
        unsigned long mask;
        struct rcu_data *rdp;
        struct rcu_node *rnp;
        bool newcpu;
 
+       lockdep_assert_irqs_disabled();
        rdp = per_cpu_ptr(&rcu_data, cpu);
        if (rdp->cpu_started)
                return;
@@ -4384,7 +4430,6 @@ void rcu_cpu_starting(unsigned int cpu)
 
        rnp = rdp->mynode;
        mask = rdp->grpmask;
-       local_irq_save(flags);
        arch_spin_lock(&rcu_state.ofl_lock);
        rcu_dynticks_eqs_online();
        raw_spin_lock(&rcu_state.barrier_lock);
@@ -4403,17 +4448,17 @@ void rcu_cpu_starting(unsigned int cpu)
        /* An incoming CPU should never be blocking a grace period. */
        if (WARN_ON_ONCE(rnp->qsmask & mask)) { /* RCU waiting on incoming CPU? */
                /* rcu_report_qs_rnp() *really* wants some flags to restore */
-               unsigned long flags2;
+               unsigned long flags;
 
-               local_irq_save(flags2);
+               local_irq_save(flags);
                rcu_disable_urgency_upon_qs(rdp);
                /* Report QS -after- changing ->qsmaskinitnext! */
-               rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags2);
+               rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
        } else {
                raw_spin_unlock_rcu_node(rnp);
        }
        arch_spin_unlock(&rcu_state.ofl_lock);
-       local_irq_restore(flags);
+       smp_store_release(&rdp->beenonline, true);
        smp_mb(); /* Ensure RCU read-side usage follows above initialization. */
 }
 
index 3b7abb5..8239b39 100644 (file)
@@ -643,7 +643,7 @@ static void synchronize_rcu_expedited_wait(void)
                                        "O."[!!cpu_online(cpu)],
                                        "o."[!!(rdp->grpmask & rnp->expmaskinit)],
                                        "N."[!!(rdp->grpmask & rnp->expmaskinitnext)],
-                                       "D."[!!(rdp->cpu_no_qs.b.exp)]);
+                                       "D."[!!data_race(rdp->cpu_no_qs.b.exp)]);
                        }
                }
                pr_cont(" } %lu jiffies s: %lu root: %#lx/%c\n",
index f228061..43229d2 100644 (file)
@@ -1319,13 +1319,22 @@ lazy_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
        int cpu;
        unsigned long count = 0;
 
+       if (WARN_ON_ONCE(!cpumask_available(rcu_nocb_mask)))
+               return 0;
+
+       /*  Protect rcu_nocb_mask against concurrent (de-)offloading. */
+       if (!mutex_trylock(&rcu_state.barrier_mutex))
+               return 0;
+
        /* Snapshot count of all CPUs */
-       for_each_possible_cpu(cpu) {
+       for_each_cpu(cpu, rcu_nocb_mask) {
                struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
 
                count +=  READ_ONCE(rdp->lazy_len);
        }
 
+       mutex_unlock(&rcu_state.barrier_mutex);
+
        return count ? count : SHRINK_EMPTY;
 }
 
@@ -1336,15 +1345,45 @@ lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
        unsigned long flags;
        unsigned long count = 0;
 
+       if (WARN_ON_ONCE(!cpumask_available(rcu_nocb_mask)))
+               return 0;
+       /*
+        * Protect against concurrent (de-)offloading. Otherwise nocb locking
+        * may be ignored or imbalanced.
+        */
+       if (!mutex_trylock(&rcu_state.barrier_mutex)) {
+               /*
+                * But really don't insist if barrier_mutex is contended since we
+                * can't guarantee that it will never engage in a dependency
+                * chain involving memory allocation. The lock is seldom contended
+                * anyway.
+                */
+               return 0;
+       }
+
        /* Snapshot count of all CPUs */
-       for_each_possible_cpu(cpu) {
+       for_each_cpu(cpu, rcu_nocb_mask) {
                struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
-               int _count = READ_ONCE(rdp->lazy_len);
+               int _count;
+
+               if (WARN_ON_ONCE(!rcu_rdp_is_offloaded(rdp)))
+                       continue;
 
-               if (_count == 0)
+               if (!READ_ONCE(rdp->lazy_len))
                        continue;
+
                rcu_nocb_lock_irqsave(rdp, flags);
-               WRITE_ONCE(rdp->lazy_len, 0);
+               /*
+                * Recheck under the nocb lock. Since we are not holding the bypass
+                * lock we may still race with increments from the enqueuer but still
+                * we know for sure if there is at least one lazy callback.
+                */
+               _count = READ_ONCE(rdp->lazy_len);
+               if (!_count) {
+                       rcu_nocb_unlock_irqrestore(rdp, flags);
+                       continue;
+               }
+               WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, false));
                rcu_nocb_unlock_irqrestore(rdp, flags);
                wake_nocb_gp(rdp, false);
                sc->nr_to_scan -= _count;
@@ -1352,6 +1391,9 @@ lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
                if (sc->nr_to_scan <= 0)
                        break;
        }
+
+       mutex_unlock(&rcu_state.barrier_mutex);
+
        return count ? count : SHRINK_STOP;
 }
 
index 7b0fe74..4102108 100644 (file)
@@ -257,6 +257,8 @@ static void rcu_preempt_ctxt_queue(struct rcu_node *rnp, struct rcu_data *rdp)
         * GP should not be able to end until we report, so there should be
         * no need to check for a subsequent expedited GP.  (Though we are
         * still in a quiescent state in any case.)
+        *
+        * Interrupts are disabled, so ->cpu_no_qs.b.exp cannot change.
         */
        if (blkd_state & RCU_EXP_BLKD && rdp->cpu_no_qs.b.exp)
                rcu_report_exp_rdp(rdp);
@@ -941,7 +943,7 @@ notrace void rcu_preempt_deferred_qs(struct task_struct *t)
 {
        struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
 
-       if (rdp->cpu_no_qs.b.exp)
+       if (READ_ONCE(rdp->cpu_no_qs.b.exp))
                rcu_report_exp_rdp(rdp);
 }
 
index b5cc2b5..3c6193d 100644 (file)
@@ -266,7 +266,7 @@ static __always_inline u64 sched_clock_local(struct sched_clock_data *scd)
        s64 delta;
 
 again:
-       now = sched_clock();
+       now = sched_clock_noinstr();
        delta = now - scd->tick_raw;
        if (unlikely(delta < 0))
                delta = 0;
@@ -287,28 +287,35 @@ again:
        clock = wrap_max(clock, min_clock);
        clock = wrap_min(clock, max_clock);
 
-       if (!arch_try_cmpxchg64(&scd->clock, &old_clock, clock))
+       if (!raw_try_cmpxchg64(&scd->clock, &old_clock, clock))
                goto again;
 
        return clock;
 }
 
-noinstr u64 local_clock(void)
+noinstr u64 local_clock_noinstr(void)
 {
        u64 clock;
 
        if (static_branch_likely(&__sched_clock_stable))
-               return sched_clock() + __sched_clock_offset;
+               return sched_clock_noinstr() + __sched_clock_offset;
 
        if (!static_branch_likely(&sched_clock_running))
-               return sched_clock();
+               return sched_clock_noinstr();
 
-       preempt_disable_notrace();
        clock = sched_clock_local(this_scd());
-       preempt_enable_notrace();
 
        return clock;
 }
+
+u64 local_clock(void)
+{
+       u64 now;
+       preempt_disable_notrace();
+       now = local_clock_noinstr();
+       preempt_enable_notrace();
+       return now;
+}
 EXPORT_SYMBOL_GPL(local_clock);
 
 static notrace u64 sched_clock_remote(struct sched_clock_data *scd)
index a68d127..c52c2eb 100644 (file)
@@ -2213,6 +2213,154 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
                rq_clock_skip_update(rq);
 }
 
+static __always_inline
+int __task_state_match(struct task_struct *p, unsigned int state)
+{
+       if (READ_ONCE(p->__state) & state)
+               return 1;
+
+#ifdef CONFIG_PREEMPT_RT
+       if (READ_ONCE(p->saved_state) & state)
+               return -1;
+#endif
+       return 0;
+}
+
+static __always_inline
+int task_state_match(struct task_struct *p, unsigned int state)
+{
+#ifdef CONFIG_PREEMPT_RT
+       int match;
+
+       /*
+        * Serialize against current_save_and_set_rtlock_wait_state() and
+        * current_restore_rtlock_saved_state().
+        */
+       raw_spin_lock_irq(&p->pi_lock);
+       match = __task_state_match(p, state);
+       raw_spin_unlock_irq(&p->pi_lock);
+
+       return match;
+#else
+       return __task_state_match(p, state);
+#endif
+}
+
+/*
+ * wait_task_inactive - wait for a thread to unschedule.
+ *
+ * Wait for the thread to block in any of the states set in @match_state.
+ * If it changes, i.e. @p might have woken up, then return zero.  When we
+ * succeed in waiting for @p to be off its CPU, we return a positive number
+ * (its total switch count).  If a second call a short while later returns the
+ * same number, the caller can be sure that @p has remained unscheduled the
+ * whole time.
+ *
+ * The caller must ensure that the task *will* unschedule sometime soon,
+ * else this function might spin for a *long* time. This function can't
+ * be called with interrupts off, or it may introduce deadlock with
+ * smp_call_function() if an IPI is sent by the same process we are
+ * waiting to become inactive.
+ */
+unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state)
+{
+       int running, queued, match;
+       struct rq_flags rf;
+       unsigned long ncsw;
+       struct rq *rq;
+
+       for (;;) {
+               /*
+                * We do the initial early heuristics without holding
+                * any task-queue locks at all. We'll only try to get
+                * the runqueue lock when things look like they will
+                * work out!
+                */
+               rq = task_rq(p);
+
+               /*
+                * If the task is actively running on another CPU
+                * still, just relax and busy-wait without holding
+                * any locks.
+                *
+                * NOTE! Since we don't hold any locks, it's not
+                * even sure that "rq" stays as the right runqueue!
+                * But we don't care, since "task_on_cpu()" will
+                * return false if the runqueue has changed and p
+                * is actually now running somewhere else!
+                */
+               while (task_on_cpu(rq, p)) {
+                       if (!task_state_match(p, match_state))
+                               return 0;
+                       cpu_relax();
+               }
+
+               /*
+                * Ok, time to look more closely! We need the rq
+                * lock now, to be *sure*. If we're wrong, we'll
+                * just go back and repeat.
+                */
+               rq = task_rq_lock(p, &rf);
+               trace_sched_wait_task(p);
+               running = task_on_cpu(rq, p);
+               queued = task_on_rq_queued(p);
+               ncsw = 0;
+               if ((match = __task_state_match(p, match_state))) {
+                       /*
+                        * When matching on p->saved_state, consider this task
+                        * still queued so it will wait.
+                        */
+                       if (match < 0)
+                               queued = 1;
+                       ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
+               }
+               task_rq_unlock(rq, p, &rf);
+
+               /*
+                * If it changed from the expected state, bail out now.
+                */
+               if (unlikely(!ncsw))
+                       break;
+
+               /*
+                * Was it really running after all now that we
+                * checked with the proper locks actually held?
+                *
+                * Oops. Go back and try again..
+                */
+               if (unlikely(running)) {
+                       cpu_relax();
+                       continue;
+               }
+
+               /*
+                * It's not enough that it's not actively running,
+                * it must be off the runqueue _entirely_, and not
+                * preempted!
+                *
+                * So if it was still runnable (but just not actively
+                * running right now), it's preempted, and we should
+                * yield - it could be a while.
+                */
+               if (unlikely(queued)) {
+                       ktime_t to = NSEC_PER_SEC / HZ;
+
+                       set_current_state(TASK_UNINTERRUPTIBLE);
+                       schedule_hrtimeout(&to, HRTIMER_MODE_REL_HARD);
+                       continue;
+               }
+
+               /*
+                * Ahh, all good. It wasn't running, and it wasn't
+                * runnable, which means that it will never become
+                * running in the future either. We're all done!
+                */
+               break;
+       }
+
+       return ncsw;
+}
+
 #ifdef CONFIG_SMP
 
 static void
@@ -2398,7 +2546,6 @@ static struct rq *__migrate_task(struct rq *rq, struct rq_flags *rf,
        if (!is_cpu_allowed(p, dest_cpu))
                return rq;
 
-       update_rq_clock(rq);
        rq = move_queued_task(rq, rf, p, dest_cpu);
 
        return rq;
@@ -2456,10 +2603,12 @@ static int migration_cpu_stop(void *data)
                                goto out;
                }
 
-               if (task_on_rq_queued(p))
+               if (task_on_rq_queued(p)) {
+                       update_rq_clock(rq);
                        rq = __migrate_task(rq, &rf, p, arg->dest_cpu);
-               else
+               } else {
                        p->wake_cpu = arg->dest_cpu;
+               }
 
                /*
                 * XXX __migrate_task() can fail, at which point we might end
@@ -3341,114 +3490,6 @@ out:
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
-/*
- * wait_task_inactive - wait for a thread to unschedule.
- *
- * Wait for the thread to block in any of the states set in @match_state.
- * If it changes, i.e. @p might have woken up, then return zero.  When we
- * succeed in waiting for @p to be off its CPU, we return a positive number
- * (its total switch count).  If a second call a short while later returns the
- * same number, the caller can be sure that @p has remained unscheduled the
- * whole time.
- *
- * The caller must ensure that the task *will* unschedule sometime soon,
- * else this function might spin for a *long* time. This function can't
- * be called with interrupts off, or it may introduce deadlock with
- * smp_call_function() if an IPI is sent by the same process we are
- * waiting to become inactive.
- */
-unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state)
-{
-       int running, queued;
-       struct rq_flags rf;
-       unsigned long ncsw;
-       struct rq *rq;
-
-       for (;;) {
-               /*
-                * We do the initial early heuristics without holding
-                * any task-queue locks at all. We'll only try to get
-                * the runqueue lock when things look like they will
-                * work out!
-                */
-               rq = task_rq(p);
-
-               /*
-                * If the task is actively running on another CPU
-                * still, just relax and busy-wait without holding
-                * any locks.
-                *
-                * NOTE! Since we don't hold any locks, it's not
-                * even sure that "rq" stays as the right runqueue!
-                * But we don't care, since "task_on_cpu()" will
-                * return false if the runqueue has changed and p
-                * is actually now running somewhere else!
-                */
-               while (task_on_cpu(rq, p)) {
-                       if (!(READ_ONCE(p->__state) & match_state))
-                               return 0;
-                       cpu_relax();
-               }
-
-               /*
-                * Ok, time to look more closely! We need the rq
-                * lock now, to be *sure*. If we're wrong, we'll
-                * just go back and repeat.
-                */
-               rq = task_rq_lock(p, &rf);
-               trace_sched_wait_task(p);
-               running = task_on_cpu(rq, p);
-               queued = task_on_rq_queued(p);
-               ncsw = 0;
-               if (READ_ONCE(p->__state) & match_state)
-                       ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
-               task_rq_unlock(rq, p, &rf);
-
-               /*
-                * If it changed from the expected state, bail out now.
-                */
-               if (unlikely(!ncsw))
-                       break;
-
-               /*
-                * Was it really running after all now that we
-                * checked with the proper locks actually held?
-                *
-                * Oops. Go back and try again..
-                */
-               if (unlikely(running)) {
-                       cpu_relax();
-                       continue;
-               }
-
-               /*
-                * It's not enough that it's not actively running,
-                * it must be off the runqueue _entirely_, and not
-                * preempted!
-                *
-                * So if it was still runnable (but just not actively
-                * running right now), it's preempted, and we should
-                * yield - it could be a while.
-                */
-               if (unlikely(queued)) {
-                       ktime_t to = NSEC_PER_SEC / HZ;
-
-                       set_current_state(TASK_UNINTERRUPTIBLE);
-                       schedule_hrtimeout(&to, HRTIMER_MODE_REL_HARD);
-                       continue;
-               }
-
-               /*
-                * Ahh, all good. It wasn't running, and it wasn't
-                * runnable, which means that it will never become
-                * running in the future either. We're all done!
-                */
-               break;
-       }
-
-       return ncsw;
-}
-
 /***
  * kick_process - kick a running thread to enter/exit the kernel
  * @p: the to-be-kicked thread
@@ -4003,15 +4044,14 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
 static __always_inline
 bool ttwu_state_match(struct task_struct *p, unsigned int state, int *success)
 {
+       int match;
+
        if (IS_ENABLED(CONFIG_DEBUG_PREEMPT)) {
                WARN_ON_ONCE((state & TASK_RTLOCK_WAIT) &&
                             state != TASK_RTLOCK_WAIT);
        }
 
-       if (READ_ONCE(p->__state) & state) {
-               *success = 1;
-               return true;
-       }
+       *success = !!(match = __task_state_match(p, state));
 
 #ifdef CONFIG_PREEMPT_RT
        /*
@@ -4027,12 +4067,10 @@ bool ttwu_state_match(struct task_struct *p, unsigned int state, int *success)
         * p::saved_state to TASK_RUNNING so any further tests will
         * not result in false positives vs. @success
         */
-       if (p->saved_state & state) {
+       if (match < 0)
                p->saved_state = TASK_RUNNING;
-               *success = 1;
-       }
 #endif
-       return false;
+       return match > 0;
 }
 
 /*
@@ -5632,6 +5670,9 @@ void scheduler_tick(void)
 
        perf_event_task_tick();
 
+       if (curr->flags & PF_WQ_WORKER)
+               wq_worker_tick(curr);
+
 #ifdef CONFIG_SMP
        rq->idle_balance = idle_cpu(cpu);
        trigger_load_balance(rq);
@@ -7590,6 +7631,7 @@ static int __sched_setscheduler(struct task_struct *p,
        int reset_on_fork;
        int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
        struct rq *rq;
+       bool cpuset_locked = false;
 
        /* The pi code expects interrupts enabled */
        BUG_ON(pi && in_interrupt());
@@ -7639,8 +7681,14 @@ recheck:
                        return retval;
        }
 
-       if (pi)
-               cpuset_read_lock();
+       /*
+        * SCHED_DEADLINE bandwidth accounting relies on stable cpusets
+        * information.
+        */
+       if (dl_policy(policy) || dl_policy(p->policy)) {
+               cpuset_locked = true;
+               cpuset_lock();
+       }
 
        /*
         * Make sure no PI-waiters arrive (or leave) while we are
@@ -7716,8 +7764,8 @@ change:
        if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
                policy = oldpolicy = -1;
                task_rq_unlock(rq, p, &rf);
-               if (pi)
-                       cpuset_read_unlock();
+               if (cpuset_locked)
+                       cpuset_unlock();
                goto recheck;
        }
 
@@ -7784,7 +7832,8 @@ change:
        task_rq_unlock(rq, p, &rf);
 
        if (pi) {
-               cpuset_read_unlock();
+               if (cpuset_locked)
+                       cpuset_unlock();
                rt_mutex_adjust_pi(p);
        }
 
@@ -7796,8 +7845,8 @@ change:
 
 unlock:
        task_rq_unlock(rq, p, &rf);
-       if (pi)
-               cpuset_read_unlock();
+       if (cpuset_locked)
+               cpuset_unlock();
        return retval;
 }
 
@@ -9286,8 +9335,7 @@ int cpuset_cpumask_can_shrink(const struct cpumask *cur,
        return ret;
 }
 
-int task_can_attach(struct task_struct *p,
-                   const struct cpumask *cs_effective_cpus)
+int task_can_attach(struct task_struct *p)
 {
        int ret = 0;
 
@@ -9300,21 +9348,9 @@ int task_can_attach(struct task_struct *p,
         * success of set_cpus_allowed_ptr() on all attached tasks
         * before cpus_mask may be changed.
         */
-       if (p->flags & PF_NO_SETAFFINITY) {
+       if (p->flags & PF_NO_SETAFFINITY)
                ret = -EINVAL;
-               goto out;
-       }
-
-       if (dl_task(p) && !cpumask_intersects(task_rq(p)->rd->span,
-                                             cs_effective_cpus)) {
-               int cpu = cpumask_any_and(cpu_active_mask, cs_effective_cpus);
-
-               if (unlikely(cpu >= nr_cpu_ids))
-                       return -EINVAL;
-               ret = dl_cpu_busy(cpu, p);
-       }
 
-out:
        return ret;
 }
 
@@ -9548,6 +9584,7 @@ void set_rq_offline(struct rq *rq)
        if (rq->online) {
                const struct sched_class *class;
 
+               update_rq_clock(rq);
                for_each_class(class) {
                        if (class->rq_offline)
                                class->rq_offline(rq);
@@ -9596,7 +9633,7 @@ static void cpuset_cpu_active(void)
 static int cpuset_cpu_inactive(unsigned int cpu)
 {
        if (!cpuhp_tasks_frozen) {
-               int ret = dl_cpu_busy(cpu, NULL);
+               int ret = dl_bw_check_overflow(cpu);
 
                if (ret)
                        return ret;
@@ -9689,7 +9726,6 @@ int sched_cpu_deactivate(unsigned int cpu)
 
        rq_lock_irqsave(rq, &rf);
        if (rq->rd) {
-               update_rq_clock(rq);
                BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span));
                set_rq_offline(rq);
        }
index e321145..4492608 100644 (file)
@@ -155,10 +155,11 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 
 static void sugov_get_util(struct sugov_cpu *sg_cpu)
 {
+       unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu);
        struct rq *rq = cpu_rq(sg_cpu->cpu);
 
        sg_cpu->bw_dl = cpu_bw_dl(rq);
-       sg_cpu->util = effective_cpu_util(sg_cpu->cpu, cpu_util_cfs(sg_cpu->cpu),
+       sg_cpu->util = effective_cpu_util(sg_cpu->cpu, util,
                                          FREQUENCY_UTIL, NULL);
 }
 
index 5a9a4b8..58b542b 100644 (file)
@@ -16,6 +16,8 @@
  *                    Fabio Checconi <fchecconi@gmail.com>
  */
 
+#include <linux/cpuset.h>
+
 /*
  * Default limits for DL period; on the top end we guard against small util
  * tasks still getting ridiculously long effective runtimes, on the bottom end we
@@ -489,13 +491,6 @@ static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
 
 static void init_dl_rq_bw_ratio(struct dl_rq *dl_rq);
 
-void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime)
-{
-       raw_spin_lock_init(&dl_b->dl_runtime_lock);
-       dl_b->dl_period = period;
-       dl_b->dl_runtime = runtime;
-}
-
 void init_dl_bw(struct dl_bw *dl_b)
 {
        raw_spin_lock_init(&dl_b->lock);
@@ -1260,43 +1255,39 @@ int dl_runtime_exceeded(struct sched_dl_entity *dl_se)
 }
 
 /*
- * This function implements the GRUB accounting rule:
- * according to the GRUB reclaiming algorithm, the runtime is
- * not decreased as "dq = -dt", but as
- * "dq = -max{u / Umax, (1 - Uinact - Uextra)} dt",
+ * This function implements the GRUB accounting rule. According to the
+ * GRUB reclaiming algorithm, the runtime is not decreased as "dq = -dt",
+ * but as "dq = -(max{u, (Umax - Uinact - Uextra)} / Umax) dt",
  * where u is the utilization of the task, Umax is the maximum reclaimable
  * utilization, Uinact is the (per-runqueue) inactive utilization, computed
  * as the difference between the "total runqueue utilization" and the
- * runqueue active utilization, and Uextra is the (per runqueue) extra
+ * "runqueue active utilization", and Uextra is the (per runqueue) extra
  * reclaimable utilization.
- * Since rq->dl.running_bw and rq->dl.this_bw contain utilizations
- * multiplied by 2^BW_SHIFT, the result has to be shifted right by
- * BW_SHIFT.
- * Since rq->dl.bw_ratio contains 1 / Umax multiplied by 2^RATIO_SHIFT,
- * dl_bw is multiped by rq->dl.bw_ratio and shifted right by RATIO_SHIFT.
- * Since delta is a 64 bit variable, to have an overflow its value
- * should be larger than 2^(64 - 20 - 8), which is more than 64 seconds.
- * So, overflow is not an issue here.
+ * Since rq->dl.running_bw and rq->dl.this_bw contain utilizations multiplied
+ * by 2^BW_SHIFT, the result has to be shifted right by BW_SHIFT.
+ * Since rq->dl.bw_ratio contains 1 / Umax multiplied by 2^RATIO_SHIFT, dl_bw
+ * is multiped by rq->dl.bw_ratio and shifted right by RATIO_SHIFT.
+ * Since delta is a 64 bit variable, to have an overflow its value should be
+ * larger than 2^(64 - 20 - 8), which is more than 64 seconds. So, overflow is
+ * not an issue here.
  */
 static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
 {
-       u64 u_inact = rq->dl.this_bw - rq->dl.running_bw; /* Utot - Uact */
        u64 u_act;
-       u64 u_act_min = (dl_se->dl_bw * rq->dl.bw_ratio) >> RATIO_SHIFT;
+       u64 u_inact = rq->dl.this_bw - rq->dl.running_bw; /* Utot - Uact */
 
        /*
-        * Instead of computing max{u * bw_ratio, (1 - u_inact - u_extra)},
-        * we compare u_inact + rq->dl.extra_bw with
-        * 1 - (u * rq->dl.bw_ratio >> RATIO_SHIFT), because
-        * u_inact + rq->dl.extra_bw can be larger than
-        * 1 * (so, 1 - u_inact - rq->dl.extra_bw would be negative
-        * leading to wrong results)
+        * Instead of computing max{u, (u_max - u_inact - u_extra)}, we
+        * compare u_inact + u_extra with u_max - u, because u_inact + u_extra
+        * can be larger than u_max. So, u_max - u_inact - u_extra would be
+        * negative leading to wrong results.
         */
-       if (u_inact + rq->dl.extra_bw > BW_UNIT - u_act_min)
-               u_act = u_act_min;
+       if (u_inact + rq->dl.extra_bw > rq->dl.max_bw - dl_se->dl_bw)
+               u_act = dl_se->dl_bw;
        else
-               u_act = BW_UNIT - u_inact - rq->dl.extra_bw;
+               u_act = rq->dl.max_bw - u_inact - rq->dl.extra_bw;
 
+       u_act = (u_act * rq->dl.bw_ratio) >> RATIO_SHIFT;
        return (delta * u_act) >> BW_SHIFT;
 }
 
@@ -2596,6 +2587,12 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
        if (task_on_rq_queued(p) && p->dl.dl_runtime)
                task_non_contending(p);
 
+       /*
+        * In case a task is setscheduled out from SCHED_DEADLINE we need to
+        * keep track of that on its cpuset (for correct bandwidth tracking).
+        */
+       dec_dl_tasks_cs(p);
+
        if (!task_on_rq_queued(p)) {
                /*
                 * Inactive timer is armed. However, p is leaving DEADLINE and
@@ -2636,6 +2633,12 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
        if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
                put_task_struct(p);
 
+       /*
+        * In case a task is setscheduled to SCHED_DEADLINE we need to keep
+        * track of that on its cpuset (for correct bandwidth tracking).
+        */
+       inc_dl_tasks_cs(p);
+
        /* If p is not queued we will update its parameters at next wakeup. */
        if (!task_on_rq_queued(p)) {
                add_rq_bw(&p->dl, &rq->dl);
@@ -2795,12 +2798,12 @@ static void init_dl_rq_bw_ratio(struct dl_rq *dl_rq)
 {
        if (global_rt_runtime() == RUNTIME_INF) {
                dl_rq->bw_ratio = 1 << RATIO_SHIFT;
-               dl_rq->extra_bw = 1 << BW_SHIFT;
+               dl_rq->max_bw = dl_rq->extra_bw = 1 << BW_SHIFT;
        } else {
                dl_rq->bw_ratio = to_ratio(global_rt_runtime(),
                          global_rt_period()) >> (BW_SHIFT - RATIO_SHIFT);
-               dl_rq->extra_bw = to_ratio(global_rt_period(),
-                                                   global_rt_runtime());
+               dl_rq->max_bw = dl_rq->extra_bw =
+                       to_ratio(global_rt_period(), global_rt_runtime());
        }
 }
 
@@ -3044,26 +3047,38 @@ int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur,
        return ret;
 }
 
-int dl_cpu_busy(int cpu, struct task_struct *p)
+enum dl_bw_request {
+       dl_bw_req_check_overflow = 0,
+       dl_bw_req_alloc,
+       dl_bw_req_free
+};
+
+static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw)
 {
-       unsigned long flags, cap;
+       unsigned long flags;
        struct dl_bw *dl_b;
-       bool overflow;
+       bool overflow = 0;
 
        rcu_read_lock_sched();
        dl_b = dl_bw_of(cpu);
        raw_spin_lock_irqsave(&dl_b->lock, flags);
-       cap = dl_bw_capacity(cpu);
-       overflow = __dl_overflow(dl_b, cap, 0, p ? p->dl.dl_bw : 0);
 
-       if (!overflow && p) {
-               /*
-                * We reserve space for this task in the destination
-                * root_domain, as we can't fail after this point.
-                * We will free resources in the source root_domain
-                * later on (see set_cpus_allowed_dl()).
-                */
-               __dl_add(dl_b, p->dl.dl_bw, dl_bw_cpus(cpu));
+       if (req == dl_bw_req_free) {
+               __dl_sub(dl_b, dl_bw, dl_bw_cpus(cpu));
+       } else {
+               unsigned long cap = dl_bw_capacity(cpu);
+
+               overflow = __dl_overflow(dl_b, cap, 0, dl_bw);
+
+               if (req == dl_bw_req_alloc && !overflow) {
+                       /*
+                        * We reserve space in the destination
+                        * root_domain, as we can't fail after this point.
+                        * We will free resources in the source root_domain
+                        * later on (see set_cpus_allowed_dl()).
+                        */
+                       __dl_add(dl_b, dl_bw, dl_bw_cpus(cpu));
+               }
        }
 
        raw_spin_unlock_irqrestore(&dl_b->lock, flags);
@@ -3071,6 +3086,21 @@ int dl_cpu_busy(int cpu, struct task_struct *p)
 
        return overflow ? -EBUSY : 0;
 }
+
+int dl_bw_check_overflow(int cpu)
+{
+       return dl_bw_manage(dl_bw_req_check_overflow, cpu, 0);
+}
+
+int dl_bw_alloc(int cpu, u64 dl_bw)
+{
+       return dl_bw_manage(dl_bw_req_alloc, cpu, dl_bw);
+}
+
+void dl_bw_free(int cpu, u64 dl_bw)
+{
+       dl_bw_manage(dl_bw_req_free, cpu, dl_bw);
+}
 #endif
 
 #ifdef CONFIG_SCHED_DEBUG
index 0b2340a..066ff1c 100644 (file)
@@ -777,7 +777,7 @@ static void print_cpu(struct seq_file *m, int cpu)
 #define P(x)                                                           \
 do {                                                                   \
        if (sizeof(rq->x) == 4)                                         \
-               SEQ_printf(m, "  .%-30s: %ld\n", #x, (long)(rq->x));    \
+               SEQ_printf(m, "  .%-30s: %d\n", #x, (int)(rq->x));      \
        else                                                            \
                SEQ_printf(m, "  .%-30s: %Ld\n", #x, (long long)(rq->x));\
 } while (0)
index 373ff5f..a80a739 100644 (file)
@@ -1064,6 +1064,23 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * Scheduling class queueing methods:
  */
 
+static inline bool is_core_idle(int cpu)
+{
+#ifdef CONFIG_SCHED_SMT
+       int sibling;
+
+       for_each_cpu(sibling, cpu_smt_mask(cpu)) {
+               if (cpu == sibling)
+                       continue;
+
+               if (!idle_cpu(sibling))
+                       return false;
+       }
+#endif
+
+       return true;
+}
+
 #ifdef CONFIG_NUMA
 #define NUMA_IMBALANCE_MIN 2
 
@@ -1700,23 +1717,6 @@ struct numa_stats {
        int idle_cpu;
 };
 
-static inline bool is_core_idle(int cpu)
-{
-#ifdef CONFIG_SCHED_SMT
-       int sibling;
-
-       for_each_cpu(sibling, cpu_smt_mask(cpu)) {
-               if (cpu == sibling)
-                       continue;
-
-               if (!idle_cpu(sibling))
-                       return false;
-       }
-#endif
-
-       return true;
-}
-
 struct task_numa_env {
        struct task_struct *p;
 
@@ -5577,6 +5577,14 @@ static void __cfsb_csd_unthrottle(void *arg)
        rq_lock(rq, &rf);
 
        /*
+        * Iterating over the list can trigger several call to
+        * update_rq_clock() in unthrottle_cfs_rq().
+        * Do it once and skip the potential next ones.
+        */
+       update_rq_clock(rq);
+       rq_clock_start_loop_update(rq);
+
+       /*
         * Since we hold rq lock we're safe from concurrent manipulation of
         * the CSD list. However, this RCU critical section annotates the
         * fact that we pair with sched_free_group_rcu(), so that we cannot
@@ -5595,6 +5603,7 @@ static void __cfsb_csd_unthrottle(void *arg)
 
        rcu_read_unlock();
 
+       rq_clock_stop_loop_update(rq);
        rq_unlock(rq, &rf);
 }
 
@@ -6115,6 +6124,13 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
 
        lockdep_assert_rq_held(rq);
 
+       /*
+        * The rq clock has already been updated in the
+        * set_rq_offline(), so we should skip updating
+        * the rq clock again in unthrottle_cfs_rq().
+        */
+       rq_clock_start_loop_update(rq);
+
        rcu_read_lock();
        list_for_each_entry_rcu(tg, &task_groups, list) {
                struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
@@ -6137,6 +6153,8 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
                        unthrottle_cfs_rq(cfs_rq);
        }
        rcu_read_unlock();
+
+       rq_clock_stop_loop_update(rq);
 }
 
 #else /* CONFIG_CFS_BANDWIDTH */
@@ -7202,14 +7220,58 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
        return target;
 }
 
-/*
- * Predicts what cpu_util(@cpu) would return if @p was removed from @cpu
- * (@dst_cpu = -1) or migrated to @dst_cpu.
- */
-static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
+/**
+ * cpu_util() - Estimates the amount of CPU capacity used by CFS tasks.
+ * @cpu: the CPU to get the utilization for
+ * @p: task for which the CPU utilization should be predicted or NULL
+ * @dst_cpu: CPU @p migrates to, -1 if @p moves from @cpu or @p == NULL
+ * @boost: 1 to enable boosting, otherwise 0
+ *
+ * The unit of the return value must be the same as the one of CPU capacity
+ * so that CPU utilization can be compared with CPU capacity.
+ *
+ * CPU utilization is the sum of running time of runnable tasks plus the
+ * recent utilization of currently non-runnable tasks on that CPU.
+ * It represents the amount of CPU capacity currently used by CFS tasks in
+ * the range [0..max CPU capacity] with max CPU capacity being the CPU
+ * capacity at f_max.
+ *
+ * The estimated CPU utilization is defined as the maximum between CPU
+ * utilization and sum of the estimated utilization of the currently
+ * runnable tasks on that CPU. It preserves a utilization "snapshot" of
+ * previously-executed tasks, which helps better deduce how busy a CPU will
+ * be when a long-sleeping task wakes up. The contribution to CPU utilization
+ * of such a task would be significantly decayed at this point of time.
+ *
+ * Boosted CPU utilization is defined as max(CPU runnable, CPU utilization).
+ * CPU contention for CFS tasks can be detected by CPU runnable > CPU
+ * utilization. Boosting is implemented in cpu_util() so that internal
+ * users (e.g. EAS) can use it next to external users (e.g. schedutil),
+ * latter via cpu_util_cfs_boost().
+ *
+ * CPU utilization can be higher than the current CPU capacity
+ * (f_curr/f_max * max CPU capacity) or even the max CPU capacity because
+ * of rounding errors as well as task migrations or wakeups of new tasks.
+ * CPU utilization has to be capped to fit into the [0..max CPU capacity]
+ * range. Otherwise a group of CPUs (CPU0 util = 121% + CPU1 util = 80%)
+ * could be seen as over-utilized even though CPU1 has 20% of spare CPU
+ * capacity. CPU utilization is allowed to overshoot current CPU capacity
+ * though since this is useful for predicting the CPU capacity required
+ * after task migrations (scheduler-driven DVFS).
+ *
+ * Return: (Boosted) (estimated) utilization for the specified CPU.
+ */
+static unsigned long
+cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
 {
        struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
        unsigned long util = READ_ONCE(cfs_rq->avg.util_avg);
+       unsigned long runnable;
+
+       if (boost) {
+               runnable = READ_ONCE(cfs_rq->avg.runnable_avg);
+               util = max(util, runnable);
+       }
 
        /*
         * If @dst_cpu is -1 or @p migrates from @cpu to @dst_cpu remove its
@@ -7217,9 +7279,9 @@ static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
         * contribution. In all the other cases @cpu is not impacted by the
         * migration so its util_avg is already correct.
         */
-       if (task_cpu(p) == cpu && dst_cpu != cpu)
+       if (p && task_cpu(p) == cpu && dst_cpu != cpu)
                lsub_positive(&util, task_util(p));
-       else if (task_cpu(p) != cpu && dst_cpu == cpu)
+       else if (p && task_cpu(p) != cpu && dst_cpu == cpu)
                util += task_util(p);
 
        if (sched_feat(UTIL_EST)) {
@@ -7227,6 +7289,9 @@ static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
 
                util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
 
+               if (boost)
+                       util_est = max(util_est, runnable);
+
                /*
                 * During wake-up @p isn't enqueued yet and doesn't contribute
                 * to any cpu_rq(cpu)->cfs.avg.util_est.enqueued.
@@ -7255,7 +7320,7 @@ static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
                 */
                if (dst_cpu == cpu)
                        util_est += _task_util_est(p);
-               else if (unlikely(task_on_rq_queued(p) || current == p))
+               else if (p && unlikely(task_on_rq_queued(p) || current == p))
                        lsub_positive(&util_est, _task_util_est(p));
 
                util = max(util, util_est);
@@ -7264,6 +7329,16 @@ static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
        return min(util, capacity_orig_of(cpu));
 }
 
+unsigned long cpu_util_cfs(int cpu)
+{
+       return cpu_util(cpu, NULL, -1, 0);
+}
+
+unsigned long cpu_util_cfs_boost(int cpu)
+{
+       return cpu_util(cpu, NULL, -1, 1);
+}
+
 /*
  * cpu_util_without: compute cpu utilization without any contributions from *p
  * @cpu: the CPU which utilization is requested
@@ -7281,9 +7356,9 @@ static unsigned long cpu_util_without(int cpu, struct task_struct *p)
 {
        /* Task has no contribution or is new */
        if (cpu != task_cpu(p) || !READ_ONCE(p->se.avg.last_update_time))
-               return cpu_util_cfs(cpu);
+               p = NULL;
 
-       return cpu_util_next(cpu, p, -1);
+       return cpu_util(cpu, p, -1, 0);
 }
 
 /*
@@ -7330,7 +7405,7 @@ static inline void eenv_task_busy_time(struct energy_env *eenv,
  * cpu_capacity.
  *
  * The contribution of the task @p for which we want to estimate the
- * energy cost is removed (by cpu_util_next()) and must be calculated
+ * energy cost is removed (by cpu_util()) and must be calculated
  * separately (see eenv_task_busy_time). This ensures:
  *
  *   - A stable PD utilization, no matter which CPU of that PD we want to place
@@ -7351,7 +7426,7 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
        int cpu;
 
        for_each_cpu(cpu, pd_cpus) {
-               unsigned long util = cpu_util_next(cpu, p, -1);
+               unsigned long util = cpu_util(cpu, p, -1, 0);
 
                busy_time += effective_cpu_util(cpu, util, ENERGY_UTIL, NULL);
        }
@@ -7375,8 +7450,8 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
 
        for_each_cpu(cpu, pd_cpus) {
                struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
-               unsigned long util = cpu_util_next(cpu, p, dst_cpu);
-               unsigned long cpu_util;
+               unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
+               unsigned long eff_util;
 
                /*
                 * Performance domain frequency: utilization clamping
@@ -7385,8 +7460,8 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
                 * NOTE: in case RT tasks are running, by default the
                 * FREQUENCY_UTIL's utilization can be max OPP.
                 */
-               cpu_util = effective_cpu_util(cpu, util, FREQUENCY_UTIL, tsk);
-               max_util = max(max_util, cpu_util);
+               eff_util = effective_cpu_util(cpu, util, FREQUENCY_UTIL, tsk);
+               max_util = max(max_util, eff_util);
        }
 
        return min(max_util, eenv->cpu_cap);
@@ -7521,7 +7596,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
                        if (!cpumask_test_cpu(cpu, p->cpus_ptr))
                                continue;
 
-                       util = cpu_util_next(cpu, p, cpu);
+                       util = cpu_util(cpu, p, cpu, 0);
                        cpu_cap = capacity_of(cpu);
 
                        /*
@@ -9331,96 +9406,61 @@ group_type group_classify(unsigned int imbalance_pct,
 }
 
 /**
- * asym_smt_can_pull_tasks - Check whether the load balancing CPU can pull tasks
- * @dst_cpu:   Destination CPU of the load balancing
+ * sched_use_asym_prio - Check whether asym_packing priority must be used
+ * @sd:                The scheduling domain of the load balancing
+ * @cpu:       A CPU
+ *
+ * Always use CPU priority when balancing load between SMT siblings. When
+ * balancing load between cores, it is not sufficient that @cpu is idle. Only
+ * use CPU priority if the whole core is idle.
+ *
+ * Returns: True if the priority of @cpu must be followed. False otherwise.
+ */
+static bool sched_use_asym_prio(struct sched_domain *sd, int cpu)
+{
+       if (!sched_smt_active())
+               return true;
+
+       return sd->flags & SD_SHARE_CPUCAPACITY || is_core_idle(cpu);
+}
+
+/**
+ * sched_asym - Check if the destination CPU can do asym_packing load balance
+ * @env:       The load balancing environment
  * @sds:       Load-balancing data with statistics of the local group
  * @sgs:       Load-balancing statistics of the candidate busiest group
- * @sg:                The candidate busiest group
+ * @group:     The candidate busiest group
  *
- * Check the state of the SMT siblings of both @sds::local and @sg and decide
- * if @dst_cpu can pull tasks.
+ * @env::dst_cpu can do asym_packing if it has higher priority than the
+ * preferred CPU of @group.
  *
- * If @dst_cpu does not have SMT siblings, it can pull tasks if two or more of
- * the SMT siblings of @sg are busy. If only one CPU in @sg is busy, pull tasks
- * only if @dst_cpu has higher priority.
+ * SMT is a special case. If we are balancing load between cores, @env::dst_cpu
+ * can do asym_packing balance only if all its SMT siblings are idle. Also, it
+ * can only do it if @group is an SMT group and has exactly on busy CPU. Larger
+ * imbalances in the number of CPUS are dealt with in find_busiest_group().
  *
- * If both @dst_cpu and @sg have SMT siblings, and @sg has exactly one more
- * busy CPU than @sds::local, let @dst_cpu pull tasks if it has higher priority.
- * Bigger imbalances in the number of busy CPUs will be dealt with in
- * update_sd_pick_busiest().
+ * If we are balancing load within an SMT core, or at DIE domain level, always
+ * proceed.
  *
- * If @sg does not have SMT siblings, only pull tasks if all of the SMT siblings
- * of @dst_cpu are idle and @sg has lower priority.
- *
- * Return: true if @dst_cpu can pull tasks, false otherwise.
+ * Return: true if @env::dst_cpu can do with asym_packing load balance. False
+ * otherwise.
  */
-static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
-                                   struct sg_lb_stats *sgs,
-                                   struct sched_group *sg)
+static inline bool
+sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats *sgs,
+          struct sched_group *group)
 {
-#ifdef CONFIG_SCHED_SMT
-       bool local_is_smt, sg_is_smt;
-       int sg_busy_cpus;
-
-       local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
-       sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
-
-       sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
-
-       if (!local_is_smt) {
-               /*
-                * If we are here, @dst_cpu is idle and does not have SMT
-                * siblings. Pull tasks if candidate group has two or more
-                * busy CPUs.
-                */
-               if (sg_busy_cpus >= 2) /* implies sg_is_smt */
-                       return true;
-
-               /*
-                * @dst_cpu does not have SMT siblings. @sg may have SMT
-                * siblings and only one is busy. In such case, @dst_cpu
-                * can help if it has higher priority and is idle (i.e.,
-                * it has no running tasks).
-                */
-               return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
-       }
-
-       /* @dst_cpu has SMT siblings. */
-
-       if (sg_is_smt) {
-               int local_busy_cpus = sds->local->group_weight -
-                                     sds->local_stat.idle_cpus;
-               int busy_cpus_delta = sg_busy_cpus - local_busy_cpus;
-
-               if (busy_cpus_delta == 1)
-                       return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
-
+       /* Ensure that the whole local core is idle, if applicable. */
+       if (!sched_use_asym_prio(env->sd, env->dst_cpu))
                return false;
-       }
 
        /*
-        * @sg does not have SMT siblings. Ensure that @sds::local does not end
-        * up with more than one busy SMT sibling and only pull tasks if there
-        * are not busy CPUs (i.e., no CPU has running tasks).
+        * CPU priorities does not make sense for SMT cores with more than one
+        * busy sibling.
         */
-       if (!sds->local_stat.sum_nr_running)
-               return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
-
-       return false;
-#else
-       /* Always return false so that callers deal with non-SMT cases. */
-       return false;
-#endif
-}
-
-static inline bool
-sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats *sgs,
-          struct sched_group *group)
-{
-       /* Only do SMT checks if either local or candidate have SMT siblings */
-       if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
-           (group->flags & SD_SHARE_CPUCAPACITY))
-               return asym_smt_can_pull_tasks(env->dst_cpu, sds, sgs, group);
+       if (group->flags & SD_SHARE_CPUCAPACITY) {
+               if (sgs->group_weight - sgs->idle_cpus != 1)
+                       return false;
+       }
 
        return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
 }
@@ -9610,10 +9650,22 @@ static bool update_sd_pick_busiest(struct lb_env *env,
                 * contention when accessing shared HW resources.
                 *
                 * XXX for now avg_load is not computed and always 0 so we
-                * select the 1st one.
+                * select the 1st one, except if @sg is composed of SMT
+                * siblings.
                 */
-               if (sgs->avg_load <= busiest->avg_load)
+
+               if (sgs->avg_load < busiest->avg_load)
                        return false;
+
+               if (sgs->avg_load == busiest->avg_load) {
+                       /*
+                        * SMT sched groups need more help than non-SMT groups.
+                        * If @sg happens to also be SMT, either choice is good.
+                        */
+                       if (sds->busiest->flags & SD_SHARE_CPUCAPACITY)
+                               return false;
+               }
+
                break;
 
        case group_has_spare:
@@ -10088,7 +10140,6 @@ static void update_idle_cpu_scan(struct lb_env *env,
 
 static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
 {
-       struct sched_domain *child = env->sd->child;
        struct sched_group *sg = env->sd->groups;
        struct sg_lb_stats *local = &sds->local_stat;
        struct sg_lb_stats tmp_sgs;
@@ -10129,8 +10180,13 @@ next_group:
                sg = sg->next;
        } while (sg != env->sd->groups);
 
-       /* Tag domain that child domain prefers tasks go to siblings first */
-       sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
+       /*
+        * Indicate that the child domain of the busiest group prefers tasks
+        * go to a child's sibling domains first. NB the flags of a sched group
+        * are those of the child domain.
+        */
+       if (sds->busiest)
+               sds->prefer_sibling = !!(sds->busiest->flags & SD_PREFER_SIBLING);
 
 
        if (env->sd->flags & SD_NUMA)
@@ -10440,7 +10496,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
                        goto out_balanced;
        }
 
-       /* Try to move all excess tasks to child's sibling domain */
+       /*
+        * Try to move all excess tasks to a sibling domain of the busiest
+        * group's child domain.
+        */
        if (sds.prefer_sibling && local->group_type == group_has_spare &&
            busiest->sum_nr_running > local->sum_nr_running + 1)
                goto force_balance;
@@ -10542,8 +10601,15 @@ static struct rq *find_busiest_queue(struct lb_env *env,
                    nr_running == 1)
                        continue;
 
-               /* Make sure we only pull tasks from a CPU of lower priority */
+               /*
+                * Make sure we only pull tasks from a CPU of lower priority
+                * when balancing between SMT siblings.
+                *
+                * If balancing between cores, let lower priority CPUs help
+                * SMT cores with more than one busy sibling.
+                */
                if ((env->sd->flags & SD_ASYM_PACKING) &&
+                   sched_use_asym_prio(env->sd, i) &&
                    sched_asym_prefer(i, env->dst_cpu) &&
                    nr_running == 1)
                        continue;
@@ -10581,7 +10647,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
                        break;
 
                case migrate_util:
-                       util = cpu_util_cfs(i);
+                       util = cpu_util_cfs_boost(i);
 
                        /*
                         * Don't try to pull utilization from a CPU with one
@@ -10632,12 +10698,19 @@ static inline bool
 asym_active_balance(struct lb_env *env)
 {
        /*
-        * ASYM_PACKING needs to force migrate tasks from busy but
-        * lower priority CPUs in order to pack all tasks in the
-        * highest priority CPUs.
+        * ASYM_PACKING needs to force migrate tasks from busy but lower
+        * priority CPUs in order to pack all tasks in the highest priority
+        * CPUs. When done between cores, do it only if the whole core if the
+        * whole core is idle.
+        *
+        * If @env::src_cpu is an SMT core with busy siblings, let
+        * the lower priority @env::dst_cpu help it. Do not follow
+        * CPU priority.
         */
        return env->idle != CPU_NOT_IDLE && (env->sd->flags & SD_ASYM_PACKING) &&
-              sched_asym_prefer(env->dst_cpu, env->src_cpu);
+              sched_use_asym_prio(env->sd, env->dst_cpu) &&
+              (sched_asym_prefer(env->dst_cpu, env->src_cpu) ||
+               !sched_use_asym_prio(env->sd, env->src_cpu));
 }
 
 static inline bool
@@ -10744,7 +10817,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
                .sd             = sd,
                .dst_cpu        = this_cpu,
                .dst_rq         = this_rq,
-               .dst_grpmask    = sched_group_span(sd->groups),
+               .dst_grpmask    = group_balance_mask(sd->groups),
                .idle           = idle,
                .loop_break     = SCHED_NR_MIGRATE_BREAK,
                .cpus           = cpus,
@@ -11371,9 +11444,13 @@ static void nohz_balancer_kick(struct rq *rq)
                 * When ASYM_PACKING; see if there's a more preferred CPU
                 * currently idle; in which case, kick the ILB to move tasks
                 * around.
+                *
+                * When balancing betwen cores, all the SMT siblings of the
+                * preferred CPU must be idle.
                 */
                for_each_cpu_and(i, sched_domain_span(sd), nohz.idle_cpus_mask) {
-                       if (sched_asym_prefer(i, cpu)) {
+                       if (sched_use_asym_prio(sd, i) &&
+                           sched_asym_prefer(i, cpu)) {
                                flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
                                goto unlock;
                        }
index e072f6b..81fca77 100644 (file)
@@ -160,7 +160,6 @@ __setup("psi=", setup_psi);
 #define EXP_300s       2034            /* 1/exp(2s/300s) */
 
 /* PSI trigger definitions */
-#define WINDOW_MIN_US 500000   /* Min window size is 500ms */
 #define WINDOW_MAX_US 10000000 /* Max window size is 10s */
 #define UPDATES_PER_WINDOW 10  /* 10 updates per window */
 
@@ -1305,8 +1304,7 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
        if (state >= PSI_NONIDLE)
                return ERR_PTR(-EINVAL);
 
-       if (window_us < WINDOW_MIN_US ||
-               window_us > WINDOW_MAX_US)
+       if (window_us == 0 || window_us > WINDOW_MAX_US)
                return ERR_PTR(-EINVAL);
 
        /*
@@ -1409,11 +1407,16 @@ void psi_trigger_destroy(struct psi_trigger *t)
                        group->rtpoll_nr_triggers[t->state]--;
                        if (!group->rtpoll_nr_triggers[t->state])
                                group->rtpoll_states &= ~(1 << t->state);
-                       /* reset min update period for the remaining triggers */
-                       list_for_each_entry(tmp, &group->rtpoll_triggers, node)
-                               period = min(period, div_u64(tmp->win.size,
-                                               UPDATES_PER_WINDOW));
-                       group->rtpoll_min_period = period;
+                       /*
+                        * Reset min update period for the remaining triggers
+                        * iff the destroying trigger had the min window size.
+                        */
+                       if (group->rtpoll_min_period == div_u64(t->win.size, UPDATES_PER_WINDOW)) {
+                               list_for_each_entry(tmp, &group->rtpoll_triggers, node)
+                                       period = min(period, div_u64(tmp->win.size,
+                                                       UPDATES_PER_WINDOW));
+                               group->rtpoll_min_period = period;
+                       }
                        /* Destroy rtpoll_task when the last trigger is destroyed */
                        if (group->rtpoll_states == 0) {
                                group->rtpoll_until = 0;
index ec7b3e0..e93e006 100644 (file)
@@ -286,12 +286,6 @@ struct rt_bandwidth {
 
 void __dl_clear_params(struct task_struct *p);
 
-struct dl_bandwidth {
-       raw_spinlock_t          dl_runtime_lock;
-       u64                     dl_runtime;
-       u64                     dl_period;
-};
-
 static inline int dl_bandwidth_enabled(void)
 {
        return sysctl_sched_rt_runtime >= 0;
@@ -330,7 +324,7 @@ extern void __getparam_dl(struct task_struct *p, struct sched_attr *attr);
 extern bool __checkparam_dl(const struct sched_attr *attr);
 extern bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr);
 extern int  dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
-extern int  dl_cpu_busy(int cpu, struct task_struct *p);
+extern int  dl_bw_check_overflow(int cpu);
 
 #ifdef CONFIG_CGROUP_SCHED
 
@@ -754,6 +748,12 @@ struct dl_rq {
        u64                     extra_bw;
 
        /*
+        * Maximum available bandwidth for reclaiming by SCHED_FLAG_RECLAIM
+        * tasks of this rq. Used in calculation of reclaimable bandwidth(GRUB).
+        */
+       u64                     max_bw;
+
+       /*
         * Inverse of the fraction of CPU utilization that can be reclaimed
         * by the GRUB algorithm.
         */
@@ -1546,6 +1546,28 @@ static inline void rq_clock_cancel_skipupdate(struct rq *rq)
        rq->clock_update_flags &= ~RQCF_REQ_SKIP;
 }
 
+/*
+ * During cpu offlining and rq wide unthrottling, we can trigger
+ * an update_rq_clock() for several cfs and rt runqueues (Typically
+ * when using list_for_each_entry_*)
+ * rq_clock_start_loop_update() can be called after updating the clock
+ * once and before iterating over the list to prevent multiple update.
+ * After the iterative traversal, we need to call rq_clock_stop_loop_update()
+ * to clear RQCF_ACT_SKIP of rq->clock_update_flags.
+ */
+static inline void rq_clock_start_loop_update(struct rq *rq)
+{
+       lockdep_assert_rq_held(rq);
+       SCHED_WARN_ON(rq->clock_update_flags & RQCF_ACT_SKIP);
+       rq->clock_update_flags |= RQCF_ACT_SKIP;
+}
+
+static inline void rq_clock_stop_loop_update(struct rq *rq)
+{
+       lockdep_assert_rq_held(rq);
+       rq->clock_update_flags &= ~RQCF_ACT_SKIP;
+}
+
 struct rq_flags {
        unsigned long flags;
        struct pin_cookie cookie;
@@ -1772,6 +1794,13 @@ queue_balance_callback(struct rq *rq,
        for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); \
                        __sd; __sd = __sd->parent)
 
+/* A mask of all the SD flags that have the SDF_SHARED_CHILD metaflag */
+#define SD_FLAG(name, mflags) (name * !!((mflags) & SDF_SHARED_CHILD)) |
+static const unsigned int SD_SHARED_CHILD_MASK =
+#include <linux/sched/sd_flags.h>
+0;
+#undef SD_FLAG
+
 /**
  * highest_flag_domain - Return highest sched_domain containing flag.
  * @cpu:       The CPU whose highest level of sched domain is to
@@ -1779,16 +1808,25 @@ queue_balance_callback(struct rq *rq,
  * @flag:      The flag to check for the highest sched_domain
  *             for the given CPU.
  *
- * Returns the highest sched_domain of a CPU which contains the given flag.
+ * Returns the highest sched_domain of a CPU which contains @flag. If @flag has
+ * the SDF_SHARED_CHILD metaflag, all the children domains also have @flag.
  */
 static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
 {
        struct sched_domain *sd, *hsd = NULL;
 
        for_each_domain(cpu, sd) {
-               if (!(sd->flags & flag))
+               if (sd->flags & flag) {
+                       hsd = sd;
+                       continue;
+               }
+
+               /*
+                * Stop the search if @flag is known to be shared at lower
+                * levels. It will not be found further up.
+                */
+               if (flag & SD_SHARED_CHILD_MASK)
                        break;
-               hsd = sd;
        }
 
        return hsd;
@@ -2378,7 +2416,6 @@ extern struct rt_bandwidth def_rt_bandwidth;
 extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
 extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
 
-extern void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime);
 extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
 extern void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se);
 
@@ -2946,53 +2983,9 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
        return READ_ONCE(rq->avg_dl.util_avg);
 }
 
-/**
- * cpu_util_cfs() - Estimates the amount of CPU capacity used by CFS tasks.
- * @cpu: the CPU to get the utilization for.
- *
- * The unit of the return value must be the same as the one of CPU capacity
- * so that CPU utilization can be compared with CPU capacity.
- *
- * CPU utilization is the sum of running time of runnable tasks plus the
- * recent utilization of currently non-runnable tasks on that CPU.
- * It represents the amount of CPU capacity currently used by CFS tasks in
- * the range [0..max CPU capacity] with max CPU capacity being the CPU
- * capacity at f_max.
- *
- * The estimated CPU utilization is defined as the maximum between CPU
- * utilization and sum of the estimated utilization of the currently
- * runnable tasks on that CPU. It preserves a utilization "snapshot" of
- * previously-executed tasks, which helps better deduce how busy a CPU will
- * be when a long-sleeping task wakes up. The contribution to CPU utilization
- * of such a task would be significantly decayed at this point of time.
- *
- * CPU utilization can be higher than the current CPU capacity
- * (f_curr/f_max * max CPU capacity) or even the max CPU capacity because
- * of rounding errors as well as task migrations or wakeups of new tasks.
- * CPU utilization has to be capped to fit into the [0..max CPU capacity]
- * range. Otherwise a group of CPUs (CPU0 util = 121% + CPU1 util = 80%)
- * could be seen as over-utilized even though CPU1 has 20% of spare CPU
- * capacity. CPU utilization is allowed to overshoot current CPU capacity
- * though since this is useful for predicting the CPU capacity required
- * after task migrations (scheduler-driven DVFS).
- *
- * Return: (Estimated) utilization for the specified CPU.
- */
-static inline unsigned long cpu_util_cfs(int cpu)
-{
-       struct cfs_rq *cfs_rq;
-       unsigned long util;
-
-       cfs_rq = &cpu_rq(cpu)->cfs;
-       util = READ_ONCE(cfs_rq->avg.util_avg);
-
-       if (sched_feat(UTIL_EST)) {
-               util = max_t(unsigned long, util,
-                            READ_ONCE(cfs_rq->avg.util_est.enqueued));
-       }
 
-       return min(util, capacity_orig_of(cpu));
-}
+extern unsigned long cpu_util_cfs(int cpu);
+extern unsigned long cpu_util_cfs_boost(int cpu);
 
 static inline unsigned long cpu_util_rt(struct rq *rq)
 {
index 6682535..d3a3b26 100644 (file)
@@ -487,9 +487,9 @@ static void free_rootdomain(struct rcu_head *rcu)
 void rq_attach_root(struct rq *rq, struct root_domain *rd)
 {
        struct root_domain *old_rd = NULL;
-       unsigned long flags;
+       struct rq_flags rf;
 
-       raw_spin_rq_lock_irqsave(rq, flags);
+       rq_lock_irqsave(rq, &rf);
 
        if (rq->rd) {
                old_rd = rq->rd;
@@ -515,7 +515,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
        if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
                set_rq_online(rq);
 
-       raw_spin_rq_unlock_irqrestore(rq, flags);
+       rq_unlock_irqrestore(rq, &rf);
 
        if (old_rd)
                call_rcu(&old_rd->rcu, free_rootdomain);
@@ -719,8 +719,13 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 
                if (sd_parent_degenerate(tmp, parent)) {
                        tmp->parent = parent->parent;
-                       if (parent->parent)
+
+                       if (parent->parent) {
                                parent->parent->child = tmp;
+                               if (tmp->flags & SD_SHARE_CPUCAPACITY)
+                                       parent->parent->groups->flags |= SD_SHARE_CPUCAPACITY;
+                       }
+
                        /*
                         * Transfer SD_PREFER_SIBLING down in case of a
                         * degenerate parent; the spans match for this
@@ -1676,7 +1681,7 @@ static struct sched_domain_topology_level *sched_domain_topology_saved;
 #define for_each_sd_topology(tl)                       \
        for (tl = sched_domain_topology; tl->mask; tl++)
 
-void set_sched_topology(struct sched_domain_topology_level *tl)
+void __init set_sched_topology(struct sched_domain_topology_level *tl)
 {
        if (WARN_ON_ONCE(sched_smp_initialized))
                return;
index 133b747..48c53e4 100644 (file)
@@ -425,11 +425,6 @@ int autoremove_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, i
 }
 EXPORT_SYMBOL(autoremove_wake_function);
 
-static inline bool is_kthread_should_stop(void)
-{
-       return (current->flags & PF_KTHREAD) && kthread_should_stop();
-}
-
 /*
  * DEFINE_WAIT_FUNC(wait, woken_wake_func);
  *
@@ -459,7 +454,7 @@ long wait_woken(struct wait_queue_entry *wq_entry, unsigned mode, long timeout)
         * or woken_wake_function() sees our store to current->state.
         */
        set_current_state(mode); /* A */
-       if (!(wq_entry->flags & WQ_FLAG_WOKEN) && !is_kthread_should_stop())
+       if (!(wq_entry->flags & WQ_FLAG_WOKEN) && !kthread_should_stop_or_park())
                timeout = schedule_timeout(timeout);
        __set_current_state(TASK_RUNNING);
 
index 2547fa7..b5370fe 100644 (file)
@@ -45,6 +45,7 @@
 #include <linux/posix-timers.h>
 #include <linux/cgroup.h>
 #include <linux/audit.h>
+#include <linux/sysctl.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/signal.h>
@@ -4773,6 +4774,28 @@ static inline void siginfo_buildtime_checks(void)
 #endif
 }
 
+#if defined(CONFIG_SYSCTL)
+static struct ctl_table signal_debug_table[] = {
+#ifdef CONFIG_SYSCTL_EXCEPTION_TRACE
+       {
+               .procname       = "exception-trace",
+               .data           = &show_unhandled_signals,
+               .maxlen         = sizeof(int),
+               .mode           = 0644,
+               .proc_handler   = proc_dointvec
+       },
+#endif
+       { }
+};
+
+static int __init init_signal_sysctls(void)
+{
+       register_sysctl_init("debug", signal_debug_table);
+       return 0;
+}
+early_initcall(init_signal_sysctls);
+#endif /* CONFIG_SYSCTL */
+
 void __init signals_init(void)
 {
        siginfo_buildtime_checks();
index ab3e5da..385179d 100644 (file)
@@ -27,6 +27,9 @@
 #include <linux/jump_label.h>
 
 #include <trace/events/ipi.h>
+#define CREATE_TRACE_POINTS
+#include <trace/events/csd.h>
+#undef CREATE_TRACE_POINTS
 
 #include "smpboot.h"
 #include "sched/smp.h"
@@ -121,6 +124,14 @@ send_call_function_ipi_mask(struct cpumask *mask)
        arch_send_call_function_ipi_mask(mask);
 }
 
+static __always_inline void
+csd_do_func(smp_call_func_t func, void *info, struct __call_single_data *csd)
+{
+       trace_csd_function_entry(func, csd);
+       func(info);
+       trace_csd_function_exit(func, csd);
+}
+
 #ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
 
 static DEFINE_STATIC_KEY_MAYBE(CONFIG_CSD_LOCK_WAIT_DEBUG_DEFAULT, csdlock_debug_enabled);
@@ -329,7 +340,7 @@ void __smp_call_single_queue(int cpu, struct llist_node *node)
         * even if we haven't sent the smp_call IPI yet (e.g. the stopper
         * executes migration_cpu_stop() on the remote CPU).
         */
-       if (trace_ipi_send_cpu_enabled()) {
+       if (trace_csd_queue_cpu_enabled()) {
                call_single_data_t *csd;
                smp_call_func_t func;
 
@@ -337,7 +348,7 @@ void __smp_call_single_queue(int cpu, struct llist_node *node)
                func = CSD_TYPE(csd) == CSD_TYPE_TTWU ?
                        sched_ttwu_pending : csd->func;
 
-               trace_ipi_send_cpu(cpu, _RET_IP_, func);
+               trace_csd_queue_cpu(cpu, _RET_IP_, func, csd);
        }
 
        /*
@@ -375,7 +386,7 @@ static int generic_exec_single(int cpu, struct __call_single_data *csd)
                csd_lock_record(csd);
                csd_unlock(csd);
                local_irq_save(flags);
-               func(info);
+               csd_do_func(func, info, NULL);
                csd_lock_record(NULL);
                local_irq_restore(flags);
                return 0;
@@ -477,7 +488,7 @@ static void __flush_smp_call_function_queue(bool warn_cpu_offline)
                        }
 
                        csd_lock_record(csd);
-                       func(info);
+                       csd_do_func(func, info, csd);
                        csd_unlock(csd);
                        csd_lock_record(NULL);
                } else {
@@ -508,7 +519,7 @@ static void __flush_smp_call_function_queue(bool warn_cpu_offline)
 
                                csd_lock_record(csd);
                                csd_unlock(csd);
-                               func(info);
+                               csd_do_func(func, info, csd);
                                csd_lock_record(NULL);
                        } else if (type == CSD_TYPE_IRQ_WORK) {
                                irq_work_single(csd);
@@ -522,8 +533,10 @@ static void __flush_smp_call_function_queue(bool warn_cpu_offline)
        /*
         * Third; only CSD_TYPE_TTWU is left, issue those.
         */
-       if (entry)
-               sched_ttwu_pending(entry);
+       if (entry) {
+               csd = llist_entry(entry, typeof(*csd), node.llist);
+               csd_do_func(sched_ttwu_pending, entry, csd);
+       }
 }
 
 
@@ -728,7 +741,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
        int cpu, last_cpu, this_cpu = smp_processor_id();
        struct call_function_data *cfd;
        bool wait = scf_flags & SCF_WAIT;
-       int nr_cpus = 0, nr_queued = 0;
+       int nr_cpus = 0;
        bool run_remote = false;
        bool run_local = false;
 
@@ -786,22 +799,16 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
                        csd->node.src = smp_processor_id();
                        csd->node.dst = cpu;
 #endif
+                       trace_csd_queue_cpu(cpu, _RET_IP_, func, csd);
+
                        if (llist_add(&csd->node.llist, &per_cpu(call_single_queue, cpu))) {
                                __cpumask_set_cpu(cpu, cfd->cpumask_ipi);
                                nr_cpus++;
                                last_cpu = cpu;
                        }
-                       nr_queued++;
                }
 
                /*
-                * Trace each smp_function_call_*() as an IPI, actual IPIs
-                * will be traced with func==generic_smp_call_function_single_ipi().
-                */
-               if (nr_queued)
-                       trace_ipi_send_cpumask(cfd->cpumask, _RET_IP_, func);
-
-               /*
                 * Choose the most efficient way to send an IPI. Note that the
                 * number of CPUs might be zero due to concurrent changes to the
                 * provided mask.
@@ -816,7 +823,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
                unsigned long flags;
 
                local_irq_save(flags);
-               func(info);
+               csd_do_func(func, info, NULL);
                local_irq_restore(flags);
        }
 
@@ -892,7 +899,7 @@ EXPORT_SYMBOL(setup_max_cpus);
  * SMP mode to <NUM>.
  */
 
-void __weak arch_disable_smp_support(void) { }
+void __weak __init arch_disable_smp_support(void) { }
 
 static int __init nosmp(char *str)
 {
index 2c7396d..f47d8f3 100644 (file)
@@ -325,166 +325,3 @@ void smpboot_unregister_percpu_thread(struct smp_hotplug_thread *plug_thread)
        cpus_read_unlock();
 }
 EXPORT_SYMBOL_GPL(smpboot_unregister_percpu_thread);
-
-static DEFINE_PER_CPU(atomic_t, cpu_hotplug_state) = ATOMIC_INIT(CPU_POST_DEAD);
-
-/*
- * Called to poll specified CPU's state, for example, when waiting for
- * a CPU to come online.
- */
-int cpu_report_state(int cpu)
-{
-       return atomic_read(&per_cpu(cpu_hotplug_state, cpu));
-}
-
-/*
- * If CPU has died properly, set its state to CPU_UP_PREPARE and
- * return success.  Otherwise, return -EBUSY if the CPU died after
- * cpu_wait_death() timed out.  And yet otherwise again, return -EAGAIN
- * if cpu_wait_death() timed out and the CPU still hasn't gotten around
- * to dying.  In the latter two cases, the CPU might not be set up
- * properly, but it is up to the arch-specific code to decide.
- * Finally, -EIO indicates an unanticipated problem.
- *
- * Note that it is permissible to omit this call entirely, as is
- * done in architectures that do no CPU-hotplug error checking.
- */
-int cpu_check_up_prepare(int cpu)
-{
-       if (!IS_ENABLED(CONFIG_HOTPLUG_CPU)) {
-               atomic_set(&per_cpu(cpu_hotplug_state, cpu), CPU_UP_PREPARE);
-               return 0;
-       }
-
-       switch (atomic_read(&per_cpu(cpu_hotplug_state, cpu))) {
-
-       case CPU_POST_DEAD:
-
-               /* The CPU died properly, so just start it up again. */
-               atomic_set(&per_cpu(cpu_hotplug_state, cpu), CPU_UP_PREPARE);
-               return 0;
-
-       case CPU_DEAD_FROZEN:
-
-               /*
-                * Timeout during CPU death, so let caller know.
-                * The outgoing CPU completed its processing, but after
-                * cpu_wait_death() timed out and reported the error. The
-                * caller is free to proceed, in which case the state
-                * will be reset properly by cpu_set_state_online().
-                * Proceeding despite this -EBUSY return makes sense
-                * for systems where the outgoing CPUs take themselves
-                * offline, with no post-death manipulation required from
-                * a surviving CPU.
-                */
-               return -EBUSY;
-
-       case CPU_BROKEN:
-
-               /*
-                * The most likely reason we got here is that there was
-                * a timeout during CPU death, and the outgoing CPU never
-                * did complete its processing.  This could happen on
-                * a virtualized system if the outgoing VCPU gets preempted
-                * for more than five seconds, and the user attempts to
-                * immediately online that same CPU.  Trying again later
-                * might return -EBUSY above, hence -EAGAIN.
-                */
-               return -EAGAIN;
-
-       case CPU_UP_PREPARE:
-               /*
-                * Timeout while waiting for the CPU to show up. Allow to try
-                * again later.
-                */
-               return 0;
-
-       default:
-
-               /* Should not happen.  Famous last words. */
-               return -EIO;
-       }
-}
-
-/*
- * Mark the specified CPU online.
- *
- * Note that it is permissible to omit this call entirely, as is
- * done in architectures that do no CPU-hotplug error checking.
- */
-void cpu_set_state_online(int cpu)
-{
-       (void)atomic_xchg(&per_cpu(cpu_hotplug_state, cpu), CPU_ONLINE);
-}
-
-#ifdef CONFIG_HOTPLUG_CPU
-
-/*
- * Wait for the specified CPU to exit the idle loop and die.
- */
-bool cpu_wait_death(unsigned int cpu, int seconds)
-{
-       int jf_left = seconds * HZ;
-       int oldstate;
-       bool ret = true;
-       int sleep_jf = 1;
-
-       might_sleep();
-
-       /* The outgoing CPU will normally get done quite quickly. */
-       if (atomic_read(&per_cpu(cpu_hotplug_state, cpu)) == CPU_DEAD)
-               goto update_state_early;
-       udelay(5);
-
-       /* But if the outgoing CPU dawdles, wait increasingly long times. */
-       while (atomic_read(&per_cpu(cpu_hotplug_state, cpu)) != CPU_DEAD) {
-               schedule_timeout_uninterruptible(sleep_jf);
-               jf_left -= sleep_jf;
-               if (jf_left <= 0)
-                       break;
-               sleep_jf = DIV_ROUND_UP(sleep_jf * 11, 10);
-       }
-update_state_early:
-       oldstate = atomic_read(&per_cpu(cpu_hotplug_state, cpu));
-update_state:
-       if (oldstate == CPU_DEAD) {
-               /* Outgoing CPU died normally, update state. */
-               smp_mb(); /* atomic_read() before update. */
-               atomic_set(&per_cpu(cpu_hotplug_state, cpu), CPU_POST_DEAD);
-       } else {
-               /* Outgoing CPU still hasn't died, set state accordingly. */
-               if (!atomic_try_cmpxchg(&per_cpu(cpu_hotplug_state, cpu),
-                                       &oldstate, CPU_BROKEN))
-                       goto update_state;
-               ret = false;
-       }
-       return ret;
-}
-
-/*
- * Called by the outgoing CPU to report its successful death.  Return
- * false if this report follows the surviving CPU's timing out.
- *
- * A separate "CPU_DEAD_FROZEN" is used when the surviving CPU
- * timed out.  This approach allows architectures to omit calls to
- * cpu_check_up_prepare() and cpu_set_state_online() without defeating
- * the next cpu_wait_death()'s polling loop.
- */
-bool cpu_report_death(void)
-{
-       int oldstate;
-       int newstate;
-       int cpu = smp_processor_id();
-
-       oldstate = atomic_read(&per_cpu(cpu_hotplug_state, cpu));
-       do {
-               if (oldstate != CPU_BROKEN)
-                       newstate = CPU_DEAD;
-               else
-                       newstate = CPU_DEAD_FROZEN;
-       } while (!atomic_try_cmpxchg(&per_cpu(cpu_hotplug_state, cpu),
-                                    &oldstate, newstate));
-       return newstate == CPU_DEAD;
-}
-
-#endif /* #ifdef CONFIG_HOTPLUG_CPU */
index 1b72551..807b34c 100644 (file)
@@ -80,21 +80,6 @@ static void wakeup_softirqd(void)
                wake_up_process(tsk);
 }
 
-/*
- * If ksoftirqd is scheduled, we do not want to process pending softirqs
- * right now. Let ksoftirqd handle this at its own rate, to get fairness,
- * unless we're doing some of the synchronous softirqs.
- */
-#define SOFTIRQ_NOW_MASK ((1 << HI_SOFTIRQ) | (1 << TASKLET_SOFTIRQ))
-static bool ksoftirqd_running(unsigned long pending)
-{
-       struct task_struct *tsk = __this_cpu_read(ksoftirqd);
-
-       if (pending & SOFTIRQ_NOW_MASK)
-               return false;
-       return tsk && task_is_running(tsk) && !__kthread_should_park(tsk);
-}
-
 #ifdef CONFIG_TRACE_IRQFLAGS
 DEFINE_PER_CPU(int, hardirqs_enabled);
 DEFINE_PER_CPU(int, hardirq_context);
@@ -236,7 +221,7 @@ void __local_bh_enable_ip(unsigned long ip, unsigned int cnt)
                goto out;
 
        pending = local_softirq_pending();
-       if (!pending || ksoftirqd_running(pending))
+       if (!pending)
                goto out;
 
        /*
@@ -432,9 +417,6 @@ static inline bool should_wake_ksoftirqd(void)
 
 static inline void invoke_softirq(void)
 {
-       if (ksoftirqd_running(local_softirq_pending()))
-               return;
-
        if (!force_irqthreads() || !__this_cpu_read(ksoftirqd)) {
 #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
                /*
@@ -468,7 +450,7 @@ asmlinkage __visible void do_softirq(void)
 
        pending = local_softirq_pending();
 
-       if (pending && !ksoftirqd_running(pending))
+       if (pending)
                do_softirq_own_stack();
 
        local_irq_restore(flags);
index 860b2dc..04bfb1e 100644 (file)
@@ -299,6 +299,7 @@ COND_SYSCALL(set_mempolicy);
 COND_SYSCALL(migrate_pages);
 COND_SYSCALL(move_pages);
 COND_SYSCALL(set_mempolicy_home_node);
+COND_SYSCALL(cachestat);
 
 COND_SYSCALL(perf_event_open);
 COND_SYSCALL(accept4);
index bfe53e8..354a2d2 100644 (file)
@@ -1783,11 +1783,6 @@ static struct ctl_table kern_table[] = {
                .proc_handler   = sysctl_max_threads,
        },
        {
-               .procname       = "usermodehelper",
-               .mode           = 0555,
-               .child          = usermodehelper_table,
-       },
-       {
                .procname       = "overflowuid",
                .data           = &overflowuid,
                .maxlen         = sizeof(int),
@@ -1962,13 +1957,6 @@ static struct ctl_table kern_table[] = {
                .proc_handler   = proc_dointvec,
        },
 #endif
-#ifdef CONFIG_KEYS
-       {
-               .procname       = "keys",
-               .mode           = 0555,
-               .child          = key_sysctls,
-       },
-#endif
 #ifdef CONFIG_PERF_EVENTS
        /*
         * User-space scripts rely on the existence of this file
@@ -2120,13 +2108,6 @@ static struct ctl_table vm_table[] = {
        },
 #endif
        {
-               .procname       = "lowmem_reserve_ratio",
-               .data           = &sysctl_lowmem_reserve_ratio,
-               .maxlen         = sizeof(sysctl_lowmem_reserve_ratio),
-               .mode           = 0644,
-               .proc_handler   = lowmem_reserve_ratio_sysctl_handler,
-       },
-       {
                .procname       = "drop_caches",
                .data           = &sysctl_drop_caches,
                .maxlen         = sizeof(int),
@@ -2136,39 +2117,6 @@ static struct ctl_table vm_table[] = {
                .extra2         = SYSCTL_FOUR,
        },
        {
-               .procname       = "min_free_kbytes",
-               .data           = &min_free_kbytes,
-               .maxlen         = sizeof(min_free_kbytes),
-               .mode           = 0644,
-               .proc_handler   = min_free_kbytes_sysctl_handler,
-               .extra1         = SYSCTL_ZERO,
-       },
-       {
-               .procname       = "watermark_boost_factor",
-               .data           = &watermark_boost_factor,
-               .maxlen         = sizeof(watermark_boost_factor),
-               .mode           = 0644,
-               .proc_handler   = proc_dointvec_minmax,
-               .extra1         = SYSCTL_ZERO,
-       },
-       {
-               .procname       = "watermark_scale_factor",
-               .data           = &watermark_scale_factor,
-               .maxlen         = sizeof(watermark_scale_factor),
-               .mode           = 0644,
-               .proc_handler   = watermark_scale_factor_sysctl_handler,
-               .extra1         = SYSCTL_ONE,
-               .extra2         = SYSCTL_THREE_THOUSAND,
-       },
-       {
-               .procname       = "percpu_pagelist_high_fraction",
-               .data           = &percpu_pagelist_high_fraction,
-               .maxlen         = sizeof(percpu_pagelist_high_fraction),
-               .mode           = 0644,
-               .proc_handler   = percpu_pagelist_high_fraction_sysctl_handler,
-               .extra1         = SYSCTL_ZERO,
-       },
-       {
                .procname       = "page_lock_unfairness",
                .data           = &sysctl_page_lock_unfairness,
                .maxlen         = sizeof(sysctl_page_lock_unfairness),
@@ -2223,24 +2171,6 @@ static struct ctl_table vm_table[] = {
                .proc_handler   = proc_dointvec_minmax,
                .extra1         = SYSCTL_ZERO,
        },
-       {
-               .procname       = "min_unmapped_ratio",
-               .data           = &sysctl_min_unmapped_ratio,
-               .maxlen         = sizeof(sysctl_min_unmapped_ratio),
-               .mode           = 0644,
-               .proc_handler   = sysctl_min_unmapped_ratio_sysctl_handler,
-               .extra1         = SYSCTL_ZERO,
-               .extra2         = SYSCTL_ONE_HUNDRED,
-       },
-       {
-               .procname       = "min_slab_ratio",
-               .data           = &sysctl_min_slab_ratio,
-               .maxlen         = sizeof(sysctl_min_slab_ratio),
-               .mode           = 0644,
-               .proc_handler   = sysctl_min_slab_ratio_sysctl_handler,
-               .extra1         = SYSCTL_ZERO,
-               .extra2         = SYSCTL_ONE_HUNDRED,
-       },
 #endif
 #ifdef CONFIG_SMP
        {
@@ -2267,15 +2197,6 @@ static struct ctl_table vm_table[] = {
                .proc_handler   = mmap_min_addr_handler,
        },
 #endif
-#ifdef CONFIG_NUMA
-       {
-               .procname       = "numa_zonelist_order",
-               .data           = &numa_zonelist_order,
-               .maxlen         = NUMA_ZONELIST_ORDER_LEN,
-               .mode           = 0644,
-               .proc_handler   = numa_zonelist_order_handler,
-       },
-#endif
 #if (defined(CONFIG_X86_32) && !defined(CONFIG_UML))|| \
    (defined(CONFIG_SUPERH) && defined(CONFIG_VSYSCALL))
        {
@@ -2331,34 +2252,10 @@ static struct ctl_table vm_table[] = {
        { }
 };
 
-static struct ctl_table debug_table[] = {
-#ifdef CONFIG_SYSCTL_EXCEPTION_TRACE
-       {
-               .procname       = "exception-trace",
-               .data           = &show_unhandled_signals,
-               .maxlen         = sizeof(int),
-               .mode           = 0644,
-               .proc_handler   = proc_dointvec
-       },
-#endif
-       { }
-};
-
-static struct ctl_table dev_table[] = {
-       { }
-};
-
-DECLARE_SYSCTL_BASE(kernel, kern_table);
-DECLARE_SYSCTL_BASE(vm, vm_table);
-DECLARE_SYSCTL_BASE(debug, debug_table);
-DECLARE_SYSCTL_BASE(dev, dev_table);
-
 int __init sysctl_init_bases(void)
 {
-       register_sysctl_base(kernel);
-       register_sysctl_base(vm);
-       register_sysctl_base(debug);
-       register_sysctl_base(dev);
+       register_sysctl_init("kernel", kern_table);
+       register_sysctl_init("vm", vm_table);
 
        return 0;
 }
index 82b28ab..8d9f13d 100644 (file)
@@ -751,7 +751,7 @@ static int alarm_timer_create(struct k_itimer *new_timer)
 static enum alarmtimer_restart alarmtimer_nsleep_wakeup(struct alarm *alarm,
                                                                ktime_t now)
 {
-       struct task_struct *task = (struct task_struct *)alarm->data;
+       struct task_struct *task = alarm->data;
 
        alarm->data = NULL;
        if (task)
@@ -847,7 +847,7 @@ static int alarm_timer_nsleep(const clockid_t which_clock, int flags,
        struct restart_block *restart = &current->restart_block;
        struct alarm alarm;
        ktime_t exp;
-       int ret = 0;
+       int ret;
 
        if (!alarmtimer_get_rtcdev())
                return -EOPNOTSUPP;
index 91836b7..88cbc11 100644 (file)
@@ -1480,7 +1480,7 @@ static int __init boot_override_clocksource(char* str)
 {
        mutex_lock(&clocksource_mutex);
        if (str)
-               strlcpy(override_name, str, sizeof(override_name));
+               strscpy(override_name, str, sizeof(override_name));
        mutex_unlock(&clocksource_mutex);
        return 1;
 }
index e8c0829..238262e 100644 (file)
@@ -164,6 +164,7 @@ static inline bool is_migration_base(struct hrtimer_clock_base *base)
 static
 struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer,
                                             unsigned long *flags)
+       __acquires(&timer->base->lock)
 {
        struct hrtimer_clock_base *base;
 
@@ -280,6 +281,7 @@ static inline bool is_migration_base(struct hrtimer_clock_base *base)
 
 static inline struct hrtimer_clock_base *
 lock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
+       __acquires(&timer->base->cpu_base->lock)
 {
        struct hrtimer_clock_base *base = timer->base;
 
@@ -1013,6 +1015,7 @@ void hrtimers_resume_local(void)
  */
 static inline
 void unlock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
+       __releases(&timer->base->cpu_base->lock)
 {
        raw_spin_unlock_irqrestore(&timer->base->cpu_base->lock, *flags);
 }
index 808a247..b924f0f 100644 (file)
 #include "timekeeping.h"
 #include "posix-timers.h"
 
-/*
- * Management arrays for POSIX timers. Timers are now kept in static hash table
- * with 512 entries.
- * Timer ids are allocated by local routine, which selects proper hash head by
- * key, constructed from current->signal address and per signal struct counter.
- * This keeps timer ids unique per process, but now they can intersect between
- * processes.
- */
+static struct kmem_cache *posix_timers_cache;
 
 /*
- * Lets keep our timers in a slab cache :-)
+ * Timers are managed in a hash table for lockless lookup. The hash key is
+ * constructed from current::signal and the timer ID and the timer is
+ * matched against current::signal and the timer ID when walking the hash
+ * bucket list.
+ *
+ * This allows checkpoint/restore to reconstruct the exact timer IDs for
+ * a process.
  */
-static struct kmem_cache *posix_timers_cache;
-
 static DEFINE_HASHTABLE(posix_timers_hashtable, 9);
 static DEFINE_SPINLOCK(hash_lock);
 
@@ -56,52 +53,12 @@ static const struct k_clock * const posix_clocks[];
 static const struct k_clock *clockid_to_kclock(const clockid_t id);
 static const struct k_clock clock_realtime, clock_monotonic;
 
-/*
- * we assume that the new SIGEV_THREAD_ID shares no bits with the other
- * SIGEV values.  Here we put out an error if this assumption fails.
- */
+/* SIGEV_THREAD_ID cannot share a bit with the other SIGEV values. */
 #if SIGEV_THREAD_ID != (SIGEV_THREAD_ID & \
-                       ~(SIGEV_SIGNAL | SIGEV_NONE | SIGEV_THREAD))
+                       ~(SIGEV_SIGNAL | SIGEV_NONE | SIGEV_THREAD))
 #error "SIGEV_THREAD_ID must not share bit with other SIGEV values!"
 #endif
 
-/*
- * The timer ID is turned into a timer address by idr_find().
- * Verifying a valid ID consists of:
- *
- * a) checking that idr_find() returns other than -1.
- * b) checking that the timer id matches the one in the timer itself.
- * c) that the timer owner is in the callers thread group.
- */
-
-/*
- * CLOCKs: The POSIX standard calls for a couple of clocks and allows us
- *         to implement others.  This structure defines the various
- *         clocks.
- *
- * RESOLUTION: Clock resolution is used to round up timer and interval
- *         times, NOT to report clock times, which are reported with as
- *         much resolution as the system can muster.  In some cases this
- *         resolution may depend on the underlying clock hardware and
- *         may not be quantifiable until run time, and only then is the
- *         necessary code is written.  The standard says we should say
- *         something about this issue in the documentation...
- *
- * FUNCTIONS: The CLOCKs structure defines possible functions to
- *         handle various clock functions.
- *
- *         The standard POSIX timer management code assumes the
- *         following: 1.) The k_itimer struct (sched.h) is used for
- *         the timer.  2.) The list, it_lock, it_clock, it_id and
- *         it_pid fields are not modified by timer code.
- *
- * Permissions: It is assumed that the clock_settime() function defined
- *         for each clock will take care of permission checks.  Some
- *         clocks may be set able by any user (i.e. local process
- *         clocks) others not.  Currently the only set able clock we
- *         have is CLOCK_REALTIME and its high res counter part, both of
- *         which we beg off on and pass to do_sys_settimeofday().
- */
 static struct k_itimer *__lock_timer(timer_t timer_id, unsigned long *flags);
 
 #define lock_timer(tid, flags)                                            \
@@ -121,9 +78,9 @@ static struct k_itimer *__posix_timers_find(struct hlist_head *head,
 {
        struct k_itimer *timer;
 
-       hlist_for_each_entry_rcu(timer, head, t_hash,
-                                lockdep_is_held(&hash_lock)) {
-               if ((timer->it_signal == sig) && (timer->it_id == id))
+       hlist_for_each_entry_rcu(timer, head, t_hash, lockdep_is_held(&hash_lock)) {
+               /* timer->it_signal can be set concurrently */
+               if ((READ_ONCE(timer->it_signal) == sig) && (timer->it_id == id))
                        return timer;
        }
        return NULL;
@@ -140,25 +97,30 @@ static struct k_itimer *posix_timer_by_id(timer_t id)
 static int posix_timer_add(struct k_itimer *timer)
 {
        struct signal_struct *sig = current->signal;
-       int first_free_id = sig->posix_timer_id;
        struct hlist_head *head;
-       int ret = -ENOENT;
+       unsigned int cnt, id;
 
-       do {
+       /*
+        * FIXME: Replace this by a per signal struct xarray once there is
+        * a plan to handle the resulting CRIU regression gracefully.
+        */
+       for (cnt = 0; cnt <= INT_MAX; cnt++) {
                spin_lock(&hash_lock);
-               head = &posix_timers_hashtable[hash(sig, sig->posix_timer_id)];
-               if (!__posix_timers_find(head, sig, sig->posix_timer_id)) {
+               id = sig->next_posix_timer_id;
+
+               /* Write the next ID back. Clamp it to the positive space */
+               sig->next_posix_timer_id = (id + 1) & INT_MAX;
+
+               head = &posix_timers_hashtable[hash(sig, id)];
+               if (!__posix_timers_find(head, sig, id)) {
                        hlist_add_head_rcu(&timer->t_hash, head);
-                       ret = sig->posix_timer_id;
+                       spin_unlock(&hash_lock);
+                       return id;
                }
-               if (++sig->posix_timer_id < 0)
-                       sig->posix_timer_id = 0;
-               if ((sig->posix_timer_id == first_free_id) && (ret == -ENOENT))
-                       /* Loop over all possible ids completed */
-                       ret = -EAGAIN;
                spin_unlock(&hash_lock);
-       } while (ret == -ENOENT);
-       return ret;
+       }
+       /* POSIX return code when no timer ID could be allocated */
+       return -EAGAIN;
 }
 
 static inline void unlock_timer(struct k_itimer *timr, unsigned long flags)
@@ -166,7 +128,6 @@ static inline void unlock_timer(struct k_itimer *timr, unsigned long flags)
        spin_unlock_irqrestore(&timr->it_lock, flags);
 }
 
-/* Get clock_realtime */
 static int posix_get_realtime_timespec(clockid_t which_clock, struct timespec64 *tp)
 {
        ktime_get_real_ts64(tp);
@@ -178,7 +139,6 @@ static ktime_t posix_get_realtime_ktime(clockid_t which_clock)
        return ktime_get_real();
 }
 
-/* Set clock_realtime */
 static int posix_clock_realtime_set(const clockid_t which_clock,
                                    const struct timespec64 *tp)
 {
@@ -191,9 +151,6 @@ static int posix_clock_realtime_adj(const clockid_t which_clock,
        return do_adjtimex(t);
 }
 
-/*
- * Get monotonic time for posix timers
- */
 static int posix_get_monotonic_timespec(clockid_t which_clock, struct timespec64 *tp)
 {
        ktime_get_ts64(tp);
@@ -206,9 +163,6 @@ static ktime_t posix_get_monotonic_ktime(clockid_t which_clock)
        return ktime_get();
 }
 
-/*
- * Get monotonic-raw time for posix timers
- */
 static int posix_get_monotonic_raw(clockid_t which_clock, struct timespec64 *tp)
 {
        ktime_get_raw_ts64(tp);
@@ -216,7 +170,6 @@ static int posix_get_monotonic_raw(clockid_t which_clock, struct timespec64 *tp)
        return 0;
 }
 
-
 static int posix_get_realtime_coarse(clockid_t which_clock, struct timespec64 *tp)
 {
        ktime_get_coarse_real_ts64(tp);
@@ -267,9 +220,6 @@ static int posix_get_hrtimer_res(clockid_t which_clock, struct timespec64 *tp)
        return 0;
 }
 
-/*
- * Initialize everything, well, just everything in Posix clocks/timers ;)
- */
 static __init int init_posix_timers(void)
 {
        posix_timers_cache = kmem_cache_create("posix_timers_cache",
@@ -300,15 +250,9 @@ static void common_hrtimer_rearm(struct k_itimer *timr)
 }
 
 /*
- * This function is exported for use by the signal deliver code.  It is
- * called just prior to the info block being released and passes that
- * block to us.  It's function is to update the overrun entry AND to
- * restart the timer.  It should only be called if the timer is to be
- * restarted (i.e. we have flagged this in the sys_private entry of the
- * info block).
- *
- * To protect against the timer going away while the interrupt is queued,
- * we require that the it_requeue_pending flag be set.
+ * This function is called from the signal delivery code if
+ * info->si_sys_private is not zero, which indicates that the timer has to
+ * be rearmed. Restart the timer and update info::si_overrun.
  */
 void posixtimer_rearm(struct kernel_siginfo *info)
 {
@@ -357,18 +301,18 @@ int posix_timer_event(struct k_itimer *timr, int si_private)
 }
 
 /*
- * This function gets called when a POSIX.1b interval timer expires.  It
- * is used as a callback from the kernel internal timer.  The
- * run_timer_list code ALWAYS calls with interrupts on.
-
- * This code is for CLOCK_REALTIME* and CLOCK_MONOTONIC* timers.
+ * This function gets called when a POSIX.1b interval timer expires from
+ * the HRTIMER interrupt (soft interrupt on RT kernels).
+ *
+ * Handles CLOCK_REALTIME, CLOCK_MONOTONIC, CLOCK_BOOTTIME and CLOCK_TAI
+ * based timers.
  */
 static enum hrtimer_restart posix_timer_fn(struct hrtimer *timer)
 {
+       enum hrtimer_restart ret = HRTIMER_NORESTART;
        struct k_itimer *timr;
        unsigned long flags;
        int si_private = 0;
-       enum hrtimer_restart ret = HRTIMER_NORESTART;
 
        timr = container_of(timer, struct k_itimer, it.real.timer);
        spin_lock_irqsave(&timr->it_lock, flags);
@@ -379,9 +323,10 @@ static enum hrtimer_restart posix_timer_fn(struct hrtimer *timer)
 
        if (posix_timer_event(timr, si_private)) {
                /*
-                * signal was not sent because of sig_ignor
-                * we will not get a call back to restart it AND
-                * it should be restarted.
+                * The signal was not queued due to SIG_IGN. As a
+                * consequence the timer is not going to be rearmed from
+                * the signal delivery path. But as a real signal handler
+                * can be installed later the timer must be rearmed here.
                 */
                if (timr->it_interval != 0) {
                        ktime_t now = hrtimer_cb_get_time(timer);
@@ -390,34 +335,35 @@ static enum hrtimer_restart posix_timer_fn(struct hrtimer *timer)
                         * FIXME: What we really want, is to stop this
                         * timer completely and restart it in case the
                         * SIG_IGN is removed. This is a non trivial
-                        * change which involves sighand locking
-                        * (sigh !), which we don't want to do late in
-                        * the release cycle.
+                        * change to the signal handling code.
+                        *
+                        * For now let timers with an interval less than a
+                        * jiffie expire every jiffie and recheck for a
+                        * valid signal handler.
+                        *
+                        * This avoids interrupt starvation in case of a
+                        * very small interval, which would expire the
+                        * timer immediately again.
+                        *
+                        * Moving now ahead of time by one jiffie tricks
+                        * hrtimer_forward() to expire the timer later,
+                        * while it still maintains the overrun accuracy
+                        * for the price of a slight inconsistency in the
+                        * timer_gettime() case. This is at least better
+                        * than a timer storm.
                         *
-                        * For now we just let timers with an interval
-                        * less than a jiffie expire every jiffie to
-                        * avoid softirq starvation in case of SIG_IGN
-                        * and a very small interval, which would put
-                        * the timer right back on the softirq pending
-                        * list. By moving now ahead of time we trick
-                        * hrtimer_forward() to expire the timer
-                        * later, while we still maintain the overrun
-                        * accuracy, but have some inconsistency in
-                        * the timer_gettime() case. This is at least
-                        * better than a starved softirq. A more
-                        * complex fix which solves also another related
-                        * inconsistency is already in the pipeline.
+                        * Only required when high resolution timers are
+                        * enabled as the periodic tick based timers are
+                        * automatically aligned to the next tick.
                         */
-#ifdef CONFIG_HIGH_RES_TIMERS
-                       {
-                               ktime_t kj = NSEC_PER_SEC / HZ;
+                       if (IS_ENABLED(CONFIG_HIGH_RES_TIMERS)) {
+                               ktime_t kj = TICK_NSEC;
 
                                if (timr->it_interval < kj)
                                        now = ktime_add(now, kj);
                        }
-#endif
-                       timr->it_overrun += hrtimer_forward(timer, now,
-                                                           timr->it_interval);
+
+                       timr->it_overrun += hrtimer_forward(timer, now, timr->it_interval);
                        ret = HRTIMER_RESTART;
                        ++timr->it_requeue_pending;
                        timr->it_active = 1;
@@ -454,8 +400,8 @@ static struct pid *good_sigevent(sigevent_t * event)
 
 static struct k_itimer * alloc_posix_timer(void)
 {
-       struct k_itimer *tmr;
-       tmr = kmem_cache_zalloc(posix_timers_cache, GFP_KERNEL);
+       struct k_itimer *tmr = kmem_cache_zalloc(posix_timers_cache, GFP_KERNEL);
+
        if (!tmr)
                return tmr;
        if (unlikely(!(tmr->sigq = sigqueue_alloc()))) {
@@ -473,21 +419,21 @@ static void k_itimer_rcu_free(struct rcu_head *head)
        kmem_cache_free(posix_timers_cache, tmr);
 }
 
-#define IT_ID_SET      1
-#define IT_ID_NOT_SET  0
-static void release_posix_timer(struct k_itimer *tmr, int it_id_set)
+static void posix_timer_free(struct k_itimer *tmr)
 {
-       if (it_id_set) {
-               unsigned long flags;
-               spin_lock_irqsave(&hash_lock, flags);
-               hlist_del_rcu(&tmr->t_hash);
-               spin_unlock_irqrestore(&hash_lock, flags);
-       }
        put_pid(tmr->it_pid);
        sigqueue_free(tmr->sigq);
        call_rcu(&tmr->rcu, k_itimer_rcu_free);
 }
 
+static void posix_timer_unhash_and_free(struct k_itimer *tmr)
+{
+       spin_lock(&hash_lock);
+       hlist_del_rcu(&tmr->t_hash);
+       spin_unlock(&hash_lock);
+       posix_timer_free(tmr);
+}
+
 static int common_timer_create(struct k_itimer *new_timer)
 {
        hrtimer_init(&new_timer->it.real.timer, new_timer->it_clock, 0);
@@ -501,7 +447,6 @@ static int do_timer_create(clockid_t which_clock, struct sigevent *event,
        const struct k_clock *kc = clockid_to_kclock(which_clock);
        struct k_itimer *new_timer;
        int error, new_timer_id;
-       int it_id_set = IT_ID_NOT_SET;
 
        if (!kc)
                return -EINVAL;
@@ -513,13 +458,18 @@ static int do_timer_create(clockid_t which_clock, struct sigevent *event,
                return -EAGAIN;
 
        spin_lock_init(&new_timer->it_lock);
+
+       /*
+        * Add the timer to the hash table. The timer is not yet valid
+        * because new_timer::it_signal is still NULL. The timer id is also
+        * not yet visible to user space.
+        */
        new_timer_id = posix_timer_add(new_timer);
        if (new_timer_id < 0) {
-               error = new_timer_id;
-               goto out;
+               posix_timer_free(new_timer);
+               return new_timer_id;
        }
 
-       it_id_set = IT_ID_SET;
        new_timer->it_id = (timer_t) new_timer_id;
        new_timer->it_clock = which_clock;
        new_timer->kclock = kc;
@@ -547,30 +497,33 @@ static int do_timer_create(clockid_t which_clock, struct sigevent *event,
        new_timer->sigq->info.si_tid   = new_timer->it_id;
        new_timer->sigq->info.si_code  = SI_TIMER;
 
-       if (copy_to_user(created_timer_id,
-                        &new_timer_id, sizeof (new_timer_id))) {
+       if (copy_to_user(created_timer_id, &new_timer_id, sizeof (new_timer_id))) {
                error = -EFAULT;
                goto out;
        }
-
+       /*
+        * After succesful copy out, the timer ID is visible to user space
+        * now but not yet valid because new_timer::signal is still NULL.
+        *
+        * Complete the initialization with the clock specific create
+        * callback.
+        */
        error = kc->timer_create(new_timer);
        if (error)
                goto out;
 
        spin_lock_irq(&current->sighand->siglock);
-       new_timer->it_signal = current->signal;
+       /* This makes the timer valid in the hash table */
+       WRITE_ONCE(new_timer->it_signal, current->signal);
        list_add(&new_timer->list, &current->signal->posix_timers);
        spin_unlock_irq(&current->sighand->siglock);
-
-       return 0;
        /*
-        * In the case of the timer belonging to another task, after
-        * the task is unlocked, the timer is owned by the other task
-        * and may cease to exist at any time.  Don't use or modify
-        * new_timer after the unlock call.
+        * After unlocking sighand::siglock @new_timer is subject to
+        * concurrent removal and cannot be touched anymore
         */
+       return 0;
 out:
-       release_posix_timer(new_timer, it_id_set);
+       posix_timer_unhash_and_free(new_timer);
        return error;
 }
 
@@ -604,13 +557,6 @@ COMPAT_SYSCALL_DEFINE3(timer_create, clockid_t, which_clock,
 }
 #endif
 
-/*
- * Locking issues: We need to protect the result of the id look up until
- * we get the timer locked down so it is not deleted under us.  The
- * removal is done under the idr spinlock so we use that here to bridge
- * the find to the timer lock.  To avoid a dead lock, the timer id MUST
- * be release with out holding the timer lock.
- */
 static struct k_itimer *__lock_timer(timer_t timer_id, unsigned long *flags)
 {
        struct k_itimer *timr;
@@ -622,10 +568,35 @@ static struct k_itimer *__lock_timer(timer_t timer_id, unsigned long *flags)
        if ((unsigned long long)timer_id > INT_MAX)
                return NULL;
 
+       /*
+        * The hash lookup and the timers are RCU protected.
+        *
+        * Timers are added to the hash in invalid state where
+        * timr::it_signal == NULL. timer::it_signal is only set after the
+        * rest of the initialization succeeded.
+        *
+        * Timer destruction happens in steps:
+        *  1) Set timr::it_signal to NULL with timr::it_lock held
+        *  2) Release timr::it_lock
+        *  3) Remove from the hash under hash_lock
+        *  4) Call RCU for removal after the grace period
+        *
+        * Holding rcu_read_lock() accross the lookup ensures that
+        * the timer cannot be freed.
+        *
+        * The lookup validates locklessly that timr::it_signal ==
+        * current::it_signal and timr::it_id == @timer_id. timr::it_id
+        * can't change, but timr::it_signal becomes NULL during
+        * destruction.
+        */
        rcu_read_lock();
        timr = posix_timer_by_id(timer_id);
        if (timr) {
                spin_lock_irqsave(&timr->it_lock, *flags);
+               /*
+                * Validate under timr::it_lock that timr::it_signal is
+                * still valid. Pairs with #1 above.
+                */
                if (timr->it_signal == current->signal) {
                        rcu_read_unlock();
                        return timr;
@@ -652,20 +623,16 @@ static s64 common_hrtimer_forward(struct k_itimer *timr, ktime_t now)
 }
 
 /*
- * Get the time remaining on a POSIX.1b interval timer.  This function
- * is ALWAYS called with spin_lock_irq on the timer, thus it must not
- * mess with irq.
+ * Get the time remaining on a POSIX.1b interval timer.
  *
- * We have a couple of messes to clean up here.  First there is the case
- * of a timer that has a requeue pending.  These timers should appear to
- * be in the timer list with an expiry as if we were to requeue them
- * now.
+ * Two issues to handle here:
  *
- * The second issue is the SIGEV_NONE timer which may be active but is
- * not really ever put in the timer list (to save system resources).
- * This timer may be expired, and if so, we will do it here.  Otherwise
- * it is the same as a requeue pending timer WRT to what we should
- * report.
+ *  1) The timer has a requeue pending. The return value must appear as
+ *     if the timer has been requeued right now.
+ *
+ *  2) The timer is a SIGEV_NONE timer. These timers are never enqueued
+ *     into the hrtimer queue and therefore never expired. Emulate expiry
+ *     here taking #1 into account.
  */
 void common_timer_get(struct k_itimer *timr, struct itimerspec64 *cur_setting)
 {
@@ -681,8 +648,12 @@ void common_timer_get(struct k_itimer *timr, struct itimerspec64 *cur_setting)
                cur_setting->it_interval = ktime_to_timespec64(iv);
        } else if (!timr->it_active) {
                /*
-                * SIGEV_NONE oneshot timers are never queued. Check them
-                * below.
+                * SIGEV_NONE oneshot timers are never queued and therefore
+                * timr->it_active is always false. The check below
+                * vs. remaining time will handle this case.
+                *
+                * For all other timers there is nothing to update here, so
+                * return.
                 */
                if (!sig_none)
                        return;
@@ -691,18 +662,29 @@ void common_timer_get(struct k_itimer *timr, struct itimerspec64 *cur_setting)
        now = kc->clock_get_ktime(timr->it_clock);
 
        /*
-        * When a requeue is pending or this is a SIGEV_NONE timer move the
-        * expiry time forward by intervals, so expiry is > now.
+        * If this is an interval timer and either has requeue pending or
+        * is a SIGEV_NONE timer move the expiry time forward by intervals,
+        * so expiry is > now.
         */
        if (iv && (timr->it_requeue_pending & REQUEUE_PENDING || sig_none))
                timr->it_overrun += kc->timer_forward(timr, now);
 
        remaining = kc->timer_remaining(timr, now);
-       /* Return 0 only, when the timer is expired and not pending */
+       /*
+        * As @now is retrieved before a possible timer_forward() and
+        * cannot be reevaluated by the compiler @remaining is based on the
+        * same @now value. Therefore @remaining is consistent vs. @now.
+        *
+        * Consequently all interval timers, i.e. @iv > 0, cannot have a
+        * remaining time <= 0 because timer_forward() guarantees to move
+        * them forward so that the next timer expiry is > @now.
+        */
        if (remaining <= 0) {
                /*
-                * A single shot SIGEV_NONE timer must return 0, when
-                * it is expired !
+                * A single shot SIGEV_NONE timer must return 0, when it is
+                * expired! Timers which have a real signal delivery mode
+                * must return a remaining time greater than 0 because the
+                * signal has not yet been delivered.
                 */
                if (!sig_none)
                        cur_setting->it_value.tv_nsec = 1;
@@ -711,11 +693,10 @@ void common_timer_get(struct k_itimer *timr, struct itimerspec64 *cur_setting)
        }
 }
 
-/* Get the time remaining on a POSIX.1b interval timer. */
 static int do_timer_gettime(timer_t timer_id,  struct itimerspec64 *setting)
 {
-       struct k_itimer *timr;
        const struct k_clock *kc;
+       struct k_itimer *timr;
        unsigned long flags;
        int ret = 0;
 
@@ -765,20 +746,29 @@ SYSCALL_DEFINE2(timer_gettime32, timer_t, timer_id,
 
 #endif
 
-/*
- * Get the number of overruns of a POSIX.1b interval timer.  This is to
- * be the overrun of the timer last delivered.  At the same time we are
- * accumulating overruns on the next timer.  The overrun is frozen when
- * the signal is delivered, either at the notify time (if the info block
- * is not queued) or at the actual delivery time (as we are informed by
- * the call back to posixtimer_rearm().  So all we need to do is
- * to pick up the frozen overrun.
+/**
+ * sys_timer_getoverrun - Get the number of overruns of a POSIX.1b interval timer
+ * @timer_id:  The timer ID which identifies the timer
+ *
+ * The "overrun count" of a timer is one plus the number of expiration
+ * intervals which have elapsed between the first expiry, which queues the
+ * signal and the actual signal delivery. On signal delivery the "overrun
+ * count" is calculated and cached, so it can be returned directly here.
+ *
+ * As this is relative to the last queued signal the returned overrun count
+ * is meaningless outside of the signal delivery path and even there it
+ * does not accurately reflect the current state when user space evaluates
+ * it.
+ *
+ * Returns:
+ *     -EINVAL         @timer_id is invalid
+ *     1..INT_MAX      The number of overruns related to the last delivered signal
  */
 SYSCALL_DEFINE1(timer_getoverrun, timer_t, timer_id)
 {
        struct k_itimer *timr;
-       int overrun;
        unsigned long flags;
+       int overrun;
 
        timr = lock_timer(timer_id, &flags);
        if (!timr)
@@ -831,10 +821,18 @@ static void common_timer_wait_running(struct k_itimer *timer)
 }
 
 /*
- * On PREEMPT_RT this prevent priority inversion against softirq kthread in
- * case it gets preempted while executing a timer callback. See comments in
- * hrtimer_cancel_wait_running. For PREEMPT_RT=n this just results in a
- * cpu_relax().
+ * On PREEMPT_RT this prevents priority inversion and a potential livelock
+ * against the ksoftirqd thread in case that ksoftirqd gets preempted while
+ * executing a hrtimer callback.
+ *
+ * See the comments in hrtimer_cancel_wait_running(). For PREEMPT_RT=n this
+ * just results in a cpu_relax().
+ *
+ * For POSIX CPU timers with CONFIG_POSIX_CPU_TIMERS_TASK_WORK=n this is
+ * just a cpu_relax(). With CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y this
+ * prevents spinning on an eventually scheduled out task and a livelock
+ * when the task which tries to delete or disarm the timer has preempted
+ * the task which runs the expiry in task work context.
  */
 static struct k_itimer *timer_wait_running(struct k_itimer *timer,
                                           unsigned long *flags)
@@ -943,8 +941,7 @@ SYSCALL_DEFINE4(timer_settime, timer_t, timer_id, int, flags,
                const struct __kernel_itimerspec __user *, new_setting,
                struct __kernel_itimerspec __user *, old_setting)
 {
-       struct itimerspec64 new_spec, old_spec;
-       struct itimerspec64 *rtn = old_setting ? &old_spec : NULL;
+       struct itimerspec64 new_spec, old_spec, *rtn;
        int error = 0;
 
        if (!new_setting)
@@ -953,6 +950,7 @@ SYSCALL_DEFINE4(timer_settime, timer_t, timer_id, int, flags,
        if (get_itimerspec64(&new_spec, new_setting))
                return -EFAULT;
 
+       rtn = old_setting ? &old_spec : NULL;
        error = do_timer_settime(timer_id, flags, &new_spec, rtn);
        if (!error && old_setting) {
                if (put_itimerspec64(&old_spec, old_setting))
@@ -1026,38 +1024,71 @@ retry_delete:
        list_del(&timer->list);
        spin_unlock(&current->sighand->siglock);
        /*
-        * This keeps any tasks waiting on the spin lock from thinking
-        * they got something (see the lock code above).
+        * A concurrent lookup could check timer::it_signal lockless. It
+        * will reevaluate with timer::it_lock held and observe the NULL.
         */
-       timer->it_signal = NULL;
+       WRITE_ONCE(timer->it_signal, NULL);
 
        unlock_timer(timer, flags);
-       release_posix_timer(timer, IT_ID_SET);
+       posix_timer_unhash_and_free(timer);
        return 0;
 }
 
 /*
- * return timer owned by the process, used by exit_itimers
+ * Delete a timer if it is armed, remove it from the hash and schedule it
+ * for RCU freeing.
  */
 static void itimer_delete(struct k_itimer *timer)
 {
-retry_delete:
-       spin_lock_irq(&timer->it_lock);
+       unsigned long flags;
 
+       /*
+        * irqsave is required to make timer_wait_running() work.
+        */
+       spin_lock_irqsave(&timer->it_lock, flags);
+
+retry_delete:
+       /*
+        * Even if the timer is not longer accessible from other tasks
+        * it still might be armed and queued in the underlying timer
+        * mechanism. Worse, that timer mechanism might run the expiry
+        * function concurrently.
+        */
        if (timer_delete_hook(timer) == TIMER_RETRY) {
-               spin_unlock_irq(&timer->it_lock);
+               /*
+                * Timer is expired concurrently, prevent livelocks
+                * and pointless spinning on RT.
+                *
+                * timer_wait_running() drops timer::it_lock, which opens
+                * the possibility for another task to delete the timer.
+                *
+                * That's not possible here because this is invoked from
+                * do_exit() only for the last thread of the thread group.
+                * So no other task can access and delete that timer.
+                */
+               if (WARN_ON_ONCE(timer_wait_running(timer, &flags) != timer))
+                       return;
+
                goto retry_delete;
        }
        list_del(&timer->list);
 
-       spin_unlock_irq(&timer->it_lock);
-       release_posix_timer(timer, IT_ID_SET);
+       /*
+        * Setting timer::it_signal to NULL is technically not required
+        * here as nothing can access the timer anymore legitimately via
+        * the hash table. Set it to NULL nevertheless so that all deletion
+        * paths are consistent.
+        */
+       WRITE_ONCE(timer->it_signal, NULL);
+
+       spin_unlock_irqrestore(&timer->it_lock, flags);
+       posix_timer_unhash_and_free(timer);
 }
 
 /*
- * This is called by do_exit or de_thread, only when nobody else can
- * modify the signal->posix_timers list. Yet we need sighand->siglock
- * to prevent the race with /proc/pid/timers.
+ * Invoked from do_exit() when the last thread of a thread group exits.
+ * At that point no other task can access the timers of the dying
+ * task anymore.
  */
 void exit_itimers(struct task_struct *tsk)
 {
@@ -1067,10 +1098,12 @@ void exit_itimers(struct task_struct *tsk)
        if (list_empty(&tsk->signal->posix_timers))
                return;
 
+       /* Protect against concurrent read via /proc/$PID/timers */
        spin_lock_irq(&tsk->sighand->siglock);
        list_replace_init(&tsk->signal->posix_timers, &timers);
        spin_unlock_irq(&tsk->sighand->siglock);
 
+       /* The timers are not longer accessible via tsk::signal */
        while (!list_empty(&timers)) {
                tmr = list_first_entry(&timers, struct k_itimer, list);
                itimer_delete(tmr);
@@ -1089,6 +1122,10 @@ SYSCALL_DEFINE2(clock_settime, const clockid_t, which_clock,
        if (get_timespec64(&new_tp, tp))
                return -EFAULT;
 
+       /*
+        * Permission checks have to be done inside the clock specific
+        * setter callback.
+        */
        return kc->clock_set(which_clock, &new_tp);
 }
 
@@ -1139,6 +1176,79 @@ SYSCALL_DEFINE2(clock_adjtime, const clockid_t, which_clock,
        return err;
 }
 
+/**
+ * sys_clock_getres - Get the resolution of a clock
+ * @which_clock:       The clock to get the resolution for
+ * @tp:                        Pointer to a a user space timespec64 for storage
+ *
+ * POSIX defines:
+ *
+ * "The clock_getres() function shall return the resolution of any
+ * clock. Clock resolutions are implementation-defined and cannot be set by
+ * a process. If the argument res is not NULL, the resolution of the
+ * specified clock shall be stored in the location pointed to by res. If
+ * res is NULL, the clock resolution is not returned. If the time argument
+ * of clock_settime() is not a multiple of res, then the value is truncated
+ * to a multiple of res."
+ *
+ * Due to the various hardware constraints the real resolution can vary
+ * wildly and even change during runtime when the underlying devices are
+ * replaced. The kernel also can use hardware devices with different
+ * resolutions for reading the time and for arming timers.
+ *
+ * The kernel therefore deviates from the POSIX spec in various aspects:
+ *
+ * 1) The resolution returned to user space
+ *
+ *    For CLOCK_REALTIME, CLOCK_MONOTONIC, CLOCK_BOOTTIME, CLOCK_TAI,
+ *    CLOCK_REALTIME_ALARM, CLOCK_BOOTTIME_ALAREM and CLOCK_MONOTONIC_RAW
+ *    the kernel differentiates only two cases:
+ *
+ *    I)  Low resolution mode:
+ *
+ *       When high resolution timers are disabled at compile or runtime
+ *       the resolution returned is nanoseconds per tick, which represents
+ *       the precision at which timers expire.
+ *
+ *    II) High resolution mode:
+ *
+ *       When high resolution timers are enabled the resolution returned
+ *       is always one nanosecond independent of the actual resolution of
+ *       the underlying hardware devices.
+ *
+ *       For CLOCK_*_ALARM the actual resolution depends on system
+ *       state. When system is running the resolution is the same as the
+ *       resolution of the other clocks. During suspend the actual
+ *       resolution is the resolution of the underlying RTC device which
+ *       might be way less precise than the clockevent device used during
+ *       running state.
+ *
+ *   For CLOCK_REALTIME_COARSE and CLOCK_MONOTONIC_COARSE the resolution
+ *   returned is always nanoseconds per tick.
+ *
+ *   For CLOCK_PROCESS_CPUTIME and CLOCK_THREAD_CPUTIME the resolution
+ *   returned is always one nanosecond under the assumption that the
+ *   underlying scheduler clock has a better resolution than nanoseconds
+ *   per tick.
+ *
+ *   For dynamic POSIX clocks (PTP devices) the resolution returned is
+ *   always one nanosecond.
+ *
+ * 2) Affect on sys_clock_settime()
+ *
+ *    The kernel does not truncate the time which is handed in to
+ *    sys_clock_settime(). The kernel internal timekeeping is always using
+ *    nanoseconds precision independent of the clocksource device which is
+ *    used to read the time from. The resolution of that device only
+ *    affects the presicion of the time returned by sys_clock_gettime().
+ *
+ * Returns:
+ *     0               Success. @tp contains the resolution
+ *     -EINVAL         @which_clock is not a valid clock ID
+ *     -EFAULT         Copying the resolution to @tp faulted
+ *     -ENODEV         Dynamic POSIX clock is not backed by a device
+ *     -EOPNOTSUPP     Dynamic POSIX clock does not support getres()
+ */
 SYSCALL_DEFINE2(clock_getres, const clockid_t, which_clock,
                struct __kernel_timespec __user *, tp)
 {
@@ -1230,7 +1340,7 @@ SYSCALL_DEFINE2(clock_getres_time32, clockid_t, which_clock,
 #endif
 
 /*
- * nanosleep for monotonic and realtime clocks
+ * sys_clock_nanosleep() for CLOCK_REALTIME and CLOCK_TAI
  */
 static int common_nsleep(const clockid_t which_clock, int flags,
                         const struct timespec64 *rqtp)
@@ -1242,8 +1352,13 @@ static int common_nsleep(const clockid_t which_clock, int flags,
                                 which_clock);
 }
 
+/*
+ * sys_clock_nanosleep() for CLOCK_MONOTONIC and CLOCK_BOOTTIME
+ *
+ * Absolute nanosleeps for these clocks are time-namespace adjusted.
+ */
 static int common_nsleep_timens(const clockid_t which_clock, int flags,
-                        const struct timespec64 *rqtp)
+                               const struct timespec64 *rqtp)
 {
        ktime_t texp = timespec64_to_ktime(*rqtp);
 
index 8464c5a..68d6c11 100644 (file)
@@ -64,7 +64,7 @@ static struct clock_data cd ____cacheline_aligned = {
        .actual_read_sched_clock = jiffy_sched_clock_read,
 };
 
-static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift)
+static __always_inline u64 cyc_to_ns(u64 cyc, u32 mult, u32 shift)
 {
        return (cyc * mult) >> shift;
 }
@@ -77,26 +77,36 @@ notrace struct clock_read_data *sched_clock_read_begin(unsigned int *seq)
 
 notrace int sched_clock_read_retry(unsigned int seq)
 {
-       return read_seqcount_latch_retry(&cd.seq, seq);
+       return raw_read_seqcount_latch_retry(&cd.seq, seq);
 }
 
-unsigned long long notrace sched_clock(void)
+unsigned long long noinstr sched_clock_noinstr(void)
 {
-       u64 cyc, res;
-       unsigned int seq;
        struct clock_read_data *rd;
+       unsigned int seq;
+       u64 cyc, res;
 
        do {
-               rd = sched_clock_read_begin(&seq);
+               seq = raw_read_seqcount_latch(&cd.seq);
+               rd = cd.read_data + (seq & 1);
 
                cyc = (rd->read_sched_clock() - rd->epoch_cyc) &
                      rd->sched_clock_mask;
                res = rd->epoch_ns + cyc_to_ns(cyc, rd->mult, rd->shift);
-       } while (sched_clock_read_retry(seq));
+       } while (raw_read_seqcount_latch_retry(&cd.seq, seq));
 
        return res;
 }
 
+unsigned long long notrace sched_clock(void)
+{
+       unsigned long long ns;
+       preempt_disable_notrace();
+       ns = sched_clock_noinstr();
+       preempt_enable_notrace();
+       return ns;
+}
+
 /*
  * Updating the data required to read the clock.
  *
index 42c0be3..4df14db 100644 (file)
@@ -1041,7 +1041,7 @@ static bool report_idle_softirq(void)
                        return false;
        }
 
-       if (ratelimit < 10)
+       if (ratelimit >= 10)
                return false;
 
        /* On RT, softirqs handling may be waiting on some lock */
index 09d5949..266d028 100644 (file)
@@ -450,7 +450,7 @@ static __always_inline u64 __ktime_get_fast_ns(struct tk_fast *tkf)
                tkr = tkf->base + (seq & 0x01);
                now = ktime_to_ns(tkr->base);
                now += fast_tk_get_delta_ns(tkr);
-       } while (read_seqcount_latch_retry(&tkf->seq, seq));
+       } while (raw_read_seqcount_latch_retry(&tkf->seq, seq));
 
        return now;
 }
@@ -566,7 +566,7 @@ static __always_inline u64 __ktime_get_real_fast(struct tk_fast *tkf, u64 *mono)
                basem = ktime_to_ns(tkr->base);
                baser = ktime_to_ns(tkr->base_real);
                delta = fast_tk_get_delta_ns(tkr);
-       } while (read_seqcount_latch_retry(&tkf->seq, seq));
+       } while (raw_read_seqcount_latch_retry(&tkf->seq, seq));
 
        if (mono)
                *mono = basem + delta;
index 7646684..6a77edb 100644 (file)
@@ -5743,7 +5743,7 @@ bool ftrace_filter_param __initdata;
 static int __init set_ftrace_notrace(char *str)
 {
        ftrace_filter_param = true;
-       strlcpy(ftrace_notrace_buf, str, FTRACE_FILTER_SIZE);
+       strscpy(ftrace_notrace_buf, str, FTRACE_FILTER_SIZE);
        return 1;
 }
 __setup("ftrace_notrace=", set_ftrace_notrace);
@@ -5751,7 +5751,7 @@ __setup("ftrace_notrace=", set_ftrace_notrace);
 static int __init set_ftrace_filter(char *str)
 {
        ftrace_filter_param = true;
-       strlcpy(ftrace_filter_buf, str, FTRACE_FILTER_SIZE);
+       strscpy(ftrace_filter_buf, str, FTRACE_FILTER_SIZE);
        return 1;
 }
 __setup("ftrace_filter=", set_ftrace_filter);
@@ -5763,14 +5763,14 @@ static int ftrace_graph_set_hash(struct ftrace_hash *hash, char *buffer);
 
 static int __init set_graph_function(char *str)
 {
-       strlcpy(ftrace_graph_buf, str, FTRACE_FILTER_SIZE);
+       strscpy(ftrace_graph_buf, str, FTRACE_FILTER_SIZE);
        return 1;
 }
 __setup("ftrace_graph_filter=", set_graph_function);
 
 static int __init set_graph_notrace_function(char *str)
 {
-       strlcpy(ftrace_graph_notrace_buf, str, FTRACE_FILTER_SIZE);
+       strscpy(ftrace_graph_notrace_buf, str, FTRACE_FILTER_SIZE);
        return 1;
 }
 __setup("ftrace_graph_notrace=", set_graph_notrace_function);
@@ -6569,8 +6569,8 @@ static int ftrace_get_trampoline_kallsym(unsigned int symnum,
                        continue;
                *value = op->trampoline;
                *type = 't';
-               strlcpy(name, FTRACE_TRAMPOLINE_SYM, KSYM_NAME_LEN);
-               strlcpy(module_name, FTRACE_TRAMPOLINE_MOD, MODULE_NAME_LEN);
+               strscpy(name, FTRACE_TRAMPOLINE_SYM, KSYM_NAME_LEN);
+               strscpy(module_name, FTRACE_TRAMPOLINE_MOD, MODULE_NAME_LEN);
                *exported = 0;
                return 0;
        }
@@ -6933,7 +6933,7 @@ ftrace_func_address_lookup(struct ftrace_mod_map *mod_map,
                if (off)
                        *off = addr - found_func->ip;
                if (sym)
-                       strlcpy(sym, found_func->name, KSYM_NAME_LEN);
+                       strscpy(sym, found_func->name, KSYM_NAME_LEN);
 
                return found_func->name;
        }
@@ -6987,8 +6987,8 @@ int ftrace_mod_get_kallsym(unsigned int symnum, unsigned long *value,
 
                        *value = mod_func->ip;
                        *type = 'T';
-                       strlcpy(name, mod_func->name, KSYM_NAME_LEN);
-                       strlcpy(module_name, mod_map->mod->name, MODULE_NAME_LEN);
+                       strscpy(name, mod_func->name, KSYM_NAME_LEN);
+                       strscpy(module_name, mod_map->mod->name, MODULE_NAME_LEN);
                        *exported = 1;
                        preempt_enable();
                        return 0;
index 64a4dde..074d0b2 100644 (file)
@@ -199,7 +199,7 @@ static int boot_snapshot_index;
 
 static int __init set_cmdline_ftrace(char *str)
 {
-       strlcpy(bootup_tracer_buf, str, MAX_TRACER_SIZE);
+       strscpy(bootup_tracer_buf, str, MAX_TRACER_SIZE);
        default_bootup_tracer = bootup_tracer_buf;
        /* We are using ftrace early, expand it */
        ring_buffer_expanded = true;
@@ -284,7 +284,7 @@ static char trace_boot_options_buf[MAX_TRACER_SIZE] __initdata;
 
 static int __init set_trace_boot_options(char *str)
 {
-       strlcpy(trace_boot_options_buf, str, MAX_TRACER_SIZE);
+       strscpy(trace_boot_options_buf, str, MAX_TRACER_SIZE);
        return 1;
 }
 __setup("trace_options=", set_trace_boot_options);
@@ -294,7 +294,7 @@ static char *trace_boot_clock __initdata;
 
 static int __init set_trace_boot_clock(char *str)
 {
-       strlcpy(trace_boot_clock_buf, str, MAX_TRACER_SIZE);
+       strscpy(trace_boot_clock_buf, str, MAX_TRACER_SIZE);
        trace_boot_clock = trace_boot_clock_buf;
        return 1;
 }
@@ -2546,7 +2546,7 @@ static void __trace_find_cmdline(int pid, char comm[])
        if (map != NO_CMDLINE_MAP) {
                tpid = savedcmd->map_cmdline_to_pid[map];
                if (tpid == pid) {
-                       strlcpy(comm, get_saved_cmdlines(map), TASK_COMM_LEN);
+                       strscpy(comm, get_saved_cmdlines(map), TASK_COMM_LEN);
                        return;
                }
        }
@@ -5199,7 +5199,7 @@ static const struct file_operations tracing_fops = {
        .open           = tracing_open,
        .read           = seq_read,
        .read_iter      = seq_read_iter,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = copy_splice_read,
        .write          = tracing_write_stub,
        .llseek         = tracing_lseek,
        .release        = tracing_release,
index 57e539d..5d6ae4e 100644 (file)
@@ -2833,7 +2833,7 @@ static __init int setup_trace_triggers(char *str)
        char *buf;
        int i;
 
-       strlcpy(bootup_trigger_buf, str, COMMAND_LINE_SIZE);
+       strscpy(bootup_trigger_buf, str, COMMAND_LINE_SIZE);
        ring_buffer_expanded = true;
        disable_tracing_selftest("running event triggers");
 
@@ -3623,7 +3623,7 @@ static char bootup_event_buf[COMMAND_LINE_SIZE] __initdata;
 
 static __init int setup_trace_event(char *str)
 {
-       strlcpy(bootup_event_buf, str, COMMAND_LINE_SIZE);
+       strscpy(bootup_event_buf, str, COMMAND_LINE_SIZE);
        ring_buffer_expanded = true;
        disable_tracing_selftest("running event tracing");
 
index d6b4935..abe805d 100644 (file)
@@ -217,7 +217,7 @@ static int parse_entry(char *str, struct trace_event_call *call, void **pentry)
                        char *addr = (char *)(unsigned long) val;
 
                        if (field->filter_type == FILTER_STATIC_STRING) {
-                               strlcpy(entry + field->offset, addr, field->size);
+                               strscpy(entry + field->offset, addr, field->size);
                        } else if (field->filter_type == FILTER_DYN_STRING ||
                                   field->filter_type == FILTER_RDYN_STRING) {
                                int str_len = strlen(addr) + 1;
@@ -232,7 +232,7 @@ static int parse_entry(char *str, struct trace_event_call *call, void **pentry)
                                }
                                entry = *pentry;
 
-                               strlcpy(entry + (entry_size - str_len), addr, str_len);
+                               strscpy(entry + (entry_size - str_len), addr, str_len);
                                str_item = (u32 *)(entry + field->offset);
                                if (field->filter_type == FILTER_RDYN_STRING)
                                        str_loc -= field->offset + field->size;
index 8df0550..0536db7 100644 (file)
@@ -498,7 +498,7 @@ static int user_event_enabler_write(struct user_event_mm *mm,
                return -EBUSY;
 
        ret = pin_user_pages_remote(mm->mm, uaddr, 1, FOLL_WRITE | FOLL_NOFAULT,
-                                   &page, NULL, NULL);
+                                   &page, NULL);
 
        if (unlikely(ret <= 0)) {
                if (!fixup_fault)
index 59cda19..1b3fa7b 100644 (file)
@@ -30,7 +30,7 @@ static char kprobe_boot_events_buf[COMMAND_LINE_SIZE] __initdata;
 
 static int __init set_kprobe_boot_events(char *str)
 {
-       strlcpy(kprobe_boot_events_buf, str, COMMAND_LINE_SIZE);
+       strscpy(kprobe_boot_events_buf, str, COMMAND_LINE_SIZE);
        disable_tracing_selftest("running kprobe events");
 
        return 1;
index 2d26166..73055ba 100644 (file)
@@ -254,7 +254,7 @@ int traceprobe_parse_event_name(const char **pevent, const char **pgroup,
                        trace_probe_log_err(offset, GROUP_TOO_LONG);
                        return -EINVAL;
                }
-               strlcpy(buf, event, slash - event + 1);
+               strscpy(buf, event, slash - event + 1);
                if (!is_good_system_name(buf)) {
                        trace_probe_log_err(offset, BAD_GROUP_NAME);
                        return -EINVAL;
index 60aa9e7..41088c5 100644 (file)
@@ -544,7 +544,8 @@ static int proc_cap_handler(struct ctl_table *table, int write,
        return 0;
 }
 
-struct ctl_table usermodehelper_table[] = {
+#if defined(CONFIG_SYSCTL)
+static struct ctl_table usermodehelper_table[] = {
        {
                .procname       = "bset",
                .data           = &usermodehelper_bset,
@@ -561,3 +562,11 @@ struct ctl_table usermodehelper_table[] = {
        },
        { }
 };
+
+static int __init init_umh_sysctls(void)
+{
+       register_sysctl_init("kernel/usermodehelper", usermodehelper_table);
+       return 0;
+}
+early_initcall(init_umh_sysctls);
+#endif /* CONFIG_SYSCTL */
index e91cb4c..d0b6b39 100644 (file)
@@ -42,7 +42,7 @@ MODULE_AUTHOR("Red Hat, Inc.");
 static inline bool lock_wqueue(struct watch_queue *wqueue)
 {
        spin_lock_bh(&wqueue->lock);
-       if (unlikely(wqueue->defunct)) {
+       if (unlikely(!wqueue->pipe)) {
                spin_unlock_bh(&wqueue->lock);
                return false;
        }
@@ -104,9 +104,6 @@ static bool post_one_notification(struct watch_queue *wqueue,
        unsigned int head, tail, mask, note, offset, len;
        bool done = false;
 
-       if (!pipe)
-               return false;
-
        spin_lock_irq(&pipe->rd_wait.lock);
 
        mask = pipe->ring_size - 1;
@@ -603,8 +600,11 @@ void watch_queue_clear(struct watch_queue *wqueue)
        rcu_read_lock();
        spin_lock_bh(&wqueue->lock);
 
-       /* Prevent new notifications from being stored. */
-       wqueue->defunct = true;
+       /*
+        * This pipe can be freed by callers like free_pipe_info().
+        * Removing this reference also prevents new notifications.
+        */
+       wqueue->pipe = NULL;
 
        while (!hlist_empty(&wqueue->watches)) {
                watch = hlist_entry(wqueue->watches.first, struct watch, queue_node);
index 8e61f21..be38276 100644 (file)
 
 static DEFINE_MUTEX(watchdog_mutex);
 
-#if defined(CONFIG_HARDLOCKUP_DETECTOR) || defined(CONFIG_HAVE_NMI_WATCHDOG)
-# define WATCHDOG_DEFAULT      (SOFT_WATCHDOG_ENABLED | NMI_WATCHDOG_ENABLED)
-# define NMI_WATCHDOG_DEFAULT  1
+#if defined(CONFIG_HARDLOCKUP_DETECTOR) || defined(CONFIG_HARDLOCKUP_DETECTOR_SPARC64)
+# define WATCHDOG_HARDLOCKUP_DEFAULT   1
 #else
-# define WATCHDOG_DEFAULT      (SOFT_WATCHDOG_ENABLED)
-# define NMI_WATCHDOG_DEFAULT  0
+# define WATCHDOG_HARDLOCKUP_DEFAULT   0
 #endif
 
 unsigned long __read_mostly watchdog_enabled;
 int __read_mostly watchdog_user_enabled = 1;
-int __read_mostly nmi_watchdog_user_enabled = NMI_WATCHDOG_DEFAULT;
-int __read_mostly soft_watchdog_user_enabled = 1;
+static int __read_mostly watchdog_hardlockup_user_enabled = WATCHDOG_HARDLOCKUP_DEFAULT;
+static int __read_mostly watchdog_softlockup_user_enabled = 1;
 int __read_mostly watchdog_thresh = 10;
-static int __read_mostly nmi_watchdog_available;
+static int __read_mostly watchdog_hardlockup_available;
 
 struct cpumask watchdog_cpumask __read_mostly;
 unsigned long *watchdog_cpumask_bits = cpumask_bits(&watchdog_cpumask);
@@ -68,7 +66,7 @@ unsigned int __read_mostly hardlockup_panic =
  */
 void __init hardlockup_detector_disable(void)
 {
-       nmi_watchdog_user_enabled = 0;
+       watchdog_hardlockup_user_enabled = 0;
 }
 
 static int __init hardlockup_panic_setup(char *str)
@@ -78,54 +76,163 @@ static int __init hardlockup_panic_setup(char *str)
        else if (!strncmp(str, "nopanic", 7))
                hardlockup_panic = 0;
        else if (!strncmp(str, "0", 1))
-               nmi_watchdog_user_enabled = 0;
+               watchdog_hardlockup_user_enabled = 0;
        else if (!strncmp(str, "1", 1))
-               nmi_watchdog_user_enabled = 1;
+               watchdog_hardlockup_user_enabled = 1;
        return 1;
 }
 __setup("nmi_watchdog=", hardlockup_panic_setup);
 
 #endif /* CONFIG_HARDLOCKUP_DETECTOR */
 
-/*
- * These functions can be overridden if an architecture implements its
- * own hardlockup detector.
- *
- * watchdog_nmi_enable/disable can be implemented to start and stop when
- * softlockup watchdog start and stop. The arch must select the
- * SOFTLOCKUP_DETECTOR Kconfig.
- */
-int __weak watchdog_nmi_enable(unsigned int cpu)
+#if defined(CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER)
+
+static DEFINE_PER_CPU(atomic_t, hrtimer_interrupts);
+static DEFINE_PER_CPU(int, hrtimer_interrupts_saved);
+static DEFINE_PER_CPU(bool, watchdog_hardlockup_warned);
+static DEFINE_PER_CPU(bool, watchdog_hardlockup_touched);
+static unsigned long watchdog_hardlockup_all_cpu_dumped;
+
+notrace void arch_touch_nmi_watchdog(void)
 {
-       hardlockup_detector_perf_enable();
-       return 0;
+       /*
+        * Using __raw here because some code paths have
+        * preemption enabled.  If preemption is enabled
+        * then interrupts should be enabled too, in which
+        * case we shouldn't have to worry about the watchdog
+        * going off.
+        */
+       raw_cpu_write(watchdog_hardlockup_touched, true);
+}
+EXPORT_SYMBOL(arch_touch_nmi_watchdog);
+
+void watchdog_hardlockup_touch_cpu(unsigned int cpu)
+{
+       per_cpu(watchdog_hardlockup_touched, cpu) = true;
 }
 
-void __weak watchdog_nmi_disable(unsigned int cpu)
+static bool is_hardlockup(unsigned int cpu)
 {
-       hardlockup_detector_perf_disable();
+       int hrint = atomic_read(&per_cpu(hrtimer_interrupts, cpu));
+
+       if (per_cpu(hrtimer_interrupts_saved, cpu) == hrint)
+               return true;
+
+       /*
+        * NOTE: we don't need any fancy atomic_t or READ_ONCE/WRITE_ONCE
+        * for hrtimer_interrupts_saved. hrtimer_interrupts_saved is
+        * written/read by a single CPU.
+        */
+       per_cpu(hrtimer_interrupts_saved, cpu) = hrint;
+
+       return false;
 }
 
-/* Return 0, if a NMI watchdog is available. Error code otherwise */
-int __weak __init watchdog_nmi_probe(void)
+static void watchdog_hardlockup_kick(void)
 {
-       return hardlockup_detector_perf_init();
+       int new_interrupts;
+
+       new_interrupts = atomic_inc_return(this_cpu_ptr(&hrtimer_interrupts));
+       watchdog_buddy_check_hardlockup(new_interrupts);
+}
+
+void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
+{
+       if (per_cpu(watchdog_hardlockup_touched, cpu)) {
+               per_cpu(watchdog_hardlockup_touched, cpu) = false;
+               return;
+       }
+
+       /*
+        * Check for a hardlockup by making sure the CPU's timer
+        * interrupt is incrementing. The timer interrupt should have
+        * fired multiple times before we overflow'd. If it hasn't
+        * then this is a good indication the cpu is stuck
+        */
+       if (is_hardlockup(cpu)) {
+               unsigned int this_cpu = smp_processor_id();
+               struct cpumask backtrace_mask;
+
+               cpumask_copy(&backtrace_mask, cpu_online_mask);
+
+               /* Only print hardlockups once. */
+               if (per_cpu(watchdog_hardlockup_warned, cpu))
+                       return;
+
+               pr_emerg("Watchdog detected hard LOCKUP on cpu %d\n", cpu);
+               print_modules();
+               print_irqtrace_events(current);
+               if (cpu == this_cpu) {
+                       if (regs)
+                               show_regs(regs);
+                       else
+                               dump_stack();
+                       cpumask_clear_cpu(cpu, &backtrace_mask);
+               } else {
+                       if (trigger_single_cpu_backtrace(cpu))
+                               cpumask_clear_cpu(cpu, &backtrace_mask);
+               }
+
+               /*
+                * Perform multi-CPU dump only once to avoid multiple
+                * hardlockups generating interleaving traces
+                */
+               if (sysctl_hardlockup_all_cpu_backtrace &&
+                   !test_and_set_bit(0, &watchdog_hardlockup_all_cpu_dumped))
+                       trigger_cpumask_backtrace(&backtrace_mask);
+
+               if (hardlockup_panic)
+                       nmi_panic(regs, "Hard LOCKUP");
+
+               per_cpu(watchdog_hardlockup_warned, cpu) = true;
+       } else {
+               per_cpu(watchdog_hardlockup_warned, cpu) = false;
+       }
+}
+
+#else /* CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER */
+
+static inline void watchdog_hardlockup_kick(void) { }
+
+#endif /* !CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER */
+
+/*
+ * These functions can be overridden based on the configured hardlockdup detector.
+ *
+ * watchdog_hardlockup_enable/disable can be implemented to start and stop when
+ * softlockup watchdog start and stop. The detector must select the
+ * SOFTLOCKUP_DETECTOR Kconfig.
+ */
+void __weak watchdog_hardlockup_enable(unsigned int cpu) { }
+
+void __weak watchdog_hardlockup_disable(unsigned int cpu) { }
+
+/*
+ * Watchdog-detector specific API.
+ *
+ * Return 0 when hardlockup watchdog is available, negative value otherwise.
+ * Note that the negative value means that a delayed probe might
+ * succeed later.
+ */
+int __weak __init watchdog_hardlockup_probe(void)
+{
+       return -ENODEV;
 }
 
 /**
- * watchdog_nmi_stop - Stop the watchdog for reconfiguration
+ * watchdog_hardlockup_stop - Stop the watchdog for reconfiguration
  *
  * The reconfiguration steps are:
- * watchdog_nmi_stop();
+ * watchdog_hardlockup_stop();
  * update_variables();
- * watchdog_nmi_start();
+ * watchdog_hardlockup_start();
  */
-void __weak watchdog_nmi_stop(void) { }
+void __weak watchdog_hardlockup_stop(void) { }
 
 /**
- * watchdog_nmi_start - Start the watchdog after reconfiguration
+ * watchdog_hardlockup_start - Start the watchdog after reconfiguration
  *
- * Counterpart to watchdog_nmi_stop().
+ * Counterpart to watchdog_hardlockup_stop().
  *
  * The following variables have been updated in update_variables() and
  * contain the currently valid configuration:
@@ -133,23 +240,23 @@ void __weak watchdog_nmi_stop(void) { }
  * - watchdog_thresh
  * - watchdog_cpumask
  */
-void __weak watchdog_nmi_start(void) { }
+void __weak watchdog_hardlockup_start(void) { }
 
 /**
  * lockup_detector_update_enable - Update the sysctl enable bit
  *
- * Caller needs to make sure that the NMI/perf watchdogs are off, so this
- * can't race with watchdog_nmi_disable().
+ * Caller needs to make sure that the hard watchdogs are off, so this
+ * can't race with watchdog_hardlockup_disable().
  */
 static void lockup_detector_update_enable(void)
 {
        watchdog_enabled = 0;
        if (!watchdog_user_enabled)
                return;
-       if (nmi_watchdog_available && nmi_watchdog_user_enabled)
-               watchdog_enabled |= NMI_WATCHDOG_ENABLED;
-       if (soft_watchdog_user_enabled)
-               watchdog_enabled |= SOFT_WATCHDOG_ENABLED;
+       if (watchdog_hardlockup_available && watchdog_hardlockup_user_enabled)
+               watchdog_enabled |= WATCHDOG_HARDLOCKUP_ENABLED;
+       if (watchdog_softlockup_user_enabled)
+               watchdog_enabled |= WATCHDOG_SOFTOCKUP_ENABLED;
 }
 
 #ifdef CONFIG_SOFTLOCKUP_DETECTOR
@@ -179,8 +286,6 @@ static DEFINE_PER_CPU(unsigned long, watchdog_touch_ts);
 static DEFINE_PER_CPU(unsigned long, watchdog_report_ts);
 static DEFINE_PER_CPU(struct hrtimer, watchdog_hrtimer);
 static DEFINE_PER_CPU(bool, softlockup_touch_sync);
-static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts);
-static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts_saved);
 static unsigned long soft_lockup_nmi_warn;
 
 static int __init nowatchdog_setup(char *str)
@@ -192,7 +297,7 @@ __setup("nowatchdog", nowatchdog_setup);
 
 static int __init nosoftlockup_setup(char *str)
 {
-       soft_watchdog_user_enabled = 0;
+       watchdog_softlockup_user_enabled = 0;
        return 1;
 }
 __setup("nosoftlockup", nosoftlockup_setup);
@@ -306,7 +411,7 @@ static int is_softlockup(unsigned long touch_ts,
                         unsigned long period_ts,
                         unsigned long now)
 {
-       if ((watchdog_enabled & SOFT_WATCHDOG_ENABLED) && watchdog_thresh){
+       if ((watchdog_enabled & WATCHDOG_SOFTOCKUP_ENABLED) && watchdog_thresh) {
                /* Warn about unreasonable delays. */
                if (time_after(now, period_ts + get_softlockup_thresh()))
                        return now - touch_ts;
@@ -315,22 +420,6 @@ static int is_softlockup(unsigned long touch_ts,
 }
 
 /* watchdog detector functions */
-bool is_hardlockup(void)
-{
-       unsigned long hrint = __this_cpu_read(hrtimer_interrupts);
-
-       if (__this_cpu_read(hrtimer_interrupts_saved) == hrint)
-               return true;
-
-       __this_cpu_write(hrtimer_interrupts_saved, hrint);
-       return false;
-}
-
-static void watchdog_interrupt_count(void)
-{
-       __this_cpu_inc(hrtimer_interrupts);
-}
-
 static DEFINE_PER_CPU(struct completion, softlockup_completion);
 static DEFINE_PER_CPU(struct cpu_stop_work, softlockup_stop_work);
 
@@ -361,8 +450,7 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
        if (!watchdog_enabled)
                return HRTIMER_NORESTART;
 
-       /* kick the hardlockup detector */
-       watchdog_interrupt_count();
+       watchdog_hardlockup_kick();
 
        /* kick the softlockup detector */
        if (completion_done(this_cpu_ptr(&softlockup_completion))) {
@@ -458,7 +546,7 @@ static void watchdog_enable(unsigned int cpu)
        complete(done);
 
        /*
-        * Start the timer first to prevent the NMI watchdog triggering
+        * Start the timer first to prevent the hardlockup watchdog triggering
         * before the timer has a chance to fire.
         */
        hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);
@@ -468,9 +556,9 @@ static void watchdog_enable(unsigned int cpu)
 
        /* Initialize timestamp */
        update_touch_ts();
-       /* Enable the perf event */
-       if (watchdog_enabled & NMI_WATCHDOG_ENABLED)
-               watchdog_nmi_enable(cpu);
+       /* Enable the hardlockup detector */
+       if (watchdog_enabled & WATCHDOG_HARDLOCKUP_ENABLED)
+               watchdog_hardlockup_enable(cpu);
 }
 
 static void watchdog_disable(unsigned int cpu)
@@ -480,11 +568,11 @@ static void watchdog_disable(unsigned int cpu)
        WARN_ON_ONCE(cpu != smp_processor_id());
 
        /*
-        * Disable the perf event first. That prevents that a large delay
-        * between disabling the timer and disabling the perf event causes
-        * the perf NMI to detect a false positive.
+        * Disable the hardlockup detector first. That prevents that a large
+        * delay between disabling the timer and disabling the hardlockup
+        * detector causes a false positive.
         */
-       watchdog_nmi_disable(cpu);
+       watchdog_hardlockup_disable(cpu);
        hrtimer_cancel(hrtimer);
        wait_for_completion(this_cpu_ptr(&softlockup_completion));
 }
@@ -540,7 +628,7 @@ int lockup_detector_offline_cpu(unsigned int cpu)
 static void __lockup_detector_reconfigure(void)
 {
        cpus_read_lock();
-       watchdog_nmi_stop();
+       watchdog_hardlockup_stop();
 
        softlockup_stop_all();
        set_sample_period();
@@ -548,7 +636,7 @@ static void __lockup_detector_reconfigure(void)
        if (watchdog_enabled && watchdog_thresh)
                softlockup_start_all();
 
-       watchdog_nmi_start();
+       watchdog_hardlockup_start();
        cpus_read_unlock();
        /*
         * Must be called outside the cpus locked section to prevent
@@ -589,9 +677,9 @@ static __init void lockup_detector_setup(void)
 static void __lockup_detector_reconfigure(void)
 {
        cpus_read_lock();
-       watchdog_nmi_stop();
+       watchdog_hardlockup_stop();
        lockup_detector_update_enable();
-       watchdog_nmi_start();
+       watchdog_hardlockup_start();
        cpus_read_unlock();
 }
 void lockup_detector_reconfigure(void)
@@ -646,14 +734,14 @@ static void proc_watchdog_update(void)
 /*
  * common function for watchdog, nmi_watchdog and soft_watchdog parameter
  *
- * caller             | table->data points to      | 'which'
- * -------------------|----------------------------|--------------------------
- * proc_watchdog      | watchdog_user_enabled      | NMI_WATCHDOG_ENABLED |
- *                    |                            | SOFT_WATCHDOG_ENABLED
- * -------------------|----------------------------|--------------------------
- * proc_nmi_watchdog  | nmi_watchdog_user_enabled  | NMI_WATCHDOG_ENABLED
- * -------------------|----------------------------|--------------------------
- * proc_soft_watchdog | soft_watchdog_user_enabled | SOFT_WATCHDOG_ENABLED
+ * caller             | table->data points to            | 'which'
+ * -------------------|----------------------------------|-------------------------------
+ * proc_watchdog      | watchdog_user_enabled            | WATCHDOG_HARDLOCKUP_ENABLED |
+ *                    |                                  | WATCHDOG_SOFTOCKUP_ENABLED
+ * -------------------|----------------------------------|-------------------------------
+ * proc_nmi_watchdog  | watchdog_hardlockup_user_enabled | WATCHDOG_HARDLOCKUP_ENABLED
+ * -------------------|----------------------------------|-------------------------------
+ * proc_soft_watchdog | watchdog_softlockup_user_enabled | WATCHDOG_SOFTOCKUP_ENABLED
  */
 static int proc_watchdog_common(int which, struct ctl_table *table, int write,
                                void *buffer, size_t *lenp, loff_t *ppos)
@@ -685,7 +773,8 @@ static int proc_watchdog_common(int which, struct ctl_table *table, int write,
 int proc_watchdog(struct ctl_table *table, int write,
                  void *buffer, size_t *lenp, loff_t *ppos)
 {
-       return proc_watchdog_common(NMI_WATCHDOG_ENABLED|SOFT_WATCHDOG_ENABLED,
+       return proc_watchdog_common(WATCHDOG_HARDLOCKUP_ENABLED |
+                                   WATCHDOG_SOFTOCKUP_ENABLED,
                                    table, write, buffer, lenp, ppos);
 }
 
@@ -695,9 +784,9 @@ int proc_watchdog(struct ctl_table *table, int write,
 int proc_nmi_watchdog(struct ctl_table *table, int write,
                      void *buffer, size_t *lenp, loff_t *ppos)
 {
-       if (!nmi_watchdog_available && write)
+       if (!watchdog_hardlockup_available && write)
                return -ENOTSUPP;
-       return proc_watchdog_common(NMI_WATCHDOG_ENABLED,
+       return proc_watchdog_common(WATCHDOG_HARDLOCKUP_ENABLED,
                                    table, write, buffer, lenp, ppos);
 }
 
@@ -707,7 +796,7 @@ int proc_nmi_watchdog(struct ctl_table *table, int write,
 int proc_soft_watchdog(struct ctl_table *table, int write,
                        void *buffer, size_t *lenp, loff_t *ppos)
 {
-       return proc_watchdog_common(SOFT_WATCHDOG_ENABLED,
+       return proc_watchdog_common(WATCHDOG_SOFTOCKUP_ENABLED,
                                    table, write, buffer, lenp, ppos);
 }
 
@@ -774,15 +863,6 @@ static struct ctl_table watchdog_sysctls[] = {
                .extra2         = (void *)&sixty,
        },
        {
-               .procname       = "nmi_watchdog",
-               .data           = &nmi_watchdog_user_enabled,
-               .maxlen         = sizeof(int),
-               .mode           = NMI_WATCHDOG_SYSCTL_PERM,
-               .proc_handler   = proc_nmi_watchdog,
-               .extra1         = SYSCTL_ZERO,
-               .extra2         = SYSCTL_ONE,
-       },
-       {
                .procname       = "watchdog_cpumask",
                .data           = &watchdog_cpumask_bits,
                .maxlen         = NR_CPUS,
@@ -792,7 +872,7 @@ static struct ctl_table watchdog_sysctls[] = {
 #ifdef CONFIG_SOFTLOCKUP_DETECTOR
        {
                .procname       = "soft_watchdog",
-               .data           = &soft_watchdog_user_enabled,
+               .data           = &watchdog_softlockup_user_enabled,
                .maxlen         = sizeof(int),
                .mode           = 0644,
                .proc_handler   = proc_soft_watchdog,
@@ -845,14 +925,90 @@ static struct ctl_table watchdog_sysctls[] = {
        {}
 };
 
+static struct ctl_table watchdog_hardlockup_sysctl[] = {
+       {
+               .procname       = "nmi_watchdog",
+               .data           = &watchdog_hardlockup_user_enabled,
+               .maxlen         = sizeof(int),
+               .mode           = 0444,
+               .proc_handler   = proc_nmi_watchdog,
+               .extra1         = SYSCTL_ZERO,
+               .extra2         = SYSCTL_ONE,
+       },
+       {}
+};
+
 static void __init watchdog_sysctl_init(void)
 {
        register_sysctl_init("kernel", watchdog_sysctls);
+
+       if (watchdog_hardlockup_available)
+               watchdog_hardlockup_sysctl[0].mode = 0644;
+       register_sysctl_init("kernel", watchdog_hardlockup_sysctl);
 }
+
 #else
 #define watchdog_sysctl_init() do { } while (0)
 #endif /* CONFIG_SYSCTL */
 
+static void __init lockup_detector_delay_init(struct work_struct *work);
+static bool allow_lockup_detector_init_retry __initdata;
+
+static struct work_struct detector_work __initdata =
+               __WORK_INITIALIZER(detector_work, lockup_detector_delay_init);
+
+static void __init lockup_detector_delay_init(struct work_struct *work)
+{
+       int ret;
+
+       ret = watchdog_hardlockup_probe();
+       if (ret) {
+               pr_info("Delayed init of the lockup detector failed: %d\n", ret);
+               pr_info("Hard watchdog permanently disabled\n");
+               return;
+       }
+
+       allow_lockup_detector_init_retry = false;
+
+       watchdog_hardlockup_available = true;
+       lockup_detector_setup();
+}
+
+/*
+ * lockup_detector_retry_init - retry init lockup detector if possible.
+ *
+ * Retry hardlockup detector init. It is useful when it requires some
+ * functionality that has to be initialized later on a particular
+ * platform.
+ */
+void __init lockup_detector_retry_init(void)
+{
+       /* Must be called before late init calls */
+       if (!allow_lockup_detector_init_retry)
+               return;
+
+       schedule_work(&detector_work);
+}
+
+/*
+ * Ensure that optional delayed hardlockup init is proceed before
+ * the init code and memory is freed.
+ */
+static int __init lockup_detector_check(void)
+{
+       /* Prevent any later retry. */
+       allow_lockup_detector_init_retry = false;
+
+       /* Make sure no work is pending. */
+       flush_work(&detector_work);
+
+       watchdog_sysctl_init();
+
+       return 0;
+
+}
+late_initcall_sync(lockup_detector_check);
+
 void __init lockup_detector_init(void)
 {
        if (tick_nohz_full_enabled())
@@ -861,8 +1017,10 @@ void __init lockup_detector_init(void)
        cpumask_copy(&watchdog_cpumask,
                     housekeeping_cpumask(HK_TYPE_TIMER));
 
-       if (!watchdog_nmi_probe())
-               nmi_watchdog_available = true;
+       if (!watchdog_hardlockup_probe())
+               watchdog_hardlockup_available = true;
+       else
+               allow_lockup_detector_init_retry = true;
+
        lockup_detector_setup();
-       watchdog_sysctl_init();
 }
diff --git a/kernel/watchdog_buddy.c b/kernel/watchdog_buddy.c
new file mode 100644 (file)
index 0000000..34dbfe0
--- /dev/null
@@ -0,0 +1,113 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/cpu.h>
+#include <linux/cpumask.h>
+#include <linux/kernel.h>
+#include <linux/nmi.h>
+#include <linux/percpu-defs.h>
+
+static cpumask_t __read_mostly watchdog_cpus;
+
+static unsigned int watchdog_next_cpu(unsigned int cpu)
+{
+       unsigned int next_cpu;
+
+       next_cpu = cpumask_next(cpu, &watchdog_cpus);
+       if (next_cpu >= nr_cpu_ids)
+               next_cpu = cpumask_first(&watchdog_cpus);
+
+       if (next_cpu == cpu)
+               return nr_cpu_ids;
+
+       return next_cpu;
+}
+
+int __init watchdog_hardlockup_probe(void)
+{
+       return 0;
+}
+
+void watchdog_hardlockup_enable(unsigned int cpu)
+{
+       unsigned int next_cpu;
+
+       /*
+        * The new CPU will be marked online before the hrtimer interrupt
+        * gets a chance to run on it. If another CPU tests for a
+        * hardlockup on the new CPU before it has run its the hrtimer
+        * interrupt, it will get a false positive. Touch the watchdog on
+        * the new CPU to delay the check for at least 3 sampling periods
+        * to guarantee one hrtimer has run on the new CPU.
+        */
+       watchdog_hardlockup_touch_cpu(cpu);
+
+       /*
+        * We are going to check the next CPU. Our watchdog_hrtimer
+        * need not be zero if the CPU has already been online earlier.
+        * Touch the watchdog on the next CPU to avoid false positive
+        * if we try to check it in less then 3 interrupts.
+        */
+       next_cpu = watchdog_next_cpu(cpu);
+       if (next_cpu < nr_cpu_ids)
+               watchdog_hardlockup_touch_cpu(next_cpu);
+
+       /*
+        * Makes sure that watchdog is touched on this CPU before
+        * other CPUs could see it in watchdog_cpus. The counter
+        * part is in watchdog_buddy_check_hardlockup().
+        */
+       smp_wmb();
+
+       cpumask_set_cpu(cpu, &watchdog_cpus);
+}
+
+void watchdog_hardlockup_disable(unsigned int cpu)
+{
+       unsigned int next_cpu = watchdog_next_cpu(cpu);
+
+       /*
+        * Offlining this CPU will cause the CPU before this one to start
+        * checking the one after this one. If this CPU just finished checking
+        * the next CPU and updating hrtimer_interrupts_saved, and then the
+        * previous CPU checks it within one sample period, it will trigger a
+        * false positive. Touch the watchdog on the next CPU to prevent it.
+        */
+       if (next_cpu < nr_cpu_ids)
+               watchdog_hardlockup_touch_cpu(next_cpu);
+
+       /*
+        * Makes sure that watchdog is touched on the next CPU before
+        * this CPU disappear in watchdog_cpus. The counter part is in
+        * watchdog_buddy_check_hardlockup().
+        */
+       smp_wmb();
+
+       cpumask_clear_cpu(cpu, &watchdog_cpus);
+}
+
+void watchdog_buddy_check_hardlockup(int hrtimer_interrupts)
+{
+       unsigned int next_cpu;
+
+       /*
+        * Test for hardlockups every 3 samples. The sample period is
+        *  watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly over
+        *  watchdog_thresh (over by 20%).
+        */
+       if (hrtimer_interrupts % 3 != 0)
+               return;
+
+       /* check for a hardlockup on the next CPU */
+       next_cpu = watchdog_next_cpu(smp_processor_id());
+       if (next_cpu >= nr_cpu_ids)
+               return;
+
+       /*
+        * Make sure that the watchdog was touched on next CPU when
+        * watchdog_next_cpu() returned another one because of
+        * a change in watchdog_hardlockup_enable()/disable().
+        */
+       smp_rmb();
+
+       watchdog_hardlockup_check(next_cpu, NULL);
+}
similarity index 72%
rename from kernel/watchdog_hld.c
rename to kernel/watchdog_perf.c
index 247bf0b..8ea00c4 100644 (file)
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 /*
- * Detect hard lockups on a system
+ * Detect hard lockups on a system using perf
  *
  * started by Don Zickus, Copyright (C) 2010 Red Hat, Inc.
  *
 #include <asm/irq_regs.h>
 #include <linux/perf_event.h>
 
-static DEFINE_PER_CPU(bool, hard_watchdog_warn);
-static DEFINE_PER_CPU(bool, watchdog_nmi_touch);
 static DEFINE_PER_CPU(struct perf_event *, watchdog_ev);
 static DEFINE_PER_CPU(struct perf_event *, dead_event);
 static struct cpumask dead_events_mask;
 
-static unsigned long hardlockup_allcpu_dumped;
 static atomic_t watchdog_cpus = ATOMIC_INIT(0);
 
-notrace void arch_touch_nmi_watchdog(void)
-{
-       /*
-        * Using __raw here because some code paths have
-        * preemption enabled.  If preemption is enabled
-        * then interrupts should be enabled too, in which
-        * case we shouldn't have to worry about the watchdog
-        * going off.
-        */
-       raw_cpu_write(watchdog_nmi_touch, true);
-}
-EXPORT_SYMBOL(arch_touch_nmi_watchdog);
-
 #ifdef CONFIG_HARDLOCKUP_CHECK_TIMESTAMP
 static DEFINE_PER_CPU(ktime_t, last_timestamp);
 static DEFINE_PER_CPU(unsigned int, nmi_rearmed);
@@ -114,61 +98,24 @@ static void watchdog_overflow_callback(struct perf_event *event,
        /* Ensure the watchdog never gets throttled */
        event->hw.interrupts = 0;
 
-       if (__this_cpu_read(watchdog_nmi_touch) == true) {
-               __this_cpu_write(watchdog_nmi_touch, false);
-               return;
-       }
-
        if (!watchdog_check_timestamp())
                return;
 
-       /* check for a hardlockup
-        * This is done by making sure our timer interrupt
-        * is incrementing.  The timer interrupt should have
-        * fired multiple times before we overflow'd.  If it hasn't
-        * then this is a good indication the cpu is stuck
-        */
-       if (is_hardlockup()) {
-               int this_cpu = smp_processor_id();
-
-               /* only print hardlockups once */
-               if (__this_cpu_read(hard_watchdog_warn) == true)
-                       return;
-
-               pr_emerg("Watchdog detected hard LOCKUP on cpu %d\n",
-                        this_cpu);
-               print_modules();
-               print_irqtrace_events(current);
-               if (regs)
-                       show_regs(regs);
-               else
-                       dump_stack();
-
-               /*
-                * Perform all-CPU dump only once to avoid multiple hardlockups
-                * generating interleaving traces
-                */
-               if (sysctl_hardlockup_all_cpu_backtrace &&
-                               !test_and_set_bit(0, &hardlockup_allcpu_dumped))
-                       trigger_allbutself_cpu_backtrace();
-
-               if (hardlockup_panic)
-                       nmi_panic(regs, "Hard LOCKUP");
-
-               __this_cpu_write(hard_watchdog_warn, true);
-               return;
-       }
-
-       __this_cpu_write(hard_watchdog_warn, false);
-       return;
+       watchdog_hardlockup_check(smp_processor_id(), regs);
 }
 
 static int hardlockup_detector_event_create(void)
 {
-       unsigned int cpu = smp_processor_id();
+       unsigned int cpu;
        struct perf_event_attr *wd_attr;
        struct perf_event *evt;
 
+       /*
+        * Preemption is not disabled because memory will be allocated.
+        * Ensure CPU-locality by calling this in per-CPU kthread.
+        */
+       WARN_ON(!is_percpu_thread());
+       cpu = raw_smp_processor_id();
        wd_attr = &wd_hw_attr;
        wd_attr->sample_period = hw_nmi_get_sample_period(watchdog_thresh);
 
@@ -185,10 +132,14 @@ static int hardlockup_detector_event_create(void)
 }
 
 /**
- * hardlockup_detector_perf_enable - Enable the local event
+ * watchdog_hardlockup_enable - Enable the local event
+ *
+ * @cpu: The CPU to enable hard lockup on.
  */
-void hardlockup_detector_perf_enable(void)
+void watchdog_hardlockup_enable(unsigned int cpu)
 {
+       WARN_ON_ONCE(cpu != smp_processor_id());
+
        if (hardlockup_detector_event_create())
                return;
 
@@ -200,12 +151,16 @@ void hardlockup_detector_perf_enable(void)
 }
 
 /**
- * hardlockup_detector_perf_disable - Disable the local event
+ * watchdog_hardlockup_disable - Disable the local event
+ *
+ * @cpu: The CPU to enable hard lockup on.
  */
-void hardlockup_detector_perf_disable(void)
+void watchdog_hardlockup_disable(unsigned int cpu)
 {
        struct perf_event *event = this_cpu_read(watchdog_ev);
 
+       WARN_ON_ONCE(cpu != smp_processor_id());
+
        if (event) {
                perf_event_disable(event);
                this_cpu_write(watchdog_ev, NULL);
@@ -268,7 +223,7 @@ void __init hardlockup_detector_perf_restart(void)
 
        lockdep_assert_cpus_held();
 
-       if (!(watchdog_enabled & NMI_WATCHDOG_ENABLED))
+       if (!(watchdog_enabled & WATCHDOG_HARDLOCKUP_ENABLED))
                return;
 
        for_each_online_cpu(cpu) {
@@ -279,12 +234,22 @@ void __init hardlockup_detector_perf_restart(void)
        }
 }
 
+bool __weak __init arch_perf_nmi_is_available(void)
+{
+       return true;
+}
+
 /**
- * hardlockup_detector_perf_init - Probe whether NMI event is available at all
+ * watchdog_hardlockup_probe - Probe whether NMI event is available at all
  */
-int __init hardlockup_detector_perf_init(void)
+int __init watchdog_hardlockup_probe(void)
 {
-       int ret = hardlockup_detector_event_create();
+       int ret;
+
+       if (!arch_perf_nmi_is_available())
+               return -ENODEV;
+
+       ret = hardlockup_detector_event_create();
 
        if (ret) {
                pr_info("Perf NMI watchdog permanently disabled\n");
index 4666a1a..02a8f40 100644 (file)
@@ -126,6 +126,12 @@ enum {
  *    cpu or grabbing pool->lock is enough for read access.  If
  *    POOL_DISASSOCIATED is set, it's identical to L.
  *
+ * K: Only modified by worker while holding pool->lock. Can be safely read by
+ *    self, while holding pool->lock or from IRQ context if %current is the
+ *    kworker.
+ *
+ * S: Only modified by worker self.
+ *
  * A: wq_pool_attach_mutex protected.
  *
  * PL: wq_pool_mutex protected.
@@ -200,6 +206,22 @@ struct worker_pool {
 };
 
 /*
+ * Per-pool_workqueue statistics. These can be monitored using
+ * tools/workqueue/wq_monitor.py.
+ */
+enum pool_workqueue_stats {
+       PWQ_STAT_STARTED,       /* work items started execution */
+       PWQ_STAT_COMPLETED,     /* work items completed execution */
+       PWQ_STAT_CPU_TIME,      /* total CPU time consumed */
+       PWQ_STAT_CPU_INTENSIVE, /* wq_cpu_intensive_thresh_us violations */
+       PWQ_STAT_CM_WAKEUP,     /* concurrency-management worker wakeups */
+       PWQ_STAT_MAYDAY,        /* maydays to rescuer */
+       PWQ_STAT_RESCUED,       /* linked work items executed by rescuer */
+
+       PWQ_NR_STATS,
+};
+
+/*
  * The per-pool workqueue.  While queued, the lower WORK_STRUCT_FLAG_BITS
  * of work_struct->data are used for flags and the remaining high bits
  * point to the pwq; thus, pwqs need to be aligned at two's power of the
@@ -236,6 +258,8 @@ struct pool_workqueue {
        struct list_head        pwqs_node;      /* WR: node on wq->pwqs */
        struct list_head        mayday_node;    /* MD: node on wq->maydays */
 
+       u64                     stats[PWQ_NR_STATS];
+
        /*
         * Release of unbound pwq is punted to system_wq.  See put_pwq()
         * and pwq_unbound_release_workfn() for details.  pool_workqueue
@@ -310,6 +334,14 @@ static struct kmem_cache *pwq_cache;
 static cpumask_var_t *wq_numa_possible_cpumask;
                                        /* possible CPUs of each node */
 
+/*
+ * Per-cpu work items which run for longer than the following threshold are
+ * automatically considered CPU intensive and excluded from concurrency
+ * management to prevent them from noticeably delaying other per-cpu work items.
+ */
+static unsigned long wq_cpu_intensive_thresh_us = 10000;
+module_param_named(cpu_intensive_thresh_us, wq_cpu_intensive_thresh_us, ulong, 0644);
+
 static bool wq_disable_numa;
 module_param_named(disable_numa, wq_disable_numa, bool, 0444);
 
@@ -705,12 +737,17 @@ static void clear_work_data(struct work_struct *work)
        set_work_data(work, WORK_STRUCT_NO_POOL, 0);
 }
 
+static inline struct pool_workqueue *work_struct_pwq(unsigned long data)
+{
+       return (struct pool_workqueue *)(data & WORK_STRUCT_WQ_DATA_MASK);
+}
+
 static struct pool_workqueue *get_work_pwq(struct work_struct *work)
 {
        unsigned long data = atomic_long_read(&work->data);
 
        if (data & WORK_STRUCT_PWQ)
-               return (void *)(data & WORK_STRUCT_WQ_DATA_MASK);
+               return work_struct_pwq(data);
        else
                return NULL;
 }
@@ -738,8 +775,7 @@ static struct worker_pool *get_work_pool(struct work_struct *work)
        assert_rcu_or_pool_mutex();
 
        if (data & WORK_STRUCT_PWQ)
-               return ((struct pool_workqueue *)
-                       (data & WORK_STRUCT_WQ_DATA_MASK))->pool;
+               return work_struct_pwq(data)->pool;
 
        pool_id = data >> WORK_OFFQ_POOL_SHIFT;
        if (pool_id == WORK_OFFQ_POOL_NONE)
@@ -760,8 +796,7 @@ static int get_work_pool_id(struct work_struct *work)
        unsigned long data = atomic_long_read(&work->data);
 
        if (data & WORK_STRUCT_PWQ)
-               return ((struct pool_workqueue *)
-                       (data & WORK_STRUCT_WQ_DATA_MASK))->pool->id;
+               return work_struct_pwq(data)->pool->id;
 
        return data >> WORK_OFFQ_POOL_SHIFT;
 }
@@ -864,6 +899,152 @@ static void wake_up_worker(struct worker_pool *pool)
 }
 
 /**
+ * worker_set_flags - set worker flags and adjust nr_running accordingly
+ * @worker: self
+ * @flags: flags to set
+ *
+ * Set @flags in @worker->flags and adjust nr_running accordingly.
+ *
+ * CONTEXT:
+ * raw_spin_lock_irq(pool->lock)
+ */
+static inline void worker_set_flags(struct worker *worker, unsigned int flags)
+{
+       struct worker_pool *pool = worker->pool;
+
+       WARN_ON_ONCE(worker->task != current);
+
+       /* If transitioning into NOT_RUNNING, adjust nr_running. */
+       if ((flags & WORKER_NOT_RUNNING) &&
+           !(worker->flags & WORKER_NOT_RUNNING)) {
+               pool->nr_running--;
+       }
+
+       worker->flags |= flags;
+}
+
+/**
+ * worker_clr_flags - clear worker flags and adjust nr_running accordingly
+ * @worker: self
+ * @flags: flags to clear
+ *
+ * Clear @flags in @worker->flags and adjust nr_running accordingly.
+ *
+ * CONTEXT:
+ * raw_spin_lock_irq(pool->lock)
+ */
+static inline void worker_clr_flags(struct worker *worker, unsigned int flags)
+{
+       struct worker_pool *pool = worker->pool;
+       unsigned int oflags = worker->flags;
+
+       WARN_ON_ONCE(worker->task != current);
+
+       worker->flags &= ~flags;
+
+       /*
+        * If transitioning out of NOT_RUNNING, increment nr_running.  Note
+        * that the nested NOT_RUNNING is not a noop.  NOT_RUNNING is mask
+        * of multiple flags, not a single flag.
+        */
+       if ((flags & WORKER_NOT_RUNNING) && (oflags & WORKER_NOT_RUNNING))
+               if (!(worker->flags & WORKER_NOT_RUNNING))
+                       pool->nr_running++;
+}
+
+#ifdef CONFIG_WQ_CPU_INTENSIVE_REPORT
+
+/*
+ * Concurrency-managed per-cpu work items that hog CPU for longer than
+ * wq_cpu_intensive_thresh_us trigger the automatic CPU_INTENSIVE mechanism,
+ * which prevents them from stalling other concurrency-managed work items. If a
+ * work function keeps triggering this mechanism, it's likely that the work item
+ * should be using an unbound workqueue instead.
+ *
+ * wq_cpu_intensive_report() tracks work functions which trigger such conditions
+ * and report them so that they can be examined and converted to use unbound
+ * workqueues as appropriate. To avoid flooding the console, each violating work
+ * function is tracked and reported with exponential backoff.
+ */
+#define WCI_MAX_ENTS 128
+
+struct wci_ent {
+       work_func_t             func;
+       atomic64_t              cnt;
+       struct hlist_node       hash_node;
+};
+
+static struct wci_ent wci_ents[WCI_MAX_ENTS];
+static int wci_nr_ents;
+static DEFINE_RAW_SPINLOCK(wci_lock);
+static DEFINE_HASHTABLE(wci_hash, ilog2(WCI_MAX_ENTS));
+
+static struct wci_ent *wci_find_ent(work_func_t func)
+{
+       struct wci_ent *ent;
+
+       hash_for_each_possible_rcu(wci_hash, ent, hash_node,
+                                  (unsigned long)func) {
+               if (ent->func == func)
+                       return ent;
+       }
+       return NULL;
+}
+
+static void wq_cpu_intensive_report(work_func_t func)
+{
+       struct wci_ent *ent;
+
+restart:
+       ent = wci_find_ent(func);
+       if (ent) {
+               u64 cnt;
+
+               /*
+                * Start reporting from the fourth time and back off
+                * exponentially.
+                */
+               cnt = atomic64_inc_return_relaxed(&ent->cnt);
+               if (cnt >= 4 && is_power_of_2(cnt))
+                       printk_deferred(KERN_WARNING "workqueue: %ps hogged CPU for >%luus %llu times, consider switching to WQ_UNBOUND\n",
+                                       ent->func, wq_cpu_intensive_thresh_us,
+                                       atomic64_read(&ent->cnt));
+               return;
+       }
+
+       /*
+        * @func is a new violation. Allocate a new entry for it. If wcn_ents[]
+        * is exhausted, something went really wrong and we probably made enough
+        * noise already.
+        */
+       if (wci_nr_ents >= WCI_MAX_ENTS)
+               return;
+
+       raw_spin_lock(&wci_lock);
+
+       if (wci_nr_ents >= WCI_MAX_ENTS) {
+               raw_spin_unlock(&wci_lock);
+               return;
+       }
+
+       if (wci_find_ent(func)) {
+               raw_spin_unlock(&wci_lock);
+               goto restart;
+       }
+
+       ent = &wci_ents[wci_nr_ents++];
+       ent->func = func;
+       atomic64_set(&ent->cnt, 1);
+       hash_add_rcu(wci_hash, &ent->hash_node, (unsigned long)func);
+
+       raw_spin_unlock(&wci_lock);
+}
+
+#else  /* CONFIG_WQ_CPU_INTENSIVE_REPORT */
+static void wq_cpu_intensive_report(work_func_t func) {}
+#endif /* CONFIG_WQ_CPU_INTENSIVE_REPORT */
+
+/**
  * wq_worker_running - a worker is running again
  * @task: task waking up
  *
@@ -873,7 +1054,7 @@ void wq_worker_running(struct task_struct *task)
 {
        struct worker *worker = kthread_data(task);
 
-       if (!worker->sleeping)
+       if (!READ_ONCE(worker->sleeping))
                return;
 
        /*
@@ -886,7 +1067,14 @@ void wq_worker_running(struct task_struct *task)
        if (!(worker->flags & WORKER_NOT_RUNNING))
                worker->pool->nr_running++;
        preempt_enable();
-       worker->sleeping = 0;
+
+       /*
+        * CPU intensive auto-detection cares about how long a work item hogged
+        * CPU without sleeping. Reset the starting timestamp on wakeup.
+        */
+       worker->current_at = worker->task->se.sum_exec_runtime;
+
+       WRITE_ONCE(worker->sleeping, 0);
 }
 
 /**
@@ -912,10 +1100,10 @@ void wq_worker_sleeping(struct task_struct *task)
        pool = worker->pool;
 
        /* Return if preempted before wq_worker_running() was reached */
-       if (worker->sleeping)
+       if (READ_ONCE(worker->sleeping))
                return;
 
-       worker->sleeping = 1;
+       WRITE_ONCE(worker->sleeping, 1);
        raw_spin_lock_irq(&pool->lock);
 
        /*
@@ -929,12 +1117,66 @@ void wq_worker_sleeping(struct task_struct *task)
        }
 
        pool->nr_running--;
-       if (need_more_worker(pool))
+       if (need_more_worker(pool)) {
+               worker->current_pwq->stats[PWQ_STAT_CM_WAKEUP]++;
                wake_up_worker(pool);
+       }
        raw_spin_unlock_irq(&pool->lock);
 }
 
 /**
+ * wq_worker_tick - a scheduler tick occurred while a kworker is running
+ * @task: task currently running
+ *
+ * Called from scheduler_tick(). We're in the IRQ context and the current
+ * worker's fields which follow the 'K' locking rule can be accessed safely.
+ */
+void wq_worker_tick(struct task_struct *task)
+{
+       struct worker *worker = kthread_data(task);
+       struct pool_workqueue *pwq = worker->current_pwq;
+       struct worker_pool *pool = worker->pool;
+
+       if (!pwq)
+               return;
+
+       pwq->stats[PWQ_STAT_CPU_TIME] += TICK_USEC;
+
+       if (!wq_cpu_intensive_thresh_us)
+               return;
+
+       /*
+        * If the current worker is concurrency managed and hogged the CPU for
+        * longer than wq_cpu_intensive_thresh_us, it's automatically marked
+        * CPU_INTENSIVE to avoid stalling other concurrency-managed work items.
+        *
+        * Set @worker->sleeping means that @worker is in the process of
+        * switching out voluntarily and won't be contributing to
+        * @pool->nr_running until it wakes up. As wq_worker_sleeping() also
+        * decrements ->nr_running, setting CPU_INTENSIVE here can lead to
+        * double decrements. The task is releasing the CPU anyway. Let's skip.
+        * We probably want to make this prettier in the future.
+        */
+       if ((worker->flags & WORKER_NOT_RUNNING) || READ_ONCE(worker->sleeping) ||
+           worker->task->se.sum_exec_runtime - worker->current_at <
+           wq_cpu_intensive_thresh_us * NSEC_PER_USEC)
+               return;
+
+       raw_spin_lock(&pool->lock);
+
+       worker_set_flags(worker, WORKER_CPU_INTENSIVE);
+       wq_cpu_intensive_report(worker->current_func);
+       pwq->stats[PWQ_STAT_CPU_INTENSIVE]++;
+
+       if (need_more_worker(pool)) {
+               pwq->stats[PWQ_STAT_CM_WAKEUP]++;
+               wake_up_worker(pool);
+       }
+
+       raw_spin_unlock(&pool->lock);
+}
+
+/**
  * wq_worker_last_func - retrieve worker's last work function
  * @task: Task to retrieve last work function of.
  *
@@ -966,60 +1208,6 @@ work_func_t wq_worker_last_func(struct task_struct *task)
 }
 
 /**
- * worker_set_flags - set worker flags and adjust nr_running accordingly
- * @worker: self
- * @flags: flags to set
- *
- * Set @flags in @worker->flags and adjust nr_running accordingly.
- *
- * CONTEXT:
- * raw_spin_lock_irq(pool->lock)
- */
-static inline void worker_set_flags(struct worker *worker, unsigned int flags)
-{
-       struct worker_pool *pool = worker->pool;
-
-       WARN_ON_ONCE(worker->task != current);
-
-       /* If transitioning into NOT_RUNNING, adjust nr_running. */
-       if ((flags & WORKER_NOT_RUNNING) &&
-           !(worker->flags & WORKER_NOT_RUNNING)) {
-               pool->nr_running--;
-       }
-
-       worker->flags |= flags;
-}
-
-/**
- * worker_clr_flags - clear worker flags and adjust nr_running accordingly
- * @worker: self
- * @flags: flags to clear
- *
- * Clear @flags in @worker->flags and adjust nr_running accordingly.
- *
- * CONTEXT:
- * raw_spin_lock_irq(pool->lock)
- */
-static inline void worker_clr_flags(struct worker *worker, unsigned int flags)
-{
-       struct worker_pool *pool = worker->pool;
-       unsigned int oflags = worker->flags;
-
-       WARN_ON_ONCE(worker->task != current);
-
-       worker->flags &= ~flags;
-
-       /*
-        * If transitioning out of NOT_RUNNING, increment nr_running.  Note
-        * that the nested NOT_RUNNING is not a noop.  NOT_RUNNING is mask
-        * of multiple flags, not a single flag.
-        */
-       if ((flags & WORKER_NOT_RUNNING) && (oflags & WORKER_NOT_RUNNING))
-               if (!(worker->flags & WORKER_NOT_RUNNING))
-                       pool->nr_running++;
-}
-
-/**
  * find_worker_executing_work - find worker which is executing a work
  * @pool: pool of interest
  * @work: work to find worker for
@@ -1539,6 +1727,8 @@ out:
  * We queue the work to a specific CPU, the caller must ensure it
  * can't go away.  Callers that fail to ensure that the specified
  * CPU cannot go away will execute on a randomly chosen CPU.
+ * But note well that callers specifying a CPU that never has been
+ * online will get a splat.
  *
  * Return: %false if @work was already on a queue, %true otherwise.
  */
@@ -2163,6 +2353,7 @@ static void send_mayday(struct work_struct *work)
                get_pwq(pwq);
                list_add_tail(&pwq->mayday_node, &wq->maydays);
                wake_up_process(wq->rescuer->task);
+               pwq->stats[PWQ_STAT_MAYDAY]++;
        }
 }
 
@@ -2300,7 +2491,6 @@ __acquires(&pool->lock)
 {
        struct pool_workqueue *pwq = get_work_pwq(work);
        struct worker_pool *pool = worker->pool;
-       bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;
        unsigned long work_data;
        struct worker *collision;
 #ifdef CONFIG_LOCKDEP
@@ -2337,6 +2527,7 @@ __acquires(&pool->lock)
        worker->current_work = work;
        worker->current_func = work->func;
        worker->current_pwq = pwq;
+       worker->current_at = worker->task->se.sum_exec_runtime;
        work_data = *work_data_bits(work);
        worker->current_color = get_work_color(work_data);
 
@@ -2354,7 +2545,7 @@ __acquires(&pool->lock)
         * of concurrency management and the next code block will chain
         * execution of the pending work items.
         */
-       if (unlikely(cpu_intensive))
+       if (unlikely(pwq->wq->flags & WQ_CPU_INTENSIVE))
                worker_set_flags(worker, WORKER_CPU_INTENSIVE);
 
        /*
@@ -2401,6 +2592,7 @@ __acquires(&pool->lock)
         * workqueues), so hiding them isn't a problem.
         */
        lockdep_invariant_state(true);
+       pwq->stats[PWQ_STAT_STARTED]++;
        trace_workqueue_execute_start(work);
        worker->current_func(work);
        /*
@@ -2408,6 +2600,7 @@ __acquires(&pool->lock)
         * point will only record its address.
         */
        trace_workqueue_execute_end(work, worker->current_func);
+       pwq->stats[PWQ_STAT_COMPLETED]++;
        lock_map_release(&lockdep_map);
        lock_map_release(&pwq->wq->lockdep_map);
 
@@ -2432,9 +2625,12 @@ __acquires(&pool->lock)
 
        raw_spin_lock_irq(&pool->lock);
 
-       /* clear cpu intensive status */
-       if (unlikely(cpu_intensive))
-               worker_clr_flags(worker, WORKER_CPU_INTENSIVE);
+       /*
+        * In addition to %WQ_CPU_INTENSIVE, @worker may also have been marked
+        * CPU intensive by wq_worker_tick() if @work hogged CPU longer than
+        * wq_cpu_intensive_thresh_us. Clear it.
+        */
+       worker_clr_flags(worker, WORKER_CPU_INTENSIVE);
 
        /* tag the worker for identification in schedule() */
        worker->last_func = worker->current_func;
@@ -2651,6 +2847,7 @@ repeat:
                                if (first)
                                        pool->watchdog_ts = jiffies;
                                move_linked_works(work, scheduled, &n);
+                               pwq->stats[PWQ_STAT_RESCUED]++;
                        }
                        first = false;
                }
index e00b120..6b1d66e 100644 (file)
@@ -28,13 +28,18 @@ struct worker {
                struct hlist_node       hentry; /* L: while busy */
        };
 
-       struct work_struct      *current_work;  /* L: work being processed */
-       work_func_t             current_func;   /* L: current_work's fn */
-       struct pool_workqueue   *current_pwq;   /* L: current_work's pwq */
-       unsigned int            current_color;  /* L: current_work's color */
-       struct list_head        scheduled;      /* L: scheduled works */
+       struct work_struct      *current_work;  /* K: work being processed and its */
+       work_func_t             current_func;   /* K: function */
+       struct pool_workqueue   *current_pwq;   /* K: pwq */
+       u64                     current_at;     /* K: runtime at start or last wakeup */
+       unsigned int            current_color;  /* K: color */
+
+       int                     sleeping;       /* S: is worker sleeping? */
 
-       /* 64 bytes boundary on 64bit, 32 on 32bit */
+       /* used by the scheduler to determine a worker's last known identity */
+       work_func_t             last_func;      /* K: last work's fn */
+
+       struct list_head        scheduled;      /* L: scheduled works */
 
        struct task_struct      *task;          /* I: worker task */
        struct worker_pool      *pool;          /* A: the associated pool */
@@ -42,10 +47,9 @@ struct worker {
        struct list_head        node;           /* A: anchored at pool->workers */
                                                /* A: runs through worker->node */
 
-       unsigned long           last_active;    /* L: last active timestamp */
+       unsigned long           last_active;    /* K: last active timestamp */
        unsigned int            flags;          /* X: flags */
        int                     id;             /* I: worker id */
-       int                     sleeping;       /* None */
 
        /*
         * Opaque string set with work_set_desc().  Printed out with task
@@ -55,9 +59,6 @@ struct worker {
 
        /* used only by rescuers to point to the target workqueue */
        struct workqueue_struct *rescue_wq;     /* I: the workqueue to rescue */
-
-       /* used by the scheduler to determine a worker's last known identity */
-       work_func_t             last_func;
 };
 
 /**
@@ -76,6 +77,7 @@ static inline struct worker *current_wq_worker(void)
  */
 void wq_worker_running(struct task_struct *task);
 void wq_worker_sleeping(struct task_struct *task);
+void wq_worker_tick(struct task_struct *task);
 work_func_t wq_worker_last_func(struct task_struct *task);
 
 #endif /* _KERNEL_WORKQUEUE_INTERNAL_H */
index ce51d4d..781f061 100644 (file)
@@ -1035,27 +1035,30 @@ config BOOTPARAM_SOFTLOCKUP_PANIC
 
          Say N if unsure.
 
-config HARDLOCKUP_DETECTOR_PERF
+config HAVE_HARDLOCKUP_DETECTOR_BUDDY
        bool
-       select SOFTLOCKUP_DETECTOR
+       depends on SMP
+       default y
 
 #
-# Enables a timestamp based low pass filter to compensate for perf based
-# hard lockup detection which runs too fast due to turbo modes.
+# Global switch whether to build a hardlockup detector at all. It is available
+# only when the architecture supports at least one implementation. There are
+# two exceptions. The hardlockup detector is never enabled on:
 #
-config HARDLOCKUP_CHECK_TIMESTAMP
-       bool
-
+#      s390: it reported many false positives there
 #
-# arch/ can define HAVE_HARDLOCKUP_DETECTOR_ARCH to provide their own hard
-# lockup detector rather than the perf based detector.
+#      sparc64: has a custom implementation which is not using the common
+#              hardlockup command line options and sysctl interface.
 #
 config HARDLOCKUP_DETECTOR
        bool "Detect Hard Lockups"
-       depends on DEBUG_KERNEL && !S390
-       depends on HAVE_HARDLOCKUP_DETECTOR_PERF || HAVE_HARDLOCKUP_DETECTOR_ARCH
+       depends on DEBUG_KERNEL && !S390 && !HARDLOCKUP_DETECTOR_SPARC64
+       depends on HAVE_HARDLOCKUP_DETECTOR_PERF || HAVE_HARDLOCKUP_DETECTOR_BUDDY || HAVE_HARDLOCKUP_DETECTOR_ARCH
+       imply HARDLOCKUP_DETECTOR_PERF
+       imply HARDLOCKUP_DETECTOR_BUDDY
+       imply HARDLOCKUP_DETECTOR_ARCH
        select LOCKUP_DETECTOR
-       select HARDLOCKUP_DETECTOR_PERF if HAVE_HARDLOCKUP_DETECTOR_PERF
+
        help
          Say Y here to enable the kernel to act as a watchdog to detect
          hard lockups.
@@ -1065,6 +1068,63 @@ config HARDLOCKUP_DETECTOR
          chance to run.  The current stack trace is displayed upon detection
          and the system will stay locked up.
 
+#
+# Note that arch-specific variants are always preferred.
+#
+config HARDLOCKUP_DETECTOR_PREFER_BUDDY
+       bool "Prefer the buddy CPU hardlockup detector"
+       depends on HARDLOCKUP_DETECTOR
+       depends on HAVE_HARDLOCKUP_DETECTOR_PERF && HAVE_HARDLOCKUP_DETECTOR_BUDDY
+       depends on !HAVE_HARDLOCKUP_DETECTOR_ARCH
+       help
+         Say Y here to prefer the buddy hardlockup detector over the perf one.
+
+         With the buddy detector, each CPU uses its softlockup hrtimer
+         to check that the next CPU is processing hrtimer interrupts by
+         verifying that a counter is increasing.
+
+         This hardlockup detector is useful on systems that don't have
+         an arch-specific hardlockup detector or if resources needed
+         for the hardlockup detector are better used for other things.
+
+config HARDLOCKUP_DETECTOR_PERF
+       bool
+       depends on HARDLOCKUP_DETECTOR
+       depends on HAVE_HARDLOCKUP_DETECTOR_PERF && !HARDLOCKUP_DETECTOR_PREFER_BUDDY
+       depends on !HAVE_HARDLOCKUP_DETECTOR_ARCH
+       select HARDLOCKUP_DETECTOR_COUNTS_HRTIMER
+
+config HARDLOCKUP_DETECTOR_BUDDY
+       bool
+       depends on HARDLOCKUP_DETECTOR
+       depends on HAVE_HARDLOCKUP_DETECTOR_BUDDY
+       depends on !HAVE_HARDLOCKUP_DETECTOR_PERF || HARDLOCKUP_DETECTOR_PREFER_BUDDY
+       depends on !HAVE_HARDLOCKUP_DETECTOR_ARCH
+       select HARDLOCKUP_DETECTOR_COUNTS_HRTIMER
+
+config HARDLOCKUP_DETECTOR_ARCH
+       bool
+       depends on HARDLOCKUP_DETECTOR
+       depends on HAVE_HARDLOCKUP_DETECTOR_ARCH
+       help
+         The arch-specific implementation of the hardlockup detector will
+         be used.
+
+#
+# Both the "perf" and "buddy" hardlockup detectors count hrtimer
+# interrupts. This config enables functions managing this common code.
+#
+config HARDLOCKUP_DETECTOR_COUNTS_HRTIMER
+       bool
+       select SOFTLOCKUP_DETECTOR
+
+#
+# Enables a timestamp based low pass filter to compensate for perf based
+# hard lockup detection which runs too fast due to turbo modes.
+#
+config HARDLOCKUP_CHECK_TIMESTAMP
+       bool
+
 config BOOTPARAM_HARDLOCKUP_PANIC
        bool "Panic (Reboot) On Hard Lockups"
        depends on HARDLOCKUP_DETECTOR
@@ -1134,6 +1194,19 @@ config WQ_WATCHDOG
          state.  This can be configured through kernel parameter
          "workqueue.watchdog_thresh" and its sysfs counterpart.
 
+config WQ_CPU_INTENSIVE_REPORT
+       bool "Report per-cpu work items which hog CPU for too long"
+       depends on DEBUG_KERNEL
+       help
+         Say Y here to enable reporting of concurrency-managed per-cpu work
+         items that hog CPUs for longer than
+         workqueue.cpu_intensive_threshold_us. Workqueue automatically
+         detects and excludes them from concurrency management to prevent
+         them from stalling other per-cpu work items. Occassional
+         triggering may not necessarily indicate a problem. Repeated
+         triggering likely indicates that the work item should be switched
+         to use an unbound workqueue.
+
 config TEST_LOCKUP
        tristate "Test module to generate lockups"
        depends on m
@@ -2302,9 +2375,13 @@ config TEST_XARRAY
        tristate "Test the XArray code at runtime"
 
 config TEST_MAPLE_TREE
-       depends on DEBUG_KERNEL
-       select DEBUG_MAPLE_TREE
-       tristate "Test the Maple Tree code at runtime"
+       tristate "Test the Maple Tree code at runtime or module load"
+       help
+         Enable this option to test the maple tree code functions at boot, or
+         when the module is loaded. Enable "Debug Maple Trees" will enable
+         more verbose output on failures.
+
+         If unsure, say N.
 
 config TEST_RHASHTABLE
        tristate "Perform selftest on resizable hash table"
@@ -2453,6 +2530,23 @@ config BITFIELD_KUNIT
 
          If unsure, say N.
 
+config CHECKSUM_KUNIT
+       tristate "KUnit test checksum functions at runtime" if !KUNIT_ALL_TESTS
+       depends on KUNIT
+       default KUNIT_ALL_TESTS
+       help
+         Enable this option to test the checksum functions at boot.
+
+         KUnit tests run during boot and output the results to the debug log
+         in TAP format (http://testanything.org/). Only useful for kernel devs
+         running the KUnit test harness, and not intended for inclusion into a
+         production build.
+
+         For more information on KUnit and unit tests in general please refer
+         to the KUnit documentation in Documentation/dev-tools/kunit/.
+
+         If unsure, say N.
+
 config HASH_KUNIT_TEST
        tristate "KUnit Test for integer hash functions" if !KUNIT_ALL_TESTS
        depends on KUNIT
@@ -2645,7 +2739,7 @@ config STACKINIT_KUNIT_TEST
 
 config FORTIFY_KUNIT_TEST
        tristate "Test fortified str*() and mem*() function internals at runtime" if !KUNIT_ALL_TESTS
-       depends on KUNIT && FORTIFY_SOURCE
+       depends on KUNIT
        default KUNIT_ALL_TESTS
        help
          Builds unit tests for checking internals of FORTIFY_SOURCE as used
@@ -2662,6 +2756,11 @@ config HW_BREAKPOINT_KUNIT_TEST
 
          If unsure, say N.
 
+config STRCAT_KUNIT_TEST
+       tristate "Test strcat() family of functions at runtime" if !KUNIT_ALL_TESTS
+       depends on KUNIT
+       default KUNIT_ALL_TESTS
+
 config STRSCPY_KUNIT_TEST
        tristate "Test strscpy*() family of functions at runtime" if !KUNIT_ALL_TESTS
        depends on KUNIT
index fd15230..efae7e0 100644 (file)
@@ -15,7 +15,6 @@ if UBSAN
 config UBSAN_TRAP
        bool "On Sanitizer warnings, abort the running kernel code"
        depends on !COMPILE_TEST
-       depends on $(cc-option, -fsanitize-undefined-trap-on-error)
        help
          Building kernels with Sanitizer features enabled tends to grow
          the kernel size by around 5%, due to adding all the debugging
@@ -27,16 +26,29 @@ config UBSAN_TRAP
          the system. For some system builders this is an acceptable
          trade-off.
 
-config CC_HAS_UBSAN_BOUNDS
-       def_bool $(cc-option,-fsanitize=bounds)
+config CC_HAS_UBSAN_BOUNDS_STRICT
+       def_bool $(cc-option,-fsanitize=bounds-strict)
+       help
+         The -fsanitize=bounds-strict option is only available on GCC,
+         but uses the more strict handling of arrays that includes knowledge
+         of flexible arrays, which is comparable to Clang's regular
+         -fsanitize=bounds.
 
 config CC_HAS_UBSAN_ARRAY_BOUNDS
        def_bool $(cc-option,-fsanitize=array-bounds)
+       help
+         Under Clang, the -fsanitize=bounds option is actually composed
+         of two more specific options, -fsanitize=array-bounds and
+         -fsanitize=local-bounds. However, -fsanitize=local-bounds can
+         only be used when trap mode is enabled. (See also the help for
+         CONFIG_LOCAL_BOUNDS.) Explicitly check for -fsanitize=array-bounds
+         so that we can build up the options needed for UBSAN_BOUNDS
+         with or without UBSAN_TRAP.
 
 config UBSAN_BOUNDS
        bool "Perform array index bounds checking"
        default UBSAN
-       depends on CC_HAS_UBSAN_ARRAY_BOUNDS || CC_HAS_UBSAN_BOUNDS
+       depends on CC_HAS_UBSAN_ARRAY_BOUNDS || CC_HAS_UBSAN_BOUNDS_STRICT
        help
          This option enables detection of directly indexed out of bounds
          array accesses, where the array size is known at compile time.
@@ -44,33 +56,26 @@ config UBSAN_BOUNDS
          to the {str,mem}*cpy() family of functions (that is addressed
          by CONFIG_FORTIFY_SOURCE).
 
-config UBSAN_ONLY_BOUNDS
-       def_bool CC_HAS_UBSAN_BOUNDS && !CC_HAS_UBSAN_ARRAY_BOUNDS
-       depends on UBSAN_BOUNDS
+config UBSAN_BOUNDS_STRICT
+       def_bool UBSAN_BOUNDS && CC_HAS_UBSAN_BOUNDS_STRICT
        help
-         This is a weird case: Clang's -fsanitize=bounds includes
-         -fsanitize=local-bounds, but it's trapping-only, so for
-         Clang, we must use -fsanitize=array-bounds when we want
-         traditional array bounds checking enabled. For GCC, we
-         want -fsanitize=bounds.
+         GCC's bounds sanitizer. This option is used to select the
+         correct options in Makefile.ubsan.
 
 config UBSAN_ARRAY_BOUNDS
-       def_bool CC_HAS_UBSAN_ARRAY_BOUNDS
-       depends on UBSAN_BOUNDS
+       def_bool UBSAN_BOUNDS && CC_HAS_UBSAN_ARRAY_BOUNDS
+       help
+         Clang's array bounds sanitizer. This option is used to select
+         the correct options in Makefile.ubsan.
 
 config UBSAN_LOCAL_BOUNDS
-       bool "Perform array local bounds checking"
-       depends on UBSAN_TRAP
-       depends on $(cc-option,-fsanitize=local-bounds)
-       help
-         This option enables -fsanitize=local-bounds which traps when an
-         exception/error is detected. Therefore, it may only be enabled
-         with CONFIG_UBSAN_TRAP.
-
-         Enabling this option detects errors due to accesses through a
-         pointer that is derived from an object of a statically-known size,
-         where an added offset (which may not be known statically) is
-         out-of-bounds.
+       def_bool UBSAN_ARRAY_BOUNDS && UBSAN_TRAP
+       help
+         This option enables Clang's -fsanitize=local-bounds which traps
+         when an access through a pointer that is derived from an object
+         of a statically-known size, where an added offset (which may not
+         be known statically) is out-of-bounds. Since this option is
+         trap-only, it depends on CONFIG_UBSAN_TRAP.
 
 config UBSAN_SHIFT
        bool "Perform checking for bit-shift overflows"
index 876fcde..42d307a 100644 (file)
@@ -30,7 +30,7 @@ endif
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
         rbtree.o radix-tree.o timerqueue.o xarray.o \
         maple_tree.o idr.o extable.o irq_regs.o argv_split.o \
-        flex_proportions.o ratelimit.o show_mem.o \
+        flex_proportions.o ratelimit.o \
         is_single_threaded.o plist.o decompress.o kobject_uevent.o \
         earlycpio.o seq_buf.o siphash.o dec_and_lock.o \
         nmi_backtrace.o win_minmax.o memcat_p.o \
@@ -377,6 +377,7 @@ obj-$(CONFIG_PLDMFW) += pldmfw/
 # KUnit tests
 CFLAGS_bitfield_kunit.o := $(DISABLE_STRUCTLEAK_PLUGIN)
 obj-$(CONFIG_BITFIELD_KUNIT) += bitfield_kunit.o
+obj-$(CONFIG_CHECKSUM_KUNIT) += checksum_kunit.o
 obj-$(CONFIG_LIST_KUNIT_TEST) += list-test.o
 obj-$(CONFIG_HASHTABLE_KUNIT_TEST) += hashtable_test.o
 obj-$(CONFIG_LINEAR_RANGES_TEST) += test_linear_ranges.o
@@ -392,6 +393,7 @@ obj-$(CONFIG_STACKINIT_KUNIT_TEST) += stackinit_kunit.o
 CFLAGS_fortify_kunit.o += $(call cc-disable-warning, unsequenced)
 CFLAGS_fortify_kunit.o += $(DISABLE_STRUCTLEAK_PLUGIN)
 obj-$(CONFIG_FORTIFY_KUNIT_TEST) += fortify_kunit.o
+obj-$(CONFIG_STRCAT_KUNIT_TEST) += strcat_kunit.o
 obj-$(CONFIG_STRSCPY_KUNIT_TEST) += strscpy_kunit.o
 obj-$(CONFIG_SIPHASH_KUNIT_TEST) += siphash_kunit.o
 
diff --git a/lib/checksum_kunit.c b/lib/checksum_kunit.c
new file mode 100644 (file)
index 0000000..ace3c47
--- /dev/null
@@ -0,0 +1,334 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Test cases csum_partial and csum_fold
+ */
+
+#include <kunit/test.h>
+#include <asm/checksum.h>
+
+#define MAX_LEN 512
+#define MAX_ALIGN 64
+#define TEST_BUFLEN (MAX_LEN + MAX_ALIGN)
+
+static const __wsum random_init_sum = 0x2847aab;
+static const u8 random_buf[] = {
+       0xac, 0xd7, 0x76, 0x69, 0x6e, 0xf2, 0x93, 0x2c, 0x1f, 0xe0, 0xde, 0x86,
+       0x8f, 0x54, 0x33, 0x90, 0x95, 0xbf, 0xff, 0xb9, 0xea, 0x62, 0x6e, 0xb5,
+       0xd3, 0x4f, 0xf5, 0x60, 0x50, 0x5c, 0xc7, 0xfa, 0x6d, 0x1a, 0xc7, 0xf0,
+       0xd2, 0x2c, 0x12, 0x3d, 0x88, 0xe3, 0x14, 0x21, 0xb1, 0x5e, 0x45, 0x31,
+       0xa2, 0x85, 0x36, 0x76, 0xba, 0xd8, 0xad, 0xbb, 0x9e, 0x49, 0x8f, 0xf7,
+       0xce, 0xea, 0xef, 0xca, 0x2c, 0x29, 0xf7, 0x15, 0x5c, 0x1d, 0x4d, 0x09,
+       0x1f, 0xe2, 0x14, 0x31, 0x8c, 0x07, 0x57, 0x23, 0x1f, 0x6f, 0x03, 0xe1,
+       0x93, 0x19, 0x53, 0x03, 0x45, 0x49, 0x9a, 0x3b, 0x8e, 0x0c, 0x12, 0x5d,
+       0x8a, 0xb8, 0x9b, 0x8c, 0x9a, 0x03, 0xe5, 0xa2, 0x43, 0xd2, 0x3b, 0x4e,
+       0x7e, 0x30, 0x3c, 0x22, 0x2d, 0xc5, 0xfc, 0x9e, 0xdb, 0xc6, 0xf9, 0x69,
+       0x12, 0x39, 0x1f, 0xa0, 0x11, 0x0c, 0x3f, 0xf5, 0x53, 0xc9, 0x30, 0xfb,
+       0xb0, 0xdd, 0x21, 0x1d, 0x34, 0xe2, 0x65, 0x30, 0xf1, 0xe8, 0x1b, 0xe7,
+       0x55, 0x0d, 0xeb, 0xbd, 0xcc, 0x9d, 0x24, 0xa4, 0xad, 0xa7, 0x93, 0x47,
+       0x19, 0x2e, 0xc4, 0x5c, 0x3b, 0xc7, 0x6d, 0x95, 0x0c, 0x47, 0x60, 0xaf,
+       0x5b, 0x47, 0xee, 0xdc, 0x31, 0x31, 0x14, 0x12, 0x7e, 0x9e, 0x45, 0xb1,
+       0xc1, 0x69, 0x4b, 0x84, 0xfc, 0x88, 0xc1, 0x9e, 0x46, 0xb4, 0xc2, 0x25,
+       0xc5, 0x6c, 0x4c, 0x22, 0x58, 0x5c, 0xbe, 0xff, 0xea, 0x88, 0x88, 0x7a,
+       0xcb, 0x1c, 0x5d, 0x63, 0xa1, 0xf2, 0x33, 0x0c, 0xa2, 0x16, 0x0b, 0x6e,
+       0x2b, 0x79, 0x58, 0xf7, 0xac, 0xd3, 0x6a, 0x3f, 0x81, 0x57, 0x48, 0x45,
+       0xe3, 0x7c, 0xdc, 0xd6, 0x34, 0x7e, 0xe6, 0x73, 0xfa, 0xcb, 0x31, 0x18,
+       0xa9, 0x0b, 0xee, 0x6b, 0x99, 0xb9, 0x2d, 0xde, 0x22, 0x0e, 0x71, 0x57,
+       0x0e, 0x9b, 0x11, 0xd1, 0x15, 0x41, 0xd0, 0x6b, 0x50, 0x8a, 0x23, 0x64,
+       0xe3, 0x9c, 0xb3, 0x55, 0x09, 0xe9, 0x32, 0x67, 0xf9, 0xe0, 0x73, 0xf1,
+       0x60, 0x66, 0x0b, 0x88, 0x79, 0x8d, 0x4b, 0x52, 0x83, 0x20, 0x26, 0x78,
+       0x49, 0x27, 0xe7, 0x3e, 0x29, 0xa8, 0x18, 0x82, 0x41, 0xdd, 0x1e, 0xcc,
+       0x3b, 0xc4, 0x65, 0xd1, 0x21, 0x40, 0x72, 0xb2, 0x87, 0x5e, 0x16, 0x10,
+       0x80, 0x3f, 0x4b, 0x58, 0x1c, 0xc2, 0x79, 0x20, 0xf0, 0xe0, 0x80, 0xd3,
+       0x52, 0xa5, 0x19, 0x6e, 0x47, 0x90, 0x08, 0xf5, 0x50, 0xe2, 0xd6, 0xae,
+       0xe9, 0x2e, 0xdc, 0xd5, 0xb4, 0x90, 0x1f, 0x79, 0x49, 0x82, 0x21, 0x84,
+       0xa0, 0xb5, 0x2f, 0xff, 0x30, 0x71, 0xed, 0x80, 0x68, 0xb1, 0x6d, 0xef,
+       0xf6, 0xcf, 0xb8, 0x41, 0x79, 0xf5, 0x01, 0xbc, 0x0c, 0x9b, 0x0e, 0x06,
+       0xf3, 0xb0, 0xbb, 0x97, 0xb8, 0xb1, 0xfd, 0x51, 0x4e, 0xef, 0x0a, 0x3d,
+       0x7a, 0x3d, 0xbd, 0x61, 0x00, 0xa2, 0xb3, 0xf0, 0x1d, 0x77, 0x7b, 0x6c,
+       0x01, 0x61, 0xa5, 0xa3, 0xdb, 0xd5, 0xd5, 0xf4, 0xb5, 0x28, 0x9f, 0x0a,
+       0xa3, 0x82, 0x5f, 0x4b, 0x40, 0x0f, 0x05, 0x0e, 0x78, 0xed, 0xbf, 0x17,
+       0xf6, 0x5a, 0x8a, 0x7d, 0xf9, 0x45, 0xc1, 0xd7, 0x1b, 0x9d, 0x6c, 0x07,
+       0x88, 0xf3, 0xbc, 0xf1, 0xea, 0x28, 0x1f, 0xb8, 0x7a, 0x60, 0x3c, 0xce,
+       0x3e, 0x50, 0xb2, 0x0b, 0xcf, 0xe5, 0x08, 0x1f, 0x48, 0x04, 0xf9, 0x35,
+       0x29, 0x15, 0xbe, 0x82, 0x96, 0xc2, 0x55, 0x04, 0x6c, 0x19, 0x45, 0x29,
+       0x0b, 0xb6, 0x49, 0x12, 0xfb, 0x8d, 0x1b, 0x75, 0x8b, 0xd9, 0x6a, 0x5c,
+       0xbe, 0x46, 0x2b, 0x41, 0xfe, 0x21, 0xad, 0x1f, 0x75, 0xe7, 0x90, 0x3d,
+       0xe1, 0xdf, 0x4b, 0xe1, 0x81, 0xe2, 0x17, 0x02, 0x7b, 0x58, 0x8b, 0x92,
+       0x1a, 0xac, 0x46, 0xdd, 0x2e, 0xce, 0x40, 0x09
+};
+static const __sum16 expected_results[] = {
+       0x82d0, 0x8224, 0xab23, 0xaaad, 0x41ad, 0x413f, 0x4f3e, 0x4eab, 0x22ab,
+       0x228c, 0x428b, 0x41ad, 0xbbac, 0xbb1d, 0x671d, 0x66ea, 0xd6e9, 0xd654,
+       0x1754, 0x1655, 0x5d54, 0x5c6a, 0xfa69, 0xf9fb, 0x44fb, 0x4428, 0xf527,
+       0xf432, 0x9432, 0x93e2, 0x37e2, 0x371b, 0x3d1a, 0x3cad, 0x22ad, 0x21e6,
+       0x31e5, 0x3113, 0x0513, 0x0501, 0xc800, 0xc778, 0xe477, 0xe463, 0xc363,
+       0xc2b2, 0x64b2, 0x646d, 0x336d, 0x32cb, 0xadca, 0xad94, 0x3794, 0x36da,
+       0x5ed9, 0x5e2c, 0xa32b, 0xa28d, 0x598d, 0x58fe, 0x61fd, 0x612f, 0x772e,
+       0x763f, 0xac3e, 0xac12, 0x8312, 0x821b, 0x6d1b, 0x6cbf, 0x4fbf, 0x4f72,
+       0x4672, 0x4653, 0x6452, 0x643e, 0x333e, 0x32b2, 0x2bb2, 0x2b5b, 0x085b,
+       0x083c, 0x993b, 0x9938, 0xb837, 0xb7a4, 0x9ea4, 0x9e51, 0x9b51, 0x9b0c,
+       0x520c, 0x5172, 0x1672, 0x15e4, 0x09e4, 0x09d2, 0xacd1, 0xac47, 0xf446,
+       0xf3ab, 0x67ab, 0x6711, 0x6411, 0x632c, 0xc12b, 0xc0e8, 0xeee7, 0xeeac,
+       0xa0ac, 0xa02e, 0x702e, 0x6ff2, 0x4df2, 0x4dc5, 0x88c4, 0x87c8, 0xe9c7,
+       0xe8ec, 0x22ec, 0x21f3, 0xb8f2, 0xb8e0, 0x7fe0, 0x7fc1, 0xdfc0, 0xdfaf,
+       0xd3af, 0xd370, 0xde6f, 0xde1c, 0x151c, 0x14ec, 0x19eb, 0x193b, 0x3c3a,
+       0x3c19, 0x1f19, 0x1ee5, 0x3ce4, 0x3c7f, 0x0c7f, 0x0b8e, 0x238d, 0x2372,
+       0x3c71, 0x3c1c, 0x2f1c, 0x2e31, 0x7130, 0x7064, 0xd363, 0xd33f, 0x2f3f,
+       0x2e92, 0x8791, 0x86fe, 0x3ffe, 0x3fe5, 0x11e5, 0x1121, 0xb520, 0xb4e5,
+       0xede4, 0xed77, 0x5877, 0x586b, 0x116b, 0x110b, 0x620a, 0x61af, 0x1aaf,
+       0x19c1, 0x3dc0, 0x3d8f, 0x0c8f, 0x0c7b, 0xfa7a, 0xf9fc, 0x5bfc, 0x5bb7,
+       0xaab6, 0xa9f5, 0x40f5, 0x40aa, 0xbca9, 0xbbad, 0x33ad, 0x32ec, 0x94eb,
+       0x94a5, 0xe0a4, 0xdfe2, 0xbae2, 0xba1d, 0x4e1d, 0x4dd1, 0x2bd1, 0x2b79,
+       0xcf78, 0xceba, 0xcfb9, 0xcecf, 0x46cf, 0x4647, 0xcc46, 0xcb7b, 0xaf7b,
+       0xaf1e, 0x4c1e, 0x4b7d, 0x597c, 0x5949, 0x4d49, 0x4ca7, 0x36a7, 0x369c,
+       0xc89b, 0xc870, 0x4f70, 0x4f18, 0x5817, 0x576b, 0x846a, 0x8400, 0x4500,
+       0x447f, 0xed7e, 0xed36, 0xa836, 0xa753, 0x2b53, 0x2a77, 0x5476, 0x5442,
+       0xd641, 0xd55b, 0x625b, 0x6161, 0x9660, 0x962f, 0x7e2f, 0x7d86, 0x7286,
+       0x7198, 0x0698, 0x05ff, 0x4cfe, 0x4cd1, 0x6ed0, 0x6eae, 0x60ae, 0x603d,
+       0x093d, 0x092f, 0x6e2e, 0x6e1d, 0x9d1c, 0x9d07, 0x5c07, 0x5b37, 0xf036,
+       0xefe6, 0x65e6, 0x65c3, 0x01c3, 0x00e0, 0x64df, 0x642c, 0x0f2c, 0x0f23,
+       0x2622, 0x25f0, 0xbeef, 0xbdf6, 0xddf5, 0xdd82, 0xec81, 0xec21, 0x8621,
+       0x8616, 0xfe15, 0xfd9c, 0x709c, 0x7051, 0x1e51, 0x1dce, 0xfdcd, 0xfda7,
+       0x85a7, 0x855e, 0x5e5e, 0x5d77, 0x1f77, 0x1f4e, 0x774d, 0x7735, 0xf534,
+       0xf4f3, 0x17f3, 0x17d5, 0x4bd4, 0x4b99, 0x8798, 0x8733, 0xb632, 0xb611,
+       0x7611, 0x759f, 0xc39e, 0xc317, 0x6517, 0x6501, 0x5501, 0x5481, 0x1581,
+       0x1536, 0xbd35, 0xbd19, 0xfb18, 0xfa9f, 0xda9f, 0xd9af, 0xf9ae, 0xf92e,
+       0x262e, 0x25dc, 0x80db, 0x80c2, 0x12c2, 0x127b, 0x827a, 0x8272, 0x8d71,
+       0x8d21, 0xab20, 0xaa4a, 0xfc49, 0xfb60, 0xcd60, 0xcc84, 0xf783, 0xf6cf,
+       0x66cf, 0x66b0, 0xedaf, 0xed66, 0x6b66, 0x6b45, 0xe744, 0xe6a4, 0x31a4,
+       0x3175, 0x3274, 0x3244, 0xc143, 0xc056, 0x4056, 0x3fee, 0x8eed, 0x8e80,
+       0x9f7f, 0x9e89, 0xcf88, 0xced0, 0x8dd0, 0x8d57, 0x9856, 0x9855, 0xdc54,
+       0xdc48, 0x4148, 0x413a, 0x3b3a, 0x3a47, 0x8a46, 0x898b, 0xf28a, 0xf1d2,
+       0x40d2, 0x3fd5, 0xeed4, 0xee86, 0xff85, 0xff7b, 0xc27b, 0xc201, 0x8501,
+       0x8444, 0x2344, 0x2344, 0x8143, 0x8090, 0x908f, 0x9072, 0x1972, 0x18f7,
+       0xacf6, 0xacf5, 0x4bf5, 0x4b50, 0xa84f, 0xa774, 0xd273, 0xd19e, 0xdd9d,
+       0xdce8, 0xb4e8, 0xb449, 0xaa49, 0xa9a6, 0x27a6, 0x2747, 0xdc46, 0xdc06,
+       0xcd06, 0xcd01, 0xbf01, 0xbe89, 0xd188, 0xd0c9, 0xb9c9, 0xb8d3, 0x5ed3,
+       0x5e49, 0xe148, 0xe04f, 0x9b4f, 0x9a8e, 0xc38d, 0xc372, 0x2672, 0x2606,
+       0x1f06, 0x1e7e, 0x2b7d, 0x2ac1, 0x39c0, 0x38d6, 0x10d6, 0x10b7, 0x58b6,
+       0x583c, 0xf83b, 0xf7ff, 0x29ff, 0x29c1, 0xd9c0, 0xd90e, 0xce0e, 0xcd3f,
+       0xe83e, 0xe836, 0xc936, 0xc8ee, 0xc4ee, 0xc3f5, 0x8ef5, 0x8ecc, 0x79cc,
+       0x790e, 0xf70d, 0xf677, 0x3477, 0x3422, 0x3022, 0x2fb6, 0x16b6, 0x1671,
+       0xed70, 0xed65, 0x3765, 0x371c, 0x251c, 0x2421, 0x9720, 0x9705, 0x2205,
+       0x217a, 0x4879, 0x480f, 0xec0e, 0xeb50, 0xa550, 0xa525, 0x6425, 0x6327,
+       0x4227, 0x417a, 0x227a, 0x2205, 0x3b04, 0x3a74, 0xfd73, 0xfc92, 0x1d92,
+       0x1d47, 0x3c46, 0x3bc5, 0x59c4, 0x59ad, 0x57ad, 0x5732, 0xff31, 0xfea6,
+       0x6ca6, 0x6c8c, 0xc08b, 0xc045, 0xe344, 0xe316, 0x1516, 0x14d6,
+};
+static const __wsum init_sums_no_overflow[] = {
+       0xffffffff, 0xfffffffb, 0xfffffbfb, 0xfffffbf7, 0xfffff7f7, 0xfffff7f3,
+       0xfffff3f3, 0xfffff3ef, 0xffffefef, 0xffffefeb, 0xffffebeb, 0xffffebe7,
+       0xffffe7e7, 0xffffe7e3, 0xffffe3e3, 0xffffe3df, 0xffffdfdf, 0xffffdfdb,
+       0xffffdbdb, 0xffffdbd7, 0xffffd7d7, 0xffffd7d3, 0xffffd3d3, 0xffffd3cf,
+       0xffffcfcf, 0xffffcfcb, 0xffffcbcb, 0xffffcbc7, 0xffffc7c7, 0xffffc7c3,
+       0xffffc3c3, 0xffffc3bf, 0xffffbfbf, 0xffffbfbb, 0xffffbbbb, 0xffffbbb7,
+       0xffffb7b7, 0xffffb7b3, 0xffffb3b3, 0xffffb3af, 0xffffafaf, 0xffffafab,
+       0xffffabab, 0xffffaba7, 0xffffa7a7, 0xffffa7a3, 0xffffa3a3, 0xffffa39f,
+       0xffff9f9f, 0xffff9f9b, 0xffff9b9b, 0xffff9b97, 0xffff9797, 0xffff9793,
+       0xffff9393, 0xffff938f, 0xffff8f8f, 0xffff8f8b, 0xffff8b8b, 0xffff8b87,
+       0xffff8787, 0xffff8783, 0xffff8383, 0xffff837f, 0xffff7f7f, 0xffff7f7b,
+       0xffff7b7b, 0xffff7b77, 0xffff7777, 0xffff7773, 0xffff7373, 0xffff736f,
+       0xffff6f6f, 0xffff6f6b, 0xffff6b6b, 0xffff6b67, 0xffff6767, 0xffff6763,
+       0xffff6363, 0xffff635f, 0xffff5f5f, 0xffff5f5b, 0xffff5b5b, 0xffff5b57,
+       0xffff5757, 0xffff5753, 0xffff5353, 0xffff534f, 0xffff4f4f, 0xffff4f4b,
+       0xffff4b4b, 0xffff4b47, 0xffff4747, 0xffff4743, 0xffff4343, 0xffff433f,
+       0xffff3f3f, 0xffff3f3b, 0xffff3b3b, 0xffff3b37, 0xffff3737, 0xffff3733,
+       0xffff3333, 0xffff332f, 0xffff2f2f, 0xffff2f2b, 0xffff2b2b, 0xffff2b27,
+       0xffff2727, 0xffff2723, 0xffff2323, 0xffff231f, 0xffff1f1f, 0xffff1f1b,
+       0xffff1b1b, 0xffff1b17, 0xffff1717, 0xffff1713, 0xffff1313, 0xffff130f,
+       0xffff0f0f, 0xffff0f0b, 0xffff0b0b, 0xffff0b07, 0xffff0707, 0xffff0703,
+       0xffff0303, 0xffff02ff, 0xfffffefe, 0xfffffefa, 0xfffffafa, 0xfffffaf6,
+       0xfffff6f6, 0xfffff6f2, 0xfffff2f2, 0xfffff2ee, 0xffffeeee, 0xffffeeea,
+       0xffffeaea, 0xffffeae6, 0xffffe6e6, 0xffffe6e2, 0xffffe2e2, 0xffffe2de,
+       0xffffdede, 0xffffdeda, 0xffffdada, 0xffffdad6, 0xffffd6d6, 0xffffd6d2,
+       0xffffd2d2, 0xffffd2ce, 0xffffcece, 0xffffceca, 0xffffcaca, 0xffffcac6,
+       0xffffc6c6, 0xffffc6c2, 0xffffc2c2, 0xffffc2be, 0xffffbebe, 0xffffbeba,
+       0xffffbaba, 0xffffbab6, 0xffffb6b6, 0xffffb6b2, 0xffffb2b2, 0xffffb2ae,
+       0xffffaeae, 0xffffaeaa, 0xffffaaaa, 0xffffaaa6, 0xffffa6a6, 0xffffa6a2,
+       0xffffa2a2, 0xffffa29e, 0xffff9e9e, 0xffff9e9a, 0xffff9a9a, 0xffff9a96,
+       0xffff9696, 0xffff9692, 0xffff9292, 0xffff928e, 0xffff8e8e, 0xffff8e8a,
+       0xffff8a8a, 0xffff8a86, 0xffff8686, 0xffff8682, 0xffff8282, 0xffff827e,
+       0xffff7e7e, 0xffff7e7a, 0xffff7a7a, 0xffff7a76, 0xffff7676, 0xffff7672,
+       0xffff7272, 0xffff726e, 0xffff6e6e, 0xffff6e6a, 0xffff6a6a, 0xffff6a66,
+       0xffff6666, 0xffff6662, 0xffff6262, 0xffff625e, 0xffff5e5e, 0xffff5e5a,
+       0xffff5a5a, 0xffff5a56, 0xffff5656, 0xffff5652, 0xffff5252, 0xffff524e,
+       0xffff4e4e, 0xffff4e4a, 0xffff4a4a, 0xffff4a46, 0xffff4646, 0xffff4642,
+       0xffff4242, 0xffff423e, 0xffff3e3e, 0xffff3e3a, 0xffff3a3a, 0xffff3a36,
+       0xffff3636, 0xffff3632, 0xffff3232, 0xffff322e, 0xffff2e2e, 0xffff2e2a,
+       0xffff2a2a, 0xffff2a26, 0xffff2626, 0xffff2622, 0xffff2222, 0xffff221e,
+       0xffff1e1e, 0xffff1e1a, 0xffff1a1a, 0xffff1a16, 0xffff1616, 0xffff1612,
+       0xffff1212, 0xffff120e, 0xffff0e0e, 0xffff0e0a, 0xffff0a0a, 0xffff0a06,
+       0xffff0606, 0xffff0602, 0xffff0202, 0xffff01fe, 0xfffffdfd, 0xfffffdf9,
+       0xfffff9f9, 0xfffff9f5, 0xfffff5f5, 0xfffff5f1, 0xfffff1f1, 0xfffff1ed,
+       0xffffeded, 0xffffede9, 0xffffe9e9, 0xffffe9e5, 0xffffe5e5, 0xffffe5e1,
+       0xffffe1e1, 0xffffe1dd, 0xffffdddd, 0xffffddd9, 0xffffd9d9, 0xffffd9d5,
+       0xffffd5d5, 0xffffd5d1, 0xffffd1d1, 0xffffd1cd, 0xffffcdcd, 0xffffcdc9,
+       0xffffc9c9, 0xffffc9c5, 0xffffc5c5, 0xffffc5c1, 0xffffc1c1, 0xffffc1bd,
+       0xffffbdbd, 0xffffbdb9, 0xffffb9b9, 0xffffb9b5, 0xffffb5b5, 0xffffb5b1,
+       0xffffb1b1, 0xffffb1ad, 0xffffadad, 0xffffada9, 0xffffa9a9, 0xffffa9a5,
+       0xffffa5a5, 0xffffa5a1, 0xffffa1a1, 0xffffa19d, 0xffff9d9d, 0xffff9d99,
+       0xffff9999, 0xffff9995, 0xffff9595, 0xffff9591, 0xffff9191, 0xffff918d,
+       0xffff8d8d, 0xffff8d89, 0xffff8989, 0xffff8985, 0xffff8585, 0xffff8581,
+       0xffff8181, 0xffff817d, 0xffff7d7d, 0xffff7d79, 0xffff7979, 0xffff7975,
+       0xffff7575, 0xffff7571, 0xffff7171, 0xffff716d, 0xffff6d6d, 0xffff6d69,
+       0xffff6969, 0xffff6965, 0xffff6565, 0xffff6561, 0xffff6161, 0xffff615d,
+       0xffff5d5d, 0xffff5d59, 0xffff5959, 0xffff5955, 0xffff5555, 0xffff5551,
+       0xffff5151, 0xffff514d, 0xffff4d4d, 0xffff4d49, 0xffff4949, 0xffff4945,
+       0xffff4545, 0xffff4541, 0xffff4141, 0xffff413d, 0xffff3d3d, 0xffff3d39,
+       0xffff3939, 0xffff3935, 0xffff3535, 0xffff3531, 0xffff3131, 0xffff312d,
+       0xffff2d2d, 0xffff2d29, 0xffff2929, 0xffff2925, 0xffff2525, 0xffff2521,
+       0xffff2121, 0xffff211d, 0xffff1d1d, 0xffff1d19, 0xffff1919, 0xffff1915,
+       0xffff1515, 0xffff1511, 0xffff1111, 0xffff110d, 0xffff0d0d, 0xffff0d09,
+       0xffff0909, 0xffff0905, 0xffff0505, 0xffff0501, 0xffff0101, 0xffff00fd,
+       0xfffffcfc, 0xfffffcf8, 0xfffff8f8, 0xfffff8f4, 0xfffff4f4, 0xfffff4f0,
+       0xfffff0f0, 0xfffff0ec, 0xffffecec, 0xffffece8, 0xffffe8e8, 0xffffe8e4,
+       0xffffe4e4, 0xffffe4e0, 0xffffe0e0, 0xffffe0dc, 0xffffdcdc, 0xffffdcd8,
+       0xffffd8d8, 0xffffd8d4, 0xffffd4d4, 0xffffd4d0, 0xffffd0d0, 0xffffd0cc,
+       0xffffcccc, 0xffffccc8, 0xffffc8c8, 0xffffc8c4, 0xffffc4c4, 0xffffc4c0,
+       0xffffc0c0, 0xffffc0bc, 0xffffbcbc, 0xffffbcb8, 0xffffb8b8, 0xffffb8b4,
+       0xffffb4b4, 0xffffb4b0, 0xffffb0b0, 0xffffb0ac, 0xffffacac, 0xffffaca8,
+       0xffffa8a8, 0xffffa8a4, 0xffffa4a4, 0xffffa4a0, 0xffffa0a0, 0xffffa09c,
+       0xffff9c9c, 0xffff9c98, 0xffff9898, 0xffff9894, 0xffff9494, 0xffff9490,
+       0xffff9090, 0xffff908c, 0xffff8c8c, 0xffff8c88, 0xffff8888, 0xffff8884,
+       0xffff8484, 0xffff8480, 0xffff8080, 0xffff807c, 0xffff7c7c, 0xffff7c78,
+       0xffff7878, 0xffff7874, 0xffff7474, 0xffff7470, 0xffff7070, 0xffff706c,
+       0xffff6c6c, 0xffff6c68, 0xffff6868, 0xffff6864, 0xffff6464, 0xffff6460,
+       0xffff6060, 0xffff605c, 0xffff5c5c, 0xffff5c58, 0xffff5858, 0xffff5854,
+       0xffff5454, 0xffff5450, 0xffff5050, 0xffff504c, 0xffff4c4c, 0xffff4c48,
+       0xffff4848, 0xffff4844, 0xffff4444, 0xffff4440, 0xffff4040, 0xffff403c,
+       0xffff3c3c, 0xffff3c38, 0xffff3838, 0xffff3834, 0xffff3434, 0xffff3430,
+       0xffff3030, 0xffff302c, 0xffff2c2c, 0xffff2c28, 0xffff2828, 0xffff2824,
+       0xffff2424, 0xffff2420, 0xffff2020, 0xffff201c, 0xffff1c1c, 0xffff1c18,
+       0xffff1818, 0xffff1814, 0xffff1414, 0xffff1410, 0xffff1010, 0xffff100c,
+       0xffff0c0c, 0xffff0c08, 0xffff0808, 0xffff0804, 0xffff0404, 0xffff0400,
+       0xffff0000, 0xfffffffb,
+};
+
+static u8 tmp_buf[TEST_BUFLEN];
+
+#define full_csum(buff, len, sum) csum_fold(csum_partial(buff, len, sum))
+
+#define CHECK_EQ(lhs, rhs) KUNIT_ASSERT_EQ(test, lhs, rhs)
+
+static void assert_setup_correct(struct kunit *test)
+{
+       CHECK_EQ(sizeof(random_buf) / sizeof(random_buf[0]), MAX_LEN);
+       CHECK_EQ(sizeof(expected_results) / sizeof(expected_results[0]),
+                MAX_LEN);
+       CHECK_EQ(sizeof(init_sums_no_overflow) /
+                        sizeof(init_sums_no_overflow[0]),
+                MAX_LEN);
+}
+
+/*
+ * Test with randomized input (pre determined random with known results).
+ */
+static void test_csum_fixed_random_inputs(struct kunit *test)
+{
+       int len, align;
+       __wsum result, expec, sum;
+
+       assert_setup_correct(test);
+       for (align = 0; align < TEST_BUFLEN; ++align) {
+               memcpy(&tmp_buf[align], random_buf,
+                      min(MAX_LEN, TEST_BUFLEN - align));
+               for (len = 0; len < MAX_LEN && (align + len) < TEST_BUFLEN;
+                    ++len) {
+                       /*
+                        * Test the precomputed random input.
+                        */
+                       sum = random_init_sum;
+                       result = full_csum(&tmp_buf[align], len, sum);
+                       expec = expected_results[len];
+                       CHECK_EQ(result, expec);
+               }
+       }
+}
+
+/*
+ * All ones input test. If there are any missing carry operations, it fails.
+ */
+static void test_csum_all_carry_inputs(struct kunit *test)
+{
+       int len, align;
+       __wsum result, expec, sum;
+
+       assert_setup_correct(test);
+       memset(tmp_buf, 0xff, TEST_BUFLEN);
+       for (align = 0; align < TEST_BUFLEN; ++align) {
+               for (len = 0; len < MAX_LEN && (align + len) < TEST_BUFLEN;
+                    ++len) {
+                       /*
+                        * All carries from input and initial sum.
+                        */
+                       sum = 0xffffffff;
+                       result = full_csum(&tmp_buf[align], len, sum);
+                       expec = (len & 1) ? 0xff00 : 0;
+                       CHECK_EQ(result, expec);
+
+                       /*
+                        * All carries from input.
+                        */
+                       sum = 0;
+                       result = full_csum(&tmp_buf[align], len, sum);
+                       if (len & 1)
+                               expec = 0xff00;
+                       else if (len)
+                               expec = 0;
+                       else
+                               expec = 0xffff;
+                       CHECK_EQ(result, expec);
+               }
+       }
+}
+
+/*
+ * Test with input that alone doesn't cause any carries. By selecting the
+ * maximum initial sum, this allows us to test that there are no carries
+ * where there shouldn't be.
+ */
+static void test_csum_no_carry_inputs(struct kunit *test)
+{
+       int len, align;
+       __wsum result, expec, sum;
+
+       assert_setup_correct(test);
+       memset(tmp_buf, 0x4, TEST_BUFLEN);
+       for (align = 0; align < TEST_BUFLEN; ++align) {
+               for (len = 0; len < MAX_LEN && (align + len) < TEST_BUFLEN;
+                    ++len) {
+                       /*
+                        * Expect no carries.
+                        */
+                       sum = init_sums_no_overflow[len];
+                       result = full_csum(&tmp_buf[align], len, sum);
+                       expec = 0;
+                       CHECK_EQ(result, expec);
+
+                       /*
+                        * Expect one carry.
+                        */
+                       sum = init_sums_no_overflow[len] + 1;
+                       result = full_csum(&tmp_buf[align], len, sum);
+                       expec = len ? 0xfffe : 0xffff;
+                       CHECK_EQ(result, expec);
+               }
+       }
+}
+
+static struct kunit_case __refdata checksum_test_cases[] = {
+       KUNIT_CASE(test_csum_fixed_random_inputs),
+       KUNIT_CASE(test_csum_all_carry_inputs),
+       KUNIT_CASE(test_csum_no_carry_inputs),
+       {}
+};
+
+static struct kunit_suite checksum_test_suite = {
+       .name = "checksum",
+       .test_cases = checksum_test_cases,
+};
+
+kunit_test_suites(&checksum_test_suite);
+
+MODULE_AUTHOR("Noah Goldstein <goldstein.w.n@gmail.com>");
+MODULE_LICENSE("GPL");
index 771d82d..c40e5d9 100644 (file)
@@ -14,8 +14,6 @@
 #include <crypto/curve25519.h>
 #include <linux/string.h>
 
-typedef __uint128_t u128;
-
 static __always_inline u64 u64_eq_mask(u64 a, u64 b)
 {
        u64 x = a ^ b;
index d34cf40..988702c 100644 (file)
@@ -10,8 +10,6 @@
 #include <asm/unaligned.h>
 #include <crypto/internal/poly1305.h>
 
-typedef __uint128_t u128;
-
 void poly1305_core_setkey(struct poly1305_core_key *key,
                          const u8 raw_key[POLY1305_BLOCK_SIZE])
 {
index 984985c..a517256 100644 (file)
@@ -498,6 +498,15 @@ static void debug_print_object(struct debug_obj *obj, char *msg)
        const struct debug_obj_descr *descr = obj->descr;
        static int limit;
 
+       /*
+        * Don't report if lookup_object_or_alloc() by the current thread
+        * failed because lookup_object_or_alloc()/debug_objects_oom() by a
+        * concurrent thread turned off debug_objects_enabled and cleared
+        * the hash buckets.
+        */
+       if (!debug_objects_enabled)
+               return;
+
        if (limit < 5 && descr != descr_test) {
                void *hint = descr->debug_hint ?
                        descr->debug_hint(obj->object) : NULL;
index 6130c42..e19199f 100644 (file)
@@ -39,7 +39,7 @@ static long INIT nofill(void *buffer, unsigned long len)
 }
 
 /* Included from initramfs et al code */
-STATIC int INIT __gunzip(unsigned char *buf, long len,
+static int INIT __gunzip(unsigned char *buf, long len,
                       long (*fill)(void*, unsigned long),
                       long (*flush)(void*, unsigned long),
                       unsigned char *out_buf, long out_len,
index 9f4262e..353268b 100644 (file)
  */
 #ifdef STATIC
 #      define XZ_PREBOOT
+#else
+#include <linux/decompress/unxz.h>
 #endif
 #ifdef __KERNEL__
 #      include <linux/decompress/mm.h>
index a512b99..bba2c0b 100644 (file)
@@ -69,6 +69,8 @@
 # define UNZSTD_PREBOOT
 # include "xxhash.c"
 # include "zstd/decompress_sources.h"
+#else
+#include <linux/decompress/unzstd.h>
 #endif
 
 #include <linux/decompress/mm.h>
index 60be9e2..9c060c6 100644 (file)
@@ -10,6 +10,7 @@
 
 #include <linux/mm.h>
 #include <linux/ioport.h>
+#include <linux/io.h>
 
 /*
  * devmem_is_allowed() checks to see if /dev/mem access to a certain address
index 6baf439..c44f104 100644 (file)
@@ -129,7 +129,7 @@ __devm_ioremap_resource(struct device *dev, const struct resource *res,
        BUG_ON(!dev);
 
        if (!res || resource_type(res) != IORESOURCE_MEM) {
-               dev_err(dev, "invalid resource\n");
+               dev_err(dev, "invalid resource %pR\n", res);
                return IOMEM_ERR_PTR(-EINVAL);
        }
 
index c8c33cb..524132f 100644 (file)
@@ -25,6 +25,11 @@ static const char array_of_10[] = "this is 10";
 static const char *ptr_of_11 = "this is 11!";
 static char array_unknown[] = "compiler thinks I might change";
 
+/* Handle being built without CONFIG_FORTIFY_SOURCE */
+#ifndef __compiletime_strlen
+# define __compiletime_strlen __builtin_strlen
+#endif
+
 static void known_sizes_test(struct kunit *test)
 {
        KUNIT_EXPECT_EQ(test, __compiletime_strlen("88888888"), 8);
@@ -307,6 +312,14 @@ DEFINE_ALLOC_SIZE_TEST_PAIR(kvmalloc)
 } while (0)
 DEFINE_ALLOC_SIZE_TEST_PAIR(devm_kmalloc)
 
+static int fortify_test_init(struct kunit *test)
+{
+       if (!IS_ENABLED(CONFIG_FORTIFY_SOURCE))
+               kunit_skip(test, "Not built with CONFIG_FORTIFY_SOURCE=y");
+
+       return 0;
+}
+
 static struct kunit_case fortify_test_cases[] = {
        KUNIT_CASE(known_sizes_test),
        KUNIT_CASE(control_flow_split_test),
@@ -323,6 +336,7 @@ static struct kunit_case fortify_test_cases[] = {
 
 static struct kunit_suite fortify_test_suite = {
        .name = "fortify",
+       .init = fortify_test_init,
        .test_cases = fortify_test_cases,
 };
 
index 960223e..b667b1e 100644 (file)
@@ -14,8 +14,6 @@
 #include <linux/scatterlist.h>
 #include <linux/instrumented.h>
 
-#define PIPE_PARANOIA /* for now */
-
 /* covers ubuf and kbuf alike */
 #define iterate_buf(i, n, base, len, off, __p, STEP) {         \
        size_t __maybe_unused off = 0;                          \
@@ -198,150 +196,6 @@ static int copyin(void *to, const void __user *from, size_t n)
        return res;
 }
 
-#ifdef PIPE_PARANOIA
-static bool sanity(const struct iov_iter *i)
-{
-       struct pipe_inode_info *pipe = i->pipe;
-       unsigned int p_head = pipe->head;
-       unsigned int p_tail = pipe->tail;
-       unsigned int p_occupancy = pipe_occupancy(p_head, p_tail);
-       unsigned int i_head = i->head;
-       unsigned int idx;
-
-       if (i->last_offset) {
-               struct pipe_buffer *p;
-               if (unlikely(p_occupancy == 0))
-                       goto Bad;       // pipe must be non-empty
-               if (unlikely(i_head != p_head - 1))
-                       goto Bad;       // must be at the last buffer...
-
-               p = pipe_buf(pipe, i_head);
-               if (unlikely(p->offset + p->len != abs(i->last_offset)))
-                       goto Bad;       // ... at the end of segment
-       } else {
-               if (i_head != p_head)
-                       goto Bad;       // must be right after the last buffer
-       }
-       return true;
-Bad:
-       printk(KERN_ERR "idx = %d, offset = %d\n", i_head, i->last_offset);
-       printk(KERN_ERR "head = %d, tail = %d, buffers = %d\n",
-                       p_head, p_tail, pipe->ring_size);
-       for (idx = 0; idx < pipe->ring_size; idx++)
-               printk(KERN_ERR "[%p %p %d %d]\n",
-                       pipe->bufs[idx].ops,
-                       pipe->bufs[idx].page,
-                       pipe->bufs[idx].offset,
-                       pipe->bufs[idx].len);
-       WARN_ON(1);
-       return false;
-}
-#else
-#define sanity(i) true
-#endif
-
-static struct page *push_anon(struct pipe_inode_info *pipe, unsigned size)
-{
-       struct page *page = alloc_page(GFP_USER);
-       if (page) {
-               struct pipe_buffer *buf = pipe_buf(pipe, pipe->head++);
-               *buf = (struct pipe_buffer) {
-                       .ops = &default_pipe_buf_ops,
-                       .page = page,
-                       .offset = 0,
-                       .len = size
-               };
-       }
-       return page;
-}
-
-static void push_page(struct pipe_inode_info *pipe, struct page *page,
-                       unsigned int offset, unsigned int size)
-{
-       struct pipe_buffer *buf = pipe_buf(pipe, pipe->head++);
-       *buf = (struct pipe_buffer) {
-               .ops = &page_cache_pipe_buf_ops,
-               .page = page,
-               .offset = offset,
-               .len = size
-       };
-       get_page(page);
-}
-
-static inline int last_offset(const struct pipe_buffer *buf)
-{
-       if (buf->ops == &default_pipe_buf_ops)
-               return buf->len;        // buf->offset is 0 for those
-       else
-               return -(buf->offset + buf->len);
-}
-
-static struct page *append_pipe(struct iov_iter *i, size_t size,
-                               unsigned int *off)
-{
-       struct pipe_inode_info *pipe = i->pipe;
-       int offset = i->last_offset;
-       struct pipe_buffer *buf;
-       struct page *page;
-
-       if (offset > 0 && offset < PAGE_SIZE) {
-               // some space in the last buffer; add to it
-               buf = pipe_buf(pipe, pipe->head - 1);
-               size = min_t(size_t, size, PAGE_SIZE - offset);
-               buf->len += size;
-               i->last_offset += size;
-               i->count -= size;
-               *off = offset;
-               return buf->page;
-       }
-       // OK, we need a new buffer
-       *off = 0;
-       size = min_t(size_t, size, PAGE_SIZE);
-       if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
-               return NULL;
-       page = push_anon(pipe, size);
-       if (!page)
-               return NULL;
-       i->head = pipe->head - 1;
-       i->last_offset = size;
-       i->count -= size;
-       return page;
-}
-
-static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t bytes,
-                        struct iov_iter *i)
-{
-       struct pipe_inode_info *pipe = i->pipe;
-       unsigned int head = pipe->head;
-
-       if (unlikely(bytes > i->count))
-               bytes = i->count;
-
-       if (unlikely(!bytes))
-               return 0;
-
-       if (!sanity(i))
-               return 0;
-
-       if (offset && i->last_offset == -offset) { // could we merge it?
-               struct pipe_buffer *buf = pipe_buf(pipe, head - 1);
-               if (buf->page == page) {
-                       buf->len += bytes;
-                       i->last_offset -= bytes;
-                       i->count -= bytes;
-                       return bytes;
-               }
-       }
-       if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
-               return 0;
-
-       push_page(pipe, page, offset, bytes);
-       i->last_offset = -(offset + bytes);
-       i->head = head;
-       i->count -= bytes;
-       return bytes;
-}
-
 /*
  * fault_in_iov_iter_readable - fault in iov iterator for reading
  * @i: iterator
@@ -446,46 +300,6 @@ void iov_iter_init(struct iov_iter *i, unsigned int direction,
 }
 EXPORT_SYMBOL(iov_iter_init);
 
-// returns the offset in partial buffer (if any)
-static inline unsigned int pipe_npages(const struct iov_iter *i, int *npages)
-{
-       struct pipe_inode_info *pipe = i->pipe;
-       int used = pipe->head - pipe->tail;
-       int off = i->last_offset;
-
-       *npages = max((int)pipe->max_usage - used, 0);
-
-       if (off > 0 && off < PAGE_SIZE) { // anon and not full
-               (*npages)++;
-               return off;
-       }
-       return 0;
-}
-
-static size_t copy_pipe_to_iter(const void *addr, size_t bytes,
-                               struct iov_iter *i)
-{
-       unsigned int off, chunk;
-
-       if (unlikely(bytes > i->count))
-               bytes = i->count;
-       if (unlikely(!bytes))
-               return 0;
-
-       if (!sanity(i))
-               return 0;
-
-       for (size_t n = bytes; n; n -= chunk) {
-               struct page *page = append_pipe(i, n, &off);
-               chunk = min_t(size_t, n, PAGE_SIZE - off);
-               if (!page)
-                       return bytes - n;
-               memcpy_to_page(page, off, addr, chunk);
-               addr += chunk;
-       }
-       return bytes;
-}
-
 static __wsum csum_and_memcpy(void *to, const void *from, size_t len,
                              __wsum sum, size_t off)
 {
@@ -493,44 +307,10 @@ static __wsum csum_and_memcpy(void *to, const void *from, size_t len,
        return csum_block_add(sum, next, off);
 }
 
-static size_t csum_and_copy_to_pipe_iter(const void *addr, size_t bytes,
-                                        struct iov_iter *i, __wsum *sump)
-{
-       __wsum sum = *sump;
-       size_t off = 0;
-       unsigned int chunk, r;
-
-       if (unlikely(bytes > i->count))
-               bytes = i->count;
-       if (unlikely(!bytes))
-               return 0;
-
-       if (!sanity(i))
-               return 0;
-
-       while (bytes) {
-               struct page *page = append_pipe(i, bytes, &r);
-               char *p;
-
-               if (!page)
-                       break;
-               chunk = min_t(size_t, bytes, PAGE_SIZE - r);
-               p = kmap_local_page(page);
-               sum = csum_and_memcpy(p + r, addr + off, chunk, sum, off);
-               kunmap_local(p);
-               off += chunk;
-               bytes -= chunk;
-       }
-       *sump = sum;
-       return off;
-}
-
 size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 {
        if (WARN_ON_ONCE(i->data_source))
                return 0;
-       if (unlikely(iov_iter_is_pipe(i)))
-               return copy_pipe_to_iter(addr, bytes, i);
        if (user_backed_iter(i))
                might_fault();
        iterate_and_advance(i, bytes, base, len, off,
@@ -552,42 +332,6 @@ static int copyout_mc(void __user *to, const void *from, size_t n)
        return n;
 }
 
-static size_t copy_mc_pipe_to_iter(const void *addr, size_t bytes,
-                               struct iov_iter *i)
-{
-       size_t xfer = 0;
-       unsigned int off, chunk;
-
-       if (unlikely(bytes > i->count))
-               bytes = i->count;
-       if (unlikely(!bytes))
-               return 0;
-
-       if (!sanity(i))
-               return 0;
-
-       while (bytes) {
-               struct page *page = append_pipe(i, bytes, &off);
-               unsigned long rem;
-               char *p;
-
-               if (!page)
-                       break;
-               chunk = min_t(size_t, bytes, PAGE_SIZE - off);
-               p = kmap_local_page(page);
-               rem = copy_mc_to_kernel(p + off, addr + xfer, chunk);
-               chunk -= rem;
-               kunmap_local(p);
-               xfer += chunk;
-               bytes -= chunk;
-               if (rem) {
-                       iov_iter_revert(i, rem);
-                       break;
-               }
-       }
-       return xfer;
-}
-
 /**
  * _copy_mc_to_iter - copy to iter with source memory error exception handling
  * @addr: source kernel address
@@ -607,9 +351,8 @@ static size_t copy_mc_pipe_to_iter(const void *addr, size_t bytes,
  *   alignment and poison alignment assumptions to avoid re-triggering
  *   hardware exceptions.
  *
- * * ITER_KVEC, ITER_PIPE, and ITER_BVEC can return short copies.
- *   Compare to copy_to_iter() where only ITER_IOVEC attempts might return
- *   a short copy.
+ * * ITER_KVEC and ITER_BVEC can return short copies.  Compare to
+ *   copy_to_iter() where only ITER_IOVEC attempts might return a short copy.
  *
  * Return: number of bytes copied (may be %0)
  */
@@ -617,8 +360,6 @@ size_t _copy_mc_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 {
        if (WARN_ON_ONCE(i->data_source))
                return 0;
-       if (unlikely(iov_iter_is_pipe(i)))
-               return copy_mc_pipe_to_iter(addr, bytes, i);
        if (user_backed_iter(i))
                might_fault();
        __iterate_and_advance(i, bytes, base, len, off,
@@ -732,8 +473,6 @@ size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
                return 0;
        if (WARN_ON_ONCE(i->data_source))
                return 0;
-       if (unlikely(iov_iter_is_pipe(i)))
-               return copy_page_to_iter_pipe(page, offset, bytes, i);
        page += offset / PAGE_SIZE; // first subpage
        offset %= PAGE_SIZE;
        while (1) {
@@ -764,8 +503,6 @@ size_t copy_page_to_iter_nofault(struct page *page, unsigned offset, size_t byte
                return 0;
        if (WARN_ON_ONCE(i->data_source))
                return 0;
-       if (unlikely(iov_iter_is_pipe(i)))
-               return copy_page_to_iter_pipe(page, offset, bytes, i);
        page += offset / PAGE_SIZE; // first subpage
        offset %= PAGE_SIZE;
        while (1) {
@@ -818,36 +555,8 @@ size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 }
 EXPORT_SYMBOL(copy_page_from_iter);
 
-static size_t pipe_zero(size_t bytes, struct iov_iter *i)
-{
-       unsigned int chunk, off;
-
-       if (unlikely(bytes > i->count))
-               bytes = i->count;
-       if (unlikely(!bytes))
-               return 0;
-
-       if (!sanity(i))
-               return 0;
-
-       for (size_t n = bytes; n; n -= chunk) {
-               struct page *page = append_pipe(i, n, &off);
-               char *p;
-
-               if (!page)
-                       return bytes - n;
-               chunk = min_t(size_t, n, PAGE_SIZE - off);
-               p = kmap_local_page(page);
-               memset(p + off, 0, chunk);
-               kunmap_local(p);
-       }
-       return bytes;
-}
-
 size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
 {
-       if (unlikely(iov_iter_is_pipe(i)))
-               return pipe_zero(bytes, i);
        iterate_and_advance(i, bytes, base, len, count,
                clear_user(base, len),
                memset(base, 0, len)
@@ -878,32 +587,6 @@ size_t copy_page_from_iter_atomic(struct page *page, unsigned offset, size_t byt
 }
 EXPORT_SYMBOL(copy_page_from_iter_atomic);
 
-static void pipe_advance(struct iov_iter *i, size_t size)
-{
-       struct pipe_inode_info *pipe = i->pipe;
-       int off = i->last_offset;
-
-       if (!off && !size) {
-               pipe_discard_from(pipe, i->start_head); // discard everything
-               return;
-       }
-       i->count -= size;
-       while (1) {
-               struct pipe_buffer *buf = pipe_buf(pipe, i->head);
-               if (off) /* make it relative to the beginning of buffer */
-                       size += abs(off) - buf->offset;
-               if (size <= buf->len) {
-                       buf->len = size;
-                       i->last_offset = last_offset(buf);
-                       break;
-               }
-               size -= buf->len;
-               i->head++;
-               off = 0;
-       }
-       pipe_discard_from(pipe, i->head + 1); // discard everything past this one
-}
-
 static void iov_iter_bvec_advance(struct iov_iter *i, size_t size)
 {
        const struct bio_vec *bvec, *end;
@@ -955,8 +638,6 @@ void iov_iter_advance(struct iov_iter *i, size_t size)
                iov_iter_iovec_advance(i, size);
        } else if (iov_iter_is_bvec(i)) {
                iov_iter_bvec_advance(i, size);
-       } else if (iov_iter_is_pipe(i)) {
-               pipe_advance(i, size);
        } else if (iov_iter_is_discard(i)) {
                i->count -= size;
        }
@@ -970,26 +651,6 @@ void iov_iter_revert(struct iov_iter *i, size_t unroll)
        if (WARN_ON(unroll > MAX_RW_COUNT))
                return;
        i->count += unroll;
-       if (unlikely(iov_iter_is_pipe(i))) {
-               struct pipe_inode_info *pipe = i->pipe;
-               unsigned int head = pipe->head;
-
-               while (head > i->start_head) {
-                       struct pipe_buffer *b = pipe_buf(pipe, --head);
-                       if (unroll < b->len) {
-                               b->len -= unroll;
-                               i->last_offset = last_offset(b);
-                               i->head = head;
-                               return;
-                       }
-                       unroll -= b->len;
-                       pipe_buf_release(pipe, b);
-                       pipe->head--;
-               }
-               i->last_offset = 0;
-               i->head = head;
-               return;
-       }
        if (unlikely(iov_iter_is_discard(i)))
                return;
        if (unroll <= i->iov_offset) {
@@ -1079,24 +740,6 @@ void iov_iter_bvec(struct iov_iter *i, unsigned int direction,
 }
 EXPORT_SYMBOL(iov_iter_bvec);
 
-void iov_iter_pipe(struct iov_iter *i, unsigned int direction,
-                       struct pipe_inode_info *pipe,
-                       size_t count)
-{
-       BUG_ON(direction != READ);
-       WARN_ON(pipe_full(pipe->head, pipe->tail, pipe->ring_size));
-       *i = (struct iov_iter){
-               .iter_type = ITER_PIPE,
-               .data_source = false,
-               .pipe = pipe,
-               .head = pipe->head,
-               .start_head = pipe->head,
-               .last_offset = 0,
-               .count = count
-       };
-}
-EXPORT_SYMBOL(iov_iter_pipe);
-
 /**
  * iov_iter_xarray - Initialise an I/O iterator to use the pages in an xarray
  * @i: The iterator to initialise.
@@ -1224,19 +867,6 @@ bool iov_iter_is_aligned(const struct iov_iter *i, unsigned addr_mask,
        if (iov_iter_is_bvec(i))
                return iov_iter_aligned_bvec(i, addr_mask, len_mask);
 
-       if (iov_iter_is_pipe(i)) {
-               size_t size = i->count;
-
-               if (size & len_mask)
-                       return false;
-               if (size && i->last_offset > 0) {
-                       if (i->last_offset & addr_mask)
-                               return false;
-               }
-
-               return true;
-       }
-
        if (iov_iter_is_xarray(i)) {
                if (i->count & len_mask)
                        return false;
@@ -1307,14 +937,6 @@ unsigned long iov_iter_alignment(const struct iov_iter *i)
        if (iov_iter_is_bvec(i))
                return iov_iter_alignment_bvec(i);
 
-       if (iov_iter_is_pipe(i)) {
-               size_t size = i->count;
-
-               if (size && i->last_offset > 0)
-                       return size | i->last_offset;
-               return size;
-       }
-
        if (iov_iter_is_xarray(i))
                return (i->xarray_start + i->iov_offset) | i->count;
 
@@ -1367,36 +989,6 @@ static int want_pages_array(struct page ***res, size_t size,
        return count;
 }
 
-static ssize_t pipe_get_pages(struct iov_iter *i,
-                  struct page ***pages, size_t maxsize, unsigned maxpages,
-                  size_t *start)
-{
-       unsigned int npages, count, off, chunk;
-       struct page **p;
-       size_t left;
-
-       if (!sanity(i))
-               return -EFAULT;
-
-       *start = off = pipe_npages(i, &npages);
-       if (!npages)
-               return -EFAULT;
-       count = want_pages_array(pages, maxsize, off, min(npages, maxpages));
-       if (!count)
-               return -ENOMEM;
-       p = *pages;
-       for (npages = 0, left = maxsize ; npages < count; npages++, left -= chunk) {
-               struct page *page = append_pipe(i, left, &off);
-               if (!page)
-                       break;
-               chunk = min_t(size_t, left, PAGE_SIZE - off);
-               get_page(*p++ = page);
-       }
-       if (!npages)
-               return -EFAULT;
-       return maxsize - left;
-}
-
 static ssize_t iter_xarray_populate_pages(struct page **pages, struct xarray *xa,
                                          pgoff_t index, unsigned int nr_pages)
 {
@@ -1490,8 +1082,7 @@ static struct page *first_bvec_segment(const struct iov_iter *i,
 
 static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
                   struct page ***pages, size_t maxsize,
-                  unsigned int maxpages, size_t *start,
-                  iov_iter_extraction_t extraction_flags)
+                  unsigned int maxpages, size_t *start)
 {
        unsigned int n, gup_flags = 0;
 
@@ -1501,8 +1092,6 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
                return 0;
        if (maxsize > MAX_RW_COUNT)
                maxsize = MAX_RW_COUNT;
-       if (extraction_flags & ITER_ALLOW_P2PDMA)
-               gup_flags |= FOLL_PCI_P2PDMA;
 
        if (likely(user_backed_iter(i))) {
                unsigned long addr;
@@ -1547,56 +1136,36 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
                }
                return maxsize;
        }
-       if (iov_iter_is_pipe(i))
-               return pipe_get_pages(i, pages, maxsize, maxpages, start);
        if (iov_iter_is_xarray(i))
                return iter_xarray_get_pages(i, pages, maxsize, maxpages, start);
        return -EFAULT;
 }
 
-ssize_t iov_iter_get_pages(struct iov_iter *i,
-                  struct page **pages, size_t maxsize, unsigned maxpages,
-                  size_t *start, iov_iter_extraction_t extraction_flags)
+ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
+               size_t maxsize, unsigned maxpages, size_t *start)
 {
        if (!maxpages)
                return 0;
        BUG_ON(!pages);
 
-       return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages,
-                                         start, extraction_flags);
-}
-EXPORT_SYMBOL_GPL(iov_iter_get_pages);
-
-ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
-               size_t maxsize, unsigned maxpages, size_t *start)
-{
-       return iov_iter_get_pages(i, pages, maxsize, maxpages, start, 0);
+       return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages, start);
 }
 EXPORT_SYMBOL(iov_iter_get_pages2);
 
-ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
-                  struct page ***pages, size_t maxsize,
-                  size_t *start, iov_iter_extraction_t extraction_flags)
+ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
+               struct page ***pages, size_t maxsize, size_t *start)
 {
        ssize_t len;
 
        *pages = NULL;
 
-       len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start,
-                                        extraction_flags);
+       len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start);
        if (len <= 0) {
                kvfree(*pages);
                *pages = NULL;
        }
        return len;
 }
-EXPORT_SYMBOL_GPL(iov_iter_get_pages_alloc);
-
-ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
-               struct page ***pages, size_t maxsize, size_t *start)
-{
-       return iov_iter_get_pages_alloc(i, pages, maxsize, start, 0);
-}
 EXPORT_SYMBOL(iov_iter_get_pages_alloc2);
 
 size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
@@ -1638,9 +1207,7 @@ size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *_csstate,
        }
 
        sum = csum_shift(csstate->csum, csstate->off);
-       if (unlikely(iov_iter_is_pipe(i)))
-               bytes = csum_and_copy_to_pipe_iter(addr, bytes, i, &sum);
-       else iterate_and_advance(i, bytes, base, len, off, ({
+       iterate_and_advance(i, bytes, base, len, off, ({
                next = csum_and_copy_to_user(addr + off, base, len);
                sum = csum_block_add(sum, next, off);
                next ? 0 : len;
@@ -1725,15 +1292,6 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages)
                return iov_npages(i, maxpages);
        if (iov_iter_is_bvec(i))
                return bvec_npages(i, maxpages);
-       if (iov_iter_is_pipe(i)) {
-               int npages;
-
-               if (!sanity(i))
-                       return 0;
-
-               pipe_npages(i, &npages);
-               return min(npages, maxpages);
-       }
        if (iov_iter_is_xarray(i)) {
                unsigned offset = (i->xarray_start + i->iov_offset) % PAGE_SIZE;
                int npages = DIV_ROUND_UP(offset + i->count, PAGE_SIZE);
@@ -1746,10 +1304,6 @@ EXPORT_SYMBOL(iov_iter_npages);
 const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags)
 {
        *new = *old;
-       if (unlikely(iov_iter_is_pipe(new))) {
-               WARN_ON(1);
-               return NULL;
-       }
        if (iov_iter_is_bvec(new))
                return new->bvec = kmemdup(new->bvec,
                                    new->nr_segs * sizeof(struct bio_vec),
index f79a434..16d530f 100644 (file)
@@ -281,8 +281,7 @@ int kobject_set_name_vargs(struct kobject *kobj, const char *fmt,
                kfree_const(s);
                if (!t)
                        return -ENOMEM;
-               strreplace(t, '/', '!');
-               s = t;
+               s = strreplace(t, '/', '!');
        }
        kfree_const(kobj->name);
        kobj->name = s;
index b08bb1f..22c5c49 100644 (file)
@@ -10,6 +10,7 @@
 #include <kunit/test.h>
 
 #include "string-stream.h"
+#include "debugfs.h"
 
 #define KUNIT_DEBUGFS_ROOT             "kunit"
 #define KUNIT_DEBUGFS_RESULTS          "results"
index 0cea31c..ce6749a 100644 (file)
@@ -125,11 +125,6 @@ kunit_test_suites(&executor_test_suite);
 
 /* Test helpers */
 
-static void kfree_res_free(struct kunit_resource *res)
-{
-       kfree(res->data);
-}
-
 /* Use the resource API to register a call to kfree(to_free).
  * Since we never actually use the resource, it's safe to use on const data.
  */
@@ -138,8 +133,10 @@ static void kfree_at_end(struct kunit *test, const void *to_free)
        /* kfree() handles NULL already, but avoid allocating a no-op cleanup. */
        if (IS_ERR_OR_NULL(to_free))
                return;
-       kunit_alloc_resource(test, NULL, kfree_res_free, GFP_KERNEL,
-                            (void *)to_free);
+
+       kunit_add_action(test,
+                       (kunit_action_t *)kfree,
+                       (void *)to_free);
 }
 
 static struct kunit_suite *alloc_fake_suite(struct kunit *test,
index cd8b7e5..b69b689 100644 (file)
@@ -42,6 +42,16 @@ static int example_test_init(struct kunit *test)
 }
 
 /*
+ * This is run once after each test case, see the comment on
+ * example_test_suite for more information.
+ */
+static void example_test_exit(struct kunit *test)
+{
+       kunit_info(test, "cleaning up\n");
+}
+
+
+/*
  * This is run once before all test cases in the suite.
  * See the comment on example_test_suite for more information.
  */
@@ -53,6 +63,16 @@ static int example_test_init_suite(struct kunit_suite *suite)
 }
 
 /*
+ * This is run once after all test cases in the suite.
+ * See the comment on example_test_suite for more information.
+ */
+static void example_test_exit_suite(struct kunit_suite *suite)
+{
+       kunit_info(suite, "exiting suite\n");
+}
+
+
+/*
  * This test should always be skipped.
  */
 static void example_skip_test(struct kunit *test)
@@ -167,6 +187,39 @@ static void example_static_stub_test(struct kunit *test)
        KUNIT_EXPECT_EQ(test, add_one(1), 2);
 }
 
+static const struct example_param {
+       int value;
+} example_params_array[] = {
+       { .value = 2, },
+       { .value = 1, },
+       { .value = 0, },
+};
+
+static void example_param_get_desc(const struct example_param *p, char *desc)
+{
+       snprintf(desc, KUNIT_PARAM_DESC_SIZE, "example value %d", p->value);
+}
+
+KUNIT_ARRAY_PARAM(example, example_params_array, example_param_get_desc);
+
+/*
+ * This test shows the use of params.
+ */
+static void example_params_test(struct kunit *test)
+{
+       const struct example_param *param = test->param_value;
+
+       /* By design, param pointer will not be NULL */
+       KUNIT_ASSERT_NOT_NULL(test, param);
+
+       /* Test can be skipped on unsupported param values */
+       if (!param->value)
+               kunit_skip(test, "unsupported param value");
+
+       /* You can use param values for parameterized testing */
+       KUNIT_EXPECT_EQ(test, param->value % param->value, 0);
+}
+
 /*
  * Here we make a list of all the test cases we want to add to the test suite
  * below.
@@ -183,6 +236,7 @@ static struct kunit_case example_test_cases[] = {
        KUNIT_CASE(example_mark_skipped_test),
        KUNIT_CASE(example_all_expect_macros_test),
        KUNIT_CASE(example_static_stub_test),
+       KUNIT_CASE_PARAM(example_params_test, example_gen_params),
        {}
 };
 
@@ -211,7 +265,9 @@ static struct kunit_case example_test_cases[] = {
 static struct kunit_suite example_test_suite = {
        .name = "example",
        .init = example_test_init,
+       .exit = example_test_exit,
        .suite_init = example_test_init_suite,
+       .suite_exit = example_test_exit_suite,
        .test_cases = example_test_cases,
 };
 
index 42e44ca..83d8e90 100644 (file)
@@ -112,7 +112,7 @@ struct kunit_test_resource_context {
        struct kunit test;
        bool is_resource_initialized;
        int allocate_order[2];
-       int free_order[2];
+       int free_order[4];
 };
 
 static int fake_resource_init(struct kunit_resource *res, void *context)
@@ -403,6 +403,88 @@ static void kunit_resource_test_named(struct kunit *test)
        KUNIT_EXPECT_TRUE(test, list_empty(&test->resources));
 }
 
+static void increment_int(void *ctx)
+{
+       int *i = (int *)ctx;
+       (*i)++;
+}
+
+static void kunit_resource_test_action(struct kunit *test)
+{
+       int num_actions = 0;
+
+       kunit_add_action(test, increment_int, &num_actions);
+       KUNIT_EXPECT_EQ(test, num_actions, 0);
+       kunit_cleanup(test);
+       KUNIT_EXPECT_EQ(test, num_actions, 1);
+
+       /* Once we've cleaned up, the action queue is empty. */
+       kunit_cleanup(test);
+       KUNIT_EXPECT_EQ(test, num_actions, 1);
+
+       /* Check the same function can be deferred multiple times. */
+       kunit_add_action(test, increment_int, &num_actions);
+       kunit_add_action(test, increment_int, &num_actions);
+       kunit_cleanup(test);
+       KUNIT_EXPECT_EQ(test, num_actions, 3);
+}
+static void kunit_resource_test_remove_action(struct kunit *test)
+{
+       int num_actions = 0;
+
+       kunit_add_action(test, increment_int, &num_actions);
+       KUNIT_EXPECT_EQ(test, num_actions, 0);
+
+       kunit_remove_action(test, increment_int, &num_actions);
+       kunit_cleanup(test);
+       KUNIT_EXPECT_EQ(test, num_actions, 0);
+}
+static void kunit_resource_test_release_action(struct kunit *test)
+{
+       int num_actions = 0;
+
+       kunit_add_action(test, increment_int, &num_actions);
+       KUNIT_EXPECT_EQ(test, num_actions, 0);
+       /* Runs immediately on trigger. */
+       kunit_release_action(test, increment_int, &num_actions);
+       KUNIT_EXPECT_EQ(test, num_actions, 1);
+
+       /* Doesn't run again on test exit. */
+       kunit_cleanup(test);
+       KUNIT_EXPECT_EQ(test, num_actions, 1);
+}
+static void action_order_1(void *ctx)
+{
+       struct kunit_test_resource_context *res_ctx = (struct kunit_test_resource_context *)ctx;
+
+       KUNIT_RESOURCE_TEST_MARK_ORDER(res_ctx, free_order, 1);
+       kunit_log(KERN_INFO, current->kunit_test, "action_order_1");
+}
+static void action_order_2(void *ctx)
+{
+       struct kunit_test_resource_context *res_ctx = (struct kunit_test_resource_context *)ctx;
+
+       KUNIT_RESOURCE_TEST_MARK_ORDER(res_ctx, free_order, 2);
+       kunit_log(KERN_INFO, current->kunit_test, "action_order_2");
+}
+static void kunit_resource_test_action_ordering(struct kunit *test)
+{
+       struct kunit_test_resource_context *ctx = test->priv;
+
+       kunit_add_action(test, action_order_1, ctx);
+       kunit_add_action(test, action_order_2, ctx);
+       kunit_add_action(test, action_order_1, ctx);
+       kunit_add_action(test, action_order_2, ctx);
+       kunit_remove_action(test, action_order_1, ctx);
+       kunit_release_action(test, action_order_2, ctx);
+       kunit_cleanup(test);
+
+       /* [2 is triggered] [2], [(1 is cancelled)] [1] */
+       KUNIT_EXPECT_EQ(test, ctx->free_order[0], 2);
+       KUNIT_EXPECT_EQ(test, ctx->free_order[1], 2);
+       KUNIT_EXPECT_EQ(test, ctx->free_order[2], 1);
+}
+
 static int kunit_resource_test_init(struct kunit *test)
 {
        struct kunit_test_resource_context *ctx =
@@ -434,6 +516,10 @@ static struct kunit_case kunit_resource_test_cases[] = {
        KUNIT_CASE(kunit_resource_test_proper_free_ordering),
        KUNIT_CASE(kunit_resource_test_static),
        KUNIT_CASE(kunit_resource_test_named),
+       KUNIT_CASE(kunit_resource_test_action),
+       KUNIT_CASE(kunit_resource_test_remove_action),
+       KUNIT_CASE(kunit_resource_test_release_action),
+       KUNIT_CASE(kunit_resource_test_action_ordering),
        {}
 };
 
index c414df9..f020925 100644 (file)
@@ -77,3 +77,102 @@ int kunit_destroy_resource(struct kunit *test, kunit_resource_match_t match,
        return 0;
 }
 EXPORT_SYMBOL_GPL(kunit_destroy_resource);
+
+struct kunit_action_ctx {
+       struct kunit_resource res;
+       kunit_action_t *func;
+       void *ctx;
+};
+
+static void __kunit_action_free(struct kunit_resource *res)
+{
+       struct kunit_action_ctx *action_ctx = container_of(res, struct kunit_action_ctx, res);
+
+       action_ctx->func(action_ctx->ctx);
+}
+
+
+int kunit_add_action(struct kunit *test, void (*action)(void *), void *ctx)
+{
+       struct kunit_action_ctx *action_ctx;
+
+       KUNIT_ASSERT_NOT_NULL_MSG(test, action, "Tried to action a NULL function!");
+
+       action_ctx = kzalloc(sizeof(*action_ctx), GFP_KERNEL);
+       if (!action_ctx)
+               return -ENOMEM;
+
+       action_ctx->func = action;
+       action_ctx->ctx = ctx;
+
+       action_ctx->res.should_kfree = true;
+       /* As init is NULL, this cannot fail. */
+       __kunit_add_resource(test, NULL, __kunit_action_free, &action_ctx->res, action_ctx);
+
+       return 0;
+}
+EXPORT_SYMBOL_GPL(kunit_add_action);
+
+int kunit_add_action_or_reset(struct kunit *test, void (*action)(void *),
+                             void *ctx)
+{
+       int res = kunit_add_action(test, action, ctx);
+
+       if (res)
+               action(ctx);
+       return res;
+}
+EXPORT_SYMBOL_GPL(kunit_add_action_or_reset);
+
+static bool __kunit_action_match(struct kunit *test,
+                               struct kunit_resource *res, void *match_data)
+{
+       struct kunit_action_ctx *match_ctx = (struct kunit_action_ctx *)match_data;
+       struct kunit_action_ctx *res_ctx = container_of(res, struct kunit_action_ctx, res);
+
+       /* Make sure this is a free function. */
+       if (res->free != __kunit_action_free)
+               return false;
+
+       /* Both the function and context data should match. */
+       return (match_ctx->func == res_ctx->func) && (match_ctx->ctx == res_ctx->ctx);
+}
+
+void kunit_remove_action(struct kunit *test,
+                       kunit_action_t *action,
+                       void *ctx)
+{
+       struct kunit_action_ctx match_ctx;
+       struct kunit_resource *res;
+
+       match_ctx.func = action;
+       match_ctx.ctx = ctx;
+
+       res = kunit_find_resource(test, __kunit_action_match, &match_ctx);
+       if (res) {
+               /* Remove the free function so we don't run the action. */
+               res->free = NULL;
+               kunit_remove_resource(test, res);
+               kunit_put_resource(res);
+       }
+}
+EXPORT_SYMBOL_GPL(kunit_remove_action);
+
+void kunit_release_action(struct kunit *test,
+                        kunit_action_t *action,
+                        void *ctx)
+{
+       struct kunit_action_ctx match_ctx;
+       struct kunit_resource *res;
+
+       match_ctx.func = action;
+       match_ctx.ctx = ctx;
+
+       res = kunit_find_resource(test, __kunit_action_match, &match_ctx);
+       if (res) {
+               kunit_remove_resource(test, res);
+               /* We have to put() this here, else free won't be called. */
+               kunit_put_resource(res);
+       }
+}
+EXPORT_SYMBOL_GPL(kunit_release_action);
index e2910b2..84e4666 100644 (file)
@@ -185,16 +185,28 @@ static void kunit_print_suite_start(struct kunit_suite *suite)
                  kunit_suite_num_test_cases(suite));
 }
 
-static void kunit_print_ok_not_ok(void *test_or_suite,
-                                 bool is_test,
+/* Currently supported test levels */
+enum {
+       KUNIT_LEVEL_SUITE = 0,
+       KUNIT_LEVEL_CASE,
+       KUNIT_LEVEL_CASE_PARAM,
+};
+
+static void kunit_print_ok_not_ok(struct kunit *test,
+                                 unsigned int test_level,
                                  enum kunit_status status,
                                  size_t test_number,
                                  const char *description,
                                  const char *directive)
 {
-       struct kunit_suite *suite = is_test ? NULL : test_or_suite;
-       struct kunit *test = is_test ? test_or_suite : NULL;
        const char *directive_header = (status == KUNIT_SKIPPED) ? " # SKIP " : "";
+       const char *directive_body = (status == KUNIT_SKIPPED) ? directive : "";
+
+       /*
+        * When test is NULL assume that results are from the suite
+        * and today suite results are expected at level 0 only.
+        */
+       WARN(!test && test_level, "suite test level can't be %u!\n", test_level);
 
        /*
         * We do not log the test suite results as doing so would
@@ -203,17 +215,18 @@ static void kunit_print_ok_not_ok(void *test_or_suite,
         * separately seq_printf() the suite results for the debugfs
         * representation.
         */
-       if (suite)
+       if (!test)
                pr_info("%s %zd %s%s%s\n",
                        kunit_status_to_ok_not_ok(status),
                        test_number, description, directive_header,
-                       (status == KUNIT_SKIPPED) ? directive : "");
+                       directive_body);
        else
                kunit_log(KERN_INFO, test,
-                         KUNIT_SUBTEST_INDENT "%s %zd %s%s%s",
+                         "%*s%s %zd %s%s%s",
+                         KUNIT_INDENT_LEN * test_level, "",
                          kunit_status_to_ok_not_ok(status),
                          test_number, description, directive_header,
-                         (status == KUNIT_SKIPPED) ? directive : "");
+                         directive_body);
 }
 
 enum kunit_status kunit_suite_has_succeeded(struct kunit_suite *suite)
@@ -239,7 +252,7 @@ static size_t kunit_suite_counter = 1;
 
 static void kunit_print_suite_end(struct kunit_suite *suite)
 {
-       kunit_print_ok_not_ok((void *)suite, false,
+       kunit_print_ok_not_ok(NULL, KUNIT_LEVEL_SUITE,
                              kunit_suite_has_succeeded(suite),
                              kunit_suite_counter++,
                              suite->name,
@@ -310,7 +323,7 @@ static void kunit_fail(struct kunit *test, const struct kunit_loc *loc,
        string_stream_destroy(stream);
 }
 
-static void __noreturn kunit_abort(struct kunit *test)
+void __noreturn __kunit_abort(struct kunit *test)
 {
        kunit_try_catch_throw(&test->try_catch); /* Does not return. */
 
@@ -322,8 +335,9 @@ static void __noreturn kunit_abort(struct kunit *test)
         */
        WARN_ONCE(true, "Throw could not abort from test!\n");
 }
+EXPORT_SYMBOL_GPL(__kunit_abort);
 
-void kunit_do_failed_assertion(struct kunit *test,
+void __kunit_do_failed_assertion(struct kunit *test,
                               const struct kunit_loc *loc,
                               enum kunit_assert_type type,
                               const struct kunit_assert *assert,
@@ -340,11 +354,8 @@ void kunit_do_failed_assertion(struct kunit *test,
        kunit_fail(test, loc, type, assert, assert_format, &message);
 
        va_end(args);
-
-       if (type == KUNIT_ASSERTION)
-               kunit_abort(test);
 }
-EXPORT_SYMBOL_GPL(kunit_do_failed_assertion);
+EXPORT_SYMBOL_GPL(__kunit_do_failed_assertion);
 
 void kunit_init_test(struct kunit *test, const char *name, char *log)
 {
@@ -419,15 +430,54 @@ static void kunit_try_run_case(void *data)
         * thread will resume control and handle any necessary clean up.
         */
        kunit_run_case_internal(test, suite, test_case);
-       /* This line may never be reached. */
+}
+
+static void kunit_try_run_case_cleanup(void *data)
+{
+       struct kunit_try_catch_context *ctx = data;
+       struct kunit *test = ctx->test;
+       struct kunit_suite *suite = ctx->suite;
+
+       current->kunit_test = test;
+
        kunit_run_case_cleanup(test, suite);
 }
 
+static void kunit_catch_run_case_cleanup(void *data)
+{
+       struct kunit_try_catch_context *ctx = data;
+       struct kunit *test = ctx->test;
+       int try_exit_code = kunit_try_catch_get_result(&test->try_catch);
+
+       /* It is always a failure if cleanup aborts. */
+       kunit_set_failure(test);
+
+       if (try_exit_code) {
+               /*
+                * Test case could not finish, we have no idea what state it is
+                * in, so don't do clean up.
+                */
+               if (try_exit_code == -ETIMEDOUT) {
+                       kunit_err(test, "test case cleanup timed out\n");
+               /*
+                * Unknown internal error occurred preventing test case from
+                * running, so there is nothing to clean up.
+                */
+               } else {
+                       kunit_err(test, "internal error occurred during test case cleanup: %d\n",
+                                 try_exit_code);
+               }
+               return;
+       }
+
+       kunit_err(test, "test aborted during cleanup. continuing without cleaning up\n");
+}
+
+
 static void kunit_catch_run_case(void *data)
 {
        struct kunit_try_catch_context *ctx = data;
        struct kunit *test = ctx->test;
-       struct kunit_suite *suite = ctx->suite;
        int try_exit_code = kunit_try_catch_get_result(&test->try_catch);
 
        if (try_exit_code) {
@@ -448,12 +498,6 @@ static void kunit_catch_run_case(void *data)
                }
                return;
        }
-
-       /*
-        * Test case was run, but aborted. It is the test case's business as to
-        * whether it failed or not, we just need to clean up.
-        */
-       kunit_run_case_cleanup(test, suite);
 }
 
 /*
@@ -478,6 +522,13 @@ static void kunit_run_case_catch_errors(struct kunit_suite *suite,
        context.test_case = test_case;
        kunit_try_catch_run(try_catch, &context);
 
+       /* Now run the cleanup */
+       kunit_try_catch_init(try_catch,
+                            test,
+                            kunit_try_run_case_cleanup,
+                            kunit_catch_run_case_cleanup);
+       kunit_try_catch_run(try_catch, &context);
+
        /* Propagate the parameter result to the test case. */
        if (test->status == KUNIT_FAILURE)
                test_case->status = KUNIT_FAILURE;
@@ -585,11 +636,11 @@ int kunit_run_tests(struct kunit_suite *suite)
                                                 "param-%d", test.param_index);
                                }
 
-                               kunit_log(KERN_INFO, &test,
-                                         KUNIT_SUBTEST_INDENT KUNIT_SUBTEST_INDENT
-                                         "%s %d %s",
-                                         kunit_status_to_ok_not_ok(test.status),
-                                         test.param_index + 1, param_desc);
+                               kunit_print_ok_not_ok(&test, KUNIT_LEVEL_CASE_PARAM,
+                                                     test.status,
+                                                     test.param_index + 1,
+                                                     param_desc,
+                                                     test.status_comment);
 
                                /* Get next param. */
                                param_desc[0] = '\0';
@@ -603,7 +654,7 @@ int kunit_run_tests(struct kunit_suite *suite)
 
                kunit_print_test_stats(&test, param_stats);
 
-               kunit_print_ok_not_ok(&test, true, test_case->status,
+               kunit_print_ok_not_ok(&test, KUNIT_LEVEL_CASE, test_case->status,
                                      kunit_test_case_num(suite, test_case),
                                      test_case->name,
                                      test.status_comment);
@@ -712,58 +763,28 @@ static struct notifier_block kunit_mod_nb = {
 };
 #endif
 
-struct kunit_kmalloc_array_params {
-       size_t n;
-       size_t size;
-       gfp_t gfp;
-};
-
-static int kunit_kmalloc_array_init(struct kunit_resource *res, void *context)
+void *kunit_kmalloc_array(struct kunit *test, size_t n, size_t size, gfp_t gfp)
 {
-       struct kunit_kmalloc_array_params *params = context;
+       void *data;
 
-       res->data = kmalloc_array(params->n, params->size, params->gfp);
-       if (!res->data)
-               return -ENOMEM;
+       data = kmalloc_array(n, size, gfp);
 
-       return 0;
-}
+       if (!data)
+               return NULL;
 
-static void kunit_kmalloc_array_free(struct kunit_resource *res)
-{
-       kfree(res->data);
-}
-
-void *kunit_kmalloc_array(struct kunit *test, size_t n, size_t size, gfp_t gfp)
-{
-       struct kunit_kmalloc_array_params params = {
-               .size = size,
-               .n = n,
-               .gfp = gfp
-       };
+       if (kunit_add_action_or_reset(test, (kunit_action_t *)kfree, data) != 0)
+               return NULL;
 
-       return kunit_alloc_resource(test,
-                                   kunit_kmalloc_array_init,
-                                   kunit_kmalloc_array_free,
-                                   gfp,
-                                   &params);
+       return data;
 }
 EXPORT_SYMBOL_GPL(kunit_kmalloc_array);
 
-static inline bool kunit_kfree_match(struct kunit *test,
-                                    struct kunit_resource *res, void *match_data)
-{
-       /* Only match resources allocated with kunit_kmalloc() and friends. */
-       return res->free == kunit_kmalloc_array_free && res->data == match_data;
-}
-
 void kunit_kfree(struct kunit *test, const void *ptr)
 {
        if (!ptr)
                return;
 
-       if (kunit_destroy_resource(test, kunit_kfree_match, (void *)ptr))
-               KUNIT_FAIL(test, "kunit_kfree: %px already freed or not allocated by kunit", ptr);
+       kunit_release_action(test, (kunit_action_t *)kfree, (void *)ptr);
 }
 EXPORT_SYMBOL_GPL(kunit_kfree);
 
index 8ebc43d..bfffbb7 100644 (file)
@@ -194,7 +194,7 @@ static void mas_set_height(struct ma_state *mas)
        unsigned int new_flags = mas->tree->ma_flags;
 
        new_flags &= ~MT_FLAGS_HEIGHT_MASK;
-       BUG_ON(mas->depth > MAPLE_HEIGHT_MAX);
+       MAS_BUG_ON(mas, mas->depth > MAPLE_HEIGHT_MAX);
        new_flags |= mas->depth << MT_FLAGS_HEIGHT_OFFSET;
        mas->tree->ma_flags = new_flags;
 }
@@ -240,12 +240,12 @@ static inline void mas_set_err(struct ma_state *mas, long err)
        mas->node = MA_ERROR(err);
 }
 
-static inline bool mas_is_ptr(struct ma_state *mas)
+static inline bool mas_is_ptr(const struct ma_state *mas)
 {
        return mas->node == MAS_ROOT;
 }
 
-static inline bool mas_is_start(struct ma_state *mas)
+static inline bool mas_is_start(const struct ma_state *mas)
 {
        return mas->node == MAS_START;
 }
@@ -425,28 +425,26 @@ static inline unsigned long mte_parent_slot_mask(unsigned long parent)
 }
 
 /*
- * mas_parent_enum() - Return the maple_type of the parent from the stored
+ * mas_parent_type() - Return the maple_type of the parent from the stored
  * parent type.
  * @mas: The maple state
- * @node: The maple_enode to extract the parent's enum
+ * @enode: The maple_enode to extract the parent's enum
  * Return: The node->parent maple_type
  */
 static inline
-enum maple_type mte_parent_enum(struct maple_enode *p_enode,
-                               struct maple_tree *mt)
+enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
 {
        unsigned long p_type;
 
-       p_type = (unsigned long)p_enode;
-       if (p_type & MAPLE_PARENT_ROOT)
-               return 0; /* Validated in the caller. */
+       p_type = (unsigned long)mte_to_node(enode)->parent;
+       if (WARN_ON(p_type & MAPLE_PARENT_ROOT))
+               return 0;
 
        p_type &= MAPLE_NODE_MASK;
-       p_type = p_type & ~(MAPLE_PARENT_ROOT | mte_parent_slot_mask(p_type));
-
+       p_type &= ~mte_parent_slot_mask(p_type);
        switch (p_type) {
        case MAPLE_PARENT_RANGE64: /* or MAPLE_PARENT_ARANGE64 */
-               if (mt_is_alloc(mt))
+               if (mt_is_alloc(mas->tree))
                        return maple_arange_64;
                return maple_range_64;
        }
@@ -454,14 +452,8 @@ enum maple_type mte_parent_enum(struct maple_enode *p_enode,
        return 0;
 }
 
-static inline
-enum maple_type mas_parent_enum(struct ma_state *mas, struct maple_enode *enode)
-{
-       return mte_parent_enum(ma_enode_ptr(mte_to_node(enode)->parent), mas->tree);
-}
-
 /*
- * mte_set_parent() - Set the parent node and encode the slot
+ * mas_set_parent() - Set the parent node and encode the slot
  * @enode: The encoded maple node.
  * @parent: The encoded maple node that is the parent of @enode.
  * @slot: The slot that @enode resides in @parent.
@@ -470,16 +462,16 @@ enum maple_type mas_parent_enum(struct ma_state *mas, struct maple_enode *enode)
  * parent type.
  */
 static inline
-void mte_set_parent(struct maple_enode *enode, const struct maple_enode *parent,
-                   unsigned char slot)
+void mas_set_parent(struct ma_state *mas, struct maple_enode *enode,
+                   const struct maple_enode *parent, unsigned char slot)
 {
        unsigned long val = (unsigned long)parent;
        unsigned long shift;
        unsigned long type;
        enum maple_type p_type = mte_node_type(parent);
 
-       BUG_ON(p_type == maple_dense);
-       BUG_ON(p_type == maple_leaf_64);
+       MAS_BUG_ON(mas, p_type == maple_dense);
+       MAS_BUG_ON(mas, p_type == maple_leaf_64);
 
        switch (p_type) {
        case maple_range_64:
@@ -671,22 +663,22 @@ static inline unsigned long *ma_gaps(struct maple_node *node,
 }
 
 /*
- * mte_pivot() - Get the pivot at @piv of the maple encoded node.
- * @mn: The maple encoded node.
+ * mas_pivot() - Get the pivot at @piv of the maple encoded node.
+ * @mas: The maple state.
  * @piv: The pivot.
  *
  * Return: the pivot at @piv of @mn.
  */
-static inline unsigned long mte_pivot(const struct maple_enode *mn,
-                                unsigned char piv)
+static inline unsigned long mas_pivot(struct ma_state *mas, unsigned char piv)
 {
-       struct maple_node *node = mte_to_node(mn);
-       enum maple_type type = mte_node_type(mn);
+       struct maple_node *node = mas_mn(mas);
+       enum maple_type type = mte_node_type(mas->node);
 
-       if (piv >= mt_pivots[type]) {
-               WARN_ON(1);
+       if (MAS_WARN_ON(mas, piv >= mt_pivots[type])) {
+               mas_set_err(mas, -EIO);
                return 0;
        }
+
        switch (type) {
        case maple_arange_64:
                return node->ma64.pivot[piv];
@@ -971,8 +963,6 @@ static inline unsigned char ma_meta_end(struct maple_node *mn,
 static inline unsigned char ma_meta_gap(struct maple_node *mn,
                                        enum maple_type mt)
 {
-       BUG_ON(mt != maple_arange_64);
-
        return mn->ma64.meta.gap;
 }
 
@@ -1111,7 +1101,6 @@ static int mas_ascend(struct ma_state *mas)
        enum maple_type a_type;
        unsigned long min, max;
        unsigned long *pivots;
-       unsigned char offset;
        bool set_max = false, set_min = false;
 
        a_node = mas_mn(mas);
@@ -1123,8 +1112,9 @@ static int mas_ascend(struct ma_state *mas)
        p_node = mte_parent(mas->node);
        if (unlikely(a_node == p_node))
                return 1;
-       a_type = mas_parent_enum(mas, mas->node);
-       offset = mte_parent_slot(mas->node);
+
+       a_type = mas_parent_type(mas, mas->node);
+       mas->offset = mte_parent_slot(mas->node);
        a_enode = mt_mk_node(p_node, a_type);
 
        /* Check to make sure all parent information is still accurate */
@@ -1132,7 +1122,6 @@ static int mas_ascend(struct ma_state *mas)
                return 1;
 
        mas->node = a_enode;
-       mas->offset = offset;
 
        if (mte_is_root(a_enode)) {
                mas->max = ULONG_MAX;
@@ -1140,11 +1129,17 @@ static int mas_ascend(struct ma_state *mas)
                return 0;
        }
 
+       if (!mas->min)
+               set_min = true;
+
+       if (mas->max == ULONG_MAX)
+               set_max = true;
+
        min = 0;
        max = ULONG_MAX;
        do {
                p_enode = a_enode;
-               a_type = mas_parent_enum(mas, p_enode);
+               a_type = mas_parent_type(mas, p_enode);
                a_node = mte_parent(p_enode);
                a_slot = mte_parent_slot(p_enode);
                a_enode = mt_mk_node(a_node, a_type);
@@ -1401,9 +1396,9 @@ static inline struct maple_enode *mas_start(struct ma_state *mas)
 
                mas->min = 0;
                mas->max = ULONG_MAX;
-               mas->depth = 0;
 
 retry:
+               mas->depth = 0;
                root = mas_root(mas);
                /* Tree with nodes */
                if (likely(xa_is_node(root))) {
@@ -1631,6 +1626,7 @@ static inline unsigned long mas_max_gap(struct ma_state *mas)
                return mas_leaf_max_gap(mas);
 
        node = mas_mn(mas);
+       MAS_BUG_ON(mas, mt != maple_arange_64);
        offset = ma_meta_gap(node, mt);
        if (offset == MAPLE_ARANGE64_META_MAX)
                return 0;
@@ -1659,11 +1655,12 @@ static inline void mas_parent_gap(struct ma_state *mas, unsigned char offset,
        enum maple_type pmt;
 
        pnode = mte_parent(mas->node);
-       pmt = mas_parent_enum(mas, mas->node);
+       pmt = mas_parent_type(mas, mas->node);
        penode = mt_mk_node(pnode, pmt);
        pgaps = ma_gaps(pnode, pmt);
 
 ascend:
+       MAS_BUG_ON(mas, pmt != maple_arange_64);
        meta_offset = ma_meta_gap(pnode, pmt);
        if (meta_offset == MAPLE_ARANGE64_META_MAX)
                meta_gap = 0;
@@ -1691,7 +1688,7 @@ ascend:
 
        /* Go to the parent node. */
        pnode = mte_parent(penode);
-       pmt = mas_parent_enum(mas, penode);
+       pmt = mas_parent_type(mas, penode);
        pgaps = ma_gaps(pnode, pmt);
        offset = mte_parent_slot(penode);
        penode = mt_mk_node(pnode, pmt);
@@ -1718,7 +1715,7 @@ static inline void mas_update_gap(struct ma_state *mas)
 
        pslot = mte_parent_slot(mas->node);
        p_gap = ma_gaps(mte_parent(mas->node),
-                       mas_parent_enum(mas, mas->node))[pslot];
+                       mas_parent_type(mas, mas->node))[pslot];
 
        if (p_gap != max_gap)
                mas_parent_gap(mas, pslot, max_gap);
@@ -1743,7 +1740,7 @@ static inline void mas_adopt_children(struct ma_state *mas,
        offset = ma_data_end(node, type, pivots, mas->max);
        do {
                child = mas_slot_locked(mas, slots, offset);
-               mte_set_parent(child, parent, offset);
+               mas_set_parent(mas, child, parent, offset);
        } while (offset--);
 }
 
@@ -1755,7 +1752,7 @@ static inline void mas_adopt_children(struct ma_state *mas,
  * leave the node (true) and handle the adoption and free elsewhere.
  */
 static inline void mas_replace(struct ma_state *mas, bool advanced)
-       __must_hold(mas->tree->lock)
+       __must_hold(mas->tree->ma_lock)
 {
        struct maple_node *mn = mas_mn(mas);
        struct maple_enode *old_enode;
@@ -1767,7 +1764,7 @@ static inline void mas_replace(struct ma_state *mas, bool advanced)
        } else {
                offset = mte_parent_slot(mas->node);
                slots = ma_slots(mte_parent(mas->node),
-                                mas_parent_enum(mas, mas->node));
+                                mas_parent_type(mas, mas->node));
                old_enode = mas_slot_locked(mas, slots, offset);
        }
 
@@ -1795,7 +1792,7 @@ static inline void mas_replace(struct ma_state *mas, bool advanced)
  * @child: the maple state to store the child.
  */
 static inline bool mas_new_child(struct ma_state *mas, struct ma_state *child)
-       __must_hold(mas->tree->lock)
+       __must_hold(mas->tree->ma_lock)
 {
        enum maple_type mt;
        unsigned char offset;
@@ -1943,8 +1940,9 @@ static inline int mab_calc_split(struct ma_state *mas,
                 * causes one node to be deficient.
                 * NOTE: mt_min_slots is 1 based, b_end and split are zero.
                 */
-               while (((bn->pivot[split] - min) < slot_count - 1) &&
-                      (split < slot_count - 1) && (b_end - split > slot_min))
+               while ((split < slot_count - 1) &&
+                      ((bn->pivot[split] - min) < slot_count - 1) &&
+                      (b_end - split > slot_min))
                        split++;
        }
 
@@ -2347,7 +2345,8 @@ static inline void mas_topiary_range(struct ma_state *mas,
        void __rcu **slots;
        unsigned char offset;
 
-       MT_BUG_ON(mas->tree, mte_is_leaf(mas->node));
+       MAS_BUG_ON(mas, mte_is_leaf(mas->node));
+
        slots = ma_slots(mas_mn(mas), mte_node_type(mas->node));
        for (offset = start; offset <= end; offset++) {
                struct maple_enode *enode = mas_slot_locked(mas, slots, offset);
@@ -2707,9 +2706,9 @@ static inline void mas_set_split_parent(struct ma_state *mas,
                return;
 
        if ((*slot) <= split)
-               mte_set_parent(mas->node, left, *slot);
+               mas_set_parent(mas, mas->node, left, *slot);
        else if (right)
-               mte_set_parent(mas->node, right, (*slot) - split - 1);
+               mas_set_parent(mas, mas->node, right, (*slot) - split - 1);
 
        (*slot)++;
 }
@@ -3106,12 +3105,12 @@ static int mas_spanning_rebalance(struct ma_state *mas,
                                mte_node_type(mast->orig_l->node));
        mast->orig_l->depth++;
        mab_mas_cp(mast->bn, 0, mt_slots[mast->bn->type] - 1, &l_mas, true);
-       mte_set_parent(left, l_mas.node, slot);
+       mas_set_parent(mas, left, l_mas.node, slot);
        if (middle)
-               mte_set_parent(middle, l_mas.node, ++slot);
+               mas_set_parent(mas, middle, l_mas.node, ++slot);
 
        if (right)
-               mte_set_parent(right, l_mas.node, ++slot);
+               mas_set_parent(mas, right, l_mas.node, ++slot);
 
        if (mas_is_root_limits(mast->l)) {
 new_root:
@@ -3250,7 +3249,7 @@ static inline void mas_destroy_rebalance(struct ma_state *mas, unsigned char end
        l_mas.max = l_pivs[split];
        mas->min = l_mas.max + 1;
        eparent = mt_mk_node(mte_parent(l_mas.node),
-                            mas_parent_enum(&l_mas, l_mas.node));
+                            mas_parent_type(&l_mas, l_mas.node));
        tmp += end;
        if (!in_rcu) {
                unsigned char max_p = mt_pivots[mt];
@@ -3293,7 +3292,7 @@ static inline void mas_destroy_rebalance(struct ma_state *mas, unsigned char end
 
        /* replace parent. */
        offset = mte_parent_slot(mas->node);
-       mt = mas_parent_enum(&l_mas, l_mas.node);
+       mt = mas_parent_type(&l_mas, l_mas.node);
        parent = mas_pop_node(mas);
        slots = ma_slots(parent, mt);
        pivs = ma_pivots(parent, mt);
@@ -3338,8 +3337,8 @@ static inline bool mas_split_final_node(struct maple_subtree_state *mast,
         * The Big_node data should just fit in a single node.
         */
        ancestor = mas_new_ma_node(mas, mast->bn);
-       mte_set_parent(mast->l->node, ancestor, mast->l->offset);
-       mte_set_parent(mast->r->node, ancestor, mast->r->offset);
+       mas_set_parent(mas, mast->l->node, ancestor, mast->l->offset);
+       mas_set_parent(mas, mast->r->node, ancestor, mast->r->offset);
        mte_to_node(ancestor)->parent = mas_mn(mas)->parent;
 
        mast->l->node = ancestor;
@@ -3729,43 +3728,31 @@ static inline void mas_store_root(struct ma_state *mas, void *entry)
  */
 static bool mas_is_span_wr(struct ma_wr_state *wr_mas)
 {
-       unsigned long max;
+       unsigned long max = wr_mas->r_max;
        unsigned long last = wr_mas->mas->last;
-       unsigned long piv = wr_mas->r_max;
        enum maple_type type = wr_mas->type;
        void *entry = wr_mas->entry;
 
-       /* Contained in this pivot */
-       if (piv > last)
+       /* Contained in this pivot, fast path */
+       if (last < max)
                return false;
 
-       max = wr_mas->mas->max;
-       if (unlikely(ma_is_leaf(type))) {
-               /* Fits in the node, but may span slots. */
+       if (ma_is_leaf(type)) {
+               max = wr_mas->mas->max;
                if (last < max)
                        return false;
+       }
 
-               /* Writes to the end of the node but not null. */
-               if ((last == max) && entry)
-                       return false;
-
+       if (last == max) {
                /*
-                * Writing ULONG_MAX is not a spanning write regardless of the
-                * value being written as long as the range fits in the node.
+                * The last entry of leaf node cannot be NULL unless it is the
+                * rightmost node (writing ULONG_MAX), otherwise it spans slots.
                 */
-               if ((last == ULONG_MAX) && (last == max))
-                       return false;
-       } else if (piv == last) {
-               if (entry)
-                       return false;
-
-               /* Detect spanning store wr walk */
-               if (last == ULONG_MAX)
+               if (entry || last == ULONG_MAX)
                        return false;
        }
 
-       trace_ma_write(__func__, wr_mas->mas, piv, entry);
-
+       trace_ma_write(__func__, wr_mas->mas, wr_mas->r_max, entry);
        return true;
 }
 
@@ -4087,52 +4074,27 @@ static inline int mas_wr_spanning_store(struct ma_wr_state *wr_mas)
  *
  * Return: True if stored, false otherwise
  */
-static inline bool mas_wr_node_store(struct ma_wr_state *wr_mas)
+static inline bool mas_wr_node_store(struct ma_wr_state *wr_mas,
+                                    unsigned char new_end)
 {
        struct ma_state *mas = wr_mas->mas;
        void __rcu **dst_slots;
        unsigned long *dst_pivots;
-       unsigned char dst_offset;
-       unsigned char new_end = wr_mas->node_end;
-       unsigned char offset;
-       unsigned char node_slots = mt_slots[wr_mas->type];
+       unsigned char dst_offset, offset_end = wr_mas->offset_end;
        struct maple_node reuse, *newnode;
-       unsigned char copy_size, max_piv = mt_pivots[wr_mas->type];
+       unsigned char copy_size, node_pivots = mt_pivots[wr_mas->type];
        bool in_rcu = mt_in_rcu(mas->tree);
 
-       offset = mas->offset;
-       if (mas->last == wr_mas->r_max) {
-               /* runs right to the end of the node */
-               if (mas->last == mas->max)
-                       new_end = offset;
-               /* don't copy this offset */
-               wr_mas->offset_end++;
-       } else if (mas->last < wr_mas->r_max) {
-               /* new range ends in this range */
-               if (unlikely(wr_mas->r_max == ULONG_MAX))
-                       mas_bulk_rebalance(mas, wr_mas->node_end, wr_mas->type);
-
-               new_end++;
-       } else {
-               if (wr_mas->end_piv == mas->last)
-                       wr_mas->offset_end++;
-
-               new_end -= wr_mas->offset_end - offset - 1;
-       }
-
-       /* new range starts within a range */
-       if (wr_mas->r_min < mas->index)
-               new_end++;
-
-       /* Not enough room */
-       if (new_end >= node_slots)
-               return false;
-
-       /* Not enough data. */
+       /* Check if there is enough data. The room is enough. */
        if (!mte_is_root(mas->node) && (new_end <= mt_min_slots[wr_mas->type]) &&
            !(mas->mas_flags & MA_STATE_BULK))
                return false;
 
+       if (mas->last == wr_mas->end_piv)
+               offset_end++; /* don't copy this offset */
+       else if (unlikely(wr_mas->r_max == ULONG_MAX))
+               mas_bulk_rebalance(mas, wr_mas->node_end, wr_mas->type);
+
        /* set up node. */
        if (in_rcu) {
                mas_node_count(mas, 1);
@@ -4149,47 +4111,36 @@ static inline bool mas_wr_node_store(struct ma_wr_state *wr_mas)
        dst_pivots = ma_pivots(newnode, wr_mas->type);
        dst_slots = ma_slots(newnode, wr_mas->type);
        /* Copy from start to insert point */
-       memcpy(dst_pivots, wr_mas->pivots, sizeof(unsigned long) * (offset + 1));
-       memcpy(dst_slots, wr_mas->slots, sizeof(void *) * (offset + 1));
-       dst_offset = offset;
+       memcpy(dst_pivots, wr_mas->pivots, sizeof(unsigned long) * mas->offset);
+       memcpy(dst_slots, wr_mas->slots, sizeof(void *) * mas->offset);
 
        /* Handle insert of new range starting after old range */
        if (wr_mas->r_min < mas->index) {
-               mas->offset++;
-               rcu_assign_pointer(dst_slots[dst_offset], wr_mas->content);
-               dst_pivots[dst_offset++] = mas->index - 1;
+               rcu_assign_pointer(dst_slots[mas->offset], wr_mas->content);
+               dst_pivots[mas->offset++] = mas->index - 1;
        }
 
        /* Store the new entry and range end. */
-       if (dst_offset < max_piv)
-               dst_pivots[dst_offset] = mas->last;
-       mas->offset = dst_offset;
-       rcu_assign_pointer(dst_slots[dst_offset], wr_mas->entry);
+       if (mas->offset < node_pivots)
+               dst_pivots[mas->offset] = mas->last;
+       rcu_assign_pointer(dst_slots[mas->offset], wr_mas->entry);
 
        /*
         * this range wrote to the end of the node or it overwrote the rest of
         * the data
         */
-       if (wr_mas->offset_end > wr_mas->node_end || mas->last >= mas->max) {
-               new_end = dst_offset;
+       if (offset_end > wr_mas->node_end)
                goto done;
-       }
 
-       dst_offset++;
+       dst_offset = mas->offset + 1;
        /* Copy to the end of node if necessary. */
-       copy_size = wr_mas->node_end - wr_mas->offset_end + 1;
-       memcpy(dst_slots + dst_offset, wr_mas->slots + wr_mas->offset_end,
+       copy_size = wr_mas->node_end - offset_end + 1;
+       memcpy(dst_slots + dst_offset, wr_mas->slots + offset_end,
               sizeof(void *) * copy_size);
-       if (dst_offset < max_piv) {
-               if (copy_size > max_piv - dst_offset)
-                       copy_size = max_piv - dst_offset;
-
-               memcpy(dst_pivots + dst_offset,
-                      wr_mas->pivots + wr_mas->offset_end,
-                      sizeof(unsigned long) * copy_size);
-       }
+       memcpy(dst_pivots + dst_offset, wr_mas->pivots + offset_end,
+              sizeof(unsigned long) * (copy_size - 1));
 
-       if ((wr_mas->node_end == node_slots - 1) && (new_end < node_slots - 1))
+       if (new_end < node_pivots)
                dst_pivots[new_end] = mas->max;
 
 done:
@@ -4215,59 +4166,46 @@ done:
 static inline bool mas_wr_slot_store(struct ma_wr_state *wr_mas)
 {
        struct ma_state *mas = wr_mas->mas;
-       unsigned long lmax; /* Logical max. */
        unsigned char offset = mas->offset;
+       bool gap = false;
 
-       if ((wr_mas->r_max > mas->last) && ((wr_mas->r_min != mas->index) ||
-                                 (offset != wr_mas->node_end)))
-               return false;
-
-       if (offset == wr_mas->node_end - 1)
-               lmax = mas->max;
-       else
-               lmax = wr_mas->pivots[offset + 1];
-
-       /* going to overwrite too many slots. */
-       if (lmax < mas->last)
+       if (wr_mas->offset_end - offset != 1)
                return false;
 
-       if (wr_mas->r_min == mas->index) {
-               /* overwriting two or more ranges with one. */
-               if (lmax == mas->last)
-                       return false;
+       gap |= !mt_slot_locked(mas->tree, wr_mas->slots, offset);
+       gap |= !mt_slot_locked(mas->tree, wr_mas->slots, offset + 1);
 
-               /* Overwriting all of offset and a portion of offset + 1. */
+       if (mas->index == wr_mas->r_min) {
+               /* Overwriting the range and over a part of the next range. */
                rcu_assign_pointer(wr_mas->slots[offset], wr_mas->entry);
                wr_mas->pivots[offset] = mas->last;
-               goto done;
+       } else {
+               /* Overwriting a part of the range and over the next range */
+               rcu_assign_pointer(wr_mas->slots[offset + 1], wr_mas->entry);
+               wr_mas->pivots[offset] = mas->index - 1;
+               mas->offset++; /* Keep mas accurate. */
        }
 
-       /* Doesn't end on the next range end. */
-       if (lmax != mas->last)
-               return false;
-
-       /* Overwriting a portion of offset and all of offset + 1 */
-       if ((offset + 1 < mt_pivots[wr_mas->type]) &&
-           (wr_mas->entry || wr_mas->pivots[offset + 1]))
-               wr_mas->pivots[offset + 1] = mas->last;
-
-       rcu_assign_pointer(wr_mas->slots[offset + 1], wr_mas->entry);
-       wr_mas->pivots[offset] = mas->index - 1;
-       mas->offset++; /* Keep mas accurate. */
-
-done:
        trace_ma_write(__func__, mas, 0, wr_mas->entry);
-       mas_update_gap(mas);
+       /*
+        * Only update gap when the new entry is empty or there is an empty
+        * entry in the original two ranges.
+        */
+       if (!wr_mas->entry || gap)
+               mas_update_gap(mas);
+
        return true;
 }
 
 static inline void mas_wr_end_piv(struct ma_wr_state *wr_mas)
 {
-       while ((wr_mas->mas->last > wr_mas->end_piv) &&
-              (wr_mas->offset_end < wr_mas->node_end))
-               wr_mas->end_piv = wr_mas->pivots[++wr_mas->offset_end];
+       while ((wr_mas->offset_end < wr_mas->node_end) &&
+              (wr_mas->mas->last > wr_mas->pivots[wr_mas->offset_end]))
+               wr_mas->offset_end++;
 
-       if (wr_mas->mas->last > wr_mas->end_piv)
+       if (wr_mas->offset_end < wr_mas->node_end)
+               wr_mas->end_piv = wr_mas->pivots[wr_mas->offset_end];
+       else
                wr_mas->end_piv = wr_mas->mas->max;
 }
 
@@ -4275,19 +4213,21 @@ static inline void mas_wr_extend_null(struct ma_wr_state *wr_mas)
 {
        struct ma_state *mas = wr_mas->mas;
 
-       if (mas->last < wr_mas->end_piv && !wr_mas->slots[wr_mas->offset_end])
+       if (!wr_mas->slots[wr_mas->offset_end]) {
+               /* If this one is null, the next and prev are not */
                mas->last = wr_mas->end_piv;
-
-       /* Check next slot(s) if we are overwriting the end */
-       if ((mas->last == wr_mas->end_piv) &&
-           (wr_mas->node_end != wr_mas->offset_end) &&
-           !wr_mas->slots[wr_mas->offset_end + 1]) {
-               wr_mas->offset_end++;
-               if (wr_mas->offset_end == wr_mas->node_end)
-                       mas->last = mas->max;
-               else
-                       mas->last = wr_mas->pivots[wr_mas->offset_end];
-               wr_mas->end_piv = mas->last;
+       } else {
+               /* Check next slot(s) if we are overwriting the end */
+               if ((mas->last == wr_mas->end_piv) &&
+                   (wr_mas->node_end != wr_mas->offset_end) &&
+                   !wr_mas->slots[wr_mas->offset_end + 1]) {
+                       wr_mas->offset_end++;
+                       if (wr_mas->offset_end == wr_mas->node_end)
+                               mas->last = mas->max;
+                       else
+                               mas->last = wr_mas->pivots[wr_mas->offset_end];
+                       wr_mas->end_piv = mas->last;
+               }
        }
 
        if (!wr_mas->content) {
@@ -4305,6 +4245,27 @@ static inline void mas_wr_extend_null(struct ma_wr_state *wr_mas)
        }
 }
 
+static inline unsigned char mas_wr_new_end(struct ma_wr_state *wr_mas)
+{
+       struct ma_state *mas = wr_mas->mas;
+       unsigned char new_end = wr_mas->node_end + 2;
+
+       new_end -= wr_mas->offset_end - mas->offset;
+       if (wr_mas->r_min == mas->index)
+               new_end--;
+
+       if (wr_mas->end_piv == mas->last)
+               new_end--;
+
+       return new_end;
+}
+
+/*
+ * mas_wr_append: Attempt to append
+ * @wr_mas: the maple write state
+ *
+ * Return: True if appended, false otherwise
+ */
 static inline bool mas_wr_append(struct ma_wr_state *wr_mas)
 {
        unsigned char end = wr_mas->node_end;
@@ -4312,34 +4273,30 @@ static inline bool mas_wr_append(struct ma_wr_state *wr_mas)
        struct ma_state *mas = wr_mas->mas;
        unsigned char node_pivots = mt_pivots[wr_mas->type];
 
-       if ((mas->index != wr_mas->r_min) && (mas->last == wr_mas->r_max)) {
-               if (new_end < node_pivots)
-                       wr_mas->pivots[new_end] = wr_mas->pivots[end];
+       if (mas->offset != wr_mas->node_end)
+               return false;
 
-               if (new_end < node_pivots)
-                       ma_set_meta(wr_mas->node, maple_leaf_64, 0, new_end);
+       if (new_end < node_pivots) {
+               wr_mas->pivots[new_end] = wr_mas->pivots[end];
+               ma_set_meta(wr_mas->node, maple_leaf_64, 0, new_end);
+       }
 
+       if (mas->last == wr_mas->r_max) {
+               /* Append to end of range */
                rcu_assign_pointer(wr_mas->slots[new_end], wr_mas->entry);
-               mas->offset = new_end;
                wr_mas->pivots[end] = mas->index - 1;
-
-               return true;
-       }
-
-       if ((mas->index == wr_mas->r_min) && (mas->last < wr_mas->r_max)) {
-               if (new_end < node_pivots)
-                       wr_mas->pivots[new_end] = wr_mas->pivots[end];
-
+               mas->offset = new_end;
+       } else {
+               /* Append to start of range */
                rcu_assign_pointer(wr_mas->slots[new_end], wr_mas->content);
-               if (new_end < node_pivots)
-                       ma_set_meta(wr_mas->node, maple_leaf_64, 0, new_end);
-
                wr_mas->pivots[end] = mas->last;
                rcu_assign_pointer(wr_mas->slots[end], wr_mas->entry);
-               return true;
        }
 
-       return false;
+       if (!wr_mas->content || !wr_mas->entry)
+               mas_update_gap(mas);
+
+       return  true;
 }
 
 /*
@@ -4360,9 +4317,8 @@ static void mas_wr_bnode(struct ma_wr_state *wr_mas)
 
 static inline void mas_wr_modify(struct ma_wr_state *wr_mas)
 {
-       unsigned char node_slots;
-       unsigned char node_size;
        struct ma_state *mas = wr_mas->mas;
+       unsigned char new_end;
 
        /* Direct replacement */
        if (wr_mas->r_min == mas->index && wr_mas->r_max == mas->last) {
@@ -4372,26 +4328,22 @@ static inline void mas_wr_modify(struct ma_wr_state *wr_mas)
                return;
        }
 
-       /* Attempt to append */
-       node_slots = mt_slots[wr_mas->type];
-       node_size = wr_mas->node_end - wr_mas->offset_end + mas->offset + 2;
-       if (mas->max == ULONG_MAX)
-               node_size++;
-
-       /* slot and node store will not fit, go to the slow path */
-       if (unlikely(node_size >= node_slots))
+       /*
+        * new_end exceeds the size of the maple node and cannot enter the fast
+        * path.
+        */
+       new_end = mas_wr_new_end(wr_mas);
+       if (new_end >= mt_slots[wr_mas->type])
                goto slow_path;
 
-       if (wr_mas->entry && (wr_mas->node_end < node_slots - 1) &&
-           (mas->offset == wr_mas->node_end) && mas_wr_append(wr_mas)) {
-               if (!wr_mas->content || !wr_mas->entry)
-                       mas_update_gap(mas);
+       /* Attempt to append */
+       if (new_end == wr_mas->node_end + 1 && mas_wr_append(wr_mas))
                return;
-       }
 
-       if ((wr_mas->offset_end - mas->offset <= 1) && mas_wr_slot_store(wr_mas))
+       if (new_end == wr_mas->node_end && mas_wr_slot_store(wr_mas))
                return;
-       else if (mas_wr_node_store(wr_mas))
+
+       if (mas_wr_node_store(wr_mas, new_end))
                return;
 
        if (mas_is_err(mas))
@@ -4424,7 +4376,6 @@ static inline void *mas_wr_store_entry(struct ma_wr_state *wr_mas)
        }
 
        /* At this point, we are at the leaf node that needs to be altered. */
-       wr_mas->end_piv = wr_mas->r_max;
        mas_wr_end_piv(wr_mas);
 
        if (!wr_mas->entry)
@@ -4498,6 +4449,25 @@ exists:
 
 }
 
+static inline void mas_rewalk(struct ma_state *mas, unsigned long index)
+{
+retry:
+       mas_set(mas, index);
+       mas_state_walk(mas);
+       if (mas_is_start(mas))
+               goto retry;
+}
+
+static inline bool mas_rewalk_if_dead(struct ma_state *mas,
+               struct maple_node *node, const unsigned long index)
+{
+       if (unlikely(ma_dead_node(node))) {
+               mas_rewalk(mas, index);
+               return true;
+       }
+       return false;
+}
+
 /*
  * mas_prev_node() - Find the prev non-null entry at the same level in the
  * tree.  The prev value will be mas->node[mas->offset] or MAS_NONE.
@@ -4513,15 +4483,19 @@ static inline int mas_prev_node(struct ma_state *mas, unsigned long min)
        int offset, level;
        void __rcu **slots;
        struct maple_node *node;
-       struct maple_enode *enode;
        unsigned long *pivots;
+       unsigned long max;
 
-       if (mas_is_none(mas))
-               return 0;
+       node = mas_mn(mas);
+       if (!mas->min)
+               goto no_entry;
+
+       max = mas->min - 1;
+       if (max < min)
+               goto no_entry;
 
        level = 0;
        do {
-               node = mas_mn(mas);
                if (ma_is_root(node))
                        goto no_entry;
 
@@ -4530,64 +4504,41 @@ static inline int mas_prev_node(struct ma_state *mas, unsigned long min)
                        return 1;
                offset = mas->offset;
                level++;
+               node = mas_mn(mas);
        } while (!offset);
 
        offset--;
        mt = mte_node_type(mas->node);
-       node = mas_mn(mas);
-       slots = ma_slots(node, mt);
-       pivots = ma_pivots(node, mt);
-       if (unlikely(ma_dead_node(node)))
-               return 1;
-
-       mas->max = pivots[offset];
-       if (offset)
-               mas->min = pivots[offset - 1] + 1;
-       if (unlikely(ma_dead_node(node)))
-               return 1;
-
-       if (mas->max < min)
-               goto no_entry_min;
-
        while (level > 1) {
                level--;
-               enode = mas_slot(mas, slots, offset);
+               slots = ma_slots(node, mt);
+               mas->node = mas_slot(mas, slots, offset);
                if (unlikely(ma_dead_node(node)))
                        return 1;
 
-               mas->node = enode;
                mt = mte_node_type(mas->node);
                node = mas_mn(mas);
-               slots = ma_slots(node, mt);
                pivots = ma_pivots(node, mt);
-               offset = ma_data_end(node, mt, pivots, mas->max);
+               offset = ma_data_end(node, mt, pivots, max);
                if (unlikely(ma_dead_node(node)))
                        return 1;
-
-               if (offset)
-                       mas->min = pivots[offset - 1] + 1;
-
-               if (offset < mt_pivots[mt])
-                       mas->max = pivots[offset];
-
-               if (mas->max < min)
-                       goto no_entry;
        }
 
+       slots = ma_slots(node, mt);
        mas->node = mas_slot(mas, slots, offset);
+       pivots = ma_pivots(node, mt);
        if (unlikely(ma_dead_node(node)))
                return 1;
 
+       if (likely(offset))
+               mas->min = pivots[offset - 1] + 1;
+       mas->max = max;
        mas->offset = mas_data_end(mas);
        if (unlikely(mte_dead_node(mas->node)))
                return 1;
 
        return 0;
 
-no_entry_min:
-       mas->offset = offset;
-       if (offset)
-               mas->min = pivots[offset - 1] + 1;
 no_entry:
        if (unlikely(ma_dead_node(node)))
                return 1;
@@ -4597,6 +4548,76 @@ no_entry:
 }
 
 /*
+ * mas_prev_slot() - Get the entry in the previous slot
+ *
+ * @mas: The maple state
+ * @max: The minimum starting range
+ *
+ * Return: The entry in the previous slot which is possibly NULL
+ */
+static void *mas_prev_slot(struct ma_state *mas, unsigned long min, bool empty)
+{
+       void *entry;
+       void __rcu **slots;
+       unsigned long pivot;
+       enum maple_type type;
+       unsigned long *pivots;
+       struct maple_node *node;
+       unsigned long save_point = mas->index;
+
+retry:
+       node = mas_mn(mas);
+       type = mte_node_type(mas->node);
+       pivots = ma_pivots(node, type);
+       if (unlikely(mas_rewalk_if_dead(mas, node, save_point)))
+               goto retry;
+
+again:
+       if (mas->min <= min) {
+               pivot = mas_safe_min(mas, pivots, mas->offset);
+
+               if (unlikely(mas_rewalk_if_dead(mas, node, save_point)))
+                       goto retry;
+
+               if (pivot <= min)
+                       return NULL;
+       }
+
+       if (likely(mas->offset)) {
+               mas->offset--;
+               mas->last = mas->index - 1;
+               mas->index = mas_safe_min(mas, pivots, mas->offset);
+       } else  {
+               if (mas_prev_node(mas, min)) {
+                       mas_rewalk(mas, save_point);
+                       goto retry;
+               }
+
+               if (mas_is_none(mas))
+                       return NULL;
+
+               mas->last = mas->max;
+               node = mas_mn(mas);
+               type = mte_node_type(mas->node);
+               pivots = ma_pivots(node, type);
+               mas->index = pivots[mas->offset - 1] + 1;
+       }
+
+       slots = ma_slots(node, type);
+       entry = mas_slot(mas, slots, mas->offset);
+       if (unlikely(mas_rewalk_if_dead(mas, node, save_point)))
+               goto retry;
+
+       if (likely(entry))
+               return entry;
+
+       if (!empty)
+               goto again;
+
+       return entry;
+}
+
+/*
  * mas_next_node() - Get the next node at the same level in the tree.
  * @mas: The maple state
  * @max: The maximum pivot value to check.
@@ -4607,11 +4628,10 @@ no_entry:
 static inline int mas_next_node(struct ma_state *mas, struct maple_node *node,
                                unsigned long max)
 {
-       unsigned long min, pivot;
+       unsigned long min;
        unsigned long *pivots;
        struct maple_enode *enode;
        int level = 0;
-       unsigned char offset;
        unsigned char node_end;
        enum maple_type mt;
        void __rcu **slots;
@@ -4619,19 +4639,16 @@ static inline int mas_next_node(struct ma_state *mas, struct maple_node *node,
        if (mas->max >= max)
                goto no_entry;
 
+       min = mas->max + 1;
        level = 0;
        do {
                if (ma_is_root(node))
                        goto no_entry;
 
-               min = mas->max + 1;
-               if (min > max)
-                       goto no_entry;
-
+               /* Walk up. */
                if (unlikely(mas_ascend(mas)))
                        return 1;
 
-               offset = mas->offset;
                level++;
                node = mas_mn(mas);
                mt = mte_node_type(mas->node);
@@ -4640,36 +4657,37 @@ static inline int mas_next_node(struct ma_state *mas, struct maple_node *node,
                if (unlikely(ma_dead_node(node)))
                        return 1;
 
-       } while (unlikely(offset == node_end));
+       } while (unlikely(mas->offset == node_end));
 
        slots = ma_slots(node, mt);
-       pivot = mas_safe_pivot(mas, pivots, ++offset, mt);
-       while (unlikely(level > 1)) {
-               /* Descend, if necessary */
-               enode = mas_slot(mas, slots, offset);
-               if (unlikely(ma_dead_node(node)))
-                       return 1;
+       mas->offset++;
+       enode = mas_slot(mas, slots, mas->offset);
+       if (unlikely(ma_dead_node(node)))
+               return 1;
 
-               mas->node = enode;
+       if (level > 1)
+               mas->offset = 0;
+
+       while (unlikely(level > 1)) {
                level--;
+               mas->node = enode;
                node = mas_mn(mas);
                mt = mte_node_type(mas->node);
                slots = ma_slots(node, mt);
-               pivots = ma_pivots(node, mt);
+               enode = mas_slot(mas, slots, 0);
                if (unlikely(ma_dead_node(node)))
                        return 1;
-
-               offset = 0;
-               pivot = pivots[0];
        }
 
-       enode = mas_slot(mas, slots, offset);
+       if (!mas->offset)
+               pivots = ma_pivots(node, mt);
+
+       mas->max = mas_safe_pivot(mas, pivots, mas->offset, mt);
        if (unlikely(ma_dead_node(node)))
                return 1;
 
        mas->node = enode;
        mas->min = min;
-       mas->max = pivot;
        return 0;
 
 no_entry:
@@ -4681,92 +4699,88 @@ no_entry:
 }
 
 /*
- * mas_next_nentry() - Get the next node entry
- * @mas: The maple state
- * @max: The maximum value to check
- * @*range_start: Pointer to store the start of the range.
+ * mas_next_slot() - Get the entry in the next slot
  *
- * Sets @mas->offset to the offset of the next node entry, @mas->last to the
- * pivot of the entry.
+ * @mas: The maple state
+ * @max: The maximum starting range
+ * @empty: Can be empty
  *
- * Return: The next entry, %NULL otherwise
+ * Return: The entry in the next slot which is possibly NULL
  */
-static inline void *mas_next_nentry(struct ma_state *mas,
-           struct maple_node *node, unsigned long max, enum maple_type type)
+static void *mas_next_slot(struct ma_state *mas, unsigned long max, bool empty)
 {
-       unsigned char count;
-       unsigned long pivot;
-       unsigned long *pivots;
        void __rcu **slots;
+       unsigned long *pivots;
+       unsigned long pivot;
+       enum maple_type type;
+       struct maple_node *node;
+       unsigned char data_end;
+       unsigned long save_point = mas->last;
        void *entry;
 
-       if (mas->last == mas->max) {
-               mas->index = mas->max;
-               return NULL;
-       }
-
-       slots = ma_slots(node, type);
+retry:
+       node = mas_mn(mas);
+       type = mte_node_type(mas->node);
        pivots = ma_pivots(node, type);
-       count = ma_data_end(node, type, pivots, mas->max);
-       if (unlikely(ma_dead_node(node)))
-               return NULL;
+       data_end = ma_data_end(node, type, pivots, mas->max);
+       if (unlikely(mas_rewalk_if_dead(mas, node, save_point)))
+               goto retry;
 
-       mas->index = mas_safe_min(mas, pivots, mas->offset);
-       if (unlikely(ma_dead_node(node)))
-               return NULL;
+again:
+       if (mas->max >= max) {
+               if (likely(mas->offset < data_end))
+                       pivot = pivots[mas->offset];
+               else
+                       return NULL; /* must be mas->max */
 
-       if (mas->index > max)
-               return NULL;
-
-       if (mas->offset > count)
-               return NULL;
-
-       while (mas->offset < count) {
-               pivot = pivots[mas->offset];
-               entry = mas_slot(mas, slots, mas->offset);
-               if (ma_dead_node(node))
-                       return NULL;
-
-               if (entry)
-                       goto found;
+               if (unlikely(mas_rewalk_if_dead(mas, node, save_point)))
+                       goto retry;
 
                if (pivot >= max)
                        return NULL;
+       }
 
-               mas->index = pivot + 1;
+       if (likely(mas->offset < data_end)) {
+               mas->index = pivots[mas->offset] + 1;
                mas->offset++;
-       }
+               if (likely(mas->offset < data_end))
+                       mas->last = pivots[mas->offset];
+               else
+                       mas->last = mas->max;
+       } else  {
+               if (mas_next_node(mas, node, max)) {
+                       mas_rewalk(mas, save_point);
+                       goto retry;
+               }
 
-       if (mas->index > mas->max) {
-               mas->index = mas->last;
-               return NULL;
+               if (mas_is_none(mas))
+                       return NULL;
+
+               mas->offset = 0;
+               mas->index = mas->min;
+               node = mas_mn(mas);
+               type = mte_node_type(mas->node);
+               pivots = ma_pivots(node, type);
+               mas->last = pivots[0];
        }
 
-       pivot = mas_safe_pivot(mas, pivots, mas->offset, type);
-       entry = mas_slot(mas, slots, mas->offset);
-       if (ma_dead_node(node))
-               return NULL;
+       slots = ma_slots(node, type);
+       entry = mt_slot(mas->tree, slots, mas->offset);
+       if (unlikely(mas_rewalk_if_dead(mas, node, save_point)))
+               goto retry;
 
-       if (!pivot)
-               return NULL;
+       if (entry)
+               return entry;
 
-       if (!entry)
-               return NULL;
+       if (!empty) {
+               if (!mas->offset)
+                       data_end = 2;
+               goto again;
+       }
 
-found:
-       mas->last = pivot;
        return entry;
 }
 
-static inline void mas_rewalk(struct ma_state *mas, unsigned long index)
-{
-retry:
-       mas_set(mas, index);
-       mas_state_walk(mas);
-       if (mas_is_start(mas))
-               goto retry;
-}
-
 /*
  * mas_next_entry() - Internal function to get the next entry.
  * @mas: The maple state
@@ -4781,155 +4795,10 @@ retry:
  */
 static inline void *mas_next_entry(struct ma_state *mas, unsigned long limit)
 {
-       void *entry = NULL;
-       struct maple_enode *prev_node;
-       struct maple_node *node;
-       unsigned char offset;
-       unsigned long last;
-       enum maple_type mt;
-
-       if (mas->index > limit) {
-               mas->index = mas->last = limit;
-               mas_pause(mas);
+       if (mas->last >= limit)
                return NULL;
-       }
-       last = mas->last;
-retry:
-       offset = mas->offset;
-       prev_node = mas->node;
-       node = mas_mn(mas);
-       mt = mte_node_type(mas->node);
-       mas->offset++;
-       if (unlikely(mas->offset >= mt_slots[mt])) {
-               mas->offset = mt_slots[mt] - 1;
-               goto next_node;
-       }
-
-       while (!mas_is_none(mas)) {
-               entry = mas_next_nentry(mas, node, limit, mt);
-               if (unlikely(ma_dead_node(node))) {
-                       mas_rewalk(mas, last);
-                       goto retry;
-               }
-
-               if (likely(entry))
-                       return entry;
-
-               if (unlikely((mas->index > limit)))
-                       break;
-
-next_node:
-               prev_node = mas->node;
-               offset = mas->offset;
-               if (unlikely(mas_next_node(mas, node, limit))) {
-                       mas_rewalk(mas, last);
-                       goto retry;
-               }
-               mas->offset = 0;
-               node = mas_mn(mas);
-               mt = mte_node_type(mas->node);
-       }
 
-       mas->index = mas->last = limit;
-       mas->offset = offset;
-       mas->node = prev_node;
-       return NULL;
-}
-
-/*
- * mas_prev_nentry() - Get the previous node entry.
- * @mas: The maple state.
- * @limit: The lower limit to check for a value.
- *
- * Return: the entry, %NULL otherwise.
- */
-static inline void *mas_prev_nentry(struct ma_state *mas, unsigned long limit,
-                                   unsigned long index)
-{
-       unsigned long pivot, min;
-       unsigned char offset;
-       struct maple_node *mn;
-       enum maple_type mt;
-       unsigned long *pivots;
-       void __rcu **slots;
-       void *entry;
-
-retry:
-       if (!mas->offset)
-               return NULL;
-
-       mn = mas_mn(mas);
-       mt = mte_node_type(mas->node);
-       offset = mas->offset - 1;
-       if (offset >= mt_slots[mt])
-               offset = mt_slots[mt] - 1;
-
-       slots = ma_slots(mn, mt);
-       pivots = ma_pivots(mn, mt);
-       if (unlikely(ma_dead_node(mn))) {
-               mas_rewalk(mas, index);
-               goto retry;
-       }
-
-       if (offset == mt_pivots[mt])
-               pivot = mas->max;
-       else
-               pivot = pivots[offset];
-
-       if (unlikely(ma_dead_node(mn))) {
-               mas_rewalk(mas, index);
-               goto retry;
-       }
-
-       while (offset && ((!mas_slot(mas, slots, offset) && pivot >= limit) ||
-              !pivot))
-               pivot = pivots[--offset];
-
-       min = mas_safe_min(mas, pivots, offset);
-       entry = mas_slot(mas, slots, offset);
-       if (unlikely(ma_dead_node(mn))) {
-               mas_rewalk(mas, index);
-               goto retry;
-       }
-
-       if (likely(entry)) {
-               mas->offset = offset;
-               mas->last = pivot;
-               mas->index = min;
-       }
-       return entry;
-}
-
-static inline void *mas_prev_entry(struct ma_state *mas, unsigned long min)
-{
-       void *entry;
-
-       if (mas->index < min) {
-               mas->index = mas->last = min;
-               mas->node = MAS_NONE;
-               return NULL;
-       }
-retry:
-       while (likely(!mas_is_none(mas))) {
-               entry = mas_prev_nentry(mas, min, mas->index);
-               if (unlikely(mas->last < min))
-                       goto not_found;
-
-               if (likely(entry))
-                       return entry;
-
-               if (unlikely(mas_prev_node(mas, min))) {
-                       mas_rewalk(mas, mas->index);
-                       goto retry;
-               }
-
-               mas->offset++;
-       }
-
-       mas->offset--;
-not_found:
-       mas->index = mas->last = min;
-       return NULL;
+       return mas_next_slot(mas, limit, false);
 }
 
 /*
@@ -5105,24 +4974,25 @@ void *mas_walk(struct ma_state *mas)
 {
        void *entry;
 
+       if (mas_is_none(mas) || mas_is_paused(mas) || mas_is_ptr(mas))
+               mas->node = MAS_START;
 retry:
        entry = mas_state_walk(mas);
-       if (mas_is_start(mas))
+       if (mas_is_start(mas)) {
                goto retry;
-
-       if (mas_is_ptr(mas)) {
+       } else if (mas_is_none(mas)) {
+               mas->index = 0;
+               mas->last = ULONG_MAX;
+       } else if (mas_is_ptr(mas)) {
                if (!mas->index) {
                        mas->last = 0;
-               } else {
-                       mas->index = 1;
-                       mas->last = ULONG_MAX;
+                       return entry;
                }
-               return entry;
-       }
 
-       if (mas_is_none(mas)) {
-               mas->index = 0;
+               mas->index = 1;
                mas->last = ULONG_MAX;
+               mas->node = MAS_NONE;
+               return NULL;
        }
 
        return entry;
@@ -5202,46 +5072,6 @@ static inline void mas_awalk(struct ma_state *mas, unsigned long size)
 }
 
 /*
- * mas_fill_gap() - Fill a located gap with @entry.
- * @mas: The maple state
- * @entry: The value to store
- * @slot: The offset into the node to store the @entry
- * @size: The size of the entry
- * @index: The start location
- */
-static inline void mas_fill_gap(struct ma_state *mas, void *entry,
-               unsigned char slot, unsigned long size, unsigned long *index)
-{
-       MA_WR_STATE(wr_mas, mas, entry);
-       unsigned char pslot = mte_parent_slot(mas->node);
-       struct maple_enode *mn = mas->node;
-       unsigned long *pivots;
-       enum maple_type ptype;
-       /*
-        * mas->index is the start address for the search
-        *  which may no longer be needed.
-        * mas->last is the end address for the search
-        */
-
-       *index = mas->index;
-       mas->last = mas->index + size - 1;
-
-       /*
-        * It is possible that using mas->max and mas->min to correctly
-        * calculate the index and last will cause an issue in the gap
-        * calculation, so fix the ma_state here
-        */
-       mas_ascend(mas);
-       ptype = mte_node_type(mas->node);
-       pivots = ma_pivots(mas_mn(mas), ptype);
-       mas->max = mas_safe_pivot(mas, pivots, pslot, ptype);
-       mas->min = mas_safe_min(mas, pivots, pslot);
-       mas->node = mn;
-       mas->offset = slot;
-       mas_wr_store_entry(&wr_mas);
-}
-
-/*
  * mas_sparse_area() - Internal function.  Return upper or lower limit when
  * searching for a gap in an empty tree.
  * @mas: The maple state
@@ -5289,7 +5119,10 @@ int mas_empty_area(struct ma_state *mas, unsigned long min,
        unsigned long *pivots;
        enum maple_type mt;
 
-       if (min >= max)
+       if (min > max)
+               return -EINVAL;
+
+       if (size == 0 || max - min < size - 1)
                return -EINVAL;
 
        if (mas_is_start(mas))
@@ -5338,7 +5171,10 @@ int mas_empty_area_rev(struct ma_state *mas, unsigned long min,
 {
        struct maple_enode *last = mas->node;
 
-       if (min >= max)
+       if (min > max)
+               return -EINVAL;
+
+       if (size == 0 || max - min < size - 1)
                return -EINVAL;
 
        if (mas_is_start(mas)) {
@@ -5374,7 +5210,7 @@ int mas_empty_area_rev(struct ma_state *mas, unsigned long min,
                return -EBUSY;
 
        /* Trim the upper limit to the max. */
-       if (max <= mas->last)
+       if (max < mas->last)
                mas->last = max;
 
        mas->index = mas->last - size + 1;
@@ -5382,71 +5218,6 @@ int mas_empty_area_rev(struct ma_state *mas, unsigned long min,
 }
 EXPORT_SYMBOL_GPL(mas_empty_area_rev);
 
-static inline int mas_alloc(struct ma_state *mas, void *entry,
-               unsigned long size, unsigned long *index)
-{
-       unsigned long min;
-
-       mas_start(mas);
-       if (mas_is_none(mas) || mas_is_ptr(mas)) {
-               mas_root_expand(mas, entry);
-               if (mas_is_err(mas))
-                       return xa_err(mas->node);
-
-               if (!mas->index)
-                       return mte_pivot(mas->node, 0);
-               return mte_pivot(mas->node, 1);
-       }
-
-       /* Must be walking a tree. */
-       mas_awalk(mas, size);
-       if (mas_is_err(mas))
-               return xa_err(mas->node);
-
-       if (mas->offset == MAPLE_NODE_SLOTS)
-               goto no_gap;
-
-       /*
-        * At this point, mas->node points to the right node and we have an
-        * offset that has a sufficient gap.
-        */
-       min = mas->min;
-       if (mas->offset)
-               min = mte_pivot(mas->node, mas->offset - 1) + 1;
-
-       if (mas->index < min)
-               mas->index = min;
-
-       mas_fill_gap(mas, entry, mas->offset, size, index);
-       return 0;
-
-no_gap:
-       return -EBUSY;
-}
-
-static inline int mas_rev_alloc(struct ma_state *mas, unsigned long min,
-                               unsigned long max, void *entry,
-                               unsigned long size, unsigned long *index)
-{
-       int ret = 0;
-
-       ret = mas_empty_area_rev(mas, min, max, size);
-       if (ret)
-               return ret;
-
-       if (mas_is_err(mas))
-               return xa_err(mas->node);
-
-       if (mas->offset == MAPLE_NODE_SLOTS)
-               goto no_gap;
-
-       mas_fill_gap(mas, entry, mas->offset, size, index);
-       return 0;
-
-no_gap:
-       return -EBUSY;
-}
-
 /*
  * mte_dead_leaves() - Mark all leaves of a node as dead.
  * @mas: The maple state
@@ -5694,9 +5465,9 @@ void *mas_store(struct ma_state *mas, void *entry)
 
        trace_ma_write(__func__, mas, 0, entry);
 #ifdef CONFIG_DEBUG_MAPLE_TREE
-       if (mas->index > mas->last)
-               pr_err("Error %lu > %lu %p\n", mas->index, mas->last, entry);
-       MT_BUG_ON(mas->tree, mas->index > mas->last);
+       if (MAS_WARN_ON(mas, mas->index > mas->last))
+               pr_err("Error %lX > %lX %p\n", mas->index, mas->last, entry);
+
        if (mas->index > mas->last) {
                mas_set_err(mas, -EINVAL);
                return NULL;
@@ -5756,7 +5527,7 @@ void mas_store_prealloc(struct ma_state *mas, void *entry)
        mas_wr_store_setup(&wr_mas);
        trace_ma_write(__func__, mas, 0, entry);
        mas_wr_store_entry(&wr_mas);
-       BUG_ON(mas_is_err(mas));
+       MAS_WR_BUG_ON(&wr_mas, mas_is_err(mas));
        mas_destroy(mas);
 }
 EXPORT_SYMBOL_GPL(mas_store_prealloc);
@@ -5808,9 +5579,7 @@ void mas_destroy(struct ma_state *mas)
        if (mas->mas_flags & MA_STATE_REBALANCE) {
                unsigned char end;
 
-               if (mas_is_start(mas))
-                       mas_start(mas);
-
+               mas_start(mas);
                mtree_range_walk(mas);
                end = mas_data_end(mas) + 1;
                if (end < mt_min_slot_count(mas->node) - 1)
@@ -5900,6 +5669,34 @@ int mas_expected_entries(struct ma_state *mas, unsigned long nr_entries)
 }
 EXPORT_SYMBOL_GPL(mas_expected_entries);
 
+static inline bool mas_next_setup(struct ma_state *mas, unsigned long max,
+               void **entry)
+{
+       bool was_none = mas_is_none(mas);
+
+       if (mas_is_none(mas) || mas_is_paused(mas))
+               mas->node = MAS_START;
+
+       if (mas_is_start(mas))
+               *entry = mas_walk(mas); /* Retries on dead nodes handled by mas_walk */
+
+       if (mas_is_ptr(mas)) {
+               *entry = NULL;
+               if (was_none && mas->index == 0) {
+                       mas->index = mas->last = 0;
+                       return true;
+               }
+               mas->index = 1;
+               mas->last = ULONG_MAX;
+               mas->node = MAS_NONE;
+               return true;
+       }
+
+       if (mas_is_none(mas))
+               return true;
+       return false;
+}
+
 /**
  * mas_next() - Get the next entry.
  * @mas: The maple state
@@ -5913,92 +5710,144 @@ EXPORT_SYMBOL_GPL(mas_expected_entries);
  */
 void *mas_next(struct ma_state *mas, unsigned long max)
 {
+       void *entry = NULL;
+
+       if (mas_next_setup(mas, max, &entry))
+               return entry;
+
+       /* Retries on dead nodes handled by mas_next_slot */
+       return mas_next_slot(mas, max, false);
+}
+EXPORT_SYMBOL_GPL(mas_next);
+
+/**
+ * mas_next_range() - Advance the maple state to the next range
+ * @mas: The maple state
+ * @max: The maximum index to check.
+ *
+ * Sets @mas->index and @mas->last to the range.
+ * Must hold rcu_read_lock or the write lock.
+ * Can return the zero entry.
+ *
+ * Return: The next entry or %NULL
+ */
+void *mas_next_range(struct ma_state *mas, unsigned long max)
+{
+       void *entry = NULL;
+
+       if (mas_next_setup(mas, max, &entry))
+               return entry;
+
+       /* Retries on dead nodes handled by mas_next_slot */
+       return mas_next_slot(mas, max, true);
+}
+EXPORT_SYMBOL_GPL(mas_next_range);
+
+/**
+ * mt_next() - get the next value in the maple tree
+ * @mt: The maple tree
+ * @index: The start index
+ * @max: The maximum index to check
+ *
+ * Return: The entry at @index or higher, or %NULL if nothing is found.
+ */
+void *mt_next(struct maple_tree *mt, unsigned long index, unsigned long max)
+{
+       void *entry = NULL;
+       MA_STATE(mas, mt, index, index);
+
+       rcu_read_lock();
+       entry = mas_next(&mas, max);
+       rcu_read_unlock();
+       return entry;
+}
+EXPORT_SYMBOL_GPL(mt_next);
+
+static inline bool mas_prev_setup(struct ma_state *mas, unsigned long min,
+               void **entry)
+{
+       if (mas->index <= min)
+               goto none;
+
        if (mas_is_none(mas) || mas_is_paused(mas))
                mas->node = MAS_START;
 
-       if (mas_is_start(mas))
-               mas_walk(mas); /* Retries on dead nodes handled by mas_walk */
+       if (mas_is_start(mas)) {
+               mas_walk(mas);
+               if (!mas->index)
+                       goto none;
+       }
+
+       if (unlikely(mas_is_ptr(mas))) {
+               if (!mas->index)
+                       goto none;
+               mas->index = mas->last = 0;
+               *entry = mas_root(mas);
+               return true;
+       }
 
-       if (mas_is_ptr(mas)) {
-               if (!mas->index) {
-                       mas->index = 1;
-                       mas->last = ULONG_MAX;
+       if (mas_is_none(mas)) {
+               if (mas->index) {
+                       /* Walked to out-of-range pointer? */
+                       mas->index = mas->last = 0;
+                       mas->node = MAS_ROOT;
+                       *entry = mas_root(mas);
+                       return true;
                }
-               return NULL;
+               return true;
        }
 
-       if (mas->last == ULONG_MAX)
-               return NULL;
+       return false;
 
-       /* Retries on dead nodes handled by mas_next_entry */
-       return mas_next_entry(mas, max);
+none:
+       mas->node = MAS_NONE;
+       return true;
 }
-EXPORT_SYMBOL_GPL(mas_next);
 
 /**
- * mt_next() - get the next value in the maple tree
- * @mt: The maple tree
- * @index: The start index
- * @max: The maximum index to check
+ * mas_prev() - Get the previous entry
+ * @mas: The maple state
+ * @min: The minimum value to check.
  *
- * Return: The entry at @index or higher, or %NULL if nothing is found.
+ * Must hold rcu_read_lock or the write lock.
+ * Will reset mas to MAS_START if the node is MAS_NONE.  Will stop on not
+ * searchable nodes.
+ *
+ * Return: the previous value or %NULL.
  */
-void *mt_next(struct maple_tree *mt, unsigned long index, unsigned long max)
+void *mas_prev(struct ma_state *mas, unsigned long min)
 {
        void *entry = NULL;
-       MA_STATE(mas, mt, index, index);
 
-       rcu_read_lock();
-       entry = mas_next(&mas, max);
-       rcu_read_unlock();
-       return entry;
+       if (mas_prev_setup(mas, min, &entry))
+               return entry;
+
+       return mas_prev_slot(mas, min, false);
 }
-EXPORT_SYMBOL_GPL(mt_next);
+EXPORT_SYMBOL_GPL(mas_prev);
 
 /**
- * mas_prev() - Get the previous entry
+ * mas_prev_range() - Advance to the previous range
  * @mas: The maple state
  * @min: The minimum value to check.
  *
+ * Sets @mas->index and @mas->last to the range.
  * Must hold rcu_read_lock or the write lock.
  * Will reset mas to MAS_START if the node is MAS_NONE.  Will stop on not
  * searchable nodes.
  *
  * Return: the previous value or %NULL.
  */
-void *mas_prev(struct ma_state *mas, unsigned long min)
+void *mas_prev_range(struct ma_state *mas, unsigned long min)
 {
-       if (!mas->index) {
-               /* Nothing comes before 0 */
-               mas->last = 0;
-               mas->node = MAS_NONE;
-               return NULL;
-       }
-
-       if (unlikely(mas_is_ptr(mas)))
-               return NULL;
-
-       if (mas_is_none(mas) || mas_is_paused(mas))
-               mas->node = MAS_START;
-
-       if (mas_is_start(mas)) {
-               mas_walk(mas);
-               if (!mas->index)
-                       return NULL;
-       }
+       void *entry = NULL;
 
-       if (mas_is_ptr(mas)) {
-               if (!mas->index) {
-                       mas->last = 0;
-                       return NULL;
-               }
+       if (mas_prev_setup(mas, min, &entry))
+               return entry;
 
-               mas->index = mas->last = 0;
-               return mas_root_locked(mas);
-       }
-       return mas_prev_entry(mas, min);
+       return mas_prev_slot(mas, min, true);
 }
-EXPORT_SYMBOL_GPL(mas_prev);
+EXPORT_SYMBOL_GPL(mas_prev_range);
 
 /**
  * mt_prev() - get the previous value in the maple tree
@@ -6040,6 +5889,64 @@ void mas_pause(struct ma_state *mas)
 EXPORT_SYMBOL_GPL(mas_pause);
 
 /**
+ * mas_find_setup() - Internal function to set up mas_find*().
+ * @mas: The maple state
+ * @max: The maximum index
+ * @entry: Pointer to the entry
+ *
+ * Returns: True if entry is the answer, false otherwise.
+ */
+static inline bool mas_find_setup(struct ma_state *mas, unsigned long max,
+               void **entry)
+{
+       *entry = NULL;
+
+       if (unlikely(mas_is_none(mas))) {
+               if (unlikely(mas->last >= max))
+                       return true;
+
+               mas->index = mas->last;
+               mas->node = MAS_START;
+       } else if (unlikely(mas_is_paused(mas))) {
+               if (unlikely(mas->last >= max))
+                       return true;
+
+               mas->node = MAS_START;
+               mas->index = ++mas->last;
+       } else if (unlikely(mas_is_ptr(mas)))
+               goto ptr_out_of_range;
+
+       if (unlikely(mas_is_start(mas))) {
+               /* First run or continue */
+               if (mas->index > max)
+                       return true;
+
+               *entry = mas_walk(mas);
+               if (*entry)
+                       return true;
+
+       }
+
+       if (unlikely(!mas_searchable(mas))) {
+               if (unlikely(mas_is_ptr(mas)))
+                       goto ptr_out_of_range;
+
+               return true;
+       }
+
+       if (mas->index == max)
+               return true;
+
+       return false;
+
+ptr_out_of_range:
+       mas->node = MAS_NONE;
+       mas->index = 1;
+       mas->last = ULONG_MAX;
+       return true;
+}
+
+/**
  * mas_find() - On the first call, find the entry at or after mas->index up to
  * %max.  Otherwise, find the entry after mas->index.
  * @mas: The maple state
@@ -6053,37 +5960,105 @@ EXPORT_SYMBOL_GPL(mas_pause);
  */
 void *mas_find(struct ma_state *mas, unsigned long max)
 {
+       void *entry = NULL;
+
+       if (mas_find_setup(mas, max, &entry))
+               return entry;
+
+       /* Retries on dead nodes handled by mas_next_slot */
+       return mas_next_slot(mas, max, false);
+}
+EXPORT_SYMBOL_GPL(mas_find);
+
+/**
+ * mas_find_range() - On the first call, find the entry at or after
+ * mas->index up to %max.  Otherwise, advance to the next slot mas->index.
+ * @mas: The maple state
+ * @max: The maximum value to check.
+ *
+ * Must hold rcu_read_lock or the write lock.
+ * If an entry exists, last and index are updated accordingly.
+ * May set @mas->node to MAS_NONE.
+ *
+ * Return: The entry or %NULL.
+ */
+void *mas_find_range(struct ma_state *mas, unsigned long max)
+{
+       void *entry;
+
+       if (mas_find_setup(mas, max, &entry))
+               return entry;
+
+       /* Retries on dead nodes handled by mas_next_slot */
+       return mas_next_slot(mas, max, true);
+}
+EXPORT_SYMBOL_GPL(mas_find_range);
+
+/**
+ * mas_find_rev_setup() - Internal function to set up mas_find_*_rev()
+ * @mas: The maple state
+ * @min: The minimum index
+ * @entry: Pointer to the entry
+ *
+ * Returns: True if entry is the answer, false otherwise.
+ */
+static inline bool mas_find_rev_setup(struct ma_state *mas, unsigned long min,
+               void **entry)
+{
+       *entry = NULL;
+
+       if (unlikely(mas_is_none(mas))) {
+               if (mas->index <= min)
+                       goto none;
+
+               mas->last = mas->index;
+               mas->node = MAS_START;
+       }
+
        if (unlikely(mas_is_paused(mas))) {
-               if (unlikely(mas->last == ULONG_MAX)) {
+               if (unlikely(mas->index <= min)) {
                        mas->node = MAS_NONE;
-                       return NULL;
+                       return true;
                }
                mas->node = MAS_START;
-               mas->index = ++mas->last;
+               mas->last = --mas->index;
        }
 
-       if (unlikely(mas_is_none(mas)))
-               mas->node = MAS_START;
-
        if (unlikely(mas_is_start(mas))) {
                /* First run or continue */
-               void *entry;
+               if (mas->index < min)
+                       return true;
 
-               if (mas->index > max)
-                       return NULL;
+               *entry = mas_walk(mas);
+               if (*entry)
+                       return true;
+       }
 
-               entry = mas_walk(mas);
-               if (entry)
-                       return entry;
+       if (unlikely(!mas_searchable(mas))) {
+               if (mas_is_ptr(mas))
+                       goto none;
+
+               if (mas_is_none(mas)) {
+                       /*
+                        * Walked to the location, and there was nothing so the
+                        * previous location is 0.
+                        */
+                       mas->last = mas->index = 0;
+                       mas->node = MAS_ROOT;
+                       *entry = mas_root(mas);
+                       return true;
+               }
        }
 
-       if (unlikely(!mas_searchable(mas)))
-               return NULL;
+       if (mas->index < min)
+               return true;
 
-       /* Retries on dead nodes handled by mas_next_entry */
-       return mas_next_entry(mas, max);
+       return false;
+
+none:
+       mas->node = MAS_NONE;
+       return true;
 }
-EXPORT_SYMBOL_GPL(mas_find);
 
 /**
  * mas_find_rev: On the first call, find the first non-null entry at or below
@@ -6100,37 +6075,41 @@ EXPORT_SYMBOL_GPL(mas_find);
  */
 void *mas_find_rev(struct ma_state *mas, unsigned long min)
 {
-       if (unlikely(mas_is_paused(mas))) {
-               if (unlikely(mas->last == ULONG_MAX)) {
-                       mas->node = MAS_NONE;
-                       return NULL;
-               }
-               mas->node = MAS_START;
-               mas->last = --mas->index;
-       }
+       void *entry;
 
-       if (unlikely(mas_is_start(mas))) {
-               /* First run or continue */
-               void *entry;
+       if (mas_find_rev_setup(mas, min, &entry))
+               return entry;
 
-               if (mas->index < min)
-                       return NULL;
+       /* Retries on dead nodes handled by mas_prev_slot */
+       return mas_prev_slot(mas, min, false);
 
-               entry = mas_walk(mas);
-               if (entry)
-                       return entry;
-       }
+}
+EXPORT_SYMBOL_GPL(mas_find_rev);
 
-       if (unlikely(!mas_searchable(mas)))
-               return NULL;
+/**
+ * mas_find_range_rev: On the first call, find the first non-null entry at or
+ * below mas->index down to %min.  Otherwise advance to the previous slot after
+ * mas->index down to %min.
+ * @mas: The maple state
+ * @min: The minimum value to check.
+ *
+ * Must hold rcu_read_lock or the write lock.
+ * If an entry exists, last and index are updated accordingly.
+ * May set @mas->node to MAS_NONE.
+ *
+ * Return: The entry or %NULL.
+ */
+void *mas_find_range_rev(struct ma_state *mas, unsigned long min)
+{
+       void *entry;
 
-       if (mas->index < min)
-               return NULL;
+       if (mas_find_rev_setup(mas, min, &entry))
+               return entry;
 
-       /* Retries on dead nodes handled by mas_prev_entry */
-       return mas_prev_entry(mas, min);
+       /* Retries on dead nodes handled by mas_prev_slot */
+       return mas_prev_slot(mas, min, true);
 }
-EXPORT_SYMBOL_GPL(mas_find_rev);
+EXPORT_SYMBOL_GPL(mas_find_range_rev);
 
 /**
  * mas_erase() - Find the range in which index resides and erase the entire
@@ -6176,7 +6155,7 @@ EXPORT_SYMBOL_GPL(mas_erase);
  * Return: true on allocation, false otherwise.
  */
 bool mas_nomem(struct ma_state *mas, gfp_t gfp)
-       __must_hold(mas->tree->lock)
+       __must_hold(mas->tree->ma_lock)
 {
        if (likely(mas->node != MA_ERROR(-ENOMEM))) {
                mas_destroy(mas);
@@ -6357,31 +6336,33 @@ int mtree_alloc_range(struct maple_tree *mt, unsigned long *startp,
 {
        int ret = 0;
 
-       MA_STATE(mas, mt, min, max - size);
+       MA_STATE(mas, mt, 0, 0);
        if (!mt_is_alloc(mt))
                return -EINVAL;
 
        if (WARN_ON_ONCE(mt_is_reserved(entry)))
                return -EINVAL;
 
-       if (min > max)
-               return -EINVAL;
-
-       if (max < size)
-               return -EINVAL;
-
-       if (!size)
-               return -EINVAL;
-
        mtree_lock(mt);
 retry:
-       mas.offset = 0;
-       mas.index = min;
-       mas.last = max - size;
-       ret = mas_alloc(&mas, entry, size, startp);
+       ret = mas_empty_area(&mas, min, max, size);
+       if (ret)
+               goto unlock;
+
+       mas_insert(&mas, entry);
+       /*
+        * mas_nomem() may release the lock, causing the allocated area
+        * to be unavailable, so try to allocate a free area again.
+        */
        if (mas_nomem(&mas, gfp))
                goto retry;
 
+       if (mas_is_err(&mas))
+               ret = xa_err(mas.node);
+       else
+               *startp = mas.index;
+
+unlock:
        mtree_unlock(mt);
        return ret;
 }
@@ -6393,28 +6374,33 @@ int mtree_alloc_rrange(struct maple_tree *mt, unsigned long *startp,
 {
        int ret = 0;
 
-       MA_STATE(mas, mt, min, max - size);
+       MA_STATE(mas, mt, 0, 0);
        if (!mt_is_alloc(mt))
                return -EINVAL;
 
        if (WARN_ON_ONCE(mt_is_reserved(entry)))
                return -EINVAL;
 
-       if (min >= max)
-               return -EINVAL;
-
-       if (max < size - 1)
-               return -EINVAL;
-
-       if (!size)
-               return -EINVAL;
-
        mtree_lock(mt);
 retry:
-       ret = mas_rev_alloc(&mas, min, max, entry, size, startp);
+       ret = mas_empty_area_rev(&mas, min, max, size);
+       if (ret)
+               goto unlock;
+
+       mas_insert(&mas, entry);
+       /*
+        * mas_nomem() may release the lock, causing the allocated area
+        * to be unavailable, so try to allocate a free area again.
+        */
        if (mas_nomem(&mas, gfp))
                goto retry;
 
+       if (mas_is_err(&mas))
+               ret = xa_err(mas.node);
+       else
+               *startp = mas.index;
+
+unlock:
        mtree_unlock(mt);
        return ret;
 }
@@ -6512,7 +6498,7 @@ retry:
        if (entry)
                goto unlock;
 
-       while (mas_searchable(&mas) && (mas.index < max)) {
+       while (mas_searchable(&mas) && (mas.last < max)) {
                entry = mas_next_entry(&mas, max);
                if (likely(entry && !xa_is_zero(entry)))
                        break;
@@ -6525,10 +6511,9 @@ unlock:
        if (likely(entry)) {
                *index = mas.last + 1;
 #ifdef CONFIG_DEBUG_MAPLE_TREE
-               if ((*index) && (*index) <= copy)
+               if (MT_WARN_ON(mt, (*index) && ((*index) <= copy)))
                        pr_err("index not increased! %lx <= %lx\n",
                               *index, copy);
-               MT_BUG_ON(mt, (*index) && ((*index) <= copy));
 #endif
        }
 
@@ -6674,7 +6659,7 @@ static inline void *mas_first_entry(struct ma_state *mas, struct maple_node *mn,
        max = mas->max;
        mas->offset = 0;
        while (likely(!ma_is_leaf(mt))) {
-               MT_BUG_ON(mas->tree, mte_dead_node(mas->node));
+               MAS_WARN_ON(mas, mte_dead_node(mas->node));
                slots = ma_slots(mn, mt);
                entry = mas_slot(mas, slots, 0);
                pivots = ma_pivots(mn, mt);
@@ -6685,7 +6670,7 @@ static inline void *mas_first_entry(struct ma_state *mas, struct maple_node *mn,
                mn = mas_mn(mas);
                mt = mte_node_type(mas->node);
        }
-       MT_BUG_ON(mas->tree, mte_dead_node(mas->node));
+       MAS_WARN_ON(mas, mte_dead_node(mas->node));
 
        mas->max = max;
        slots = ma_slots(mn, mt);
@@ -6735,15 +6720,12 @@ static void mas_dfs_postorder(struct ma_state *mas, unsigned long max)
 
        mas->node = mn;
        mas_ascend(mas);
-       while (mas->node != MAS_NONE) {
+       do {
                p = mas->node;
                p_min = mas->min;
                p_max = mas->max;
                mas_prev_node(mas, 0);
-       }
-
-       if (p == MAS_NONE)
-               return;
+       } while (!mas_is_none(mas));
 
        mas->node = p;
        mas->max = p_max;
@@ -6752,22 +6734,33 @@ static void mas_dfs_postorder(struct ma_state *mas, unsigned long max)
 
 /* Tree validations */
 static void mt_dump_node(const struct maple_tree *mt, void *entry,
-               unsigned long min, unsigned long max, unsigned int depth);
+               unsigned long min, unsigned long max, unsigned int depth,
+               enum mt_dump_format format);
 static void mt_dump_range(unsigned long min, unsigned long max,
-                         unsigned int depth)
+                         unsigned int depth, enum mt_dump_format format)
 {
        static const char spaces[] = "                                ";
 
-       if (min == max)
-               pr_info("%.*s%lu: ", depth * 2, spaces, min);
-       else
-               pr_info("%.*s%lu-%lu: ", depth * 2, spaces, min, max);
+       switch(format) {
+       case mt_dump_hex:
+               if (min == max)
+                       pr_info("%.*s%lx: ", depth * 2, spaces, min);
+               else
+                       pr_info("%.*s%lx-%lx: ", depth * 2, spaces, min, max);
+               break;
+       default:
+       case mt_dump_dec:
+               if (min == max)
+                       pr_info("%.*s%lu: ", depth * 2, spaces, min);
+               else
+                       pr_info("%.*s%lu-%lu: ", depth * 2, spaces, min, max);
+       }
 }
 
 static void mt_dump_entry(void *entry, unsigned long min, unsigned long max,
-                         unsigned int depth)
+                         unsigned int depth, enum mt_dump_format format)
 {
-       mt_dump_range(min, max, depth);
+       mt_dump_range(min, max, depth, format);
 
        if (xa_is_value(entry))
                pr_cont("value %ld (0x%lx) [%p]\n", xa_to_value(entry),
@@ -6781,7 +6774,8 @@ static void mt_dump_entry(void *entry, unsigned long min, unsigned long max,
 }
 
 static void mt_dump_range64(const struct maple_tree *mt, void *entry,
-                       unsigned long min, unsigned long max, unsigned int depth)
+               unsigned long min, unsigned long max, unsigned int depth,
+               enum mt_dump_format format)
 {
        struct maple_range_64 *node = &mte_to_node(entry)->mr64;
        bool leaf = mte_is_leaf(entry);
@@ -6789,8 +6783,16 @@ static void mt_dump_range64(const struct maple_tree *mt, void *entry,
        int i;
 
        pr_cont(" contents: ");
-       for (i = 0; i < MAPLE_RANGE64_SLOTS - 1; i++)
-               pr_cont("%p %lu ", node->slot[i], node->pivot[i]);
+       for (i = 0; i < MAPLE_RANGE64_SLOTS - 1; i++) {
+               switch(format) {
+               case mt_dump_hex:
+                       pr_cont("%p %lX ", node->slot[i], node->pivot[i]);
+                       break;
+               default:
+               case mt_dump_dec:
+                       pr_cont("%p %lu ", node->slot[i], node->pivot[i]);
+               }
+       }
        pr_cont("%p\n", node->slot[i]);
        for (i = 0; i < MAPLE_RANGE64_SLOTS; i++) {
                unsigned long last = max;
@@ -6803,24 +6805,32 @@ static void mt_dump_range64(const struct maple_tree *mt, void *entry,
                        break;
                if (leaf)
                        mt_dump_entry(mt_slot(mt, node->slot, i),
-                                       first, last, depth + 1);
+                                       first, last, depth + 1, format);
                else if (node->slot[i])
                        mt_dump_node(mt, mt_slot(mt, node->slot, i),
-                                       first, last, depth + 1);
+                                       first, last, depth + 1, format);
 
                if (last == max)
                        break;
                if (last > max) {
-                       pr_err("node %p last (%lu) > max (%lu) at pivot %d!\n",
+                       switch(format) {
+                       case mt_dump_hex:
+                               pr_err("node %p last (%lx) > max (%lx) at pivot %d!\n",
                                        node, last, max, i);
-                       break;
+                               break;
+                       default:
+                       case mt_dump_dec:
+                               pr_err("node %p last (%lu) > max (%lu) at pivot %d!\n",
+                                       node, last, max, i);
+                       }
                }
                first = last + 1;
        }
 }
 
 static void mt_dump_arange64(const struct maple_tree *mt, void *entry,
-                       unsigned long min, unsigned long max, unsigned int depth)
+       unsigned long min, unsigned long max, unsigned int depth,
+       enum mt_dump_format format)
 {
        struct maple_arange_64 *node = &mte_to_node(entry)->ma64;
        bool leaf = mte_is_leaf(entry);
@@ -6845,10 +6855,10 @@ static void mt_dump_arange64(const struct maple_tree *mt, void *entry,
                        break;
                if (leaf)
                        mt_dump_entry(mt_slot(mt, node->slot, i),
-                                       first, last, depth + 1);
+                                       first, last, depth + 1, format);
                else if (node->slot[i])
                        mt_dump_node(mt, mt_slot(mt, node->slot, i),
-                                       first, last, depth + 1);
+                                       first, last, depth + 1, format);
 
                if (last == max)
                        break;
@@ -6862,13 +6872,14 @@ static void mt_dump_arange64(const struct maple_tree *mt, void *entry,
 }
 
 static void mt_dump_node(const struct maple_tree *mt, void *entry,
-               unsigned long min, unsigned long max, unsigned int depth)
+               unsigned long min, unsigned long max, unsigned int depth,
+               enum mt_dump_format format)
 {
        struct maple_node *node = mte_to_node(entry);
        unsigned int type = mte_node_type(entry);
        unsigned int i;
 
-       mt_dump_range(min, max, depth);
+       mt_dump_range(min, max, depth, format);
 
        pr_cont("node %p depth %d type %d parent %p", node, depth, type,
                        node ? node->parent : NULL);
@@ -6879,15 +6890,15 @@ static void mt_dump_node(const struct maple_tree *mt, void *entry,
                        if (min + i > max)
                                pr_cont("OUT OF RANGE: ");
                        mt_dump_entry(mt_slot(mt, node->slot, i),
-                                       min + i, min + i, depth);
+                                       min + i, min + i, depth, format);
                }
                break;
        case maple_leaf_64:
        case maple_range_64:
-               mt_dump_range64(mt, entry, min, max, depth);
+               mt_dump_range64(mt, entry, min, max, depth, format);
                break;
        case maple_arange_64:
-               mt_dump_arange64(mt, entry, min, max, depth);
+               mt_dump_arange64(mt, entry, min, max, depth, format);
                break;
 
        default:
@@ -6895,16 +6906,16 @@ static void mt_dump_node(const struct maple_tree *mt, void *entry,
        }
 }
 
-void mt_dump(const struct maple_tree *mt)
+void mt_dump(const struct maple_tree *mt, enum mt_dump_format format)
 {
        void *entry = rcu_dereference_check(mt->ma_root, mt_locked(mt));
 
        pr_info("maple_tree(%p) flags %X, height %u root %p\n",
                 mt, mt->ma_flags, mt_height(mt), entry);
        if (!xa_is_node(entry))
-               mt_dump_entry(entry, 0, 0, 0);
+               mt_dump_entry(entry, 0, 0, 0, format);
        else if (entry)
-               mt_dump_node(mt, entry, 0, mt_node_max(entry), 0);
+               mt_dump_node(mt, entry, 0, mt_node_max(entry), 0, format);
 }
 EXPORT_SYMBOL_GPL(mt_dump);
 
@@ -6957,7 +6968,7 @@ static void mas_validate_gaps(struct ma_state *mas)
                                                mas_mn(mas), i,
                                                mas_get_slot(mas, i), gap,
                                                p_end, p_start);
-                                       mt_dump(mas->tree);
+                                       mt_dump(mas->tree, mt_dump_hex);
 
                                        MT_BUG_ON(mas->tree,
                                                gap != p_end - p_start + 1);
@@ -6988,27 +6999,29 @@ counted:
        p_slot = mte_parent_slot(mas->node);
        p_mn = mte_parent(mte);
        MT_BUG_ON(mas->tree, max_gap > mas->max);
-       if (ma_gaps(p_mn, mas_parent_enum(mas, mte))[p_slot] != max_gap) {
+       if (ma_gaps(p_mn, mas_parent_type(mas, mte))[p_slot] != max_gap) {
                pr_err("gap %p[%u] != %lu\n", p_mn, p_slot, max_gap);
-               mt_dump(mas->tree);
+               mt_dump(mas->tree, mt_dump_hex);
        }
 
        MT_BUG_ON(mas->tree,
-                 ma_gaps(p_mn, mas_parent_enum(mas, mte))[p_slot] != max_gap);
+                 ma_gaps(p_mn, mas_parent_type(mas, mte))[p_slot] != max_gap);
 }
 
 static void mas_validate_parent_slot(struct ma_state *mas)
 {
        struct maple_node *parent;
        struct maple_enode *node;
-       enum maple_type p_type = mas_parent_enum(mas, mas->node);
-       unsigned char p_slot = mte_parent_slot(mas->node);
+       enum maple_type p_type;
+       unsigned char p_slot;
        void __rcu **slots;
        int i;
 
        if (mte_is_root(mas->node))
                return;
 
+       p_slot = mte_parent_slot(mas->node);
+       p_type = mas_parent_type(mas, mas->node);
        parent = mte_parent(mas->node);
        slots = ma_slots(parent, p_type);
        MT_BUG_ON(mas->tree, mas_mn(mas) == parent);
@@ -7101,18 +7114,18 @@ static void mas_validate_limits(struct ma_state *mas)
                if (prev_piv > piv) {
                        pr_err("%p[%u] piv %lu < prev_piv %lu\n",
                                mas_mn(mas), i, piv, prev_piv);
-                       MT_BUG_ON(mas->tree, piv < prev_piv);
+                       MAS_WARN_ON(mas, piv < prev_piv);
                }
 
                if (piv < mas->min) {
                        pr_err("%p[%u] %lu < %lu\n", mas_mn(mas), i,
                                piv, mas->min);
-                       MT_BUG_ON(mas->tree, piv < mas->min);
+                       MAS_WARN_ON(mas, piv < mas->min);
                }
                if (piv > mas->max) {
                        pr_err("%p[%u] %lu > %lu\n", mas_mn(mas), i,
                                piv, mas->max);
-                       MT_BUG_ON(mas->tree, piv > mas->max);
+                       MAS_WARN_ON(mas, piv > mas->max);
                }
                prev_piv = piv;
                if (piv == mas->max)
@@ -7135,7 +7148,7 @@ static void mas_validate_limits(struct ma_state *mas)
 
                        pr_err("%p[%u] should not have piv %lu\n",
                               mas_mn(mas), i, piv);
-                       MT_BUG_ON(mas->tree, i < mt_pivots[type] - 1);
+                       MAS_WARN_ON(mas, i < mt_pivots[type] - 1);
                }
        }
 }
@@ -7194,16 +7207,15 @@ void mt_validate(struct maple_tree *mt)
 
        mas_first_entry(&mas, mas_mn(&mas), ULONG_MAX, mte_node_type(mas.node));
        while (!mas_is_none(&mas)) {
-               MT_BUG_ON(mas.tree, mte_dead_node(mas.node));
+               MAS_WARN_ON(&mas, mte_dead_node(mas.node));
                if (!mte_is_root(mas.node)) {
                        end = mas_data_end(&mas);
-                       if ((end < mt_min_slot_count(mas.node)) &&
-                           (mas.max != ULONG_MAX)) {
+                       if (MAS_WARN_ON(&mas,
+                                       (end < mt_min_slot_count(mas.node)) &&
+                                       (mas.max != ULONG_MAX))) {
                                pr_err("Invalid size %u of %p\n", end,
-                               mas_mn(&mas));
-                               MT_BUG_ON(mas.tree, 1);
+                                      mas_mn(&mas));
                        }
-
                }
                mas_validate_parent_slot(&mas);
                mas_validate_child_slot(&mas);
@@ -7219,4 +7231,34 @@ done:
 }
 EXPORT_SYMBOL_GPL(mt_validate);
 
+void mas_dump(const struct ma_state *mas)
+{
+       pr_err("MAS: tree=%p enode=%p ", mas->tree, mas->node);
+       if (mas_is_none(mas))
+               pr_err("(MAS_NONE) ");
+       else if (mas_is_ptr(mas))
+               pr_err("(MAS_ROOT) ");
+       else if (mas_is_start(mas))
+                pr_err("(MAS_START) ");
+       else if (mas_is_paused(mas))
+               pr_err("(MAS_PAUSED) ");
+
+       pr_err("[%u] index=%lx last=%lx\n", mas->offset, mas->index, mas->last);
+       pr_err("     min=%lx max=%lx alloc=%p, depth=%u, flags=%x\n",
+              mas->min, mas->max, mas->alloc, mas->depth, mas->mas_flags);
+       if (mas->index > mas->last)
+               pr_err("Check index & last\n");
+}
+EXPORT_SYMBOL_GPL(mas_dump);
+
+void mas_wr_dump(const struct ma_wr_state *wr_mas)
+{
+       pr_err("WR_MAS: node=%p r_min=%lx r_max=%lx\n",
+              wr_mas->node, wr_mas->r_min, wr_mas->r_max);
+       pr_err("        type=%u off_end=%u, node_end=%u, end_piv=%lx\n",
+              wr_mas->type, wr_mas->offset_end, wr_mas->node_end,
+              wr_mas->end_piv);
+}
+EXPORT_SYMBOL_GPL(mas_wr_dump);
+
 #endif /* CONFIG_DEBUG_MAPLE_TREE */
index dcd3ba1..34db0b3 100644 (file)
@@ -649,7 +649,7 @@ struct __test_flex_array {
 static void overflow_size_helpers_test(struct kunit *test)
 {
        /* Make sure struct_size() can be used in a constant expression. */
-       u8 ce_array[struct_size((struct __test_flex_array *)0, data, 55)];
+       u8 ce_array[struct_size_t(struct __test_flex_array, data, 55)];
        struct __test_flex_array *obj;
        int count = 0;
        int var;
diff --git a/lib/raid6/neon.h b/lib/raid6/neon.h
new file mode 100644 (file)
index 0000000..2ca41ee
--- /dev/null
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+void raid6_neon1_gen_syndrome_real(int disks, unsigned long bytes, void **ptrs);
+void raid6_neon1_xor_syndrome_real(int disks, int start, int stop,
+                                   unsigned long bytes, void **ptrs);
+void raid6_neon2_gen_syndrome_real(int disks, unsigned long bytes, void **ptrs);
+void raid6_neon2_xor_syndrome_real(int disks, int start, int stop,
+                                   unsigned long bytes, void **ptrs);
+void raid6_neon4_gen_syndrome_real(int disks, unsigned long bytes, void **ptrs);
+void raid6_neon4_xor_syndrome_real(int disks, int start, int stop,
+                                   unsigned long bytes, void **ptrs);
+void raid6_neon8_gen_syndrome_real(int disks, unsigned long bytes, void **ptrs);
+void raid6_neon8_xor_syndrome_real(int disks, int start, int stop,
+                                   unsigned long bytes, void **ptrs);
+void __raid6_2data_recov_neon(int bytes, uint8_t *p, uint8_t *q, uint8_t *dp,
+                             uint8_t *dq, const uint8_t *pbmul,
+                             const uint8_t *qmul);
+
+void __raid6_datap_recov_neon(int bytes, uint8_t *p, uint8_t *q, uint8_t *dq,
+                             const uint8_t *qmul);
+
+
index b7c6803..355270a 100644 (file)
@@ -25,6 +25,7 @@
  */
 
 #include <arm_neon.h>
+#include "neon.h"
 
 typedef uint8x16_t unative_t;
 
index d6fba8b..1bfc141 100644 (file)
@@ -8,6 +8,7 @@
 
 #ifdef __KERNEL__
 #include <asm/neon.h>
+#include "neon.h"
 #else
 #define kernel_neon_begin()
 #define kernel_neon_end()
@@ -19,13 +20,6 @@ static int raid6_has_neon(void)
        return cpu_has_neon();
 }
 
-void __raid6_2data_recov_neon(int bytes, uint8_t *p, uint8_t *q, uint8_t *dp,
-                             uint8_t *dq, const uint8_t *pbmul,
-                             const uint8_t *qmul);
-
-void __raid6_datap_recov_neon(int bytes, uint8_t *p, uint8_t *q, uint8_t *dq,
-                             const uint8_t *qmul);
-
 static void raid6_2data_recov_neon(int disks, size_t bytes, int faila,
                int failb, void **ptrs)
 {
index 90eb80d..f9e7e8f 100644 (file)
@@ -5,6 +5,7 @@
  */
 
 #include <arm_neon.h>
+#include "neon.h"
 
 #ifdef CONFIG_ARM
 /*
diff --git a/lib/show_mem.c b/lib/show_mem.c
deleted file mode 100644 (file)
index 1485c87..0000000
+++ /dev/null
@@ -1,37 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-/*
- * Generic show_mem() implementation
- *
- * Copyright (C) 2008 Johannes Weiner <hannes@saeurebad.de>
- */
-
-#include <linux/mm.h>
-#include <linux/cma.h>
-
-void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
-{
-       unsigned long total = 0, reserved = 0, highmem = 0;
-       struct zone *zone;
-
-       printk("Mem-Info:\n");
-       __show_free_areas(filter, nodemask, max_zone_idx);
-
-       for_each_populated_zone(zone) {
-
-               total += zone->present_pages;
-               reserved += zone->present_pages - zone_managed_pages(zone);
-
-               if (is_highmem(zone))
-                       highmem += zone->present_pages;
-       }
-
-       printk("%lu pages RAM\n", total);
-       printk("%lu pages HighMem/MovableOnly\n", highmem);
-       printk("%lu pages reserved\n", reserved);
-#ifdef CONFIG_CMA
-       printk("%lu pages cma reserved\n", totalcma_pages);
-#endif
-#ifdef CONFIG_MEMORY_FAILURE
-       printk("%lu pages hwpoisoned\n", atomic_long_read(&num_poisoned_pages));
-#endif
-}
diff --git a/lib/strcat_kunit.c b/lib/strcat_kunit.c
new file mode 100644 (file)
index 0000000..e21be95
--- /dev/null
@@ -0,0 +1,104 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Kernel module for testing 'strcat' family of functions.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <kunit/test.h>
+#include <linux/string.h>
+
+static volatile int unconst;
+
+static void strcat_test(struct kunit *test)
+{
+       char dest[8];
+
+       /* Destination is terminated. */
+       memset(dest, 0, sizeof(dest));
+       KUNIT_EXPECT_EQ(test, strlen(dest), 0);
+       /* Empty copy does nothing. */
+       KUNIT_EXPECT_TRUE(test, strcat(dest, "") == dest);
+       KUNIT_EXPECT_STREQ(test, dest, "");
+       /* 4 characters copied in, stops at %NUL. */
+       KUNIT_EXPECT_TRUE(test, strcat(dest, "four\000123") == dest);
+       KUNIT_EXPECT_STREQ(test, dest, "four");
+       KUNIT_EXPECT_EQ(test, dest[5], '\0');
+       /* 2 more characters copied in okay. */
+       KUNIT_EXPECT_TRUE(test, strcat(dest, "AB") == dest);
+       KUNIT_EXPECT_STREQ(test, dest, "fourAB");
+}
+
+static void strncat_test(struct kunit *test)
+{
+       char dest[8];
+
+       /* Destination is terminated. */
+       memset(dest, 0, sizeof(dest));
+       KUNIT_EXPECT_EQ(test, strlen(dest), 0);
+       /* Empty copy of size 0 does nothing. */
+       KUNIT_EXPECT_TRUE(test, strncat(dest, "", 0 + unconst) == dest);
+       KUNIT_EXPECT_STREQ(test, dest, "");
+       /* Empty copy of size 1 does nothing too. */
+       KUNIT_EXPECT_TRUE(test, strncat(dest, "", 1 + unconst) == dest);
+       KUNIT_EXPECT_STREQ(test, dest, "");
+       /* Copy of max 0 characters should do nothing. */
+       KUNIT_EXPECT_TRUE(test, strncat(dest, "asdf", 0 + unconst) == dest);
+       KUNIT_EXPECT_STREQ(test, dest, "");
+
+       /* 4 characters copied in, even if max is 8. */
+       KUNIT_EXPECT_TRUE(test, strncat(dest, "four\000123", 8 + unconst) == dest);
+       KUNIT_EXPECT_STREQ(test, dest, "four");
+       KUNIT_EXPECT_EQ(test, dest[5], '\0');
+       KUNIT_EXPECT_EQ(test, dest[6], '\0');
+       /* 2 characters copied in okay, 2 ignored. */
+       KUNIT_EXPECT_TRUE(test, strncat(dest, "ABCD", 2 + unconst) == dest);
+       KUNIT_EXPECT_STREQ(test, dest, "fourAB");
+}
+
+static void strlcat_test(struct kunit *test)
+{
+       char dest[8] = "";
+       int len = sizeof(dest) + unconst;
+
+       /* Destination is terminated. */
+       KUNIT_EXPECT_EQ(test, strlen(dest), 0);
+       /* Empty copy is size 0. */
+       KUNIT_EXPECT_EQ(test, strlcat(dest, "", len), 0);
+       KUNIT_EXPECT_STREQ(test, dest, "");
+       /* Size 1 should keep buffer terminated, report size of source only. */
+       KUNIT_EXPECT_EQ(test, strlcat(dest, "four", 1 + unconst), 4);
+       KUNIT_EXPECT_STREQ(test, dest, "");
+
+       /* 4 characters copied in. */
+       KUNIT_EXPECT_EQ(test, strlcat(dest, "four", len), 4);
+       KUNIT_EXPECT_STREQ(test, dest, "four");
+       /* 2 characters copied in okay, gets to 6 total. */
+       KUNIT_EXPECT_EQ(test, strlcat(dest, "AB", len), 6);
+       KUNIT_EXPECT_STREQ(test, dest, "fourAB");
+       /* 2 characters ignored if max size (7) reached. */
+       KUNIT_EXPECT_EQ(test, strlcat(dest, "CD", 7 + unconst), 8);
+       KUNIT_EXPECT_STREQ(test, dest, "fourAB");
+       /* 1 of 2 characters skipped, now at true max size. */
+       KUNIT_EXPECT_EQ(test, strlcat(dest, "EFG", len), 9);
+       KUNIT_EXPECT_STREQ(test, dest, "fourABE");
+       /* Everything else ignored, now at full size. */
+       KUNIT_EXPECT_EQ(test, strlcat(dest, "1234", len), 11);
+       KUNIT_EXPECT_STREQ(test, dest, "fourABE");
+}
+
+static struct kunit_case strcat_test_cases[] = {
+       KUNIT_CASE(strcat_test),
+       KUNIT_CASE(strncat_test),
+       KUNIT_CASE(strlcat_test),
+       {}
+};
+
+static struct kunit_suite strcat_test_suite = {
+       .name = "strcat",
+       .test_cases = strcat_test_cases,
+};
+
+kunit_test_suite(strcat_test_suite);
+
+MODULE_LICENSE("GPL");
index 3d55ef8..be26623 100644 (file)
@@ -110,7 +110,7 @@ size_t strlcpy(char *dest, const char *src, size_t size)
 
        if (size) {
                size_t len = (ret >= size) ? size - 1 : ret;
-               memcpy(dest, src, len);
+               __builtin_memcpy(dest, src, len);
                dest[len] = '\0';
        }
        return ret;
@@ -260,7 +260,7 @@ size_t strlcat(char *dest, const char *src, size_t count)
        count -= dsize;
        if (len >= count)
                len = count-1;
-       memcpy(dest, src, len);
+       __builtin_memcpy(dest, src, len);
        dest[len] = 0;
        return res;
 }
index 230020a..d3b1dd7 100644 (file)
@@ -979,18 +979,22 @@ EXPORT_SYMBOL(__sysfs_match_string);
 
 /**
  * strreplace - Replace all occurrences of character in string.
- * @s: The string to operate on.
+ * @str: The string to operate on.
  * @old: The character being replaced.
  * @new: The character @old is replaced with.
  *
- * Returns pointer to the nul byte at the end of @s.
+ * Replaces the each @old character with a @new one in the given string @str.
+ *
+ * Return: pointer to the string @str itself.
  */
-char *strreplace(char *s, char old, char new)
+char *strreplace(char *str, char old, char new)
 {
+       char *s = str;
+
        for (; *s; ++s)
                if (*s == old)
                        *s = new;
-       return s;
+       return str;
 }
 EXPORT_SYMBOL(strreplace);
 
index f1db333..9939be3 100644 (file)
 #include <linux/module.h>
 
 #define MTREE_ALLOC_MAX 0x2000000000000Ul
-#ifndef CONFIG_DEBUG_MAPLE_TREE
-#define CONFIG_DEBUG_MAPLE_TREE
-#endif
 #define CONFIG_MAPLE_SEARCH
 #define MAPLE_32BIT (MAPLE_NODE_SLOTS > 31)
 
+#ifndef CONFIG_DEBUG_MAPLE_TREE
+#define mt_dump(mt, fmt)               do {} while (0)
+#define mt_validate(mt)                        do {} while (0)
+#define mt_cache_shrink()              do {} while (0)
+#define mas_dump(mas)                  do {} while (0)
+#define mas_wr_dump(mas)               do {} while (0)
+atomic_t maple_tree_tests_run;
+atomic_t maple_tree_tests_passed;
+#undef MT_BUG_ON
+
+#define MT_BUG_ON(__tree, __x) do {                                    \
+       atomic_inc(&maple_tree_tests_run);                              \
+       if (__x) {                                                      \
+               pr_info("BUG at %s:%d (%u)\n",                          \
+               __func__, __LINE__, __x);                               \
+               pr_info("Pass: %u Run:%u\n",                            \
+                       atomic_read(&maple_tree_tests_passed),          \
+                       atomic_read(&maple_tree_tests_run));            \
+       } else {                                                        \
+               atomic_inc(&maple_tree_tests_passed);                   \
+       }                                                               \
+} while (0)
+#endif
+
 /* #define BENCH_SLOT_STORE */
 /* #define BENCH_NODE_STORE */
 /* #define BENCH_AWALK */
 #else
 #define cond_resched()                 do {} while (0)
 #endif
-static
-int mtree_insert_index(struct maple_tree *mt, unsigned long index, gfp_t gfp)
+static int __init mtree_insert_index(struct maple_tree *mt,
+                                    unsigned long index, gfp_t gfp)
 {
        return mtree_insert(mt, index, xa_mk_value(index & LONG_MAX), gfp);
 }
 
-static void mtree_erase_index(struct maple_tree *mt, unsigned long index)
+static void __init mtree_erase_index(struct maple_tree *mt, unsigned long index)
 {
        MT_BUG_ON(mt, mtree_erase(mt, index) != xa_mk_value(index & LONG_MAX));
        MT_BUG_ON(mt, mtree_load(mt, index) != NULL);
 }
 
-static int mtree_test_insert(struct maple_tree *mt, unsigned long index,
+static int __init mtree_test_insert(struct maple_tree *mt, unsigned long index,
                                void *ptr)
 {
        return mtree_insert(mt, index, ptr, GFP_KERNEL);
 }
 
-static int mtree_test_store_range(struct maple_tree *mt, unsigned long start,
-                               unsigned long end, void *ptr)
+static int __init mtree_test_store_range(struct maple_tree *mt,
+                       unsigned long start, unsigned long end, void *ptr)
 {
        return mtree_store_range(mt, start, end, ptr, GFP_KERNEL);
 }
 
-static int mtree_test_store(struct maple_tree *mt, unsigned long start,
+static int __init mtree_test_store(struct maple_tree *mt, unsigned long start,
                                void *ptr)
 {
        return mtree_test_store_range(mt, start, start, ptr);
 }
 
-static int mtree_test_insert_range(struct maple_tree *mt, unsigned long start,
-                               unsigned long end, void *ptr)
+static int __init mtree_test_insert_range(struct maple_tree *mt,
+                       unsigned long start, unsigned long end, void *ptr)
 {
        return mtree_insert_range(mt, start, end, ptr, GFP_KERNEL);
 }
 
-static void *mtree_test_load(struct maple_tree *mt, unsigned long index)
+static void __init *mtree_test_load(struct maple_tree *mt, unsigned long index)
 {
        return mtree_load(mt, index);
 }
 
-static void *mtree_test_erase(struct maple_tree *mt, unsigned long index)
+static void __init *mtree_test_erase(struct maple_tree *mt, unsigned long index)
 {
        return mtree_erase(mt, index);
 }
 
 #if defined(CONFIG_64BIT)
-static noinline void check_mtree_alloc_range(struct maple_tree *mt,
+static noinline void __init check_mtree_alloc_range(struct maple_tree *mt,
                unsigned long start, unsigned long end, unsigned long size,
                unsigned long expected, int eret, void *ptr)
 {
@@ -94,7 +115,7 @@ static noinline void check_mtree_alloc_range(struct maple_tree *mt,
        MT_BUG_ON(mt, result != expected);
 }
 
-static noinline void check_mtree_alloc_rrange(struct maple_tree *mt,
+static noinline void __init check_mtree_alloc_rrange(struct maple_tree *mt,
                unsigned long start, unsigned long end, unsigned long size,
                unsigned long expected, int eret, void *ptr)
 {
@@ -102,7 +123,7 @@ static noinline void check_mtree_alloc_rrange(struct maple_tree *mt,
        unsigned long result = expected + 1;
        int ret;
 
-       ret = mtree_alloc_rrange(mt, &result, ptr, size, start, end - 1,
+       ret = mtree_alloc_rrange(mt, &result, ptr, size, start, end,
                        GFP_KERNEL);
        MT_BUG_ON(mt, ret != eret);
        if (ret)
@@ -112,8 +133,8 @@ static noinline void check_mtree_alloc_rrange(struct maple_tree *mt,
 }
 #endif
 
-static noinline void check_load(struct maple_tree *mt, unsigned long index,
-                               void *ptr)
+static noinline void __init check_load(struct maple_tree *mt,
+                                      unsigned long index, void *ptr)
 {
        void *ret = mtree_test_load(mt, index);
 
@@ -122,7 +143,7 @@ static noinline void check_load(struct maple_tree *mt, unsigned long index,
        MT_BUG_ON(mt, ret != ptr);
 }
 
-static noinline void check_store_range(struct maple_tree *mt,
+static noinline void __init check_store_range(struct maple_tree *mt,
                unsigned long start, unsigned long end, void *ptr, int expected)
 {
        int ret = -EINVAL;
@@ -138,7 +159,7 @@ static noinline void check_store_range(struct maple_tree *mt,
                check_load(mt, i, ptr);
 }
 
-static noinline void check_insert_range(struct maple_tree *mt,
+static noinline void __init check_insert_range(struct maple_tree *mt,
                unsigned long start, unsigned long end, void *ptr, int expected)
 {
        int ret = -EINVAL;
@@ -154,8 +175,8 @@ static noinline void check_insert_range(struct maple_tree *mt,
                check_load(mt, i, ptr);
 }
 
-static noinline void check_insert(struct maple_tree *mt, unsigned long index,
-               void *ptr)
+static noinline void __init check_insert(struct maple_tree *mt,
+                                        unsigned long index, void *ptr)
 {
        int ret = -EINVAL;
 
@@ -163,7 +184,7 @@ static noinline void check_insert(struct maple_tree *mt, unsigned long index,
        MT_BUG_ON(mt, ret != 0);
 }
 
-static noinline void check_dup_insert(struct maple_tree *mt,
+static noinline void __init check_dup_insert(struct maple_tree *mt,
                                      unsigned long index, void *ptr)
 {
        int ret = -EINVAL;
@@ -173,13 +194,13 @@ static noinline void check_dup_insert(struct maple_tree *mt,
 }
 
 
-static noinline
-void check_index_load(struct maple_tree *mt, unsigned long index)
+static noinline void __init check_index_load(struct maple_tree *mt,
+                                            unsigned long index)
 {
        return check_load(mt, index, xa_mk_value(index & LONG_MAX));
 }
 
-static inline int not_empty(struct maple_node *node)
+static inline __init int not_empty(struct maple_node *node)
 {
        int i;
 
@@ -194,8 +215,8 @@ static inline int not_empty(struct maple_node *node)
 }
 
 
-static noinline void check_rev_seq(struct maple_tree *mt, unsigned long max,
-               bool verbose)
+static noinline void __init check_rev_seq(struct maple_tree *mt,
+                                         unsigned long max, bool verbose)
 {
        unsigned long i = max, j;
 
@@ -219,7 +240,7 @@ static noinline void check_rev_seq(struct maple_tree *mt, unsigned long max,
 #ifndef __KERNEL__
        if (verbose) {
                rcu_barrier();
-               mt_dump(mt);
+               mt_dump(mt, mt_dump_dec);
                pr_info(" %s test of 0-%lu %luK in %d active (%d total)\n",
                        __func__, max, mt_get_alloc_size()/1024, mt_nr_allocated(),
                        mt_nr_tallocated());
@@ -227,7 +248,7 @@ static noinline void check_rev_seq(struct maple_tree *mt, unsigned long max,
 #endif
 }
 
-static noinline void check_seq(struct maple_tree *mt, unsigned long max,
+static noinline void __init check_seq(struct maple_tree *mt, unsigned long max,
                bool verbose)
 {
        unsigned long i, j;
@@ -248,7 +269,7 @@ static noinline void check_seq(struct maple_tree *mt, unsigned long max,
 #ifndef __KERNEL__
        if (verbose) {
                rcu_barrier();
-               mt_dump(mt);
+               mt_dump(mt, mt_dump_dec);
                pr_info(" seq test of 0-%lu %luK in %d active (%d total)\n",
                        max, mt_get_alloc_size()/1024, mt_nr_allocated(),
                        mt_nr_tallocated());
@@ -256,7 +277,7 @@ static noinline void check_seq(struct maple_tree *mt, unsigned long max,
 #endif
 }
 
-static noinline void check_lb_not_empty(struct maple_tree *mt)
+static noinline void __init check_lb_not_empty(struct maple_tree *mt)
 {
        unsigned long i, j;
        unsigned long huge = 4000UL * 1000 * 1000;
@@ -275,13 +296,13 @@ static noinline void check_lb_not_empty(struct maple_tree *mt)
        mtree_destroy(mt);
 }
 
-static noinline void check_lower_bound_split(struct maple_tree *mt)
+static noinline void __init check_lower_bound_split(struct maple_tree *mt)
 {
        MT_BUG_ON(mt, !mtree_empty(mt));
        check_lb_not_empty(mt);
 }
 
-static noinline void check_upper_bound_split(struct maple_tree *mt)
+static noinline void __init check_upper_bound_split(struct maple_tree *mt)
 {
        unsigned long i, j;
        unsigned long huge;
@@ -306,7 +327,7 @@ static noinline void check_upper_bound_split(struct maple_tree *mt)
        mtree_destroy(mt);
 }
 
-static noinline void check_mid_split(struct maple_tree *mt)
+static noinline void __init check_mid_split(struct maple_tree *mt)
 {
        unsigned long huge = 8000UL * 1000 * 1000;
 
@@ -315,7 +336,7 @@ static noinline void check_mid_split(struct maple_tree *mt)
        check_lb_not_empty(mt);
 }
 
-static noinline void check_rev_find(struct maple_tree *mt)
+static noinline void __init check_rev_find(struct maple_tree *mt)
 {
        int i, nr_entries = 200;
        void *val;
@@ -354,7 +375,7 @@ static noinline void check_rev_find(struct maple_tree *mt)
        rcu_read_unlock();
 }
 
-static noinline void check_find(struct maple_tree *mt)
+static noinline void __init check_find(struct maple_tree *mt)
 {
        unsigned long val = 0;
        unsigned long count;
@@ -571,7 +592,7 @@ static noinline void check_find(struct maple_tree *mt)
        mtree_destroy(mt);
 }
 
-static noinline void check_find_2(struct maple_tree *mt)
+static noinline void __init check_find_2(struct maple_tree *mt)
 {
        unsigned long i, j;
        void *entry;
@@ -616,7 +637,7 @@ static noinline void check_find_2(struct maple_tree *mt)
 
 
 #if defined(CONFIG_64BIT)
-static noinline void check_alloc_rev_range(struct maple_tree *mt)
+static noinline void __init check_alloc_rev_range(struct maple_tree *mt)
 {
        /*
         * Generated by:
@@ -624,7 +645,7 @@ static noinline void check_alloc_rev_range(struct maple_tree *mt)
         * awk -F "-" '{printf "0x%s, 0x%s, ", $1, $2}'
         */
 
-       unsigned long range[] = {
+       static const unsigned long range[] = {
        /*      Inclusive     , Exclusive. */
                0x565234af2000, 0x565234af4000,
                0x565234af4000, 0x565234af9000,
@@ -652,7 +673,7 @@ static noinline void check_alloc_rev_range(struct maple_tree *mt)
                0x7fff58791000, 0x7fff58793000,
        };
 
-       unsigned long holes[] = {
+       static const unsigned long holes[] = {
                /*
                 * Note: start of hole is INCLUSIVE
                 *        end of hole is EXCLUSIVE
@@ -672,7 +693,7 @@ static noinline void check_alloc_rev_range(struct maple_tree *mt)
         * 4. number that should be returned.
         * 5. return value
         */
-       unsigned long req_range[] = {
+       static const unsigned long req_range[] = {
                0x565234af9000, /* Min */
                0x7fff58791000, /* Max */
                0x1000,         /* Size */
@@ -680,7 +701,7 @@ static noinline void check_alloc_rev_range(struct maple_tree *mt)
                0,              /* Return value success. */
 
                0x0,            /* Min */
-               0x565234AF1 << 12,    /* Max */
+               0x565234AF0 << 12,    /* Max */
                0x3000,         /* Size */
                0x565234AEE << 12,  /* max - 3. */
                0,              /* Return value success. */
@@ -692,14 +713,14 @@ static noinline void check_alloc_rev_range(struct maple_tree *mt)
                0,              /* Return value success. */
 
                0x0,            /* Min */
-               0x7F36D510A << 12,    /* Max */
+               0x7F36D5109 << 12,    /* Max */
                0x4000,         /* Size */
                0x7F36D5106 << 12,    /* First rev hole of size 0x4000 */
                0,              /* Return value success. */
 
                /* Ascend test. */
                0x0,
-               34148798629 << 12,
+               34148798628 << 12,
                19 << 12,
                34148797418 << 12,
                0x0,
@@ -711,6 +732,12 @@ static noinline void check_alloc_rev_range(struct maple_tree *mt)
                0x0,
                -EBUSY,
 
+               /* Single space test. */
+               34148798725 << 12,
+               34148798725 << 12,
+               1 << 12,
+               34148798725 << 12,
+               0,
        };
 
        int i, range_count = ARRAY_SIZE(range);
@@ -759,9 +786,9 @@ static noinline void check_alloc_rev_range(struct maple_tree *mt)
        mas_unlock(&mas);
        for (i = 0; i < req_range_count; i += 5) {
 #if DEBUG_REV_RANGE
-               pr_debug("\tReverse request between %lu-%lu size %lu, should get %lu\n",
-                               req_range[i] >> 12,
-                               (req_range[i + 1] >> 12) - 1,
+               pr_debug("\tReverse request %d between %lu-%lu size %lu, should get %lu\n",
+                               i, req_range[i] >> 12,
+                               (req_range[i + 1] >> 12),
                                req_range[i+2] >> 12,
                                req_range[i+3] >> 12);
 #endif
@@ -777,13 +804,14 @@ static noinline void check_alloc_rev_range(struct maple_tree *mt)
 
        mt_set_non_kernel(1);
        mtree_erase(mt, 34148798727); /* create a deleted range. */
+       mtree_erase(mt, 34148798725);
        check_mtree_alloc_rrange(mt, 0, 34359052173, 210253414,
                        34148798725, 0, mt);
 
        mtree_destroy(mt);
 }
 
-static noinline void check_alloc_range(struct maple_tree *mt)
+static noinline void __init check_alloc_range(struct maple_tree *mt)
 {
        /*
         * Generated by:
@@ -791,7 +819,7 @@ static noinline void check_alloc_range(struct maple_tree *mt)
         * awk -F "-" '{printf "0x%s, 0x%s, ", $1, $2}'
         */
 
-       unsigned long range[] = {
+       static const unsigned long range[] = {
        /*      Inclusive     , Exclusive. */
                0x565234af2000, 0x565234af4000,
                0x565234af4000, 0x565234af9000,
@@ -818,7 +846,7 @@ static noinline void check_alloc_range(struct maple_tree *mt)
                0x7fff5878e000, 0x7fff58791000,
                0x7fff58791000, 0x7fff58793000,
        };
-       unsigned long holes[] = {
+       static const unsigned long holes[] = {
                /* Start of hole, end of hole,  size of hole (+1) */
                0x565234afb000, 0x565234afc000, 0x1000,
                0x565234afe000, 0x565235def000, 0x12F1000,
@@ -833,7 +861,7 @@ static noinline void check_alloc_range(struct maple_tree *mt)
         * 4. number that should be returned.
         * 5. return value
         */
-       unsigned long req_range[] = {
+       static const unsigned long req_range[] = {
                0x565234af9000, /* Min */
                0x7fff58791000, /* Max */
                0x1000,         /* Size */
@@ -880,6 +908,13 @@ static noinline void check_alloc_range(struct maple_tree *mt)
                4503599618982063UL << 12,  /* Size */
                34359052178 << 12,  /* Expected location */
                -EBUSY,             /* Return failure. */
+
+               /* Test a single entry */
+               34148798648 << 12,              /* Min */
+               34148798648 << 12,              /* Max */
+               4096,                   /* Size of 1 */
+               34148798648 << 12,      /* Location is the same as min/max */
+               0,                      /* Success */
        };
        int i, range_count = ARRAY_SIZE(range);
        int req_range_count = ARRAY_SIZE(req_range);
@@ -893,7 +928,7 @@ static noinline void check_alloc_range(struct maple_tree *mt)
 #if DEBUG_ALLOC_RANGE
                pr_debug("\tInsert %lu-%lu\n", range[i] >> 12,
                         (range[i + 1] >> 12) - 1);
-               mt_dump(mt);
+               mt_dump(mt, mt_dump_hex);
 #endif
                check_insert_range(mt, range[i] >> 12, (range[i + 1] >> 12) - 1,
                                xa_mk_value(range[i] >> 12), 0);
@@ -934,7 +969,7 @@ static noinline void check_alloc_range(struct maple_tree *mt)
                                xa_mk_value(req_range[i] >> 12)); /* pointer */
                mt_validate(mt);
 #if DEBUG_ALLOC_RANGE
-               mt_dump(mt);
+               mt_dump(mt, mt_dump_hex);
 #endif
        }
 
@@ -942,10 +977,10 @@ static noinline void check_alloc_range(struct maple_tree *mt)
 }
 #endif
 
-static noinline void check_ranges(struct maple_tree *mt)
+static noinline void __init check_ranges(struct maple_tree *mt)
 {
        int i, val, val2;
-       unsigned long r[] = {
+       static const unsigned long r[] = {
                10, 15,
                20, 25,
                17, 22, /* Overlaps previous range. */
@@ -1210,7 +1245,7 @@ static noinline void check_ranges(struct maple_tree *mt)
                MT_BUG_ON(mt, mt_height(mt) != 4);
 }
 
-static noinline void check_next_entry(struct maple_tree *mt)
+static noinline void __init check_next_entry(struct maple_tree *mt)
 {
        void *entry = NULL;
        unsigned long limit = 30, i = 0;
@@ -1234,7 +1269,7 @@ static noinline void check_next_entry(struct maple_tree *mt)
        mtree_destroy(mt);
 }
 
-static noinline void check_prev_entry(struct maple_tree *mt)
+static noinline void __init check_prev_entry(struct maple_tree *mt)
 {
        unsigned long index = 16;
        void *value;
@@ -1278,7 +1313,7 @@ static noinline void check_prev_entry(struct maple_tree *mt)
        mas_unlock(&mas);
 }
 
-static noinline void check_root_expand(struct maple_tree *mt)
+static noinline void __init check_root_expand(struct maple_tree *mt)
 {
        MA_STATE(mas, mt, 0, 0);
        void *ptr;
@@ -1287,6 +1322,7 @@ static noinline void check_root_expand(struct maple_tree *mt)
        mas_lock(&mas);
        mas_set(&mas, 3);
        ptr = mas_walk(&mas);
+       MT_BUG_ON(mt, mas.index != 0);
        MT_BUG_ON(mt, ptr != NULL);
        MT_BUG_ON(mt, mas.index != 0);
        MT_BUG_ON(mt, mas.last != ULONG_MAX);
@@ -1356,7 +1392,7 @@ static noinline void check_root_expand(struct maple_tree *mt)
        mas_store_gfp(&mas, ptr, GFP_KERNEL);
        ptr = mas_next(&mas, ULONG_MAX);
        MT_BUG_ON(mt, ptr != NULL);
-       MT_BUG_ON(mt, (mas.index != 1) && (mas.last != ULONG_MAX));
+       MT_BUG_ON(mt, (mas.index != ULONG_MAX) && (mas.last != ULONG_MAX));
 
        mas_set(&mas, 1);
        ptr = mas_prev(&mas, 0);
@@ -1367,13 +1403,13 @@ static noinline void check_root_expand(struct maple_tree *mt)
        mas_unlock(&mas);
 }
 
-static noinline void check_gap_combining(struct maple_tree *mt)
+static noinline void __init check_gap_combining(struct maple_tree *mt)
 {
        struct maple_enode *mn1, *mn2;
        void *entry;
        unsigned long singletons = 100;
-       unsigned long *seq100;
-       unsigned long seq100_64[] = {
+       static const unsigned long *seq100;
+       static const unsigned long seq100_64[] = {
                /* 0-5 */
                74, 75, 76,
                50, 100, 2,
@@ -1387,7 +1423,7 @@ static noinline void check_gap_combining(struct maple_tree *mt)
                76, 2, 79, 85, 4,
        };
 
-       unsigned long seq100_32[] = {
+       static const unsigned long seq100_32[] = {
                /* 0-5 */
                61, 62, 63,
                50, 100, 2,
@@ -1401,11 +1437,11 @@ static noinline void check_gap_combining(struct maple_tree *mt)
                76, 2, 79, 85, 4,
        };
 
-       unsigned long seq2000[] = {
+       static const unsigned long seq2000[] = {
                1152, 1151,
                1100, 1200, 2,
        };
-       unsigned long seq400[] = {
+       static const unsigned long seq400[] = {
                286, 318,
                256, 260, 266, 270, 275, 280, 290, 398,
                286, 310,
@@ -1564,7 +1600,7 @@ static noinline void check_gap_combining(struct maple_tree *mt)
        mt_set_non_kernel(0);
        mtree_destroy(mt);
 }
-static noinline void check_node_overwrite(struct maple_tree *mt)
+static noinline void __init check_node_overwrite(struct maple_tree *mt)
 {
        int i, max = 4000;
 
@@ -1572,12 +1608,12 @@ static noinline void check_node_overwrite(struct maple_tree *mt)
                mtree_test_store_range(mt, i*100, i*100 + 50, xa_mk_value(i*100));
 
        mtree_test_store_range(mt, 319951, 367950, NULL);
-       /*mt_dump(mt); */
+       /*mt_dump(mt, mt_dump_dec); */
        mt_validate(mt);
 }
 
 #if defined(BENCH_SLOT_STORE)
-static noinline void bench_slot_store(struct maple_tree *mt)
+static noinline void __init bench_slot_store(struct maple_tree *mt)
 {
        int i, brk = 105, max = 1040, brk_start = 100, count = 20000000;
 
@@ -1593,7 +1629,7 @@ static noinline void bench_slot_store(struct maple_tree *mt)
 #endif
 
 #if defined(BENCH_NODE_STORE)
-static noinline void bench_node_store(struct maple_tree *mt)
+static noinline void __init bench_node_store(struct maple_tree *mt)
 {
        int i, overwrite = 76, max = 240, count = 20000000;
 
@@ -1612,7 +1648,7 @@ static noinline void bench_node_store(struct maple_tree *mt)
 #endif
 
 #if defined(BENCH_AWALK)
-static noinline void bench_awalk(struct maple_tree *mt)
+static noinline void __init bench_awalk(struct maple_tree *mt)
 {
        int i, max = 2500, count = 50000000;
        MA_STATE(mas, mt, 1470, 1470);
@@ -1629,7 +1665,7 @@ static noinline void bench_awalk(struct maple_tree *mt)
 }
 #endif
 #if defined(BENCH_WALK)
-static noinline void bench_walk(struct maple_tree *mt)
+static noinline void __init bench_walk(struct maple_tree *mt)
 {
        int i, max = 2500, count = 550000000;
        MA_STATE(mas, mt, 1470, 1470);
@@ -1646,7 +1682,7 @@ static noinline void bench_walk(struct maple_tree *mt)
 #endif
 
 #if defined(BENCH_MT_FOR_EACH)
-static noinline void bench_mt_for_each(struct maple_tree *mt)
+static noinline void __init bench_mt_for_each(struct maple_tree *mt)
 {
        int i, count = 1000000;
        unsigned long max = 2500, index = 0;
@@ -1670,7 +1706,7 @@ static noinline void bench_mt_for_each(struct maple_tree *mt)
 #endif
 
 /* check_forking - simulate the kernel forking sequence with the tree. */
-static noinline void check_forking(struct maple_tree *mt)
+static noinline void __init check_forking(struct maple_tree *mt)
 {
 
        struct maple_tree newmt;
@@ -1709,7 +1745,7 @@ static noinline void check_forking(struct maple_tree *mt)
        mtree_destroy(&newmt);
 }
 
-static noinline void check_iteration(struct maple_tree *mt)
+static noinline void __init check_iteration(struct maple_tree *mt)
 {
        int i, nr_entries = 125;
        void *val;
@@ -1765,7 +1801,6 @@ static noinline void check_iteration(struct maple_tree *mt)
                        mas.index = 760;
                        mas.last = 765;
                        mas_store(&mas, val);
-                       mas_next(&mas, ULONG_MAX);
                }
                i++;
        }
@@ -1777,7 +1812,7 @@ static noinline void check_iteration(struct maple_tree *mt)
        mt_set_non_kernel(0);
 }
 
-static noinline void check_mas_store_gfp(struct maple_tree *mt)
+static noinline void __init check_mas_store_gfp(struct maple_tree *mt)
 {
 
        struct maple_tree newmt;
@@ -1810,7 +1845,7 @@ static noinline void check_mas_store_gfp(struct maple_tree *mt)
 }
 
 #if defined(BENCH_FORK)
-static noinline void bench_forking(struct maple_tree *mt)
+static noinline void __init bench_forking(struct maple_tree *mt)
 {
 
        struct maple_tree newmt;
@@ -1852,15 +1887,17 @@ static noinline void bench_forking(struct maple_tree *mt)
 }
 #endif
 
-static noinline void next_prev_test(struct maple_tree *mt)
+static noinline void __init next_prev_test(struct maple_tree *mt)
 {
        int i, nr_entries;
        void *val;
        MA_STATE(mas, mt, 0, 0);
        struct maple_enode *mn;
-       unsigned long *level2;
-       unsigned long level2_64[] = {707, 1000, 710, 715, 720, 725};
-       unsigned long level2_32[] = {1747, 2000, 1750, 1755, 1760, 1765};
+       static const unsigned long *level2;
+       static const unsigned long level2_64[] = { 707, 1000, 710, 715, 720,
+                                                  725};
+       static const unsigned long level2_32[] = { 1747, 2000, 1750, 1755,
+                                                  1760, 1765};
 
        if (MAPLE_32BIT) {
                nr_entries = 500;
@@ -1974,7 +2011,7 @@ static noinline void next_prev_test(struct maple_tree *mt)
 
        val = mas_next(&mas, ULONG_MAX);
        MT_BUG_ON(mt, val != NULL);
-       MT_BUG_ON(mt, mas.index != ULONG_MAX);
+       MT_BUG_ON(mt, mas.index != 0x7d6);
        MT_BUG_ON(mt, mas.last != ULONG_MAX);
 
        val = mas_prev(&mas, 0);
@@ -1998,7 +2035,8 @@ static noinline void next_prev_test(struct maple_tree *mt)
        val = mas_prev(&mas, 0);
        MT_BUG_ON(mt, val != NULL);
        MT_BUG_ON(mt, mas.index != 0);
-       MT_BUG_ON(mt, mas.last != 0);
+       MT_BUG_ON(mt, mas.last != 5);
+       MT_BUG_ON(mt, mas.node != MAS_NONE);
 
        mas.index = 0;
        mas.last = 5;
@@ -2010,7 +2048,7 @@ static noinline void next_prev_test(struct maple_tree *mt)
        val = mas_prev(&mas, 0);
        MT_BUG_ON(mt, val != NULL);
        MT_BUG_ON(mt, mas.index != 0);
-       MT_BUG_ON(mt, mas.last != 0);
+       MT_BUG_ON(mt, mas.last != 9);
        mas_unlock(&mas);
 
        mtree_destroy(mt);
@@ -2028,7 +2066,7 @@ static noinline void next_prev_test(struct maple_tree *mt)
 
 
 /* Test spanning writes that require balancing right sibling or right cousin */
-static noinline void check_spanning_relatives(struct maple_tree *mt)
+static noinline void __init check_spanning_relatives(struct maple_tree *mt)
 {
 
        unsigned long i, nr_entries = 1000;
@@ -2041,7 +2079,7 @@ static noinline void check_spanning_relatives(struct maple_tree *mt)
        mtree_store_range(mt, 9365, 9955, NULL, GFP_KERNEL);
 }
 
-static noinline void check_fuzzer(struct maple_tree *mt)
+static noinline void __init check_fuzzer(struct maple_tree *mt)
 {
        /*
         * 1. Causes a spanning rebalance of a single root node.
@@ -2438,7 +2476,7 @@ static noinline void check_fuzzer(struct maple_tree *mt)
 }
 
 /* duplicate the tree with a specific gap */
-static noinline void check_dup_gaps(struct maple_tree *mt,
+static noinline void __init check_dup_gaps(struct maple_tree *mt,
                                    unsigned long nr_entries, bool zero_start,
                                    unsigned long gap)
 {
@@ -2478,7 +2516,7 @@ static noinline void check_dup_gaps(struct maple_tree *mt,
 }
 
 /* Duplicate many sizes of trees.  Mainly to test expected entry values */
-static noinline void check_dup(struct maple_tree *mt)
+static noinline void __init check_dup(struct maple_tree *mt)
 {
        int i;
        int big_start = 100010;
@@ -2566,7 +2604,7 @@ static noinline void check_dup(struct maple_tree *mt)
        }
 }
 
-static noinline void check_bnode_min_spanning(struct maple_tree *mt)
+static noinline void __init check_bnode_min_spanning(struct maple_tree *mt)
 {
        int i = 50;
        MA_STATE(mas, mt, 0, 0);
@@ -2585,7 +2623,7 @@ static noinline void check_bnode_min_spanning(struct maple_tree *mt)
        mt_set_non_kernel(0);
 }
 
-static noinline void check_empty_area_window(struct maple_tree *mt)
+static noinline void __init check_empty_area_window(struct maple_tree *mt)
 {
        unsigned long i, nr_entries = 20;
        MA_STATE(mas, mt, 0, 0);
@@ -2660,7 +2698,7 @@ static noinline void check_empty_area_window(struct maple_tree *mt)
        MT_BUG_ON(mt, mas_empty_area(&mas, 5, 100, 6) != -EBUSY);
 
        mas_reset(&mas);
-       MT_BUG_ON(mt, mas_empty_area(&mas, 0, 8, 10) != -EBUSY);
+       MT_BUG_ON(mt, mas_empty_area(&mas, 0, 8, 10) != -EINVAL);
 
        mas_reset(&mas);
        mas_empty_area(&mas, 100, 165, 3);
@@ -2670,7 +2708,7 @@ static noinline void check_empty_area_window(struct maple_tree *mt)
        rcu_read_unlock();
 }
 
-static noinline void check_empty_area_fill(struct maple_tree *mt)
+static noinline void __init check_empty_area_fill(struct maple_tree *mt)
 {
        const unsigned long max = 0x25D78000;
        unsigned long size;
@@ -2713,12 +2751,635 @@ static noinline void check_empty_area_fill(struct maple_tree *mt)
        mt_set_non_kernel(0);
 }
 
+/*
+ * Check MAS_START, MAS_PAUSE, active (implied), and MAS_NONE transitions.
+ *
+ * The table below shows the single entry tree (0-0 pointer) and normal tree
+ * with nodes.
+ *
+ * Function    ENTRY   Start           Result          index & last
+ *     ┬          ┬       ┬               ┬                ┬
+ *     │          │       │               │                └─ the final range
+ *     │          │       │               └─ The node value after execution
+ *     │          │       └─ The node value before execution
+ *     │          └─ If the entry exists or does not exists (DNE)
+ *     └─ The function name
+ *
+ * Function    ENTRY   Start           Result          index & last
+ * mas_next()
+ *  - after last
+ *                     Single entry tree at 0-0
+ *                     ------------------------
+ *             DNE     MAS_START       MAS_NONE        1 - oo
+ *             DNE     MAS_PAUSE       MAS_NONE        1 - oo
+ *             DNE     MAS_ROOT        MAS_NONE        1 - oo
+ *                     when index = 0
+ *             DNE     MAS_NONE        MAS_ROOT        0
+ *                     when index > 0
+ *             DNE     MAS_NONE        MAS_NONE        1 - oo
+ *
+ *                     Normal tree
+ *                     -----------
+ *             exists  MAS_START       active          range
+ *             DNE     MAS_START       active          set to last range
+ *             exists  MAS_PAUSE       active          range
+ *             DNE     MAS_PAUSE       active          set to last range
+ *             exists  MAS_NONE        active          range
+ *             exists  active          active          range
+ *             DNE     active          active          set to last range
+ *
+ * Function    ENTRY   Start           Result          index & last
+ * mas_prev()
+ * - before index
+ *                     Single entry tree at 0-0
+ *                     ------------------------
+ *                             if index > 0
+ *             exists  MAS_START       MAS_ROOT        0
+ *             exists  MAS_PAUSE       MAS_ROOT        0
+ *             exists  MAS_NONE        MAS_ROOT        0
+ *
+ *                             if index == 0
+ *             DNE     MAS_START       MAS_NONE        0
+ *             DNE     MAS_PAUSE       MAS_NONE        0
+ *             DNE     MAS_NONE        MAS_NONE        0
+ *             DNE     MAS_ROOT        MAS_NONE        0
+ *
+ *                     Normal tree
+ *                     -----------
+ *             exists  MAS_START       active          range
+ *             DNE     MAS_START       active          set to min
+ *             exists  MAS_PAUSE       active          range
+ *             DNE     MAS_PAUSE       active          set to min
+ *             exists  MAS_NONE        active          range
+ *             DNE     MAS_NONE        MAS_NONE        set to min
+ *             any     MAS_ROOT        MAS_NONE        0
+ *             exists  active          active          range
+ *             DNE     active          active          last range
+ *
+ * Function    ENTRY   Start           Result          index & last
+ * mas_find()
+ *  - at index or next
+ *                     Single entry tree at 0-0
+ *                     ------------------------
+ *                             if index >  0
+ *             DNE     MAS_START       MAS_NONE        0
+ *             DNE     MAS_PAUSE       MAS_NONE        0
+ *             DNE     MAS_ROOT        MAS_NONE        0
+ *             DNE     MAS_NONE        MAS_NONE        0
+ *                             if index ==  0
+ *             exists  MAS_START       MAS_ROOT        0
+ *             exists  MAS_PAUSE       MAS_ROOT        0
+ *             exists  MAS_NONE        MAS_ROOT        0
+ *
+ *                     Normal tree
+ *                     -----------
+ *             exists  MAS_START       active          range
+ *             DNE     MAS_START       active          set to max
+ *             exists  MAS_PAUSE       active          range
+ *             DNE     MAS_PAUSE       active          set to max
+ *             exists  MAS_NONE        active          range
+ *             exists  active          active          range
+ *             DNE     active          active          last range (max < last)
+ *
+ * Function    ENTRY   Start           Result          index & last
+ * mas_find_rev()
+ *  - at index or before
+ *                     Single entry tree at 0-0
+ *                     ------------------------
+ *                             if index >  0
+ *             exists  MAS_START       MAS_ROOT        0
+ *             exists  MAS_PAUSE       MAS_ROOT        0
+ *             exists  MAS_NONE        MAS_ROOT        0
+ *                             if index ==  0
+ *             DNE     MAS_START       MAS_NONE        0
+ *             DNE     MAS_PAUSE       MAS_NONE        0
+ *             DNE     MAS_NONE        MAS_NONE        0
+ *             DNE     MAS_ROOT        MAS_NONE        0
+ *
+ *                     Normal tree
+ *                     -----------
+ *             exists  MAS_START       active          range
+ *             DNE     MAS_START       active          set to min
+ *             exists  MAS_PAUSE       active          range
+ *             DNE     MAS_PAUSE       active          set to min
+ *             exists  MAS_NONE        active          range
+ *             exists  active          active          range
+ *             DNE     active          active          last range (min > index)
+ *
+ * Function    ENTRY   Start           Result          index & last
+ * mas_walk()
+ * - Look up index
+ *                     Single entry tree at 0-0
+ *                     ------------------------
+ *                             if index >  0
+ *             DNE     MAS_START       MAS_ROOT        1 - oo
+ *             DNE     MAS_PAUSE       MAS_ROOT        1 - oo
+ *             DNE     MAS_NONE        MAS_ROOT        1 - oo
+ *             DNE     MAS_ROOT        MAS_ROOT        1 - oo
+ *                             if index ==  0
+ *             exists  MAS_START       MAS_ROOT        0
+ *             exists  MAS_PAUSE       MAS_ROOT        0
+ *             exists  MAS_NONE        MAS_ROOT        0
+ *             exists  MAS_ROOT        MAS_ROOT        0
+ *
+ *                     Normal tree
+ *                     -----------
+ *             exists  MAS_START       active          range
+ *             DNE     MAS_START       active          range of NULL
+ *             exists  MAS_PAUSE       active          range
+ *             DNE     MAS_PAUSE       active          range of NULL
+ *             exists  MAS_NONE        active          range
+ *             DNE     MAS_NONE        active          range of NULL
+ *             exists  active          active          range
+ *             DNE     active          active          range of NULL
+ */
+
+#define mas_active(x)          (((x).node != MAS_ROOT) && \
+                                ((x).node != MAS_START) && \
+                                ((x).node != MAS_PAUSE) && \
+                                ((x).node != MAS_NONE))
+static noinline void __init check_state_handling(struct maple_tree *mt)
+{
+       MA_STATE(mas, mt, 0, 0);
+       void *entry, *ptr = (void *) 0x1234500;
+       void *ptr2 = &ptr;
+       void *ptr3 = &ptr2;
+
+       /* Check MAS_ROOT First */
+       mtree_store_range(mt, 0, 0, ptr, GFP_KERNEL);
+
+       mas_lock(&mas);
+       /* prev: Start -> none */
+       entry = mas_prev(&mas, 0);
+       MT_BUG_ON(mt, entry != NULL);
+       MT_BUG_ON(mt, mas.node != MAS_NONE);
+
+       /* prev: Start -> root */
+       mas_set(&mas, 10);
+       entry = mas_prev(&mas, 0);
+       MT_BUG_ON(mt, entry != ptr);
+       MT_BUG_ON(mt, mas.index != 0);
+       MT_BUG_ON(mt, mas.last != 0);
+       MT_BUG_ON(mt, mas.node != MAS_ROOT);
+
+       /* prev: pause -> root */
+       mas_set(&mas, 10);
+       mas_pause(&mas);
+       entry = mas_prev(&mas, 0);
+       MT_BUG_ON(mt, entry != ptr);
+       MT_BUG_ON(mt, mas.index != 0);
+       MT_BUG_ON(mt, mas.last != 0);
+       MT_BUG_ON(mt, mas.node != MAS_ROOT);
+
+       /* next: start -> none */
+       mas_set(&mas, 0);
+       entry = mas_next(&mas, ULONG_MAX);
+       MT_BUG_ON(mt, mas.index != 1);
+       MT_BUG_ON(mt, mas.last != ULONG_MAX);
+       MT_BUG_ON(mt, entry != NULL);
+       MT_BUG_ON(mt, mas.node != MAS_NONE);
+
+       /* next: start -> none */
+       mas_set(&mas, 10);
+       entry = mas_next(&mas, ULONG_MAX);
+       MT_BUG_ON(mt, mas.index != 1);
+       MT_BUG_ON(mt, mas.last != ULONG_MAX);
+       MT_BUG_ON(mt, entry != NULL);
+       MT_BUG_ON(mt, mas.node != MAS_NONE);
+
+       /* find: start -> root */
+       mas_set(&mas, 0);
+       entry = mas_find(&mas, ULONG_MAX);
+       MT_BUG_ON(mt, entry != ptr);
+       MT_BUG_ON(mt, mas.index != 0);
+       MT_BUG_ON(mt, mas.last != 0);
+       MT_BUG_ON(mt, mas.node != MAS_ROOT);
+
+       /* find: root -> none */
+       entry = mas_find(&mas, ULONG_MAX);
+       MT_BUG_ON(mt, entry != NULL);
+       MT_BUG_ON(mt, mas.index != 1);
+       MT_BUG_ON(mt, mas.last != ULONG_MAX);
+       MT_BUG_ON(mt, mas.node != MAS_NONE);
+
+       /* find: none -> none */
+       entry = mas_find(&mas, ULONG_MAX);
+       MT_BUG_ON(mt, entry != NULL);
+       MT_BUG_ON(mt, mas.index != 1);
+       MT_BUG_ON(mt, mas.last != ULONG_MAX);
+       MT_BUG_ON(mt, mas.node != MAS_NONE);
+
+       /* find: start -> none */
+       mas_set(&mas, 10);
+       entry = mas_find(&mas, ULONG_MAX);
+       MT_BUG_ON(mt, entry != NULL);
+       MT_BUG_ON(mt, mas.index != 1);
+       MT_BUG_ON(mt, mas.last != ULONG_MAX);
+       MT_BUG_ON(mt, mas.node != MAS_NONE);
+
+       /* find_rev: none -> root */
+       entry = mas_find_rev(&mas, 0);
+       MT_BUG_ON(mt, entry != ptr);
+       MT_BUG_ON(mt, mas.index != 0);
+       MT_BUG_ON(mt, mas.last != 0);
+       MT_BUG_ON(mt, mas.node != MAS_ROOT);
+
+       /* find_rev: start -> root */
+       mas_set(&mas, 0);
+       entry = mas_find_rev(&mas, 0);
+       MT_BUG_ON(mt, entry != ptr);
+       MT_BUG_ON(mt, mas.index != 0);
+       MT_BUG_ON(mt, mas.last != 0);
+       MT_BUG_ON(mt, mas.node != MAS_ROOT);
+
+       /* find_rev: root -> none */
+       entry = mas_find_rev(&mas, 0);
+       MT_BUG_ON(mt, entry != NULL);
+       MT_BUG_ON(mt, mas.index != 0);
+       MT_BUG_ON(mt, mas.last != 0);
+       MT_BUG_ON(mt, mas.node != MAS_NONE);
+
+       /* find_rev: none -> none */
+       entry = mas_find_rev(&mas, 0);
+       MT_BUG_ON(mt, entry != NULL);
+       MT_BUG_ON(mt, mas.index != 0);
+       MT_BUG_ON(mt, mas.last != 0);
+       MT_BUG_ON(mt, mas.node != MAS_NONE);
+
+       /* find_rev: start -> root */
+       mas_set(&mas, 10);
+       entry = mas_find_rev(&mas, 0);
+       MT_BUG_ON(mt, entry != ptr);
+       MT_BUG_ON(mt, mas.index != 0);
+       MT_BUG_ON(mt, mas.last != 0);
+       MT_BUG_ON(mt, mas.node != MAS_ROOT);
+
+       /* walk: start -> none */
+       mas_set(&mas, 10);
+       entry = mas_walk(&mas);
+       MT_BUG_ON(mt, entry != NULL);
+       MT_BUG_ON(mt, mas.index != 1);
+       MT_BUG_ON(mt, mas.last != ULONG_MAX);
+       MT_BUG_ON(mt, mas.node != MAS_NONE);
+
+       /* walk: pause -> none*/
+       mas_set(&mas, 10);
+       mas_pause(&mas);
+       entry = mas_walk(&mas);
+       MT_BUG_ON(mt, entry != NULL);
+       MT_BUG_ON(mt, mas.index != 1);
+       MT_BUG_ON(mt, mas.last != ULONG_MAX);
+       MT_BUG_ON(mt, mas.node != MAS_NONE);
+
+       /* walk: none -> none */
+       mas.index = mas.last = 10;
+       entry = mas_walk(&mas);
+       MT_BUG_ON(mt, entry != NULL);
+       MT_BUG_ON(mt, mas.index != 1);
+       MT_BUG_ON(mt, mas.last != ULONG_MAX);
+       MT_BUG_ON(mt, mas.node != MAS_NONE);
+
+       /* walk: none -> none */
+       entry = mas_walk(&mas);
+       MT_BUG_ON(mt, entry != NULL);
+       MT_BUG_ON(mt, mas.index != 1);
+       MT_BUG_ON(mt, mas.last != ULONG_MAX);
+       MT_BUG_ON(mt, mas.node != MAS_NONE);
+
+       /* walk: start -> root */
+       mas_set(&mas, 0);
+       entry = mas_walk(&mas);
+       MT_BUG_ON(mt, entry != ptr);
+       MT_BUG_ON(mt, mas.index != 0);
+       MT_BUG_ON(mt, mas.last != 0);
+       MT_BUG_ON(mt, mas.node != MAS_ROOT);
+
+       /* walk: pause -> root */
+       mas_set(&mas, 0);
+       mas_pause(&mas);
+       entry = mas_walk(&mas);
+       MT_BUG_ON(mt, entry != ptr);
+       MT_BUG_ON(mt, mas.index != 0);
+       MT_BUG_ON(mt, mas.last != 0);
+       MT_BUG_ON(mt, mas.node != MAS_ROOT);
+
+       /* walk: none -> root */
+       mas.node = MAS_NONE;
+       entry = mas_walk(&mas);
+       MT_BUG_ON(mt, entry != ptr);
+       MT_BUG_ON(mt, mas.index != 0);
+       MT_BUG_ON(mt, mas.last != 0);
+       MT_BUG_ON(mt, mas.node != MAS_ROOT);
+
+       /* walk: root -> root */
+       entry = mas_walk(&mas);
+       MT_BUG_ON(mt, entry != ptr);
+       MT_BUG_ON(mt, mas.index != 0);
+       MT_BUG_ON(mt, mas.last != 0);
+       MT_BUG_ON(mt, mas.node != MAS_ROOT);
+
+       /* walk: root -> none */
+       mas_set(&mas, 10);
+       entry = mas_walk(&mas);
+       MT_BUG_ON(mt, entry != NULL);
+       MT_BUG_ON(mt, mas.index != 1);
+       MT_BUG_ON(mt, mas.last != ULONG_MAX);
+       MT_BUG_ON(mt, mas.node != MAS_NONE);
+
+       /* walk: none -> root */
+       mas.index = mas.last = 0;
+       entry = mas_walk(&mas);
+       MT_BUG_ON(mt, entry != ptr);
+       MT_BUG_ON(mt, mas.index != 0);
+       MT_BUG_ON(mt, mas.last != 0);
+       MT_BUG_ON(mt, mas.node != MAS_ROOT);
+
+       mas_unlock(&mas);
+
+       /* Check when there is an actual node */
+       mtree_store_range(mt, 0, 0, NULL, GFP_KERNEL);
+       mtree_store_range(mt, 0x1000, 0x1500, ptr, GFP_KERNEL);
+       mtree_store_range(mt, 0x2000, 0x2500, ptr2, GFP_KERNEL);
+       mtree_store_range(mt, 0x3000, 0x3500, ptr3, GFP_KERNEL);
+
+       mas_lock(&mas);
+
+       /* next: start ->active */
+       mas_set(&mas, 0);
+       entry = mas_next(&mas, ULONG_MAX);
+       MT_BUG_ON(mt, entry != ptr);
+       MT_BUG_ON(mt, mas.index != 0x1000);
+       MT_BUG_ON(mt, mas.last != 0x1500);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* next: pause ->active */
+       mas_set(&mas, 0);
+       mas_pause(&mas);
+       entry = mas_next(&mas, ULONG_MAX);
+       MT_BUG_ON(mt, entry != ptr);
+       MT_BUG_ON(mt, mas.index != 0x1000);
+       MT_BUG_ON(mt, mas.last != 0x1500);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* next: none ->active */
+       mas.index = mas.last = 0;
+       mas.offset = 0;
+       mas.node = MAS_NONE;
+       entry = mas_next(&mas, ULONG_MAX);
+       MT_BUG_ON(mt, entry != ptr);
+       MT_BUG_ON(mt, mas.index != 0x1000);
+       MT_BUG_ON(mt, mas.last != 0x1500);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* next:active ->active */
+       entry = mas_next(&mas, ULONG_MAX);
+       MT_BUG_ON(mt, entry != ptr2);
+       MT_BUG_ON(mt, mas.index != 0x2000);
+       MT_BUG_ON(mt, mas.last != 0x2500);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* next:active -> active out of range*/
+       entry = mas_next(&mas, 0x2999);
+       MT_BUG_ON(mt, entry != NULL);
+       MT_BUG_ON(mt, mas.index != 0x2501);
+       MT_BUG_ON(mt, mas.last != 0x2fff);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* Continue after out of range*/
+       entry = mas_next(&mas, ULONG_MAX);
+       MT_BUG_ON(mt, entry != ptr3);
+       MT_BUG_ON(mt, mas.index != 0x3000);
+       MT_BUG_ON(mt, mas.last != 0x3500);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* next:active -> active out of range*/
+       entry = mas_next(&mas, ULONG_MAX);
+       MT_BUG_ON(mt, entry != NULL);
+       MT_BUG_ON(mt, mas.index != 0x3501);
+       MT_BUG_ON(mt, mas.last != ULONG_MAX);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* next: none -> active, skip value at location */
+       mas_set(&mas, 0);
+       entry = mas_next(&mas, ULONG_MAX);
+       mas.node = MAS_NONE;
+       mas.offset = 0;
+       entry = mas_next(&mas, ULONG_MAX);
+       MT_BUG_ON(mt, entry != ptr2);
+       MT_BUG_ON(mt, mas.index != 0x2000);
+       MT_BUG_ON(mt, mas.last != 0x2500);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* prev:active ->active */
+       entry = mas_prev(&mas, 0);
+       MT_BUG_ON(mt, entry != ptr);
+       MT_BUG_ON(mt, mas.index != 0x1000);
+       MT_BUG_ON(mt, mas.last != 0x1500);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* prev:active -> active out of range*/
+       entry = mas_prev(&mas, 0);
+       MT_BUG_ON(mt, entry != NULL);
+       MT_BUG_ON(mt, mas.index != 0);
+       MT_BUG_ON(mt, mas.last != 0x0FFF);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* prev: pause ->active */
+       mas_set(&mas, 0x3600);
+       entry = mas_prev(&mas, 0);
+       MT_BUG_ON(mt, entry != ptr3);
+       mas_pause(&mas);
+       entry = mas_prev(&mas, 0);
+       MT_BUG_ON(mt, entry != ptr2);
+       MT_BUG_ON(mt, mas.index != 0x2000);
+       MT_BUG_ON(mt, mas.last != 0x2500);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* prev:active -> active out of range*/
+       entry = mas_prev(&mas, 0x1600);
+       MT_BUG_ON(mt, entry != NULL);
+       MT_BUG_ON(mt, mas.index != 0x1501);
+       MT_BUG_ON(mt, mas.last != 0x1FFF);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* prev: active ->active, continue*/
+       entry = mas_prev(&mas, 0);
+       MT_BUG_ON(mt, entry != ptr);
+       MT_BUG_ON(mt, mas.index != 0x1000);
+       MT_BUG_ON(mt, mas.last != 0x1500);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* find: start ->active */
+       mas_set(&mas, 0);
+       entry = mas_find(&mas, ULONG_MAX);
+       MT_BUG_ON(mt, entry != ptr);
+       MT_BUG_ON(mt, mas.index != 0x1000);
+       MT_BUG_ON(mt, mas.last != 0x1500);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* find: pause ->active */
+       mas_set(&mas, 0);
+       mas_pause(&mas);
+       entry = mas_find(&mas, ULONG_MAX);
+       MT_BUG_ON(mt, entry != ptr);
+       MT_BUG_ON(mt, mas.index != 0x1000);
+       MT_BUG_ON(mt, mas.last != 0x1500);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* find: start ->active on value */;
+       mas_set(&mas, 1200);
+       entry = mas_find(&mas, ULONG_MAX);
+       MT_BUG_ON(mt, entry != ptr);
+       MT_BUG_ON(mt, mas.index != 0x1000);
+       MT_BUG_ON(mt, mas.last != 0x1500);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* find:active ->active */
+       entry = mas_find(&mas, ULONG_MAX);
+       MT_BUG_ON(mt, entry != ptr2);
+       MT_BUG_ON(mt, mas.index != 0x2000);
+       MT_BUG_ON(mt, mas.last != 0x2500);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+
+       /* find:active -> active (NULL)*/
+       entry = mas_find(&mas, 0x2700);
+       MT_BUG_ON(mt, entry != NULL);
+       MT_BUG_ON(mt, mas.index != 0x2501);
+       MT_BUG_ON(mt, mas.last != 0x2FFF);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* find: none ->active */
+       entry = mas_find(&mas, 0x5000);
+       MT_BUG_ON(mt, entry != ptr3);
+       MT_BUG_ON(mt, mas.index != 0x3000);
+       MT_BUG_ON(mt, mas.last != 0x3500);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* find:active -> active (NULL) end*/
+       entry = mas_find(&mas, ULONG_MAX);
+       MT_BUG_ON(mt, entry != NULL);
+       MT_BUG_ON(mt, mas.index != 0x3501);
+       MT_BUG_ON(mt, mas.last != ULONG_MAX);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* find_rev: active (END) ->active */
+       entry = mas_find_rev(&mas, 0);
+       MT_BUG_ON(mt, entry != ptr3);
+       MT_BUG_ON(mt, mas.index != 0x3000);
+       MT_BUG_ON(mt, mas.last != 0x3500);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* find_rev:active ->active */
+       entry = mas_find_rev(&mas, 0);
+       MT_BUG_ON(mt, entry != ptr2);
+       MT_BUG_ON(mt, mas.index != 0x2000);
+       MT_BUG_ON(mt, mas.last != 0x2500);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* find_rev: pause ->active */
+       mas_pause(&mas);
+       entry = mas_find_rev(&mas, 0);
+       MT_BUG_ON(mt, entry != ptr);
+       MT_BUG_ON(mt, mas.index != 0x1000);
+       MT_BUG_ON(mt, mas.last != 0x1500);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* find_rev:active -> active */
+       entry = mas_find_rev(&mas, 0);
+       MT_BUG_ON(mt, entry != NULL);
+       MT_BUG_ON(mt, mas.index != 0);
+       MT_BUG_ON(mt, mas.last != 0x0FFF);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* find_rev: start ->active */
+       mas_set(&mas, 0x1200);
+       entry = mas_find_rev(&mas, 0);
+       MT_BUG_ON(mt, entry != ptr);
+       MT_BUG_ON(mt, mas.index != 0x1000);
+       MT_BUG_ON(mt, mas.last != 0x1500);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* mas_walk start ->active */
+       mas_set(&mas, 0x1200);
+       entry = mas_walk(&mas);
+       MT_BUG_ON(mt, entry != ptr);
+       MT_BUG_ON(mt, mas.index != 0x1000);
+       MT_BUG_ON(mt, mas.last != 0x1500);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* mas_walk start ->active */
+       mas_set(&mas, 0x1600);
+       entry = mas_walk(&mas);
+       MT_BUG_ON(mt, entry != NULL);
+       MT_BUG_ON(mt, mas.index != 0x1501);
+       MT_BUG_ON(mt, mas.last != 0x1fff);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* mas_walk pause ->active */
+       mas_set(&mas, 0x1200);
+       mas_pause(&mas);
+       entry = mas_walk(&mas);
+       MT_BUG_ON(mt, entry != ptr);
+       MT_BUG_ON(mt, mas.index != 0x1000);
+       MT_BUG_ON(mt, mas.last != 0x1500);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* mas_walk pause -> active */
+       mas_set(&mas, 0x1600);
+       mas_pause(&mas);
+       entry = mas_walk(&mas);
+       MT_BUG_ON(mt, entry != NULL);
+       MT_BUG_ON(mt, mas.index != 0x1501);
+       MT_BUG_ON(mt, mas.last != 0x1fff);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* mas_walk none -> active */
+       mas_set(&mas, 0x1200);
+       mas.node = MAS_NONE;
+       entry = mas_walk(&mas);
+       MT_BUG_ON(mt, entry != ptr);
+       MT_BUG_ON(mt, mas.index != 0x1000);
+       MT_BUG_ON(mt, mas.last != 0x1500);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* mas_walk none -> active */
+       mas_set(&mas, 0x1600);
+       mas.node = MAS_NONE;
+       entry = mas_walk(&mas);
+       MT_BUG_ON(mt, entry != NULL);
+       MT_BUG_ON(mt, mas.index != 0x1501);
+       MT_BUG_ON(mt, mas.last != 0x1fff);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* mas_walk active -> active */
+       mas.index = 0x1200;
+       mas.last = 0x1200;
+       mas.offset = 0;
+       entry = mas_walk(&mas);
+       MT_BUG_ON(mt, entry != ptr);
+       MT_BUG_ON(mt, mas.index != 0x1000);
+       MT_BUG_ON(mt, mas.last != 0x1500);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       /* mas_walk active -> active */
+       mas.index = 0x1600;
+       mas.last = 0x1600;
+       entry = mas_walk(&mas);
+       MT_BUG_ON(mt, entry != NULL);
+       MT_BUG_ON(mt, mas.index != 0x1501);
+       MT_BUG_ON(mt, mas.last != 0x1fff);
+       MT_BUG_ON(mt, !mas_active(mas));
+
+       mas_unlock(&mas);
+}
+
 static DEFINE_MTREE(tree);
-static int maple_tree_seed(void)
+static int __init maple_tree_seed(void)
 {
-       unsigned long set[] = {5015, 5014, 5017, 25, 1000,
-                              1001, 1002, 1003, 1005, 0,
-                              5003, 5002};
+       unsigned long set[] = { 5015, 5014, 5017, 25, 1000,
+                               1001, 1002, 1003, 1005, 0,
+                               5003, 5002};
        void *ptr = &set;
 
        pr_info("\nTEST STARTING\n\n");
@@ -2974,6 +3635,10 @@ static int maple_tree_seed(void)
        mtree_destroy(&tree);
 
 
+       mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
+       check_state_handling(&tree);
+       mtree_destroy(&tree);
+
 #if defined(BENCH)
 skip:
 #endif
@@ -2988,7 +3653,7 @@ skip:
        return -EINVAL;
 }
 
-static void maple_tree_harvest(void)
+static void __exit maple_tree_harvest(void)
 {
 
 }
index e2a816d..8036aa9 100644 (file)
@@ -30,6 +30,13 @@ static int i_zero;
 static int i_one_hundred = 100;
 static int match_int_ok = 1;
 
+
+static struct {
+       struct ctl_table_header *test_h_setup_node;
+       struct ctl_table_header *test_h_mnt;
+       struct ctl_table_header *test_h_mnterror;
+} sysctl_test_headers;
+
 struct test_sysctl_data {
        int int_0001;
        int int_0002;
@@ -126,9 +133,7 @@ static struct ctl_table test_table[] = {
        { }
 };
 
-static struct ctl_table_header *test_sysctl_header;
-
-static int __init test_sysctl_init(void)
+static void test_sysctl_calc_match_int_ok(void)
 {
        int i;
 
@@ -153,24 +158,96 @@ static int __init test_sysctl_init(void)
        for (i = 0; i < ARRAY_SIZE(match_int); i++)
                if (match_int[i].defined != match_int[i].wanted)
                        match_int_ok = 0;
+}
 
+static int test_sysctl_setup_node_tests(void)
+{
+       test_sysctl_calc_match_int_ok();
        test_data.bitmap_0001 = kzalloc(SYSCTL_TEST_BITMAP_SIZE/8, GFP_KERNEL);
        if (!test_data.bitmap_0001)
                return -ENOMEM;
-       test_sysctl_header = register_sysctl("debug/test_sysctl", test_table);
-       if (!test_sysctl_header) {
+       sysctl_test_headers.test_h_setup_node = register_sysctl("debug/test_sysctl", test_table);
+       if (!sysctl_test_headers.test_h_setup_node) {
                kfree(test_data.bitmap_0001);
                return -ENOMEM;
        }
+
        return 0;
 }
+
+/* Used to test that unregister actually removes the directory */
+static struct ctl_table test_table_unregister[] = {
+       {
+               .procname       = "unregister_error",
+               .data           = &test_data.int_0001,
+               .maxlen         = sizeof(int),
+               .mode           = 0644,
+               .proc_handler   = proc_dointvec_minmax,
+       },
+       {}
+};
+
+static int test_sysctl_run_unregister_nested(void)
+{
+       struct ctl_table_header *unregister;
+
+       unregister = register_sysctl("debug/test_sysctl/unregister_error",
+                                  test_table_unregister);
+       if (!unregister)
+               return -ENOMEM;
+
+       unregister_sysctl_table(unregister);
+       return 0;
+}
+
+static int test_sysctl_run_register_mount_point(void)
+{
+       sysctl_test_headers.test_h_mnt
+               = register_sysctl_mount_point("debug/test_sysctl/mnt");
+       if (!sysctl_test_headers.test_h_mnt)
+               return -ENOMEM;
+
+       sysctl_test_headers.test_h_mnterror
+               = register_sysctl("debug/test_sysctl/mnt/mnt_error",
+                                 test_table_unregister);
+       /*
+        * Don't check the result.:
+        * If it fails (expected behavior), return 0.
+        * If successful (missbehavior of register mount point), we want to see
+        * mnt_error when we run the sysctl test script
+        */
+
+       return 0;
+}
+
+static int __init test_sysctl_init(void)
+{
+       int err;
+
+       err = test_sysctl_setup_node_tests();
+       if (err)
+               goto out;
+
+       err = test_sysctl_run_unregister_nested();
+       if (err)
+               goto out;
+
+       err = test_sysctl_run_register_mount_point();
+
+out:
+       return err;
+}
 module_init(test_sysctl_init);
 
 static void __exit test_sysctl_exit(void)
 {
        kfree(test_data.bitmap_0001);
-       if (test_sysctl_header)
-               unregister_sysctl_table(test_sysctl_header);
+       if (sysctl_test_headers.test_h_setup_node)
+               unregister_sysctl_table(sysctl_test_headers.test_h_setup_node);
+       if (sysctl_test_headers.test_h_mnt)
+               unregister_sysctl_table(sysctl_test_headers.test_h_mnt);
+       if (sysctl_test_headers.test_h_mnterror)
+               unregister_sysctl_table(sysctl_test_headers.test_h_mnterror);
 }
 
 module_exit(test_sysctl_exit);
index e2cc4a7..3f90810 100644 (file)
@@ -425,9 +425,6 @@ EXPORT_SYMBOL(__ubsan_handle_load_invalid_value);
 
 void __ubsan_handle_alignment_assumption(void *_data, unsigned long ptr,
                                         unsigned long align,
-                                        unsigned long offset);
-void __ubsan_handle_alignment_assumption(void *_data, unsigned long ptr,
-                                        unsigned long align,
                                         unsigned long offset)
 {
        struct alignment_assumption_data *data = _data;
index cc5cb94..5d99ab8 100644 (file)
@@ -124,4 +124,15 @@ typedef s64 s_max;
 typedef u64 u_max;
 #endif
 
+void __ubsan_handle_divrem_overflow(void *_data, void *lhs, void *rhs);
+void __ubsan_handle_type_mismatch(struct type_mismatch_data *data, void *ptr);
+void __ubsan_handle_type_mismatch_v1(void *_data, void *ptr);
+void __ubsan_handle_out_of_bounds(void *_data, void *index);
+void __ubsan_handle_shift_out_of_bounds(void *_data, void *lhs, void *rhs);
+void __ubsan_handle_builtin_unreachable(void *_data);
+void __ubsan_handle_load_invalid_value(void *_data, void *val);
+void __ubsan_handle_alignment_assumption(void *_data, unsigned long ptr,
+                                        unsigned long align,
+                                        unsigned long offset);
+
 #endif
index f06df06..2c34e8a 100644 (file)
@@ -105,21 +105,3 @@ static uint64_t ZSTD_div64(uint64_t dividend, uint32_t divisor) {
 
 #endif /* ZSTD_DEPS_IO */
 #endif /* ZSTD_DEPS_NEED_IO */
-
-/*
- * Only requested when MSAN is enabled.
- * Need:
- * intptr_t
- */
-#ifdef ZSTD_DEPS_NEED_STDINT
-#ifndef ZSTD_DEPS_STDINT
-#define ZSTD_DEPS_STDINT
-
-/*
- * The Linux Kernel doesn't provide intptr_t, only uintptr_t, which
- * is an unsigned long.
- */
-typedef long intptr_t;
-
-#endif /* ZSTD_DEPS_STDINT */
-#endif /* ZSTD_DEPS_NEED_STDINT */
index 7672a22..12f32f8 100644 (file)
@@ -46,6 +46,22 @@ config ZSWAP_DEFAULT_ON
          The selection made here can be overridden by using the kernel
          command line 'zswap.enabled=' option.
 
+config ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON
+       bool "Invalidate zswap entries when pages are loaded"
+       depends on ZSWAP
+       help
+         If selected, exclusive loads for zswap will be enabled at boot,
+         otherwise it will be disabled.
+
+         If exclusive loads are enabled, when a page is loaded from zswap,
+         the zswap entry is invalidated at once, as opposed to leaving it
+         in zswap until the swap entry is freed.
+
+         This avoids having two copies of the same page in memory
+         (compressed and uncompressed) after faulting in a page from zswap.
+         The cost is that if the page was never dirtied and needs to be
+         swapped out again, it will be re-compressed.
+
 choice
        prompt "Default compressor"
        depends on ZSWAP
index e29afc8..678530a 100644 (file)
@@ -51,7 +51,7 @@ obj-y                 := filemap.o mempool.o oom_kill.o fadvise.o \
                           readahead.o swap.o truncate.o vmscan.o shmem.o \
                           util.o mmzone.o vmstat.o backing-dev.o \
                           mm_init.o percpu.o slab_common.o \
-                          compaction.o \
+                          compaction.o show_mem.o\
                           interval_tree.o list_lru.o workingset.o \
                           debug.o gup.o mmap_lock.o $(mmu-y)
 
@@ -89,6 +89,7 @@ obj-$(CONFIG_KASAN)   += kasan/
 obj-$(CONFIG_KFENCE) += kfence/
 obj-$(CONFIG_KMSAN)    += kmsan/
 obj-$(CONFIG_FAILSLAB) += failslab.o
+obj-$(CONFIG_FAIL_PAGE_ALLOC) += fail_page_alloc.o
 obj-$(CONFIG_MEMTEST)          += memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_NUMA) += memory-tiers.o
@@ -123,6 +124,7 @@ obj-$(CONFIG_SECRETMEM) += secretmem.o
 obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o
 obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
 obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
+obj-$(CONFIG_DEBUG_PAGEALLOC) += debug_page_alloc.o
 obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
 obj-$(CONFIG_DAMON) += damon/
 obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
index 7da9727..3ffc3cf 100644 (file)
@@ -20,7 +20,6 @@
 struct backing_dev_info noop_backing_dev_info;
 EXPORT_SYMBOL_GPL(noop_backing_dev_info);
 
-static struct class *bdi_class;
 static const char *bdi_unknown_name = "(unknown)";
 
 /*
@@ -345,13 +344,19 @@ static struct attribute *bdi_dev_attrs[] = {
 };
 ATTRIBUTE_GROUPS(bdi_dev);
 
+static const struct class bdi_class = {
+       .name           = "bdi",
+       .dev_groups     = bdi_dev_groups,
+};
+
 static __init int bdi_class_init(void)
 {
-       bdi_class = class_create("bdi");
-       if (IS_ERR(bdi_class))
-               return PTR_ERR(bdi_class);
+       int ret;
+
+       ret = class_register(&bdi_class);
+       if (ret)
+               return ret;
 
-       bdi_class->dev_groups = bdi_dev_groups;
        bdi_debug_init();
 
        return 0;
@@ -1001,7 +1006,7 @@ int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args)
                return 0;
 
        vsnprintf(bdi->dev_name, sizeof(bdi->dev_name), fmt, args);
-       dev = device_create(bdi_class, NULL, MKDEV(0, 0), bdi, bdi->dev_name);
+       dev = device_create(&bdi_class, NULL, MKDEV(0, 0), bdi, bdi->dev_name);
        if (IS_ERR(dev))
                return PTR_ERR(dev);
 
index 6268d66..a4cfe99 100644 (file)
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -483,8 +483,8 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
                if (ret != -EBUSY)
                        break;
 
-               pr_debug("%s(): memory range at %p is busy, retrying\n",
-                        __func__, pfn_to_page(pfn));
+               pr_debug("%s(): memory range at pfn 0x%lx %p is busy, retrying\n",
+                        __func__, pfn, pfn_to_page(pfn));
 
                trace_cma_alloc_busy_retry(cma->name, pfn, pfn_to_page(pfn),
                                           count, align);
index c8bcdea..dbc9f86 100644 (file)
@@ -229,6 +229,33 @@ static void reset_cached_positions(struct zone *zone)
                                pageblock_start_pfn(zone_end_pfn(zone) - 1);
 }
 
+#ifdef CONFIG_SPARSEMEM
+/*
+ * If the PFN falls into an offline section, return the start PFN of the
+ * next online section. If the PFN falls into an online section or if
+ * there is no next online section, return 0.
+ */
+static unsigned long skip_offline_sections(unsigned long start_pfn)
+{
+       unsigned long start_nr = pfn_to_section_nr(start_pfn);
+
+       if (online_section_nr(start_nr))
+               return 0;
+
+       while (++start_nr <= __highest_present_section_nr) {
+               if (online_section_nr(start_nr))
+                       return section_nr_to_pfn(start_nr);
+       }
+
+       return 0;
+}
+#else
+static unsigned long skip_offline_sections(unsigned long start_pfn)
+{
+       return 0;
+}
+#endif
+
 /*
  * Compound pages of >= pageblock_order should consistently be skipped until
  * released. It is always pointless to compact pages of such order (if they are
@@ -392,18 +419,14 @@ void reset_isolation_suitable(pg_data_t *pgdat)
  * Sets the pageblock skip bit if it was clear. Note that this is a hint as
  * locks are not required for read/writers. Returns true if it was already set.
  */
-static bool test_and_set_skip(struct compact_control *cc, struct page *page,
-                                                       unsigned long pfn)
+static bool test_and_set_skip(struct compact_control *cc, struct page *page)
 {
        bool skip;
 
-       /* Do no update if skip hint is being ignored */
+       /* Do not update if skip hint is being ignored */
        if (cc->ignore_skip_hint)
                return false;
 
-       if (!pageblock_aligned(pfn))
-               return false;
-
        skip = get_pageblock_skip(page);
        if (!skip && !cc->no_set_skip_hint)
                set_pageblock_skip(page);
@@ -440,9 +463,6 @@ static void update_pageblock_skip(struct compact_control *cc,
        if (cc->no_set_skip_hint)
                return;
 
-       if (!page)
-               return;
-
        set_pageblock_skip(page);
 
        /* Update where async and sync compaction should restart */
@@ -470,8 +490,7 @@ static void update_cached_migrate(struct compact_control *cc, unsigned long pfn)
 {
 }
 
-static bool test_and_set_skip(struct compact_control *cc, struct page *page,
-                                                       unsigned long pfn)
+static bool test_and_set_skip(struct compact_control *cc, struct page *page)
 {
        return false;
 }
@@ -745,8 +764,9 @@ isolate_freepages_range(struct compact_control *cc,
 }
 
 /* Similar to reclaim, but different enough that they don't share logic */
-static bool too_many_isolated(pg_data_t *pgdat)
+static bool too_many_isolated(struct compact_control *cc)
 {
+       pg_data_t *pgdat = cc->zone->zone_pgdat;
        bool too_many;
 
        unsigned long active, inactive, isolated;
@@ -758,6 +778,17 @@ static bool too_many_isolated(pg_data_t *pgdat)
        isolated = node_page_state(pgdat, NR_ISOLATED_FILE) +
                        node_page_state(pgdat, NR_ISOLATED_ANON);
 
+       /*
+        * Allow GFP_NOFS to isolate past the limit set for regular
+        * compaction runs. This prevents an ABBA deadlock when other
+        * compactors have already isolated to the limit, but are
+        * blocked on filesystem locks held by the GFP_NOFS thread.
+        */
+       if (cc->gfp_mask & __GFP_FS) {
+               inactive >>= 3;
+               active >>= 3;
+       }
+
        too_many = isolated > (inactive + active) / 2;
        if (!too_many)
                wake_throttle_isolated(pgdat);
@@ -791,6 +822,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
        struct lruvec *lruvec;
        unsigned long flags = 0;
        struct lruvec *locked = NULL;
+       struct folio *folio = NULL;
        struct page *page = NULL, *valid_page = NULL;
        struct address_space *mapping;
        unsigned long start_pfn = low_pfn;
@@ -806,7 +838,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
         * list by either parallel reclaimers or compaction. If there are,
         * delay for some time until fewer pages are isolated
         */
-       while (unlikely(too_many_isolated(pgdat))) {
+       while (unlikely(too_many_isolated(cc))) {
                /* stop isolation if there are still pages not migrated */
                if (cc->nr_migratepages)
                        return -EAGAIN;
@@ -887,7 +919,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
                if (!valid_page && pageblock_aligned(low_pfn)) {
                        if (!isolation_suitable(cc, page)) {
                                low_pfn = end_pfn;
-                               page = NULL;
+                               folio = NULL;
                                goto isolate_abort;
                        }
                        valid_page = page;
@@ -919,7 +951,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
                                 * Hugepage was successfully isolated and placed
                                 * on the cc->migratepages list.
                                 */
-                               low_pfn += compound_nr(page) - 1;
+                               folio = page_folio(page);
+                               low_pfn += folio_nr_pages(folio) - 1;
                                goto isolate_success_no_list;
                        }
 
@@ -987,8 +1020,10 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
                                        locked = NULL;
                                }
 
-                               if (isolate_movable_page(page, mode))
+                               if (isolate_movable_page(page, mode)) {
+                                       folio = page_folio(page);
                                        goto isolate_success;
+                               }
                        }
 
                        goto isolate_fail;
@@ -999,7 +1034,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
                 * sure the page is not being freed elsewhere -- the
                 * page release code relies on it.
                 */
-               if (unlikely(!get_page_unless_zero(page)))
+               folio = folio_get_nontail_page(page);
+               if (unlikely(!folio))
                        goto isolate_fail;
 
                /*
@@ -1007,8 +1043,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
                 * so avoid taking lru_lock and isolating it unnecessarily in an
                 * admittedly racy check.
                 */
-               mapping = page_mapping(page);
-               if (!mapping && (page_count(page) - 1) > total_mapcount(page))
+               mapping = folio_mapping(folio);
+               if (!mapping && (folio_ref_count(folio) - 1) > folio_mapcount(folio))
                        goto isolate_fail_put;
 
                /*
@@ -1019,11 +1055,11 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
                        goto isolate_fail_put;
 
                /* Only take pages on LRU: a check now makes later tests safe */
-               if (!PageLRU(page))
+               if (!folio_test_lru(folio))
                        goto isolate_fail_put;
 
                /* Compaction might skip unevictable pages but CMA takes them */
-               if (!(mode & ISOLATE_UNEVICTABLE) && PageUnevictable(page))
+               if (!(mode & ISOLATE_UNEVICTABLE) && folio_test_unevictable(folio))
                        goto isolate_fail_put;
 
                /*
@@ -1032,10 +1068,10 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
                 * it will be able to migrate without blocking - clean pages
                 * for the most part.  PageWriteback would require blocking.
                 */
-               if ((mode & ISOLATE_ASYNC_MIGRATE) && PageWriteback(page))
+               if ((mode & ISOLATE_ASYNC_MIGRATE) && folio_test_writeback(folio))
                        goto isolate_fail_put;
 
-               if ((mode & ISOLATE_ASYNC_MIGRATE) && PageDirty(page)) {
+               if ((mode & ISOLATE_ASYNC_MIGRATE) && folio_test_dirty(folio)) {
                        bool migrate_dirty;
 
                        /*
@@ -1047,22 +1083,22 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
                         * the page lock until after the page is removed
                         * from the page cache.
                         */
-                       if (!trylock_page(page))
+                       if (!folio_trylock(folio))
                                goto isolate_fail_put;
 
-                       mapping = page_mapping(page);
+                       mapping = folio_mapping(folio);
                        migrate_dirty = !mapping ||
                                        mapping->a_ops->migrate_folio;
-                       unlock_page(page);
+                       folio_unlock(folio);
                        if (!migrate_dirty)
                                goto isolate_fail_put;
                }
 
-               /* Try isolate the page */
-               if (!TestClearPageLRU(page))
+               /* Try isolate the folio */
+               if (!folio_test_clear_lru(folio))
                        goto isolate_fail_put;
 
-               lruvec = folio_lruvec(page_folio(page));
+               lruvec = folio_lruvec(folio);
 
                /* If we already hold the lock, we can skip some rechecking */
                if (lruvec != locked) {
@@ -1072,44 +1108,49 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
                        compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
                        locked = lruvec;
 
-                       lruvec_memcg_debug(lruvec, page_folio(page));
+                       lruvec_memcg_debug(lruvec, folio);
 
-                       /* Try get exclusive access under lock */
-                       if (!skip_updated) {
+                       /*
+                        * Try get exclusive access under lock. If marked for
+                        * skip, the scan is aborted unless the current context
+                        * is a rescan to reach the end of the pageblock.
+                        */
+                       if (!skip_updated && valid_page) {
                                skip_updated = true;
-                               if (test_and_set_skip(cc, page, low_pfn))
+                               if (test_and_set_skip(cc, valid_page) &&
+                                   !cc->finish_pageblock) {
                                        goto isolate_abort;
+                               }
                        }
 
                        /*
-                        * Page become compound since the non-locked check,
-                        * and it's on LRU. It can only be a THP so the order
-                        * is safe to read and it's 0 for tail pages.
+                        * folio become large since the non-locked check,
+                        * and it's on LRU.
                         */
-                       if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
-                               low_pfn += compound_nr(page) - 1;
-                               nr_scanned += compound_nr(page) - 1;
-                               SetPageLRU(page);
+                       if (unlikely(folio_test_large(folio) && !cc->alloc_contig)) {
+                               low_pfn += folio_nr_pages(folio) - 1;
+                               nr_scanned += folio_nr_pages(folio) - 1;
+                               folio_set_lru(folio);
                                goto isolate_fail_put;
                        }
                }
 
-               /* The whole page is taken off the LRU; skip the tail pages. */
-               if (PageCompound(page))
-                       low_pfn += compound_nr(page) - 1;
+               /* The folio is taken off the LRU */
+               if (folio_test_large(folio))
+                       low_pfn += folio_nr_pages(folio) - 1;
 
                /* Successfully isolated */
-               del_page_from_lru_list(page, lruvec);
-               mod_node_page_state(page_pgdat(page),
-                               NR_ISOLATED_ANON + page_is_file_lru(page),
-                               thp_nr_pages(page));
+               lruvec_del_folio(lruvec, folio);
+               node_stat_mod_folio(folio,
+                               NR_ISOLATED_ANON + folio_is_file_lru(folio),
+                               folio_nr_pages(folio));
 
 isolate_success:
-               list_add(&page->lru, &cc->migratepages);
+               list_add(&folio->lru, &cc->migratepages);
 isolate_success_no_list:
-               cc->nr_migratepages += compound_nr(page);
-               nr_isolated += compound_nr(page);
-               nr_scanned += compound_nr(page) - 1;
+               cc->nr_migratepages += folio_nr_pages(folio);
+               nr_isolated += folio_nr_pages(folio);
+               nr_scanned += folio_nr_pages(folio) - 1;
 
                /*
                 * Avoid isolating too much unless this block is being
@@ -1131,7 +1172,7 @@ isolate_fail_put:
                        unlock_page_lruvec_irqrestore(locked, flags);
                        locked = NULL;
                }
-               put_page(page);
+               folio_put(folio);
 
 isolate_fail:
                if (!skip_on_failure && ret != -ENOMEM)
@@ -1172,14 +1213,14 @@ isolate_fail:
        if (unlikely(low_pfn > end_pfn))
                low_pfn = end_pfn;
 
-       page = NULL;
+       folio = NULL;
 
 isolate_abort:
        if (locked)
                unlock_page_lruvec_irqrestore(locked, flags);
-       if (page) {
-               SetPageLRU(page);
-               put_page(page);
+       if (folio) {
+               folio_set_lru(folio);
+               folio_put(folio);
        }
 
        /*
@@ -1191,7 +1232,7 @@ isolate_abort:
         * rescanned twice in a row.
         */
        if (low_pfn == end_pfn && (!nr_isolated || cc->finish_pageblock)) {
-               if (valid_page && !skip_updated)
+               if (!cc->no_set_skip_hint && valid_page && !skip_updated)
                        set_pageblock_skip(valid_page);
                update_cached_migrate(cc, low_pfn);
        }
@@ -1379,7 +1420,7 @@ fast_isolate_around(struct compact_control *cc, unsigned long pfn)
        isolate_freepages_block(cc, &start_pfn, end_pfn, &cc->freepages, 1, false);
 
        /* Skip this pageblock in the future as it's full or nearly full */
-       if (cc->nr_freepages < cc->nr_migratepages)
+       if (start_pfn == end_pfn)
                set_pageblock_skip(page);
 
        return;
@@ -1403,11 +1444,10 @@ static int next_search_order(struct compact_control *cc, int order)
        return order;
 }
 
-static unsigned long
-fast_isolate_freepages(struct compact_control *cc)
+static void fast_isolate_freepages(struct compact_control *cc)
 {
        unsigned int limit = max(1U, freelist_scan_limit(cc) >> 1);
-       unsigned int nr_scanned = 0;
+       unsigned int nr_scanned = 0, total_isolated = 0;
        unsigned long low_pfn, min_pfn, highest = 0;
        unsigned long nr_isolated = 0;
        unsigned long distance;
@@ -1417,7 +1457,7 @@ fast_isolate_freepages(struct compact_control *cc)
 
        /* Full compaction passes in a negative order */
        if (cc->order <= 0)
-               return cc->free_pfn;
+               return;
 
        /*
         * If starting the scan, use a deeper search and use the highest
@@ -1506,6 +1546,7 @@ fast_isolate_freepages(struct compact_control *cc)
                                set_page_private(page, order);
                                nr_isolated = 1 << order;
                                nr_scanned += nr_isolated - 1;
+                               total_isolated += nr_isolated;
                                cc->nr_freepages += nr_isolated;
                                list_add_tail(&page->lru, &cc->freepages);
                                count_compact_events(COMPACTISOLATED, nr_isolated);
@@ -1518,6 +1559,10 @@ fast_isolate_freepages(struct compact_control *cc)
 
                spin_unlock_irqrestore(&cc->zone->lock, flags);
 
+               /* Skip fast search if enough freepages isolated */
+               if (cc->nr_freepages >= cc->nr_migratepages)
+                       break;
+
                /*
                 * Smaller scan on next order so the total scan is related
                 * to freelist_scan_limit.
@@ -1526,6 +1571,9 @@ fast_isolate_freepages(struct compact_control *cc)
                        limit = max(1U, limit >> 1);
        }
 
+       trace_mm_compaction_fast_isolate_freepages(min_pfn, cc->free_pfn,
+                                                  nr_scanned, total_isolated);
+
        if (!page) {
                cc->fast_search_fail++;
                if (scan_start) {
@@ -1556,11 +1604,10 @@ fast_isolate_freepages(struct compact_control *cc)
 
        cc->total_free_scanned += nr_scanned;
        if (!page)
-               return cc->free_pfn;
+               return;
 
        low_pfn = page_to_pfn(page);
        fast_isolate_around(cc, low_pfn);
-       return low_pfn;
 }
 
 /*
@@ -1684,11 +1731,10 @@ splitmap:
  * This is a migrate-callback that "allocates" freepages by taking pages
  * from the isolated freelists in the block we are migrating to.
  */
-static struct page *compaction_alloc(struct page *migratepage,
-                                       unsigned long data)
+static struct folio *compaction_alloc(struct folio *src, unsigned long data)
 {
        struct compact_control *cc = (struct compact_control *)data;
-       struct page *freepage;
+       struct folio *dst;
 
        if (list_empty(&cc->freepages)) {
                isolate_freepages(cc);
@@ -1697,11 +1743,11 @@ static struct page *compaction_alloc(struct page *migratepage,
                        return NULL;
        }
 
-       freepage = list_entry(cc->freepages.next, struct page, lru);
-       list_del(&freepage->lru);
+       dst = list_entry(cc->freepages.next, struct folio, lru);
+       list_del(&dst->lru);
        cc->nr_freepages--;
 
-       return freepage;
+       return dst;
 }
 
 /*
@@ -1709,11 +1755,11 @@ static struct page *compaction_alloc(struct page *migratepage,
  * freelist.  All pages on the freelist are from the same zone, so there is no
  * special handling needed for NUMA.
  */
-static void compaction_free(struct page *page, unsigned long data)
+static void compaction_free(struct folio *dst, unsigned long data)
 {
        struct compact_control *cc = (struct compact_control *)data;
 
-       list_add(&page->lru, &cc->freepages);
+       list_add(&dst->lru, &cc->freepages);
        cc->nr_freepages++;
 }
 
@@ -1736,6 +1782,7 @@ static int sysctl_compact_unevictable_allowed __read_mostly = CONFIG_COMPACT_UNE
  */
 static unsigned int __read_mostly sysctl_compaction_proactiveness = 20;
 static int sysctl_extfrag_threshold = 500;
+static int __read_mostly sysctl_compact_memory;
 
 static inline void
 update_fast_start_pfn(struct compact_control *cc, unsigned long pfn)
@@ -1864,7 +1911,6 @@ static unsigned long fast_find_migrateblock(struct compact_control *cc)
                                        pfn = cc->zone->zone_start_pfn;
                                cc->fast_search_fail = 0;
                                found_block = true;
-                               set_pageblock_skip(freepage);
                                break;
                        }
                }
@@ -1940,8 +1986,14 @@ static isolate_migrate_t isolate_migratepages(struct compact_control *cc)
 
                page = pageblock_pfn_to_page(block_start_pfn,
                                                block_end_pfn, cc->zone);
-               if (!page)
+               if (!page) {
+                       unsigned long next_pfn;
+
+                       next_pfn = skip_offline_sections(block_start_pfn);
+                       if (next_pfn)
+                               block_end_pfn = min(next_pfn, cc->free_pfn);
                        continue;
+               }
 
                /*
                 * If isolation recently failed, do not retry. Only check the
@@ -2193,25 +2245,11 @@ static enum compact_result compact_finished(struct compact_control *cc)
        return ret;
 }
 
-static enum compact_result __compaction_suitable(struct zone *zone, int order,
-                                       unsigned int alloc_flags,
-                                       int highest_zoneidx,
-                                       unsigned long wmark_target)
+static bool __compaction_suitable(struct zone *zone, int order,
+                                 int highest_zoneidx,
+                                 unsigned long wmark_target)
 {
        unsigned long watermark;
-
-       if (is_via_compact_memory(order))
-               return COMPACT_CONTINUE;
-
-       watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
-       /*
-        * If watermarks for high-order allocation are already met, there
-        * should be no need for compaction at all.
-        */
-       if (zone_watermark_ok(zone, order, watermark, highest_zoneidx,
-                                                               alloc_flags))
-               return COMPACT_SUCCESS;
-
        /*
         * Watermarks for order-0 must be met for compaction to be able to
         * isolate free pages for migration targets. This means that the
@@ -2229,29 +2267,20 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order,
        watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
                                low_wmark_pages(zone) : min_wmark_pages(zone);
        watermark += compact_gap(order);
-       if (!__zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
-                                               ALLOC_CMA, wmark_target))
-               return COMPACT_SKIPPED;
-
-       return COMPACT_CONTINUE;
+       return __zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
+                                  ALLOC_CMA, wmark_target);
 }
 
 /*
  * compaction_suitable: Is this suitable to run compaction on this zone now?
- * Returns
- *   COMPACT_SKIPPED  - If there are too few free pages for compaction
- *   COMPACT_SUCCESS  - If the allocation would succeed without compaction
- *   COMPACT_CONTINUE - If compaction should run now
  */
-enum compact_result compaction_suitable(struct zone *zone, int order,
-                                       unsigned int alloc_flags,
-                                       int highest_zoneidx)
+bool compaction_suitable(struct zone *zone, int order, int highest_zoneidx)
 {
-       enum compact_result ret;
-       int fragindex;
+       enum compact_result compact_result;
+       bool suitable;
 
-       ret = __compaction_suitable(zone, order, alloc_flags, highest_zoneidx,
-                                   zone_page_state(zone, NR_FREE_PAGES));
+       suitable = __compaction_suitable(zone, order, highest_zoneidx,
+                                        zone_page_state(zone, NR_FREE_PAGES));
        /*
         * fragmentation index determines if allocation failures are due to
         * low memory or external fragmentation
@@ -2268,17 +2297,24 @@ enum compact_result compaction_suitable(struct zone *zone, int order,
         * excessive compaction for costly orders, but it should not be at the
         * expense of system stability.
         */
-       if (ret == COMPACT_CONTINUE && (order > PAGE_ALLOC_COSTLY_ORDER)) {
-               fragindex = fragmentation_index(zone, order);
-               if (fragindex >= 0 && fragindex <= sysctl_extfrag_threshold)
-                       ret = COMPACT_NOT_SUITABLE_ZONE;
+       if (suitable) {
+               compact_result = COMPACT_CONTINUE;
+               if (order > PAGE_ALLOC_COSTLY_ORDER) {
+                       int fragindex = fragmentation_index(zone, order);
+
+                       if (fragindex >= 0 &&
+                           fragindex <= sysctl_extfrag_threshold) {
+                               suitable = false;
+                               compact_result = COMPACT_NOT_SUITABLE_ZONE;
+                       }
+               }
+       } else {
+               compact_result = COMPACT_SKIPPED;
        }
 
-       trace_mm_compaction_suitable(zone, order, ret);
-       if (ret == COMPACT_NOT_SUITABLE_ZONE)
-               ret = COMPACT_SKIPPED;
+       trace_mm_compaction_suitable(zone, order, compact_result);
 
-       return ret;
+       return suitable;
 }
 
 bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
@@ -2294,7 +2330,6 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
        for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
                                ac->highest_zoneidx, ac->nodemask) {
                unsigned long available;
-               enum compact_result compact_result;
 
                /*
                 * Do not consider all the reclaimable memory because we do not
@@ -2304,9 +2339,8 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
                 */
                available = zone_reclaimable_pages(zone) / order;
                available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
-               compact_result = __compaction_suitable(zone, order, alloc_flags,
-                               ac->highest_zoneidx, available);
-               if (compact_result == COMPACT_CONTINUE)
+               if (__compaction_suitable(zone, order, ac->highest_zoneidx,
+                                         available))
                        return true;
        }
 
@@ -2336,11 +2370,22 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
        INIT_LIST_HEAD(&cc->migratepages);
 
        cc->migratetype = gfp_migratetype(cc->gfp_mask);
-       ret = compaction_suitable(cc->zone, cc->order, cc->alloc_flags,
-                                                       cc->highest_zoneidx);
-       /* Compaction is likely to fail */
-       if (ret == COMPACT_SUCCESS || ret == COMPACT_SKIPPED)
-               return ret;
+
+       if (!is_via_compact_memory(cc->order)) {
+               unsigned long watermark;
+
+               /* Allocation can already succeed, nothing to do */
+               watermark = wmark_pages(cc->zone,
+                                       cc->alloc_flags & ALLOC_WMARK_MASK);
+               if (zone_watermark_ok(cc->zone, cc->order, watermark,
+                                     cc->highest_zoneidx, cc->alloc_flags))
+                       return COMPACT_SUCCESS;
+
+               /* Compaction is likely to fail */
+               if (!compaction_suitable(cc->zone, cc->order,
+                                        cc->highest_zoneidx))
+                       return COMPACT_SKIPPED;
+       }
 
        /*
         * Clear pageblock skip if there were failures recently and compaction
@@ -2456,7 +2501,8 @@ rescan:
                        }
                        /*
                         * If an ASYNC or SYNC_LIGHT fails to migrate a page
-                        * within the current order-aligned block, scan the
+                        * within the current order-aligned block and
+                        * fast_find_migrateblock may be used then scan the
                         * remainder of the pageblock. This will mark the
                         * pageblock "skip" to avoid rescanning in the near
                         * future. This will isolate more pages than necessary
@@ -2464,8 +2510,9 @@ rescan:
                         * fast_find_migrateblock revisiting blocks that were
                         * recently partially scanned.
                         */
-                       if (cc->direct_compaction && !cc->finish_pageblock &&
-                                               (cc->mode < MIGRATE_SYNC)) {
+                       if (!pageblock_aligned(cc->migrate_pfn) &&
+                           !cc->ignore_skip_hint && !cc->finish_pageblock &&
+                           (cc->mode < MIGRATE_SYNC)) {
                                cc->finish_pageblock = true;
 
                                /*
@@ -2780,6 +2827,15 @@ static int compaction_proactiveness_sysctl_handler(struct ctl_table *table, int
 static int sysctl_compaction_handler(struct ctl_table *table, int write,
                        void *buffer, size_t *length, loff_t *ppos)
 {
+       int ret;
+
+       ret = proc_dointvec(table, write, buffer, length, ppos);
+       if (ret)
+               return ret;
+
+       if (sysctl_compact_memory != 1)
+               return -EINVAL;
+
        if (write)
                compact_nodes();
 
@@ -2833,8 +2889,14 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
                if (!populated_zone(zone))
                        continue;
 
-               if (compaction_suitable(zone, pgdat->kcompactd_max_order, 0,
-                                       highest_zoneidx) == COMPACT_CONTINUE)
+               /* Allocation can already succeed, check other zones */
+               if (zone_watermark_ok(zone, pgdat->kcompactd_max_order,
+                                     min_wmark_pages(zone),
+                                     highest_zoneidx, 0))
+                       continue;
+
+               if (compaction_suitable(zone, pgdat->kcompactd_max_order,
+                                       highest_zoneidx))
                        return true;
        }
 
@@ -2871,8 +2933,12 @@ static void kcompactd_do_work(pg_data_t *pgdat)
                if (compaction_deferred(zone, cc.order))
                        continue;
 
-               if (compaction_suitable(zone, cc.order, 0, zoneid) !=
-                                                       COMPACT_CONTINUE)
+               /* Allocation can already succeed, nothing to do */
+               if (zone_watermark_ok(zone, cc.order,
+                                     min_wmark_pages(zone), zoneid, 0))
+                       continue;
+
+               if (!compaction_suitable(zone, cc.order, zoneid))
                        continue;
 
                if (kthread_should_stop())
@@ -3021,7 +3087,7 @@ static int kcompactd(void *p)
  * This kcompactd start function will be called by init and node-hot-add.
  * On node-hot-add, kcompactd will moved to proper cpus if cpus are hot-added.
  */
-void kcompactd_run(int nid)
+void __meminit kcompactd_run(int nid)
 {
        pg_data_t *pgdat = NODE_DATA(nid);
 
@@ -3039,7 +3105,7 @@ void kcompactd_run(int nid)
  * Called by memory hotplug when all memory in a node is offlined. Caller must
  * be holding mem_hotplug_begin/done().
  */
-void kcompactd_stop(int nid)
+void __meminit kcompactd_stop(int nid)
 {
        struct task_struct *kcompactd = NODE_DATA(nid)->kcompactd;
 
@@ -3095,7 +3161,7 @@ static int proc_dointvec_minmax_warn_RT_change(struct ctl_table *table,
 static struct ctl_table vm_compaction[] = {
        {
                .procname       = "compact_memory",
-               .data           = NULL,
+               .data           = &sysctl_compact_memory,
                .maxlen         = sizeof(int),
                .mode           = 0200,
                .proc_handler   = sysctl_compaction_handler,
index fae64d3..c112101 100644 (file)
@@ -318,6 +318,29 @@ static void damon_test_update_monitoring_result(struct kunit *test)
        KUNIT_EXPECT_EQ(test, r->age, 20);
 }
 
+static void damon_test_set_attrs(struct kunit *test)
+{
+       struct damon_ctx ctx;
+       struct damon_attrs valid_attrs = {
+               .min_nr_regions = 10, .max_nr_regions = 1000,
+               .sample_interval = 5000, .aggr_interval = 100000,};
+       struct damon_attrs invalid_attrs;
+
+       KUNIT_EXPECT_EQ(test, damon_set_attrs(&ctx, &valid_attrs), 0);
+
+       invalid_attrs = valid_attrs;
+       invalid_attrs.min_nr_regions = 1;
+       KUNIT_EXPECT_EQ(test, damon_set_attrs(&ctx, &invalid_attrs), -EINVAL);
+
+       invalid_attrs = valid_attrs;
+       invalid_attrs.max_nr_regions = 9;
+       KUNIT_EXPECT_EQ(test, damon_set_attrs(&ctx, &invalid_attrs), -EINVAL);
+
+       invalid_attrs = valid_attrs;
+       invalid_attrs.aggr_interval = 4999;
+       KUNIT_EXPECT_EQ(test, damon_set_attrs(&ctx, &invalid_attrs), -EINVAL);
+}
+
 static struct kunit_case damon_test_cases[] = {
        KUNIT_CASE(damon_test_target),
        KUNIT_CASE(damon_test_regions),
@@ -329,6 +352,7 @@ static struct kunit_case damon_test_cases[] = {
        KUNIT_CASE(damon_test_ops_registration),
        KUNIT_CASE(damon_test_set_regions),
        KUNIT_CASE(damon_test_update_monitoring_result),
+       KUNIT_CASE(damon_test_set_attrs),
        {},
 };
 
index cc63cf9..e940802 100644 (file)
@@ -37,51 +37,29 @@ struct folio *damon_get_folio(unsigned long pfn)
        return folio;
 }
 
-void damon_ptep_mkold(pte_t *pte, struct mm_struct *mm, unsigned long addr)
+void damon_ptep_mkold(pte_t *pte, struct vm_area_struct *vma, unsigned long addr)
 {
-       bool referenced = false;
-       struct folio *folio = damon_get_folio(pte_pfn(*pte));
+       struct folio *folio = damon_get_folio(pte_pfn(ptep_get(pte)));
 
        if (!folio)
                return;
 
-       if (pte_young(*pte)) {
-               referenced = true;
-               *pte = pte_mkold(*pte);
-       }
-
-#ifdef CONFIG_MMU_NOTIFIER
-       if (mmu_notifier_clear_young(mm, addr, addr + PAGE_SIZE))
-               referenced = true;
-#endif /* CONFIG_MMU_NOTIFIER */
-
-       if (referenced)
+       if (ptep_clear_young_notify(vma, addr, pte))
                folio_set_young(folio);
 
        folio_set_idle(folio);
        folio_put(folio);
 }
 
-void damon_pmdp_mkold(pmd_t *pmd, struct mm_struct *mm, unsigned long addr)
+void damon_pmdp_mkold(pmd_t *pmd, struct vm_area_struct *vma, unsigned long addr)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-       bool referenced = false;
        struct folio *folio = damon_get_folio(pmd_pfn(*pmd));
 
        if (!folio)
                return;
 
-       if (pmd_young(*pmd)) {
-               referenced = true;
-               *pmd = pmd_mkold(*pmd);
-       }
-
-#ifdef CONFIG_MMU_NOTIFIER
-       if (mmu_notifier_clear_young(mm, addr, addr + HPAGE_PMD_SIZE))
-               referenced = true;
-#endif /* CONFIG_MMU_NOTIFIER */
-
-       if (referenced)
+       if (pmdp_clear_young_notify(vma, addr, pmd))
                folio_set_young(folio);
 
        folio_set_idle(folio);
index 14f4bc6..18d837d 100644 (file)
@@ -9,8 +9,8 @@
 
 struct folio *damon_get_folio(unsigned long pfn);
 
-void damon_ptep_mkold(pte_t *pte, struct mm_struct *mm, unsigned long addr);
-void damon_pmdp_mkold(pmd_t *pmd, struct mm_struct *mm, unsigned long addr);
+void damon_ptep_mkold(pte_t *pte, struct vm_area_struct *vma, unsigned long addr);
+void damon_pmdp_mkold(pmd_t *pmd, struct vm_area_struct *vma, unsigned long addr);
 
 int damon_cold_score(struct damon_ctx *c, struct damon_region *r,
                        struct damos *s);
index 467b991..40801e3 100644 (file)
@@ -24,9 +24,9 @@ static bool __damon_pa_mkold(struct folio *folio, struct vm_area_struct *vma,
        while (page_vma_mapped_walk(&pvmw)) {
                addr = pvmw.address;
                if (pvmw.pte)
-                       damon_ptep_mkold(pvmw.pte, vma->vm_mm, addr);
+                       damon_ptep_mkold(pvmw.pte, vma, addr);
                else
-                       damon_pmdp_mkold(pvmw.pmd, vma->vm_mm, addr);
+                       damon_pmdp_mkold(pvmw.pmd, vma, addr);
        }
        return true;
 }
@@ -89,7 +89,7 @@ static bool __damon_pa_young(struct folio *folio, struct vm_area_struct *vma,
        while (page_vma_mapped_walk(&pvmw)) {
                addr = pvmw.address;
                if (pvmw.pte) {
-                       *accessed = pte_young(*pvmw.pte) ||
+                       *accessed = pte_young(ptep_get(pvmw.pte)) ||
                                !folio_test_idle(folio) ||
                                mmu_notifier_test_young(vma->vm_mm, addr);
                } else {
index 1fec16d..2fcc973 100644 (file)
@@ -311,19 +311,21 @@ static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr,
                }
 
                if (pmd_trans_huge(*pmd)) {
-                       damon_pmdp_mkold(pmd, walk->mm, addr);
+                       damon_pmdp_mkold(pmd, walk->vma, addr);
                        spin_unlock(ptl);
                        return 0;
                }
                spin_unlock(ptl);
        }
 
-       if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
-               return 0;
        pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
-       if (!pte_present(*pte))
+       if (!pte) {
+               walk->action = ACTION_AGAIN;
+               return 0;
+       }
+       if (!pte_present(ptep_get(pte)))
                goto out;
-       damon_ptep_mkold(pte, walk->mm, addr);
+       damon_ptep_mkold(pte, walk->vma, addr);
 out:
        pte_unmap_unlock(pte, ptl);
        return 0;
@@ -431,6 +433,7 @@ static int damon_young_pmd_entry(pmd_t *pmd, unsigned long addr,
                unsigned long next, struct mm_walk *walk)
 {
        pte_t *pte;
+       pte_t ptent;
        spinlock_t *ptl;
        struct folio *folio;
        struct damon_young_walk_private *priv = walk->private;
@@ -464,15 +467,18 @@ huge_out:
 regular_page:
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-       if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
-               return -EINVAL;
        pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
-       if (!pte_present(*pte))
+       if (!pte) {
+               walk->action = ACTION_AGAIN;
+               return 0;
+       }
+       ptent = ptep_get(pte);
+       if (!pte_present(ptent))
                goto out;
-       folio = damon_get_folio(pte_pfn(*pte));
+       folio = damon_get_folio(pte_pfn(ptent));
        if (!folio)
                goto out;
-       if (pte_young(*pte) || !folio_test_idle(folio) ||
+       if (pte_young(ptent) || !folio_test_idle(folio) ||
                        mmu_notifier_test_young(walk->mm, addr))
                priv->young = true;
        *priv->folio_sz = folio_size(folio);
index c7b2280..ee533a5 100644 (file)
@@ -268,4 +268,13 @@ void page_init_poison(struct page *page, size_t size)
        if (page_init_poisoning)
                memset(page, PAGE_POISON_PATTERN, size);
 }
+
+void vma_iter_dump_tree(const struct vma_iterator *vmi)
+{
+#if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
+       mas_dump(&vmi->mas);
+       mt_dump(vmi->mas.tree, mt_dump_hex);
+#endif /* CONFIG_DEBUG_VM_MAPLE_TREE */
+}
+
 #endif         /* CONFIG_DEBUG_VM */
diff --git a/mm/debug_page_alloc.c b/mm/debug_page_alloc.c
new file mode 100644 (file)
index 0000000..f9d1457
--- /dev/null
@@ -0,0 +1,59 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/mm.h>
+#include <linux/page-isolation.h>
+
+unsigned int _debug_guardpage_minorder;
+
+bool _debug_pagealloc_enabled_early __read_mostly
+                       = IS_ENABLED(CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT);
+EXPORT_SYMBOL(_debug_pagealloc_enabled_early);
+DEFINE_STATIC_KEY_FALSE(_debug_pagealloc_enabled);
+EXPORT_SYMBOL(_debug_pagealloc_enabled);
+
+DEFINE_STATIC_KEY_FALSE(_debug_guardpage_enabled);
+
+static int __init early_debug_pagealloc(char *buf)
+{
+       return kstrtobool(buf, &_debug_pagealloc_enabled_early);
+}
+early_param("debug_pagealloc", early_debug_pagealloc);
+
+static int __init debug_guardpage_minorder_setup(char *buf)
+{
+       unsigned long res;
+
+       if (kstrtoul(buf, 10, &res) < 0 ||  res > MAX_ORDER / 2) {
+               pr_err("Bad debug_guardpage_minorder value\n");
+               return 0;
+       }
+       _debug_guardpage_minorder = res;
+       pr_info("Setting debug_guardpage_minorder to %lu\n", res);
+       return 0;
+}
+early_param("debug_guardpage_minorder", debug_guardpage_minorder_setup);
+
+bool __set_page_guard(struct zone *zone, struct page *page, unsigned int order,
+                     int migratetype)
+{
+       if (order >= debug_guardpage_minorder())
+               return false;
+
+       __SetPageGuard(page);
+       INIT_LIST_HEAD(&page->buddy_list);
+       set_page_private(page, order);
+       /* Guard pages are not available for any usage */
+       if (!is_migrate_isolate(migratetype))
+               __mod_zone_freepage_state(zone, -(1 << order), migratetype);
+
+       return true;
+}
+
+void __clear_page_guard(struct zone *zone, struct page *page, unsigned int order,
+                     int migratetype)
+{
+       __ClearPageGuard(page);
+
+       set_page_private(page, 0);
+       if (!is_migrate_isolate(migratetype))
+               __mod_zone_freepage_state(zone, (1 << order), migratetype);
+}
index c54177a..ee119e3 100644 (file)
@@ -138,6 +138,9 @@ static void __init pte_advanced_tests(struct pgtable_debug_args *args)
                return;
 
        pr_debug("Validating PTE advanced\n");
+       if (WARN_ON(!args->ptep))
+               return;
+
        pte = pfn_pte(args->pte_pfn, args->page_prot);
        set_pte_at(args->mm, args->vaddr, args->ptep, pte);
        flush_dcache_page(page);
@@ -619,6 +622,9 @@ static void __init pte_clear_tests(struct pgtable_debug_args *args)
         * the unexpected overhead of cache flushing is acceptable.
         */
        pr_debug("Validating PTE clear\n");
+       if (WARN_ON(!args->ptep))
+               return;
+
 #ifndef CONFIG_RISCV
        pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
 #endif
@@ -1377,7 +1383,8 @@ static int __init debug_vm_pgtable(void)
        args.ptep = pte_offset_map_lock(args.mm, args.pmdp, args.vaddr, &ptl);
        pte_clear_tests(&args);
        pte_advanced_tests(&args);
-       pte_unmap_unlock(args.ptep, ptl);
+       if (args.ptep)
+               pte_unmap_unlock(args.ptep, ptl);
 
        ptl = pmd_lock(args.mm, args.pmdp);
        pmd_clear_tests(&args);
index d2b0f8f..a151a21 100644 (file)
@@ -226,7 +226,7 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev,
 {
        struct dma_pool *retval;
        size_t allocation;
-       bool empty = false;
+       bool empty;
 
        if (!dev)
                return NULL;
@@ -276,8 +276,7 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev,
         */
        mutex_lock(&pools_reg_lock);
        mutex_lock(&pools_lock);
-       if (list_empty(&dev->dma_pools))
-               empty = true;
+       empty = list_empty(&dev->dma_pools);
        list_add(&retval->pools, &dev->dma_pools);
        mutex_unlock(&pools_lock);
        if (empty) {
@@ -361,7 +360,7 @@ static struct dma_page *pool_alloc_page(struct dma_pool *pool, gfp_t mem_flags)
 void dma_pool_destroy(struct dma_pool *pool)
 {
        struct dma_page *page, *tmp;
-       bool empty = false, busy = false;
+       bool empty, busy = false;
 
        if (unlikely(!pool))
                return;
@@ -369,8 +368,7 @@ void dma_pool_destroy(struct dma_pool *pool)
        mutex_lock(&pools_reg_lock);
        mutex_lock(&pools_lock);
        list_del(&pool->pools);
-       if (list_empty(&pool->dev->dma_pools))
-               empty = true;
+       empty = list_empty(&pool->dev->dma_pools);
        mutex_unlock(&pools_lock);
        if (empty)
                device_remove_file(pool->dev, &dev_attr_pools);
index 9bc12e5..ce06b28 100644 (file)
@@ -72,12 +72,10 @@ void __init early_ioremap_setup(void)
 {
        int i;
 
-       for (i = 0; i < FIX_BTMAPS_SLOTS; i++)
-               if (WARN_ON(prev_map[i]))
-                       break;
-
-       for (i = 0; i < FIX_BTMAPS_SLOTS; i++)
+       for (i = 0; i < FIX_BTMAPS_SLOTS; i++) {
+               WARN_ON_ONCE(prev_map[i]);
                slot_virt[i] = __fix_to_virt(FIX_BTMAP_BEGIN - NR_FIX_BTMAPS*i);
+       }
 }
 
 static int __init check_early_ioremap_leak(void)
index fb7c5f4..6c39d42 100644 (file)
@@ -14,7 +14,6 @@
 #include <linux/mm.h>
 #include <linux/pagemap.h>
 #include <linux/backing-dev.h>
-#include <linux/pagevec.h>
 #include <linux/fadvise.h>
 #include <linux/writeback.h>
 #include <linux/syscalls.h>
@@ -143,7 +142,7 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
                }
 
                if (end_index >= start_index) {
-                       unsigned long nr_pagevec = 0;
+                       unsigned long nr_failed = 0;
 
                        /*
                         * It's common to FADV_DONTNEED right after
@@ -156,17 +155,15 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
                         */
                        lru_add_drain();
 
-                       invalidate_mapping_pagevec(mapping,
-                                               start_index, end_index,
-                                               &nr_pagevec);
+                       mapping_try_invalidate(mapping, start_index, end_index,
+                                       &nr_failed);
 
                        /*
-                        * If fewer pages were invalidated than expected then
-                        * it is possible that some of the pages were on
-                        * a per-cpu pagevec for a remote CPU. Drain all
-                        * pagevecs and try again.
+                        * The failures may be due to the folio being
+                        * in the LRU cache of a remote CPU. Drain all
+                        * caches and try again.
                         */
-                       if (nr_pagevec) {
+                       if (nr_failed) {
                                lru_add_drain_all();
                                invalidate_mapping_pages(mapping, start_index,
                                                end_index);
diff --git a/mm/fail_page_alloc.c b/mm/fail_page_alloc.c
new file mode 100644 (file)
index 0000000..b1b09cc
--- /dev/null
@@ -0,0 +1,66 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/fault-inject.h>
+#include <linux/mm.h>
+
+static struct {
+       struct fault_attr attr;
+
+       bool ignore_gfp_highmem;
+       bool ignore_gfp_reclaim;
+       u32 min_order;
+} fail_page_alloc = {
+       .attr = FAULT_ATTR_INITIALIZER,
+       .ignore_gfp_reclaim = true,
+       .ignore_gfp_highmem = true,
+       .min_order = 1,
+};
+
+static int __init setup_fail_page_alloc(char *str)
+{
+       return setup_fault_attr(&fail_page_alloc.attr, str);
+}
+__setup("fail_page_alloc=", setup_fail_page_alloc);
+
+bool __should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
+{
+       int flags = 0;
+
+       if (order < fail_page_alloc.min_order)
+               return false;
+       if (gfp_mask & __GFP_NOFAIL)
+               return false;
+       if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
+               return false;
+       if (fail_page_alloc.ignore_gfp_reclaim &&
+                       (gfp_mask & __GFP_DIRECT_RECLAIM))
+               return false;
+
+       /* See comment in __should_failslab() */
+       if (gfp_mask & __GFP_NOWARN)
+               flags |= FAULT_NOWARN;
+
+       return should_fail_ex(&fail_page_alloc.attr, 1 << order, flags);
+}
+
+#ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
+
+static int __init fail_page_alloc_debugfs(void)
+{
+       umode_t mode = S_IFREG | 0600;
+       struct dentry *dir;
+
+       dir = fault_create_debugfs_attr("fail_page_alloc", NULL,
+                                       &fail_page_alloc.attr);
+
+       debugfs_create_bool("ignore-gfp-wait", mode, dir,
+                           &fail_page_alloc.ignore_gfp_reclaim);
+       debugfs_create_bool("ignore-gfp-highmem", mode, dir,
+                           &fail_page_alloc.ignore_gfp_highmem);
+       debugfs_create_u32("min-order", mode, dir, &fail_page_alloc.min_order);
+
+       return 0;
+}
+
+late_initcall(fail_page_alloc_debugfs);
+
+#endif /* CONFIG_FAULT_INJECTION_DEBUG_FS */
index 83dda76..9e44a49 100644 (file)
@@ -22,6 +22,7 @@
 #include <linux/mm.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/syscalls.h>
 #include <linux/mman.h>
 #include <linux/pagemap.h>
 #include <linux/file.h>
@@ -58,6 +59,8 @@
 
 #include <asm/mman.h>
 
+#include "swap.h"
+
 /*
  * Shared mappings implemented 30.11.1994. It's not fully working yet,
  * though.
  *    ->i_pages lock           (page_remove_rmap->set_page_dirty)
  *    bdi.wb->list_lock                (page_remove_rmap->set_page_dirty)
  *    ->inode->i_lock          (page_remove_rmap->set_page_dirty)
- *    ->memcg->move_lock       (page_remove_rmap->lock_page_memcg)
+ *    ->memcg->move_lock       (page_remove_rmap->folio_memcg_lock)
  *    bdi.wb->list_lock                (zap_pte_range->set_page_dirty)
  *    ->inode->i_lock          (zap_pte_range->set_page_dirty)
  *    ->private_lock           (zap_pte_range->block_dirty_folio)
@@ -1359,8 +1362,6 @@ repeat:
 /**
  * migration_entry_wait_on_locked - Wait for a migration entry to be removed
  * @entry: migration swap entry.
- * @ptep: mapped pte pointer. Will return with the ptep unmapped. Only required
- *        for pte entries, pass NULL for pmd entries.
  * @ptl: already locked ptl. This function will drop the lock.
  *
  * Wait for a migration entry referencing the given page to be removed. This is
@@ -1369,13 +1370,13 @@ repeat:
  * should be called while holding the ptl for the migration entry referencing
  * the page.
  *
- * Returns after unmapping and unlocking the pte/ptl with pte_unmap_unlock().
+ * Returns after unlocking the ptl.
  *
  * This follows the same logic as folio_wait_bit_common() so see the comments
  * there.
  */
-void migration_entry_wait_on_locked(swp_entry_t entry, pte_t *ptep,
-                               spinlock_t *ptl)
+void migration_entry_wait_on_locked(swp_entry_t entry, spinlock_t *ptl)
+       __releases(ptl)
 {
        struct wait_page_queue wait_page;
        wait_queue_entry_t *wait = &wait_page.wait;
@@ -1409,10 +1410,7 @@ void migration_entry_wait_on_locked(swp_entry_t entry, pte_t *ptep,
         * a valid reference to the page, and it must take the ptl to remove the
         * migration entry. So the page is valid until the ptl is dropped.
         */
-       if (ptep)
-               pte_unmap_unlock(ptep, ptl);
-       else
-               spin_unlock(ptl);
+       spin_unlock(ptl);
 
        for (;;) {
                unsigned int flags;
@@ -1625,36 +1623,6 @@ void folio_end_writeback(struct folio *folio)
 }
 EXPORT_SYMBOL(folio_end_writeback);
 
-/*
- * After completing I/O on a page, call this routine to update the page
- * flags appropriately
- */
-void page_endio(struct page *page, bool is_write, int err)
-{
-       struct folio *folio = page_folio(page);
-
-       if (!is_write) {
-               if (!err) {
-                       folio_mark_uptodate(folio);
-               } else {
-                       folio_clear_uptodate(folio);
-                       folio_set_error(folio);
-               }
-               folio_unlock(folio);
-       } else {
-               if (err) {
-                       struct address_space *mapping;
-
-                       folio_set_error(folio);
-                       mapping = folio_mapping(folio);
-                       if (mapping)
-                               mapping_set_error(mapping, err);
-               }
-               folio_end_writeback(folio);
-       }
-}
-EXPORT_SYMBOL_GPL(page_endio);
-
 /**
  * __folio_lock - Get a lock on the folio, assuming we need to sleep to get it.
  * @folio: The folio to lock
@@ -1760,9 +1728,7 @@ bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
  *
  * Return: The index of the gap if found, otherwise an index outside the
  * range specified (in which case 'return - index >= max_scan' will be true).
- * In the rare case of index wrap-around, 0 will be returned.  0 will also
- * be returned if index == 0 and there is a gap at the index.  We can not
- * wrap-around if passed index == 0.
+ * In the rare case of index wrap-around, 0 will be returned.
  */
 pgoff_t page_cache_next_miss(struct address_space *mapping,
                             pgoff_t index, unsigned long max_scan)
@@ -1772,13 +1738,12 @@ pgoff_t page_cache_next_miss(struct address_space *mapping,
        while (max_scan--) {
                void *entry = xas_next(&xas);
                if (!entry || xa_is_value(entry))
-                       return xas.xa_index;
-               if (xas.xa_index == 0 && index != 0)
-                       return xas.xa_index;
+                       break;
+               if (xas.xa_index == 0)
+                       break;
        }
 
-       /* No gaps in range and no wrap-around, return index beyond range */
-       return xas.xa_index + 1;
+       return xas.xa_index;
 }
 EXPORT_SYMBOL(page_cache_next_miss);
 
@@ -1799,9 +1764,7 @@ EXPORT_SYMBOL(page_cache_next_miss);
  *
  * Return: The index of the gap if found, otherwise an index outside the
  * range specified (in which case 'index - return >= max_scan' will be true).
- * In the rare case of wrap-around, ULONG_MAX will be returned.  ULONG_MAX
- * will also be returned if index == ULONG_MAX and there is a gap at the
- * index.  We can not wrap-around if passed index == ULONG_MAX.
+ * In the rare case of wrap-around, ULONG_MAX will be returned.
  */
 pgoff_t page_cache_prev_miss(struct address_space *mapping,
                             pgoff_t index, unsigned long max_scan)
@@ -1811,13 +1774,12 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
        while (max_scan--) {
                void *entry = xas_prev(&xas);
                if (!entry || xa_is_value(entry))
-                       return xas.xa_index;
-               if (xas.xa_index == ULONG_MAX && index != ULONG_MAX)
-                       return xas.xa_index;
+                       break;
+               if (xas.xa_index == ULONG_MAX)
+                       break;
        }
 
-       /* No gaps in range and no wrap-around, return index beyond range */
-       return xas.xa_index - 1;
+       return xas.xa_index;
 }
 EXPORT_SYMBOL(page_cache_prev_miss);
 
@@ -2693,8 +2655,7 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
                if (unlikely(iocb->ki_pos >= i_size_read(inode)))
                        break;
 
-               error = filemap_get_pages(iocb, iter->count, &fbatch,
-                                         iov_iter_is_pipe(iter));
+               error = filemap_get_pages(iocb, iter->count, &fbatch, false);
                if (error < 0)
                        break;
 
@@ -2768,6 +2729,48 @@ put_folios:
 }
 EXPORT_SYMBOL_GPL(filemap_read);
 
+int kiocb_write_and_wait(struct kiocb *iocb, size_t count)
+{
+       struct address_space *mapping = iocb->ki_filp->f_mapping;
+       loff_t pos = iocb->ki_pos;
+       loff_t end = pos + count - 1;
+
+       if (iocb->ki_flags & IOCB_NOWAIT) {
+               if (filemap_range_needs_writeback(mapping, pos, end))
+                       return -EAGAIN;
+               return 0;
+       }
+
+       return filemap_write_and_wait_range(mapping, pos, end);
+}
+
+int kiocb_invalidate_pages(struct kiocb *iocb, size_t count)
+{
+       struct address_space *mapping = iocb->ki_filp->f_mapping;
+       loff_t pos = iocb->ki_pos;
+       loff_t end = pos + count - 1;
+       int ret;
+
+       if (iocb->ki_flags & IOCB_NOWAIT) {
+               /* we could block if there are any pages in the range */
+               if (filemap_range_has_page(mapping, pos, end))
+                       return -EAGAIN;
+       } else {
+               ret = filemap_write_and_wait_range(mapping, pos, end);
+               if (ret)
+                       return ret;
+       }
+
+       /*
+        * After a write we want buffered reads to be sure to go to disk to get
+        * the new data.  We invalidate clean cached page from the region we're
+        * about to write.  We do this *before* the write so that we can return
+        * without clobbering -EIOCBQUEUED from ->direct_IO().
+        */
+       return invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT,
+                                            end >> PAGE_SHIFT);
+}
+
 /**
  * generic_file_read_iter - generic filesystem read routine
  * @iocb:      kernel I/O control block
@@ -2803,18 +2806,9 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
                struct address_space *mapping = file->f_mapping;
                struct inode *inode = mapping->host;
 
-               if (iocb->ki_flags & IOCB_NOWAIT) {
-                       if (filemap_range_needs_writeback(mapping, iocb->ki_pos,
-                                               iocb->ki_pos + count - 1))
-                               return -EAGAIN;
-               } else {
-                       retval = filemap_write_and_wait_range(mapping,
-                                               iocb->ki_pos,
-                                               iocb->ki_pos + count - 1);
-                       if (retval < 0)
-                               return retval;
-               }
-
+               retval = kiocb_write_and_wait(iocb, count);
+               if (retval < 0)
+                       return retval;
                file_accessed(file);
 
                retval = mapping->a_ops->direct_IO(iocb, iter);
@@ -2878,9 +2872,24 @@ size_t splice_folio_into_pipe(struct pipe_inode_info *pipe,
        return spliced;
 }
 
-/*
- * Splice folios from the pagecache of a buffered (ie. non-O_DIRECT) file into
- * a pipe.
+/**
+ * filemap_splice_read -  Splice data from a file's pagecache into a pipe
+ * @in: The file to read from
+ * @ppos: Pointer to the file position to read from
+ * @pipe: The pipe to splice into
+ * @len: The amount to splice
+ * @flags: The SPLICE_F_* flags
+ *
+ * This function gets folios from a file's pagecache and splices them into the
+ * pipe.  Readahead will be called as necessary to fill more folios.  This may
+ * be used for blockdevs also.
+ *
+ * Return: On success, the number of bytes read will be returned and *@ppos
+ * will be updated if appropriate; 0 will be returned if there is no more data
+ * to be read; -EAGAIN will be returned if the pipe had no space, and some
+ * other negative error code will be returned on error.  A short read may occur
+ * if the pipe has insufficient space, we reach the end of the data or we hit a
+ * hole.
  */
 ssize_t filemap_splice_read(struct file *in, loff_t *ppos,
                            struct pipe_inode_info *pipe,
@@ -2893,6 +2902,9 @@ ssize_t filemap_splice_read(struct file *in, loff_t *ppos,
        bool writably_mapped;
        int i, error = 0;
 
+       if (unlikely(*ppos >= in->f_mapping->host->i_sb->s_maxbytes))
+               return 0;
+
        init_sync_kiocb(&iocb, in);
        iocb.ki_pos = *ppos;
 
@@ -2906,7 +2918,7 @@ ssize_t filemap_splice_read(struct file *in, loff_t *ppos,
        do {
                cond_resched();
 
-               if (*ppos >= i_size_read(file_inode(in)))
+               if (*ppos >= i_size_read(in->f_mapping->host))
                        break;
 
                iocb.ki_pos = *ppos;
@@ -2922,7 +2934,7 @@ ssize_t filemap_splice_read(struct file *in, loff_t *ppos,
                 * part of the page is not copied back to userspace (unless
                 * another truncate extends the file - this is desired though).
                 */
-               isize = i_size_read(file_inode(in));
+               isize = i_size_read(in->f_mapping->host);
                if (unlikely(*ppos >= isize))
                        break;
                end_offset = min_t(loff_t, isize, *ppos + len);
@@ -3419,13 +3431,6 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct folio *folio,
        if (pmd_none(*vmf->pmd))
                pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);
 
-       /* See comment in handle_pte_fault() */
-       if (pmd_devmap_trans_unstable(vmf->pmd)) {
-               folio_unlock(folio);
-               folio_put(folio);
-               return true;
-       }
-
        return false;
 }
 
@@ -3512,6 +3517,11 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 
        addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
        vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
+       if (!vmf->pte) {
+               folio_unlock(folio);
+               folio_put(folio);
+               goto out;
+       }
        do {
 again:
                page = folio_file_page(folio, xas.xa_index);
@@ -3530,7 +3540,7 @@ again:
                 * handled in the specific fault path, and it'll prohibit the
                 * fault-around logic.
                 */
-               if (!pte_none(*vmf->pte))
+               if (!pte_none(ptep_get(vmf->pte)))
                        goto unlock;
 
                /* We're about to handle the fault */
@@ -3789,7 +3799,7 @@ EXPORT_SYMBOL(read_cache_page_gfp);
 /*
  * Warn about a page cache invalidation failure during a direct I/O write.
  */
-void dio_warn_stale_pagecache(struct file *filp)
+static void dio_warn_stale_pagecache(struct file *filp)
 {
        static DEFINE_RATELIMIT_STATE(_rs, 86400 * HZ, DEFAULT_RATELIMIT_BURST);
        char pathname[128];
@@ -3806,48 +3816,33 @@ void dio_warn_stale_pagecache(struct file *filp)
        }
 }
 
-ssize_t
-generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
+void kiocb_invalidate_post_direct_write(struct kiocb *iocb, size_t count)
 {
-       struct file     *file = iocb->ki_filp;
-       struct address_space *mapping = file->f_mapping;
-       struct inode    *inode = mapping->host;
-       loff_t          pos = iocb->ki_pos;
-       ssize_t         written;
-       size_t          write_len;
-       pgoff_t         end;
+       struct address_space *mapping = iocb->ki_filp->f_mapping;
 
-       write_len = iov_iter_count(from);
-       end = (pos + write_len - 1) >> PAGE_SHIFT;
+       if (mapping->nrpages &&
+           invalidate_inode_pages2_range(mapping,
+                       iocb->ki_pos >> PAGE_SHIFT,
+                       (iocb->ki_pos + count - 1) >> PAGE_SHIFT))
+               dio_warn_stale_pagecache(iocb->ki_filp);
+}
 
-       if (iocb->ki_flags & IOCB_NOWAIT) {
-               /* If there are pages to writeback, return */
-               if (filemap_range_has_page(file->f_mapping, pos,
-                                          pos + write_len - 1))
-                       return -EAGAIN;
-       } else {
-               written = filemap_write_and_wait_range(mapping, pos,
-                                                       pos + write_len - 1);
-               if (written)
-                       goto out;
-       }
+ssize_t
+generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
+{
+       struct address_space *mapping = iocb->ki_filp->f_mapping;
+       size_t write_len = iov_iter_count(from);
+       ssize_t written;
 
        /*
-        * After a write we want buffered reads to be sure to go to disk to get
-        * the new data.  We invalidate clean cached page from the region we're
-        * about to write.  We do this *before* the write so that we can return
-        * without clobbering -EIOCBQUEUED from ->direct_IO().
-        */
-       written = invalidate_inode_pages2_range(mapping,
-                                       pos >> PAGE_SHIFT, end);
-       /*
         * If a page can not be invalidated, return 0 to fall back
         * to buffered write.
         */
+       written = kiocb_invalidate_pages(iocb, write_len);
        if (written) {
                if (written == -EBUSY)
                        return 0;
-               goto out;
+               return written;
        }
 
        written = mapping->a_ops->direct_IO(iocb, from);
@@ -3869,11 +3864,11 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
         *
         * Skip invalidation for async writes or if mapping has no pages.
         */
-       if (written > 0 && mapping->nrpages &&
-           invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT, end))
-               dio_warn_stale_pagecache(file);
-
        if (written > 0) {
+               struct inode *inode = mapping->host;
+               loff_t pos = iocb->ki_pos;
+
+               kiocb_invalidate_post_direct_write(iocb, written);
                pos += written;
                write_len -= written;
                if (pos > i_size_read(inode) && !S_ISBLK(inode->i_mode)) {
@@ -3884,7 +3879,6 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
        }
        if (written != -EIOCBQUEUED)
                iov_iter_revert(from, write_len - iov_iter_count(from));
-out:
        return written;
 }
 EXPORT_SYMBOL(generic_file_direct_write);
@@ -3963,7 +3957,10 @@ again:
                balance_dirty_pages_ratelimited(mapping);
        } while (iov_iter_count(i));
 
-       return written ? written : status;
+       if (!written)
+               return status;
+       iocb->ki_pos += written;
+       return written;
 }
 EXPORT_SYMBOL(generic_perform_write);
 
@@ -3992,25 +3989,19 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
        struct file *file = iocb->ki_filp;
        struct address_space *mapping = file->f_mapping;
-       struct inode    *inode = mapping->host;
-       ssize_t         written = 0;
-       ssize_t         err;
-       ssize_t         status;
-
-       /* We can write back this queue in page reclaim */
-       current->backing_dev_info = inode_to_bdi(inode);
-       err = file_remove_privs(file);
-       if (err)
-               goto out;
+       struct inode *inode = mapping->host;
+       ssize_t ret;
 
-       err = file_update_time(file);
-       if (err)
-               goto out;
+       ret = file_remove_privs(file);
+       if (ret)
+               return ret;
 
-       if (iocb->ki_flags & IOCB_DIRECT) {
-               loff_t pos, endbyte;
+       ret = file_update_time(file);
+       if (ret)
+               return ret;
 
-               written = generic_file_direct_write(iocb, from);
+       if (iocb->ki_flags & IOCB_DIRECT) {
+               ret = generic_file_direct_write(iocb, from);
                /*
                 * If the write stopped short of completing, fall back to
                 * buffered writes.  Some filesystems do this for writes to
@@ -4018,49 +4009,13 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
                 * not succeed (even if it did, DAX does not handle dirty
                 * page-cache pages correctly).
                 */
-               if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
-                       goto out;
-
-               pos = iocb->ki_pos;
-               status = generic_perform_write(iocb, from);
-               /*
-                * If generic_perform_write() returned a synchronous error
-                * then we want to return the number of bytes which were
-                * direct-written, or the error code if that was zero.  Note
-                * that this differs from normal direct-io semantics, which
-                * will return -EFOO even if some bytes were written.
-                */
-               if (unlikely(status < 0)) {
-                       err = status;
-                       goto out;
-               }
-               /*
-                * We need to ensure that the page cache pages are written to
-                * disk and invalidated to preserve the expected O_DIRECT
-                * semantics.
-                */
-               endbyte = pos + status - 1;
-               err = filemap_write_and_wait_range(mapping, pos, endbyte);
-               if (err == 0) {
-                       iocb->ki_pos = endbyte + 1;
-                       written += status;
-                       invalidate_mapping_pages(mapping,
-                                                pos >> PAGE_SHIFT,
-                                                endbyte >> PAGE_SHIFT);
-               } else {
-                       /*
-                        * We don't know how much we wrote, so just return
-                        * the number of bytes which were direct-written
-                        */
-               }
-       } else {
-               written = generic_perform_write(iocb, from);
-               if (likely(written > 0))
-                       iocb->ki_pos += written;
+               if (ret < 0 || !iov_iter_count(from) || IS_DAX(inode))
+                       return ret;
+               return direct_write_fallback(iocb, from, ret,
+                               generic_perform_write(iocb, from));
        }
-out:
-       current->backing_dev_info = NULL;
-       return written ? written : err;
+
+       return generic_perform_write(iocb, from);
 }
 EXPORT_SYMBOL(__generic_file_write_iter);
 
@@ -4125,3 +4080,171 @@ bool filemap_release_folio(struct folio *folio, gfp_t gfp)
        return try_to_free_buffers(folio);
 }
 EXPORT_SYMBOL(filemap_release_folio);
+
+#ifdef CONFIG_CACHESTAT_SYSCALL
+/**
+ * filemap_cachestat() - compute the page cache statistics of a mapping
+ * @mapping:   The mapping to compute the statistics for.
+ * @first_index:       The starting page cache index.
+ * @last_index:        The final page index (inclusive).
+ * @cs:        the cachestat struct to write the result to.
+ *
+ * This will query the page cache statistics of a mapping in the
+ * page range of [first_index, last_index] (inclusive). The statistics
+ * queried include: number of dirty pages, number of pages marked for
+ * writeback, and the number of (recently) evicted pages.
+ */
+static void filemap_cachestat(struct address_space *mapping,
+               pgoff_t first_index, pgoff_t last_index, struct cachestat *cs)
+{
+       XA_STATE(xas, &mapping->i_pages, first_index);
+       struct folio *folio;
+
+       rcu_read_lock();
+       xas_for_each(&xas, folio, last_index) {
+               unsigned long nr_pages;
+               pgoff_t folio_first_index, folio_last_index;
+
+               if (xas_retry(&xas, folio))
+                       continue;
+
+               if (xa_is_value(folio)) {
+                       /* page is evicted */
+                       void *shadow = (void *)folio;
+                       bool workingset; /* not used */
+                       int order = xa_get_order(xas.xa, xas.xa_index);
+
+                       nr_pages = 1 << order;
+                       folio_first_index = round_down(xas.xa_index, 1 << order);
+                       folio_last_index = folio_first_index + nr_pages - 1;
+
+                       /* Folios might straddle the range boundaries, only count covered pages */
+                       if (folio_first_index < first_index)
+                               nr_pages -= first_index - folio_first_index;
+
+                       if (folio_last_index > last_index)
+                               nr_pages -= folio_last_index - last_index;
+
+                       cs->nr_evicted += nr_pages;
+
+#ifdef CONFIG_SWAP /* implies CONFIG_MMU */
+                       if (shmem_mapping(mapping)) {
+                               /* shmem file - in swap cache */
+                               swp_entry_t swp = radix_to_swp_entry(folio);
+
+                               shadow = get_shadow_from_swap_cache(swp);
+                       }
+#endif
+                       if (workingset_test_recent(shadow, true, &workingset))
+                               cs->nr_recently_evicted += nr_pages;
+
+                       goto resched;
+               }
+
+               nr_pages = folio_nr_pages(folio);
+               folio_first_index = folio_pgoff(folio);
+               folio_last_index = folio_first_index + nr_pages - 1;
+
+               /* Folios might straddle the range boundaries, only count covered pages */
+               if (folio_first_index < first_index)
+                       nr_pages -= first_index - folio_first_index;
+
+               if (folio_last_index > last_index)
+                       nr_pages -= folio_last_index - last_index;
+
+               /* page is in cache */
+               cs->nr_cache += nr_pages;
+
+               if (folio_test_dirty(folio))
+                       cs->nr_dirty += nr_pages;
+
+               if (folio_test_writeback(folio))
+                       cs->nr_writeback += nr_pages;
+
+resched:
+               if (need_resched()) {
+                       xas_pause(&xas);
+                       cond_resched_rcu();
+               }
+       }
+       rcu_read_unlock();
+}
+
+/*
+ * The cachestat(2) system call.
+ *
+ * cachestat() returns the page cache statistics of a file in the
+ * bytes range specified by `off` and `len`: number of cached pages,
+ * number of dirty pages, number of pages marked for writeback,
+ * number of evicted pages, and number of recently evicted pages.
+ *
+ * An evicted page is a page that is previously in the page cache
+ * but has been evicted since. A page is recently evicted if its last
+ * eviction was recent enough that its reentry to the cache would
+ * indicate that it is actively being used by the system, and that
+ * there is memory pressure on the system.
+ *
+ * `off` and `len` must be non-negative integers. If `len` > 0,
+ * the queried range is [`off`, `off` + `len`]. If `len` == 0,
+ * we will query in the range from `off` to the end of the file.
+ *
+ * The `flags` argument is unused for now, but is included for future
+ * extensibility. User should pass 0 (i.e no flag specified).
+ *
+ * Currently, hugetlbfs is not supported.
+ *
+ * Because the status of a page can change after cachestat() checks it
+ * but before it returns to the application, the returned values may
+ * contain stale information.
+ *
+ * return values:
+ *  zero        - success
+ *  -EFAULT     - cstat or cstat_range points to an illegal address
+ *  -EINVAL     - invalid flags
+ *  -EBADF      - invalid file descriptor
+ *  -EOPNOTSUPP - file descriptor is of a hugetlbfs file
+ */
+SYSCALL_DEFINE4(cachestat, unsigned int, fd,
+               struct cachestat_range __user *, cstat_range,
+               struct cachestat __user *, cstat, unsigned int, flags)
+{
+       struct fd f = fdget(fd);
+       struct address_space *mapping;
+       struct cachestat_range csr;
+       struct cachestat cs;
+       pgoff_t first_index, last_index;
+
+       if (!f.file)
+               return -EBADF;
+
+       if (copy_from_user(&csr, cstat_range,
+                       sizeof(struct cachestat_range))) {
+               fdput(f);
+               return -EFAULT;
+       }
+
+       /* hugetlbfs is not supported */
+       if (is_file_hugepages(f.file)) {
+               fdput(f);
+               return -EOPNOTSUPP;
+       }
+
+       if (flags != 0) {
+               fdput(f);
+               return -EINVAL;
+       }
+
+       first_index = csr.off >> PAGE_SHIFT;
+       last_index =
+               csr.len == 0 ? ULONG_MAX : (csr.off + csr.len - 1) >> PAGE_SHIFT;
+       memset(&cs, 0, sizeof(struct cachestat));
+       mapping = f.file->f_mapping;
+       filemap_cachestat(mapping, first_index, last_index, &cs);
+       fdput(f);
+
+       if (copy_to_user(cstat, &cs, sizeof(struct cachestat)))
+               return -EFAULT;
+
+       return 0;
+}
+#endif /* CONFIG_CACHESTAT_SYSCALL */
index 279e55b..2fb5df3 100644 (file)
@@ -206,6 +206,7 @@ int __frontswap_load(struct page *page)
        int type = swp_type(entry);
        struct swap_info_struct *sis = swap_info[type];
        pgoff_t offset = swp_offset(entry);
+       bool exclusive = false;
 
        VM_BUG_ON(!frontswap_ops);
        VM_BUG_ON(!PageLocked(page));
@@ -215,9 +216,14 @@ int __frontswap_load(struct page *page)
                return -1;
 
        /* Try loading from each implementation, until one succeeds. */
-       ret = frontswap_ops->load(type, offset, page);
-       if (ret == 0)
+       ret = frontswap_ops->load(type, offset, page, &exclusive);
+       if (ret == 0) {
                inc_frontswap_loads();
+               if (exclusive) {
+                       SetPageDirty(page);
+                       __frontswap_clear(sis, offset);
+               }
+       }
        return ret;
 }
 
index bbe4162..48c1659 100644 (file)
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -18,6 +18,7 @@
 #include <linux/migrate.h>
 #include <linux/mm_inline.h>
 #include <linux/sched/mm.h>
+#include <linux/shmem_fs.h>
 
 #include <asm/mmu_context.h>
 #include <asm/tlbflush.h>
@@ -51,7 +52,8 @@ static inline void sanity_check_pinned_pages(struct page **pages,
                struct page *page = *pages;
                struct folio *folio = page_folio(page);
 
-               if (!folio_test_anon(folio))
+               if (is_zero_page(page) ||
+                   !folio_test_anon(folio))
                        continue;
                if (!folio_test_large(folio) || folio_test_hugetlb(folio))
                        VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page), page);
@@ -123,63 +125,72 @@ retry:
  */
 struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
 {
+       struct folio *folio;
+
+       if (WARN_ON_ONCE((flags & (FOLL_GET | FOLL_PIN)) == 0))
+               return NULL;
+
        if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
                return NULL;
 
        if (flags & FOLL_GET)
                return try_get_folio(page, refs);
-       else if (flags & FOLL_PIN) {
-               struct folio *folio;
 
-               /*
-                * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
-                * right zone, so fail and let the caller fall back to the slow
-                * path.
-                */
-               if (unlikely((flags & FOLL_LONGTERM) &&
-                            !is_longterm_pinnable_page(page)))
-                       return NULL;
+       /* FOLL_PIN is set */
 
-               /*
-                * CAUTION: Don't use compound_head() on the page before this
-                * point, the result won't be stable.
-                */
-               folio = try_get_folio(page, refs);
-               if (!folio)
-                       return NULL;
-
-               /*
-                * When pinning a large folio, use an exact count to track it.
-                *
-                * However, be sure to *also* increment the normal folio
-                * refcount field at least once, so that the folio really
-                * is pinned.  That's why the refcount from the earlier
-                * try_get_folio() is left intact.
-                */
-               if (folio_test_large(folio))
-                       atomic_add(refs, &folio->_pincount);
-               else
-                       folio_ref_add(folio,
-                                       refs * (GUP_PIN_COUNTING_BIAS - 1));
-               /*
-                * Adjust the pincount before re-checking the PTE for changes.
-                * This is essentially a smp_mb() and is paired with a memory
-                * barrier in page_try_share_anon_rmap().
-                */
-               smp_mb__after_atomic();
+       /*
+        * Don't take a pin on the zero page - it's not going anywhere
+        * and it is used in a *lot* of places.
+        */
+       if (is_zero_page(page))
+               return page_folio(page);
 
-               node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs);
+       folio = try_get_folio(page, refs);
+       if (!folio)
+               return NULL;
 
-               return folio;
+       /*
+        * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
+        * right zone, so fail and let the caller fall back to the slow
+        * path.
+        */
+       if (unlikely((flags & FOLL_LONGTERM) &&
+                    !folio_is_longterm_pinnable(folio))) {
+               if (!put_devmap_managed_page_refs(&folio->page, refs))
+                       folio_put_refs(folio, refs);
+               return NULL;
        }
 
-       WARN_ON_ONCE(1);
-       return NULL;
+       /*
+        * When pinning a large folio, use an exact count to track it.
+        *
+        * However, be sure to *also* increment the normal folio
+        * refcount field at least once, so that the folio really
+        * is pinned.  That's why the refcount from the earlier
+        * try_get_folio() is left intact.
+        */
+       if (folio_test_large(folio))
+               atomic_add(refs, &folio->_pincount);
+       else
+               folio_ref_add(folio,
+                               refs * (GUP_PIN_COUNTING_BIAS - 1));
+       /*
+        * Adjust the pincount before re-checking the PTE for changes.
+        * This is essentially a smp_mb() and is paired with a memory
+        * barrier in page_try_share_anon_rmap().
+        */
+       smp_mb__after_atomic();
+
+       node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs);
+
+       return folio;
 }
 
 static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
 {
        if (flags & FOLL_PIN) {
+               if (is_zero_folio(folio))
+                       return;
                node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs);
                if (folio_test_large(folio))
                        atomic_sub(refs, &folio->_pincount);
@@ -225,6 +236,13 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
                folio_ref_inc(folio);
        else if (flags & FOLL_PIN) {
                /*
+                * Don't take a pin on the zero page - it's not going anywhere
+                * and it is used in a *lot* of places.
+                */
+               if (is_zero_page(page))
+                       return 0;
+
+               /*
                 * Similar to try_grab_folio(): be sure to *also*
                 * increment the normal page refcount field at least once,
                 * so that the page really is pinned.
@@ -258,6 +276,33 @@ void unpin_user_page(struct page *page)
 }
 EXPORT_SYMBOL(unpin_user_page);
 
+/**
+ * folio_add_pin - Try to get an additional pin on a pinned folio
+ * @folio: The folio to be pinned
+ *
+ * Get an additional pin on a folio we already have a pin on.  Makes no change
+ * if the folio is a zero_page.
+ */
+void folio_add_pin(struct folio *folio)
+{
+       if (is_zero_folio(folio))
+               return;
+
+       /*
+        * Similar to try_grab_folio(): be sure to *also* increment the normal
+        * page refcount field at least once, so that the page really is
+        * pinned.
+        */
+       if (folio_test_large(folio)) {
+               WARN_ON_ONCE(atomic_read(&folio->_pincount) < 1);
+               folio_ref_inc(folio);
+               atomic_inc(&folio->_pincount);
+       } else {
+               WARN_ON_ONCE(folio_ref_count(folio) < GUP_PIN_COUNTING_BIAS);
+               folio_ref_add(folio, GUP_PIN_COUNTING_BIAS);
+       }
+}
+
 static inline struct folio *gup_folio_range_next(struct page *start,
                unsigned long npages, unsigned long i, unsigned int *ntails)
 {
@@ -476,13 +521,14 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
                pte_t *pte, unsigned int flags)
 {
        if (flags & FOLL_TOUCH) {
-               pte_t entry = *pte;
+               pte_t orig_entry = ptep_get(pte);
+               pte_t entry = orig_entry;
 
                if (flags & FOLL_WRITE)
                        entry = pte_mkdirty(entry);
                entry = pte_mkyoung(entry);
 
-               if (!pte_same(*pte, entry)) {
+               if (!pte_same(orig_entry, entry)) {
                        set_pte_at(vma->vm_mm, address, pte, entry);
                        update_mmu_cache(vma, address, pte);
                }
@@ -544,11 +590,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
        if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
                         (FOLL_PIN | FOLL_GET)))
                return ERR_PTR(-EINVAL);
-       if (unlikely(pmd_bad(*pmd)))
-               return no_page_table(vma, flags);
 
        ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
-       pte = *ptep;
+       if (!ptep)
+               return no_page_table(vma, flags);
+       pte = ptep_get(ptep);
        if (!pte_present(pte))
                goto no_page;
        if (pte_protnone(pte) && !gup_can_follow_protnone(flags))
@@ -653,11 +699,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
        struct mm_struct *mm = vma->vm_mm;
 
        pmd = pmd_offset(pudp, address);
-       /*
-        * The READ_ONCE() will stabilize the pmdval in a register or
-        * on the stack so that it will stop changing under the code.
-        */
-       pmdval = READ_ONCE(*pmd);
+       pmdval = pmdp_get_lockless(pmd);
        if (pmd_none(pmdval))
                return no_page_table(vma, flags);
        if (!pmd_present(pmdval))
@@ -685,21 +727,10 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
                return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
        }
        if (flags & FOLL_SPLIT_PMD) {
-               int ret;
-               page = pmd_page(*pmd);
-               if (is_huge_zero_page(page)) {
-                       spin_unlock(ptl);
-                       ret = 0;
-                       split_huge_pmd(vma, pmd, address);
-                       if (pmd_trans_unstable(pmd))
-                               ret = -EBUSY;
-               } else {
-                       spin_unlock(ptl);
-                       split_huge_pmd(vma, pmd, address);
-                       ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
-               }
-
-               return ret ? ERR_PTR(ret) :
+               spin_unlock(ptl);
+               split_huge_pmd(vma, pmd, address);
+               /* If pmd was left empty, stuff a page table in there quickly */
+               return pte_alloc(mm, pmd) ? ERR_PTR(-ENOMEM) :
                        follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
        }
        page = follow_trans_huge_pmd(vma, address, pmd, flags);
@@ -835,6 +866,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
        pud_t *pud;
        pmd_t *pmd;
        pte_t *pte;
+       pte_t entry;
        int ret = -EFAULT;
 
        /* user gate pages are read-only */
@@ -855,18 +887,20 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
        pmd = pmd_offset(pud, address);
        if (!pmd_present(*pmd))
                return -EFAULT;
-       VM_BUG_ON(pmd_trans_huge(*pmd));
        pte = pte_offset_map(pmd, address);
-       if (pte_none(*pte))
+       if (!pte)
+               return -EFAULT;
+       entry = ptep_get(pte);
+       if (pte_none(entry))
                goto unmap;
        *vma = get_gate_vma(mm);
        if (!page)
                goto out;
-       *page = vm_normal_page(*vma, address, *pte);
+       *page = vm_normal_page(*vma, address, entry);
        if (!*page) {
-               if ((gup_flags & FOLL_DUMP) || !is_zero_pfn(pte_pfn(*pte)))
+               if ((gup_flags & FOLL_DUMP) || !is_zero_pfn(pte_pfn(entry)))
                        goto unmap;
-               *page = pte_page(*pte);
+               *page = pte_page(entry);
        }
        ret = try_grab_page(*page, gup_flags);
        if (unlikely(ret))
@@ -959,16 +993,54 @@ static int faultin_page(struct vm_area_struct *vma,
        return 0;
 }
 
+/*
+ * Writing to file-backed mappings which require folio dirty tracking using GUP
+ * is a fundamentally broken operation, as kernel write access to GUP mappings
+ * do not adhere to the semantics expected by a file system.
+ *
+ * Consider the following scenario:-
+ *
+ * 1. A folio is written to via GUP which write-faults the memory, notifying
+ *    the file system and dirtying the folio.
+ * 2. Later, writeback is triggered, resulting in the folio being cleaned and
+ *    the PTE being marked read-only.
+ * 3. The GUP caller writes to the folio, as it is mapped read/write via the
+ *    direct mapping.
+ * 4. The GUP caller, now done with the page, unpins it and sets it dirty
+ *    (though it does not have to).
+ *
+ * This results in both data being written to a folio without writenotify, and
+ * the folio being dirtied unexpectedly (if the caller decides to do so).
+ */
+static bool writable_file_mapping_allowed(struct vm_area_struct *vma,
+                                         unsigned long gup_flags)
+{
+       /*
+        * If we aren't pinning then no problematic write can occur. A long term
+        * pin is the most egregious case so this is the case we disallow.
+        */
+       if ((gup_flags & (FOLL_PIN | FOLL_LONGTERM)) !=
+           (FOLL_PIN | FOLL_LONGTERM))
+               return true;
+
+       /*
+        * If the VMA does not require dirty tracking then no problematic write
+        * can occur either.
+        */
+       return !vma_needs_dirty_tracking(vma);
+}
+
 static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 {
        vm_flags_t vm_flags = vma->vm_flags;
        int write = (gup_flags & FOLL_WRITE);
        int foreign = (gup_flags & FOLL_REMOTE);
+       bool vma_anon = vma_is_anonymous(vma);
 
        if (vm_flags & (VM_IO | VM_PFNMAP))
                return -EFAULT;
 
-       if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma))
+       if ((gup_flags & FOLL_ANON) && !vma_anon)
                return -EFAULT;
 
        if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
@@ -978,6 +1050,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
                return -EFAULT;
 
        if (write) {
+               if (!vma_anon &&
+                   !writable_file_mapping_allowed(vma, gup_flags))
+                       return -EFAULT;
+
                if (!(vm_flags & VM_WRITE)) {
                        if (!(gup_flags & FOLL_FORCE))
                                return -EFAULT;
@@ -1024,8 +1100,6 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
  * @pages:     array that receives pointers to the pages pinned.
  *             Should be at least nr_pages long. Or NULL, if caller
  *             only intends to ensure the pages are faulted in.
- * @vmas:      array of pointers to vmas corresponding to each page.
- *             Or NULL if the caller does not require them.
  * @locked:     whether we're still with the mmap_lock held
  *
  * Returns either number of pages pinned (which may be less than the
@@ -1039,8 +1113,6 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
  *
  * The caller is responsible for releasing returned @pages, via put_page().
  *
- * @vmas are valid only as long as mmap_lock is held.
- *
  * Must be called with mmap_lock held.  It may be released.  See below.
  *
  * __get_user_pages walks a process's page tables and takes a reference to
@@ -1076,7 +1148,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 static long __get_user_pages(struct mm_struct *mm,
                unsigned long start, unsigned long nr_pages,
                unsigned int gup_flags, struct page **pages,
-               struct vm_area_struct **vmas, int *locked)
+               int *locked)
 {
        long ret = 0, i = 0;
        struct vm_area_struct *vma = NULL;
@@ -1116,9 +1188,9 @@ static long __get_user_pages(struct mm_struct *mm,
                                goto out;
 
                        if (is_vm_hugetlb_page(vma)) {
-                               i = follow_hugetlb_page(mm, vma, pages, vmas,
-                                               &start, &nr_pages, i,
-                                               gup_flags, locked);
+                               i = follow_hugetlb_page(mm, vma, pages,
+                                                       &start, &nr_pages, i,
+                                                       gup_flags, locked);
                                if (!*locked) {
                                        /*
                                         * We've got a VM_FAULT_RETRY
@@ -1183,10 +1255,6 @@ retry:
                        ctx.page_mask = 0;
                }
 next_page:
-               if (vmas) {
-                       vmas[i] = vma;
-                       ctx.page_mask = 0;
-               }
                page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
                if (page_increm > nr_pages)
                        page_increm = nr_pages;
@@ -1341,7 +1409,6 @@ static __always_inline long __get_user_pages_locked(struct mm_struct *mm,
                                                unsigned long start,
                                                unsigned long nr_pages,
                                                struct page **pages,
-                                               struct vm_area_struct **vmas,
                                                int *locked,
                                                unsigned int flags)
 {
@@ -1379,7 +1446,7 @@ static __always_inline long __get_user_pages_locked(struct mm_struct *mm,
        pages_done = 0;
        for (;;) {
                ret = __get_user_pages(mm, start, nr_pages, flags, pages,
-                                      vmas, locked);
+                                      locked);
                if (!(flags & FOLL_UNLOCKABLE)) {
                        /* VM_FAULT_RETRY couldn't trigger, bypass */
                        pages_done = ret;
@@ -1443,7 +1510,7 @@ retry:
 
                *locked = 1;
                ret = __get_user_pages(mm, start, 1, flags | FOLL_TRIED,
-                                      pages, NULL, locked);
+                                      pages, locked);
                if (!*locked) {
                        /* Continue to retry until we succeeded */
                        BUG_ON(ret != 0);
@@ -1541,7 +1608,7 @@ long populate_vma_page_range(struct vm_area_struct *vma,
         * not result in a stack expansion that recurses back here.
         */
        ret = __get_user_pages(mm, start, nr_pages, gup_flags,
-                               NULL, NULL, locked ? locked : &local_locked);
+                              NULL, locked ? locked : &local_locked);
        lru_add_drain();
        return ret;
 }
@@ -1599,7 +1666,7 @@ long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long start,
                return -EINVAL;
 
        ret = __get_user_pages(mm, start, nr_pages, gup_flags,
-                               NULL, NULL, locked);
+                              NULL, locked);
        lru_add_drain();
        return ret;
 }
@@ -1667,8 +1734,7 @@ int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
 #else /* CONFIG_MMU */
 static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
                unsigned long nr_pages, struct page **pages,
-               struct vm_area_struct **vmas, int *locked,
-               unsigned int foll_flags)
+               int *locked, unsigned int foll_flags)
 {
        struct vm_area_struct *vma;
        bool must_unlock = false;
@@ -1712,8 +1778,7 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
                        if (pages[i])
                                get_page(pages[i]);
                }
-               if (vmas)
-                       vmas[i] = vma;
+
                start = (start + PAGE_SIZE) & PAGE_MASK;
        }
 
@@ -1894,8 +1959,7 @@ struct page *get_dump_page(unsigned long addr)
        int locked = 0;
        int ret;
 
-       ret = __get_user_pages_locked(current->mm, addr, 1, &page, NULL,
-                                     &locked,
+       ret = __get_user_pages_locked(current->mm, addr, 1, &page, &locked,
                                      FOLL_FORCE | FOLL_DUMP | FOLL_GET);
        return (ret == 1) ? page : NULL;
 }
@@ -2068,7 +2132,6 @@ static long __gup_longterm_locked(struct mm_struct *mm,
                                  unsigned long start,
                                  unsigned long nr_pages,
                                  struct page **pages,
-                                 struct vm_area_struct **vmas,
                                  int *locked,
                                  unsigned int gup_flags)
 {
@@ -2076,13 +2139,13 @@ static long __gup_longterm_locked(struct mm_struct *mm,
        long rc, nr_pinned_pages;
 
        if (!(gup_flags & FOLL_LONGTERM))
-               return __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
+               return __get_user_pages_locked(mm, start, nr_pages, pages,
                                               locked, gup_flags);
 
        flags = memalloc_pin_save();
        do {
                nr_pinned_pages = __get_user_pages_locked(mm, start, nr_pages,
-                                                         pages, vmas, locked,
+                                                         pages, locked,
                                                          gup_flags);
                if (nr_pinned_pages <= 0) {
                        rc = nr_pinned_pages;
@@ -2100,9 +2163,8 @@ static long __gup_longterm_locked(struct mm_struct *mm,
  * Check that the given flags are valid for the exported gup/pup interface, and
  * update them with the required flags that the caller must have set.
  */
-static bool is_valid_gup_args(struct page **pages, struct vm_area_struct **vmas,
-                             int *locked, unsigned int *gup_flags_p,
-                             unsigned int to_set)
+static bool is_valid_gup_args(struct page **pages, int *locked,
+                             unsigned int *gup_flags_p, unsigned int to_set)
 {
        unsigned int gup_flags = *gup_flags_p;
 
@@ -2144,13 +2206,6 @@ static bool is_valid_gup_args(struct page **pages, struct vm_area_struct **vmas,
                         (gup_flags & FOLL_PCI_P2PDMA)))
                return false;
 
-       /*
-        * Can't use VMAs with locked, as locked allows GUP to unlock
-        * which invalidates the vmas array
-        */
-       if (WARN_ON_ONCE(vmas && (gup_flags & FOLL_UNLOCKABLE)))
-               return false;
-
        *gup_flags_p = gup_flags;
        return true;
 }
@@ -2165,8 +2220,6 @@ static bool is_valid_gup_args(struct page **pages, struct vm_area_struct **vmas,
  * @pages:     array that receives pointers to the pages pinned.
  *             Should be at least nr_pages long. Or NULL, if caller
  *             only intends to ensure the pages are faulted in.
- * @vmas:      array of pointers to vmas corresponding to each page.
- *             Or NULL if the caller does not require them.
  * @locked:    pointer to lock flag indicating whether lock is held and
  *             subsequently whether VM_FAULT_RETRY functionality can be
  *             utilised. Lock must initially be held.
@@ -2181,8 +2234,6 @@ static bool is_valid_gup_args(struct page **pages, struct vm_area_struct **vmas,
  *
  * The caller is responsible for releasing returned @pages, via put_page().
  *
- * @vmas are valid only as long as mmap_lock is held.
- *
  * Must be called with mmap_lock held for read or write.
  *
  * get_user_pages_remote walks a process's page tables and takes a reference
@@ -2219,15 +2270,15 @@ static bool is_valid_gup_args(struct page **pages, struct vm_area_struct **vmas,
 long get_user_pages_remote(struct mm_struct *mm,
                unsigned long start, unsigned long nr_pages,
                unsigned int gup_flags, struct page **pages,
-               struct vm_area_struct **vmas, int *locked)
+               int *locked)
 {
        int local_locked = 1;
 
-       if (!is_valid_gup_args(pages, vmas, locked, &gup_flags,
+       if (!is_valid_gup_args(pages, locked, &gup_flags,
                               FOLL_TOUCH | FOLL_REMOTE))
                return -EINVAL;
 
-       return __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
+       return __get_user_pages_locked(mm, start, nr_pages, pages,
                                       locked ? locked : &local_locked,
                                       gup_flags);
 }
@@ -2237,7 +2288,7 @@ EXPORT_SYMBOL(get_user_pages_remote);
 long get_user_pages_remote(struct mm_struct *mm,
                           unsigned long start, unsigned long nr_pages,
                           unsigned int gup_flags, struct page **pages,
-                          struct vm_area_struct **vmas, int *locked)
+                          int *locked)
 {
        return 0;
 }
@@ -2251,8 +2302,6 @@ long get_user_pages_remote(struct mm_struct *mm,
  * @pages:      array that receives pointers to the pages pinned.
  *              Should be at least nr_pages long. Or NULL, if caller
  *              only intends to ensure the pages are faulted in.
- * @vmas:       array of pointers to vmas corresponding to each page.
- *              Or NULL if the caller does not require them.
  *
  * This is the same as get_user_pages_remote(), just with a less-flexible
  * calling convention where we assume that the mm being operated on belongs to
@@ -2260,16 +2309,15 @@ long get_user_pages_remote(struct mm_struct *mm,
  * obviously don't pass FOLL_REMOTE in here.
  */
 long get_user_pages(unsigned long start, unsigned long nr_pages,
-               unsigned int gup_flags, struct page **pages,
-               struct vm_area_struct **vmas)
+                   unsigned int gup_flags, struct page **pages)
 {
        int locked = 1;
 
-       if (!is_valid_gup_args(pages, vmas, NULL, &gup_flags, FOLL_TOUCH))
+       if (!is_valid_gup_args(pages, NULL, &gup_flags, FOLL_TOUCH))
                return -EINVAL;
 
        return __get_user_pages_locked(current->mm, start, nr_pages, pages,
-                                      vmas, &locked, gup_flags);
+                                      &locked, gup_flags);
 }
 EXPORT_SYMBOL(get_user_pages);
 
@@ -2293,12 +2341,12 @@ long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 {
        int locked = 0;
 
-       if (!is_valid_gup_args(pages, NULL, NULL, &gup_flags,
+       if (!is_valid_gup_args(pages, NULL, &gup_flags,
                               FOLL_TOUCH | FOLL_UNLOCKABLE))
                return -EINVAL;
 
        return __get_user_pages_locked(current->mm, start, nr_pages, pages,
-                                      NULL, &locked, gup_flags);
+                                      &locked, gup_flags);
 }
 EXPORT_SYMBOL(get_user_pages_unlocked);
 
@@ -2337,6 +2385,82 @@ EXPORT_SYMBOL(get_user_pages_unlocked);
  */
 #ifdef CONFIG_HAVE_FAST_GUP
 
+/*
+ * Used in the GUP-fast path to determine whether a pin is permitted for a
+ * specific folio.
+ *
+ * This call assumes the caller has pinned the folio, that the lowest page table
+ * level still points to this folio, and that interrupts have been disabled.
+ *
+ * Writing to pinned file-backed dirty tracked folios is inherently problematic
+ * (see comment describing the writable_file_mapping_allowed() function). We
+ * therefore try to avoid the most egregious case of a long-term mapping doing
+ * so.
+ *
+ * This function cannot be as thorough as that one as the VMA is not available
+ * in the fast path, so instead we whitelist known good cases and if in doubt,
+ * fall back to the slow path.
+ */
+static bool folio_fast_pin_allowed(struct folio *folio, unsigned int flags)
+{
+       struct address_space *mapping;
+       unsigned long mapping_flags;
+
+       /*
+        * If we aren't pinning then no problematic write can occur. A long term
+        * pin is the most egregious case so this is the one we disallow.
+        */
+       if ((flags & (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE)) !=
+           (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE))
+               return true;
+
+       /* The folio is pinned, so we can safely access folio fields. */
+
+       if (WARN_ON_ONCE(folio_test_slab(folio)))
+               return false;
+
+       /* hugetlb mappings do not require dirty-tracking. */
+       if (folio_test_hugetlb(folio))
+               return true;
+
+       /*
+        * GUP-fast disables IRQs. When IRQS are disabled, RCU grace periods
+        * cannot proceed, which means no actions performed under RCU can
+        * proceed either.
+        *
+        * inodes and thus their mappings are freed under RCU, which means the
+        * mapping cannot be freed beneath us and thus we can safely dereference
+        * it.
+        */
+       lockdep_assert_irqs_disabled();
+
+       /*
+        * However, there may be operations which _alter_ the mapping, so ensure
+        * we read it once and only once.
+        */
+       mapping = READ_ONCE(folio->mapping);
+
+       /*
+        * The mapping may have been truncated, in any case we cannot determine
+        * if this mapping is safe - fall back to slow path to determine how to
+        * proceed.
+        */
+       if (!mapping)
+               return false;
+
+       /* Anonymous folios pose no problem. */
+       mapping_flags = (unsigned long)mapping & PAGE_MAPPING_FLAGS;
+       if (mapping_flags)
+               return mapping_flags & PAGE_MAPPING_ANON;
+
+       /*
+        * At this point, we know the mapping is non-null and points to an
+        * address_space object. The only remaining whitelisted file system is
+        * shmem.
+        */
+       return shmem_mapping(mapping);
+}
+
 static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
                                            unsigned int flags,
                                            struct page **pages)
@@ -2381,6 +2505,8 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
        pte_t *ptep, *ptem;
 
        ptem = ptep = pte_offset_map(&pmd, addr);
+       if (!ptep)
+               return 0;
        do {
                pte_t pte = ptep_get_lockless(ptep);
                struct page *page;
@@ -2417,7 +2543,12 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
                }
 
                if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
-                   unlikely(pte_val(pte) != pte_val(*ptep))) {
+                   unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) {
+                       gup_put_folio(folio, 1, flags);
+                       goto pte_unmap;
+               }
+
+               if (!folio_fast_pin_allowed(folio, flags)) {
                        gup_put_folio(folio, 1, flags);
                        goto pte_unmap;
                }
@@ -2609,7 +2740,12 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
        if (!folio)
                return 0;
 
-       if (unlikely(pte_val(pte) != pte_val(*ptep))) {
+       if (unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) {
+               gup_put_folio(folio, refs, flags);
+               return 0;
+       }
+
+       if (!folio_fast_pin_allowed(folio, flags)) {
                gup_put_folio(folio, refs, flags);
                return 0;
        }
@@ -2680,6 +2816,10 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
                return 0;
        }
 
+       if (!folio_fast_pin_allowed(folio, flags)) {
+               gup_put_folio(folio, refs, flags);
+               return 0;
+       }
        if (!pmd_write(orig) && gup_must_unshare(NULL, flags, &folio->page)) {
                gup_put_folio(folio, refs, flags);
                return 0;
@@ -2720,6 +2860,11 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
                return 0;
        }
 
+       if (!folio_fast_pin_allowed(folio, flags)) {
+               gup_put_folio(folio, refs, flags);
+               return 0;
+       }
+
        if (!pud_write(orig) && gup_must_unshare(NULL, flags, &folio->page)) {
                gup_put_folio(folio, refs, flags);
                return 0;
@@ -2755,6 +2900,16 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
                return 0;
        }
 
+       if (!pgd_write(orig) && gup_must_unshare(NULL, flags, &folio->page)) {
+               gup_put_folio(folio, refs, flags);
+               return 0;
+       }
+
+       if (!folio_fast_pin_allowed(folio, flags)) {
+               gup_put_folio(folio, refs, flags);
+               return 0;
+       }
+
        *nr += refs;
        folio_set_referenced(folio);
        return 1;
@@ -2969,7 +3124,7 @@ static int internal_get_user_pages_fast(unsigned long start,
        start = untagged_addr(start) & PAGE_MASK;
        len = nr_pages << PAGE_SHIFT;
        if (check_add_overflow(start, len, &end))
-               return 0;
+               return -EOVERFLOW;
        if (end > TASK_SIZE_MAX)
                return -EFAULT;
        if (unlikely(!access_ok((void __user *)start, len)))
@@ -2983,7 +3138,7 @@ static int internal_get_user_pages_fast(unsigned long start,
        start += nr_pinned << PAGE_SHIFT;
        pages += nr_pinned;
        ret = __gup_longterm_locked(current->mm, start, nr_pages - nr_pinned,
-                                   pages, NULL, &locked,
+                                   pages, &locked,
                                    gup_flags | FOLL_TOUCH | FOLL_UNLOCKABLE);
        if (ret < 0) {
                /*
@@ -3025,7 +3180,7 @@ int get_user_pages_fast_only(unsigned long start, int nr_pages,
         * FOLL_FAST_ONLY is required in order to match the API description of
         * this routine: no fall back to regular ("slow") GUP.
         */
-       if (!is_valid_gup_args(pages, NULL, NULL, &gup_flags,
+       if (!is_valid_gup_args(pages, NULL, &gup_flags,
                               FOLL_GET | FOLL_FAST_ONLY))
                return -EINVAL;
 
@@ -3058,7 +3213,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
         * FOLL_GET, because gup fast is always a "pin with a +1 page refcount"
         * request.
         */
-       if (!is_valid_gup_args(pages, NULL, NULL, &gup_flags, FOLL_GET))
+       if (!is_valid_gup_args(pages, NULL, &gup_flags, FOLL_GET))
                return -EINVAL;
        return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
 }
@@ -3079,11 +3234,14 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
  *
  * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
  * see Documentation/core-api/pin_user_pages.rst for further details.
+ *
+ * Note that if a zero_page is amongst the returned pages, it will not have
+ * pins in it and unpin_user_page() will not remove pins from it.
  */
 int pin_user_pages_fast(unsigned long start, int nr_pages,
                        unsigned int gup_flags, struct page **pages)
 {
-       if (!is_valid_gup_args(pages, NULL, NULL, &gup_flags, FOLL_PIN))
+       if (!is_valid_gup_args(pages, NULL, &gup_flags, FOLL_PIN))
                return -EINVAL;
        return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
 }
@@ -3098,8 +3256,6 @@ EXPORT_SYMBOL_GPL(pin_user_pages_fast);
  * @gup_flags: flags modifying lookup behaviour
  * @pages:     array that receives pointers to the pages pinned.
  *             Should be at least nr_pages long.
- * @vmas:      array of pointers to vmas corresponding to each page.
- *             Or NULL if the caller does not require them.
  * @locked:    pointer to lock flag indicating whether lock is held and
  *             subsequently whether VM_FAULT_RETRY functionality can be
  *             utilised. Lock must initially be held.
@@ -3110,18 +3266,21 @@ EXPORT_SYMBOL_GPL(pin_user_pages_fast);
  *
  * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
  * see Documentation/core-api/pin_user_pages.rst for details.
+ *
+ * Note that if a zero_page is amongst the returned pages, it will not have
+ * pins in it and unpin_user_page*() will not remove pins from it.
  */
 long pin_user_pages_remote(struct mm_struct *mm,
                           unsigned long start, unsigned long nr_pages,
                           unsigned int gup_flags, struct page **pages,
-                          struct vm_area_struct **vmas, int *locked)
+                          int *locked)
 {
        int local_locked = 1;
 
-       if (!is_valid_gup_args(pages, vmas, locked, &gup_flags,
+       if (!is_valid_gup_args(pages, locked, &gup_flags,
                               FOLL_PIN | FOLL_TOUCH | FOLL_REMOTE))
                return 0;
-       return __gup_longterm_locked(mm, start, nr_pages, pages, vmas,
+       return __gup_longterm_locked(mm, start, nr_pages, pages,
                                     locked ? locked : &local_locked,
                                     gup_flags);
 }
@@ -3135,25 +3294,25 @@ EXPORT_SYMBOL(pin_user_pages_remote);
  * @gup_flags: flags modifying lookup behaviour
  * @pages:     array that receives pointers to the pages pinned.
  *             Should be at least nr_pages long.
- * @vmas:      array of pointers to vmas corresponding to each page.
- *             Or NULL if the caller does not require them.
  *
  * Nearly the same as get_user_pages(), except that FOLL_TOUCH is not set, and
  * FOLL_PIN is set.
  *
  * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
  * see Documentation/core-api/pin_user_pages.rst for details.
+ *
+ * Note that if a zero_page is amongst the returned pages, it will not have
+ * pins in it and unpin_user_page*() will not remove pins from it.
  */
 long pin_user_pages(unsigned long start, unsigned long nr_pages,
-                   unsigned int gup_flags, struct page **pages,
-                   struct vm_area_struct **vmas)
+                   unsigned int gup_flags, struct page **pages)
 {
        int locked = 1;
 
-       if (!is_valid_gup_args(pages, vmas, NULL, &gup_flags, FOLL_PIN))
+       if (!is_valid_gup_args(pages, NULL, &gup_flags, FOLL_PIN))
                return 0;
        return __gup_longterm_locked(current->mm, start, nr_pages,
-                                    pages, vmas, &locked, gup_flags);
+                                    pages, &locked, gup_flags);
 }
 EXPORT_SYMBOL(pin_user_pages);
 
@@ -3161,17 +3320,20 @@ EXPORT_SYMBOL(pin_user_pages);
  * pin_user_pages_unlocked() is the FOLL_PIN variant of
  * get_user_pages_unlocked(). Behavior is the same, except that this one sets
  * FOLL_PIN and rejects FOLL_GET.
+ *
+ * Note that if a zero_page is amongst the returned pages, it will not have
+ * pins in it and unpin_user_page*() will not remove pins from it.
  */
 long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
                             struct page **pages, unsigned int gup_flags)
 {
        int locked = 0;
 
-       if (!is_valid_gup_args(pages, NULL, NULL, &gup_flags,
+       if (!is_valid_gup_args(pages, NULL, &gup_flags,
                               FOLL_PIN | FOLL_TOUCH | FOLL_UNLOCKABLE))
                return 0;
 
-       return __gup_longterm_locked(current->mm, start, nr_pages, pages, NULL,
+       return __gup_longterm_locked(current->mm, start, nr_pages, pages,
                                     &locked, gup_flags);
 }
 EXPORT_SYMBOL(pin_user_pages_unlocked);
index c0421b7..eeb3f4d 100644 (file)
@@ -40,24 +40,25 @@ static void verify_dma_pinned(unsigned int cmd, struct page **pages,
                              unsigned long nr_pages)
 {
        unsigned long i;
-       struct page *page;
+       struct folio *folio;
 
        switch (cmd) {
        case PIN_FAST_BENCHMARK:
        case PIN_BASIC_TEST:
        case PIN_LONGTERM_BENCHMARK:
                for (i = 0; i < nr_pages; i++) {
-                       page = pages[i];
-                       if (WARN(!page_maybe_dma_pinned(page),
+                       folio = page_folio(pages[i]);
+
+                       if (WARN(!folio_maybe_dma_pinned(folio),
                                 "pages[%lu] is NOT dma-pinned\n", i)) {
 
-                               dump_page(page, "gup_test failure");
+                               dump_page(&folio->page, "gup_test failure");
                                break;
                        } else if (cmd == PIN_LONGTERM_BENCHMARK &&
-                               WARN(!is_longterm_pinnable_page(page),
+                               WARN(!folio_is_longterm_pinnable(folio),
                                     "pages[%lu] is NOT pinnable but pinned\n",
                                     i)) {
-                               dump_page(page, "gup_test failure");
+                               dump_page(&folio->page, "gup_test failure");
                                break;
                        }
                }
@@ -139,29 +140,27 @@ static int __gup_test_ioctl(unsigned int cmd,
                                                 pages + i);
                        break;
                case GUP_BASIC_TEST:
-                       nr = get_user_pages(addr, nr, gup->gup_flags, pages + i,
-                                           NULL);
+                       nr = get_user_pages(addr, nr, gup->gup_flags, pages + i);
                        break;
                case PIN_FAST_BENCHMARK:
                        nr = pin_user_pages_fast(addr, nr, gup->gup_flags,
                                                 pages + i);
                        break;
                case PIN_BASIC_TEST:
-                       nr = pin_user_pages(addr, nr, gup->gup_flags, pages + i,
-                                           NULL);
+                       nr = pin_user_pages(addr, nr, gup->gup_flags, pages + i);
                        break;
                case PIN_LONGTERM_BENCHMARK:
                        nr = pin_user_pages(addr, nr,
                                            gup->gup_flags | FOLL_LONGTERM,
-                                           pages + i, NULL);
+                                           pages + i);
                        break;
                case DUMP_USER_PAGES_TEST:
                        if (gup->test_flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)
                                nr = pin_user_pages(addr, nr, gup->gup_flags,
-                                                   pages + i, NULL);
+                                                   pages + i);
                        else
                                nr = get_user_pages(addr, nr, gup->gup_flags,
-                                                   pages + i, NULL);
+                                                   pages + i);
                        break;
                default:
                        ret = -EINVAL;
@@ -271,7 +270,7 @@ static inline int pin_longterm_test_start(unsigned long arg)
                                                        gup_flags, pages);
                else
                        cur_pages = pin_user_pages(addr, remaining_pages,
-                                                  gup_flags, pages, NULL);
+                                                  gup_flags, pages);
                if (cur_pages < 0) {
                        pin_longterm_test_stop();
                        ret = cur_pages;
index db251e7..e192690 100644 (file)
@@ -161,7 +161,7 @@ struct page *__kmap_to_page(void *vaddr)
        /* kmap() mappings */
        if (WARN_ON_ONCE(addr >= PKMAP_ADDR(0) &&
                         addr < PKMAP_ADDR(LAST_PKMAP)))
-               return pte_page(pkmap_page_table[PKMAP_NR(addr)]);
+               return pte_page(ptep_get(&pkmap_page_table[PKMAP_NR(addr)]));
 
        /* kmap_local_page() mappings */
        if (WARN_ON_ONCE(base >= __fix_to_virt(FIX_KMAP_END) &&
@@ -191,6 +191,7 @@ static void flush_all_zero_pkmaps(void)
 
        for (i = 0; i < LAST_PKMAP; i++) {
                struct page *page;
+               pte_t ptent;
 
                /*
                 * zero means we don't have anything to do,
@@ -203,7 +204,8 @@ static void flush_all_zero_pkmaps(void)
                pkmap_count[i] = 0;
 
                /* sanity check */
-               BUG_ON(pte_none(pkmap_page_table[i]));
+               ptent = ptep_get(&pkmap_page_table[i]);
+               BUG_ON(pte_none(ptent));
 
                /*
                 * Don't need an atomic fetch-and-clear op here;
@@ -212,7 +214,7 @@ static void flush_all_zero_pkmaps(void)
                 * getting the kmap_lock (which is held here).
                 * So no dangers, even with speculative execution.
                 */
-               page = pte_page(pkmap_page_table[i]);
+               page = pte_page(ptent);
                pte_clear(&init_mm, PKMAP_ADDR(i), &pkmap_page_table[i]);
 
                set_page_address(page, NULL);
@@ -511,7 +513,7 @@ static inline bool kmap_high_unmap_local(unsigned long vaddr)
 {
 #ifdef ARCH_NEEDS_KMAP_HIGH_GET
        if (vaddr >= PKMAP_ADDR(0) && vaddr < PKMAP_ADDR(LAST_PKMAP)) {
-               kunmap_high(pte_page(pkmap_page_table[PKMAP_NR(vaddr)]));
+               kunmap_high(pte_page(ptep_get(&pkmap_page_table[PKMAP_NR(vaddr)])));
                return true;
        }
 #endif
@@ -548,7 +550,7 @@ void *__kmap_local_pfn_prot(unsigned long pfn, pgprot_t prot)
        idx = arch_kmap_local_map_idx(kmap_local_idx_push(), pfn);
        vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
        kmap_pte = kmap_get_pte(vaddr, idx);
-       BUG_ON(!pte_none(*kmap_pte));
+       BUG_ON(!pte_none(ptep_get(kmap_pte)));
        pteval = pfn_pte(pfn, prot);
        arch_kmap_local_set_pte(&init_mm, vaddr, kmap_pte, pteval);
        arch_kmap_local_post_map(vaddr, pteval);
index 6a151c0..855e25e 100644 (file)
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -228,7 +228,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
        struct hmm_range *range = hmm_vma_walk->range;
        unsigned int required_fault;
        unsigned long cpu_flags;
-       pte_t pte = *ptep;
+       pte_t pte = ptep_get(ptep);
        uint64_t pfn_req_flags = *hmm_pfn;
 
        if (pte_none_mostly(pte)) {
@@ -332,7 +332,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
        pmd_t pmd;
 
 again:
-       pmd = READ_ONCE(*pmdp);
+       pmd = pmdp_get_lockless(pmdp);
        if (pmd_none(pmd))
                return hmm_vma_walk_hole(start, end, -1, walk);
 
@@ -381,6 +381,8 @@ again:
        }
 
        ptep = pte_offset_map(pmdp, addr);
+       if (!ptep)
+               goto again;
        for (; addr < end; addr += PAGE_SIZE, ptep++, hmm_pfns++) {
                int r;
 
index 624671a..eb36783 100644 (file)
@@ -583,7 +583,7 @@ void prep_transhuge_page(struct page *page)
 
        VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio);
        INIT_LIST_HEAD(&folio->_deferred_list);
-       set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
+       folio_set_compound_dtor(folio, TRANSHUGE_PAGE_DTOR);
 }
 
 static inline bool is_transparent_hugepage(struct page *page)
@@ -1344,7 +1344,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
        /*
         * See do_wp_page(): we can only reuse the folio exclusively if
         * there are no additional references. Note that we always drain
-        * the LRU pagevecs immediately after adding a THP.
+        * the LRU cache immediately after adding a THP.
         */
        if (folio_ref_count(folio) >
                        1 + folio_test_swapcache(folio) * folio_nr_pages(folio))
@@ -1760,9 +1760,10 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 
        /*
         * The destination pmd shouldn't be established, free_pgtables()
-        * should have release it.
+        * should have released it; but move_page_tables() might have already
+        * inserted a page table, if racing against shmem/file collapse.
         */
-       if (WARN_ON(!pmd_none(*new_pmd))) {
+       if (!pmd_none(*new_pmd)) {
                VM_BUG_ON(pmd_trans_huge(*new_pmd));
                return false;
        }
@@ -2036,6 +2037,8 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
        struct mm_struct *mm = vma->vm_mm;
        pgtable_t pgtable;
        pmd_t _pmd, old_pmd;
+       unsigned long addr;
+       pte_t *pte;
        int i;
 
        /*
@@ -2051,17 +2054,20 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
        pgtable = pgtable_trans_huge_withdraw(mm, pmd);
        pmd_populate(mm, &_pmd, pgtable);
 
-       for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
-               pte_t *pte, entry;
-               entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
+       pte = pte_offset_map(&_pmd, haddr);
+       VM_BUG_ON(!pte);
+       for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
+               pte_t entry;
+
+               entry = pfn_pte(my_zero_pfn(addr), vma->vm_page_prot);
                entry = pte_mkspecial(entry);
                if (pmd_uffd_wp(old_pmd))
                        entry = pte_mkuffd_wp(entry);
-               pte = pte_offset_map(&_pmd, haddr);
-               VM_BUG_ON(!pte_none(*pte));
-               set_pte_at(mm, haddr, pte, entry);
-               pte_unmap(pte);
+               VM_BUG_ON(!pte_none(ptep_get(pte)));
+               set_pte_at(mm, addr, pte, entry);
+               pte++;
        }
+       pte_unmap(pte - 1);
        smp_wmb(); /* make pte visible before pmd */
        pmd_populate(mm, pmd, pgtable);
 }
@@ -2076,6 +2082,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
        bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
        bool anon_exclusive = false, dirty = false;
        unsigned long addr;
+       pte_t *pte;
        int i;
 
        VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
@@ -2204,8 +2211,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
        pgtable = pgtable_trans_huge_withdraw(mm, pmd);
        pmd_populate(mm, &_pmd, pgtable);
 
+       pte = pte_offset_map(&_pmd, haddr);
+       VM_BUG_ON(!pte);
        for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
-               pte_t entry, *pte;
+               pte_t entry;
                /*
                 * Note that NUMA hinting access restrictions are not
                 * transferred to avoid any possibility of altering
@@ -2248,11 +2257,11 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
                                entry = pte_mkuffd_wp(entry);
                        page_add_anon_rmap(page + i, vma, addr, false);
                }
-               pte = pte_offset_map(&_pmd, addr);
-               BUG_ON(!pte_none(*pte));
+               VM_BUG_ON(!pte_none(ptep_get(pte)));
                set_pte_at(mm, addr, pte, entry);
-               pte_unmap(pte);
+               pte++;
        }
+       pte_unmap(pte - 1);
 
        if (!pmd_migration)
                page_remove_rmap(page, vma, true);
@@ -2792,12 +2801,19 @@ void free_transhuge_page(struct page *page)
        struct deferred_split *ds_queue = get_deferred_split_queue(folio);
        unsigned long flags;
 
-       spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-       if (!list_empty(&folio->_deferred_list)) {
-               ds_queue->split_queue_len--;
-               list_del(&folio->_deferred_list);
+       /*
+        * At this point, there is no one trying to add the folio to
+        * deferred_list. If folio is not in deferred_list, it's safe
+        * to check without acquiring the split_queue_lock.
+        */
+       if (data_race(!list_empty(&folio->_deferred_list))) {
+               spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
+               if (!list_empty(&folio->_deferred_list)) {
+                       ds_queue->split_queue_len--;
+                       list_del(&folio->_deferred_list);
+               }
+               spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
        }
-       spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
        free_compound_page(page);
 }
 
index f154019..bce28cc 100644 (file)
@@ -1489,7 +1489,6 @@ static void __destroy_compound_gigantic_folio(struct folio *folio,
                        set_page_refcounted(p);
        }
 
-       folio_set_order(folio, 0);
        __folio_clear_head(folio);
 }
 
@@ -1951,9 +1950,6 @@ static bool __prep_compound_gigantic_folio(struct folio *folio,
        struct page *p;
 
        __folio_clear_reserved(folio);
-       __folio_set_head(folio);
-       /* we rely on prep_new_hugetlb_folio to set the destructor */
-       folio_set_order(folio, order);
        for (i = 0; i < nr_pages; i++) {
                p = folio_page(folio, i);
 
@@ -1999,6 +1995,9 @@ static bool __prep_compound_gigantic_folio(struct folio *folio,
                if (i != 0)
                        set_compound_head(p, &folio->page);
        }
+       __folio_set_head(folio);
+       /* we rely on prep_new_hugetlb_folio to set the destructor */
+       folio_set_order(folio, order);
        atomic_set(&folio->_entire_mapcount, -1);
        atomic_set(&folio->_nr_pages_mapped, 0);
        atomic_set(&folio->_pincount, 0);
@@ -2017,8 +2016,6 @@ out_error:
                p = folio_page(folio, j);
                __ClearPageReserved(p);
        }
-       folio_set_order(folio, 0);
-       __folio_clear_head(folio);
        return false;
 }
 
@@ -5016,7 +5013,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
                            struct vm_area_struct *src_vma)
 {
        pte_t *src_pte, *dst_pte, entry;
-       struct page *ptepage;
+       struct folio *pte_folio;
        unsigned long addr;
        bool cow = is_cow_mapping(src_vma->vm_flags);
        struct hstate *h = hstate_vma(src_vma);
@@ -5115,8 +5112,8 @@ again:
                                set_huge_pte_at(dst, addr, dst_pte, entry);
                } else {
                        entry = huge_ptep_get(src_pte);
-                       ptepage = pte_page(entry);
-                       get_page(ptepage);
+                       pte_folio = page_folio(pte_page(entry));
+                       folio_get(pte_folio);
 
                        /*
                         * Failing to duplicate the anon rmap is a rare case
@@ -5128,10 +5125,10 @@ again:
                         * need to be without the pgtable locks since we could
                         * sleep during the process.
                         */
-                       if (!PageAnon(ptepage)) {
-                               page_dup_file_rmap(ptepage, true);
-                       } else if (page_try_dup_anon_rmap(ptepage, true,
-                                                         src_vma)) {
+                       if (!folio_test_anon(pte_folio)) {
+                               page_dup_file_rmap(&pte_folio->page, true);
+                       } else if (page_try_dup_anon_rmap(&pte_folio->page,
+                                                         true, src_vma)) {
                                pte_t src_pte_old = entry;
                                struct folio *new_folio;
 
@@ -5140,14 +5137,14 @@ again:
                                /* Do not use reserve as it's private owned */
                                new_folio = alloc_hugetlb_folio(dst_vma, addr, 1);
                                if (IS_ERR(new_folio)) {
-                                       put_page(ptepage);
+                                       folio_put(pte_folio);
                                        ret = PTR_ERR(new_folio);
                                        break;
                                }
                                ret = copy_user_large_folio(new_folio,
-                                                     page_folio(ptepage),
-                                                     addr, dst_vma);
-                               put_page(ptepage);
+                                                           pte_folio,
+                                                           addr, dst_vma);
+                               folio_put(pte_folio);
                                if (ret) {
                                        folio_put(new_folio);
                                        break;
@@ -5540,7 +5537,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
        const bool unshare = flags & FAULT_FLAG_UNSHARE;
        pte_t pte = huge_ptep_get(ptep);
        struct hstate *h = hstate_vma(vma);
-       struct page *old_page;
+       struct folio *old_folio;
        struct folio *new_folio;
        int outside_reserve = 0;
        vm_fault_t ret = 0;
@@ -5571,7 +5568,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
                return 0;
        }
 
-       old_page = pte_page(pte);
+       old_folio = page_folio(pte_page(pte));
 
        delayacct_wpcopy_start();
 
@@ -5580,17 +5577,17 @@ retry_avoidcopy:
         * If no-one else is actually using this page, we're the exclusive
         * owner and can reuse this page.
         */
-       if (page_mapcount(old_page) == 1 && PageAnon(old_page)) {
-               if (!PageAnonExclusive(old_page))
-                       page_move_anon_rmap(old_page, vma);
+       if (folio_mapcount(old_folio) == 1 && folio_test_anon(old_folio)) {
+               if (!PageAnonExclusive(&old_folio->page))
+                       page_move_anon_rmap(&old_folio->page, vma);
                if (likely(!unshare))
                        set_huge_ptep_writable(vma, haddr, ptep);
 
                delayacct_wpcopy_end();
                return 0;
        }
-       VM_BUG_ON_PAGE(PageAnon(old_page) && PageAnonExclusive(old_page),
-                      old_page);
+       VM_BUG_ON_PAGE(folio_test_anon(old_folio) &&
+                      PageAnonExclusive(&old_folio->page), &old_folio->page);
 
        /*
         * If the process that created a MAP_PRIVATE mapping is about to
@@ -5602,10 +5599,10 @@ retry_avoidcopy:
         * of the full address range.
         */
        if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
-                       page_folio(old_page) != pagecache_folio)
+                       old_folio != pagecache_folio)
                outside_reserve = 1;
 
-       get_page(old_page);
+       folio_get(old_folio);
 
        /*
         * Drop page table lock as buddy allocator may be called. It will
@@ -5627,7 +5624,7 @@ retry_avoidcopy:
                        pgoff_t idx;
                        u32 hash;
 
-                       put_page(old_page);
+                       folio_put(old_folio);
                        /*
                         * Drop hugetlb_fault_mutex and vma_lock before
                         * unmapping.  unmapping needs to hold vma_lock
@@ -5642,7 +5639,7 @@ retry_avoidcopy:
                        hugetlb_vma_unlock_read(vma);
                        mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 
-                       unmap_ref_private(mm, vma, old_page, haddr);
+                       unmap_ref_private(mm, vma, &old_folio->page, haddr);
 
                        mutex_lock(&hugetlb_fault_mutex_table[hash]);
                        hugetlb_vma_lock_read(vma);
@@ -5672,7 +5669,7 @@ retry_avoidcopy:
                goto out_release_all;
        }
 
-       if (copy_user_large_folio(new_folio, page_folio(old_page), address, vma)) {
+       if (copy_user_large_folio(new_folio, old_folio, address, vma)) {
                ret = VM_FAULT_HWPOISON_LARGE;
                goto out_release_all;
        }
@@ -5694,14 +5691,14 @@ retry_avoidcopy:
                /* Break COW or unshare */
                huge_ptep_clear_flush(vma, haddr, ptep);
                mmu_notifier_invalidate_range(mm, range.start, range.end);
-               page_remove_rmap(old_page, vma, true);
+               page_remove_rmap(&old_folio->page, vma, true);
                hugepage_add_new_anon_rmap(new_folio, vma, haddr);
                if (huge_pte_uffd_wp(pte))
                        newpte = huge_pte_mkuffd_wp(newpte);
                set_huge_pte_at(mm, haddr, ptep, newpte);
                folio_set_hugetlb_migratable(new_folio);
                /* Make the old page be freed below */
-               new_folio = page_folio(old_page);
+               new_folio = old_folio;
        }
        spin_unlock(ptl);
        mmu_notifier_invalidate_range_end(&range);
@@ -5710,11 +5707,11 @@ out_release_all:
         * No restore in case of successful pagetable update (Break COW or
         * unshare)
         */
-       if (new_folio != page_folio(old_page))
+       if (new_folio != old_folio)
                restore_reserve_on_error(h, vma, haddr, new_folio);
        folio_put(new_folio);
 out_release_old:
-       put_page(old_page);
+       folio_put(old_folio);
 
        spin_lock(ptl); /* Caller expects lock to be held */
 
@@ -5731,13 +5728,13 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
 {
        struct address_space *mapping = vma->vm_file->f_mapping;
        pgoff_t idx = vma_hugecache_offset(h, vma, address);
-       bool present;
-
-       rcu_read_lock();
-       present = page_cache_next_miss(mapping, idx, 1) != idx;
-       rcu_read_unlock();
+       struct folio *folio;
 
-       return present;
+       folio = filemap_get_folio(mapping, idx);
+       if (IS_ERR(folio))
+               return false;
+       folio_put(folio);
+       return true;
 }
 
 int hugetlb_add_to_page_cache(struct folio *folio, struct address_space *mapping,
@@ -6062,7 +6059,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
        vm_fault_t ret;
        u32 hash;
        pgoff_t idx;
-       struct page *page = NULL;
+       struct folio *folio = NULL;
        struct folio *pagecache_folio = NULL;
        struct hstate *h = hstate_vma(vma);
        struct address_space *mapping;
@@ -6179,16 +6176,16 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
        /*
         * hugetlb_wp() requires page locks of pte_page(entry) and
         * pagecache_folio, so here we need take the former one
-        * when page != pagecache_folio or !pagecache_folio.
+        * when folio != pagecache_folio or !pagecache_folio.
         */
-       page = pte_page(entry);
-       if (page_folio(page) != pagecache_folio)
-               if (!trylock_page(page)) {
+       folio = page_folio(pte_page(entry));
+       if (folio != pagecache_folio)
+               if (!folio_trylock(folio)) {
                        need_wait_lock = 1;
                        goto out_ptl;
                }
 
-       get_page(page);
+       folio_get(folio);
 
        if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
                if (!huge_pte_write(entry)) {
@@ -6204,9 +6201,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
                                                flags & FAULT_FLAG_WRITE))
                update_mmu_cache(vma, haddr, ptep);
 out_put_page:
-       if (page_folio(page) != pagecache_folio)
-               unlock_page(page);
-       put_page(page);
+       if (folio != pagecache_folio)
+               folio_unlock(folio);
+       folio_put(folio);
 out_ptl:
        spin_unlock(ptl);
 
@@ -6225,7 +6222,7 @@ out_mutex:
         * here without taking refcount.
         */
        if (need_wait_lock)
-               wait_on_page_locked(page);
+               folio_wait_locked(folio);
        return ret;
 }
 
@@ -6425,17 +6422,14 @@ out_release_nounlock:
 }
 #endif /* CONFIG_USERFAULTFD */
 
-static void record_subpages_vmas(struct page *page, struct vm_area_struct *vma,
-                                int refs, struct page **pages,
-                                struct vm_area_struct **vmas)
+static void record_subpages(struct page *page, struct vm_area_struct *vma,
+                           int refs, struct page **pages)
 {
        int nr;
 
        for (nr = 0; nr < refs; nr++) {
                if (likely(pages))
                        pages[nr] = nth_page(page, nr);
-               if (vmas)
-                       vmas[nr] = vma;
        }
 }
 
@@ -6508,9 +6502,9 @@ out_unlock:
 }
 
 long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
-                        struct page **pages, struct vm_area_struct **vmas,
-                        unsigned long *position, unsigned long *nr_pages,
-                        long i, unsigned int flags, int *locked)
+                        struct page **pages, unsigned long *position,
+                        unsigned long *nr_pages, long i, unsigned int flags,
+                        int *locked)
 {
        unsigned long pfn_offset;
        unsigned long vaddr = *position;
@@ -6638,7 +6632,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
                 * If subpage information not requested, update counters
                 * and skip the same_page loop below.
                 */
-               if (!pages && !vmas && !pfn_offset &&
+               if (!pages && !pfn_offset &&
                    (vaddr + huge_page_size(h) < vma->vm_end) &&
                    (remainder >= pages_per_huge_page(h))) {
                        vaddr += huge_page_size(h);
@@ -6653,11 +6647,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
                refs = min3(pages_per_huge_page(h) - pfn_offset, remainder,
                    (vma->vm_end - ALIGN_DOWN(vaddr, PAGE_SIZE)) >> PAGE_SHIFT);
 
-               if (pages || vmas)
-                       record_subpages_vmas(nth_page(page, pfn_offset),
-                                            vma, refs,
-                                            likely(pages) ? pages + i : NULL,
-                                            vmas ? vmas + i : NULL);
+               if (pages)
+                       record_subpages(nth_page(page, pfn_offset),
+                                       vma, refs,
+                                       likely(pages) ? pages + i : NULL);
 
                if (pages) {
                        /*
@@ -7137,7 +7130,6 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
        unsigned long saddr;
        pte_t *spte = NULL;
        pte_t *pte;
-       spinlock_t *ptl;
 
        i_mmap_lock_read(mapping);
        vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
@@ -7158,7 +7150,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
        if (!spte)
                goto out;
 
-       ptl = huge_pte_lock(hstate_vma(vma), mm, spte);
+       spin_lock(&mm->page_table_lock);
        if (pud_none(*pud)) {
                pud_populate(mm, pud,
                                (pmd_t *)((unsigned long)spte & PAGE_MASK));
@@ -7166,7 +7158,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
        } else {
                put_page(virt_to_page(spte));
        }
-       spin_unlock(ptl);
+       spin_unlock(&mm->page_table_lock);
 out:
        pte = (pte_t *)pmd_alloc(mm, pud, addr);
        i_mmap_unlock_read(mapping);
@@ -7254,7 +7246,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
                                pte = (pte_t *)pmd_alloc(mm, pud, addr);
                }
        }
-       BUG_ON(pte && pte_present(*pte) && !pte_huge(*pte));
+       BUG_ON(pte && pte_present(ptep_get(pte)) && !pte_huge(ptep_get(pte)));
 
        return pte;
 }
index 27f001e..c2007ef 100644 (file)
@@ -105,7 +105,7 @@ static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
         * remapping (which is calling @walk->remap_pte).
         */
        if (!walk->reuse_page) {
-               walk->reuse_page = pte_page(*pte);
+               walk->reuse_page = pte_page(ptep_get(pte));
                /*
                 * Because the reuse address is part of the range that we are
                 * walking, skip the reuse address range.
@@ -239,7 +239,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
         * to the tail pages.
         */
        pgprot_t pgprot = PAGE_KERNEL_RO;
-       struct page *page = pte_page(*pte);
+       struct page *page = pte_page(ptep_get(pte));
        pte_t entry;
 
        /* Remapping the head page requires r/w */
@@ -286,7 +286,7 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
        struct page *page;
        void *to;
 
-       BUG_ON(pte_page(*pte) != walk->reuse_page);
+       BUG_ON(pte_page(ptep_get(pte)) != walk->reuse_page);
 
        page = list_first_entry(walk->vmemmap_pages, struct page, lru);
        list_del(&page->lru);
@@ -384,8 +384,9 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
 }
 
 static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
-                                  gfp_t gfp_mask, struct list_head *list)
+                                  struct list_head *list)
 {
+       gfp_t gfp_mask = GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_THISNODE;
        unsigned long nr_pages = (end - start) >> PAGE_SHIFT;
        int nid = page_to_nid((struct page *)start);
        struct page *page, *next;
@@ -413,12 +414,11 @@ out:
  * @end:       end address of the vmemmap virtual address range that we want to
  *             remap.
  * @reuse:     reuse address.
- * @gfp_mask:  GFP flag for allocating vmemmap pages.
  *
  * Return: %0 on success, negative error code otherwise.
  */
 static int vmemmap_remap_alloc(unsigned long start, unsigned long end,
-                              unsigned long reuse, gfp_t gfp_mask)
+                              unsigned long reuse)
 {
        LIST_HEAD(vmemmap_pages);
        struct vmemmap_remap_walk walk = {
@@ -430,7 +430,7 @@ static int vmemmap_remap_alloc(unsigned long start, unsigned long end,
        /* See the comment in the vmemmap_remap_free(). */
        BUG_ON(start - reuse != PAGE_SIZE);
 
-       if (alloc_vmemmap_page_list(start, end, gfp_mask, &vmemmap_pages))
+       if (alloc_vmemmap_page_list(start, end, &vmemmap_pages))
                return -ENOMEM;
 
        mmap_read_lock(&init_mm);
@@ -476,8 +476,7 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
         * When a HugeTLB page is freed to the buddy allocator, previously
         * discarded vmemmap pages must be allocated and remapping.
         */
-       ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse,
-                                 GFP_KERNEL | __GFP_NORETRY | __GFP_THISNODE);
+       ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse);
        if (!ret) {
                ClearHPageVmemmapOptimized(head);
                static_branch_dec(&hugetlb_optimize_vmemmap_key);
index 68410c6..a7d9e98 100644 (file)
@@ -133,8 +133,8 @@ int truncate_inode_folio(struct address_space *mapping, struct folio *folio);
 bool truncate_inode_partial_folio(struct folio *folio, loff_t start,
                loff_t end);
 long invalidate_inode_page(struct page *page);
-unsigned long invalidate_mapping_pagevec(struct address_space *mapping,
-               pgoff_t start, pgoff_t end, unsigned long *nr_pagevec);
+unsigned long mapping_try_invalidate(struct address_space *mapping,
+               pgoff_t start, pgoff_t end, unsigned long *nr_failed);
 
 /**
  * folio_evictable - Test whether a folio is evictable.
@@ -179,12 +179,6 @@ extern unsigned long highest_memmap_pfn;
 #define MAX_RECLAIM_RETRIES 16
 
 /*
- * in mm/early_ioremap.c
- */
-pgprot_t __init early_memremap_pgprot_adjust(resource_size_t phys_addr,
-                                       unsigned long size, pgprot_t prot);
-
-/*
  * in mm/vmscan.c:
  */
 bool isolate_lru_page(struct page *page);
@@ -208,10 +202,12 @@ extern char * const zone_names[MAX_NR_ZONES];
 /* perform sanity checks on struct pages being allocated or freed */
 DECLARE_STATIC_KEY_MAYBE(CONFIG_DEBUG_VM, check_pages_enabled);
 
-static inline bool is_check_pages_enabled(void)
-{
-       return static_branch_unlikely(&check_pages_enabled);
-}
+extern int min_free_kbytes;
+
+void setup_per_zone_wmarks(void);
+void calculate_min_free_kbytes(void);
+int __meminit init_per_zone_wmark_min(void);
+void page_alloc_sysctl_init(void);
 
 /*
  * Structure for holding the mostly immutable allocation parameters passed
@@ -371,6 +367,13 @@ static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn,
        return __pageblock_pfn_to_page(start_pfn, end_pfn, zone);
 }
 
+void set_zone_contiguous(struct zone *zone);
+
+static inline void clear_zone_contiguous(struct zone *zone)
+{
+       zone->contiguous = false;
+}
+
 extern int __isolate_free_page(struct page *page, unsigned int order);
 extern void __putback_isolated_page(struct page *page, unsigned int order,
                                    int mt);
@@ -378,12 +381,27 @@ extern void memblock_free_pages(struct page *page, unsigned long pfn,
                                        unsigned int order);
 extern void __free_pages_core(struct page *page, unsigned int order);
 
+/*
+ * This will have no effect, other than possibly generating a warning, if the
+ * caller passes in a non-large folio.
+ */
+static inline void folio_set_order(struct folio *folio, unsigned int order)
+{
+       if (WARN_ON_ONCE(!order || !folio_test_large(folio)))
+               return;
+
+       folio->_folio_order = order;
+#ifdef CONFIG_64BIT
+       folio->_folio_nr_pages = 1U << order;
+#endif
+}
+
 static inline void prep_compound_head(struct page *page, unsigned int order)
 {
        struct folio *folio = (struct folio *)page;
 
-       set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
-       set_compound_order(page, order);
+       folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
+       folio_set_order(folio, order);
        atomic_set(&folio->_entire_mapcount, -1);
        atomic_set(&folio->_nr_pages_mapped, 0);
        atomic_set(&folio->_pincount, 0);
@@ -416,27 +434,12 @@ extern void *memmap_alloc(phys_addr_t size, phys_addr_t align,
                          phys_addr_t min_addr,
                          int nid, bool exact_nid);
 
-int split_free_page(struct page *free_page,
-                       unsigned int order, unsigned long split_pfn_offset);
+void memmap_init_range(unsigned long, int, unsigned long, unsigned long,
+               unsigned long, enum meminit_context, struct vmem_altmap *, int);
 
-/*
- * This will have no effect, other than possibly generating a warning, if the
- * caller passes in a non-large folio.
- */
-static inline void folio_set_order(struct folio *folio, unsigned int order)
-{
-       if (WARN_ON_ONCE(!folio_test_large(folio)))
-               return;
 
-       folio->_folio_order = order;
-#ifdef CONFIG_64BIT
-       /*
-        * When hugetlb dissolves a folio, we need to clear the tail
-        * page, rather than setting nr_pages to 1.
-        */
-       folio->_folio_nr_pages = order ? 1U << order : 0;
-#endif
-}
+int split_free_page(struct page *free_page,
+                       unsigned int order, unsigned long split_pfn_offset);
 
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 
@@ -563,8 +566,8 @@ extern long populate_vma_page_range(struct vm_area_struct *vma,
 extern long faultin_vma_page_range(struct vm_area_struct *vma,
                                   unsigned long start, unsigned long end,
                                   bool write, int *locked);
-extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
-                             unsigned long len);
+extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
+                              unsigned long bytes);
 /*
  * mlock_vma_folio() and munlock_vma_folio():
  * should be called with vma's mmap_lock held for read or write,
@@ -1047,17 +1050,17 @@ static inline void vma_iter_store(struct vma_iterator *vmi,
 {
 
 #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
-       if (WARN_ON(vmi->mas.node != MAS_START && vmi->mas.index > vma->vm_start)) {
-               printk("%lu > %lu\n", vmi->mas.index, vma->vm_start);
-               printk("store of vma %lu-%lu", vma->vm_start, vma->vm_end);
-               printk("into slot    %lu-%lu", vmi->mas.index, vmi->mas.last);
-               mt_dump(vmi->mas.tree);
+       if (MAS_WARN_ON(&vmi->mas, vmi->mas.node != MAS_START &&
+                       vmi->mas.index > vma->vm_start)) {
+               pr_warn("%lx > %lx\n store vma %lx-%lx\n into slot %lx-%lx\n",
+                       vmi->mas.index, vma->vm_start, vma->vm_start,
+                       vma->vm_end, vmi->mas.index, vmi->mas.last);
        }
-       if (WARN_ON(vmi->mas.node != MAS_START && vmi->mas.last <  vma->vm_start)) {
-               printk("%lu < %lu\n", vmi->mas.last, vma->vm_start);
-               printk("store of vma %lu-%lu", vma->vm_start, vma->vm_end);
-               printk("into slot    %lu-%lu", vmi->mas.index, vmi->mas.last);
-               mt_dump(vmi->mas.tree);
+       if (MAS_WARN_ON(&vmi->mas, vmi->mas.node != MAS_START &&
+                       vmi->mas.last <  vma->vm_start)) {
+               pr_warn("%lx < %lx\nstore vma %lx-%lx\ninto slot %lx-%lx\n",
+                      vmi->mas.last, vma->vm_start, vma->vm_start, vma->vm_end,
+                      vmi->mas.index, vmi->mas.last);
        }
 #endif
 
index b376a5d..256930d 100644 (file)
@@ -445,7 +445,7 @@ void * __must_check __kasan_krealloc(const void *object, size_t size, gfp_t flag
 bool __kasan_check_byte(const void *address, unsigned long ip)
 {
        if (!kasan_byte_accessible(address)) {
-               kasan_report((unsigned long)address, 1, false, ip);
+               kasan_report(address, 1, false, ip);
                return false;
        }
        return true;
index e5eef67..5b4c97b 100644 (file)
  * depending on memory access size X.
  */
 
-static __always_inline bool memory_is_poisoned_1(unsigned long addr)
+static __always_inline bool memory_is_poisoned_1(const void *addr)
 {
-       s8 shadow_value = *(s8 *)kasan_mem_to_shadow((void *)addr);
+       s8 shadow_value = *(s8 *)kasan_mem_to_shadow(addr);
 
        if (unlikely(shadow_value)) {
-               s8 last_accessible_byte = addr & KASAN_GRANULE_MASK;
+               s8 last_accessible_byte = (unsigned long)addr & KASAN_GRANULE_MASK;
                return unlikely(last_accessible_byte >= shadow_value);
        }
 
        return false;
 }
 
-static __always_inline bool memory_is_poisoned_2_4_8(unsigned long addr,
+static __always_inline bool memory_is_poisoned_2_4_8(const void *addr,
                                                unsigned long size)
 {
-       u8 *shadow_addr = (u8 *)kasan_mem_to_shadow((void *)addr);
+       u8 *shadow_addr = (u8 *)kasan_mem_to_shadow(addr);
 
        /*
         * Access crosses 8(shadow size)-byte boundary. Such access maps
         * into 2 shadow bytes, so we need to check them both.
         */
-       if (unlikely(((addr + size - 1) & KASAN_GRANULE_MASK) < size - 1))
+       if (unlikely((((unsigned long)addr + size - 1) & KASAN_GRANULE_MASK) < size - 1))
                return *shadow_addr || memory_is_poisoned_1(addr + size - 1);
 
        return memory_is_poisoned_1(addr + size - 1);
 }
 
-static __always_inline bool memory_is_poisoned_16(unsigned long addr)
+static __always_inline bool memory_is_poisoned_16(const void *addr)
 {
-       u16 *shadow_addr = (u16 *)kasan_mem_to_shadow((void *)addr);
+       u16 *shadow_addr = (u16 *)kasan_mem_to_shadow(addr);
 
        /* Unaligned 16-bytes access maps into 3 shadow bytes. */
-       if (unlikely(!IS_ALIGNED(addr, KASAN_GRANULE_SIZE)))
+       if (unlikely(!IS_ALIGNED((unsigned long)addr, KASAN_GRANULE_SIZE)))
                return *shadow_addr || memory_is_poisoned_1(addr + 15);
 
        return *shadow_addr;
@@ -120,26 +120,25 @@ static __always_inline unsigned long memory_is_nonzero(const void *start,
        return bytes_is_nonzero(start, (end - start) % 8);
 }
 
-static __always_inline bool memory_is_poisoned_n(unsigned long addr,
-                                               size_t size)
+static __always_inline bool memory_is_poisoned_n(const void *addr, size_t size)
 {
        unsigned long ret;
 
-       ret = memory_is_nonzero(kasan_mem_to_shadow((void *)addr),
-                       kasan_mem_to_shadow((void *)addr + size - 1) + 1);
+       ret = memory_is_nonzero(kasan_mem_to_shadow(addr),
+                       kasan_mem_to_shadow(addr + size - 1) + 1);
 
        if (unlikely(ret)) {
-               unsigned long last_byte = addr + size - 1;
-               s8 *last_shadow = (s8 *)kasan_mem_to_shadow((void *)last_byte);
+               const void *last_byte = addr + size - 1;
+               s8 *last_shadow = (s8 *)kasan_mem_to_shadow(last_byte);
 
                if (unlikely(ret != (unsigned long)last_shadow ||
-                       ((long)(last_byte & KASAN_GRANULE_MASK) >= *last_shadow)))
+                       (((long)last_byte & KASAN_GRANULE_MASK) >= *last_shadow)))
                        return true;
        }
        return false;
 }
 
-static __always_inline bool memory_is_poisoned(unsigned long addr, size_t size)
+static __always_inline bool memory_is_poisoned(const void *addr, size_t size)
 {
        if (__builtin_constant_p(size)) {
                switch (size) {
@@ -159,7 +158,7 @@ static __always_inline bool memory_is_poisoned(unsigned long addr, size_t size)
        return memory_is_poisoned_n(addr, size);
 }
 
-static __always_inline bool check_region_inline(unsigned long addr,
+static __always_inline bool check_region_inline(const void *addr,
                                                size_t size, bool write,
                                                unsigned long ret_ip)
 {
@@ -172,7 +171,7 @@ static __always_inline bool check_region_inline(unsigned long addr,
        if (unlikely(addr + size < addr))
                return !kasan_report(addr, size, write, ret_ip);
 
-       if (unlikely(!addr_has_metadata((void *)addr)))
+       if (unlikely(!addr_has_metadata(addr)))
                return !kasan_report(addr, size, write, ret_ip);
 
        if (likely(!memory_is_poisoned(addr, size)))
@@ -181,7 +180,7 @@ static __always_inline bool check_region_inline(unsigned long addr,
        return !kasan_report(addr, size, write, ret_ip);
 }
 
-bool kasan_check_range(unsigned long addr, size_t size, bool write,
+bool kasan_check_range(const void *addr, size_t size, bool write,
                                        unsigned long ret_ip)
 {
        return check_region_inline(addr, size, write, ret_ip);
@@ -221,36 +220,37 @@ static void register_global(struct kasan_global *global)
                     KASAN_GLOBAL_REDZONE, false);
 }
 
-void __asan_register_globals(struct kasan_global *globals, size_t size)
+void __asan_register_globals(void *ptr, ssize_t size)
 {
        int i;
+       struct kasan_global *globals = ptr;
 
        for (i = 0; i < size; i++)
                register_global(&globals[i]);
 }
 EXPORT_SYMBOL(__asan_register_globals);
 
-void __asan_unregister_globals(struct kasan_global *globals, size_t size)
+void __asan_unregister_globals(void *ptr, ssize_t size)
 {
 }
 EXPORT_SYMBOL(__asan_unregister_globals);
 
 #define DEFINE_ASAN_LOAD_STORE(size)                                   \
-       void __asan_load##size(unsigned long addr)                      \
+       void __asan_load##size(void *addr)                              \
        {                                                               \
                check_region_inline(addr, size, false, _RET_IP_);       \
        }                                                               \
        EXPORT_SYMBOL(__asan_load##size);                               \
        __alias(__asan_load##size)                                      \
-       void __asan_load##size##_noabort(unsigned long);                \
+       void __asan_load##size##_noabort(void *);                       \
        EXPORT_SYMBOL(__asan_load##size##_noabort);                     \
-       void __asan_store##size(unsigned long addr)                     \
+       void __asan_store##size(void *addr)                             \
        {                                                               \
                check_region_inline(addr, size, true, _RET_IP_);        \
        }                                                               \
        EXPORT_SYMBOL(__asan_store##size);                              \
        __alias(__asan_store##size)                                     \
-       void __asan_store##size##_noabort(unsigned long);               \
+       void __asan_store##size##_noabort(void *);                      \
        EXPORT_SYMBOL(__asan_store##size##_noabort)
 
 DEFINE_ASAN_LOAD_STORE(1);
@@ -259,24 +259,24 @@ DEFINE_ASAN_LOAD_STORE(4);
 DEFINE_ASAN_LOAD_STORE(8);
 DEFINE_ASAN_LOAD_STORE(16);
 
-void __asan_loadN(unsigned long addr, size_t size)
+void __asan_loadN(void *addr, ssize_t size)
 {
        kasan_check_range(addr, size, false, _RET_IP_);
 }
 EXPORT_SYMBOL(__asan_loadN);
 
 __alias(__asan_loadN)
-void __asan_loadN_noabort(unsigned long, size_t);
+void __asan_loadN_noabort(void *, ssize_t);
 EXPORT_SYMBOL(__asan_loadN_noabort);
 
-void __asan_storeN(unsigned long addr, size_t size)
+void __asan_storeN(void *addr, ssize_t size)
 {
        kasan_check_range(addr, size, true, _RET_IP_);
 }
 EXPORT_SYMBOL(__asan_storeN);
 
 __alias(__asan_storeN)
-void __asan_storeN_noabort(unsigned long, size_t);
+void __asan_storeN_noabort(void *, ssize_t);
 EXPORT_SYMBOL(__asan_storeN_noabort);
 
 /* to shut up compiler complaints */
@@ -284,7 +284,7 @@ void __asan_handle_no_return(void) {}
 EXPORT_SYMBOL(__asan_handle_no_return);
 
 /* Emitted by compiler to poison alloca()ed objects. */
-void __asan_alloca_poison(unsigned long addr, size_t size)
+void __asan_alloca_poison(void *addr, ssize_t size)
 {
        size_t rounded_up_size = round_up(size, KASAN_GRANULE_SIZE);
        size_t padding_size = round_up(size, KASAN_ALLOCA_REDZONE_SIZE) -
@@ -295,7 +295,7 @@ void __asan_alloca_poison(unsigned long addr, size_t size)
                        KASAN_ALLOCA_REDZONE_SIZE);
        const void *right_redzone = (const void *)(addr + rounded_up_size);
 
-       WARN_ON(!IS_ALIGNED(addr, KASAN_ALLOCA_REDZONE_SIZE));
+       WARN_ON(!IS_ALIGNED((unsigned long)addr, KASAN_ALLOCA_REDZONE_SIZE));
 
        kasan_unpoison((const void *)(addr + rounded_down_size),
                        size - rounded_down_size, false);
@@ -307,18 +307,18 @@ void __asan_alloca_poison(unsigned long addr, size_t size)
 EXPORT_SYMBOL(__asan_alloca_poison);
 
 /* Emitted by compiler to unpoison alloca()ed areas when the stack unwinds. */
-void __asan_allocas_unpoison(const void *stack_top, const void *stack_bottom)
+void __asan_allocas_unpoison(void *stack_top, ssize_t stack_bottom)
 {
-       if (unlikely(!stack_top || stack_top > stack_bottom))
+       if (unlikely(!stack_top || stack_top > (void *)stack_bottom))
                return;
 
-       kasan_unpoison(stack_top, stack_bottom - stack_top, false);
+       kasan_unpoison(stack_top, (void *)stack_bottom - stack_top, false);
 }
 EXPORT_SYMBOL(__asan_allocas_unpoison);
 
 /* Emitted by the compiler to [un]poison local variables. */
 #define DEFINE_ASAN_SET_SHADOW(byte) \
-       void __asan_set_shadow_##byte(const void *addr, size_t size)    \
+       void __asan_set_shadow_##byte(const void *addr, ssize_t size)   \
        {                                                               \
                __memset((void *)addr, 0x##byte, size);                 \
        }                                                               \
@@ -488,7 +488,7 @@ static void __kasan_record_aux_stack(void *addr, bool can_alloc)
                return;
 
        alloc_meta->aux_stack[1] = alloc_meta->aux_stack[0];
-       alloc_meta->aux_stack[0] = kasan_save_stack(GFP_NOWAIT, can_alloc);
+       alloc_meta->aux_stack[0] = kasan_save_stack(0, can_alloc);
 }
 
 void kasan_record_aux_stack(void *addr)
@@ -518,7 +518,7 @@ void kasan_save_free_info(struct kmem_cache *cache, void *object)
        if (!free_meta)
                return;
 
-       kasan_set_track(&free_meta->free_track, GFP_NOWAIT);
+       kasan_set_track(&free_meta->free_track, 0);
        /* The object was freed and has free track set. */
        *(u8 *)kasan_mem_to_shadow(object) = KASAN_SLAB_FREETRACK;
 }
index cc64ed6..dcfec27 100644 (file)
@@ -286,7 +286,7 @@ static void kasan_free_pte(pte_t *pte_start, pmd_t *pmd)
 
        for (i = 0; i < PTRS_PER_PTE; i++) {
                pte = pte_start + i;
-               if (!pte_none(*pte))
+               if (!pte_none(ptep_get(pte)))
                        return;
        }
 
@@ -343,16 +343,19 @@ static void kasan_remove_pte_table(pte_t *pte, unsigned long addr,
                                unsigned long end)
 {
        unsigned long next;
+       pte_t ptent;
 
        for (; addr < end; addr = next, pte++) {
                next = (addr + PAGE_SIZE) & PAGE_MASK;
                if (next > end)
                        next = end;
 
-               if (!pte_present(*pte))
+               ptent = ptep_get(pte);
+
+               if (!pte_present(ptent))
                        continue;
 
-               if (WARN_ON(!kasan_early_shadow_page_entry(*pte)))
+               if (WARN_ON(!kasan_early_shadow_page_entry(ptent)))
                        continue;
                pte_clear(&init_mm, addr, pte);
        }
index f5e4f5f..b799f11 100644 (file)
@@ -198,13 +198,13 @@ enum kasan_report_type {
 struct kasan_report_info {
        /* Filled in by kasan_report_*(). */
        enum kasan_report_type type;
-       void *access_addr;
+       const void *access_addr;
        size_t access_size;
        bool is_write;
        unsigned long ip;
 
        /* Filled in by the common reporting code. */
-       void *first_bad_addr;
+       const void *first_bad_addr;
        struct kmem_cache *cache;
        void *object;
        size_t alloc_size;
@@ -311,7 +311,7 @@ static __always_inline bool addr_has_metadata(const void *addr)
  * @ret_ip: return address
  * @return: true if access was valid, false if invalid
  */
-bool kasan_check_range(unsigned long addr, size_t size, bool write,
+bool kasan_check_range(const void *addr, size_t size, bool write,
                                unsigned long ret_ip);
 
 #else /* CONFIG_KASAN_GENERIC || CONFIG_KASAN_SW_TAGS */
@@ -323,7 +323,7 @@ static __always_inline bool addr_has_metadata(const void *addr)
 
 #endif /* CONFIG_KASAN_GENERIC || CONFIG_KASAN_SW_TAGS */
 
-void *kasan_find_first_bad_addr(void *addr, size_t size);
+const void *kasan_find_first_bad_addr(const void *addr, size_t size);
 size_t kasan_get_alloc_size(void *object, struct kmem_cache *cache);
 void kasan_complete_mode_report_info(struct kasan_report_info *info);
 void kasan_metadata_fetch_row(char *buffer, void *row);
@@ -346,7 +346,7 @@ void kasan_print_aux_stacks(struct kmem_cache *cache, const void *object);
 static inline void kasan_print_aux_stacks(struct kmem_cache *cache, const void *object) { }
 #endif
 
-bool kasan_report(unsigned long addr, size_t size,
+bool kasan_report(const void *addr, size_t size,
                bool is_write, unsigned long ip);
 void kasan_report_invalid_free(void *object, unsigned long ip, enum kasan_report_type type);
 
@@ -571,79 +571,82 @@ void kasan_restore_multi_shot(bool enabled);
  */
 
 asmlinkage void kasan_unpoison_task_stack_below(const void *watermark);
-void __asan_register_globals(struct kasan_global *globals, size_t size);
-void __asan_unregister_globals(struct kasan_global *globals, size_t size);
+void __asan_register_globals(void *globals, ssize_t size);
+void __asan_unregister_globals(void *globals, ssize_t size);
 void __asan_handle_no_return(void);
-void __asan_alloca_poison(unsigned long addr, size_t size);
-void __asan_allocas_unpoison(const void *stack_top, const void *stack_bottom);
-
-void __asan_load1(unsigned long addr);
-void __asan_store1(unsigned long addr);
-void __asan_load2(unsigned long addr);
-void __asan_store2(unsigned long addr);
-void __asan_load4(unsigned long addr);
-void __asan_store4(unsigned long addr);
-void __asan_load8(unsigned long addr);
-void __asan_store8(unsigned long addr);
-void __asan_load16(unsigned long addr);
-void __asan_store16(unsigned long addr);
-void __asan_loadN(unsigned long addr, size_t size);
-void __asan_storeN(unsigned long addr, size_t size);
-
-void __asan_load1_noabort(unsigned long addr);
-void __asan_store1_noabort(unsigned long addr);
-void __asan_load2_noabort(unsigned long addr);
-void __asan_store2_noabort(unsigned long addr);
-void __asan_load4_noabort(unsigned long addr);
-void __asan_store4_noabort(unsigned long addr);
-void __asan_load8_noabort(unsigned long addr);
-void __asan_store8_noabort(unsigned long addr);
-void __asan_load16_noabort(unsigned long addr);
-void __asan_store16_noabort(unsigned long addr);
-void __asan_loadN_noabort(unsigned long addr, size_t size);
-void __asan_storeN_noabort(unsigned long addr, size_t size);
-
-void __asan_report_load1_noabort(unsigned long addr);
-void __asan_report_store1_noabort(unsigned long addr);
-void __asan_report_load2_noabort(unsigned long addr);
-void __asan_report_store2_noabort(unsigned long addr);
-void __asan_report_load4_noabort(unsigned long addr);
-void __asan_report_store4_noabort(unsigned long addr);
-void __asan_report_load8_noabort(unsigned long addr);
-void __asan_report_store8_noabort(unsigned long addr);
-void __asan_report_load16_noabort(unsigned long addr);
-void __asan_report_store16_noabort(unsigned long addr);
-void __asan_report_load_n_noabort(unsigned long addr, size_t size);
-void __asan_report_store_n_noabort(unsigned long addr, size_t size);
-
-void __asan_set_shadow_00(const void *addr, size_t size);
-void __asan_set_shadow_f1(const void *addr, size_t size);
-void __asan_set_shadow_f2(const void *addr, size_t size);
-void __asan_set_shadow_f3(const void *addr, size_t size);
-void __asan_set_shadow_f5(const void *addr, size_t size);
-void __asan_set_shadow_f8(const void *addr, size_t size);
-
-void *__asan_memset(void *addr, int c, size_t len);
-void *__asan_memmove(void *dest, const void *src, size_t len);
-void *__asan_memcpy(void *dest, const void *src, size_t len);
-
-void __hwasan_load1_noabort(unsigned long addr);
-void __hwasan_store1_noabort(unsigned long addr);
-void __hwasan_load2_noabort(unsigned long addr);
-void __hwasan_store2_noabort(unsigned long addr);
-void __hwasan_load4_noabort(unsigned long addr);
-void __hwasan_store4_noabort(unsigned long addr);
-void __hwasan_load8_noabort(unsigned long addr);
-void __hwasan_store8_noabort(unsigned long addr);
-void __hwasan_load16_noabort(unsigned long addr);
-void __hwasan_store16_noabort(unsigned long addr);
-void __hwasan_loadN_noabort(unsigned long addr, size_t size);
-void __hwasan_storeN_noabort(unsigned long addr, size_t size);
-
-void __hwasan_tag_memory(unsigned long addr, u8 tag, unsigned long size);
-
-void *__hwasan_memset(void *addr, int c, size_t len);
-void *__hwasan_memmove(void *dest, const void *src, size_t len);
-void *__hwasan_memcpy(void *dest, const void *src, size_t len);
+void __asan_alloca_poison(void *, ssize_t size);
+void __asan_allocas_unpoison(void *stack_top, ssize_t stack_bottom);
+
+void __asan_load1(void *);
+void __asan_store1(void *);
+void __asan_load2(void *);
+void __asan_store2(void *);
+void __asan_load4(void *);
+void __asan_store4(void *);
+void __asan_load8(void *);
+void __asan_store8(void *);
+void __asan_load16(void *);
+void __asan_store16(void *);
+void __asan_loadN(void *, ssize_t size);
+void __asan_storeN(void *, ssize_t size);
+
+void __asan_load1_noabort(void *);
+void __asan_store1_noabort(void *);
+void __asan_load2_noabort(void *);
+void __asan_store2_noabort(void *);
+void __asan_load4_noabort(void *);
+void __asan_store4_noabort(void *);
+void __asan_load8_noabort(void *);
+void __asan_store8_noabort(void *);
+void __asan_load16_noabort(void *);
+void __asan_store16_noabort(void *);
+void __asan_loadN_noabort(void *, ssize_t size);
+void __asan_storeN_noabort(void *, ssize_t size);
+
+void __asan_report_load1_noabort(void *);
+void __asan_report_store1_noabort(void *);
+void __asan_report_load2_noabort(void *);
+void __asan_report_store2_noabort(void *);
+void __asan_report_load4_noabort(void *);
+void __asan_report_store4_noabort(void *);
+void __asan_report_load8_noabort(void *);
+void __asan_report_store8_noabort(void *);
+void __asan_report_load16_noabort(void *);
+void __asan_report_store16_noabort(void *);
+void __asan_report_load_n_noabort(void *, ssize_t size);
+void __asan_report_store_n_noabort(void *, ssize_t size);
+
+void __asan_set_shadow_00(const void *addr, ssize_t size);
+void __asan_set_shadow_f1(const void *addr, ssize_t size);
+void __asan_set_shadow_f2(const void *addr, ssize_t size);
+void __asan_set_shadow_f3(const void *addr, ssize_t size);
+void __asan_set_shadow_f5(const void *addr, ssize_t size);
+void __asan_set_shadow_f8(const void *addr, ssize_t size);
+
+void *__asan_memset(void *addr, int c, ssize_t len);
+void *__asan_memmove(void *dest, const void *src, ssize_t len);
+void *__asan_memcpy(void *dest, const void *src, ssize_t len);
+
+void __hwasan_load1_noabort(void *);
+void __hwasan_store1_noabort(void *);
+void __hwasan_load2_noabort(void *);
+void __hwasan_store2_noabort(void *);
+void __hwasan_load4_noabort(void *);
+void __hwasan_store4_noabort(void *);
+void __hwasan_load8_noabort(void *);
+void __hwasan_store8_noabort(void *);
+void __hwasan_load16_noabort(void *);
+void __hwasan_store16_noabort(void *);
+void __hwasan_loadN_noabort(void *, ssize_t size);
+void __hwasan_storeN_noabort(void *, ssize_t size);
+
+void __hwasan_tag_memory(void *, u8 tag, ssize_t size);
+
+void *__hwasan_memset(void *addr, int c, ssize_t len);
+void *__hwasan_memmove(void *dest, const void *src, ssize_t len);
+void *__hwasan_memcpy(void *dest, const void *src, ssize_t len);
+
+void kasan_tag_mismatch(void *addr, unsigned long access_info,
+                       unsigned long ret_ip);
 
 #endif /* __MM_KASAN_KASAN_H */
index 892a9dc..ca4b6ff 100644 (file)
@@ -43,6 +43,7 @@ enum kasan_arg_fault {
        KASAN_ARG_FAULT_DEFAULT,
        KASAN_ARG_FAULT_REPORT,
        KASAN_ARG_FAULT_PANIC,
+       KASAN_ARG_FAULT_PANIC_ON_WRITE,
 };
 
 static enum kasan_arg_fault kasan_arg_fault __ro_after_init = KASAN_ARG_FAULT_DEFAULT;
@@ -57,6 +58,8 @@ static int __init early_kasan_fault(char *arg)
                kasan_arg_fault = KASAN_ARG_FAULT_REPORT;
        else if (!strcmp(arg, "panic"))
                kasan_arg_fault = KASAN_ARG_FAULT_PANIC;
+       else if (!strcmp(arg, "panic_on_write"))
+               kasan_arg_fault = KASAN_ARG_FAULT_PANIC_ON_WRITE;
        else
                return -EINVAL;
 
@@ -211,7 +214,7 @@ static void start_report(unsigned long *flags, bool sync)
        pr_err("==================================================================\n");
 }
 
-static void end_report(unsigned long *flags, void *addr)
+static void end_report(unsigned long *flags, const void *addr, bool is_write)
 {
        if (addr)
                trace_error_report_end(ERROR_DETECTOR_KASAN,
@@ -220,8 +223,18 @@ static void end_report(unsigned long *flags, void *addr)
        spin_unlock_irqrestore(&report_lock, *flags);
        if (!test_bit(KASAN_BIT_MULTI_SHOT, &kasan_flags))
                check_panic_on_warn("KASAN");
-       if (kasan_arg_fault == KASAN_ARG_FAULT_PANIC)
+       switch (kasan_arg_fault) {
+       case KASAN_ARG_FAULT_DEFAULT:
+       case KASAN_ARG_FAULT_REPORT:
+               break;
+       case KASAN_ARG_FAULT_PANIC:
                panic("kasan.fault=panic set ...\n");
+               break;
+       case KASAN_ARG_FAULT_PANIC_ON_WRITE:
+               if (is_write)
+                       panic("kasan.fault=panic_on_write set ...\n");
+               break;
+       }
        add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
        lockdep_on();
        report_suppress_stop();
@@ -450,8 +463,8 @@ static void print_memory_metadata(const void *addr)
 
 static void print_report(struct kasan_report_info *info)
 {
-       void *addr = kasan_reset_tag(info->access_addr);
-       u8 tag = get_tag(info->access_addr);
+       void *addr = kasan_reset_tag((void *)info->access_addr);
+       u8 tag = get_tag((void *)info->access_addr);
 
        print_error_description(info);
        if (addr_has_metadata(addr))
@@ -468,12 +481,12 @@ static void print_report(struct kasan_report_info *info)
 
 static void complete_report_info(struct kasan_report_info *info)
 {
-       void *addr = kasan_reset_tag(info->access_addr);
+       void *addr = kasan_reset_tag((void *)info->access_addr);
        struct slab *slab;
 
        if (info->type == KASAN_REPORT_ACCESS)
                info->first_bad_addr = kasan_find_first_bad_addr(
-                                       info->access_addr, info->access_size);
+                                       (void *)info->access_addr, info->access_size);
        else
                info->first_bad_addr = addr;
 
@@ -536,7 +549,11 @@ void kasan_report_invalid_free(void *ptr, unsigned long ip, enum kasan_report_ty
 
        print_report(&info);
 
-       end_report(&flags, ptr);
+       /*
+        * Invalid free is considered a "write" since the allocator's metadata
+        * updates involves writes.
+        */
+       end_report(&flags, ptr, true);
 }
 
 /*
@@ -544,11 +561,10 @@ void kasan_report_invalid_free(void *ptr, unsigned long ip, enum kasan_report_ty
  * user_access_save/restore(): kasan_report_invalid_free() cannot be called
  * from a UACCESS region, and kasan_report_async() is not used on x86.
  */
-bool kasan_report(unsigned long addr, size_t size, bool is_write,
+bool kasan_report(const void *addr, size_t size, bool is_write,
                        unsigned long ip)
 {
        bool ret = true;
-       void *ptr = (void *)addr;
        unsigned long ua_flags = user_access_save();
        unsigned long irq_flags;
        struct kasan_report_info info;
@@ -562,7 +578,7 @@ bool kasan_report(unsigned long addr, size_t size, bool is_write,
 
        memset(&info, 0, sizeof(info));
        info.type = KASAN_REPORT_ACCESS;
-       info.access_addr = ptr;
+       info.access_addr = addr;
        info.access_size = size;
        info.is_write = is_write;
        info.ip = ip;
@@ -571,7 +587,7 @@ bool kasan_report(unsigned long addr, size_t size, bool is_write,
 
        print_report(&info);
 
-       end_report(&irq_flags, ptr);
+       end_report(&irq_flags, (void *)addr, is_write);
 
 out:
        user_access_restore(ua_flags);
@@ -597,7 +613,11 @@ void kasan_report_async(void)
        pr_err("Asynchronous fault: no details available\n");
        pr_err("\n");
        dump_stack_lvl(KERN_ERR);
-       end_report(&flags, NULL);
+       /*
+        * Conservatively set is_write=true, because no details are available.
+        * In this mode, kasan.fault=panic_on_write is like kasan.fault=panic.
+        */
+       end_report(&flags, NULL, true);
 }
 #endif /* CONFIG_KASAN_HW_TAGS */
 
index 87d39bc..51a1e8a 100644 (file)
@@ -30,9 +30,9 @@
 #include "kasan.h"
 #include "../slab.h"
 
-void *kasan_find_first_bad_addr(void *addr, size_t size)
+const void *kasan_find_first_bad_addr(const void *addr, size_t size)
 {
-       void *p = addr;
+       const void *p = addr;
 
        if (!addr_has_metadata(p))
                return p;
@@ -362,14 +362,14 @@ void kasan_print_address_stack_frame(const void *addr)
 #endif /* CONFIG_KASAN_STACK */
 
 #define DEFINE_ASAN_REPORT_LOAD(size)                     \
-void __asan_report_load##size##_noabort(unsigned long addr) \
+void __asan_report_load##size##_noabort(void *addr) \
 {                                                         \
        kasan_report(addr, size, false, _RET_IP_);        \
 }                                                         \
 EXPORT_SYMBOL(__asan_report_load##size##_noabort)
 
 #define DEFINE_ASAN_REPORT_STORE(size)                     \
-void __asan_report_store##size##_noabort(unsigned long addr) \
+void __asan_report_store##size##_noabort(void *addr) \
 {                                                          \
        kasan_report(addr, size, true, _RET_IP_);          \
 }                                                          \
@@ -386,13 +386,13 @@ DEFINE_ASAN_REPORT_STORE(4);
 DEFINE_ASAN_REPORT_STORE(8);
 DEFINE_ASAN_REPORT_STORE(16);
 
-void __asan_report_load_n_noabort(unsigned long addr, size_t size)
+void __asan_report_load_n_noabort(void *addr, ssize_t size)
 {
        kasan_report(addr, size, false, _RET_IP_);
 }
 EXPORT_SYMBOL(__asan_report_load_n_noabort);
 
-void __asan_report_store_n_noabort(unsigned long addr, size_t size)
+void __asan_report_store_n_noabort(void *addr, ssize_t size)
 {
        kasan_report(addr, size, true, _RET_IP_);
 }
index 32e80f7..065e1b2 100644 (file)
@@ -15,7 +15,7 @@
 
 #include "kasan.h"
 
-void *kasan_find_first_bad_addr(void *addr, size_t size)
+const void *kasan_find_first_bad_addr(const void *addr, size_t size)
 {
        /*
         * Hardware Tag-Based KASAN only calls this function for normal memory
index 8b1f5a7..689e94f 100644 (file)
@@ -30,7 +30,7 @@
 #include "kasan.h"
 #include "../slab.h"
 
-void *kasan_find_first_bad_addr(void *addr, size_t size)
+const void *kasan_find_first_bad_addr(const void *addr, size_t size)
 {
        u8 tag = get_tag(addr);
        void *p = kasan_reset_tag(addr);
index c8b86f3..dd772f9 100644 (file)
 
 bool __kasan_check_read(const volatile void *p, unsigned int size)
 {
-       return kasan_check_range((unsigned long)p, size, false, _RET_IP_);
+       return kasan_check_range((void *)p, size, false, _RET_IP_);
 }
 EXPORT_SYMBOL(__kasan_check_read);
 
 bool __kasan_check_write(const volatile void *p, unsigned int size)
 {
-       return kasan_check_range((unsigned long)p, size, true, _RET_IP_);
+       return kasan_check_range((void *)p, size, true, _RET_IP_);
 }
 EXPORT_SYMBOL(__kasan_check_write);
 
@@ -50,7 +50,7 @@ EXPORT_SYMBOL(__kasan_check_write);
 #undef memset
 void *memset(void *addr, int c, size_t len)
 {
-       if (!kasan_check_range((unsigned long)addr, len, true, _RET_IP_))
+       if (!kasan_check_range(addr, len, true, _RET_IP_))
                return NULL;
 
        return __memset(addr, c, len);
@@ -60,8 +60,8 @@ void *memset(void *addr, int c, size_t len)
 #undef memmove
 void *memmove(void *dest, const void *src, size_t len)
 {
-       if (!kasan_check_range((unsigned long)src, len, false, _RET_IP_) ||
-           !kasan_check_range((unsigned long)dest, len, true, _RET_IP_))
+       if (!kasan_check_range(src, len, false, _RET_IP_) ||
+           !kasan_check_range(dest, len, true, _RET_IP_))
                return NULL;
 
        return __memmove(dest, src, len);
@@ -71,17 +71,17 @@ void *memmove(void *dest, const void *src, size_t len)
 #undef memcpy
 void *memcpy(void *dest, const void *src, size_t len)
 {
-       if (!kasan_check_range((unsigned long)src, len, false, _RET_IP_) ||
-           !kasan_check_range((unsigned long)dest, len, true, _RET_IP_))
+       if (!kasan_check_range(src, len, false, _RET_IP_) ||
+           !kasan_check_range(dest, len, true, _RET_IP_))
                return NULL;
 
        return __memcpy(dest, src, len);
 }
 #endif
 
-void *__asan_memset(void *addr, int c, size_t len)
+void *__asan_memset(void *addr, int c, ssize_t len)
 {
-       if (!kasan_check_range((unsigned long)addr, len, true, _RET_IP_))
+       if (!kasan_check_range(addr, len, true, _RET_IP_))
                return NULL;
 
        return __memset(addr, c, len);
@@ -89,10 +89,10 @@ void *__asan_memset(void *addr, int c, size_t len)
 EXPORT_SYMBOL(__asan_memset);
 
 #ifdef __HAVE_ARCH_MEMMOVE
-void *__asan_memmove(void *dest, const void *src, size_t len)
+void *__asan_memmove(void *dest, const void *src, ssize_t len)
 {
-       if (!kasan_check_range((unsigned long)src, len, false, _RET_IP_) ||
-           !kasan_check_range((unsigned long)dest, len, true, _RET_IP_))
+       if (!kasan_check_range(src, len, false, _RET_IP_) ||
+           !kasan_check_range(dest, len, true, _RET_IP_))
                return NULL;
 
        return __memmove(dest, src, len);
@@ -100,10 +100,10 @@ void *__asan_memmove(void *dest, const void *src, size_t len)
 EXPORT_SYMBOL(__asan_memmove);
 #endif
 
-void *__asan_memcpy(void *dest, const void *src, size_t len)
+void *__asan_memcpy(void *dest, const void *src, ssize_t len)
 {
-       if (!kasan_check_range((unsigned long)src, len, false, _RET_IP_) ||
-           !kasan_check_range((unsigned long)dest, len, true, _RET_IP_))
+       if (!kasan_check_range(src, len, false, _RET_IP_) ||
+           !kasan_check_range(dest, len, true, _RET_IP_))
                return NULL;
 
        return __memcpy(dest, src, len);
@@ -111,13 +111,13 @@ void *__asan_memcpy(void *dest, const void *src, size_t len)
 EXPORT_SYMBOL(__asan_memcpy);
 
 #ifdef CONFIG_KASAN_SW_TAGS
-void *__hwasan_memset(void *addr, int c, size_t len) __alias(__asan_memset);
+void *__hwasan_memset(void *addr, int c, ssize_t len) __alias(__asan_memset);
 EXPORT_SYMBOL(__hwasan_memset);
 #ifdef __HAVE_ARCH_MEMMOVE
-void *__hwasan_memmove(void *dest, const void *src, size_t len) __alias(__asan_memmove);
+void *__hwasan_memmove(void *dest, const void *src, ssize_t len) __alias(__asan_memmove);
 EXPORT_SYMBOL(__hwasan_memmove);
 #endif
-void *__hwasan_memcpy(void *dest, const void *src, size_t len) __alias(__asan_memcpy);
+void *__hwasan_memcpy(void *dest, const void *src, ssize_t len) __alias(__asan_memcpy);
 EXPORT_SYMBOL(__hwasan_memcpy);
 #endif
 
@@ -226,7 +226,7 @@ static bool shadow_mapped(unsigned long addr)
        if (pmd_bad(*pmd))
                return true;
        pte = pte_offset_kernel(pmd, addr);
-       return !pte_none(*pte);
+       return !pte_none(ptep_get(pte));
 }
 
 static int __meminit kasan_mem_notifier(struct notifier_block *nb,
@@ -317,7 +317,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
        unsigned long page;
        pte_t pte;
 
-       if (likely(!pte_none(*ptep)))
+       if (likely(!pte_none(ptep_get(ptep))))
                return 0;
 
        page = __get_free_page(GFP_KERNEL);
@@ -328,7 +328,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
        pte = pfn_pte(PFN_DOWN(__pa(page)), PAGE_KERNEL);
 
        spin_lock(&init_mm.page_table_lock);
-       if (likely(pte_none(*ptep))) {
+       if (likely(pte_none(ptep_get(ptep)))) {
                set_pte_at(&init_mm, addr, ptep, pte);
                page = 0;
        }
@@ -418,11 +418,11 @@ static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr,
 {
        unsigned long page;
 
-       page = (unsigned long)__va(pte_pfn(*ptep) << PAGE_SHIFT);
+       page = (unsigned long)__va(pte_pfn(ptep_get(ptep)) << PAGE_SHIFT);
 
        spin_lock(&init_mm.page_table_lock);
 
-       if (likely(!pte_none(*ptep))) {
+       if (likely(!pte_none(ptep_get(ptep)))) {
                pte_clear(&init_mm, addr, ptep);
                free_page(page);
        }
index 30da65f..220b5d4 100644 (file)
@@ -70,8 +70,8 @@ u8 kasan_random_tag(void)
        return (u8)(state % (KASAN_TAG_MAX + 1));
 }
 
-bool kasan_check_range(unsigned long addr, size_t size, bool write,
-                               unsigned long ret_ip)
+bool kasan_check_range(const void *addr, size_t size, bool write,
+                       unsigned long ret_ip)
 {
        u8 tag;
        u8 *shadow_first, *shadow_last, *shadow;
@@ -133,12 +133,12 @@ bool kasan_byte_accessible(const void *addr)
 }
 
 #define DEFINE_HWASAN_LOAD_STORE(size)                                 \
-       void __hwasan_load##size##_noabort(unsigned long addr)          \
+       void __hwasan_load##size##_noabort(void *addr)                  \
        {                                                               \
-               kasan_check_range(addr, size, false, _RET_IP_); \
+               kasan_check_range(addr, size, false, _RET_IP_);         \
        }                                                               \
        EXPORT_SYMBOL(__hwasan_load##size##_noabort);                   \
-       void __hwasan_store##size##_noabort(unsigned long addr)         \
+       void __hwasan_store##size##_noabort(void *addr)                 \
        {                                                               \
                kasan_check_range(addr, size, true, _RET_IP_);          \
        }                                                               \
@@ -150,25 +150,25 @@ DEFINE_HWASAN_LOAD_STORE(4);
 DEFINE_HWASAN_LOAD_STORE(8);
 DEFINE_HWASAN_LOAD_STORE(16);
 
-void __hwasan_loadN_noabort(unsigned long addr, unsigned long size)
+void __hwasan_loadN_noabort(void *addr, ssize_t size)
 {
        kasan_check_range(addr, size, false, _RET_IP_);
 }
 EXPORT_SYMBOL(__hwasan_loadN_noabort);
 
-void __hwasan_storeN_noabort(unsigned long addr, unsigned long size)
+void __hwasan_storeN_noabort(void *addr, ssize_t size)
 {
        kasan_check_range(addr, size, true, _RET_IP_);
 }
 EXPORT_SYMBOL(__hwasan_storeN_noabort);
 
-void __hwasan_tag_memory(unsigned long addr, u8 tag, unsigned long size)
+void __hwasan_tag_memory(void *addr, u8 tag, ssize_t size)
 {
-       kasan_poison((void *)addr, size, tag, false);
+       kasan_poison(addr, size, tag, false);
 }
 EXPORT_SYMBOL(__hwasan_tag_memory);
 
-void kasan_tag_mismatch(unsigned long addr, unsigned long access_info,
+void kasan_tag_mismatch(void *addr, unsigned long access_info,
                        unsigned long ret_ip)
 {
        kasan_report(addr, 1 << (access_info & 0xf), access_info & 0x10,
index 67a2225..7dcfe34 100644 (file)
@@ -140,5 +140,5 @@ void kasan_save_alloc_info(struct kmem_cache *cache, void *object, gfp_t flags)
 
 void kasan_save_free_info(struct kmem_cache *cache, void *object)
 {
-       save_stack_info(cache, object, GFP_NOWAIT, true);
+       save_stack_info(cache, object, 0, true);
 }
index 2d0d58f..3beb4ad 100644 (file)
@@ -88,7 +88,7 @@ static unsigned int khugepaged_max_ptes_swap __read_mostly;
 static unsigned int khugepaged_max_ptes_shared __read_mostly;
 
 #define MM_SLOTS_HASH_BITS 10
-static __read_mostly DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
+static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
 
 static struct kmem_cache *mm_slot_cache __read_mostly;
 
@@ -422,19 +422,17 @@ void __khugepaged_enter(struct mm_struct *mm)
        struct mm_slot *slot;
        int wakeup;
 
+       /* __khugepaged_exit() must not run from under us */
+       VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm);
+       if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags)))
+               return;
+
        mm_slot = mm_slot_alloc(mm_slot_cache);
        if (!mm_slot)
                return;
 
        slot = &mm_slot->slot;
 
-       /* __khugepaged_exit() must not run from under us */
-       VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm);
-       if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) {
-               mm_slot_free(mm_slot_cache, mm_slot);
-               return;
-       }
-
        spin_lock(&khugepaged_mm_lock);
        mm_slot_insert(mm_slots_hash, mm, slot);
        /*
@@ -513,7 +511,7 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
        struct folio *folio, *tmp;
 
        while (--_pte >= pte) {
-               pte_t pteval = *_pte;
+               pte_t pteval = ptep_get(_pte);
                unsigned long pfn;
 
                if (pte_none(pteval))
@@ -557,7 +555,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 
        for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
             _pte++, address += PAGE_SIZE) {
-               pte_t pteval = *_pte;
+               pte_t pteval = ptep_get(_pte);
                if (pte_none(pteval) || (pte_present(pteval) &&
                                is_zero_pfn(pte_pfn(pteval)))) {
                        ++none_or_zero;
@@ -701,7 +699,7 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
 
        for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
             _pte++, address += PAGE_SIZE) {
-               pteval = *_pte;
+               pteval = ptep_get(_pte);
                if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
                        add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
                        if (is_zero_pfn(pte_pfn(pteval))) {
@@ -799,7 +797,7 @@ static int __collapse_huge_page_copy(pte_t *pte,
         */
        for (_pte = pte, _address = address; _pte < pte + HPAGE_PMD_NR;
             _pte++, page++, _address += PAGE_SIZE) {
-               pteval = *_pte;
+               pteval = ptep_get(_pte);
                if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
                        clear_user_highpage(page, _address);
                        continue;
@@ -946,10 +944,6 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
        return SCAN_SUCCEED;
 }
 
-/*
- * See pmd_trans_unstable() for how the result may change out from
- * underneath us, even if we hold mmap_lock in read.
- */
 static int find_pmd_or_thp_or_none(struct mm_struct *mm,
                                   unsigned long address,
                                   pmd_t **pmd)
@@ -961,11 +955,6 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
                return SCAN_PMD_NULL;
 
        pmde = pmdp_get_lockless(*pmd);
-
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-       /* See comments in pmd_none_or_trans_huge_or_clear_bad() */
-       barrier();
-#endif
        if (pmd_none(pmde))
                return SCAN_PMD_NONE;
        if (!pmd_present(pmde))
@@ -998,9 +987,8 @@ static int check_pmd_still_valid(struct mm_struct *mm,
  * Only done if hpage_collapse_scan_pmd believes it is worthwhile.
  *
  * Called and returns without pte mapped or spinlocks held.
- * Note that if false is returned, mmap_lock will be released.
+ * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
  */
-
 static int __collapse_huge_page_swapin(struct mm_struct *mm,
                                       struct vm_area_struct *vma,
                                       unsigned long haddr, pmd_t *pmd,
@@ -1009,23 +997,37 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
        int swapped_in = 0;
        vm_fault_t ret = 0;
        unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);
+       int result;
+       pte_t *pte = NULL;
+       spinlock_t *ptl;
 
        for (address = haddr; address < end; address += PAGE_SIZE) {
                struct vm_fault vmf = {
                        .vma = vma,
                        .address = address,
-                       .pgoff = linear_page_index(vma, haddr),
+                       .pgoff = linear_page_index(vma, address),
                        .flags = FAULT_FLAG_ALLOW_RETRY,
                        .pmd = pmd,
                };
 
-               vmf.pte = pte_offset_map(pmd, address);
-               vmf.orig_pte = *vmf.pte;
-               if (!is_swap_pte(vmf.orig_pte)) {
-                       pte_unmap(vmf.pte);
-                       continue;
+               if (!pte++) {
+                       pte = pte_offset_map_nolock(mm, pmd, address, &ptl);
+                       if (!pte) {
+                               mmap_read_unlock(mm);
+                               result = SCAN_PMD_NULL;
+                               goto out;
+                       }
                }
+
+               vmf.orig_pte = ptep_get_lockless(pte);
+               if (!is_swap_pte(vmf.orig_pte))
+                       continue;
+
+               vmf.pte = pte;
+               vmf.ptl = ptl;
                ret = do_swap_page(&vmf);
+               /* Which unmaps pte (after perhaps re-checking the entry) */
+               pte = NULL;
 
                /*
                 * do_swap_page returns VM_FAULT_RETRY with released mmap_lock.
@@ -1034,24 +1036,29 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
                 * resulting in later failure.
                 */
                if (ret & VM_FAULT_RETRY) {
-                       trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
                        /* Likely, but not guaranteed, that page lock failed */
-                       return SCAN_PAGE_LOCK;
+                       result = SCAN_PAGE_LOCK;
+                       goto out;
                }
                if (ret & VM_FAULT_ERROR) {
                        mmap_read_unlock(mm);
-                       trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
-                       return SCAN_FAIL;
+                       result = SCAN_FAIL;
+                       goto out;
                }
                swapped_in++;
        }
 
-       /* Drain LRU add pagevec to remove extra pin on the swapped in pages */
+       if (pte)
+               pte_unmap(pte);
+
+       /* Drain LRU cache to remove extra pin on the swapped in pages */
        if (swapped_in)
                lru_add_drain();
 
-       trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 1);
-       return SCAN_SUCCEED;
+       result = SCAN_SUCCEED;
+out:
+       trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result);
+       return result;
 }
 
 static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
@@ -1151,9 +1158,6 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
                                address + HPAGE_PMD_SIZE);
        mmu_notifier_invalidate_range_start(&range);
 
-       pte = pte_offset_map(pmd, address);
-       pte_ptl = pte_lockptr(mm, pmd);
-
        pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
        /*
         * This removes any huge TLB entry from the CPU so we won't allow
@@ -1168,13 +1172,18 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
        mmu_notifier_invalidate_range_end(&range);
        tlb_remove_table_sync_one();
 
-       spin_lock(pte_ptl);
-       result =  __collapse_huge_page_isolate(vma, address, pte, cc,
-                                              &compound_pagelist);
-       spin_unlock(pte_ptl);
+       pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
+       if (pte) {
+               result = __collapse_huge_page_isolate(vma, address, pte, cc,
+                                                     &compound_pagelist);
+               spin_unlock(pte_ptl);
+       } else {
+               result = SCAN_PMD_NULL;
+       }
 
        if (unlikely(result != SCAN_SUCCEED)) {
-               pte_unmap(pte);
+               if (pte)
+                       pte_unmap(pte);
                spin_lock(pmd_ptl);
                BUG_ON(!pmd_none(*pmd));
                /*
@@ -1258,9 +1267,14 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
        memset(cc->node_load, 0, sizeof(cc->node_load));
        nodes_clear(cc->alloc_nmask);
        pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+       if (!pte) {
+               result = SCAN_PMD_NULL;
+               goto out;
+       }
+
        for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
             _pte++, _address += PAGE_SIZE) {
-               pte_t pteval = *_pte;
+               pte_t pteval = ptep_get(_pte);
                if (is_swap_pte(pteval)) {
                        ++unmapped;
                        if (!cc->is_khugepaged ||
@@ -1627,25 +1641,28 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
         * lockless_pages_from_mm() and the hardware page walker can access page
         * tables while all the high-level locks are held in write mode.
         */
-       start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
        result = SCAN_FAIL;
+       start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
+       if (!start_pte)
+               goto drop_immap;
 
        /* step 1: check all mapped PTEs are to the right huge page */
        for (i = 0, addr = haddr, pte = start_pte;
             i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
                struct page *page;
+               pte_t ptent = ptep_get(pte);
 
                /* empty pte, skip */
-               if (pte_none(*pte))
+               if (pte_none(ptent))
                        continue;
 
                /* page swapped out, abort */
-               if (!pte_present(*pte)) {
+               if (!pte_present(ptent)) {
                        result = SCAN_PTE_NON_PRESENT;
                        goto abort;
                }
 
-               page = vm_normal_page(vma, addr, *pte);
+               page = vm_normal_page(vma, addr, ptent);
                if (WARN_ON_ONCE(page && is_zone_device_page(page)))
                        page = NULL;
                /*
@@ -1661,10 +1678,11 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
        for (i = 0, addr = haddr, pte = start_pte;
             i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
                struct page *page;
+               pte_t ptent = ptep_get(pte);
 
-               if (pte_none(*pte))
+               if (pte_none(ptent))
                        continue;
-               page = vm_normal_page(vma, addr, *pte);
+               page = vm_normal_page(vma, addr, ptent);
                if (WARN_ON_ONCE(page && is_zone_device_page(page)))
                        goto abort;
                page_remove_rmap(page, vma, false);
@@ -1702,6 +1720,7 @@ drop_hpage:
 
 abort:
        pte_unmap_unlock(start_pte, ptl);
+drop_immap:
        i_mmap_unlock_write(vma->vm_file->f_mapping);
        goto drop_hpage;
 }
@@ -1953,7 +1972,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
                                        result = SCAN_FAIL;
                                        goto xa_unlocked;
                                }
-                               /* drain pagevecs to help isolate_lru_page() */
+                               /* drain lru cache to help isolate_lru_page() */
                                lru_add_drain();
                                page = folio_file_page(folio, index);
                        } else if (trylock_page(page)) {
@@ -1969,7 +1988,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
                                page_cache_sync_readahead(mapping, &file->f_ra,
                                                          file, index,
                                                          end - index);
-                               /* drain pagevecs to help isolate_lru_page() */
+                               /* drain lru cache to help isolate_lru_page() */
                                lru_add_drain();
                                page = find_lock_page(mapping, index);
                                if (unlikely(page == NULL)) {
index 7d1e4aa..3adb4c1 100644 (file)
@@ -74,7 +74,7 @@ depot_stack_handle_t kmsan_save_stack_with_flags(gfp_t flags,
        nr_entries = stack_trace_save(entries, KMSAN_STACK_DEPTH, 0);
 
        /* Don't sleep. */
-       flags &= ~__GFP_DIRECT_RECLAIM;
+       flags &= ~(__GFP_DIRECT_RECLAIM | __GFP_KSWAPD_RECLAIM);
 
        handle = __stack_depot_save(entries, nr_entries, flags, true);
        return stack_depot_set_extra_bits(handle, extra);
@@ -245,7 +245,7 @@ depot_stack_handle_t kmsan_internal_chain_origin(depot_stack_handle_t id)
        extra_bits = kmsan_extra_bits(depth, uaf);
 
        entries[0] = KMSAN_CHAIN_MAGIC_ORIGIN;
-       entries[1] = kmsan_save_stack_with_flags(GFP_ATOMIC, 0);
+       entries[1] = kmsan_save_stack_with_flags(__GFP_HIGH, 0);
        entries[2] = id;
        /*
         * @entries is a local var in non-instrumented code, so KMSAN does not
@@ -253,7 +253,7 @@ depot_stack_handle_t kmsan_internal_chain_origin(depot_stack_handle_t id)
         * positives when __stack_depot_save() passes it to instrumented code.
         */
        kmsan_internal_unpoison_memory(entries, sizeof(entries), false);
-       handle = __stack_depot_save(entries, ARRAY_SIZE(entries), GFP_ATOMIC,
+       handle = __stack_depot_save(entries, ARRAY_SIZE(entries), __GFP_HIGH,
                                    true);
        return stack_depot_set_extra_bits(handle, extra_bits);
 }
index cf12e96..cc3907a 100644 (file)
@@ -282,7 +282,7 @@ void __msan_poison_alloca(void *address, uintptr_t size, char *descr)
 
        /* stack_depot_save() may allocate memory. */
        kmsan_enter_runtime();
-       handle = stack_depot_save(entries, ARRAY_SIZE(entries), GFP_ATOMIC);
+       handle = stack_depot_save(entries, ARRAY_SIZE(entries), __GFP_HIGH);
        kmsan_leave_runtime();
 
        kmsan_internal_set_shadow_origin(address, size, -1, handle,
index 0156bde..ba26635 100644 (file)
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -429,16 +429,17 @@ static int break_ksm_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long nex
        struct page *page = NULL;
        spinlock_t *ptl;
        pte_t *pte;
+       pte_t ptent;
        int ret;
 
-       if (pmd_leaf(*pmd) || !pmd_present(*pmd))
-               return 0;
-
        pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
-       if (pte_present(*pte)) {
-               page = vm_normal_page(walk->vma, addr, *pte);
-       } else if (!pte_none(*pte)) {
-               swp_entry_t entry = pte_to_swp_entry(*pte);
+       if (!pte)
+               return 0;
+       ptent = ptep_get(pte);
+       if (pte_present(ptent)) {
+               page = vm_normal_page(walk->vma, addr, ptent);
+       } else if (!pte_none(ptent)) {
+               swp_entry_t entry = pte_to_swp_entry(ptent);
 
                /*
                 * As KSM pages remain KSM pages until freed, no need to wait
@@ -931,7 +932,7 @@ static int remove_stable_node(struct ksm_stable_node *stable_node)
                 * The stable node did not yet appear stale to get_ksm_page(),
                 * since that allows for an unmapped ksm page to be recognized
                 * right up until it is freed; but the node is safe to remove.
-                * This page might be in a pagevec waiting to be freed,
+                * This page might be in an LRU cache waiting to be freed,
                 * or it might be PageSwapCache (perhaps under writeback),
                 * or it might have been removed from swapcache a moment ago.
                 */
@@ -1086,6 +1087,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
        int err = -EFAULT;
        struct mmu_notifier_range range;
        bool anon_exclusive;
+       pte_t entry;
 
        pvmw.address = page_address_in_vma(page, vma);
        if (pvmw.address == -EFAULT)
@@ -1103,10 +1105,9 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
                goto out_unlock;
 
        anon_exclusive = PageAnonExclusive(page);
-       if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
+       entry = ptep_get(pvmw.pte);
+       if (pte_write(entry) || pte_dirty(entry) ||
            anon_exclusive || mm_tlb_flush_pending(mm)) {
-               pte_t entry;
-
                swapped = PageSwapCache(page);
                flush_cache_page(vma, pvmw.address, page_to_pfn(page));
                /*
@@ -1148,7 +1149,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
                set_pte_at_notify(mm, pvmw.address, pvmw.pte, entry);
        }
-       *orig_pte = *pvmw.pte;
+       *orig_pte = entry;
        err = 0;
 
 out_unlock:
@@ -1194,8 +1195,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
         * without holding anon_vma lock for write.  So when looking for a
         * genuine pmde (in which to find pte), test present and !THP together.
         */
-       pmde = *pmd;
-       barrier();
+       pmde = pmdp_get_lockless(pmd);
        if (!pmd_present(pmde) || pmd_trans_huge(pmde))
                goto out;
 
@@ -1204,7 +1204,9 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
        mmu_notifier_invalidate_range_start(&range);
 
        ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
-       if (!pte_same(*ptep, orig_pte)) {
+       if (!ptep)
+               goto out_mn;
+       if (!pte_same(ptep_get(ptep), orig_pte)) {
                pte_unmap_unlock(ptep, ptl);
                goto out_mn;
        }
@@ -1231,7 +1233,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
                dec_mm_counter(mm, MM_ANONPAGES);
        }
 
-       flush_cache_page(vma, addr, pte_pfn(*ptep));
+       flush_cache_page(vma, addr, pte_pfn(ptep_get(ptep)));
        /*
         * No need to notify as we are replacing a read only page with another
         * read only page with the same content.
@@ -2301,8 +2303,8 @@ static struct ksm_rmap_item *scan_get_next_rmap_item(struct page **page)
                trace_ksm_start_scan(ksm_scan.seqnr, ksm_rmap_items);
 
                /*
-                * A number of pages can hang around indefinitely on per-cpu
-                * pagevecs, raised page count preventing write_protect_page
+                * A number of pages can hang around indefinitely in per-cpu
+                * LRU cache, raised page count preventing write_protect_page
                 * from merging them.  Though it doesn't really matter much,
                 * it is puzzling to see some stuck in pages_volatile until
                 * other activity jostles them out, and they also prevented
index b5ffbaf..886f060 100644 (file)
@@ -188,37 +188,43 @@ success:
 
 #ifdef CONFIG_SWAP
 static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
-       unsigned long end, struct mm_walk *walk)
+               unsigned long end, struct mm_walk *walk)
 {
        struct vm_area_struct *vma = walk->private;
-       unsigned long index;
        struct swap_iocb *splug = NULL;
+       pte_t *ptep = NULL;
+       spinlock_t *ptl;
+       unsigned long addr;
 
-       if (pmd_none_or_trans_huge_or_clear_bad(pmd))
-               return 0;
-
-       for (index = start; index != end; index += PAGE_SIZE) {
+       for (addr = start; addr < end; addr += PAGE_SIZE) {
                pte_t pte;
                swp_entry_t entry;
                struct page *page;
-               spinlock_t *ptl;
-               pte_t *ptep;
 
-               ptep = pte_offset_map_lock(vma->vm_mm, pmd, index, &ptl);
-               pte = *ptep;
-               pte_unmap_unlock(ptep, ptl);
+               if (!ptep++) {
+                       ptep = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+                       if (!ptep)
+                               break;
+               }
 
+               pte = ptep_get(ptep);
                if (!is_swap_pte(pte))
                        continue;
                entry = pte_to_swp_entry(pte);
                if (unlikely(non_swap_entry(entry)))
                        continue;
 
+               pte_unmap_unlock(ptep, ptl);
+               ptep = NULL;
+
                page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
-                                            vma, index, false, &splug);
+                                            vma, addr, false, &splug);
                if (page)
                        put_page(page);
        }
+
+       if (ptep)
+               pte_unmap_unlock(ptep, ptl);
        swap_read_unplug(splug);
        cond_resched();
 
@@ -229,30 +235,34 @@ static const struct mm_walk_ops swapin_walk_ops = {
        .pmd_entry              = swapin_walk_pmd_entry,
 };
 
-static void force_shm_swapin_readahead(struct vm_area_struct *vma,
+static void shmem_swapin_range(struct vm_area_struct *vma,
                unsigned long start, unsigned long end,
                struct address_space *mapping)
 {
        XA_STATE(xas, &mapping->i_pages, linear_page_index(vma, start));
-       pgoff_t end_index = linear_page_index(vma, end + PAGE_SIZE - 1);
+       pgoff_t end_index = linear_page_index(vma, end) - 1;
        struct page *page;
        struct swap_iocb *splug = NULL;
 
        rcu_read_lock();
        xas_for_each(&xas, page, end_index) {
-               swp_entry_t swap;
+               unsigned long addr;
+               swp_entry_t entry;
 
                if (!xa_is_value(page))
                        continue;
-               swap = radix_to_swp_entry(page);
+               entry = radix_to_swp_entry(page);
                /* There might be swapin error entries in shmem mapping. */
-               if (non_swap_entry(swap))
+               if (non_swap_entry(entry))
                        continue;
+
+               addr = vma->vm_start +
+                       ((xas.xa_index - vma->vm_pgoff) << PAGE_SHIFT);
                xas_pause(&xas);
                rcu_read_unlock();
 
-               page = read_swap_cache_async(swap, GFP_HIGHUSER_MOVABLE,
-                                            NULL, 0, false, &splug);
+               page = read_swap_cache_async(entry, mapping_gfp_mask(mapping),
+                                            vma, addr, false, &splug);
                if (page)
                        put_page(page);
 
@@ -260,8 +270,6 @@ static void force_shm_swapin_readahead(struct vm_area_struct *vma,
        }
        rcu_read_unlock();
        swap_read_unplug(splug);
-
-       lru_add_drain();        /* Push any new pages onto the LRU now */
 }
 #endif         /* CONFIG_SWAP */
 
@@ -285,8 +293,8 @@ static long madvise_willneed(struct vm_area_struct *vma,
        }
 
        if (shmem_mapping(file->f_mapping)) {
-               force_shm_swapin_readahead(vma, start, end,
-                                       file->f_mapping);
+               shmem_swapin_range(vma, start, end, file->f_mapping);
+               lru_add_drain(); /* Push any new pages onto the LRU now */
                return 0;
        }
 #else
@@ -340,7 +348,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
        bool pageout = private->pageout;
        struct mm_struct *mm = tlb->mm;
        struct vm_area_struct *vma = walk->vma;
-       pte_t *orig_pte, *pte, ptent;
+       pte_t *start_pte, *pte, ptent;
        spinlock_t *ptl;
        struct folio *folio = NULL;
        LIST_HEAD(folio_list);
@@ -422,15 +430,15 @@ huge_unlock:
        }
 
 regular_folio:
-       if (pmd_trans_unstable(pmd))
-               return 0;
 #endif
        tlb_change_page_size(tlb, PAGE_SIZE);
-       orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+       start_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+       if (!start_pte)
+               return 0;
        flush_tlb_batched_pending(mm);
        arch_enter_lazy_mmu_mode();
        for (; addr < end; pte++, addr += PAGE_SIZE) {
-               ptent = *pte;
+               ptent = ptep_get(pte);
 
                if (pte_none(ptent))
                        continue;
@@ -447,25 +455,28 @@ regular_folio:
                 * are sure it's worth. Split it if we are only owner.
                 */
                if (folio_test_large(folio)) {
+                       int err;
+
                        if (folio_mapcount(folio) != 1)
                                break;
                        if (pageout_anon_only_filter && !folio_test_anon(folio))
                                break;
-                       folio_get(folio);
-                       if (!folio_trylock(folio)) {
-                               folio_put(folio);
-                               break;
-                       }
-                       pte_unmap_unlock(orig_pte, ptl);
-                       if (split_folio(folio)) {
-                               folio_unlock(folio);
-                               folio_put(folio);
-                               orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+                       if (!folio_trylock(folio))
                                break;
-                       }
+                       folio_get(folio);
+                       arch_leave_lazy_mmu_mode();
+                       pte_unmap_unlock(start_pte, ptl);
+                       start_pte = NULL;
+                       err = split_folio(folio);
                        folio_unlock(folio);
                        folio_put(folio);
-                       orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+                       if (err)
+                               break;
+                       start_pte = pte =
+                               pte_offset_map_lock(mm, pmd, addr, &ptl);
+                       if (!start_pte)
+                               break;
+                       arch_enter_lazy_mmu_mode();
                        pte--;
                        addr -= PAGE_SIZE;
                        continue;
@@ -510,8 +521,10 @@ regular_folio:
                        folio_deactivate(folio);
        }
 
-       arch_leave_lazy_mmu_mode();
-       pte_unmap_unlock(orig_pte, ptl);
+       if (start_pte) {
+               arch_leave_lazy_mmu_mode();
+               pte_unmap_unlock(start_pte, ptl);
+       }
        if (pageout)
                reclaim_pages(&folio_list);
        cond_resched();
@@ -612,7 +625,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
        struct mm_struct *mm = tlb->mm;
        struct vm_area_struct *vma = walk->vma;
        spinlock_t *ptl;
-       pte_t *orig_pte, *pte, ptent;
+       pte_t *start_pte, *pte, ptent;
        struct folio *folio;
        int nr_swap = 0;
        unsigned long next;
@@ -620,17 +633,16 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
        next = pmd_addr_end(addr, end);
        if (pmd_trans_huge(*pmd))
                if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
-                       goto next;
-
-       if (pmd_trans_unstable(pmd))
-               return 0;
+                       return 0;
 
        tlb_change_page_size(tlb, PAGE_SIZE);
-       orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+       start_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+       if (!start_pte)
+               return 0;
        flush_tlb_batched_pending(mm);
        arch_enter_lazy_mmu_mode();
        for (; addr != end; pte++, addr += PAGE_SIZE) {
-               ptent = *pte;
+               ptent = ptep_get(pte);
 
                if (pte_none(ptent))
                        continue;
@@ -664,23 +676,26 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
                 * deactivate all pages.
                 */
                if (folio_test_large(folio)) {
+                       int err;
+
                        if (folio_mapcount(folio) != 1)
-                               goto out;
+                               break;
+                       if (!folio_trylock(folio))
+                               break;
                        folio_get(folio);
-                       if (!folio_trylock(folio)) {
-                               folio_put(folio);
-                               goto out;
-                       }
-                       pte_unmap_unlock(orig_pte, ptl);
-                       if (split_folio(folio)) {
-                               folio_unlock(folio);
-                               folio_put(folio);
-                               orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
-                               goto out;
-                       }
+                       arch_leave_lazy_mmu_mode();
+                       pte_unmap_unlock(start_pte, ptl);
+                       start_pte = NULL;
+                       err = split_folio(folio);
                        folio_unlock(folio);
                        folio_put(folio);
-                       orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+                       if (err)
+                               break;
+                       start_pte = pte =
+                               pte_offset_map_lock(mm, pmd, addr, &ptl);
+                       if (!start_pte)
+                               break;
+                       arch_enter_lazy_mmu_mode();
                        pte--;
                        addr -= PAGE_SIZE;
                        continue;
@@ -725,17 +740,18 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
                }
                folio_mark_lazyfree(folio);
        }
-out:
+
        if (nr_swap) {
                if (current->mm == mm)
                        sync_mm_rss(mm);
-
                add_mm_counter(mm, MM_SWAPENTS, nr_swap);
        }
-       arch_leave_lazy_mmu_mode();
-       pte_unmap_unlock(orig_pte, ptl);
+       if (start_pte) {
+               arch_leave_lazy_mmu_mode();
+               pte_unmap_unlock(start_pte, ptl);
+       }
        cond_resched();
-next:
+
        return 0;
 }
 
index e1eb33f..a26dd8b 100644 (file)
@@ -35,7 +35,7 @@ static int wp_pte(pte_t *pte, unsigned long addr, unsigned long end,
                  struct mm_walk *walk)
 {
        struct wp_walk *wpwalk = walk->private;
-       pte_t ptent = *pte;
+       pte_t ptent = ptep_get(pte);
 
        if (pte_write(ptent)) {
                pte_t old_pte = ptep_modify_prot_start(walk->vma, addr, pte);
@@ -91,7 +91,7 @@ static int clean_record_pte(pte_t *pte, unsigned long addr,
 {
        struct wp_walk *wpwalk = walk->private;
        struct clean_walk *cwalk = to_clean_walk(wpwalk);
-       pte_t ptent = *pte;
+       pte_t ptent = ptep_get(pte);
 
        if (pte_dirty(ptent)) {
                pgoff_t pgoff = ((addr - walk->vma->vm_start) >> PAGE_SHIFT) +
@@ -128,19 +128,11 @@ static int wp_clean_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long end,
 {
        pmd_t pmdval = pmdp_get_lockless(pmd);
 
-       if (!pmd_trans_unstable(&pmdval))
-               return 0;
-
-       if (pmd_none(pmdval)) {
-               walk->action = ACTION_AGAIN;
-               return 0;
-       }
-
-       /* Huge pmd, present or migrated */
-       walk->action = ACTION_CONTINUE;
-       if (pmd_trans_huge(pmdval) || pmd_devmap(pmdval))
+       /* Do not split a huge pmd, present or migrated */
+       if (pmd_trans_huge(pmdval) || pmd_devmap(pmdval)) {
                WARN_ON(pmd_write(pmdval) || pmd_dirty(pmdval));
-
+               walk->action = ACTION_CONTINUE;
+       }
        return 0;
 }
 
@@ -156,23 +148,15 @@ static int wp_clean_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long end,
 static int wp_clean_pud_entry(pud_t *pud, unsigned long addr, unsigned long end,
                              struct mm_walk *walk)
 {
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
        pud_t pudval = READ_ONCE(*pud);
 
-       if (!pud_trans_unstable(&pudval))
-               return 0;
-
-       if (pud_none(pudval)) {
-               walk->action = ACTION_AGAIN;
-               return 0;
-       }
-
-#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-       /* Huge pud */
-       walk->action = ACTION_CONTINUE;
-       if (pud_trans_huge(pudval) || pud_devmap(pudval))
+       /* Do not split a huge pud */
+       if (pud_trans_huge(pudval) || pud_devmap(pudval)) {
                WARN_ON(pud_write(pudval) || pud_dirty(pudval));
+               walk->action = ACTION_CONTINUE;
+       }
 #endif
-
        return 0;
 }
 
index 3feafea..388bc0c 100644 (file)
@@ -1436,6 +1436,15 @@ done:
                 */
                kmemleak_alloc_phys(found, size, 0);
 
+       /*
+        * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
+        * require memory to be accepted before it can be used by the
+        * guest.
+        *
+        * Accept the memory of the allocated buffer.
+        */
+       accept_memory(found, found + size);
+
        return found;
 }
 
@@ -2082,19 +2091,30 @@ static void __init memmap_init_reserved_pages(void)
 {
        struct memblock_region *region;
        phys_addr_t start, end;
-       u64 i;
+       int nid;
+
+       /*
+        * set nid on all reserved pages and also treat struct
+        * pages for the NOMAP regions as PageReserved
+        */
+       for_each_mem_region(region) {
+               nid = memblock_get_region_node(region);
+               start = region->base;
+               end = start + region->size;
+
+               if (memblock_is_nomap(region))
+                       reserve_bootmem_region(start, end, nid);
+
+               memblock_set_node(start, end, &memblock.reserved, nid);
+       }
 
        /* initialize struct pages for the reserved regions */
-       for_each_reserved_mem_range(i, &start, &end)
-               reserve_bootmem_region(start, end);
+       for_each_reserved_mem_region(region) {
+               nid = memblock_get_region_node(region);
+               start = region->base;
+               end = start + region->size;
 
-       /* and also treat struct pages for the NOMAP regions as PageReserved */
-       for_each_mem_region(region) {
-               if (memblock_is_nomap(region)) {
-                       start = region->base;
-                       end = start + region->size;
-                       reserve_bootmem_region(start, end);
-               }
+               reserve_bootmem_region(start, end, nid);
        }
 }
 
@@ -2122,7 +2142,7 @@ static unsigned long __init free_low_memory_core_early(void)
 
 static int reset_managed_pages_done __initdata;
 
-void reset_node_managed_pages(pg_data_t *pgdat)
+static void __init reset_node_managed_pages(pg_data_t *pgdat)
 {
        struct zone *z;
 
index 4b27e24..e8ca4bd 100644 (file)
@@ -485,7 +485,7 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, int nid)
 
        if (lru_gen_enabled()) {
                if (soft_limit_excess(memcg))
-                       lru_gen_soft_reclaim(&memcg->nodeinfo[nid]->lruvec);
+                       lru_gen_soft_reclaim(memcg, nid);
                return;
        }
 
@@ -639,7 +639,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
        }
 }
 
-static void do_flush_stats(bool atomic)
+static void do_flush_stats(void)
 {
        /*
         * We always flush the entire tree, so concurrent flushers can just
@@ -652,30 +652,16 @@ static void do_flush_stats(bool atomic)
 
        WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
 
-       if (atomic)
-               cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
-       else
-               cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
+       cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
 
        atomic_set(&stats_flush_threshold, 0);
        atomic_set(&stats_flush_ongoing, 0);
 }
 
-static bool should_flush_stats(void)
-{
-       return atomic_read(&stats_flush_threshold) > num_online_cpus();
-}
-
 void mem_cgroup_flush_stats(void)
 {
-       if (should_flush_stats())
-               do_flush_stats(false);
-}
-
-void mem_cgroup_flush_stats_atomic(void)
-{
-       if (should_flush_stats())
-               do_flush_stats(true);
+       if (atomic_read(&stats_flush_threshold) > num_online_cpus())
+               do_flush_stats();
 }
 
 void mem_cgroup_flush_stats_ratelimited(void)
@@ -690,7 +676,7 @@ static void flush_memcg_stats_dwork(struct work_struct *w)
         * Always flush here so that flushing in latency-sensitive paths is
         * as cheap as possible.
         */
-       do_flush_stats(false);
+       do_flush_stats();
        queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
 }
 
@@ -1273,13 +1259,13 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
  *
  * This function iterates over tasks attached to @memcg or to any of its
  * descendants and calls @fn for each task. If @fn returns a non-zero
- * value, the function breaks the iteration loop and returns the value.
- * Otherwise, it will iterate over all tasks and return 0.
+ * value, the function breaks the iteration loop. Otherwise, it will iterate
+ * over all tasks and return 0.
  *
  * This function must not be called for the root memory cgroup.
  */
-int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
-                         int (*fn)(struct task_struct *, void *), void *arg)
+void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
+                          int (*fn)(struct task_struct *, void *), void *arg)
 {
        struct mem_cgroup *iter;
        int ret = 0;
@@ -1299,7 +1285,6 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
                        break;
                }
        }
-       return ret;
 }
 
 #ifdef CONFIG_DEBUG_VM
@@ -1580,13 +1565,10 @@ static inline unsigned long memcg_page_state_output(struct mem_cgroup *memcg,
        return memcg_page_state(memcg, item) * memcg_page_state_unit(item);
 }
 
-static void memory_stat_format(struct mem_cgroup *memcg, char *buf, int bufsize)
+static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 {
-       struct seq_buf s;
        int i;
 
-       seq_buf_init(&s, buf, bufsize);
-
        /*
         * Provide statistics on the state of the memory subsystem as
         * well as cumulative event counters that show past behavior.
@@ -1603,21 +1585,21 @@ static void memory_stat_format(struct mem_cgroup *memcg, char *buf, int bufsize)
                u64 size;
 
                size = memcg_page_state_output(memcg, memory_stats[i].idx);
-               seq_buf_printf(&s, "%s %llu\n", memory_stats[i].name, size);
+               seq_buf_printf(s, "%s %llu\n", memory_stats[i].name, size);
 
                if (unlikely(memory_stats[i].idx == NR_SLAB_UNRECLAIMABLE_B)) {
                        size += memcg_page_state_output(memcg,
                                                        NR_SLAB_RECLAIMABLE_B);
-                       seq_buf_printf(&s, "slab %llu\n", size);
+                       seq_buf_printf(s, "slab %llu\n", size);
                }
        }
 
        /* Accumulated memory events */
-       seq_buf_printf(&s, "pgscan %lu\n",
+       seq_buf_printf(s, "pgscan %lu\n",
                       memcg_events(memcg, PGSCAN_KSWAPD) +
                       memcg_events(memcg, PGSCAN_DIRECT) +
                       memcg_events(memcg, PGSCAN_KHUGEPAGED));
-       seq_buf_printf(&s, "pgsteal %lu\n",
+       seq_buf_printf(s, "pgsteal %lu\n",
                       memcg_events(memcg, PGSTEAL_KSWAPD) +
                       memcg_events(memcg, PGSTEAL_DIRECT) +
                       memcg_events(memcg, PGSTEAL_KHUGEPAGED));
@@ -1627,13 +1609,24 @@ static void memory_stat_format(struct mem_cgroup *memcg, char *buf, int bufsize)
                    memcg_vm_event_stat[i] == PGPGOUT)
                        continue;
 
-               seq_buf_printf(&s, "%s %lu\n",
+               seq_buf_printf(s, "%s %lu\n",
                               vm_event_name(memcg_vm_event_stat[i]),
                               memcg_events(memcg, memcg_vm_event_stat[i]));
        }
 
        /* The above should easily fit into one page */
-       WARN_ON_ONCE(seq_buf_has_overflowed(&s));
+       WARN_ON_ONCE(seq_buf_has_overflowed(s));
+}
+
+static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s);
+
+static void memory_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
+{
+       if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
+               memcg_stat_format(memcg, s);
+       else
+               memcg1_stat_format(memcg, s);
+       WARN_ON_ONCE(seq_buf_has_overflowed(s));
 }
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
@@ -1671,6 +1664,7 @@ void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
 {
        /* Use static buffer, for the caller is holding oom_lock. */
        static char buf[PAGE_SIZE];
+       struct seq_buf s;
 
        lockdep_assert_held(&oom_lock);
 
@@ -1693,8 +1687,9 @@ void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
        pr_info("Memory cgroup stats for ");
        pr_cont_cgroup_path(memcg->css.cgroup);
        pr_cont(":");
-       memory_stat_format(memcg, buf, sizeof(buf));
-       pr_info("%s", buf);
+       seq_buf_init(&s, buf, sizeof(buf));
+       memory_stat_format(memcg, &s);
+       seq_buf_do_printk(&s, KERN_INFO);
 }
 
 /*
@@ -2028,26 +2023,12 @@ bool mem_cgroup_oom_synchronize(bool handle)
        if (locked)
                mem_cgroup_oom_notify(memcg);
 
-       if (locked && !READ_ONCE(memcg->oom_kill_disable)) {
-               mem_cgroup_unmark_under_oom(memcg);
-               finish_wait(&memcg_oom_waitq, &owait.wait);
-               mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask,
-                                        current->memcg_oom_order);
-       } else {
-               schedule();
-               mem_cgroup_unmark_under_oom(memcg);
-               finish_wait(&memcg_oom_waitq, &owait.wait);
-       }
+       schedule();
+       mem_cgroup_unmark_under_oom(memcg);
+       finish_wait(&memcg_oom_waitq, &owait.wait);
 
-       if (locked) {
+       if (locked)
                mem_cgroup_oom_unlock(memcg);
-               /*
-                * There is no guarantee that an OOM-lock contender
-                * sees the wakeups triggered by the OOM kill
-                * uncharges.  Wake any sleepers explicitly.
-                */
-               memcg_oom_recover(memcg);
-       }
 cleanup:
        current->memcg_in_oom = NULL;
        css_put(&memcg->css);
@@ -2166,17 +2147,12 @@ again:
         * When charge migration first begins, we can have multiple
         * critical sections holding the fast-path RCU lock and one
         * holding the slowpath move_lock. Track the task who has the
-        * move_lock for unlock_page_memcg().
+        * move_lock for folio_memcg_unlock().
         */
        memcg->move_lock_task = current;
        memcg->move_lock_flags = flags;
 }
 
-void lock_page_memcg(struct page *page)
-{
-       folio_memcg_lock(page_folio(page));
-}
-
 static void __folio_memcg_unlock(struct mem_cgroup *memcg)
 {
        if (memcg && memcg->move_lock_task == current) {
@@ -2204,11 +2180,6 @@ void folio_memcg_unlock(struct folio *folio)
        __folio_memcg_unlock(folio_memcg(folio));
 }
 
-void unlock_page_memcg(struct page *page)
-{
-       folio_memcg_unlock(page_folio(page));
-}
-
 struct memcg_stock_pcp {
        local_lock_t stock_lock;
        struct mem_cgroup *cached; /* this never be root cgroup */
@@ -2275,7 +2246,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
        local_lock_irqsave(&memcg_stock.stock_lock, flags);
 
        stock = this_cpu_ptr(&memcg_stock);
-       if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
+       if (memcg == READ_ONCE(stock->cached) && stock->nr_pages >= nr_pages) {
                stock->nr_pages -= nr_pages;
                ret = true;
        }
@@ -2290,7 +2261,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
  */
 static void drain_stock(struct memcg_stock_pcp *stock)
 {
-       struct mem_cgroup *old = stock->cached;
+       struct mem_cgroup *old = READ_ONCE(stock->cached);
 
        if (!old)
                return;
@@ -2303,7 +2274,7 @@ static void drain_stock(struct memcg_stock_pcp *stock)
        }
 
        css_put(&old->css);
-       stock->cached = NULL;
+       WRITE_ONCE(stock->cached, NULL);
 }
 
 static void drain_local_stock(struct work_struct *dummy)
@@ -2338,10 +2309,10 @@ static void __refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
        struct memcg_stock_pcp *stock;
 
        stock = this_cpu_ptr(&memcg_stock);
-       if (stock->cached != memcg) { /* reset if necessary */
+       if (READ_ONCE(stock->cached) != memcg) { /* reset if necessary */
                drain_stock(stock);
                css_get(&memcg->css);
-               stock->cached = memcg;
+               WRITE_ONCE(stock->cached, memcg);
        }
        stock->nr_pages += nr_pages;
 
@@ -2383,7 +2354,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
                bool flush = false;
 
                rcu_read_lock();
-               memcg = stock->cached;
+               memcg = READ_ONCE(stock->cached);
                if (memcg && stock->nr_pages &&
                    mem_cgroup_is_descendant(memcg, root_memcg))
                        flush = true;
@@ -2884,7 +2855,7 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
         *
         * - the page lock
         * - LRU isolation
-        * - lock_page_memcg()
+        * - folio_memcg_lock()
         * - exclusive reference
         * - mem_cgroup_trylock_pages()
         */
@@ -3208,12 +3179,12 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
         * accumulating over a page of vmstat data or when pgdat or idx
         * changes.
         */
-       if (stock->cached_objcg != objcg) {
+       if (READ_ONCE(stock->cached_objcg) != objcg) {
                old = drain_obj_stock(stock);
                obj_cgroup_get(objcg);
                stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
                                ? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0;
-               stock->cached_objcg = objcg;
+               WRITE_ONCE(stock->cached_objcg, objcg);
                stock->cached_pgdat = pgdat;
        } else if (stock->cached_pgdat != pgdat) {
                /* Flush the existing cached vmstat data */
@@ -3267,7 +3238,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
        local_lock_irqsave(&memcg_stock.stock_lock, flags);
 
        stock = this_cpu_ptr(&memcg_stock);
-       if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
+       if (objcg == READ_ONCE(stock->cached_objcg) && stock->nr_bytes >= nr_bytes) {
                stock->nr_bytes -= nr_bytes;
                ret = true;
        }
@@ -3279,7 +3250,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 
 static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock)
 {
-       struct obj_cgroup *old = stock->cached_objcg;
+       struct obj_cgroup *old = READ_ONCE(stock->cached_objcg);
 
        if (!old)
                return NULL;
@@ -3332,7 +3303,7 @@ static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock)
                stock->cached_pgdat = NULL;
        }
 
-       stock->cached_objcg = NULL;
+       WRITE_ONCE(stock->cached_objcg, NULL);
        /*
         * The `old' objects needs to be released by the caller via
         * obj_cgroup_put() outside of memcg_stock_pcp::stock_lock.
@@ -3343,10 +3314,11 @@ static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock)
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
                                     struct mem_cgroup *root_memcg)
 {
+       struct obj_cgroup *objcg = READ_ONCE(stock->cached_objcg);
        struct mem_cgroup *memcg;
 
-       if (stock->cached_objcg) {
-               memcg = obj_cgroup_memcg(stock->cached_objcg);
+       if (objcg) {
+               memcg = obj_cgroup_memcg(objcg);
                if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
                        return true;
        }
@@ -3365,10 +3337,10 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
        local_lock_irqsave(&memcg_stock.stock_lock, flags);
 
        stock = this_cpu_ptr(&memcg_stock);
-       if (stock->cached_objcg != objcg) { /* reset if necessary */
+       if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */
                old = drain_obj_stock(stock);
                obj_cgroup_get(objcg);
-               stock->cached_objcg = objcg;
+               WRITE_ONCE(stock->cached_objcg, objcg);
                stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
                                ? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0;
                allow_uncharge = true;  /* Allow uncharge when objcg changes */
@@ -3699,27 +3671,13 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
 
        if (mem_cgroup_is_root(memcg)) {
                /*
-                * We can reach here from irq context through:
-                * uncharge_batch()
-                * |--memcg_check_events()
-                *    |--mem_cgroup_threshold()
-                *       |--__mem_cgroup_threshold()
-                *          |--mem_cgroup_usage
-                *
-                * rstat flushing is an expensive operation that should not be
-                * done from irq context; use stale stats in this case.
-                * Arguably, usage threshold events are not reliable on the root
-                * memcg anyway since its usage is ill-defined.
-                *
-                * Additionally, other call paths through memcg_check_events()
-                * disable irqs, so make sure we are flushing stats atomically.
+                * Approximate root's usage from global state. This isn't
+                * perfect, but the root usage was always an approximation.
                 */
-               if (in_task())
-                       mem_cgroup_flush_stats_atomic();
-               val = memcg_page_state(memcg, NR_FILE_PAGES) +
-                       memcg_page_state(memcg, NR_ANON_MAPPED);
+               val = global_node_page_state(NR_FILE_PAGES) +
+                       global_node_page_state(NR_ANON_MAPPED);
                if (swap)
-                       val += memcg_page_state(memcg, MEMCG_SWAP);
+                       val += total_swap_pages - get_nr_swap_pages();
        } else {
                if (!swap)
                        val = page_counter_read(&memcg->memory);
@@ -4135,9 +4093,8 @@ static const unsigned int memcg1_events[] = {
        PGMAJFAULT,
 };
 
-static int memcg_stat_show(struct seq_file *m, void *v)
+static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 {
-       struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
        unsigned long memory, memsw;
        struct mem_cgroup *mi;
        unsigned int i;
@@ -4152,18 +4109,18 @@ static int memcg_stat_show(struct seq_file *m, void *v)
                if (memcg1_stats[i] == MEMCG_SWAP && !do_memsw_account())
                        continue;
                nr = memcg_page_state_local(memcg, memcg1_stats[i]);
-               seq_printf(m, "%s %lu\n", memcg1_stat_names[i],
+               seq_buf_printf(s, "%s %lu\n", memcg1_stat_names[i],
                           nr * memcg_page_state_unit(memcg1_stats[i]));
        }
 
        for (i = 0; i < ARRAY_SIZE(memcg1_events); i++)
-               seq_printf(m, "%s %lu\n", vm_event_name(memcg1_events[i]),
-                          memcg_events_local(memcg, memcg1_events[i]));
+               seq_buf_printf(s, "%s %lu\n", vm_event_name(memcg1_events[i]),
+                              memcg_events_local(memcg, memcg1_events[i]));
 
        for (i = 0; i < NR_LRU_LISTS; i++)
-               seq_printf(m, "%s %lu\n", lru_list_name(i),
-                          memcg_page_state_local(memcg, NR_LRU_BASE + i) *
-                          PAGE_SIZE);
+               seq_buf_printf(s, "%s %lu\n", lru_list_name(i),
+                              memcg_page_state_local(memcg, NR_LRU_BASE + i) *
+                              PAGE_SIZE);
 
        /* Hierarchical information */
        memory = memsw = PAGE_COUNTER_MAX;
@@ -4171,11 +4128,11 @@ static int memcg_stat_show(struct seq_file *m, void *v)
                memory = min(memory, READ_ONCE(mi->memory.max));
                memsw = min(memsw, READ_ONCE(mi->memsw.max));
        }
-       seq_printf(m, "hierarchical_memory_limit %llu\n",
-                  (u64)memory * PAGE_SIZE);
+       seq_buf_printf(s, "hierarchical_memory_limit %llu\n",
+                      (u64)memory * PAGE_SIZE);
        if (do_memsw_account())
-               seq_printf(m, "hierarchical_memsw_limit %llu\n",
-                          (u64)memsw * PAGE_SIZE);
+               seq_buf_printf(s, "hierarchical_memsw_limit %llu\n",
+                              (u64)memsw * PAGE_SIZE);
 
        for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
                unsigned long nr;
@@ -4183,19 +4140,19 @@ static int memcg_stat_show(struct seq_file *m, void *v)
                if (memcg1_stats[i] == MEMCG_SWAP && !do_memsw_account())
                        continue;
                nr = memcg_page_state(memcg, memcg1_stats[i]);
-               seq_printf(m, "total_%s %llu\n", memcg1_stat_names[i],
+               seq_buf_printf(s, "total_%s %llu\n", memcg1_stat_names[i],
                           (u64)nr * memcg_page_state_unit(memcg1_stats[i]));
        }
 
        for (i = 0; i < ARRAY_SIZE(memcg1_events); i++)
-               seq_printf(m, "total_%s %llu\n",
-                          vm_event_name(memcg1_events[i]),
-                          (u64)memcg_events(memcg, memcg1_events[i]));
+               seq_buf_printf(s, "total_%s %llu\n",
+                              vm_event_name(memcg1_events[i]),
+                              (u64)memcg_events(memcg, memcg1_events[i]));
 
        for (i = 0; i < NR_LRU_LISTS; i++)
-               seq_printf(m, "total_%s %llu\n", lru_list_name(i),
-                          (u64)memcg_page_state(memcg, NR_LRU_BASE + i) *
-                          PAGE_SIZE);
+               seq_buf_printf(s, "total_%s %llu\n", lru_list_name(i),
+                              (u64)memcg_page_state(memcg, NR_LRU_BASE + i) *
+                              PAGE_SIZE);
 
 #ifdef CONFIG_DEBUG_VM
        {
@@ -4210,12 +4167,10 @@ static int memcg_stat_show(struct seq_file *m, void *v)
                        anon_cost += mz->lruvec.anon_cost;
                        file_cost += mz->lruvec.file_cost;
                }
-               seq_printf(m, "anon_cost %lu\n", anon_cost);
-               seq_printf(m, "file_cost %lu\n", file_cost);
+               seq_buf_printf(s, "anon_cost %lu\n", anon_cost);
+               seq_buf_printf(s, "file_cost %lu\n", file_cost);
        }
 #endif
-
-       return 0;
 }
 
 static u64 mem_cgroup_swappiness_read(struct cgroup_subsys_state *css,
@@ -4648,11 +4603,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
        struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
        struct mem_cgroup *parent;
 
-       /*
-        * wb_writeback() takes a spinlock and calls
-        * wb_over_bg_thresh()->mem_cgroup_wb_stats(). Do not sleep.
-        */
-       mem_cgroup_flush_stats_atomic();
+       mem_cgroup_flush_stats();
 
        *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
        *pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
@@ -5059,6 +5010,8 @@ static int mem_cgroup_slab_show(struct seq_file *m, void *p)
 }
 #endif
 
+static int memory_stat_show(struct seq_file *m, void *v);
+
 static struct cftype mem_cgroup_legacy_files[] = {
        {
                .name = "usage_in_bytes",
@@ -5091,7 +5044,7 @@ static struct cftype mem_cgroup_legacy_files[] = {
        },
        {
                .name = "stat",
-               .seq_show = memcg_stat_show,
+               .seq_show = memory_stat_show,
        },
        {
                .name = "force_empty",
@@ -5464,7 +5417,7 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 
        if (unlikely(mem_cgroup_is_root(memcg)))
                queue_delayed_work(system_unbound_wq, &stats_flush_dwork,
-                                  2UL*HZ);
+                                  FLUSH_TIME);
        lru_gen_online_memcg(memcg);
        return 0;
 offline_kmem:
@@ -5865,7 +5818,7 @@ static int mem_cgroup_move_account(struct page *page,
         * with (un)charging, migration, LRU putback, or anything else
         * that would rely on a stable page's memory cgroup.
         *
-        * Note that lock_page_memcg is a memcg lock, not a page lock,
+        * Note that folio_memcg_lock is a memcg lock, not a page lock,
         * to save space. As soon as we switch page's memory cgroup to a
         * new memcg that isn't locked, the above state can change
         * concurrently again. Make sure we're truly done with it.
@@ -6057,11 +6010,11 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
                return 0;
        }
 
-       if (pmd_trans_unstable(pmd))
-               return 0;
        pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+       if (!pte)
+               return 0;
        for (; addr != end; pte++, addr += PAGE_SIZE)
-               if (get_mctgt_type(vma, addr, *pte, NULL))
+               if (get_mctgt_type(vma, addr, ptep_get(pte), NULL))
                        mc.precharge++; /* increment precharge temporarily */
        pte_unmap_unlock(pte - 1, ptl);
        cond_resched();
@@ -6277,12 +6230,12 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
                return 0;
        }
 
-       if (pmd_trans_unstable(pmd))
-               return 0;
 retry:
        pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+       if (!pte)
+               return 0;
        for (; addr != end; addr += PAGE_SIZE) {
-               pte_t ptent = *(pte++);
+               pte_t ptent = ptep_get(pte++);
                bool device = false;
                swp_entry_t ent;
 
@@ -6356,7 +6309,7 @@ static void mem_cgroup_move_charge(void)
 {
        lru_add_drain_all();
        /*
-        * Signal lock_page_memcg() to take the memcg's move_lock
+        * Signal folio_memcg_lock() to take the memcg's move_lock
         * while we're moving its pages to another memcg. Then wait
         * for already started RCU-only updates to finish.
         */
@@ -6634,10 +6587,12 @@ static int memory_stat_show(struct seq_file *m, void *v)
 {
        struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
        char *buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+       struct seq_buf s;
 
        if (!buf)
                return -ENOMEM;
-       memory_stat_format(memcg, buf, PAGE_SIZE);
+       seq_buf_init(&s, buf, PAGE_SIZE);
+       memory_stat_format(memcg, &s);
        seq_puts(m, buf);
        kfree(buf);
        return 0;
@@ -6896,7 +6851,7 @@ static unsigned long effective_protection(unsigned long usage,
        protected = min(usage, setting);
        /*
         * If all cgroups at this level combined claim and use more
-        * protection then what the parent affords them, distribute
+        * protection than what the parent affords them, distribute
         * shares in proportion to utilization.
         *
         * We are using actual utilization rather than the statically
@@ -7421,8 +7376,7 @@ static int __init mem_cgroup_init(void)
        for_each_node(node) {
                struct mem_cgroup_tree_per_node *rtpn;
 
-               rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL,
-                                   node_online(node) ? node : NUMA_NO_NODE);
+               rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, node);
 
                rtpn->rb_root = RB_ROOT;
                rtpn->rb_rightmost = NULL;
@@ -7656,6 +7610,14 @@ static u64 swap_current_read(struct cgroup_subsys_state *css,
        return (u64)page_counter_read(&memcg->swap) * PAGE_SIZE;
 }
 
+static u64 swap_peak_read(struct cgroup_subsys_state *css,
+                         struct cftype *cft)
+{
+       struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+       return (u64)memcg->swap.watermark * PAGE_SIZE;
+}
+
 static int swap_high_show(struct seq_file *m, void *v)
 {
        return seq_puts_memcg_tunable(m,
@@ -7735,6 +7697,11 @@ static struct cftype swap_files[] = {
                .write = swap_max_write,
        },
        {
+               .name = "swap.peak",
+               .flags = CFTYPE_NOT_ON_ROOT,
+               .read_u64 = swap_peak_read,
+       },
+       {
                .name = "swap.events",
                .flags = CFTYPE_NOT_ON_ROOT,
                .file_offset = offsetof(struct mem_cgroup, swap_events_file),
index 5b663ec..e245191 100644 (file)
@@ -6,16 +6,16 @@
  * High level machine check handler. Handles pages reported by the
  * hardware as being corrupted usually due to a multi-bit ECC memory or cache
  * failure.
- * 
+ *
  * In addition there is a "soft offline" entry point that allows stop using
  * not-yet-corrupted-by-suspicious pages without killing anything.
  *
  * Handles page cache pages in various states. The tricky part
- * here is that we can access any page asynchronously in respect to 
- * other VM users, because memory failures could happen anytime and 
- * anywhere. This could violate some of their assumptions. This is why 
- * this code has to be extremely careful. Generally it tries to use 
- * normal locking rules, as in get the standard locks, even if that means 
+ * here is that we can access any page asynchronously in respect to
+ * other VM users, because memory failures could happen anytime and
+ * anywhere. This could violate some of their assumptions. This is why
+ * this code has to be extremely careful. Generally it tries to use
+ * normal locking rules, as in get the standard locks, even if that means
  * the error handling takes potentially a long time.
  *
  * It can be very tempting to add handling for obscure cases here.
  *   https://git.kernel.org/cgit/utils/cpu/mce/mce-test.git/
  * - The case actually shows up as a frequent (top 10) page state in
  *   tools/mm/page-types when running a real workload.
- * 
+ *
  * There are several operations here with exponential complexity because
- * of unsuitable VM data structures. For example the operation to map back 
- * from RMAP chains to processes has to walk the complete process list and 
+ * of unsuitable VM data structures. For example the operation to map back
+ * from RMAP chains to processes has to walk the complete process list and
  * has non linear complexity with the number. But since memory corruptions
- * are rare we hope to get away with this. This avoids impacting the core 
+ * are rare we hope to get away with this. This avoids impacting the core
  * VM.
  */
 
@@ -123,7 +123,6 @@ const struct attribute_group memory_failure_attr_group = {
        .attrs = memory_failure_attr,
 };
 
-#ifdef CONFIG_SYSCTL
 static struct ctl_table memory_failure_table[] = {
        {
                .procname       = "memory_failure_early_kill",
@@ -146,14 +145,6 @@ static struct ctl_table memory_failure_table[] = {
        { }
 };
 
-static int __init memory_failure_sysctl_init(void)
-{
-       register_sysctl_init("vm", memory_failure_table);
-       return 0;
-}
-late_initcall(memory_failure_sysctl_init);
-#endif /* CONFIG_SYSCTL */
-
 /*
  * Return values:
  *   1:   the page is dissolved (if needed) and taken off from buddy,
@@ -395,6 +386,7 @@ static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
        pud_t *pud;
        pmd_t *pmd;
        pte_t *pte;
+       pte_t ptent;
 
        VM_BUG_ON_VMA(address == -EFAULT, vma);
        pgd = pgd_offset(vma->vm_mm, address);
@@ -414,7 +406,10 @@ static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
        if (pmd_devmap(*pmd))
                return PMD_SHIFT;
        pte = pte_offset_map(pmd, address);
-       if (pte_present(*pte) && pte_devmap(*pte))
+       if (!pte)
+               return 0;
+       ptent = ptep_get(pte);
+       if (pte_present(ptent) && pte_devmap(ptent))
                ret = PAGE_SHIFT;
        pte_unmap(pte);
        return ret;
@@ -800,13 +795,13 @@ static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
                goto out;
        }
 
-       if (pmd_trans_unstable(pmdp))
-               goto out;
-
        mapped_pte = ptep = pte_offset_map_lock(walk->vma->vm_mm, pmdp,
                                                addr, &ptl);
+       if (!ptep)
+               goto out;
+
        for (; addr != end; ptep++, addr += PAGE_SIZE) {
-               ret = check_hwpoisoned_entry(*ptep, addr, PAGE_SHIFT,
+               ret = check_hwpoisoned_entry(ptep_get(ptep), addr, PAGE_SHIFT,
                                             hwp->pfn, &hwp->tk);
                if (ret == 1)
                        break;
@@ -2441,6 +2436,8 @@ static int __init memory_failure_init(void)
                INIT_WORK(&mf_cpu->work, memory_failure_work_func);
        }
 
+       register_sysctl_init("vm", memory_failure_table);
+
        return 0;
 }
 core_initcall(memory_failure_init);
index e593e56..a516e30 100644 (file)
@@ -366,7 +366,7 @@ static void establish_demotion_targets(void)
 
        lockdep_assert_held_once(&memory_tier_lock);
 
-       if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION))
+       if (!node_demotion)
                return;
 
        disable_all_demotion_targets();
@@ -451,7 +451,6 @@ static void establish_demotion_targets(void)
 }
 
 #else
-static inline void disable_all_demotion_targets(void) {}
 static inline void establish_demotion_targets(void) {}
 #endif /* CONFIG_MIGRATION */
 
index 3e46b4d..58029fd 100644 (file)
@@ -700,15 +700,17 @@ static void restore_exclusive_pte(struct vm_area_struct *vma,
                                  struct page *page, unsigned long address,
                                  pte_t *ptep)
 {
+       pte_t orig_pte;
        pte_t pte;
        swp_entry_t entry;
 
+       orig_pte = ptep_get(ptep);
        pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
-       if (pte_swp_soft_dirty(*ptep))
+       if (pte_swp_soft_dirty(orig_pte))
                pte = pte_mksoft_dirty(pte);
 
-       entry = pte_to_swp_entry(*ptep);
-       if (pte_swp_uffd_wp(*ptep))
+       entry = pte_to_swp_entry(orig_pte);
+       if (pte_swp_uffd_wp(orig_pte))
                pte = pte_mkuffd_wp(pte);
        else if (is_writable_device_exclusive_entry(entry))
                pte = maybe_mkwrite(pte_mkdirty(pte), vma);
@@ -745,7 +747,7 @@ static int
 try_restore_exclusive_pte(pte_t *src_pte, struct vm_area_struct *vma,
                        unsigned long addr)
 {
-       swp_entry_t entry = pte_to_swp_entry(*src_pte);
+       swp_entry_t entry = pte_to_swp_entry(ptep_get(src_pte));
        struct page *page = pfn_swap_entry_to_page(entry);
 
        if (trylock_page(page)) {
@@ -769,9 +771,10 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
                struct vm_area_struct *src_vma, unsigned long addr, int *rss)
 {
        unsigned long vm_flags = dst_vma->vm_flags;
-       pte_t pte = *src_pte;
+       pte_t orig_pte = ptep_get(src_pte);
+       pte_t pte = orig_pte;
        struct page *page;
-       swp_entry_t entry = pte_to_swp_entry(pte);
+       swp_entry_t entry = pte_to_swp_entry(orig_pte);
 
        if (likely(!non_swap_entry(entry))) {
                if (swap_duplicate(entry) < 0)
@@ -786,8 +789,8 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
                        spin_unlock(&mmlist_lock);
                }
                /* Mark the swap entry as shared. */
-               if (pte_swp_exclusive(*src_pte)) {
-                       pte = pte_swp_clear_exclusive(*src_pte);
+               if (pte_swp_exclusive(orig_pte)) {
+                       pte = pte_swp_clear_exclusive(orig_pte);
                        set_pte_at(src_mm, addr, src_pte, pte);
                }
                rss[MM_SWAPENTS]++;
@@ -806,9 +809,9 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
                        entry = make_readable_migration_entry(
                                                        swp_offset(entry));
                        pte = swp_entry_to_pte(entry);
-                       if (pte_swp_soft_dirty(*src_pte))
+                       if (pte_swp_soft_dirty(orig_pte))
                                pte = pte_swp_mksoft_dirty(pte);
-                       if (pte_swp_uffd_wp(*src_pte))
+                       if (pte_swp_uffd_wp(orig_pte))
                                pte = pte_swp_mkuffd_wp(pte);
                        set_pte_at(src_mm, addr, src_pte, pte);
                }
@@ -841,7 +844,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
                        entry = make_readable_device_private_entry(
                                                        swp_offset(entry));
                        pte = swp_entry_to_pte(entry);
-                       if (pte_swp_uffd_wp(*src_pte))
+                       if (pte_swp_uffd_wp(orig_pte))
                                pte = pte_swp_mkuffd_wp(pte);
                        set_pte_at(src_mm, addr, src_pte, pte);
                }
@@ -905,7 +908,7 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
        /* All done, just insert the new page copy in the child */
        pte = mk_pte(&new_folio->page, dst_vma->vm_page_prot);
        pte = maybe_mkwrite(pte_mkdirty(pte), dst_vma);
-       if (userfaultfd_pte_wp(dst_vma, *src_pte))
+       if (userfaultfd_pte_wp(dst_vma, ptep_get(src_pte)))
                /* Uffd-wp needs to be delivered to dest pte as well */
                pte = pte_mkuffd_wp(pte);
        set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
@@ -923,7 +926,7 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 {
        struct mm_struct *src_mm = src_vma->vm_mm;
        unsigned long vm_flags = src_vma->vm_flags;
-       pte_t pte = *src_pte;
+       pte_t pte = ptep_get(src_pte);
        struct page *page;
        struct folio *folio;
 
@@ -1003,6 +1006,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
        struct mm_struct *src_mm = src_vma->vm_mm;
        pte_t *orig_src_pte, *orig_dst_pte;
        pte_t *src_pte, *dst_pte;
+       pte_t ptent;
        spinlock_t *src_ptl, *dst_ptl;
        int progress, ret = 0;
        int rss[NR_MM_COUNTERS];
@@ -1013,13 +1017,25 @@ again:
        progress = 0;
        init_rss_vec(rss);
 
+       /*
+        * copy_pmd_range()'s prior pmd_none_or_clear_bad(src_pmd), and the
+        * error handling here, assume that exclusive mmap_lock on dst and src
+        * protects anon from unexpected THP transitions; with shmem and file
+        * protected by mmap_lock-less collapse skipping areas with anon_vma
+        * (whereas vma_needs_copy() skips areas without anon_vma).  A rework
+        * can remove such assumptions later, but this is good enough for now.
+        */
        dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
        if (!dst_pte) {
                ret = -ENOMEM;
                goto out;
        }
-       src_pte = pte_offset_map(src_pmd, addr);
-       src_ptl = pte_lockptr(src_mm, src_pmd);
+       src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl);
+       if (!src_pte) {
+               pte_unmap_unlock(dst_pte, dst_ptl);
+               /* ret == 0 */
+               goto out;
+       }
        spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
        orig_src_pte = src_pte;
        orig_dst_pte = dst_pte;
@@ -1036,17 +1052,18 @@ again:
                            spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
                                break;
                }
-               if (pte_none(*src_pte)) {
+               ptent = ptep_get(src_pte);
+               if (pte_none(ptent)) {
                        progress++;
                        continue;
                }
-               if (unlikely(!pte_present(*src_pte))) {
+               if (unlikely(!pte_present(ptent))) {
                        ret = copy_nonpresent_pte(dst_mm, src_mm,
                                                  dst_pte, src_pte,
                                                  dst_vma, src_vma,
                                                  addr, rss);
                        if (ret == -EIO) {
-                               entry = pte_to_swp_entry(*src_pte);
+                               entry = pte_to_swp_entry(ptep_get(src_pte));
                                break;
                        } else if (ret == -EBUSY) {
                                break;
@@ -1084,8 +1101,7 @@ again:
        } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
 
        arch_leave_lazy_mmu_mode();
-       spin_unlock(src_ptl);
-       pte_unmap(orig_src_pte);
+       pte_unmap_unlock(orig_src_pte, src_ptl);
        add_mm_rss_vec(dst_mm, rss);
        pte_unmap_unlock(orig_dst_pte, dst_ptl);
        cond_resched();
@@ -1389,14 +1405,15 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
        swp_entry_t entry;
 
        tlb_change_page_size(tlb, PAGE_SIZE);
-again:
        init_rss_vec(rss);
-       start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
-       pte = start_pte;
+       start_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+       if (!pte)
+               return addr;
+
        flush_tlb_batched_pending(mm);
        arch_enter_lazy_mmu_mode();
        do {
-               pte_t ptent = *pte;
+               pte_t ptent = ptep_get(pte);
                struct page *page;
 
                if (pte_none(ptent))
@@ -1508,17 +1525,10 @@ again:
         * If we forced a TLB flush (either due to running out of
         * batch buffers or because we needed to flush dirty TLB
         * entries before releasing the ptl), free the batched
-        * memory too. Restart if we didn't do everything.
+        * memory too. Come back again if we didn't do everything.
         */
-       if (force_flush) {
-               force_flush = 0;
+       if (force_flush)
                tlb_flush_mmu(tlb);
-       }
-
-       if (addr != end) {
-               cond_resched();
-               goto again;
-       }
 
        return addr;
 }
@@ -1537,8 +1547,10 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
                if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
                        if (next - addr != HPAGE_PMD_SIZE)
                                __split_huge_pmd(vma, pmd, addr, false, NULL);
-                       else if (zap_huge_pmd(tlb, vma, pmd, addr))
-                               goto next;
+                       else if (zap_huge_pmd(tlb, vma, pmd, addr)) {
+                               addr = next;
+                               continue;
+                       }
                        /* fall through */
                } else if (details && details->single_folio &&
                           folio_test_pmd_mappable(details->single_folio) &&
@@ -1551,20 +1563,14 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
                         */
                        spin_unlock(ptl);
                }
-
-               /*
-                * Here there can be other concurrent MADV_DONTNEED or
-                * trans huge page faults running, and if the pmd is
-                * none or trans huge it can change under us. This is
-                * because MADV_DONTNEED holds the mmap_lock in read
-                * mode.
-                */
-               if (pmd_none_or_trans_huge_or_clear_bad(pmd))
-                       goto next;
-               next = zap_pte_range(tlb, vma, pmd, addr, next, details);
-next:
-               cond_resched();
-       } while (pmd++, addr = next, addr != end);
+               if (pmd_none(*pmd)) {
+                       addr = next;
+                       continue;
+               }
+               addr = zap_pte_range(tlb, vma, pmd, addr, next, details);
+               if (addr != next)
+                       pmd--;
+       } while (pmd++, cond_resched(), addr != end);
 
        return addr;
 }
@@ -1822,7 +1828,7 @@ static int validate_page_before_insert(struct page *page)
 static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
                        unsigned long addr, struct page *page, pgprot_t prot)
 {
-       if (!pte_none(*pte))
+       if (!pte_none(ptep_get(pte)))
                return -EBUSY;
        /* Ok, finally just insert the thing.. */
        get_page(page);
@@ -1906,6 +1912,10 @@ more:
                const int batch_size = min_t(int, pages_to_write_in_pmd, 8);
 
                start_pte = pte_offset_map_lock(mm, pmd, addr, &pte_lock);
+               if (!start_pte) {
+                       ret = -EFAULT;
+                       goto out;
+               }
                for (pte = start_pte; pte_idx < batch_size; ++pte, ++pte_idx) {
                        int err = insert_page_in_batch_locked(vma, pte,
                                addr, pages[curr_page_idx], prot);
@@ -2112,7 +2122,8 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr,
        pte = get_locked_pte(mm, addr, &ptl);
        if (!pte)
                return VM_FAULT_OOM;
-       if (!pte_none(*pte)) {
+       entry = ptep_get(pte);
+       if (!pte_none(entry)) {
                if (mkwrite) {
                        /*
                         * For read faults on private mappings the PFN passed
@@ -2124,11 +2135,11 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr,
                         * allocation and mapping invalidation so just skip the
                         * update.
                         */
-                       if (pte_pfn(*pte) != pfn_t_to_pfn(pfn)) {
-                               WARN_ON_ONCE(!is_zero_pfn(pte_pfn(*pte)));
+                       if (pte_pfn(entry) != pfn_t_to_pfn(pfn)) {
+                               WARN_ON_ONCE(!is_zero_pfn(pte_pfn(entry)));
                                goto out_unlock;
                        }
-                       entry = pte_mkyoung(*pte);
+                       entry = pte_mkyoung(entry);
                        entry = maybe_mkwrite(pte_mkdirty(entry), vma);
                        if (ptep_set_access_flags(vma, addr, pte, entry, 1))
                                update_mmu_cache(vma, addr, pte);
@@ -2340,7 +2351,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
                return -ENOMEM;
        arch_enter_lazy_mmu_mode();
        do {
-               BUG_ON(!pte_none(*pte));
+               BUG_ON(!pte_none(ptep_get(pte)));
                if (!pfn_modify_allowed(pfn, prot)) {
                        err = -EACCES;
                        break;
@@ -2573,15 +2584,15 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
                mapped_pte = pte = (mm == &init_mm) ?
                        pte_offset_kernel(pmd, addr) :
                        pte_offset_map_lock(mm, pmd, addr, &ptl);
+               if (!pte)
+                       return -EINVAL;
        }
 
-       BUG_ON(pmd_huge(*pmd));
-
        arch_enter_lazy_mmu_mode();
 
        if (fn) {
                do {
-                       if (create || !pte_none(*pte)) {
+                       if (create || !pte_none(ptep_get(pte))) {
                                err = fn(pte++, addr, data);
                                if (err)
                                        break;
@@ -2782,10 +2793,9 @@ static inline int pte_unmap_same(struct vm_fault *vmf)
        int same = 1;
 #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPTION)
        if (sizeof(pte_t) > sizeof(unsigned long)) {
-               spinlock_t *ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-               spin_lock(ptl);
-               same = pte_same(*vmf->pte, vmf->orig_pte);
-               spin_unlock(ptl);
+               spin_lock(vmf->ptl);
+               same = pte_same(ptep_get(vmf->pte), vmf->orig_pte);
+               spin_unlock(vmf->ptl);
        }
 #endif
        pte_unmap(vmf->pte);
@@ -2805,7 +2815,6 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
        int ret;
        void *kaddr;
        void __user *uaddr;
-       bool locked = false;
        struct vm_area_struct *vma = vmf->vma;
        struct mm_struct *mm = vma->vm_mm;
        unsigned long addr = vmf->address;
@@ -2831,17 +2840,18 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
         * On architectures with software "accessed" bits, we would
         * take a double page fault, so mark it accessed here.
         */
+       vmf->pte = NULL;
        if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) {
                pte_t entry;
 
                vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
-               locked = true;
-               if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) {
+               if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
                        /*
                         * Other thread has already handled the fault
                         * and update local tlb only
                         */
-                       update_mmu_tlb(vma, addr, vmf->pte);
+                       if (vmf->pte)
+                               update_mmu_tlb(vma, addr, vmf->pte);
                        ret = -EAGAIN;
                        goto pte_unlock;
                }
@@ -2858,15 +2868,15 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
         * zeroes.
         */
        if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) {
-               if (locked)
+               if (vmf->pte)
                        goto warn;
 
                /* Re-validate under PTL if the page is still mapped */
                vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
-               locked = true;
-               if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) {
+               if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
                        /* The PTE changed under us, update local tlb */
-                       update_mmu_tlb(vma, addr, vmf->pte);
+                       if (vmf->pte)
+                               update_mmu_tlb(vma, addr, vmf->pte);
                        ret = -EAGAIN;
                        goto pte_unlock;
                }
@@ -2889,7 +2899,7 @@ warn:
        ret = 0;
 
 pte_unlock:
-       if (locked)
+       if (vmf->pte)
                pte_unmap_unlock(vmf->pte, vmf->ptl);
        kunmap_atomic(kaddr);
        flush_dcache_page(dst);
@@ -3111,7 +3121,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
         * Re-check the pte - we dropped the lock
         */
        vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl);
-       if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
+       if (likely(vmf->pte && pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
                if (old_folio) {
                        if (!folio_test_anon(old_folio)) {
                                dec_mm_counter(mm, mm_counter_file(&old_folio->page));
@@ -3179,19 +3189,20 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
                /* Free the old page.. */
                new_folio = old_folio;
                page_copied = 1;
-       } else {
+               pte_unmap_unlock(vmf->pte, vmf->ptl);
+       } else if (vmf->pte) {
                update_mmu_tlb(vma, vmf->address, vmf->pte);
+               pte_unmap_unlock(vmf->pte, vmf->ptl);
        }
 
-       if (new_folio)
-               folio_put(new_folio);
-
-       pte_unmap_unlock(vmf->pte, vmf->ptl);
        /*
         * No need to double call mmu_notifier->invalidate_range() callback as
         * the above ptep_clear_flush_notify() did already call it.
         */
        mmu_notifier_invalidate_range_only_end(&range);
+
+       if (new_folio)
+               folio_put(new_folio);
        if (old_folio) {
                if (page_copied)
                        free_swap_cache(&old_folio->page);
@@ -3231,11 +3242,13 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf)
        WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED));
        vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address,
                                       &vmf->ptl);
+       if (!vmf->pte)
+               return VM_FAULT_NOPAGE;
        /*
         * We might have raced with another page fault while we released the
         * pte_offset_map_lock.
         */
-       if (!pte_same(*vmf->pte, vmf->orig_pte)) {
+       if (!pte_same(ptep_get(vmf->pte), vmf->orig_pte)) {
                update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
                pte_unmap_unlock(vmf->pte, vmf->ptl);
                return VM_FAULT_NOPAGE;
@@ -3330,7 +3343,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
        struct folio *folio = NULL;
 
        if (likely(!unshare)) {
-               if (userfaultfd_pte_wp(vma, *vmf->pte)) {
+               if (userfaultfd_pte_wp(vma, ptep_get(vmf->pte))) {
                        pte_unmap_unlock(vmf->pte, vmf->ptl);
                        return handle_userfault(vmf, VM_UFFD_WP);
                }
@@ -3389,8 +3402,8 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
                        goto copy;
                if (!folio_test_lru(folio))
                        /*
-                        * Note: We cannot easily detect+handle references from
-                        * remote LRU pagevecs or references to LRU folios.
+                        * We cannot easily detect+handle references from
+                        * remote LRU caches or references to LRU folios.
                         */
                        lru_add_drain();
                if (folio_ref_count(folio) > 1 + folio_test_swapcache(folio))
@@ -3592,10 +3605,11 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
 
        vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
                                &vmf->ptl);
-       if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
+       if (likely(vmf->pte && pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
                restore_exclusive_pte(vma, vmf->page, vmf->address, vmf->pte);
 
-       pte_unmap_unlock(vmf->pte, vmf->ptl);
+       if (vmf->pte)
+               pte_unmap_unlock(vmf->pte, vmf->ptl);
        folio_unlock(folio);
        folio_put(folio);
 
@@ -3626,6 +3640,8 @@ static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
 {
        vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
                                       vmf->address, &vmf->ptl);
+       if (!vmf->pte)
+               return 0;
        /*
         * Be careful so that we will only recover a special uffd-wp pte into a
         * none pte.  Otherwise it means the pte could have changed, so retry.
@@ -3634,7 +3650,7 @@ static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
         * quickly from a PTE_MARKER_UFFD_WP into PTE_MARKER_SWAPIN_ERROR.
         * So is_pte_marker() check is not enough to safely drop the pte.
         */
-       if (pte_same(vmf->orig_pte, *vmf->pte))
+       if (pte_same(vmf->orig_pte, ptep_get(vmf->pte)))
                pte_clear(vmf->vma->vm_mm, vmf->address, vmf->pte);
        pte_unmap_unlock(vmf->pte, vmf->ptl);
        return 0;
@@ -3729,10 +3745,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
                        vmf->page = pfn_swap_entry_to_page(entry);
                        vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
                                        vmf->address, &vmf->ptl);
-                       if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
-                               spin_unlock(vmf->ptl);
-                               goto out;
-                       }
+                       if (unlikely(!vmf->pte ||
+                                    !pte_same(ptep_get(vmf->pte),
+                                                       vmf->orig_pte)))
+                               goto unlock;
 
                        /*
                         * Get a page reference while we know the page can't be
@@ -3808,7 +3824,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
                         */
                        vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
                                        vmf->address, &vmf->ptl);
-                       if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
+                       if (likely(vmf->pte &&
+                                  pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
                                ret = VM_FAULT_OOM;
                        goto unlock;
                }
@@ -3864,7 +3881,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
                 * If we want to map a page that's in the swapcache writable, we
                 * have to detect via the refcount if we're really the exclusive
                 * owner. Try removing the extra reference from the local LRU
-                * pagevecs if required.
+                * caches if required.
                 */
                if ((vmf->flags & FAULT_FLAG_WRITE) && folio == swapcache &&
                    !folio_test_ksm(folio) && !folio_test_lru(folio))
@@ -3878,7 +3895,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
         */
        vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
                        &vmf->ptl);
-       if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte)))
+       if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
                goto out_nomap;
 
        if (unlikely(!folio_test_uptodate(folio))) {
@@ -4004,13 +4021,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
        /* No need to invalidate - it was non-present before */
        update_mmu_cache(vma, vmf->address, vmf->pte);
 unlock:
-       pte_unmap_unlock(vmf->pte, vmf->ptl);
+       if (vmf->pte)
+               pte_unmap_unlock(vmf->pte, vmf->ptl);
 out:
        if (si)
                put_swap_device(si);
        return ret;
 out_nomap:
-       pte_unmap_unlock(vmf->pte, vmf->ptl);
+       if (vmf->pte)
+               pte_unmap_unlock(vmf->pte, vmf->ptl);
 out_page:
        folio_unlock(folio);
 out_release:
@@ -4042,22 +4061,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
                return VM_FAULT_SIGBUS;
 
        /*
-        * Use pte_alloc() instead of pte_alloc_map().  We can't run
-        * pte_offset_map() on pmds where a huge pmd might be created
-        * from a different thread.
-        *
-        * pte_alloc_map() is safe to use under mmap_write_lock(mm) or when
-        * parallel threads are excluded by other means.
-        *
-        * Here we only have mmap_read_lock(mm).
+        * Use pte_alloc() instead of pte_alloc_map(), so that OOM can
+        * be distinguished from a transient failure of pte_offset_map().
         */
        if (pte_alloc(vma->vm_mm, vmf->pmd))
                return VM_FAULT_OOM;
 
-       /* See comment in handle_pte_fault() */
-       if (unlikely(pmd_trans_unstable(vmf->pmd)))
-               return 0;
-
        /* Use the zero-page for reads */
        if (!(vmf->flags & FAULT_FLAG_WRITE) &&
                        !mm_forbids_zeropage(vma->vm_mm)) {
@@ -4065,6 +4074,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
                                                vma->vm_page_prot));
                vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
                                vmf->address, &vmf->ptl);
+               if (!vmf->pte)
+                       goto unlock;
                if (vmf_pte_changed(vmf)) {
                        update_mmu_tlb(vma, vmf->address, vmf->pte);
                        goto unlock;
@@ -4105,6 +4116,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 
        vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
                        &vmf->ptl);
+       if (!vmf->pte)
+               goto release;
        if (vmf_pte_changed(vmf)) {
                update_mmu_tlb(vma, vmf->address, vmf->pte);
                goto release;
@@ -4132,7 +4145,8 @@ setpte:
        /* No need to invalidate - it was non-present before */
        update_mmu_cache(vma, vmf->address, vmf->pte);
 unlock:
-       pte_unmap_unlock(vmf->pte, vmf->ptl);
+       if (vmf->pte)
+               pte_unmap_unlock(vmf->pte, vmf->ptl);
        return ret;
 release:
        folio_put(folio);
@@ -4326,9 +4340,9 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
 static bool vmf_pte_changed(struct vm_fault *vmf)
 {
        if (vmf->flags & FAULT_FLAG_ORIG_PTE_VALID)
-               return !pte_same(*vmf->pte, vmf->orig_pte);
+               return !pte_same(ptep_get(vmf->pte), vmf->orig_pte);
 
-       return !pte_none(*vmf->pte);
+       return !pte_none(ptep_get(vmf->pte));
 }
 
 /**
@@ -4381,15 +4395,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
                        return VM_FAULT_OOM;
        }
 
-       /*
-        * See comment in handle_pte_fault() for how this scenario happens, we
-        * need to return NOPAGE so that we drop this page.
-        */
-       if (pmd_devmap_trans_unstable(vmf->pmd))
-               return VM_FAULT_NOPAGE;
-
        vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
                                      vmf->address, &vmf->ptl);
+       if (!vmf->pte)
+               return VM_FAULT_NOPAGE;
 
        /* Re-check under ptl */
        if (likely(!vmf_pte_changed(vmf))) {
@@ -4631,17 +4640,11 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
         * The VMA was not fully populated on mmap() or missing VM_DONTEXPAND
         */
        if (!vma->vm_ops->fault) {
-               /*
-                * If we find a migration pmd entry or a none pmd entry, which
-                * should never happen, return SIGBUS
-                */
-               if (unlikely(!pmd_present(*vmf->pmd)))
+               vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+                                              vmf->address, &vmf->ptl);
+               if (unlikely(!vmf->pte))
                        ret = VM_FAULT_SIGBUS;
                else {
-                       vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm,
-                                                      vmf->pmd,
-                                                      vmf->address,
-                                                      &vmf->ptl);
                        /*
                         * Make sure this is not a temporary clearing of pte
                         * by holding ptl and checking again. A R/M/W update
@@ -4649,7 +4652,7 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
                         * we don't have concurrent modification by hardware
                         * followed by an update.
                         */
-                       if (unlikely(pte_none(*vmf->pte)))
+                       if (unlikely(pte_none(ptep_get(vmf->pte))))
                                ret = VM_FAULT_SIGBUS;
                        else
                                ret = VM_FAULT_NOPAGE;
@@ -4704,9 +4707,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
         * validation through pte_unmap_same(). It's of NUMA type but
         * the pfn may be screwed if the read is non atomic.
         */
-       vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
        spin_lock(vmf->ptl);
-       if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
+       if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
                pte_unmap_unlock(vmf->pte, vmf->ptl);
                goto out;
        }
@@ -4775,9 +4777,11 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
                flags |= TNF_MIGRATED;
        } else {
                flags |= TNF_MIGRATE_FAIL;
-               vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
-               spin_lock(vmf->ptl);
-               if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
+               vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+                                              vmf->address, &vmf->ptl);
+               if (unlikely(!vmf->pte))
+                       goto out;
+               if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
                        pte_unmap_unlock(vmf->pte, vmf->ptl);
                        goto out;
                }
@@ -4906,38 +4910,18 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
                vmf->flags &= ~FAULT_FLAG_ORIG_PTE_VALID;
        } else {
                /*
-                * If a huge pmd materialized under us just retry later.  Use
-                * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead
-                * of pmd_trans_huge() to ensure the pmd didn't become
-                * pmd_trans_huge under us and then back to pmd_none, as a
-                * result of MADV_DONTNEED running immediately after a huge pmd
-                * fault in a different thread of this mm, in turn leading to a
-                * misleading pmd_trans_huge() retval. All we have to ensure is
-                * that it is a regular pmd that we can walk with
-                * pte_offset_map() and we can do that through an atomic read
-                * in C, which is what pmd_trans_unstable() provides.
-                */
-               if (pmd_devmap_trans_unstable(vmf->pmd))
-                       return 0;
-               /*
                 * A regular pmd is established and it can't morph into a huge
-                * pmd from under us anymore at this point because we hold the
-                * mmap_lock read mode and khugepaged takes it in write mode.
-                * So now it's safe to run pte_offset_map().
+                * pmd by anon khugepaged, since that takes mmap_lock in write
+                * mode; but shmem or file collapse to THP could still morph
+                * it into a huge pmd: just retry later if so.
                 */
-               vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
-               vmf->orig_pte = *vmf->pte;
+               vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
+                                                vmf->address, &vmf->ptl);
+               if (unlikely(!vmf->pte))
+                       return 0;
+               vmf->orig_pte = ptep_get_lockless(vmf->pte);
                vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
 
-               /*
-                * some architectures can have larger ptes than wordsize,
-                * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
-                * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic
-                * accesses.  The code below just needs a consistent view
-                * for the ifs and we later double check anyway with the
-                * ptl lock held. So here a barrier will do.
-                */
-               barrier();
                if (pte_none(vmf->orig_pte)) {
                        pte_unmap(vmf->pte);
                        vmf->pte = NULL;
@@ -4953,10 +4937,9 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
        if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
                return do_numa_page(vmf);
 
-       vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
        spin_lock(vmf->ptl);
        entry = vmf->orig_pte;
-       if (unlikely(!pte_same(*vmf->pte, entry))) {
+       if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) {
                update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
                goto unlock;
        }
@@ -5061,9 +5044,8 @@ retry_pud:
                if (!(ret & VM_FAULT_FALLBACK))
                        return ret;
        } else {
-               vmf.orig_pmd = *vmf.pmd;
+               vmf.orig_pmd = pmdp_get_lockless(vmf.pmd);
 
-               barrier();
                if (unlikely(is_swap_pmd(vmf.orig_pmd))) {
                        VM_BUG_ON(thp_migration_supported() &&
                                          !is_pmd_migration_entry(vmf.orig_pmd));
@@ -5440,11 +5422,10 @@ int follow_pte(struct mm_struct *mm, unsigned long address,
        pmd = pmd_offset(pud, address);
        VM_BUG_ON(pmd_trans_huge(*pmd));
 
-       if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
-               goto out;
-
        ptep = pte_offset_map_lock(mm, pmd, address, ptlp);
-       if (!pte_present(*ptep))
+       if (!ptep)
+               goto out;
+       if (!pte_present(ptep_get(ptep)))
                goto unlock;
        *ptepp = ptep;
        return 0;
@@ -5481,7 +5462,7 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address,
        ret = follow_pte(vma->vm_mm, address, &ptep, &ptl);
        if (ret)
                return ret;
-       *pfn = pte_pfn(*ptep);
+       *pfn = pte_pfn(ptep_get(ptep));
        pte_unmap_unlock(ptep, ptl);
        return 0;
 }
@@ -5501,7 +5482,7 @@ int follow_phys(struct vm_area_struct *vma,
 
        if (follow_pte(vma->vm_mm, address, &ptep, &ptl))
                goto out;
-       pte = *ptep;
+       pte = ptep_get(ptep);
 
        if ((flags & FOLL_WRITE) && !pte_write(pte))
                goto unlock;
@@ -5545,7 +5526,7 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
 retry:
        if (follow_pte(vma->vm_mm, addr, &ptep, &ptl))
                return -EINVAL;
-       pte = *ptep;
+       pte = ptep_get(ptep);
        pte_unmap_unlock(ptep, ptl);
 
        prot = pgprot_val(pte_pgprot(pte));
@@ -5561,7 +5542,7 @@ retry:
        if (follow_pte(vma->vm_mm, addr, &ptep, &ptl))
                goto out_unmap;
 
-       if (!pte_same(pte, *ptep)) {
+       if (!pte_same(pte, ptep_get(ptep))) {
                pte_unmap_unlock(ptep, ptl);
                iounmap(maddr);
 
@@ -5588,7 +5569,6 @@ EXPORT_SYMBOL_GPL(generic_access_phys);
 int __access_remote_vm(struct mm_struct *mm, unsigned long addr, void *buf,
                       int len, unsigned int gup_flags)
 {
-       struct vm_area_struct *vma;
        void *old_buf = buf;
        int write = gup_flags & FOLL_WRITE;
 
@@ -5597,16 +5577,18 @@ int __access_remote_vm(struct mm_struct *mm, unsigned long addr, void *buf,
 
        /* ignore errors, just check how much was successfully transferred */
        while (len) {
-               int bytes, ret, offset;
+               int bytes, offset;
                void *maddr;
-               struct page *page = NULL;
+               struct vm_area_struct *vma = NULL;
+               struct page *page = get_user_page_vma_remote(mm, addr,
+                                                            gup_flags, &vma);
 
-               ret = get_user_pages_remote(mm, addr, 1,
-                               gup_flags, &page, &vma, NULL);
-               if (ret <= 0) {
+               if (IS_ERR_OR_NULL(page)) {
 #ifndef CONFIG_HAVE_IOREMAP_PROT
                        break;
 #else
+                       int res = 0;
+
                        /*
                         * Check if this is a VM_IO | VM_PFNMAP VMA, which
                         * we can access using slightly different code.
@@ -5615,11 +5597,11 @@ int __access_remote_vm(struct mm_struct *mm, unsigned long addr, void *buf,
                        if (!vma)
                                break;
                        if (vma->vm_ops && vma->vm_ops->access)
-                               ret = vma->vm_ops->access(vma, addr, buf,
+                               res = vma->vm_ops->access(vma, addr, buf,
                                                          len, write);
-                       if (ret <= 0)
+                       if (res <= 0)
                                break;
-                       bytes = ret;
+                       bytes = res;
 #endif
                } else {
                        bytes = len;
index 8e0fa20..3f231cf 100644 (file)
@@ -13,7 +13,6 @@
 #include <linux/pagemap.h>
 #include <linux/compiler.h>
 #include <linux/export.h>
-#include <linux/pagevec.h>
 #include <linux/writeback.h>
 #include <linux/slab.h>
 #include <linux/sysctl.h>
@@ -325,7 +324,7 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
        }
 
        if (check_pfn_span(pfn, nr_pages)) {
-               WARN(1, "Misaligned %s start: %#lx end: #%lx\n", __func__, pfn, pfn + nr_pages - 1);
+               WARN(1, "Misaligned %s start: %#lx end: %#lx\n", __func__, pfn, pfn + nr_pages - 1);
                return -EINVAL;
        }
 
@@ -492,18 +491,6 @@ void __ref remove_pfn_range_from_zone(struct zone *zone,
        set_zone_contiguous(zone);
 }
 
-static void __remove_section(unsigned long pfn, unsigned long nr_pages,
-                            unsigned long map_offset,
-                            struct vmem_altmap *altmap)
-{
-       struct mem_section *ms = __pfn_to_section(pfn);
-
-       if (WARN_ON_ONCE(!valid_section(ms)))
-               return;
-
-       sparse_remove_section(ms, pfn, nr_pages, map_offset, altmap);
-}
-
 /**
  * __remove_pages() - remove sections of pages
  * @pfn: starting pageframe (must be aligned to start of a section)
@@ -520,12 +507,9 @@ void __remove_pages(unsigned long pfn, unsigned long nr_pages,
 {
        const unsigned long end_pfn = pfn + nr_pages;
        unsigned long cur_nr_pages;
-       unsigned long map_offset = 0;
-
-       map_offset = vmem_altmap_offset(altmap);
 
        if (check_pfn_span(pfn, nr_pages)) {
-               WARN(1, "Misaligned %s start: %#lx end: #%lx\n", __func__, pfn, pfn + nr_pages - 1);
+               WARN(1, "Misaligned %s start: %#lx end: %#lx\n", __func__, pfn, pfn + nr_pages - 1);
                return;
        }
 
@@ -534,8 +518,7 @@ void __remove_pages(unsigned long pfn, unsigned long nr_pages,
                /* Select all remaining pages up to the next section boundary */
                cur_nr_pages = min(end_pfn - pfn,
                                   SECTION_ALIGN_UP(pfn + 1) - pfn);
-               __remove_section(pfn, cur_nr_pages, map_offset, altmap);
-               map_offset = 0;
+               sparse_remove_section(pfn, cur_nr_pages, altmap);
        }
 }
 
@@ -1172,16 +1155,6 @@ failed_addition:
        return ret;
 }
 
-static void reset_node_present_pages(pg_data_t *pgdat)
-{
-       struct zone *z;
-
-       for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
-               z->present_pages = 0;
-
-       pgdat->node_present_pages = 0;
-}
-
 /* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */
 static pg_data_t __ref *hotadd_init_pgdat(int nid)
 {
@@ -1204,15 +1177,6 @@ static pg_data_t __ref *hotadd_init_pgdat(int nid)
         */
        build_all_zonelists(pgdat);
 
-       /*
-        * When memory is hot-added, all the memory is in offline state. So
-        * clear all zones' present_pages because they will be updated in
-        * online_pages() and offline_pages().
-        * TODO: should be in free_area_init_core_hotplug?
-        */
-       reset_node_managed_pages(pgdat);
-       reset_node_present_pages(pgdat);
-
        return pgdat;
 }
 
index 1756389..edc2519 100644 (file)
@@ -508,20 +508,23 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
        unsigned long flags = qp->flags;
        bool has_unmovable = false;
        pte_t *pte, *mapped_pte;
+       pte_t ptent;
        spinlock_t *ptl;
 
        ptl = pmd_trans_huge_lock(pmd, vma);
        if (ptl)
                return queue_folios_pmd(pmd, ptl, addr, end, walk);
 
-       if (pmd_trans_unstable(pmd))
-               return 0;
-
        mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+       if (!pte) {
+               walk->action = ACTION_AGAIN;
+               return 0;
+       }
        for (; addr != end; pte++, addr += PAGE_SIZE) {
-               if (!pte_present(*pte))
+               ptent = ptep_get(pte);
+               if (!pte_present(ptent))
                        continue;
-               folio = vm_normal_folio(vma, addr, *pte);
+               folio = vm_normal_folio(vma, addr, ptent);
                if (!folio || folio_is_zone_device(folio))
                        continue;
                /*
@@ -1195,24 +1198,22 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from,
  * list of pages handed to migrate_pages()--which is how we get here--
  * is in virtual address order.
  */
-static struct page *new_page(struct page *page, unsigned long start)
+static struct folio *new_folio(struct folio *src, unsigned long start)
 {
-       struct folio *dst, *src = page_folio(page);
        struct vm_area_struct *vma;
        unsigned long address;
        VMA_ITERATOR(vmi, current->mm, start);
        gfp_t gfp = GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL;
 
        for_each_vma(vmi, vma) {
-               address = page_address_in_vma(page, vma);
+               address = page_address_in_vma(&src->page, vma);
                if (address != -EFAULT)
                        break;
        }
 
        if (folio_test_hugetlb(src)) {
-               dst = alloc_hugetlb_folio_vma(folio_hstate(src),
+               return alloc_hugetlb_folio_vma(folio_hstate(src),
                                vma, address);
-               return &dst->page;
        }
 
        if (folio_test_large(src))
@@ -1221,9 +1222,8 @@ static struct page *new_page(struct page *page, unsigned long start)
        /*
         * if !vma, vma_alloc_folio() will use task or system default policy
         */
-       dst = vma_alloc_folio(gfp, folio_order(src), vma, address,
+       return vma_alloc_folio(gfp, folio_order(src), vma, address,
                        folio_test_large(src));
-       return &dst->page;
 }
 #else
 
@@ -1239,7 +1239,7 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from,
        return -ENOSYS;
 }
 
-static struct page *new_page(struct page *page, unsigned long start)
+static struct folio *new_folio(struct folio *src, unsigned long start)
 {
        return NULL;
 }
@@ -1334,7 +1334,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 
                if (!list_empty(&pagelist)) {
                        WARN_ON_ONCE(flags & MPOL_MF_LAZY);
-                       nr_failed = migrate_pages(&pagelist, new_page, NULL,
+                       nr_failed = migrate_pages(&pagelist, new_folio, NULL,
                                start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND, NULL);
                        if (nr_failed)
                                putback_movable_pages(&pagelist);
index 01cac26..24baad2 100644 (file)
@@ -21,7 +21,6 @@
 #include <linux/buffer_head.h>
 #include <linux/mm_inline.h>
 #include <linux/nsproxy.h>
-#include <linux/pagevec.h>
 #include <linux/ksm.h>
 #include <linux/rmap.h>
 #include <linux/topology.h>
@@ -188,6 +187,7 @@ static bool remove_migration_pte(struct folio *folio,
 
        while (page_vma_mapped_walk(&pvmw)) {
                rmap_t rmap_flags = RMAP_NONE;
+               pte_t old_pte;
                pte_t pte;
                swp_entry_t entry;
                struct page *new;
@@ -210,17 +210,18 @@ static bool remove_migration_pte(struct folio *folio,
 
                folio_get(folio);
                pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
-               if (pte_swp_soft_dirty(*pvmw.pte))
+               old_pte = ptep_get(pvmw.pte);
+               if (pte_swp_soft_dirty(old_pte))
                        pte = pte_mksoft_dirty(pte);
 
-               entry = pte_to_swp_entry(*pvmw.pte);
+               entry = pte_to_swp_entry(old_pte);
                if (!is_migration_entry_young(entry))
                        pte = pte_mkold(pte);
                if (folio_test_dirty(folio) && is_migration_entry_dirty(entry))
                        pte = pte_mkdirty(pte);
                if (is_writable_migration_entry(entry))
                        pte = pte_mkwrite(pte);
-               else if (pte_swp_uffd_wp(*pvmw.pte))
+               else if (pte_swp_uffd_wp(old_pte))
                        pte = pte_mkuffd_wp(pte);
 
                if (folio_test_anon(folio) && !is_readable_migration_entry(entry))
@@ -234,9 +235,9 @@ static bool remove_migration_pte(struct folio *folio,
                                entry = make_readable_device_private_entry(
                                                        page_to_pfn(new));
                        pte = swp_entry_to_pte(entry);
-                       if (pte_swp_soft_dirty(*pvmw.pte))
+                       if (pte_swp_soft_dirty(old_pte))
                                pte = pte_swp_mksoft_dirty(pte);
-                       if (pte_swp_uffd_wp(*pvmw.pte))
+                       if (pte_swp_uffd_wp(old_pte))
                                pte = pte_swp_mkuffd_wp(pte);
                }
 
@@ -296,14 +297,21 @@ void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked)
  * get to the page and wait until migration is finished.
  * When we return from this function the fault will be retried.
  */
-void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
-                               spinlock_t *ptl)
+void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
+                         unsigned long address)
 {
+       spinlock_t *ptl;
+       pte_t *ptep;
        pte_t pte;
        swp_entry_t entry;
 
-       spin_lock(ptl);
-       pte = *ptep;
+       ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+       if (!ptep)
+               return;
+
+       pte = ptep_get(ptep);
+       pte_unmap(ptep);
+
        if (!is_swap_pte(pte))
                goto out;
 
@@ -311,18 +319,10 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
        if (!is_migration_entry(entry))
                goto out;
 
-       migration_entry_wait_on_locked(entry, ptep, ptl);
+       migration_entry_wait_on_locked(entry, ptl);
        return;
 out:
-       pte_unmap_unlock(ptep, ptl);
-}
-
-void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
-                               unsigned long address)
-{
-       spinlock_t *ptl = pte_lockptr(mm, pmd);
-       pte_t *ptep = pte_offset_map(pmd, address);
-       __migration_entry_wait(mm, ptep, ptl);
+       spin_unlock(ptl);
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
@@ -332,9 +332,9 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
  *
  * This function will release the vma lock before returning.
  */
-void __migration_entry_wait_huge(struct vm_area_struct *vma,
-                                pte_t *ptep, spinlock_t *ptl)
+void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *ptep)
 {
+       spinlock_t *ptl = huge_pte_lockptr(hstate_vma(vma), vma->vm_mm, ptep);
        pte_t pte;
 
        hugetlb_vma_assert_locked(vma);
@@ -352,16 +352,9 @@ void __migration_entry_wait_huge(struct vm_area_struct *vma,
                 * lock release in migration_entry_wait_on_locked().
                 */
                hugetlb_vma_unlock_read(vma);
-               migration_entry_wait_on_locked(pte_to_swp_entry(pte), NULL, ptl);
+               migration_entry_wait_on_locked(pte_to_swp_entry(pte), ptl);
        }
 }
-
-void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte)
-{
-       spinlock_t *ptl = huge_pte_lockptr(hstate_vma(vma), vma->vm_mm, pte);
-
-       __migration_entry_wait_huge(vma, pte, ptl);
-}
 #endif
 
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
@@ -372,7 +365,7 @@ void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
        ptl = pmd_lock(mm, pmd);
        if (!is_pmd_migration_entry(*pmd))
                goto unlock;
-       migration_entry_wait_on_locked(pmd_to_swp_entry(*pmd), NULL, ptl);
+       migration_entry_wait_on_locked(pmd_to_swp_entry(*pmd), ptl);
        return;
 unlock:
        spin_unlock(ptl);
@@ -492,6 +485,11 @@ int folio_migrate_mapping(struct address_space *mapping,
                if (folio_test_swapbacked(folio) && !folio_test_swapcache(folio)) {
                        __mod_lruvec_state(old_lruvec, NR_SHMEM, -nr);
                        __mod_lruvec_state(new_lruvec, NR_SHMEM, nr);
+
+                       if (folio_test_pmd_mappable(folio)) {
+                               __mod_lruvec_state(old_lruvec, NR_SHMEM_THPS, -nr);
+                               __mod_lruvec_state(new_lruvec, NR_SHMEM_THPS, nr);
+                       }
                }
 #ifdef CONFIG_SWAP
                if (folio_test_swapcache(folio)) {
@@ -692,37 +690,32 @@ static bool buffer_migrate_lock_buffers(struct buffer_head *head,
                                                        enum migrate_mode mode)
 {
        struct buffer_head *bh = head;
+       struct buffer_head *failed_bh;
 
-       /* Simple case, sync compaction */
-       if (mode != MIGRATE_ASYNC) {
-               do {
-                       lock_buffer(bh);
-                       bh = bh->b_this_page;
-
-               } while (bh != head);
-
-               return true;
-       }
-
-       /* async case, we cannot block on lock_buffer so use trylock_buffer */
        do {
                if (!trylock_buffer(bh)) {
-                       /*
-                        * We failed to lock the buffer and cannot stall in
-                        * async migration. Release the taken locks
-                        */
-                       struct buffer_head *failed_bh = bh;
-                       bh = head;
-                       while (bh != failed_bh) {
-                               unlock_buffer(bh);
-                               bh = bh->b_this_page;
-                       }
-                       return false;
+                       if (mode == MIGRATE_ASYNC)
+                               goto unlock;
+                       if (mode == MIGRATE_SYNC_LIGHT && !buffer_uptodate(bh))
+                               goto unlock;
+                       lock_buffer(bh);
                }
 
                bh = bh->b_this_page;
        } while (bh != head);
+
        return true;
+
+unlock:
+       /* We failed to lock the buffer and cannot stall. */
+       failed_bh = bh;
+       bh = head;
+       while (bh != failed_bh) {
+               unlock_buffer(bh);
+               bh = bh->b_this_page;
+       }
+
+       return false;
 }
 
 static int __buffer_migrate_folio(struct address_space *mapping,
@@ -1072,15 +1065,13 @@ static void migrate_folio_undo_src(struct folio *src,
 }
 
 /* Restore the destination folio to the original state upon failure */
-static void migrate_folio_undo_dst(struct folio *dst,
-                                  bool locked,
-                                  free_page_t put_new_page,
-                                  unsigned long private)
+static void migrate_folio_undo_dst(struct folio *dst, bool locked,
+               free_folio_t put_new_folio, unsigned long private)
 {
        if (locked)
                folio_unlock(dst);
-       if (put_new_page)
-               put_new_page(&dst->page, private);
+       if (put_new_folio)
+               put_new_folio(dst, private);
        else
                folio_put(dst);
 }
@@ -1104,14 +1095,13 @@ static void migrate_folio_done(struct folio *src,
 }
 
 /* Obtain the lock on page, remove all ptes. */
-static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page,
-                              unsigned long private, struct folio *src,
-                              struct folio **dstp, enum migrate_mode mode,
-                              enum migrate_reason reason, struct list_head *ret)
+static int migrate_folio_unmap(new_folio_t get_new_folio,
+               free_folio_t put_new_folio, unsigned long private,
+               struct folio *src, struct folio **dstp, enum migrate_mode mode,
+               enum migrate_reason reason, struct list_head *ret)
 {
        struct folio *dst;
        int rc = -EAGAIN;
-       struct page *newpage = NULL;
        int page_was_mapped = 0;
        struct anon_vma *anon_vma = NULL;
        bool is_lru = !__PageMovable(&src->page);
@@ -1128,10 +1118,9 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page
                return MIGRATEPAGE_SUCCESS;
        }
 
-       newpage = get_new_page(&src->page, private);
-       if (!newpage)
+       dst = get_new_folio(src, private);
+       if (!dst)
                return -ENOMEM;
-       dst = page_folio(newpage);
        *dstp = dst;
 
        dst->private = NULL;
@@ -1156,6 +1145,14 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page
                if (current->flags & PF_MEMALLOC)
                        goto out;
 
+               /*
+                * In "light" mode, we can wait for transient locks (eg
+                * inserting a page into the page table), but it's not
+                * worth waiting for I/O.
+                */
+               if (mode == MIGRATE_SYNC_LIGHT && !folio_test_uptodate(src))
+                       goto out;
+
                folio_lock(src);
        }
        locked = true;
@@ -1251,13 +1248,13 @@ out:
                ret = NULL;
 
        migrate_folio_undo_src(src, page_was_mapped, anon_vma, locked, ret);
-       migrate_folio_undo_dst(dst, dst_locked, put_new_page, private);
+       migrate_folio_undo_dst(dst, dst_locked, put_new_folio, private);
 
        return rc;
 }
 
 /* Migrate the folio to the newly allocated folio in dst. */
-static int migrate_folio_move(free_page_t put_new_page, unsigned long private,
+static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
                              struct folio *src, struct folio *dst,
                              enum migrate_mode mode, enum migrate_reason reason,
                              struct list_head *ret)
@@ -1329,7 +1326,7 @@ out:
        }
 
        migrate_folio_undo_src(src, page_was_mapped, anon_vma, true, ret);
-       migrate_folio_undo_dst(dst, true, put_new_page, private);
+       migrate_folio_undo_dst(dst, true, put_new_folio, private);
 
        return rc;
 }
@@ -1352,16 +1349,14 @@ out:
  * because then pte is replaced with migration swap entry and direct I/O code
  * will wait in the page fault for migration to complete.
  */
-static int unmap_and_move_huge_page(new_page_t get_new_page,
-                               free_page_t put_new_page, unsigned long private,
-                               struct page *hpage, int force,
-                               enum migrate_mode mode, int reason,
-                               struct list_head *ret)
+static int unmap_and_move_huge_page(new_folio_t get_new_folio,
+               free_folio_t put_new_folio, unsigned long private,
+               struct folio *src, int force, enum migrate_mode mode,
+               int reason, struct list_head *ret)
 {
-       struct folio *dst, *src = page_folio(hpage);
+       struct folio *dst;
        int rc = -EAGAIN;
        int page_was_mapped = 0;
-       struct page *new_hpage;
        struct anon_vma *anon_vma = NULL;
        struct address_space *mapping = NULL;
 
@@ -1371,10 +1366,9 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
                return MIGRATEPAGE_SUCCESS;
        }
 
-       new_hpage = get_new_page(hpage, private);
-       if (!new_hpage)
+       dst = get_new_folio(src, private);
+       if (!dst)
                return -ENOMEM;
-       dst = page_folio(new_hpage);
 
        if (!folio_trylock(src)) {
                if (!force)
@@ -1415,7 +1409,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
                         * semaphore in write mode here and set TTU_RMAP_LOCKED
                         * to let lower levels know we have taken the lock.
                         */
-                       mapping = hugetlb_page_mapping_lock_write(hpage);
+                       mapping = hugetlb_page_mapping_lock_write(&src->page);
                        if (unlikely(!mapping))
                                goto unlock_put_anon;
 
@@ -1445,7 +1439,7 @@ put_anon:
 
        if (rc == MIGRATEPAGE_SUCCESS) {
                move_hugetlb_state(src, dst, reason);
-               put_new_page = NULL;
+               put_new_folio = NULL;
        }
 
 out_unlock:
@@ -1461,8 +1455,8 @@ out:
         * it.  Otherwise, put_page() will drop the reference grabbed during
         * isolation.
         */
-       if (put_new_page)
-               put_new_page(new_hpage, private);
+       if (put_new_folio)
+               put_new_folio(dst, private);
        else
                folio_putback_active_hugetlb(dst);
 
@@ -1509,8 +1503,8 @@ struct migrate_pages_stats {
  * exist any more. It is caller's responsibility to call putback_movable_pages()
  * only if ret != 0.
  */
-static int migrate_hugetlbs(struct list_head *from, new_page_t get_new_page,
-                           free_page_t put_new_page, unsigned long private,
+static int migrate_hugetlbs(struct list_head *from, new_folio_t get_new_folio,
+                           free_folio_t put_new_folio, unsigned long private,
                            enum migrate_mode mode, int reason,
                            struct migrate_pages_stats *stats,
                            struct list_head *ret_folios)
@@ -1548,9 +1542,9 @@ static int migrate_hugetlbs(struct list_head *from, new_page_t get_new_page,
                                continue;
                        }
 
-                       rc = unmap_and_move_huge_page(get_new_page,
-                                                     put_new_page, private,
-                                                     &folio->page, pass > 2, mode,
+                       rc = unmap_and_move_huge_page(get_new_folio,
+                                                     put_new_folio, private,
+                                                     folio, pass > 2, mode,
                                                      reason, ret_folios);
                        /*
                         * The rules are:
@@ -1607,20 +1601,17 @@ static int migrate_hugetlbs(struct list_head *from, new_page_t get_new_page,
  * deadlock (e.g., for loop device).  So, if mode != MIGRATE_ASYNC, the
  * length of the from list must be <= 1.
  */
-static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
-               free_page_t put_new_page, unsigned long private,
-               enum migrate_mode mode, int reason, struct list_head *ret_folios,
-               struct list_head *split_folios, struct migrate_pages_stats *stats,
-               int nr_pass)
+static int migrate_pages_batch(struct list_head *from,
+               new_folio_t get_new_folio, free_folio_t put_new_folio,
+               unsigned long private, enum migrate_mode mode, int reason,
+               struct list_head *ret_folios, struct list_head *split_folios,
+               struct migrate_pages_stats *stats, int nr_pass)
 {
        int retry = 1;
-       int large_retry = 1;
        int thp_retry = 1;
        int nr_failed = 0;
        int nr_retry_pages = 0;
-       int nr_large_failed = 0;
        int pass = 0;
-       bool is_large = false;
        bool is_thp = false;
        struct folio *folio, *folio2, *dst = NULL, *dst2;
        int rc, rc_saved = 0, nr_pages;
@@ -1631,20 +1622,13 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
        VM_WARN_ON_ONCE(mode != MIGRATE_ASYNC &&
                        !list_empty(from) && !list_is_singular(from));
 
-       for (pass = 0; pass < nr_pass && (retry || large_retry); pass++) {
+       for (pass = 0; pass < nr_pass && retry; pass++) {
                retry = 0;
-               large_retry = 0;
                thp_retry = 0;
                nr_retry_pages = 0;
 
                list_for_each_entry_safe(folio, folio2, from, lru) {
-                       /*
-                        * Large folio statistics is based on the source large
-                        * folio. Capture required information that might get
-                        * lost during migration.
-                        */
-                       is_large = folio_test_large(folio);
-                       is_thp = is_large && folio_test_pmd_mappable(folio);
+                       is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio);
                        nr_pages = folio_nr_pages(folio);
 
                        cond_resched();
@@ -1660,7 +1644,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
                         * list is processed.
                         */
                        if (!thp_migration_supported() && is_thp) {
-                               nr_large_failed++;
+                               nr_failed++;
                                stats->nr_thp_failed++;
                                if (!try_split_folio(folio, split_folios)) {
                                        stats->nr_thp_split++;
@@ -1671,8 +1655,9 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
                                continue;
                        }
 
-                       rc = migrate_folio_unmap(get_new_page, put_new_page, private,
-                                                folio, &dst, mode, reason, ret_folios);
+                       rc = migrate_folio_unmap(get_new_folio, put_new_folio,
+                                       private, folio, &dst, mode, reason,
+                                       ret_folios);
                        /*
                         * The rules are:
                         *      Success: folio will be freed
@@ -1688,38 +1673,33 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
                                 * When memory is low, don't bother to try to migrate
                                 * other folios, move unmapped folios, then exit.
                                 */
-                               if (is_large) {
-                                       nr_large_failed++;
-                                       stats->nr_thp_failed += is_thp;
-                                       /* Large folio NUMA faulting doesn't split to retry. */
-                                       if (!nosplit) {
-                                               int ret = try_split_folio(folio, split_folios);
-
-                                               if (!ret) {
-                                                       stats->nr_thp_split += is_thp;
-                                                       break;
-                                               } else if (reason == MR_LONGTERM_PIN &&
-                                                          ret == -EAGAIN) {
-                                                       /*
-                                                        * Try again to split large folio to
-                                                        * mitigate the failure of longterm pinning.
-                                                        */
-                                                       large_retry++;
-                                                       thp_retry += is_thp;
-                                                       nr_retry_pages += nr_pages;
-                                                       /* Undo duplicated failure counting. */
-                                                       nr_large_failed--;
-                                                       stats->nr_thp_failed -= is_thp;
-                                                       break;
-                                               }
+                               nr_failed++;
+                               stats->nr_thp_failed += is_thp;
+                               /* Large folio NUMA faulting doesn't split to retry. */
+                               if (folio_test_large(folio) && !nosplit) {
+                                       int ret = try_split_folio(folio, split_folios);
+
+                                       if (!ret) {
+                                               stats->nr_thp_split += is_thp;
+                                               break;
+                                       } else if (reason == MR_LONGTERM_PIN &&
+                                                  ret == -EAGAIN) {
+                                               /*
+                                                * Try again to split large folio to
+                                                * mitigate the failure of longterm pinning.
+                                                */
+                                               retry++;
+                                               thp_retry += is_thp;
+                                               nr_retry_pages += nr_pages;
+                                               /* Undo duplicated failure counting. */
+                                               nr_failed--;
+                                               stats->nr_thp_failed -= is_thp;
+                                               break;
                                        }
-                               } else {
-                                       nr_failed++;
                                }
 
                                stats->nr_failed_pages += nr_pages + nr_retry_pages;
                                /* nr_failed isn't updated for not used */
-                               nr_large_failed += large_retry;
                                stats->nr_thp_failed += thp_retry;
                                rc_saved = rc;
                                if (list_empty(&unmap_folios))
@@ -1727,12 +1707,8 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
                                else
                                        goto move;
                        case -EAGAIN:
-                               if (is_large) {
-                                       large_retry++;
-                                       thp_retry += is_thp;
-                               } else {
-                                       retry++;
-                               }
+                               retry++;
+                               thp_retry += is_thp;
                                nr_retry_pages += nr_pages;
                                break;
                        case MIGRATEPAGE_SUCCESS:
@@ -1750,20 +1726,14 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page,
                                 * removed from migration folio list and not
                                 * retried in the next outer loop.
                                 */
-                               if (is_large) {
-                                       nr_large_failed++;
-                                       stats->nr_thp_failed += is_thp;
-                               } else {
-                                       nr_failed++;
-                               }
-
+                               nr_failed++;
+                               stats->nr_thp_failed += is_thp;
                                stats->nr_failed_pages += nr_pages;
                                break;
                        }
                }
        }
        nr_failed += retry;
-       nr_large_failed += large_retry;
        stats->nr_thp_failed += thp_retry;
        stats->nr_failed_pages += nr_retry_pages;
 move:
@@ -1771,22 +1741,20 @@ move:
        try_to_unmap_flush();
 
        retry = 1;
-       for (pass = 0; pass < nr_pass && (retry || large_retry); pass++) {
+       for (pass = 0; pass < nr_pass && retry; pass++) {
                retry = 0;
-               large_retry = 0;
                thp_retry = 0;
                nr_retry_pages = 0;
 
                dst = list_first_entry(&dst_folios, struct folio, lru);
                dst2 = list_next_entry(dst, lru);
                list_for_each_entry_safe(folio, folio2, &unmap_folios, lru) {
-                       is_large = folio_test_large(folio);
-                       is_thp = is_large && folio_test_pmd_mappable(folio);
+                       is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio);
                        nr_pages = folio_nr_pages(folio);
 
                        cond_resched();
 
-                       rc = migrate_folio_move(put_new_page, private,
+                       rc = migrate_folio_move(put_new_folio, private,
                                                folio, dst, mode,
                                                reason, ret_folios);
                        /*
@@ -1797,12 +1765,8 @@ move:
                         */
                        switch(rc) {
                        case -EAGAIN:
-                               if (is_large) {
-                                       large_retry++;
-                                       thp_retry += is_thp;
-                               } else {
-                                       retry++;
-                               }
+                               retry++;
+                               thp_retry += is_thp;
                                nr_retry_pages += nr_pages;
                                break;
                        case MIGRATEPAGE_SUCCESS:
@@ -1810,13 +1774,8 @@ move:
                                stats->nr_thp_succeeded += is_thp;
                                break;
                        default:
-                               if (is_large) {
-                                       nr_large_failed++;
-                                       stats->nr_thp_failed += is_thp;
-                               } else {
-                                       nr_failed++;
-                               }
-
+                               nr_failed++;
+                               stats->nr_thp_failed += is_thp;
                                stats->nr_failed_pages += nr_pages;
                                break;
                        }
@@ -1825,14 +1784,10 @@ move:
                }
        }
        nr_failed += retry;
-       nr_large_failed += large_retry;
        stats->nr_thp_failed += thp_retry;
        stats->nr_failed_pages += nr_retry_pages;
 
-       if (rc_saved)
-               rc = rc_saved;
-       else
-               rc = nr_failed + nr_large_failed;
+       rc = rc_saved ? : nr_failed;
 out:
        /* Cleanup remaining folios */
        dst = list_first_entry(&dst_folios, struct folio, lru);
@@ -1845,7 +1800,7 @@ out:
                migrate_folio_undo_src(folio, page_was_mapped, anon_vma,
                                       true, ret_folios);
                list_del(&dst->lru);
-               migrate_folio_undo_dst(dst, true, put_new_page, private);
+               migrate_folio_undo_dst(dst, true, put_new_folio, private);
                dst = dst2;
                dst2 = list_next_entry(dst, lru);
        }
@@ -1853,10 +1808,11 @@ out:
        return rc;
 }
 
-static int migrate_pages_sync(struct list_head *from, new_page_t get_new_page,
-               free_page_t put_new_page, unsigned long private,
-               enum migrate_mode mode, int reason, struct list_head *ret_folios,
-               struct list_head *split_folios, struct migrate_pages_stats *stats)
+static int migrate_pages_sync(struct list_head *from, new_folio_t get_new_folio,
+               free_folio_t put_new_folio, unsigned long private,
+               enum migrate_mode mode, int reason,
+               struct list_head *ret_folios, struct list_head *split_folios,
+               struct migrate_pages_stats *stats)
 {
        int rc, nr_failed = 0;
        LIST_HEAD(folios);
@@ -1864,7 +1820,7 @@ static int migrate_pages_sync(struct list_head *from, new_page_t get_new_page,
 
        memset(&astats, 0, sizeof(astats));
        /* Try to migrate in batch with MIGRATE_ASYNC mode firstly */
-       rc = migrate_pages_batch(from, get_new_page, put_new_page, private, MIGRATE_ASYNC,
+       rc = migrate_pages_batch(from, get_new_folio, put_new_folio, private, MIGRATE_ASYNC,
                                 reason, &folios, split_folios, &astats,
                                 NR_MAX_MIGRATE_ASYNC_RETRY);
        stats->nr_succeeded += astats.nr_succeeded;
@@ -1886,7 +1842,7 @@ static int migrate_pages_sync(struct list_head *from, new_page_t get_new_page,
        list_splice_tail_init(&folios, from);
        while (!list_empty(from)) {
                list_move(from->next, &folios);
-               rc = migrate_pages_batch(&folios, get_new_page, put_new_page,
+               rc = migrate_pages_batch(&folios, get_new_folio, put_new_folio,
                                         private, mode, reason, ret_folios,
                                         split_folios, stats, NR_MAX_MIGRATE_SYNC_RETRY);
                list_splice_tail_init(&folios, ret_folios);
@@ -1903,11 +1859,11 @@ static int migrate_pages_sync(struct list_head *from, new_page_t get_new_page,
  *                supplied as the target for the page migration
  *
  * @from:              The list of folios to be migrated.
- * @get_new_page:      The function used to allocate free folios to be used
+ * @get_new_folio:     The function used to allocate free folios to be used
  *                     as the target of the folio migration.
- * @put_new_page:      The function used to free target folios if migration
+ * @put_new_folio:     The function used to free target folios if migration
  *                     fails, or NULL if no special handling is necessary.
- * @private:           Private data to be passed on to get_new_page()
+ * @private:           Private data to be passed on to get_new_folio()
  * @mode:              The migration mode that specifies the constraints for
  *                     folio migration, if any.
  * @reason:            The reason for folio migration.
@@ -1924,8 +1880,8 @@ static int migrate_pages_sync(struct list_head *from, new_page_t get_new_page,
  * considered as the number of non-migrated large folio, no matter how many
  * split folios of the large folio are migrated successfully.
  */
-int migrate_pages(struct list_head *from, new_page_t get_new_page,
-               free_page_t put_new_page, unsigned long private,
+int migrate_pages(struct list_head *from, new_folio_t get_new_folio,
+               free_folio_t put_new_folio, unsigned long private,
                enum migrate_mode mode, int reason, unsigned int *ret_succeeded)
 {
        int rc, rc_gather;
@@ -1940,7 +1896,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 
        memset(&stats, 0, sizeof(stats));
 
-       rc_gather = migrate_hugetlbs(from, get_new_page, put_new_page, private,
+       rc_gather = migrate_hugetlbs(from, get_new_folio, put_new_folio, private,
                                     mode, reason, &stats, &ret_folios);
        if (rc_gather < 0)
                goto out;
@@ -1963,12 +1919,14 @@ again:
        else
                list_splice_init(from, &folios);
        if (mode == MIGRATE_ASYNC)
-               rc = migrate_pages_batch(&folios, get_new_page, put_new_page, private,
-                                        mode, reason, &ret_folios, &split_folios, &stats,
-                                        NR_MAX_MIGRATE_PAGES_RETRY);
+               rc = migrate_pages_batch(&folios, get_new_folio, put_new_folio,
+                               private, mode, reason, &ret_folios,
+                               &split_folios, &stats,
+                               NR_MAX_MIGRATE_PAGES_RETRY);
        else
-               rc = migrate_pages_sync(&folios, get_new_page, put_new_page, private,
-                                       mode, reason, &ret_folios, &split_folios, &stats);
+               rc = migrate_pages_sync(&folios, get_new_folio, put_new_folio,
+                               private, mode, reason, &ret_folios,
+                               &split_folios, &stats);
        list_splice_tail_init(&folios, &ret_folios);
        if (rc < 0) {
                rc_gather = rc;
@@ -1981,8 +1939,9 @@ again:
                 * is counted as 1 failure already.  And, we only try to migrate
                 * with minimal effort, force MIGRATE_ASYNC mode and retry once.
                 */
-               migrate_pages_batch(&split_folios, get_new_page, put_new_page, private,
-                                   MIGRATE_ASYNC, reason, &ret_folios, NULL, &stats, 1);
+               migrate_pages_batch(&split_folios, get_new_folio,
+                               put_new_folio, private, MIGRATE_ASYNC, reason,
+                               &ret_folios, NULL, &stats, 1);
                list_splice_tail_init(&split_folios, &ret_folios);
        }
        rc_gather += rc;
@@ -2017,14 +1976,11 @@ out:
        return rc_gather;
 }
 
-struct page *alloc_migration_target(struct page *page, unsigned long private)
+struct folio *alloc_migration_target(struct folio *src, unsigned long private)
 {
-       struct folio *folio = page_folio(page);
        struct migration_target_control *mtc;
        gfp_t gfp_mask;
        unsigned int order = 0;
-       struct folio *hugetlb_folio = NULL;
-       struct folio *new_folio = NULL;
        int nid;
        int zidx;
 
@@ -2032,33 +1988,30 @@ struct page *alloc_migration_target(struct page *page, unsigned long private)
        gfp_mask = mtc->gfp_mask;
        nid = mtc->nid;
        if (nid == NUMA_NO_NODE)
-               nid = folio_nid(folio);
+               nid = folio_nid(src);
 
-       if (folio_test_hugetlb(folio)) {
-               struct hstate *h = folio_hstate(folio);
+       if (folio_test_hugetlb(src)) {
+               struct hstate *h = folio_hstate(src);
 
                gfp_mask = htlb_modify_alloc_mask(h, gfp_mask);
-               hugetlb_folio = alloc_hugetlb_folio_nodemask(h, nid,
+               return alloc_hugetlb_folio_nodemask(h, nid,
                                                mtc->nmask, gfp_mask);
-               return &hugetlb_folio->page;
        }
 
-       if (folio_test_large(folio)) {
+       if (folio_test_large(src)) {
                /*
                 * clear __GFP_RECLAIM to make the migration callback
                 * consistent with regular THP allocations.
                 */
                gfp_mask &= ~__GFP_RECLAIM;
                gfp_mask |= GFP_TRANSHUGE;
-               order = folio_order(folio);
+               order = folio_order(src);
        }
-       zidx = zone_idx(folio_zone(folio));
+       zidx = zone_idx(folio_zone(src));
        if (is_highmem_idx(zidx) || zidx == ZONE_MOVABLE)
                gfp_mask |= __GFP_HIGHMEM;
 
-       new_folio = __folio_alloc(gfp_mask, order, nid, mtc->nmask);
-
-       return &new_folio->page;
+       return __folio_alloc(gfp_mask, order, nid, mtc->nmask);
 }
 
 #ifdef CONFIG_NUMA
@@ -2509,13 +2462,12 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
        return false;
 }
 
-static struct page *alloc_misplaced_dst_page(struct page *page,
+static struct folio *alloc_misplaced_dst_folio(struct folio *src,
                                           unsigned long data)
 {
        int nid = (int) data;
-       int order = compound_order(page);
+       int order = folio_order(src);
        gfp_t gfp = __GFP_THISNODE;
-       struct folio *new;
 
        if (order > 0)
                gfp |= GFP_TRANSHUGE_LIGHT;
@@ -2524,9 +2476,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
                        __GFP_NOWARN;
                gfp &= ~__GFP_RECLAIM;
        }
-       new = __folio_alloc_node(gfp, order, nid);
-
-       return &new->page;
+       return __folio_alloc_node(gfp, order, nid);
 }
 
 static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
@@ -2604,7 +2554,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
                goto out;
 
        list_add(&page->lru, &migratepages);
-       nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_page,
+       nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_folio,
                                     NULL, node, MIGRATE_ASYNC,
                                     MR_NUMA_MISPLACED, &nr_succeeded);
        if (nr_remaining) {
index d30c9de..8365158 100644 (file)
@@ -83,9 +83,6 @@ again:
                if (is_huge_zero_page(page)) {
                        spin_unlock(ptl);
                        split_huge_pmd(vma, pmdp, addr);
-                       if (pmd_trans_unstable(pmdp))
-                               return migrate_vma_collect_skip(start, end,
-                                                               walk);
                } else {
                        int ret;
 
@@ -100,16 +97,12 @@ again:
                        if (ret)
                                return migrate_vma_collect_skip(start, end,
                                                                walk);
-                       if (pmd_none(*pmdp))
-                               return migrate_vma_collect_hole(start, end, -1,
-                                                               walk);
                }
        }
 
-       if (unlikely(pmd_bad(*pmdp)))
-               return migrate_vma_collect_skip(start, end, walk);
-
        ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+       if (!ptep)
+               goto again;
        arch_enter_lazy_mmu_mode();
 
        for (; addr < end; addr += PAGE_SIZE, ptep++) {
@@ -118,7 +111,7 @@ again:
                swp_entry_t entry;
                pte_t pte;
 
-               pte = *ptep;
+               pte = ptep_get(ptep);
 
                if (pte_none(pte)) {
                        if (vma_is_anonymous(vma)) {
@@ -201,7 +194,7 @@ again:
                        bool anon_exclusive;
                        pte_t swp_pte;
 
-                       flush_cache_page(vma, addr, pte_pfn(*ptep));
+                       flush_cache_page(vma, addr, pte_pfn(pte));
                        anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
                        if (anon_exclusive) {
                                pte = ptep_clear_flush(vma, addr, ptep);
@@ -383,7 +376,7 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
                /* ZONE_DEVICE pages are not on LRU */
                if (!is_zone_device_page(page)) {
                        if (!PageLRU(page) && allow_drain) {
-                               /* Drain CPU's pagevec */
+                               /* Drain CPU's lru cache */
                                lru_add_drain_all();
                                allow_drain = false;
                        }
@@ -580,6 +573,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
        pud_t *pudp;
        pmd_t *pmdp;
        pte_t *ptep;
+       pte_t orig_pte;
 
        /* Only allow populating anonymous memory */
        if (!vma_is_anonymous(vma))
@@ -595,27 +589,10 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
        pmdp = pmd_alloc(mm, pudp, addr);
        if (!pmdp)
                goto abort;
-
        if (pmd_trans_huge(*pmdp) || pmd_devmap(*pmdp))
                goto abort;
-
-       /*
-        * Use pte_alloc() instead of pte_alloc_map().  We can't run
-        * pte_offset_map() on pmds where a huge pmd might be created
-        * from a different thread.
-        *
-        * pte_alloc_map() is safe to use under mmap_write_lock(mm) or when
-        * parallel threads are excluded by other means.
-        *
-        * Here we only have mmap_read_lock(mm).
-        */
        if (pte_alloc(mm, pmdp))
                goto abort;
-
-       /* See the comment in pte_alloc_one_map() */
-       if (unlikely(pmd_trans_unstable(pmdp)))
-               goto abort;
-
        if (unlikely(anon_vma_prepare(vma)))
                goto abort;
        if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL))
@@ -650,17 +627,20 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
        }
 
        ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+       if (!ptep)
+               goto abort;
+       orig_pte = ptep_get(ptep);
 
        if (check_stable_address_space(mm))
                goto unlock_abort;
 
-       if (pte_present(*ptep)) {
-               unsigned long pfn = pte_pfn(*ptep);
+       if (pte_present(orig_pte)) {
+               unsigned long pfn = pte_pfn(orig_pte);
 
                if (!is_zero_pfn(pfn))
                        goto unlock_abort;
                flush = true;
-       } else if (!pte_none(*ptep))
+       } else if (!pte_none(orig_pte))
                goto unlock_abort;
 
        /*
@@ -677,7 +657,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
        get_page(page);
 
        if (flush) {
-               flush_cache_page(vma, addr, pte_pfn(*ptep));
+               flush_cache_page(vma, addr, pte_pfn(orig_pte));
                ptep_clear_flush_notify(vma, addr, ptep);
                set_pte_at_notify(mm, addr, ptep, entry);
                update_mmu_cache(vma, addr, ptep);
index 2d5be01..b7f7a51 100644 (file)
@@ -113,14 +113,13 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
                goto out;
        }
 
-       if (pmd_trans_unstable(pmd)) {
-               __mincore_unmapped_range(addr, end, vma, vec);
-               goto out;
-       }
-
        ptep = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+       if (!ptep) {
+               walk->action = ACTION_AGAIN;
+               return 0;
+       }
        for (; addr != end; ptep++, addr += PAGE_SIZE) {
-               pte_t pte = *ptep;
+               pte_t pte = ptep_get(ptep);
 
                /* We need to do cache lookup too for pte markers */
                if (pte_none_mostly(pte))
index 40b43f8..d7db945 100644 (file)
@@ -312,6 +312,7 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
        struct vm_area_struct *vma = walk->vma;
        spinlock_t *ptl;
        pte_t *start_pte, *pte;
+       pte_t ptent;
        struct folio *folio;
 
        ptl = pmd_trans_huge_lock(pmd, vma);
@@ -329,10 +330,15 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
        }
 
        start_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+       if (!start_pte) {
+               walk->action = ACTION_AGAIN;
+               return 0;
+       }
        for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
-               if (!pte_present(*pte))
+               ptent = ptep_get(pte);
+               if (!pte_present(ptent))
                        continue;
-               folio = vm_normal_folio(vma, addr, *pte);
+               folio = vm_normal_folio(vma, addr, ptent);
                if (!folio || folio_is_zone_device(folio))
                        continue;
                if (folio_test_large(folio))
index 7f7f9c6..a1963c3 100644 (file)
@@ -259,6 +259,8 @@ static int __init cmdline_parse_core(char *p, unsigned long *core,
        return 0;
 }
 
+bool mirrored_kernelcore __initdata_memblock;
+
 /*
  * kernelcore=size sets the amount of memory for use for allocations that
  * cannot be reclaimed or migrated.
@@ -644,10 +646,8 @@ static inline void pgdat_set_deferred_range(pg_data_t *pgdat)
 }
 
 /* Returns true if the struct page for the pfn is initialised */
-static inline bool __meminit early_page_initialised(unsigned long pfn)
+static inline bool __meminit early_page_initialised(unsigned long pfn, int nid)
 {
-       int nid = early_pfn_to_nid(pfn);
-
        if (node_online(nid) && pfn >= NODE_DATA(nid)->first_deferred_pfn)
                return false;
 
@@ -693,15 +693,14 @@ defer_init(int nid, unsigned long pfn, unsigned long end_pfn)
        return false;
 }
 
-static void __meminit init_reserved_page(unsigned long pfn)
+static void __meminit init_reserved_page(unsigned long pfn, int nid)
 {
        pg_data_t *pgdat;
-       int nid, zid;
+       int zid;
 
-       if (early_page_initialised(pfn))
+       if (early_page_initialised(pfn, nid))
                return;
 
-       nid = early_pfn_to_nid(pfn);
        pgdat = NODE_DATA(nid);
 
        for (zid = 0; zid < MAX_NR_ZONES; zid++) {
@@ -715,7 +714,7 @@ static void __meminit init_reserved_page(unsigned long pfn)
 #else
 static inline void pgdat_set_deferred_range(pg_data_t *pgdat) {}
 
-static inline bool early_page_initialised(unsigned long pfn)
+static inline bool early_page_initialised(unsigned long pfn, int nid)
 {
        return true;
 }
@@ -725,7 +724,7 @@ static inline bool defer_init(int nid, unsigned long pfn, unsigned long end_pfn)
        return false;
 }
 
-static inline void init_reserved_page(unsigned long pfn)
+static inline void init_reserved_page(unsigned long pfn, int nid)
 {
 }
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
@@ -736,7 +735,8 @@ static inline void init_reserved_page(unsigned long pfn)
  * marks the pages PageReserved. The remaining valid pages are later
  * sent to the buddy page allocator.
  */
-void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end)
+void __meminit reserve_bootmem_region(phys_addr_t start,
+                                     phys_addr_t end, int nid)
 {
        unsigned long start_pfn = PFN_DOWN(start);
        unsigned long end_pfn = PFN_UP(end);
@@ -745,7 +745,7 @@ void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end)
                if (pfn_valid(start_pfn)) {
                        struct page *page = pfn_to_page(start_pfn);
 
-                       init_reserved_page(start_pfn);
+                       init_reserved_page(start_pfn, nid);
 
                        /* Avoid false-positive PageTail() */
                        INIT_LIST_HEAD(&page->lru);
@@ -1166,24 +1166,15 @@ unsigned long __init absent_pages_in_range(unsigned long start_pfn,
 /* Return the number of page frames in holes in a zone on a node */
 static unsigned long __init zone_absent_pages_in_node(int nid,
                                        unsigned long zone_type,
-                                       unsigned long node_start_pfn,
-                                       unsigned long node_end_pfn)
+                                       unsigned long zone_start_pfn,
+                                       unsigned long zone_end_pfn)
 {
-       unsigned long zone_low = arch_zone_lowest_possible_pfn[zone_type];
-       unsigned long zone_high = arch_zone_highest_possible_pfn[zone_type];
-       unsigned long zone_start_pfn, zone_end_pfn;
        unsigned long nr_absent;
 
-       /* When hotadd a new node from cpu_up(), the node should be empty */
-       if (!node_start_pfn && !node_end_pfn)
+       /* zone is empty, we don't have any absent pages */
+       if (zone_start_pfn == zone_end_pfn)
                return 0;
 
-       zone_start_pfn = clamp(node_start_pfn, zone_low, zone_high);
-       zone_end_pfn = clamp(node_end_pfn, zone_low, zone_high);
-
-       adjust_zone_range_for_zone_movable(nid, zone_type,
-                       node_start_pfn, node_end_pfn,
-                       &zone_start_pfn, &zone_end_pfn);
        nr_absent = __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn);
 
        /*
@@ -1227,9 +1218,6 @@ static unsigned long __init zone_spanned_pages_in_node(int nid,
 {
        unsigned long zone_low = arch_zone_lowest_possible_pfn[zone_type];
        unsigned long zone_high = arch_zone_highest_possible_pfn[zone_type];
-       /* When hotadd a new node from cpu_up(), the node should be empty */
-       if (!node_start_pfn && !node_end_pfn)
-               return 0;
 
        /* Get the start and end of the zone */
        *zone_start_pfn = clamp(node_start_pfn, zone_low, zone_high);
@@ -1250,6 +1238,24 @@ static unsigned long __init zone_spanned_pages_in_node(int nid,
        return *zone_end_pfn - *zone_start_pfn;
 }
 
+static void __init reset_memoryless_node_totalpages(struct pglist_data *pgdat)
+{
+       struct zone *z;
+
+       for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++) {
+               z->zone_start_pfn = 0;
+               z->spanned_pages = 0;
+               z->present_pages = 0;
+#if defined(CONFIG_MEMORY_HOTPLUG)
+               z->present_early_pages = 0;
+#endif
+       }
+
+       pgdat->node_spanned_pages = 0;
+       pgdat->node_present_pages = 0;
+       pr_debug("On node %d totalpages: 0\n", pgdat->node_id);
+}
+
 static void __init calculate_node_totalpages(struct pglist_data *pgdat,
                                                unsigned long node_start_pfn,
                                                unsigned long node_end_pfn)
@@ -1261,7 +1267,7 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
                struct zone *zone = pgdat->node_zones + i;
                unsigned long zone_start_pfn, zone_end_pfn;
                unsigned long spanned, absent;
-               unsigned long size, real_size;
+               unsigned long real_size;
 
                spanned = zone_spanned_pages_in_node(pgdat->node_id, i,
                                                     node_start_pfn,
@@ -1269,23 +1275,22 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
                                                     &zone_start_pfn,
                                                     &zone_end_pfn);
                absent = zone_absent_pages_in_node(pgdat->node_id, i,
-                                                  node_start_pfn,
-                                                  node_end_pfn);
+                                                  zone_start_pfn,
+                                                  zone_end_pfn);
 
-               size = spanned;
-               real_size = size - absent;
+               real_size = spanned - absent;
 
-               if (size)
+               if (spanned)
                        zone->zone_start_pfn = zone_start_pfn;
                else
                        zone->zone_start_pfn = 0;
-               zone->spanned_pages = size;
+               zone->spanned_pages = spanned;
                zone->present_pages = real_size;
 #if defined(CONFIG_MEMORY_HOTPLUG)
                zone->present_early_pages = real_size;
 #endif
 
-               totalpages += size;
+               totalpages += spanned;
                realtotalpages += real_size;
        }
 
@@ -1375,6 +1380,10 @@ static void __meminit zone_init_free_lists(struct zone *zone)
                INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
                zone->free_area[order].nr_free = 0;
        }
+
+#ifdef CONFIG_UNACCEPTED_MEMORY
+       INIT_LIST_HEAD(&zone->unaccepted_pages);
+#endif
 }
 
 void __meminit init_currently_empty_zone(struct zone *zone,
@@ -1502,6 +1511,8 @@ void __ref free_area_init_core_hotplug(struct pglist_data *pgdat)
        pgdat->kswapd_order = 0;
        pgdat->kswapd_highest_zoneidx = 0;
        pgdat->node_start_pfn = 0;
+       pgdat->node_present_pages = 0;
+
        for_each_online_cpu(cpu) {
                struct per_cpu_nodestat *p;
 
@@ -1509,8 +1520,17 @@ void __ref free_area_init_core_hotplug(struct pglist_data *pgdat)
                memset(p, 0, sizeof(*p));
        }
 
-       for (z = 0; z < MAX_NR_ZONES; z++)
-               zone_init_internals(&pgdat->node_zones[z], z, nid, 0);
+       /*
+        * When memory is hot-added, all the memory is in offline state. So
+        * clear all zones' present_pages and managed_pages because they will
+        * be updated in online_pages() and offline_pages().
+        */
+       for (z = 0; z < MAX_NR_ZONES; z++) {
+               struct zone *zone = pgdat->node_zones + z;
+
+               zone->present_pages = 0;
+               zone_init_internals(zone, z, nid, 0);
+       }
 }
 #endif
 
@@ -1578,7 +1598,6 @@ static void __init free_area_init_core(struct pglist_data *pgdat)
                if (!size)
                        continue;
 
-               set_pageblock_order();
                setup_usemap(zone);
                init_currently_empty_zone(zone, zone->zone_start_pfn, size);
        }
@@ -1702,11 +1721,13 @@ static void __init free_area_init_node(int nid)
                pr_info("Initmem setup node %d [mem %#018Lx-%#018Lx]\n", nid,
                        (u64)start_pfn << PAGE_SHIFT,
                        end_pfn ? ((u64)end_pfn << PAGE_SHIFT) - 1 : 0);
+
+               calculate_node_totalpages(pgdat, start_pfn, end_pfn);
        } else {
                pr_info("Initmem setup node %d as memoryless\n", nid);
-       }
 
-       calculate_node_totalpages(pgdat, start_pfn, end_pfn);
+               reset_memoryless_node_totalpages(pgdat);
+       }
 
        alloc_node_mem_map(pgdat);
        pgdat_set_deferred_range(pgdat);
@@ -1716,7 +1737,7 @@ static void __init free_area_init_node(int nid)
 }
 
 /* Any regular or high memory on that node ? */
-static void check_for_memory(pg_data_t *pgdat, int nid)
+static void check_for_memory(pg_data_t *pgdat)
 {
        enum zone_type zone_type;
 
@@ -1724,9 +1745,9 @@ static void check_for_memory(pg_data_t *pgdat, int nid)
                struct zone *zone = &pgdat->node_zones[zone_type];
                if (populated_zone(zone)) {
                        if (IS_ENABLED(CONFIG_HIGHMEM))
-                               node_set_state(nid, N_HIGH_MEMORY);
+                               node_set_state(pgdat->node_id, N_HIGH_MEMORY);
                        if (zone_type <= ZONE_NORMAL)
-                               node_set_state(nid, N_NORMAL_MEMORY);
+                               node_set_state(pgdat->node_id, N_NORMAL_MEMORY);
                        break;
                }
        }
@@ -1745,11 +1766,6 @@ void __init setup_nr_node_ids(void)
 }
 #endif
 
-static void __init free_area_init_memoryless_node(int nid)
-{
-       free_area_init_node(nid);
-}
-
 /*
  * Some architectures, e.g. ARC may have ZONE_HIGHMEM below ZONE_NORMAL. For
  * such cases we allow max_zone_pfn sorted in the descending order
@@ -1848,6 +1864,8 @@ void __init free_area_init(unsigned long *max_zone_pfn)
        /* Initialise every node */
        mminit_verify_pageflags_layout();
        setup_nr_node_ids();
+       set_pageblock_order();
+
        for_each_node(nid) {
                pg_data_t *pgdat;
 
@@ -1860,7 +1878,7 @@ void __init free_area_init(unsigned long *max_zone_pfn)
                                panic("Cannot allocate %zuB for node %d.\n",
                                       sizeof(*pgdat), nid);
                        arch_refresh_nodedata(nid, pgdat);
-                       free_area_init_memoryless_node(nid);
+                       free_area_init_node(nid);
 
                        /*
                         * We do not want to confuse userspace by sysfs
@@ -1881,7 +1899,7 @@ void __init free_area_init(unsigned long *max_zone_pfn)
                /* Any memory on that node */
                if (pgdat->node_present_pages)
                        node_set_state(nid, N_MEMORY);
-               check_for_memory(pgdat, nid);
+               check_for_memory(pgdat);
        }
 
        memmap_init();
@@ -1960,6 +1978,9 @@ static void __init deferred_free_range(unsigned long pfn,
                return;
        }
 
+       /* Accept chunks smaller than MAX_ORDER upfront */
+       accept_memory(PFN_PHYS(pfn), PFN_PHYS(pfn + nr_pages));
+
        for (i = 0; i < nr_pages; i++, page++, pfn++) {
                if (pageblock_aligned(pfn))
                        set_pageblock_migratetype(page, MIGRATE_MOVABLE);
@@ -2328,6 +2349,28 @@ void __init init_cma_reserved_pageblock(struct page *page)
 }
 #endif
 
+void set_zone_contiguous(struct zone *zone)
+{
+       unsigned long block_start_pfn = zone->zone_start_pfn;
+       unsigned long block_end_pfn;
+
+       block_end_pfn = pageblock_end_pfn(block_start_pfn);
+       for (; block_start_pfn < zone_end_pfn(zone);
+                       block_start_pfn = block_end_pfn,
+                        block_end_pfn += pageblock_nr_pages) {
+
+               block_end_pfn = min(block_end_pfn, zone_end_pfn(zone));
+
+               if (!__pageblock_pfn_to_page(block_start_pfn,
+                                            block_end_pfn, zone))
+                       return;
+               cond_resched();
+       }
+
+       /* We confirm that there is no hole */
+       zone->contiguous = true;
+}
+
 void __init page_alloc_init_late(void)
 {
        struct zone *zone;
@@ -2368,6 +2411,8 @@ void __init page_alloc_init_late(void)
        /* Initialize page ext after all struct pages are initialized. */
        if (deferred_struct_pages)
                page_ext_init();
+
+       page_alloc_sysctl_init();
 }
 
 #ifndef __HAVE_ARCH_RESERVED_KERNEL_PAGES
@@ -2532,8 +2577,14 @@ void __init set_dma_reserve(unsigned long new_dma_reserve)
 void __init memblock_free_pages(struct page *page, unsigned long pfn,
                                                        unsigned int order)
 {
-       if (!early_page_initialised(pfn))
-               return;
+
+       if (IS_ENABLED(CONFIG_DEFERRED_STRUCT_PAGE_INIT)) {
+               int nid = early_pfn_to_nid(pfn);
+
+               if (!early_page_initialised(pfn, nid))
+                       return;
+       }
+
        if (!kmsan_memblock_free_pages(page, order)) {
                /* KMSAN will take care of these pages. */
                return;
@@ -2541,6 +2592,12 @@ void __init memblock_free_pages(struct page *page, unsigned long pfn,
        __free_pages_core(page, order);
 }
 
+DEFINE_STATIC_KEY_MAYBE(CONFIG_INIT_ON_ALLOC_DEFAULT_ON, init_on_alloc);
+EXPORT_SYMBOL(init_on_alloc);
+
+DEFINE_STATIC_KEY_MAYBE(CONFIG_INIT_ON_FREE_DEFAULT_ON, init_on_free);
+EXPORT_SYMBOL(init_on_free);
+
 static bool _init_on_alloc_enabled_early __read_mostly
                                = IS_ENABLED(CONFIG_INIT_ON_ALLOC_DEFAULT_ON);
 static int __init early_init_on_alloc(char *buf)
index d600404..8f1000b 100644 (file)
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -182,7 +182,8 @@ static int check_brk_limits(unsigned long addr, unsigned long len)
        if (IS_ERR_VALUE(mapped_addr))
                return mapped_addr;
 
-       return mlock_future_check(current->mm, current->mm->def_flags, len);
+       return mlock_future_ok(current->mm, current->mm->def_flags, len)
+               ? 0 : -EAGAIN;
 }
 static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *brkvma,
                unsigned long addr, unsigned long request, unsigned long flags);
@@ -300,61 +301,40 @@ out:
 }
 
 #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
-extern void mt_validate(struct maple_tree *mt);
-extern void mt_dump(const struct maple_tree *mt);
-
-/* Validate the maple tree */
-static void validate_mm_mt(struct mm_struct *mm)
-{
-       struct maple_tree *mt = &mm->mm_mt;
-       struct vm_area_struct *vma_mt;
-
-       MA_STATE(mas, mt, 0, 0);
-
-       mt_validate(&mm->mm_mt);
-       mas_for_each(&mas, vma_mt, ULONG_MAX) {
-               if ((vma_mt->vm_start != mas.index) ||
-                   (vma_mt->vm_end - 1 != mas.last)) {
-                       pr_emerg("issue in %s\n", current->comm);
-                       dump_stack();
-                       dump_vma(vma_mt);
-                       pr_emerg("mt piv: %p %lu - %lu\n", vma_mt,
-                                mas.index, mas.last);
-                       pr_emerg("mt vma: %p %lu - %lu\n", vma_mt,
-                                vma_mt->vm_start, vma_mt->vm_end);
-
-                       mt_dump(mas.tree);
-                       if (vma_mt->vm_end != mas.last + 1) {
-                               pr_err("vma: %p vma_mt %lu-%lu\tmt %lu-%lu\n",
-                                               mm, vma_mt->vm_start, vma_mt->vm_end,
-                                               mas.index, mas.last);
-                               mt_dump(mas.tree);
-                       }
-                       VM_BUG_ON_MM(vma_mt->vm_end != mas.last + 1, mm);
-                       if (vma_mt->vm_start != mas.index) {
-                               pr_err("vma: %p vma_mt %p %lu - %lu doesn't match\n",
-                                               mm, vma_mt, vma_mt->vm_start, vma_mt->vm_end);
-                               mt_dump(mas.tree);
-                       }
-                       VM_BUG_ON_MM(vma_mt->vm_start != mas.index, mm);
-               }
-       }
-}
-
 static void validate_mm(struct mm_struct *mm)
 {
        int bug = 0;
        int i = 0;
        struct vm_area_struct *vma;
-       MA_STATE(mas, &mm->mm_mt, 0, 0);
-
-       validate_mm_mt(mm);
+       VMA_ITERATOR(vmi, mm, 0);
 
-       mas_for_each(&mas, vma, ULONG_MAX) {
+       mt_validate(&mm->mm_mt);
+       for_each_vma(vmi, vma) {
 #ifdef CONFIG_DEBUG_VM_RB
                struct anon_vma *anon_vma = vma->anon_vma;
                struct anon_vma_chain *avc;
+#endif
+               unsigned long vmi_start, vmi_end;
+               bool warn = 0;
+
+               vmi_start = vma_iter_addr(&vmi);
+               vmi_end = vma_iter_end(&vmi);
+               if (VM_WARN_ON_ONCE_MM(vma->vm_end != vmi_end, mm))
+                       warn = 1;
+
+               if (VM_WARN_ON_ONCE_MM(vma->vm_start != vmi_start, mm))
+                       warn = 1;
 
+               if (warn) {
+                       pr_emerg("issue in %s\n", current->comm);
+                       dump_stack();
+                       dump_vma(vma);
+                       pr_emerg("tree range: %px start %lx end %lx\n", vma,
+                                vmi_start, vmi_end - 1);
+                       vma_iter_dump_tree(&vmi);
+               }
+
+#ifdef CONFIG_DEBUG_VM_RB
                if (anon_vma) {
                        anon_vma_lock_read(anon_vma);
                        list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
@@ -365,14 +345,13 @@ static void validate_mm(struct mm_struct *mm)
                i++;
        }
        if (i != mm->map_count) {
-               pr_emerg("map_count %d mas_for_each %d\n", mm->map_count, i);
+               pr_emerg("map_count %d vma iterator %d\n", mm->map_count, i);
                bug = 1;
        }
        VM_BUG_ON_MM(bug, mm);
 }
 
 #else /* !CONFIG_DEBUG_VM_MAPLE_TREE */
-#define validate_mm_mt(root) do { } while (0)
 #define validate_mm(mm) do { } while (0)
 #endif /* CONFIG_DEBUG_VM_MAPLE_TREE */
 
@@ -1167,21 +1146,21 @@ static inline unsigned long round_hint_to_min(unsigned long hint)
        return hint;
 }
 
-int mlock_future_check(struct mm_struct *mm, unsigned long flags,
-                      unsigned long len)
+bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
+                       unsigned long bytes)
 {
-       unsigned long locked, lock_limit;
+       unsigned long locked_pages, limit_pages;
 
-       /*  mlock MCL_FUTURE? */
-       if (flags & VM_LOCKED) {
-               locked = len >> PAGE_SHIFT;
-               locked += mm->locked_vm;
-               lock_limit = rlimit(RLIMIT_MEMLOCK);
-               lock_limit >>= PAGE_SHIFT;
-               if (locked > lock_limit && !capable(CAP_IPC_LOCK))
-                       return -EAGAIN;
-       }
-       return 0;
+       if (!(flags & VM_LOCKED) || capable(CAP_IPC_LOCK))
+               return true;
+
+       locked_pages = bytes >> PAGE_SHIFT;
+       locked_pages += mm->locked_vm;
+
+       limit_pages = rlimit(RLIMIT_MEMLOCK);
+       limit_pages >>= PAGE_SHIFT;
+
+       return locked_pages <= limit_pages;
 }
 
 static inline u64 file_mmap_size_max(struct file *file, struct inode *inode)
@@ -1293,7 +1272,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
                if (!can_do_mlock())
                        return -EPERM;
 
-       if (mlock_future_check(mm, vm_flags, len))
+       if (!mlock_future_ok(mm, vm_flags, len))
                return -EAGAIN;
 
        if (file) {
@@ -1475,6 +1454,48 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
 }
 #endif /* __ARCH_WANT_SYS_OLD_MMAP */
 
+static bool vm_ops_needs_writenotify(const struct vm_operations_struct *vm_ops)
+{
+       return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite);
+}
+
+static bool vma_is_shared_writable(struct vm_area_struct *vma)
+{
+       return (vma->vm_flags & (VM_WRITE | VM_SHARED)) ==
+               (VM_WRITE | VM_SHARED);
+}
+
+static bool vma_fs_can_writeback(struct vm_area_struct *vma)
+{
+       /* No managed pages to writeback. */
+       if (vma->vm_flags & VM_PFNMAP)
+               return false;
+
+       return vma->vm_file && vma->vm_file->f_mapping &&
+               mapping_can_writeback(vma->vm_file->f_mapping);
+}
+
+/*
+ * Does this VMA require the underlying folios to have their dirty state
+ * tracked?
+ */
+bool vma_needs_dirty_tracking(struct vm_area_struct *vma)
+{
+       /* Only shared, writable VMAs require dirty tracking. */
+       if (!vma_is_shared_writable(vma))
+               return false;
+
+       /* Does the filesystem need to be notified? */
+       if (vm_ops_needs_writenotify(vma->vm_ops))
+               return true;
+
+       /*
+        * Even if the filesystem doesn't indicate a need for writenotify, if it
+        * can writeback, dirty tracking is still required.
+        */
+       return vma_fs_can_writeback(vma);
+}
+
 /*
  * Some shared mappings will want the pages marked read-only
  * to track write events. If so, we'll downgrade vm_page_prot
@@ -1483,21 +1504,18 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
  */
 int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
 {
-       vm_flags_t vm_flags = vma->vm_flags;
-       const struct vm_operations_struct *vm_ops = vma->vm_ops;
-
        /* If it was private or non-writable, the write bit is already clear */
-       if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED)))
+       if (!vma_is_shared_writable(vma))
                return 0;
 
        /* The backer wishes to know when pages are first written to? */
-       if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite))
+       if (vm_ops_needs_writenotify(vma->vm_ops))
                return 1;
 
        /* The open routine did something to the protections that pgprot_modify
         * won't preserve? */
        if (pgprot_val(vm_page_prot) !=
-           pgprot_val(vm_pgprot_modify(vm_page_prot, vm_flags)))
+           pgprot_val(vm_pgprot_modify(vm_page_prot, vma->vm_flags)))
                return 0;
 
        /*
@@ -1511,13 +1529,8 @@ int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
        if (userfaultfd_wp(vma))
                return 1;
 
-       /* Specialty mapping? */
-       if (vm_flags & VM_PFNMAP)
-               return 0;
-
        /* Can the mapping track the dirty pages? */
-       return vma->vm_file && vma->vm_file->f_mapping &&
-               mapping_can_writeback(vma->vm_file->f_mapping);
+       return vma_fs_can_writeback(vma);
 }
 
 /*
@@ -1911,7 +1924,7 @@ static int acct_stack_growth(struct vm_area_struct *vma,
                return -ENOMEM;
 
        /* mlock limit tests */
-       if (mlock_future_check(mm, vma->vm_flags, grow << PAGE_SHIFT))
+       if (!mlock_future_ok(mm, vma->vm_flags, grow << PAGE_SHIFT))
                return -ENOMEM;
 
        /* Check to ensure the stack will not grow into a hugetlb-only region */
@@ -2234,7 +2247,7 @@ int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
        struct vm_area_struct *new;
        int err;
 
-       validate_mm_mt(vma->vm_mm);
+       validate_mm(vma->vm_mm);
 
        WARN_ON(vma->vm_start >= addr);
        WARN_ON(vma->vm_end <= addr);
@@ -2292,7 +2305,7 @@ int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
        /* Success. */
        if (new_below)
                vma_next(vmi);
-       validate_mm_mt(vma->vm_mm);
+       validate_mm(vma->vm_mm);
        return 0;
 
 out_free_mpol:
@@ -2301,7 +2314,7 @@ out_free_vmi:
        vma_iter_free(vmi);
 out_free_vma:
        vm_area_free(new);
-       validate_mm_mt(vma->vm_mm);
+       validate_mm(vma->vm_mm);
        return err;
 }
 
@@ -2394,28 +2407,32 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
                        locked_vm += vma_pages(next);
 
                count++;
+               if (unlikely(uf)) {
+                       /*
+                        * If userfaultfd_unmap_prep returns an error the vmas
+                        * will remain split, but userland will get a
+                        * highly unexpected error anyway. This is no
+                        * different than the case where the first of the two
+                        * __split_vma fails, but we don't undo the first
+                        * split, despite we could. This is unlikely enough
+                        * failure that it's not worth optimizing it for.
+                        */
+                       error = userfaultfd_unmap_prep(next, start, end, uf);
+
+                       if (error)
+                               goto userfaultfd_error;
+               }
 #ifdef CONFIG_DEBUG_VM_MAPLE_TREE
                BUG_ON(next->vm_start < start);
                BUG_ON(next->vm_start > end);
 #endif
        }
 
-       next = vma_next(vmi);
-       if (unlikely(uf)) {
-               /*
-                * If userfaultfd_unmap_prep returns an error the vmas
-                * will remain split, but userland will get a
-                * highly unexpected error anyway. This is no
-                * different than the case where the first of the two
-                * __split_vma fails, but we don't undo the first
-                * split, despite we could. This is unlikely enough
-                * failure that it's not worth optimizing it for.
-                */
-               error = userfaultfd_unmap_prep(mm, start, end, uf);
+       if (vma_iter_end(vmi) > end)
+               next = vma_iter_load(vmi);
 
-               if (error)
-                       goto userfaultfd_error;
-       }
+       if (!next)
+               next = vma_next(vmi);
 
 #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
        /* Make sure no VMAs are about to be lost. */
@@ -2620,6 +2637,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
        }
 
 cannot_expand:
+       if (prev)
+               vma_iter_next_range(&vmi);
+
        /*
         * Determine the object being mapped and call the appropriate
         * specific mapper. the address has already been validated, but
@@ -2933,7 +2953,7 @@ int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 
        arch_unmap(mm, start, end);
        ret = do_vmi_align_munmap(vmi, vma, mm, start, end, uf, downgrade);
-       validate_mm_mt(mm);
+       validate_mm(mm);
        return ret;
 }
 
@@ -2955,7 +2975,7 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
        struct mm_struct *mm = current->mm;
        struct vma_prepare vp;
 
-       validate_mm_mt(mm);
+       validate_mm(mm);
        /*
         * Check against address space limits by the changed size
         * Note: This happens *after* clearing old mappings in some code paths.
@@ -3196,7 +3216,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
        bool faulted_in_anon_vma = true;
        VMA_ITERATOR(vmi, mm, addr);
 
-       validate_mm_mt(mm);
+       validate_mm(mm);
        /*
         * If anonymous vma has not yet been faulted, update new pgoff
         * to match new location, to increase its chance of merging.
@@ -3255,7 +3275,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
                        goto out_vma_link;
                *need_rmap_locks = false;
        }
-       validate_mm_mt(mm);
+       validate_mm(mm);
        return new_vma;
 
 out_vma_link:
@@ -3271,7 +3291,7 @@ out_free_mempol:
 out_free_vma:
        vm_area_free(new_vma);
 out:
-       validate_mm_mt(mm);
+       validate_mm(mm);
        return NULL;
 }
 
@@ -3408,7 +3428,7 @@ static struct vm_area_struct *__install_special_mapping(
        int ret;
        struct vm_area_struct *vma;
 
-       validate_mm_mt(mm);
+       validate_mm(mm);
        vma = vm_area_alloc(mm);
        if (unlikely(vma == NULL))
                return ERR_PTR(-ENOMEM);
@@ -3431,12 +3451,12 @@ static struct vm_area_struct *__install_special_mapping(
 
        perf_event_mmap(vma);
 
-       validate_mm_mt(mm);
+       validate_mm(mm);
        return vma;
 
 out:
        vm_area_free(vma);
-       validate_mm_mt(mm);
+       validate_mm(mm);
        return ERR_PTR(ret);
 }
 
index c59e756..6f658d4 100644 (file)
@@ -93,22 +93,9 @@ static long change_pte_range(struct mmu_gather *tlb,
        bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
 
        tlb_change_page_size(tlb, PAGE_SIZE);
-
-       /*
-        * Can be called with only the mmap_lock for reading by
-        * prot_numa so we must check the pmd isn't constantly
-        * changing from under us from pmd_none to pmd_trans_huge
-        * and/or the other way around.
-        */
-       if (pmd_trans_unstable(pmd))
-               return 0;
-
-       /*
-        * The pmd points to a regular pte so the pmd can't change
-        * from under us even if the mmap_lock is only hold for
-        * reading.
-        */
        pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+       if (!pte)
+               return -EAGAIN;
 
        /* Get target node for single threaded private VMAs */
        if (prot_numa && !(vma->vm_flags & VM_SHARED) &&
@@ -118,7 +105,7 @@ static long change_pte_range(struct mmu_gather *tlb,
        flush_tlb_batched_pending(vma->vm_mm);
        arch_enter_lazy_mmu_mode();
        do {
-               oldpte = *pte;
+               oldpte = ptep_get(pte);
                if (pte_present(oldpte)) {
                        pte_t ptent;
 
@@ -302,31 +289,6 @@ static long change_pte_range(struct mmu_gather *tlb,
 }
 
 /*
- * Used when setting automatic NUMA hinting protection where it is
- * critical that a numa hinting PMD is not confused with a bad PMD.
- */
-static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
-{
-       pmd_t pmdval = pmdp_get_lockless(pmd);
-
-       /* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-       barrier();
-#endif
-
-       if (pmd_none(pmdval))
-               return 1;
-       if (pmd_trans_huge(pmdval))
-               return 0;
-       if (unlikely(pmd_bad(pmdval))) {
-               pmd_clear_bad(pmd);
-               return 1;
-       }
-
-       return 0;
-}
-
-/*
  * Return true if we want to split THPs into PTE mappings in change
  * protection procedure, false otherwise.
  */
@@ -403,7 +365,8 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
        pmd = pmd_offset(pud, addr);
        do {
                long ret;
-
+               pmd_t _pmd;
+again:
                next = pmd_addr_end(addr, end);
 
                ret = change_pmd_prepare(vma, pmd, cp_flags);
@@ -411,16 +374,8 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
                        pages = ret;
                        break;
                }
-               /*
-                * Automatic NUMA balancing walks the tables with mmap_lock
-                * held for read. It's possible a parallel update to occur
-                * between pmd_trans_huge() and a pmd_none_or_clear_bad()
-                * check leading to a false positive and clearing.
-                * Hence, it's necessary to atomically read the PMD value
-                * for all the checks.
-                */
-               if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
-                    pmd_none_or_clear_bad_unless_trans_huge(pmd))
+
+               if (pmd_none(*pmd))
                        goto next;
 
                /* invoke the mmu notifier if the pmd is populated */
@@ -431,7 +386,8 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
                        mmu_notifier_invalidate_range_start(&range);
                }
 
-               if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+               _pmd = pmdp_get_lockless(pmd);
+               if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) {
                        if ((next - addr != HPAGE_PMD_SIZE) ||
                            pgtable_split_needed(vma, cp_flags)) {
                                __split_huge_pmd(vma, pmd, addr, false, NULL);
@@ -446,15 +402,10 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
                                        break;
                                }
                        } else {
-                               /*
-                                * change_huge_pmd() does not defer TLB flushes,
-                                * so no need to propagate the tlb argument.
-                                */
-                               int nr_ptes = change_huge_pmd(tlb, vma, pmd,
+                               ret = change_huge_pmd(tlb, vma, pmd,
                                                addr, newprot, cp_flags);
-
-                               if (nr_ptes) {
-                                       if (nr_ptes == HPAGE_PMD_NR) {
+                               if (ret) {
+                                       if (ret == HPAGE_PMD_NR) {
                                                pages += HPAGE_PMD_NR;
                                                nr_huge_updates++;
                                        }
@@ -465,8 +416,12 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
                        }
                        /* fall through, the trans huge pmd just split */
                }
-               pages += change_pte_range(tlb, vma, pmd, addr, next,
-                                         newprot, cp_flags);
+
+               ret = change_pte_range(tlb, vma, pmd, addr, next, newprot,
+                                      cp_flags);
+               if (ret < 0)
+                       goto again;
+               pages += ret;
 next:
                cond_resched();
        } while (pmd++, addr = next, addr != end);
@@ -589,7 +544,8 @@ long change_protection(struct mmu_gather *tlb,
 static int prot_none_pte_entry(pte_t *pte, unsigned long addr,
                               unsigned long next, struct mm_walk *walk)
 {
-       return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ?
+       return pfn_modify_allowed(pte_pfn(ptep_get(pte)),
+                                 *(pgprot_t *)(walk->private)) ?
                0 : -EACCES;
 }
 
@@ -597,7 +553,8 @@ static int prot_none_hugetlb_entry(pte_t *pte, unsigned long hmask,
                                   unsigned long addr, unsigned long next,
                                   struct mm_walk *walk)
 {
-       return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ?
+       return pfn_modify_allowed(pte_pfn(ptep_get(pte)),
+                                 *(pgprot_t *)(walk->private)) ?
                0 : -EACCES;
 }
 
index b11ce6c..fe6b722 100644 (file)
@@ -133,7 +133,7 @@ static pte_t move_soft_dirty_pte(pte_t pte)
        return pte;
 }
 
-static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
+static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
                unsigned long old_addr, unsigned long old_end,
                struct vm_area_struct *new_vma, pmd_t *new_pmd,
                unsigned long new_addr, bool need_rmap_locks)
@@ -143,6 +143,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
        spinlock_t *old_ptl, *new_ptl;
        bool force_flush = false;
        unsigned long len = old_end - old_addr;
+       int err = 0;
 
        /*
         * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
@@ -170,8 +171,16 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
         * pte locks because exclusive mmap_lock prevents deadlock.
         */
        old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
-       new_pte = pte_offset_map(new_pmd, new_addr);
-       new_ptl = pte_lockptr(mm, new_pmd);
+       if (!old_pte) {
+               err = -EAGAIN;
+               goto out;
+       }
+       new_pte = pte_offset_map_nolock(mm, new_pmd, new_addr, &new_ptl);
+       if (!new_pte) {
+               pte_unmap_unlock(old_pte, old_ptl);
+               err = -EAGAIN;
+               goto out;
+       }
        if (new_ptl != old_ptl)
                spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
        flush_tlb_batched_pending(vma->vm_mm);
@@ -179,7 +188,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 
        for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
                                   new_pte++, new_addr += PAGE_SIZE) {
-               if (pte_none(*old_pte))
+               if (pte_none(ptep_get(old_pte)))
                        continue;
 
                pte = ptep_get_and_clear(mm, old_addr, old_pte);
@@ -208,8 +217,10 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
                spin_unlock(new_ptl);
        pte_unmap(new_pte - 1);
        pte_unmap_unlock(old_pte - 1, old_ptl);
+out:
        if (need_rmap_locks)
                drop_rmap_locks(vma);
+       return err;
 }
 
 #ifndef arch_supports_page_table_move
@@ -537,6 +548,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
                new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
                if (!new_pmd)
                        break;
+again:
                if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd) ||
                    pmd_devmap(*old_pmd)) {
                        if (extent == HPAGE_PMD_SIZE &&
@@ -544,8 +556,6 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
                                           old_pmd, new_pmd, need_rmap_locks))
                                continue;
                        split_huge_pmd(vma, old_pmd, old_addr);
-                       if (pmd_trans_unstable(old_pmd))
-                               continue;
                } else if (IS_ENABLED(CONFIG_HAVE_MOVE_PMD) &&
                           extent == PMD_SIZE) {
                        /*
@@ -556,11 +566,13 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
                                           old_pmd, new_pmd, true))
                                continue;
                }
-
+               if (pmd_none(*old_pmd))
+                       continue;
                if (pte_alloc(new_vma->vm_mm, new_pmd))
                        break;
-               move_ptes(vma, old_pmd, old_addr, old_addr + extent, new_vma,
-                         new_pmd, new_addr, need_rmap_locks);
+               if (move_ptes(vma, old_pmd, old_addr, old_addr + extent,
+                             new_vma, new_pmd, new_addr, need_rmap_locks) < 0)
+                       goto again;
        }
 
        mmu_notifier_invalidate_range_end(&range);
@@ -775,7 +787,7 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
        if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))
                return ERR_PTR(-EFAULT);
 
-       if (mlock_future_check(mm, vma->vm_flags, new_len - old_len))
+       if (!mlock_future_ok(mm, vma->vm_flags, new_len - old_len))
                return ERR_PTR(-EAGAIN);
 
        if (!may_expand_vm(mm, vma->vm_flags,
@@ -914,7 +926,8 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
         * mapping address intact. A non-zero tag will cause the subsequent
         * range checks to reject the address as invalid.
         *
-        * See Documentation/arm64/tagged-address-abi.rst for more information.
+        * See Documentation/arch/arm64/tagged-address-abi.rst for more
+        * information.
         */
        addr = untagged_addr(addr);
 
index 044e1ee..612b559 100644 (file)
@@ -1130,12 +1130,10 @@ bool out_of_memory(struct oom_control *oc)
 
        /*
         * The OOM killer does not compensate for IO-less reclaim.
-        * pagefault_out_of_memory lost its gfp context so we have to
-        * make sure exclude 0 mask - all other users should have at least
-        * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to
-        * invoke the OOM killer even if it is a GFP_NOFS allocation.
+        * But mem_cgroup_oom() has to invoke the OOM killer even
+        * if it is a GFP_NOFS allocation.
         */
-       if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
+       if (!(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
                return true;
 
        /*
index db79439..1d17fb1 100644 (file)
@@ -2597,7 +2597,7 @@ EXPORT_SYMBOL(noop_dirty_folio);
 /*
  * Helper function for set_page_dirty family.
  *
- * Caller must hold lock_page_memcg().
+ * Caller must hold folio_memcg_lock().
  *
  * NOTE: This relies on being atomic wrt interrupts.
  */
@@ -2631,7 +2631,7 @@ static void folio_account_dirtied(struct folio *folio,
 /*
  * Helper function for deaccounting dirty page without writeback.
  *
- * Caller must hold lock_page_memcg().
+ * Caller must hold folio_memcg_lock().
  */
 void folio_account_cleaned(struct folio *folio, struct bdi_writeback *wb)
 {
@@ -2650,7 +2650,7 @@ void folio_account_cleaned(struct folio *folio, struct bdi_writeback *wb)
  * If warn is true, then emit a warning if the folio is not uptodate and has
  * not been truncated.
  *
- * The caller must hold lock_page_memcg().  Most callers have the folio
+ * The caller must hold folio_memcg_lock().  Most callers have the folio
  * locked.  A few have the folio blocked from truncation through other
  * means (eg zap_vma_pages() has it mapped and is holding the page table
  * lock).  This can also be called from mark_buffer_dirty(), which I
index 47421be..7d3460c 100644 (file)
 #include <linux/stddef.h>
 #include <linux/mm.h>
 #include <linux/highmem.h>
-#include <linux/swap.h>
-#include <linux/swapops.h>
 #include <linux/interrupt.h>
-#include <linux/pagemap.h>
 #include <linux/jiffies.h>
-#include <linux/memblock.h>
 #include <linux/compiler.h>
 #include <linux/kernel.h>
 #include <linux/kasan.h>
 #include <linux/kmsan.h>
 #include <linux/module.h>
 #include <linux/suspend.h>
-#include <linux/pagevec.h>
-#include <linux/blkdev.h>
-#include <linux/slab.h>
 #include <linux/ratelimit.h>
 #include <linux/oom.h>
 #include <linux/topology.h>
 #include <linux/cpuset.h>
 #include <linux/memory_hotplug.h>
 #include <linux/nodemask.h>
-#include <linux/vmalloc.h>
 #include <linux/vmstat.h>
-#include <linux/mempolicy.h>
-#include <linux/memremap.h>
-#include <linux/stop_machine.h>
-#include <linux/random.h>
-#include <linux/sort.h>
-#include <linux/pfn.h>
-#include <linux/backing-dev.h>
 #include <linux/fault-inject.h>
-#include <linux/page-isolation.h>
-#include <linux/debugobjects.h>
-#include <linux/kmemleak.h>
 #include <linux/compaction.h>
 #include <trace/events/kmem.h>
 #include <trace/events/oom.h>
 #include <linux/mm_inline.h>
 #include <linux/mmu_notifier.h>
 #include <linux/migrate.h>
-#include <linux/hugetlb.h>
-#include <linux/sched/rt.h>
 #include <linux/sched/mm.h>
 #include <linux/page_owner.h>
 #include <linux/page_table_check.h>
-#include <linux/kthread.h>
 #include <linux/memcontrol.h>
 #include <linux/ftrace.h>
 #include <linux/lockdep.h>
-#include <linux/nmi.h>
 #include <linux/psi.h>
 #include <linux/khugepaged.h>
 #include <linux/delayacct.h>
-#include <asm/sections.h>
-#include <asm/tlbflush.h>
 #include <asm/div64.h>
 #include "internal.h"
 #include "shuffle.h"
 #include "page_reporting.h"
-#include "swap.h"
 
 /* Free Page Internal flags: for internal, non-pcp variants of free_pages(). */
 typedef int __bitwise fpi_t;
@@ -227,18 +202,7 @@ nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
 };
 EXPORT_SYMBOL(node_states);
 
-atomic_long_t _totalram_pages __read_mostly;
-EXPORT_SYMBOL(_totalram_pages);
-unsigned long totalreserve_pages __read_mostly;
-unsigned long totalcma_pages __read_mostly;
-
-int percpu_pagelist_high_fraction;
 gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
-DEFINE_STATIC_KEY_MAYBE(CONFIG_INIT_ON_ALLOC_DEFAULT_ON, init_on_alloc);
-EXPORT_SYMBOL(init_on_alloc);
-
-DEFINE_STATIC_KEY_MAYBE(CONFIG_INIT_ON_FREE_DEFAULT_ON, init_on_free);
-EXPORT_SYMBOL(init_on_free);
 
 /*
  * A cached value of the page's pageblock's migratetype, used when the page is
@@ -258,44 +222,6 @@ static inline void set_pcppage_migratetype(struct page *page, int migratetype)
        page->index = migratetype;
 }
 
-#ifdef CONFIG_PM_SLEEP
-/*
- * The following functions are used by the suspend/hibernate code to temporarily
- * change gfp_allowed_mask in order to avoid using I/O during memory allocations
- * while devices are suspended.  To avoid races with the suspend/hibernate code,
- * they should always be called with system_transition_mutex held
- * (gfp_allowed_mask also should only be modified with system_transition_mutex
- * held, unless the suspend/hibernate code is guaranteed not to run in parallel
- * with that modification).
- */
-
-static gfp_t saved_gfp_mask;
-
-void pm_restore_gfp_mask(void)
-{
-       WARN_ON(!mutex_is_locked(&system_transition_mutex));
-       if (saved_gfp_mask) {
-               gfp_allowed_mask = saved_gfp_mask;
-               saved_gfp_mask = 0;
-       }
-}
-
-void pm_restrict_gfp_mask(void)
-{
-       WARN_ON(!mutex_is_locked(&system_transition_mutex));
-       WARN_ON(saved_gfp_mask);
-       saved_gfp_mask = gfp_allowed_mask;
-       gfp_allowed_mask &= ~(__GFP_IO | __GFP_FS);
-}
-
-bool pm_suspended_storage(void)
-{
-       if ((gfp_allowed_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
-               return false;
-       return true;
-}
-#endif /* CONFIG_PM_SLEEP */
-
 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
 unsigned int pageblock_order __read_mostly;
 #endif
@@ -314,7 +240,7 @@ static void __free_pages_ok(struct page *page, unsigned int order,
  * TBD: should special case ZONE_DMA32 machines here - in those we normally
  * don't need any ZONE_NORMAL reservation
  */
-int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES] = {
+static int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES] = {
 #ifdef CONFIG_ZONE_DMA
        [ZONE_DMA] = 256,
 #endif
@@ -358,7 +284,7 @@ const char * const migratetype_names[MIGRATE_TYPES] = {
 #endif
 };
 
-compound_page_dtor * const compound_page_dtors[NR_COMPOUND_DTORS] = {
+static compound_page_dtor * const compound_page_dtors[NR_COMPOUND_DTORS] = {
        [NULL_COMPOUND_DTOR] = NULL,
        [COMPOUND_PAGE_DTOR] = free_compound_page,
 #ifdef CONFIG_HUGETLB_PAGE
@@ -371,10 +297,8 @@ compound_page_dtor * const compound_page_dtors[NR_COMPOUND_DTORS] = {
 
 int min_free_kbytes = 1024;
 int user_min_free_kbytes = -1;
-int watermark_boost_factor __read_mostly = 15000;
-int watermark_scale_factor = 10;
-
-bool mirrored_kernelcore __initdata_memblock;
+static int watermark_boost_factor __read_mostly = 15000;
+static int watermark_scale_factor = 10;
 
 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
 int movable_zone;
@@ -387,6 +311,12 @@ EXPORT_SYMBOL(nr_node_ids);
 EXPORT_SYMBOL(nr_online_nodes);
 #endif
 
+static bool page_contains_unaccepted(struct page *page, unsigned int order);
+static void accept_page(struct page *page, unsigned int order);
+static bool try_to_accept_memory(struct zone *zone, unsigned int order);
+static inline bool has_unaccepted_memory(void);
+static bool __free_unaccepted(struct page *page);
+
 int page_group_by_mobility_disabled __read_mostly;
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
@@ -550,13 +480,6 @@ static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
        return ret;
 }
 
-static int page_is_consistent(struct zone *zone, struct page *page)
-{
-       if (zone != page_zone(page))
-               return 0;
-
-       return 1;
-}
 /*
  * Temporary debugging check for pages not lying within a given zone.
  */
@@ -564,7 +487,7 @@ static int __maybe_unused bad_range(struct zone *zone, struct page *page)
 {
        if (page_outside_zone_boundaries(zone, page))
                return 1;
-       if (!page_is_consistent(zone, page))
+       if (zone != page_zone(page))
                return 1;
 
        return 0;
@@ -704,75 +627,6 @@ void destroy_large_folio(struct folio *folio)
        compound_page_dtors[dtor](&folio->page);
 }
 
-#ifdef CONFIG_DEBUG_PAGEALLOC
-unsigned int _debug_guardpage_minorder;
-
-bool _debug_pagealloc_enabled_early __read_mostly
-                       = IS_ENABLED(CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT);
-EXPORT_SYMBOL(_debug_pagealloc_enabled_early);
-DEFINE_STATIC_KEY_FALSE(_debug_pagealloc_enabled);
-EXPORT_SYMBOL(_debug_pagealloc_enabled);
-
-DEFINE_STATIC_KEY_FALSE(_debug_guardpage_enabled);
-
-static int __init early_debug_pagealloc(char *buf)
-{
-       return kstrtobool(buf, &_debug_pagealloc_enabled_early);
-}
-early_param("debug_pagealloc", early_debug_pagealloc);
-
-static int __init debug_guardpage_minorder_setup(char *buf)
-{
-       unsigned long res;
-
-       if (kstrtoul(buf, 10, &res) < 0 ||  res > MAX_ORDER / 2) {
-               pr_err("Bad debug_guardpage_minorder value\n");
-               return 0;
-       }
-       _debug_guardpage_minorder = res;
-       pr_info("Setting debug_guardpage_minorder to %lu\n", res);
-       return 0;
-}
-early_param("debug_guardpage_minorder", debug_guardpage_minorder_setup);
-
-static inline bool set_page_guard(struct zone *zone, struct page *page,
-                               unsigned int order, int migratetype)
-{
-       if (!debug_guardpage_enabled())
-               return false;
-
-       if (order >= debug_guardpage_minorder())
-               return false;
-
-       __SetPageGuard(page);
-       INIT_LIST_HEAD(&page->buddy_list);
-       set_page_private(page, order);
-       /* Guard pages are not available for any usage */
-       if (!is_migrate_isolate(migratetype))
-               __mod_zone_freepage_state(zone, -(1 << order), migratetype);
-
-       return true;
-}
-
-static inline void clear_page_guard(struct zone *zone, struct page *page,
-                               unsigned int order, int migratetype)
-{
-       if (!debug_guardpage_enabled())
-               return;
-
-       __ClearPageGuard(page);
-
-       set_page_private(page, 0);
-       if (!is_migrate_isolate(migratetype))
-               __mod_zone_freepage_state(zone, (1 << order), migratetype);
-}
-#else
-static inline bool set_page_guard(struct zone *zone, struct page *page,
-                       unsigned int order, int migratetype) { return false; }
-static inline void clear_page_guard(struct zone *zone, struct page *page,
-                               unsigned int order, int migratetype) {}
-#endif
-
 static inline void set_buddy_order(struct page *page, unsigned int order)
 {
        set_page_private(page, order);
@@ -879,7 +733,7 @@ static inline struct page *get_page_from_free_area(struct free_area *area,
                                            int migratetype)
 {
        return list_first_entry_or_null(&area->free_list[migratetype],
-                                       struct page, lru);
+                                       struct page, buddy_list);
 }
 
 /*
@@ -1131,6 +985,11 @@ static inline bool free_page_is_bad(struct page *page)
        return true;
 }
 
+static inline bool is_check_pages_enabled(void)
+{
+       return static_branch_unlikely(&check_pages_enabled);
+}
+
 static int free_tail_page_prepare(struct page *head_page, struct page *page)
 {
        struct folio *folio = (struct folio *)head_page;
@@ -1142,7 +1001,7 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
         */
        BUILD_BUG_ON((unsigned long)LIST_POISON1 & 1);
 
-       if (!static_branch_unlikely(&check_pages_enabled)) {
+       if (!is_check_pages_enabled()) {
                ret = 0;
                goto out;
        }
@@ -1481,6 +1340,13 @@ void __free_pages_core(struct page *page, unsigned int order)
 
        atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
 
+       if (page_contains_unaccepted(page, order)) {
+               if (order == MAX_ORDER && __free_unaccepted(page))
+                       return;
+
+               accept_page(page, order);
+       }
+
        /*
         * Bypass PCP and place fresh pages right to the tail, primarily
         * relevant for memory onlining.
@@ -1521,7 +1387,7 @@ struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
        /* end_pfn is one past the range we are checking */
        end_pfn--;
 
-       if (!pfn_valid(start_pfn) || !pfn_valid(end_pfn))
+       if (!pfn_valid(end_pfn))
                return NULL;
 
        start_page = pfn_to_online_page(start_pfn);
@@ -1540,33 +1406,6 @@ struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
        return start_page;
 }
 
-void set_zone_contiguous(struct zone *zone)
-{
-       unsigned long block_start_pfn = zone->zone_start_pfn;
-       unsigned long block_end_pfn;
-
-       block_end_pfn = pageblock_end_pfn(block_start_pfn);
-       for (; block_start_pfn < zone_end_pfn(zone);
-                       block_start_pfn = block_end_pfn,
-                        block_end_pfn += pageblock_nr_pages) {
-
-               block_end_pfn = min(block_end_pfn, zone_end_pfn(zone));
-
-               if (!__pageblock_pfn_to_page(block_start_pfn,
-                                            block_end_pfn, zone))
-                       return;
-               cond_resched();
-       }
-
-       /* We confirm that there is no hole */
-       zone->contiguous = true;
-}
-
-void clear_zone_contiguous(struct zone *zone)
-{
-       zone->contiguous = false;
-}
-
 /*
  * The order of subdivision here is critical for the IO subsystem.
  * Please do not alter this order without good reasons and regression
@@ -2501,61 +2340,6 @@ void drain_all_pages(struct zone *zone)
        __drain_all_pages(zone, false);
 }
 
-#ifdef CONFIG_HIBERNATION
-
-/*
- * Touch the watchdog for every WD_PAGE_COUNT pages.
- */
-#define WD_PAGE_COUNT  (128*1024)
-
-void mark_free_pages(struct zone *zone)
-{
-       unsigned long pfn, max_zone_pfn, page_count = WD_PAGE_COUNT;
-       unsigned long flags;
-       unsigned int order, t;
-       struct page *page;
-
-       if (zone_is_empty(zone))
-               return;
-
-       spin_lock_irqsave(&zone->lock, flags);
-
-       max_zone_pfn = zone_end_pfn(zone);
-       for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++)
-               if (pfn_valid(pfn)) {
-                       page = pfn_to_page(pfn);
-
-                       if (!--page_count) {
-                               touch_nmi_watchdog();
-                               page_count = WD_PAGE_COUNT;
-                       }
-
-                       if (page_zone(page) != zone)
-                               continue;
-
-                       if (!swsusp_page_is_forbidden(page))
-                               swsusp_unset_page_free(page);
-               }
-
-       for_each_migratetype_order(order, t) {
-               list_for_each_entry(page,
-                               &zone->free_area[order].free_list[t], buddy_list) {
-                       unsigned long i;
-
-                       pfn = page_to_pfn(page);
-                       for (i = 0; i < (1UL << order); i++) {
-                               if (!--page_count) {
-                                       touch_nmi_watchdog();
-                                       page_count = WD_PAGE_COUNT;
-                               }
-                               swsusp_set_page_free(pfn_to_page(pfn + i));
-                       }
-               }
-       }
-       spin_unlock_irqrestore(&zone->lock, flags);
-}
-#endif /* CONFIG_PM */
-
 static bool free_unref_page_prepare(struct page *page, unsigned long pfn,
                                                        unsigned int order)
 {
@@ -3052,7 +2836,8 @@ struct page *rmqueue(struct zone *preferred_zone,
 
 out:
        /* Separate test+clear to avoid unnecessary atomics */
-       if (unlikely(test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags))) {
+       if ((alloc_flags & ALLOC_KSWAPD) &&
+           unlikely(test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags))) {
                clear_bit(ZONE_BOOSTED_WATERMARK, &zone->flags);
                wakeup_kswapd(zone, 0, 0, zone_idx(zone));
        }
@@ -3061,80 +2846,6 @@ out:
        return page;
 }
 
-#ifdef CONFIG_FAIL_PAGE_ALLOC
-
-static struct {
-       struct fault_attr attr;
-
-       bool ignore_gfp_highmem;
-       bool ignore_gfp_reclaim;
-       u32 min_order;
-} fail_page_alloc = {
-       .attr = FAULT_ATTR_INITIALIZER,
-       .ignore_gfp_reclaim = true,
-       .ignore_gfp_highmem = true,
-       .min_order = 1,
-};
-
-static int __init setup_fail_page_alloc(char *str)
-{
-       return setup_fault_attr(&fail_page_alloc.attr, str);
-}
-__setup("fail_page_alloc=", setup_fail_page_alloc);
-
-static bool __should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
-{
-       int flags = 0;
-
-       if (order < fail_page_alloc.min_order)
-               return false;
-       if (gfp_mask & __GFP_NOFAIL)
-               return false;
-       if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
-               return false;
-       if (fail_page_alloc.ignore_gfp_reclaim &&
-                       (gfp_mask & __GFP_DIRECT_RECLAIM))
-               return false;
-
-       /* See comment in __should_failslab() */
-       if (gfp_mask & __GFP_NOWARN)
-               flags |= FAULT_NOWARN;
-
-       return should_fail_ex(&fail_page_alloc.attr, 1 << order, flags);
-}
-
-#ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
-
-static int __init fail_page_alloc_debugfs(void)
-{
-       umode_t mode = S_IFREG | 0600;
-       struct dentry *dir;
-
-       dir = fault_create_debugfs_attr("fail_page_alloc", NULL,
-                                       &fail_page_alloc.attr);
-
-       debugfs_create_bool("ignore-gfp-wait", mode, dir,
-                           &fail_page_alloc.ignore_gfp_reclaim);
-       debugfs_create_bool("ignore-gfp-highmem", mode, dir,
-                           &fail_page_alloc.ignore_gfp_highmem);
-       debugfs_create_u32("min-order", mode, dir, &fail_page_alloc.min_order);
-
-       return 0;
-}
-
-late_initcall(fail_page_alloc_debugfs);
-
-#endif /* CONFIG_FAULT_INJECTION_DEBUG_FS */
-
-#else /* CONFIG_FAIL_PAGE_ALLOC */
-
-static inline bool __should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
-{
-       return false;
-}
-
-#endif /* CONFIG_FAIL_PAGE_ALLOC */
-
 noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
 {
        return __should_fail_alloc_page(gfp_mask, order);
@@ -3159,6 +2870,9 @@ static inline long __zone_watermark_unusable_free(struct zone *z,
        if (!(alloc_flags & ALLOC_CMA))
                unusable_free += zone_page_state(z, NR_FREE_CMA_PAGES);
 #endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+       unusable_free += zone_page_state(z, NR_UNACCEPTED);
+#endif
 
        return unusable_free;
 }
@@ -3458,6 +3172,11 @@ retry:
                                       gfp_mask)) {
                        int ret;
 
+                       if (has_unaccepted_memory()) {
+                               if (try_to_accept_memory(zone, order))
+                                       goto try_this_zone;
+                       }
+
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
                        /*
                         * Watermark failed for this zone, but see if we can
@@ -3510,6 +3229,11 @@ try_this_zone:
 
                        return page;
                } else {
+                       if (has_unaccepted_memory()) {
+                               if (try_to_accept_memory(zone, order))
+                                       goto try_this_zone;
+                       }
+
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
                        /* Try again if zone has deferred pages */
                        if (deferred_pages_enabled()) {
@@ -3768,56 +3492,41 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
        if (fatal_signal_pending(current))
                return false;
 
-       if (compaction_made_progress(compact_result))
-               (*compaction_retries)++;
-
-       /*
-        * compaction considers all the zone as desperately out of memory
-        * so it doesn't really make much sense to retry except when the
-        * failure could be caused by insufficient priority
-        */
-       if (compaction_failed(compact_result))
-               goto check_priority;
-
        /*
-        * compaction was skipped because there are not enough order-0 pages
-        * to work with, so we retry only if it looks like reclaim can help.
+        * Compaction was skipped due to a lack of free order-0
+        * migration targets. Continue if reclaim can help.
         */
-       if (compaction_needs_reclaim(compact_result)) {
+       if (compact_result == COMPACT_SKIPPED) {
                ret = compaction_zonelist_suitable(ac, order, alloc_flags);
                goto out;
        }
 
        /*
-        * make sure the compaction wasn't deferred or didn't bail out early
-        * due to locks contention before we declare that we should give up.
-        * But the next retry should use a higher priority if allowed, so
-        * we don't just keep bailing out endlessly.
+        * Compaction managed to coalesce some page blocks, but the
+        * allocation failed presumably due to a race. Retry some.
         */
-       if (compaction_withdrawn(compact_result)) {
-               goto check_priority;
-       }
+       if (compact_result == COMPACT_SUCCESS) {
+               /*
+                * !costly requests are much more important than
+                * __GFP_RETRY_MAYFAIL costly ones because they are de
+                * facto nofail and invoke OOM killer to move on while
+                * costly can fail and users are ready to cope with
+                * that. 1/4 retries is rather arbitrary but we would
+                * need much more detailed feedback from compaction to
+                * make a better decision.
+                */
+               if (order > PAGE_ALLOC_COSTLY_ORDER)
+                       max_retries /= 4;
 
-       /*
-        * !costly requests are much more important than __GFP_RETRY_MAYFAIL
-        * costly ones because they are de facto nofail and invoke OOM
-        * killer to move on while costly can fail and users are ready
-        * to cope with that. 1/4 retries is rather arbitrary but we
-        * would need much more detailed feedback from compaction to
-        * make a better decision.
-        */
-       if (order > PAGE_ALLOC_COSTLY_ORDER)
-               max_retries /= 4;
-       if (*compaction_retries <= max_retries) {
-               ret = true;
-               goto out;
+               if (++(*compaction_retries) <= max_retries) {
+                       ret = true;
+                       goto out;
+               }
        }
 
        /*
-        * Make sure there are attempts at the highest priority if we exhausted
-        * all retries or failed at the lower priorities.
+        * Compaction failed. Retry with increasing priority.
         */
-check_priority:
        min_priority = (order > PAGE_ALLOC_COSTLY_ORDER) ?
                        MIN_COMPACT_COSTLY_PRIORITY : MIN_COMPACT_PRIORITY;
 
@@ -5137,383 +4846,6 @@ unsigned long nr_free_buffer_pages(void)
 }
 EXPORT_SYMBOL_GPL(nr_free_buffer_pages);
 
-static inline void show_node(struct zone *zone)
-{
-       if (IS_ENABLED(CONFIG_NUMA))
-               printk("Node %d ", zone_to_nid(zone));
-}
-
-long si_mem_available(void)
-{
-       long available;
-       unsigned long pagecache;
-       unsigned long wmark_low = 0;
-       unsigned long pages[NR_LRU_LISTS];
-       unsigned long reclaimable;
-       struct zone *zone;
-       int lru;
-
-       for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++)
-               pages[lru] = global_node_page_state(NR_LRU_BASE + lru);
-
-       for_each_zone(zone)
-               wmark_low += low_wmark_pages(zone);
-
-       /*
-        * Estimate the amount of memory available for userspace allocations,
-        * without causing swapping or OOM.
-        */
-       available = global_zone_page_state(NR_FREE_PAGES) - totalreserve_pages;
-
-       /*
-        * Not all the page cache can be freed, otherwise the system will
-        * start swapping or thrashing. Assume at least half of the page
-        * cache, or the low watermark worth of cache, needs to stay.
-        */
-       pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE];
-       pagecache -= min(pagecache / 2, wmark_low);
-       available += pagecache;
-
-       /*
-        * Part of the reclaimable slab and other kernel memory consists of
-        * items that are in use, and cannot be freed. Cap this estimate at the
-        * low watermark.
-        */
-       reclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B) +
-               global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE);
-       available += reclaimable - min(reclaimable / 2, wmark_low);
-
-       if (available < 0)
-               available = 0;
-       return available;
-}
-EXPORT_SYMBOL_GPL(si_mem_available);
-
-void si_meminfo(struct sysinfo *val)
-{
-       val->totalram = totalram_pages();
-       val->sharedram = global_node_page_state(NR_SHMEM);
-       val->freeram = global_zone_page_state(NR_FREE_PAGES);
-       val->bufferram = nr_blockdev_pages();
-       val->totalhigh = totalhigh_pages();
-       val->freehigh = nr_free_highpages();
-       val->mem_unit = PAGE_SIZE;
-}
-
-EXPORT_SYMBOL(si_meminfo);
-
-#ifdef CONFIG_NUMA
-void si_meminfo_node(struct sysinfo *val, int nid)
-{
-       int zone_type;          /* needs to be signed */
-       unsigned long managed_pages = 0;
-       unsigned long managed_highpages = 0;
-       unsigned long free_highpages = 0;
-       pg_data_t *pgdat = NODE_DATA(nid);
-
-       for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++)
-               managed_pages += zone_managed_pages(&pgdat->node_zones[zone_type]);
-       val->totalram = managed_pages;
-       val->sharedram = node_page_state(pgdat, NR_SHMEM);
-       val->freeram = sum_zone_node_page_state(nid, NR_FREE_PAGES);
-#ifdef CONFIG_HIGHMEM
-       for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
-               struct zone *zone = &pgdat->node_zones[zone_type];
-
-               if (is_highmem(zone)) {
-                       managed_highpages += zone_managed_pages(zone);
-                       free_highpages += zone_page_state(zone, NR_FREE_PAGES);
-               }
-       }
-       val->totalhigh = managed_highpages;
-       val->freehigh = free_highpages;
-#else
-       val->totalhigh = managed_highpages;
-       val->freehigh = free_highpages;
-#endif
-       val->mem_unit = PAGE_SIZE;
-}
-#endif
-
-/*
- * Determine whether the node should be displayed or not, depending on whether
- * SHOW_MEM_FILTER_NODES was passed to show_free_areas().
- */
-static bool show_mem_node_skip(unsigned int flags, int nid, nodemask_t *nodemask)
-{
-       if (!(flags & SHOW_MEM_FILTER_NODES))
-               return false;
-
-       /*
-        * no node mask - aka implicit memory numa policy. Do not bother with
-        * the synchronization - read_mems_allowed_begin - because we do not
-        * have to be precise here.
-        */
-       if (!nodemask)
-               nodemask = &cpuset_current_mems_allowed;
-
-       return !node_isset(nid, *nodemask);
-}
-
-static void show_migration_types(unsigned char type)
-{
-       static const char types[MIGRATE_TYPES] = {
-               [MIGRATE_UNMOVABLE]     = 'U',
-               [MIGRATE_MOVABLE]       = 'M',
-               [MIGRATE_RECLAIMABLE]   = 'E',
-               [MIGRATE_HIGHATOMIC]    = 'H',
-#ifdef CONFIG_CMA
-               [MIGRATE_CMA]           = 'C',
-#endif
-#ifdef CONFIG_MEMORY_ISOLATION
-               [MIGRATE_ISOLATE]       = 'I',
-#endif
-       };
-       char tmp[MIGRATE_TYPES + 1];
-       char *p = tmp;
-       int i;
-
-       for (i = 0; i < MIGRATE_TYPES; i++) {
-               if (type & (1 << i))
-                       *p++ = types[i];
-       }
-
-       *p = '\0';
-       printk(KERN_CONT "(%s) ", tmp);
-}
-
-static bool node_has_managed_zones(pg_data_t *pgdat, int max_zone_idx)
-{
-       int zone_idx;
-       for (zone_idx = 0; zone_idx <= max_zone_idx; zone_idx++)
-               if (zone_managed_pages(pgdat->node_zones + zone_idx))
-                       return true;
-       return false;
-}
-
-/*
- * Show free area list (used inside shift_scroll-lock stuff)
- * We also calculate the percentage fragmentation. We do this by counting the
- * memory on each free list with the exception of the first item on the list.
- *
- * Bits in @filter:
- * SHOW_MEM_FILTER_NODES: suppress nodes that are not allowed by current's
- *   cpuset.
- */
-void __show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
-{
-       unsigned long free_pcp = 0;
-       int cpu, nid;
-       struct zone *zone;
-       pg_data_t *pgdat;
-
-       for_each_populated_zone(zone) {
-               if (zone_idx(zone) > max_zone_idx)
-                       continue;
-               if (show_mem_node_skip(filter, zone_to_nid(zone), nodemask))
-                       continue;
-
-               for_each_online_cpu(cpu)
-                       free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
-       }
-
-       printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
-               " active_file:%lu inactive_file:%lu isolated_file:%lu\n"
-               " unevictable:%lu dirty:%lu writeback:%lu\n"
-               " slab_reclaimable:%lu slab_unreclaimable:%lu\n"
-               " mapped:%lu shmem:%lu pagetables:%lu\n"
-               " sec_pagetables:%lu bounce:%lu\n"
-               " kernel_misc_reclaimable:%lu\n"
-               " free:%lu free_pcp:%lu free_cma:%lu\n",
-               global_node_page_state(NR_ACTIVE_ANON),
-               global_node_page_state(NR_INACTIVE_ANON),
-               global_node_page_state(NR_ISOLATED_ANON),
-               global_node_page_state(NR_ACTIVE_FILE),
-               global_node_page_state(NR_INACTIVE_FILE),
-               global_node_page_state(NR_ISOLATED_FILE),
-               global_node_page_state(NR_UNEVICTABLE),
-               global_node_page_state(NR_FILE_DIRTY),
-               global_node_page_state(NR_WRITEBACK),
-               global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B),
-               global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B),
-               global_node_page_state(NR_FILE_MAPPED),
-               global_node_page_state(NR_SHMEM),
-               global_node_page_state(NR_PAGETABLE),
-               global_node_page_state(NR_SECONDARY_PAGETABLE),
-               global_zone_page_state(NR_BOUNCE),
-               global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE),
-               global_zone_page_state(NR_FREE_PAGES),
-               free_pcp,
-               global_zone_page_state(NR_FREE_CMA_PAGES));
-
-       for_each_online_pgdat(pgdat) {
-               if (show_mem_node_skip(filter, pgdat->node_id, nodemask))
-                       continue;
-               if (!node_has_managed_zones(pgdat, max_zone_idx))
-                       continue;
-
-               printk("Node %d"
-                       " active_anon:%lukB"
-                       " inactive_anon:%lukB"
-                       " active_file:%lukB"
-                       " inactive_file:%lukB"
-                       " unevictable:%lukB"
-                       " isolated(anon):%lukB"
-                       " isolated(file):%lukB"
-                       " mapped:%lukB"
-                       " dirty:%lukB"
-                       " writeback:%lukB"
-                       " shmem:%lukB"
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-                       " shmem_thp: %lukB"
-                       " shmem_pmdmapped: %lukB"
-                       " anon_thp: %lukB"
-#endif
-                       " writeback_tmp:%lukB"
-                       " kernel_stack:%lukB"
-#ifdef CONFIG_SHADOW_CALL_STACK
-                       " shadow_call_stack:%lukB"
-#endif
-                       " pagetables:%lukB"
-                       " sec_pagetables:%lukB"
-                       " all_unreclaimable? %s"
-                       "\n",
-                       pgdat->node_id,
-                       K(node_page_state(pgdat, NR_ACTIVE_ANON)),
-                       K(node_page_state(pgdat, NR_INACTIVE_ANON)),
-                       K(node_page_state(pgdat, NR_ACTIVE_FILE)),
-                       K(node_page_state(pgdat, NR_INACTIVE_FILE)),
-                       K(node_page_state(pgdat, NR_UNEVICTABLE)),
-                       K(node_page_state(pgdat, NR_ISOLATED_ANON)),
-                       K(node_page_state(pgdat, NR_ISOLATED_FILE)),
-                       K(node_page_state(pgdat, NR_FILE_MAPPED)),
-                       K(node_page_state(pgdat, NR_FILE_DIRTY)),
-                       K(node_page_state(pgdat, NR_WRITEBACK)),
-                       K(node_page_state(pgdat, NR_SHMEM)),
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-                       K(node_page_state(pgdat, NR_SHMEM_THPS)),
-                       K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)),
-                       K(node_page_state(pgdat, NR_ANON_THPS)),
-#endif
-                       K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
-                       node_page_state(pgdat, NR_KERNEL_STACK_KB),
-#ifdef CONFIG_SHADOW_CALL_STACK
-                       node_page_state(pgdat, NR_KERNEL_SCS_KB),
-#endif
-                       K(node_page_state(pgdat, NR_PAGETABLE)),
-                       K(node_page_state(pgdat, NR_SECONDARY_PAGETABLE)),
-                       pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
-                               "yes" : "no");
-       }
-
-       for_each_populated_zone(zone) {
-               int i;
-
-               if (zone_idx(zone) > max_zone_idx)
-                       continue;
-               if (show_mem_node_skip(filter, zone_to_nid(zone), nodemask))
-                       continue;
-
-               free_pcp = 0;
-               for_each_online_cpu(cpu)
-                       free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
-
-               show_node(zone);
-               printk(KERN_CONT
-                       "%s"
-                       " free:%lukB"
-                       " boost:%lukB"
-                       " min:%lukB"
-                       " low:%lukB"
-                       " high:%lukB"
-                       " reserved_highatomic:%luKB"
-                       " active_anon:%lukB"
-                       " inactive_anon:%lukB"
-                       " active_file:%lukB"
-                       " inactive_file:%lukB"
-                       " unevictable:%lukB"
-                       " writepending:%lukB"
-                       " present:%lukB"
-                       " managed:%lukB"
-                       " mlocked:%lukB"
-                       " bounce:%lukB"
-                       " free_pcp:%lukB"
-                       " local_pcp:%ukB"
-                       " free_cma:%lukB"
-                       "\n",
-                       zone->name,
-                       K(zone_page_state(zone, NR_FREE_PAGES)),
-                       K(zone->watermark_boost),
-                       K(min_wmark_pages(zone)),
-                       K(low_wmark_pages(zone)),
-                       K(high_wmark_pages(zone)),
-                       K(zone->nr_reserved_highatomic),
-                       K(zone_page_state(zone, NR_ZONE_ACTIVE_ANON)),
-                       K(zone_page_state(zone, NR_ZONE_INACTIVE_ANON)),
-                       K(zone_page_state(zone, NR_ZONE_ACTIVE_FILE)),
-                       K(zone_page_state(zone, NR_ZONE_INACTIVE_FILE)),
-                       K(zone_page_state(zone, NR_ZONE_UNEVICTABLE)),
-                       K(zone_page_state(zone, NR_ZONE_WRITE_PENDING)),
-                       K(zone->present_pages),
-                       K(zone_managed_pages(zone)),
-                       K(zone_page_state(zone, NR_MLOCK)),
-                       K(zone_page_state(zone, NR_BOUNCE)),
-                       K(free_pcp),
-                       K(this_cpu_read(zone->per_cpu_pageset->count)),
-                       K(zone_page_state(zone, NR_FREE_CMA_PAGES)));
-               printk("lowmem_reserve[]:");
-               for (i = 0; i < MAX_NR_ZONES; i++)
-                       printk(KERN_CONT " %ld", zone->lowmem_reserve[i]);
-               printk(KERN_CONT "\n");
-       }
-
-       for_each_populated_zone(zone) {
-               unsigned int order;
-               unsigned long nr[MAX_ORDER + 1], flags, total = 0;
-               unsigned char types[MAX_ORDER + 1];
-
-               if (zone_idx(zone) > max_zone_idx)
-                       continue;
-               if (show_mem_node_skip(filter, zone_to_nid(zone), nodemask))
-                       continue;
-               show_node(zone);
-               printk(KERN_CONT "%s: ", zone->name);
-
-               spin_lock_irqsave(&zone->lock, flags);
-               for (order = 0; order <= MAX_ORDER; order++) {
-                       struct free_area *area = &zone->free_area[order];
-                       int type;
-
-                       nr[order] = area->nr_free;
-                       total += nr[order] << order;
-
-                       types[order] = 0;
-                       for (type = 0; type < MIGRATE_TYPES; type++) {
-                               if (!free_area_empty(area, type))
-                                       types[order] |= 1 << type;
-                       }
-               }
-               spin_unlock_irqrestore(&zone->lock, flags);
-               for (order = 0; order <= MAX_ORDER; order++) {
-                       printk(KERN_CONT "%lu*%lukB ",
-                              nr[order], K(1UL) << order);
-                       if (nr[order])
-                               show_migration_types(types[order]);
-               }
-               printk(KERN_CONT "= %lukB\n", K(total));
-       }
-
-       for_each_online_node(nid) {
-               if (show_mem_node_skip(filter, nid, nodemask))
-                       continue;
-               hugetlb_show_meminfo_node(nid);
-       }
-
-       printk("%ld total pagecache pages\n", global_node_page_state(NR_FILE_PAGES));
-
-       show_swap_cache_info();
-}
-
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
        zoneref->zone = zone;
@@ -5560,12 +4892,12 @@ static int __parse_numa_zonelist_order(char *s)
        return 0;
 }
 
-char numa_zonelist_order[] = "Node";
-
+static char numa_zonelist_order[] = "Node";
+#define NUMA_ZONELIST_ORDER_LEN        16
 /*
  * sysctl handler for numa_zonelist_order
  */
-int numa_zonelist_order_handler(struct ctl_table *table, int write,
+static int numa_zonelist_order_handler(struct ctl_table *table, int write,
                void *buffer, size_t *length, loff_t *ppos)
 {
        if (write)
@@ -5573,7 +4905,6 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write,
        return proc_dostring(table, write, buffer, length, ppos);
 }
 
-
 static int node_load[MAX_NUMNODES];
 
 /**
@@ -5976,6 +5307,7 @@ static int zone_batchsize(struct zone *zone)
 #endif
 }
 
+static int percpu_pagelist_high_fraction;
 static int zone_highsize(struct zone *zone, int batch, int cpu_online)
 {
 #ifdef CONFIG_MMU
@@ -6505,7 +5837,7 @@ postcore_initcall(init_per_zone_wmark_min)
  *     that we can call two helper functions whenever min_free_kbytes
  *     changes.
  */
-int min_free_kbytes_sysctl_handler(struct ctl_table *table, int write,
+static int min_free_kbytes_sysctl_handler(struct ctl_table *table, int write,
                void *buffer, size_t *length, loff_t *ppos)
 {
        int rc;
@@ -6521,7 +5853,7 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *table, int write,
        return 0;
 }
 
-int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write,
+static int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write,
                void *buffer, size_t *length, loff_t *ppos)
 {
        int rc;
@@ -6551,7 +5883,7 @@ static void setup_min_unmapped_ratio(void)
 }
 
 
-int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *table, int write,
+static int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *table, int write,
                void *buffer, size_t *length, loff_t *ppos)
 {
        int rc;
@@ -6578,7 +5910,7 @@ static void setup_min_slab_ratio(void)
                                                     sysctl_min_slab_ratio) / 100;
 }
 
-int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *table, int write,
+static int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *table, int write,
                void *buffer, size_t *length, loff_t *ppos)
 {
        int rc;
@@ -6602,8 +5934,8 @@ int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *table, int write,
  * minimum watermarks. The lowmem reserve ratio can only make sense
  * if in function of the boot time zone sizes.
  */
-int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *table, int write,
-               void *buffer, size_t *length, loff_t *ppos)
+static int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *table,
+               int write, void *buffer, size_t *length, loff_t *ppos)
 {
        int i;
 
@@ -6623,7 +5955,7 @@ int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *table, int write,
  * cpu. It is the fraction of total pages in each zone that a hot per cpu
  * pagelist can have before it gets flushed back to buddy allocator.
  */
-int percpu_pagelist_high_fraction_sysctl_handler(struct ctl_table *table,
+static int percpu_pagelist_high_fraction_sysctl_handler(struct ctl_table *table,
                int write, void *buffer, size_t *length, loff_t *ppos)
 {
        struct zone *zone;
@@ -6656,9 +5988,83 @@ out:
        return ret;
 }
 
+static struct ctl_table page_alloc_sysctl_table[] = {
+       {
+               .procname       = "min_free_kbytes",
+               .data           = &min_free_kbytes,
+               .maxlen         = sizeof(min_free_kbytes),
+               .mode           = 0644,
+               .proc_handler   = min_free_kbytes_sysctl_handler,
+               .extra1         = SYSCTL_ZERO,
+       },
+       {
+               .procname       = "watermark_boost_factor",
+               .data           = &watermark_boost_factor,
+               .maxlen         = sizeof(watermark_boost_factor),
+               .mode           = 0644,
+               .proc_handler   = proc_dointvec_minmax,
+               .extra1         = SYSCTL_ZERO,
+       },
+       {
+               .procname       = "watermark_scale_factor",
+               .data           = &watermark_scale_factor,
+               .maxlen         = sizeof(watermark_scale_factor),
+               .mode           = 0644,
+               .proc_handler   = watermark_scale_factor_sysctl_handler,
+               .extra1         = SYSCTL_ONE,
+               .extra2         = SYSCTL_THREE_THOUSAND,
+       },
+       {
+               .procname       = "percpu_pagelist_high_fraction",
+               .data           = &percpu_pagelist_high_fraction,
+               .maxlen         = sizeof(percpu_pagelist_high_fraction),
+               .mode           = 0644,
+               .proc_handler   = percpu_pagelist_high_fraction_sysctl_handler,
+               .extra1         = SYSCTL_ZERO,
+       },
+       {
+               .procname       = "lowmem_reserve_ratio",
+               .data           = &sysctl_lowmem_reserve_ratio,
+               .maxlen         = sizeof(sysctl_lowmem_reserve_ratio),
+               .mode           = 0644,
+               .proc_handler   = lowmem_reserve_ratio_sysctl_handler,
+       },
+#ifdef CONFIG_NUMA
+       {
+               .procname       = "numa_zonelist_order",
+               .data           = &numa_zonelist_order,
+               .maxlen         = NUMA_ZONELIST_ORDER_LEN,
+               .mode           = 0644,
+               .proc_handler   = numa_zonelist_order_handler,
+       },
+       {
+               .procname       = "min_unmapped_ratio",
+               .data           = &sysctl_min_unmapped_ratio,
+               .maxlen         = sizeof(sysctl_min_unmapped_ratio),
+               .mode           = 0644,
+               .proc_handler   = sysctl_min_unmapped_ratio_sysctl_handler,
+               .extra1         = SYSCTL_ZERO,
+               .extra2         = SYSCTL_ONE_HUNDRED,
+       },
+       {
+               .procname       = "min_slab_ratio",
+               .data           = &sysctl_min_slab_ratio,
+               .maxlen         = sizeof(sysctl_min_slab_ratio),
+               .mode           = 0644,
+               .proc_handler   = sysctl_min_slab_ratio_sysctl_handler,
+               .extra1         = SYSCTL_ZERO,
+               .extra2         = SYSCTL_ONE_HUNDRED,
+       },
+#endif
+       {}
+};
+
+void __init page_alloc_sysctl_init(void)
+{
+       register_sysctl_init("vm", page_alloc_sysctl_table);
+}
+
 #ifdef CONFIG_CONTIG_ALLOC
-#if defined(CONFIG_DYNAMIC_DEBUG) || \
-       (defined(CONFIG_DYNAMIC_DEBUG_CORE) && defined(DYNAMIC_DEBUG_MODULE))
 /* Usage: See admin-guide/dynamic-debug-howto.rst */
 static void alloc_contig_dump_pages(struct list_head *page_list)
 {
@@ -6672,11 +6078,6 @@ static void alloc_contig_dump_pages(struct list_head *page_list)
                        dump_page(page, "migration failure");
        }
 }
-#else
-static inline void alloc_contig_dump_pages(struct list_head *page_list)
-{
-}
-#endif
 
 /* [start, end) must belong to a single zone. */
 int __alloc_contig_migrate_range(struct compact_control *cc,
@@ -7215,3 +6616,150 @@ bool has_managed_dma(void)
        return false;
 }
 #endif /* CONFIG_ZONE_DMA */
+
+#ifdef CONFIG_UNACCEPTED_MEMORY
+
+/* Counts number of zones with unaccepted pages. */
+static DEFINE_STATIC_KEY_FALSE(zones_with_unaccepted_pages);
+
+static bool lazy_accept = true;
+
+static int __init accept_memory_parse(char *p)
+{
+       if (!strcmp(p, "lazy")) {
+               lazy_accept = true;
+               return 0;
+       } else if (!strcmp(p, "eager")) {
+               lazy_accept = false;
+               return 0;
+       } else {
+               return -EINVAL;
+       }
+}
+early_param("accept_memory", accept_memory_parse);
+
+static bool page_contains_unaccepted(struct page *page, unsigned int order)
+{
+       phys_addr_t start = page_to_phys(page);
+       phys_addr_t end = start + (PAGE_SIZE << order);
+
+       return range_contains_unaccepted_memory(start, end);
+}
+
+static void accept_page(struct page *page, unsigned int order)
+{
+       phys_addr_t start = page_to_phys(page);
+
+       accept_memory(start, start + (PAGE_SIZE << order));
+}
+
+static bool try_to_accept_memory_one(struct zone *zone)
+{
+       unsigned long flags;
+       struct page *page;
+       bool last;
+
+       if (list_empty(&zone->unaccepted_pages))
+               return false;
+
+       spin_lock_irqsave(&zone->lock, flags);
+       page = list_first_entry_or_null(&zone->unaccepted_pages,
+                                       struct page, lru);
+       if (!page) {
+               spin_unlock_irqrestore(&zone->lock, flags);
+               return false;
+       }
+
+       list_del(&page->lru);
+       last = list_empty(&zone->unaccepted_pages);
+
+       __mod_zone_freepage_state(zone, -MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE);
+       __mod_zone_page_state(zone, NR_UNACCEPTED, -MAX_ORDER_NR_PAGES);
+       spin_unlock_irqrestore(&zone->lock, flags);
+
+       accept_page(page, MAX_ORDER);
+
+       __free_pages_ok(page, MAX_ORDER, FPI_TO_TAIL);
+
+       if (last)
+               static_branch_dec(&zones_with_unaccepted_pages);
+
+       return true;
+}
+
+static bool try_to_accept_memory(struct zone *zone, unsigned int order)
+{
+       long to_accept;
+       int ret = false;
+
+       /* How much to accept to get to high watermark? */
+       to_accept = high_wmark_pages(zone) -
+                   (zone_page_state(zone, NR_FREE_PAGES) -
+                   __zone_watermark_unusable_free(zone, order, 0));
+
+       /* Accept at least one page */
+       do {
+               if (!try_to_accept_memory_one(zone))
+                       break;
+               ret = true;
+               to_accept -= MAX_ORDER_NR_PAGES;
+       } while (to_accept > 0);
+
+       return ret;
+}
+
+static inline bool has_unaccepted_memory(void)
+{
+       return static_branch_unlikely(&zones_with_unaccepted_pages);
+}
+
+static bool __free_unaccepted(struct page *page)
+{
+       struct zone *zone = page_zone(page);
+       unsigned long flags;
+       bool first = false;
+
+       if (!lazy_accept)
+               return false;
+
+       spin_lock_irqsave(&zone->lock, flags);
+       first = list_empty(&zone->unaccepted_pages);
+       list_add_tail(&page->lru, &zone->unaccepted_pages);
+       __mod_zone_freepage_state(zone, MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE);
+       __mod_zone_page_state(zone, NR_UNACCEPTED, MAX_ORDER_NR_PAGES);
+       spin_unlock_irqrestore(&zone->lock, flags);
+
+       if (first)
+               static_branch_inc(&zones_with_unaccepted_pages);
+
+       return true;
+}
+
+#else
+
+static bool page_contains_unaccepted(struct page *page, unsigned int order)
+{
+       return false;
+}
+
+static void accept_page(struct page *page, unsigned int order)
+{
+}
+
+static bool try_to_accept_memory(struct zone *zone, unsigned int order)
+{
+       return false;
+}
+
+static inline bool has_unaccepted_memory(void)
+{
+       return false;
+}
+
+static bool __free_unaccepted(struct page *page)
+{
+       BUILD_BUG();
+       return false;
+}
+
+#endif /* CONFIG_UNACCEPTED_MEMORY */
index 87b682d..684cd3c 100644 (file)
@@ -338,7 +338,7 @@ static void swap_writepage_bdev_sync(struct page *page,
        bio_init(&bio, sis->bdev, &bv, 1,
                 REQ_OP_WRITE | REQ_SWAP | wbc_to_write_flags(wbc));
        bio.bi_iter.bi_sector = swap_page_sector(page);
-       bio_add_page(&bio, page, thp_size(page), 0);
+       __bio_add_page(&bio, page, thp_size(page), 0);
 
        bio_associate_blkg_from_page(&bio, page);
        count_swpout_vm_event(page);
@@ -360,7 +360,7 @@ static void swap_writepage_bdev_async(struct page *page,
                        GFP_NOIO);
        bio->bi_iter.bi_sector = swap_page_sector(page);
        bio->bi_end_io = end_swap_bio_write;
-       bio_add_page(bio, page, thp_size(page), 0);
+       __bio_add_page(bio, page, thp_size(page), 0);
 
        bio_associate_blkg_from_page(bio, page);
        count_swpout_vm_event(page);
@@ -468,7 +468,7 @@ static void swap_readpage_bdev_sync(struct page *page,
 
        bio_init(&bio, sis->bdev, &bv, 1, REQ_OP_READ);
        bio.bi_iter.bi_sector = swap_page_sector(page);
-       bio_add_page(&bio, page, thp_size(page), 0);
+       __bio_add_page(&bio, page, thp_size(page), 0);
        /*
         * Keep this task valid during swap readpage because the oom killer may
         * attempt to access it in the page fault retry time check.
@@ -488,7 +488,7 @@ static void swap_readpage_bdev_async(struct page *page,
        bio = bio_alloc(sis->bdev, 1, REQ_OP_READ, GFP_KERNEL);
        bio->bi_iter.bi_sector = swap_page_sector(page);
        bio->bi_end_io = end_swap_bio_read;
-       bio_add_page(bio, page, thp_size(page), 0);
+       __bio_add_page(bio, page, thp_size(page), 0);
        count_vm_event(PSWPIN);
        submit_bio(bio);
 }
index c6f3605..6599cc9 100644 (file)
@@ -481,10 +481,9 @@ failed:
 }
 
 /**
- * start_isolate_page_range() - make page-allocation-type of range of pages to
- * be MIGRATE_ISOLATE.
- * @start_pfn:         The lower PFN of the range to be isolated.
- * @end_pfn:           The upper PFN of the range to be isolated.
+ * start_isolate_page_range() - mark page range MIGRATE_ISOLATE
+ * @start_pfn:         The first PFN of the range to be isolated.
+ * @end_pfn:           The last PFN of the range to be isolated.
  * @migratetype:       Migrate type to set in error recovery.
  * @flags:             The following flags are allowed (they can be combined in
  *                     a bit mask)
@@ -571,8 +570,14 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
        return 0;
 }
 
-/*
- * Make isolated pages available again.
+/**
+ * undo_isolate_page_range - undo effects of start_isolate_page_range()
+ * @start_pfn:         The first PFN of the isolated range
+ * @end_pfn:           The last PFN of the isolated range
+ * @migratetype:       New migrate type to set on the range
+ *
+ * This finds every MIGRATE_ISOLATE page block in the given range
+ * and switches it to @migratetype.
  */
 void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
                            int migratetype)
@@ -631,7 +636,21 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
        return pfn;
 }
 
-/* Caller should ensure that requested range is in a single zone */
+/**
+ * test_pages_isolated - check if pageblocks in range are isolated
+ * @start_pfn:         The first PFN of the isolated range
+ * @end_pfn:           The first PFN *after* the isolated range
+ * @isol_flags:                Testing mode flags
+ *
+ * This tests if all in the specified range are free.
+ *
+ * If %MEMORY_OFFLINE is specified in @flags, it will consider
+ * poisoned and offlined pages free as well.
+ *
+ * Caller must ensure the requested range doesn't span zones.
+ *
+ * Returns 0 if true, -EBUSY if one or more pages are in use.
+ */
 int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
                        int isol_flags)
 {
index 31169b3..c93baef 100644 (file)
@@ -418,7 +418,7 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn,
        pageblock_mt = get_pageblock_migratetype(page);
        page_mt  = gfp_migratetype(page_owner->gfp_mask);
        ret += scnprintf(kbuf + ret, count - ret,
-                       "PFN %lu type %s Block %lu type %s Flags %pGp\n",
+                       "PFN 0x%lx type %s Block %lu type %s Flags %pGp\n",
                        pfn,
                        migratetype_names[page_mt],
                        pfn >> pageblock_order,
index f2baf97..93ec769 100644 (file)
@@ -196,7 +196,7 @@ void __page_table_check_pte_set(struct mm_struct *mm, unsigned long addr,
        if (&init_mm == mm)
                return;
 
-       __page_table_check_pte_clear(mm, addr, *ptep);
+       __page_table_check_pte_clear(mm, addr, ptep_get(ptep));
        if (pte_user_accessible_page(pte)) {
                page_table_check_set(mm, addr, pte_pfn(pte),
                                     PAGE_SIZE >> PAGE_SHIFT,
@@ -246,8 +246,10 @@ void __page_table_check_pte_clear_range(struct mm_struct *mm,
                pte_t *ptep = pte_offset_map(&pmd, addr);
                unsigned long i;
 
+               if (WARN_ON(!ptep))
+                       return;
                for (i = 0; i < PTRS_PER_PTE; i++) {
-                       __page_table_check_pte_clear(mm, addr, *ptep);
+                       __page_table_check_pte_clear(mm, addr, ptep_get(ptep));
                        addr += PAGE_SIZE;
                        ptep++;
                }
index 4e448cf..49e0d28 100644 (file)
@@ -13,42 +13,61 @@ static inline bool not_found(struct page_vma_mapped_walk *pvmw)
        return false;
 }
 
-static bool map_pte(struct page_vma_mapped_walk *pvmw)
+static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
 {
-       pvmw->pte = pte_offset_map(pvmw->pmd, pvmw->address);
-       if (!(pvmw->flags & PVMW_SYNC)) {
-               if (pvmw->flags & PVMW_MIGRATION) {
-                       if (!is_swap_pte(*pvmw->pte))
-                               return false;
-               } else {
-                       /*
-                        * We get here when we are trying to unmap a private
-                        * device page from the process address space. Such
-                        * page is not CPU accessible and thus is mapped as
-                        * a special swap entry, nonetheless it still does
-                        * count as a valid regular mapping for the page (and
-                        * is accounted as such in page maps count).
-                        *
-                        * So handle this special case as if it was a normal
-                        * page mapping ie lock CPU page table and returns
-                        * true.
-                        *
-                        * For more details on device private memory see HMM
-                        * (include/linux/hmm.h or mm/hmm.c).
-                        */
-                       if (is_swap_pte(*pvmw->pte)) {
-                               swp_entry_t entry;
+       pte_t ptent;
 
-                               /* Handle un-addressable ZONE_DEVICE memory */
-                               entry = pte_to_swp_entry(*pvmw->pte);
-                               if (!is_device_private_entry(entry) &&
-                                   !is_device_exclusive_entry(entry))
-                                       return false;
-                       } else if (!pte_present(*pvmw->pte))
-                               return false;
-               }
+       if (pvmw->flags & PVMW_SYNC) {
+               /* Use the stricter lookup */
+               pvmw->pte = pte_offset_map_lock(pvmw->vma->vm_mm, pvmw->pmd,
+                                               pvmw->address, &pvmw->ptl);
+               *ptlp = pvmw->ptl;
+               return !!pvmw->pte;
        }
-       pvmw->ptl = pte_lockptr(pvmw->vma->vm_mm, pvmw->pmd);
+
+       /*
+        * It is important to return the ptl corresponding to pte,
+        * in case *pvmw->pmd changes underneath us; so we need to
+        * return it even when choosing not to lock, in case caller
+        * proceeds to loop over next ptes, and finds a match later.
+        * Though, in most cases, page lock already protects this.
+        */
+       pvmw->pte = pte_offset_map_nolock(pvmw->vma->vm_mm, pvmw->pmd,
+                                         pvmw->address, ptlp);
+       if (!pvmw->pte)
+               return false;
+
+       ptent = ptep_get(pvmw->pte);
+
+       if (pvmw->flags & PVMW_MIGRATION) {
+               if (!is_swap_pte(ptent))
+                       return false;
+       } else if (is_swap_pte(ptent)) {
+               swp_entry_t entry;
+               /*
+                * Handle un-addressable ZONE_DEVICE memory.
+                *
+                * We get here when we are trying to unmap a private
+                * device page from the process address space. Such
+                * page is not CPU accessible and thus is mapped as
+                * a special swap entry, nonetheless it still does
+                * count as a valid regular mapping for the page
+                * (and is accounted as such in page maps count).
+                *
+                * So handle this special case as if it was a normal
+                * page mapping ie lock CPU page table and return true.
+                *
+                * For more details on device private memory see HMM
+                * (include/linux/hmm.h or mm/hmm.c).
+                */
+               entry = pte_to_swp_entry(ptent);
+               if (!is_device_private_entry(entry) &&
+                   !is_device_exclusive_entry(entry))
+                       return false;
+       } else if (!pte_present(ptent)) {
+               return false;
+       }
+       pvmw->ptl = *ptlp;
        spin_lock(pvmw->ptl);
        return true;
 }
@@ -75,33 +94,34 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw)
 static bool check_pte(struct page_vma_mapped_walk *pvmw)
 {
        unsigned long pfn;
+       pte_t ptent = ptep_get(pvmw->pte);
 
        if (pvmw->flags & PVMW_MIGRATION) {
                swp_entry_t entry;
-               if (!is_swap_pte(*pvmw->pte))
+               if (!is_swap_pte(ptent))
                        return false;
-               entry = pte_to_swp_entry(*pvmw->pte);
+               entry = pte_to_swp_entry(ptent);
 
                if (!is_migration_entry(entry) &&
                    !is_device_exclusive_entry(entry))
                        return false;
 
                pfn = swp_offset_pfn(entry);
-       } else if (is_swap_pte(*pvmw->pte)) {
+       } else if (is_swap_pte(ptent)) {
                swp_entry_t entry;
 
                /* Handle un-addressable ZONE_DEVICE memory */
-               entry = pte_to_swp_entry(*pvmw->pte);
+               entry = pte_to_swp_entry(ptent);
                if (!is_device_private_entry(entry) &&
                    !is_device_exclusive_entry(entry))
                        return false;
 
                pfn = swp_offset_pfn(entry);
        } else {
-               if (!pte_present(*pvmw->pte))
+               if (!pte_present(ptent))
                        return false;
 
-               pfn = pte_pfn(*pvmw->pte);
+               pfn = pte_pfn(ptent);
        }
 
        return (pfn - pvmw->pfn) < pvmw->nr_pages;
@@ -153,6 +173,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
        struct vm_area_struct *vma = pvmw->vma;
        struct mm_struct *mm = vma->vm_mm;
        unsigned long end;
+       spinlock_t *ptl;
        pgd_t *pgd;
        p4d_t *p4d;
        pud_t *pud;
@@ -210,7 +231,7 @@ restart:
                 * compiler and used as a stale value after we've observed a
                 * subsequent update.
                 */
-               pmde = READ_ONCE(*pvmw->pmd);
+               pmde = pmdp_get_lockless(pvmw->pmd);
 
                if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde) ||
                    (pmd_present(pmde) && pmd_devmap(pmde))) {
@@ -254,8 +275,11 @@ restart:
                        step_forward(pvmw, PMD_SIZE);
                        continue;
                }
-               if (!map_pte(pvmw))
+               if (!map_pte(pvmw, &ptl)) {
+                       if (!pvmw->pte)
+                               goto restart;
                        goto next_pte;
+               }
 this_pte:
                if (check_pte(pvmw))
                        return true;
@@ -275,14 +299,10 @@ next_pte:
                                goto restart;
                        }
                        pvmw->pte++;
-                       if ((pvmw->flags & PVMW_SYNC) && !pvmw->ptl) {
-                               pvmw->ptl = pte_lockptr(mm, pvmw->pmd);
-                               spin_lock(pvmw->ptl);
-                       }
-               } while (pte_none(*pvmw->pte));
+               } while (pte_none(ptep_get(pvmw->pte)));
 
                if (!pvmw->ptl) {
-                       pvmw->ptl = pte_lockptr(mm, pvmw->pmd);
+                       pvmw->ptl = ptl;
                        spin_lock(pvmw->ptl);
                }
                goto this_pte;
index cb23f8a..6443710 100644 (file)
@@ -46,15 +46,27 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
        spinlock_t *ptl;
 
        if (walk->no_vma) {
-               pte = pte_offset_map(pmd, addr);
-               err = walk_pte_range_inner(pte, addr, end, walk);
-               pte_unmap(pte);
+               /*
+                * pte_offset_map() might apply user-specific validation.
+                */
+               if (walk->mm == &init_mm)
+                       pte = pte_offset_kernel(pmd, addr);
+               else
+                       pte = pte_offset_map(pmd, addr);
+               if (pte) {
+                       err = walk_pte_range_inner(pte, addr, end, walk);
+                       if (walk->mm != &init_mm)
+                               pte_unmap(pte);
+               }
        } else {
                pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
-               err = walk_pte_range_inner(pte, addr, end, walk);
-               pte_unmap_unlock(pte, ptl);
+               if (pte) {
+                       err = walk_pte_range_inner(pte, addr, end, walk);
+                       pte_unmap_unlock(pte, ptl);
+               }
        }
-
+       if (!pte)
+               walk->action = ACTION_AGAIN;
        return err;
 }
 
@@ -141,11 +153,8 @@ again:
                    !(ops->pte_entry))
                        continue;
 
-               if (walk->vma) {
+               if (walk->vma)
                        split_huge_pmd(walk->vma, pmd, addr);
-                       if (pmd_trans_unstable(pmd))
-                               goto again;
-               }
 
                if (is_hugepd(__hugepd(pmd_val(*pmd))))
                        err = walk_hugepd_range((hugepd_t *)pmd, addr, next, walk, PMD_SHIFT);
@@ -153,6 +162,10 @@ again:
                        err = walk_pte_range(pmd, addr, next, walk);
                if (err)
                        break;
+
+               if (walk->action == ACTION_AGAIN)
+                       goto again;
+
        } while (pmd++, addr = next, addr != end);
 
        return err;
index f9847c1..cdd0aa5 100644 (file)
@@ -41,10 +41,17 @@ struct pcpu_chunk {
        struct list_head        list;           /* linked to pcpu_slot lists */
        int                     free_bytes;     /* free bytes in the chunk */
        struct pcpu_block_md    chunk_md;
-       void                    *base_addr;     /* base address of this chunk */
+       unsigned long           *bound_map;     /* boundary map */
+
+       /*
+        * base_addr is the base address of this chunk.
+        * To reduce false sharing, current layout is optimized to make sure
+        * base_addr locate in the different cacheline with free_bytes and
+        * chunk_md.
+        */
+       void                    *base_addr ____cacheline_aligned_in_smp;
 
        unsigned long           *alloc_map;     /* allocation map */
-       unsigned long           *bound_map;     /* boundary map */
        struct pcpu_block_md    *md_blocks;     /* metadata blocks */
 
        void                    *data;          /* chunk data */
index d2fc52b..4d45495 100644 (file)
@@ -10,6 +10,8 @@
 #include <linux/pagemap.h>
 #include <linux/hugetlb.h>
 #include <linux/pgtable.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 #include <linux/mm_inline.h>
 #include <asm/tlb.h>
 
@@ -66,7 +68,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
                          unsigned long address, pte_t *ptep,
                          pte_t entry, int dirty)
 {
-       int changed = !pte_same(*ptep, entry);
+       int changed = !pte_same(ptep_get(ptep), entry);
        if (changed) {
                set_pte_at(vma->vm_mm, address, ptep, entry);
                flush_tlb_fix_spurious_fault(vma, address, ptep);
@@ -229,3 +231,57 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
 }
 #endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
+{
+       pmd_t pmdval;
+
+       /* rcu_read_lock() to be added later */
+       pmdval = pmdp_get_lockless(pmd);
+       if (pmdvalp)
+               *pmdvalp = pmdval;
+       if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
+               goto nomap;
+       if (unlikely(pmd_trans_huge(pmdval) || pmd_devmap(pmdval)))
+               goto nomap;
+       if (unlikely(pmd_bad(pmdval))) {
+               pmd_clear_bad(pmd);
+               goto nomap;
+       }
+       return __pte_map(&pmdval, addr);
+nomap:
+       /* rcu_read_unlock() to be added later */
+       return NULL;
+}
+
+pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
+                            unsigned long addr, spinlock_t **ptlp)
+{
+       pmd_t pmdval;
+       pte_t *pte;
+
+       pte = __pte_offset_map(pmd, addr, &pmdval);
+       if (likely(pte))
+               *ptlp = pte_lockptr(mm, &pmdval);
+       return pte;
+}
+
+pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
+                            unsigned long addr, spinlock_t **ptlp)
+{
+       spinlock_t *ptl;
+       pmd_t pmdval;
+       pte_t *pte;
+again:
+       pte = __pte_offset_map(pmd, addr, &pmdval);
+       if (unlikely(!pte))
+               return pte;
+       ptl = pte_lockptr(mm, &pmdval);
+       spin_lock(ptl);
+       if (likely(pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
+               *ptlp = ptl;
+               return pte;
+       }
+       pte_unmap_unlock(pte, ptl);
+       goto again;
+}
index 78dfaf9..0523eda 100644 (file)
@@ -104,7 +104,7 @@ static int process_vm_rw_single_vec(unsigned long addr,
                mmap_read_lock(mm);
                pinned_pages = pin_user_pages_remote(mm, pa, pinned_pages,
                                                     flags, process_pages,
-                                                    NULL, &locked);
+                                                    &locked);
                if (locked)
                        mmap_read_unlock(mm);
                if (pinned_pages <= 0)
index 8adab45..03c1bda 100644 (file)
@@ -119,7 +119,7 @@ static int ptdump_pte_entry(pte_t *pte, unsigned long addr,
                            unsigned long next, struct mm_walk *walk)
 {
        struct ptdump_state *st = walk->private;
-       pte_t val = ptep_get(pte);
+       pte_t val = ptep_get_lockless(pte);
 
        if (st->effective_prot)
                st->effective_prot(st, 4, pte_val(val));
index 47afbca..a9c999a 100644 (file)
 #include <linux/export.h>
 #include <linux/backing-dev.h>
 #include <linux/task_io_accounting_ops.h>
-#include <linux/pagevec.h>
 #include <linux/pagemap.h>
 #include <linux/psi.h>
 #include <linux/syscalls.h>
index 19392e0..0c0d885 100644 (file)
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -826,7 +826,8 @@ static bool folio_referenced_one(struct folio *folio,
                }
 
                if (pvmw.pte) {
-                       if (lru_gen_enabled() && pte_young(*pvmw.pte)) {
+                       if (lru_gen_enabled() &&
+                           pte_young(ptep_get(pvmw.pte))) {
                                lru_gen_look_around(&pvmw);
                                referenced++;
                        }
@@ -956,13 +957,13 @@ static int page_vma_mkclean_one(struct page_vma_mapped_walk *pvmw)
 
                address = pvmw->address;
                if (pvmw->pte) {
-                       pte_t entry;
                        pte_t *pte = pvmw->pte;
+                       pte_t entry = ptep_get(pte);
 
-                       if (!pte_dirty(*pte) && !pte_write(*pte))
+                       if (!pte_dirty(entry) && !pte_write(entry))
                                continue;
 
-                       flush_cache_page(vma, address, pte_pfn(*pte));
+                       flush_cache_page(vma, address, pte_pfn(entry));
                        entry = ptep_clear_flush(vma, address, pte);
                        entry = pte_wrprotect(entry);
                        entry = pte_mkclean(entry);
@@ -1137,7 +1138,7 @@ void page_move_anon_rmap(struct page *page, struct vm_area_struct *vma)
  * @folio:     Folio which contains page.
  * @page:      Page to add to rmap.
  * @vma:       VM area to add page to.
- * @address:   User virtual address of the mapping     
+ * @address:   User virtual address of the mapping
  * @exclusive: the page is exclusively owned by the current process
  */
 static void __page_set_anon_rmap(struct folio *folio, struct page *page,
@@ -1458,6 +1459,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
        bool anon_exclusive, ret = true;
        struct mmu_notifier_range range;
        enum ttu_flags flags = (enum ttu_flags)(long)arg;
+       unsigned long pfn;
 
        /*
         * When racing against e.g. zap_pte_range() on another cpu,
@@ -1508,8 +1510,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
                        break;
                }
 
-               subpage = folio_page(folio,
-                                       pte_pfn(*pvmw.pte) - folio_pfn(folio));
+               pfn = pte_pfn(ptep_get(pvmw.pte));
+               subpage = folio_page(folio, pfn - folio_pfn(folio));
                address = pvmw.address;
                anon_exclusive = folio_test_anon(folio) &&
                                 PageAnonExclusive(subpage);
@@ -1571,7 +1573,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
                        }
                        pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
                } else {
-                       flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
+                       flush_cache_page(vma, address, pfn);
                        /* Nuke the page table entry. */
                        if (should_defer_flush(mm, flags)) {
                                /*
@@ -1818,6 +1820,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
        bool anon_exclusive, ret = true;
        struct mmu_notifier_range range;
        enum ttu_flags flags = (enum ttu_flags)(long)arg;
+       unsigned long pfn;
 
        /*
         * When racing against e.g. zap_pte_range() on another cpu,
@@ -1877,6 +1880,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
                /* Unexpected PMD-mapped THP? */
                VM_BUG_ON_FOLIO(!pvmw.pte, folio);
 
+               pfn = pte_pfn(ptep_get(pvmw.pte));
+
                if (folio_is_zone_device(folio)) {
                        /*
                         * Our PTE is a non-present device exclusive entry and
@@ -1891,8 +1896,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
                        VM_BUG_ON_FOLIO(folio_nr_pages(folio) > 1, folio);
                        subpage = &folio->page;
                } else {
-                       subpage = folio_page(folio,
-                                       pte_pfn(*pvmw.pte) - folio_pfn(folio));
+                       subpage = folio_page(folio, pfn - folio_pfn(folio));
                }
                address = pvmw.address;
                anon_exclusive = folio_test_anon(folio) &&
@@ -1952,7 +1956,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
                        /* Nuke the hugetlb page table entry */
                        pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
                } else {
-                       flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
+                       flush_cache_page(vma, address, pfn);
                        /* Nuke the page table entry. */
                        if (should_defer_flush(mm, flags)) {
                                /*
@@ -2187,6 +2191,7 @@ static bool page_make_device_exclusive_one(struct folio *folio,
        struct mmu_notifier_range range;
        swp_entry_t entry;
        pte_t swp_pte;
+       pte_t ptent;
 
        mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0,
                                      vma->vm_mm, address, min(vma->vm_end,
@@ -2198,18 +2203,19 @@ static bool page_make_device_exclusive_one(struct folio *folio,
                /* Unexpected PMD-mapped THP? */
                VM_BUG_ON_FOLIO(!pvmw.pte, folio);
 
-               if (!pte_present(*pvmw.pte)) {
+               ptent = ptep_get(pvmw.pte);
+               if (!pte_present(ptent)) {
                        ret = false;
                        page_vma_mapped_walk_done(&pvmw);
                        break;
                }
 
                subpage = folio_page(folio,
-                               pte_pfn(*pvmw.pte) - folio_pfn(folio));
+                               pte_pfn(ptent) - folio_pfn(folio));
                address = pvmw.address;
 
                /* Nuke the page table entry. */
-               flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
+               flush_cache_page(vma, address, pte_pfn(ptent));
                pteval = ptep_clear_flush(vma, address, pvmw.pte);
 
                /* Set the dirty flag on the folio now the pte is gone. */
@@ -2328,7 +2334,7 @@ int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
 
        npages = get_user_pages_remote(mm, start, npages,
                                       FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
-                                      pages, NULL, NULL);
+                                      pages, NULL);
        if (npages < 0)
                return npages;
 
index 0b50262..86442a1 100644 (file)
@@ -35,7 +35,7 @@
 #define SECRETMEM_MODE_MASK    (0x0)
 #define SECRETMEM_FLAGS_MASK   SECRETMEM_MODE_MASK
 
-static bool secretmem_enable __ro_after_init;
+static bool secretmem_enable __ro_after_init = 1;
 module_param_named(enable, secretmem_enable, bool, 0400);
 MODULE_PARM_DESC(secretmem_enable,
                 "Enable secretmem and memfd_secret(2) system call");
@@ -125,7 +125,7 @@ static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
        if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
                return -EINVAL;
 
-       if (mlock_future_check(vma->vm_mm, vma->vm_flags | VM_LOCKED, len))
+       if (!mlock_future_ok(vma->vm_mm, vma->vm_flags | VM_LOCKED, len))
                return -EAGAIN;
 
        vm_flags_set(vma, VM_LOCKED | VM_DONTDUMP);
index e40a08c..2f2e0e6 100644 (file)
@@ -2731,6 +2731,138 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
        return retval ? retval : error;
 }
 
+static bool zero_pipe_buf_get(struct pipe_inode_info *pipe,
+                             struct pipe_buffer *buf)
+{
+       return true;
+}
+
+static void zero_pipe_buf_release(struct pipe_inode_info *pipe,
+                                 struct pipe_buffer *buf)
+{
+}
+
+static bool zero_pipe_buf_try_steal(struct pipe_inode_info *pipe,
+                                   struct pipe_buffer *buf)
+{
+       return false;
+}
+
+static const struct pipe_buf_operations zero_pipe_buf_ops = {
+       .release        = zero_pipe_buf_release,
+       .try_steal      = zero_pipe_buf_try_steal,
+       .get            = zero_pipe_buf_get,
+};
+
+static size_t splice_zeropage_into_pipe(struct pipe_inode_info *pipe,
+                                       loff_t fpos, size_t size)
+{
+       size_t offset = fpos & ~PAGE_MASK;
+
+       size = min_t(size_t, size, PAGE_SIZE - offset);
+
+       if (!pipe_full(pipe->head, pipe->tail, pipe->max_usage)) {
+               struct pipe_buffer *buf = pipe_head_buf(pipe);
+
+               *buf = (struct pipe_buffer) {
+                       .ops    = &zero_pipe_buf_ops,
+                       .page   = ZERO_PAGE(0),
+                       .offset = offset,
+                       .len    = size,
+               };
+               pipe->head++;
+       }
+
+       return size;
+}
+
+static ssize_t shmem_file_splice_read(struct file *in, loff_t *ppos,
+                                     struct pipe_inode_info *pipe,
+                                     size_t len, unsigned int flags)
+{
+       struct inode *inode = file_inode(in);
+       struct address_space *mapping = inode->i_mapping;
+       struct folio *folio = NULL;
+       size_t total_spliced = 0, used, npages, n, part;
+       loff_t isize;
+       int error = 0;
+
+       /* Work out how much data we can actually add into the pipe */
+       used = pipe_occupancy(pipe->head, pipe->tail);
+       npages = max_t(ssize_t, pipe->max_usage - used, 0);
+       len = min_t(size_t, len, npages * PAGE_SIZE);
+
+       do {
+               if (*ppos >= i_size_read(inode))
+                       break;
+
+               error = shmem_get_folio(inode, *ppos / PAGE_SIZE, &folio, SGP_READ);
+               if (error) {
+                       if (error == -EINVAL)
+                               error = 0;
+                       break;
+               }
+               if (folio) {
+                       folio_unlock(folio);
+
+                       if (folio_test_hwpoison(folio)) {
+                               error = -EIO;
+                               break;
+                       }
+               }
+
+               /*
+                * i_size must be checked after we know the pages are Uptodate.
+                *
+                * Checking i_size after the check allows us to calculate
+                * the correct value for "nr", which means the zero-filled
+                * part of the page is not copied back to userspace (unless
+                * another truncate extends the file - this is desired though).
+                */
+               isize = i_size_read(inode);
+               if (unlikely(*ppos >= isize))
+                       break;
+               part = min_t(loff_t, isize - *ppos, len);
+
+               if (folio) {
+                       /*
+                        * If users can be writing to this page using arbitrary
+                        * virtual addresses, take care about potential aliasing
+                        * before reading the page on the kernel side.
+                        */
+                       if (mapping_writably_mapped(mapping))
+                               flush_dcache_folio(folio);
+                       folio_mark_accessed(folio);
+                       /*
+                        * Ok, we have the page, and it's up-to-date, so we can
+                        * now splice it into the pipe.
+                        */
+                       n = splice_folio_into_pipe(pipe, folio, *ppos, part);
+                       folio_put(folio);
+                       folio = NULL;
+               } else {
+                       n = splice_zeropage_into_pipe(pipe, *ppos, len);
+               }
+
+               if (!n)
+                       break;
+               len -= n;
+               total_spliced += n;
+               *ppos += n;
+               in->f_ra.prev_pos = *ppos;
+               if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
+                       break;
+
+               cond_resched();
+       } while (len);
+
+       if (folio)
+               folio_put(folio);
+
+       file_accessed(in);
+       return total_spliced ? total_spliced : error;
+}
+
 static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
 {
        struct address_space *mapping = file->f_mapping;
@@ -3726,6 +3858,7 @@ out:
 static int shmem_show_options(struct seq_file *seq, struct dentry *root)
 {
        struct shmem_sb_info *sbinfo = SHMEM_SB(root->d_sb);
+       struct mempolicy *mpol;
 
        if (sbinfo->max_blocks != shmem_default_max_blocks())
                seq_printf(seq, ",size=%luk",
@@ -3768,7 +3901,9 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
        if (sbinfo->huge)
                seq_printf(seq, ",huge=%s", shmem_format_huge(sbinfo->huge));
 #endif
-       shmem_show_mpol(seq, sbinfo->mpol);
+       mpol = shmem_get_sbmpol(sbinfo);
+       shmem_show_mpol(seq, mpol);
+       mpol_put(mpol);
        if (sbinfo->noswap)
                seq_printf(seq, ",noswap");
        return 0;
@@ -3971,7 +4106,7 @@ static const struct file_operations shmem_file_operations = {
        .read_iter      = shmem_file_read_iter,
        .write_iter     = generic_file_write_iter,
        .fsync          = noop_fsync,
-       .splice_read    = generic_file_splice_read,
+       .splice_read    = shmem_file_splice_read,
        .splice_write   = iter_file_splice_write,
        .fallocate      = shmem_fallocate,
 #endif
@@ -4196,7 +4331,7 @@ static struct file_system_type shmem_fs_type = {
        .name           = "tmpfs",
        .init_fs_context = ramfs_init_fs_context,
        .parameters     = ramfs_fs_parameters,
-       .kill_sb        = kill_litter_super,
+       .kill_sb        = ramfs_kill_sb,
        .fs_flags       = FS_USERNS_MOUNT,
 };
 
diff --git a/mm/show_mem.c b/mm/show_mem.c
new file mode 100644 (file)
index 0000000..01f8e99
--- /dev/null
@@ -0,0 +1,429 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Generic show_mem() implementation
+ *
+ * Copyright (C) 2008 Johannes Weiner <hannes@saeurebad.de>
+ */
+
+#include <linux/blkdev.h>
+#include <linux/cma.h>
+#include <linux/cpuset.h>
+#include <linux/highmem.h>
+#include <linux/hugetlb.h>
+#include <linux/mm.h>
+#include <linux/mmzone.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+
+#include "internal.h"
+#include "swap.h"
+
+atomic_long_t _totalram_pages __read_mostly;
+EXPORT_SYMBOL(_totalram_pages);
+unsigned long totalreserve_pages __read_mostly;
+unsigned long totalcma_pages __read_mostly;
+
+static inline void show_node(struct zone *zone)
+{
+       if (IS_ENABLED(CONFIG_NUMA))
+               printk("Node %d ", zone_to_nid(zone));
+}
+
+long si_mem_available(void)
+{
+       long available;
+       unsigned long pagecache;
+       unsigned long wmark_low = 0;
+       unsigned long pages[NR_LRU_LISTS];
+       unsigned long reclaimable;
+       struct zone *zone;
+       int lru;
+
+       for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++)
+               pages[lru] = global_node_page_state(NR_LRU_BASE + lru);
+
+       for_each_zone(zone)
+               wmark_low += low_wmark_pages(zone);
+
+       /*
+        * Estimate the amount of memory available for userspace allocations,
+        * without causing swapping or OOM.
+        */
+       available = global_zone_page_state(NR_FREE_PAGES) - totalreserve_pages;
+
+       /*
+        * Not all the page cache can be freed, otherwise the system will
+        * start swapping or thrashing. Assume at least half of the page
+        * cache, or the low watermark worth of cache, needs to stay.
+        */
+       pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE];
+       pagecache -= min(pagecache / 2, wmark_low);
+       available += pagecache;
+
+       /*
+        * Part of the reclaimable slab and other kernel memory consists of
+        * items that are in use, and cannot be freed. Cap this estimate at the
+        * low watermark.
+        */
+       reclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B) +
+               global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE);
+       available += reclaimable - min(reclaimable / 2, wmark_low);
+
+       if (available < 0)
+               available = 0;
+       return available;
+}
+EXPORT_SYMBOL_GPL(si_mem_available);
+
+void si_meminfo(struct sysinfo *val)
+{
+       val->totalram = totalram_pages();
+       val->sharedram = global_node_page_state(NR_SHMEM);
+       val->freeram = global_zone_page_state(NR_FREE_PAGES);
+       val->bufferram = nr_blockdev_pages();
+       val->totalhigh = totalhigh_pages();
+       val->freehigh = nr_free_highpages();
+       val->mem_unit = PAGE_SIZE;
+}
+
+EXPORT_SYMBOL(si_meminfo);
+
+#ifdef CONFIG_NUMA
+void si_meminfo_node(struct sysinfo *val, int nid)
+{
+       int zone_type;          /* needs to be signed */
+       unsigned long managed_pages = 0;
+       unsigned long managed_highpages = 0;
+       unsigned long free_highpages = 0;
+       pg_data_t *pgdat = NODE_DATA(nid);
+
+       for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++)
+               managed_pages += zone_managed_pages(&pgdat->node_zones[zone_type]);
+       val->totalram = managed_pages;
+       val->sharedram = node_page_state(pgdat, NR_SHMEM);
+       val->freeram = sum_zone_node_page_state(nid, NR_FREE_PAGES);
+#ifdef CONFIG_HIGHMEM
+       for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
+               struct zone *zone = &pgdat->node_zones[zone_type];
+
+               if (is_highmem(zone)) {
+                       managed_highpages += zone_managed_pages(zone);
+                       free_highpages += zone_page_state(zone, NR_FREE_PAGES);
+               }
+       }
+       val->totalhigh = managed_highpages;
+       val->freehigh = free_highpages;
+#else
+       val->totalhigh = managed_highpages;
+       val->freehigh = free_highpages;
+#endif
+       val->mem_unit = PAGE_SIZE;
+}
+#endif
+
+/*
+ * Determine whether the node should be displayed or not, depending on whether
+ * SHOW_MEM_FILTER_NODES was passed to show_free_areas().
+ */
+static bool show_mem_node_skip(unsigned int flags, int nid, nodemask_t *nodemask)
+{
+       if (!(flags & SHOW_MEM_FILTER_NODES))
+               return false;
+
+       /*
+        * no node mask - aka implicit memory numa policy. Do not bother with
+        * the synchronization - read_mems_allowed_begin - because we do not
+        * have to be precise here.
+        */
+       if (!nodemask)
+               nodemask = &cpuset_current_mems_allowed;
+
+       return !node_isset(nid, *nodemask);
+}
+
+static void show_migration_types(unsigned char type)
+{
+       static const char types[MIGRATE_TYPES] = {
+               [MIGRATE_UNMOVABLE]     = 'U',
+               [MIGRATE_MOVABLE]       = 'M',
+               [MIGRATE_RECLAIMABLE]   = 'E',
+               [MIGRATE_HIGHATOMIC]    = 'H',
+#ifdef CONFIG_CMA
+               [MIGRATE_CMA]           = 'C',
+#endif
+#ifdef CONFIG_MEMORY_ISOLATION
+               [MIGRATE_ISOLATE]       = 'I',
+#endif
+       };
+       char tmp[MIGRATE_TYPES + 1];
+       char *p = tmp;
+       int i;
+
+       for (i = 0; i < MIGRATE_TYPES; i++) {
+               if (type & (1 << i))
+                       *p++ = types[i];
+       }
+
+       *p = '\0';
+       printk(KERN_CONT "(%s) ", tmp);
+}
+
+static bool node_has_managed_zones(pg_data_t *pgdat, int max_zone_idx)
+{
+       int zone_idx;
+       for (zone_idx = 0; zone_idx <= max_zone_idx; zone_idx++)
+               if (zone_managed_pages(pgdat->node_zones + zone_idx))
+                       return true;
+       return false;
+}
+
+/*
+ * Show free area list (used inside shift_scroll-lock stuff)
+ * We also calculate the percentage fragmentation. We do this by counting the
+ * memory on each free list with the exception of the first item on the list.
+ *
+ * Bits in @filter:
+ * SHOW_MEM_FILTER_NODES: suppress nodes that are not allowed by current's
+ *   cpuset.
+ */
+void __show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
+{
+       unsigned long free_pcp = 0;
+       int cpu, nid;
+       struct zone *zone;
+       pg_data_t *pgdat;
+
+       for_each_populated_zone(zone) {
+               if (zone_idx(zone) > max_zone_idx)
+                       continue;
+               if (show_mem_node_skip(filter, zone_to_nid(zone), nodemask))
+                       continue;
+
+               for_each_online_cpu(cpu)
+                       free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
+       }
+
+       printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
+               " active_file:%lu inactive_file:%lu isolated_file:%lu\n"
+               " unevictable:%lu dirty:%lu writeback:%lu\n"
+               " slab_reclaimable:%lu slab_unreclaimable:%lu\n"
+               " mapped:%lu shmem:%lu pagetables:%lu\n"
+               " sec_pagetables:%lu bounce:%lu\n"
+               " kernel_misc_reclaimable:%lu\n"
+               " free:%lu free_pcp:%lu free_cma:%lu\n",
+               global_node_page_state(NR_ACTIVE_ANON),
+               global_node_page_state(NR_INACTIVE_ANON),
+               global_node_page_state(NR_ISOLATED_ANON),
+               global_node_page_state(NR_ACTIVE_FILE),
+               global_node_page_state(NR_INACTIVE_FILE),
+               global_node_page_state(NR_ISOLATED_FILE),
+               global_node_page_state(NR_UNEVICTABLE),
+               global_node_page_state(NR_FILE_DIRTY),
+               global_node_page_state(NR_WRITEBACK),
+               global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B),
+               global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B),
+               global_node_page_state(NR_FILE_MAPPED),
+               global_node_page_state(NR_SHMEM),
+               global_node_page_state(NR_PAGETABLE),
+               global_node_page_state(NR_SECONDARY_PAGETABLE),
+               global_zone_page_state(NR_BOUNCE),
+               global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE),
+               global_zone_page_state(NR_FREE_PAGES),
+               free_pcp,
+               global_zone_page_state(NR_FREE_CMA_PAGES));
+
+       for_each_online_pgdat(pgdat) {
+               if (show_mem_node_skip(filter, pgdat->node_id, nodemask))
+                       continue;
+               if (!node_has_managed_zones(pgdat, max_zone_idx))
+                       continue;
+
+               printk("Node %d"
+                       " active_anon:%lukB"
+                       " inactive_anon:%lukB"
+                       " active_file:%lukB"
+                       " inactive_file:%lukB"
+                       " unevictable:%lukB"
+                       " isolated(anon):%lukB"
+                       " isolated(file):%lukB"
+                       " mapped:%lukB"
+                       " dirty:%lukB"
+                       " writeback:%lukB"
+                       " shmem:%lukB"
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+                       " shmem_thp: %lukB"
+                       " shmem_pmdmapped: %lukB"
+                       " anon_thp: %lukB"
+#endif
+                       " writeback_tmp:%lukB"
+                       " kernel_stack:%lukB"
+#ifdef CONFIG_SHADOW_CALL_STACK
+                       " shadow_call_stack:%lukB"
+#endif
+                       " pagetables:%lukB"
+                       " sec_pagetables:%lukB"
+                       " all_unreclaimable? %s"
+                       "\n",
+                       pgdat->node_id,
+                       K(node_page_state(pgdat, NR_ACTIVE_ANON)),
+                       K(node_page_state(pgdat, NR_INACTIVE_ANON)),
+                       K(node_page_state(pgdat, NR_ACTIVE_FILE)),
+                       K(node_page_state(pgdat, NR_INACTIVE_FILE)),
+                       K(node_page_state(pgdat, NR_UNEVICTABLE)),
+                       K(node_page_state(pgdat, NR_ISOLATED_ANON)),
+                       K(node_page_state(pgdat, NR_ISOLATED_FILE)),
+                       K(node_page_state(pgdat, NR_FILE_MAPPED)),
+                       K(node_page_state(pgdat, NR_FILE_DIRTY)),
+                       K(node_page_state(pgdat, NR_WRITEBACK)),
+                       K(node_page_state(pgdat, NR_SHMEM)),
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+                       K(node_page_state(pgdat, NR_SHMEM_THPS)),
+                       K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)),
+                       K(node_page_state(pgdat, NR_ANON_THPS)),
+#endif
+                       K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
+                       node_page_state(pgdat, NR_KERNEL_STACK_KB),
+#ifdef CONFIG_SHADOW_CALL_STACK
+                       node_page_state(pgdat, NR_KERNEL_SCS_KB),
+#endif
+                       K(node_page_state(pgdat, NR_PAGETABLE)),
+                       K(node_page_state(pgdat, NR_SECONDARY_PAGETABLE)),
+                       pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
+                               "yes" : "no");
+       }
+
+       for_each_populated_zone(zone) {
+               int i;
+
+               if (zone_idx(zone) > max_zone_idx)
+                       continue;
+               if (show_mem_node_skip(filter, zone_to_nid(zone), nodemask))
+                       continue;
+
+               free_pcp = 0;
+               for_each_online_cpu(cpu)
+                       free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
+
+               show_node(zone);
+               printk(KERN_CONT
+                       "%s"
+                       " free:%lukB"
+                       " boost:%lukB"
+                       " min:%lukB"
+                       " low:%lukB"
+                       " high:%lukB"
+                       " reserved_highatomic:%luKB"
+                       " active_anon:%lukB"
+                       " inactive_anon:%lukB"
+                       " active_file:%lukB"
+                       " inactive_file:%lukB"
+                       " unevictable:%lukB"
+                       " writepending:%lukB"
+                       " present:%lukB"
+                       " managed:%lukB"
+                       " mlocked:%lukB"
+                       " bounce:%lukB"
+                       " free_pcp:%lukB"
+                       " local_pcp:%ukB"
+                       " free_cma:%lukB"
+                       "\n",
+                       zone->name,
+                       K(zone_page_state(zone, NR_FREE_PAGES)),
+                       K(zone->watermark_boost),
+                       K(min_wmark_pages(zone)),
+                       K(low_wmark_pages(zone)),
+                       K(high_wmark_pages(zone)),
+                       K(zone->nr_reserved_highatomic),
+                       K(zone_page_state(zone, NR_ZONE_ACTIVE_ANON)),
+                       K(zone_page_state(zone, NR_ZONE_INACTIVE_ANON)),
+                       K(zone_page_state(zone, NR_ZONE_ACTIVE_FILE)),
+                       K(zone_page_state(zone, NR_ZONE_INACTIVE_FILE)),
+                       K(zone_page_state(zone, NR_ZONE_UNEVICTABLE)),
+                       K(zone_page_state(zone, NR_ZONE_WRITE_PENDING)),
+                       K(zone->present_pages),
+                       K(zone_managed_pages(zone)),
+                       K(zone_page_state(zone, NR_MLOCK)),
+                       K(zone_page_state(zone, NR_BOUNCE)),
+                       K(free_pcp),
+                       K(this_cpu_read(zone->per_cpu_pageset->count)),
+                       K(zone_page_state(zone, NR_FREE_CMA_PAGES)));
+               printk("lowmem_reserve[]:");
+               for (i = 0; i < MAX_NR_ZONES; i++)
+                       printk(KERN_CONT " %ld", zone->lowmem_reserve[i]);
+               printk(KERN_CONT "\n");
+       }
+
+       for_each_populated_zone(zone) {
+               unsigned int order;
+               unsigned long nr[MAX_ORDER + 1], flags, total = 0;
+               unsigned char types[MAX_ORDER + 1];
+
+               if (zone_idx(zone) > max_zone_idx)
+                       continue;
+               if (show_mem_node_skip(filter, zone_to_nid(zone), nodemask))
+                       continue;
+               show_node(zone);
+               printk(KERN_CONT "%s: ", zone->name);
+
+               spin_lock_irqsave(&zone->lock, flags);
+               for (order = 0; order <= MAX_ORDER; order++) {
+                       struct free_area *area = &zone->free_area[order];
+                       int type;
+
+                       nr[order] = area->nr_free;
+                       total += nr[order] << order;
+
+                       types[order] = 0;
+                       for (type = 0; type < MIGRATE_TYPES; type++) {
+                               if (!free_area_empty(area, type))
+                                       types[order] |= 1 << type;
+                       }
+               }
+               spin_unlock_irqrestore(&zone->lock, flags);
+               for (order = 0; order <= MAX_ORDER; order++) {
+                       printk(KERN_CONT "%lu*%lukB ",
+                              nr[order], K(1UL) << order);
+                       if (nr[order])
+                               show_migration_types(types[order]);
+               }
+               printk(KERN_CONT "= %lukB\n", K(total));
+       }
+
+       for_each_online_node(nid) {
+               if (show_mem_node_skip(filter, nid, nodemask))
+                       continue;
+               hugetlb_show_meminfo_node(nid);
+       }
+
+       printk("%ld total pagecache pages\n", global_node_page_state(NR_FILE_PAGES));
+
+       show_swap_cache_info();
+}
+
+void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
+{
+       unsigned long total = 0, reserved = 0, highmem = 0;
+       struct zone *zone;
+
+       printk("Mem-Info:\n");
+       __show_free_areas(filter, nodemask, max_zone_idx);
+
+       for_each_populated_zone(zone) {
+
+               total += zone->present_pages;
+               reserved += zone->present_pages - zone_managed_pages(zone);
+
+               if (is_highmem(zone))
+                       highmem += zone->present_pages;
+       }
+
+       printk("%lu pages RAM\n", total);
+       printk("%lu pages HighMem/MovableOnly\n", highmem);
+       printk("%lu pages reserved\n", reserved);
+#ifdef CONFIG_CMA
+       printk("%lu pages cma reserved\n", totalcma_pages);
+#endif
+#ifdef CONFIG_MEMORY_FAILURE
+       printk("%lu pages hwpoisoned\n", atomic_long_read(&num_poisoned_pages));
+#endif
+}
index bb57f7f..b7817dc 100644 (file)
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1240,11 +1240,7 @@ void __init kmem_cache_init(void)
         * Initialize the caches that provide memory for the  kmem_cache_node
         * structures first.  Without this, further allocations will bug.
         */
-       kmalloc_caches[KMALLOC_NORMAL][INDEX_NODE] = create_kmalloc_cache(
-                               kmalloc_info[INDEX_NODE].name[KMALLOC_NORMAL],
-                               kmalloc_info[INDEX_NODE].size,
-                               ARCH_KMALLOC_FLAGS, 0,
-                               kmalloc_info[INDEX_NODE].size);
+       new_kmalloc_cache(INDEX_NODE, KMALLOC_NORMAL, ARCH_KMALLOC_FLAGS);
        slab_state = PARTIAL_NODE;
        setup_kmalloc_cache_index_table();
 
index f01ac25..a59c8e5 100644 (file)
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -6,6 +6,38 @@
  */
 void __init kmem_cache_init(void);
 
+#ifdef CONFIG_64BIT
+# ifdef system_has_cmpxchg128
+# define system_has_freelist_aba()     system_has_cmpxchg128()
+# define try_cmpxchg_freelist          try_cmpxchg128
+# endif
+#define this_cpu_try_cmpxchg_freelist  this_cpu_try_cmpxchg128
+typedef u128 freelist_full_t;
+#else /* CONFIG_64BIT */
+# ifdef system_has_cmpxchg64
+# define system_has_freelist_aba()     system_has_cmpxchg64()
+# define try_cmpxchg_freelist          try_cmpxchg64
+# endif
+#define this_cpu_try_cmpxchg_freelist  this_cpu_try_cmpxchg64
+typedef u64 freelist_full_t;
+#endif /* CONFIG_64BIT */
+
+#if defined(system_has_freelist_aba) && !defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
+#undef system_has_freelist_aba
+#endif
+
+/*
+ * Freelist pointer and counter to cmpxchg together, avoids the typical ABA
+ * problems with cmpxchg of just a pointer.
+ */
+typedef union {
+       struct {
+               void *freelist;
+               unsigned long counter;
+       };
+       freelist_full_t full;
+} freelist_aba_t;
+
 /* Reuses the bits in struct page */
 struct slab {
        unsigned long __page_flags;
@@ -38,14 +70,21 @@ struct slab {
 #endif
                        };
                        /* Double-word boundary */
-                       void *freelist;         /* first free object */
                        union {
-                               unsigned long counters;
                                struct {
-                                       unsigned inuse:16;
-                                       unsigned objects:15;
-                                       unsigned frozen:1;
+                                       void *freelist;         /* first free object */
+                                       union {
+                                               unsigned long counters;
+                                               struct {
+                                                       unsigned inuse:16;
+                                                       unsigned objects:15;
+                                                       unsigned frozen:1;
+                                               };
+                                       };
                                };
+#ifdef system_has_freelist_aba
+                               freelist_aba_t freelist_counter;
+#endif
                        };
                };
                struct rcu_head rcu_head;
@@ -72,8 +111,8 @@ SLAB_MATCH(memcg_data, memcg_data);
 #endif
 #undef SLAB_MATCH
 static_assert(sizeof(struct slab) <= sizeof(struct page));
-#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && defined(CONFIG_SLUB)
-static_assert(IS_ALIGNED(offsetof(struct slab, freelist), 2*sizeof(void *)));
+#if defined(system_has_freelist_aba) && defined(CONFIG_SLUB)
+static_assert(IS_ALIGNED(offsetof(struct slab, freelist), sizeof(freelist_aba_t)));
 #endif
 
 /**
@@ -255,9 +294,8 @@ gfp_t kmalloc_fix_flags(gfp_t flags);
 /* Functions provided by the slab allocators */
 int __kmem_cache_create(struct kmem_cache *, slab_flags_t flags);
 
-struct kmem_cache *create_kmalloc_cache(const char *name, unsigned int size,
-                       slab_flags_t flags, unsigned int useroffset,
-                       unsigned int usersize);
+void __init new_kmalloc_cache(int idx, enum kmalloc_cache_type type,
+                             slab_flags_t flags);
 extern void create_boot_cache(struct kmem_cache *, const char *name,
                        unsigned int size, slab_flags_t flags,
                        unsigned int useroffset, unsigned int usersize);
index 6072497..43c0081 100644 (file)
@@ -17,6 +17,8 @@
 #include <linux/cpu.h>
 #include <linux/uaccess.h>
 #include <linux/seq_file.h>
+#include <linux/dma-mapping.h>
+#include <linux/swiotlb.h>
 #include <linux/proc_fs.h>
 #include <linux/debugfs.h>
 #include <linux/kasan.h>
@@ -658,17 +660,16 @@ void __init create_boot_cache(struct kmem_cache *s, const char *name,
        s->refcount = -1;       /* Exempt from merging for now */
 }
 
-struct kmem_cache *__init create_kmalloc_cache(const char *name,
-               unsigned int size, slab_flags_t flags,
-               unsigned int useroffset, unsigned int usersize)
+static struct kmem_cache *__init create_kmalloc_cache(const char *name,
+                                                     unsigned int size,
+                                                     slab_flags_t flags)
 {
        struct kmem_cache *s = kmem_cache_zalloc(kmem_cache, GFP_NOWAIT);
 
        if (!s)
                panic("Out of memory when creating slab %s\n", name);
 
-       create_boot_cache(s, name, size, flags | SLAB_KMALLOC, useroffset,
-                                                               usersize);
+       create_boot_cache(s, name, size, flags | SLAB_KMALLOC, 0, size);
        list_add(&s->list, &slab_caches);
        s->refcount = 1;
        return s;
@@ -863,9 +864,22 @@ void __init setup_kmalloc_cache_index_table(void)
        }
 }
 
-static void __init
+static unsigned int __kmalloc_minalign(void)
+{
+#ifdef CONFIG_DMA_BOUNCE_UNALIGNED_KMALLOC
+       if (io_tlb_default_mem.nslabs)
+               return ARCH_KMALLOC_MINALIGN;
+#endif
+       return dma_get_cache_alignment();
+}
+
+void __init
 new_kmalloc_cache(int idx, enum kmalloc_cache_type type, slab_flags_t flags)
 {
+       unsigned int minalign = __kmalloc_minalign();
+       unsigned int aligned_size = kmalloc_info[idx].size;
+       int aligned_idx = idx;
+
        if ((KMALLOC_RECLAIM != KMALLOC_NORMAL) && (type == KMALLOC_RECLAIM)) {
                flags |= SLAB_RECLAIM_ACCOUNT;
        } else if (IS_ENABLED(CONFIG_MEMCG_KMEM) && (type == KMALLOC_CGROUP)) {
@@ -878,10 +892,17 @@ new_kmalloc_cache(int idx, enum kmalloc_cache_type type, slab_flags_t flags)
                flags |= SLAB_CACHE_DMA;
        }
 
-       kmalloc_caches[type][idx] = create_kmalloc_cache(
-                                       kmalloc_info[idx].name[type],
-                                       kmalloc_info[idx].size, flags, 0,
-                                       kmalloc_info[idx].size);
+       if (minalign > ARCH_KMALLOC_MINALIGN) {
+               aligned_size = ALIGN(aligned_size, minalign);
+               aligned_idx = __kmalloc_index(aligned_size, false);
+       }
+
+       if (!kmalloc_caches[type][aligned_idx])
+               kmalloc_caches[type][aligned_idx] = create_kmalloc_cache(
+                                       kmalloc_info[aligned_idx].name[type],
+                                       aligned_size, flags);
+       if (idx != aligned_idx)
+               kmalloc_caches[type][idx] = kmalloc_caches[type][aligned_idx];
 
        /*
         * If CONFIG_MEMCG_KMEM is enabled, disable cache merging for
index c87628c..7529626 100644 (file)
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -292,7 +292,12 @@ static inline bool kmem_cache_has_cpu_partial(struct kmem_cache *s)
 /* Poison object */
 #define __OBJECT_POISON                ((slab_flags_t __force)0x80000000U)
 /* Use cmpxchg_double */
+
+#ifdef system_has_freelist_aba
 #define __CMPXCHG_DOUBLE       ((slab_flags_t __force)0x40000000U)
+#else
+#define __CMPXCHG_DOUBLE       ((slab_flags_t __force)0U)
+#endif
 
 /*
  * Tracking user of a slab.
@@ -512,6 +517,40 @@ static __always_inline void slab_unlock(struct slab *slab)
        __bit_spin_unlock(PG_locked, &page->flags);
 }
 
+static inline bool
+__update_freelist_fast(struct slab *slab,
+                     void *freelist_old, unsigned long counters_old,
+                     void *freelist_new, unsigned long counters_new)
+{
+#ifdef system_has_freelist_aba
+       freelist_aba_t old = { .freelist = freelist_old, .counter = counters_old };
+       freelist_aba_t new = { .freelist = freelist_new, .counter = counters_new };
+
+       return try_cmpxchg_freelist(&slab->freelist_counter.full, &old.full, new.full);
+#else
+       return false;
+#endif
+}
+
+static inline bool
+__update_freelist_slow(struct slab *slab,
+                     void *freelist_old, unsigned long counters_old,
+                     void *freelist_new, unsigned long counters_new)
+{
+       bool ret = false;
+
+       slab_lock(slab);
+       if (slab->freelist == freelist_old &&
+           slab->counters == counters_old) {
+               slab->freelist = freelist_new;
+               slab->counters = counters_new;
+               ret = true;
+       }
+       slab_unlock(slab);
+
+       return ret;
+}
+
 /*
  * Interrupts must be disabled (for the fallback code to work right), typically
  * by an _irqsave() lock variant. On PREEMPT_RT the preempt_disable(), which is
@@ -519,33 +558,25 @@ static __always_inline void slab_unlock(struct slab *slab)
  * allocation/ free operation in hardirq context. Therefore nothing can
  * interrupt the operation.
  */
-static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct slab *slab,
+static inline bool __slab_update_freelist(struct kmem_cache *s, struct slab *slab,
                void *freelist_old, unsigned long counters_old,
                void *freelist_new, unsigned long counters_new,
                const char *n)
 {
+       bool ret;
+
        if (USE_LOCKLESS_FAST_PATH())
                lockdep_assert_irqs_disabled();
-#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
-    defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
+
        if (s->flags & __CMPXCHG_DOUBLE) {
-               if (cmpxchg_double(&slab->freelist, &slab->counters,
-                                  freelist_old, counters_old,
-                                  freelist_new, counters_new))
-                       return true;
-       } else
-#endif
-       {
-               slab_lock(slab);
-               if (slab->freelist == freelist_old &&
-                                       slab->counters == counters_old) {
-                       slab->freelist = freelist_new;
-                       slab->counters = counters_new;
-                       slab_unlock(slab);
-                       return true;
-               }
-               slab_unlock(slab);
+               ret = __update_freelist_fast(slab, freelist_old, counters_old,
+                                           freelist_new, counters_new);
+       } else {
+               ret = __update_freelist_slow(slab, freelist_old, counters_old,
+                                           freelist_new, counters_new);
        }
+       if (likely(ret))
+               return true;
 
        cpu_relax();
        stat(s, CMPXCHG_DOUBLE_FAIL);
@@ -557,36 +588,26 @@ static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct slab *slab
        return false;
 }
 
-static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct slab *slab,
+static inline bool slab_update_freelist(struct kmem_cache *s, struct slab *slab,
                void *freelist_old, unsigned long counters_old,
                void *freelist_new, unsigned long counters_new,
                const char *n)
 {
-#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
-    defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
+       bool ret;
+
        if (s->flags & __CMPXCHG_DOUBLE) {
-               if (cmpxchg_double(&slab->freelist, &slab->counters,
-                                  freelist_old, counters_old,
-                                  freelist_new, counters_new))
-                       return true;
-       } else
-#endif
-       {
+               ret = __update_freelist_fast(slab, freelist_old, counters_old,
+                                           freelist_new, counters_new);
+       } else {
                unsigned long flags;
 
                local_irq_save(flags);
-               slab_lock(slab);
-               if (slab->freelist == freelist_old &&
-                                       slab->counters == counters_old) {
-                       slab->freelist = freelist_new;
-                       slab->counters = counters_new;
-                       slab_unlock(slab);
-                       local_irq_restore(flags);
-                       return true;
-               }
-               slab_unlock(slab);
+               ret = __update_freelist_slow(slab, freelist_old, counters_old,
+                                           freelist_new, counters_new);
                local_irq_restore(flags);
        }
+       if (likely(ret))
+               return true;
 
        cpu_relax();
        stat(s, CMPXCHG_DOUBLE_FAIL);
@@ -2228,7 +2249,7 @@ static inline void *acquire_slab(struct kmem_cache *s,
        VM_BUG_ON(new.frozen);
        new.frozen = 1;
 
-       if (!__cmpxchg_double_slab(s, slab,
+       if (!__slab_update_freelist(s, slab,
                        freelist, counters,
                        new.freelist, new.counters,
                        "acquire_slab"))
@@ -2554,7 +2575,7 @@ redo:
        }
 
 
-       if (!cmpxchg_double_slab(s, slab,
+       if (!slab_update_freelist(s, slab,
                                old.freelist, old.counters,
                                new.freelist, new.counters,
                                "unfreezing slab")) {
@@ -2611,7 +2632,7 @@ static void __unfreeze_partials(struct kmem_cache *s, struct slab *partial_slab)
 
                        new.frozen = 0;
 
-               } while (!__cmpxchg_double_slab(s, slab,
+               } while (!__slab_update_freelist(s, slab,
                                old.freelist, old.counters,
                                new.freelist, new.counters,
                                "unfreezing slab"));
@@ -3008,6 +3029,18 @@ static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags)
 }
 
 #ifndef CONFIG_SLUB_TINY
+static inline bool
+__update_cpu_freelist_fast(struct kmem_cache *s,
+                          void *freelist_old, void *freelist_new,
+                          unsigned long tid)
+{
+       freelist_aba_t old = { .freelist = freelist_old, .counter = tid };
+       freelist_aba_t new = { .freelist = freelist_new, .counter = next_tid(tid) };
+
+       return this_cpu_try_cmpxchg_freelist(s->cpu_slab->freelist_tid.full,
+                                            &old.full, new.full);
+}
+
 /*
  * Check the slab->freelist and either transfer the freelist to the
  * per cpu freelist or deactivate the slab.
@@ -3034,7 +3067,7 @@ static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
                new.inuse = slab->objects;
                new.frozen = freelist != NULL;
 
-       } while (!__cmpxchg_double_slab(s, slab,
+       } while (!__slab_update_freelist(s, slab,
                freelist, counters,
                NULL, new.counters,
                "get_freelist"));
@@ -3359,11 +3392,7 @@ redo:
                 * against code executing on this cpu *not* from access by
                 * other cpus.
                 */
-               if (unlikely(!this_cpu_cmpxchg_double(
-                               s->cpu_slab->freelist, s->cpu_slab->tid,
-                               object, tid,
-                               next_object, next_tid(tid)))) {
-
+               if (unlikely(!__update_cpu_freelist_fast(s, object, next_object, tid))) {
                        note_cmpxchg_failure("slab_alloc", s, tid);
                        goto redo;
                }
@@ -3631,7 +3660,7 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
                        }
                }
 
-       } while (!cmpxchg_double_slab(s, slab,
+       } while (!slab_update_freelist(s, slab,
                prior, counters,
                head, new.counters,
                "__slab_free"));
@@ -3736,11 +3765,7 @@ redo:
 
                set_freepointer(s, tail_obj, freelist);
 
-               if (unlikely(!this_cpu_cmpxchg_double(
-                               s->cpu_slab->freelist, s->cpu_slab->tid,
-                               freelist, tid,
-                               head, next_tid(tid)))) {
-
+               if (unlikely(!__update_cpu_freelist_fast(s, freelist, head, tid))) {
                        note_cmpxchg_failure("slab_free", s, tid);
                        goto redo;
                }
@@ -4505,11 +4530,11 @@ static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
                }
        }
 
-#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
-    defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
-       if (system_has_cmpxchg_double() && (s->flags & SLAB_NO_CMPXCHG) == 0)
+#ifdef system_has_freelist_aba
+       if (system_has_freelist_aba() && !(s->flags & SLAB_NO_CMPXCHG)) {
                /* Enable fast mode */
                s->flags |= __CMPXCHG_DOUBLE;
+       }
 #endif
 
        /*
index 10d73a0..a044a13 100644 (file)
@@ -133,7 +133,7 @@ static void * __meminit altmap_alloc_block_buf(unsigned long size,
 void __meminit vmemmap_verify(pte_t *pte, int node,
                                unsigned long start, unsigned long end)
 {
-       unsigned long pfn = pte_pfn(*pte);
+       unsigned long pfn = pte_pfn(ptep_get(pte));
        int actual_node = early_pfn_to_nid(pfn);
 
        if (node_distance(actual_node, node) > LOCAL_DISTANCE)
@@ -146,7 +146,7 @@ pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
                                       struct page *reuse)
 {
        pte_t *pte = pte_offset_kernel(pmd, addr);
-       if (pte_none(*pte)) {
+       if (pte_none(ptep_get(pte))) {
                pte_t entry;
                void *p;
 
@@ -414,7 +414,7 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
                 * with just tail struct pages.
                 */
                return vmemmap_populate_range(start, end, node, NULL,
-                                             pte_page(*pte));
+                                             pte_page(ptep_get(pte)));
        }
 
        size = min(end - start, pgmap_vmemmap_nr(pgmap) * sizeof(struct page));
@@ -438,7 +438,7 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
                 */
                next += PAGE_SIZE;
                rc = vmemmap_populate_range(next, last, node, NULL,
-                                           pte_page(*pte));
+                                           pte_page(ptep_get(pte)));
                if (rc)
                        return -ENOMEM;
        }
index c2afdb2..297a8b7 100644 (file)
@@ -701,7 +701,7 @@ static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages)
        return rc;
 }
 #else
-struct page * __meminit populate_section_memmap(unsigned long pfn,
+static struct page * __meminit populate_section_memmap(unsigned long pfn,
                unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
                struct dev_pagemap *pgmap)
 {
@@ -922,10 +922,14 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
        return 0;
 }
 
-void sparse_remove_section(struct mem_section *ms, unsigned long pfn,
-               unsigned long nr_pages, unsigned long map_offset,
-               struct vmem_altmap *altmap)
+void sparse_remove_section(unsigned long pfn, unsigned long nr_pages,
+                          struct vmem_altmap *altmap)
 {
+       struct mem_section *ms = __pfn_to_section(pfn);
+
+       if (WARN_ON_ONCE(!valid_section(ms)))
+               return;
+
        section_deactivate(pfn, nr_pages, altmap);
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
index 423199e..cd8f015 100644 (file)
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -76,7 +76,7 @@ static DEFINE_PER_CPU(struct cpu_fbatches, cpu_fbatches) = {
 
 /*
  * This path almost never happens for VM activity - pages are normally freed
- * via pagevecs.  But it gets used by networking - and for compound pages.
+ * in batches.  But it gets used by networking - and for compound pages.
  */
 static void __page_cache_release(struct folio *folio)
 {
@@ -1044,25 +1044,25 @@ void release_pages(release_pages_arg arg, int nr)
 EXPORT_SYMBOL(release_pages);
 
 /*
- * The pages which we're about to release may be in the deferred lru-addition
+ * The folios which we're about to release may be in the deferred lru-addition
  * queues.  That would prevent them from really being freed right now.  That's
- * OK from a correctness point of view but is inefficient - those pages may be
+ * OK from a correctness point of view but is inefficient - those folios may be
  * cache-warm and we want to give them back to the page allocator ASAP.
  *
- * So __pagevec_release() will drain those queues here.
+ * So __folio_batch_release() will drain those queues here.
  * folio_batch_move_lru() calls folios_put() directly to avoid
  * mutual recursion.
  */
-void __pagevec_release(struct pagevec *pvec)
+void __folio_batch_release(struct folio_batch *fbatch)
 {
-       if (!pvec->percpu_pvec_drained) {
+       if (!fbatch->percpu_pvec_drained) {
                lru_add_drain();
-               pvec->percpu_pvec_drained = true;
+               fbatch->percpu_pvec_drained = true;
        }
-       release_pages(pvec->pages, pagevec_count(pvec));
-       pagevec_reinit(pvec);
+       release_pages(fbatch->folios, folio_batch_count(fbatch));
+       folio_batch_reinit(fbatch);
 }
-EXPORT_SYMBOL(__pagevec_release);
+EXPORT_SYMBOL(__folio_batch_release);
 
 /**
  * folio_batch_remove_exceptionals() - Prune non-folios from a batch.
index b76a65a..f8ea701 100644 (file)
@@ -16,7 +16,6 @@
 #include <linux/pagemap.h>
 #include <linux/backing-dev.h>
 #include <linux/blkdev.h>
-#include <linux/pagevec.h>
 #include <linux/migrate.h>
 #include <linux/vmalloc.h>
 #include <linux/swap_slots.h>
@@ -275,9 +274,9 @@ void clear_shadow_from_swap_cache(int type, unsigned long begin,
        }
 }
 
-/* 
- * If we are the only user, then try to free up the swap cache. 
- * 
+/*
+ * If we are the only user, then try to free up the swap cache.
+ *
  * Its ok to check the swapcache flag without the folio lock
  * here because we are going to recheck again inside
  * folio_free_swap() _with_ the lock.
@@ -294,7 +293,7 @@ void free_swap_cache(struct page *page)
        }
 }
 
-/* 
+/*
  * Perform a free_page(), also freeing any swap cache associated with
  * this page if it is the last user of the page.
  */
@@ -417,9 +416,13 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 {
        struct swap_info_struct *si;
        struct folio *folio;
+       struct page *page;
        void *shadow = NULL;
 
        *new_page_allocated = false;
+       si = get_swap_device(entry);
+       if (!si)
+               return NULL;
 
        for (;;) {
                int err;
@@ -428,14 +431,12 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
                 * called after swap_cache_get_folio() failed, re-calling
                 * that would confuse statistics.
                 */
-               si = get_swap_device(entry);
-               if (!si)
-                       return NULL;
                folio = filemap_get_folio(swap_address_space(entry),
                                                swp_offset(entry));
-               put_swap_device(si);
-               if (!IS_ERR(folio))
-                       return folio_file_page(folio, swp_offset(entry));
+               if (!IS_ERR(folio)) {
+                       page = folio_file_page(folio, swp_offset(entry));
+                       goto got_page;
+               }
 
                /*
                 * Just skip read ahead for unused swap slot.
@@ -445,8 +446,8 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
                 * as SWAP_HAS_CACHE.  That's done in later part of code or
                 * else swap_off will be aborted if we return NULL.
                 */
-               if (!__swp_swapcount(entry) && swap_slot_cache_enabled)
-                       return NULL;
+               if (!swap_swapcount(si, entry) && swap_slot_cache_enabled)
+                       goto fail_put_swap;
 
                /*
                 * Get a new page to read into from swap.  Allocate it now,
@@ -455,7 +456,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
                 */
                folio = vma_alloc_folio(gfp_mask, 0, vma, addr, false);
                if (!folio)
-                       return NULL;
+                        goto fail_put_swap;
 
                /*
                 * Swap entry may have been freed since our caller observed it.
@@ -466,7 +467,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 
                folio_put(folio);
                if (err != -EEXIST)
-                       return NULL;
+                       goto fail_put_swap;
 
                /*
                 * We might race against __delete_from_swap_cache(), and
@@ -500,12 +501,17 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
        /* Caller will initiate read into locked folio */
        folio_add_lru(folio);
        *new_page_allocated = true;
-       return &folio->page;
+       page = &folio->page;
+got_page:
+       put_swap_device(si);
+       return page;
 
 fail_unlock:
        put_swap_folio(folio, entry);
        folio_unlock(folio);
        folio_put(folio);
+fail_put_swap:
+       put_swap_device(si);
        return NULL;
 }
 
@@ -514,6 +520,10 @@ fail_unlock:
  * and reading the disk if it is not already cached.
  * A failure return means that either the page allocation failed or that
  * the swap entry is no longer in use.
+ *
+ * get/put_swap_device() aren't needed to call this function, because
+ * __read_swap_cache_async() call them and swap_readpage() holds the
+ * swap cache folio lock.
  */
 struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
                                   struct vm_area_struct *vma,
@@ -698,6 +708,14 @@ void exit_swap_address_space(unsigned int type)
        swapper_spaces[type] = NULL;
 }
 
+#define SWAP_RA_ORDER_CEILING  5
+
+struct vma_swap_readahead {
+       unsigned short win;
+       unsigned short offset;
+       unsigned short nr_pte;
+};
+
 static void swap_ra_info(struct vm_fault *vmf,
                         struct vma_swap_readahead *ra_info)
 {
@@ -705,11 +723,7 @@ static void swap_ra_info(struct vm_fault *vmf,
        unsigned long ra_val;
        unsigned long faddr, pfn, fpfn, lpfn, rpfn;
        unsigned long start, end;
-       pte_t *pte, *orig_pte;
        unsigned int max_win, hits, prev_win, win;
-#ifndef CONFIG_64BIT
-       pte_t *tpte;
-#endif
 
        max_win = 1 << min_t(unsigned int, READ_ONCE(page_cluster),
                             SWAP_RA_ORDER_CEILING);
@@ -728,12 +742,9 @@ static void swap_ra_info(struct vm_fault *vmf,
                                               max_win, prev_win);
        atomic_long_set(&vma->swap_readahead_info,
                        SWAP_RA_VAL(faddr, win, 0));
-
        if (win == 1)
                return;
 
-       /* Copy the PTEs because the page table may be unmapped */
-       orig_pte = pte = pte_offset_map(vmf->pmd, faddr);
        if (fpfn == pfn + 1) {
                lpfn = fpfn;
                rpfn = fpfn + win;
@@ -753,15 +764,6 @@ static void swap_ra_info(struct vm_fault *vmf,
 
        ra_info->nr_pte = end - start;
        ra_info->offset = fpfn - start;
-       pte -= ra_info->offset;
-#ifdef CONFIG_64BIT
-       ra_info->ptes = pte;
-#else
-       tpte = ra_info->ptes;
-       for (pfn = start; pfn != end; pfn++)
-               *tpte++ = *pte++;
-#endif
-       pte_unmap(orig_pte);
 }
 
 /**
@@ -785,7 +787,8 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
        struct swap_iocb *splug = NULL;
        struct vm_area_struct *vma = vmf->vma;
        struct page *page;
-       pte_t *pte, pentry;
+       pte_t *pte = NULL, pentry;
+       unsigned long addr;
        swp_entry_t entry;
        unsigned int i;
        bool page_allocated;
@@ -797,17 +800,25 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
        if (ra_info.win == 1)
                goto skip;
 
+       addr = vmf->address - (ra_info.offset * PAGE_SIZE);
+
        blk_start_plug(&plug);
-       for (i = 0, pte = ra_info.ptes; i < ra_info.nr_pte;
-            i++, pte++) {
-               pentry = *pte;
+       for (i = 0; i < ra_info.nr_pte; i++, addr += PAGE_SIZE) {
+               if (!pte++) {
+                       pte = pte_offset_map(vmf->pmd, addr);
+                       if (!pte)
+                               break;
+               }
+               pentry = ptep_get_lockless(pte);
                if (!is_swap_pte(pentry))
                        continue;
                entry = pte_to_swp_entry(pentry);
                if (unlikely(non_swap_entry(entry)))
                        continue;
+               pte_unmap(pte);
+               pte = NULL;
                page = __read_swap_cache_async(entry, gfp_mask, vma,
-                                              vmf->address, &page_allocated);
+                                              addr, &page_allocated);
                if (!page)
                        continue;
                if (page_allocated) {
@@ -819,6 +830,8 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
                }
                put_page(page);
        }
+       if (pte)
+               pte_unmap(pte);
        blk_finish_plug(&plug);
        swap_read_unplug(splug);
        lru_add_drain();
index 274bbf7..8e6dde6 100644 (file)
@@ -41,6 +41,7 @@
 #include <linux/swap_slots.h>
 #include <linux/sort.h>
 #include <linux/completion.h>
+#include <linux/suspend.h>
 
 #include <asm/tlbflush.h>
 #include <linux/swapops.h>
@@ -1219,6 +1220,13 @@ static unsigned char __swap_entry_free_locked(struct swap_info_struct *p,
 }
 
 /*
+ * When we get a swap entry, if there aren't some other ways to
+ * prevent swapoff, such as the folio in swap cache is locked, page
+ * table lock is held, etc., the swap entry may become invalid because
+ * of swapoff.  Then, we need to enclose all swap related functions
+ * with get_swap_device() and put_swap_device(), unless the swap
+ * functions call get/put_swap_device() by themselves.
+ *
  * Check whether swap entry is valid in the swap device.  If so,
  * return pointer to swap_info_struct, and keep the swap entry valid
  * via preventing the swap device from being swapoff, until
@@ -1227,9 +1235,8 @@ static unsigned char __swap_entry_free_locked(struct swap_info_struct *p,
  * Notice that swapoff or swapoff+swapon can still happen before the
  * percpu_ref_tryget_live() in get_swap_device() or after the
  * percpu_ref_put() in put_swap_device() if there isn't any other way
- * to prevent swapoff, such as page lock, page table lock, etc.  The
- * caller must be prepared for that.  For example, the following
- * situation is possible.
+ * to prevent swapoff.  The caller must be prepared for that.  For
+ * example, the following situation is possible.
  *
  *   CPU1                              CPU2
  *   do_swap_page()
@@ -1432,16 +1439,10 @@ void swapcache_free_entries(swp_entry_t *entries, int n)
 
 int __swap_count(swp_entry_t entry)
 {
-       struct swap_info_struct *si;
+       struct swap_info_struct *si = swp_swap_info(entry);
        pgoff_t offset = swp_offset(entry);
-       int count = 0;
 
-       si = get_swap_device(entry);
-       if (si) {
-               count = swap_count(si->swap_map[offset]);
-               put_swap_device(si);
-       }
-       return count;
+       return swap_count(si->swap_map[offset]);
 }
 
 /*
@@ -1449,7 +1450,7 @@ int __swap_count(swp_entry_t entry)
  * This does not give an exact answer when swap count is continued,
  * but does include the high COUNT_CONTINUED flag to allow for that.
  */
-static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
+int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
 {
        pgoff_t offset = swp_offset(entry);
        struct swap_cluster_info *ci;
@@ -1463,24 +1464,6 @@ static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
 
 /*
  * How many references to @entry are currently swapped out?
- * This does not give an exact answer when swap count is continued,
- * but does include the high COUNT_CONTINUED flag to allow for that.
- */
-int __swp_swapcount(swp_entry_t entry)
-{
-       int count = 0;
-       struct swap_info_struct *si;
-
-       si = get_swap_device(entry);
-       if (si) {
-               count = swap_swapcount(si, entry);
-               put_swap_device(si);
-       }
-       return count;
-}
-
-/*
- * How many references to @entry are currently swapped out?
  * This considers COUNT_CONTINUED so it returns exact answer.
  */
 int swp_swapcount(swp_entry_t entry)
@@ -1762,7 +1745,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
        struct page *page = folio_file_page(folio, swp_offset(entry));
        struct page *swapcache;
        spinlock_t *ptl;
-       pte_t *pte, new_pte;
+       pte_t *pte, new_pte, old_pte;
        bool hwposioned = false;
        int ret = 1;
 
@@ -1774,11 +1757,14 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
                hwposioned = true;
 
        pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
-       if (unlikely(!pte_same_as_swp(*pte, swp_entry_to_pte(entry)))) {
+       if (unlikely(!pte || !pte_same_as_swp(ptep_get(pte),
+                                               swp_entry_to_pte(entry)))) {
                ret = 0;
                goto out;
        }
 
+       old_pte = ptep_get(pte);
+
        if (unlikely(hwposioned || !PageUptodate(page))) {
                swp_entry_t swp_entry;
 
@@ -1810,7 +1796,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
                 * call and have the page locked.
                 */
                VM_BUG_ON_PAGE(PageWriteback(page), page);
-               if (pte_swp_exclusive(*pte))
+               if (pte_swp_exclusive(old_pte))
                        rmap_flags |= RMAP_EXCLUSIVE;
 
                page_add_anon_rmap(page, vma, addr, rmap_flags);
@@ -1819,15 +1805,16 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
                lru_cache_add_inactive_or_unevictable(page, vma);
        }
        new_pte = pte_mkold(mk_pte(page, vma->vm_page_prot));
-       if (pte_swp_soft_dirty(*pte))
+       if (pte_swp_soft_dirty(old_pte))
                new_pte = pte_mksoft_dirty(new_pte);
-       if (pte_swp_uffd_wp(*pte))
+       if (pte_swp_uffd_wp(old_pte))
                new_pte = pte_mkuffd_wp(new_pte);
 setpte:
        set_pte_at(vma->vm_mm, addr, pte, new_pte);
        swap_free(entry);
 out:
-       pte_unmap_unlock(pte, ptl);
+       if (pte)
+               pte_unmap_unlock(pte, ptl);
        if (page != swapcache) {
                unlock_page(page);
                put_page(page);
@@ -1839,27 +1826,37 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
                        unsigned long addr, unsigned long end,
                        unsigned int type)
 {
-       swp_entry_t entry;
-       pte_t *pte;
+       pte_t *pte = NULL;
        struct swap_info_struct *si;
-       int ret = 0;
 
        si = swap_info[type];
-       pte = pte_offset_map(pmd, addr);
        do {
                struct folio *folio;
                unsigned long offset;
                unsigned char swp_count;
+               swp_entry_t entry;
+               int ret;
+               pte_t ptent;
+
+               if (!pte++) {
+                       pte = pte_offset_map(pmd, addr);
+                       if (!pte)
+                               break;
+               }
+
+               ptent = ptep_get_lockless(pte);
 
-               if (!is_swap_pte(*pte))
+               if (!is_swap_pte(ptent))
                        continue;
 
-               entry = pte_to_swp_entry(*pte);
+               entry = pte_to_swp_entry(ptent);
                if (swp_type(entry) != type)
                        continue;
 
                offset = swp_offset(entry);
                pte_unmap(pte);
+               pte = NULL;
+
                folio = swap_cache_get_folio(entry, vma, addr);
                if (!folio) {
                        struct page *page;
@@ -1878,8 +1875,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
                if (!folio) {
                        swp_count = READ_ONCE(si->swap_map[offset]);
                        if (swp_count == 0 || swp_count == SWAP_MAP_BAD)
-                               goto try_next;
-
+                               continue;
                        return -ENOMEM;
                }
 
@@ -1889,20 +1885,17 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
                if (ret < 0) {
                        folio_unlock(folio);
                        folio_put(folio);
-                       goto out;
+                       return ret;
                }
 
                folio_free_swap(folio);
                folio_unlock(folio);
                folio_put(folio);
-try_next:
-               pte = pte_offset_map(pmd, addr);
-       } while (pte++, addr += PAGE_SIZE, addr != end);
-       pte_unmap(pte - 1);
+       } while (addr += PAGE_SIZE, addr != end);
 
-       ret = 0;
-out:
-       return ret;
+       if (pte)
+               pte_unmap(pte);
+       return 0;
 }
 
 static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
@@ -1917,8 +1910,6 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
        do {
                cond_resched();
                next = pmd_addr_end(addr, end);
-               if (pmd_none_or_trans_huge_or_clear_bad(pmd))
-                       continue;
                ret = unuse_pte_range(vma, pmd, addr, next, type);
                if (ret)
                        return ret;
@@ -2539,7 +2530,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
                struct block_device *bdev = I_BDEV(inode);
 
                set_blocksize(bdev, old_block_size);
-               blkdev_put(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL);
+               blkdev_put(bdev, p);
        }
 
        inode_lock(inode);
@@ -2770,7 +2761,7 @@ static int claim_swapfile(struct swap_info_struct *p, struct inode *inode)
 
        if (S_ISBLK(inode->i_mode)) {
                p->bdev = blkdev_get_by_dev(inode->i_rdev,
-                                  FMODE_READ | FMODE_WRITE | FMODE_EXCL, p);
+                               BLK_OPEN_READ | BLK_OPEN_WRITE, p, NULL);
                if (IS_ERR(p->bdev)) {
                        error = PTR_ERR(p->bdev);
                        p->bdev = NULL;
@@ -3221,7 +3212,7 @@ bad_swap:
        p->cluster_next_cpu = NULL;
        if (inode && S_ISBLK(inode->i_mode) && p->bdev) {
                set_blocksize(p->bdev, p->old_block_size);
-               blkdev_put(p->bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL);
+               blkdev_put(p->bdev, p);
        }
        inode = NULL;
        destroy_swap_extents(p);
@@ -3288,9 +3279,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
        unsigned char has_cache;
        int err;
 
-       p = get_swap_device(entry);
-       if (!p)
-               return -EINVAL;
+       p = swp_swap_info(entry);
 
        offset = swp_offset(entry);
        ci = lock_cluster_or_swap_info(p, offset);
@@ -3337,7 +3326,6 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
 
 unlock_out:
        unlock_cluster_or_swap_info(p, ci);
-       put_swap_device(p);
        return err;
 }
 
@@ -3468,11 +3456,6 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
                goto out;
        }
 
-       /*
-        * We are fortunate that although vmalloc_to_page uses pte_offset_map,
-        * no architecture is using highmem pages for kernel page tables: so it
-        * will not corrupt the GFP_ATOMIC caller's atomic page table kmaps.
-        */
        head = vmalloc_to_page(si->swap_map + offset);
        offset &= ~PAGE_MASK;
 
index 86de31e..95d1291 100644 (file)
@@ -486,18 +486,17 @@ void truncate_inode_pages_final(struct address_space *mapping)
 EXPORT_SYMBOL(truncate_inode_pages_final);
 
 /**
- * invalidate_mapping_pagevec - Invalidate all the unlocked pages of one inode
- * @mapping: the address_space which holds the pages to invalidate
+ * mapping_try_invalidate - Invalidate all the evictable folios of one inode
+ * @mapping: the address_space which holds the folios to invalidate
  * @start: the offset 'from' which to invalidate
  * @end: the offset 'to' which to invalidate (inclusive)
- * @nr_pagevec: invalidate failed page number for caller
+ * @nr_failed: How many folio invalidations failed
  *
- * This helper is similar to invalidate_mapping_pages(), except that it accounts
- * for pages that are likely on a pagevec and counts them in @nr_pagevec, which
- * will be used by the caller.
+ * This function is similar to invalidate_mapping_pages(), except that it
+ * returns the number of folios which could not be evicted in @nr_failed.
  */
-unsigned long invalidate_mapping_pagevec(struct address_space *mapping,
-               pgoff_t start, pgoff_t end, unsigned long *nr_pagevec)
+unsigned long mapping_try_invalidate(struct address_space *mapping,
+               pgoff_t start, pgoff_t end, unsigned long *nr_failed)
 {
        pgoff_t indices[PAGEVEC_SIZE];
        struct folio_batch fbatch;
@@ -527,9 +526,9 @@ unsigned long invalidate_mapping_pagevec(struct address_space *mapping,
                         */
                        if (!ret) {
                                deactivate_file_folio(folio);
-                               /* It is likely on the pagevec of a remote CPU */
-                               if (nr_pagevec)
-                                       (*nr_pagevec)++;
+                               /* Likely in the lru cache of a remote CPU */
+                               if (nr_failed)
+                                       (*nr_failed)++;
                        }
                        count += ret;
                }
@@ -552,12 +551,12 @@ unsigned long invalidate_mapping_pagevec(struct address_space *mapping,
  * If you want to remove all the pages of one inode, regardless of
  * their use and writeback state, use truncate_inode_pages().
  *
- * Return: the number of the cache entries that were invalidated
+ * Return: The number of indices that had their contents invalidated
  */
 unsigned long invalidate_mapping_pages(struct address_space *mapping,
                pgoff_t start, pgoff_t end)
 {
-       return invalidate_mapping_pagevec(mapping, start, end, NULL);
+       return mapping_try_invalidate(mapping, start, end, NULL);
 }
 EXPORT_SYMBOL(invalidate_mapping_pages);
 
@@ -566,7 +565,7 @@ EXPORT_SYMBOL(invalidate_mapping_pages);
  * refcount.  We do this because invalidate_inode_pages2() needs stronger
  * invalidation guarantees, and cannot afford to leave pages behind because
  * shrink_page_list() has a temp ref on them, or because they're transiently
- * sitting in the folio_add_lru() pagevecs.
+ * sitting in the folio_add_lru() caches.
  */
 static int invalidate_complete_folio2(struct address_space *mapping,
                                        struct folio *folio)
index e97a0b4..a2bf37e 100644 (file)
@@ -76,7 +76,10 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
        if (flags & MFILL_ATOMIC_WP)
                _dst_pte = pte_mkuffd_wp(_dst_pte);
 
+       ret = -EAGAIN;
        dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
+       if (!dst_pte)
+               goto out;
 
        if (vma_is_shmem(dst_vma)) {
                /* serialize against truncate with the page table lock */
@@ -94,7 +97,7 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
         * registered, we firstly wr-protect a none pte which has no page cache
         * page backing it, then access the page.
         */
-       if (!pte_none_mostly(*dst_pte))
+       if (!pte_none_mostly(ptep_get(dst_pte)))
                goto out_unlock;
 
        folio = page_folio(page);
@@ -121,6 +124,7 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
        ret = 0;
 out_unlock:
        pte_unmap_unlock(dst_pte, ptl);
+out:
        return ret;
 }
 
@@ -212,7 +216,10 @@ static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd,
 
        _dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr),
                                         dst_vma->vm_page_prot));
+       ret = -EAGAIN;
        dst_pte = pte_offset_map_lock(dst_vma->vm_mm, dst_pmd, dst_addr, &ptl);
+       if (!dst_pte)
+               goto out;
        if (dst_vma->vm_file) {
                /* the shmem MAP_PRIVATE case requires checking the i_size */
                inode = dst_vma->vm_file->f_inode;
@@ -223,7 +230,7 @@ static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd,
                        goto out_unlock;
        }
        ret = -EEXIST;
-       if (!pte_none(*dst_pte))
+       if (!pte_none(ptep_get(dst_pte)))
                goto out_unlock;
        set_pte_at(dst_vma->vm_mm, dst_addr, dst_pte, _dst_pte);
        /* No need to invalidate - it was non-present before */
@@ -231,6 +238,7 @@ static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd,
        ret = 0;
 out_unlock:
        pte_unmap_unlock(dst_pte, ptl);
+out:
        return ret;
 }
 
index 1d13d71..93cf99a 100644 (file)
@@ -103,7 +103,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
        if (!pte)
                return -ENOMEM;
        do {
-               BUG_ON(!pte_none(*pte));
+               BUG_ON(!pte_none(ptep_get(pte)));
 
 #ifdef CONFIG_HUGETLB_PAGE
                size = arch_vmap_pte_range_map_size(addr, end, pfn, max_page_shift);
@@ -472,7 +472,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
        do {
                struct page *page = pages[*nr];
 
-               if (WARN_ON(!pte_none(*pte)))
+               if (WARN_ON(!pte_none(ptep_get(pte))))
                        return -EBUSY;
                if (WARN_ON(!page))
                        return -ENOMEM;
@@ -703,11 +703,10 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
        if (WARN_ON_ONCE(pmd_bad(*pmd)))
                return NULL;
 
-       ptep = pte_offset_map(pmd, addr);
-       pte = *ptep;
+       ptep = pte_offset_kernel(pmd, addr);
+       pte = ptep_get(ptep);
        if (pte_present(pte))
                page = pte_page(pte);
-       pte_unmap(ptep);
 
        return page;
 }
@@ -791,7 +790,7 @@ get_subtree_max_size(struct rb_node *node)
 RB_DECLARE_CALLBACKS_MAX(static, free_vmap_area_rb_augment_cb,
        struct vmap_area, rb_node, unsigned long, subtree_max_size, va_size)
 
-static void purge_vmap_area_lazy(void);
+static void reclaim_and_purge_vmap_areas(void);
 static BLOCKING_NOTIFIER_HEAD(vmap_notify_list);
 static void drain_vmap_area_work(struct work_struct *work);
 static DECLARE_WORK(drain_vmap_work, drain_vmap_area_work);
@@ -1649,7 +1648,7 @@ retry:
 
 overflow:
        if (!purged) {
-               purge_vmap_area_lazy();
+               reclaim_and_purge_vmap_areas();
                purged = 1;
                goto retry;
        }
@@ -1785,9 +1784,10 @@ out:
 }
 
 /*
- * Kick off a purge of the outstanding lazy areas.
+ * Reclaim vmap areas by purging fragmented blocks and purge_vmap_area_list.
  */
-static void purge_vmap_area_lazy(void)
+static void reclaim_and_purge_vmap_areas(void)
+
 {
        mutex_lock(&vmap_purge_lock);
        purge_fragmented_blocks_allcpus();
@@ -1908,6 +1908,12 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
 
 #define VMAP_BLOCK_SIZE                (VMAP_BBMAP_BITS * PAGE_SIZE)
 
+/*
+ * Purge threshold to prevent overeager purging of fragmented blocks for
+ * regular operations: Purge if vb->free is less than 1/4 of the capacity.
+ */
+#define VMAP_PURGE_THRESHOLD   (VMAP_BBMAP_BITS / 4)
+
 #define VMAP_RAM               0x1 /* indicates vm_map_ram area*/
 #define VMAP_BLOCK             0x2 /* mark out the vmap_block sub-type*/
 #define VMAP_FLAGS_MASK                0x3
@@ -2086,39 +2092,62 @@ static void free_vmap_block(struct vmap_block *vb)
        kfree_rcu(vb, rcu_head);
 }
 
+static bool purge_fragmented_block(struct vmap_block *vb,
+               struct vmap_block_queue *vbq, struct list_head *purge_list,
+               bool force_purge)
+{
+       if (vb->free + vb->dirty != VMAP_BBMAP_BITS ||
+           vb->dirty == VMAP_BBMAP_BITS)
+               return false;
+
+       /* Don't overeagerly purge usable blocks unless requested */
+       if (!(force_purge || vb->free < VMAP_PURGE_THRESHOLD))
+               return false;
+
+       /* prevent further allocs after releasing lock */
+       WRITE_ONCE(vb->free, 0);
+       /* prevent purging it again */
+       WRITE_ONCE(vb->dirty, VMAP_BBMAP_BITS);
+       vb->dirty_min = 0;
+       vb->dirty_max = VMAP_BBMAP_BITS;
+       spin_lock(&vbq->lock);
+       list_del_rcu(&vb->free_list);
+       spin_unlock(&vbq->lock);
+       list_add_tail(&vb->purge, purge_list);
+       return true;
+}
+
+static void free_purged_blocks(struct list_head *purge_list)
+{
+       struct vmap_block *vb, *n_vb;
+
+       list_for_each_entry_safe(vb, n_vb, purge_list, purge) {
+               list_del(&vb->purge);
+               free_vmap_block(vb);
+       }
+}
+
 static void purge_fragmented_blocks(int cpu)
 {
        LIST_HEAD(purge);
        struct vmap_block *vb;
-       struct vmap_block *n_vb;
        struct vmap_block_queue *vbq = &per_cpu(vmap_block_queue, cpu);
 
        rcu_read_lock();
        list_for_each_entry_rcu(vb, &vbq->free, free_list) {
+               unsigned long free = READ_ONCE(vb->free);
+               unsigned long dirty = READ_ONCE(vb->dirty);
 
-               if (!(vb->free + vb->dirty == VMAP_BBMAP_BITS && vb->dirty != VMAP_BBMAP_BITS))
+               if (free + dirty != VMAP_BBMAP_BITS ||
+                   dirty == VMAP_BBMAP_BITS)
                        continue;
 
                spin_lock(&vb->lock);
-               if (vb->free + vb->dirty == VMAP_BBMAP_BITS && vb->dirty != VMAP_BBMAP_BITS) {
-                       vb->free = 0; /* prevent further allocs after releasing lock */
-                       vb->dirty = VMAP_BBMAP_BITS; /* prevent purging it again */
-                       vb->dirty_min = 0;
-                       vb->dirty_max = VMAP_BBMAP_BITS;
-                       spin_lock(&vbq->lock);
-                       list_del_rcu(&vb->free_list);
-                       spin_unlock(&vbq->lock);
-                       spin_unlock(&vb->lock);
-                       list_add_tail(&vb->purge, &purge);
-               } else
-                       spin_unlock(&vb->lock);
+               purge_fragmented_block(vb, vbq, &purge, true);
+               spin_unlock(&vb->lock);
        }
        rcu_read_unlock();
-
-       list_for_each_entry_safe(vb, n_vb, &purge, purge) {
-               list_del(&vb->purge);
-               free_vmap_block(vb);
-       }
+       free_purged_blocks(&purge);
 }
 
 static void purge_fragmented_blocks_allcpus(void)
@@ -2153,6 +2182,9 @@ static void *vb_alloc(unsigned long size, gfp_t gfp_mask)
        list_for_each_entry_rcu(vb, &vbq->free, free_list) {
                unsigned long pages_off;
 
+               if (READ_ONCE(vb->free) < (1UL << order))
+                       continue;
+
                spin_lock(&vb->lock);
                if (vb->free < (1UL << order)) {
                        spin_unlock(&vb->lock);
@@ -2161,7 +2193,7 @@ static void *vb_alloc(unsigned long size, gfp_t gfp_mask)
 
                pages_off = VMAP_BBMAP_BITS - vb->free;
                vaddr = vmap_block_vaddr(vb->va->va_start, pages_off);
-               vb->free -= 1UL << order;
+               WRITE_ONCE(vb->free, vb->free - (1UL << order));
                bitmap_set(vb->used_map, pages_off, (1UL << order));
                if (vb->free == 0) {
                        spin_lock(&vbq->lock);
@@ -2211,11 +2243,11 @@ static void vb_free(unsigned long addr, unsigned long size)
 
        spin_lock(&vb->lock);
 
-       /* Expand dirty range */
+       /* Expand the not yet TLB flushed dirty range */
        vb->dirty_min = min(vb->dirty_min, offset);
        vb->dirty_max = max(vb->dirty_max, offset + (1UL << order));
 
-       vb->dirty += 1UL << order;
+       WRITE_ONCE(vb->dirty, vb->dirty + (1UL << order));
        if (vb->dirty == VMAP_BBMAP_BITS) {
                BUG_ON(vb->free);
                spin_unlock(&vb->lock);
@@ -2226,21 +2258,30 @@ static void vb_free(unsigned long addr, unsigned long size)
 
 static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
 {
+       LIST_HEAD(purge_list);
        int cpu;
 
        if (unlikely(!vmap_initialized))
                return;
 
-       might_sleep();
+       mutex_lock(&vmap_purge_lock);
 
        for_each_possible_cpu(cpu) {
                struct vmap_block_queue *vbq = &per_cpu(vmap_block_queue, cpu);
                struct vmap_block *vb;
+               unsigned long idx;
 
                rcu_read_lock();
-               list_for_each_entry_rcu(vb, &vbq->free, free_list) {
+               xa_for_each(&vbq->vmap_blocks, idx, vb) {
                        spin_lock(&vb->lock);
-                       if (vb->dirty && vb->dirty != VMAP_BBMAP_BITS) {
+
+                       /*
+                        * Try to purge a fragmented block first. If it's
+                        * not purgeable, check whether there is dirty
+                        * space to be flushed.
+                        */
+                       if (!purge_fragmented_block(vb, vbq, &purge_list, false) &&
+                           vb->dirty_max && vb->dirty != VMAP_BBMAP_BITS) {
                                unsigned long va_start = vb->va->va_start;
                                unsigned long s, e;
 
@@ -2250,15 +2291,18 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
                                start = min(s, start);
                                end   = max(e, end);
 
+                               /* Prevent that this is flushed again */
+                               vb->dirty_min = VMAP_BBMAP_BITS;
+                               vb->dirty_max = 0;
+
                                flush = 1;
                        }
                        spin_unlock(&vb->lock);
                }
                rcu_read_unlock();
        }
+       free_purged_blocks(&purge_list);
 
-       mutex_lock(&vmap_purge_lock);
-       purge_fragmented_blocks_allcpus();
        if (!__purge_vmap_area_lazy(start, end) && flush)
                flush_tlb_kernel_range(start, end);
        mutex_unlock(&vmap_purge_lock);
@@ -2899,10 +2943,16 @@ struct vmap_pfn_data {
 static int vmap_pfn_apply(pte_t *pte, unsigned long addr, void *private)
 {
        struct vmap_pfn_data *data = private;
+       unsigned long pfn = data->pfns[data->idx];
+       pte_t ptent;
 
-       if (WARN_ON_ONCE(pfn_valid(data->pfns[data->idx])))
+       if (WARN_ON_ONCE(pfn_valid(pfn)))
                return -EINVAL;
-       *pte = pte_mkspecial(pfn_pte(data->pfns[data->idx++], data->prot));
+
+       ptent = pte_mkspecial(pfn_pte(pfn, data->prot));
+       set_pte_at(&init_mm, addr, pte, ptent);
+
+       data->idx++;
        return 0;
 }
 
@@ -3520,7 +3570,7 @@ static size_t zero_iter(struct iov_iter *iter, size_t count)
        while (remains > 0) {
                size_t num, copied;
 
-               num = remains < PAGE_SIZE ? remains : PAGE_SIZE;
+               num = min_t(size_t, remains, PAGE_SIZE);
                copied = copy_page_to_iter_nofault(ZERO_PAGE(0), 0, num, iter);
                remains -= copied;
 
@@ -4151,7 +4201,7 @@ recovery:
 overflow:
        spin_unlock(&free_vmap_area_lock);
        if (!purged) {
-               purge_vmap_area_lazy();
+               reclaim_and_purge_vmap_areas();
                purged = true;
 
                /* Before "retry", check if we recover. */
index 5bf98d0..1080209 100644 (file)
@@ -429,12 +429,17 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
        up_read(&shrinker_rwsem);
 }
 
+/* Returns true for reclaim through cgroup limits or cgroup interfaces. */
 static bool cgroup_reclaim(struct scan_control *sc)
 {
        return sc->target_mem_cgroup;
 }
 
-static bool global_reclaim(struct scan_control *sc)
+/*
+ * Returns true for reclaim on the root cgroup. This is true for direct
+ * allocator reclaim and reclaim through cgroup interfaces on the root cgroup.
+ */
+static bool root_reclaim(struct scan_control *sc)
 {
        return !sc->target_mem_cgroup || mem_cgroup_is_root(sc->target_mem_cgroup);
 }
@@ -489,7 +494,7 @@ static bool cgroup_reclaim(struct scan_control *sc)
        return false;
 }
 
-static bool global_reclaim(struct scan_control *sc)
+static bool root_reclaim(struct scan_control *sc)
 {
        return true;
 }
@@ -546,7 +551,7 @@ static void flush_reclaim_state(struct scan_control *sc)
         * memcg reclaim, to make reporting more accurate and reduce
         * underestimation, but it's probably not worth the complexity for now.
         */
-       if (current->reclaim_state && global_reclaim(sc)) {
+       if (current->reclaim_state && root_reclaim(sc)) {
                sc->nr_reclaimed += current->reclaim_state->reclaimed;
                current->reclaim_state->reclaimed = 0;
        }
@@ -1606,9 +1611,10 @@ static void folio_check_dirty_writeback(struct folio *folio,
                mapping->a_ops->is_dirty_writeback(folio, dirty, writeback);
 }
 
-static struct page *alloc_demote_page(struct page *page, unsigned long private)
+static struct folio *alloc_demote_folio(struct folio *src,
+               unsigned long private)
 {
-       struct page *target_page;
+       struct folio *dst;
        nodemask_t *allowed_mask;
        struct migration_target_control *mtc;
 
@@ -1626,14 +1632,14 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private)
         */
        mtc->nmask = NULL;
        mtc->gfp_mask |= __GFP_THISNODE;
-       target_page = alloc_migration_target(page, (unsigned long)mtc);
-       if (target_page)
-               return target_page;
+       dst = alloc_migration_target(src, (unsigned long)mtc);
+       if (dst)
+               return dst;
 
        mtc->gfp_mask &= ~__GFP_THISNODE;
        mtc->nmask = allowed_mask;
 
-       return alloc_migration_target(page, (unsigned long)mtc);
+       return alloc_migration_target(src, (unsigned long)mtc);
 }
 
 /*
@@ -1668,7 +1674,7 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
        node_get_allowed_targets(pgdat, &allowed_mask);
 
        /* Demotion ignores all cpuset and mempolicy settings */
-       migrate_pages(demote_folios, alloc_demote_page, NULL,
+       migrate_pages(demote_folios, alloc_demote_folio, NULL,
                      (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
                      &nr_succeeded);
 
@@ -2255,6 +2261,25 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
 
 }
 
+#ifdef CONFIG_CMA
+/*
+ * It is waste of effort to scan and reclaim CMA pages if it is not available
+ * for current allocation context. Kswapd can not be enrolled as it can not
+ * distinguish this scenario by using sc->gfp_mask = GFP_KERNEL
+ */
+static bool skip_cma(struct folio *folio, struct scan_control *sc)
+{
+       return !current_is_kswapd() &&
+                       gfp_migratetype(sc->gfp_mask) != MIGRATE_MOVABLE &&
+                       get_pageblock_migratetype(&folio->page) == MIGRATE_CMA;
+}
+#else
+static bool skip_cma(struct folio *folio, struct scan_control *sc)
+{
+       return false;
+}
+#endif
+
 /*
  * Isolating page from the lruvec to fill in @dst list by nr_to_scan times.
  *
@@ -2301,7 +2326,8 @@ static unsigned long isolate_lru_folios(unsigned long nr_to_scan,
                nr_pages = folio_nr_pages(folio);
                total_scan += nr_pages;
 
-               if (folio_zonenum(folio) > sc->reclaim_idx) {
+               if (folio_zonenum(folio) > sc->reclaim_idx ||
+                               skip_cma(folio, sc)) {
                        nr_skipped[folio_zonenum(folio)] += nr_pages;
                        move_to = &folios_skipped;
                        goto move;
@@ -2443,7 +2469,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
         * won't get blocked by normal direct-reclaimers, forming a circular
         * deadlock.
         */
-       if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
+       if (gfp_has_io_fs(sc->gfp_mask))
                inactive >>= 3;
 
        too_many = isolated > inactive;
@@ -3218,6 +3244,16 @@ DEFINE_STATIC_KEY_ARRAY_FALSE(lru_gen_caps, NR_LRU_GEN_CAPS);
 #define get_cap(cap)   static_branch_unlikely(&lru_gen_caps[cap])
 #endif
 
+static bool should_walk_mmu(void)
+{
+       return arch_has_hw_pte_young() && get_cap(LRU_GEN_MM_WALK);
+}
+
+static bool should_clear_pmd_young(void)
+{
+       return arch_has_hw_nonleaf_pmd_young() && get_cap(LRU_GEN_NONLEAF_YOUNG);
+}
+
 /******************************************************************************
  *                          shorthand helpers
  ******************************************************************************/
@@ -3978,28 +4014,29 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
        struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
        int old_gen, new_gen = lru_gen_from_seq(walk->max_seq);
 
-       VM_WARN_ON_ONCE(pmd_leaf(*pmd));
-
-       ptl = pte_lockptr(args->mm, pmd);
-       if (!spin_trylock(ptl))
+       pte = pte_offset_map_nolock(args->mm, pmd, start & PMD_MASK, &ptl);
+       if (!pte)
                return false;
+       if (!spin_trylock(ptl)) {
+               pte_unmap(pte);
+               return false;
+       }
 
        arch_enter_lazy_mmu_mode();
-
-       pte = pte_offset_map(pmd, start & PMD_MASK);
 restart:
        for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) {
                unsigned long pfn;
                struct folio *folio;
+               pte_t ptent = ptep_get(pte + i);
 
                total++;
                walk->mm_stats[MM_LEAF_TOTAL]++;
 
-               pfn = get_pte_pfn(pte[i], args->vma, addr);
+               pfn = get_pte_pfn(ptent, args->vma, addr);
                if (pfn == -1)
                        continue;
 
-               if (!pte_young(pte[i])) {
+               if (!pte_young(ptent)) {
                        walk->mm_stats[MM_LEAF_OLD]++;
                        continue;
                }
@@ -4014,7 +4051,7 @@ restart:
                young++;
                walk->mm_stats[MM_LEAF_YOUNG]++;
 
-               if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
+               if (pte_dirty(ptent) && !folio_test_dirty(folio) &&
                    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
                      !folio_test_swapcache(folio)))
                        folio_mark_dirty(folio);
@@ -4027,10 +4064,8 @@ restart:
        if (i < PTRS_PER_PTE && get_next_vma(PMD_MASK, PAGE_SIZE, args, &start, &end))
                goto restart;
 
-       pte_unmap(pte);
-
        arch_leave_lazy_mmu_mode();
-       spin_unlock(ptl);
+       pte_unmap_unlock(pte, ptl);
 
        return suitable_to_scan(total, young);
 }
@@ -4082,7 +4117,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
                        goto next;
 
                if (!pmd_trans_huge(pmd[i])) {
-                       if (arch_has_hw_nonleaf_pmd_young() && get_cap(LRU_GEN_NONLEAF_YOUNG))
+                       if (should_clear_pmd_young())
                                pmdp_test_and_clear_young(vma, addr, pmd + i);
                        goto next;
                }
@@ -4128,7 +4163,7 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
        unsigned long next;
        unsigned long addr;
        struct vm_area_struct *vma;
-       unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)];
+       DECLARE_BITMAP(bitmap, MIN_LRU_BATCH);
        unsigned long first = -1;
        struct lru_gen_mm_walk *walk = args->private;
 
@@ -4175,7 +4210,7 @@ restart:
 #endif
                walk->mm_stats[MM_NONLEAF_TOTAL]++;
 
-               if (arch_has_hw_nonleaf_pmd_young() && get_cap(LRU_GEN_NONLEAF_YOUNG)) {
+               if (should_clear_pmd_young()) {
                        if (!pmd_young(val))
                                continue;
 
@@ -4477,7 +4512,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
         * handful of PTEs. Spreading the work out over a period of time usually
         * is less efficient, but it avoids bursty page faults.
         */
-       if (!arch_has_hw_pte_young() || !get_cap(LRU_GEN_MM_WALK)) {
+       if (!should_walk_mmu()) {
                success = iterate_mm_list_nowalk(lruvec, max_seq);
                goto done;
        }
@@ -4659,12 +4694,13 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 
        for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
                unsigned long pfn;
+               pte_t ptent = ptep_get(pte + i);
 
-               pfn = get_pte_pfn(pte[i], pvmw->vma, addr);
+               pfn = get_pte_pfn(ptent, pvmw->vma, addr);
                if (pfn == -1)
                        continue;
 
-               if (!pte_young(pte[i]))
+               if (!pte_young(ptent))
                        continue;
 
                folio = get_pfn_folio(pfn, memcg, pgdat, !walk || walk->can_swap);
@@ -4676,7 +4712,7 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 
                young++;
 
-               if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
+               if (pte_dirty(ptent) && !folio_test_dirty(folio) &&
                    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
                      !folio_test_swapcache(folio)))
                        folio_mark_dirty(folio);
@@ -4728,10 +4764,11 @@ static void lru_gen_rotate_memcg(struct lruvec *lruvec, int op)
 {
        int seg;
        int old, new;
+       unsigned long flags;
        int bin = get_random_u32_below(MEMCG_NR_BINS);
        struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
-       spin_lock(&pgdat->memcg_lru.lock);
+       spin_lock_irqsave(&pgdat->memcg_lru.lock, flags);
 
        VM_WARN_ON_ONCE(hlist_nulls_unhashed(&lruvec->lrugen.list));
 
@@ -4766,7 +4803,7 @@ static void lru_gen_rotate_memcg(struct lruvec *lruvec, int op)
        if (!pgdat->memcg_lru.nr_memcgs[old] && old == get_memcg_gen(pgdat->memcg_lru.seq))
                WRITE_ONCE(pgdat->memcg_lru.seq, pgdat->memcg_lru.seq + 1);
 
-       spin_unlock(&pgdat->memcg_lru.lock);
+       spin_unlock_irqrestore(&pgdat->memcg_lru.lock, flags);
 }
 
 void lru_gen_online_memcg(struct mem_cgroup *memcg)
@@ -4779,7 +4816,7 @@ void lru_gen_online_memcg(struct mem_cgroup *memcg)
                struct pglist_data *pgdat = NODE_DATA(nid);
                struct lruvec *lruvec = get_lruvec(memcg, nid);
 
-               spin_lock(&pgdat->memcg_lru.lock);
+               spin_lock_irq(&pgdat->memcg_lru.lock);
 
                VM_WARN_ON_ONCE(!hlist_nulls_unhashed(&lruvec->lrugen.list));
 
@@ -4790,7 +4827,7 @@ void lru_gen_online_memcg(struct mem_cgroup *memcg)
 
                lruvec->lrugen.gen = gen;
 
-               spin_unlock(&pgdat->memcg_lru.lock);
+               spin_unlock_irq(&pgdat->memcg_lru.lock);
        }
 }
 
@@ -4814,7 +4851,7 @@ void lru_gen_release_memcg(struct mem_cgroup *memcg)
                struct pglist_data *pgdat = NODE_DATA(nid);
                struct lruvec *lruvec = get_lruvec(memcg, nid);
 
-               spin_lock(&pgdat->memcg_lru.lock);
+               spin_lock_irq(&pgdat->memcg_lru.lock);
 
                VM_WARN_ON_ONCE(hlist_nulls_unhashed(&lruvec->lrugen.list));
 
@@ -4826,12 +4863,14 @@ void lru_gen_release_memcg(struct mem_cgroup *memcg)
                if (!pgdat->memcg_lru.nr_memcgs[gen] && gen == get_memcg_gen(pgdat->memcg_lru.seq))
                        WRITE_ONCE(pgdat->memcg_lru.seq, pgdat->memcg_lru.seq + 1);
 
-               spin_unlock(&pgdat->memcg_lru.lock);
+               spin_unlock_irq(&pgdat->memcg_lru.lock);
        }
 }
 
-void lru_gen_soft_reclaim(struct lruvec *lruvec)
+void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid)
 {
+       struct lruvec *lruvec = get_lruvec(memcg, nid);
+
        /* see the comment on MEMCG_NR_GENS */
        if (lru_gen_memcg_seg(lruvec) != MEMCG_LRU_HEAD)
                lru_gen_rotate_memcg(lruvec, MEMCG_LRU_HEAD);
@@ -4897,7 +4936,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx)
 
                WRITE_ONCE(lrugen->protected[hist][type][tier - 1],
                           lrugen->protected[hist][type][tier - 1] + delta);
-               __mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
                return true;
        }
 
@@ -5292,7 +5330,7 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool
 static unsigned long get_nr_to_reclaim(struct scan_control *sc)
 {
        /* don't abort memcg reclaim to ensure fairness */
-       if (!global_reclaim(sc))
+       if (!root_reclaim(sc))
                return -1;
 
        return max(sc->nr_to_reclaim, compact_gap(sc->order));
@@ -5444,7 +5482,7 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
 {
        struct blk_plug plug;
 
-       VM_WARN_ON_ONCE(global_reclaim(sc));
+       VM_WARN_ON_ONCE(root_reclaim(sc));
        VM_WARN_ON_ONCE(!sc->may_writepage || !sc->may_unmap);
 
        lru_add_drain();
@@ -5505,7 +5543,7 @@ static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *
        struct blk_plug plug;
        unsigned long reclaimed = sc->nr_reclaimed;
 
-       VM_WARN_ON_ONCE(!global_reclaim(sc));
+       VM_WARN_ON_ONCE(!root_reclaim(sc));
 
        /*
         * Unmapped clean folios are already prioritized. Scanning for more of
@@ -5712,10 +5750,10 @@ static ssize_t enabled_show(struct kobject *kobj, struct kobj_attribute *attr, c
        if (get_cap(LRU_GEN_CORE))
                caps |= BIT(LRU_GEN_CORE);
 
-       if (arch_has_hw_pte_young() && get_cap(LRU_GEN_MM_WALK))
+       if (should_walk_mmu())
                caps |= BIT(LRU_GEN_MM_WALK);
 
-       if (arch_has_hw_nonleaf_pmd_young() && get_cap(LRU_GEN_NONLEAF_YOUNG))
+       if (should_clear_pmd_young())
                caps |= BIT(LRU_GEN_NONLEAF_YOUNG);
 
        return sysfs_emit(buf, "0x%04x\n", caps);
@@ -6227,7 +6265,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
        bool proportional_reclaim;
        struct blk_plug plug;
 
-       if (lru_gen_enabled() && !global_reclaim(sc)) {
+       if (lru_gen_enabled() && !root_reclaim(sc)) {
                lru_gen_shrink_lruvec(lruvec, sc);
                return;
        }
@@ -6383,14 +6421,13 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
                if (!managed_zone(zone))
                        continue;
 
-               switch (compaction_suitable(zone, sc->order, 0, sc->reclaim_idx)) {
-               case COMPACT_SUCCESS:
-               case COMPACT_CONTINUE:
+               /* Allocation can already succeed, nothing to do */
+               if (zone_watermark_ok(zone, sc->order, min_wmark_pages(zone),
+                                     sc->reclaim_idx, 0))
+                       return false;
+
+               if (compaction_suitable(zone, sc->order, sc->reclaim_idx))
                        return false;
-               default:
-                       /* check next zone */
-                       ;
-               }
        }
 
        /*
@@ -6469,7 +6506,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
        struct lruvec *target_lruvec;
        bool reclaimable = false;
 
-       if (lru_gen_enabled() && global_reclaim(sc)) {
+       if (lru_gen_enabled() && root_reclaim(sc)) {
                lru_gen_shrink_node(pgdat, sc);
                return;
        }
@@ -6541,10 +6578,13 @@ again:
         * Legacy memcg will stall in page writeback so avoid forcibly
         * stalling in reclaim_throttle().
         */
-       if ((current_is_kswapd() ||
-            (cgroup_reclaim(sc) && writeback_throttling_sane(sc))) &&
-           sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
-               set_bit(LRUVEC_CONGESTED, &target_lruvec->flags);
+       if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested) {
+               if (cgroup_reclaim(sc) && writeback_throttling_sane(sc))
+                       set_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags);
+
+               if (current_is_kswapd())
+                       set_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags);
+       }
 
        /*
         * Stall direct reclaim for IO completions if the lruvec is
@@ -6554,7 +6594,8 @@ again:
         */
        if (!current_is_kswapd() && current_may_throttle() &&
            !sc->hibernation_mode &&
-           test_bit(LRUVEC_CONGESTED, &target_lruvec->flags))
+           (test_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags) ||
+            test_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags)))
                reclaim_throttle(pgdat, VMSCAN_THROTTLE_CONGESTED);
 
        if (should_continue_reclaim(pgdat, nr_node_reclaimed, sc))
@@ -6578,14 +6619,14 @@ again:
 static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
 {
        unsigned long watermark;
-       enum compact_result suitable;
 
-       suitable = compaction_suitable(zone, sc->order, 0, sc->reclaim_idx);
-       if (suitable == COMPACT_SUCCESS)
-               /* Allocation should succeed already. Don't reclaim. */
+       /* Allocation can already succeed, nothing to do */
+       if (zone_watermark_ok(zone, sc->order, min_wmark_pages(zone),
+                             sc->reclaim_idx, 0))
                return true;
-       if (suitable == COMPACT_SKIPPED)
-               /* Compaction cannot yet proceed. Do reclaim. */
+
+       /* Compaction cannot yet proceed. Do reclaim. */
+       if (!compaction_suitable(zone, sc->order, sc->reclaim_idx))
                return false;
 
        /*
@@ -6811,7 +6852,7 @@ retry:
 
                        lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup,
                                                   zone->zone_pgdat);
-                       clear_bit(LRUVEC_CONGESTED, &lruvec->flags);
+                       clear_bit(LRUVEC_CGROUP_CONGESTED, &lruvec->flags);
                }
        }
 
@@ -6872,7 +6913,7 @@ static bool allow_direct_reclaim(pg_data_t *pgdat)
                        continue;
 
                pfmemalloc_reserve += min_wmark_pages(zone);
-               free_pages += zone_page_state(zone, NR_FREE_PAGES);
+               free_pages += zone_page_state_snapshot(zone, NR_FREE_PAGES);
        }
 
        /* If there are no reserves (unexpected config) then do not throttle */
@@ -7200,7 +7241,8 @@ static void clear_pgdat_congested(pg_data_t *pgdat)
 {
        struct lruvec *lruvec = mem_cgroup_lruvec(NULL, pgdat);
 
-       clear_bit(LRUVEC_CONGESTED, &lruvec->flags);
+       clear_bit(LRUVEC_NODE_CONGESTED, &lruvec->flags);
+       clear_bit(LRUVEC_CGROUP_CONGESTED, &lruvec->flags);
        clear_bit(PGDAT_DIRTY, &pgdat->flags);
        clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
 }
@@ -7825,7 +7867,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
 /*
  * This kswapd start function will be called by init and node-hot-add.
  */
-void kswapd_run(int nid)
+void __meminit kswapd_run(int nid)
 {
        pg_data_t *pgdat = NODE_DATA(nid);
 
@@ -7846,7 +7888,7 @@ void kswapd_run(int nid)
  * Called by memory hotplug when all memory in a node is offlined.  Caller must
  * be holding mem_hotplug_begin/done().
  */
-void kswapd_stop(int nid)
+void __meminit kswapd_stop(int nid)
 {
        pg_data_t *pgdat = NODE_DATA(nid);
        struct task_struct *kswapd;
@@ -8043,23 +8085,6 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
 }
 #endif
 
-void check_move_unevictable_pages(struct pagevec *pvec)
-{
-       struct folio_batch fbatch;
-       unsigned i;
-
-       folio_batch_init(&fbatch);
-       for (i = 0; i < pvec->nr; i++) {
-               struct page *page = pvec->pages[i];
-
-               if (PageTransTail(page))
-                       continue;
-               folio_batch_add(&fbatch, page_folio(page));
-       }
-       check_move_unevictable_folios(&fbatch);
-}
-EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
-
 /**
  * check_move_unevictable_folios - Move evictable folios to appropriate zone
  * lru list
index c280463..b731d57 100644 (file)
@@ -28,6 +28,7 @@
 #include <linux/mm_inline.h>
 #include <linux/page_ext.h>
 #include <linux/page_owner.h>
+#include <linux/sched/isolation.h>
 
 #include "internal.h"
 
@@ -1180,6 +1181,9 @@ const char * const vmstat_text[] = {
        "nr_zspages",
 #endif
        "nr_free_cma",
+#ifdef CONFIG_UNACCEPTED_MEMORY
+       "nr_unaccepted",
+#endif
 
        /* enum numa_stat_item counters */
 #ifdef CONFIG_NUMA
@@ -2022,6 +2026,20 @@ static void vmstat_shepherd(struct work_struct *w)
        for_each_online_cpu(cpu) {
                struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
 
+               /*
+                * In kernel users of vmstat counters either require the precise value and
+                * they are using zone_page_state_snapshot interface or they can live with
+                * an imprecision as the regular flushing can happen at arbitrary time and
+                * cumulative error can grow (see calculate_normal_threshold).
+                *
+                * From that POV the regular flushing can be postponed for CPUs that have
+                * been isolated from the kernel interference without critical
+                * infrastructure ever noticing. Skip regular flushing from vmstat_shepherd
+                * for all isolated CPUs to avoid interference with the isolated workload.
+                */
+               if (cpu_is_isolated(cpu))
+                       continue;
+
                if (!delayed_work_pending(dw) && need_update(cpu))
                        queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
 
index 8177589..4686ae3 100644 (file)
@@ -255,45 +255,58 @@ static void *lru_gen_eviction(struct folio *folio)
        return pack_shadow(mem_cgroup_id(memcg), pgdat, token, refs);
 }
 
+/*
+ * Tests if the shadow entry is for a folio that was recently evicted.
+ * Fills in @lruvec, @token, @workingset with the values unpacked from shadow.
+ */
+static bool lru_gen_test_recent(void *shadow, bool file, struct lruvec **lruvec,
+                               unsigned long *token, bool *workingset)
+{
+       int memcg_id;
+       unsigned long min_seq;
+       struct mem_cgroup *memcg;
+       struct pglist_data *pgdat;
+
+       unpack_shadow(shadow, &memcg_id, &pgdat, token, workingset);
+
+       memcg = mem_cgroup_from_id(memcg_id);
+       *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+
+       min_seq = READ_ONCE((*lruvec)->lrugen.min_seq[file]);
+       return (*token >> LRU_REFS_WIDTH) == (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH));
+}
+
 static void lru_gen_refault(struct folio *folio, void *shadow)
 {
+       bool recent;
        int hist, tier, refs;
-       int memcg_id;
        bool workingset;
        unsigned long token;
-       unsigned long min_seq;
        struct lruvec *lruvec;
        struct lru_gen_folio *lrugen;
-       struct mem_cgroup *memcg;
-       struct pglist_data *pgdat;
        int type = folio_is_file_lru(folio);
        int delta = folio_nr_pages(folio);
 
-       unpack_shadow(shadow, &memcg_id, &pgdat, &token, &workingset);
-
-       if (pgdat != folio_pgdat(folio))
-               return;
-
        rcu_read_lock();
 
-       memcg = folio_memcg_rcu(folio);
-       if (memcg_id != mem_cgroup_id(memcg))
+       recent = lru_gen_test_recent(shadow, type, &lruvec, &token, &workingset);
+       if (lruvec != folio_lruvec(folio))
                goto unlock;
 
-       lruvec = mem_cgroup_lruvec(memcg, pgdat);
-       lrugen = &lruvec->lrugen;
+       mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta);
 
-       min_seq = READ_ONCE(lrugen->min_seq[type]);
-       if ((token >> LRU_REFS_WIDTH) != (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH)))
+       if (!recent)
                goto unlock;
 
-       hist = lru_hist_from_seq(min_seq);
+       lrugen = &lruvec->lrugen;
+
+       hist = lru_hist_from_seq(READ_ONCE(lrugen->min_seq[type]));
        /* see the comment in folio_lru_refs() */
        refs = (token & (BIT(LRU_REFS_WIDTH) - 1)) + workingset;
        tier = lru_tier_from_refs(refs);
 
        atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]);
-       mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta);
+       mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
 
        /*
         * Count the following two cases as stalls:
@@ -317,6 +330,12 @@ static void *lru_gen_eviction(struct folio *folio)
        return NULL;
 }
 
+static bool lru_gen_test_recent(void *shadow, bool file, struct lruvec **lruvec,
+                               unsigned long *token, bool *workingset)
+{
+       return false;
+}
+
 static void lru_gen_refault(struct folio *folio, void *shadow)
 {
 }
@@ -385,42 +404,33 @@ void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg)
 }
 
 /**
- * workingset_refault - Evaluate the refault of a previously evicted folio.
- * @folio: The freshly allocated replacement folio.
- * @shadow: Shadow entry of the evicted folio.
- *
- * Calculates and evaluates the refault distance of the previously
- * evicted folio in the context of the node and the memcg whose memory
- * pressure caused the eviction.
+ * workingset_test_recent - tests if the shadow entry is for a folio that was
+ * recently evicted. Also fills in @workingset with the value unpacked from
+ * shadow.
+ * @shadow: the shadow entry to be tested.
+ * @file: whether the corresponding folio is from the file lru.
+ * @workingset: where the workingset value unpacked from shadow should
+ * be stored.
+ *
+ * Return: true if the shadow is for a recently evicted folio; false otherwise.
  */
-void workingset_refault(struct folio *folio, void *shadow)
+bool workingset_test_recent(void *shadow, bool file, bool *workingset)
 {
-       bool file = folio_is_file_lru(folio);
        struct mem_cgroup *eviction_memcg;
        struct lruvec *eviction_lruvec;
        unsigned long refault_distance;
        unsigned long workingset_size;
-       struct pglist_data *pgdat;
-       struct mem_cgroup *memcg;
-       unsigned long eviction;
-       struct lruvec *lruvec;
        unsigned long refault;
-       bool workingset;
        int memcgid;
-       long nr;
+       struct pglist_data *pgdat;
+       unsigned long eviction;
 
-       if (lru_gen_enabled()) {
-               lru_gen_refault(folio, shadow);
-               return;
-       }
+       if (lru_gen_enabled())
+               return lru_gen_test_recent(shadow, file, &eviction_lruvec, &eviction, workingset);
 
-       unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
+       unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset);
        eviction <<= bucket_order;
 
-       /* Flush stats (and potentially sleep) before holding RCU read lock */
-       mem_cgroup_flush_stats_ratelimited();
-
-       rcu_read_lock();
        /*
         * Look up the memcg associated with the stored ID. It might
         * have been deleted since the folio's eviction.
@@ -439,7 +449,8 @@ void workingset_refault(struct folio *folio, void *shadow)
         */
        eviction_memcg = mem_cgroup_from_id(memcgid);
        if (!mem_cgroup_disabled() && !eviction_memcg)
-               goto out;
+               return false;
+
        eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
        refault = atomic_long_read(&eviction_lruvec->nonresident_age);
 
@@ -462,20 +473,6 @@ void workingset_refault(struct folio *folio, void *shadow)
        refault_distance = (refault - eviction) & EVICTION_MASK;
 
        /*
-        * The activation decision for this folio is made at the level
-        * where the eviction occurred, as that is where the LRU order
-        * during folio reclaim is being determined.
-        *
-        * However, the cgroup that will own the folio is the one that
-        * is actually experiencing the refault event.
-        */
-       nr = folio_nr_pages(folio);
-       memcg = folio_memcg(folio);
-       pgdat = folio_pgdat(folio);
-       lruvec = mem_cgroup_lruvec(memcg, pgdat);
-
-       mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
-       /*
         * Compare the distance to the existing workingset size. We
         * don't activate pages that couldn't stay resident even if
         * all the memory was available to the workingset. Whether
@@ -495,7 +492,54 @@ void workingset_refault(struct folio *folio, void *shadow)
                                                     NR_INACTIVE_ANON);
                }
        }
-       if (refault_distance > workingset_size)
+
+       return refault_distance <= workingset_size;
+}
+
+/**
+ * workingset_refault - Evaluate the refault of a previously evicted folio.
+ * @folio: The freshly allocated replacement folio.
+ * @shadow: Shadow entry of the evicted folio.
+ *
+ * Calculates and evaluates the refault distance of the previously
+ * evicted folio in the context of the node and the memcg whose memory
+ * pressure caused the eviction.
+ */
+void workingset_refault(struct folio *folio, void *shadow)
+{
+       bool file = folio_is_file_lru(folio);
+       struct pglist_data *pgdat;
+       struct mem_cgroup *memcg;
+       struct lruvec *lruvec;
+       bool workingset;
+       long nr;
+
+       if (lru_gen_enabled()) {
+               lru_gen_refault(folio, shadow);
+               return;
+       }
+
+       /* Flush stats (and potentially sleep) before holding RCU read lock */
+       mem_cgroup_flush_stats_ratelimited();
+
+       rcu_read_lock();
+
+       /*
+        * The activation decision for this folio is made at the level
+        * where the eviction occurred, as that is where the LRU order
+        * during folio reclaim is being determined.
+        *
+        * However, the cgroup that will own the folio is the one that
+        * is actually experiencing the refault event.
+        */
+       nr = folio_nr_pages(folio);
+       memcg = folio_memcg(folio);
+       pgdat = folio_pgdat(folio);
+       lruvec = mem_cgroup_lruvec(memcg, pgdat);
+
+       mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
+
+       if (!workingset_test_recent(shadow, file, &workingset))
                goto out;
 
        folio_set_active(folio);
index 0cef845..e84de91 100644 (file)
@@ -125,13 +125,11 @@ struct z3fold_header {
 /**
  * struct z3fold_pool - stores metadata for each z3fold pool
  * @name:      pool name
- * @lock:      protects pool unbuddied/lru lists
+ * @lock:      protects pool unbuddied lists
  * @stale_lock:        protects pool stale page list
  * @unbuddied: per-cpu array of lists tracking z3fold pages that contain 2-
  *             buddies; the list each z3fold page is added to depends on
  *             the size of its free region.
- * @lru:       list tracking the z3fold pages in LRU order by most recently
- *             added buddy.
  * @stale:     list of pages marked for freeing
  * @pages_nr:  number of z3fold pages in the pool.
  * @c_handle:  cache for z3fold_buddy_slots allocation
@@ -149,12 +147,9 @@ struct z3fold_pool {
        spinlock_t lock;
        spinlock_t stale_lock;
        struct list_head *unbuddied;
-       struct list_head lru;
        struct list_head stale;
        atomic64_t pages_nr;
        struct kmem_cache *c_handle;
-       struct zpool *zpool;
-       const struct zpool_ops *zpool_ops;
        struct workqueue_struct *compact_wq;
        struct workqueue_struct *release_wq;
        struct work_struct work;
@@ -329,7 +324,6 @@ static struct z3fold_header *init_z3fold_page(struct page *page, bool headless,
        struct z3fold_header *zhdr = page_address(page);
        struct z3fold_buddy_slots *slots;
 
-       INIT_LIST_HEAD(&page->lru);
        clear_bit(PAGE_HEADLESS, &page->private);
        clear_bit(MIDDLE_CHUNK_MAPPED, &page->private);
        clear_bit(NEEDS_COMPACTING, &page->private);
@@ -451,8 +445,6 @@ static void __release_z3fold_page(struct z3fold_header *zhdr, bool locked)
        set_bit(PAGE_STALE, &page->private);
        clear_bit(NEEDS_COMPACTING, &page->private);
        spin_lock(&pool->lock);
-       if (!list_empty(&page->lru))
-               list_del_init(&page->lru);
        spin_unlock(&pool->lock);
 
        if (locked)
@@ -930,7 +922,6 @@ static struct z3fold_pool *z3fold_create_pool(const char *name, gfp_t gfp)
                for_each_unbuddied_list(i, 0)
                        INIT_LIST_HEAD(&unbuddied[i]);
        }
-       INIT_LIST_HEAD(&pool->lru);
        INIT_LIST_HEAD(&pool->stale);
        atomic64_set(&pool->pages_nr, 0);
        pool->name = name;
@@ -1073,12 +1064,6 @@ found:
 
 headless:
        spin_lock(&pool->lock);
-       /* Add/move z3fold page to beginning of LRU */
-       if (!list_empty(&page->lru))
-               list_del(&page->lru);
-
-       list_add(&page->lru, &pool->lru);
-
        *handle = encode_handle(zhdr, bud);
        spin_unlock(&pool->lock);
        if (bud != HEADLESS)
@@ -1115,9 +1100,6 @@ static void z3fold_free(struct z3fold_pool *pool, unsigned long handle)
                 * immediately so we don't care about its value any more.
                 */
                if (!page_claimed) {
-                       spin_lock(&pool->lock);
-                       list_del(&page->lru);
-                       spin_unlock(&pool->lock);
                        put_z3fold_header(zhdr);
                        free_z3fold_page(page, true);
                        atomic64_dec(&pool->pages_nr);
@@ -1173,194 +1155,6 @@ static void z3fold_free(struct z3fold_pool *pool, unsigned long handle)
 }
 
 /**
- * z3fold_reclaim_page() - evicts allocations from a pool page and frees it
- * @pool:      pool from which a page will attempt to be evicted
- * @retries:   number of pages on the LRU list for which eviction will
- *             be attempted before failing
- *
- * z3fold reclaim is different from normal system reclaim in that it is done
- * from the bottom, up. This is because only the bottom layer, z3fold, has
- * information on how the allocations are organized within each z3fold page.
- * This has the potential to create interesting locking situations between
- * z3fold and the user, however.
- *
- * To avoid these, this is how z3fold_reclaim_page() should be called:
- *
- * The user detects a page should be reclaimed and calls z3fold_reclaim_page().
- * z3fold_reclaim_page() will remove a z3fold page from the pool LRU list and
- * call the user-defined eviction handler with the pool and handle as
- * arguments.
- *
- * If the handle can not be evicted, the eviction handler should return
- * non-zero. z3fold_reclaim_page() will add the z3fold page back to the
- * appropriate list and try the next z3fold page on the LRU up to
- * a user defined number of retries.
- *
- * If the handle is successfully evicted, the eviction handler should
- * return 0 _and_ should have called z3fold_free() on the handle. z3fold_free()
- * contains logic to delay freeing the page if the page is under reclaim,
- * as indicated by the setting of the PG_reclaim flag on the underlying page.
- *
- * If all buddies in the z3fold page are successfully evicted, then the
- * z3fold page can be freed.
- *
- * Returns: 0 if page is successfully freed, otherwise -EINVAL if there are
- * no pages to evict or an eviction handler is not registered, -EAGAIN if
- * the retry limit was hit.
- */
-static int z3fold_reclaim_page(struct z3fold_pool *pool, unsigned int retries)
-{
-       int i, ret = -1;
-       struct z3fold_header *zhdr = NULL;
-       struct page *page = NULL;
-       struct list_head *pos;
-       unsigned long first_handle = 0, middle_handle = 0, last_handle = 0;
-       struct z3fold_buddy_slots slots __attribute__((aligned(SLOTS_ALIGN)));
-
-       rwlock_init(&slots.lock);
-       slots.pool = (unsigned long)pool | (1 << HANDLES_NOFREE);
-
-       spin_lock(&pool->lock);
-       for (i = 0; i < retries; i++) {
-               if (list_empty(&pool->lru)) {
-                       spin_unlock(&pool->lock);
-                       return -EINVAL;
-               }
-               list_for_each_prev(pos, &pool->lru) {
-                       page = list_entry(pos, struct page, lru);
-
-                       zhdr = page_address(page);
-                       if (test_bit(PAGE_HEADLESS, &page->private)) {
-                               /*
-                                * For non-headless pages, we wait to do this
-                                * until we have the page lock to avoid racing
-                                * with __z3fold_alloc(). Headless pages don't
-                                * have a lock (and __z3fold_alloc() will never
-                                * see them), but we still need to test and set
-                                * PAGE_CLAIMED to avoid racing with
-                                * z3fold_free(), so just do it now before
-                                * leaving the loop.
-                                */
-                               if (test_and_set_bit(PAGE_CLAIMED, &page->private))
-                                       continue;
-
-                               break;
-                       }
-
-                       if (!z3fold_page_trylock(zhdr)) {
-                               zhdr = NULL;
-                               continue; /* can't evict at this point */
-                       }
-
-                       /* test_and_set_bit is of course atomic, but we still
-                        * need to do it under page lock, otherwise checking
-                        * that bit in __z3fold_alloc wouldn't make sense
-                        */
-                       if (zhdr->foreign_handles ||
-                           test_and_set_bit(PAGE_CLAIMED, &page->private)) {
-                               z3fold_page_unlock(zhdr);
-                               zhdr = NULL;
-                               continue; /* can't evict such page */
-                       }
-                       list_del_init(&zhdr->buddy);
-                       zhdr->cpu = -1;
-                       /* See comment in __z3fold_alloc. */
-                       kref_get(&zhdr->refcount);
-                       break;
-               }
-
-               if (!zhdr)
-                       break;
-
-               list_del_init(&page->lru);
-               spin_unlock(&pool->lock);
-
-               if (!test_bit(PAGE_HEADLESS, &page->private)) {
-                       /*
-                        * We need encode the handles before unlocking, and
-                        * use our local slots structure because z3fold_free
-                        * can zero out zhdr->slots and we can't do much
-                        * about that
-                        */
-                       first_handle = 0;
-                       last_handle = 0;
-                       middle_handle = 0;
-                       memset(slots.slot, 0, sizeof(slots.slot));
-                       if (zhdr->first_chunks)
-                               first_handle = __encode_handle(zhdr, &slots,
-                                                               FIRST);
-                       if (zhdr->middle_chunks)
-                               middle_handle = __encode_handle(zhdr, &slots,
-                                                               MIDDLE);
-                       if (zhdr->last_chunks)
-                               last_handle = __encode_handle(zhdr, &slots,
-                                                               LAST);
-                       /*
-                        * it's safe to unlock here because we hold a
-                        * reference to this page
-                        */
-                       z3fold_page_unlock(zhdr);
-               } else {
-                       first_handle = encode_handle(zhdr, HEADLESS);
-                       last_handle = middle_handle = 0;
-               }
-               /* Issue the eviction callback(s) */
-               if (middle_handle) {
-                       ret = pool->zpool_ops->evict(pool->zpool, middle_handle);
-                       if (ret)
-                               goto next;
-               }
-               if (first_handle) {
-                       ret = pool->zpool_ops->evict(pool->zpool, first_handle);
-                       if (ret)
-                               goto next;
-               }
-               if (last_handle) {
-                       ret = pool->zpool_ops->evict(pool->zpool, last_handle);
-                       if (ret)
-                               goto next;
-               }
-next:
-               if (test_bit(PAGE_HEADLESS, &page->private)) {
-                       if (ret == 0) {
-                               free_z3fold_page(page, true);
-                               atomic64_dec(&pool->pages_nr);
-                               return 0;
-                       }
-                       spin_lock(&pool->lock);
-                       list_add(&page->lru, &pool->lru);
-                       spin_unlock(&pool->lock);
-                       clear_bit(PAGE_CLAIMED, &page->private);
-               } else {
-                       struct z3fold_buddy_slots *slots = zhdr->slots;
-                       z3fold_page_lock(zhdr);
-                       if (kref_put(&zhdr->refcount,
-                                       release_z3fold_page_locked)) {
-                               kmem_cache_free(pool->c_handle, slots);
-                               return 0;
-                       }
-                       /*
-                        * if we are here, the page is still not completely
-                        * free. Take the global pool lock then to be able
-                        * to add it back to the lru list
-                        */
-                       spin_lock(&pool->lock);
-                       list_add(&page->lru, &pool->lru);
-                       spin_unlock(&pool->lock);
-                       if (list_empty(&zhdr->buddy))
-                               add_to_unbuddied(pool, zhdr);
-                       clear_bit(PAGE_CLAIMED, &page->private);
-                       z3fold_page_unlock(zhdr);
-               }
-
-               /* We started off locked to we need to lock the pool back */
-               spin_lock(&pool->lock);
-       }
-       spin_unlock(&pool->lock);
-       return -EAGAIN;
-}
-
-/**
  * z3fold_map() - maps the allocation associated with the given handle
  * @pool:      pool in which the allocation resides
  * @handle:    handle associated with the allocation to be mapped
@@ -1470,8 +1264,6 @@ static bool z3fold_page_isolate(struct page *page, isolate_mode_t mode)
        spin_lock(&pool->lock);
        if (!list_empty(&zhdr->buddy))
                list_del_init(&zhdr->buddy);
-       if (!list_empty(&page->lru))
-               list_del_init(&page->lru);
        spin_unlock(&pool->lock);
 
        kref_get(&zhdr->refcount);
@@ -1531,9 +1323,6 @@ static int z3fold_page_migrate(struct page *newpage, struct page *page,
                encode_handle(new_zhdr, MIDDLE);
        set_bit(NEEDS_COMPACTING, &newpage->private);
        new_zhdr->cpu = smp_processor_id();
-       spin_lock(&pool->lock);
-       list_add(&newpage->lru, &pool->lru);
-       spin_unlock(&pool->lock);
        __SetPageMovable(newpage, &z3fold_mops);
        z3fold_page_unlock(new_zhdr);
 
@@ -1559,9 +1348,6 @@ static void z3fold_page_putback(struct page *page)
        INIT_LIST_HEAD(&page->lru);
        if (kref_put(&zhdr->refcount, release_z3fold_page_locked))
                return;
-       spin_lock(&pool->lock);
-       list_add(&page->lru, &pool->lru);
-       spin_unlock(&pool->lock);
        if (list_empty(&zhdr->buddy))
                add_to_unbuddied(pool, zhdr);
        clear_bit(PAGE_CLAIMED, &page->private);
@@ -1578,18 +1364,9 @@ static const struct movable_operations z3fold_mops = {
  * zpool
  ****************/
 
-static void *z3fold_zpool_create(const char *name, gfp_t gfp,
-                              const struct zpool_ops *zpool_ops,
-                              struct zpool *zpool)
+static void *z3fold_zpool_create(const char *name, gfp_t gfp)
 {
-       struct z3fold_pool *pool;
-
-       pool = z3fold_create_pool(name, gfp);
-       if (pool) {
-               pool->zpool = zpool;
-               pool->zpool_ops = zpool_ops;
-       }
-       return pool;
+       return z3fold_create_pool(name, gfp);
 }
 
 static void z3fold_zpool_destroy(void *pool)
@@ -1607,25 +1384,6 @@ static void z3fold_zpool_free(void *pool, unsigned long handle)
        z3fold_free(pool, handle);
 }
 
-static int z3fold_zpool_shrink(void *pool, unsigned int pages,
-                       unsigned int *reclaimed)
-{
-       unsigned int total = 0;
-       int ret = -EINVAL;
-
-       while (total < pages) {
-               ret = z3fold_reclaim_page(pool, 8);
-               if (ret < 0)
-                       break;
-               total++;
-       }
-
-       if (reclaimed)
-               *reclaimed = total;
-
-       return ret;
-}
-
 static void *z3fold_zpool_map(void *pool, unsigned long handle,
                        enum zpool_mapmode mm)
 {
@@ -1649,7 +1407,6 @@ static struct zpool_driver z3fold_zpool_driver = {
        .destroy =      z3fold_zpool_destroy,
        .malloc =       z3fold_zpool_malloc,
        .free =         z3fold_zpool_free,
-       .shrink =       z3fold_zpool_shrink,
        .map =          z3fold_zpool_map,
        .unmap =        z3fold_zpool_unmap,
        .total_size =   z3fold_zpool_total_size,
index 3acd261..2190cc1 100644 (file)
--- a/mm/zbud.c
+++ b/mm/zbud.c
@@ -83,11 +83,7 @@ struct zbud_pool;
  *             its free region.
  * @buddied:   list tracking the zbud pages that contain two buddies;
  *             these zbud pages are full
- * @lru:       list tracking the zbud pages in LRU order by most recently
- *             added buddy.
  * @pages_nr:  number of zbud pages in the pool.
- * @zpool:     zpool driver
- * @zpool_ops: zpool operations structure with an evict callback
  *
  * This structure is allocated at pool creation time and maintains metadata
  * pertaining to a particular zbud pool.
@@ -102,26 +98,20 @@ struct zbud_pool {
                struct list_head buddied;
                struct list_head unbuddied[NCHUNKS];
        };
-       struct list_head lru;
        u64 pages_nr;
-       struct zpool *zpool;
-       const struct zpool_ops *zpool_ops;
 };
 
 /*
  * struct zbud_header - zbud page metadata occupying the first chunk of each
  *                     zbud page.
  * @buddy:     links the zbud page into the unbuddied/buddied lists in the pool
- * @lru:       links the zbud page into the lru list in the pool
  * @first_chunks:      the size of the first buddy in chunks, 0 if free
  * @last_chunks:       the size of the last buddy in chunks, 0 if free
  */
 struct zbud_header {
        struct list_head buddy;
-       struct list_head lru;
        unsigned int first_chunks;
        unsigned int last_chunks;
-       bool under_reclaim;
 };
 
 /*****************
@@ -149,8 +139,6 @@ static struct zbud_header *init_zbud_page(struct page *page)
        zhdr->first_chunks = 0;
        zhdr->last_chunks = 0;
        INIT_LIST_HEAD(&zhdr->buddy);
-       INIT_LIST_HEAD(&zhdr->lru);
-       zhdr->under_reclaim = false;
        return zhdr;
 }
 
@@ -221,7 +209,6 @@ static struct zbud_pool *zbud_create_pool(gfp_t gfp)
        for_each_unbuddied_list(i, 0)
                INIT_LIST_HEAD(&pool->unbuddied[i]);
        INIT_LIST_HEAD(&pool->buddied);
-       INIT_LIST_HEAD(&pool->lru);
        pool->pages_nr = 0;
        return pool;
 }
@@ -310,11 +297,6 @@ found:
                list_add(&zhdr->buddy, &pool->buddied);
        }
 
-       /* Add/move zbud page to beginning of LRU */
-       if (!list_empty(&zhdr->lru))
-               list_del(&zhdr->lru);
-       list_add(&zhdr->lru, &pool->lru);
-
        *handle = encode_handle(zhdr, bud);
        spin_unlock(&pool->lock);
 
@@ -325,11 +307,6 @@ found:
  * zbud_free() - frees the allocation associated with the given handle
  * @pool:      pool in which the allocation resided
  * @handle:    handle associated with the allocation returned by zbud_alloc()
- *
- * In the case that the zbud page in which the allocation resides is under
- * reclaim, as indicated by the PG_reclaim flag being set, this function
- * only sets the first|last_chunks to 0.  The page is actually freed
- * once both buddies are evicted (see zbud_reclaim_page() below).
  */
 static void zbud_free(struct zbud_pool *pool, unsigned long handle)
 {
@@ -345,18 +322,11 @@ static void zbud_free(struct zbud_pool *pool, unsigned long handle)
        else
                zhdr->first_chunks = 0;
 
-       if (zhdr->under_reclaim) {
-               /* zbud page is under reclaim, reclaim will free */
-               spin_unlock(&pool->lock);
-               return;
-       }
-
        /* Remove from existing buddy list */
        list_del(&zhdr->buddy);
 
        if (zhdr->first_chunks == 0 && zhdr->last_chunks == 0) {
                /* zbud page is empty, free */
-               list_del(&zhdr->lru);
                free_zbud_page(zhdr);
                pool->pages_nr--;
        } else {
@@ -369,110 +339,6 @@ static void zbud_free(struct zbud_pool *pool, unsigned long handle)
 }
 
 /**
- * zbud_reclaim_page() - evicts allocations from a pool page and frees it
- * @pool:      pool from which a page will attempt to be evicted
- * @retries:   number of pages on the LRU list for which eviction will
- *             be attempted before failing
- *
- * zbud reclaim is different from normal system reclaim in that the reclaim is
- * done from the bottom, up.  This is because only the bottom layer, zbud, has
- * information on how the allocations are organized within each zbud page. This
- * has the potential to create interesting locking situations between zbud and
- * the user, however.
- *
- * To avoid these, this is how zbud_reclaim_page() should be called:
- *
- * The user detects a page should be reclaimed and calls zbud_reclaim_page().
- * zbud_reclaim_page() will remove a zbud page from the pool LRU list and call
- * the user-defined eviction handler with the pool and handle as arguments.
- *
- * If the handle can not be evicted, the eviction handler should return
- * non-zero. zbud_reclaim_page() will add the zbud page back to the
- * appropriate list and try the next zbud page on the LRU up to
- * a user defined number of retries.
- *
- * If the handle is successfully evicted, the eviction handler should
- * return 0 _and_ should have called zbud_free() on the handle. zbud_free()
- * contains logic to delay freeing the page if the page is under reclaim,
- * as indicated by the setting of the PG_reclaim flag on the underlying page.
- *
- * If all buddies in the zbud page are successfully evicted, then the
- * zbud page can be freed.
- *
- * Returns: 0 if page is successfully freed, otherwise -EINVAL if there are
- * no pages to evict or an eviction handler is not registered, -EAGAIN if
- * the retry limit was hit.
- */
-static int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries)
-{
-       int i, ret, freechunks;
-       struct zbud_header *zhdr;
-       unsigned long first_handle = 0, last_handle = 0;
-
-       spin_lock(&pool->lock);
-       if (list_empty(&pool->lru)) {
-               spin_unlock(&pool->lock);
-               return -EINVAL;
-       }
-       for (i = 0; i < retries; i++) {
-               zhdr = list_last_entry(&pool->lru, struct zbud_header, lru);
-               list_del(&zhdr->lru);
-               list_del(&zhdr->buddy);
-               /* Protect zbud page against free */
-               zhdr->under_reclaim = true;
-               /*
-                * We need encode the handles before unlocking, since we can
-                * race with free that will set (first|last)_chunks to 0
-                */
-               first_handle = 0;
-               last_handle = 0;
-               if (zhdr->first_chunks)
-                       first_handle = encode_handle(zhdr, FIRST);
-               if (zhdr->last_chunks)
-                       last_handle = encode_handle(zhdr, LAST);
-               spin_unlock(&pool->lock);
-
-               /* Issue the eviction callback(s) */
-               if (first_handle) {
-                       ret = pool->zpool_ops->evict(pool->zpool, first_handle);
-                       if (ret)
-                               goto next;
-               }
-               if (last_handle) {
-                       ret = pool->zpool_ops->evict(pool->zpool, last_handle);
-                       if (ret)
-                               goto next;
-               }
-next:
-               spin_lock(&pool->lock);
-               zhdr->under_reclaim = false;
-               if (zhdr->first_chunks == 0 && zhdr->last_chunks == 0) {
-                       /*
-                        * Both buddies are now free, free the zbud page and
-                        * return success.
-                        */
-                       free_zbud_page(zhdr);
-                       pool->pages_nr--;
-                       spin_unlock(&pool->lock);
-                       return 0;
-               } else if (zhdr->first_chunks == 0 ||
-                               zhdr->last_chunks == 0) {
-                       /* add to unbuddied list */
-                       freechunks = num_free_chunks(zhdr);
-                       list_add(&zhdr->buddy, &pool->unbuddied[freechunks]);
-               } else {
-                       /* add to buddied list */
-                       list_add(&zhdr->buddy, &pool->buddied);
-               }
-
-               /* add to beginning of LRU */
-               list_add(&zhdr->lru, &pool->lru);
-       }
-       spin_unlock(&pool->lock);
-       return -EAGAIN;
-}
-
-/**
  * zbud_map() - maps the allocation associated with the given handle
  * @pool:      pool in which the allocation resides
  * @handle:    handle associated with the allocation to be mapped
@@ -514,18 +380,9 @@ static u64 zbud_get_pool_size(struct zbud_pool *pool)
  * zpool
  ****************/
 
-static void *zbud_zpool_create(const char *name, gfp_t gfp,
-                              const struct zpool_ops *zpool_ops,
-                              struct zpool *zpool)
+static void *zbud_zpool_create(const char *name, gfp_t gfp)
 {
-       struct zbud_pool *pool;
-
-       pool = zbud_create_pool(gfp);
-       if (pool) {
-               pool->zpool = zpool;
-               pool->zpool_ops = zpool_ops;
-       }
-       return pool;
+       return zbud_create_pool(gfp);
 }
 
 static void zbud_zpool_destroy(void *pool)
@@ -543,25 +400,6 @@ static void zbud_zpool_free(void *pool, unsigned long handle)
        zbud_free(pool, handle);
 }
 
-static int zbud_zpool_shrink(void *pool, unsigned int pages,
-                       unsigned int *reclaimed)
-{
-       unsigned int total = 0;
-       int ret = -EINVAL;
-
-       while (total < pages) {
-               ret = zbud_reclaim_page(pool, 8);
-               if (ret < 0)
-                       break;
-               total++;
-       }
-
-       if (reclaimed)
-               *reclaimed = total;
-
-       return ret;
-}
-
 static void *zbud_zpool_map(void *pool, unsigned long handle,
                        enum zpool_mapmode mm)
 {
@@ -585,7 +423,6 @@ static struct zpool_driver zbud_zpool_driver = {
        .destroy =      zbud_zpool_destroy,
        .malloc =       zbud_zpool_malloc,
        .free =         zbud_zpool_free,
-       .shrink =       zbud_zpool_shrink,
        .map =          zbud_zpool_map,
        .unmap =        zbud_zpool_unmap,
        .total_size =   zbud_zpool_total_size,
index 6a19c4a..8464104 100644 (file)
@@ -133,7 +133,6 @@ EXPORT_SYMBOL(zpool_has_pool);
  * @type:      The type of the zpool to create (e.g. zbud, zsmalloc)
  * @name:      The name of the zpool (e.g. zram0, zswap)
  * @gfp:       The GFP flags to use when allocating the pool.
- * @ops:       The optional ops callback.
  *
  * This creates a new zpool of the specified type.  The gfp flags will be
  * used when allocating memory, if the implementation supports it.  If the
@@ -145,8 +144,7 @@ EXPORT_SYMBOL(zpool_has_pool);
  *
  * Returns: New zpool on success, NULL on failure.
  */
-struct zpool *zpool_create_pool(const char *type, const char *name, gfp_t gfp,
-               const struct zpool_ops *ops)
+struct zpool *zpool_create_pool(const char *type, const char *name, gfp_t gfp)
 {
        struct zpool_driver *driver;
        struct zpool *zpool;
@@ -173,7 +171,7 @@ struct zpool *zpool_create_pool(const char *type, const char *name, gfp_t gfp,
        }
 
        zpool->driver = driver;
-       zpool->pool = driver->create(name, gfp, ops, zpool);
+       zpool->pool = driver->create(name, gfp);
 
        if (!zpool->pool) {
                pr_err("couldn't create %s pool\n", type);
@@ -280,30 +278,6 @@ void zpool_free(struct zpool *zpool, unsigned long handle)
 }
 
 /**
- * zpool_shrink() - Shrink the pool size
- * @zpool:     The zpool to shrink.
- * @pages:     The number of pages to shrink the pool.
- * @reclaimed: The number of pages successfully evicted.
- *
- * This attempts to shrink the actual memory size of the pool
- * by evicting currently used handle(s).  If the pool was
- * created with no zpool_ops, or the evict call fails for any
- * of the handles, this will fail.  If non-NULL, the @reclaimed
- * parameter will be set to the number of pages reclaimed,
- * which may be more than the number of pages requested.
- *
- * Implementations must guarantee this to be thread-safe.
- *
- * Returns: 0 on success, negative value on error/failure.
- */
-int zpool_shrink(struct zpool *zpool, unsigned int pages,
-                       unsigned int *reclaimed)
-{
-       return zpool->driver->shrink ?
-              zpool->driver->shrink(zpool->pool, pages, reclaimed) : -EINVAL;
-}
-
-/**
  * zpool_map_handle() - Map a previously allocated handle into memory
  * @zpool:     The zpool that the handle was allocated from
  * @handle:    The handle to map
@@ -360,24 +334,6 @@ u64 zpool_get_total_size(struct zpool *zpool)
 }
 
 /**
- * zpool_evictable() - Test if zpool is potentially evictable
- * @zpool:     The zpool to test
- *
- * Zpool is only potentially evictable when it's created with struct
- * zpool_ops.evict and its driver implements struct zpool_driver.shrink.
- *
- * However, it doesn't necessarily mean driver will use zpool_ops.evict
- * in its implementation of zpool_driver.shrink. It could do internal
- * defragmentation instead.
- *
- * Returns: true if potentially evictable; false otherwise.
- */
-bool zpool_evictable(struct zpool *zpool)
-{
-       return zpool->driver->shrink;
-}
-
-/**
  * zpool_can_sleep_mapped - Test if zpool can sleep when do mapped.
  * @zpool:     The zpool to test
  *
index 02f7f41..3f05797 100644 (file)
  */
 #define OBJ_ALLOCATED_TAG 1
 
-#ifdef CONFIG_ZPOOL
-/*
- * The second least-significant bit in the object's header identifies if the
- * value stored at the header is a deferred handle from the last reclaim
- * attempt.
- *
- * As noted above, this is valid because we have room for two bits.
- */
-#define OBJ_DEFERRED_HANDLE_TAG        2
-#define OBJ_TAG_BITS   2
-#define OBJ_TAG_MASK   (OBJ_ALLOCATED_TAG | OBJ_DEFERRED_HANDLE_TAG)
-#else
 #define OBJ_TAG_BITS   1
 #define OBJ_TAG_MASK   OBJ_ALLOCATED_TAG
-#endif /* CONFIG_ZPOOL */
 
 #define OBJ_INDEX_BITS (BITS_PER_LONG - _PFN_BITS - OBJ_TAG_BITS)
 #define OBJ_INDEX_MASK ((_AC(1, UL) << OBJ_INDEX_BITS) - 1)
@@ -227,12 +214,6 @@ struct link_free {
                 * Handle of allocated object.
                 */
                unsigned long handle;
-#ifdef CONFIG_ZPOOL
-               /*
-                * Deferred handle of a reclaimed object.
-                */
-               unsigned long deferred_handle;
-#endif
        };
 };
 
@@ -250,13 +231,6 @@ struct zs_pool {
        /* Compact classes */
        struct shrinker shrinker;
 
-#ifdef CONFIG_ZPOOL
-       /* List tracking the zspages in LRU order by most recently added object */
-       struct list_head lru;
-       struct zpool *zpool;
-       const struct zpool_ops *zpool_ops;
-#endif
-
 #ifdef CONFIG_ZSMALLOC_STAT
        struct dentry *stat_dentry;
 #endif
@@ -279,13 +253,6 @@ struct zspage {
        unsigned int freeobj;
        struct page *first_page;
        struct list_head list; /* fullness list */
-
-#ifdef CONFIG_ZPOOL
-       /* links the zspage to the lru list in the pool */
-       struct list_head lru;
-       bool under_reclaim;
-#endif
-
        struct zs_pool *pool;
        rwlock_t lock;
 };
@@ -384,23 +351,14 @@ static void record_obj(unsigned long handle, unsigned long obj)
 
 #ifdef CONFIG_ZPOOL
 
-static void *zs_zpool_create(const char *name, gfp_t gfp,
-                            const struct zpool_ops *zpool_ops,
-                            struct zpool *zpool)
+static void *zs_zpool_create(const char *name, gfp_t gfp)
 {
        /*
         * Ignore global gfp flags: zs_malloc() may be invoked from
         * different contexts and its caller must provide a valid
         * gfp mask.
         */
-       struct zs_pool *pool = zs_create_pool(name);
-
-       if (pool) {
-               pool->zpool = zpool;
-               pool->zpool_ops = zpool_ops;
-       }
-
-       return pool;
+       return zs_create_pool(name);
 }
 
 static void zs_zpool_destroy(void *pool)
@@ -422,27 +380,6 @@ static void zs_zpool_free(void *pool, unsigned long handle)
        zs_free(pool, handle);
 }
 
-static int zs_reclaim_page(struct zs_pool *pool, unsigned int retries);
-
-static int zs_zpool_shrink(void *pool, unsigned int pages,
-                       unsigned int *reclaimed)
-{
-       unsigned int total = 0;
-       int ret = -EINVAL;
-
-       while (total < pages) {
-               ret = zs_reclaim_page(pool, 8);
-               if (ret < 0)
-                       break;
-               total++;
-       }
-
-       if (reclaimed)
-               *reclaimed = total;
-
-       return ret;
-}
-
 static void *zs_zpool_map(void *pool, unsigned long handle,
                        enum zpool_mapmode mm)
 {
@@ -481,7 +418,6 @@ static struct zpool_driver zs_zpool_driver = {
        .malloc_support_movable = true,
        .malloc =                 zs_zpool_malloc,
        .free =                   zs_zpool_free,
-       .shrink =                 zs_zpool_shrink,
        .map =                    zs_zpool_map,
        .unmap =                  zs_zpool_unmap,
        .total_size =             zs_zpool_total_size,
@@ -884,14 +820,6 @@ static inline bool obj_allocated(struct page *page, void *obj, unsigned long *ph
        return obj_tagged(page, obj, phandle, OBJ_ALLOCATED_TAG);
 }
 
-#ifdef CONFIG_ZPOOL
-static bool obj_stores_deferred_handle(struct page *page, void *obj,
-               unsigned long *phandle)
-{
-       return obj_tagged(page, obj, phandle, OBJ_DEFERRED_HANDLE_TAG);
-}
-#endif
-
 static void reset_page(struct page *page)
 {
        __ClearPageMovable(page);
@@ -922,39 +850,6 @@ unlock:
        return 0;
 }
 
-#ifdef CONFIG_ZPOOL
-static unsigned long find_deferred_handle_obj(struct size_class *class,
-               struct page *page, int *obj_idx);
-
-/*
- * Free all the deferred handles whose objects are freed in zs_free.
- */
-static void free_handles(struct zs_pool *pool, struct size_class *class,
-               struct zspage *zspage)
-{
-       int obj_idx = 0;
-       struct page *page = get_first_page(zspage);
-       unsigned long handle;
-
-       while (1) {
-               handle = find_deferred_handle_obj(class, page, &obj_idx);
-               if (!handle) {
-                       page = get_next_page(page);
-                       if (!page)
-                               break;
-                       obj_idx = 0;
-                       continue;
-               }
-
-               cache_free_handle(pool, handle);
-               obj_idx++;
-       }
-}
-#else
-static inline void free_handles(struct zs_pool *pool, struct size_class *class,
-               struct zspage *zspage) {}
-#endif
-
 static void __free_zspage(struct zs_pool *pool, struct size_class *class,
                                struct zspage *zspage)
 {
@@ -969,9 +864,6 @@ static void __free_zspage(struct zs_pool *pool, struct size_class *class,
        VM_BUG_ON(get_zspage_inuse(zspage));
        VM_BUG_ON(fg != ZS_INUSE_RATIO_0);
 
-       /* Free all deferred handles from zs_free */
-       free_handles(pool, class, zspage);
-
        next = page = get_first_page(zspage);
        do {
                VM_BUG_ON_PAGE(!PageLocked(page), page);
@@ -1006,9 +898,6 @@ static void free_zspage(struct zs_pool *pool, struct size_class *class,
        }
 
        remove_zspage(class, zspage, ZS_INUSE_RATIO_0);
-#ifdef CONFIG_ZPOOL
-       list_del(&zspage->lru);
-#endif
        __free_zspage(pool, class, zspage);
 }
 
@@ -1054,11 +943,6 @@ static void init_zspage(struct size_class *class, struct zspage *zspage)
                off %= PAGE_SIZE;
        }
 
-#ifdef CONFIG_ZPOOL
-       INIT_LIST_HEAD(&zspage->lru);
-       zspage->under_reclaim = false;
-#endif
-
        set_freeobj(zspage, 0);
 }
 
@@ -1341,7 +1225,7 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
        spin_unlock(&pool->lock);
 
        class = zspage_class(pool, zspage);
-       off = (class->size * obj_idx) & ~PAGE_MASK;
+       off = offset_in_page(class->size * obj_idx);
 
        local_lock(&zs_map_area.lock);
        area = this_cpu_ptr(&zs_map_area);
@@ -1381,7 +1265,7 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
        obj_to_location(obj, &page, &obj_idx);
        zspage = get_zspage(page);
        class = zspage_class(pool, zspage);
-       off = (class->size * obj_idx) & ~PAGE_MASK;
+       off = offset_in_page(class->size * obj_idx);
 
        area = this_cpu_ptr(&zs_map_area);
        if (off + class->size <= PAGE_SIZE)
@@ -1438,7 +1322,7 @@ static unsigned long obj_malloc(struct zs_pool *pool,
 
        offset = obj * class->size;
        nr_page = offset >> PAGE_SHIFT;
-       m_offset = offset & ~PAGE_MASK;
+       m_offset = offset_in_page(offset);
        m_page = get_first_page(zspage);
 
        for (i = 0; i < nr_page; i++)
@@ -1525,20 +1409,13 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
        /* We completely set up zspage so mark them as movable */
        SetZsPageMovable(pool, zspage);
 out:
-#ifdef CONFIG_ZPOOL
-       /* Add/move zspage to beginning of LRU */
-       if (!list_empty(&zspage->lru))
-               list_del(&zspage->lru);
-       list_add(&zspage->lru, &pool->lru);
-#endif
-
        spin_unlock(&pool->lock);
 
        return handle;
 }
 EXPORT_SYMBOL_GPL(zs_malloc);
 
-static void obj_free(int class_size, unsigned long obj, unsigned long *handle)
+static void obj_free(int class_size, unsigned long obj)
 {
        struct link_free *link;
        struct zspage *zspage;
@@ -1548,31 +1425,18 @@ static void obj_free(int class_size, unsigned long obj, unsigned long *handle)
        void *vaddr;
 
        obj_to_location(obj, &f_page, &f_objidx);
-       f_offset = (class_size * f_objidx) & ~PAGE_MASK;
+       f_offset = offset_in_page(class_size * f_objidx);
        zspage = get_zspage(f_page);
 
        vaddr = kmap_atomic(f_page);
        link = (struct link_free *)(vaddr + f_offset);
 
-       if (handle) {
-#ifdef CONFIG_ZPOOL
-               /* Stores the (deferred) handle in the object's header */
-               *handle |= OBJ_DEFERRED_HANDLE_TAG;
-               *handle &= ~OBJ_ALLOCATED_TAG;
-
-               if (likely(!ZsHugePage(zspage)))
-                       link->deferred_handle = *handle;
-               else
-                       f_page->index = *handle;
-#endif
-       } else {
-               /* Insert this object in containing zspage's freelist */
-               if (likely(!ZsHugePage(zspage)))
-                       link->next = get_freeobj(zspage) << OBJ_TAG_BITS;
-               else
-                       f_page->index = 0;
-               set_freeobj(zspage, f_objidx);
-       }
+       /* Insert this object in containing zspage's freelist */
+       if (likely(!ZsHugePage(zspage)))
+               link->next = get_freeobj(zspage) << OBJ_TAG_BITS;
+       else
+               f_page->index = 0;
+       set_freeobj(zspage, f_objidx);
 
        kunmap_atomic(vaddr);
        mod_zspage_inuse(zspage, -1);
@@ -1600,21 +1464,7 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
        class = zspage_class(pool, zspage);
 
        class_stat_dec(class, ZS_OBJS_INUSE, 1);
-
-#ifdef CONFIG_ZPOOL
-       if (zspage->under_reclaim) {
-               /*
-                * Reclaim needs the handles during writeback. It'll free
-                * them along with the zspage when it's done with them.
-                *
-                * Record current deferred handle in the object's header.
-                */
-               obj_free(class->size, obj, &handle);
-               spin_unlock(&pool->lock);
-               return;
-       }
-#endif
-       obj_free(class->size, obj, NULL);
+       obj_free(class->size, obj);
 
        fullness = fix_fullness_group(class, zspage);
        if (fullness == ZS_INUSE_RATIO_0)
@@ -1640,8 +1490,8 @@ static void zs_object_copy(struct size_class *class, unsigned long dst,
        obj_to_location(src, &s_page, &s_objidx);
        obj_to_location(dst, &d_page, &d_objidx);
 
-       s_off = (class->size * s_objidx) & ~PAGE_MASK;
-       d_off = (class->size * d_objidx) & ~PAGE_MASK;
+       s_off = offset_in_page(class->size * s_objidx);
+       d_off = offset_in_page(class->size * d_objidx);
 
        if (s_off + class->size > PAGE_SIZE)
                s_size = PAGE_SIZE - s_off;
@@ -1735,18 +1585,6 @@ static unsigned long find_alloced_obj(struct size_class *class,
        return find_tagged_obj(class, page, obj_idx, OBJ_ALLOCATED_TAG);
 }
 
-#ifdef CONFIG_ZPOOL
-/*
- * Find object storing a deferred handle in header in zspage from index object
- * and return handle.
- */
-static unsigned long find_deferred_handle_obj(struct size_class *class,
-               struct page *page, int *obj_idx)
-{
-       return find_tagged_obj(class, page, obj_idx, OBJ_DEFERRED_HANDLE_TAG);
-}
-#endif
-
 struct zs_compact_control {
        /* Source spage for migration which could be a subpage of zspage */
        struct page *s_page;
@@ -1786,7 +1624,7 @@ static void migrate_zspage(struct zs_pool *pool, struct size_class *class,
                zs_object_copy(class, free_obj, used_obj);
                obj_idx++;
                record_obj(handle, free_obj);
-               obj_free(class->size, used_obj, NULL);
+               obj_free(class->size, used_obj);
        }
 
        /* Remember last position in this iteration */
@@ -1846,7 +1684,7 @@ static int putback_zspage(struct size_class *class, struct zspage *zspage)
        return fullness;
 }
 
-#if defined(CONFIG_ZPOOL) || defined(CONFIG_COMPACTION)
+#ifdef CONFIG_COMPACTION
 /*
  * To prevent zspage destroy during migration, zspage freeing should
  * hold locks of all pages in the zspage.
@@ -1888,24 +1726,7 @@ static void lock_zspage(struct zspage *zspage)
        }
        migrate_read_unlock(zspage);
 }
-#endif /* defined(CONFIG_ZPOOL) || defined(CONFIG_COMPACTION) */
-
-#ifdef CONFIG_ZPOOL
-/*
- * Unlocks all the pages of the zspage.
- *
- * pool->lock must be held before this function is called
- * to prevent the underlying pages from migrating.
- */
-static void unlock_zspage(struct zspage *zspage)
-{
-       struct page *page = get_first_page(zspage);
-
-       do {
-               unlock_page(page);
-       } while ((page = get_next_page(page)) != NULL);
-}
-#endif /* CONFIG_ZPOOL */
+#endif /* CONFIG_COMPACTION */
 
 static void migrate_lock_init(struct zspage *zspage)
 {
@@ -2126,9 +1947,6 @@ static void async_free_zspage(struct work_struct *work)
                VM_BUG_ON(fullness != ZS_INUSE_RATIO_0);
                class = pool->size_class[class_idx];
                spin_lock(&pool->lock);
-#ifdef CONFIG_ZPOOL
-               list_del(&zspage->lru);
-#endif
                __free_zspage(pool, class, zspage);
                spin_unlock(&pool->lock);
        }
@@ -2474,10 +2292,6 @@ struct zs_pool *zs_create_pool(const char *name)
         */
        zs_register_shrinker(pool);
 
-#ifdef CONFIG_ZPOOL
-       INIT_LIST_HEAD(&pool->lru);
-#endif
-
        return pool;
 
 err:
@@ -2520,190 +2334,6 @@ void zs_destroy_pool(struct zs_pool *pool)
 }
 EXPORT_SYMBOL_GPL(zs_destroy_pool);
 
-#ifdef CONFIG_ZPOOL
-static void restore_freelist(struct zs_pool *pool, struct size_class *class,
-               struct zspage *zspage)
-{
-       unsigned int obj_idx = 0;
-       unsigned long handle, off = 0; /* off is within-page offset */
-       struct page *page = get_first_page(zspage);
-       struct link_free *prev_free = NULL;
-       void *prev_page_vaddr = NULL;
-
-       /* in case no free object found */
-       set_freeobj(zspage, (unsigned int)(-1UL));
-
-       while (page) {
-               void *vaddr = kmap_atomic(page);
-               struct page *next_page;
-
-               while (off < PAGE_SIZE) {
-                       void *obj_addr = vaddr + off;
-
-                       /* skip allocated object */
-                       if (obj_allocated(page, obj_addr, &handle)) {
-                               obj_idx++;
-                               off += class->size;
-                               continue;
-                       }
-
-                       /* free deferred handle from reclaim attempt */
-                       if (obj_stores_deferred_handle(page, obj_addr, &handle))
-                               cache_free_handle(pool, handle);
-
-                       if (prev_free)
-                               prev_free->next = obj_idx << OBJ_TAG_BITS;
-                       else /* first free object found */
-                               set_freeobj(zspage, obj_idx);
-
-                       prev_free = (struct link_free *)vaddr + off / sizeof(*prev_free);
-                       /* if last free object in a previous page, need to unmap */
-                       if (prev_page_vaddr) {
-                               kunmap_atomic(prev_page_vaddr);
-                               prev_page_vaddr = NULL;
-                       }
-
-                       obj_idx++;
-                       off += class->size;
-               }
-
-               /*
-                * Handle the last (full or partial) object on this page.
-                */
-               next_page = get_next_page(page);
-               if (next_page) {
-                       if (!prev_free || prev_page_vaddr) {
-                               /*
-                                * There is no free object in this page, so we can safely
-                                * unmap it.
-                                */
-                               kunmap_atomic(vaddr);
-                       } else {
-                               /* update prev_page_vaddr since prev_free is on this page */
-                               prev_page_vaddr = vaddr;
-                       }
-               } else { /* this is the last page */
-                       if (prev_free) {
-                               /*
-                                * Reset OBJ_TAG_BITS bit to last link to tell
-                                * whether it's allocated object or not.
-                                */
-                               prev_free->next = -1UL << OBJ_TAG_BITS;
-                       }
-
-                       /* unmap previous page (if not done yet) */
-                       if (prev_page_vaddr) {
-                               kunmap_atomic(prev_page_vaddr);
-                               prev_page_vaddr = NULL;
-                       }
-
-                       kunmap_atomic(vaddr);
-               }
-
-               page = next_page;
-               off %= PAGE_SIZE;
-       }
-}
-
-static int zs_reclaim_page(struct zs_pool *pool, unsigned int retries)
-{
-       int i, obj_idx, ret = 0;
-       unsigned long handle;
-       struct zspage *zspage;
-       struct page *page;
-       int fullness;
-
-       /* Lock LRU and fullness list */
-       spin_lock(&pool->lock);
-       if (list_empty(&pool->lru)) {
-               spin_unlock(&pool->lock);
-               return -EINVAL;
-       }
-
-       for (i = 0; i < retries; i++) {
-               struct size_class *class;
-
-               zspage = list_last_entry(&pool->lru, struct zspage, lru);
-               list_del(&zspage->lru);
-
-               /* zs_free may free objects, but not the zspage and handles */
-               zspage->under_reclaim = true;
-
-               class = zspage_class(pool, zspage);
-               fullness = get_fullness_group(class, zspage);
-
-               /* Lock out object allocations and object compaction */
-               remove_zspage(class, zspage, fullness);
-
-               spin_unlock(&pool->lock);
-               cond_resched();
-
-               /* Lock backing pages into place */
-               lock_zspage(zspage);
-
-               obj_idx = 0;
-               page = get_first_page(zspage);
-               while (1) {
-                       handle = find_alloced_obj(class, page, &obj_idx);
-                       if (!handle) {
-                               page = get_next_page(page);
-                               if (!page)
-                                       break;
-                               obj_idx = 0;
-                               continue;
-                       }
-
-                       /*
-                        * This will write the object and call zs_free.
-                        *
-                        * zs_free will free the object, but the
-                        * under_reclaim flag prevents it from freeing
-                        * the zspage altogether. This is necessary so
-                        * that we can continue working with the
-                        * zspage potentially after the last object
-                        * has been freed.
-                        */
-                       ret = pool->zpool_ops->evict(pool->zpool, handle);
-                       if (ret)
-                               goto next;
-
-                       obj_idx++;
-               }
-
-next:
-               /* For freeing the zspage, or putting it back in the pool and LRU list. */
-               spin_lock(&pool->lock);
-               zspage->under_reclaim = false;
-
-               if (!get_zspage_inuse(zspage)) {
-                       /*
-                        * Fullness went stale as zs_free() won't touch it
-                        * while the page is removed from the pool. Fix it
-                        * up for the check in __free_zspage().
-                        */
-                       zspage->fullness = ZS_INUSE_RATIO_0;
-
-                       __free_zspage(pool, class, zspage);
-                       spin_unlock(&pool->lock);
-                       return 0;
-               }
-
-               /*
-                * Eviction fails on one of the handles, so we need to restore zspage.
-                * We need to rebuild its freelist (and free stored deferred handles),
-                * put it back to the correct size class, and add it to the LRU list.
-                */
-               restore_freelist(pool, class, zspage);
-               putback_zspage(class, zspage);
-               list_add(&zspage->lru, &pool->lru);
-               unlock_zspage(zspage);
-       }
-
-       spin_unlock(&pool->lock);
-       return -EAGAIN;
-}
-#endif /* CONFIG_ZPOOL */
-
 static int __init zs_init(void)
 {
        int ret;
index 30092d9..62195f7 100644 (file)
@@ -37,6 +37,7 @@
 #include <linux/workqueue.h>
 
 #include "swap.h"
+#include "internal.h"
 
 /*********************************
 * statistics
@@ -137,6 +138,10 @@ static bool zswap_non_same_filled_pages_enabled = true;
 module_param_named(non_same_filled_pages_enabled, zswap_non_same_filled_pages_enabled,
                   bool, 0644);
 
+static bool zswap_exclusive_loads_enabled = IS_ENABLED(
+               CONFIG_ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON);
+module_param_named(exclusive_loads, zswap_exclusive_loads_enabled, bool, 0644);
+
 /*********************************
 * data structures
 **********************************/
@@ -149,6 +154,12 @@ struct crypto_acomp_ctx {
        struct mutex *mutex;
 };
 
+/*
+ * The lock ordering is zswap_tree.lock -> zswap_pool.lru_lock.
+ * The only case where lru_lock is not acquired while holding tree.lock is
+ * when a zswap_entry is taken off the lru for writeback, in that case it
+ * needs to be verified that it's still valid in the tree.
+ */
 struct zswap_pool {
        struct zpool *zpool;
        struct crypto_acomp_ctx __percpu *acomp_ctx;
@@ -158,6 +169,8 @@ struct zswap_pool {
        struct work_struct shrink_work;
        struct hlist_node node;
        char tfm_name[CRYPTO_MAX_ALG_NAME];
+       struct list_head lru;
+       spinlock_t lru_lock;
 };
 
 /*
@@ -175,14 +188,16 @@ struct zswap_pool {
  *            be held while changing the refcount.  Since the lock must
  *            be held, there is no reason to also make refcount atomic.
  * length - the length in bytes of the compressed page data.  Needed during
- *          decompression. For a same value filled page length is 0.
+ *          decompression. For a same value filled page length is 0, and both
+ *          pool and lru are invalid and must be ignored.
  * pool - the zswap_pool the entry's data is in
  * handle - zpool allocation handle that stores the compressed page data
  * value - value of the same-value filled pages which have same content
+ * lru - handle to the pool's lru used to evict pages.
  */
 struct zswap_entry {
        struct rb_node rbnode;
-       pgoff_t offset;
+       swp_entry_t swpentry;
        int refcount;
        unsigned int length;
        struct zswap_pool *pool;
@@ -191,10 +206,7 @@ struct zswap_entry {
                unsigned long value;
        };
        struct obj_cgroup *objcg;
-};
-
-struct zswap_header {
-       swp_entry_t swpentry;
+       struct list_head lru;
 };
 
 /*
@@ -238,14 +250,11 @@ static bool zswap_has_pool;
        pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name,         \
                 zpool_get_type((p)->zpool))
 
-static int zswap_writeback_entry(struct zpool *pool, unsigned long handle);
+static int zswap_writeback_entry(struct zswap_entry *entry,
+                                struct zswap_tree *tree);
 static int zswap_pool_get(struct zswap_pool *pool);
 static void zswap_pool_put(struct zswap_pool *pool);
 
-static const struct zpool_ops zswap_zpool_ops = {
-       .evict = zswap_writeback_entry
-};
-
 static bool zswap_is_full(void)
 {
        return totalram_pages() * zswap_max_pool_percent / 100 <
@@ -302,12 +311,14 @@ static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
 {
        struct rb_node *node = root->rb_node;
        struct zswap_entry *entry;
+       pgoff_t entry_offset;
 
        while (node) {
                entry = rb_entry(node, struct zswap_entry, rbnode);
-               if (entry->offset > offset)
+               entry_offset = swp_offset(entry->swpentry);
+               if (entry_offset > offset)
                        node = node->rb_left;
-               else if (entry->offset < offset)
+               else if (entry_offset < offset)
                        node = node->rb_right;
                else
                        return entry;
@@ -324,13 +335,15 @@ static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry,
 {
        struct rb_node **link = &root->rb_node, *parent = NULL;
        struct zswap_entry *myentry;
+       pgoff_t myentry_offset, entry_offset = swp_offset(entry->swpentry);
 
        while (*link) {
                parent = *link;
                myentry = rb_entry(parent, struct zswap_entry, rbnode);
-               if (myentry->offset > entry->offset)
+               myentry_offset = swp_offset(myentry->swpentry);
+               if (myentry_offset > entry_offset)
                        link = &(*link)->rb_left;
-               else if (myentry->offset < entry->offset)
+               else if (myentry_offset < entry_offset)
                        link = &(*link)->rb_right;
                else {
                        *dupentry = myentry;
@@ -342,12 +355,14 @@ static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry,
        return 0;
 }
 
-static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry)
+static bool zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry)
 {
        if (!RB_EMPTY_NODE(&entry->rbnode)) {
                rb_erase(&entry->rbnode, root);
                RB_CLEAR_NODE(&entry->rbnode);
+               return true;
        }
+       return false;
 }
 
 /*
@@ -363,6 +378,9 @@ static void zswap_free_entry(struct zswap_entry *entry)
        if (!entry->length)
                atomic_dec(&zswap_same_filled_pages);
        else {
+               spin_lock(&entry->pool->lru_lock);
+               list_del(&entry->lru);
+               spin_unlock(&entry->pool->lru_lock);
                zpool_free(entry->pool->zpool, entry->handle);
                zswap_pool_put(entry->pool);
        }
@@ -583,13 +601,95 @@ static struct zswap_pool *zswap_pool_find_get(char *type, char *compressor)
        return NULL;
 }
 
+/*
+ * If the entry is still valid in the tree, drop the initial ref and remove it
+ * from the tree. This function must be called with an additional ref held,
+ * otherwise it may race with another invalidation freeing the entry.
+ */
+static void zswap_invalidate_entry(struct zswap_tree *tree,
+                                  struct zswap_entry *entry)
+{
+       if (zswap_rb_erase(&tree->rbroot, entry))
+               zswap_entry_put(tree, entry);
+}
+
+static int zswap_reclaim_entry(struct zswap_pool *pool)
+{
+       struct zswap_entry *entry;
+       struct zswap_tree *tree;
+       pgoff_t swpoffset;
+       int ret;
+
+       /* Get an entry off the LRU */
+       spin_lock(&pool->lru_lock);
+       if (list_empty(&pool->lru)) {
+               spin_unlock(&pool->lru_lock);
+               return -EINVAL;
+       }
+       entry = list_last_entry(&pool->lru, struct zswap_entry, lru);
+       list_del_init(&entry->lru);
+       /*
+        * Once the lru lock is dropped, the entry might get freed. The
+        * swpoffset is copied to the stack, and entry isn't deref'd again
+        * until the entry is verified to still be alive in the tree.
+        */
+       swpoffset = swp_offset(entry->swpentry);
+       tree = zswap_trees[swp_type(entry->swpentry)];
+       spin_unlock(&pool->lru_lock);
+
+       /* Check for invalidate() race */
+       spin_lock(&tree->lock);
+       if (entry != zswap_rb_search(&tree->rbroot, swpoffset)) {
+               ret = -EAGAIN;
+               goto unlock;
+       }
+       /* Hold a reference to prevent a free during writeback */
+       zswap_entry_get(entry);
+       spin_unlock(&tree->lock);
+
+       ret = zswap_writeback_entry(entry, tree);
+
+       spin_lock(&tree->lock);
+       if (ret) {
+               /* Writeback failed, put entry back on LRU */
+               spin_lock(&pool->lru_lock);
+               list_move(&entry->lru, &pool->lru);
+               spin_unlock(&pool->lru_lock);
+               goto put_unlock;
+       }
+
+       /*
+        * Writeback started successfully, the page now belongs to the
+        * swapcache. Drop the entry from zswap - unless invalidate already
+        * took it out while we had the tree->lock released for IO.
+        */
+       zswap_invalidate_entry(tree, entry);
+
+put_unlock:
+       /* Drop local reference */
+       zswap_entry_put(tree, entry);
+unlock:
+       spin_unlock(&tree->lock);
+       return ret ? -EAGAIN : 0;
+}
+
 static void shrink_worker(struct work_struct *w)
 {
        struct zswap_pool *pool = container_of(w, typeof(*pool),
                                                shrink_work);
+       int ret, failures = 0;
 
-       if (zpool_shrink(pool->zpool, 1, NULL))
-               zswap_reject_reclaim_fail++;
+       do {
+               ret = zswap_reclaim_entry(pool);
+               if (ret) {
+                       zswap_reject_reclaim_fail++;
+                       if (ret != -EAGAIN)
+                               break;
+                       if (++failures == MAX_RECLAIM_RETRIES)
+                               break;
+               }
+               cond_resched();
+       } while (!zswap_can_accept());
        zswap_pool_put(pool);
 }
 
@@ -618,7 +718,7 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
        /* unique name for each pool specifically required by zsmalloc */
        snprintf(name, 38, "zswap%x", atomic_inc_return(&zswap_pools_count));
 
-       pool->zpool = zpool_create_pool(type, name, gfp, &zswap_zpool_ops);
+       pool->zpool = zpool_create_pool(type, name, gfp);
        if (!pool->zpool) {
                pr_err("%s zpool not available\n", type);
                goto error;
@@ -644,6 +744,8 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
         */
        kref_init(&pool->kref);
        INIT_LIST_HEAD(&pool->list);
+       INIT_LIST_HEAD(&pool->lru);
+       spin_lock_init(&pool->lru_lock);
        INIT_WORK(&pool->shrink_work, shrink_worker);
 
        zswap_pool_debug("created", pool);
@@ -964,16 +1066,14 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
  * the swap cache, the compressed version stored by zswap can be
  * freed.
  */
-static int zswap_writeback_entry(struct zpool *pool, unsigned long handle)
+static int zswap_writeback_entry(struct zswap_entry *entry,
+                                struct zswap_tree *tree)
 {
-       struct zswap_header *zhdr;
-       swp_entry_t swpentry;
-       struct zswap_tree *tree;
-       pgoff_t offset;
-       struct zswap_entry *entry;
+       swp_entry_t swpentry = entry->swpentry;
        struct page *page;
        struct scatterlist input, output;
        struct crypto_acomp_ctx *acomp_ctx;
+       struct zpool *pool = entry->pool->zpool;
 
        u8 *src, *tmp = NULL;
        unsigned int dlen;
@@ -988,25 +1088,6 @@ static int zswap_writeback_entry(struct zpool *pool, unsigned long handle)
                        return -ENOMEM;
        }
 
-       /* extract swpentry from data */
-       zhdr = zpool_map_handle(pool, handle, ZPOOL_MM_RO);
-       swpentry = zhdr->swpentry; /* here */
-       tree = zswap_trees[swp_type(swpentry)];
-       offset = swp_offset(swpentry);
-       zpool_unmap_handle(pool, handle);
-
-       /* find and ref zswap entry */
-       spin_lock(&tree->lock);
-       entry = zswap_entry_find_get(&tree->rbroot, offset);
-       if (!entry) {
-               /* entry was invalidated */
-               spin_unlock(&tree->lock);
-               kfree(tmp);
-               return 0;
-       }
-       spin_unlock(&tree->lock);
-       BUG_ON(offset != entry->offset);
-
        /* try to allocate swap cache page */
        switch (zswap_get_swap_cache_page(swpentry, &page)) {
        case ZSWAP_SWAPCACHE_FAIL: /* no memory or invalidate happened */
@@ -1028,7 +1109,7 @@ static int zswap_writeback_entry(struct zpool *pool, unsigned long handle)
                 * writing.
                 */
                spin_lock(&tree->lock);
-               if (zswap_rb_search(&tree->rbroot, entry->offset) != entry) {
+               if (zswap_rb_search(&tree->rbroot, swp_offset(entry->swpentry)) != entry) {
                        spin_unlock(&tree->lock);
                        delete_from_swap_cache(page_folio(page));
                        ret = -ENOMEM;
@@ -1040,12 +1121,11 @@ static int zswap_writeback_entry(struct zpool *pool, unsigned long handle)
                acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
                dlen = PAGE_SIZE;
 
-               zhdr = zpool_map_handle(pool, handle, ZPOOL_MM_RO);
-               src = (u8 *)zhdr + sizeof(struct zswap_header);
+               src = zpool_map_handle(pool, entry->handle, ZPOOL_MM_RO);
                if (!zpool_can_sleep_mapped(pool)) {
                        memcpy(tmp, src, entry->length);
                        src = tmp;
-                       zpool_unmap_handle(pool, handle);
+                       zpool_unmap_handle(pool, entry->handle);
                }
 
                mutex_lock(acomp_ctx->mutex);
@@ -1060,7 +1140,7 @@ static int zswap_writeback_entry(struct zpool *pool, unsigned long handle)
                if (!zpool_can_sleep_mapped(pool))
                        kfree(tmp);
                else
-                       zpool_unmap_handle(pool, handle);
+                       zpool_unmap_handle(pool, entry->handle);
 
                BUG_ON(ret);
                BUG_ON(dlen != PAGE_SIZE);
@@ -1077,23 +1157,7 @@ static int zswap_writeback_entry(struct zpool *pool, unsigned long handle)
        put_page(page);
        zswap_written_back_pages++;
 
-       spin_lock(&tree->lock);
-       /* drop local reference */
-       zswap_entry_put(tree, entry);
-
-       /*
-       * There are two possible situations for entry here:
-       * (1) refcount is 1(normal case),  entry is valid and on the tree
-       * (2) refcount is 0, entry is freed and not on the tree
-       *     because invalidate happened during writeback
-       *  search the tree and free the entry if find entry
-       */
-       if (entry == zswap_rb_search(&tree->rbroot, offset))
-               zswap_entry_put(tree, entry);
-       spin_unlock(&tree->lock);
-
        return ret;
-
 fail:
        if (!zpool_can_sleep_mapped(pool))
                kfree(tmp);
@@ -1102,13 +1166,8 @@ fail:
        * if we get here due to ZSWAP_SWAPCACHE_EXIST
        * a load may be happening concurrently.
        * it is safe and okay to not free the entry.
-       * if we free the entry in the following put
        * it is also okay to return !0
        */
-       spin_lock(&tree->lock);
-       zswap_entry_put(tree, entry);
-       spin_unlock(&tree->lock);
-
        return ret;
 }
 
@@ -1156,11 +1215,10 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
        struct obj_cgroup *objcg = NULL;
        struct zswap_pool *pool;
        int ret;
-       unsigned int hlen, dlen = PAGE_SIZE;
+       unsigned int dlen = PAGE_SIZE;
        unsigned long handle, value;
        char *buf;
        u8 *src, *dst;
-       struct zswap_header zhdr = { .swpentry = swp_entry(type, offset) };
        gfp_t gfp;
 
        /* THP isn't supported */
@@ -1195,7 +1253,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
        if (zswap_pool_reached_full) {
               if (!zswap_can_accept()) {
                        ret = -ENOMEM;
-                       goto reject;
+                       goto shrink;
                } else
                        zswap_pool_reached_full = false;
        }
@@ -1212,7 +1270,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
                src = kmap_atomic(page);
                if (zswap_is_page_same_filled(src, &value)) {
                        kunmap_atomic(src);
-                       entry->offset = offset;
+                       entry->swpentry = swp_entry(type, offset);
                        entry->length = 0;
                        entry->value = value;
                        atomic_inc(&zswap_same_filled_pages);
@@ -1266,11 +1324,10 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
        }
 
        /* store */
-       hlen = zpool_evictable(entry->pool->zpool) ? sizeof(zhdr) : 0;
        gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
        if (zpool_malloc_support_movable(entry->pool->zpool))
                gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
-       ret = zpool_malloc(entry->pool->zpool, hlen + dlen, gfp, &handle);
+       ret = zpool_malloc(entry->pool->zpool, dlen, gfp, &handle);
        if (ret == -ENOSPC) {
                zswap_reject_compress_poor++;
                goto put_dstmem;
@@ -1280,13 +1337,12 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
                goto put_dstmem;
        }
        buf = zpool_map_handle(entry->pool->zpool, handle, ZPOOL_MM_WO);
-       memcpy(buf, &zhdr, hlen);
-       memcpy(buf + hlen, dst, dlen);
+       memcpy(buf, dst, dlen);
        zpool_unmap_handle(entry->pool->zpool, handle);
        mutex_unlock(acomp_ctx->mutex);
 
        /* populate entry */
-       entry->offset = offset;
+       entry->swpentry = swp_entry(type, offset);
        entry->handle = handle;
        entry->length = dlen;
 
@@ -1309,6 +1365,11 @@ insert_entry:
                        zswap_entry_put(tree, dupentry);
                }
        } while (ret == -EEXIST);
+       if (entry->length) {
+               spin_lock(&entry->pool->lru_lock);
+               list_add(&entry->lru, &entry->pool->lru);
+               spin_unlock(&entry->pool->lru_lock);
+       }
        spin_unlock(&tree->lock);
 
        /* update stats */
@@ -1341,7 +1402,7 @@ shrink:
  * return -1 on entry not found or error
 */
 static int zswap_frontswap_load(unsigned type, pgoff_t offset,
-                               struct page *page)
+                               struct page *page, bool *exclusive)
 {
        struct zswap_tree *tree = zswap_trees[type];
        struct zswap_entry *entry;
@@ -1380,8 +1441,6 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
        /* decompress */
        dlen = PAGE_SIZE;
        src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO);
-       if (zpool_evictable(entry->pool->zpool))
-               src += sizeof(struct zswap_header);
 
        if (!zpool_can_sleep_mapped(entry->pool->zpool)) {
                memcpy(tmp, src, entry->length);
@@ -1410,6 +1469,14 @@ stats:
                count_objcg_event(entry->objcg, ZSWPIN);
 freeentry:
        spin_lock(&tree->lock);
+       if (!ret && zswap_exclusive_loads_enabled) {
+               zswap_invalidate_entry(tree, entry);
+               *exclusive = true;
+       } else if (entry->length) {
+               spin_lock(&entry->pool->lru_lock);
+               list_move(&entry->lru, &entry->pool->lru);
+               spin_unlock(&entry->pool->lru_lock);
+       }
        zswap_entry_put(tree, entry);
        spin_unlock(&tree->lock);
 
@@ -1430,13 +1497,7 @@ static void zswap_frontswap_invalidate_page(unsigned type, pgoff_t offset)
                spin_unlock(&tree->lock);
                return;
        }
-
-       /* remove from rbtree */
-       zswap_rb_erase(&tree->rbroot, entry);
-
-       /* drop the initial reference from entry creation */
-       zswap_entry_put(tree, entry);
-
+       zswap_invalidate_entry(tree, entry);
        spin_unlock(&tree->lock);
 }
 
index 285e8ff..9ee3b7a 100644 (file)
@@ -1670,6 +1670,10 @@ int mptcp_subflow_create_socket(struct sock *sk, unsigned short family,
 
        lock_sock_nested(sf->sk, SINGLE_DEPTH_NESTING);
 
+       err = security_mptcp_add_subflow(sk, sf->sk);
+       if (err)
+               goto release_ssk;
+
        /* the newly created socket has to be in the same cgroup as its parent */
        mptcp_attach_cgroup(sk, sf->sk);
 
@@ -1682,6 +1686,8 @@ int mptcp_subflow_create_socket(struct sock *sk, unsigned short family,
        get_net_track(net, &sf->sk->ns_tracker, GFP_KERNEL);
        sock_inuse_add(net, 1);
        err = tcp_set_ulp(sf->sk, "mptcp");
+
+release_ssk:
        release_sock(sf->sk);
 
        if (err) {
index 0310732..95aeb31 100644 (file)
@@ -40,7 +40,7 @@ MODULE_ALIAS("ip_set_hash:net,iface");
 #define IP_SET_HASH_WITH_MULTI
 #define IP_SET_HASH_WITH_NET0
 
-#define STRLCPY(a, b)  strlcpy(a, b, IFNAMSIZ)
+#define STRSCPY(a, b)  strscpy(a, b, IFNAMSIZ)
 
 /* IPv4 variant */
 
@@ -182,11 +182,11 @@ hash_netiface4_kadt(struct ip_set *set, const struct sk_buff *skb,
 
                if (!eiface)
                        return -EINVAL;
-               STRLCPY(e.iface, eiface);
+               STRSCPY(e.iface, eiface);
                e.physdev = 1;
 #endif
        } else {
-               STRLCPY(e.iface, SRCDIR ? IFACE(in) : IFACE(out));
+               STRSCPY(e.iface, SRCDIR ? IFACE(in) : IFACE(out));
        }
 
        if (strlen(e.iface) == 0)
@@ -400,11 +400,11 @@ hash_netiface6_kadt(struct ip_set *set, const struct sk_buff *skb,
 
                if (!eiface)
                        return -EINVAL;
-               STRLCPY(e.iface, eiface);
+               STRSCPY(e.iface, eiface);
                e.physdev = 1;
 #endif
        } else {
-               STRLCPY(e.iface, SRCDIR ? IFACE(in) : IFACE(out));
+               STRSCPY(e.iface, SRCDIR ? IFACE(in) : IFACE(out));
        }
 
        if (strlen(e.iface) == 0)
index 0f25a38..0f7a729 100644 (file)
@@ -783,7 +783,7 @@ int qrtr_ns_init(void)
                goto err_sock;
        }
 
-       qrtr_ns.workqueue = alloc_workqueue("qrtr_ns_handler", WQ_UNBOUND, 1);
+       qrtr_ns.workqueue = alloc_ordered_workqueue("qrtr_ns_handler", 0);
        if (!qrtr_ns.workqueue) {
                ret = -ENOMEM;
                goto err_sock;
index f2cf4aa..fa8aec7 100644 (file)
@@ -988,7 +988,7 @@ static int __init af_rxrpc_init(void)
                goto error_call_jar;
        }
 
-       rxrpc_workqueue = alloc_workqueue("krxrpcd", WQ_HIGHPRI | WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
+       rxrpc_workqueue = alloc_ordered_workqueue("krxrpcd", WQ_HIGHPRI | WQ_MEM_RECLAIM);
        if (!rxrpc_workqueue) {
                pr_notice("Failed to allocate work queue\n");
                goto error_work_queue;
index 8c3c8b2..2b0e54b 100644 (file)
@@ -471,6 +471,7 @@ struct file *sock_alloc_file(struct socket *sock, int flags, const char *dname)
                return file;
        }
 
+       file->f_mode |= FMODE_NOWAIT;
        sock->file = file;
        file->private_data = sock;
        stream_open(SOCK_INODE(sock), file);
@@ -1073,7 +1074,7 @@ static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
        struct socket *sock = file->private_data;
 
        if (unlikely(!sock->ops->splice_read))
-               return generic_file_splice_read(file, ppos, pipe, len, flags);
+               return copy_splice_read(file, ppos, pipe, len, flags);
 
        return sock->ops->splice_read(sock, ppos, pipe, len, flags);
 }
index 79967b6..587811a 100644 (file)
@@ -109,15 +109,15 @@ param_get_pool_mode(char *buf, const struct kernel_param *kp)
        switch (*ip)
        {
        case SVC_POOL_AUTO:
-               return strlcpy(buf, "auto\n", 20);
+               return sysfs_emit(buf, "auto\n");
        case SVC_POOL_GLOBAL:
-               return strlcpy(buf, "global\n", 20);
+               return sysfs_emit(buf, "global\n");
        case SVC_POOL_PERCPU:
-               return strlcpy(buf, "percpu\n", 20);
+               return sysfs_emit(buf, "percpu\n");
        case SVC_POOL_PERNODE:
-               return strlcpy(buf, "pernode\n", 20);
+               return sysfs_emit(buf, "pernode\n");
        default:
-               return sprintf(buf, "%d\n", *ip);
+               return sysfs_emit(buf, "%d\n", *ip);
        }
 }
 
@@ -597,34 +597,25 @@ svc_destroy(struct kref *ref)
 }
 EXPORT_SYMBOL_GPL(svc_destroy);
 
-/*
- * Allocate an RPC server's buffer space.
- * We allocate pages and place them in rq_pages.
- */
-static int
+static bool
 svc_init_buffer(struct svc_rqst *rqstp, unsigned int size, int node)
 {
-       unsigned int pages, arghi;
+       unsigned long pages, ret;
 
        /* bc_xprt uses fore channel allocated buffers */
        if (svc_is_backchannel(rqstp))
-               return 1;
+               return true;
 
        pages = size / PAGE_SIZE + 1; /* extra page as we hold both request and reply.
                                       * We assume one is at most one page
                                       */
-       arghi = 0;
        WARN_ON_ONCE(pages > RPCSVC_MAXPAGES);
        if (pages > RPCSVC_MAXPAGES)
                pages = RPCSVC_MAXPAGES;
-       while (pages) {
-               struct page *p = alloc_pages_node(node, GFP_KERNEL, 0);
-               if (!p)
-                       break;
-               rqstp->rq_pages[arghi++] = p;
-               pages--;
-       }
-       return pages == 0;
+
+       ret = alloc_pages_bulk_array_node(GFP_KERNEL, node, pages,
+                                         rqstp->rq_pages);
+       return ret == pages;
 }
 
 /*
@@ -649,7 +640,7 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
        if (!rqstp)
                return rqstp;
 
-       pagevec_init(&rqstp->rq_pvec);
+       folio_batch_init(&rqstp->rq_fbatch);
 
        __set_bit(RQ_BUSY, &rqstp->rq_flags);
        rqstp->rq_server = serv;
@@ -860,9 +851,9 @@ bool svc_rqst_replace_page(struct svc_rqst *rqstp, struct page *page)
        }
 
        if (*rqstp->rq_next_page) {
-               if (!pagevec_space(&rqstp->rq_pvec))
-                       __pagevec_release(&rqstp->rq_pvec);
-               pagevec_add(&rqstp->rq_pvec, *rqstp->rq_next_page);
+               if (!folio_batch_add(&rqstp->rq_fbatch,
+                               page_folio(*rqstp->rq_next_page)))
+                       __folio_batch_release(&rqstp->rq_fbatch);
        }
 
        get_page(page);
@@ -896,7 +887,7 @@ void svc_rqst_release_pages(struct svc_rqst *rqstp)
 void
 svc_rqst_free(struct svc_rqst *rqstp)
 {
-       pagevec_release(&rqstp->rq_pvec);
+       folio_batch_release(&rqstp->rq_fbatch);
        svc_release_buffer(rqstp);
        if (rqstp->rq_scratch_page)
                put_page(rqstp->rq_scratch_page);
@@ -1173,6 +1164,7 @@ static void __svc_unregister(struct net *net, const u32 program, const u32 versi
  */
 static void svc_unregister(const struct svc_serv *serv, struct net *net)
 {
+       struct sighand_struct *sighand;
        struct svc_program *progp;
        unsigned long flags;
        unsigned int i;
@@ -1189,9 +1181,12 @@ static void svc_unregister(const struct svc_serv *serv, struct net *net)
                }
        }
 
-       spin_lock_irqsave(&current->sighand->siglock, flags);
+       rcu_read_lock();
+       sighand = rcu_dereference(current->sighand);
+       spin_lock_irqsave(&sighand->siglock, flags);
        recalc_sigpending();
-       spin_unlock_irqrestore(&current->sighand->siglock, flags);
+       spin_unlock_irqrestore(&sighand->siglock, flags);
+       rcu_read_unlock();
 }
 
 /*
index 13a1489..62c7919 100644 (file)
@@ -74,13 +74,18 @@ static LIST_HEAD(svc_xprt_class_list);
  *               that no other thread will be using the transport or will
  *               try to set XPT_DEAD.
  */
+
+/**
+ * svc_reg_xprt_class - Register a server-side RPC transport class
+ * @xcl: New transport class to be registered
+ *
+ * Returns zero on success; otherwise a negative errno is returned.
+ */
 int svc_reg_xprt_class(struct svc_xprt_class *xcl)
 {
        struct svc_xprt_class *cl;
        int res = -EEXIST;
 
-       dprintk("svc: Adding svc transport class '%s'\n", xcl->xcl_name);
-
        INIT_LIST_HEAD(&xcl->xcl_list);
        spin_lock(&svc_xprt_class_lock);
        /* Make sure there isn't already a class with the same name */
@@ -96,9 +101,13 @@ out:
 }
 EXPORT_SYMBOL_GPL(svc_reg_xprt_class);
 
+/**
+ * svc_unreg_xprt_class - Unregister a server-side RPC transport class
+ * @xcl: Transport class to be unregistered
+ *
+ */
 void svc_unreg_xprt_class(struct svc_xprt_class *xcl)
 {
-       dprintk("svc: Removing svc transport class '%s'\n", xcl->xcl_name);
        spin_lock(&svc_xprt_class_lock);
        list_del_init(&xcl->xcl_list);
        spin_unlock(&svc_xprt_class_lock);
@@ -685,8 +694,9 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
        }
 
        for (filled = 0; filled < pages; filled = ret) {
-               ret = alloc_pages_bulk_array(GFP_KERNEL, pages,
-                                            rqstp->rq_pages);
+               ret = alloc_pages_bulk_array_node(GFP_KERNEL,
+                                                 rqstp->rq_pool->sp_id,
+                                                 pages, rqstp->rq_pages);
                if (ret > filled)
                        /* Made progress, don't sleep yet */
                        continue;
@@ -843,15 +853,11 @@ static int svc_handle_xprt(struct svc_rqst *rqstp, struct svc_xprt *xprt)
                svc_xprt_received(xprt);
        } else if (svc_xprt_reserve_slot(rqstp, xprt)) {
                /* XPT_DATA|XPT_DEFERRED case: */
-               dprintk("svc: server %p, pool %u, transport %p, inuse=%d\n",
-                       rqstp, rqstp->rq_pool->sp_id, xprt,
-                       kref_read(&xprt->xpt_ref));
                rqstp->rq_deferred = svc_deferred_dequeue(xprt);
                if (rqstp->rq_deferred)
                        len = svc_deferred_recv(rqstp);
                else
                        len = xprt->xpt_ops->xpo_recvfrom(rqstp);
-               rqstp->rq_stime = ktime_get();
                rqstp->rq_reserved = serv->sv_max_mesg;
                atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
        } else
@@ -894,6 +900,7 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
        err = -EAGAIN;
        if (len <= 0)
                goto out_release;
+
        trace_svc_xdr_recvfrom(&rqstp->rq_arg);
 
        clear_bit(XPT_OLD, &xprt->xpt_flags);
@@ -902,6 +909,7 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
 
        if (serv->sv_stats)
                serv->sv_stats->netcnt++;
+       rqstp->rq_stime = ktime_get();
        return len;
 out_release:
        rqstp->rq_res.len = 0;
index 9d9f522..e43f263 100644 (file)
@@ -826,12 +826,6 @@ static void svc_tcp_listen_data_ready(struct sock *sk)
 
        trace_sk_data_ready(sk);
 
-       if (svsk) {
-               /* Refer to svc_setup_socket() for details. */
-               rmb();
-               svsk->sk_odata(sk);
-       }
-
        /*
         * This callback may called twice when a new connection
         * is established as a child socket inherits everything
@@ -840,13 +834,18 @@ static void svc_tcp_listen_data_ready(struct sock *sk)
         *    when one of child sockets become ESTABLISHED.
         * 2) data_ready method of the child socket may be called
         *    when it receives data before the socket is accepted.
-        * In case of 2, we should ignore it silently.
+        * In case of 2, we should ignore it silently and DO NOT
+        * dereference svsk.
         */
-       if (sk->sk_state == TCP_LISTEN) {
-               if (svsk) {
-                       set_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags);
-                       svc_xprt_enqueue(&svsk->sk_xprt);
-               }
+       if (sk->sk_state != TCP_LISTEN)
+               return;
+
+       if (svsk) {
+               /* Refer to svc_setup_socket() for details. */
+               rmb();
+               svsk->sk_odata(sk);
+               set_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags);
+               svc_xprt_enqueue(&svsk->sk_xprt);
        }
 }
 
@@ -887,13 +886,8 @@ static struct svc_xprt *svc_tcp_accept(struct svc_xprt *xprt)
        clear_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags);
        err = kernel_accept(sock, &newsock, O_NONBLOCK);
        if (err < 0) {
-               if (err == -ENOMEM)
-                       printk(KERN_WARNING "%s: no more sockets!\n",
-                              serv->sv_name);
-               else if (err != -EAGAIN)
-                       net_warn_ratelimited("%s: accept failed (err %d)!\n",
-                                            serv->sv_name, -err);
-               trace_svcsock_accept_err(xprt, serv->sv_name, err);
+               if (err != -EAGAIN)
+                       trace_svcsock_accept_err(xprt, serv->sv_name, err);
                return NULL;
        }
        if (IS_ERR(sock_alloc_file(newsock, O_NONBLOCK, NULL)))
@@ -1450,7 +1444,7 @@ static struct svc_sock *svc_setup_socket(struct svc_serv *serv,
        svsk->sk_owspace = inet->sk_write_space;
        /*
         * This barrier is necessary in order to prevent race condition
-        * with svc_data_ready(), svc_listen_data_ready() and others
+        * with svc_data_ready(), svc_tcp_listen_data_ready(), and others
         * when calling callbacks above.
         */
        wmb();
@@ -1462,7 +1456,7 @@ static struct svc_sock *svc_setup_socket(struct svc_serv *serv,
        else
                svc_tcp_init(svsk, serv);
 
-       trace_svcsock_new_socket(sock);
+       trace_svcsock_new(svsk, sock);
        return svsk;
 }
 
@@ -1643,6 +1637,8 @@ static void svc_sock_free(struct svc_xprt *xprt)
        struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt);
        struct socket *sock = svsk->sk_sock;
 
+       trace_svcsock_free(svsk, sock);
+
        tls_handshake_cancel(sock->sk);
        if (sock->file)
                sockfd_put(sock);
index 36835b2..2a22e78 100644 (file)
@@ -1070,22 +1070,22 @@ __be32 * xdr_reserve_space(struct xdr_stream *xdr, size_t nbytes)
 }
 EXPORT_SYMBOL_GPL(xdr_reserve_space);
 
-
 /**
  * xdr_reserve_space_vec - Reserves a large amount of buffer space for sending
  * @xdr: pointer to xdr_stream
- * @vec: pointer to a kvec array
  * @nbytes: number of bytes to reserve
  *
- * Reserves enough buffer space to encode 'nbytes' of data and stores the
- * pointers in 'vec'. The size argument passed to xdr_reserve_space() is
- * determined based on the number of bytes remaining in the current page to
- * avoid invalidating iov_base pointers when xdr_commit_encode() is called.
+ * The size argument passed to xdr_reserve_space() is determined based
+ * on the number of bytes remaining in the current page to avoid
+ * invalidating iov_base pointers when xdr_commit_encode() is called.
+ *
+ * Return values:
+ *   %0: success
+ *   %-EMSGSIZE: not enough space is available in @xdr
  */
-int xdr_reserve_space_vec(struct xdr_stream *xdr, struct kvec *vec, size_t nbytes)
+int xdr_reserve_space_vec(struct xdr_stream *xdr, size_t nbytes)
 {
-       int thislen;
-       int v = 0;
+       size_t thislen;
        __be32 *p;
 
        /*
@@ -1097,21 +1097,19 @@ int xdr_reserve_space_vec(struct xdr_stream *xdr, struct kvec *vec, size_t nbyte
                xdr->end = xdr->p;
        }
 
+       /* XXX: Let's find a way to make this more efficient */
        while (nbytes) {
                thislen = xdr->buf->page_len % PAGE_SIZE;
                thislen = min_t(size_t, nbytes, PAGE_SIZE - thislen);
 
                p = xdr_reserve_space(xdr, thislen);
                if (!p)
-                       return -EIO;
+                       return -EMSGSIZE;
 
-               vec[v].iov_base = p;
-               vec[v].iov_len = thislen;
-               v++;
                nbytes -= thislen;
        }
 
-       return v;
+       return 0;
 }
 EXPORT_SYMBOL_GPL(xdr_reserve_space_vec);
 
index aa2227a..7420a2c 100644 (file)
@@ -93,13 +93,7 @@ static int svc_rdma_bc_sendto(struct svcxprt_rdma *rdma,
         */
        get_page(virt_to_page(rqst->rq_buffer));
        sctxt->sc_send_wr.opcode = IB_WR_SEND;
-       ret = svc_rdma_send(rdma, sctxt);
-       if (ret < 0)
-               return ret;
-
-       ret = wait_for_completion_killable(&sctxt->sc_done);
-       svc_rdma_send_ctxt_put(rdma, sctxt);
-       return ret;
+       return svc_rdma_send(rdma, sctxt);
 }
 
 /* Server-side transport endpoint wants a whole page for its send
index a22fe75..85c8bca 100644 (file)
@@ -125,14 +125,15 @@ static void svc_rdma_recv_cid_init(struct svcxprt_rdma *rdma,
 static struct svc_rdma_recv_ctxt *
 svc_rdma_recv_ctxt_alloc(struct svcxprt_rdma *rdma)
 {
+       int node = ibdev_to_node(rdma->sc_cm_id->device);
        struct svc_rdma_recv_ctxt *ctxt;
        dma_addr_t addr;
        void *buffer;
 
-       ctxt = kmalloc(sizeof(*ctxt), GFP_KERNEL);
+       ctxt = kmalloc_node(sizeof(*ctxt), GFP_KERNEL, node);
        if (!ctxt)
                goto fail0;
-       buffer = kmalloc(rdma->sc_max_req_size, GFP_KERNEL);
+       buffer = kmalloc_node(rdma->sc_max_req_size, GFP_KERNEL, node);
        if (!buffer)
                goto fail1;
        addr = ib_dma_map_single(rdma->sc_pd->device, buffer,
@@ -155,7 +156,6 @@ svc_rdma_recv_ctxt_alloc(struct svcxprt_rdma *rdma)
        ctxt->rc_recv_sge.length = rdma->sc_max_req_size;
        ctxt->rc_recv_sge.lkey = rdma->sc_pd->local_dma_lkey;
        ctxt->rc_recv_buf = buffer;
-       ctxt->rc_temp = false;
        return ctxt;
 
 fail2:
@@ -232,10 +232,7 @@ void svc_rdma_recv_ctxt_put(struct svcxprt_rdma *rdma,
        pcl_free(&ctxt->rc_write_pcl);
        pcl_free(&ctxt->rc_reply_pcl);
 
-       if (!ctxt->rc_temp)
-               llist_add(&ctxt->rc_node, &rdma->sc_recv_ctxts);
-       else
-               svc_rdma_recv_ctxt_destroy(rdma, ctxt);
+       llist_add(&ctxt->rc_node, &rdma->sc_recv_ctxts);
 }
 
 /**
@@ -258,7 +255,7 @@ void svc_rdma_release_ctxt(struct svc_xprt *xprt, void *vctxt)
 }
 
 static bool svc_rdma_refresh_recvs(struct svcxprt_rdma *rdma,
-                                  unsigned int wanted, bool temp)
+                                  unsigned int wanted)
 {
        const struct ib_recv_wr *bad_wr = NULL;
        struct svc_rdma_recv_ctxt *ctxt;
@@ -275,7 +272,6 @@ static bool svc_rdma_refresh_recvs(struct svcxprt_rdma *rdma,
                        break;
 
                trace_svcrdma_post_recv(ctxt);
-               ctxt->rc_temp = temp;
                ctxt->rc_recv_wr.next = recv_chain;
                recv_chain = &ctxt->rc_recv_wr;
                rdma->sc_pending_recvs++;
@@ -309,7 +305,7 @@ err_free:
  */
 bool svc_rdma_post_recvs(struct svcxprt_rdma *rdma)
 {
-       return svc_rdma_refresh_recvs(rdma, rdma->sc_max_requests, true);
+       return svc_rdma_refresh_recvs(rdma, rdma->sc_max_requests);
 }
 
 /**
@@ -343,7 +339,7 @@ static void svc_rdma_wc_receive(struct ib_cq *cq, struct ib_wc *wc)
         * client reconnects.
         */
        if (rdma->sc_pending_recvs < rdma->sc_max_requests)
-               if (!svc_rdma_refresh_recvs(rdma, rdma->sc_recv_batch, false))
+               if (!svc_rdma_refresh_recvs(rdma, rdma->sc_recv_batch))
                        goto dropped;
 
        /* All wc fields are now known to be valid */
@@ -775,9 +771,6 @@ static bool svc_rdma_is_reverse_direction_reply(struct svc_xprt *xprt,
  *
  * The next ctxt is removed from the "receive" lists.
  *
- * - If the ctxt completes a Read, then finish assembling the Call
- *   message and return the number of bytes in the message.
- *
  * - If the ctxt completes a Receive, then construct the Call
  *   message from the contents of the Receive buffer.
  *
@@ -786,7 +779,8 @@ static bool svc_rdma_is_reverse_direction_reply(struct svc_xprt *xprt,
  *     in the message.
  *
  *   - If there are Read chunks in this message, post Read WRs to
- *     pull that payload and return 0.
+ *     pull that payload. When the Read WRs complete, build the
+ *     full message and return the number of bytes in it.
  */
 int svc_rdma_recvfrom(struct svc_rqst *rqstp)
 {
@@ -796,6 +790,12 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
        struct svc_rdma_recv_ctxt *ctxt;
        int ret;
 
+       /* Prevent svc_xprt_release() from releasing pages in rq_pages
+        * when returning 0 or an error.
+        */
+       rqstp->rq_respages = rqstp->rq_pages;
+       rqstp->rq_next_page = rqstp->rq_respages;
+
        rqstp->rq_xprt_ctxt = NULL;
 
        ctxt = NULL;
@@ -819,12 +819,6 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
                                   DMA_FROM_DEVICE);
        svc_rdma_build_arg_xdr(rqstp, ctxt);
 
-       /* Prevent svc_xprt_release from releasing pages in rq_pages
-        * if we return 0 or an error.
-        */
-       rqstp->rq_respages = rqstp->rq_pages;
-       rqstp->rq_next_page = rqstp->rq_respages;
-
        ret = svc_rdma_xdr_decode_req(&rqstp->rq_arg, ctxt);
        if (ret < 0)
                goto out_err;
index 11cf7c6..e460e25 100644 (file)
@@ -62,8 +62,8 @@ svc_rdma_get_rw_ctxt(struct svcxprt_rdma *rdma, unsigned int sges)
        if (node) {
                ctxt = llist_entry(node, struct svc_rdma_rw_ctxt, rw_node);
        } else {
-               ctxt = kmalloc(struct_size(ctxt, rw_first_sgl, SG_CHUNK_SIZE),
-                              GFP_KERNEL);
+               ctxt = kmalloc_node(struct_size(ctxt, rw_first_sgl, SG_CHUNK_SIZE),
+                                   GFP_KERNEL, ibdev_to_node(rdma->sc_cm_id->device));
                if (!ctxt)
                        goto out_noctx;
 
@@ -84,8 +84,7 @@ out_noctx:
        return NULL;
 }
 
-static void __svc_rdma_put_rw_ctxt(struct svcxprt_rdma *rdma,
-                                  struct svc_rdma_rw_ctxt *ctxt,
+static void __svc_rdma_put_rw_ctxt(struct svc_rdma_rw_ctxt *ctxt,
                                   struct llist_head *list)
 {
        sg_free_table_chained(&ctxt->rw_sg_table, SG_CHUNK_SIZE);
@@ -95,7 +94,7 @@ static void __svc_rdma_put_rw_ctxt(struct svcxprt_rdma *rdma,
 static void svc_rdma_put_rw_ctxt(struct svcxprt_rdma *rdma,
                                 struct svc_rdma_rw_ctxt *ctxt)
 {
-       __svc_rdma_put_rw_ctxt(rdma, ctxt, &rdma->sc_rw_ctxts);
+       __svc_rdma_put_rw_ctxt(ctxt, &rdma->sc_rw_ctxts);
 }
 
 /**
@@ -191,6 +190,8 @@ static void svc_rdma_cc_release(struct svc_rdma_chunk_ctxt *cc,
        struct svc_rdma_rw_ctxt *ctxt;
        LLIST_HEAD(free);
 
+       trace_svcrdma_cc_release(&cc->cc_cid, cc->cc_sqecount);
+
        first = last = NULL;
        while ((ctxt = svc_rdma_next_ctxt(&cc->cc_rwctxts)) != NULL) {
                list_del(&ctxt->rw_list);
@@ -198,7 +199,7 @@ static void svc_rdma_cc_release(struct svc_rdma_chunk_ctxt *cc,
                rdma_rw_ctx_destroy(&ctxt->rw_ctx, rdma->sc_qp,
                                    rdma->sc_port_num, ctxt->rw_sg_table.sgl,
                                    ctxt->rw_nents, dir);
-               __svc_rdma_put_rw_ctxt(rdma, ctxt, &free);
+               __svc_rdma_put_rw_ctxt(ctxt, &free);
 
                ctxt->rw_node.next = first;
                first = &ctxt->rw_node;
@@ -234,7 +235,8 @@ svc_rdma_write_info_alloc(struct svcxprt_rdma *rdma,
 {
        struct svc_rdma_write_info *info;
 
-       info = kmalloc(sizeof(*info), GFP_KERNEL);
+       info = kmalloc_node(sizeof(*info), GFP_KERNEL,
+                           ibdev_to_node(rdma->sc_cm_id->device));
        if (!info)
                return info;
 
@@ -304,7 +306,8 @@ svc_rdma_read_info_alloc(struct svcxprt_rdma *rdma)
 {
        struct svc_rdma_read_info *info;
 
-       info = kmalloc(sizeof(*info), GFP_KERNEL);
+       info = kmalloc_node(sizeof(*info), GFP_KERNEL,
+                           ibdev_to_node(rdma->sc_cm_id->device));
        if (!info)
                return info;
 
@@ -351,8 +354,7 @@ static void svc_rdma_wc_read_done(struct ib_cq *cq, struct ib_wc *wc)
        return;
 }
 
-/* This function sleeps when the transport's Send Queue is congested.
- *
+/*
  * Assumptions:
  * - If ib_post_send() succeeds, only one completion is expected,
  *   even if one or more WRs are flushed. This is true when posting
@@ -367,6 +369,8 @@ static int svc_rdma_post_chunk_ctxt(struct svc_rdma_chunk_ctxt *cc)
        struct ib_cqe *cqe;
        int ret;
 
+       might_sleep();
+
        if (cc->cc_sqecount > rdma->sc_sq_depth)
                return -EINVAL;
 
index 22a871e..c6644cc 100644 (file)
@@ -123,18 +123,17 @@ static void svc_rdma_send_cid_init(struct svcxprt_rdma *rdma,
 static struct svc_rdma_send_ctxt *
 svc_rdma_send_ctxt_alloc(struct svcxprt_rdma *rdma)
 {
+       int node = ibdev_to_node(rdma->sc_cm_id->device);
        struct svc_rdma_send_ctxt *ctxt;
        dma_addr_t addr;
        void *buffer;
-       size_t size;
        int i;
 
-       size = sizeof(*ctxt);
-       size += rdma->sc_max_send_sges * sizeof(struct ib_sge);
-       ctxt = kmalloc(size, GFP_KERNEL);
+       ctxt = kmalloc_node(struct_size(ctxt, sc_sges, rdma->sc_max_send_sges),
+                           GFP_KERNEL, node);
        if (!ctxt)
                goto fail0;
-       buffer = kmalloc(rdma->sc_max_req_size, GFP_KERNEL);
+       buffer = kmalloc_node(rdma->sc_max_req_size, GFP_KERNEL, node);
        if (!buffer)
                goto fail1;
        addr = ib_dma_map_single(rdma->sc_pd->device, buffer,
@@ -148,7 +147,6 @@ svc_rdma_send_ctxt_alloc(struct svcxprt_rdma *rdma)
        ctxt->sc_send_wr.wr_cqe = &ctxt->sc_cqe;
        ctxt->sc_send_wr.sg_list = ctxt->sc_sges;
        ctxt->sc_send_wr.send_flags = IB_SEND_SIGNALED;
-       init_completion(&ctxt->sc_done);
        ctxt->sc_cqe.done = svc_rdma_wc_send;
        ctxt->sc_xprt_buf = buffer;
        xdr_buf_init(&ctxt->sc_hdrbuf, ctxt->sc_xprt_buf,
@@ -214,6 +212,7 @@ out:
 
        ctxt->sc_send_wr.num_sge = 0;
        ctxt->sc_cur_sge_no = 0;
+       ctxt->sc_page_count = 0;
        return ctxt;
 
 out_empty:
@@ -228,6 +227,8 @@ out_empty:
  * svc_rdma_send_ctxt_put - Return send_ctxt to free list
  * @rdma: controlling svcxprt_rdma
  * @ctxt: object to return to the free list
+ *
+ * Pages left in sc_pages are DMA unmapped and released.
  */
 void svc_rdma_send_ctxt_put(struct svcxprt_rdma *rdma,
                            struct svc_rdma_send_ctxt *ctxt)
@@ -235,6 +236,9 @@ void svc_rdma_send_ctxt_put(struct svcxprt_rdma *rdma,
        struct ib_device *device = rdma->sc_cm_id->device;
        unsigned int i;
 
+       if (ctxt->sc_page_count)
+               release_pages(ctxt->sc_pages, ctxt->sc_page_count);
+
        /* The first SGE contains the transport header, which
         * remains mapped until @ctxt is destroyed.
         */
@@ -281,12 +285,12 @@ static void svc_rdma_wc_send(struct ib_cq *cq, struct ib_wc *wc)
                container_of(cqe, struct svc_rdma_send_ctxt, sc_cqe);
 
        svc_rdma_wake_send_waiters(rdma, 1);
-       complete(&ctxt->sc_done);
 
        if (unlikely(wc->status != IB_WC_SUCCESS))
                goto flushed;
 
        trace_svcrdma_wc_send(wc, &ctxt->sc_cid);
+       svc_rdma_send_ctxt_put(rdma, ctxt);
        return;
 
 flushed:
@@ -294,6 +298,7 @@ flushed:
                trace_svcrdma_wc_send_err(wc, &ctxt->sc_cid);
        else
                trace_svcrdma_wc_send_flush(wc, &ctxt->sc_cid);
+       svc_rdma_send_ctxt_put(rdma, ctxt);
        svc_xprt_deferred_close(&rdma->sc_xprt);
 }
 
@@ -310,7 +315,7 @@ int svc_rdma_send(struct svcxprt_rdma *rdma, struct svc_rdma_send_ctxt *ctxt)
        struct ib_send_wr *wr = &ctxt->sc_send_wr;
        int ret;
 
-       reinit_completion(&ctxt->sc_done);
+       might_sleep();
 
        /* Sync the transport header buffer */
        ib_dma_sync_single_for_device(rdma->sc_pd->device,
@@ -799,6 +804,25 @@ int svc_rdma_map_reply_msg(struct svcxprt_rdma *rdma,
                                       svc_rdma_xb_dma_map, &args);
 }
 
+/* The svc_rqst and all resources it owns are released as soon as
+ * svc_rdma_sendto returns. Transfer pages under I/O to the ctxt
+ * so they are released by the Send completion handler.
+ */
+static void svc_rdma_save_io_pages(struct svc_rqst *rqstp,
+                                  struct svc_rdma_send_ctxt *ctxt)
+{
+       int i, pages = rqstp->rq_next_page - rqstp->rq_respages;
+
+       ctxt->sc_page_count += pages;
+       for (i = 0; i < pages; i++) {
+               ctxt->sc_pages[i] = rqstp->rq_respages[i];
+               rqstp->rq_respages[i] = NULL;
+       }
+
+       /* Prevent svc_xprt_release from releasing pages in rq_pages */
+       rqstp->rq_next_page = rqstp->rq_respages;
+}
+
 /* Prepare the portion of the RPC Reply that will be transmitted
  * via RDMA Send. The RPC-over-RDMA transport header is prepared
  * in sc_sges[0], and the RPC xdr_buf is prepared in following sges.
@@ -828,6 +852,8 @@ static int svc_rdma_send_reply_msg(struct svcxprt_rdma *rdma,
        if (ret < 0)
                return ret;
 
+       svc_rdma_save_io_pages(rqstp, sctxt);
+
        if (rctxt->rc_inv_rkey) {
                sctxt->sc_send_wr.opcode = IB_WR_SEND_WITH_INV;
                sctxt->sc_send_wr.ex.invalidate_rkey = rctxt->rc_inv_rkey;
@@ -835,13 +861,7 @@ static int svc_rdma_send_reply_msg(struct svcxprt_rdma *rdma,
                sctxt->sc_send_wr.opcode = IB_WR_SEND;
        }
 
-       ret = svc_rdma_send(rdma, sctxt);
-       if (ret < 0)
-               return ret;
-
-       ret = wait_for_completion_killable(&sctxt->sc_done);
-       svc_rdma_send_ctxt_put(rdma, sctxt);
-       return ret;
+       return svc_rdma_send(rdma, sctxt);
 }
 
 /**
@@ -907,8 +927,7 @@ void svc_rdma_send_error_msg(struct svcxprt_rdma *rdma,
        sctxt->sc_sges[0].length = sctxt->sc_hdrbuf.len;
        if (svc_rdma_send(rdma, sctxt))
                goto put_ctxt;
-
-       wait_for_completion_killable(&sctxt->sc_done);
+       return;
 
 put_ctxt:
        svc_rdma_send_ctxt_put(rdma, sctxt);
@@ -976,17 +995,16 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
        ret = svc_rdma_send_reply_msg(rdma, sctxt, rctxt, rqstp);
        if (ret < 0)
                goto put_ctxt;
-
-       /* Prevent svc_xprt_release() from releasing the page backing
-        * rq_res.head[0].iov_base. It's no longer being accessed by
-        * the I/O device. */
-       rqstp->rq_respages++;
        return 0;
 
 reply_chunk:
        if (ret != -E2BIG && ret != -EINVAL)
                goto put_ctxt;
 
+       /* Send completion releases payload pages that were part
+        * of previously posted RDMA Writes.
+        */
+       svc_rdma_save_io_pages(rqstp, sctxt);
        svc_rdma_send_error_msg(rdma, sctxt, rctxt, ret);
        return 0;
 
index ca04f7a..2abd895 100644 (file)
@@ -64,7 +64,7 @@
 #define RPCDBG_FACILITY        RPCDBG_SVCXPRT
 
 static struct svcxprt_rdma *svc_rdma_create_xprt(struct svc_serv *serv,
-                                                struct net *net);
+                                                struct net *net, int node);
 static struct svc_xprt *svc_rdma_create(struct svc_serv *serv,
                                        struct net *net,
                                        struct sockaddr *sa, int salen,
@@ -123,14 +123,14 @@ static void qp_event_handler(struct ib_event *event, void *context)
 }
 
 static struct svcxprt_rdma *svc_rdma_create_xprt(struct svc_serv *serv,
-                                                struct net *net)
+                                                struct net *net, int node)
 {
-       struct svcxprt_rdma *cma_xprt = kzalloc(sizeof *cma_xprt, GFP_KERNEL);
+       struct svcxprt_rdma *cma_xprt;
 
-       if (!cma_xprt) {
-               dprintk("svcrdma: failed to create new transport\n");
+       cma_xprt = kzalloc_node(sizeof(*cma_xprt), GFP_KERNEL, node);
+       if (!cma_xprt)
                return NULL;
-       }
+
        svc_xprt_init(net, &svc_rdma_class, &cma_xprt->sc_xprt, serv);
        INIT_LIST_HEAD(&cma_xprt->sc_accept_q);
        INIT_LIST_HEAD(&cma_xprt->sc_rq_dto_q);
@@ -193,9 +193,9 @@ static void handle_connect_req(struct rdma_cm_id *new_cma_id,
        struct svcxprt_rdma *newxprt;
        struct sockaddr *sa;
 
-       /* Create a new transport */
        newxprt = svc_rdma_create_xprt(listen_xprt->sc_xprt.xpt_server,
-                                      listen_xprt->sc_xprt.xpt_net);
+                                      listen_xprt->sc_xprt.xpt_net,
+                                      ibdev_to_node(new_cma_id->device));
        if (!newxprt)
                return;
        newxprt->sc_cm_id = new_cma_id;
@@ -304,7 +304,7 @@ static struct svc_xprt *svc_rdma_create(struct svc_serv *serv,
 
        if (sa->sa_family != AF_INET && sa->sa_family != AF_INET6)
                return ERR_PTR(-EAFNOSUPPORT);
-       cma_xprt = svc_rdma_create_xprt(serv, net);
+       cma_xprt = svc_rdma_create_xprt(serv, net, NUMA_NO_NODE);
        if (!cma_xprt)
                return ERR_PTR(-ENOMEM);
        set_bit(XPT_LISTENER, &cma_xprt->sc_xprt.xpt_flags);
index 02207e8..06cead2 100644 (file)
@@ -103,7 +103,7 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned long address)
 
        mmap_read_lock(current->mm);
        npgs = pin_user_pages(address, umem->npgs,
-                             gup_flags | FOLL_LONGTERM, &umem->pgs[0], NULL);
+                             gup_flags | FOLL_LONGTERM, &umem->pgs[0]);
        mmap_read_unlock(current->mm);
 
        if (npgs != umem->npgs) {
index c89c753..eb6f22e 100644 (file)
@@ -10,6 +10,9 @@ upstream. In general, only additions should be performed (e.g. new
 methods). Eventually, changes should make it into upstream so that,
 at some point, this fork can be dropped from the kernel tree.
 
+The Rust upstream version on top of which these files are based matches
+the output of `scripts/min-tool-version.sh rustc`.
+
 
 ## Rationale
 
index ca224a5..acf22d4 100644 (file)
@@ -22,21 +22,24 @@ use core::marker::Destruct;
 mod tests;
 
 extern "Rust" {
-    // These are the magic symbols to call the global allocator.  rustc generates
+    // These are the magic symbols to call the global allocator. rustc generates
     // them to call `__rg_alloc` etc. if there is a `#[global_allocator]` attribute
     // (the code expanding that attribute macro generates those functions), or to call
-    // the default implementations in libstd (`__rdl_alloc` etc. in `library/std/src/alloc.rs`)
+    // the default implementations in std (`__rdl_alloc` etc. in `library/std/src/alloc.rs`)
     // otherwise.
-    // The rustc fork of LLVM also special-cases these function names to be able to optimize them
+    // The rustc fork of LLVM 14 and earlier also special-cases these function names to be able to optimize them
     // like `malloc`, `realloc`, and `free`, respectively.
     #[rustc_allocator]
-    #[rustc_allocator_nounwind]
+    #[rustc_nounwind]
     fn __rust_alloc(size: usize, align: usize) -> *mut u8;
-    #[rustc_allocator_nounwind]
+    #[rustc_deallocator]
+    #[rustc_nounwind]
     fn __rust_dealloc(ptr: *mut u8, size: usize, align: usize);
-    #[rustc_allocator_nounwind]
+    #[rustc_reallocator]
+    #[rustc_nounwind]
     fn __rust_realloc(ptr: *mut u8, old_size: usize, align: usize, new_size: usize) -> *mut u8;
-    #[rustc_allocator_nounwind]
+    #[rustc_allocator_zeroed]
+    #[rustc_nounwind]
     fn __rust_alloc_zeroed(size: usize, align: usize) -> *mut u8;
 }
 
@@ -72,11 +75,14 @@ pub use std::alloc::Global;
 /// # Examples
 ///
 /// ```
-/// use std::alloc::{alloc, dealloc, Layout};
+/// use std::alloc::{alloc, dealloc, handle_alloc_error, Layout};
 ///
 /// unsafe {
 ///     let layout = Layout::new::<u16>();
 ///     let ptr = alloc(layout);
+///     if ptr.is_null() {
+///         handle_alloc_error(layout);
+///     }
 ///
 ///     *(ptr as *mut u16) = 42;
 ///     assert_eq!(*(ptr as *mut u16), 42);
@@ -349,7 +355,7 @@ pub(crate) const unsafe fn box_free<T: ?Sized, A: ~const Allocator + ~const Dest
 
 #[cfg(not(no_global_oom_handling))]
 extern "Rust" {
-    // This is the magic symbol to call the global alloc error handler.  rustc generates
+    // This is the magic symbol to call the global alloc error handler. rustc generates
     // it to call `__rg_oom` if there is a `#[alloc_error_handler]`, or to call the
     // default implementations below (`__rdl_oom`) otherwise.
     fn __rust_alloc_error_handler(size: usize, align: usize) -> !;
@@ -394,25 +400,24 @@ pub use std::alloc::handle_alloc_error;
 #[allow(unused_attributes)]
 #[unstable(feature = "alloc_internals", issue = "none")]
 pub mod __alloc_error_handler {
-    use crate::alloc::Layout;
-
-    // called via generated `__rust_alloc_error_handler`
-
-    // if there is no `#[alloc_error_handler]`
+    // called via generated `__rust_alloc_error_handler` if there is no
+    // `#[alloc_error_handler]`.
     #[rustc_std_internal_symbol]
-    pub unsafe extern "C-unwind" fn __rdl_oom(size: usize, _align: usize) -> ! {
-        panic!("memory allocation of {size} bytes failed")
-    }
-
-    // if there is an `#[alloc_error_handler]`
-    #[rustc_std_internal_symbol]
-    pub unsafe extern "C-unwind" fn __rg_oom(size: usize, align: usize) -> ! {
-        let layout = unsafe { Layout::from_size_align_unchecked(size, align) };
+    pub unsafe fn __rdl_oom(size: usize, _align: usize) -> ! {
         extern "Rust" {
-            #[lang = "oom"]
-            fn oom_impl(layout: Layout) -> !;
+            // This symbol is emitted by rustc next to __rust_alloc_error_handler.
+            // Its value depends on the -Zoom={panic,abort} compiler option.
+            static __rust_alloc_error_handler_should_panic: u8;
+        }
+
+        #[allow(unused_unsafe)]
+        if unsafe { __rust_alloc_error_handler_should_panic != 0 } {
+            panic!("memory allocation of {size} bytes failed")
+        } else {
+            core::panicking::panic_nounwind_fmt(format_args!(
+                "memory allocation of {size} bytes failed"
+            ))
         }
-        unsafe { oom_impl(layout) }
     }
 }
 
index dcfe87b..14af986 100644 (file)
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: Apache-2.0 OR MIT
 
-//! A pointer type for heap allocation.
+//! The `Box<T>` type for heap allocation.
 //!
 //! [`Box<T>`], casually referred to as a 'box', provides the simplest form of
 //! heap allocation in Rust. Boxes provide ownership for this allocation, and
 //! definition is just using `T*` can lead to undefined behavior, as
 //! described in [rust-lang/unsafe-code-guidelines#198][ucg#198].
 //!
+//! # Considerations for unsafe code
+//!
+//! **Warning: This section is not normative and is subject to change, possibly
+//! being relaxed in the future! It is a simplified summary of the rules
+//! currently implemented in the compiler.**
+//!
+//! The aliasing rules for `Box<T>` are the same as for `&mut T`. `Box<T>`
+//! asserts uniqueness over its content. Using raw pointers derived from a box
+//! after that box has been mutated through, moved or borrowed as `&mut T`
+//! is not allowed. For more guidance on working with box from unsafe code, see
+//! [rust-lang/unsafe-code-guidelines#326][ucg#326].
+//!
+//!
 //! [ucg#198]: https://github.com/rust-lang/unsafe-code-guidelines/issues/198
+//! [ucg#326]: https://github.com/rust-lang/unsafe-code-guidelines/issues/326
 //! [dereferencing]: core::ops::Deref
 //! [`Box::<T>::from_raw(value)`]: Box::from_raw
 //! [`Global`]: crate::alloc::Global
@@ -139,12 +153,14 @@ use core::async_iter::AsyncIterator;
 use core::borrow;
 use core::cmp::Ordering;
 use core::convert::{From, TryFrom};
+use core::error::Error;
 use core::fmt;
 use core::future::Future;
 use core::hash::{Hash, Hasher};
 #[cfg(not(no_global_oom_handling))]
 use core::iter::FromIterator;
 use core::iter::{FusedIterator, Iterator};
+use core::marker::Tuple;
 use core::marker::{Destruct, Unpin, Unsize};
 use core::mem;
 use core::ops::{
@@ -163,6 +179,8 @@ use crate::raw_vec::RawVec;
 #[cfg(not(no_global_oom_handling))]
 use crate::str::from_boxed_utf8_unchecked;
 #[cfg(not(no_global_oom_handling))]
+use crate::string::String;
+#[cfg(not(no_global_oom_handling))]
 use crate::vec::Vec;
 
 #[cfg(not(no_thin))]
@@ -172,7 +190,7 @@ pub use thin::ThinBox;
 #[cfg(not(no_thin))]
 mod thin;
 
-/// A pointer type for heap allocation.
+/// A pointer type that uniquely owns a heap allocation of type `T`.
 ///
 /// See the [module-level documentation](../../std/boxed/index.html) for more.
 #[lang = "owned_box"]
@@ -196,12 +214,13 @@ impl<T> Box<T> {
     /// ```
     /// let five = Box::new(5);
     /// ```
-    #[cfg(not(no_global_oom_handling))]
+    #[cfg(all(not(no_global_oom_handling)))]
     #[inline(always)]
     #[stable(feature = "rust1", since = "1.0.0")]
     #[must_use]
     pub fn new(x: T) -> Self {
-        box x
+        #[rustc_box]
+        Box::new(x)
     }
 
     /// Constructs a new box with uninitialized contents.
@@ -256,14 +275,21 @@ impl<T> Box<T> {
         Self::new_zeroed_in(Global)
     }
 
-    /// Constructs a new `Pin<Box<T>>`. If `T` does not implement `Unpin`, then
+    /// Constructs a new `Pin<Box<T>>`. If `T` does not implement [`Unpin`], then
     /// `x` will be pinned in memory and unable to be moved.
+    ///
+    /// Constructing and pinning of the `Box` can also be done in two steps: `Box::pin(x)`
+    /// does the same as <code>[Box::into_pin]\([Box::new]\(x))</code>. Consider using
+    /// [`into_pin`](Box::into_pin) if you already have a `Box<T>`, or if you want to
+    /// construct a (pinned) `Box` in a different way than with [`Box::new`].
     #[cfg(not(no_global_oom_handling))]
     #[stable(feature = "pin", since = "1.33.0")]
     #[must_use]
     #[inline(always)]
     pub fn pin(x: T) -> Pin<Box<T>> {
-        (box x).into()
+        (#[rustc_box]
+        Box::new(x))
+        .into()
     }
 
     /// Allocates memory on the heap then places `x` into it,
@@ -543,8 +569,13 @@ impl<T, A: Allocator> Box<T, A> {
         unsafe { Ok(Box::from_raw_in(ptr.as_ptr(), alloc)) }
     }
 
-    /// Constructs a new `Pin<Box<T, A>>`. If `T` does not implement `Unpin`, then
+    /// Constructs a new `Pin<Box<T, A>>`. If `T` does not implement [`Unpin`], then
     /// `x` will be pinned in memory and unable to be moved.
+    ///
+    /// Constructing and pinning of the `Box` can also be done in two steps: `Box::pin_in(x, alloc)`
+    /// does the same as <code>[Box::into_pin]\([Box::new_in]\(x, alloc))</code>. Consider using
+    /// [`into_pin`](Box::into_pin) if you already have a `Box<T, A>`, or if you want to
+    /// construct a (pinned) `Box` in a different way than with [`Box::new_in`].
     #[cfg(not(no_global_oom_handling))]
     #[unstable(feature = "allocator_api", issue = "32838")]
     #[rustc_const_unstable(feature = "const_box", issue = "92521")]
@@ -926,6 +957,7 @@ impl<T: ?Sized> Box<T> {
     /// [`Layout`]: crate::Layout
     #[stable(feature = "box_raw", since = "1.4.0")]
     #[inline]
+    #[must_use = "call `drop(Box::from_raw(ptr))` if you intend to drop the `Box`"]
     pub unsafe fn from_raw(raw: *mut T) -> Self {
         unsafe { Self::from_raw_in(raw, Global) }
     }
@@ -1160,19 +1192,44 @@ impl<T: ?Sized, A: Allocator> Box<T, A> {
         unsafe { &mut *mem::ManuallyDrop::new(b).0.as_ptr() }
     }
 
-    /// Converts a `Box<T>` into a `Pin<Box<T>>`
+    /// Converts a `Box<T>` into a `Pin<Box<T>>`. If `T` does not implement [`Unpin`], then
+    /// `*boxed` will be pinned in memory and unable to be moved.
     ///
     /// This conversion does not allocate on the heap and happens in place.
     ///
     /// This is also available via [`From`].
-    #[unstable(feature = "box_into_pin", issue = "62370")]
+    ///
+    /// Constructing and pinning a `Box` with <code>Box::into_pin([Box::new]\(x))</code>
+    /// can also be written more concisely using <code>[Box::pin]\(x)</code>.
+    /// This `into_pin` method is useful if you already have a `Box<T>`, or you are
+    /// constructing a (pinned) `Box` in a different way than with [`Box::new`].
+    ///
+    /// # Notes
+    ///
+    /// It's not recommended that crates add an impl like `From<Box<T>> for Pin<T>`,
+    /// as it'll introduce an ambiguity when calling `Pin::from`.
+    /// A demonstration of such a poor impl is shown below.
+    ///
+    /// ```compile_fail
+    /// # use std::pin::Pin;
+    /// struct Foo; // A type defined in this crate.
+    /// impl From<Box<()>> for Pin<Foo> {
+    ///     fn from(_: Box<()>) -> Pin<Foo> {
+    ///         Pin::new(Foo)
+    ///     }
+    /// }
+    ///
+    /// let foo = Box::new(());
+    /// let bar = Pin::from(foo);
+    /// ```
+    #[stable(feature = "box_into_pin", since = "1.63.0")]
     #[rustc_const_unstable(feature = "const_box", issue = "92521")]
     pub const fn into_pin(boxed: Self) -> Pin<Self>
     where
         A: 'static,
     {
         // It's not possible to move or replace the insides of a `Pin<Box<T>>`
-        // when `T: !Unpin`,  so it's safe to pin it directly without any
+        // when `T: !Unpin`, so it's safe to pin it directly without any
         // additional requirements.
         unsafe { Pin::new_unchecked(boxed) }
     }
@@ -1190,7 +1247,8 @@ unsafe impl<#[may_dangle] T: ?Sized, A: Allocator> Drop for Box<T, A> {
 impl<T: Default> Default for Box<T> {
     /// Creates a `Box<T>`, with the `Default` value for T.
     fn default() -> Self {
-        box T::default()
+        #[rustc_box]
+        Box::new(T::default())
     }
 }
 
@@ -1408,9 +1466,17 @@ impl<T: ?Sized, A: Allocator> const From<Box<T, A>> for Pin<Box<T, A>>
 where
     A: 'static,
 {
-    /// Converts a `Box<T>` into a `Pin<Box<T>>`
+    /// Converts a `Box<T>` into a `Pin<Box<T>>`. If `T` does not implement [`Unpin`], then
+    /// `*boxed` will be pinned in memory and unable to be moved.
     ///
     /// This conversion does not allocate on the heap and happens in place.
+    ///
+    /// This is also available via [`Box::into_pin`].
+    ///
+    /// Constructing and pinning a `Box` with <code><Pin<Box\<T>>>::from([Box::new]\(x))</code>
+    /// can also be written more concisely using <code>[Box::pin]\(x)</code>.
+    /// This `From` implementation is useful if you already have a `Box<T>`, or you are
+    /// constructing a (pinned) `Box` in a different way than with [`Box::new`].
     fn from(boxed: Box<T, A>) -> Self {
         Box::into_pin(boxed)
     }
@@ -1422,7 +1488,7 @@ impl<T: Copy> From<&[T]> for Box<[T]> {
     /// Converts a `&[T]` into a `Box<[T]>`
     ///
     /// This conversion allocates on the heap
-    /// and performs a copy of `slice`.
+    /// and performs a copy of `slice` and its contents.
     ///
     /// # Examples
     /// ```rust
@@ -1554,10 +1620,27 @@ impl<T, const N: usize> From<[T; N]> for Box<[T]> {
     /// println!("{boxed:?}");
     /// ```
     fn from(array: [T; N]) -> Box<[T]> {
-        box array
+        #[rustc_box]
+        Box::new(array)
     }
 }
 
+/// Casts a boxed slice to a boxed array.
+///
+/// # Safety
+///
+/// `boxed_slice.len()` must be exactly `N`.
+unsafe fn boxed_slice_as_array_unchecked<T, A: Allocator, const N: usize>(
+    boxed_slice: Box<[T], A>,
+) -> Box<[T; N], A> {
+    debug_assert_eq!(boxed_slice.len(), N);
+
+    let (ptr, alloc) = Box::into_raw_with_allocator(boxed_slice);
+    // SAFETY: Pointer and allocator came from an existing box,
+    // and our safety condition requires that the length is exactly `N`
+    unsafe { Box::from_raw_in(ptr as *mut [T; N], alloc) }
+}
+
 #[stable(feature = "boxed_slice_try_from", since = "1.43.0")]
 impl<T, const N: usize> TryFrom<Box<[T]>> for Box<[T; N]> {
     type Error = Box<[T]>;
@@ -1573,13 +1656,46 @@ impl<T, const N: usize> TryFrom<Box<[T]>> for Box<[T; N]> {
     /// `boxed_slice.len()` does not equal `N`.
     fn try_from(boxed_slice: Box<[T]>) -> Result<Self, Self::Error> {
         if boxed_slice.len() == N {
-            Ok(unsafe { Box::from_raw(Box::into_raw(boxed_slice) as *mut [T; N]) })
+            Ok(unsafe { boxed_slice_as_array_unchecked(boxed_slice) })
         } else {
             Err(boxed_slice)
         }
     }
 }
 
+#[cfg(not(no_global_oom_handling))]
+#[stable(feature = "boxed_array_try_from_vec", since = "1.66.0")]
+impl<T, const N: usize> TryFrom<Vec<T>> for Box<[T; N]> {
+    type Error = Vec<T>;
+
+    /// Attempts to convert a `Vec<T>` into a `Box<[T; N]>`.
+    ///
+    /// Like [`Vec::into_boxed_slice`], this is in-place if `vec.capacity() == N`,
+    /// but will require a reallocation otherwise.
+    ///
+    /// # Errors
+    ///
+    /// Returns the original `Vec<T>` in the `Err` variant if
+    /// `boxed_slice.len()` does not equal `N`.
+    ///
+    /// # Examples
+    ///
+    /// This can be used with [`vec!`] to create an array on the heap:
+    ///
+    /// ```
+    /// let state: Box<[f32; 100]> = vec![1.0; 100].try_into().unwrap();
+    /// assert_eq!(state.len(), 100);
+    /// ```
+    fn try_from(vec: Vec<T>) -> Result<Self, Self::Error> {
+        if vec.len() == N {
+            let boxed_slice = vec.into_boxed_slice();
+            Ok(unsafe { boxed_slice_as_array_unchecked(boxed_slice) })
+        } else {
+            Err(vec)
+        }
+    }
+}
+
 impl<A: Allocator> Box<dyn Any, A> {
     /// Attempt to downcast the box to a concrete type.
     ///
@@ -1869,7 +1985,7 @@ impl<I: ExactSizeIterator + ?Sized, A: Allocator> ExactSizeIterator for Box<I, A
 impl<I: FusedIterator + ?Sized, A: Allocator> FusedIterator for Box<I, A> {}
 
 #[stable(feature = "boxed_closure_impls", since = "1.35.0")]
-impl<Args, F: FnOnce<Args> + ?Sized, A: Allocator> FnOnce<Args> for Box<F, A> {
+impl<Args: Tuple, F: FnOnce<Args> + ?Sized, A: Allocator> FnOnce<Args> for Box<F, A> {
     type Output = <F as FnOnce<Args>>::Output;
 
     extern "rust-call" fn call_once(self, args: Args) -> Self::Output {
@@ -1878,20 +1994,20 @@ impl<Args, F: FnOnce<Args> + ?Sized, A: Allocator> FnOnce<Args> for Box<F, A> {
 }
 
 #[stable(feature = "boxed_closure_impls", since = "1.35.0")]
-impl<Args, F: FnMut<Args> + ?Sized, A: Allocator> FnMut<Args> for Box<F, A> {
+impl<Args: Tuple, F: FnMut<Args> + ?Sized, A: Allocator> FnMut<Args> for Box<F, A> {
     extern "rust-call" fn call_mut(&mut self, args: Args) -> Self::Output {
         <F as FnMut<Args>>::call_mut(self, args)
     }
 }
 
 #[stable(feature = "boxed_closure_impls", since = "1.35.0")]
-impl<Args, F: Fn<Args> + ?Sized, A: Allocator> Fn<Args> for Box<F, A> {
+impl<Args: Tuple, F: Fn<Args> + ?Sized, A: Allocator> Fn<Args> for Box<F, A> {
     extern "rust-call" fn call(&self, args: Args) -> Self::Output {
         <F as Fn<Args>>::call(self, args)
     }
 }
 
-#[unstable(feature = "coerce_unsized", issue = "27732")]
+#[unstable(feature = "coerce_unsized", issue = "18598")]
 impl<T: ?Sized + Unsize<U>, U: ?Sized, A: Allocator> CoerceUnsized<Box<U, A>> for Box<T, A> {}
 
 #[unstable(feature = "dispatch_from_dyn", issue = "none")]
@@ -1973,8 +2089,7 @@ impl<T: ?Sized, A: Allocator> AsMut<T> for Box<T, A> {
  *  could have a method to project a Pin<T> from it.
  */
 #[stable(feature = "pin", since = "1.33.0")]
-#[rustc_const_unstable(feature = "const_box", issue = "92521")]
-impl<T: ?Sized, A: Allocator> const Unpin for Box<T, A> where A: 'static {}
+impl<T: ?Sized, A: Allocator> Unpin for Box<T, A> where A: 'static {}
 
 #[unstable(feature = "generator_trait", issue = "43122")]
 impl<G: ?Sized + Generator<R> + Unpin, R, A: Allocator> Generator<R> for Box<G, A>
@@ -2026,3 +2141,292 @@ impl<S: ?Sized + AsyncIterator + Unpin> AsyncIterator for Box<S> {
         (**self).size_hint()
     }
 }
+
+impl dyn Error {
+    #[inline]
+    #[stable(feature = "error_downcast", since = "1.3.0")]
+    #[rustc_allow_incoherent_impl]
+    /// Attempts to downcast the box to a concrete type.
+    pub fn downcast<T: Error + 'static>(self: Box<Self>) -> Result<Box<T>, Box<dyn Error>> {
+        if self.is::<T>() {
+            unsafe {
+                let raw: *mut dyn Error = Box::into_raw(self);
+                Ok(Box::from_raw(raw as *mut T))
+            }
+        } else {
+            Err(self)
+        }
+    }
+}
+
+impl dyn Error + Send {
+    #[inline]
+    #[stable(feature = "error_downcast", since = "1.3.0")]
+    #[rustc_allow_incoherent_impl]
+    /// Attempts to downcast the box to a concrete type.
+    pub fn downcast<T: Error + 'static>(self: Box<Self>) -> Result<Box<T>, Box<dyn Error + Send>> {
+        let err: Box<dyn Error> = self;
+        <dyn Error>::downcast(err).map_err(|s| unsafe {
+            // Reapply the `Send` marker.
+            mem::transmute::<Box<dyn Error>, Box<dyn Error + Send>>(s)
+        })
+    }
+}
+
+impl dyn Error + Send + Sync {
+    #[inline]
+    #[stable(feature = "error_downcast", since = "1.3.0")]
+    #[rustc_allow_incoherent_impl]
+    /// Attempts to downcast the box to a concrete type.
+    pub fn downcast<T: Error + 'static>(self: Box<Self>) -> Result<Box<T>, Box<Self>> {
+        let err: Box<dyn Error> = self;
+        <dyn Error>::downcast(err).map_err(|s| unsafe {
+            // Reapply the `Send + Sync` marker.
+            mem::transmute::<Box<dyn Error>, Box<dyn Error + Send + Sync>>(s)
+        })
+    }
+}
+
+#[cfg(not(no_global_oom_handling))]
+#[stable(feature = "rust1", since = "1.0.0")]
+impl<'a, E: Error + 'a> From<E> for Box<dyn Error + 'a> {
+    /// Converts a type of [`Error`] into a box of dyn [`Error`].
+    ///
+    /// # Examples
+    ///
+    /// ```
+    /// use std::error::Error;
+    /// use std::fmt;
+    /// use std::mem;
+    ///
+    /// #[derive(Debug)]
+    /// struct AnError;
+    ///
+    /// impl fmt::Display for AnError {
+    ///     fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+    ///         write!(f, "An error")
+    ///     }
+    /// }
+    ///
+    /// impl Error for AnError {}
+    ///
+    /// let an_error = AnError;
+    /// assert!(0 == mem::size_of_val(&an_error));
+    /// let a_boxed_error = Box::<dyn Error>::from(an_error);
+    /// assert!(mem::size_of::<Box<dyn Error>>() == mem::size_of_val(&a_boxed_error))
+    /// ```
+    fn from(err: E) -> Box<dyn Error + 'a> {
+        Box::new(err)
+    }
+}
+
+#[cfg(not(no_global_oom_handling))]
+#[stable(feature = "rust1", since = "1.0.0")]
+impl<'a, E: Error + Send + Sync + 'a> From<E> for Box<dyn Error + Send + Sync + 'a> {
+    /// Converts a type of [`Error`] + [`Send`] + [`Sync`] into a box of
+    /// dyn [`Error`] + [`Send`] + [`Sync`].
+    ///
+    /// # Examples
+    ///
+    /// ```
+    /// use std::error::Error;
+    /// use std::fmt;
+    /// use std::mem;
+    ///
+    /// #[derive(Debug)]
+    /// struct AnError;
+    ///
+    /// impl fmt::Display for AnError {
+    ///     fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+    ///         write!(f, "An error")
+    ///     }
+    /// }
+    ///
+    /// impl Error for AnError {}
+    ///
+    /// unsafe impl Send for AnError {}
+    ///
+    /// unsafe impl Sync for AnError {}
+    ///
+    /// let an_error = AnError;
+    /// assert!(0 == mem::size_of_val(&an_error));
+    /// let a_boxed_error = Box::<dyn Error + Send + Sync>::from(an_error);
+    /// assert!(
+    ///     mem::size_of::<Box<dyn Error + Send + Sync>>() == mem::size_of_val(&a_boxed_error))
+    /// ```
+    fn from(err: E) -> Box<dyn Error + Send + Sync + 'a> {
+        Box::new(err)
+    }
+}
+
+#[cfg(not(no_global_oom_handling))]
+#[stable(feature = "rust1", since = "1.0.0")]
+impl From<String> for Box<dyn Error + Send + Sync> {
+    /// Converts a [`String`] into a box of dyn [`Error`] + [`Send`] + [`Sync`].
+    ///
+    /// # Examples
+    ///
+    /// ```
+    /// use std::error::Error;
+    /// use std::mem;
+    ///
+    /// let a_string_error = "a string error".to_string();
+    /// let a_boxed_error = Box::<dyn Error + Send + Sync>::from(a_string_error);
+    /// assert!(
+    ///     mem::size_of::<Box<dyn Error + Send + Sync>>() == mem::size_of_val(&a_boxed_error))
+    /// ```
+    #[inline]
+    fn from(err: String) -> Box<dyn Error + Send + Sync> {
+        struct StringError(String);
+
+        impl Error for StringError {
+            #[allow(deprecated)]
+            fn description(&self) -> &str {
+                &self.0
+            }
+        }
+
+        impl fmt::Display for StringError {
+            fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+                fmt::Display::fmt(&self.0, f)
+            }
+        }
+
+        // Purposefully skip printing "StringError(..)"
+        impl fmt::Debug for StringError {
+            fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+                fmt::Debug::fmt(&self.0, f)
+            }
+        }
+
+        Box::new(StringError(err))
+    }
+}
+
+#[cfg(not(no_global_oom_handling))]
+#[stable(feature = "string_box_error", since = "1.6.0")]
+impl From<String> for Box<dyn Error> {
+    /// Converts a [`String`] into a box of dyn [`Error`].
+    ///
+    /// # Examples
+    ///
+    /// ```
+    /// use std::error::Error;
+    /// use std::mem;
+    ///
+    /// let a_string_error = "a string error".to_string();
+    /// let a_boxed_error = Box::<dyn Error>::from(a_string_error);
+    /// assert!(mem::size_of::<Box<dyn Error>>() == mem::size_of_val(&a_boxed_error))
+    /// ```
+    fn from(str_err: String) -> Box<dyn Error> {
+        let err1: Box<dyn Error + Send + Sync> = From::from(str_err);
+        let err2: Box<dyn Error> = err1;
+        err2
+    }
+}
+
+#[cfg(not(no_global_oom_handling))]
+#[stable(feature = "rust1", since = "1.0.0")]
+impl<'a> From<&str> for Box<dyn Error + Send + Sync + 'a> {
+    /// Converts a [`str`] into a box of dyn [`Error`] + [`Send`] + [`Sync`].
+    ///
+    /// [`str`]: prim@str
+    ///
+    /// # Examples
+    ///
+    /// ```
+    /// use std::error::Error;
+    /// use std::mem;
+    ///
+    /// let a_str_error = "a str error";
+    /// let a_boxed_error = Box::<dyn Error + Send + Sync>::from(a_str_error);
+    /// assert!(
+    ///     mem::size_of::<Box<dyn Error + Send + Sync>>() == mem::size_of_val(&a_boxed_error))
+    /// ```
+    #[inline]
+    fn from(err: &str) -> Box<dyn Error + Send + Sync + 'a> {
+        From::from(String::from(err))
+    }
+}
+
+#[cfg(not(no_global_oom_handling))]
+#[stable(feature = "string_box_error", since = "1.6.0")]
+impl From<&str> for Box<dyn Error> {
+    /// Converts a [`str`] into a box of dyn [`Error`].
+    ///
+    /// [`str`]: prim@str
+    ///
+    /// # Examples
+    ///
+    /// ```
+    /// use std::error::Error;
+    /// use std::mem;
+    ///
+    /// let a_str_error = "a str error";
+    /// let a_boxed_error = Box::<dyn Error>::from(a_str_error);
+    /// assert!(mem::size_of::<Box<dyn Error>>() == mem::size_of_val(&a_boxed_error))
+    /// ```
+    fn from(err: &str) -> Box<dyn Error> {
+        From::from(String::from(err))
+    }
+}
+
+#[cfg(not(no_global_oom_handling))]
+#[stable(feature = "cow_box_error", since = "1.22.0")]
+impl<'a, 'b> From<Cow<'b, str>> for Box<dyn Error + Send + Sync + 'a> {
+    /// Converts a [`Cow`] into a box of dyn [`Error`] + [`Send`] + [`Sync`].
+    ///
+    /// # Examples
+    ///
+    /// ```
+    /// use std::error::Error;
+    /// use std::mem;
+    /// use std::borrow::Cow;
+    ///
+    /// let a_cow_str_error = Cow::from("a str error");
+    /// let a_boxed_error = Box::<dyn Error + Send + Sync>::from(a_cow_str_error);
+    /// assert!(
+    ///     mem::size_of::<Box<dyn Error + Send + Sync>>() == mem::size_of_val(&a_boxed_error))
+    /// ```
+    fn from(err: Cow<'b, str>) -> Box<dyn Error + Send + Sync + 'a> {
+        From::from(String::from(err))
+    }
+}
+
+#[cfg(not(no_global_oom_handling))]
+#[stable(feature = "cow_box_error", since = "1.22.0")]
+impl<'a> From<Cow<'a, str>> for Box<dyn Error> {
+    /// Converts a [`Cow`] into a box of dyn [`Error`].
+    ///
+    /// # Examples
+    ///
+    /// ```
+    /// use std::error::Error;
+    /// use std::mem;
+    /// use std::borrow::Cow;
+    ///
+    /// let a_cow_str_error = Cow::from("a str error");
+    /// let a_boxed_error = Box::<dyn Error>::from(a_cow_str_error);
+    /// assert!(mem::size_of::<Box<dyn Error>>() == mem::size_of_val(&a_boxed_error))
+    /// ```
+    fn from(err: Cow<'a, str>) -> Box<dyn Error> {
+        From::from(String::from(err))
+    }
+}
+
+#[stable(feature = "box_error", since = "1.8.0")]
+impl<T: core::error::Error> core::error::Error for Box<T> {
+    #[allow(deprecated, deprecated_in_future)]
+    fn description(&self) -> &str {
+        core::error::Error::description(&**self)
+    }
+
+    #[allow(deprecated)]
+    fn cause(&self) -> Option<&dyn core::error::Error> {
+        core::error::Error::cause(&**self)
+    }
+
+    fn source(&self) -> Option<&(dyn core::error::Error + 'static)> {
+        core::error::Error::source(&**self)
+    }
+}
index 1eec265..2506065 100644 (file)
@@ -141,7 +141,7 @@ impl Display for TryReserveError {
                 " because the computed capacity exceeded the collection's maximum"
             }
             TryReserveErrorKind::AllocError { .. } => {
-                " because the memory allocator returned a error"
+                " because the memory allocator returned an error"
             }
         };
         fmt.write_str(reason)
@@ -154,3 +154,6 @@ trait SpecExtend<I: IntoIterator> {
     /// Extends `self` with the contents of the given iterator.
     fn spec_extend(&mut self, iter: I);
 }
+
+#[stable(feature = "try_reserve", since = "1.57.0")]
+impl core::error::Error for TryReserveError {}
index 3aebf83..5f37437 100644 (file)
@@ -5,7 +5,7 @@
 //! This library provides smart pointers and collections for managing
 //! heap-allocated values.
 //!
-//! This library, like libcore, normally doesn’t need to be used directly
+//! This library, like core, normally doesn’t need to be used directly
 //! since its contents are re-exported in the [`std` crate](../std/index.html).
 //! Crates that use the `#![no_std]` attribute however will typically
 //! not depend on `std`, so they’d use this crate instead.
 //! [`Rc`]: rc
 //! [`RefCell`]: core::cell
 
-// To run liballoc tests without x.py without ending up with two copies of liballoc, Miri needs to be
-// able to "empty" this crate. See <https://github.com/rust-lang/miri-test-libstd/issues/4>.
-// rustc itself never sets the feature, so this line has no affect there.
-#![cfg(any(not(feature = "miri-test-libstd"), test, doctest))]
 #![allow(unused_attributes)]
 #![stable(feature = "alloc", since = "1.36.0")]
 #![doc(
     any(not(feature = "miri-test-libstd"), test, doctest),
     no_global_oom_handling,
     not(no_global_oom_handling),
+    not(no_rc),
+    not(no_sync),
     target_has_atomic = "ptr"
 ))]
 #![no_std]
 #![needs_allocator]
+// To run alloc tests without x.py without ending up with two copies of alloc, Miri needs to be
+// able to "empty" this crate. See <https://github.com/rust-lang/miri-test-libstd/issues/4>.
+// rustc itself never sets the feature, so this line has no affect there.
+#![cfg(any(not(feature = "miri-test-libstd"), test, doctest))]
 //
 // Lints:
 #![deny(unsafe_op_in_unsafe_fn)]
+#![deny(fuzzy_provenance_casts)]
 #![warn(deprecated_in_future)]
 #![warn(missing_debug_implementations)]
 #![warn(missing_docs)]
 #![allow(explicit_outlives_requirements)]
 //
 // Library features:
-#![cfg_attr(not(no_global_oom_handling), feature(alloc_c_string))]
 #![feature(alloc_layout_extra)]
 #![feature(allocator_api)]
 #![feature(array_chunks)]
+#![feature(array_into_iter_constructors)]
 #![feature(array_methods)]
 #![feature(array_windows)]
 #![feature(assert_matches)]
 #![feature(coerce_unsized)]
 #![cfg_attr(not(no_global_oom_handling), feature(const_alloc_error))]
 #![feature(const_box)]
-#![cfg_attr(not(no_global_oom_handling), feature(const_btree_new))]
+#![cfg_attr(not(no_global_oom_handling), feature(const_btree_len))]
 #![cfg_attr(not(no_borrow), feature(const_cow_is_borrowed))]
 #![feature(const_convert)]
 #![feature(const_size_of_val)]
 #![feature(const_align_of_val)]
 #![feature(const_ptr_read)]
+#![feature(const_maybe_uninit_zeroed)]
 #![feature(const_maybe_uninit_write)]
 #![feature(const_maybe_uninit_as_mut_ptr)]
 #![feature(const_refs_to_cell)]
-#![feature(core_c_str)]
 #![feature(core_intrinsics)]
-#![feature(core_ffi_c)]
+#![feature(core_panic)]
 #![feature(const_eval_select)]
 #![feature(const_pin)]
+#![feature(const_waker)]
 #![feature(cstr_from_bytes_until_nul)]
 #![feature(dispatch_from_dyn)]
+#![feature(error_generic_member_access)]
+#![feature(error_in_core)]
 #![feature(exact_size_is_empty)]
 #![feature(extend_one)]
 #![feature(fmt_internals)]
 #![feature(fn_traits)]
 #![feature(hasher_prefixfree_extras)]
+#![feature(inline_const)]
 #![feature(inplace_iteration)]
+#![cfg_attr(test, feature(is_sorted))]
 #![feature(iter_advance_by)]
+#![feature(iter_next_chunk)]
+#![feature(iter_repeat_n)]
 #![feature(layout_for_ptr)]
 #![feature(maybe_uninit_slice)]
+#![feature(maybe_uninit_uninit_array)]
+#![feature(maybe_uninit_uninit_array_transpose)]
 #![cfg_attr(test, feature(new_uninit))]
 #![feature(nonnull_slice_from_raw_parts)]
 #![feature(pattern)]
+#![feature(pointer_byte_offsets)]
+#![feature(provide_any)]
 #![feature(ptr_internals)]
 #![feature(ptr_metadata)]
 #![feature(ptr_sub_ptr)]
 #![feature(receiver_trait)]
+#![feature(saturating_int_impl)]
 #![feature(set_ptr_value)]
+#![feature(sized_type_properties)]
+#![feature(slice_from_ptr_range)]
 #![feature(slice_group_by)]
 #![feature(slice_ptr_get)]
 #![feature(slice_ptr_len)]
 #![feature(trusted_len)]
 #![feature(trusted_random_access)]
 #![feature(try_trait_v2)]
+#![feature(tuple_trait)]
 #![feature(unchecked_math)]
 #![feature(unicode_internals)]
 #![feature(unsize)]
+#![feature(utf8_chunks)]
+#![feature(std_internals)]
 //
 // Language features:
 #![feature(allocator_internals)]
 #![feature(allow_internal_unstable)]
 #![feature(associated_type_bounds)]
-#![feature(box_syntax)]
 #![feature(cfg_sanitize)]
 #![feature(const_deref)]
 #![feature(const_mut_refs)]
 #![cfg_attr(not(test), feature(generator_trait))]
 #![feature(hashmap_internals)]
 #![feature(lang_items)]
-#![feature(let_else)]
 #![feature(min_specialization)]
 #![feature(negative_impls)]
 #![feature(never_type)]
-#![feature(nll)] // Not necessary, but here to test the `nll` feature.
 #![feature(rustc_allow_const_fn_unstable)]
 #![feature(rustc_attrs)]
+#![feature(pointer_is_aligned)]
 #![feature(slice_internals)]
 #![feature(staged_api)]
+#![feature(stmt_expr_attributes)]
 #![cfg_attr(test, feature(test))]
 #![feature(unboxed_closures)]
 #![feature(unsized_fn_params)]
 #![feature(c_unwind)]
+#![feature(with_negative_coherence)]
+#![cfg_attr(test, feature(panic_update_hook))]
 //
 // Rustdoc features:
 #![feature(doc_cfg)]
 extern crate std;
 #[cfg(test)]
 extern crate test;
+#[cfg(test)]
+mod testing;
 
 // Module with internal macros used by other modules (needs to be included before other modules).
 #[cfg(not(no_macros))]
@@ -218,7 +241,7 @@ mod boxed {
 #[cfg(not(no_borrow))]
 pub mod borrow;
 pub mod collections;
-#[cfg(not(no_global_oom_handling))]
+#[cfg(all(not(no_rc), not(no_sync), not(no_global_oom_handling)))]
 pub mod ffi;
 #[cfg(not(no_fmt))]
 pub mod fmt;
@@ -229,10 +252,9 @@ pub mod slice;
 pub mod str;
 #[cfg(not(no_string))]
 pub mod string;
-#[cfg(not(no_sync))]
-#[cfg(target_has_atomic = "ptr")]
+#[cfg(all(not(no_rc), not(no_sync), target_has_atomic = "ptr"))]
 pub mod sync;
-#[cfg(all(not(no_global_oom_handling), target_has_atomic = "ptr"))]
+#[cfg(all(not(no_global_oom_handling), not(no_rc), not(no_sync), target_has_atomic = "ptr"))]
 pub mod task;
 #[cfg(test)]
 mod tests;
@@ -243,3 +265,20 @@ pub mod vec;
 pub mod __export {
     pub use core::format_args;
 }
+
+#[cfg(test)]
+#[allow(dead_code)] // Not used in all configurations
+pub(crate) mod test_helpers {
+    /// Copied from `std::test_helpers::test_rng`, since these tests rely on the
+    /// seed not being the same for every RNG invocation too.
+    pub(crate) fn test_rng() -> rand_xorshift::XorShiftRng {
+        use std::hash::{BuildHasher, Hash, Hasher};
+        let mut hasher = std::collections::hash_map::RandomState::new().build_hasher();
+        std::panic::Location::caller().hash(&mut hasher);
+        let hc64 = hasher.finish();
+        let seed_vec =
+            hc64.to_le_bytes().into_iter().chain(0u8..8).collect::<crate::vec::Vec<u8>>();
+        let seed: [u8; 16] = seed_vec.as_slice().try_into().unwrap();
+        rand::SeedableRng::from_seed(seed)
+    }
+}
index eb77db5..5db87ea 100644 (file)
@@ -5,7 +5,7 @@
 use core::alloc::LayoutError;
 use core::cmp;
 use core::intrinsics;
-use core::mem::{self, ManuallyDrop, MaybeUninit};
+use core::mem::{self, ManuallyDrop, MaybeUninit, SizedTypeProperties};
 use core::ops::Drop;
 use core::ptr::{self, NonNull, Unique};
 use core::slice;
@@ -177,7 +177,7 @@ impl<T, A: Allocator> RawVec<T, A> {
     #[cfg(not(no_global_oom_handling))]
     fn allocate_in(capacity: usize, init: AllocInit, alloc: A) -> Self {
         // Don't allocate here because `Drop` will not deallocate when `capacity` is 0.
-        if mem::size_of::<T>() == 0 || capacity == 0 {
+        if T::IS_ZST || capacity == 0 {
             Self::new_in(alloc)
         } else {
             // We avoid `unwrap_or_else` here because it bloats the amount of
@@ -212,7 +212,7 @@ impl<T, A: Allocator> RawVec<T, A> {
 
     fn try_allocate_in(capacity: usize, init: AllocInit, alloc: A) -> Result<Self, TryReserveError> {
         // Don't allocate here because `Drop` will not deallocate when `capacity` is 0.
-        if mem::size_of::<T>() == 0 || capacity == 0 {
+        if T::IS_ZST || capacity == 0 {
             return Ok(Self::new_in(alloc));
         }
 
@@ -262,7 +262,7 @@ impl<T, A: Allocator> RawVec<T, A> {
     /// This will always be `usize::MAX` if `T` is zero-sized.
     #[inline(always)]
     pub fn capacity(&self) -> usize {
-        if mem::size_of::<T>() == 0 { usize::MAX } else { self.cap }
+        if T::IS_ZST { usize::MAX } else { self.cap }
     }
 
     /// Returns a shared reference to the allocator backing this `RawVec`.
@@ -271,7 +271,7 @@ impl<T, A: Allocator> RawVec<T, A> {
     }
 
     fn current_memory(&self) -> Option<(NonNull<u8>, Layout)> {
-        if mem::size_of::<T>() == 0 || self.cap == 0 {
+        if T::IS_ZST || self.cap == 0 {
             None
         } else {
             // We have an allocated chunk of memory, so we can bypass runtime
@@ -419,7 +419,7 @@ impl<T, A: Allocator> RawVec<T, A> {
         // This is ensured by the calling contexts.
         debug_assert!(additional > 0);
 
-        if mem::size_of::<T>() == 0 {
+        if T::IS_ZST {
             // Since we return a capacity of `usize::MAX` when `elem_size` is
             // 0, getting to here necessarily means the `RawVec` is overfull.
             return Err(CapacityOverflow.into());
@@ -445,7 +445,7 @@ impl<T, A: Allocator> RawVec<T, A> {
     // `grow_amortized`, but this method is usually instantiated less often so
     // it's less critical.
     fn grow_exact(&mut self, len: usize, additional: usize) -> Result<(), TryReserveError> {
-        if mem::size_of::<T>() == 0 {
+        if T::IS_ZST {
             // Since we return a capacity of `usize::MAX` when the type size is
             // 0, getting to here necessarily means the `RawVec` is overfull.
             return Err(CapacityOverflow.into());
@@ -460,7 +460,7 @@ impl<T, A: Allocator> RawVec<T, A> {
         Ok(())
     }
 
-    #[allow(dead_code)]
+    #[cfg(not(no_global_oom_handling))]
     fn shrink(&mut self, cap: usize) -> Result<(), TryReserveError> {
         assert!(cap <= self.capacity(), "Tried to shrink to a larger capacity");
 
index e444e97..245e015 100644 (file)
@@ -1,84 +1,14 @@
 // SPDX-License-Identifier: Apache-2.0 OR MIT
 
-//! A dynamically-sized view into a contiguous sequence, `[T]`.
+//! Utilities for the slice primitive type.
 //!
 //! *[See also the slice primitive type](slice).*
 //!
-//! Slices are a view into a block of memory represented as a pointer and a
-//! length.
+//! Most of the structs in this module are iterator types which can only be created
+//! using a certain function. For example, `slice.iter()` yields an [`Iter`].
 //!
-//! ```
-//! // slicing a Vec
-//! let vec = vec![1, 2, 3];
-//! let int_slice = &vec[..];
-//! // coercing an array to a slice
-//! let str_slice: &[&str] = &["one", "two", "three"];
-//! ```
-//!
-//! Slices are either mutable or shared. The shared slice type is `&[T]`,
-//! while the mutable slice type is `&mut [T]`, where `T` represents the element
-//! type. For example, you can mutate the block of memory that a mutable slice
-//! points to:
-//!
-//! ```
-//! let x = &mut [1, 2, 3];
-//! x[1] = 7;
-//! assert_eq!(x, &[1, 7, 3]);
-//! ```
-//!
-//! Here are some of the things this module contains:
-//!
-//! ## Structs
-//!
-//! There are several structs that are useful for slices, such as [`Iter`], which
-//! represents iteration over a slice.
-//!
-//! ## Trait Implementations
-//!
-//! There are several implementations of common traits for slices. Some examples
-//! include:
-//!
-//! * [`Clone`]
-//! * [`Eq`], [`Ord`] - for slices whose element type are [`Eq`] or [`Ord`].
-//! * [`Hash`] - for slices whose element type is [`Hash`].
-//!
-//! ## Iteration
-//!
-//! The slices implement `IntoIterator`. The iterator yields references to the
-//! slice elements.
-//!
-//! ```
-//! let numbers = &[0, 1, 2];
-//! for n in numbers {
-//!     println!("{n} is a number!");
-//! }
-//! ```
-//!
-//! The mutable slice yields mutable references to the elements:
-//!
-//! ```
-//! let mut scores = [7, 8, 9];
-//! for score in &mut scores[..] {
-//!     *score += 1;
-//! }
-//! ```
-//!
-//! This iterator yields mutable references to the slice's elements, so while
-//! the element type of the slice is `i32`, the element type of the iterator is
-//! `&mut i32`.
-//!
-//! * [`.iter`] and [`.iter_mut`] are the explicit methods to return the default
-//!   iterators.
-//! * Further methods that return iterators are [`.split`], [`.splitn`],
-//!   [`.chunks`], [`.windows`] and more.
-//!
-//! [`Hash`]: core::hash::Hash
-//! [`.iter`]: slice::iter
-//! [`.iter_mut`]: slice::iter_mut
-//! [`.split`]: slice::split
-//! [`.splitn`]: slice::splitn
-//! [`.chunks`]: slice::chunks
-//! [`.windows`]: slice::windows
+//! A few functions are provided to create a slice from a value reference
+//! or from a raw pointer.
 #![stable(feature = "rust1", since = "1.0.0")]
 // Many of the usings in this module are only used in the test configuration.
 // It's cleaner to just turn off the unused_imports warning than to fix them.
@@ -88,20 +18,23 @@ use core::borrow::{Borrow, BorrowMut};
 #[cfg(not(no_global_oom_handling))]
 use core::cmp::Ordering::{self, Less};
 #[cfg(not(no_global_oom_handling))]
-use core::mem;
-#[cfg(not(no_global_oom_handling))]
-use core::mem::size_of;
+use core::mem::{self, SizedTypeProperties};
 #[cfg(not(no_global_oom_handling))]
 use core::ptr;
+#[cfg(not(no_global_oom_handling))]
+use core::slice::sort;
 
 use crate::alloc::Allocator;
 #[cfg(not(no_global_oom_handling))]
-use crate::alloc::Global;
+use crate::alloc::{self, Global};
 #[cfg(not(no_global_oom_handling))]
 use crate::borrow::ToOwned;
 use crate::boxed::Box;
 use crate::vec::Vec;
 
+#[cfg(test)]
+mod tests;
+
 #[unstable(feature = "slice_range", issue = "76393")]
 pub use core::slice::range;
 #[unstable(feature = "array_chunks", issue = "74985")]
@@ -116,6 +49,8 @@ pub use core::slice::EscapeAscii;
 pub use core::slice::SliceIndex;
 #[stable(feature = "from_ref", since = "1.28.0")]
 pub use core::slice::{from_mut, from_ref};
+#[unstable(feature = "slice_from_ptr_range", issue = "89792")]
+pub use core::slice::{from_mut_ptr_range, from_ptr_range};
 #[stable(feature = "rust1", since = "1.0.0")]
 pub use core::slice::{from_raw_parts, from_raw_parts_mut};
 #[stable(feature = "rust1", since = "1.0.0")]
@@ -275,7 +210,7 @@ impl<T> [T] {
     where
         T: Ord,
     {
-        merge_sort(self, |a, b| a.lt(b));
+        stable_sort(self, T::lt);
     }
 
     /// Sorts the slice with a comparator function.
@@ -331,7 +266,7 @@ impl<T> [T] {
     where
         F: FnMut(&T, &T) -> Ordering,
     {
-        merge_sort(self, |a, b| compare(a, b) == Less);
+        stable_sort(self, |a, b| compare(a, b) == Less);
     }
 
     /// Sorts the slice with a key extraction function.
@@ -374,7 +309,7 @@ impl<T> [T] {
         F: FnMut(&T) -> K,
         K: Ord,
     {
-        merge_sort(self, |a, b| f(a).lt(&f(b)));
+        stable_sort(self, |a, b| f(a).lt(&f(b)));
     }
 
     /// Sorts the slice with a key extraction function.
@@ -530,7 +465,7 @@ impl<T> [T] {
         hack::into_vec(self)
     }
 
-    /// Creates a vector by repeating a slice `n` times.
+    /// Creates a vector by copying a slice `n` times.
     ///
     /// # Panics
     ///
@@ -725,7 +660,7 @@ impl [u8] {
 ///
 /// ```error
 /// error[E0207]: the type parameter `T` is not constrained by the impl trait, self type, or predica
-///    --> src/liballoc/slice.rs:608:6
+///    --> library/alloc/src/slice.rs:608:6
 ///     |
 /// 608 | impl<T: Clone, V: Borrow<[T]>> Concat for [V] {
 ///     |      ^ unconstrained type parameter
@@ -836,14 +771,14 @@ impl<T: Clone, V: Borrow<[T]>> Join<&[T]> for [V] {
 ////////////////////////////////////////////////////////////////////////////////
 
 #[stable(feature = "rust1", since = "1.0.0")]
-impl<T> Borrow<[T]> for Vec<T> {
+impl<T, A: Allocator> Borrow<[T]> for Vec<T, A> {
     fn borrow(&self) -> &[T] {
         &self[..]
     }
 }
 
 #[stable(feature = "rust1", since = "1.0.0")]
-impl<T> BorrowMut<[T]> for Vec<T> {
+impl<T, A: Allocator> BorrowMut<[T]> for Vec<T, A> {
     fn borrow_mut(&mut self) -> &mut [T] {
         &mut self[..]
     }
@@ -881,324 +816,52 @@ impl<T: Clone> ToOwned for [T] {
 // Sorting
 ////////////////////////////////////////////////////////////////////////////////
 
-/// Inserts `v[0]` into pre-sorted sequence `v[1..]` so that whole `v[..]` becomes sorted.
-///
-/// This is the integral subroutine of insertion sort.
+#[inline]
 #[cfg(not(no_global_oom_handling))]
-fn insert_head<T, F>(v: &mut [T], is_less: &mut F)
+fn stable_sort<T, F>(v: &mut [T], mut is_less: F)
 where
     F: FnMut(&T, &T) -> bool,
 {
-    if v.len() >= 2 && is_less(&v[1], &v[0]) {
-        unsafe {
-            // There are three ways to implement insertion here:
-            //
-            // 1. Swap adjacent elements until the first one gets to its final destination.
-            //    However, this way we copy data around more than is necessary. If elements are big
-            //    structures (costly to copy), this method will be slow.
-            //
-            // 2. Iterate until the right place for the first element is found. Then shift the
-            //    elements succeeding it to make room for it and finally place it into the
-            //    remaining hole. This is a good method.
-            //
-            // 3. Copy the first element into a temporary variable. Iterate until the right place
-            //    for it is found. As we go along, copy every traversed element into the slot
-            //    preceding it. Finally, copy data from the temporary variable into the remaining
-            //    hole. This method is very good. Benchmarks demonstrated slightly better
-            //    performance than with the 2nd method.
-            //
-            // All methods were benchmarked, and the 3rd showed best results. So we chose that one.
-            let tmp = mem::ManuallyDrop::new(ptr::read(&v[0]));
-
-            // Intermediate state of the insertion process is always tracked by `hole`, which
-            // serves two purposes:
-            // 1. Protects integrity of `v` from panics in `is_less`.
-            // 2. Fills the remaining hole in `v` in the end.
-            //
-            // Panic safety:
-            //
-            // If `is_less` panics at any point during the process, `hole` will get dropped and
-            // fill the hole in `v` with `tmp`, thus ensuring that `v` still holds every object it
-            // initially held exactly once.
-            let mut hole = InsertionHole { src: &*tmp, dest: &mut v[1] };
-            ptr::copy_nonoverlapping(&v[1], &mut v[0], 1);
-
-            for i in 2..v.len() {
-                if !is_less(&v[i], &*tmp) {
-                    break;
-                }
-                ptr::copy_nonoverlapping(&v[i], &mut v[i - 1], 1);
-                hole.dest = &mut v[i];
-            }
-            // `hole` gets dropped and thus copies `tmp` into the remaining hole in `v`.
-        }
-    }
-
-    // When dropped, copies from `src` into `dest`.
-    struct InsertionHole<T> {
-        src: *const T,
-        dest: *mut T,
-    }
-
-    impl<T> Drop for InsertionHole<T> {
-        fn drop(&mut self) {
-            unsafe {
-                ptr::copy_nonoverlapping(self.src, self.dest, 1);
-            }
-        }
+    if T::IS_ZST {
+        // Sorting has no meaningful behavior on zero-sized types. Do nothing.
+        return;
     }
-}
-
-/// Merges non-decreasing runs `v[..mid]` and `v[mid..]` using `buf` as temporary storage, and
-/// stores the result into `v[..]`.
-///
-/// # Safety
-///
-/// The two slices must be non-empty and `mid` must be in bounds. Buffer `buf` must be long enough
-/// to hold a copy of the shorter slice. Also, `T` must not be a zero-sized type.
-#[cfg(not(no_global_oom_handling))]
-unsafe fn merge<T, F>(v: &mut [T], mid: usize, buf: *mut T, is_less: &mut F)
-where
-    F: FnMut(&T, &T) -> bool,
-{
-    let len = v.len();
-    let v = v.as_mut_ptr();
-    let (v_mid, v_end) = unsafe { (v.add(mid), v.add(len)) };
 
-    // The merge process first copies the shorter run into `buf`. Then it traces the newly copied
-    // run and the longer run forwards (or backwards), comparing their next unconsumed elements and
-    // copying the lesser (or greater) one into `v`.
-    //
-    // As soon as the shorter run is fully consumed, the process is done. If the longer run gets
-    // consumed first, then we must copy whatever is left of the shorter run into the remaining
-    // hole in `v`.
-    //
-    // Intermediate state of the process is always tracked by `hole`, which serves two purposes:
-    // 1. Protects integrity of `v` from panics in `is_less`.
-    // 2. Fills the remaining hole in `v` if the longer run gets consumed first.
-    //
-    // Panic safety:
-    //
-    // If `is_less` panics at any point during the process, `hole` will get dropped and fill the
-    // hole in `v` with the unconsumed range in `buf`, thus ensuring that `v` still holds every
-    // object it initially held exactly once.
-    let mut hole;
+    let elem_alloc_fn = |len: usize| -> *mut T {
+        // SAFETY: Creating the layout is safe as long as merge_sort never calls this with len >
+        // v.len(). Alloc in general will only be used as 'shadow-region' to store temporary swap
+        // elements.
+        unsafe { alloc::alloc(alloc::Layout::array::<T>(len).unwrap_unchecked()) as *mut T }
+    };
 
-    if mid <= len - mid {
-        // The left run is shorter.
+    let elem_dealloc_fn = |buf_ptr: *mut T, len: usize| {
+        // SAFETY: Creating the layout is safe as long as merge_sort never calls this with len >
+        // v.len(). The caller must ensure that buf_ptr was created by elem_alloc_fn with the same
+        // len.
         unsafe {
-            ptr::copy_nonoverlapping(v, buf, mid);
-            hole = MergeHole { start: buf, end: buf.add(mid), dest: v };
+            alloc::dealloc(buf_ptr as *mut u8, alloc::Layout::array::<T>(len).unwrap_unchecked());
         }
+    };
 
-        // Initially, these pointers point to the beginnings of their arrays.
-        let left = &mut hole.start;
-        let mut right = v_mid;
-        let out = &mut hole.dest;
-
-        while *left < hole.end && right < v_end {
-            // Consume the lesser side.
-            // If equal, prefer the left run to maintain stability.
-            unsafe {
-                let to_copy = if is_less(&*right, &**left) {
-                    get_and_increment(&mut right)
-                } else {
-                    get_and_increment(left)
-                };
-                ptr::copy_nonoverlapping(to_copy, get_and_increment(out), 1);
-            }
-        }
-    } else {
-        // The right run is shorter.
+    let run_alloc_fn = |len: usize| -> *mut sort::TimSortRun {
+        // SAFETY: Creating the layout is safe as long as merge_sort never calls this with an
+        // obscene length or 0.
         unsafe {
-            ptr::copy_nonoverlapping(v_mid, buf, len - mid);
-            hole = MergeHole { start: buf, end: buf.add(len - mid), dest: v_mid };
+            alloc::alloc(alloc::Layout::array::<sort::TimSortRun>(len).unwrap_unchecked())
+                as *mut sort::TimSortRun
         }
+    };
 
-        // Initially, these pointers point past the ends of their arrays.
-        let left = &mut hole.dest;
-        let right = &mut hole.end;
-        let mut out = v_end;
-
-        while v < *left && buf < *right {
-            // Consume the greater side.
-            // If equal, prefer the right run to maintain stability.
-            unsafe {
-                let to_copy = if is_less(&*right.offset(-1), &*left.offset(-1)) {
-                    decrement_and_get(left)
-                } else {
-                    decrement_and_get(right)
-                };
-                ptr::copy_nonoverlapping(to_copy, decrement_and_get(&mut out), 1);
-            }
-        }
-    }
-    // Finally, `hole` gets dropped. If the shorter run was not fully consumed, whatever remains of
-    // it will now be copied into the hole in `v`.
-
-    unsafe fn get_and_increment<T>(ptr: &mut *mut T) -> *mut T {
-        let old = *ptr;
-        *ptr = unsafe { ptr.offset(1) };
-        old
-    }
-
-    unsafe fn decrement_and_get<T>(ptr: &mut *mut T) -> *mut T {
-        *ptr = unsafe { ptr.offset(-1) };
-        *ptr
-    }
-
-    // When dropped, copies the range `start..end` into `dest..`.
-    struct MergeHole<T> {
-        start: *mut T,
-        end: *mut T,
-        dest: *mut T,
-    }
-
-    impl<T> Drop for MergeHole<T> {
-        fn drop(&mut self) {
-            // `T` is not a zero-sized type, and these are pointers into a slice's elements.
-            unsafe {
-                let len = self.end.sub_ptr(self.start);
-                ptr::copy_nonoverlapping(self.start, self.dest, len);
-            }
-        }
-    }
-}
-
-/// This merge sort borrows some (but not all) ideas from TimSort, which is described in detail
-/// [here](https://github.com/python/cpython/blob/main/Objects/listsort.txt).
-///
-/// The algorithm identifies strictly descending and non-descending subsequences, which are called
-/// natural runs. There is a stack of pending runs yet to be merged. Each newly found run is pushed
-/// onto the stack, and then some pairs of adjacent runs are merged until these two invariants are
-/// satisfied:
-///
-/// 1. for every `i` in `1..runs.len()`: `runs[i - 1].len > runs[i].len`
-/// 2. for every `i` in `2..runs.len()`: `runs[i - 2].len > runs[i - 1].len + runs[i].len`
-///
-/// The invariants ensure that the total running time is *O*(*n* \* log(*n*)) worst-case.
-#[cfg(not(no_global_oom_handling))]
-fn merge_sort<T, F>(v: &mut [T], mut is_less: F)
-where
-    F: FnMut(&T, &T) -> bool,
-{
-    // Slices of up to this length get sorted using insertion sort.
-    const MAX_INSERTION: usize = 20;
-    // Very short runs are extended using insertion sort to span at least this many elements.
-    const MIN_RUN: usize = 10;
-
-    // Sorting has no meaningful behavior on zero-sized types.
-    if size_of::<T>() == 0 {
-        return;
-    }
-
-    let len = v.len();
-
-    // Short arrays get sorted in-place via insertion sort to avoid allocations.
-    if len <= MAX_INSERTION {
-        if len >= 2 {
-            for i in (0..len - 1).rev() {
-                insert_head(&mut v[i..], &mut is_less);
-            }
-        }
-        return;
-    }
-
-    // Allocate a buffer to use as scratch memory. We keep the length 0 so we can keep in it
-    // shallow copies of the contents of `v` without risking the dtors running on copies if
-    // `is_less` panics. When merging two sorted runs, this buffer holds a copy of the shorter run,
-    // which will always have length at most `len / 2`.
-    let mut buf = Vec::with_capacity(len / 2);
-
-    // In order to identify natural runs in `v`, we traverse it backwards. That might seem like a
-    // strange decision, but consider the fact that merges more often go in the opposite direction
-    // (forwards). According to benchmarks, merging forwards is slightly faster than merging
-    // backwards. To conclude, identifying runs by traversing backwards improves performance.
-    let mut runs = vec![];
-    let mut end = len;
-    while end > 0 {
-        // Find the next natural run, and reverse it if it's strictly descending.
-        let mut start = end - 1;
-        if start > 0 {
-            start -= 1;
-            unsafe {
-                if is_less(v.get_unchecked(start + 1), v.get_unchecked(start)) {
-                    while start > 0 && is_less(v.get_unchecked(start), v.get_unchecked(start - 1)) {
-                        start -= 1;
-                    }
-                    v[start..end].reverse();
-                } else {
-                    while start > 0 && !is_less(v.get_unchecked(start), v.get_unchecked(start - 1))
-                    {
-                        start -= 1;
-                    }
-                }
-            }
-        }
-
-        // Insert some more elements into the run if it's too short. Insertion sort is faster than
-        // merge sort on short sequences, so this significantly improves performance.
-        while start > 0 && end - start < MIN_RUN {
-            start -= 1;
-            insert_head(&mut v[start..end], &mut is_less);
-        }
-
-        // Push this run onto the stack.
-        runs.push(Run { start, len: end - start });
-        end = start;
-
-        // Merge some pairs of adjacent runs to satisfy the invariants.
-        while let Some(r) = collapse(&runs) {
-            let left = runs[r + 1];
-            let right = runs[r];
-            unsafe {
-                merge(
-                    &mut v[left.start..right.start + right.len],
-                    left.len,
-                    buf.as_mut_ptr(),
-                    &mut is_less,
-                );
-            }
-            runs[r] = Run { start: left.start, len: left.len + right.len };
-            runs.remove(r + 1);
-        }
-    }
-
-    // Finally, exactly one run must remain in the stack.
-    debug_assert!(runs.len() == 1 && runs[0].start == 0 && runs[0].len == len);
-
-    // Examines the stack of runs and identifies the next pair of runs to merge. More specifically,
-    // if `Some(r)` is returned, that means `runs[r]` and `runs[r + 1]` must be merged next. If the
-    // algorithm should continue building a new run instead, `None` is returned.
-    //
-    // TimSort is infamous for its buggy implementations, as described here:
-    // http://envisage-project.eu/timsort-specification-and-verification/
-    //
-    // The gist of the story is: we must enforce the invariants on the top four runs on the stack.
-    // Enforcing them on just top three is not sufficient to ensure that the invariants will still
-    // hold for *all* runs in the stack.
-    //
-    // This function correctly checks invariants for the top four runs. Additionally, if the top
-    // run starts at index 0, it will always demand a merge operation until the stack is fully
-    // collapsed, in order to complete the sort.
-    #[inline]
-    fn collapse(runs: &[Run]) -> Option<usize> {
-        let n = runs.len();
-        if n >= 2
-            && (runs[n - 1].start == 0
-                || runs[n - 2].len <= runs[n - 1].len
-                || (n >= 3 && runs[n - 3].len <= runs[n - 2].len + runs[n - 1].len)
-                || (n >= 4 && runs[n - 4].len <= runs[n - 3].len + runs[n - 2].len))
-        {
-            if n >= 3 && runs[n - 3].len < runs[n - 1].len { Some(n - 3) } else { Some(n - 2) }
-        } else {
-            None
+    let run_dealloc_fn = |buf_ptr: *mut sort::TimSortRun, len: usize| {
+        // SAFETY: The caller must ensure that buf_ptr was created by elem_alloc_fn with the same
+        // len.
+        unsafe {
+            alloc::dealloc(
+                buf_ptr as *mut u8,
+                alloc::Layout::array::<sort::TimSortRun>(len).unwrap_unchecked(),
+            );
         }
-    }
+    };
 
-    #[derive(Clone, Copy)]
-    struct Run {
-        start: usize,
-        len: usize,
-    }
+    sort::merge_sort(v, &mut is_less, elem_alloc_fn, elem_dealloc_fn, run_alloc_fn, run_dealloc_fn);
 }
index b6a5f98..d503d2f 100644 (file)
@@ -3,7 +3,7 @@
 use crate::alloc::{Allocator, Global};
 use core::fmt;
 use core::iter::{FusedIterator, TrustedLen};
-use core::mem;
+use core::mem::{self, ManuallyDrop, SizedTypeProperties};
 use core::ptr::{self, NonNull};
 use core::slice::{self};
 
@@ -67,6 +67,77 @@ impl<'a, T, A: Allocator> Drain<'a, T, A> {
     pub fn allocator(&self) -> &A {
         unsafe { self.vec.as_ref().allocator() }
     }
+
+    /// Keep unyielded elements in the source `Vec`.
+    ///
+    /// # Examples
+    ///
+    /// ```
+    /// #![feature(drain_keep_rest)]
+    ///
+    /// let mut vec = vec!['a', 'b', 'c'];
+    /// let mut drain = vec.drain(..);
+    ///
+    /// assert_eq!(drain.next().unwrap(), 'a');
+    ///
+    /// // This call keeps 'b' and 'c' in the vec.
+    /// drain.keep_rest();
+    ///
+    /// // If we wouldn't call `keep_rest()`,
+    /// // `vec` would be empty.
+    /// assert_eq!(vec, ['b', 'c']);
+    /// ```
+    #[unstable(feature = "drain_keep_rest", issue = "101122")]
+    pub fn keep_rest(self) {
+        // At this moment layout looks like this:
+        //
+        // [head] [yielded by next] [unyielded] [yielded by next_back] [tail]
+        //        ^-- start         \_________/-- unyielded_len        \____/-- self.tail_len
+        //                          ^-- unyielded_ptr                  ^-- tail
+        //
+        // Normally `Drop` impl would drop [unyielded] and then move [tail] to the `start`.
+        // Here we want to
+        // 1. Move [unyielded] to `start`
+        // 2. Move [tail] to a new start at `start + len(unyielded)`
+        // 3. Update length of the original vec to `len(head) + len(unyielded) + len(tail)`
+        //    a. In case of ZST, this is the only thing we want to do
+        // 4. Do *not* drop self, as everything is put in a consistent state already, there is nothing to do
+        let mut this = ManuallyDrop::new(self);
+
+        unsafe {
+            let source_vec = this.vec.as_mut();
+
+            let start = source_vec.len();
+            let tail = this.tail_start;
+
+            let unyielded_len = this.iter.len();
+            let unyielded_ptr = this.iter.as_slice().as_ptr();
+
+            // ZSTs have no identity, so we don't need to move them around.
+            let needs_move = mem::size_of::<T>() != 0;
+
+            if needs_move {
+                let start_ptr = source_vec.as_mut_ptr().add(start);
+
+                // memmove back unyielded elements
+                if unyielded_ptr != start_ptr {
+                    let src = unyielded_ptr;
+                    let dst = start_ptr;
+
+                    ptr::copy(src, dst, unyielded_len);
+                }
+
+                // memmove back untouched tail
+                if tail != (start + unyielded_len) {
+                    let src = source_vec.as_ptr().add(tail);
+                    let dst = start_ptr.add(unyielded_len);
+                    ptr::copy(src, dst, this.tail_len);
+                }
+            }
+
+            source_vec.set_len(start + unyielded_len + this.tail_len);
+        }
+    }
 }
 
 #[stable(feature = "vec_drain_as_slice", since = "1.46.0")]
@@ -133,7 +204,7 @@ impl<T, A: Allocator> Drop for Drain<'_, T, A> {
 
         let mut vec = self.vec;
 
-        if mem::size_of::<T>() == 0 {
+        if T::IS_ZST {
             // ZSTs have no identity, so we don't need to move them around, we only need to drop the correct amount.
             // this can be achieved by manipulating the Vec length instead of moving values out from `iter`.
             unsafe {
@@ -154,9 +225,9 @@ impl<T, A: Allocator> Drop for Drain<'_, T, A> {
         }
 
         // as_slice() must only be called when iter.len() is > 0 because
-        // vec::Splice modifies vec::Drain fields and may grow the vec which would invalidate
-        // the iterator's internal pointers. Creating a reference to deallocated memory
-        // is invalid even when it is zero-length
+        // it also gets touched by vec::Splice which may turn it into a dangling pointer
+        // which would make it and the vec pointer point to different allocations which would
+        // lead to invalid pointer arithmetic below.
         let drop_ptr = iter.as_slice().as_ptr();
 
         unsafe {
index b04fce0..4b01922 100644 (file)
@@ -1,8 +1,9 @@
 // SPDX-License-Identifier: Apache-2.0 OR MIT
 
 use crate::alloc::{Allocator, Global};
-use core::ptr::{self};
-use core::slice::{self};
+use core::mem::{self, ManuallyDrop};
+use core::ptr;
+use core::slice;
 
 use super::Vec;
 
@@ -56,6 +57,61 @@ where
     pub fn allocator(&self) -> &A {
         self.vec.allocator()
     }
+
+    /// Keep unyielded elements in the source `Vec`.
+    ///
+    /// # Examples
+    ///
+    /// ```
+    /// #![feature(drain_filter)]
+    /// #![feature(drain_keep_rest)]
+    ///
+    /// let mut vec = vec!['a', 'b', 'c'];
+    /// let mut drain = vec.drain_filter(|_| true);
+    ///
+    /// assert_eq!(drain.next().unwrap(), 'a');
+    ///
+    /// // This call keeps 'b' and 'c' in the vec.
+    /// drain.keep_rest();
+    ///
+    /// // If we wouldn't call `keep_rest()`,
+    /// // `vec` would be empty.
+    /// assert_eq!(vec, ['b', 'c']);
+    /// ```
+    #[unstable(feature = "drain_keep_rest", issue = "101122")]
+    pub fn keep_rest(self) {
+        // At this moment layout looks like this:
+        //
+        //  _____________________/-- old_len
+        // /                     \
+        // [kept] [yielded] [tail]
+        //        \_______/ ^-- idx
+        //                \-- del
+        //
+        // Normally `Drop` impl would drop [tail] (via .for_each(drop), ie still calling `pred`)
+        //
+        // 1. Move [tail] after [kept]
+        // 2. Update length of the original vec to `old_len - del`
+        //    a. In case of ZST, this is the only thing we want to do
+        // 3. Do *not* drop self, as everything is put in a consistent state already, there is nothing to do
+        let mut this = ManuallyDrop::new(self);
+
+        unsafe {
+            // ZSTs have no identity, so we don't need to move them around.
+            let needs_move = mem::size_of::<T>() != 0;
+
+            if needs_move && this.idx < this.old_len && this.del > 0 {
+                let ptr = this.vec.as_mut_ptr();
+                let src = ptr.add(this.idx);
+                let dst = src.sub(this.del);
+                let tail_len = this.old_len - this.idx;
+                src.copy_to(dst, tail_len);
+            }
+
+            let new_len = this.old_len - this.del;
+            this.vec.set_len(new_len);
+        }
+    }
 }
 
 #[unstable(feature = "drain_filter", reason = "recently added", issue = "43244")]
index f7a50e7..34a2a70 100644 (file)
@@ -3,14 +3,16 @@
 #[cfg(not(no_global_oom_handling))]
 use super::AsVecIntoIter;
 use crate::alloc::{Allocator, Global};
+#[cfg(not(no_global_oom_handling))]
+use crate::collections::VecDeque;
 use crate::raw_vec::RawVec;
+use core::array;
 use core::fmt;
-use core::intrinsics::arith_offset;
 use core::iter::{
     FusedIterator, InPlaceIterable, SourceIter, TrustedLen, TrustedRandomAccessNoCoerce,
 };
 use core::marker::PhantomData;
-use core::mem::{self, ManuallyDrop};
+use core::mem::{self, ManuallyDrop, MaybeUninit, SizedTypeProperties};
 #[cfg(not(no_global_oom_handling))]
 use core::ops::Deref;
 use core::ptr::{self, NonNull};
@@ -40,7 +42,9 @@ pub struct IntoIter<
     // to avoid dropping the allocator twice we need to wrap it into ManuallyDrop
     pub(super) alloc: ManuallyDrop<A>,
     pub(super) ptr: *const T,
-    pub(super) end: *const T,
+    pub(super) end: *const T, // If T is a ZST, this is actually ptr+len. This encoding is picked so that
+                              // ptr == end is a quick test for the Iterator being empty, that works
+                              // for both ZST and non-ZST.
 }
 
 #[stable(feature = "vec_intoiter_debug", since = "1.13.0")]
@@ -97,13 +101,16 @@ impl<T, A: Allocator> IntoIter<T, A> {
     }
 
     /// Drops remaining elements and relinquishes the backing allocation.
+    /// This method guarantees it won't panic before relinquishing
+    /// the backing allocation.
     ///
     /// This is roughly equivalent to the following, but more efficient
     ///
     /// ```
     /// # let mut into_iter = Vec::<u8>::with_capacity(10).into_iter();
+    /// let mut into_iter = std::mem::replace(&mut into_iter, Vec::new().into_iter());
     /// (&mut into_iter).for_each(core::mem::drop);
-    /// unsafe { core::ptr::write(&mut into_iter, Vec::new().into_iter()); }
+    /// std::mem::forget(into_iter);
     /// ```
     ///
     /// This method is used by in-place iteration, refer to the vec::in_place_collect
@@ -120,15 +127,45 @@ impl<T, A: Allocator> IntoIter<T, A> {
         self.ptr = self.buf.as_ptr();
         self.end = self.buf.as_ptr();
 
+        // Dropping the remaining elements can panic, so this needs to be
+        // done only after updating the other fields.
         unsafe {
             ptr::drop_in_place(remaining);
         }
     }
 
     /// Forgets to Drop the remaining elements while still allowing the backing allocation to be freed.
-    #[allow(dead_code)]
     pub(crate) fn forget_remaining_elements(&mut self) {
-        self.ptr = self.end;
+        // For th ZST case, it is crucial that we mutate `end` here, not `ptr`.
+        // `ptr` must stay aligned, while `end` may be unaligned.
+        self.end = self.ptr;
+    }
+
+    #[cfg(not(no_global_oom_handling))]
+    #[inline]
+    pub(crate) fn into_vecdeque(self) -> VecDeque<T, A> {
+        // Keep our `Drop` impl from dropping the elements and the allocator
+        let mut this = ManuallyDrop::new(self);
+
+        // SAFETY: This allocation originally came from a `Vec`, so it passes
+        // all those checks. We have `this.buf` ≤ `this.ptr` ≤ `this.end`,
+        // so the `sub_ptr`s below cannot wrap, and will produce a well-formed
+        // range. `end` ≤ `buf + cap`, so the range will be in-bounds.
+        // Taking `alloc` is ok because nothing else is going to look at it,
+        // since our `Drop` impl isn't going to run so there's no more code.
+        unsafe {
+            let buf = this.buf.as_ptr();
+            let initialized = if T::IS_ZST {
+                // All the pointers are the same for ZSTs, so it's fine to
+                // say that they're all at the beginning of the "allocation".
+                0..this.len()
+            } else {
+                this.ptr.sub_ptr(buf)..this.end.sub_ptr(buf)
+            };
+            let cap = this.cap;
+            let alloc = ManuallyDrop::take(&mut this.alloc);
+            VecDeque::from_contiguous_raw_parts_in(buf, initialized, cap, alloc)
+        }
     }
 }
 
@@ -150,19 +187,18 @@ impl<T, A: Allocator> Iterator for IntoIter<T, A> {
 
     #[inline]
     fn next(&mut self) -> Option<T> {
-        if self.ptr as *const _ == self.end {
+        if self.ptr == self.end {
             None
-        } else if mem::size_of::<T>() == 0 {
-            // purposefully don't use 'ptr.offset' because for
-            // vectors with 0-size elements this would return the
-            // same pointer.
-            self.ptr = unsafe { arith_offset(self.ptr as *const i8, 1) as *mut T };
+        } else if T::IS_ZST {
+            // `ptr` has to stay where it is to remain aligned, so we reduce the length by 1 by
+            // reducing the `end`.
+            self.end = self.end.wrapping_byte_sub(1);
 
             // Make up a value of this ZST.
             Some(unsafe { mem::zeroed() })
         } else {
             let old = self.ptr;
-            self.ptr = unsafe { self.ptr.offset(1) };
+            self.ptr = unsafe { self.ptr.add(1) };
 
             Some(unsafe { ptr::read(old) })
         }
@@ -170,7 +206,7 @@ impl<T, A: Allocator> Iterator for IntoIter<T, A> {
 
     #[inline]
     fn size_hint(&self) -> (usize, Option<usize>) {
-        let exact = if mem::size_of::<T>() == 0 {
+        let exact = if T::IS_ZST {
             self.end.addr().wrapping_sub(self.ptr.addr())
         } else {
             unsafe { self.end.sub_ptr(self.ptr) }
@@ -182,11 +218,9 @@ impl<T, A: Allocator> Iterator for IntoIter<T, A> {
     fn advance_by(&mut self, n: usize) -> Result<(), usize> {
         let step_size = self.len().min(n);
         let to_drop = ptr::slice_from_raw_parts_mut(self.ptr as *mut T, step_size);
-        if mem::size_of::<T>() == 0 {
-            // SAFETY: due to unchecked casts of unsigned amounts to signed offsets the wraparound
-            // effectively results in unsigned pointers representing positions 0..usize::MAX,
-            // which is valid for ZSTs.
-            self.ptr = unsafe { arith_offset(self.ptr as *const i8, step_size as isize) as *mut T }
+        if T::IS_ZST {
+            // See `next` for why we sub `end` here.
+            self.end = self.end.wrapping_byte_sub(step_size);
         } else {
             // SAFETY: the min() above ensures that step_size is in bounds
             self.ptr = unsafe { self.ptr.add(step_size) };
@@ -206,6 +240,43 @@ impl<T, A: Allocator> Iterator for IntoIter<T, A> {
         self.len()
     }
 
+    #[inline]
+    fn next_chunk<const N: usize>(&mut self) -> Result<[T; N], core::array::IntoIter<T, N>> {
+        let mut raw_ary = MaybeUninit::uninit_array();
+
+        let len = self.len();
+
+        if T::IS_ZST {
+            if len < N {
+                self.forget_remaining_elements();
+                // Safety: ZSTs can be conjured ex nihilo, only the amount has to be correct
+                return Err(unsafe { array::IntoIter::new_unchecked(raw_ary, 0..len) });
+            }
+
+            self.end = self.end.wrapping_byte_sub(N);
+            // Safety: ditto
+            return Ok(unsafe { raw_ary.transpose().assume_init() });
+        }
+
+        if len < N {
+            // Safety: `len` indicates that this many elements are available and we just checked that
+            // it fits into the array.
+            unsafe {
+                ptr::copy_nonoverlapping(self.ptr, raw_ary.as_mut_ptr() as *mut T, len);
+                self.forget_remaining_elements();
+                return Err(array::IntoIter::new_unchecked(raw_ary, 0..len));
+            }
+        }
+
+        // Safety: `len` is larger than the array size. Copy a fixed amount here to fully initialize
+        // the array.
+        return unsafe {
+            ptr::copy_nonoverlapping(self.ptr, raw_ary.as_mut_ptr() as *mut T, N);
+            self.ptr = self.ptr.add(N);
+            Ok(raw_ary.transpose().assume_init())
+        };
+    }
+
     unsafe fn __iterator_get_unchecked(&mut self, i: usize) -> Self::Item
     where
         Self: TrustedRandomAccessNoCoerce,
@@ -219,7 +290,7 @@ impl<T, A: Allocator> Iterator for IntoIter<T, A> {
         // that `T: Copy` so reading elements from the buffer doesn't invalidate
         // them for `Drop`.
         unsafe {
-            if mem::size_of::<T>() == 0 { mem::zeroed() } else { ptr::read(self.ptr.add(i)) }
+            if T::IS_ZST { mem::zeroed() } else { ptr::read(self.ptr.add(i)) }
         }
     }
 }
@@ -230,14 +301,14 @@ impl<T, A: Allocator> DoubleEndedIterator for IntoIter<T, A> {
     fn next_back(&mut self) -> Option<T> {
         if self.end == self.ptr {
             None
-        } else if mem::size_of::<T>() == 0 {
+        } else if T::IS_ZST {
             // See above for why 'ptr.offset' isn't used
-            self.end = unsafe { arith_offset(self.end as *const i8, -1) as *mut T };
+            self.end = self.end.wrapping_byte_sub(1);
 
             // Make up a value of this ZST.
             Some(unsafe { mem::zeroed() })
         } else {
-            self.end = unsafe { self.end.offset(-1) };
+            self.end = unsafe { self.end.sub(1) };
 
             Some(unsafe { ptr::read(self.end) })
         }
@@ -246,14 +317,12 @@ impl<T, A: Allocator> DoubleEndedIterator for IntoIter<T, A> {
     #[inline]
     fn advance_back_by(&mut self, n: usize) -> Result<(), usize> {
         let step_size = self.len().min(n);
-        if mem::size_of::<T>() == 0 {
+        if T::IS_ZST {
             // SAFETY: same as for advance_by()
-            self.end = unsafe {
-                arith_offset(self.end as *const i8, step_size.wrapping_neg() as isize) as *mut T
-            }
+            self.end = self.end.wrapping_byte_sub(step_size);
         } else {
             // SAFETY: same as for advance_by()
-            self.end = unsafe { self.end.offset(step_size.wrapping_neg() as isize) };
+            self.end = unsafe { self.end.sub(step_size) };
         }
         let to_drop = ptr::slice_from_raw_parts_mut(self.end as *mut T, step_size);
         // SAFETY: same as for advance_by()
index 377f3d1..d928dcf 100644 (file)
@@ -1,10 +1,13 @@
 // SPDX-License-Identifier: Apache-2.0 OR MIT
 
+use core::num::{Saturating, Wrapping};
+
 use crate::boxed::Box;
 
 #[rustc_specialization_trait]
 pub(super) unsafe trait IsZero {
-    /// Whether this value's representation is all zeros
+    /// Whether this value's representation is all zeros,
+    /// or can be represented with all zeroes.
     fn is_zero(&self) -> bool;
 }
 
@@ -19,12 +22,14 @@ macro_rules! impl_is_zero {
     };
 }
 
+impl_is_zero!(i8, |x| x == 0); // It is needed to impl for arrays and tuples of i8.
 impl_is_zero!(i16, |x| x == 0);
 impl_is_zero!(i32, |x| x == 0);
 impl_is_zero!(i64, |x| x == 0);
 impl_is_zero!(i128, |x| x == 0);
 impl_is_zero!(isize, |x| x == 0);
 
+impl_is_zero!(u8, |x| x == 0); // It is needed to impl for arrays and tuples of u8.
 impl_is_zero!(u16, |x| x == 0);
 impl_is_zero!(u32, |x| x == 0);
 impl_is_zero!(u64, |x| x == 0);
@@ -55,16 +60,42 @@ unsafe impl<T: IsZero, const N: usize> IsZero for [T; N] {
     #[inline]
     fn is_zero(&self) -> bool {
         // Because this is generated as a runtime check, it's not obvious that
-        // it's worth doing if the array is really long.  The threshold here
-        // is largely arbitrary, but was picked because as of 2022-05-01 LLVM
-        // can const-fold the check in `vec![[0; 32]; n]` but not in
-        // `vec![[0; 64]; n]`: https://godbolt.org/z/WTzjzfs5b
+        // it's worth doing if the array is really long. The threshold here
+        // is largely arbitrary, but was picked because as of 2022-07-01 LLVM
+        // fails to const-fold the check in `vec![[1; 32]; n]`
+        // See https://github.com/rust-lang/rust/pull/97581#issuecomment-1166628022
         // Feel free to tweak if you have better evidence.
 
-        N <= 32 && self.iter().all(IsZero::is_zero)
+        N <= 16 && self.iter().all(IsZero::is_zero)
+    }
+}
+
+// This is recursive macro.
+macro_rules! impl_for_tuples {
+    // Stopper
+    () => {
+        // No use for implementing for empty tuple because it is ZST.
+    };
+    ($first_arg:ident $(,$rest:ident)*) => {
+        unsafe impl <$first_arg: IsZero, $($rest: IsZero,)*> IsZero for ($first_arg, $($rest,)*){
+            #[inline]
+            fn is_zero(&self) -> bool{
+                // Destructure tuple to N references
+                // Rust allows to hide generic params by local variable names.
+                #[allow(non_snake_case)]
+                let ($first_arg, $($rest,)*) = self;
+
+                $first_arg.is_zero()
+                    $( && $rest.is_zero() )*
+            }
+        }
+
+        impl_for_tuples!($($rest),*);
     }
 }
 
+impl_for_tuples!(A, B, C, D, E, F, G, H);
+
 // `Option<&T>` and `Option<Box<T>>` are guaranteed to represent `None` as null.
 // For fat pointers, the bytes that would be the pointer metadata in the `Some`
 // variant are padding in the `None` variant, so ignoring them and
@@ -118,3 +149,56 @@ impl_is_zero_option_of_nonzero!(
     NonZeroUsize,
     NonZeroIsize,
 );
+
+macro_rules! impl_is_zero_option_of_num {
+    ($($t:ty,)+) => {$(
+        unsafe impl IsZero for Option<$t> {
+            #[inline]
+            fn is_zero(&self) -> bool {
+                const {
+                    let none: Self = unsafe { core::mem::MaybeUninit::zeroed().assume_init() };
+                    assert!(none.is_none());
+                }
+                self.is_none()
+            }
+        }
+    )+};
+}
+
+impl_is_zero_option_of_num!(u8, u16, u32, u64, u128, i8, i16, i32, i64, i128, usize, isize,);
+
+unsafe impl<T: IsZero> IsZero for Wrapping<T> {
+    #[inline]
+    fn is_zero(&self) -> bool {
+        self.0.is_zero()
+    }
+}
+
+unsafe impl<T: IsZero> IsZero for Saturating<T> {
+    #[inline]
+    fn is_zero(&self) -> bool {
+        self.0.is_zero()
+    }
+}
+
+macro_rules! impl_for_optional_bool {
+    ($($t:ty,)+) => {$(
+        unsafe impl IsZero for $t {
+            #[inline]
+            fn is_zero(&self) -> bool {
+                // SAFETY: This is *not* a stable layout guarantee, but
+                // inside `core` we're allowed to rely on the current rustc
+                // behaviour that options of bools will be one byte with
+                // no padding, so long as they're nested less than 254 deep.
+                let raw: u8 = unsafe { core::mem::transmute(*self) };
+                raw == 0
+            }
+        }
+    )+};
+}
+impl_for_optional_bool! {
+    Option<bool>,
+    Option<Option<bool>>,
+    Option<Option<Option<bool>>>,
+    // Could go further, but not worth the metadata overhead
+}
index fe4fff5..9499591 100644 (file)
@@ -61,12 +61,12 @@ use core::cmp::Ordering;
 use core::convert::TryFrom;
 use core::fmt;
 use core::hash::{Hash, Hasher};
-use core::intrinsics::{arith_offset, assume};
+use core::intrinsics::assume;
 use core::iter;
 #[cfg(not(no_global_oom_handling))]
 use core::iter::FromIterator;
 use core::marker::PhantomData;
-use core::mem::{self, ManuallyDrop, MaybeUninit};
+use core::mem::{self, ManuallyDrop, MaybeUninit, SizedTypeProperties};
 use core::ops::{self, Index, IndexMut, Range, RangeBounds};
 use core::ptr::{self, NonNull};
 use core::slice::{self, SliceIndex};
@@ -75,7 +75,7 @@ use crate::alloc::{Allocator, Global};
 #[cfg(not(no_borrow))]
 use crate::borrow::{Cow, ToOwned};
 use crate::boxed::Box;
-use crate::collections::TryReserveError;
+use crate::collections::{TryReserveError, TryReserveErrorKind};
 use crate::raw_vec::RawVec;
 
 #[unstable(feature = "drain_filter", reason = "recently added", issue = "43244")]
@@ -127,7 +127,7 @@ use self::set_len_on_drop::SetLenOnDrop;
 mod set_len_on_drop;
 
 #[cfg(not(no_global_oom_handling))]
-use self::in_place_drop::InPlaceDrop;
+use self::in_place_drop::{InPlaceDrop, InPlaceDstBufDrop};
 
 #[cfg(not(no_global_oom_handling))]
 mod in_place_drop;
@@ -169,7 +169,7 @@ mod spec_extend;
 /// vec[0] = 7;
 /// assert_eq!(vec[0], 7);
 ///
-/// vec.extend([1, 2, 3].iter().copied());
+/// vec.extend([1, 2, 3]);
 ///
 /// for x in &vec {
 ///     println!("{x}");
@@ -428,17 +428,25 @@ impl<T> Vec<T> {
         Vec { buf: RawVec::NEW, len: 0 }
     }
 
-    /// Constructs a new, empty `Vec<T>` with the specified capacity.
+    /// Constructs a new, empty `Vec<T>` with at least the specified capacity.
     ///
-    /// The vector will be able to hold exactly `capacity` elements without
-    /// reallocating. If `capacity` is 0, the vector will not allocate.
+    /// The vector will be able to hold at least `capacity` elements without
+    /// reallocating. This method is allowed to allocate for more elements than
+    /// `capacity`. If `capacity` is 0, the vector will not allocate.
     ///
     /// It is important to note that although the returned vector has the
-    /// *capacity* specified, the vector will have a zero *length*. For an
-    /// explanation of the difference between length and capacity, see
+    /// minimum *capacity* specified, the vector will have a zero *length*. For
+    /// an explanation of the difference between length and capacity, see
     /// *[Capacity and reallocation]*.
     ///
+    /// If it is important to know the exact allocated capacity of a `Vec`,
+    /// always use the [`capacity`] method after construction.
+    ///
+    /// For `Vec<T>` where `T` is a zero-sized type, there will be no allocation
+    /// and the capacity will always be `usize::MAX`.
+    ///
     /// [Capacity and reallocation]: #capacity-and-reallocation
+    /// [`capacity`]: Vec::capacity
     ///
     /// # Panics
     ///
@@ -451,19 +459,24 @@ impl<T> Vec<T> {
     ///
     /// // The vector contains no items, even though it has capacity for more
     /// assert_eq!(vec.len(), 0);
-    /// assert_eq!(vec.capacity(), 10);
+    /// assert!(vec.capacity() >= 10);
     ///
     /// // These are all done without reallocating...
     /// for i in 0..10 {
     ///     vec.push(i);
     /// }
     /// assert_eq!(vec.len(), 10);
-    /// assert_eq!(vec.capacity(), 10);
+    /// assert!(vec.capacity() >= 10);
     ///
     /// // ...but this may make the vector reallocate
     /// vec.push(11);
     /// assert_eq!(vec.len(), 11);
     /// assert!(vec.capacity() >= 11);
+    ///
+    /// // A vector of a zero-sized type will always over-allocate, since no
+    /// // allocation is necessary
+    /// let vec_units = Vec::<()>::with_capacity(10);
+    /// assert_eq!(vec_units.capacity(), usize::MAX);
     /// ```
     #[cfg(not(no_global_oom_handling))]
     #[inline]
@@ -473,17 +486,25 @@ impl<T> Vec<T> {
         Self::with_capacity_in(capacity, Global)
     }
 
-    /// Tries to construct a new, empty `Vec<T>` with the specified capacity.
+    /// Tries to construct a new, empty `Vec<T>` with at least the specified capacity.
     ///
-    /// The vector will be able to hold exactly `capacity` elements without
-    /// reallocating. If `capacity` is 0, the vector will not allocate.
+    /// The vector will be able to hold at least `capacity` elements without
+    /// reallocating. This method is allowed to allocate for more elements than
+    /// `capacity`. If `capacity` is 0, the vector will not allocate.
     ///
     /// It is important to note that although the returned vector has the
-    /// *capacity* specified, the vector will have a zero *length*. For an
-    /// explanation of the difference between length and capacity, see
+    /// minimum *capacity* specified, the vector will have a zero *length*. For
+    /// an explanation of the difference between length and capacity, see
     /// *[Capacity and reallocation]*.
     ///
+    /// If it is important to know the exact allocated capacity of a `Vec`,
+    /// always use the [`capacity`] method after construction.
+    ///
+    /// For `Vec<T>` where `T` is a zero-sized type, there will be no allocation
+    /// and the capacity will always be `usize::MAX`.
+    ///
     /// [Capacity and reallocation]: #capacity-and-reallocation
+    /// [`capacity`]: Vec::capacity
     ///
     /// # Examples
     ///
@@ -492,14 +513,14 @@ impl<T> Vec<T> {
     ///
     /// // The vector contains no items, even though it has capacity for more
     /// assert_eq!(vec.len(), 0);
-    /// assert_eq!(vec.capacity(), 10);
+    /// assert!(vec.capacity() >= 10);
     ///
     /// // These are all done without reallocating...
     /// for i in 0..10 {
     ///     vec.push(i);
     /// }
     /// assert_eq!(vec.len(), 10);
-    /// assert_eq!(vec.capacity(), 10);
+    /// assert!(vec.capacity() >= 10);
     ///
     /// // ...but this may make the vector reallocate
     /// vec.push(11);
@@ -508,6 +529,11 @@ impl<T> Vec<T> {
     ///
     /// let mut result = Vec::try_with_capacity(usize::MAX);
     /// assert!(result.is_err());
+    ///
+    /// // A vector of a zero-sized type will always over-allocate, since no
+    /// // allocation is necessary
+    /// let vec_units = Vec::<()>::try_with_capacity(10).unwrap();
+    /// assert_eq!(vec_units.capacity(), usize::MAX);
     /// ```
     #[inline]
     #[stable(feature = "kernel", since = "1.0.0")]
@@ -515,15 +541,15 @@ impl<T> Vec<T> {
         Self::try_with_capacity_in(capacity, Global)
     }
 
-    /// Creates a `Vec<T>` directly from the raw components of another vector.
+    /// Creates a `Vec<T>` directly from a pointer, a capacity, and a length.
     ///
     /// # Safety
     ///
     /// This is highly unsafe, due to the number of invariants that aren't
     /// checked:
     ///
-    /// * `ptr` needs to have been previously allocated via [`String`]/`Vec<T>`
-    ///   (at least, it's highly likely to be incorrect if it wasn't).
+    /// * `ptr` must have been allocated using the global allocator, such as via
+    ///   the [`alloc::alloc`] function.
     /// * `T` needs to have the same alignment as what `ptr` was allocated with.
     ///   (`T` having a less strict alignment is not sufficient, the alignment really
     ///   needs to be equal to satisfy the [`dealloc`] requirement that memory must be
@@ -532,6 +558,14 @@ impl<T> Vec<T> {
     ///   to be the same size as the pointer was allocated with. (Because similar to
     ///   alignment, [`dealloc`] must be called with the same layout `size`.)
     /// * `length` needs to be less than or equal to `capacity`.
+    /// * The first `length` values must be properly initialized values of type `T`.
+    /// * `capacity` needs to be the capacity that the pointer was allocated with.
+    /// * The allocated size in bytes must be no larger than `isize::MAX`.
+    ///   See the safety documentation of [`pointer::offset`].
+    ///
+    /// These requirements are always upheld by any `ptr` that has been allocated
+    /// via `Vec<T>`. Other allocation sources are allowed if the invariants are
+    /// upheld.
     ///
     /// Violating these may cause problems like corrupting the allocator's
     /// internal data structures. For example it is normally **not** safe
@@ -552,6 +586,7 @@ impl<T> Vec<T> {
     /// function.
     ///
     /// [`String`]: crate::string::String
+    /// [`alloc::alloc`]: crate::alloc::alloc
     /// [`dealloc`]: crate::alloc::GlobalAlloc::dealloc
     ///
     /// # Examples
@@ -574,8 +609,8 @@ impl<T> Vec<T> {
     ///
     /// unsafe {
     ///     // Overwrite memory with 4, 5, 6
-    ///     for i in 0..len as isize {
-    ///         ptr::write(p.offset(i), 4 + i);
+    ///     for i in 0..len {
+    ///         ptr::write(p.add(i), 4 + i);
     ///     }
     ///
     ///     // Put everything back together into a Vec
@@ -583,6 +618,32 @@ impl<T> Vec<T> {
     ///     assert_eq!(rebuilt, [4, 5, 6]);
     /// }
     /// ```
+    ///
+    /// Using memory that was allocated elsewhere:
+    ///
+    /// ```rust
+    /// #![feature(allocator_api)]
+    ///
+    /// use std::alloc::{AllocError, Allocator, Global, Layout};
+    ///
+    /// fn main() {
+    ///     let layout = Layout::array::<u32>(16).expect("overflow cannot happen");
+    ///
+    ///     let vec = unsafe {
+    ///         let mem = match Global.allocate(layout) {
+    ///             Ok(mem) => mem.cast::<u32>().as_ptr(),
+    ///             Err(AllocError) => return,
+    ///         };
+    ///
+    ///         mem.write(1_000_000);
+    ///
+    ///         Vec::from_raw_parts_in(mem, 1, 16, Global)
+    ///     };
+    ///
+    ///     assert_eq!(vec, &[1_000_000]);
+    ///     assert_eq!(vec.capacity(), 16);
+    /// }
+    /// ```
     #[inline]
     #[stable(feature = "rust1", since = "1.0.0")]
     pub unsafe fn from_raw_parts(ptr: *mut T, length: usize, capacity: usize) -> Self {
@@ -611,18 +672,26 @@ impl<T, A: Allocator> Vec<T, A> {
         Vec { buf: RawVec::new_in(alloc), len: 0 }
     }
 
-    /// Constructs a new, empty `Vec<T, A>` with the specified capacity with the provided
-    /// allocator.
+    /// Constructs a new, empty `Vec<T, A>` with at least the specified capacity
+    /// with the provided allocator.
     ///
-    /// The vector will be able to hold exactly `capacity` elements without
-    /// reallocating. If `capacity` is 0, the vector will not allocate.
+    /// The vector will be able to hold at least `capacity` elements without
+    /// reallocating. This method is allowed to allocate for more elements than
+    /// `capacity`. If `capacity` is 0, the vector will not allocate.
     ///
     /// It is important to note that although the returned vector has the
-    /// *capacity* specified, the vector will have a zero *length*. For an
-    /// explanation of the difference between length and capacity, see
+    /// minimum *capacity* specified, the vector will have a zero *length*. For
+    /// an explanation of the difference between length and capacity, see
     /// *[Capacity and reallocation]*.
     ///
+    /// If it is important to know the exact allocated capacity of a `Vec`,
+    /// always use the [`capacity`] method after construction.
+    ///
+    /// For `Vec<T, A>` where `T` is a zero-sized type, there will be no allocation
+    /// and the capacity will always be `usize::MAX`.
+    ///
     /// [Capacity and reallocation]: #capacity-and-reallocation
+    /// [`capacity`]: Vec::capacity
     ///
     /// # Panics
     ///
@@ -652,6 +721,11 @@ impl<T, A: Allocator> Vec<T, A> {
     /// vec.push(11);
     /// assert_eq!(vec.len(), 11);
     /// assert!(vec.capacity() >= 11);
+    ///
+    /// // A vector of a zero-sized type will always over-allocate, since no
+    /// // allocation is necessary
+    /// let vec_units = Vec::<(), System>::with_capacity_in(10, System);
+    /// assert_eq!(vec_units.capacity(), usize::MAX);
     /// ```
     #[cfg(not(no_global_oom_handling))]
     #[inline]
@@ -660,18 +734,26 @@ impl<T, A: Allocator> Vec<T, A> {
         Vec { buf: RawVec::with_capacity_in(capacity, alloc), len: 0 }
     }
 
-    /// Tries to construct a new, empty `Vec<T, A>` with the specified capacity
+    /// Tries to construct a new, empty `Vec<T, A>` with at least the specified capacity
     /// with the provided allocator.
     ///
-    /// The vector will be able to hold exactly `capacity` elements without
-    /// reallocating. If `capacity` is 0, the vector will not allocate.
+    /// The vector will be able to hold at least `capacity` elements without
+    /// reallocating. This method is allowed to allocate for more elements than
+    /// `capacity`. If `capacity` is 0, the vector will not allocate.
     ///
     /// It is important to note that although the returned vector has the
-    /// *capacity* specified, the vector will have a zero *length*. For an
-    /// explanation of the difference between length and capacity, see
+    /// minimum *capacity* specified, the vector will have a zero *length*. For
+    /// an explanation of the difference between length and capacity, see
     /// *[Capacity and reallocation]*.
     ///
+    /// If it is important to know the exact allocated capacity of a `Vec`,
+    /// always use the [`capacity`] method after construction.
+    ///
+    /// For `Vec<T, A>` where `T` is a zero-sized type, there will be no allocation
+    /// and the capacity will always be `usize::MAX`.
+    ///
     /// [Capacity and reallocation]: #capacity-and-reallocation
+    /// [`capacity`]: Vec::capacity
     ///
     /// # Examples
     ///
@@ -700,6 +782,11 @@ impl<T, A: Allocator> Vec<T, A> {
     ///
     /// let mut result = Vec::try_with_capacity_in(usize::MAX, System);
     /// assert!(result.is_err());
+    ///
+    /// // A vector of a zero-sized type will always over-allocate, since no
+    /// // allocation is necessary
+    /// let vec_units = Vec::<(), System>::try_with_capacity_in(10, System).unwrap();
+    /// assert_eq!(vec_units.capacity(), usize::MAX);
     /// ```
     #[inline]
     #[stable(feature = "kernel", since = "1.0.0")]
@@ -707,21 +794,31 @@ impl<T, A: Allocator> Vec<T, A> {
         Ok(Vec { buf: RawVec::try_with_capacity_in(capacity, alloc)?, len: 0 })
     }
 
-    /// Creates a `Vec<T, A>` directly from the raw components of another vector.
+    /// Creates a `Vec<T, A>` directly from a pointer, a capacity, a length,
+    /// and an allocator.
     ///
     /// # Safety
     ///
     /// This is highly unsafe, due to the number of invariants that aren't
     /// checked:
     ///
-    /// * `ptr` needs to have been previously allocated via [`String`]/`Vec<T>`
-    ///   (at least, it's highly likely to be incorrect if it wasn't).
-    /// * `T` needs to have the same size and alignment as what `ptr` was allocated with.
+    /// * `ptr` must be [*currently allocated*] via the given allocator `alloc`.
+    /// * `T` needs to have the same alignment as what `ptr` was allocated with.
     ///   (`T` having a less strict alignment is not sufficient, the alignment really
     ///   needs to be equal to satisfy the [`dealloc`] requirement that memory must be
     ///   allocated and deallocated with the same layout.)
+    /// * The size of `T` times the `capacity` (ie. the allocated size in bytes) needs
+    ///   to be the same size as the pointer was allocated with. (Because similar to
+    ///   alignment, [`dealloc`] must be called with the same layout `size`.)
     /// * `length` needs to be less than or equal to `capacity`.
-    /// * `capacity` needs to be the capacity that the pointer was allocated with.
+    /// * The first `length` values must be properly initialized values of type `T`.
+    /// * `capacity` needs to [*fit*] the layout size that the pointer was allocated with.
+    /// * The allocated size in bytes must be no larger than `isize::MAX`.
+    ///   See the safety documentation of [`pointer::offset`].
+    ///
+    /// These requirements are always upheld by any `ptr` that has been allocated
+    /// via `Vec<T, A>`. Other allocation sources are allowed if the invariants are
+    /// upheld.
     ///
     /// Violating these may cause problems like corrupting the allocator's
     /// internal data structures. For example it is **not** safe
@@ -739,6 +836,8 @@ impl<T, A: Allocator> Vec<T, A> {
     ///
     /// [`String`]: crate::string::String
     /// [`dealloc`]: crate::alloc::GlobalAlloc::dealloc
+    /// [*currently allocated*]: crate::alloc::Allocator#currently-allocated-memory
+    /// [*fit*]: crate::alloc::Allocator#memory-fitting
     ///
     /// # Examples
     ///
@@ -768,8 +867,8 @@ impl<T, A: Allocator> Vec<T, A> {
     ///
     /// unsafe {
     ///     // Overwrite memory with 4, 5, 6
-    ///     for i in 0..len as isize {
-    ///         ptr::write(p.offset(i), 4 + i);
+    ///     for i in 0..len {
+    ///         ptr::write(p.add(i), 4 + i);
     ///     }
     ///
     ///     // Put everything back together into a Vec
@@ -777,6 +876,29 @@ impl<T, A: Allocator> Vec<T, A> {
     ///     assert_eq!(rebuilt, [4, 5, 6]);
     /// }
     /// ```
+    ///
+    /// Using memory that was allocated elsewhere:
+    ///
+    /// ```rust
+    /// use std::alloc::{alloc, Layout};
+    ///
+    /// fn main() {
+    ///     let layout = Layout::array::<u32>(16).expect("overflow cannot happen");
+    ///     let vec = unsafe {
+    ///         let mem = alloc(layout).cast::<u32>();
+    ///         if mem.is_null() {
+    ///             return;
+    ///         }
+    ///
+    ///         mem.write(1_000_000);
+    ///
+    ///         Vec::from_raw_parts(mem, 1, 16)
+    ///     };
+    ///
+    ///     assert_eq!(vec, &[1_000_000]);
+    ///     assert_eq!(vec.capacity(), 16);
+    /// }
+    /// ```
     #[inline]
     #[unstable(feature = "allocator_api", issue = "32838")]
     pub unsafe fn from_raw_parts_in(ptr: *mut T, length: usize, capacity: usize, alloc: A) -> Self {
@@ -869,13 +991,14 @@ impl<T, A: Allocator> Vec<T, A> {
         (ptr, len, capacity, alloc)
     }
 
-    /// Returns the number of elements the vector can hold without
+    /// Returns the total number of elements the vector can hold without
     /// reallocating.
     ///
     /// # Examples
     ///
     /// ```
-    /// let vec: Vec<i32> = Vec::with_capacity(10);
+    /// let mut vec: Vec<i32> = Vec::with_capacity(10);
+    /// vec.push(42);
     /// assert_eq!(vec.capacity(), 10);
     /// ```
     #[inline]
@@ -885,10 +1008,10 @@ impl<T, A: Allocator> Vec<T, A> {
     }
 
     /// Reserves capacity for at least `additional` more elements to be inserted
-    /// in the given `Vec<T>`. The collection may reserve more space to avoid
-    /// frequent reallocations. After calling `reserve`, capacity will be
-    /// greater than or equal to `self.len() + additional`. Does nothing if
-    /// capacity is already sufficient.
+    /// in the given `Vec<T>`. The collection may reserve more space to
+    /// speculatively avoid frequent reallocations. After calling `reserve`,
+    /// capacity will be greater than or equal to `self.len() + additional`.
+    /// Does nothing if capacity is already sufficient.
     ///
     /// # Panics
     ///
@@ -907,10 +1030,12 @@ impl<T, A: Allocator> Vec<T, A> {
         self.buf.reserve(self.len, additional);
     }
 
-    /// Reserves the minimum capacity for exactly `additional` more elements to
-    /// be inserted in the given `Vec<T>`. After calling `reserve_exact`,
-    /// capacity will be greater than or equal to `self.len() + additional`.
-    /// Does nothing if the capacity is already sufficient.
+    /// Reserves the minimum capacity for at least `additional` more elements to
+    /// be inserted in the given `Vec<T>`. Unlike [`reserve`], this will not
+    /// deliberately over-allocate to speculatively avoid frequent allocations.
+    /// After calling `reserve_exact`, capacity will be greater than or equal to
+    /// `self.len() + additional`. Does nothing if the capacity is already
+    /// sufficient.
     ///
     /// Note that the allocator may give the collection more space than it
     /// requests. Therefore, capacity can not be relied upon to be precisely
@@ -936,10 +1061,11 @@ impl<T, A: Allocator> Vec<T, A> {
     }
 
     /// Tries to reserve capacity for at least `additional` more elements to be inserted
-    /// in the given `Vec<T>`. The collection may reserve more space to avoid
+    /// in the given `Vec<T>`. The collection may reserve more space to speculatively avoid
     /// frequent reallocations. After calling `try_reserve`, capacity will be
-    /// greater than or equal to `self.len() + additional`. Does nothing if
-    /// capacity is already sufficient.
+    /// greater than or equal to `self.len() + additional` if it returns
+    /// `Ok(())`. Does nothing if capacity is already sufficient. This method
+    /// preserves the contents even if an error occurs.
     ///
     /// # Errors
     ///
@@ -971,10 +1097,11 @@ impl<T, A: Allocator> Vec<T, A> {
         self.buf.try_reserve(self.len, additional)
     }
 
-    /// Tries to reserve the minimum capacity for exactly `additional`
-    /// elements to be inserted in the given `Vec<T>`. After calling
-    /// `try_reserve_exact`, capacity will be greater than or equal to
-    /// `self.len() + additional` if it returns `Ok(())`.
+    /// Tries to reserve the minimum capacity for at least `additional`
+    /// elements to be inserted in the given `Vec<T>`. Unlike [`try_reserve`],
+    /// this will not deliberately over-allocate to speculatively avoid frequent
+    /// allocations. After calling `try_reserve_exact`, capacity will be greater
+    /// than or equal to `self.len() + additional` if it returns `Ok(())`.
     /// Does nothing if the capacity is already sufficient.
     ///
     /// Note that the allocator may give the collection more space than it
@@ -1066,7 +1193,8 @@ impl<T, A: Allocator> Vec<T, A> {
 
     /// Converts the vector into [`Box<[T]>`][owned slice].
     ///
-    /// Note that this will drop any excess capacity.
+    /// If the vector has excess capacity, its items will be moved into a
+    /// newly-allocated buffer with exactly the right capacity.
     ///
     /// [owned slice]: Box
     ///
@@ -1199,7 +1327,8 @@ impl<T, A: Allocator> Vec<T, A> {
         self
     }
 
-    /// Returns a raw pointer to the vector's buffer.
+    /// Returns a raw pointer to the vector's buffer, or a dangling raw pointer
+    /// valid for zero sized reads if the vector didn't allocate.
     ///
     /// The caller must ensure that the vector outlives the pointer this
     /// function returns, or else it will end up pointing to garbage.
@@ -1236,7 +1365,8 @@ impl<T, A: Allocator> Vec<T, A> {
         ptr
     }
 
-    /// Returns an unsafe mutable pointer to the vector's buffer.
+    /// Returns an unsafe mutable pointer to the vector's buffer, or a dangling
+    /// raw pointer valid for zero sized reads if the vector didn't allocate.
     ///
     /// The caller must ensure that the vector outlives the pointer this
     /// function returns, or else it will end up pointing to garbage.
@@ -1440,9 +1570,6 @@ impl<T, A: Allocator> Vec<T, A> {
         }
 
         let len = self.len();
-        if index > len {
-            assert_failed(index, len);
-        }
 
         // space for the new element
         if len == self.buf.capacity() {
@@ -1454,9 +1581,15 @@ impl<T, A: Allocator> Vec<T, A> {
             // The spot to put the new value
             {
                 let p = self.as_mut_ptr().add(index);
-                // Shift everything over to make space. (Duplicating the
-                // `index`th element into two consecutive places.)
-                ptr::copy(p, p.offset(1), len - index);
+                if index < len {
+                    // Shift everything over to make space. (Duplicating the
+                    // `index`th element into two consecutive places.)
+                    ptr::copy(p, p.add(1), len - index);
+                } else if index == len {
+                    // No elements need shifting.
+                } else {
+                    assert_failed(index, len);
+                }
                 // Write it in, overwriting the first copy of the `index`th
                 // element.
                 ptr::write(p, element);
@@ -1513,7 +1646,7 @@ impl<T, A: Allocator> Vec<T, A> {
                 ret = ptr::read(ptr);
 
                 // Shift everything down to fill in that spot.
-                ptr::copy(ptr.offset(1), ptr, len - index - 1);
+                ptr::copy(ptr.add(1), ptr, len - index - 1);
             }
             self.set_len(len - 1);
             ret
@@ -1562,11 +1695,11 @@ impl<T, A: Allocator> Vec<T, A> {
     ///
     /// ```
     /// let mut vec = vec![1, 2, 3, 4];
-    /// vec.retain_mut(|x| if *x > 3 {
-    ///     false
-    /// } else {
+    /// vec.retain_mut(|x| if *x <= 3 {
     ///     *x += 1;
     ///     true
+    /// } else {
+    ///     false
     /// });
     /// assert_eq!(vec, [2, 3, 4]);
     /// ```
@@ -1854,6 +1987,51 @@ impl<T, A: Allocator> Vec<T, A> {
         Ok(())
     }
 
+    /// Appends an element if there is sufficient spare capacity, otherwise an error is returned
+    /// with the element.
+    ///
+    /// Unlike [`push`] this method will not reallocate when there's insufficient capacity.
+    /// The caller should use [`reserve`] or [`try_reserve`] to ensure that there is enough capacity.
+    ///
+    /// [`push`]: Vec::push
+    /// [`reserve`]: Vec::reserve
+    /// [`try_reserve`]: Vec::try_reserve
+    ///
+    /// # Examples
+    ///
+    /// A manual, panic-free alternative to [`FromIterator`]:
+    ///
+    /// ```
+    /// #![feature(vec_push_within_capacity)]
+    ///
+    /// use std::collections::TryReserveError;
+    /// fn from_iter_fallible<T>(iter: impl Iterator<Item=T>) -> Result<Vec<T>, TryReserveError> {
+    ///     let mut vec = Vec::new();
+    ///     for value in iter {
+    ///         if let Err(value) = vec.push_within_capacity(value) {
+    ///             vec.try_reserve(1)?;
+    ///             // this cannot fail, the previous line either returned or added at least 1 free slot
+    ///             let _ = vec.push_within_capacity(value);
+    ///         }
+    ///     }
+    ///     Ok(vec)
+    /// }
+    /// assert_eq!(from_iter_fallible(0..100), Ok(Vec::from_iter(0..100)));
+    /// ```
+    #[inline]
+    #[unstable(feature = "vec_push_within_capacity", issue = "100486")]
+    pub fn push_within_capacity(&mut self, value: T) -> Result<(), T> {
+        if self.len == self.buf.capacity() {
+            return Err(value);
+        }
+        unsafe {
+            let end = self.as_mut_ptr().add(self.len);
+            ptr::write(end, value);
+            self.len += 1;
+        }
+        Ok(())
+    }
+
     /// Removes the last element from a vector and returns it, or [`None`] if it
     /// is empty.
     ///
@@ -1886,7 +2064,7 @@ impl<T, A: Allocator> Vec<T, A> {
     ///
     /// # Panics
     ///
-    /// Panics if the number of elements in the vector overflows a `usize`.
+    /// Panics if the new capacity exceeds `isize::MAX` bytes.
     ///
     /// # Examples
     ///
@@ -1980,9 +2158,7 @@ impl<T, A: Allocator> Vec<T, A> {
         unsafe {
             // set self.vec length's to start, to be safe in case Drain is leaked
             self.set_len(start);
-            // Use the borrow in the IterMut to indicate borrowing behavior of the
-            // whole Drain iterator (like &mut T).
-            let range_slice = slice::from_raw_parts_mut(self.as_mut_ptr().add(start), end - start);
+            let range_slice = slice::from_raw_parts(self.as_ptr().add(start), end - start);
             Drain {
                 tail_start: end,
                 tail_len: len - end,
@@ -2145,7 +2321,7 @@ impl<T, A: Allocator> Vec<T, A> {
     {
         let len = self.len();
         if new_len > len {
-            self.extend_with(new_len - len, ExtendFunc(f));
+            self.extend_trusted(iter::repeat_with(f).take(new_len - len));
         } else {
             self.truncate(new_len);
         }
@@ -2174,7 +2350,6 @@ impl<T, A: Allocator> Vec<T, A> {
     /// static_ref[0] += 1;
     /// assert_eq!(static_ref, &[2, 2, 3]);
     /// ```
-    #[cfg(not(no_global_oom_handling))]
     #[stable(feature = "vec_leak", since = "1.47.0")]
     #[inline]
     pub fn leak<'a>(self) -> &'a mut [T]
@@ -2469,7 +2644,7 @@ impl<T: Clone, A: Allocator> Vec<T, A> {
         self.reserve(range.len());
 
         // SAFETY:
-        // - `slice::range` guarantees  that the given range is valid for indexing self
+        // - `slice::range` guarantees that the given range is valid for indexing self
         unsafe {
             self.spec_extend_from_within(range);
         }
@@ -2501,7 +2676,7 @@ impl<T, A: Allocator, const N: usize> Vec<[T; N], A> {
     #[unstable(feature = "slice_flatten", issue = "95629")]
     pub fn into_flattened(self) -> Vec<T, A> {
         let (ptr, len, cap, alloc) = self.into_raw_parts_with_alloc();
-        let (new_len, new_cap) = if mem::size_of::<T>() == 0 {
+        let (new_len, new_cap) = if T::IS_ZST {
             (len.checked_mul(N).expect("vec len overflow"), usize::MAX)
         } else {
             // SAFETY:
@@ -2537,16 +2712,6 @@ impl<T: Clone> ExtendWith<T> for ExtendElement<T> {
     }
 }
 
-struct ExtendFunc<F>(F);
-impl<T, F: FnMut() -> T> ExtendWith<T> for ExtendFunc<F> {
-    fn next(&mut self) -> T {
-        (self.0)()
-    }
-    fn last(mut self) -> T {
-        (self.0)()
-    }
-}
-
 impl<T, A: Allocator> Vec<T, A> {
     #[cfg(not(no_global_oom_handling))]
     /// Extend the vector by `n` values, using the given generator.
@@ -2563,7 +2728,7 @@ impl<T, A: Allocator> Vec<T, A> {
             // Write all elements except the last one
             for _ in 1..n {
                 ptr::write(ptr, value.next());
-                ptr = ptr.offset(1);
+                ptr = ptr.add(1);
                 // Increment the length in every step in case next() panics
                 local_len.increment_len(1);
             }
@@ -2592,7 +2757,7 @@ impl<T, A: Allocator> Vec<T, A> {
             // Write all elements except the last one
             for _ in 1..n {
                 ptr::write(ptr, value.next());
-                ptr = ptr.offset(1);
+                ptr = ptr.add(1);
                 // Increment the length in every step in case next() panics
                 local_len.increment_len(1);
             }
@@ -2664,7 +2829,7 @@ impl<T: Clone, A: Allocator> ExtendFromWithinSpec for Vec<T, A> {
         let (this, spare, len) = unsafe { self.split_at_spare_mut_with_len() };
 
         // SAFETY:
-        // - caller guaratees that src is a valid index
+        // - caller guarantees that src is a valid index
         let to_clone = unsafe { this.get_unchecked(src) };
 
         iter::zip(to_clone, spare)
@@ -2683,13 +2848,13 @@ impl<T: Copy, A: Allocator> ExtendFromWithinSpec for Vec<T, A> {
             let (init, spare) = self.split_at_spare_mut();
 
             // SAFETY:
-            // - caller guaratees that `src` is a valid index
+            // - caller guarantees that `src` is a valid index
             let source = unsafe { init.get_unchecked(src) };
 
             // SAFETY:
             // - Both pointers are created from unique slice references (`&mut [_]`)
             //   so they are valid and do not overlap.
-            // - Elements are :Copy so it's OK to to copy them, without doing
+            // - Elements are :Copy so it's OK to copy them, without doing
             //   anything with the original values
             // - `count` is equal to the len of `source`, so source is valid for
             //   `count` reads
@@ -2712,6 +2877,7 @@ impl<T: Copy, A: Allocator> ExtendFromWithinSpec for Vec<T, A> {
 impl<T, A: Allocator> ops::Deref for Vec<T, A> {
     type Target = [T];
 
+    #[inline]
     fn deref(&self) -> &[T] {
         unsafe { slice::from_raw_parts(self.as_ptr(), self.len) }
     }
@@ -2719,6 +2885,7 @@ impl<T, A: Allocator> ops::Deref for Vec<T, A> {
 
 #[stable(feature = "rust1", since = "1.0.0")]
 impl<T, A: Allocator> ops::DerefMut for Vec<T, A> {
+    #[inline]
     fn deref_mut(&mut self) -> &mut [T] {
         unsafe { slice::from_raw_parts_mut(self.as_mut_ptr(), self.len) }
     }
@@ -2764,7 +2931,7 @@ impl<T: Clone, A: Allocator + Clone> Clone for Vec<T, A> {
 
     // HACK(japaric): with cfg(test) the inherent `[T]::to_vec` method, which is
     // required for this method definition, is not available. Instead use the
-    // `slice::to_vec`  function which is only available with cfg(test)
+    // `slice::to_vec` function which is only available with cfg(test)
     // NB see the slice::hack module in slice.rs for more information
     #[cfg(test)]
     fn clone(&self) -> Self {
@@ -2845,19 +3012,22 @@ impl<T, A: Allocator> IntoIterator for Vec<T, A> {
     ///
     /// ```
     /// let v = vec!["a".to_string(), "b".to_string()];
-    /// for s in v.into_iter() {
-    ///     // s has type String, not &String
-    ///     println!("{s}");
-    /// }
+    /// let mut v_iter = v.into_iter();
+    ///
+    /// let first_element: Option<String> = v_iter.next();
+    ///
+    /// assert_eq!(first_element, Some("a".to_string()));
+    /// assert_eq!(v_iter.next(), Some("b".to_string()));
+    /// assert_eq!(v_iter.next(), None);
     /// ```
     #[inline]
-    fn into_iter(self) -> IntoIter<T, A> {
+    fn into_iter(self) -> Self::IntoIter {
         unsafe {
             let mut me = ManuallyDrop::new(self);
             let alloc = ManuallyDrop::new(ptr::read(me.allocator()));
             let begin = me.as_mut_ptr();
-            let end = if mem::size_of::<T>() == 0 {
-                arith_offset(begin as *const i8, me.len() as isize) as *const T
+            let end = if T::IS_ZST {
+                begin.wrapping_byte_add(me.len())
             } else {
                 begin.add(me.len()) as *const T
             };
@@ -2879,7 +3049,7 @@ impl<'a, T, A: Allocator> IntoIterator for &'a Vec<T, A> {
     type Item = &'a T;
     type IntoIter = slice::Iter<'a, T>;
 
-    fn into_iter(self) -> slice::Iter<'a, T> {
+    fn into_iter(self) -> Self::IntoIter {
         self.iter()
     }
 }
@@ -2889,7 +3059,7 @@ impl<'a, T, A: Allocator> IntoIterator for &'a mut Vec<T, A> {
     type Item = &'a mut T;
     type IntoIter = slice::IterMut<'a, T>;
 
-    fn into_iter(self) -> slice::IterMut<'a, T> {
+    fn into_iter(self) -> Self::IntoIter {
         self.iter_mut()
     }
 }
@@ -2969,6 +3139,69 @@ impl<T, A: Allocator> Vec<T, A> {
         Ok(())
     }
 
+    // specific extend for `TrustedLen` iterators, called both by the specializations
+    // and internal places where resolving specialization makes compilation slower
+    #[cfg(not(no_global_oom_handling))]
+    fn extend_trusted(&mut self, iterator: impl iter::TrustedLen<Item = T>) {
+        let (low, high) = iterator.size_hint();
+        if let Some(additional) = high {
+            debug_assert_eq!(
+                low,
+                additional,
+                "TrustedLen iterator's size hint is not exact: {:?}",
+                (low, high)
+            );
+            self.reserve(additional);
+            unsafe {
+                let ptr = self.as_mut_ptr();
+                let mut local_len = SetLenOnDrop::new(&mut self.len);
+                iterator.for_each(move |element| {
+                    ptr::write(ptr.add(local_len.current_len()), element);
+                    // Since the loop executes user code which can panic we have to update
+                    // the length every step to correctly drop what we've written.
+                    // NB can't overflow since we would have had to alloc the address space
+                    local_len.increment_len(1);
+                });
+            }
+        } else {
+            // Per TrustedLen contract a `None` upper bound means that the iterator length
+            // truly exceeds usize::MAX, which would eventually lead to a capacity overflow anyway.
+            // Since the other branch already panics eagerly (via `reserve()`) we do the same here.
+            // This avoids additional codegen for a fallback code path which would eventually
+            // panic anyway.
+            panic!("capacity overflow");
+        }
+    }
+
+    // specific extend for `TrustedLen` iterators, called both by the specializations
+    // and internal places where resolving specialization makes compilation slower
+    fn try_extend_trusted(&mut self, iterator: impl iter::TrustedLen<Item = T>) -> Result<(), TryReserveError> {
+        let (low, high) = iterator.size_hint();
+        if let Some(additional) = high {
+            debug_assert_eq!(
+                low,
+                additional,
+                "TrustedLen iterator's size hint is not exact: {:?}",
+                (low, high)
+            );
+            self.try_reserve(additional)?;
+            unsafe {
+                let ptr = self.as_mut_ptr();
+                let mut local_len = SetLenOnDrop::new(&mut self.len);
+                iterator.for_each(move |element| {
+                    ptr::write(ptr.add(local_len.current_len()), element);
+                    // Since the loop executes user code which can panic we have to update
+                    // the length every step to correctly drop what we've written.
+                    // NB can't overflow since we would have had to alloc the address space
+                    local_len.increment_len(1);
+                });
+            }
+            Ok(())
+        } else {
+            Err(TryReserveErrorKind::CapacityOverflow.into())
+        }
+    }
+
     /// Creates a splicing iterator that replaces the specified range in the vector
     /// with the given `replace_with` iterator and yields the removed items.
     /// `replace_with` does not need to be the same length as `range`.
@@ -3135,6 +3368,8 @@ unsafe impl<#[may_dangle] T, A: Allocator> Drop for Vec<T, A> {
 #[rustc_const_unstable(feature = "const_default_impls", issue = "87864")]
 impl<T> const Default for Vec<T> {
     /// Creates an empty `Vec<T>`.
+    ///
+    /// The vector will not allocate until elements are pushed onto it.
     fn default() -> Vec<T> {
         Vec::new()
     }
@@ -3227,12 +3462,15 @@ impl<T, const N: usize> From<[T; N]> for Vec<T> {
     /// ```
     #[cfg(not(test))]
     fn from(s: [T; N]) -> Vec<T> {
-        <[T]>::into_vec(box s)
+        <[T]>::into_vec(
+            #[rustc_box]
+            Box::new(s),
+        )
     }
 
     #[cfg(test)]
     fn from(s: [T; N]) -> Vec<T> {
-        crate::slice::into_vec(box s)
+        crate::slice::into_vec(Box::new(s))
     }
 }
 
@@ -3261,7 +3499,7 @@ where
     }
 }
 
-// note: test pulls in libstd, which causes errors here
+// note: test pulls in std, which causes errors here
 #[cfg(not(test))]
 #[stable(feature = "vec_from_box", since = "1.18.0")]
 impl<T, A: Allocator> From<Box<[T], A>> for Vec<T, A> {
@@ -3279,7 +3517,7 @@ impl<T, A: Allocator> From<Box<[T], A>> for Vec<T, A> {
     }
 }
 
-// note: test pulls in libstd, which causes errors here
+// note: test pulls in std, which causes errors here
 #[cfg(not(no_global_oom_handling))]
 #[cfg(not(test))]
 #[stable(feature = "box_from_vec", since = "1.20.0")]
@@ -3294,6 +3532,14 @@ impl<T, A: Allocator> From<Vec<T, A>> for Box<[T], A> {
     /// ```
     /// assert_eq!(Box::from(vec![1, 2, 3]), vec![1, 2, 3].into_boxed_slice());
     /// ```
+    ///
+    /// Any excess capacity is removed:
+    /// ```
+    /// let mut vec = Vec::with_capacity(10);
+    /// vec.extend([1, 2, 3]);
+    ///
+    /// assert_eq!(Box::from(vec), vec![1, 2, 3].into_boxed_slice());
+    /// ```
     fn from(v: Vec<T, A>) -> Self {
         v.into_boxed_slice()
     }
index 448bf50..d3c7297 100644 (file)
@@ -20,6 +20,11 @@ impl<'a> SetLenOnDrop<'a> {
     pub(super) fn increment_len(&mut self, increment: usize) {
         self.local_len += increment;
     }
+
+    #[inline]
+    pub(super) fn current_len(&self) -> usize {
+        self.local_len
+    }
 }
 
 impl Drop for SetLenOnDrop<'_> {
index 5ce2d00..a6a7352 100644 (file)
@@ -1,12 +1,11 @@
 // SPDX-License-Identifier: Apache-2.0 OR MIT
 
 use crate::alloc::Allocator;
-use crate::collections::{TryReserveError, TryReserveErrorKind};
+use crate::collections::TryReserveError;
 use core::iter::TrustedLen;
-use core::ptr::{self};
 use core::slice::{self};
 
-use super::{IntoIter, SetLenOnDrop, Vec};
+use super::{IntoIter, Vec};
 
 // Specialization trait used for Vec::extend
 #[cfg(not(no_global_oom_handling))]
@@ -44,36 +43,7 @@ where
     I: TrustedLen<Item = T>,
 {
     default fn spec_extend(&mut self, iterator: I) {
-        // This is the case for a TrustedLen iterator.
-        let (low, high) = iterator.size_hint();
-        if let Some(additional) = high {
-            debug_assert_eq!(
-                low,
-                additional,
-                "TrustedLen iterator's size hint is not exact: {:?}",
-                (low, high)
-            );
-            self.reserve(additional);
-            unsafe {
-                let mut ptr = self.as_mut_ptr().add(self.len());
-                let mut local_len = SetLenOnDrop::new(&mut self.len);
-                iterator.for_each(move |element| {
-                    ptr::write(ptr, element);
-                    ptr = ptr.offset(1);
-                    // Since the loop executes user code which can panic we have to bump the pointer
-                    // after each step.
-                    // NB can't overflow since we would have had to alloc the address space
-                    local_len.increment_len(1);
-                });
-            }
-        } else {
-            // Per TrustedLen contract a `None` upper bound means that the iterator length
-            // truly exceeds usize::MAX, which would eventually lead to a capacity overflow anyway.
-            // Since the other branch already panics eagerly (via `reserve()`) we do the same here.
-            // This avoids additional codegen for a fallback code path which would eventually
-            // panic anyway.
-            panic!("capacity overflow");
-        }
+        self.extend_trusted(iterator)
     }
 }
 
@@ -82,32 +52,7 @@ where
     I: TrustedLen<Item = T>,
 {
     default fn try_spec_extend(&mut self, iterator: I) -> Result<(), TryReserveError> {
-        // This is the case for a TrustedLen iterator.
-        let (low, high) = iterator.size_hint();
-        if let Some(additional) = high {
-            debug_assert_eq!(
-                low,
-                additional,
-                "TrustedLen iterator's size hint is not exact: {:?}",
-                (low, high)
-            );
-            self.try_reserve(additional)?;
-            unsafe {
-                let mut ptr = self.as_mut_ptr().add(self.len());
-                let mut local_len = SetLenOnDrop::new(&mut self.len);
-                iterator.for_each(move |element| {
-                    ptr::write(ptr, element);
-                    ptr = ptr.offset(1);
-                    // Since the loop executes user code which can panic we have to bump the pointer
-                    // after each step.
-                    // NB can't overflow since we would have had to alloc the address space
-                    local_len.increment_len(1);
-                });
-            }
-            Ok(())
-        } else {
-            Err(TryReserveErrorKind::CapacityOverflow.into())
-        }
+        self.try_extend_trusted(iterator)
     }
 }
 
index 50e7a76..3e601ce 100644 (file)
@@ -6,6 +6,7 @@
  * Sorted alphabetically.
  */
 
+#include <linux/errname.h>
 #include <linux/slab.h>
 #include <linux/refcount.h>
 #include <linux/wait.h>
index 7b24645..9bcbea0 100644 (file)
@@ -9,7 +9,6 @@
 //! using this crate.
 
 #![no_std]
-#![feature(core_ffi_c)]
 // See <https://github.com/rust-lang/rust-bindgen/issues/1651>.
 #![cfg_attr(test, allow(deref_nullptr))]
 #![cfg_attr(test, allow(unaligned_references))]
index 81e8026..bb594da 100644 (file)
@@ -21,6 +21,7 @@
 #include <linux/bug.h>
 #include <linux/build_bug.h>
 #include <linux/err.h>
+#include <linux/errname.h>
 #include <linux/refcount.h>
 #include <linux/mutex.h>
 #include <linux/spinlock.h>
@@ -110,6 +111,12 @@ long rust_helper_PTR_ERR(__force const void *ptr)
 }
 EXPORT_SYMBOL_GPL(rust_helper_PTR_ERR);
 
+const char *rust_helper_errname(int err)
+{
+       return errname(err);
+}
+EXPORT_SYMBOL_GPL(rust_helper_errname);
+
 struct task_struct *rust_helper_get_current(void)
 {
        return current;
index 6595423..9e37120 100644 (file)
@@ -67,6 +67,8 @@ macro_rules! build_error {
 ///     assert!(n > 1); // Run-time check
 /// }
 /// ```
+///
+/// [`static_assert!`]: crate::static_assert!
 #[macro_export]
 macro_rules! build_assert {
     ($cond:expr $(,)?) => {{
index 5f4114b..05fcab6 100644 (file)
@@ -4,16 +4,20 @@
 //!
 //! C header: [`include/uapi/asm-generic/errno-base.h`](../../../include/uapi/asm-generic/errno-base.h)
 
+use crate::str::CStr;
+
 use alloc::{
     alloc::{AllocError, LayoutError},
     collections::TryReserveError,
 };
 
 use core::convert::From;
+use core::fmt;
 use core::num::TryFromIntError;
 use core::str::Utf8Error;
 
 /// Contains the C-compatible error codes.
+#[rustfmt::skip]
 pub mod code {
     macro_rules! declare_err {
         ($err:tt $(,)? $($doc:expr),+) => {
@@ -58,6 +62,25 @@ pub mod code {
     declare_err!(EPIPE, "Broken pipe.");
     declare_err!(EDOM, "Math argument out of domain of func.");
     declare_err!(ERANGE, "Math result not representable.");
+    declare_err!(ERESTARTSYS, "Restart the system call.");
+    declare_err!(ERESTARTNOINTR, "System call was interrupted by a signal and will be restarted.");
+    declare_err!(ERESTARTNOHAND, "Restart if no handler.");
+    declare_err!(ENOIOCTLCMD, "No ioctl command.");
+    declare_err!(ERESTART_RESTARTBLOCK, "Restart by calling sys_restart_syscall.");
+    declare_err!(EPROBE_DEFER, "Driver requests probe retry.");
+    declare_err!(EOPENSTALE, "Open found a stale dentry.");
+    declare_err!(ENOPARAM, "Parameter not supported.");
+    declare_err!(EBADHANDLE, "Illegal NFS file handle.");
+    declare_err!(ENOTSYNC, "Update synchronization mismatch.");
+    declare_err!(EBADCOOKIE, "Cookie is stale.");
+    declare_err!(ENOTSUPP, "Operation is not supported.");
+    declare_err!(ETOOSMALL, "Buffer or request is too small.");
+    declare_err!(ESERVERFAULT, "An untranslatable error occurred.");
+    declare_err!(EBADTYPE, "Type not supported by server.");
+    declare_err!(EJUKEBOX, "Request initiated, but will not complete before timeout.");
+    declare_err!(EIOCBQUEUED, "iocb queued, will get completion event.");
+    declare_err!(ERECALLCONFLICT, "Conflict with recalled state.");
+    declare_err!(ENOGRACE, "NFS file lock reclaim refused.");
 }
 
 /// Generic integer kernel error.
@@ -113,6 +136,42 @@ impl Error {
         // SAFETY: self.0 is a valid error due to its invariant.
         unsafe { bindings::ERR_PTR(self.0.into()) as *mut _ }
     }
+
+    /// Returns a string representing the error, if one exists.
+    #[cfg(not(testlib))]
+    pub fn name(&self) -> Option<&'static CStr> {
+        // SAFETY: Just an FFI call, there are no extra safety requirements.
+        let ptr = unsafe { bindings::errname(-self.0) };
+        if ptr.is_null() {
+            None
+        } else {
+            // SAFETY: The string returned by `errname` is static and `NUL`-terminated.
+            Some(unsafe { CStr::from_char_ptr(ptr) })
+        }
+    }
+
+    /// Returns a string representing the error, if one exists.
+    ///
+    /// When `testlib` is configured, this always returns `None` to avoid the dependency on a
+    /// kernel function so that tests that use this (e.g., by calling [`Result::unwrap`]) can still
+    /// run in userspace.
+    #[cfg(testlib)]
+    pub fn name(&self) -> Option<&'static CStr> {
+        None
+    }
+}
+
+impl fmt::Debug for Error {
+    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+        match self.name() {
+            // Print out number if no name can be found.
+            None => f.debug_tuple("Error").field(&-self.0).finish(),
+            // SAFETY: These strings are ASCII-only.
+            Some(name) => f
+                .debug_tuple(unsafe { core::str::from_utf8_unchecked(name) })
+                .finish(),
+        }
+    }
 }
 
 impl From<AllocError> for Error {
@@ -177,7 +236,7 @@ impl From<core::convert::Infallible> for Error {
 /// Note that even if a function does not return anything when it succeeds,
 /// it should still be modeled as returning a `Result` rather than
 /// just an [`Error`].
-pub type Result<T = ()> = core::result::Result<T, Error>;
+pub type Result<T = (), E = Error> = core::result::Result<T, E>;
 
 /// Converts an integer as returned by a C kernel function to an error if it's negative, and
 /// `Ok(())` otherwise.
index 4ebfb08..b4332a4 100644 (file)
 //! [`Opaque`]: kernel::types::Opaque
 //! [`Opaque::ffi_init`]: kernel::types::Opaque::ffi_init
 //! [`pin_data`]: ::macros::pin_data
+//! [`pin_init!`]: crate::pin_init!
 
 use crate::{
     error::{self, Error},
@@ -255,6 +256,8 @@ pub mod macros;
 /// A normal `let` binding with optional type annotation. The expression is expected to implement
 /// [`PinInit`]/[`Init`] with the error type [`Infallible`]. If you want to use a different error
 /// type, then use [`stack_try_pin_init!`].
+///
+/// [`stack_try_pin_init!`]: crate::stack_try_pin_init!
 #[macro_export]
 macro_rules! stack_pin_init {
     (let $var:ident $(: $t:ty)? = $val:expr) => {
@@ -804,6 +807,8 @@ macro_rules! try_pin_init {
 ///
 /// This initializer is for initializing data in-place that might later be moved. If you want to
 /// pin-initialize, use [`pin_init!`].
+///
+/// [`try_init!`]: crate::try_init!
 // For a detailed example of how this macro works, see the module documentation of the hidden
 // module `__internal` inside of `init/__internal.rs`.
 #[macro_export]
index 541cfad..00aa4e9 100644 (file)
@@ -16,8 +16,9 @@
 //!
 //! We will look at the following example:
 //!
-//! ```rust
+//! ```rust,ignore
 //! # use kernel::init::*;
+//! # use core::pin::Pin;
 //! #[pin_data]
 //! #[repr(C)]
 //! struct Bar<T> {
 //!
 //! Here is the definition of `Bar` from our example:
 //!
-//! ```rust
+//! ```rust,ignore
 //! # use kernel::init::*;
 //! #[pin_data]
 //! #[repr(C)]
 //! struct Bar<T> {
+//!     #[pin]
 //!     t: T,
 //!     pub x: usize,
 //! }
@@ -83,7 +85,7 @@
 //!
 //! This expands to the following code:
 //!
-//! ```rust
+//! ```rust,ignore
 //! // Firstly the normal definition of the struct, attributes are preserved:
 //! #[repr(C)]
 //! struct Bar<T> {
 //!         unsafe fn t<E>(
 //!             self,
 //!             slot: *mut T,
-//!             init: impl ::kernel::init::Init<T, E>,
+//!             // Since `t` is `#[pin]`, this is `PinInit`.
+//!             init: impl ::kernel::init::PinInit<T, E>,
 //!         ) -> ::core::result::Result<(), E> {
-//!             unsafe { ::kernel::init::Init::__init(init, slot) }
+//!             unsafe { ::kernel::init::PinInit::__pinned_init(init, slot) }
 //!         }
 //!         pub unsafe fn x<E>(
 //!             self,
 //!             slot: *mut usize,
+//!             // Since `x` is not `#[pin]`, this is `Init`.
 //!             init: impl ::kernel::init::Init<usize, E>,
 //!         ) -> ::core::result::Result<(), E> {
 //!             unsafe { ::kernel::init::Init::__init(init, slot) }
 //!         }
 //!     }
 //!     // Implement the internal `HasPinData` trait that associates `Bar` with the pin-data struct
-//!     // that we constructed beforehand.
+//!     // that we constructed above.
 //!     unsafe impl<T> ::kernel::init::__internal::HasPinData for Bar<T> {
 //!         type PinData = __ThePinData<T>;
 //!         unsafe fn __pin_data() -> Self::PinData {
 //!     struct __Unpin<'__pin, T> {
 //!         __phantom_pin: ::core::marker::PhantomData<fn(&'__pin ()) -> &'__pin ()>,
 //!         __phantom: ::core::marker::PhantomData<fn(Bar<T>) -> Bar<T>>,
+//!         // Our only `#[pin]` field is `t`.
+//!         t: T,
 //!     }
 //!     #[doc(hidden)]
 //!     impl<'__pin, T>
 //!
 //! Here is the impl on `Bar` defining the new function:
 //!
-//! ```rust
+//! ```rust,ignore
 //! impl<T> Bar<T> {
 //!     fn new(t: T) -> impl PinInit<Self> {
 //!         pin_init!(Self { t, x: 0 })
 //!
 //! This expands to the following code:
 //!
-//! ```rust
+//! ```rust,ignore
 //! impl<T> Bar<T> {
 //!     fn new(t: T) -> impl PinInit<Self> {
 //!         {
 //!                     // that will refer to this struct instead of the one defined above.
 //!                     struct __InitOk;
 //!                     // This is the expansion of `t,`, which is syntactic sugar for `t: t,`.
-//!                     unsafe { ::core::ptr::write(&raw mut (*slot).t, t) };
+//!                     unsafe { ::core::ptr::write(::core::addr_of_mut!((*slot).t), t) };
 //!                     // Since initialization could fail later (not in this case, since the error
-//!                     // type is `Infallible`) we will need to drop this field if it fails. This
-//!                     // `DropGuard` will drop the field when it gets dropped and has not yet
-//!                     // been forgotten. We make a reference to it, so users cannot `mem::forget`
-//!                     // it from the initializer, since the name is the same as the field.
+//!                     // type is `Infallible`) we will need to drop this field if there is an
+//!                     // error later. This `DropGuard` will drop the field when it gets dropped
+//!                     // and has not yet been forgotten. We make a reference to it, so users
+//!                     // cannot `mem::forget` it from the initializer, since the name is the same
+//!                     // as the field (including hygiene).
 //!                     let t = &unsafe {
-//!                         ::kernel::init::__internal::DropGuard::new(&raw mut (*slot).t)
+//!                         ::kernel::init::__internal::DropGuard::new(
+//!                             ::core::addr_of_mut!((*slot).t),
+//!                         )
 //!                     };
 //!                     // Expansion of `x: 0,`:
 //!                     // Since this can be an arbitrary expression we cannot place it inside of
 //!                     // the `unsafe` block, so we bind it here.
 //!                     let x = 0;
-//!                     unsafe { ::core::ptr::write(&raw mut (*slot).x, x) };
+//!                     unsafe { ::core::ptr::write(::core::addr_of_mut!((*slot).x), x) };
+//!                     // We again create a `DropGuard`.
 //!                     let x = &unsafe {
-//!                         ::kernel::init::__internal::DropGuard::new(&raw mut (*slot).x)
+//!                         ::kernel::init::__internal::DropGuard::new(
+//!                             ::core::addr_of_mut!((*slot).x),
+//!                         )
 //!                     };
 //!
-//!                     // Here we use the type checker to ensuer that every field has been
+//!                     // Here we use the type checker to ensure that every field has been
 //!                     // initialized exactly once, since this is `if false` it will never get
 //!                     // executed, but still type-checked.
 //!                     // Additionally we abuse `slot` to automatically infer the correct type for
 //!                         };
 //!                     }
 //!                     // Since initialization has successfully completed, we can now forget the
-//!                     // guards.
+//!                     // guards. This is not `mem::forget`, since we only have `&DropGuard`.
 //!                     unsafe { ::kernel::init::__internal::DropGuard::forget(t) };
 //!                     unsafe { ::kernel::init::__internal::DropGuard::forget(x) };
 //!                 }
 //!                 // `__InitOk` that we need to return.
 //!                 Ok(__InitOk)
 //!             });
-//!             // Change the return type of the closure.
+//!             // Change the return type from `__InitOk` to `()`.
 //!             let init = move |slot| -> ::core::result::Result<(), ::core::convert::Infallible> {
 //!                 init(slot).map(|__InitOk| ())
 //!             };
 //! Since we already took a look at `#[pin_data]` on `Bar`, this section will only explain the
 //! differences/new things in the expansion of the `Foo` definition:
 //!
-//! ```rust
+//! ```rust,ignore
 //! #[pin_data(PinnedDrop)]
 //! struct Foo {
 //!     a: usize,
 //!
 //! This expands to the following code:
 //!
-//! ```rust
+//! ```rust,ignore
 //! struct Foo {
 //!     a: usize,
 //!     b: Bar<u32>,
 //!         unsafe fn b<E>(
 //!             self,
 //!             slot: *mut Bar<u32>,
-//!             // Note that this is `PinInit` instead of `Init`, this is because `b` is
-//!             // structurally pinned, as marked by the `#[pin]` attribute.
 //!             init: impl ::kernel::init::PinInit<Bar<u32>, E>,
 //!         ) -> ::core::result::Result<(), E> {
 //!             unsafe { ::kernel::init::PinInit::__pinned_init(init, slot) }
 //!     struct __Unpin<'__pin> {
 //!         __phantom_pin: ::core::marker::PhantomData<fn(&'__pin ()) -> &'__pin ()>,
 //!         __phantom: ::core::marker::PhantomData<fn(Foo) -> Foo>,
-//!         // Since this field is `#[pin]`, it is listed here.
 //!         b: Bar<u32>,
 //!     }
 //!     #[doc(hidden)]
 //!     impl<'__pin> ::core::marker::Unpin for Foo where __Unpin<'__pin>: ::core::marker::Unpin {}
 //!     // Since we specified `PinnedDrop` as the argument to `#[pin_data]`, we expect `Foo` to
 //!     // implement `PinnedDrop`. Thus we do not need to prevent `Drop` implementations like
-//!     // before, instead we implement it here and delegate to `PinnedDrop`.
+//!     // before, instead we implement `Drop` here and delegate to `PinnedDrop`.
 //!     impl ::core::ops::Drop for Foo {
 //!         fn drop(&mut self) {
 //!             // Since we are getting dropped, no one else has a reference to `self` and thus we
 //!
 //! Here is the `PinnedDrop` impl for `Foo`:
 //!
-//! ```rust
+//! ```rust,ignore
 //! #[pinned_drop]
 //! impl PinnedDrop for Foo {
 //!     fn drop(self: Pin<&mut Self>) {
 //!
 //! This expands to the following code:
 //!
-//! ```rust
+//! ```rust,ignore
 //! // `unsafe`, full path and the token parameter are added, everything else stays the same.
 //! unsafe impl ::kernel::init::PinnedDrop for Foo {
 //!     fn drop(self: Pin<&mut Self>, _: ::kernel::init::__internal::OnlyCallFromDrop) {
 //!
 //! ## `pin_init!` on `Foo`
 //!
-//! Since we already took a look at `pin_init!` on `Bar`, this section will only explain the
-//! differences/new things in the expansion of `pin_init!` on `Foo`:
+//! Since we already took a look at `pin_init!` on `Bar`, this section will only show the expansion
+//! of `pin_init!` on `Foo`:
 //!
-//! ```rust
+//! ```rust,ignore
 //! let a = 42;
 //! let initializer = pin_init!(Foo {
 //!     a,
 //!
 //! This expands to the following code:
 //!
-//! ```rust
+//! ```rust,ignore
 //! let a = 42;
 //! let initializer = {
 //!     struct __InitOk;
 //!     >(data, move |slot| {
 //!         {
 //!             struct __InitOk;
-//!             unsafe { ::core::ptr::write(&raw mut (*slot).a, a) };
-//!             let a = &unsafe { ::kernel::init::__internal::DropGuard::new(&raw mut (*slot).a) };
+//!             unsafe { ::core::ptr::write(::core::addr_of_mut!((*slot).a), a) };
+//!             let a = &unsafe {
+//!                 ::kernel::init::__internal::DropGuard::new(::core::addr_of_mut!((*slot).a))
+//!             };
 //!             let b = Bar::new(36);
-//!             // Here we use `data` to access the correct field and require that `b` is of type
-//!             // `PinInit<Bar<u32>, Infallible>`.
-//!             unsafe { data.b(&raw mut (*slot).b, b)? };
-//!             let b = &unsafe { ::kernel::init::__internal::DropGuard::new(&raw mut (*slot).b) };
+//!             unsafe { data.b(::core::addr_of_mut!((*slot).b), b)? };
+//!             let b = &unsafe {
+//!                 ::kernel::init::__internal::DropGuard::new(::core::addr_of_mut!((*slot).b))
+//!             };
 //!
 //!             #[allow(unreachable_code, clippy::diverging_sub_expression)]
 //!             if false {
index 676995d..85b2612 100644 (file)
 #![no_std]
 #![feature(allocator_api)]
 #![feature(coerce_unsized)]
-#![feature(core_ffi_c)]
 #![feature(dispatch_from_dyn)]
-#![feature(explicit_generic_args_with_impl_trait)]
-#![feature(generic_associated_types)]
 #![feature(new_uninit)]
-#![feature(pin_macro)]
 #![feature(receiver_trait)]
 #![feature(unsize)]
 
index b3e68b2..388d6a5 100644 (file)
 /// [`std::dbg`]: https://doc.rust-lang.org/std/macro.dbg.html
 /// [`eprintln`]: https://doc.rust-lang.org/std/macro.eprintln.html
 /// [`printk`]: https://www.kernel.org/doc/html/latest/core-api/printk-basics.html
+/// [`pr_info`]: crate::pr_info!
+/// [`pr_debug`]: crate::pr_debug!
 #[macro_export]
 macro_rules! dbg {
     // NOTE: We cannot use `concat!` to make a static string as a format argument
index cd3d2a6..c9dd3bf 100644 (file)
@@ -2,6 +2,7 @@
 
 //! String representations.
 
+use alloc::alloc::AllocError;
 use alloc::vec::Vec;
 use core::fmt::{self, Write};
 use core::ops::{self, Deref, Index};
@@ -199,6 +200,12 @@ impl CStr {
     pub unsafe fn as_str_unchecked(&self) -> &str {
         unsafe { core::str::from_utf8_unchecked(self.as_bytes()) }
     }
+
+    /// Convert this [`CStr`] into a [`CString`] by allocating memory and
+    /// copying over the string data.
+    pub fn to_cstring(&self) -> Result<CString, AllocError> {
+        CString::try_from(self)
+    }
 }
 
 impl fmt::Display for CStr {
@@ -584,6 +591,21 @@ impl Deref for CString {
     }
 }
 
+impl<'a> TryFrom<&'a CStr> for CString {
+    type Error = AllocError;
+
+    fn try_from(cstr: &'a CStr) -> Result<CString, AllocError> {
+        let mut buf = Vec::new();
+
+        buf.try_extend_from_slice(cstr.as_bytes_with_nul())
+            .map_err(|_| AllocError)?;
+
+        // INVARIANT: The `CStr` and `CString` types have the same invariants for
+        // the string data, and we copied it over without changes.
+        Ok(CString { buf })
+    }
+}
+
 /// A convenience alias for [`core::format_args`].
 #[macro_export]
 macro_rules! fmt {
index e6d2062..a89843c 100644 (file)
@@ -146,13 +146,15 @@ impl<T: ?Sized + Unsize<U>, U: ?Sized> core::ops::DispatchFromDyn<Arc<U>> for Ar
 
 // SAFETY: It is safe to send `Arc<T>` to another thread when the underlying `T` is `Sync` because
 // it effectively means sharing `&T` (which is safe because `T` is `Sync`); additionally, it needs
-// `T` to be `Send` because any thread that has an `Arc<T>` may ultimately access `T` directly, for
-// example, when the reference count reaches zero and `T` is dropped.
+// `T` to be `Send` because any thread that has an `Arc<T>` may ultimately access `T` using a
+// mutable reference when the reference count reaches zero and `T` is dropped.
 unsafe impl<T: ?Sized + Sync + Send> Send for Arc<T> {}
 
-// SAFETY: It is safe to send `&Arc<T>` to another thread when the underlying `T` is `Sync` for the
-// same reason as above. `T` needs to be `Send` as well because a thread can clone an `&Arc<T>`
-// into an `Arc<T>`, which may lead to `T` being accessed by the same reasoning as above.
+// SAFETY: It is safe to send `&Arc<T>` to another thread when the underlying `T` is `Sync`
+// because it effectively means sharing `&T` (which is safe because `T` is `Sync`); additionally,
+// it needs `T` to be `Send` because any thread that has a `&Arc<T>` may clone it and get an
+// `Arc<T>` on that thread, so the thread may ultimately access `T` using a mutable reference when
+// the reference count reaches zero and `T` is dropped.
 unsafe impl<T: ?Sized + Sync + Send> Sync for Arc<T> {}
 
 impl<T> Arc<T> {
@@ -185,7 +187,7 @@ impl<T> Arc<T> {
 
     /// Use the given initializer to in-place initialize a `T`.
     ///
-    /// This is equivalent to [`pin_init`], since an [`Arc`] is always pinned.
+    /// This is equivalent to [`Arc<T>::pin_init`], since an [`Arc`] is always pinned.
     #[inline]
     pub fn init<E>(init: impl Init<T, E>) -> error::Result<Self>
     where
@@ -221,6 +223,11 @@ impl<T: ?Sized> Arc<T> {
         // reference can be created.
         unsafe { ArcBorrow::new(self.ptr) }
     }
+
+    /// Compare whether two [`Arc`] pointers reference the same underlying object.
+    pub fn ptr_eq(this: &Self, other: &Self) -> bool {
+        core::ptr::eq(this.ptr.as_ptr(), other.ptr.as_ptr())
+    }
 }
 
 impl<T: 'static> ForeignOwnable for Arc<T> {
@@ -259,6 +266,12 @@ impl<T: ?Sized> Deref for Arc<T> {
     }
 }
 
+impl<T: ?Sized> AsRef<T> for Arc<T> {
+    fn as_ref(&self) -> &T {
+        self.deref()
+    }
+}
+
 impl<T: ?Sized> Clone for Arc<T> {
     fn clone(&self) -> Self {
         // INVARIANT: C `refcount_inc` saturates the refcount, so it cannot overflow to zero.
index 526d29a..7eda15e 100644 (file)
@@ -64,8 +64,14 @@ macro_rules! current {
 #[repr(transparent)]
 pub struct Task(pub(crate) Opaque<bindings::task_struct>);
 
-// SAFETY: It's OK to access `Task` through references from other threads because we're either
-// accessing properties that don't change (e.g., `pid`, `group_leader`) or that are properly
+// SAFETY: By design, the only way to access a `Task` is via the `current` function or via an
+// `ARef<Task>` obtained through the `AlwaysRefCounted` impl. This means that the only situation in
+// which a `Task` can be accessed mutably is when the refcount drops to zero and the destructor
+// runs. It is safe for that to happen on any thread, so it is ok for this type to be `Send`.
+unsafe impl Send for Task {}
+
+// SAFETY: It's OK to access `Task` through shared references from other threads because we're
+// either accessing properties that don't change (e.g., `pid`, `group_leader`) or that are properly
 // synchronised by C code (e.g., `signal_pending`).
 unsafe impl Sync for Task {}
 
index 29db59d..1e5380b 100644 (file)
@@ -321,6 +321,19 @@ pub struct ARef<T: AlwaysRefCounted> {
     _p: PhantomData<T>,
 }
 
+// SAFETY: It is safe to send `ARef<T>` to another thread when the underlying `T` is `Sync` because
+// it effectively means sharing `&T` (which is safe because `T` is `Sync`); additionally, it needs
+// `T` to be `Send` because any thread that has an `ARef<T>` may ultimately access `T` using a
+// mutable reference, for example, when the reference count reaches zero and `T` is dropped.
+unsafe impl<T: AlwaysRefCounted + Sync + Send> Send for ARef<T> {}
+
+// SAFETY: It is safe to send `&ARef<T>` to another thread when the underlying `T` is `Sync`
+// because it effectively means sharing `&T` (which is safe because `T` is `Sync`); additionally,
+// it needs `T` to be `Send` because any thread that has a `&ARef<T>` may clone it and get an
+// `ARef<T>` on that thread, so the thread may ultimately access `T` using a mutable reference, for
+// example, when the reference count reaches zero and `T` is dropped.
+unsafe impl<T: AlwaysRefCounted + Sync + Send> Sync for ARef<T> {}
+
 impl<T: AlwaysRefCounted> ARef<T> {
     /// Creates a new instance of [`ARef`].
     ///
index b2bdd4d..afb0f2e 100644 (file)
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 
-use proc_macro::{token_stream, Group, TokenTree};
+use proc_macro::{token_stream, Group, Punct, Spacing, TokenStream, TokenTree};
 
 pub(crate) fn try_ident(it: &mut token_stream::IntoIter) -> Option<String> {
     if let Some(TokenTree::Ident(ident)) = it.next() {
@@ -69,3 +69,87 @@ pub(crate) fn expect_end(it: &mut token_stream::IntoIter) {
         panic!("Expected end");
     }
 }
+
+pub(crate) struct Generics {
+    pub(crate) impl_generics: Vec<TokenTree>,
+    pub(crate) ty_generics: Vec<TokenTree>,
+}
+
+/// Parses the given `TokenStream` into `Generics` and the rest.
+///
+/// The generics are not present in the rest, but a where clause might remain.
+pub(crate) fn parse_generics(input: TokenStream) -> (Generics, Vec<TokenTree>) {
+    // `impl_generics`, the declared generics with their bounds.
+    let mut impl_generics = vec![];
+    // Only the names of the generics, without any bounds.
+    let mut ty_generics = vec![];
+    // Tokens not related to the generics e.g. the `where` token and definition.
+    let mut rest = vec![];
+    // The current level of `<`.
+    let mut nesting = 0;
+    let mut toks = input.into_iter();
+    // If we are at the beginning of a generic parameter.
+    let mut at_start = true;
+    for tt in &mut toks {
+        match tt.clone() {
+            TokenTree::Punct(p) if p.as_char() == '<' => {
+                if nesting >= 1 {
+                    // This is inside of the generics and part of some bound.
+                    impl_generics.push(tt);
+                }
+                nesting += 1;
+            }
+            TokenTree::Punct(p) if p.as_char() == '>' => {
+                // This is a parsing error, so we just end it here.
+                if nesting == 0 {
+                    break;
+                } else {
+                    nesting -= 1;
+                    if nesting >= 1 {
+                        // We are still inside of the generics and part of some bound.
+                        impl_generics.push(tt);
+                    }
+                    if nesting == 0 {
+                        break;
+                    }
+                }
+            }
+            tt => {
+                if nesting == 1 {
+                    // Here depending on the token, it might be a generic variable name.
+                    match &tt {
+                        // Ignore const.
+                        TokenTree::Ident(i) if i.to_string() == "const" => {}
+                        TokenTree::Ident(_) if at_start => {
+                            ty_generics.push(tt.clone());
+                            // We also already push the `,` token, this makes it easier to append
+                            // generics.
+                            ty_generics.push(TokenTree::Punct(Punct::new(',', Spacing::Alone)));
+                            at_start = false;
+                        }
+                        TokenTree::Punct(p) if p.as_char() == ',' => at_start = true,
+                        // Lifetimes begin with `'`.
+                        TokenTree::Punct(p) if p.as_char() == '\'' && at_start => {
+                            ty_generics.push(tt.clone());
+                        }
+                        _ => {}
+                    }
+                }
+                if nesting >= 1 {
+                    impl_generics.push(tt);
+                } else if nesting == 0 {
+                    // If we haven't entered the generics yet, we still want to keep these tokens.
+                    rest.push(tt);
+                }
+            }
+        }
+    }
+    rest.extend(toks);
+    (
+        Generics {
+            impl_generics,
+            ty_generics,
+        },
+        rest,
+    )
+}
index 954149d..6d58cfd 100644 (file)
 // SPDX-License-Identifier: Apache-2.0 OR MIT
 
-use proc_macro::{Punct, Spacing, TokenStream, TokenTree};
+use crate::helpers::{parse_generics, Generics};
+use proc_macro::{Group, Punct, Spacing, TokenStream, TokenTree};
 
 pub(crate) fn pin_data(args: TokenStream, input: TokenStream) -> TokenStream {
     // This proc-macro only does some pre-parsing and then delegates the actual parsing to
     // `kernel::__pin_data!`.
-    //
-    // In here we only collect the generics, since parsing them in declarative macros is very
-    // elaborate. We also do not need to analyse their structure, we only need to collect them.
 
-    // `impl_generics`, the declared generics with their bounds.
-    let mut impl_generics = vec![];
-    // Only the names of the generics, without any bounds.
-    let mut ty_generics = vec![];
-    // Tokens not related to the generics e.g. the `impl` token.
-    let mut rest = vec![];
-    // The current level of `<`.
-    let mut nesting = 0;
-    let mut toks = input.into_iter();
-    // If we are at the beginning of a generic parameter.
-    let mut at_start = true;
-    for tt in &mut toks {
-        match tt.clone() {
-            TokenTree::Punct(p) if p.as_char() == '<' => {
-                if nesting >= 1 {
-                    impl_generics.push(tt);
-                }
-                nesting += 1;
-            }
-            TokenTree::Punct(p) if p.as_char() == '>' => {
-                if nesting == 0 {
-                    break;
-                } else {
-                    nesting -= 1;
-                    if nesting >= 1 {
-                        impl_generics.push(tt);
-                    }
-                    if nesting == 0 {
-                        break;
-                    }
+    let (
+        Generics {
+            impl_generics,
+            ty_generics,
+        },
+        rest,
+    ) = parse_generics(input);
+    // The struct definition might contain the `Self` type. Since `__pin_data!` will define a new
+    // type with the same generics and bounds, this poses a problem, since `Self` will refer to the
+    // new type as opposed to this struct definition. Therefore we have to replace `Self` with the
+    // concrete name.
+
+    // Errors that occur when replacing `Self` with `struct_name`.
+    let mut errs = TokenStream::new();
+    // The name of the struct with ty_generics.
+    let struct_name = rest
+        .iter()
+        .skip_while(|tt| !matches!(tt, TokenTree::Ident(i) if i.to_string() == "struct"))
+        .nth(1)
+        .and_then(|tt| match tt {
+            TokenTree::Ident(_) => {
+                let tt = tt.clone();
+                let mut res = vec![tt];
+                if !ty_generics.is_empty() {
+                    // We add this, so it is maximally compatible with e.g. `Self::CONST` which
+                    // will be replaced by `StructName::<$generics>::CONST`.
+                    res.push(TokenTree::Punct(Punct::new(':', Spacing::Joint)));
+                    res.push(TokenTree::Punct(Punct::new(':', Spacing::Alone)));
+                    res.push(TokenTree::Punct(Punct::new('<', Spacing::Alone)));
+                    res.extend(ty_generics.iter().cloned());
+                    res.push(TokenTree::Punct(Punct::new('>', Spacing::Alone)));
                 }
+                Some(res)
             }
-            tt => {
-                if nesting == 1 {
-                    match &tt {
-                        TokenTree::Ident(i) if i.to_string() == "const" => {}
-                        TokenTree::Ident(_) if at_start => {
-                            ty_generics.push(tt.clone());
-                            ty_generics.push(TokenTree::Punct(Punct::new(',', Spacing::Alone)));
-                            at_start = false;
-                        }
-                        TokenTree::Punct(p) if p.as_char() == ',' => at_start = true,
-                        TokenTree::Punct(p) if p.as_char() == '\'' && at_start => {
-                            ty_generics.push(tt.clone());
-                        }
-                        _ => {}
-                    }
-                }
-                if nesting >= 1 {
-                    impl_generics.push(tt);
-                } else if nesting == 0 {
-                    rest.push(tt);
-                }
+            _ => None,
+        })
+        .unwrap_or_else(|| {
+            // If we did not find the name of the struct then we will use `Self` as the replacement
+            // and add a compile error to ensure it does not compile.
+            errs.extend(
+                "::core::compile_error!(\"Could not locate type name.\");"
+                    .parse::<TokenStream>()
+                    .unwrap(),
+            );
+            "Self".parse::<TokenStream>().unwrap().into_iter().collect()
+        });
+    let impl_generics = impl_generics
+        .into_iter()
+        .flat_map(|tt| replace_self_and_deny_type_defs(&struct_name, tt, &mut errs))
+        .collect::<Vec<_>>();
+    let mut rest = rest
+        .into_iter()
+        .flat_map(|tt| {
+            // We ignore top level `struct` tokens, since they would emit a compile error.
+            if matches!(&tt, TokenTree::Ident(i) if i.to_string() == "struct") {
+                vec![tt]
+            } else {
+                replace_self_and_deny_type_defs(&struct_name, tt, &mut errs)
             }
-        }
-    }
-    rest.extend(toks);
+        })
+        .collect::<Vec<_>>();
     // This should be the body of the struct `{...}`.
     let last = rest.pop();
-    quote!(::kernel::__pin_data! {
+    let mut quoted = quote!(::kernel::__pin_data! {
         parse_input:
         @args(#args),
         @sig(#(#rest)*),
         @impl_generics(#(#impl_generics)*),
         @ty_generics(#(#ty_generics)*),
         @body(#last),
-    })
+    });
+    quoted.extend(errs);
+    quoted
+}
+
+/// Replaces `Self` with `struct_name` and errors on `enum`, `trait`, `struct` `union` and `impl`
+/// keywords.
+///
+/// The error is appended to `errs` to allow normal parsing to continue.
+fn replace_self_and_deny_type_defs(
+    struct_name: &Vec<TokenTree>,
+    tt: TokenTree,
+    errs: &mut TokenStream,
+) -> Vec<TokenTree> {
+    match tt {
+        TokenTree::Ident(ref i)
+            if i.to_string() == "enum"
+                || i.to_string() == "trait"
+                || i.to_string() == "struct"
+                || i.to_string() == "union"
+                || i.to_string() == "impl" =>
+        {
+            errs.extend(
+                format!(
+                    "::core::compile_error!(\"Cannot use `{i}` inside of struct definition with \
+                        `#[pin_data]`.\");"
+                )
+                .parse::<TokenStream>()
+                .unwrap()
+                .into_iter()
+                .map(|mut tok| {
+                    tok.set_span(tt.span());
+                    tok
+                }),
+            );
+            vec![tt]
+        }
+        TokenTree::Ident(i) if i.to_string() == "Self" => struct_name.clone(),
+        TokenTree::Literal(_) | TokenTree::Punct(_) | TokenTree::Ident(_) => vec![tt],
+        TokenTree::Group(g) => vec![TokenTree::Group(Group::new(
+            g.delimiter(),
+            g.stream()
+                .into_iter()
+                .flat_map(|tt| replace_self_and_deny_type_defs(struct_name, tt, errs))
+                .collect(),
+        ))],
+    }
 }
index c8e08b3..dddbb4e 100644 (file)
@@ -39,12 +39,14 @@ impl ToTokens for TokenStream {
 /// [`quote_spanned!`](https://docs.rs/quote/latest/quote/macro.quote_spanned.html) macro from the
 /// `quote` crate but provides only just enough functionality needed by the current `macros` crate.
 macro_rules! quote_spanned {
-    ($span:expr => $($tt:tt)*) => {
-    #[allow(clippy::vec_init_then_push)]
-    {
-        let mut tokens = ::std::vec::Vec::new();
-        let span = $span;
-        quote_spanned!(@proc tokens span $($tt)*);
+    ($span:expr => $($tt:tt)*) => {{
+        let mut tokens;
+        #[allow(clippy::vec_init_then_push)]
+        {
+            tokens = ::std::vec::Vec::new();
+            let span = $span;
+            quote_spanned!(@proc tokens span $($tt)*);
+        }
         ::proc_macro::TokenStream::from_iter(tokens)
     }};
     (@proc $v:ident $span:ident) => {};
index 29f69f3..0caad90 100644 (file)
@@ -8,7 +8,6 @@
 //! userspace APIs.
 
 #![no_std]
-#![feature(core_ffi_c)]
 // See <https://github.com/rust-lang/rust-bindgen/issues/1651>.
 #![cfg_attr(test, allow(deref_nullptr))]
 #![cfg_attr(test, allow(unaligned_references))]
index 7b476eb..6ced5dd 100644 (file)
@@ -32,7 +32,7 @@ static DEFINE_PER_CPU(void *, kmemleak_test_pointer);
  * Some very simple testing. This function needs to be extended for
  * proper testing.
  */
-static int __init kmemleak_test_init(void)
+static int kmemleak_test_init(void)
 {
        struct test_node *elem;
        int i;
index 9f94fc8..7817523 100644 (file)
@@ -277,7 +277,7 @@ $(obj)/%.lst: $(src)/%.c FORCE
 # Compile Rust sources (.rs)
 # ---------------------------------------------------------------------------
 
-rust_allowed_features := core_ffi_c,explicit_generic_args_with_impl_trait,new_uninit,pin_macro
+rust_allowed_features := new_uninit
 
 rust_common_cmd = \
        RUST_MODFILE=$(modfile) $(RUSTC_OR_CLIPPY) $(rust_flags) \
index 7099c60..4749865 100644 (file)
@@ -2,7 +2,7 @@
 
 # Enable available and selected UBSAN features.
 ubsan-cflags-$(CONFIG_UBSAN_ALIGNMENT)         += -fsanitize=alignment
-ubsan-cflags-$(CONFIG_UBSAN_ONLY_BOUNDS)       += -fsanitize=bounds
+ubsan-cflags-$(CONFIG_UBSAN_BOUNDS_STRICT)     += -fsanitize=bounds-strict
 ubsan-cflags-$(CONFIG_UBSAN_ARRAY_BOUNDS)      += -fsanitize=array-bounds
 ubsan-cflags-$(CONFIG_UBSAN_LOCAL_BOUNDS)      += -fsanitize=local-bounds
 ubsan-cflags-$(CONFIG_UBSAN_SHIFT)             += -fsanitize=shift
index 81d5c32..608ff39 100755 (executable)
@@ -36,9 +36,16 @@ meta_has_relaxed()
        meta_in "$1" "BFIR"
 }
 
-#find_fallback_template(pfx, name, sfx, order)
-find_fallback_template()
+#meta_is_implicitly_relaxed(meta)
+meta_is_implicitly_relaxed()
 {
+       meta_in "$1" "vls"
+}
+
+#find_template(tmpltype, pfx, name, sfx, order)
+find_template()
+{
+       local tmpltype="$1"; shift
        local pfx="$1"; shift
        local name="$1"; shift
        local sfx="$1"; shift
@@ -52,8 +59,8 @@ find_fallback_template()
        #
        # Start at the most specific, and fall back to the most general. Once
        # we find a specific fallback, don't bother looking for more.
-       for base in "${pfx}${name}${sfx}${order}" "${name}"; do
-               file="${ATOMICDIR}/fallbacks/${base}"
+       for base in "${pfx}${name}${sfx}${order}" "${pfx}${name}${sfx}" "${name}"; do
+               file="${ATOMICDIR}/${tmpltype}/${base}"
 
                if [ -f "${file}" ]; then
                        printf "${file}"
@@ -62,6 +69,18 @@ find_fallback_template()
        done
 }
 
+#find_fallback_template(pfx, name, sfx, order)
+find_fallback_template()
+{
+       find_template "fallbacks" "$@"
+}
+
+#find_kerneldoc_template(pfx, name, sfx, order)
+find_kerneldoc_template()
+{
+       find_template "kerneldoc" "$@"
+}
+
 #gen_ret_type(meta, int)
 gen_ret_type() {
        local meta="$1"; shift
@@ -142,6 +161,91 @@ gen_args()
        done
 }
 
+#gen_desc_return(meta)
+gen_desc_return()
+{
+       local meta="$1"; shift
+
+       case "${meta}" in
+       [v])
+               printf "Return: Nothing."
+               ;;
+       [Ff])
+               printf "Return: The original value of @v."
+               ;;
+       [R])
+               printf "Return: The updated value of @v."
+               ;;
+       [l])
+               printf "Return: The value of @v."
+               ;;
+       esac
+}
+
+#gen_template_kerneldoc(template, class, meta, pfx, name, sfx, order, atomic, int, args...)
+gen_template_kerneldoc()
+{
+       local template="$1"; shift
+       local class="$1"; shift
+       local meta="$1"; shift
+       local pfx="$1"; shift
+       local name="$1"; shift
+       local sfx="$1"; shift
+       local order="$1"; shift
+       local atomic="$1"; shift
+       local int="$1"; shift
+
+       local atomicname="${atomic}_${pfx}${name}${sfx}${order}"
+
+       local ret="$(gen_ret_type "${meta}" "${int}")"
+       local retstmt="$(gen_ret_stmt "${meta}")"
+       local params="$(gen_params "${int}" "${atomic}" "$@")"
+       local args="$(gen_args "$@")"
+       local desc_order=""
+       local desc_instrumentation=""
+       local desc_return=""
+
+       if [ ! -z "${order}" ]; then
+               desc_order="${order##_}"
+       elif meta_is_implicitly_relaxed "${meta}"; then
+               desc_order="relaxed"
+       else
+               desc_order="full"
+       fi
+
+       if [ -z "${class}" ]; then
+               desc_noinstr="Unsafe to use in noinstr code; use raw_${atomicname}() there."
+       else
+               desc_noinstr="Safe to use in noinstr code; prefer ${atomicname}() elsewhere."
+       fi
+
+       desc_return="$(gen_desc_return "${meta}")"
+
+       . ${template}
+}
+
+#gen_kerneldoc(class, meta, pfx, name, sfx, order, atomic, int, args...)
+gen_kerneldoc()
+{
+       local class="$1"; shift
+       local meta="$1"; shift
+       local pfx="$1"; shift
+       local name="$1"; shift
+       local sfx="$1"; shift
+       local order="$1"; shift
+
+       local atomicname="${atomic}_${pfx}${name}${sfx}${order}"
+
+       local tmpl="$(find_kerneldoc_template "${pfx}" "${name}" "${sfx}" "${order}")"
+       if [ -z "${tmpl}" ]; then
+               printf "/*\n"
+               printf " * No kerneldoc available for ${class}${atomicname}\n"
+               printf " */\n"
+       else
+       gen_template_kerneldoc "${tmpl}" "${class}" "${meta}" "${pfx}" "${name}" "${sfx}" "${order}" "$@"
+       fi
+}
+
 #gen_proto_order_variants(meta, pfx, name, sfx, ...)
 gen_proto_order_variants()
 {
index 85ca8d9..903946c 100644 (file)
@@ -27,7 +27,7 @@ and                   vF      i       v
 andnot                 vF      i       v
 or                     vF      i       v
 xor                    vF      i       v
-xchg                   I       v       i
+xchg                   I       v       i:new
 cmpxchg                        I       v       i:old   i:new
 try_cmpxchg            B       v       p:old   i:new
 sub_and_test           b       i       v
index ef76408..4da0cab 100755 (executable)
@@ -1,9 +1,5 @@
 cat <<EOF
-static __always_inline ${ret}
-arch_${atomic}_${pfx}${name}${sfx}_acquire(${params})
-{
        ${ret} ret = arch_${atomic}_${pfx}${name}${sfx}_relaxed(${args});
        __atomic_acquire_fence();
        return ret;
-}
 EOF
index e5980ab..1d3d4ab 100755 (executable)
@@ -1,15 +1,3 @@
 cat <<EOF
-/**
- * arch_${atomic}_add_negative${order} - Add and test if negative
- * @i: integer value to add
- * @v: pointer of type ${atomic}_t
- *
- * Atomically adds @i to @v and returns true if the result is negative,
- * or false when the result is greater than or equal to zero.
- */
-static __always_inline bool
-arch_${atomic}_add_negative${order}(${int} i, ${atomic}_t *v)
-{
-       return arch_${atomic}_add_return${order}(i, v) < 0;
-}
+       return raw_${atomic}_add_return${order}(i, v) < 0;
 EOF
index 9e5159c..95ecb2b 100755 (executable)
@@ -1,16 +1,3 @@
 cat << EOF
-/**
- * arch_${atomic}_add_unless - add unless the number is already a given value
- * @v: pointer of type ${atomic}_t
- * @a: the amount to add to v...
- * @u: ...unless v is equal to u.
- *
- * Atomically adds @a to @v, if @v was not already @u.
- * Returns true if the addition was done.
- */
-static __always_inline bool
-arch_${atomic}_add_unless(${atomic}_t *v, ${int} a, ${int} u)
-{
-       return arch_${atomic}_fetch_add_unless(v, a, u) != u;
-}
+       return raw_${atomic}_fetch_add_unless(v, a, u) != u;
 EOF
index 5a42f54..6676045 100755 (executable)
@@ -1,7 +1,3 @@
 cat <<EOF
-static __always_inline ${ret}
-arch_${atomic}_${pfx}andnot${sfx}${order}(${int} i, ${atomic}_t *v)
-{
-       ${retstmt}arch_${atomic}_${pfx}and${sfx}${order}(~i, v);
-}
+       ${retstmt}raw_${atomic}_${pfx}and${sfx}${order}(~i, v);
 EOF
diff --git a/scripts/atomic/fallbacks/cmpxchg b/scripts/atomic/fallbacks/cmpxchg
new file mode 100644 (file)
index 0000000..1c8507f
--- /dev/null
@@ -0,0 +1,3 @@
+cat <<EOF
+       return raw_cmpxchg${order}(&v->counter, old, new);
+EOF
index 8c144c8..60d286d 100755 (executable)
@@ -1,7 +1,3 @@
 cat <<EOF
-static __always_inline ${ret}
-arch_${atomic}_${pfx}dec${sfx}${order}(${atomic}_t *v)
-{
-       ${retstmt}arch_${atomic}_${pfx}sub${sfx}${order}(1, v);
-}
+       ${retstmt}raw_${atomic}_${pfx}sub${sfx}${order}(1, v);
 EOF
index 8549f35..3a0278e 100755 (executable)
@@ -1,15 +1,3 @@
 cat <<EOF
-/**
- * arch_${atomic}_dec_and_test - decrement and test
- * @v: pointer of type ${atomic}_t
- *
- * Atomically decrements @v by 1 and
- * returns true if the result is 0, or false for all other
- * cases.
- */
-static __always_inline bool
-arch_${atomic}_dec_and_test(${atomic}_t *v)
-{
-       return arch_${atomic}_dec_return(v) == 0;
-}
+       return raw_${atomic}_dec_return(v) == 0;
 EOF
index 86bdced..f65c11b 100755 (executable)
@@ -1,15 +1,11 @@
 cat <<EOF
-static __always_inline ${ret}
-arch_${atomic}_dec_if_positive(${atomic}_t *v)
-{
-       ${int} dec, c = arch_${atomic}_read(v);
+       ${int} dec, c = raw_${atomic}_read(v);
 
        do {
                dec = c - 1;
                if (unlikely(dec < 0))
                        break;
-       } while (!arch_${atomic}_try_cmpxchg(v, &c, dec));
+       } while (!raw_${atomic}_try_cmpxchg(v, &c, dec));
 
        return dec;
-}
 EOF
index c531d5a..d025361 100755 (executable)
@@ -1,14 +1,10 @@
 cat <<EOF
-static __always_inline bool
-arch_${atomic}_dec_unless_positive(${atomic}_t *v)
-{
-       ${int} c = arch_${atomic}_read(v);
+       ${int} c = raw_${atomic}_read(v);
 
        do {
                if (unlikely(c > 0))
                        return false;
-       } while (!arch_${atomic}_try_cmpxchg(v, &c, c - 1));
+       } while (!raw_${atomic}_try_cmpxchg(v, &c, c - 1));
 
        return true;
-}
 EOF
index 07757d8..40d5b39 100755 (executable)
@@ -1,11 +1,7 @@
 cat <<EOF
-static __always_inline ${ret}
-arch_${atomic}_${pfx}${name}${sfx}(${params})
-{
        ${ret} ret;
        __atomic_pre_full_fence();
        ret = arch_${atomic}_${pfx}${name}${sfx}_relaxed(${args});
        __atomic_post_full_fence();
        return ret;
-}
 EOF
index 68ce13c..8db7e9e 100755 (executable)
@@ -1,23 +1,10 @@
 cat << EOF
-/**
- * arch_${atomic}_fetch_add_unless - add unless the number is already a given value
- * @v: pointer of type ${atomic}_t
- * @a: the amount to add to v...
- * @u: ...unless v is equal to u.
- *
- * Atomically adds @a to @v, so long as @v was not already @u.
- * Returns original value of @v
- */
-static __always_inline ${int}
-arch_${atomic}_fetch_add_unless(${atomic}_t *v, ${int} a, ${int} u)
-{
-       ${int} c = arch_${atomic}_read(v);
+       ${int} c = raw_${atomic}_read(v);
 
        do {
                if (unlikely(c == u))
                        break;
-       } while (!arch_${atomic}_try_cmpxchg(v, &c, c + a));
+       } while (!raw_${atomic}_try_cmpxchg(v, &c, c + a));
 
        return c;
-}
 EOF
index 3c2c373..56c770f 100755 (executable)
@@ -1,7 +1,3 @@
 cat <<EOF
-static __always_inline ${ret}
-arch_${atomic}_${pfx}inc${sfx}${order}(${atomic}_t *v)
-{
-       ${retstmt}arch_${atomic}_${pfx}add${sfx}${order}(1, v);
-}
+       ${retstmt}raw_${atomic}_${pfx}add${sfx}${order}(1, v);
 EOF
index 0cf23fe..7d16a10 100755 (executable)
@@ -1,15 +1,3 @@
 cat <<EOF
-/**
- * arch_${atomic}_inc_and_test - increment and test
- * @v: pointer of type ${atomic}_t
- *
- * Atomically increments @v by 1
- * and returns true if the result is zero, or false for all
- * other cases.
- */
-static __always_inline bool
-arch_${atomic}_inc_and_test(${atomic}_t *v)
-{
-       return arch_${atomic}_inc_return(v) == 0;
-}
+       return raw_${atomic}_inc_return(v) == 0;
 EOF
index ed8a1f5..1fcef1e 100755 (executable)
@@ -1,14 +1,3 @@
 cat <<EOF
-/**
- * arch_${atomic}_inc_not_zero - increment unless the number is zero
- * @v: pointer of type ${atomic}_t
- *
- * Atomically increments @v by 1, if @v is non-zero.
- * Returns true if the increment was done.
- */
-static __always_inline bool
-arch_${atomic}_inc_not_zero(${atomic}_t *v)
-{
-       return arch_${atomic}_add_unless(v, 1, 0);
-}
+       return raw_${atomic}_add_unless(v, 1, 0);
 EOF
index 95d8ce4..7b4b098 100755 (executable)
@@ -1,14 +1,10 @@
 cat <<EOF
-static __always_inline bool
-arch_${atomic}_inc_unless_negative(${atomic}_t *v)
-{
-       ${int} c = arch_${atomic}_read(v);
+       ${int} c = raw_${atomic}_read(v);
 
        do {
                if (unlikely(c < 0))
                        return false;
-       } while (!arch_${atomic}_try_cmpxchg(v, &c, c + 1));
+       } while (!raw_${atomic}_try_cmpxchg(v, &c, c + 1));
 
        return true;
-}
 EOF
index a0ea1d2..e319862 100755 (executable)
@@ -1,16 +1,12 @@
 cat <<EOF
-static __always_inline ${ret}
-arch_${atomic}_read_acquire(const ${atomic}_t *v)
-{
        ${int} ret;
 
        if (__native_word(${atomic}_t)) {
                ret = smp_load_acquire(&(v)->counter);
        } else {
-               ret = arch_${atomic}_read(v);
+               ret = raw_${atomic}_read(v);
                __atomic_acquire_fence();
        }
 
        return ret;
-}
 EOF
index b46feb5..1e6daf5 100755 (executable)
@@ -1,8 +1,4 @@
 cat <<EOF
-static __always_inline ${ret}
-arch_${atomic}_${pfx}${name}${sfx}_release(${params})
-{
        __atomic_release_fence();
        ${retstmt}arch_${atomic}_${pfx}${name}${sfx}_relaxed(${args});
-}
 EOF
index 05cdb7f..16a374a 100755 (executable)
@@ -1,12 +1,8 @@
 cat <<EOF
-static __always_inline void
-arch_${atomic}_set_release(${atomic}_t *v, ${int} i)
-{
        if (__native_word(${atomic}_t)) {
                smp_store_release(&(v)->counter, i);
        } else {
                __atomic_release_fence();
-               arch_${atomic}_set(v, i);
+               raw_${atomic}_set(v, i);
        }
-}
 EOF
index 260f373..d1f746f 100755 (executable)
@@ -1,16 +1,3 @@
 cat <<EOF
-/**
- * arch_${atomic}_sub_and_test - subtract value from variable and test result
- * @i: integer value to subtract
- * @v: pointer of type ${atomic}_t
- *
- * Atomically subtracts @i from @v and returns
- * true if the result is zero, or false for all
- * other cases.
- */
-static __always_inline bool
-arch_${atomic}_sub_and_test(${int} i, ${atomic}_t *v)
-{
-       return arch_${atomic}_sub_return(i, v) == 0;
-}
+       return raw_${atomic}_sub_return(i, v) == 0;
 EOF
index 890f850..d4da820 100755 (executable)
@@ -1,11 +1,7 @@
 cat <<EOF
-static __always_inline bool
-arch_${atomic}_try_cmpxchg${order}(${atomic}_t *v, ${int} *old, ${int} new)
-{
        ${int} r, o = *old;
-       r = arch_${atomic}_cmpxchg${order}(v, o, new);
+       r = raw_${atomic}_cmpxchg${order}(v, o, new);
        if (unlikely(r != o))
                *old = r;
        return likely(r == o);
-}
 EOF
diff --git a/scripts/atomic/fallbacks/xchg b/scripts/atomic/fallbacks/xchg
new file mode 100644 (file)
index 0000000..e4def1e
--- /dev/null
@@ -0,0 +1,3 @@
+cat <<EOF
+       return raw_xchg${order}(&v->counter, new);
+EOF
index 6e853f0..c0c8a85 100755 (executable)
@@ -17,23 +17,16 @@ gen_template_fallback()
        local atomic="$1"; shift
        local int="$1"; shift
 
-       local atomicname="arch_${atomic}_${pfx}${name}${sfx}${order}"
-
        local ret="$(gen_ret_type "${meta}" "${int}")"
        local retstmt="$(gen_ret_stmt "${meta}")"
        local params="$(gen_params "${int}" "${atomic}" "$@")"
        local args="$(gen_args "$@")"
 
-       if [ ! -z "${template}" ]; then
-               printf "#ifndef ${atomicname}\n"
-               . ${template}
-               printf "#define ${atomicname} ${atomicname}\n"
-               printf "#endif\n\n"
-       fi
+       . ${template}
 }
 
-#gen_proto_fallback(meta, pfx, name, sfx, order, atomic, int, args...)
-gen_proto_fallback()
+#gen_order_fallback(meta, pfx, name, sfx, order, atomic, int, args...)
+gen_order_fallback()
 {
        local meta="$1"; shift
        local pfx="$1"; shift
@@ -41,87 +34,124 @@ gen_proto_fallback()
        local sfx="$1"; shift
        local order="$1"; shift
 
-       local tmpl="$(find_fallback_template "${pfx}" "${name}" "${sfx}" "${order}")"
+       local tmpl_order=${order#_}
+       local tmpl="${ATOMICDIR}/fallbacks/${tmpl_order:-fence}"
        gen_template_fallback "${tmpl}" "${meta}" "${pfx}" "${name}" "${sfx}" "${order}" "$@"
 }
 
-#gen_basic_fallbacks(basename)
-gen_basic_fallbacks()
-{
-       local basename="$1"; shift
-cat << EOF
-#define ${basename}_acquire ${basename}
-#define ${basename}_release ${basename}
-#define ${basename}_relaxed ${basename}
-EOF
-}
-
-gen_proto_order_variant()
+#gen_proto_fallback(meta, pfx, name, sfx, order, atomic, int, args...)
+gen_proto_fallback()
 {
        local meta="$1"; shift
        local pfx="$1"; shift
        local name="$1"; shift
        local sfx="$1"; shift
        local order="$1"; shift
-       local atomic="$1"
 
-       local basename="arch_${atomic}_${pfx}${name}${sfx}"
-
-       printf "#define ${basename}${order} ${basename}${order}\n"
+       local tmpl="$(find_fallback_template "${pfx}" "${name}" "${sfx}" "${order}")"
+       gen_template_fallback "${tmpl}" "${meta}" "${pfx}" "${name}" "${sfx}" "${order}" "$@"
 }
 
-#gen_proto_order_variants(meta, pfx, name, sfx, atomic, int, args...)
-gen_proto_order_variants()
+#gen_proto_order_variant(meta, pfx, name, sfx, order, atomic, int, args...)
+gen_proto_order_variant()
 {
        local meta="$1"; shift
        local pfx="$1"; shift
        local name="$1"; shift
        local sfx="$1"; shift
-       local atomic="$1"
+       local order="$1"; shift
+       local atomic="$1"; shift
+       local int="$1"; shift
 
-       local basename="arch_${atomic}_${pfx}${name}${sfx}"
+       local atomicname="${atomic}_${pfx}${name}${sfx}${order}"
+       local basename="${atomic}_${pfx}${name}${sfx}"
 
        local template="$(find_fallback_template "${pfx}" "${name}" "${sfx}" "${order}")"
 
-       # If we don't have relaxed atomics, then we don't bother with ordering fallbacks
-       # read_acquire and set_release need to be templated, though
-       if ! meta_has_relaxed "${meta}"; then
-               gen_proto_fallback "${meta}" "${pfx}" "${name}" "${sfx}" "" "$@"
+       local ret="$(gen_ret_type "${meta}" "${int}")"
+       local retstmt="$(gen_ret_stmt "${meta}")"
+       local params="$(gen_params "${int}" "${atomic}" "$@")"
+       local args="$(gen_args "$@")"
 
-               if meta_has_acquire "${meta}"; then
-                       gen_proto_fallback "${meta}" "${pfx}" "${name}" "${sfx}" "_acquire" "$@"
-               fi
+       gen_kerneldoc "raw_" "${meta}" "${pfx}" "${name}" "${sfx}" "${order}" "${atomic}" "${int}" "$@"
+
+       printf "static __always_inline ${ret}\n"
+       printf "raw_${atomicname}(${params})\n"
+       printf "{\n"
+
+       # Where there is no possible fallback, this order variant is mandatory
+       # and must be provided by arch code. Add a comment to the header to
+       # make this obvious.
+       #
+       # Ideally we'd error on a missing definition, but arch code might
+       # define this order variant as a C function without a preprocessor
+       # symbol.
+       if [ -z ${template} ] && [ -z "${order}" ] && ! meta_has_relaxed "${meta}"; then
+               printf "\t${retstmt}arch_${atomicname}(${args});\n"
+               printf "}\n\n"
+               return
+       fi
 
-               if meta_has_release "${meta}"; then
-                       gen_proto_fallback "${meta}" "${pfx}" "${name}" "${sfx}" "_release" "$@"
-               fi
+       printf "#if defined(arch_${atomicname})\n"
+       printf "\t${retstmt}arch_${atomicname}(${args});\n"
 
-               return
+       # Allow FULL/ACQUIRE/RELEASE ops to be defined in terms of RELAXED ops
+       if [ "${order}" != "_relaxed" ] && meta_has_relaxed "${meta}"; then
+               printf "#elif defined(arch_${basename}_relaxed)\n"
+               gen_order_fallback "${meta}" "${pfx}" "${name}" "${sfx}" "${order}" "${atomic}" "${int}" "$@"
        fi
 
-       printf "#ifndef ${basename}_relaxed\n"
+       # Allow ACQUIRE/RELEASE/RELAXED ops to be defined in terms of FULL ops
+       if [ ! -z "${order}" ]; then
+               printf "#elif defined(arch_${basename})\n"
+               printf "\t${retstmt}arch_${basename}(${args});\n"
+       fi
 
+       printf "#else\n"
        if [ ! -z "${template}" ]; then
-               printf "#ifdef ${basename}\n"
+               gen_proto_fallback "${meta}" "${pfx}" "${name}" "${sfx}" "${order}" "${atomic}" "${int}" "$@"
+       else
+               printf "#error \"Unable to define raw_${atomicname}\"\n"
        fi
 
-       gen_basic_fallbacks "${basename}"
+       printf "#endif\n"
+       printf "}\n\n"
+}
 
-       if [ ! -z "${template}" ]; then
-               printf "#endif /* ${basename} */\n\n"
-               gen_proto_fallback "${meta}" "${pfx}" "${name}" "${sfx}" "" "$@"
-               gen_proto_fallback "${meta}" "${pfx}" "${name}" "${sfx}" "_acquire" "$@"
-               gen_proto_fallback "${meta}" "${pfx}" "${name}" "${sfx}" "_release" "$@"
-               gen_proto_fallback "${meta}" "${pfx}" "${name}" "${sfx}" "_relaxed" "$@"
+
+#gen_proto_order_variants(meta, pfx, name, sfx, atomic, int, args...)
+gen_proto_order_variants()
+{
+       local meta="$1"; shift
+       local pfx="$1"; shift
+       local name="$1"; shift
+       local sfx="$1"; shift
+       local atomic="$1"
+
+       gen_proto_order_variant "${meta}" "${pfx}" "${name}" "${sfx}" "" "$@"
+
+       if meta_has_acquire "${meta}"; then
+               gen_proto_order_variant "${meta}" "${pfx}" "${name}" "${sfx}" "_acquire" "$@"
        fi
 
-       printf "#else /* ${basename}_relaxed */\n\n"
+       if meta_has_release "${meta}"; then
+               gen_proto_order_variant "${meta}" "${pfx}" "${name}" "${sfx}" "_release" "$@"
+       fi
 
-       gen_template_fallback "${ATOMICDIR}/fallbacks/acquire"  "${meta}" "${pfx}" "${name}" "${sfx}" "_acquire" "$@"
-       gen_template_fallback "${ATOMICDIR}/fallbacks/release"  "${meta}" "${pfx}" "${name}" "${sfx}" "_release" "$@"
-       gen_template_fallback "${ATOMICDIR}/fallbacks/fence"  "${meta}" "${pfx}" "${name}" "${sfx}" "" "$@"
+       if meta_has_relaxed "${meta}"; then
+               gen_proto_order_variant "${meta}" "${pfx}" "${name}" "${sfx}" "_relaxed" "$@"
+       fi
+}
 
-       printf "#endif /* ${basename}_relaxed */\n\n"
+#gen_basic_fallbacks(basename)
+gen_basic_fallbacks()
+{
+       local basename="$1"; shift
+cat << EOF
+#define raw_${basename}_acquire arch_${basename}
+#define raw_${basename}_release arch_${basename}
+#define raw_${basename}_relaxed arch_${basename}
+EOF
 }
 
 gen_order_fallbacks()
@@ -130,36 +160,65 @@ gen_order_fallbacks()
 
 cat <<EOF
 
-#ifndef ${xchg}_acquire
-#define ${xchg}_acquire(...) \\
-       __atomic_op_acquire(${xchg}, __VA_ARGS__)
+#define raw_${xchg}_relaxed arch_${xchg}_relaxed
+
+#ifdef arch_${xchg}_acquire
+#define raw_${xchg}_acquire arch_${xchg}_acquire
+#else
+#define raw_${xchg}_acquire(...) \\
+       __atomic_op_acquire(arch_${xchg}, __VA_ARGS__)
 #endif
 
-#ifndef ${xchg}_release
-#define ${xchg}_release(...) \\
-       __atomic_op_release(${xchg}, __VA_ARGS__)
+#ifdef arch_${xchg}_release
+#define raw_${xchg}_release arch_${xchg}_release
+#else
+#define raw_${xchg}_release(...) \\
+       __atomic_op_release(arch_${xchg}, __VA_ARGS__)
 #endif
 
-#ifndef ${xchg}
-#define ${xchg}(...) \\
-       __atomic_op_fence(${xchg}, __VA_ARGS__)
+#ifdef arch_${xchg}
+#define raw_${xchg} arch_${xchg}
+#else
+#define raw_${xchg}(...) \\
+       __atomic_op_fence(arch_${xchg}, __VA_ARGS__)
 #endif
 
 EOF
 }
 
-gen_xchg_fallbacks()
+gen_xchg_order_fallback()
 {
        local xchg="$1"; shift
-       printf "#ifndef ${xchg}_relaxed\n"
+       local order="$1"; shift
+       local forder="${order:-_fence}"
 
-       gen_basic_fallbacks ${xchg}
+       printf "#if defined(arch_${xchg}${order})\n"
+       printf "#define raw_${xchg}${order} arch_${xchg}${order}\n"
 
-       printf "#else /* ${xchg}_relaxed */\n"
+       if [ "${order}" != "_relaxed" ]; then
+               printf "#elif defined(arch_${xchg}_relaxed)\n"
+               printf "#define raw_${xchg}${order}(...) \\\\\n"
+               printf "        __atomic_op${forder}(arch_${xchg}, __VA_ARGS__)\n"
+       fi
+
+       if [ ! -z "${order}" ]; then
+               printf "#elif defined(arch_${xchg})\n"
+               printf "#define raw_${xchg}${order} arch_${xchg}\n"
+       fi
 
-       gen_order_fallbacks ${xchg}
+       printf "#else\n"
+       printf "extern void raw_${xchg}${order}_not_implemented(void);\n"
+       printf "#define raw_${xchg}${order}(...) raw_${xchg}${order}_not_implemented()\n"
+       printf "#endif\n\n"
+}
+
+gen_xchg_fallbacks()
+{
+       local xchg="$1"; shift
 
-       printf "#endif /* ${xchg}_relaxed */\n\n"
+       for order in "" "_acquire" "_release" "_relaxed"; do
+               gen_xchg_order_fallback "${xchg}" "${order}"
+       done
 }
 
 gen_try_cmpxchg_fallback()
@@ -168,40 +227,61 @@ gen_try_cmpxchg_fallback()
        local order="$1"; shift;
 
 cat <<EOF
-#ifndef arch_try_${cmpxchg}${order}
-#define arch_try_${cmpxchg}${order}(_ptr, _oldp, _new) \\
+#define raw_try_${cmpxchg}${order}(_ptr, _oldp, _new) \\
 ({ \\
        typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \\
-       ___r = arch_${cmpxchg}${order}((_ptr), ___o, (_new)); \\
+       ___r = raw_${cmpxchg}${order}((_ptr), ___o, (_new)); \\
        if (unlikely(___r != ___o)) \\
                *___op = ___r; \\
        likely(___r == ___o); \\
 })
-#endif /* arch_try_${cmpxchg}${order} */
-
 EOF
 }
 
-gen_try_cmpxchg_fallbacks()
+gen_try_cmpxchg_order_fallback()
 {
-       local cmpxchg="$1"; shift;
+       local cmpxchg="$1"; shift
+       local order="$1"; shift
+       local forder="${order:-_fence}"
 
-       printf "#ifndef arch_try_${cmpxchg}_relaxed\n"
-       printf "#ifdef arch_try_${cmpxchg}\n"
+       printf "#if defined(arch_try_${cmpxchg}${order})\n"
+       printf "#define raw_try_${cmpxchg}${order} arch_try_${cmpxchg}${order}\n"
+
+       if [ "${order}" != "_relaxed" ]; then
+               printf "#elif defined(arch_try_${cmpxchg}_relaxed)\n"
+               printf "#define raw_try_${cmpxchg}${order}(...) \\\\\n"
+               printf "        __atomic_op${forder}(arch_try_${cmpxchg}, __VA_ARGS__)\n"
+       fi
 
-       gen_basic_fallbacks "arch_try_${cmpxchg}"
+       if [ ! -z "${order}" ]; then
+               printf "#elif defined(arch_try_${cmpxchg})\n"
+               printf "#define raw_try_${cmpxchg}${order} arch_try_${cmpxchg}\n"
+       fi
 
-       printf "#endif /* arch_try_${cmpxchg} */\n\n"
+       printf "#else\n"
+       gen_try_cmpxchg_fallback "${cmpxchg}" "${order}"
+       printf "#endif\n\n"
+}
+
+gen_try_cmpxchg_fallbacks()
+{
+       local cmpxchg="$1"; shift;
 
        for order in "" "_acquire" "_release" "_relaxed"; do
-               gen_try_cmpxchg_fallback "${cmpxchg}" "${order}"
+               gen_try_cmpxchg_order_fallback "${cmpxchg}" "${order}"
        done
+}
 
-       printf "#else /* arch_try_${cmpxchg}_relaxed */\n"
-
-       gen_order_fallbacks "arch_try_${cmpxchg}"
+gen_cmpxchg_local_fallbacks()
+{
+       local cmpxchg="$1"; shift
 
-       printf "#endif /* arch_try_${cmpxchg}_relaxed */\n\n"
+       printf "#define raw_${cmpxchg} arch_${cmpxchg}\n\n"
+       printf "#ifdef arch_try_${cmpxchg}\n"
+       printf "#define raw_try_${cmpxchg} arch_try_${cmpxchg}\n"
+       printf "#else\n"
+       gen_try_cmpxchg_fallback "${cmpxchg}" ""
+       printf "#endif\n\n"
 }
 
 cat << EOF
@@ -217,16 +297,20 @@ cat << EOF
 
 EOF
 
-for xchg in "arch_xchg" "arch_cmpxchg" "arch_cmpxchg64"; do
+for xchg in "xchg" "cmpxchg" "cmpxchg64" "cmpxchg128"; do
        gen_xchg_fallbacks "${xchg}"
 done
 
-for cmpxchg in "cmpxchg" "cmpxchg64"; do
+for cmpxchg in "cmpxchg" "cmpxchg64" "cmpxchg128"; do
        gen_try_cmpxchg_fallbacks "${cmpxchg}"
 done
 
-for cmpxchg in "cmpxchg_local" "cmpxchg64_local"; do
-       gen_try_cmpxchg_fallback "${cmpxchg}" ""
+for cmpxchg in "cmpxchg_local" "cmpxchg64_local" "cmpxchg128_local"; do
+       gen_cmpxchg_local_fallbacks "${cmpxchg}" ""
+done
+
+for cmpxchg in "sync_cmpxchg"; do
+       printf "#define raw_${cmpxchg} arch_${cmpxchg}\n\n"
 done
 
 grep '^[a-z]' "$1" | while read name meta args; do
index d9ffd74..8f8f8e3 100755 (executable)
@@ -68,12 +68,14 @@ gen_proto_order_variant()
        local args="$(gen_args "$@")"
        local retstmt="$(gen_ret_stmt "${meta}")"
 
+       gen_kerneldoc "" "${meta}" "${pfx}" "${name}" "${sfx}" "${order}" "${atomic}" "${int}" "$@"
+
 cat <<EOF
 static __always_inline ${ret}
 ${atomicname}(${params})
 {
 ${checks}
-       ${retstmt}arch_${atomicname}(${args});
+       ${retstmt}raw_${atomicname}(${args});
 }
 EOF
 
@@ -84,7 +86,6 @@ gen_xchg()
 {
        local xchg="$1"; shift
        local order="$1"; shift
-       local mult="$1"; shift
 
        kcsan_barrier=""
        if [ "${xchg%_local}" = "${xchg}" ]; then
@@ -104,9 +105,9 @@ cat <<EOF
 EOF
 [ -n "$kcsan_barrier" ] && printf "\t${kcsan_barrier}; \\\\\n"
 cat <<EOF
-       instrument_atomic_read_write(__ai_ptr, ${mult}sizeof(*__ai_ptr)); \\
-       instrument_read_write(__ai_oldp, ${mult}sizeof(*__ai_oldp)); \\
-       arch_${xchg}${order}(__ai_ptr, __ai_oldp, __VA_ARGS__); \\
+       instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \\
+       instrument_read_write(__ai_oldp, sizeof(*__ai_oldp)); \\
+       raw_${xchg}${order}(__ai_ptr, __ai_oldp, __VA_ARGS__); \\
 })
 EOF
 
@@ -119,8 +120,8 @@ cat <<EOF
 EOF
 [ -n "$kcsan_barrier" ] && printf "\t${kcsan_barrier}; \\\\\n"
 cat <<EOF
-       instrument_atomic_read_write(__ai_ptr, ${mult}sizeof(*__ai_ptr)); \\
-       arch_${xchg}${order}(__ai_ptr, __VA_ARGS__); \\
+       instrument_atomic_read_write(__ai_ptr, sizeof(*__ai_ptr)); \\
+       raw_${xchg}${order}(__ai_ptr, __VA_ARGS__); \\
 })
 EOF
 
@@ -134,15 +135,10 @@ cat << EOF
 // DO NOT MODIFY THIS FILE DIRECTLY
 
 /*
- * This file provides wrappers with KASAN instrumentation for atomic operations.
- * To use this functionality an arch's atomic.h file needs to define all
- * atomic operations with arch_ prefix (e.g. arch_atomic_read()) and include
- * this file at the end. This file provides atomic_read() that forwards to
- * arch_atomic_read() for actual atomic operation.
- * Note: if an arch atomic operation is implemented by means of other atomic
- * operations (e.g. atomic_read()/atomic_cmpxchg() loop), then it needs to use
- * arch_ variants (i.e. arch_atomic_read()/arch_atomic_cmpxchg()) to avoid
- * double instrumentation.
+ * This file provoides atomic operations with explicit instrumentation (e.g.
+ * KASAN, KCSAN), which should be used unless it is necessary to avoid
+ * instrumentation. Where it is necessary to aovid instrumenation, the
+ * raw_atomic*() operations should be used.
  */
 #ifndef _LINUX_ATOMIC_INSTRUMENTED_H
 #define _LINUX_ATOMIC_INSTRUMENTED_H
@@ -166,24 +162,18 @@ grep '^[a-z]' "$1" | while read name meta args; do
 done
 
 
-for xchg in "xchg" "cmpxchg" "cmpxchg64" "try_cmpxchg" "try_cmpxchg64"; do
+for xchg in "xchg" "cmpxchg" "cmpxchg64" "cmpxchg128" "try_cmpxchg" "try_cmpxchg64" "try_cmpxchg128"; do
        for order in "" "_acquire" "_release" "_relaxed"; do
-               gen_xchg "${xchg}" "${order}" ""
+               gen_xchg "${xchg}" "${order}"
                printf "\n"
        done
 done
 
-for xchg in "cmpxchg_local" "cmpxchg64_local" "sync_cmpxchg" "try_cmpxchg_local" "try_cmpxchg64_local" ; do
-       gen_xchg "${xchg}" "" ""
+for xchg in "cmpxchg_local" "cmpxchg64_local" "cmpxchg128_local" "sync_cmpxchg" "try_cmpxchg_local" "try_cmpxchg64_local" "try_cmpxchg128_local"; do
+       gen_xchg "${xchg}" ""
        printf "\n"
 done
 
-gen_xchg "cmpxchg_double" "" "2 * "
-
-printf "\n\n"
-
-gen_xchg "cmpxchg_double_local" "" "2 * "
-
 cat <<EOF
 
 #endif /* _LINUX_ATOMIC_INSTRUMENTED_H */
index eda89ce..9826be3 100755 (executable)
@@ -32,24 +32,34 @@ gen_args_cast()
        done
 }
 
-#gen_proto_order_variant(meta, pfx, name, sfx, order, atomic, int, arg...)
+#gen_proto_order_variant(meta, pfx, name, sfx, order, arg...)
 gen_proto_order_variant()
 {
        local meta="$1"; shift
-       local name="$1$2$3$4"; shift; shift; shift; shift
-       local atomic="$1"; shift
-       local int="$1"; shift
+       local pfx="$1"; shift
+       local name="$1"; shift
+       local sfx="$1"; shift
+       local order="$1"; shift
+
+       local atomicname="${pfx}${name}${sfx}${order}"
 
        local ret="$(gen_ret_type "${meta}" "long")"
        local params="$(gen_params "long" "atomic_long" "$@")"
-       local argscast="$(gen_args_cast "${int}" "${atomic}" "$@")"
+       local argscast_32="$(gen_args_cast "int" "atomic" "$@")"
+       local argscast_64="$(gen_args_cast "s64" "atomic64" "$@")"
        local retstmt="$(gen_ret_stmt "${meta}")"
 
+       gen_kerneldoc "raw_" "${meta}" "${pfx}" "${name}" "${sfx}" "${order}" "atomic_long" "long" "$@"
+
 cat <<EOF
 static __always_inline ${ret}
-arch_atomic_long_${name}(${params})
+raw_atomic_long_${atomicname}(${params})
 {
-       ${retstmt}arch_${atomic}_${name}(${argscast});
+#ifdef CONFIG_64BIT
+       ${retstmt}raw_atomic64_${atomicname}(${argscast_64});
+#else
+       ${retstmt}raw_atomic_${atomicname}(${argscast_32});
+#endif
 }
 
 EOF
@@ -79,24 +89,12 @@ typedef atomic_t atomic_long_t;
 #define atomic_long_cond_read_relaxed  atomic_cond_read_relaxed
 #endif
 
-#ifdef CONFIG_64BIT
-
-EOF
-
-grep '^[a-z]' "$1" | while read name meta args; do
-       gen_proto "${meta}" "${name}" "atomic64" "s64" ${args}
-done
-
-cat <<EOF
-#else /* CONFIG_64BIT */
-
 EOF
 
 grep '^[a-z]' "$1" | while read name meta args; do
-       gen_proto "${meta}" "${name}" "atomic" "int" ${args}
+       gen_proto "${meta}" "${name}" ${args}
 done
 
 cat <<EOF
-#endif /* CONFIG_64BIT */
 #endif /* _LINUX_ATOMIC_LONG_H */
 EOF
diff --git a/scripts/atomic/kerneldoc/add b/scripts/atomic/kerneldoc/add
new file mode 100644 (file)
index 0000000..991f3da
--- /dev/null
@@ -0,0 +1,13 @@
+cat <<EOF
+/**
+ * ${class}${atomicname}() - atomic add with ${desc_order} ordering
+ * @i: ${int} value to add
+ * @v: pointer to ${atomic}_t
+ *
+ * Atomically updates @v to (@v + @i) with ${desc_order} ordering.
+ *
+ * ${desc_noinstr}
+ *
+ * ${desc_return}
+ */
+EOF
diff --git a/scripts/atomic/kerneldoc/add_negative b/scripts/atomic/kerneldoc/add_negative
new file mode 100644 (file)
index 0000000..f4ca1f0
--- /dev/null
@@ -0,0 +1,13 @@
+cat <<EOF
+/**
+ * ${class}${atomicname}() - atomic add and test if negative with ${desc_order} ordering
+ * @i: ${int} value to add
+ * @v: pointer to ${atomic}_t
+ *
+ * Atomically updates @v to (@v + @i) with ${desc_order} ordering.
+ *
+ * ${desc_noinstr}
+ *
+ * Return: @true if the resulting value of @v is negative, @false otherwise.
+ */
+EOF
diff --git a/scripts/atomic/kerneldoc/add_unless b/scripts/atomic/kerneldoc/add_unless
new file mode 100644 (file)
index 0000000..f828e5f
--- /dev/null
@@ -0,0 +1,18 @@
+if [ -z "${pfx}" ]; then
+       desc_return="Return: @true if @v was updated, @false otherwise."
+fi
+
+cat <<EOF
+/**
+ * ${class}${atomicname}() - atomic add unless value with ${desc_order} ordering
+ * @v: pointer to ${atomic}_t
+ * @a: ${int} value to add
+ * @u: ${int} value to compare with
+ *
+ * If (@v != @u), atomically updates @v to (@v + @a) with ${desc_order} ordering.
+ *
+ * ${desc_noinstr}
+ *
+ * ${desc_return}
+ */
+EOF
diff --git a/scripts/atomic/kerneldoc/and b/scripts/atomic/kerneldoc/and
new file mode 100644 (file)
index 0000000..a923574
--- /dev/null
@@ -0,0 +1,13 @@
+cat <<EOF
+/**
+ * ${class}${atomicname}() - atomic bitwise AND with ${desc_order} ordering
+ * @i: ${int} value
+ * @v: pointer to ${atomic}_t
+ *
+ * Atomically updates @v to (@v & @i) with ${desc_order} ordering.
+ *
+ * ${desc_noinstr}
+ *
+ * ${desc_return}
+ */
+EOF
diff --git a/scripts/atomic/kerneldoc/andnot b/scripts/atomic/kerneldoc/andnot
new file mode 100644 (file)
index 0000000..64bb509
--- /dev/null
@@ -0,0 +1,13 @@
+cat <<EOF
+/**
+ * ${class}${atomicname}() - atomic bitwise AND NOT with ${desc_order} ordering
+ * @i: ${int} value
+ * @v: pointer to ${atomic}_t
+ *
+ * Atomically updates @v to (@v & ~@i) with ${desc_order} ordering.
+ *
+ * ${desc_noinstr}
+ *
+ * ${desc_return}
+ */
+EOF
diff --git a/scripts/atomic/kerneldoc/cmpxchg b/scripts/atomic/kerneldoc/cmpxchg
new file mode 100644 (file)
index 0000000..3bce328
--- /dev/null
@@ -0,0 +1,14 @@
+cat <<EOF
+/**
+ * ${class}${atomicname}() - atomic compare and exchange with ${desc_order} ordering
+ * @v: pointer to ${atomic}_t
+ * @old: ${int} value to compare with
+ * @new: ${int} value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with ${desc_order} ordering.
+ *
+ * ${desc_noinstr}
+ *
+ * Return: The original value of @v.
+ */
+EOF
diff --git a/scripts/atomic/kerneldoc/dec b/scripts/atomic/kerneldoc/dec
new file mode 100644 (file)
index 0000000..bbeecbc
--- /dev/null
@@ -0,0 +1,12 @@
+cat <<EOF
+/**
+ * ${class}${atomicname}() - atomic decrement with ${desc_order} ordering
+ * @v: pointer to ${atomic}_t
+ *
+ * Atomically updates @v to (@v - 1) with ${desc_order} ordering.
+ *
+ * ${desc_noinstr}
+ *
+ * ${desc_return}
+ */
+EOF
diff --git a/scripts/atomic/kerneldoc/dec_and_test b/scripts/atomic/kerneldoc/dec_and_test
new file mode 100644 (file)
index 0000000..71bbd23
--- /dev/null
@@ -0,0 +1,12 @@
+cat <<EOF
+/**
+ * ${class}${atomicname}() - atomic decrement and test if zero with ${desc_order} ordering
+ * @v: pointer to ${atomic}_t
+ *
+ * Atomically updates @v to (@v - 1) with ${desc_order} ordering.
+ *
+ * ${desc_noinstr}
+ *
+ * Return: @true if the resulting value of @v is zero, @false otherwise.
+ */
+EOF
diff --git a/scripts/atomic/kerneldoc/dec_if_positive b/scripts/atomic/kerneldoc/dec_if_positive
new file mode 100644 (file)
index 0000000..04f1aed
--- /dev/null
@@ -0,0 +1,12 @@
+cat <<EOF
+/**
+ * ${class}${atomicname}() - atomic decrement if positive with ${desc_order} ordering
+ * @v: pointer to ${atomic}_t
+ *
+ * If (@v > 0), atomically updates @v to (@v - 1) with ${desc_order} ordering.
+ *
+ * ${desc_noinstr}
+ *
+ * Return: The old value of (@v - 1), regardless of whether @v was updated.
+ */
+EOF
diff --git a/scripts/atomic/kerneldoc/dec_unless_positive b/scripts/atomic/kerneldoc/dec_unless_positive
new file mode 100644 (file)
index 0000000..ee73612
--- /dev/null
@@ -0,0 +1,12 @@
+cat <<EOF
+/**
+ * ${class}${atomicname}() - atomic decrement unless positive with ${desc_order} ordering
+ * @v: pointer to ${atomic}_t
+ *
+ * If (@v <= 0), atomically updates @v to (@v - 1) with ${desc_order} ordering.
+ *
+ * ${desc_noinstr}
+ *
+ * Return: @true if @v was updated, @false otherwise.
+ */
+EOF
diff --git a/scripts/atomic/kerneldoc/inc b/scripts/atomic/kerneldoc/inc
new file mode 100644 (file)
index 0000000..9f14f1b
--- /dev/null
@@ -0,0 +1,12 @@
+cat <<EOF
+/**
+ * ${class}${atomicname}() - atomic increment with ${desc_order} ordering
+ * @v: pointer to ${atomic}_t
+ *
+ * Atomically updates @v to (@v + 1) with ${desc_order} ordering.
+ *
+ * ${desc_noinstr}
+ *
+ * ${desc_return}
+ */
+EOF
diff --git a/scripts/atomic/kerneldoc/inc_and_test b/scripts/atomic/kerneldoc/inc_and_test
new file mode 100644 (file)
index 0000000..971694d
--- /dev/null
@@ -0,0 +1,12 @@
+cat <<EOF
+/**
+ * ${class}${atomicname}() - atomic increment and test if zero with ${desc_order} ordering
+ * @v: pointer to ${atomic}_t
+ *
+ * Atomically updates @v to (@v + 1) with ${desc_order} ordering.
+ *
+ * ${desc_noinstr}
+ *
+ * Return: @true if the resulting value of @v is zero, @false otherwise.
+ */
+EOF
diff --git a/scripts/atomic/kerneldoc/inc_not_zero b/scripts/atomic/kerneldoc/inc_not_zero
new file mode 100644 (file)
index 0000000..618be08
--- /dev/null
@@ -0,0 +1,12 @@
+cat <<EOF
+/**
+ * ${class}${atomicname}() - atomic increment unless zero with ${desc_order} ordering
+ * @v: pointer to ${atomic}_t
+ *
+ * If (@v != 0), atomically updates @v to (@v + 1) with ${desc_order} ordering.
+ *
+ * ${desc_noinstr}
+ *
+ * Return: @true if @v was updated, @false otherwise.
+ */
+EOF
diff --git a/scripts/atomic/kerneldoc/inc_unless_negative b/scripts/atomic/kerneldoc/inc_unless_negative
new file mode 100644 (file)
index 0000000..597f23d
--- /dev/null
@@ -0,0 +1,12 @@
+cat <<EOF
+/**
+ * ${class}${atomicname}() - atomic increment unless negative with ${desc_order} ordering
+ * @v: pointer to ${atomic}_t
+ *
+ * If (@v >= 0), atomically updates @v to (@v + 1) with ${desc_order} ordering.
+ *
+ * ${desc_noinstr}
+ *
+ * Return: @true if @v was updated, @false otherwise.
+ */
+EOF
diff --git a/scripts/atomic/kerneldoc/or b/scripts/atomic/kerneldoc/or
new file mode 100644 (file)
index 0000000..55b33de
--- /dev/null
@@ -0,0 +1,13 @@
+cat <<EOF
+/**
+ * ${class}${atomicname}() - atomic bitwise OR with ${desc_order} ordering
+ * @i: ${int} value
+ * @v: pointer to ${atomic}_t
+ *
+ * Atomically updates @v to (@v | @i) with ${desc_order} ordering.
+ *
+ * ${desc_noinstr}
+ *
+ * ${desc_return}
+ */
+EOF
diff --git a/scripts/atomic/kerneldoc/read b/scripts/atomic/kerneldoc/read
new file mode 100644 (file)
index 0000000..89fe614
--- /dev/null
@@ -0,0 +1,12 @@
+cat <<EOF
+/**
+ * ${class}${atomicname}() - atomic load with ${desc_order} ordering
+ * @v: pointer to ${atomic}_t
+ *
+ * Atomically loads the value of @v with ${desc_order} ordering.
+ *
+ * ${desc_noinstr}
+ *
+ * Return: The value loaded from @v.
+ */
+EOF
diff --git a/scripts/atomic/kerneldoc/set b/scripts/atomic/kerneldoc/set
new file mode 100644 (file)
index 0000000..e82cb9e
--- /dev/null
@@ -0,0 +1,13 @@
+cat <<EOF
+/**
+ * ${class}${atomicname}() - atomic set with ${desc_order} ordering
+ * @v: pointer to ${atomic}_t
+ * @i: ${int} value to assign
+ *
+ * Atomically sets @v to @i with ${desc_order} ordering.
+ *
+ * ${desc_noinstr}
+ *
+ * Return: Nothing.
+ */
+EOF
diff --git a/scripts/atomic/kerneldoc/sub b/scripts/atomic/kerneldoc/sub
new file mode 100644 (file)
index 0000000..3ba642d
--- /dev/null
@@ -0,0 +1,13 @@
+cat <<EOF
+/**
+ * ${class}${atomicname}() - atomic subtract with ${desc_order} ordering
+ * @i: ${int} value to subtract
+ * @v: pointer to ${atomic}_t
+ *
+ * Atomically updates @v to (@v - @i) with ${desc_order} ordering.
+ *
+ * ${desc_noinstr}
+ *
+ * ${desc_return}
+ */
+EOF
diff --git a/scripts/atomic/kerneldoc/sub_and_test b/scripts/atomic/kerneldoc/sub_and_test
new file mode 100644 (file)
index 0000000..d3760f7
--- /dev/null
@@ -0,0 +1,13 @@
+cat <<EOF
+/**
+ * ${class}${atomicname}() - atomic subtract and test if zero with ${desc_order} ordering
+ * @i: ${int} value to add
+ * @v: pointer to ${atomic}_t
+ *
+ * Atomically updates @v to (@v - @i) with ${desc_order} ordering.
+ *
+ * ${desc_noinstr}
+ *
+ * Return: @true if the resulting value of @v is zero, @false otherwise.
+ */
+EOF
diff --git a/scripts/atomic/kerneldoc/try_cmpxchg b/scripts/atomic/kerneldoc/try_cmpxchg
new file mode 100644 (file)
index 0000000..2965532
--- /dev/null
@@ -0,0 +1,15 @@
+cat <<EOF
+/**
+ * ${class}${atomicname}() - atomic compare and exchange with ${desc_order} ordering
+ * @v: pointer to ${atomic}_t
+ * @old: pointer to ${int} value to compare with
+ * @new: ${int} value to assign
+ *
+ * If (@v == @old), atomically updates @v to @new with ${desc_order} ordering.
+ * Otherwise, updates @old to the current value of @v.
+ *
+ * ${desc_noinstr}
+ *
+ * Return: @true if the exchange occured, @false otherwise.
+ */
+EOF
diff --git a/scripts/atomic/kerneldoc/xchg b/scripts/atomic/kerneldoc/xchg
new file mode 100644 (file)
index 0000000..75f04c0
--- /dev/null
@@ -0,0 +1,13 @@
+cat <<EOF
+/**
+ * ${class}${atomicname}() - atomic exchange with ${desc_order} ordering
+ * @v: pointer to ${atomic}_t
+ * @new: ${int} value to assign
+ *
+ * Atomically updates @v to @new with ${desc_order} ordering.
+ *
+ * ${desc_noinstr}
+ *
+ * Return: The original value of @v.
+ */
+EOF
diff --git a/scripts/atomic/kerneldoc/xor b/scripts/atomic/kerneldoc/xor
new file mode 100644 (file)
index 0000000..8837270
--- /dev/null
@@ -0,0 +1,13 @@
+cat <<EOF
+/**
+ * ${class}${atomicname}() - atomic bitwise XOR with ${desc_order} ordering
+ * @i: ${int} value
+ * @v: pointer to ${atomic}_t
+ *
+ * Atomically updates @v to (@v ^ @i) with ${desc_order} ordering.
+ *
+ * ${desc_noinstr}
+ *
+ * ${desc_return}
+ */
+EOF
index edc9a62..4f163e0 100755 (executable)
@@ -146,16 +146,6 @@ curtable && /\.procname[\t ]*=[\t ]*".+"/ {
     children[curtable][curentry] = child
 }
 
-/register_sysctl_table\(.*\)/ {
-    match($0, /register_sysctl_table\(([^)]+)\)/, tables)
-    if (debug) print "Registering table " tables[1]
-    if (children[tables[1]][table]) {
-       for (entry in entries[children[tables[1]][table]]) {
-           printentry(entry)
-       }
-    }
-}
-
 END {
     for (entry in documented) {
        if (!seen[entry]) {
index b30114d..7bfa4d3 100755 (executable)
@@ -6997,10 +6997,22 @@ sub process {
 #                      }
 #              }
 
+# strcpy uses that should likely be strscpy
+               if ($line =~ /\bstrcpy\s*\(/) {
+                       WARN("STRCPY",
+                            "Prefer strscpy over strcpy - see: https://github.com/KSPP/linux/issues/88\n" . $herecurr);
+               }
+
 # strlcpy uses that should likely be strscpy
                if ($line =~ /\bstrlcpy\s*\(/) {
                        WARN("STRLCPY",
-                            "Prefer strscpy over strlcpy - see: https://lore.kernel.org/r/CAHk-=wgfRnXz0W3D37d01q3JFkr_i_uTL=V6A6G1oUZcprmknw\@mail.gmail.com/\n" . $herecurr);
+                            "Prefer strscpy over strlcpy - see: https://github.com/KSPP/linux/issues/89\n" . $herecurr);
+               }
+
+# strncpy uses that should likely be strscpy or strscpy_pad
+               if ($line =~ /\bstrncpy\s*\(/) {
+                       WARN("STRNCPY",
+                            "Prefer strscpy, strscpy_pad, or __nonstring over strncpy - see: https://github.com/KSPP/linux/issues/90\n" . $herecurr);
                }
 
 # typecasts on min/max could be min_t/max_t
@@ -7418,6 +7430,16 @@ sub process {
                        }
                }
 
+# check for array definition/declarations that should use flexible arrays instead
+               if ($sline =~ /^[\+ ]\s*\}(?:\s*__packed)?\s*;\s*$/ &&
+                   $prevline =~ /^\+\s*(?:\}(?:\s*__packed\s*)?|$Type)\s*$Ident\s*\[\s*(0|1)\s*\]\s*;\s*$/) {
+                       if (ERROR("FLEXIBLE_ARRAY",
+                                 "Use C99 flexible arrays - see https://docs.kernel.org/process/deprecated.html#zero-length-and-one-element-arrays\n" . $hereprev) &&
+                           $1 == '0' && $fix) {
+                               $fixed[$fixlinenr - 1] =~ s/\[\s*0\s*\]/[]/;
+                       }
+               }
+
 # nested likely/unlikely calls
                if ($line =~ /\b(?:(?:un)?likely)\s*\(\s*!?\s*(IS_ERR(?:_OR_NULL|_VALUE)?|WARN)/) {
                        WARN("LIKELY_MISUSE",
index 2486689..eb70c1f 100755 (executable)
@@ -64,7 +64,7 @@ my $type_constant = '\b``([^\`]+)``\b';
 my $type_constant2 = '\%([-_\w]+)';
 my $type_func = '(\w+)\(\)';
 my $type_param = '\@(\w*((\.\w+)|(->\w+))*(\.\.\.)?)';
-my $type_param_ref = '([\!]?)\@(\w*((\.\w+)|(->\w+))*(\.\.\.)?)';
+my $type_param_ref = '([\!~]?)\@(\w*((\.\w+)|(->\w+))*(\.\.\.)?)';
 my $type_fp_param = '\@(\w+)\(\)';  # Special RST handling for func ptr params
 my $type_fp_param2 = '\@(\w+->\S+)\(\)';  # Special RST handling for structs with func ptr params
 my $type_env = '(\$\w+)';
index 20d483e..dfd1863 100755 (executable)
@@ -17,7 +17,11 @@ binutils)
        echo 2.25.0
        ;;
 gcc)
-       echo 5.1.0
+       if [ "$SRCARCH" = parisc ]; then
+               echo 11.0.0
+       else
+               echo 5.1.0
+       fi
        ;;
 llvm)
        if [ "$SRCARCH" = s390 ]; then
@@ -27,7 +31,7 @@ llvm)
        fi
        ;;
 rustc)
-       echo 1.62.0
+       echo 1.68.2
        ;;
 bindgen)
        echo 0.56.0
index d4531d0..c12150f 100644 (file)
@@ -1979,6 +1979,11 @@ static void add_header(struct buffer *b, struct module *mod)
        buf_printf(b, "#include <linux/vermagic.h>\n");
        buf_printf(b, "#include <linux/compiler.h>\n");
        buf_printf(b, "\n");
+       buf_printf(b, "#ifdef CONFIG_UNWINDER_ORC\n");
+       buf_printf(b, "#include <asm/orc_header.h>\n");
+       buf_printf(b, "ORC_HEADER;\n");
+       buf_printf(b, "#endif\n");
+       buf_printf(b, "\n");
        buf_printf(b, "BUILD_SALT;\n");
        buf_printf(b, "BUILD_LTO_INFO;\n");
        buf_printf(b, "\n");
diff --git a/scripts/orc_hash.sh b/scripts/orc_hash.sh
new file mode 100644 (file)
index 0000000..466611a
--- /dev/null
@@ -0,0 +1,16 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+
+set -e
+
+printf '%s' '#define ORC_HASH '
+
+awk '
+/^#define ORC_(REG|TYPE)_/ { print }
+/^struct orc_entry {$/ { p=1 }
+p { print }
+/^}/ { p=0 }' |
+       sha1sum |
+       cut -d " " -f 1 |
+       sed 's/\([0-9a-f]\{2\}\)/0x\1,/g'
index f8bd617..fc7ba95 100644 (file)
@@ -155,6 +155,7 @@ aquired||acquired
 aquisition||acquisition
 arbitary||arbitrary
 architechture||architecture
+archtecture||architecture
 arguement||argument
 arguements||arguments
 arithmatic||arithmetic
@@ -279,6 +280,7 @@ cant'||can't
 canot||cannot
 cann't||can't
 cannnot||cannot
+capabiity||capability
 capabilites||capabilities
 capabilties||capabilities
 capabilty||capability
@@ -426,6 +428,7 @@ cotrol||control
 cound||could
 couter||counter
 coutner||counter
+creationg||creating
 cryptocraphic||cryptographic
 cummulative||cumulative
 cunter||counter
@@ -492,6 +495,7 @@ destorys||destroys
 destroied||destroyed
 detabase||database
 deteced||detected
+detecion||detection
 detectt||detect
 detroyed||destroyed
 develope||develop
@@ -513,6 +517,7 @@ diferent||different
 differrence||difference
 diffrent||different
 differenciate||differentiate
+diffreential||differential
 diffrentiate||differentiate
 difinition||definition
 digial||digital
@@ -617,6 +622,7 @@ evalute||evaluate
 evalutes||evaluates
 evalution||evaluation
 excecutable||executable
+excceed||exceed
 exceded||exceeded
 exceds||exceeds
 exceeed||exceed
@@ -632,6 +638,7 @@ existant||existent
 exixt||exist
 exsits||exists
 exlcude||exclude
+exlcuding||excluding
 exlcusive||exclusive
 exlusive||exclusive
 exmaple||example
@@ -726,6 +733,8 @@ generiously||generously
 genereate||generate
 genereted||generated
 genric||generic
+gerenal||general
+geting||getting
 globel||global
 grabing||grabbing
 grahical||graphical
@@ -899,6 +908,7 @@ iteraions||iterations
 iternations||iterations
 itertation||iteration
 itslef||itself
+ivalid||invalid
 jave||java
 jeffies||jiffies
 jumpimng||jumping
@@ -977,6 +987,7 @@ microprocesspr||microprocessor
 migrateable||migratable
 millenium||millennium
 milliseonds||milliseconds
+minimim||minimum
 minium||minimum
 minimam||minimum
 minimun||minimum
@@ -1042,6 +1053,7 @@ notifed||notified
 notity||notify
 nubmer||number
 numebr||number
+numer||number
 numner||number
 nunber||number
 obtaion||obtain
@@ -1061,6 +1073,7 @@ offet||offset
 offlaod||offload
 offloded||offloaded
 offseting||offsetting
+oflload||offload
 omited||omitted
 omiting||omitting
 omitt||omit
@@ -1105,6 +1118,7 @@ pakage||package
 paket||packet
 pallette||palette
 paln||plan
+palne||plane
 paramameters||parameters
 paramaters||parameters
 paramater||parameter
@@ -1181,12 +1195,14 @@ previsously||previously
 primative||primitive
 princliple||principle
 priorty||priority
+priting||printing
 privilaged||privileged
 privilage||privilege
 priviledge||privilege
 priviledges||privileges
 privleges||privileges
 probaly||probably
+probabalistic||probabilistic
 procceed||proceed
 proccesors||processors
 procesed||processed
@@ -1460,6 +1476,7 @@ submited||submitted
 submition||submission
 succeded||succeeded
 suceed||succeed
+succesfuly||successfully
 succesfully||successfully
 succesful||successful
 successed||succeeded
@@ -1503,6 +1520,7 @@ symetric||symmetric
 synax||syntax
 synchonized||synchronized
 sychronization||synchronization
+sychronously||synchronously
 synchronuously||synchronously
 syncronize||synchronize
 syncronized||synchronized
@@ -1532,6 +1550,7 @@ threee||three
 threshhold||threshold
 thresold||threshold
 throught||through
+tansition||transition
 trackling||tracking
 troughput||throughput
 trys||tries
@@ -1611,6 +1630,7 @@ unneccessary||unnecessary
 unnecesary||unnecessary
 unneedingly||unnecessarily
 unnsupported||unsupported
+unuspported||unsupported
 unmached||unmatched
 unprecise||imprecise
 unpriviledged||unprivileged
@@ -1657,6 +1677,7 @@ verfication||verification
 veriosn||version
 verisons||versions
 verison||version
+veritical||vertical
 verson||version
 vicefersa||vice-versa
 virtal||virtual
@@ -1677,6 +1698,7 @@ whenver||whenever
 wheter||whether
 whe||when
 wierd||weird
+wihout||without
 wiil||will
 wirte||write
 withing||within
index 0b3fc2f..ab5742a 100644 (file)
@@ -314,7 +314,7 @@ int cap_inode_need_killpriv(struct dentry *dentry)
  * the vfsmount must be passed through @idmap. This function will then
  * take care to map the inode according to @idmap before checking
  * permissions. On non-idmapped mounts or if permission checking is to be
- * performed on the raw inode simply passs @nop_mnt_idmap.
+ * performed on the raw inode simply pass @nop_mnt_idmap.
  *
  * Return: 0 if successful, -ve on error.
  */
@@ -522,7 +522,7 @@ static bool validheader(size_t size, const struct vfs_cap_data *cap)
  * the vfsmount must be passed through @idmap. This function will then
  * take care to map the inode according to @idmap before checking
  * permissions. On non-idmapped mounts or if permission checking is to be
- * performed on the raw inode simply passs @nop_mnt_idmap.
+ * performed on the raw inode simply pass @nop_mnt_idmap.
  *
  * Return: On success, return the new size; on error, return < 0.
  */
@@ -630,7 +630,7 @@ static inline int bprm_caps_from_vfs_caps(struct cpu_vfs_cap_data *caps,
  * the vfsmount must be passed through @idmap. This function will then
  * take care to map the inode according to @idmap before checking
  * permissions. On non-idmapped mounts or if permission checking is to be
- * performed on the raw inode simply passs @nop_mnt_idmap.
+ * performed on the raw inode simply pass @nop_mnt_idmap.
  */
 int get_vfs_caps_from_disk(struct mnt_idmap *idmap,
                           const struct dentry *dentry,
@@ -1133,7 +1133,7 @@ int cap_task_fix_setuid(struct cred *new, const struct cred *old, int flags)
                break;
 
        case LSM_SETID_FS:
-               /* juggle the capabilties to follow FSUID changes, unless
+               /* juggle the capabilities to follow FSUID changes, unless
                 * otherwise suppressed
                 *
                 * FIXME - is fsuser used for all CAP_FS_MASK capabilities?
@@ -1184,10 +1184,10 @@ static int cap_safe_nice(struct task_struct *p)
 }
 
 /**
- * cap_task_setscheduler - Detemine if scheduler policy change is permitted
+ * cap_task_setscheduler - Determine if scheduler policy change is permitted
  * @p: The task to affect
  *
- * Detemine if the requested scheduler policy change is permitted for the
+ * Determine if the requested scheduler policy change is permitted for the
  * specified task.
  *
  * Return: 0 if permission is granted, -ve if denied.
@@ -1198,11 +1198,11 @@ int cap_task_setscheduler(struct task_struct *p)
 }
 
 /**
- * cap_task_setioprio - Detemine if I/O priority change is permitted
+ * cap_task_setioprio - Determine if I/O priority change is permitted
  * @p: The task to affect
  * @ioprio: The I/O priority to set
  *
- * Detemine if the requested I/O priority change is permitted for the specified
+ * Determine if the requested I/O priority change is permitted for the specified
  * task.
  *
  * Return: 0 if permission is granted, -ve if denied.
@@ -1213,11 +1213,11 @@ int cap_task_setioprio(struct task_struct *p, int ioprio)
 }
 
 /**
- * cap_task_setnice - Detemine if task priority change is permitted
+ * cap_task_setnice - Determine if task priority change is permitted
  * @p: The task to affect
  * @nice: The nice value to set
  *
- * Detemine if the requested task priority change is permitted for the
+ * Determine if the requested task priority change is permitted for the
  * specified task.
  *
  * Return: 0 if permission is granted, -ve if denied.
index 7507d14..dc4df74 100644 (file)
@@ -421,7 +421,7 @@ static bool verify_new_ex(struct dev_cgroup *dev_cgroup,
                } else {
                        /*
                         * new exception in the child will add more devices
-                        * that can be acessed, so it can't match any of
+                        * that can be accessed, so it can't match any of
                         * parent's exceptions, even slightly
                         */ 
                        match = match_exception_partial(&dev_cgroup->exceptions,
@@ -822,7 +822,6 @@ struct cgroup_subsys devices_cgrp_subsys = {
 
 /**
  * devcgroup_legacy_check_permission - checks if an inode operation is permitted
- * @dev_cgroup: the dev cgroup to be tested against
  * @type: device type
  * @major: device major number
  * @minor: device minor number
index 033804f..0dae649 100644 (file)
@@ -40,7 +40,7 @@ static const char evm_hmac[] = "hmac(sha1)";
 /**
  * evm_set_key() - set EVM HMAC key from the kernel
  * @key: pointer to a buffer with the key data
- * @size: length of the key data
+ * @keylen: length of the key data
  *
  * This function allows setting the EVM HMAC key from the kernel
  * without using the "encrypted" key subsystem keys. It can be used
index cf24c52..c9b6e2a 100644 (file)
@@ -318,7 +318,6 @@ int evm_protected_xattr_if_enabled(const char *req_xattr_name)
 /**
  * evm_read_protected_xattrs - read EVM protected xattr names, lengths, values
  * @dentry: dentry of the read xattrs
- * @inode: inode of the read xattrs
  * @buffer: buffer xattr names, lengths or values are copied to
  * @buffer_size: size of buffer
  * @type: n: names, l: lengths, v: values
@@ -390,6 +389,7 @@ int evm_read_protected_xattrs(struct dentry *dentry, u8 *buffer,
  * @xattr_name: requested xattr
  * @xattr_value: requested xattr value
  * @xattr_value_len: requested xattr value length
+ * @iint: inode integrity metadata
  *
  * Calculate the HMAC for the given dentry and verify it against the stored
  * security.evm xattr. For performance, use the xattr value and length
@@ -795,7 +795,9 @@ static int evm_attr_change(struct mnt_idmap *idmap,
 
 /**
  * evm_inode_setattr - prevent updating an invalid EVM extended attribute
+ * @idmap: idmap of the mount
  * @dentry: pointer to the affected dentry
+ * @attr: iattr structure containing the new file attributes
  *
  * Permit update of file attributes when files have a valid EVM signature,
  * except in the case of them having an immutable portable signature.
index c73858e..a462df8 100644 (file)
@@ -43,12 +43,10 @@ static struct integrity_iint_cache *__integrity_iint_find(struct inode *inode)
                else if (inode > iint->inode)
                        n = n->rb_right;
                else
-                       break;
+                       return iint;
        }
-       if (!n)
-               return NULL;
 
-       return iint;
+       return NULL;
 }
 
 /*
@@ -113,10 +111,15 @@ struct integrity_iint_cache *integrity_inode_get(struct inode *inode)
                parent = *p;
                test_iint = rb_entry(parent, struct integrity_iint_cache,
                                     rb_node);
-               if (inode < test_iint->inode)
+               if (inode < test_iint->inode) {
                        p = &(*p)->rb_left;
-               else
+               } else if (inode > test_iint->inode) {
                        p = &(*p)->rb_right;
+               } else {
+                       write_unlock(&integrity_iint_lock);
+                       kmem_cache_free(iint_cache, iint);
+                       return test_iint;
+               }
        }
 
        iint->inode = inode;
index d3662f4..452e80b 100644 (file)
@@ -13,7 +13,6 @@
 #include <linux/fs.h>
 #include <linux/xattr.h>
 #include <linux/evm.h>
-#include <linux/iversion.h>
 #include <linux/fsverity.h>
 
 #include "ima.h"
@@ -202,19 +201,19 @@ int ima_get_action(struct mnt_idmap *idmap, struct inode *inode,
                                allowed_algos);
 }
 
-static int ima_get_verity_digest(struct integrity_iint_cache *iint,
-                                struct ima_max_digest_data *hash)
+static bool ima_get_verity_digest(struct integrity_iint_cache *iint,
+                                 struct ima_max_digest_data *hash)
 {
-       enum hash_algo verity_alg;
-       int ret;
+       enum hash_algo alg;
+       int digest_len;
 
        /*
         * On failure, 'measure' policy rules will result in a file data
         * hash containing 0's.
         */
-       ret = fsverity_get_digest(iint->inode, hash->digest, &verity_alg);
-       if (ret)
-               return ret;
+       digest_len = fsverity_get_digest(iint->inode, hash->digest, NULL, &alg);
+       if (digest_len == 0)
+               return false;
 
        /*
         * Unlike in the case of actually calculating the file hash, in
@@ -223,9 +222,9 @@ static int ima_get_verity_digest(struct integrity_iint_cache *iint,
         * mismatch between the verity algorithm and the xattr signature
         * algorithm, if one exists, will be detected later.
         */
-       hash->hdr.algo = verity_alg;
-       hash->hdr.length = hash_digest_size[verity_alg];
-       return 0;
+       hash->hdr.algo = alg;
+       hash->hdr.length = digest_len;
+       return true;
 }
 
 /*
@@ -246,10 +245,11 @@ int ima_collect_measurement(struct integrity_iint_cache *iint,
        struct inode *inode = file_inode(file);
        const char *filename = file->f_path.dentry->d_name.name;
        struct ima_max_digest_data hash;
+       struct kstat stat;
        int result = 0;
        int length;
        void *tmpbuf;
-       u64 i_version;
+       u64 i_version = 0;
 
        /*
         * Always collect the modsig, because IMA might have already collected
@@ -268,7 +268,10 @@ int ima_collect_measurement(struct integrity_iint_cache *iint,
         * to an initial measurement/appraisal/audit, but was modified to
         * assume the file changed.
         */
-       i_version = inode_query_iversion(inode);
+       result = vfs_getattr_nosec(&file->f_path, &stat, STATX_CHANGE_COOKIE,
+                                  AT_STATX_SYNC_AS_STAT);
+       if (!result && (stat.result_mask & STATX_CHANGE_COOKIE))
+               i_version = stat.change_cookie;
        hash.hdr.algo = algo;
        hash.hdr.length = hash_digest_size[algo];
 
@@ -276,16 +279,9 @@ int ima_collect_measurement(struct integrity_iint_cache *iint,
        memset(&hash.digest, 0, sizeof(hash.digest));
 
        if (iint->flags & IMA_VERITY_REQUIRED) {
-               result = ima_get_verity_digest(iint, &hash);
-               switch (result) {
-               case 0:
-                       break;
-               case -ENODATA:
+               if (!ima_get_verity_digest(iint, &hash)) {
                        audit_cause = "no-verity-digest";
-                       break;
-               default:
-                       audit_cause = "invalid-verity-digest";
-                       break;
+                       result = -ENODATA;
                }
        } else if (buf) {
                result = ima_calc_buffer_hash(buf, size, &hash.hdr);
index d66a0a3..365db0e 100644 (file)
@@ -24,7 +24,6 @@
 #include <linux/slab.h>
 #include <linux/xattr.h>
 #include <linux/ima.h>
-#include <linux/iversion.h>
 #include <linux/fs.h>
 
 #include "ima.h"
@@ -164,11 +163,16 @@ static void ima_check_last_writer(struct integrity_iint_cache *iint,
 
        mutex_lock(&iint->mutex);
        if (atomic_read(&inode->i_writecount) == 1) {
+               struct kstat stat;
+
                update = test_and_clear_bit(IMA_UPDATE_XATTR,
                                            &iint->atomic_flags);
-               if (!IS_I_VERSION(inode) ||
-                   !inode_eq_iversion(inode, iint->version) ||
-                   (iint->flags & IMA_NEW_FILE)) {
+               if ((iint->flags & IMA_NEW_FILE) ||
+                   vfs_getattr_nosec(&file->f_path, &stat,
+                                     STATX_CHANGE_COOKIE,
+                                     AT_STATX_SYNC_AS_STAT) ||
+                   !(stat.result_mask & STATX_CHANGE_COOKIE) ||
+                   stat.change_cookie != iint->version) {
                        iint->flags &= ~(IMA_DONE_MASK | IMA_NEW_FILE);
                        iint->measured_pcrs = 0;
                        if (update)
index fb25723..3e7bee3 100644 (file)
@@ -89,6 +89,9 @@ int ima_read_modsig(enum ima_hooks func, const void *buf, loff_t buf_len,
 
 /**
  * ima_collect_modsig - Calculate the file hash without the appended signature.
+ * @modsig: parsed module signature
+ * @buf: data to verify the signature on
+ * @size: data size
  *
  * Since the modsig is part of the file contents, the hash used in its signature
  * isn't the same one ordinarily calculated by IMA. Therefore PKCS7 code
index 3ca8b73..c9b3bd8 100644 (file)
@@ -721,6 +721,7 @@ static int get_subaction(struct ima_rule_entry *rule, enum ima_hooks func)
  * @secid: LSM secid of the task to be validated
  * @func: IMA hook identifier
  * @mask: requested action (MAY_READ | MAY_WRITE | MAY_APPEND | MAY_EXEC)
+ * @flags: IMA actions to consider (e.g. IMA_MEASURE | IMA_APPRAISE)
  * @pcr: set the pcr to extend
  * @template_desc: the template that should be used for this rule
  * @func_data: func specific data, may be NULL
@@ -1915,7 +1916,7 @@ static int ima_parse_rule(char *rule, struct ima_rule_entry *entry)
 
 /**
  * ima_parse_add_rule - add a rule to ima_policy_rules
- * @rule - ima measurement policy rule
+ * @rule: ima measurement policy rule
  *
  * Avoid locking by allowing just one writer at a time in ima_write_policy()
  * Returns the length of the rule parsed, an error code on failure
index b46b651..b72b82b 100644 (file)
@@ -68,3 +68,10 @@ struct ctl_table key_sysctls[] = {
 #endif
        { }
 };
+
+static int __init init_security_keys_sysctls(void)
+{
+       register_sysctl_init("kernel/keys", key_sysctls);
+       return 0;
+}
+early_initcall(init_security_keys_sysctls);
index 8e33c4e..c1e862a 100644 (file)
@@ -2,7 +2,7 @@
 
 config SECURITY_LANDLOCK
        bool "Landlock support"
-       depends on SECURITY && !ARCH_EPHEMERAL_INODES
+       depends on SECURITY
        select SECURITY_PATH
        help
          Landlock is a sandboxing mechanism that enables processes to restrict
index 368e77c..849e832 100644 (file)
@@ -200,7 +200,7 @@ static void dump_common_audit_data(struct audit_buffer *ab,
        char comm[sizeof(current->comm)];
 
        /*
-        * To keep stack sizes in check force programers to notice if they
+        * To keep stack sizes in check force programmers to notice if they
         * start making this union too large!  See struct lsm_network_audit
         * as an example of how to deal with large data.
         */
index e806739..5be5894 100644 (file)
@@ -131,7 +131,7 @@ static int safesetid_security_capable(const struct cred *cred,
                 * set*gid() (e.g. setting up userns gid mappings).
                 */
                pr_warn("Operation requires CAP_SETGID, which is not available to GID %u for operations besides approved set*gid transitions\n",
-                       __kuid_val(cred->uid));
+                       __kgid_val(cred->gid));
                return -EPERM;
        default:
                /* Error, the only capabilities were checking for is CAP_SETUID/GID */
index d5ff7ff..b720424 100644 (file)
@@ -2491,7 +2491,7 @@ int security_inode_copy_up_xattr(const char *name)
        /*
         * The implementation can return 0 (accept the xattr), 1 (discard the
         * xattr), -EOPNOTSUPP if it does not know anything about the xattr or
-        * any other error code incase of an error.
+        * any other error code in case of an error.
         */
        hlist_for_each_entry(hp,
                             &security_hook_heads.inode_copy_up_xattr, list) {
@@ -4667,6 +4667,23 @@ int security_sctp_assoc_established(struct sctp_association *asoc,
 }
 EXPORT_SYMBOL(security_sctp_assoc_established);
 
+/**
+ * security_mptcp_add_subflow() - Inherit the LSM label from the MPTCP socket
+ * @sk: the owning MPTCP socket
+ * @ssk: the new subflow
+ *
+ * Update the labeling for the given MPTCP subflow, to match the one of the
+ * owning MPTCP socket. This hook has to be called after the socket creation and
+ * initialization via the security_socket_create() and
+ * security_socket_post_create() LSM hooks.
+ *
+ * Return: Returns 0 on success or a negative error code on failure.
+ */
+int security_mptcp_add_subflow(struct sock *sk, struct sock *ssk)
+{
+       return call_int_hook(mptcp_add_subflow, 0, sk, ssk);
+}
+
 #endif /* CONFIG_SECURITY_NETWORK */
 
 #ifdef CONFIG_SECURITY_INFINIBAND
@@ -4676,7 +4693,7 @@ EXPORT_SYMBOL(security_sctp_assoc_established);
  * @subnet_prefix: subnet prefix of the port
  * @pkey: IB pkey
  *
- * Check permission to access a pkey when modifing a QP.
+ * Check permission to access a pkey when modifying a QP.
  *
  * Return: Returns 0 if permission is granted.
  */
index 8b21520..8363796 100644 (file)
@@ -3,32 +3,38 @@
 # Makefile for building the SELinux module as part of the kernel tree.
 #
 
+# NOTE: There are a number of improvements that can be made to this Makefile
+# once the kernel requires make v4.3 or greater; the most important feature
+# lacking in older versions of make is support for grouped targets.  These
+# improvements are noted inline in the Makefile below ...
+
 obj-$(CONFIG_SECURITY_SELINUX) := selinux.o
 
+ccflags-y := -I$(srctree)/security/selinux -I$(srctree)/security/selinux/include
+
 selinux-y := avc.o hooks.o selinuxfs.o netlink.o nlmsgtab.o netif.o \
             netnode.o netport.o status.o \
             ss/ebitmap.o ss/hashtab.o ss/symtab.o ss/sidtab.o ss/avtab.o \
             ss/policydb.o ss/services.o ss/conditional.o ss/mls.o ss/context.o
 
 selinux-$(CONFIG_SECURITY_NETWORK_XFRM) += xfrm.o
-
 selinux-$(CONFIG_NETLABEL) += netlabel.o
-
 selinux-$(CONFIG_SECURITY_INFINIBAND) += ibpkey.o
-
 selinux-$(CONFIG_IMA) += ima.o
 
-ccflags-y := -I$(srctree)/security/selinux -I$(srctree)/security/selinux/include
+genhdrs := flask.h av_permissions.h
 
+# see the note above, replace the dependency rule with the one below:
+#  $(addprefix $(obj)/,$(selinux-y)): $(addprefix $(obj)/,$(genhdrs))
 $(addprefix $(obj)/,$(selinux-y)): $(obj)/flask.h
 
-quiet_cmd_flask = GEN     $(obj)/flask.h $(obj)/av_permissions.h
-      cmd_flask = $< $(obj)/flask.h $(obj)/av_permissions.h
+quiet_cmd_genhdrs = GEN     $(addprefix $(obj)/,$(genhdrs))
+      cmd_genhdrs = $< $(addprefix $(obj)/,$(genhdrs))
 
-targets += flask.h av_permissions.h
-# once make >= 4.3 is required, we can use grouped targets in the rule below,
-# which basically involves adding both headers and a '&' before the colon, see
-# the example below:
-#   $(obj)/flask.h $(obj)/av_permissions.h &: scripts/selinux/...
+# see the note above, replace the $targets and 'flask.h' rule with the lines
+# below:
+#  targets += $(genhdrs)
+#  $(addprefix $(obj)/,$(genhdrs)) &: scripts/selinux/...
+targets += flask.h
 $(obj)/flask.h: scripts/selinux/genheaders/genheaders FORCE
-       $(call if_changed,flask)
+       $(call if_changed,genhdrs)
index eaed5c2..1074db6 100644 (file)
@@ -642,7 +642,6 @@ static void avc_insert(u32 ssid, u32 tsid, u16 tclass,
        hlist_add_head_rcu(&node->list, head);
 found:
        spin_unlock_irqrestore(lock, flag);
-       return;
 }
 
 /**
@@ -1203,22 +1202,3 @@ u32 avc_policy_seqno(void)
 {
        return selinux_avc.avc_cache.latest_notif;
 }
-
-void avc_disable(void)
-{
-       /*
-        * If you are looking at this because you have realized that we are
-        * not destroying the avc_node_cachep it might be easy to fix, but
-        * I don't know the memory barrier semantics well enough to know.  It's
-        * possible that some other task dereferenced security_ops when
-        * it still pointed to selinux operations.  If that is the case it's
-        * possible that it is about to use the avc and is about to need the
-        * avc_node_cachep.  I know I could wrap the security.c security_ops call
-        * in an rcu_lock, but seriously, it's not worth it.  Instead I just flush
-        * the cache and get that memory back.
-        */
-       if (avc_node_cachep) {
-               avc_flush();
-               /* kmem_cache_destroy(avc_node_cachep); */
-       }
-}
index 79b4890..d06e350 100644 (file)
@@ -357,7 +357,7 @@ enum {
 };
 
 #define A(s, has_arg) {#s, sizeof(#s) - 1, Opt_##s, has_arg}
-static struct {
+static const struct {
        const char *name;
        int len;
        int opt;
@@ -605,6 +605,13 @@ static int selinux_set_mnt_opts(struct super_block *sb,
        u32 defcontext_sid = 0;
        int rc = 0;
 
+       /*
+        * Specifying internal flags without providing a place to
+        * place the results is not allowed
+        */
+       if (kern_flags && !set_kern_flags)
+               return -EINVAL;
+
        mutex_lock(&sbsec->lock);
 
        if (!selinux_initialized()) {
@@ -612,6 +619,10 @@ static int selinux_set_mnt_opts(struct super_block *sb,
                        /* Defer initialization until selinux_complete_init,
                           after the initial policy is loaded and the security
                           server is ready to handle calls. */
+                       if (kern_flags & SECURITY_LSM_NATIVE_LABELS) {
+                               sbsec->flags |= SE_SBNATIVE;
+                               *set_kern_flags |= SECURITY_LSM_NATIVE_LABELS;
+                       }
                        goto out;
                }
                rc = -EINVAL;
@@ -619,12 +630,6 @@ static int selinux_set_mnt_opts(struct super_block *sb,
                        "before the security server is initialized\n");
                goto out;
        }
-       if (kern_flags && !set_kern_flags) {
-               /* Specifying internal flags without providing a place to
-                * place the results is not allowed */
-               rc = -EINVAL;
-               goto out;
-       }
 
        /*
         * Binary mount data FS will come through this function twice.  Once
@@ -757,7 +762,17 @@ static int selinux_set_mnt_opts(struct super_block *sb,
         * sets the label used on all file below the mountpoint, and will set
         * the superblock context if not already set.
         */
-       if (kern_flags & SECURITY_LSM_NATIVE_LABELS && !context_sid) {
+       if (sbsec->flags & SE_SBNATIVE) {
+               /*
+                * This means we are initializing a superblock that has been
+                * mounted before the SELinux was initialized and the
+                * filesystem requested native labeling. We had already
+                * returned SECURITY_LSM_NATIVE_LABELS in *set_kern_flags
+                * in the original mount attempt, so now we just need to set
+                * the SECURITY_FS_USE_NATIVE behavior.
+                */
+               sbsec->behavior = SECURITY_FS_USE_NATIVE;
+       } else if (kern_flags & SECURITY_LSM_NATIVE_LABELS && !context_sid) {
                sbsec->behavior = SECURITY_FS_USE_NATIVE;
                *set_kern_flags |= SECURITY_LSM_NATIVE_LABELS;
        }
@@ -869,31 +884,37 @@ static int selinux_sb_clone_mnt_opts(const struct super_block *oldsb,
        int set_rootcontext =   (oldsbsec->flags & ROOTCONTEXT_MNT);
 
        /*
-        * if the parent was able to be mounted it clearly had no special lsm
-        * mount options.  thus we can safely deal with this superblock later
-        */
-       if (!selinux_initialized())
-               return 0;
-
-       /*
         * Specifying internal flags without providing a place to
         * place the results is not allowed.
         */
        if (kern_flags && !set_kern_flags)
                return -EINVAL;
 
+       mutex_lock(&newsbsec->lock);
+
+       /*
+        * if the parent was able to be mounted it clearly had no special lsm
+        * mount options.  thus we can safely deal with this superblock later
+        */
+       if (!selinux_initialized()) {
+               if (kern_flags & SECURITY_LSM_NATIVE_LABELS) {
+                       newsbsec->flags |= SE_SBNATIVE;
+                       *set_kern_flags |= SECURITY_LSM_NATIVE_LABELS;
+               }
+               goto out;
+       }
+
        /* how can we clone if the old one wasn't set up?? */
        BUG_ON(!(oldsbsec->flags & SE_SBINITIALIZED));
 
        /* if fs is reusing a sb, make sure that the contexts match */
        if (newsbsec->flags & SE_SBINITIALIZED) {
+               mutex_unlock(&newsbsec->lock);
                if ((kern_flags & SECURITY_LSM_NATIVE_LABELS) && !set_context)
                        *set_kern_flags |= SECURITY_LSM_NATIVE_LABELS;
                return selinux_cmp_sb_context(oldsb, newsb);
        }
 
-       mutex_lock(&newsbsec->lock);
-
        newsbsec->flags = oldsbsec->flags;
 
        newsbsec->sid = oldsbsec->sid;
@@ -937,7 +958,7 @@ out:
 }
 
 /*
- * NOTE: the caller is resposible for freeing the memory even if on error.
+ * NOTE: the caller is responsible for freeing the memory even if on error.
  */
 static int selinux_add_opt(int token, const char *s, void **mnt_opts)
 {
@@ -1394,8 +1415,11 @@ static int inode_doinit_with_dentry(struct inode *inode, struct dentry *opt_dent
        spin_unlock(&isec->lock);
 
        switch (sbsec->behavior) {
+       /*
+        * In case of SECURITY_FS_USE_NATIVE we need to re-fetch the labels
+        * via xattr when called from delayed_superblock_init().
+        */
        case SECURITY_FS_USE_NATIVE:
-               break;
        case SECURITY_FS_USE_XATTR:
                if (!(inode->i_opflags & IOP_XATTR)) {
                        sid = sbsec->def_sid;
@@ -5379,6 +5403,21 @@ static void selinux_sctp_sk_clone(struct sctp_association *asoc, struct sock *sk
        selinux_netlbl_sctp_sk_clone(sk, newsk);
 }
 
+static int selinux_mptcp_add_subflow(struct sock *sk, struct sock *ssk)
+{
+       struct sk_security_struct *ssksec = ssk->sk_security;
+       struct sk_security_struct *sksec = sk->sk_security;
+
+       ssksec->sclass = sksec->sclass;
+       ssksec->sid = sksec->sid;
+
+       /* replace the existing subflow label deleting the existing one
+        * and re-recreating a new label using the updated context
+        */
+       selinux_netlbl_sk_security_free(ssksec);
+       return selinux_netlbl_socket_post_create(ssk, ssk->sk_family);
+}
+
 static int selinux_inet_conn_request(const struct sock *sk, struct sk_buff *skb,
                                     struct request_sock *req)
 {
@@ -7074,6 +7113,7 @@ static struct security_hook_list selinux_hooks[] __ro_after_init = {
        LSM_HOOK_INIT(sctp_sk_clone, selinux_sctp_sk_clone),
        LSM_HOOK_INIT(sctp_bind_connect, selinux_sctp_bind_connect),
        LSM_HOOK_INIT(sctp_assoc_established, selinux_sctp_assoc_established),
+       LSM_HOOK_INIT(mptcp_add_subflow, selinux_mptcp_add_subflow),
        LSM_HOOK_INIT(inet_conn_request, selinux_inet_conn_request),
        LSM_HOOK_INIT(inet_csk_clone, selinux_inet_csk_clone),
        LSM_HOOK_INIT(inet_conn_established, selinux_inet_conn_established),
index 7daf596..aa34da9 100644 (file)
@@ -4,7 +4,7 @@
  *
  * Author: Lakshmi Ramasubramanian (nramas@linux.microsoft.com)
  *
- * Measure critical data structures maintainted by SELinux
+ * Measure critical data structures maintained by SELinux
  * using IMA subsystem.
  */
 #include <linux/vmalloc.h>
index 406bceb..d549513 100644 (file)
@@ -41,7 +41,7 @@ void selinux_audit_rule_free(void *rule);
  *     selinux_audit_rule_match - determine if a context ID matches a rule.
  *     @sid: the context ID to check
  *     @field: the field this rule refers to
- *     @op: the operater the rule uses
+ *     @op: the operator the rule uses
  *     @rule: pointer to the audit rule to check against
  *
  *     Returns 1 if the context id matches the rule, 0 if it does not, and
index 9301222..9e055f7 100644 (file)
@@ -168,9 +168,6 @@ int avc_get_hash_stats(char *page);
 unsigned int avc_get_cache_threshold(void);
 void avc_set_cache_threshold(unsigned int cache_threshold);
 
-/* Attempt to free avc node cache */
-void avc_disable(void);
-
 #ifdef CONFIG_SECURITY_SELINUX_AVC_STATS
 DECLARE_PER_CPU(struct avc_cache_stats, avc_cache_stats);
 #endif
index c992f83..875b055 100644 (file)
@@ -15,6 +15,7 @@
 #define _SELINUX_IB_PKEY_H
 
 #include <linux/types.h>
+#include "flask.h"
 
 #ifdef CONFIG_SECURITY_INFINIBAND
 void sel_ib_pkey_flush(void);
index 05e0417..93c05e9 100644 (file)
@@ -4,7 +4,7 @@
  *
  * Author: Lakshmi Ramasubramanian (nramas@linux.microsoft.com)
  *
- * Measure critical data structures maintainted by SELinux
+ * Measure critical data structures maintained by SELinux
  * using IMA subsystem.
  */
 
index 6082051..ecc6e74 100644 (file)
@@ -1,4 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/stddef.h>
+
 static const char *const initial_sid_to_string[] = {
        NULL,
        "kernel",
index 8746faf..3b605f3 100644 (file)
@@ -65,6 +65,7 @@
 #define SE_SBPROC              0x0200
 #define SE_SBGENFS             0x0400
 #define SE_SBGENFS_XATTR       0x0800
+#define SE_SBNATIVE            0x1000
 
 #define CONTEXT_STR    "context"
 #define FSCONTEXT_STR  "fscontext"
@@ -384,7 +385,6 @@ struct selinux_kernel_status {
 extern void selinux_status_update_setenforce(int enforcing);
 extern void selinux_status_update_policyload(int seqno);
 extern void selinux_complete_init(void);
-extern void exit_sel_fs(void);
 extern struct path selinux_null;
 extern void selnl_notify_setenforce(int val);
 extern void selnl_notify_policyload(u32 seqno);
index 767c670..528f518 100644 (file)
@@ -154,8 +154,12 @@ void selinux_netlbl_err(struct sk_buff *skb, u16 family, int error, int gateway)
  */
 void selinux_netlbl_sk_security_free(struct sk_security_struct *sksec)
 {
-       if (sksec->nlbl_secattr != NULL)
-               netlbl_secattr_free(sksec->nlbl_secattr);
+       if (!sksec->nlbl_secattr)
+               return;
+
+       netlbl_secattr_free(sksec->nlbl_secattr);
+       sksec->nlbl_secattr = NULL;
+       sksec->nlbl_state = NLBL_UNSET;
 }
 
 /**
index 69a583b..bad1f6b 100644 (file)
@@ -951,7 +951,7 @@ static ssize_t sel_write_create(struct file *file, char *buf, size_t size)
                 * either whitespace or multibyte characters, they shall be
                 * encoded based on the percentage-encoding rule.
                 * If not encoded, the sscanf logic picks up only left-half
-                * of the supplied name; splitted by a whitespace unexpectedly.
+                * of the supplied name; split by a whitespace unexpectedly.
                 */
                char   *r, *w;
                int     c1, c2;
@@ -1649,7 +1649,7 @@ static int sel_make_ss_files(struct dentry *dir)
        struct super_block *sb = dir->d_sb;
        struct selinux_fs_info *fsi = sb->s_fs_info;
        int i;
-       static struct tree_descr files[] = {
+       static const struct tree_descr files[] = {
                { "sidtab_hash_stats", &sel_sidtab_hash_stats_ops, S_IRUGO },
        };
 
index 8480ec6..6766edc 100644 (file)
@@ -354,7 +354,7 @@ int avtab_alloc_dup(struct avtab *new, const struct avtab *orig)
        return avtab_alloc_common(new, orig->nslot);
 }
 
-void avtab_hash_eval(struct avtab *h, char *tag)
+void avtab_hash_eval(struct avtab *h, const char *tag)
 {
        int i, chain_len, slots_used, max_chain_len;
        unsigned long long chain2_len_sum;
index d3ebea8..d6742fd 100644 (file)
@@ -92,7 +92,7 @@ int avtab_alloc(struct avtab *, u32);
 int avtab_alloc_dup(struct avtab *new, const struct avtab *orig);
 struct avtab_datum *avtab_search(struct avtab *h, const struct avtab_key *k);
 void avtab_destroy(struct avtab *h);
-void avtab_hash_eval(struct avtab *h, char *tag);
+void avtab_hash_eval(struct avtab *h, const char *tag);
 
 struct policydb;
 int avtab_read_item(struct avtab *a, void *fp, struct policydb *pol,
index e11219f..b156c18 100644 (file)
@@ -38,7 +38,7 @@ static int cond_evaluate_expr(struct policydb *p, struct cond_expr *expr)
                        if (sp == (COND_EXPR_MAXDEPTH - 1))
                                return -1;
                        sp++;
-                       s[sp] = p->bool_val_to_struct[node->bool - 1]->state;
+                       s[sp] = p->bool_val_to_struct[node->boolean - 1]->state;
                        break;
                case COND_NOT:
                        if (sp < 0)
@@ -366,7 +366,7 @@ static int expr_node_isvalid(struct policydb *p, struct cond_expr_node *expr)
                return 0;
        }
 
-       if (expr->bool > p->p_bools.nprim) {
+       if (expr->boolean > p->p_bools.nprim) {
                pr_err("SELinux: conditional expressions uses unknown bool.\n");
                return 0;
        }
@@ -401,7 +401,7 @@ static int cond_read_node(struct policydb *p, struct cond_node *node, void *fp)
                        return rc;
 
                expr->expr_type = le32_to_cpu(buf[0]);
-               expr->bool = le32_to_cpu(buf[1]);
+               expr->boolean = le32_to_cpu(buf[1]);
 
                if (!expr_node_isvalid(p, expr))
                        return -EINVAL;
@@ -518,7 +518,7 @@ static int cond_write_node(struct policydb *p, struct cond_node *node,
 
        for (i = 0; i < node->expr.len; i++) {
                buf[0] = cpu_to_le32(node->expr.nodes[i].expr_type);
-               buf[1] = cpu_to_le32(node->expr.nodes[i].bool);
+               buf[1] = cpu_to_le32(node->expr.nodes[i].boolean);
                rc = put_entry(buf, sizeof(u32), 2, fp);
                if (rc)
                        return rc;
index e47ec6d..5a7b512 100644 (file)
@@ -29,7 +29,7 @@ struct cond_expr_node {
 #define COND_NEQ       7 /* bool != bool */
 #define COND_LAST      COND_NEQ
        u32 expr_type;
-       u32 bool;
+       u32 boolean;
 };
 
 struct cond_expr {
index eda32c3..aed704b 100644 (file)
@@ -167,6 +167,8 @@ static inline int context_cpy(struct context *dst, const struct context *src)
        rc = mls_context_cpy(dst, src);
        if (rc) {
                kfree(dst->str);
+               dst->str = NULL;
+               dst->len = 0;
                return rc;
        }
        return 0;
index adcfb63..31b08b3 100644 (file)
@@ -42,7 +42,7 @@
 #include "services.h"
 
 #ifdef DEBUG_HASHES
-static const char *symtab_name[SYM_NUM] = {
+static const char *const symtab_name[SYM_NUM] = {
        "common prefixes",
        "classes",
        "roles",
@@ -2257,6 +2257,10 @@ static int ocontext_read(struct policydb *p, const struct policydb_compat_info *
                                if (rc)
                                        goto out;
 
+                               if (i == OCON_FS)
+                                       pr_warn("SELinux:  void and deprecated fs ocon %s\n",
+                                               c->u.name);
+
                                rc = context_read_and_validate(&c->context[0], p, fp);
                                if (rc)
                                        goto out;
index ffc4e7b..74b63ed 100644 (file)
@@ -225,7 +225,7 @@ struct genfs {
 
 /* object context array indices */
 #define OCON_ISID      0 /* initial SIDs */
-#define OCON_FS                1 /* unlabeled file systems */
+#define OCON_FS                1 /* unlabeled file systems (deprecated) */
 #define OCON_PORT      2 /* TCP and UDP port numbers */
 #define OCON_NETIF     3 /* network interfaces */
 #define OCON_NODE      4 /* nodes */
index f14d1ff..78946b7 100644 (file)
@@ -583,7 +583,7 @@ static void type_attribute_bounds_av(struct policydb *policydb,
 
 /*
  * flag which drivers have permissions
- * only looking for ioctl based extended permssions
+ * only looking for ioctl based extended permissions
  */
 void services_compute_xperms_drivers(
                struct extended_perms *xperms,
@@ -3541,38 +3541,38 @@ int selinux_audit_rule_init(u32 field, u32 op, char *rulestr, void **vrule)
        tmprule = kzalloc(sizeof(struct selinux_audit_rule), GFP_KERNEL);
        if (!tmprule)
                return -ENOMEM;
-
        context_init(&tmprule->au_ctxt);
 
        rcu_read_lock();
        policy = rcu_dereference(state->policy);
        policydb = &policy->policydb;
-
        tmprule->au_seqno = policy->latest_granting;
-
        switch (field) {
        case AUDIT_SUBJ_USER:
        case AUDIT_OBJ_USER:
-               rc = -EINVAL;
                userdatum = symtab_search(&policydb->p_users, rulestr);
-               if (!userdatum)
-                       goto out;
+               if (!userdatum) {
+                       rc = -EINVAL;
+                       goto err;
+               }
                tmprule->au_ctxt.user = userdatum->value;
                break;
        case AUDIT_SUBJ_ROLE:
        case AUDIT_OBJ_ROLE:
-               rc = -EINVAL;
                roledatum = symtab_search(&policydb->p_roles, rulestr);
-               if (!roledatum)
-                       goto out;
+               if (!roledatum) {
+                       rc = -EINVAL;
+                       goto err;
+               }
                tmprule->au_ctxt.role = roledatum->value;
                break;
        case AUDIT_SUBJ_TYPE:
        case AUDIT_OBJ_TYPE:
-               rc = -EINVAL;
                typedatum = symtab_search(&policydb->p_types, rulestr);
-               if (!typedatum)
-                       goto out;
+               if (!typedatum) {
+                       rc = -EINVAL;
+                       goto err;
+               }
                tmprule->au_ctxt.type = typedatum->value;
                break;
        case AUDIT_SUBJ_SEN:
@@ -3582,20 +3582,18 @@ int selinux_audit_rule_init(u32 field, u32 op, char *rulestr, void **vrule)
                rc = mls_from_string(policydb, rulestr, &tmprule->au_ctxt,
                                     GFP_ATOMIC);
                if (rc)
-                       goto out;
+                       goto err;
                break;
        }
-       rc = 0;
-out:
        rcu_read_unlock();
 
-       if (rc) {
-               selinux_audit_rule_free(tmprule);
-               tmprule = NULL;
-       }
-
        *rule = tmprule;
+       return 0;
 
+err:
+       rcu_read_unlock();
+       selinux_audit_rule_free(tmprule);
+       *rule = NULL;
        return rc;
 }
 
index e2239be..aa15ff5 100644 (file)
@@ -120,6 +120,7 @@ struct inode_smack {
 struct task_smack {
        struct smack_known      *smk_task;      /* label for access control */
        struct smack_known      *smk_forked;    /* label when forked */
+       struct smack_known      *smk_transmuted;/* label when transmuted */
        struct list_head        smk_rules;      /* per task access rules */
        struct mutex            smk_rules_lock; /* lock for the rules */
        struct list_head        smk_relabel;    /* transit allowed labels */
index 7a3e9ab..6e270cf 100644 (file)
@@ -933,8 +933,9 @@ static int smack_inode_init_security(struct inode *inode, struct inode *dir,
                                     const struct qstr *qstr, const char **name,
                                     void **value, size_t *len)
 {
+       struct task_smack *tsp = smack_cred(current_cred());
        struct inode_smack *issp = smack_inode(inode);
-       struct smack_known *skp = smk_of_current();
+       struct smack_known *skp = smk_of_task(tsp);
        struct smack_known *isp = smk_of_inode(inode);
        struct smack_known *dsp = smk_of_inode(dir);
        int may;
@@ -943,20 +944,34 @@ static int smack_inode_init_security(struct inode *inode, struct inode *dir,
                *name = XATTR_SMACK_SUFFIX;
 
        if (value && len) {
-               rcu_read_lock();
-               may = smk_access_entry(skp->smk_known, dsp->smk_known,
-                                      &skp->smk_rules);
-               rcu_read_unlock();
+               /*
+                * If equal, transmuting already occurred in
+                * smack_dentry_create_files_as(). No need to check again.
+                */
+               if (tsp->smk_task != tsp->smk_transmuted) {
+                       rcu_read_lock();
+                       may = smk_access_entry(skp->smk_known, dsp->smk_known,
+                                              &skp->smk_rules);
+                       rcu_read_unlock();
+               }
 
                /*
-                * If the access rule allows transmutation and
-                * the directory requests transmutation then
-                * by all means transmute.
+                * In addition to having smk_task equal to smk_transmuted,
+                * if the access rule allows transmutation and the directory
+                * requests transmutation then by all means transmute.
                 * Mark the inode as changed.
                 */
-               if (may > 0 && ((may & MAY_TRANSMUTE) != 0) &&
-                   smk_inode_transmutable(dir)) {
-                       isp = dsp;
+               if ((tsp->smk_task == tsp->smk_transmuted) ||
+                   (may > 0 && ((may & MAY_TRANSMUTE) != 0) &&
+                    smk_inode_transmutable(dir))) {
+                       /*
+                        * The caller of smack_dentry_create_files_as()
+                        * should have overridden the current cred, so the
+                        * inode label was already set correctly in
+                        * smack_inode_alloc_security().
+                        */
+                       if (tsp->smk_task != tsp->smk_transmuted)
+                               isp = dsp;
                        issp->smk_flags |= SMK_INODE_CHANGED;
                }
 
@@ -1463,10 +1478,19 @@ static int smack_inode_getsecurity(struct mnt_idmap *idmap,
        struct super_block *sbp;
        struct inode *ip = inode;
        struct smack_known *isp;
+       struct inode_smack *ispp;
+       size_t label_len;
+       char *label = NULL;
 
-       if (strcmp(name, XATTR_SMACK_SUFFIX) == 0)
+       if (strcmp(name, XATTR_SMACK_SUFFIX) == 0) {
                isp = smk_of_inode(inode);
-       else {
+       } else if (strcmp(name, XATTR_SMACK_TRANSMUTE) == 0) {
+               ispp = smack_inode(inode);
+               if (ispp->smk_flags & SMK_INODE_TRANSMUTE)
+                       label = TRANS_TRUE;
+               else
+                       label = "";
+       } else {
                /*
                 * The rest of the Smack xattrs are only on sockets.
                 */
@@ -1488,13 +1512,18 @@ static int smack_inode_getsecurity(struct mnt_idmap *idmap,
                        return -EOPNOTSUPP;
        }
 
+       if (!label)
+               label = isp->smk_known;
+
+       label_len = strlen(label);
+
        if (alloc) {
-               *buffer = kstrdup(isp->smk_known, GFP_KERNEL);
+               *buffer = kstrdup(label, GFP_KERNEL);
                if (*buffer == NULL)
                        return -ENOMEM;
        }
 
-       return strlen(isp->smk_known);
+       return label_len;
 }
 
 
@@ -4753,8 +4782,10 @@ static int smack_dentry_create_files_as(struct dentry *dentry, int mode,
                 * providing access is transmuting use the containing
                 * directory label instead of the process label.
                 */
-               if (may > 0 && (may & MAY_TRANSMUTE))
+               if (may > 0 && (may & MAY_TRANSMUTE)) {
                        ntsp->smk_task = isp->smk_inode;
+                       ntsp->smk_transmuted = ntsp->smk_task;
+               }
        }
        return 0;
 }
index 31af29f..ac20c0b 100644 (file)
@@ -916,7 +916,7 @@ bool tomoyo_dump_page(struct linux_binprm *bprm, unsigned long pos,
         */
        mmap_read_lock(bprm->mm);
        ret = get_user_pages_remote(bprm->mm, pos, 1,
-                                   FOLL_FORCE, &page, NULL, NULL);
+                                   FOLL_FORCE, &page, NULL);
        mmap_read_unlock(bprm->mm);
        if (ret <= 0)
                return false;
index 308ec70..dabfdec 100644 (file)
@@ -9527,6 +9527,7 @@ static const struct snd_pci_quirk alc269_fixup_tbl[] = {
        SND_PCI_QUIRK(0x1043, 0x1427, "Asus Zenbook UX31E", ALC269VB_FIXUP_ASUS_ZENBOOK),
        SND_PCI_QUIRK(0x1043, 0x1473, "ASUS GU604V", ALC285_FIXUP_ASUS_HEADSET_MIC),
        SND_PCI_QUIRK(0x1043, 0x1483, "ASUS GU603V", ALC285_FIXUP_ASUS_HEADSET_MIC),
+       SND_PCI_QUIRK(0x1043, 0x1493, "ASUS GV601V", ALC285_FIXUP_ASUS_HEADSET_MIC),
        SND_PCI_QUIRK(0x1043, 0x1517, "Asus Zenbook UX31A", ALC269VB_FIXUP_ASUS_ZENBOOK_UX31A),
        SND_PCI_QUIRK(0x1043, 0x1662, "ASUS GV301QH", ALC294_FIXUP_ASUS_DUAL_SPK),
        SND_PCI_QUIRK(0x1043, 0x1683, "ASUS UM3402YAR", ALC287_FIXUP_CS35L41_I2C_2),
@@ -9552,6 +9553,7 @@ static const struct snd_pci_quirk alc269_fixup_tbl[] = {
        SND_PCI_QUIRK(0x1043, 0x1c23, "Asus X55U", ALC269_FIXUP_LIMIT_INT_MIC_BOOST),
        SND_PCI_QUIRK(0x1043, 0x1c62, "ASUS GU603", ALC289_FIXUP_ASUS_GA401),
        SND_PCI_QUIRK(0x1043, 0x1c92, "ASUS ROG Strix G15", ALC285_FIXUP_ASUS_G533Z_PINS),
+       SND_PCI_QUIRK(0x1043, 0x1caf, "ASUS G634JYR/JZR", ALC285_FIXUP_ASUS_HEADSET_MIC),
        SND_PCI_QUIRK(0x1043, 0x1ccd, "ASUS X555UB", ALC256_FIXUP_ASUS_MIC),
        SND_PCI_QUIRK(0x1043, 0x1d42, "ASUS Zephyrus G14 2022", ALC289_FIXUP_ASUS_GA401),
        SND_PCI_QUIRK(0x1043, 0x1d4e, "ASUS TM420", ALC256_FIXUP_ASUS_HPE),
index 8020097..0c4c5cb 100644 (file)
@@ -1313,7 +1313,7 @@ config SND_SOC_RK3328
 
 config SND_SOC_RK817
        tristate "Rockchip RK817 audio CODEC"
-       depends on MFD_RK808 || COMPILE_TEST
+       depends on MFD_RK8XX || COMPILE_TEST
 
 config SND_SOC_RL6231
        tristate
index 6faf4a4..144f082 100644 (file)
@@ -1347,7 +1347,7 @@ static int sof_card_dai_links_create(struct device *dev,
                                if ((SDW_PART_ID(adr_link->adr_d[i].adr) !=
                                    SDW_PART_ID(adr_link->adr_d[j].adr)) ||
                                    (SDW_MFG_ID(adr_link->adr_d[i].adr) !=
-                                   SDW_MFG_ID(adr_link->adr_d[i].adr))) {
+                                   SDW_MFG_ID(adr_link->adr_d[j].adr))) {
                                        append_codec_type = true;
                                        goto out;
                                }
index c5573ea..1c1b755 100644 (file)
@@ -34,6 +34,8 @@
 #define BYTES_NOP7     0x8d,0xb4,0x26,0x00,0x00,0x00,0x00
 #define BYTES_NOP8     0x3e,BYTES_NOP7
 
+#define ASM_NOP_MAX 8
+
 #else
 
 /*
@@ -47,6 +49,9 @@
  * 6: osp nopl 0x00(%eax,%eax,1)
  * 7: nopl 0x00000000(%eax)
  * 8: nopl 0x00000000(%eax,%eax,1)
+ * 9: cs nopl 0x00000000(%eax,%eax,1)
+ * 10: osp cs nopl 0x00000000(%eax,%eax,1)
+ * 11: osp osp cs nopl 0x00000000(%eax,%eax,1)
  */
 #define BYTES_NOP1     0x90
 #define BYTES_NOP2     0x66,BYTES_NOP1
 #define BYTES_NOP6     0x66,BYTES_NOP5
 #define BYTES_NOP7     0x0f,0x1f,0x80,0x00,0x00,0x00,0x00
 #define BYTES_NOP8     0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00
+#define BYTES_NOP9     0x2e,BYTES_NOP8
+#define BYTES_NOP10    0x66,BYTES_NOP9
+#define BYTES_NOP11    0x66,BYTES_NOP10
+
+#define ASM_NOP9  _ASM_BYTES(BYTES_NOP9)
+#define ASM_NOP10 _ASM_BYTES(BYTES_NOP10)
+#define ASM_NOP11 _ASM_BYTES(BYTES_NOP11)
+
+#define ASM_NOP_MAX 11
 
 #endif /* CONFIG_64BIT */
 
@@ -68,8 +82,6 @@
 #define ASM_NOP7 _ASM_BYTES(BYTES_NOP7)
 #define ASM_NOP8 _ASM_BYTES(BYTES_NOP8)
 
-#define ASM_NOP_MAX 8
-
 #ifndef __ASSEMBLY__
 extern const unsigned char * const x86_nops[];
 #endif
diff --git a/tools/arch/x86/kcpuid/.gitignore b/tools/arch/x86/kcpuid/.gitignore
new file mode 100644 (file)
index 0000000..1b8541b
--- /dev/null
@@ -0,0 +1 @@
+kcpuid
index 416f5b3..24b7d01 100644 (file)
@@ -517,15 +517,16 @@ static void show_range(struct cpuid_range *range)
 static inline struct cpuid_func *index_to_func(u32 index)
 {
        struct cpuid_range *range;
+       u32 func_idx;
 
        range = (index & 0x80000000) ? leafs_ext : leafs_basic;
-       index &= 0x7FFFFFFF;
+       func_idx = index & 0xffff;
 
-       if (((index & 0xFFFF) + 1) > (u32)range->nr) {
+       if ((func_idx + 1) > (u32)range->nr) {
                printf("ERR: invalid input index (0x%x)\n", index);
                return NULL;
        }
-       return &range->funcs[index];
+       return &range->funcs[func_idx];
 }
 
 static void show_info(void)
index 9839fea..64d67b0 100644 (file)
@@ -25,8 +25,23 @@ endif
 
 nolibc_arch := $(patsubst arm64,aarch64,$(ARCH))
 arch_file := arch-$(nolibc_arch).h
-all_files := ctype.h errno.h nolibc.h signal.h stackprotector.h std.h stdint.h \
-             stdio.h stdlib.h string.h sys.h time.h types.h unistd.h
+all_files := \
+               compiler.h \
+               ctype.h \
+               errno.h \
+               nolibc.h \
+               signal.h \
+               stackprotector.h \
+               std.h \
+               stdint.h \
+               stdlib.h \
+               string.h \
+               sys.h \
+               time.h \
+               types.h \
+               unistd.h \
+               stdio.h \
+
 
 # install all headers needed to support a bare-metal compiler
 all: headers
index 383badd..11f294a 100644 (file)
@@ -7,6 +7,8 @@
 #ifndef _NOLIBC_ARCH_AARCH64_H
 #define _NOLIBC_ARCH_AARCH64_H
 
+#include "compiler.h"
+
 /* The struct returned by the newfstatat() syscall. Differs slightly from the
  * x86_64's stat one by field ordering, so be careful.
  */
@@ -173,27 +175,30 @@ char **environ __attribute__((weak));
 const unsigned long *_auxv __attribute__((weak));
 
 /* startup code */
-void __attribute__((weak,noreturn,optimize("omit-frame-pointer"))) _start(void)
+void __attribute__((weak,noreturn,optimize("omit-frame-pointer"))) __no_stack_protector _start(void)
 {
        __asm__ volatile (
-               "ldr x0, [sp]\n"     // argc (x0) was in the stack
-               "add x1, sp, 8\n"    // argv (x1) = sp
-               "lsl x2, x0, 3\n"    // envp (x2) = 8*argc ...
-               "add x2, x2, 8\n"    //           + 8 (skip null)
-               "add x2, x2, x1\n"   //           + argv
-               "adrp x3, environ\n"          // x3 = &environ (high bits)
-               "str x2, [x3, #:lo12:environ]\n" // store envp into environ
-               "mov x4, x2\n"       // search for auxv (follows NULL after last env)
+#ifdef _NOLIBC_STACKPROTECTOR
+               "bl __stack_chk_init\n"   /* initialize stack protector                     */
+#endif
+               "ldr x0, [sp]\n"     /* argc (x0) was in the stack                          */
+               "add x1, sp, 8\n"    /* argv (x1) = sp                                      */
+               "lsl x2, x0, 3\n"    /* envp (x2) = 8*argc ...                              */
+               "add x2, x2, 8\n"    /*           + 8 (skip null)                           */
+               "add x2, x2, x1\n"   /*           + argv                                    */
+               "adrp x3, environ\n"          /* x3 = &environ (high bits)                  */
+               "str x2, [x3, #:lo12:environ]\n" /* store envp into environ                 */
+               "mov x4, x2\n"       /* search for auxv (follows NULL after last env)       */
                "0:\n"
-               "ldr x5, [x4], 8\n"  // x5 = *x4; x4 += 8
-               "cbnz x5, 0b\n"      // and stop at NULL after last env
-               "adrp x3, _auxv\n"   // x3 = &_auxv (high bits)
-               "str x4, [x3, #:lo12:_auxv]\n" // store x4 into _auxv
-               "and sp, x1, -16\n"  // sp must be 16-byte aligned in the callee
-               "bl main\n"          // main() returns the status code, we'll exit with it.
-               "mov x8, 93\n"       // NR_exit == 93
+               "ldr x5, [x4], 8\n"  /* x5 = *x4; x4 += 8                                   */
+               "cbnz x5, 0b\n"      /* and stop at NULL after last env                     */
+               "adrp x3, _auxv\n"   /* x3 = &_auxv (high bits)                             */
+               "str x4, [x3, #:lo12:_auxv]\n" /* store x4 into _auxv                       */
+               "and sp, x1, -16\n"  /* sp must be 16-byte aligned in the callee            */
+               "bl main\n"          /* main() returns the status code, we'll exit with it. */
+               "mov x8, 93\n"       /* NR_exit == 93                                       */
                "svc #0\n"
        );
        __builtin_unreachable();
 }
-#endif // _NOLIBC_ARCH_AARCH64_H
+#endif /* _NOLIBC_ARCH_AARCH64_H */
index 42499f2..ca4c669 100644 (file)
@@ -7,6 +7,8 @@
 #ifndef _NOLIBC_ARCH_ARM_H
 #define _NOLIBC_ARCH_ARM_H
 
+#include "compiler.h"
+
 /* The struct returned by the stat() syscall, 32-bit only, the syscall returns
  * exactly 56 bytes (stops before the unused array). In big endian, the format
  * differs as devices are returned as short only.
@@ -196,41 +198,67 @@ struct sys_stat_struct {
        _arg1;                                                                \
 })
 
+#define my_syscall6(num, arg1, arg2, arg3, arg4, arg5, arg6)                  \
+({                                                                            \
+       register long _num  __asm__(_NOLIBC_SYSCALL_REG) = (num);             \
+       register long _arg1 __asm__ ("r0") = (long)(arg1);                    \
+       register long _arg2 __asm__ ("r1") = (long)(arg2);                    \
+       register long _arg3 __asm__ ("r2") = (long)(arg3);                    \
+       register long _arg4 __asm__ ("r3") = (long)(arg4);                    \
+       register long _arg5 __asm__ ("r4") = (long)(arg5);                    \
+       register long _arg6 __asm__ ("r5") = (long)(arg6);                    \
+                                                                             \
+       __asm__  volatile (                                                   \
+               _NOLIBC_THUMB_SET_R7                                          \
+               "svc #0\n"                                                    \
+               _NOLIBC_THUMB_RESTORE_R7                                      \
+               : "=r"(_arg1), "=r" (_num)                                    \
+               : "r"(_arg1), "r"(_arg2), "r"(_arg3), "r"(_arg4), "r"(_arg5), \
+                 "r"(_arg6), "r"(_num)                                       \
+               : "memory", "cc", "lr"                                        \
+       );                                                                    \
+       _arg1;                                                                \
+})
+
+
 char **environ __attribute__((weak));
 const unsigned long *_auxv __attribute__((weak));
 
 /* startup code */
-void __attribute__((weak,noreturn,optimize("omit-frame-pointer"))) _start(void)
+void __attribute__((weak,noreturn,optimize("omit-frame-pointer"))) __no_stack_protector _start(void)
 {
        __asm__ volatile (
-               "pop {%r0}\n"                 // argc was in the stack
-               "mov %r1, %sp\n"              // argv = sp
+#ifdef _NOLIBC_STACKPROTECTOR
+               "bl __stack_chk_init\n"       /* initialize stack protector                          */
+#endif
+               "pop {%r0}\n"                 /* argc was in the stack                               */
+               "mov %r1, %sp\n"              /* argv = sp                                           */
 
-               "add %r2, %r0, $1\n"          // envp = (argc + 1) ...
-               "lsl %r2, %r2, $2\n"          //        * 4        ...
-               "add %r2, %r2, %r1\n"         //        + argv
-               "ldr %r3, 1f\n"               // r3 = &environ (see below)
-               "str %r2, [r3]\n"             // store envp into environ
+               "add %r2, %r0, $1\n"          /* envp = (argc + 1) ...                               */
+               "lsl %r2, %r2, $2\n"          /*        * 4        ...                               */
+               "add %r2, %r2, %r1\n"         /*        + argv                                       */
+               "ldr %r3, 1f\n"               /* r3 = &environ (see below)                           */
+               "str %r2, [r3]\n"             /* store envp into environ                             */
 
-               "mov r4, r2\n"                // search for auxv (follows NULL after last env)
+               "mov r4, r2\n"                /* search for auxv (follows NULL after last env)       */
                "0:\n"
-               "mov r5, r4\n"                // r5 = r4
-               "add r4, r4, #4\n"            // r4 += 4
-               "ldr r5,[r5]\n"               // r5 = *r5 = *(r4-4)
-               "cmp r5, #0\n"                // and stop at NULL after last env
+               "mov r5, r4\n"                /* r5 = r4                                             */
+               "add r4, r4, #4\n"            /* r4 += 4                                             */
+               "ldr r5,[r5]\n"               /* r5 = *r5 = *(r4-4)                                  */
+               "cmp r5, #0\n"                /* and stop at NULL after last env                     */
                "bne 0b\n"
-               "ldr %r3, 2f\n"               // r3 = &_auxv (low bits)
-               "str r4, [r3]\n"              // store r4 into _auxv
+               "ldr %r3, 2f\n"               /* r3 = &_auxv (low bits)                              */
+               "str r4, [r3]\n"              /* store r4 into _auxv                                 */
 
-               "mov %r3, $8\n"               // AAPCS : sp must be 8-byte aligned in the
-               "neg %r3, %r3\n"              //         callee, and bl doesn't push (lr=pc)
-               "and %r3, %r3, %r1\n"         // so we do sp = r1(=sp) & r3(=-8);
-               "mov %sp, %r3\n"              //
+               "mov %r3, $8\n"               /* AAPCS : sp must be 8-byte aligned in the            */
+               "neg %r3, %r3\n"              /*         callee, and bl doesn't push (lr=pc)         */
+               "and %r3, %r3, %r1\n"         /* so we do sp = r1(=sp) & r3(=-8);                    */
+               "mov %sp, %r3\n"
 
-               "bl main\n"                   // main() returns the status code, we'll exit with it.
-               "movs r7, $1\n"               // NR_exit == 1
+               "bl main\n"                   /* main() returns the status code, we'll exit with it. */
+               "movs r7, $1\n"               /* NR_exit == 1                                        */
                "svc $0x00\n"
-               ".align 2\n"                  // below are the pointers to a few variables
+               ".align 2\n"                  /* below are the pointers to a few variables           */
                "1:\n"
                ".word environ\n"
                "2:\n"
@@ -239,4 +267,4 @@ void __attribute__((weak,noreturn,optimize("omit-frame-pointer"))) _start(void)
        __builtin_unreachable();
 }
 
-#endif // _NOLIBC_ARCH_ARM_H
+#endif /* _NOLIBC_ARCH_ARM_H */
index 2d98d78..3d672d9 100644 (file)
@@ -7,6 +7,8 @@
 #ifndef _NOLIBC_ARCH_I386_H
 #define _NOLIBC_ARCH_I386_H
 
+#include "compiler.h"
+
 /* The struct returned by the stat() syscall, 32-bit only, the syscall returns
  * exactly 56 bytes (stops before the unused array).
  */
@@ -181,8 +183,6 @@ struct sys_stat_struct {
 char **environ __attribute__((weak));
 const unsigned long *_auxv __attribute__((weak));
 
-#define __ARCH_SUPPORTS_STACK_PROTECTOR
-
 /* startup code */
 /*
  * i386 System V ABI mandates:
@@ -190,35 +190,35 @@ const unsigned long *_auxv __attribute__((weak));
  * 2) The deepest stack frame should be set to zero
  *
  */
-void __attribute__((weak,noreturn,optimize("omit-frame-pointer"),no_stack_protector)) _start(void)
+void __attribute__((weak,noreturn,optimize("omit-frame-pointer"))) __no_stack_protector _start(void)
 {
        __asm__ volatile (
-#ifdef NOLIBC_STACKPROTECTOR
-               "call __stack_chk_init\n"   // initialize stack protector
+#ifdef _NOLIBC_STACKPROTECTOR
+               "call __stack_chk_init\n"   /* initialize stack protector                    */
 #endif
-               "pop %eax\n"                // argc   (first arg, %eax)
-               "mov %esp, %ebx\n"          // argv[] (second arg, %ebx)
-               "lea 4(%ebx,%eax,4),%ecx\n" // then a NULL then envp (third arg, %ecx)
-               "mov %ecx, environ\n"       // save environ
-               "xor %ebp, %ebp\n"          // zero the stack frame
-               "mov %ecx, %edx\n"          // search for auxv (follows NULL after last env)
+               "pop %eax\n"                /* argc   (first arg, %eax)                      */
+               "mov %esp, %ebx\n"          /* argv[] (second arg, %ebx)                     */
+               "lea 4(%ebx,%eax,4),%ecx\n" /* then a NULL then envp (third arg, %ecx)       */
+               "mov %ecx, environ\n"       /* save environ                                  */
+               "xor %ebp, %ebp\n"          /* zero the stack frame                          */
+               "mov %ecx, %edx\n"          /* search for auxv (follows NULL after last env) */
                "0:\n"
-               "add $4, %edx\n"            // search for auxv using edx, it follows the
-               "cmp -4(%edx), %ebp\n"      // ... NULL after last env (ebp is zero here)
+               "add $4, %edx\n"            /* search for auxv using edx, it follows the     */
+               "cmp -4(%edx), %ebp\n"      /* ... NULL after last env (ebp is zero here)    */
                "jnz 0b\n"
-               "mov %edx, _auxv\n"         // save it into _auxv
-               "and $-16, %esp\n"          // x86 ABI : esp must be 16-byte aligned before
-               "sub $4, %esp\n"            // the call instruction (args are aligned)
-               "push %ecx\n"               // push all registers on the stack so that we
-               "push %ebx\n"               // support both regparm and plain stack modes
+               "mov %edx, _auxv\n"         /* save it into _auxv                            */
+               "and $-16, %esp\n"          /* x86 ABI : esp must be 16-byte aligned before  */
+               "sub $4, %esp\n"            /* the call instruction (args are aligned)       */
+               "push %ecx\n"               /* push all registers on the stack so that we    */
+               "push %ebx\n"               /* support both regparm and plain stack modes    */
                "push %eax\n"
-               "call main\n"               // main() returns the status code in %eax
-               "mov %eax, %ebx\n"          // retrieve exit code (32-bit int)
-               "movl $1, %eax\n"           // NR_exit == 1
-               "int $0x80\n"               // exit now
-               "hlt\n"                     // ensure it does not
+               "call main\n"               /* main() returns the status code in %eax        */
+               "mov %eax, %ebx\n"          /* retrieve exit code (32-bit int)               */
+               "movl $1, %eax\n"           /* NR_exit == 1                                  */
+               "int $0x80\n"               /* exit now                                      */
+               "hlt\n"                     /* ensure it does not                            */
        );
        __builtin_unreachable();
 }
 
-#endif // _NOLIBC_ARCH_I386_H
+#endif /* _NOLIBC_ARCH_I386_H */
index 029ee3c..ad3f266 100644 (file)
@@ -7,6 +7,8 @@
 #ifndef _NOLIBC_ARCH_LOONGARCH_H
 #define _NOLIBC_ARCH_LOONGARCH_H
 
+#include "compiler.h"
+
 /* Syscalls for LoongArch :
  *   - stack is 16-byte aligned
  *   - syscall number is passed in a7
@@ -158,7 +160,7 @@ const unsigned long *_auxv __attribute__((weak));
 #define LONG_ADDI    "addi.w"
 #define LONG_SLL     "slli.w"
 #define LONG_BSTRINS "bstrins.w"
-#else // __loongarch_grlen == 64
+#else /* __loongarch_grlen == 64 */
 #define LONGLOG      "3"
 #define SZREG        "8"
 #define REG_L        "ld.d"
@@ -170,31 +172,34 @@ const unsigned long *_auxv __attribute__((weak));
 #endif
 
 /* startup code */
-void __attribute__((weak,noreturn,optimize("omit-frame-pointer"))) _start(void)
+void __attribute__((weak,noreturn,optimize("omit-frame-pointer"))) __no_stack_protector _start(void)
 {
        __asm__ volatile (
-               REG_L        " $a0, $sp, 0\n"         // argc (a0) was in the stack
-               LONG_ADDI    " $a1, $sp, "SZREG"\n"   // argv (a1) = sp + SZREG
-               LONG_SLL     " $a2, $a0, "LONGLOG"\n" // envp (a2) = SZREG*argc ...
-               LONG_ADDI    " $a2, $a2, "SZREG"\n"   //             + SZREG (skip null)
-               LONG_ADD     " $a2, $a2, $a1\n"       //             + argv
-
-               "move          $a3, $a2\n"            // iterate a3 over envp to find auxv (after NULL)
-               "0:\n"                                // do {
-               REG_L        " $a4, $a3, 0\n"         //   a4 = *a3;
-               LONG_ADDI    " $a3, $a3, "SZREG"\n"   //   a3 += sizeof(void*);
-               "bne           $a4, $zero, 0b\n"      // } while (a4);
-               "la.pcrel      $a4, _auxv\n"          // a4 = &_auxv
-               LONG_S       " $a3, $a4, 0\n"         // store a3 into _auxv
-
-               "la.pcrel      $a3, environ\n"        // a3 = &environ
-               LONG_S       " $a2, $a3, 0\n"         // store envp(a2) into environ
-               LONG_BSTRINS " $sp, $zero, 3, 0\n"    // sp must be 16-byte aligned
-               "bl            main\n"                // main() returns the status code, we'll exit with it.
-               "li.w          $a7, 93\n"             // NR_exit == 93
+#ifdef _NOLIBC_STACKPROTECTOR
+               "bl __stack_chk_init\n"               /* initialize stack protector                          */
+#endif
+               REG_L        " $a0, $sp, 0\n"         /* argc (a0) was in the stack                          */
+               LONG_ADDI    " $a1, $sp, "SZREG"\n"   /* argv (a1) = sp + SZREG                              */
+               LONG_SLL     " $a2, $a0, "LONGLOG"\n" /* envp (a2) = SZREG*argc ...                          */
+               LONG_ADDI    " $a2, $a2, "SZREG"\n"   /*             + SZREG (skip null)                     */
+               LONG_ADD     " $a2, $a2, $a1\n"       /*             + argv                                  */
+
+               "move          $a3, $a2\n"            /* iterate a3 over envp to find auxv (after NULL)      */
+               "0:\n"                                /* do {                                                */
+               REG_L        " $a4, $a3, 0\n"         /*   a4 = *a3;                                         */
+               LONG_ADDI    " $a3, $a3, "SZREG"\n"   /*   a3 += sizeof(void*);                              */
+               "bne           $a4, $zero, 0b\n"      /* } while (a4);                                       */
+               "la.pcrel      $a4, _auxv\n"          /* a4 = &_auxv                                         */
+               LONG_S       " $a3, $a4, 0\n"         /* store a3 into _auxv                                 */
+
+               "la.pcrel      $a3, environ\n"        /* a3 = &environ                                       */
+               LONG_S       " $a2, $a3, 0\n"         /* store envp(a2) into environ                         */
+               LONG_BSTRINS " $sp, $zero, 3, 0\n"    /* sp must be 16-byte aligned                          */
+               "bl            main\n"                /* main() returns the status code, we'll exit with it. */
+               "li.w          $a7, 93\n"             /* NR_exit == 93                                       */
                "syscall       0\n"
        );
        __builtin_unreachable();
 }
 
-#endif // _NOLIBC_ARCH_LOONGARCH_H
+#endif /* _NOLIBC_ARCH_LOONGARCH_H */
index bf83432..db24e08 100644 (file)
@@ -7,6 +7,8 @@
 #ifndef _NOLIBC_ARCH_MIPS_H
 #define _NOLIBC_ARCH_MIPS_H
 
+#include "compiler.h"
+
 /* The struct returned by the stat() syscall. 88 bytes are returned by the
  * syscall.
  */
@@ -180,45 +182,49 @@ char **environ __attribute__((weak));
 const unsigned long *_auxv __attribute__((weak));
 
 /* startup code, note that it's called __start on MIPS */
-void __attribute__((weak,noreturn,optimize("omit-frame-pointer"))) __start(void)
+void __attribute__((weak,noreturn,optimize("omit-frame-pointer"))) __no_stack_protector __start(void)
 {
        __asm__ volatile (
-               //".set nomips16\n"
+               /*".set nomips16\n"*/
                ".set push\n"
                ".set    noreorder\n"
                ".option pic0\n"
-               //".ent __start\n"
-               //"__start:\n"
-               "lw $a0,($sp)\n"        // argc was in the stack
-               "addiu  $a1, $sp, 4\n"  // argv = sp + 4
-               "sll $a2, $a0, 2\n"     // a2 = argc * 4
-               "add   $a2, $a2, $a1\n" // envp = argv + 4*argc ...
-               "addiu $a2, $a2, 4\n"   //        ... + 4
-               "lui $a3, %hi(environ)\n"     // load environ into a3 (hi)
-               "addiu $a3, %lo(environ)\n"   // load environ into a3 (lo)
-               "sw $a2,($a3)\n"              // store envp(a2) into environ
-
-               "move $t0, $a2\n"             // iterate t0 over envp, look for NULL
-               "0:"                          // do {
-               "lw $a3, ($t0)\n"             //   a3=*(t0);
-               "bne $a3, $0, 0b\n"           // } while (a3);
-               "addiu $t0, $t0, 4\n"         // delayed slot: t0+=4;
-               "lui $a3, %hi(_auxv)\n"       // load _auxv into a3 (hi)
-               "addiu $a3, %lo(_auxv)\n"     // load _auxv into a3 (lo)
-               "sw $t0, ($a3)\n"             // store t0 into _auxv
+#ifdef _NOLIBC_STACKPROTECTOR
+               "jal __stack_chk_init\n" /* initialize stack protector                         */
+               "nop\n"                  /* delayed slot                                       */
+#endif
+               /*".ent __start\n"*/
+               /*"__start:\n"*/
+               "lw $a0,($sp)\n"        /* argc was in the stack                               */
+               "addiu  $a1, $sp, 4\n"  /* argv = sp + 4                                       */
+               "sll $a2, $a0, 2\n"     /* a2 = argc * 4                                       */
+               "add   $a2, $a2, $a1\n" /* envp = argv + 4*argc ...                            */
+               "addiu $a2, $a2, 4\n"   /*        ... + 4                                      */
+               "lui $a3, %hi(environ)\n"     /* load environ into a3 (hi)                     */
+               "addiu $a3, %lo(environ)\n"   /* load environ into a3 (lo)                     */
+               "sw $a2,($a3)\n"              /* store envp(a2) into environ                   */
+
+               "move $t0, $a2\n"             /* iterate t0 over envp, look for NULL           */
+               "0:"                          /* do {                                          */
+               "lw $a3, ($t0)\n"             /*   a3=*(t0);                                   */
+               "bne $a3, $0, 0b\n"           /* } while (a3);                                 */
+               "addiu $t0, $t0, 4\n"         /* delayed slot: t0+=4;                          */
+               "lui $a3, %hi(_auxv)\n"       /* load _auxv into a3 (hi)                       */
+               "addiu $a3, %lo(_auxv)\n"     /* load _auxv into a3 (lo)                       */
+               "sw $t0, ($a3)\n"             /* store t0 into _auxv                           */
 
                "li $t0, -8\n"
-               "and $sp, $sp, $t0\n"   // sp must be 8-byte aligned
-               "addiu $sp,$sp,-16\n"   // the callee expects to save a0..a3 there!
-               "jal main\n"            // main() returns the status code, we'll exit with it.
-               "nop\n"                 // delayed slot
-               "move $a0, $v0\n"       // retrieve 32-bit exit code from v0
-               "li $v0, 4001\n"        // NR_exit == 4001
+               "and $sp, $sp, $t0\n"   /* sp must be 8-byte aligned                           */
+               "addiu $sp,$sp,-16\n"   /* the callee expects to save a0..a3 there!            */
+               "jal main\n"            /* main() returns the status code, we'll exit with it. */
+               "nop\n"                 /* delayed slot                                        */
+               "move $a0, $v0\n"       /* retrieve 32-bit exit code from v0                   */
+               "li $v0, 4001\n"        /* NR_exit == 4001                                     */
                "syscall\n"
-               //".end __start\n"
+               /*".end __start\n"*/
                ".set pop\n"
        );
        __builtin_unreachable();
 }
 
-#endif // _NOLIBC_ARCH_MIPS_H
+#endif /* _NOLIBC_ARCH_MIPS_H */
index e197fcb..a2e8564 100644 (file)
@@ -7,6 +7,8 @@
 #ifndef _NOLIBC_ARCH_RISCV_H
 #define _NOLIBC_ARCH_RISCV_H
 
+#include "compiler.h"
+
 struct sys_stat_struct {
        unsigned long   st_dev;         /* Device.  */
        unsigned long   st_ino;         /* File serial number.  */
@@ -33,9 +35,13 @@ struct sys_stat_struct {
 #if   __riscv_xlen == 64
 #define PTRLOG "3"
 #define SZREG  "8"
+#define REG_L  "ld"
+#define REG_S  "sd"
 #elif __riscv_xlen == 32
 #define PTRLOG "2"
 #define SZREG  "4"
+#define REG_L  "lw"
+#define REG_S  "sw"
 #endif
 
 /* Syscalls for RISCV :
@@ -174,35 +180,38 @@ char **environ __attribute__((weak));
 const unsigned long *_auxv __attribute__((weak));
 
 /* startup code */
-void __attribute__((weak,noreturn,optimize("omit-frame-pointer"))) _start(void)
+void __attribute__((weak,noreturn,optimize("omit-frame-pointer"))) __no_stack_protector _start(void)
 {
        __asm__ volatile (
                ".option push\n"
                ".option norelax\n"
                "lla   gp, __global_pointer$\n"
                ".option pop\n"
-               "lw    a0, 0(sp)\n"          // argc (a0) was in the stack
-               "add   a1, sp, "SZREG"\n"    // argv (a1) = sp
-               "slli  a2, a0, "PTRLOG"\n"   // envp (a2) = SZREG*argc ...
-               "add   a2, a2, "SZREG"\n"    //             + SZREG (skip null)
-               "add   a2,a2,a1\n"           //             + argv
-
-               "add   a3, a2, zero\n"       // iterate a3 over envp to find auxv (after NULL)
-               "0:\n"                       // do {
-               "ld    a4, 0(a3)\n"          //   a4 = *a3;
-               "add   a3, a3, "SZREG"\n"    //   a3 += sizeof(void*);
-               "bne   a4, zero, 0b\n"       // } while (a4);
-               "lui   a4, %hi(_auxv)\n"     // a4 = &_auxv (high bits)
-               "sd    a3, %lo(_auxv)(a4)\n" // store a3 into _auxv
-
-               "lui a3, %hi(environ)\n"     // a3 = &environ (high bits)
-               "sd a2,%lo(environ)(a3)\n"   // store envp(a2) into environ
-               "andi  sp,a1,-16\n"          // sp must be 16-byte aligned
-               "call  main\n"               // main() returns the status code, we'll exit with it.
-               "li a7, 93\n"                // NR_exit == 93
+#ifdef _NOLIBC_STACKPROTECTOR
+               "call __stack_chk_init\n"    /* initialize stack protector                          */
+#endif
+               REG_L" a0, 0(sp)\n"          /* argc (a0) was in the stack                          */
+               "add   a1, sp, "SZREG"\n"    /* argv (a1) = sp                                      */
+               "slli  a2, a0, "PTRLOG"\n"   /* envp (a2) = SZREG*argc ...                          */
+               "add   a2, a2, "SZREG"\n"    /*             + SZREG (skip null)                     */
+               "add   a2,a2,a1\n"           /*             + argv                                  */
+
+               "add   a3, a2, zero\n"       /* iterate a3 over envp to find auxv (after NULL)      */
+               "0:\n"                       /* do {                                                */
+               REG_L" a4, 0(a3)\n"          /*   a4 = *a3;                                         */
+               "add   a3, a3, "SZREG"\n"    /*   a3 += sizeof(void*);                              */
+               "bne   a4, zero, 0b\n"       /* } while (a4);                                       */
+               "lui   a4, %hi(_auxv)\n"     /* a4 = &_auxv (high bits)                             */
+               REG_S" a3, %lo(_auxv)(a4)\n" /* store a3 into _auxv                                 */
+
+               "lui   a3, %hi(environ)\n"   /* a3 = &environ (high bits)                           */
+               REG_S" a2,%lo(environ)(a3)\n"/* store envp(a2) into environ                         */
+               "andi  sp,a1,-16\n"          /* sp must be 16-byte aligned                          */
+               "call  main\n"               /* main() returns the status code, we'll exit with it. */
+               "li a7, 93\n"                /* NR_exit == 93                                       */
                "ecall\n"
        );
        __builtin_unreachable();
 }
 
-#endif // _NOLIBC_ARCH_RISCV_H
+#endif /* _NOLIBC_ARCH_RISCV_H */
index 6b0e54e..516dff5 100644 (file)
@@ -5,8 +5,11 @@
 
 #ifndef _NOLIBC_ARCH_S390_H
 #define _NOLIBC_ARCH_S390_H
+#include <asm/signal.h>
 #include <asm/unistd.h>
 
+#include "compiler.h"
+
 /* The struct returned by the stat() syscall, equivalent to stat64(). The
  * syscall returns 116 bytes and stops in the middle of __unused.
  */
@@ -163,7 +166,7 @@ char **environ __attribute__((weak));
 const unsigned long *_auxv __attribute__((weak));
 
 /* startup code */
-void __attribute__((weak,noreturn,optimize("omit-frame-pointer"))) _start(void)
+void __attribute__((weak,noreturn,optimize("omit-frame-pointer"))) __no_stack_protector _start(void)
 {
        __asm__ volatile (
                "lg     %r2,0(%r15)\n"          /* argument count */
@@ -223,4 +226,12 @@ void *sys_mmap(void *addr, size_t length, int prot, int flags, int fd,
        return (void *)my_syscall1(__NR_mmap, &args);
 }
 #define sys_mmap sys_mmap
-#endif // _NOLIBC_ARCH_S390_H
+
+static __attribute__((unused))
+pid_t sys_fork(void)
+{
+       return my_syscall5(__NR_clone, 0, SIGCHLD, 0, 0, 0);
+}
+#define sys_fork sys_fork
+
+#endif /* _NOLIBC_ARCH_S390_H */
index f7f2a11..6fc4d83 100644 (file)
@@ -7,6 +7,8 @@
 #ifndef _NOLIBC_ARCH_X86_64_H
 #define _NOLIBC_ARCH_X86_64_H
 
+#include "compiler.h"
+
 /* The struct returned by the stat() syscall, equivalent to stat64(). The
  * syscall returns 116 bytes and stops in the middle of __unused.
  */
@@ -181,8 +183,6 @@ struct sys_stat_struct {
 char **environ __attribute__((weak));
 const unsigned long *_auxv __attribute__((weak));
 
-#define __ARCH_SUPPORTS_STACK_PROTECTOR
-
 /* startup code */
 /*
  * x86-64 System V ABI mandates:
@@ -190,31 +190,31 @@ const unsigned long *_auxv __attribute__((weak));
  * 2) The deepest stack frame should be zero (the %rbp).
  *
  */
-void __attribute__((weak,noreturn,optimize("omit-frame-pointer"))) _start(void)
+void __attribute__((weak,noreturn,optimize("omit-frame-pointer"))) __no_stack_protector _start(void)
 {
        __asm__ volatile (
-#ifdef NOLIBC_STACKPROTECTOR
-               "call __stack_chk_init\n"   // initialize stack protector
+#ifdef _NOLIBC_STACKPROTECTOR
+               "call __stack_chk_init\n"   /* initialize stack protector                          */
 #endif
-               "pop %rdi\n"                // argc   (first arg, %rdi)
-               "mov %rsp, %rsi\n"          // argv[] (second arg, %rsi)
-               "lea 8(%rsi,%rdi,8),%rdx\n" // then a NULL then envp (third arg, %rdx)
-               "mov %rdx, environ\n"       // save environ
-               "xor %ebp, %ebp\n"          // zero the stack frame
-               "mov %rdx, %rax\n"          // search for auxv (follows NULL after last env)
+               "pop %rdi\n"                /* argc   (first arg, %rdi)                            */
+               "mov %rsp, %rsi\n"          /* argv[] (second arg, %rsi)                           */
+               "lea 8(%rsi,%rdi,8),%rdx\n" /* then a NULL then envp (third arg, %rdx)             */
+               "mov %rdx, environ\n"       /* save environ                                        */
+               "xor %ebp, %ebp\n"          /* zero the stack frame                                */
+               "mov %rdx, %rax\n"          /* search for auxv (follows NULL after last env)       */
                "0:\n"
-               "add $8, %rax\n"            // search for auxv using rax, it follows the
-               "cmp -8(%rax), %rbp\n"      // ... NULL after last env (rbp is zero here)
+               "add $8, %rax\n"            /* search for auxv using rax, it follows the           */
+               "cmp -8(%rax), %rbp\n"      /* ... NULL after last env (rbp is zero here)          */
                "jnz 0b\n"
-               "mov %rax, _auxv\n"         // save it into _auxv
-               "and $-16, %rsp\n"          // x86 ABI : esp must be 16-byte aligned before call
-               "call main\n"               // main() returns the status code, we'll exit with it.
-               "mov %eax, %edi\n"          // retrieve exit code (32 bit)
-               "mov $60, %eax\n"           // NR_exit == 60
-               "syscall\n"                 // really exit
-               "hlt\n"                     // ensure it does not return
+               "mov %rax, _auxv\n"         /* save it into _auxv                                  */
+               "and $-16, %rsp\n"          /* x86 ABI : esp must be 16-byte aligned before call   */
+               "call main\n"               /* main() returns the status code, we'll exit with it. */
+               "mov %eax, %edi\n"          /* retrieve exit code (32 bit)                         */
+               "mov $60, %eax\n"           /* NR_exit == 60                                       */
+               "syscall\n"                 /* really exit                                         */
+               "hlt\n"                     /* ensure it does not return                           */
        );
        __builtin_unreachable();
 }
 
-#endif // _NOLIBC_ARCH_X86_64_H
+#endif /* _NOLIBC_ARCH_X86_64_H */
index 2d5386a..82b4393 100644 (file)
@@ -7,7 +7,7 @@
  * the syscall declarations and the _start code definition. This is the only
  * global part. On all architectures the kernel puts everything in the stack
  * before jumping to _start just above us, without any return address (_start
- * is not a function but an entry pint). So at the stack pointer we find argc.
+ * is not a function but an entry point). So at the stack pointer we find argc.
  * Then argv[] begins, and ends at the first NULL. Then we have envp which
  * starts and ends with a NULL as well. So envp=argv+argc+1.
  */
diff --git a/tools/include/nolibc/compiler.h b/tools/include/nolibc/compiler.h
new file mode 100644 (file)
index 0000000..beddc36
--- /dev/null
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: LGPL-2.1 OR MIT */
+/*
+ * NOLIBC compiler support header
+ * Copyright (C) 2023 Thomas Weißschuh <linux@weissschuh.net>
+ */
+#ifndef _NOLIBC_COMPILER_H
+#define _NOLIBC_COMPILER_H
+
+#if defined(__SSP__) || defined(__SSP_STRONG__) || defined(__SSP_ALL__) || defined(__SSP_EXPLICIT__)
+
+#define _NOLIBC_STACKPROTECTOR
+
+#endif /* defined(__SSP__) ... */
+
+#if defined(__has_attribute)
+#  if __has_attribute(no_stack_protector)
+#    define __no_stack_protector __attribute__((no_stack_protector))
+#  else
+#    define __no_stack_protector __attribute__((__optimize__("-fno-stack-protector")))
+#  endif
+#else
+#  define __no_stack_protector __attribute__((__optimize__("-fno-stack-protector")))
+#endif /* defined(__has_attribute) */
+
+#endif /* _NOLIBC_COMPILER_H */
index 04739a6..05a228a 100644 (file)
 #include "sys.h"
 #include "ctype.h"
 #include "signal.h"
+#include "unistd.h"
 #include "stdio.h"
 #include "stdlib.h"
 #include "string.h"
 #include "time.h"
-#include "unistd.h"
 #include "stackprotector.h"
 
 /* Used by programs to avoid std includes */
index d119cbb..88f7b2d 100644 (file)
@@ -7,13 +7,9 @@
 #ifndef _NOLIBC_STACKPROTECTOR_H
 #define _NOLIBC_STACKPROTECTOR_H
 
-#include "arch.h"
+#include "compiler.h"
 
-#if defined(NOLIBC_STACKPROTECTOR)
-
-#if !defined(__ARCH_SUPPORTS_STACK_PROTECTOR)
-#error "nolibc does not support stack protectors on this arch"
-#endif
+#if defined(_NOLIBC_STACKPROTECTOR)
 
 #include "sys.h"
 #include "stdlib.h"
@@ -41,13 +37,14 @@ void __stack_chk_fail_local(void)
 __attribute__((weak,section(".data.nolibc_stack_chk")))
 uintptr_t __stack_chk_guard;
 
-__attribute__((weak,no_stack_protector,section(".text.nolibc_stack_chk")))
+__attribute__((weak,section(".text.nolibc_stack_chk"))) __no_stack_protector
 void __stack_chk_init(void)
 {
        my_syscall3(__NR_getrandom, &__stack_chk_guard, sizeof(__stack_chk_guard), 0);
-       /* a bit more randomness in case getrandom() fails */
-       __stack_chk_guard ^= (uintptr_t) &__stack_chk_guard;
+       /* a bit more randomness in case getrandom() fails, ensure the guard is never 0 */
+       if (__stack_chk_guard != (uintptr_t) &__stack_chk_guard)
+               __stack_chk_guard ^= (uintptr_t) &__stack_chk_guard;
 }
-#endif // defined(NOLIBC_STACKPROTECTOR)
+#endif /* defined(_NOLIBC_STACKPROTECTOR) */
 
-#endif // _NOLIBC_STACKPROTECTOR_H
+#endif /* _NOLIBC_STACKPROTECTOR_H */
index c1ce4f5..4b28243 100644 (file)
@@ -36,8 +36,8 @@ typedef  ssize_t       int_fast16_t;
 typedef   size_t      uint_fast16_t;
 typedef  ssize_t       int_fast32_t;
 typedef   size_t      uint_fast32_t;
-typedef  ssize_t       int_fast64_t;
-typedef   size_t      uint_fast64_t;
+typedef  int64_t       int_fast64_t;
+typedef uint64_t      uint_fast64_t;
 
 typedef  int64_t           intmax_t;
 typedef uint64_t          uintmax_t;
@@ -84,16 +84,30 @@ typedef uint64_t          uintmax_t;
 #define  INT_FAST8_MIN   INT8_MIN
 #define INT_FAST16_MIN   INTPTR_MIN
 #define INT_FAST32_MIN   INTPTR_MIN
-#define INT_FAST64_MIN   INTPTR_MIN
+#define INT_FAST64_MIN   INT64_MIN
 
 #define  INT_FAST8_MAX   INT8_MAX
 #define INT_FAST16_MAX   INTPTR_MAX
 #define INT_FAST32_MAX   INTPTR_MAX
-#define INT_FAST64_MAX   INTPTR_MAX
+#define INT_FAST64_MAX   INT64_MAX
 
 #define  UINT_FAST8_MAX  UINT8_MAX
 #define UINT_FAST16_MAX  SIZE_MAX
 #define UINT_FAST32_MAX  SIZE_MAX
-#define UINT_FAST64_MAX  SIZE_MAX
+#define UINT_FAST64_MAX  UINT64_MAX
+
+#ifndef INT_MIN
+#define INT_MIN          (-__INT_MAX__ - 1)
+#endif
+#ifndef INT_MAX
+#define INT_MAX          __INT_MAX__
+#endif
+
+#ifndef LONG_MIN
+#define LONG_MIN         (-__LONG_MAX__ - 1)
+#endif
+#ifndef LONG_MAX
+#define LONG_MAX         __LONG_MAX__
+#endif
 
 #endif /* _NOLIBC_STDINT_H */
index 6cbbb52..0eef91d 100644 (file)
 #define EOF (-1)
 #endif
 
-/* just define FILE as a non-empty type */
+/* just define FILE as a non-empty type. The value of the pointer gives
+ * the FD: FILE=~fd for fd>=0 or NULL for fd<0. This way positive FILE
+ * are immediately identified as abnormal entries (i.e. possible copies
+ * of valid pointers to something else).
+ */
 typedef struct FILE {
        char dummy[1];
 } FILE;
 
-/* We define the 3 common stdio files as constant invalid pointers that
- * are easily recognized.
- */
-static __attribute__((unused)) FILE* const stdin  = (FILE*)-3;
-static __attribute__((unused)) FILE* const stdout = (FILE*)-2;
-static __attribute__((unused)) FILE* const stderr = (FILE*)-1;
+static __attribute__((unused)) FILE* const stdin  = (FILE*)(intptr_t)~STDIN_FILENO;
+static __attribute__((unused)) FILE* const stdout = (FILE*)(intptr_t)~STDOUT_FILENO;
+static __attribute__((unused)) FILE* const stderr = (FILE*)(intptr_t)~STDERR_FILENO;
+
+/* provides a FILE* equivalent of fd. The mode is ignored. */
+static __attribute__((unused))
+FILE *fdopen(int fd, const char *mode __attribute__((unused)))
+{
+       if (fd < 0) {
+               SET_ERRNO(EBADF);
+               return NULL;
+       }
+       return (FILE*)(intptr_t)~fd;
+}
+
+/* provides the fd of stream. */
+static __attribute__((unused))
+int fileno(FILE *stream)
+{
+       intptr_t i = (intptr_t)stream;
+
+       if (i >= 0) {
+               SET_ERRNO(EBADF);
+               return -1;
+       }
+       return ~i;
+}
+
+/* flush a stream. */
+static __attribute__((unused))
+int fflush(FILE *stream)
+{
+       intptr_t i = (intptr_t)stream;
+
+       /* NULL is valid here. */
+       if (i > 0) {
+               SET_ERRNO(EBADF);
+               return -1;
+       }
+
+       /* Don't do anything, nolibc does not support buffering. */
+       return 0;
+}
+
+/* flush a stream. */
+static __attribute__((unused))
+int fclose(FILE *stream)
+{
+       intptr_t i = (intptr_t)stream;
+
+       if (i >= 0) {
+               SET_ERRNO(EBADF);
+               return -1;
+       }
+
+       if (close(~i))
+               return EOF;
+
+       return 0;
+}
 
 /* getc(), fgetc(), getchar() */
 
@@ -41,14 +99,8 @@ static __attribute__((unused))
 int fgetc(FILE* stream)
 {
        unsigned char ch;
-       int fd;
 
-       if (stream < stdin || stream > stderr)
-               return EOF;
-
-       fd = 3 + (long)stream;
-
-       if (read(fd, &ch, 1) <= 0)
+       if (read(fileno(stream), &ch, 1) <= 0)
                return EOF;
        return ch;
 }
@@ -68,14 +120,8 @@ static __attribute__((unused))
 int fputc(int c, FILE* stream)
 {
        unsigned char ch = c;
-       int fd;
-
-       if (stream < stdin || stream > stderr)
-               return EOF;
-
-       fd = 3 + (long)stream;
 
-       if (write(fd, &ch, 1) <= 0)
+       if (write(fileno(stream), &ch, 1) <= 0)
                return EOF;
        return ch;
 }
@@ -96,12 +142,7 @@ static __attribute__((unused))
 int _fwrite(const void *buf, size_t size, FILE *stream)
 {
        ssize_t ret;
-       int fd;
-
-       if (stream < stdin || stream > stderr)
-               return EOF;
-
-       fd = 3 + (long)stream;
+       int fd = fileno(stream);
 
        while (size) {
                ret = write(fd, buf, size);
index 894c955..902162f 100644 (file)
@@ -102,7 +102,7 @@ char *_getenv(const char *name, char **environ)
        return NULL;
 }
 
-static inline __attribute__((unused,always_inline))
+static __inline__ __attribute__((unused,always_inline))
 char *getenv(const char *name)
 {
        extern char **environ;
@@ -231,7 +231,7 @@ int utoh_r(unsigned long in, char *buffer)
 /* converts unsigned long <in> to an hex string using the static itoa_buffer
  * and returns the pointer to that string.
  */
-static inline __attribute__((unused))
+static __inline__ __attribute__((unused))
 char *utoh(unsigned long in)
 {
        utoh_r(in, itoa_buffer);
@@ -293,7 +293,7 @@ int itoa_r(long in, char *buffer)
 /* for historical compatibility, same as above but returns the pointer to the
  * buffer.
  */
-static inline __attribute__((unused))
+static __inline__ __attribute__((unused))
 char *ltoa_r(long in, char *buffer)
 {
        itoa_r(in, buffer);
@@ -303,7 +303,7 @@ char *ltoa_r(long in, char *buffer)
 /* converts long integer <in> to a string using the static itoa_buffer and
  * returns the pointer to that string.
  */
-static inline __attribute__((unused))
+static __inline__ __attribute__((unused))
 char *itoa(long in)
 {
        itoa_r(in, itoa_buffer);
@@ -313,7 +313,7 @@ char *itoa(long in)
 /* converts long integer <in> to a string using the static itoa_buffer and
  * returns the pointer to that string. Same as above, for compatibility.
  */
-static inline __attribute__((unused))
+static __inline__ __attribute__((unused))
 char *ltoa(long in)
 {
        itoa_r(in, itoa_buffer);
@@ -323,7 +323,7 @@ char *ltoa(long in)
 /* converts unsigned long integer <in> to a string using the static itoa_buffer
  * and returns the pointer to that string.
  */
-static inline __attribute__((unused))
+static __inline__ __attribute__((unused))
 char *utoa(unsigned long in)
 {
        utoa_r(in, itoa_buffer);
@@ -367,7 +367,7 @@ int u64toh_r(uint64_t in, char *buffer)
 /* converts uint64_t <in> to an hex string using the static itoa_buffer and
  * returns the pointer to that string.
  */
-static inline __attribute__((unused))
+static __inline__ __attribute__((unused))
 char *u64toh(uint64_t in)
 {
        u64toh_r(in, itoa_buffer);
@@ -429,7 +429,7 @@ int i64toa_r(int64_t in, char *buffer)
 /* converts int64_t <in> to a string using the static itoa_buffer and returns
  * the pointer to that string.
  */
-static inline __attribute__((unused))
+static __inline__ __attribute__((unused))
 char *i64toa(int64_t in)
 {
        i64toa_r(in, itoa_buffer);
@@ -439,7 +439,7 @@ char *i64toa(int64_t in)
 /* converts uint64_t <in> to a string using the static itoa_buffer and returns
  * the pointer to that string.
  */
-static inline __attribute__((unused))
+static __inline__ __attribute__((unused))
 char *u64toa(uint64_t in)
 {
        u64toa_r(in, itoa_buffer);
index fffdaf6..0c2e06c 100644 (file)
@@ -90,7 +90,7 @@ void *memset(void *dst, int b, size_t len)
 
        while (len--) {
                /* prevent gcc from recognizing memset() here */
-               asm volatile("");
+               __asm__ volatile("");
                *(p++) = b;
        }
        return dst;
@@ -139,7 +139,7 @@ size_t strlen(const char *str)
        size_t len;
 
        for (len = 0; str[len]; len++)
-               asm("");
+               __asm__("");
        return len;
 }
 
index 5d624dc..856249a 100644 (file)
 
 /* system includes */
 #include <asm/unistd.h>
-#include <asm/signal.h>  // for SIGCHLD
+#include <asm/signal.h>  /* for SIGCHLD */
 #include <asm/ioctls.h>
 #include <asm/mman.h>
 #include <linux/fs.h>
 #include <linux/loop.h>
 #include <linux/time.h>
 #include <linux/auxvec.h>
-#include <linux/fcntl.h> // for O_* and AT_*
-#include <linux/stat.h>  // for statx()
+#include <linux/fcntl.h> /* for O_* and AT_* */
+#include <linux/stat.h>  /* for statx() */
+#include <linux/reboot.h> /* for LINUX_REBOOT_* */
+#include <linux/prctl.h>
 
 #include "arch.h"
 #include "errno.h"
@@ -322,7 +324,7 @@ static __attribute__((noreturn,unused))
 void sys_exit(int status)
 {
        my_syscall1(__NR_exit, status & 255);
-       while(1); // shut the "noreturn" warnings.
+       while(1); /* shut the "noreturn" warnings. */
 }
 
 static __attribute__((noreturn,unused))
@@ -336,6 +338,7 @@ void exit(int status)
  * pid_t fork(void);
  */
 
+#ifndef sys_fork
 static __attribute__((unused))
 pid_t sys_fork(void)
 {
@@ -351,6 +354,7 @@ pid_t sys_fork(void)
 #error Neither __NR_clone nor __NR_fork defined, cannot implement sys_fork()
 #endif
 }
+#endif
 
 static __attribute__((unused))
 pid_t fork(void)
@@ -858,7 +862,7 @@ int open(const char *path, int flags, ...)
                va_list args;
 
                va_start(args, flags);
-               mode = va_arg(args, mode_t);
+               mode = va_arg(args, int);
                va_end(args);
        }
 
@@ -873,6 +877,32 @@ int open(const char *path, int flags, ...)
 
 
 /*
+ * int prctl(int option, unsigned long arg2, unsigned long arg3,
+ *                       unsigned long arg4, unsigned long arg5);
+ */
+
+static __attribute__((unused))
+int sys_prctl(int option, unsigned long arg2, unsigned long arg3,
+                         unsigned long arg4, unsigned long arg5)
+{
+       return my_syscall5(__NR_prctl, option, arg2, arg3, arg4, arg5);
+}
+
+static __attribute__((unused))
+int prctl(int option, unsigned long arg2, unsigned long arg3,
+                     unsigned long arg4, unsigned long arg5)
+{
+       int ret = sys_prctl(option, arg2, arg3, arg4, arg5);
+
+       if (ret < 0) {
+               SET_ERRNO(-ret);
+               ret = -1;
+       }
+       return ret;
+}
+
+
+/*
  * int pivot_root(const char *new, const char *old);
  */
 
@@ -909,7 +939,7 @@ int sys_poll(struct pollfd *fds, int nfds, int timeout)
                t.tv_sec  = timeout / 1000;
                t.tv_nsec = (timeout % 1000) * 1000000;
        }
-       return my_syscall4(__NR_ppoll, fds, nfds, (timeout >= 0) ? &t : NULL, NULL);
+       return my_syscall5(__NR_ppoll, fds, nfds, (timeout >= 0) ? &t : NULL, NULL, 0);
 #elif defined(__NR_poll)
        return my_syscall3(__NR_poll, fds, nfds, timeout);
 #else
@@ -1131,23 +1161,26 @@ int sys_stat(const char *path, struct stat *buf)
        long ret;
 
        ret = sys_statx(AT_FDCWD, path, AT_NO_AUTOMOUNT, STATX_BASIC_STATS, &statx);
-       buf->st_dev     = ((statx.stx_dev_minor & 0xff)
-                         | (statx.stx_dev_major << 8)
-                         | ((statx.stx_dev_minor & ~0xff) << 12));
-       buf->st_ino     = statx.stx_ino;
-       buf->st_mode    = statx.stx_mode;
-       buf->st_nlink   = statx.stx_nlink;
-       buf->st_uid     = statx.stx_uid;
-       buf->st_gid     = statx.stx_gid;
-       buf->st_rdev    = ((statx.stx_rdev_minor & 0xff)
-                         | (statx.stx_rdev_major << 8)
-                         | ((statx.stx_rdev_minor & ~0xff) << 12));
-       buf->st_size    = statx.stx_size;
-       buf->st_blksize = statx.stx_blksize;
-       buf->st_blocks  = statx.stx_blocks;
-       buf->st_atime   = statx.stx_atime.tv_sec;
-       buf->st_mtime   = statx.stx_mtime.tv_sec;
-       buf->st_ctime   = statx.stx_ctime.tv_sec;
+       buf->st_dev          = ((statx.stx_dev_minor & 0xff)
+                              | (statx.stx_dev_major << 8)
+                              | ((statx.stx_dev_minor & ~0xff) << 12));
+       buf->st_ino          = statx.stx_ino;
+       buf->st_mode         = statx.stx_mode;
+       buf->st_nlink        = statx.stx_nlink;
+       buf->st_uid          = statx.stx_uid;
+       buf->st_gid          = statx.stx_gid;
+       buf->st_rdev         = ((statx.stx_rdev_minor & 0xff)
+                              | (statx.stx_rdev_major << 8)
+                              | ((statx.stx_rdev_minor & ~0xff) << 12));
+       buf->st_size         = statx.stx_size;
+       buf->st_blksize      = statx.stx_blksize;
+       buf->st_blocks       = statx.stx_blocks;
+       buf->st_atim.tv_sec  = statx.stx_atime.tv_sec;
+       buf->st_atim.tv_nsec = statx.stx_atime.tv_nsec;
+       buf->st_mtim.tv_sec  = statx.stx_mtime.tv_sec;
+       buf->st_mtim.tv_nsec = statx.stx_mtime.tv_nsec;
+       buf->st_ctim.tv_sec  = statx.stx_ctime.tv_sec;
+       buf->st_ctim.tv_nsec = statx.stx_ctime.tv_nsec;
        return ret;
 }
 #else
@@ -1165,19 +1198,22 @@ int sys_stat(const char *path, struct stat *buf)
 #else
 #error Neither __NR_newfstatat nor __NR_stat defined, cannot implement sys_stat()
 #endif
-       buf->st_dev     = stat.st_dev;
-       buf->st_ino     = stat.st_ino;
-       buf->st_mode    = stat.st_mode;
-       buf->st_nlink   = stat.st_nlink;
-       buf->st_uid     = stat.st_uid;
-       buf->st_gid     = stat.st_gid;
-       buf->st_rdev    = stat.st_rdev;
-       buf->st_size    = stat.st_size;
-       buf->st_blksize = stat.st_blksize;
-       buf->st_blocks  = stat.st_blocks;
-       buf->st_atime   = stat.st_atime;
-       buf->st_mtime   = stat.st_mtime;
-       buf->st_ctime   = stat.st_ctime;
+       buf->st_dev          = stat.st_dev;
+       buf->st_ino          = stat.st_ino;
+       buf->st_mode         = stat.st_mode;
+       buf->st_nlink        = stat.st_nlink;
+       buf->st_uid          = stat.st_uid;
+       buf->st_gid          = stat.st_gid;
+       buf->st_rdev         = stat.st_rdev;
+       buf->st_size         = stat.st_size;
+       buf->st_blksize      = stat.st_blksize;
+       buf->st_blocks       = stat.st_blocks;
+       buf->st_atim.tv_sec  = stat.st_atime;
+       buf->st_atim.tv_nsec = stat.st_atime_nsec;
+       buf->st_mtim.tv_sec  = stat.st_mtime;
+       buf->st_mtim.tv_nsec = stat.st_mtime_nsec;
+       buf->st_ctim.tv_sec  = stat.st_ctime;
+       buf->st_ctim.tv_nsec = stat.st_ctime_nsec;
        return ret;
 }
 #endif
@@ -1365,6 +1401,29 @@ ssize_t write(int fd, const void *buf, size_t count)
        return ret;
 }
 
+
+/*
+ * int memfd_create(const char *name, unsigned int flags);
+ */
+
+static __attribute__((unused))
+int sys_memfd_create(const char *name, unsigned int flags)
+{
+       return my_syscall2(__NR_memfd_create, name, flags);
+}
+
+static __attribute__((unused))
+int memfd_create(const char *name, unsigned int flags)
+{
+       ssize_t ret = sys_memfd_create(name, flags);
+
+       if (ret < 0) {
+               SET_ERRNO(-ret);
+               ret = -1;
+       }
+       return ret;
+}
+
 /* make sure to include all global symbols */
 #include "nolibc.h"
 
index aedd7d9..f96e28b 100644 (file)
 #define SEEK_CUR       1
 #define SEEK_END       2
 
-/* cmd for reboot() */
-#define LINUX_REBOOT_MAGIC1         0xfee1dead
-#define LINUX_REBOOT_MAGIC2         0x28121969
-#define LINUX_REBOOT_CMD_HALT       0xcdef0123
-#define LINUX_REBOOT_CMD_POWER_OFF  0x4321fedc
-#define LINUX_REBOOT_CMD_RESTART    0x01234567
-#define LINUX_REBOOT_CMD_SW_SUSPEND 0xd000fce2
-
 /* Macros used on waitpid()'s return status */
 #define WEXITSTATUS(status) (((status) & 0xff00) >> 8)
 #define WIFEXITED(status)   (((status) & 0x7f) == 0)
@@ -206,9 +198,9 @@ struct stat {
        off_t     st_size;    /* total size, in bytes */
        blksize_t st_blksize; /* blocksize for file system I/O */
        blkcnt_t  st_blocks;  /* number of 512B blocks allocated */
-       time_t    st_atime;   /* time of last access */
-       time_t    st_mtime;   /* time of last modification */
-       time_t    st_ctime;   /* time of last status change */
+       union { time_t st_atime; struct timespec st_atim; }; /* time of last access */
+       union { time_t st_mtime; struct timespec st_mtim; }; /* time of last modification */
+       union { time_t st_ctime; struct timespec st_ctim; }; /* time of last status change */
 };
 
 /* WARNING, it only deals with the 4096 first majors and 256 first minors */
index ac7d53d..0e832e1 100644 (file)
@@ -56,6 +56,21 @@ int tcsetpgrp(int fd, pid_t pid)
        return ioctl(fd, TIOCSPGRP, &pid);
 }
 
+#define _syscall(N, ...)                                                      \
+({                                                                            \
+       long _ret = my_syscall##N(__VA_ARGS__);                               \
+       if (_ret < 0) {                                                       \
+               SET_ERRNO(-_ret);                                             \
+               _ret = -1;                                                    \
+       }                                                                     \
+       _ret;                                                                 \
+})
+
+#define _syscall_narg(...) __syscall_narg(__VA_ARGS__, 6, 5, 4, 3, 2, 1, 0)
+#define __syscall_narg(_0, _1, _2, _3, _4, _5, _6, N, ...) N
+#define _syscall_n(N, ...) _syscall(N, __VA_ARGS__)
+#define syscall(...) _syscall_n(_syscall_narg(__VA_ARGS__), ##__VA_ARGS__)
+
 /* make sure to include all global symbols */
 #include "nolibc.h"
 
index 41b9b94..8e91473 100644 (file)
@@ -6,10 +6,6 @@
 #include <stdbool.h>
 #include <stdint.h>
 
-#ifndef NORETURN
-#define NORETURN __attribute__((__noreturn__))
-#endif
-
 enum parse_opt_type {
        /* special types */
        OPTION_END,
@@ -183,9 +179,9 @@ extern int parse_options_subcommand(int argc, const char **argv,
                                const char *const subcommands[],
                                const char *usagestr[], int flags);
 
-extern NORETURN void usage_with_options(const char * const *usagestr,
+extern __noreturn void usage_with_options(const char * const *usagestr,
                                         const struct option *options);
-extern NORETURN __attribute__((format(printf,3,4)))
+extern __noreturn __attribute__((format(printf,3,4)))
 void usage_with_options_msg(const char * const *usagestr,
                            const struct option *options,
                            const char *fmt, ...);
index b2aec04..dfac76e 100644 (file)
@@ -5,8 +5,7 @@
 #include <stdarg.h>
 #include <stdlib.h>
 #include <stdio.h>
-
-#define NORETURN __attribute__((__noreturn__))
+#include <linux/compiler.h>
 
 static inline void report(const char *prefix, const char *err, va_list params)
 {
@@ -15,7 +14,7 @@ static inline void report(const char *prefix, const char *err, va_list params)
        fprintf(stderr, " %s%s\n", prefix, msg);
 }
 
-static NORETURN inline void die(const char *err, ...)
+static __noreturn inline void die(const char *err, ...)
 {
        va_list params;
 
index 744db42..fe39c2a 100644 (file)
@@ -244,6 +244,11 @@ To achieve the validation, objtool enforces the following rules:
 Objtool warnings
 ----------------
 
+NOTE: When requesting help with an objtool warning, please recreate with
+OBJTOOL_VERBOSE=1 (e.g., "make OBJTOOL_VERBOSE=1") and send the full
+output, including any disassembly or backtrace below the warning, to the
+objtool maintainers.
+
 For asm files, if you're getting an error which doesn't make sense,
 first make sure that the affected code follows the above rules.
 
@@ -298,6 +303,11 @@ the objtool maintainers.
    If it's not actually in a callable function (e.g. kernel entry code),
    change ENDPROC to END.
 
+3. file.o: warning: objtool: foo+0x48c: bar() is missing a __noreturn annotation
+
+   The call from foo() to bar() doesn't return, but bar() is missing the
+   __noreturn annotation.  NOTE: In addition to annotating the function
+   with __noreturn, please also add it to tools/objtool/noreturns.h.
 
 4. file.o: warning: objtool: func(): can't find starting instruction
    or
index 73f9ae1..66814fa 100644 (file)
@@ -1,10 +1,13 @@
 /* SPDX-License-Identifier: GPL-2.0-or-later */
-
 #ifndef _OBJTOOL_ARCH_ELF
 #define _OBJTOOL_ARCH_ELF
 
-#define R_NONE R_PPC_NONE
-#define R_ABS64 R_PPC64_ADDR64
-#define R_ABS32 R_PPC_ADDR32
+#define R_NONE         R_PPC_NONE
+#define R_ABS64                R_PPC64_ADDR64
+#define R_ABS32                R_PPC_ADDR32
+#define R_DATA32       R_PPC_REL32
+#define R_DATA64       R_PPC64_REL64
+#define R_TEXT32       R_PPC_REL32
+#define R_TEXT64       R_PPC64_REL32
 
 #endif /* _OBJTOOL_ARCH_ELF */
index 9ef024f..2e1caab 100644 (file)
@@ -84,7 +84,7 @@ bool arch_pc_relative_reloc(struct reloc *reloc)
         * All relocation types where P (the address of the target)
         * is included in the computation.
         */
-       switch (reloc->type) {
+       switch (reloc_type(reloc)) {
        case R_X86_64_PC8:
        case R_X86_64_PC16:
        case R_X86_64_PC32:
@@ -623,11 +623,11 @@ int arch_decode_instruction(struct objtool_file *file, const struct section *sec
                        if (!immr || strcmp(immr->sym->name, "pv_ops"))
                                break;
 
-                       idx = (immr->addend + 8) / sizeof(void *);
+                       idx = (reloc_addend(immr) + 8) / sizeof(void *);
 
                        func = disp->sym;
                        if (disp->sym->type == STT_SECTION)
-                               func = find_symbol_by_offset(disp->sym->sec, disp->addend);
+                               func = find_symbol_by_offset(disp->sym->sec, reloc_addend(disp));
                        if (!func) {
                                WARN("no func for pv_ops[]");
                                return -1;
index ac14987..7131f7f 100644 (file)
@@ -1,8 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
 #ifndef _OBJTOOL_ARCH_ELF
 #define _OBJTOOL_ARCH_ELF
 
-#define R_NONE R_X86_64_NONE
-#define R_ABS64 R_X86_64_64
-#define R_ABS32 R_X86_64_32
+#define R_NONE         R_X86_64_NONE
+#define R_ABS32                R_X86_64_32
+#define R_ABS64                R_X86_64_64
+#define R_DATA32       R_X86_64_PC32
+#define R_DATA64       R_X86_64_PC32
+#define R_TEXT32       R_X86_64_PC32
+#define R_TEXT64       R_X86_64_PC32
 
 #endif /* _OBJTOOL_ARCH_ELF */
index 7c97b73..29e9495 100644 (file)
@@ -42,13 +42,7 @@ bool arch_support_alt_relocation(struct special_alt *special_alt,
                                 struct instruction *insn,
                                 struct reloc *reloc)
 {
-       /*
-        * The x86 alternatives code adjusts the offsets only when it
-        * encounters a branch instruction at the very beginning of the
-        * replacement group.
-        */
-       return insn->offset == special_alt->new_off &&
-              (insn->type == INSN_CALL || is_jump(insn));
+       return true;
 }
 
 /*
@@ -105,10 +99,10 @@ struct reloc *arch_find_switch_table(struct objtool_file *file,
            !text_reloc->sym->sec->rodata)
                return NULL;
 
-       table_offset = text_reloc->addend;
+       table_offset = reloc_addend(text_reloc);
        table_sec = text_reloc->sym->sec;
 
-       if (text_reloc->type == R_X86_64_PC32)
+       if (reloc_type(text_reloc) == R_X86_64_PC32)
                table_offset += 4;
 
        /*
@@ -138,7 +132,7 @@ struct reloc *arch_find_switch_table(struct objtool_file *file,
         * indicates a rare GCC quirk/bug which can leave dead
         * code behind.
         */
-       if (text_reloc->type == R_X86_64_PC32)
+       if (reloc_type(text_reloc) == R_X86_64_PC32)
                file->ignore_unreachables = true;
 
        return rodata_reloc;
index 7c17519..5e21cfb 100644 (file)
@@ -93,6 +93,7 @@ static const struct option check_options[] = {
        OPT_BOOLEAN(0, "no-unreachable", &opts.no_unreachable, "skip 'unreachable instruction' warnings"),
        OPT_BOOLEAN(0, "sec-address", &opts.sec_address, "print section addresses in warnings"),
        OPT_BOOLEAN(0, "stats", &opts.stats, "print statistics"),
+       OPT_BOOLEAN('v', "verbose", &opts.verbose, "verbose warnings"),
 
        OPT_END(),
 };
@@ -118,6 +119,10 @@ int cmd_parse_options(int argc, const char **argv, const char * const usage[])
                parse_options(envc, envv, check_options, env_usage, 0);
        }
 
+       env = getenv("OBJTOOL_VERBOSE");
+       if (env && !strcmp(env, "1"))
+               opts.verbose = true;
+
        argc = parse_options(argc, argv, check_options, usage, 0);
        if (argc != 1)
                usage_with_options(usage, check_options);
index 0fcf99c..8936a05 100644 (file)
@@ -8,7 +8,6 @@
 #include <inttypes.h>
 #include <sys/mman.h>
 
-#include <arch/elf.h>
 #include <objtool/builtin.h>
 #include <objtool/cfi.h>
 #include <objtool/arch.h>
@@ -33,6 +32,7 @@ static unsigned long nr_cfi, nr_cfi_reused, nr_cfi_cache;
 static struct cfi_init_state initial_func_cfi;
 static struct cfi_state init_cfi;
 static struct cfi_state func_cfi;
+static struct cfi_state force_undefined_cfi;
 
 struct instruction *find_insn(struct objtool_file *file,
                              struct section *sec, unsigned long offset)
@@ -192,51 +192,11 @@ static bool __dead_end_function(struct objtool_file *file, struct symbol *func,
        struct instruction *insn;
        bool empty = true;
 
-       /*
-        * Unfortunately these have to be hard coded because the noreturn
-        * attribute isn't provided in ELF data. Keep 'em sorted.
-        */
+#define NORETURN(func) __stringify(func),
        static const char * const global_noreturns[] = {
-               "__invalid_creds",
-               "__module_put_and_kthread_exit",
-               "__reiserfs_panic",
-               "__stack_chk_fail",
-               "__ubsan_handle_builtin_unreachable",
-               "arch_call_rest_init",
-               "arch_cpu_idle_dead",
-               "btrfs_assertfail",
-               "cpu_bringup_and_idle",
-               "cpu_startup_entry",
-               "do_exit",
-               "do_group_exit",
-               "do_task_dead",
-               "ex_handler_msr_mce",
-               "fortify_panic",
-               "hlt_play_dead",
-               "hv_ghcb_terminate",
-               "kthread_complete_and_exit",
-               "kthread_exit",
-               "kunit_try_catch_throw",
-               "lbug_with_loc",
-               "machine_real_restart",
-               "make_task_dead",
-               "mpt_halt_firmware",
-               "nmi_panic_self_stop",
-               "panic",
-               "panic_smp_self_stop",
-               "rest_init",
-               "resume_play_dead",
-               "rewind_stack_and_make_dead",
-               "sev_es_terminate",
-               "snp_abort",
-               "start_kernel",
-               "stop_this_cpu",
-               "usercopy_abort",
-               "x86_64_start_kernel",
-               "x86_64_start_reservations",
-               "xen_cpu_bringup_again",
-               "xen_start_kernel",
+#include "noreturns.h"
        };
+#undef NORETURN
 
        if (!func)
                return false;
@@ -533,7 +493,7 @@ static int add_pv_ops(struct objtool_file *file, const char *symname)
 {
        struct symbol *sym, *func;
        unsigned long off, end;
-       struct reloc *rel;
+       struct reloc *reloc;
        int idx;
 
        sym = find_symbol_by_name(file->elf, symname);
@@ -543,19 +503,20 @@ static int add_pv_ops(struct objtool_file *file, const char *symname)
        off = sym->offset;
        end = off + sym->len;
        for (;;) {
-               rel = find_reloc_by_dest_range(file->elf, sym->sec, off, end - off);
-               if (!rel)
+               reloc = find_reloc_by_dest_range(file->elf, sym->sec, off, end - off);
+               if (!reloc)
                        break;
 
-               func = rel->sym;
+               func = reloc->sym;
                if (func->type == STT_SECTION)
-                       func = find_symbol_by_offset(rel->sym->sec, rel->addend);
+                       func = find_symbol_by_offset(reloc->sym->sec,
+                                                    reloc_addend(reloc));
 
-               idx = (rel->offset - sym->offset) / sizeof(unsigned long);
+               idx = (reloc_offset(reloc) - sym->offset) / sizeof(unsigned long);
 
                objtool_pv_add(file, idx, func);
 
-               off = rel->offset + 1;
+               off = reloc_offset(reloc) + 1;
                if (off > end)
                        break;
        }
@@ -620,35 +581,40 @@ static struct instruction *find_last_insn(struct objtool_file *file,
  */
 static int add_dead_ends(struct objtool_file *file)
 {
-       struct section *sec;
+       struct section *rsec;
        struct reloc *reloc;
        struct instruction *insn;
+       s64 addend;
 
        /*
         * Check for manually annotated dead ends.
         */
-       sec = find_section_by_name(file->elf, ".rela.discard.unreachable");
-       if (!sec)
+       rsec = find_section_by_name(file->elf, ".rela.discard.unreachable");
+       if (!rsec)
                goto reachable;
 
-       list_for_each_entry(reloc, &sec->reloc_list, list) {
+       for_each_reloc(rsec, reloc) {
+
                if (reloc->sym->type != STT_SECTION) {
-                       WARN("unexpected relocation symbol type in %s", sec->name);
+                       WARN("unexpected relocation symbol type in %s", rsec->name);
                        return -1;
                }
-               insn = find_insn(file, reloc->sym->sec, reloc->addend);
+
+               addend = reloc_addend(reloc);
+
+               insn = find_insn(file, reloc->sym->sec, addend);
                if (insn)
                        insn = prev_insn_same_sec(file, insn);
-               else if (reloc->addend == reloc->sym->sec->sh.sh_size) {
+               else if (addend == reloc->sym->sec->sh.sh_size) {
                        insn = find_last_insn(file, reloc->sym->sec);
                        if (!insn) {
                                WARN("can't find unreachable insn at %s+0x%" PRIx64,
-                                    reloc->sym->sec->name, reloc->addend);
+                                    reloc->sym->sec->name, addend);
                                return -1;
                        }
                } else {
                        WARN("can't find unreachable insn at %s+0x%" PRIx64,
-                            reloc->sym->sec->name, reloc->addend);
+                            reloc->sym->sec->name, addend);
                        return -1;
                }
 
@@ -662,28 +628,32 @@ reachable:
         * GCC doesn't know the "ud2" is fatal, so it generates code as if it's
         * not a dead end.
         */
-       sec = find_section_by_name(file->elf, ".rela.discard.reachable");
-       if (!sec)
+       rsec = find_section_by_name(file->elf, ".rela.discard.reachable");
+       if (!rsec)
                return 0;
 
-       list_for_each_entry(reloc, &sec->reloc_list, list) {
+       for_each_reloc(rsec, reloc) {
+
                if (reloc->sym->type != STT_SECTION) {
-                       WARN("unexpected relocation symbol type in %s", sec->name);
+                       WARN("unexpected relocation symbol type in %s", rsec->name);
                        return -1;
                }
-               insn = find_insn(file, reloc->sym->sec, reloc->addend);
+
+               addend = reloc_addend(reloc);
+
+               insn = find_insn(file, reloc->sym->sec, addend);
                if (insn)
                        insn = prev_insn_same_sec(file, insn);
-               else if (reloc->addend == reloc->sym->sec->sh.sh_size) {
+               else if (addend == reloc->sym->sec->sh.sh_size) {
                        insn = find_last_insn(file, reloc->sym->sec);
                        if (!insn) {
                                WARN("can't find reachable insn at %s+0x%" PRIx64,
-                                    reloc->sym->sec->name, reloc->addend);
+                                    reloc->sym->sec->name, addend);
                                return -1;
                        }
                } else {
                        WARN("can't find reachable insn at %s+0x%" PRIx64,
-                            reloc->sym->sec->name, reloc->addend);
+                            reloc->sym->sec->name, addend);
                        return -1;
                }
 
@@ -695,8 +665,8 @@ reachable:
 
 static int create_static_call_sections(struct objtool_file *file)
 {
-       struct section *sec;
        struct static_call_site *site;
+       struct section *sec;
        struct instruction *insn;
        struct symbol *key_sym;
        char *key_name, *tmp;
@@ -716,22 +686,21 @@ static int create_static_call_sections(struct objtool_file *file)
        list_for_each_entry(insn, &file->static_call_list, call_node)
                idx++;
 
-       sec = elf_create_section(file->elf, ".static_call_sites", SHF_WRITE,
-                                sizeof(struct static_call_site), idx);
+       sec = elf_create_section_pair(file->elf, ".static_call_sites",
+                                     sizeof(*site), idx, idx * 2);
        if (!sec)
                return -1;
 
+       /* Allow modules to modify the low bits of static_call_site::key */
+       sec->sh.sh_flags |= SHF_WRITE;
+
        idx = 0;
        list_for_each_entry(insn, &file->static_call_list, call_node) {
 
-               site = (struct static_call_site *)sec->data->d_buf + idx;
-               memset(site, 0, sizeof(struct static_call_site));
-
                /* populate reloc for 'addr' */
-               if (elf_add_reloc_to_insn(file->elf, sec,
-                                         idx * sizeof(struct static_call_site),
-                                         R_X86_64_PC32,
-                                         insn->sec, insn->offset))
+               if (!elf_init_reloc_text_sym(file->elf, sec,
+                                            idx * sizeof(*site), idx * 2,
+                                            insn->sec, insn->offset))
                        return -1;
 
                /* find key symbol */
@@ -771,10 +740,10 @@ static int create_static_call_sections(struct objtool_file *file)
                free(key_name);
 
                /* populate reloc for 'key' */
-               if (elf_add_reloc(file->elf, sec,
-                                 idx * sizeof(struct static_call_site) + 4,
-                                 R_X86_64_PC32, key_sym,
-                                 is_sibling_call(insn) * STATIC_CALL_SITE_TAIL))
+               if (!elf_init_reloc_data_sym(file->elf, sec,
+                                            idx * sizeof(*site) + 4,
+                                            (idx * 2) + 1, key_sym,
+                                            is_sibling_call(insn) * STATIC_CALL_SITE_TAIL))
                        return -1;
 
                idx++;
@@ -802,26 +771,18 @@ static int create_retpoline_sites_sections(struct objtool_file *file)
        if (!idx)
                return 0;
 
-       sec = elf_create_section(file->elf, ".retpoline_sites", 0,
-                                sizeof(int), idx);
-       if (!sec) {
-               WARN("elf_create_section: .retpoline_sites");
+       sec = elf_create_section_pair(file->elf, ".retpoline_sites",
+                                     sizeof(int), idx, idx);
+       if (!sec)
                return -1;
-       }
 
        idx = 0;
        list_for_each_entry(insn, &file->retpoline_call_list, call_node) {
 
-               int *site = (int *)sec->data->d_buf + idx;
-               *site = 0;
-
-               if (elf_add_reloc_to_insn(file->elf, sec,
-                                         idx * sizeof(int),
-                                         R_X86_64_PC32,
-                                         insn->sec, insn->offset)) {
-                       WARN("elf_add_reloc_to_insn: .retpoline_sites");
+               if (!elf_init_reloc_text_sym(file->elf, sec,
+                                            idx * sizeof(int), idx,
+                                            insn->sec, insn->offset))
                        return -1;
-               }
 
                idx++;
        }
@@ -848,26 +809,18 @@ static int create_return_sites_sections(struct objtool_file *file)
        if (!idx)
                return 0;
 
-       sec = elf_create_section(file->elf, ".return_sites", 0,
-                                sizeof(int), idx);
-       if (!sec) {
-               WARN("elf_create_section: .return_sites");
+       sec = elf_create_section_pair(file->elf, ".return_sites",
+                                     sizeof(int), idx, idx);
+       if (!sec)
                return -1;
-       }
 
        idx = 0;
        list_for_each_entry(insn, &file->return_thunk_list, call_node) {
 
-               int *site = (int *)sec->data->d_buf + idx;
-               *site = 0;
-
-               if (elf_add_reloc_to_insn(file->elf, sec,
-                                         idx * sizeof(int),
-                                         R_X86_64_PC32,
-                                         insn->sec, insn->offset)) {
-                       WARN("elf_add_reloc_to_insn: .return_sites");
+               if (!elf_init_reloc_text_sym(file->elf, sec,
+                                            idx * sizeof(int), idx,
+                                            insn->sec, insn->offset))
                        return -1;
-               }
 
                idx++;
        }
@@ -900,12 +853,10 @@ static int create_ibt_endbr_seal_sections(struct objtool_file *file)
        if (!idx)
                return 0;
 
-       sec = elf_create_section(file->elf, ".ibt_endbr_seal", 0,
-                                sizeof(int), idx);
-       if (!sec) {
-               WARN("elf_create_section: .ibt_endbr_seal");
+       sec = elf_create_section_pair(file->elf, ".ibt_endbr_seal",
+                                     sizeof(int), idx, idx);
+       if (!sec)
                return -1;
-       }
 
        idx = 0;
        list_for_each_entry(insn, &file->endbr_list, call_node) {
@@ -920,13 +871,10 @@ static int create_ibt_endbr_seal_sections(struct objtool_file *file)
                     !strcmp(sym->name, "cleanup_module")))
                        WARN("%s(): not an indirect call target", sym->name);
 
-               if (elf_add_reloc_to_insn(file->elf, sec,
-                                         idx * sizeof(int),
-                                         R_X86_64_PC32,
-                                         insn->sec, insn->offset)) {
-                       WARN("elf_add_reloc_to_insn: .ibt_endbr_seal");
+               if (!elf_init_reloc_text_sym(file->elf, sec,
+                                            idx * sizeof(int), idx,
+                                            insn->sec, insn->offset))
                        return -1;
-               }
 
                idx++;
        }
@@ -938,7 +886,6 @@ static int create_cfi_sections(struct objtool_file *file)
 {
        struct section *sec;
        struct symbol *sym;
-       unsigned int *loc;
        int idx;
 
        sec = find_section_by_name(file->elf, ".cfi_sites");
@@ -959,7 +906,8 @@ static int create_cfi_sections(struct objtool_file *file)
                idx++;
        }
 
-       sec = elf_create_section(file->elf, ".cfi_sites", 0, sizeof(unsigned int), idx);
+       sec = elf_create_section_pair(file->elf, ".cfi_sites",
+                                     sizeof(unsigned int), idx, idx);
        if (!sec)
                return -1;
 
@@ -971,13 +919,9 @@ static int create_cfi_sections(struct objtool_file *file)
                if (strncmp(sym->name, "__cfi_", 6))
                        continue;
 
-               loc = (unsigned int *)sec->data->d_buf + idx;
-               memset(loc, 0, sizeof(unsigned int));
-
-               if (elf_add_reloc_to_insn(file->elf, sec,
-                                         idx * sizeof(unsigned int),
-                                         R_X86_64_PC32,
-                                         sym->sec, sym->offset))
+               if (!elf_init_reloc_text_sym(file->elf, sec,
+                                            idx * sizeof(unsigned int), idx,
+                                            sym->sec, sym->offset))
                        return -1;
 
                idx++;
@@ -988,7 +932,7 @@ static int create_cfi_sections(struct objtool_file *file)
 
 static int create_mcount_loc_sections(struct objtool_file *file)
 {
-       int addrsize = elf_class_addrsize(file->elf);
+       size_t addr_size = elf_addr_size(file->elf);
        struct instruction *insn;
        struct section *sec;
        int idx;
@@ -1007,25 +951,26 @@ static int create_mcount_loc_sections(struct objtool_file *file)
        list_for_each_entry(insn, &file->mcount_loc_list, call_node)
                idx++;
 
-       sec = elf_create_section(file->elf, "__mcount_loc", 0, addrsize, idx);
+       sec = elf_create_section_pair(file->elf, "__mcount_loc", addr_size,
+                                     idx, idx);
        if (!sec)
                return -1;
 
-       sec->sh.sh_addralign = addrsize;
+       sec->sh.sh_addralign = addr_size;
 
        idx = 0;
        list_for_each_entry(insn, &file->mcount_loc_list, call_node) {
-               void *loc;
 
-               loc = sec->data->d_buf + idx;
-               memset(loc, 0, addrsize);
+               struct reloc *reloc;
 
-               if (elf_add_reloc_to_insn(file->elf, sec, idx,
-                                         addrsize == sizeof(u64) ? R_ABS64 : R_ABS32,
-                                         insn->sec, insn->offset))
+               reloc = elf_init_reloc_text_sym(file->elf, sec, idx * addr_size, idx,
+                                              insn->sec, insn->offset);
+               if (!reloc)
                        return -1;
 
-               idx += addrsize;
+               set_reloc_type(file->elf, reloc, addr_size == 8 ? R_ABS64 : R_ABS32);
+
+               idx++;
        }
 
        return 0;
@@ -1035,7 +980,6 @@ static int create_direct_call_sections(struct objtool_file *file)
 {
        struct instruction *insn;
        struct section *sec;
-       unsigned int *loc;
        int idx;
 
        sec = find_section_by_name(file->elf, ".call_sites");
@@ -1052,20 +996,17 @@ static int create_direct_call_sections(struct objtool_file *file)
        list_for_each_entry(insn, &file->call_list, call_node)
                idx++;
 
-       sec = elf_create_section(file->elf, ".call_sites", 0, sizeof(unsigned int), idx);
+       sec = elf_create_section_pair(file->elf, ".call_sites",
+                                     sizeof(unsigned int), idx, idx);
        if (!sec)
                return -1;
 
        idx = 0;
        list_for_each_entry(insn, &file->call_list, call_node) {
 
-               loc = (unsigned int *)sec->data->d_buf + idx;
-               memset(loc, 0, sizeof(unsigned int));
-
-               if (elf_add_reloc_to_insn(file->elf, sec,
-                                         idx * sizeof(unsigned int),
-                                         R_X86_64_PC32,
-                                         insn->sec, insn->offset))
+               if (!elf_init_reloc_text_sym(file->elf, sec,
+                                            idx * sizeof(unsigned int), idx,
+                                            insn->sec, insn->offset))
                        return -1;
 
                idx++;
@@ -1080,28 +1021,29 @@ static int create_direct_call_sections(struct objtool_file *file)
 static void add_ignores(struct objtool_file *file)
 {
        struct instruction *insn;
-       struct section *sec;
+       struct section *rsec;
        struct symbol *func;
        struct reloc *reloc;
 
-       sec = find_section_by_name(file->elf, ".rela.discard.func_stack_frame_non_standard");
-       if (!sec)
+       rsec = find_section_by_name(file->elf, ".rela.discard.func_stack_frame_non_standard");
+       if (!rsec)
                return;
 
-       list_for_each_entry(reloc, &sec->reloc_list, list) {
+       for_each_reloc(rsec, reloc) {
                switch (reloc->sym->type) {
                case STT_FUNC:
                        func = reloc->sym;
                        break;
 
                case STT_SECTION:
-                       func = find_func_by_offset(reloc->sym->sec, reloc->addend);
+                       func = find_func_by_offset(reloc->sym->sec, reloc_addend(reloc));
                        if (!func)
                                continue;
                        break;
 
                default:
-                       WARN("unexpected relocation symbol type in %s: %d", sec->name, reloc->sym->type);
+                       WARN("unexpected relocation symbol type in %s: %d",
+                            rsec->name, reloc->sym->type);
                        continue;
                }
 
@@ -1320,21 +1262,21 @@ static void add_uaccess_safe(struct objtool_file *file)
  */
 static int add_ignore_alternatives(struct objtool_file *file)
 {
-       struct section *sec;
+       struct section *rsec;
        struct reloc *reloc;
        struct instruction *insn;
 
-       sec = find_section_by_name(file->elf, ".rela.discard.ignore_alts");
-       if (!sec)
+       rsec = find_section_by_name(file->elf, ".rela.discard.ignore_alts");
+       if (!rsec)
                return 0;
 
-       list_for_each_entry(reloc, &sec->reloc_list, list) {
+       for_each_reloc(rsec, reloc) {
                if (reloc->sym->type != STT_SECTION) {
-                       WARN("unexpected relocation symbol type in %s", sec->name);
+                       WARN("unexpected relocation symbol type in %s", rsec->name);
                        return -1;
                }
 
-               insn = find_insn(file, reloc->sym->sec, reloc->addend);
+               insn = find_insn(file, reloc->sym->sec, reloc_addend(reloc));
                if (!insn) {
                        WARN("bad .discard.ignore_alts entry");
                        return -1;
@@ -1421,10 +1363,8 @@ static void annotate_call_site(struct objtool_file *file,
         * noinstr text.
         */
        if (opts.hack_noinstr && insn->sec->noinstr && sym->profiling_func) {
-               if (reloc) {
-                       reloc->type = R_NONE;
-                       elf_write_reloc(file->elf, reloc);
-               }
+               if (reloc)
+                       set_reloc_type(file->elf, reloc, R_NONE);
 
                elf_write_insn(file->elf, insn->sec,
                               insn->offset, insn->len,
@@ -1450,10 +1390,8 @@ static void annotate_call_site(struct objtool_file *file,
                if (sibling)
                        WARN_INSN(insn, "tail call to __fentry__ !?!?");
                if (opts.mnop) {
-                       if (reloc) {
-                               reloc->type = R_NONE;
-                               elf_write_reloc(file->elf, reloc);
-                       }
+                       if (reloc)
+                               set_reloc_type(file->elf, reloc, R_NONE);
 
                        elf_write_insn(file->elf, insn->sec,
                                       insn->offset, insn->len,
@@ -1610,7 +1548,7 @@ static int add_jump_destinations(struct objtool_file *file)
                        dest_off = arch_jump_destination(insn);
                } else if (reloc->sym->type == STT_SECTION) {
                        dest_sec = reloc->sym->sec;
-                       dest_off = arch_dest_reloc_offset(reloc->addend);
+                       dest_off = arch_dest_reloc_offset(reloc_addend(reloc));
                } else if (reloc->sym->retpoline_thunk) {
                        add_retpoline_call(file, insn);
                        continue;
@@ -1627,7 +1565,7 @@ static int add_jump_destinations(struct objtool_file *file)
                } else if (reloc->sym->sec->idx) {
                        dest_sec = reloc->sym->sec;
                        dest_off = reloc->sym->sym.st_value +
-                                  arch_dest_reloc_offset(reloc->addend);
+                                  arch_dest_reloc_offset(reloc_addend(reloc));
                } else {
                        /* non-func asm code jumping to another file */
                        continue;
@@ -1744,7 +1682,7 @@ static int add_call_destinations(struct objtool_file *file)
                        }
 
                } else if (reloc->sym->type == STT_SECTION) {
-                       dest_off = arch_dest_reloc_offset(reloc->addend);
+                       dest_off = arch_dest_reloc_offset(reloc_addend(reloc));
                        dest = find_call_destination(reloc->sym->sec, dest_off);
                        if (!dest) {
                                WARN_INSN(insn, "can't find call dest symbol at %s+0x%lx",
@@ -1932,10 +1870,8 @@ static int handle_jump_alt(struct objtool_file *file,
        if (opts.hack_jump_label && special_alt->key_addend & 2) {
                struct reloc *reloc = insn_reloc(file, orig_insn);
 
-               if (reloc) {
-                       reloc->type = R_NONE;
-                       elf_write_reloc(file->elf, reloc);
-               }
+               if (reloc)
+                       set_reloc_type(file->elf, reloc, R_NONE);
                elf_write_insn(file->elf, orig_insn->sec,
                               orig_insn->offset, orig_insn->len,
                               arch_nop_insn(orig_insn->len));
@@ -2047,34 +1983,35 @@ out:
 }
 
 static int add_jump_table(struct objtool_file *file, struct instruction *insn,
-                           struct reloc *table)
+                         struct reloc *next_table)
 {
-       struct reloc *reloc = table;
-       struct instruction *dest_insn;
-       struct alternative *alt;
        struct symbol *pfunc = insn_func(insn)->pfunc;
+       struct reloc *table = insn_jump_table(insn);
+       struct instruction *dest_insn;
        unsigned int prev_offset = 0;
+       struct reloc *reloc = table;
+       struct alternative *alt;
 
        /*
         * Each @reloc is a switch table relocation which points to the target
         * instruction.
         */
-       list_for_each_entry_from(reloc, &table->sec->reloc_list, list) {
+       for_each_reloc_from(table->sec, reloc) {
 
                /* Check for the end of the table: */
-               if (reloc != table && reloc->jump_table_start)
+               if (reloc != table && reloc == next_table)
                        break;
 
                /* Make sure the table entries are consecutive: */
-               if (prev_offset && reloc->offset != prev_offset + 8)
+               if (prev_offset && reloc_offset(reloc) != prev_offset + 8)
                        break;
 
                /* Detect function pointers from contiguous objects: */
                if (reloc->sym->sec == pfunc->sec &&
-                   reloc->addend == pfunc->offset)
+                   reloc_addend(reloc) == pfunc->offset)
                        break;
 
-               dest_insn = find_insn(file, reloc->sym->sec, reloc->addend);
+               dest_insn = find_insn(file, reloc->sym->sec, reloc_addend(reloc));
                if (!dest_insn)
                        break;
 
@@ -2091,7 +2028,7 @@ static int add_jump_table(struct objtool_file *file, struct instruction *insn,
                alt->insn = dest_insn;
                alt->next = insn->alts;
                insn->alts = alt;
-               prev_offset = reloc->offset;
+               prev_offset = reloc_offset(reloc);
        }
 
        if (!prev_offset) {
@@ -2135,7 +2072,7 @@ static struct reloc *find_jump_table(struct objtool_file *file,
                table_reloc = arch_find_switch_table(file, insn);
                if (!table_reloc)
                        continue;
-               dest_insn = find_insn(file, table_reloc->sym->sec, table_reloc->addend);
+               dest_insn = find_insn(file, table_reloc->sym->sec, reloc_addend(table_reloc));
                if (!dest_insn || !insn_func(dest_insn) || insn_func(dest_insn)->pfunc != func)
                        continue;
 
@@ -2177,29 +2114,39 @@ static void mark_func_jump_tables(struct objtool_file *file,
                        continue;
 
                reloc = find_jump_table(file, func, insn);
-               if (reloc) {
-                       reloc->jump_table_start = true;
+               if (reloc)
                        insn->_jump_table = reloc;
-               }
        }
 }
 
 static int add_func_jump_tables(struct objtool_file *file,
                                  struct symbol *func)
 {
-       struct instruction *insn;
-       int ret;
+       struct instruction *insn, *insn_t1 = NULL, *insn_t2;
+       int ret = 0;
 
        func_for_each_insn(file, func, insn) {
                if (!insn_jump_table(insn))
                        continue;
 
-               ret = add_jump_table(file, insn, insn_jump_table(insn));
+               if (!insn_t1) {
+                       insn_t1 = insn;
+                       continue;
+               }
+
+               insn_t2 = insn;
+
+               ret = add_jump_table(file, insn_t1, insn_jump_table(insn_t2));
                if (ret)
                        return ret;
+
+               insn_t1 = insn_t2;
        }
 
-       return 0;
+       if (insn_t1)
+               ret = add_jump_table(file, insn_t1, NULL);
+
+       return ret;
 }
 
 /*
@@ -2240,7 +2187,7 @@ static void set_func_state(struct cfi_state *state)
 static int read_unwind_hints(struct objtool_file *file)
 {
        struct cfi_state cfi = init_cfi;
-       struct section *sec, *relocsec;
+       struct section *sec;
        struct unwind_hint *hint;
        struct instruction *insn;
        struct reloc *reloc;
@@ -2250,8 +2197,7 @@ static int read_unwind_hints(struct objtool_file *file)
        if (!sec)
                return 0;
 
-       relocsec = sec->reloc;
-       if (!relocsec) {
+       if (!sec->rsec) {
                WARN("missing .rela.discard.unwind_hints section");
                return -1;
        }
@@ -2272,7 +2218,7 @@ static int read_unwind_hints(struct objtool_file *file)
                        return -1;
                }
 
-               insn = find_insn(file, reloc->sym->sec, reloc->addend);
+               insn = find_insn(file, reloc->sym->sec, reloc_addend(reloc));
                if (!insn) {
                        WARN("can't find insn for unwind_hints[%d]", i);
                        return -1;
@@ -2280,6 +2226,11 @@ static int read_unwind_hints(struct objtool_file *file)
 
                insn->hint = true;
 
+               if (hint->type == UNWIND_HINT_TYPE_UNDEFINED) {
+                       insn->cfi = &force_undefined_cfi;
+                       continue;
+               }
+
                if (hint->type == UNWIND_HINT_TYPE_SAVE) {
                        insn->hint = false;
                        insn->save = true;
@@ -2326,16 +2277,17 @@ static int read_unwind_hints(struct objtool_file *file)
 
 static int read_noendbr_hints(struct objtool_file *file)
 {
-       struct section *sec;
        struct instruction *insn;
+       struct section *rsec;
        struct reloc *reloc;
 
-       sec = find_section_by_name(file->elf, ".rela.discard.noendbr");
-       if (!sec)
+       rsec = find_section_by_name(file->elf, ".rela.discard.noendbr");
+       if (!rsec)
                return 0;
 
-       list_for_each_entry(reloc, &sec->reloc_list, list) {
-               insn = find_insn(file, reloc->sym->sec, reloc->sym->offset + reloc->addend);
+       for_each_reloc(rsec, reloc) {
+               insn = find_insn(file, reloc->sym->sec,
+                                reloc->sym->offset + reloc_addend(reloc));
                if (!insn) {
                        WARN("bad .discard.noendbr entry");
                        return -1;
@@ -2349,21 +2301,21 @@ static int read_noendbr_hints(struct objtool_file *file)
 
 static int read_retpoline_hints(struct objtool_file *file)
 {
-       struct section *sec;
+       struct section *rsec;
        struct instruction *insn;
        struct reloc *reloc;
 
-       sec = find_section_by_name(file->elf, ".rela.discard.retpoline_safe");
-       if (!sec)
+       rsec = find_section_by_name(file->elf, ".rela.discard.retpoline_safe");
+       if (!rsec)
                return 0;
 
-       list_for_each_entry(reloc, &sec->reloc_list, list) {
+       for_each_reloc(rsec, reloc) {
                if (reloc->sym->type != STT_SECTION) {
-                       WARN("unexpected relocation symbol type in %s", sec->name);
+                       WARN("unexpected relocation symbol type in %s", rsec->name);
                        return -1;
                }
 
-               insn = find_insn(file, reloc->sym->sec, reloc->addend);
+               insn = find_insn(file, reloc->sym->sec, reloc_addend(reloc));
                if (!insn) {
                        WARN("bad .discard.retpoline_safe entry");
                        return -1;
@@ -2385,21 +2337,21 @@ static int read_retpoline_hints(struct objtool_file *file)
 
 static int read_instr_hints(struct objtool_file *file)
 {
-       struct section *sec;
+       struct section *rsec;
        struct instruction *insn;
        struct reloc *reloc;
 
-       sec = find_section_by_name(file->elf, ".rela.discard.instr_end");
-       if (!sec)
+       rsec = find_section_by_name(file->elf, ".rela.discard.instr_end");
+       if (!rsec)
                return 0;
 
-       list_for_each_entry(reloc, &sec->reloc_list, list) {
+       for_each_reloc(rsec, reloc) {
                if (reloc->sym->type != STT_SECTION) {
-                       WARN("unexpected relocation symbol type in %s", sec->name);
+                       WARN("unexpected relocation symbol type in %s", rsec->name);
                        return -1;
                }
 
-               insn = find_insn(file, reloc->sym->sec, reloc->addend);
+               insn = find_insn(file, reloc->sym->sec, reloc_addend(reloc));
                if (!insn) {
                        WARN("bad .discard.instr_end entry");
                        return -1;
@@ -2408,17 +2360,17 @@ static int read_instr_hints(struct objtool_file *file)
                insn->instr--;
        }
 
-       sec = find_section_by_name(file->elf, ".rela.discard.instr_begin");
-       if (!sec)
+       rsec = find_section_by_name(file->elf, ".rela.discard.instr_begin");
+       if (!rsec)
                return 0;
 
-       list_for_each_entry(reloc, &sec->reloc_list, list) {
+       for_each_reloc(rsec, reloc) {
                if (reloc->sym->type != STT_SECTION) {
-                       WARN("unexpected relocation symbol type in %s", sec->name);
+                       WARN("unexpected relocation symbol type in %s", rsec->name);
                        return -1;
                }
 
-               insn = find_insn(file, reloc->sym->sec, reloc->addend);
+               insn = find_insn(file, reloc->sym->sec, reloc_addend(reloc));
                if (!insn) {
                        WARN("bad .discard.instr_begin entry");
                        return -1;
@@ -2432,21 +2384,21 @@ static int read_instr_hints(struct objtool_file *file)
 
 static int read_validate_unret_hints(struct objtool_file *file)
 {
-       struct section *sec;
+       struct section *rsec;
        struct instruction *insn;
        struct reloc *reloc;
 
-       sec = find_section_by_name(file->elf, ".rela.discard.validate_unret");
-       if (!sec)
+       rsec = find_section_by_name(file->elf, ".rela.discard.validate_unret");
+       if (!rsec)
                return 0;
 
-       list_for_each_entry(reloc, &sec->reloc_list, list) {
+       for_each_reloc(rsec, reloc) {
                if (reloc->sym->type != STT_SECTION) {
-                       WARN("unexpected relocation symbol type in %s", sec->name);
+                       WARN("unexpected relocation symbol type in %s", rsec->name);
                        return -1;
                }
 
-               insn = find_insn(file, reloc->sym->sec, reloc->addend);
+               insn = find_insn(file, reloc->sym->sec, reloc_addend(reloc));
                if (!insn) {
                        WARN("bad .discard.instr_end entry");
                        return -1;
@@ -2461,23 +2413,23 @@ static int read_validate_unret_hints(struct objtool_file *file)
 static int read_intra_function_calls(struct objtool_file *file)
 {
        struct instruction *insn;
-       struct section *sec;
+       struct section *rsec;
        struct reloc *reloc;
 
-       sec = find_section_by_name(file->elf, ".rela.discard.intra_function_calls");
-       if (!sec)
+       rsec = find_section_by_name(file->elf, ".rela.discard.intra_function_calls");
+       if (!rsec)
                return 0;
 
-       list_for_each_entry(reloc, &sec->reloc_list, list) {
+       for_each_reloc(rsec, reloc) {
                unsigned long dest_off;
 
                if (reloc->sym->type != STT_SECTION) {
                        WARN("unexpected relocation symbol type in %s",
-                            sec->name);
+                            rsec->name);
                        return -1;
                }
 
-               insn = find_insn(file, reloc->sym->sec, reloc->addend);
+               insn = find_insn(file, reloc->sym->sec, reloc_addend(reloc));
                if (!insn) {
                        WARN("bad .discard.intra_function_call entry");
                        return -1;
@@ -2833,6 +2785,10 @@ static int update_cfi_state(struct instruction *insn,
        struct cfi_reg *cfa = &cfi->cfa;
        struct cfi_reg *regs = cfi->regs;
 
+       /* ignore UNWIND_HINT_UNDEFINED regions */
+       if (cfi->force_undefined)
+               return 0;
+
        /* stack operations don't make sense with an undefined CFA */
        if (cfa->base == CFI_UNDEFINED) {
                if (insn_func(insn)) {
@@ -3369,15 +3325,15 @@ static inline bool func_uaccess_safe(struct symbol *func)
 static inline const char *call_dest_name(struct instruction *insn)
 {
        static char pvname[19];
-       struct reloc *rel;
+       struct reloc *reloc;
        int idx;
 
        if (insn_call_dest(insn))
                return insn_call_dest(insn)->name;
 
-       rel = insn_reloc(NULL, insn);
-       if (rel && !strcmp(rel->sym->name, "pv_ops")) {
-               idx = (rel->addend / sizeof(void *));
+       reloc = insn_reloc(NULL, insn);
+       if (reloc && !strcmp(reloc->sym->name, "pv_ops")) {
+               idx = (reloc_addend(reloc) / sizeof(void *));
                snprintf(pvname, sizeof(pvname), "pv_ops[%d]", idx);
                return pvname;
        }
@@ -3388,14 +3344,14 @@ static inline const char *call_dest_name(struct instruction *insn)
 static bool pv_call_dest(struct objtool_file *file, struct instruction *insn)
 {
        struct symbol *target;
-       struct reloc *rel;
+       struct reloc *reloc;
        int idx;
 
-       rel = insn_reloc(file, insn);
-       if (!rel || strcmp(rel->sym->name, "pv_ops"))
+       reloc = insn_reloc(file, insn);
+       if (!reloc || strcmp(reloc->sym->name, "pv_ops"))
                return false;
 
-       idx = (arch_dest_reloc_offset(rel->addend) / sizeof(void *));
+       idx = (arch_dest_reloc_offset(reloc_addend(reloc)) / sizeof(void *));
 
        if (file->pv_ops[idx].clean)
                return true;
@@ -3657,8 +3613,7 @@ static int validate_branch(struct objtool_file *file, struct symbol *func,
 
                                ret = validate_branch(file, func, alt->insn, state);
                                if (ret) {
-                                       if (opts.backtrace)
-                                               BT_FUNC("(alt)", insn);
+                                       BT_INSN(insn, "(alt)");
                                        return ret;
                                }
                        }
@@ -3703,8 +3658,7 @@ static int validate_branch(struct objtool_file *file, struct symbol *func,
                                ret = validate_branch(file, func,
                                                      insn->jump_dest, state);
                                if (ret) {
-                                       if (opts.backtrace)
-                                               BT_FUNC("(branch)", insn);
+                                       BT_INSN(insn, "(branch)");
                                        return ret;
                                }
                        }
@@ -3802,8 +3756,8 @@ static int validate_unwind_hint(struct objtool_file *file,
 {
        if (insn->hint && !insn->visited && !insn->ignore) {
                int ret = validate_branch(file, insn_func(insn), insn, *state);
-               if (ret && opts.backtrace)
-                       BT_FUNC("<=== (hint)", insn);
+               if (ret)
+                       BT_INSN(insn, "<=== (hint)");
                return ret;
        }
 
@@ -3841,7 +3795,7 @@ static int validate_unwind_hints(struct objtool_file *file, struct section *sec)
 static int validate_unret(struct objtool_file *file, struct instruction *insn)
 {
        struct instruction *next, *dest;
-       int ret, warnings = 0;
+       int ret;
 
        for (;;) {
                next = next_insn_to_validate(file, insn);
@@ -3861,8 +3815,7 @@ static int validate_unret(struct objtool_file *file, struct instruction *insn)
 
                                ret = validate_unret(file, alt->insn);
                                if (ret) {
-                                       if (opts.backtrace)
-                                               BT_FUNC("(alt)", insn);
+                                       BT_INSN(insn, "(alt)");
                                        return ret;
                                }
                        }
@@ -3888,10 +3841,8 @@ static int validate_unret(struct objtool_file *file, struct instruction *insn)
                                }
                                ret = validate_unret(file, insn->jump_dest);
                                if (ret) {
-                                       if (opts.backtrace) {
-                                               BT_FUNC("(branch%s)", insn,
-                                                       insn->type == INSN_JUMP_CONDITIONAL ? "-cond" : "");
-                                       }
+                                       BT_INSN(insn, "(branch%s)",
+                                               insn->type == INSN_JUMP_CONDITIONAL ? "-cond" : "");
                                        return ret;
                                }
 
@@ -3913,8 +3864,7 @@ static int validate_unret(struct objtool_file *file, struct instruction *insn)
 
                        ret = validate_unret(file, dest);
                        if (ret) {
-                               if (opts.backtrace)
-                                       BT_FUNC("(call)", insn);
+                               BT_INSN(insn, "(call)");
                                return ret;
                        }
                        /*
@@ -3943,7 +3893,7 @@ static int validate_unret(struct objtool_file *file, struct instruction *insn)
                insn = next;
        }
 
-       return warnings;
+       return 0;
 }
 
 /*
@@ -4178,7 +4128,6 @@ static int add_prefix_symbols(struct objtool_file *file)
 {
        struct section *sec;
        struct symbol *func;
-       int warnings = 0;
 
        for_each_sec(file, sec) {
                if (!(sec->sh.sh_flags & SHF_EXECINSTR))
@@ -4192,7 +4141,7 @@ static int add_prefix_symbols(struct objtool_file *file)
                }
        }
 
-       return warnings;
+       return 0;
 }
 
 static int validate_symbol(struct objtool_file *file, struct section *sec,
@@ -4216,8 +4165,8 @@ static int validate_symbol(struct objtool_file *file, struct section *sec,
        state->uaccess = sym->uaccess_safe;
 
        ret = validate_branch(file, insn_func(insn), insn, *state);
-       if (ret && opts.backtrace)
-               BT_FUNC("<=== (sym)", insn);
+       if (ret)
+               BT_INSN(insn, "<=== (sym)");
        return ret;
 }
 
@@ -4333,8 +4282,8 @@ static int validate_ibt_insn(struct objtool_file *file, struct instruction *insn
        for (reloc = insn_reloc(file, insn);
             reloc;
             reloc = find_reloc_by_dest_range(file->elf, insn->sec,
-                                             reloc->offset + 1,
-                                             (insn->offset + insn->len) - (reloc->offset + 1))) {
+                                             reloc_offset(reloc) + 1,
+                                             (insn->offset + insn->len) - (reloc_offset(reloc) + 1))) {
 
                /*
                 * static_call_update() references the trampoline, which
@@ -4344,10 +4293,11 @@ static int validate_ibt_insn(struct objtool_file *file, struct instruction *insn
                        continue;
 
                off = reloc->sym->offset;
-               if (reloc->type == R_X86_64_PC32 || reloc->type == R_X86_64_PLT32)
-                       off += arch_dest_reloc_offset(reloc->addend);
+               if (reloc_type(reloc) == R_X86_64_PC32 ||
+                   reloc_type(reloc) == R_X86_64_PLT32)
+                       off += arch_dest_reloc_offset(reloc_addend(reloc));
                else
-                       off += reloc->addend;
+                       off += reloc_addend(reloc);
 
                dest = find_insn(file, reloc->sym->sec, off);
                if (!dest)
@@ -4404,7 +4354,7 @@ static int validate_ibt_data_reloc(struct objtool_file *file,
        struct instruction *dest;
 
        dest = find_insn(file, reloc->sym->sec,
-                        reloc->sym->offset + reloc->addend);
+                        reloc->sym->offset + reloc_addend(reloc));
        if (!dest)
                return 0;
 
@@ -4417,7 +4367,7 @@ static int validate_ibt_data_reloc(struct objtool_file *file,
                return 0;
 
        WARN_FUNC("data relocation to !ENDBR: %s",
-                 reloc->sec->base, reloc->offset,
+                 reloc->sec->base, reloc_offset(reloc),
                  offstr(dest->sec, dest->offset));
 
        return 1;
@@ -4444,7 +4394,7 @@ static int validate_ibt(struct objtool_file *file)
                if (sec->sh.sh_flags & SHF_EXECINSTR)
                        continue;
 
-               if (!sec->reloc)
+               if (!sec->rsec)
                        continue;
 
                /*
@@ -4471,7 +4421,7 @@ static int validate_ibt(struct objtool_file *file)
                    strstr(sec->name, "__patchable_function_entries"))
                        continue;
 
-               list_for_each_entry(reloc, &sec->reloc->reloc_list, list)
+               for_each_reloc(sec->rsec, reloc)
                        warnings += validate_ibt_data_reloc(file, reloc);
        }
 
@@ -4511,9 +4461,40 @@ static int validate_sls(struct objtool_file *file)
        return warnings;
 }
 
+static bool ignore_noreturn_call(struct instruction *insn)
+{
+       struct symbol *call_dest = insn_call_dest(insn);
+
+       /*
+        * FIXME: hack, we need a real noreturn solution
+        *
+        * Problem is, exc_double_fault() may or may not return, depending on
+        * whether CONFIG_X86_ESPFIX64 is set.  But objtool has no visibility
+        * to the kernel config.
+        *
+        * Other potential ways to fix it:
+        *
+        *   - have compiler communicate __noreturn functions somehow
+        *   - remove CONFIG_X86_ESPFIX64
+        *   - read the .config file
+        *   - add a cmdline option
+        *   - create a generic objtool annotation format (vs a bunch of custom
+        *     formats) and annotate it
+        */
+       if (!strcmp(call_dest->name, "exc_double_fault")) {
+               /* prevent further unreachable warnings for the caller */
+               insn->sym->warned = 1;
+               return true;
+       }
+
+       return false;
+}
+
 static int validate_reachable_instructions(struct objtool_file *file)
 {
-       struct instruction *insn;
+       struct instruction *insn, *prev_insn;
+       struct symbol *call_dest;
+       int warnings = 0;
 
        if (file->ignore_unreachables)
                return 0;
@@ -4522,13 +4503,127 @@ static int validate_reachable_instructions(struct objtool_file *file)
                if (insn->visited || ignore_unreachable_insn(file, insn))
                        continue;
 
+               prev_insn = prev_insn_same_sec(file, insn);
+               if (prev_insn && prev_insn->dead_end) {
+                       call_dest = insn_call_dest(prev_insn);
+                       if (call_dest && !ignore_noreturn_call(prev_insn)) {
+                               WARN_INSN(insn, "%s() is missing a __noreturn annotation",
+                                         call_dest->name);
+                               warnings++;
+                               continue;
+                       }
+               }
+
                WARN_INSN(insn, "unreachable instruction");
-               return 1;
+               warnings++;
+       }
+
+       return warnings;
+}
+
+/* 'funcs' is a space-separated list of function names */
+static int disas_funcs(const char *funcs)
+{
+       const char *objdump_str, *cross_compile;
+       int size, ret;
+       char *cmd;
+
+       cross_compile = getenv("CROSS_COMPILE");
+
+       objdump_str = "%sobjdump -wdr %s | gawk -M -v _funcs='%s' '"
+                       "BEGIN { split(_funcs, funcs); }"
+                       "/^$/ { func_match = 0; }"
+                       "/<.*>:/ { "
+                               "f = gensub(/.*<(.*)>:/, \"\\\\1\", 1);"
+                               "for (i in funcs) {"
+                                       "if (funcs[i] == f) {"
+                                               "func_match = 1;"
+                                               "base = strtonum(\"0x\" $1);"
+                                               "break;"
+                                       "}"
+                               "}"
+                       "}"
+                       "{"
+                               "if (func_match) {"
+                                       "addr = strtonum(\"0x\" $1);"
+                                       "printf(\"%%04x \", addr - base);"
+                                       "print;"
+                               "}"
+                       "}' 1>&2";
+
+       /* fake snprintf() to calculate the size */
+       size = snprintf(NULL, 0, objdump_str, cross_compile, objname, funcs) + 1;
+       if (size <= 0) {
+               WARN("objdump string size calculation failed");
+               return -1;
+       }
+
+       cmd = malloc(size);
+
+       /* real snprintf() */
+       snprintf(cmd, size, objdump_str, cross_compile, objname, funcs);
+       ret = system(cmd);
+       if (ret) {
+               WARN("disassembly failed: %d", ret);
+               return -1;
        }
 
        return 0;
 }
 
+static int disas_warned_funcs(struct objtool_file *file)
+{
+       struct symbol *sym;
+       char *funcs = NULL, *tmp;
+
+       for_each_sym(file, sym) {
+               if (sym->warned) {
+                       if (!funcs) {
+                               funcs = malloc(strlen(sym->name) + 1);
+                               strcpy(funcs, sym->name);
+                       } else {
+                               tmp = malloc(strlen(funcs) + strlen(sym->name) + 2);
+                               sprintf(tmp, "%s %s", funcs, sym->name);
+                               free(funcs);
+                               funcs = tmp;
+                       }
+               }
+       }
+
+       if (funcs)
+               disas_funcs(funcs);
+
+       return 0;
+}
+
+struct insn_chunk {
+       void *addr;
+       struct insn_chunk *next;
+};
+
+/*
+ * Reduce peak RSS usage by freeing insns memory before writing the ELF file,
+ * which can trigger more allocations for .debug_* sections whose data hasn't
+ * been read yet.
+ */
+static void free_insns(struct objtool_file *file)
+{
+       struct instruction *insn;
+       struct insn_chunk *chunks = NULL, *chunk;
+
+       for_each_insn(file, insn) {
+               if (!insn->idx) {
+                       chunk = malloc(sizeof(*chunk));
+                       chunk->addr = insn;
+                       chunk->next = chunks;
+                       chunks = chunk;
+               }
+       }
+
+       for (chunk = chunks; chunk; chunk = chunk->next)
+               free(chunk->addr);
+}
+
 int check(struct objtool_file *file)
 {
        int ret, warnings = 0;
@@ -4537,6 +4632,8 @@ int check(struct objtool_file *file)
        init_cfi_state(&init_cfi);
        init_cfi_state(&func_cfi);
        set_func_state(&func_cfi);
+       init_cfi_state(&force_undefined_cfi);
+       force_undefined_cfi.force_undefined = true;
 
        if (!cfi_hash_alloc(1UL << (file->elf->symbol_bits - 3)))
                goto out;
@@ -4673,6 +4770,10 @@ int check(struct objtool_file *file)
                warnings += ret;
        }
 
+       free_insns(file);
+
+       if (opts.verbose)
+               disas_warned_funcs(file);
 
        if (opts.stats) {
                printf("nr_insns_visited: %ld\n", nr_insns_visited);
index 500e929..d420b5d 100644 (file)
@@ -32,16 +32,52 @@ static inline u32 str_hash(const char *str)
 #define __elf_table(name)      (elf->name##_hash)
 #define __elf_bits(name)       (elf->name##_bits)
 
-#define elf_hash_add(name, node, key) \
-       hlist_add_head(node, &__elf_table(name)[hash_min(key, __elf_bits(name))])
+#define __elf_table_entry(name, key) \
+       __elf_table(name)[hash_min(key, __elf_bits(name))]
+
+#define elf_hash_add(name, node, key)                                  \
+({                                                                     \
+       struct elf_hash_node *__node = node;                            \
+       __node->next = __elf_table_entry(name, key);                    \
+       __elf_table_entry(name, key) = __node;                          \
+})
+
+static inline void __elf_hash_del(struct elf_hash_node *node,
+                                 struct elf_hash_node **head)
+{
+       struct elf_hash_node *cur, *prev;
+
+       if (node == *head) {
+               *head = node->next;
+               return;
+       }
+
+       for (prev = NULL, cur = *head; cur; prev = cur, cur = cur->next) {
+               if (cur == node) {
+                       prev->next = cur->next;
+                       break;
+               }
+       }
+}
 
-#define elf_hash_for_each_possible(name, obj, member, key) \
-       hlist_for_each_entry(obj, &__elf_table(name)[hash_min(key, __elf_bits(name))], member)
+#define elf_hash_del(name, node, key) \
+       __elf_hash_del(node, &__elf_table_entry(name, key))
+
+#define elf_list_entry(ptr, type, member)                              \
+({                                                                     \
+       typeof(ptr) __ptr = (ptr);                                      \
+       __ptr ? container_of(__ptr, type, member) : NULL;               \
+})
+
+#define elf_hash_for_each_possible(name, obj, member, key)             \
+       for (obj = elf_list_entry(__elf_table_entry(name, key), typeof(*obj), member); \
+            obj;                                                       \
+            obj = elf_list_entry(obj->member.next, typeof(*(obj)), member))
 
 #define elf_alloc_hash(name, size) \
 ({ \
        __elf_bits(name) = max(10, ilog2(size)); \
-       __elf_table(name) = mmap(NULL, sizeof(struct hlist_head) << __elf_bits(name), \
+       __elf_table(name) = mmap(NULL, sizeof(struct elf_hash_node *) << __elf_bits(name), \
                                 PROT_READ|PROT_WRITE, \
                                 MAP_PRIVATE|MAP_ANON, -1, 0); \
        if (__elf_table(name) == (void *)-1L) { \
@@ -233,21 +269,22 @@ struct reloc *find_reloc_by_dest_range(const struct elf *elf, struct section *se
                                     unsigned long offset, unsigned int len)
 {
        struct reloc *reloc, *r = NULL;
+       struct section *rsec;
        unsigned long o;
 
-       if (!sec->reloc)
+       rsec = sec->rsec;
+       if (!rsec)
                return NULL;
 
-       sec = sec->reloc;
-
        for_offset_range(o, offset, offset + len) {
                elf_hash_for_each_possible(reloc, reloc, hash,
-                                          sec_offset_hash(sec, o)) {
-                       if (reloc->sec != sec)
+                                          sec_offset_hash(rsec, o)) {
+                       if (reloc->sec != rsec)
                                continue;
 
-                       if (reloc->offset >= offset && reloc->offset < offset + len) {
-                               if (!r || reloc->offset < r->offset)
+                       if (reloc_offset(reloc) >= offset &&
+                           reloc_offset(reloc) < offset + len) {
+                               if (!r || reloc_offset(reloc) < reloc_offset(r))
                                        r = reloc;
                        }
                }
@@ -263,6 +300,11 @@ struct reloc *find_reloc_by_dest(const struct elf *elf, struct section *sec, uns
        return find_reloc_by_dest_range(elf, sec, offset, 1);
 }
 
+static bool is_dwarf_section(struct section *sec)
+{
+       return !strncmp(sec->name, ".debug_", 7);
+}
+
 static int read_sections(struct elf *elf)
 {
        Elf_Scn *s = NULL;
@@ -293,7 +335,6 @@ static int read_sections(struct elf *elf)
                sec = &elf->section_data[i];
 
                INIT_LIST_HEAD(&sec->symbol_list);
-               INIT_LIST_HEAD(&sec->reloc_list);
 
                s = elf_getscn(elf->elf, i);
                if (!s) {
@@ -314,7 +355,7 @@ static int read_sections(struct elf *elf)
                        return -1;
                }
 
-               if (sec->sh.sh_size != 0) {
+               if (sec->sh.sh_size != 0 && !is_dwarf_section(sec)) {
                        sec->data = elf_getdata(s, NULL);
                        if (!sec->data) {
                                WARN_ELF("elf_getdata");
@@ -328,12 +369,12 @@ static int read_sections(struct elf *elf)
                        }
                }
 
-               if (sec->sh.sh_flags & SHF_EXECINSTR)
-                       elf->text_size += sec->sh.sh_size;
-
                list_add_tail(&sec->list, &elf->sections);
                elf_hash_add(section, &sec->hash, sec->idx);
                elf_hash_add(section_name, &sec->name_hash, str_hash(sec->name));
+
+               if (is_reloc_sec(sec))
+                       elf->num_relocs += sec_num_entries(sec);
        }
 
        if (opts.stats) {
@@ -356,7 +397,6 @@ static void elf_add_symbol(struct elf *elf, struct symbol *sym)
        struct rb_node *pnode;
        struct symbol *iter;
 
-       INIT_LIST_HEAD(&sym->reloc_list);
        INIT_LIST_HEAD(&sym->pv_target);
        sym->alias = sym;
 
@@ -407,7 +447,7 @@ static int read_symbols(struct elf *elf)
                if (symtab_shndx)
                        shndx_data = symtab_shndx->data;
 
-               symbols_nr = symtab->sh.sh_size / symtab->sh.sh_entsize;
+               symbols_nr = sec_num_entries(symtab);
        } else {
                /*
                 * A missing symbol table is actually possible if it's an empty
@@ -533,52 +573,17 @@ err:
        return -1;
 }
 
-static struct section *elf_create_reloc_section(struct elf *elf,
-                                               struct section *base,
-                                               int reltype);
-
-int elf_add_reloc(struct elf *elf, struct section *sec, unsigned long offset,
-                 unsigned int type, struct symbol *sym, s64 addend)
-{
-       struct reloc *reloc;
-
-       if (!sec->reloc && !elf_create_reloc_section(elf, sec, SHT_RELA))
-               return -1;
-
-       reloc = malloc(sizeof(*reloc));
-       if (!reloc) {
-               perror("malloc");
-               return -1;
-       }
-       memset(reloc, 0, sizeof(*reloc));
-
-       reloc->sec = sec->reloc;
-       reloc->offset = offset;
-       reloc->type = type;
-       reloc->sym = sym;
-       reloc->addend = addend;
-
-       list_add_tail(&reloc->sym_reloc_entry, &sym->reloc_list);
-       list_add_tail(&reloc->list, &sec->reloc->reloc_list);
-       elf_hash_add(reloc, &reloc->hash, reloc_hash(reloc));
-
-       sec->reloc->sh.sh_size += sec->reloc->sh.sh_entsize;
-       sec->reloc->changed = true;
-
-       return 0;
-}
-
 /*
- * Ensure that any reloc section containing references to @sym is marked
- * changed such that it will get re-generated in elf_rebuild_reloc_sections()
- * with the new symbol index.
+ * @sym's idx has changed.  Update the relocs which reference it.
  */
-static void elf_dirty_reloc_sym(struct elf *elf, struct symbol *sym)
+static int elf_update_sym_relocs(struct elf *elf, struct symbol *sym)
 {
        struct reloc *reloc;
 
-       list_for_each_entry(reloc, &sym->reloc_list, sym_reloc_entry)
-               reloc->sec->changed = true;
+       for (reloc = sym->relocs; reloc; reloc = reloc->sym_next_reloc)
+               set_reloc_sym(elf, reloc, reloc->sym->idx);
+
+       return 0;
 }
 
 /*
@@ -655,7 +660,7 @@ static int elf_update_symbol(struct elf *elf, struct section *symtab,
                        symtab_data->d_align = 1;
                        symtab_data->d_type = ELF_T_SYM;
 
-                       symtab->changed = true;
+                       mark_sec_changed(elf, symtab, true);
                        symtab->truncate = true;
 
                        if (t) {
@@ -670,7 +675,7 @@ static int elf_update_symbol(struct elf *elf, struct section *symtab,
                                shndx_data->d_align = sizeof(Elf32_Word);
                                shndx_data->d_type = ELF_T_WORD;
 
-                               symtab_shndx->changed = true;
+                               mark_sec_changed(elf, symtab_shndx, true);
                                symtab_shndx->truncate = true;
                        }
 
@@ -734,7 +739,7 @@ __elf_create_symbol(struct elf *elf, struct symbol *sym)
                return NULL;
        }
 
-       new_idx = symtab->sh.sh_size / symtab->sh.sh_entsize;
+       new_idx = sec_num_entries(symtab);
 
        if (GELF_ST_BIND(sym->sym.st_info) != STB_LOCAL)
                goto non_local;
@@ -746,18 +751,19 @@ __elf_create_symbol(struct elf *elf, struct symbol *sym)
        first_non_local = symtab->sh.sh_info;
        old = find_symbol_by_index(elf, first_non_local);
        if (old) {
-               old->idx = new_idx;
 
-               hlist_del(&old->hash);
-               elf_hash_add(symbol, &old->hash, old->idx);
-
-               elf_dirty_reloc_sym(elf, old);
+               elf_hash_del(symbol, &old->hash, old->idx);
+               elf_hash_add(symbol, &old->hash, new_idx);
+               old->idx = new_idx;
 
                if (elf_update_symbol(elf, symtab, symtab_shndx, old)) {
                        WARN("elf_update_symbol move");
                        return NULL;
                }
 
+               if (elf_update_sym_relocs(elf, old))
+                       return NULL;
+
                new_idx = first_non_local;
        }
 
@@ -774,11 +780,11 @@ non_local:
        }
 
        symtab->sh.sh_size += symtab->sh.sh_entsize;
-       symtab->changed = true;
+       mark_sec_changed(elf, symtab, true);
 
        if (symtab_shndx) {
                symtab_shndx->sh.sh_size += sizeof(Elf32_Word);
-               symtab_shndx->changed = true;
+               mark_sec_changed(elf, symtab_shndx, true);
        }
 
        return sym;
@@ -841,13 +847,57 @@ elf_create_prefix_symbol(struct elf *elf, struct symbol *orig, long size)
        return sym;
 }
 
-int elf_add_reloc_to_insn(struct elf *elf, struct section *sec,
-                         unsigned long offset, unsigned int type,
-                         struct section *insn_sec, unsigned long insn_off)
+static struct reloc *elf_init_reloc(struct elf *elf, struct section *rsec,
+                                   unsigned int reloc_idx,
+                                   unsigned long offset, struct symbol *sym,
+                                   s64 addend, unsigned int type)
+{
+       struct reloc *reloc, empty = { 0 };
+
+       if (reloc_idx >= sec_num_entries(rsec)) {
+               WARN("%s: bad reloc_idx %u for %s with %d relocs",
+                    __func__, reloc_idx, rsec->name, sec_num_entries(rsec));
+               return NULL;
+       }
+
+       reloc = &rsec->relocs[reloc_idx];
+
+       if (memcmp(reloc, &empty, sizeof(empty))) {
+               WARN("%s: %s: reloc %d already initialized!",
+                    __func__, rsec->name, reloc_idx);
+               return NULL;
+       }
+
+       reloc->sec = rsec;
+       reloc->sym = sym;
+
+       set_reloc_offset(elf, reloc, offset);
+       set_reloc_sym(elf, reloc, sym->idx);
+       set_reloc_type(elf, reloc, type);
+       set_reloc_addend(elf, reloc, addend);
+
+       elf_hash_add(reloc, &reloc->hash, reloc_hash(reloc));
+       reloc->sym_next_reloc = sym->relocs;
+       sym->relocs = reloc;
+
+       return reloc;
+}
+
+struct reloc *elf_init_reloc_text_sym(struct elf *elf, struct section *sec,
+                                     unsigned long offset,
+                                     unsigned int reloc_idx,
+                                     struct section *insn_sec,
+                                     unsigned long insn_off)
 {
        struct symbol *sym = insn_sec->sym;
        int addend = insn_off;
 
+       if (!(insn_sec->sh.sh_flags & SHF_EXECINSTR)) {
+               WARN("bad call to %s() for data symbol %s",
+                    __func__, sym->name);
+               return NULL;
+       }
+
        if (!sym) {
                /*
                 * Due to how weak functions work, we must use section based
@@ -857,108 +907,86 @@ int elf_add_reloc_to_insn(struct elf *elf, struct section *sec,
                 */
                sym = elf_create_section_symbol(elf, insn_sec);
                if (!sym)
-                       return -1;
+                       return NULL;
 
                insn_sec->sym = sym;
        }
 
-       return elf_add_reloc(elf, sec, offset, type, sym, addend);
+       return elf_init_reloc(elf, sec->rsec, reloc_idx, offset, sym, addend,
+                             elf_text_rela_type(elf));
 }
 
-static int read_rel_reloc(struct section *sec, int i, struct reloc *reloc, unsigned int *symndx)
+struct reloc *elf_init_reloc_data_sym(struct elf *elf, struct section *sec,
+                                     unsigned long offset,
+                                     unsigned int reloc_idx,
+                                     struct symbol *sym,
+                                     s64 addend)
 {
-       if (!gelf_getrel(sec->data, i, &reloc->rel)) {
-               WARN_ELF("gelf_getrel");
-               return -1;
+       if (sym->sec && (sec->sh.sh_flags & SHF_EXECINSTR)) {
+               WARN("bad call to %s() for text symbol %s",
+                    __func__, sym->name);
+               return NULL;
        }
-       reloc->type = GELF_R_TYPE(reloc->rel.r_info);
-       reloc->addend = 0;
-       reloc->offset = reloc->rel.r_offset;
-       *symndx = GELF_R_SYM(reloc->rel.r_info);
-       return 0;
-}
 
-static int read_rela_reloc(struct section *sec, int i, struct reloc *reloc, unsigned int *symndx)
-{
-       if (!gelf_getrela(sec->data, i, &reloc->rela)) {
-               WARN_ELF("gelf_getrela");
-               return -1;
-       }
-       reloc->type = GELF_R_TYPE(reloc->rela.r_info);
-       reloc->addend = reloc->rela.r_addend;
-       reloc->offset = reloc->rela.r_offset;
-       *symndx = GELF_R_SYM(reloc->rela.r_info);
-       return 0;
+       return elf_init_reloc(elf, sec->rsec, reloc_idx, offset, sym, addend,
+                             elf_data_rela_type(elf));
 }
 
 static int read_relocs(struct elf *elf)
 {
-       unsigned long nr_reloc, max_reloc = 0, tot_reloc = 0;
-       struct section *sec;
+       unsigned long nr_reloc, max_reloc = 0;
+       struct section *rsec;
        struct reloc *reloc;
        unsigned int symndx;
        struct symbol *sym;
        int i;
 
-       if (!elf_alloc_hash(reloc, elf->text_size / 16))
+       if (!elf_alloc_hash(reloc, elf->num_relocs))
                return -1;
 
-       list_for_each_entry(sec, &elf->sections, list) {
-               if ((sec->sh.sh_type != SHT_RELA) &&
-                   (sec->sh.sh_type != SHT_REL))
+       list_for_each_entry(rsec, &elf->sections, list) {
+               if (!is_reloc_sec(rsec))
                        continue;
 
-               sec->base = find_section_by_index(elf, sec->sh.sh_info);
-               if (!sec->base) {
+               rsec->base = find_section_by_index(elf, rsec->sh.sh_info);
+               if (!rsec->base) {
                        WARN("can't find base section for reloc section %s",
-                            sec->name);
+                            rsec->name);
                        return -1;
                }
 
-               sec->base->reloc = sec;
+               rsec->base->rsec = rsec;
 
                nr_reloc = 0;
-               sec->reloc_data = calloc(sec->sh.sh_size / sec->sh.sh_entsize, sizeof(*reloc));
-               if (!sec->reloc_data) {
+               rsec->relocs = calloc(sec_num_entries(rsec), sizeof(*reloc));
+               if (!rsec->relocs) {
                        perror("calloc");
                        return -1;
                }
-               for (i = 0; i < sec->sh.sh_size / sec->sh.sh_entsize; i++) {
-                       reloc = &sec->reloc_data[i];
-                       switch (sec->sh.sh_type) {
-                       case SHT_REL:
-                               if (read_rel_reloc(sec, i, reloc, &symndx))
-                                       return -1;
-                               break;
-                       case SHT_RELA:
-                               if (read_rela_reloc(sec, i, reloc, &symndx))
-                                       return -1;
-                               break;
-                       default: return -1;
-                       }
+               for (i = 0; i < sec_num_entries(rsec); i++) {
+                       reloc = &rsec->relocs[i];
 
-                       reloc->sec = sec;
-                       reloc->idx = i;
+                       reloc->sec = rsec;
+                       symndx = reloc_sym(reloc);
                        reloc->sym = sym = find_symbol_by_index(elf, symndx);
                        if (!reloc->sym) {
                                WARN("can't find reloc entry symbol %d for %s",
-                                    symndx, sec->name);
+                                    symndx, rsec->name);
                                return -1;
                        }
 
-                       list_add_tail(&reloc->sym_reloc_entry, &sym->reloc_list);
-                       list_add_tail(&reloc->list, &sec->reloc_list);
                        elf_hash_add(reloc, &reloc->hash, reloc_hash(reloc));
+                       reloc->sym_next_reloc = sym->relocs;
+                       sym->relocs = reloc;
 
                        nr_reloc++;
                }
                max_reloc = max(max_reloc, nr_reloc);
-               tot_reloc += nr_reloc;
        }
 
        if (opts.stats) {
                printf("max_reloc: %lu\n", max_reloc);
-               printf("tot_reloc: %lu\n", tot_reloc);
+               printf("num_relocs: %lu\n", elf->num_relocs);
                printf("reloc_bits: %d\n", elf->reloc_bits);
        }
 
@@ -1053,13 +1081,14 @@ static int elf_add_string(struct elf *elf, struct section *strtab, char *str)
 
        len = strtab->sh.sh_size;
        strtab->sh.sh_size += data->d_size;
-       strtab->changed = true;
+
+       mark_sec_changed(elf, strtab, true);
 
        return len;
 }
 
 struct section *elf_create_section(struct elf *elf, const char *name,
-                                  unsigned int sh_flags, size_t entsize, int nr)
+                                  size_t entsize, unsigned int nr)
 {
        struct section *sec, *shstrtab;
        size_t size = entsize * nr;
@@ -1073,7 +1102,6 @@ struct section *elf_create_section(struct elf *elf, const char *name,
        memset(sec, 0, sizeof(*sec));
 
        INIT_LIST_HEAD(&sec->symbol_list);
-       INIT_LIST_HEAD(&sec->reloc_list);
 
        s = elf_newscn(elf->elf);
        if (!s) {
@@ -1088,7 +1116,6 @@ struct section *elf_create_section(struct elf *elf, const char *name,
        }
 
        sec->idx = elf_ndxscn(s);
-       sec->changed = true;
 
        sec->data = elf_newdata(s);
        if (!sec->data) {
@@ -1117,7 +1144,7 @@ struct section *elf_create_section(struct elf *elf, const char *name,
        sec->sh.sh_entsize = entsize;
        sec->sh.sh_type = SHT_PROGBITS;
        sec->sh.sh_addralign = 1;
-       sec->sh.sh_flags = SHF_ALLOC | sh_flags;
+       sec->sh.sh_flags = SHF_ALLOC;
 
        /* Add section name to .shstrtab (or .strtab for Clang) */
        shstrtab = find_section_by_name(elf, ".shstrtab");
@@ -1135,158 +1162,66 @@ struct section *elf_create_section(struct elf *elf, const char *name,
        elf_hash_add(section, &sec->hash, sec->idx);
        elf_hash_add(section_name, &sec->name_hash, str_hash(sec->name));
 
-       elf->changed = true;
+       mark_sec_changed(elf, sec, true);
 
        return sec;
 }
 
-static struct section *elf_create_rel_reloc_section(struct elf *elf, struct section *base)
+static struct section *elf_create_rela_section(struct elf *elf,
+                                              struct section *sec,
+                                              unsigned int reloc_nr)
 {
-       char *relocname;
-       struct section *sec;
+       struct section *rsec;
+       char *rsec_name;
 
-       relocname = malloc(strlen(base->name) + strlen(".rel") + 1);
-       if (!relocname) {
+       rsec_name = malloc(strlen(sec->name) + strlen(".rela") + 1);
+       if (!rsec_name) {
                perror("malloc");
                return NULL;
        }
-       strcpy(relocname, ".rel");
-       strcat(relocname, base->name);
+       strcpy(rsec_name, ".rela");
+       strcat(rsec_name, sec->name);
 
-       sec = elf_create_section(elf, relocname, 0, sizeof(GElf_Rel), 0);
-       free(relocname);
-       if (!sec)
+       rsec = elf_create_section(elf, rsec_name, elf_rela_size(elf), reloc_nr);
+       free(rsec_name);
+       if (!rsec)
                return NULL;
 
-       base->reloc = sec;
-       sec->base = base;
+       rsec->data->d_type = ELF_T_RELA;
+       rsec->sh.sh_type = SHT_RELA;
+       rsec->sh.sh_addralign = elf_addr_size(elf);
+       rsec->sh.sh_link = find_section_by_name(elf, ".symtab")->idx;
+       rsec->sh.sh_info = sec->idx;
+       rsec->sh.sh_flags = SHF_INFO_LINK;
+
+       rsec->relocs = calloc(sec_num_entries(rsec), sizeof(struct reloc));
+       if (!rsec->relocs) {
+               perror("calloc");
+               return NULL;
+       }
 
-       sec->sh.sh_type = SHT_REL;
-       sec->sh.sh_addralign = 8;
-       sec->sh.sh_link = find_section_by_name(elf, ".symtab")->idx;
-       sec->sh.sh_info = base->idx;
-       sec->sh.sh_flags = SHF_INFO_LINK;
+       sec->rsec = rsec;
+       rsec->base = sec;
 
-       return sec;
+       return rsec;
 }
 
-static struct section *elf_create_rela_reloc_section(struct elf *elf, struct section *base)
+struct section *elf_create_section_pair(struct elf *elf, const char *name,
+                                       size_t entsize, unsigned int nr,
+                                       unsigned int reloc_nr)
 {
-       char *relocname;
        struct section *sec;
-       int addrsize = elf_class_addrsize(elf);
-
-       relocname = malloc(strlen(base->name) + strlen(".rela") + 1);
-       if (!relocname) {
-               perror("malloc");
-               return NULL;
-       }
-       strcpy(relocname, ".rela");
-       strcat(relocname, base->name);
 
-       if (addrsize == sizeof(u32))
-               sec = elf_create_section(elf, relocname, 0, sizeof(Elf32_Rela), 0);
-       else
-               sec = elf_create_section(elf, relocname, 0, sizeof(GElf_Rela), 0);
-       free(relocname);
+       sec = elf_create_section(elf, name, entsize, nr);
        if (!sec)
                return NULL;
 
-       base->reloc = sec;
-       sec->base = base;
-
-       sec->sh.sh_type = SHT_RELA;
-       sec->sh.sh_addralign = addrsize;
-       sec->sh.sh_link = find_section_by_name(elf, ".symtab")->idx;
-       sec->sh.sh_info = base->idx;
-       sec->sh.sh_flags = SHF_INFO_LINK;
+       if (!elf_create_rela_section(elf, sec, reloc_nr))
+               return NULL;
 
        return sec;
 }
 
-static struct section *elf_create_reloc_section(struct elf *elf,
-                                        struct section *base,
-                                        int reltype)
-{
-       switch (reltype) {
-       case SHT_REL:  return elf_create_rel_reloc_section(elf, base);
-       case SHT_RELA: return elf_create_rela_reloc_section(elf, base);
-       default:       return NULL;
-       }
-}
-
-static int elf_rebuild_rel_reloc_section(struct section *sec)
-{
-       struct reloc *reloc;
-       int idx = 0;
-       void *buf;
-
-       /* Allocate a buffer for relocations */
-       buf = malloc(sec->sh.sh_size);
-       if (!buf) {
-               perror("malloc");
-               return -1;
-       }
-
-       sec->data->d_buf = buf;
-       sec->data->d_size = sec->sh.sh_size;
-       sec->data->d_type = ELF_T_REL;
-
-       idx = 0;
-       list_for_each_entry(reloc, &sec->reloc_list, list) {
-               reloc->rel.r_offset = reloc->offset;
-               reloc->rel.r_info = GELF_R_INFO(reloc->sym->idx, reloc->type);
-               if (!gelf_update_rel(sec->data, idx, &reloc->rel)) {
-                       WARN_ELF("gelf_update_rel");
-                       return -1;
-               }
-               idx++;
-       }
-
-       return 0;
-}
-
-static int elf_rebuild_rela_reloc_section(struct section *sec)
-{
-       struct reloc *reloc;
-       int idx = 0;
-       void *buf;
-
-       /* Allocate a buffer for relocations with addends */
-       buf = malloc(sec->sh.sh_size);
-       if (!buf) {
-               perror("malloc");
-               return -1;
-       }
-
-       sec->data->d_buf = buf;
-       sec->data->d_size = sec->sh.sh_size;
-       sec->data->d_type = ELF_T_RELA;
-
-       idx = 0;
-       list_for_each_entry(reloc, &sec->reloc_list, list) {
-               reloc->rela.r_offset = reloc->offset;
-               reloc->rela.r_addend = reloc->addend;
-               reloc->rela.r_info = GELF_R_INFO(reloc->sym->idx, reloc->type);
-               if (!gelf_update_rela(sec->data, idx, &reloc->rela)) {
-                       WARN_ELF("gelf_update_rela");
-                       return -1;
-               }
-               idx++;
-       }
-
-       return 0;
-}
-
-static int elf_rebuild_reloc_section(struct elf *elf, struct section *sec)
-{
-       switch (sec->sh.sh_type) {
-       case SHT_REL:  return elf_rebuild_rel_reloc_section(sec);
-       case SHT_RELA: return elf_rebuild_rela_reloc_section(sec);
-       default:       return -1;
-       }
-}
-
 int elf_write_insn(struct elf *elf, struct section *sec,
                   unsigned long offset, unsigned int len,
                   const char *insn)
@@ -1299,37 +1234,8 @@ int elf_write_insn(struct elf *elf, struct section *sec,
        }
 
        memcpy(data->d_buf + offset, insn, len);
-       elf_flagdata(data, ELF_C_SET, ELF_F_DIRTY);
 
-       elf->changed = true;
-
-       return 0;
-}
-
-int elf_write_reloc(struct elf *elf, struct reloc *reloc)
-{
-       struct section *sec = reloc->sec;
-
-       if (sec->sh.sh_type == SHT_REL) {
-               reloc->rel.r_info = GELF_R_INFO(reloc->sym->idx, reloc->type);
-               reloc->rel.r_offset = reloc->offset;
-
-               if (!gelf_update_rel(sec->data, reloc->idx, &reloc->rel)) {
-                       WARN_ELF("gelf_update_rel");
-                       return -1;
-               }
-       } else {
-               reloc->rela.r_info = GELF_R_INFO(reloc->sym->idx, reloc->type);
-               reloc->rela.r_addend = reloc->addend;
-               reloc->rela.r_offset = reloc->offset;
-
-               if (!gelf_update_rela(sec->data, reloc->idx, &reloc->rela)) {
-                       WARN_ELF("gelf_update_rela");
-                       return -1;
-               }
-       }
-
-       elf->changed = true;
+       mark_sec_changed(elf, sec, true);
 
        return 0;
 }
@@ -1401,25 +1307,20 @@ int elf_write(struct elf *elf)
                if (sec->truncate)
                        elf_truncate_section(elf, sec);
 
-               if (sec->changed) {
+               if (sec_changed(sec)) {
                        s = elf_getscn(elf->elf, sec->idx);
                        if (!s) {
                                WARN_ELF("elf_getscn");
                                return -1;
                        }
+
+                       /* Note this also flags the section dirty */
                        if (!gelf_update_shdr(s, &sec->sh)) {
                                WARN_ELF("gelf_update_shdr");
                                return -1;
                        }
 
-                       if (sec->base &&
-                           elf_rebuild_reloc_section(elf, sec)) {
-                               WARN("elf_rebuild_reloc_section");
-                               return -1;
-                       }
-
-                       sec->changed = false;
-                       elf->changed = true;
+                       mark_sec_changed(elf, sec, false);
                }
        }
 
@@ -1439,30 +1340,14 @@ int elf_write(struct elf *elf)
 
 void elf_close(struct elf *elf)
 {
-       struct section *sec, *tmpsec;
-       struct symbol *sym, *tmpsym;
-       struct reloc *reloc, *tmpreloc;
-
        if (elf->elf)
                elf_end(elf->elf);
 
        if (elf->fd > 0)
                close(elf->fd);
 
-       list_for_each_entry_safe(sec, tmpsec, &elf->sections, list) {
-               list_for_each_entry_safe(sym, tmpsym, &sec->symbol_list, list) {
-                       list_del(&sym->list);
-                       hash_del(&sym->hash);
-               }
-               list_for_each_entry_safe(reloc, tmpreloc, &sec->reloc_list, list) {
-                       list_del(&reloc->list);
-                       hash_del(&reloc->hash);
-               }
-               list_del(&sec->list);
-               free(sec->reloc_data);
-       }
-
-       free(elf->symbol_data);
-       free(elf->section_data);
-       free(elf);
+       /*
+        * NOTE: All remaining allocations are leaked on purpose.  Objtool is
+        * about to exit anyway.
+        */
 }
index 2a108e6..fcca666 100644 (file)
@@ -37,6 +37,7 @@ struct opts {
        bool no_unreachable;
        bool sec_address;
        bool stats;
+       bool verbose;
 };
 
 extern struct opts opts;
index b1258e7..c8a6bec 100644 (file)
@@ -36,6 +36,7 @@ struct cfi_state {
        bool drap;
        bool signal;
        bool end;
+       bool force_undefined;
 };
 
 #endif /* _OBJTOOL_CFI_H */
index e1ca588..c532d70 100644 (file)
@@ -12,6 +12,7 @@
 #include <linux/hashtable.h>
 #include <linux/rbtree.h>
 #include <linux/jhash.h>
+#include <arch/elf.h>
 
 #ifdef LIBELF_USE_DEPRECATED
 # define elf_getshdrnum    elf_getshnum
 #define ELF_C_READ_MMAP ELF_C_READ
 #endif
 
+struct elf_hash_node {
+       struct elf_hash_node *next;
+};
+
 struct section {
        struct list_head list;
-       struct hlist_node hash;
-       struct hlist_node name_hash;
+       struct elf_hash_node hash;
+       struct elf_hash_node name_hash;
        GElf_Shdr sh;
        struct rb_root_cached symbol_tree;
        struct list_head symbol_list;
-       struct list_head reloc_list;
-       struct section *base, *reloc;
+       struct section *base, *rsec;
        struct symbol *sym;
        Elf_Data *data;
        char *name;
        int idx;
-       bool changed, text, rodata, noinstr, init, truncate;
-       struct reloc *reloc_data;
+       bool _changed, text, rodata, noinstr, init, truncate;
+       struct reloc *relocs;
 };
 
 struct symbol {
        struct list_head list;
        struct rb_node node;
-       struct hlist_node hash;
-       struct hlist_node name_hash;
+       struct elf_hash_node hash;
+       struct elf_hash_node name_hash;
        GElf_Sym sym;
        struct section *sec;
        char *name;
@@ -61,37 +65,27 @@ struct symbol {
        u8 return_thunk      : 1;
        u8 fentry            : 1;
        u8 profiling_func    : 1;
+       u8 warned            : 1;
        struct list_head pv_target;
-       struct list_head reloc_list;
+       struct reloc *relocs;
 };
 
 struct reloc {
-       struct list_head list;
-       struct hlist_node hash;
-       union {
-               GElf_Rela rela;
-               GElf_Rel  rel;
-       };
+       struct elf_hash_node hash;
        struct section *sec;
        struct symbol *sym;
-       struct list_head sym_reloc_entry;
-       unsigned long offset;
-       unsigned int type;
-       s64 addend;
-       int idx;
-       bool jump_table_start;
+       struct reloc *sym_next_reloc;
 };
 
-#define ELF_HASH_BITS  20
-
 struct elf {
        Elf *elf;
        GElf_Ehdr ehdr;
        int fd;
        bool changed;
        char *name;
-       unsigned int text_size, num_files;
+       unsigned int num_files;
        struct list_head sections;
+       unsigned long num_relocs;
 
        int symbol_bits;
        int symbol_name_bits;
@@ -99,44 +93,54 @@ struct elf {
        int section_name_bits;
        int reloc_bits;
 
-       struct hlist_head *symbol_hash;
-       struct hlist_head *symbol_name_hash;
-       struct hlist_head *section_hash;
-       struct hlist_head *section_name_hash;
-       struct hlist_head *reloc_hash;
+       struct elf_hash_node **symbol_hash;
+       struct elf_hash_node **symbol_name_hash;
+       struct elf_hash_node **section_hash;
+       struct elf_hash_node **section_name_hash;
+       struct elf_hash_node **reloc_hash;
 
        struct section *section_data;
        struct symbol *symbol_data;
 };
 
-#define OFFSET_STRIDE_BITS     4
-#define OFFSET_STRIDE          (1UL << OFFSET_STRIDE_BITS)
-#define OFFSET_STRIDE_MASK     (~(OFFSET_STRIDE - 1))
-
-#define for_offset_range(_offset, _start, _end)                        \
-       for (_offset = ((_start) & OFFSET_STRIDE_MASK);         \
-            _offset >= ((_start) & OFFSET_STRIDE_MASK) &&      \
-            _offset <= ((_end) & OFFSET_STRIDE_MASK);          \
-            _offset += OFFSET_STRIDE)
+struct elf *elf_open_read(const char *name, int flags);
 
-static inline u32 sec_offset_hash(struct section *sec, unsigned long offset)
-{
-       u32 ol, oh, idx = sec->idx;
+struct section *elf_create_section(struct elf *elf, const char *name,
+                                  size_t entsize, unsigned int nr);
+struct section *elf_create_section_pair(struct elf *elf, const char *name,
+                                       size_t entsize, unsigned int nr,
+                                       unsigned int reloc_nr);
 
-       offset &= OFFSET_STRIDE_MASK;
+struct symbol *elf_create_prefix_symbol(struct elf *elf, struct symbol *orig, long size);
 
-       ol = offset;
-       oh = (offset >> 16) >> 16;
+struct reloc *elf_init_reloc_text_sym(struct elf *elf, struct section *sec,
+                                     unsigned long offset,
+                                     unsigned int reloc_idx,
+                                     struct section *insn_sec,
+                                     unsigned long insn_off);
 
-       __jhash_mix(ol, oh, idx);
+struct reloc *elf_init_reloc_data_sym(struct elf *elf, struct section *sec,
+                                     unsigned long offset,
+                                     unsigned int reloc_idx,
+                                     struct symbol *sym,
+                                     s64 addend);
 
-       return ol;
-}
+int elf_write_insn(struct elf *elf, struct section *sec,
+                  unsigned long offset, unsigned int len,
+                  const char *insn);
+int elf_write(struct elf *elf);
+void elf_close(struct elf *elf);
 
-static inline u32 reloc_hash(struct reloc *reloc)
-{
-       return sec_offset_hash(reloc->sec, reloc->offset);
-}
+struct section *find_section_by_name(const struct elf *elf, const char *name);
+struct symbol *find_func_by_offset(struct section *sec, unsigned long offset);
+struct symbol *find_symbol_by_offset(struct section *sec, unsigned long offset);
+struct symbol *find_symbol_by_name(const struct elf *elf, const char *name);
+struct symbol *find_symbol_containing(const struct section *sec, unsigned long offset);
+int find_symbol_hole_containing(const struct section *sec, unsigned long offset);
+struct reloc *find_reloc_by_dest(const struct elf *elf, struct section *sec, unsigned long offset);
+struct reloc *find_reloc_by_dest_range(const struct elf *elf, struct section *sec,
+                                    unsigned long offset, unsigned int len);
+struct symbol *find_func_containing(struct section *sec, unsigned long offset);
 
 /*
  * Try to see if it's a whole archive (vmlinux.o or module).
@@ -148,42 +152,147 @@ static inline bool has_multiple_files(struct elf *elf)
        return elf->num_files > 1;
 }
 
-static inline int elf_class_addrsize(struct elf *elf)
+static inline size_t elf_addr_size(struct elf *elf)
 {
-       if (elf->ehdr.e_ident[EI_CLASS] == ELFCLASS32)
-               return sizeof(u32);
-       else
-               return sizeof(u64);
+       return elf->ehdr.e_ident[EI_CLASS] == ELFCLASS32 ? 4 : 8;
 }
 
-struct elf *elf_open_read(const char *name, int flags);
-struct section *elf_create_section(struct elf *elf, const char *name, unsigned int sh_flags, size_t entsize, int nr);
+static inline size_t elf_rela_size(struct elf *elf)
+{
+       return elf_addr_size(elf) == 4 ? sizeof(Elf32_Rela) : sizeof(Elf64_Rela);
+}
 
-struct symbol *elf_create_prefix_symbol(struct elf *elf, struct symbol *orig, long size);
+static inline unsigned int elf_data_rela_type(struct elf *elf)
+{
+       return elf_addr_size(elf) == 4 ? R_DATA32 : R_DATA64;
+}
 
-int elf_add_reloc(struct elf *elf, struct section *sec, unsigned long offset,
-                 unsigned int type, struct symbol *sym, s64 addend);
-int elf_add_reloc_to_insn(struct elf *elf, struct section *sec,
-                         unsigned long offset, unsigned int type,
-                         struct section *insn_sec, unsigned long insn_off);
+static inline unsigned int elf_text_rela_type(struct elf *elf)
+{
+       return elf_addr_size(elf) == 4 ? R_TEXT32 : R_TEXT64;
+}
 
-int elf_write_insn(struct elf *elf, struct section *sec,
-                  unsigned long offset, unsigned int len,
-                  const char *insn);
-int elf_write_reloc(struct elf *elf, struct reloc *reloc);
-int elf_write(struct elf *elf);
-void elf_close(struct elf *elf);
+static inline bool is_reloc_sec(struct section *sec)
+{
+       return sec->sh.sh_type == SHT_RELA || sec->sh.sh_type == SHT_REL;
+}
 
-struct section *find_section_by_name(const struct elf *elf, const char *name);
-struct symbol *find_func_by_offset(struct section *sec, unsigned long offset);
-struct symbol *find_symbol_by_offset(struct section *sec, unsigned long offset);
-struct symbol *find_symbol_by_name(const struct elf *elf, const char *name);
-struct symbol *find_symbol_containing(const struct section *sec, unsigned long offset);
-int find_symbol_hole_containing(const struct section *sec, unsigned long offset);
-struct reloc *find_reloc_by_dest(const struct elf *elf, struct section *sec, unsigned long offset);
-struct reloc *find_reloc_by_dest_range(const struct elf *elf, struct section *sec,
-                                    unsigned long offset, unsigned int len);
-struct symbol *find_func_containing(struct section *sec, unsigned long offset);
+static inline bool sec_changed(struct section *sec)
+{
+       return sec->_changed;
+}
+
+static inline void mark_sec_changed(struct elf *elf, struct section *sec,
+                                   bool changed)
+{
+       sec->_changed = changed;
+       elf->changed |= changed;
+}
+
+static inline unsigned int sec_num_entries(struct section *sec)
+{
+       return sec->sh.sh_size / sec->sh.sh_entsize;
+}
+
+static inline unsigned int reloc_idx(struct reloc *reloc)
+{
+       return reloc - reloc->sec->relocs;
+}
+
+static inline void *reloc_rel(struct reloc *reloc)
+{
+       struct section *rsec = reloc->sec;
+
+       return rsec->data->d_buf + (reloc_idx(reloc) * rsec->sh.sh_entsize);
+}
+
+static inline bool is_32bit_reloc(struct reloc *reloc)
+{
+       /*
+        * Elf32_Rel:   8 bytes
+        * Elf32_Rela: 12 bytes
+        * Elf64_Rel:  16 bytes
+        * Elf64_Rela: 24 bytes
+        */
+       return reloc->sec->sh.sh_entsize < 16;
+}
+
+#define __get_reloc_field(reloc, field)                                        \
+({                                                                     \
+       is_32bit_reloc(reloc) ?                                         \
+               ((Elf32_Rela *)reloc_rel(reloc))->field :               \
+               ((Elf64_Rela *)reloc_rel(reloc))->field;                \
+})
+
+#define __set_reloc_field(reloc, field, val)                           \
+({                                                                     \
+       if (is_32bit_reloc(reloc))                                      \
+               ((Elf32_Rela *)reloc_rel(reloc))->field = val;          \
+       else                                                            \
+               ((Elf64_Rela *)reloc_rel(reloc))->field = val;          \
+})
+
+static inline u64 reloc_offset(struct reloc *reloc)
+{
+       return __get_reloc_field(reloc, r_offset);
+}
+
+static inline void set_reloc_offset(struct elf *elf, struct reloc *reloc, u64 offset)
+{
+       __set_reloc_field(reloc, r_offset, offset);
+       mark_sec_changed(elf, reloc->sec, true);
+}
+
+static inline s64 reloc_addend(struct reloc *reloc)
+{
+       return __get_reloc_field(reloc, r_addend);
+}
+
+static inline void set_reloc_addend(struct elf *elf, struct reloc *reloc, s64 addend)
+{
+       __set_reloc_field(reloc, r_addend, addend);
+       mark_sec_changed(elf, reloc->sec, true);
+}
+
+
+static inline unsigned int reloc_sym(struct reloc *reloc)
+{
+       u64 info = __get_reloc_field(reloc, r_info);
+
+       return is_32bit_reloc(reloc) ?
+               ELF32_R_SYM(info) :
+               ELF64_R_SYM(info);
+}
+
+static inline unsigned int reloc_type(struct reloc *reloc)
+{
+       u64 info = __get_reloc_field(reloc, r_info);
+
+       return is_32bit_reloc(reloc) ?
+               ELF32_R_TYPE(info) :
+               ELF64_R_TYPE(info);
+}
+
+static inline void set_reloc_sym(struct elf *elf, struct reloc *reloc, unsigned int sym)
+{
+       u64 info = is_32bit_reloc(reloc) ?
+               ELF32_R_INFO(sym, reloc_type(reloc)) :
+               ELF64_R_INFO(sym, reloc_type(reloc));
+
+       __set_reloc_field(reloc, r_info, info);
+
+       mark_sec_changed(elf, reloc->sec, true);
+}
+static inline void set_reloc_type(struct elf *elf, struct reloc *reloc, unsigned int type)
+{
+       u64 info = is_32bit_reloc(reloc) ?
+               ELF32_R_INFO(reloc_sym(reloc), type) :
+               ELF64_R_INFO(reloc_sym(reloc), type);
+
+       __set_reloc_field(reloc, r_info, info);
+
+       mark_sec_changed(elf, reloc->sec, true);
+}
 
 #define for_each_sec(file, sec)                                                \
        list_for_each_entry(sec, &file->elf->sections, list)
@@ -197,4 +306,44 @@ struct symbol *find_func_containing(struct section *sec, unsigned long offset);
                for_each_sec(file, __sec)                               \
                        sec_for_each_sym(__sec, sym)
 
+#define for_each_reloc(rsec, reloc)                                    \
+       for (int __i = 0, __fake = 1; __fake; __fake = 0)               \
+               for (reloc = rsec->relocs;                              \
+                    __i < sec_num_entries(rsec);                       \
+                    __i++, reloc++)
+
+#define for_each_reloc_from(rsec, reloc)                               \
+       for (int __i = reloc_idx(reloc);                                \
+            __i < sec_num_entries(rsec);                               \
+            __i++, reloc++)
+
+#define OFFSET_STRIDE_BITS     4
+#define OFFSET_STRIDE          (1UL << OFFSET_STRIDE_BITS)
+#define OFFSET_STRIDE_MASK     (~(OFFSET_STRIDE - 1))
+
+#define for_offset_range(_offset, _start, _end)                        \
+       for (_offset = ((_start) & OFFSET_STRIDE_MASK);         \
+            _offset >= ((_start) & OFFSET_STRIDE_MASK) &&      \
+            _offset <= ((_end) & OFFSET_STRIDE_MASK);          \
+            _offset += OFFSET_STRIDE)
+
+static inline u32 sec_offset_hash(struct section *sec, unsigned long offset)
+{
+       u32 ol, oh, idx = sec->idx;
+
+       offset &= OFFSET_STRIDE_MASK;
+
+       ol = offset;
+       oh = (offset >> 16) >> 16;
+
+       __jhash_mix(ol, oh, idx);
+
+       return ol;
+}
+
+static inline u32 reloc_hash(struct reloc *reloc)
+{
+       return sec_offset_hash(reloc->sec, reloc_offset(reloc));
+}
+
 #endif /* _OBJTOOL_ELF_H */
index b1c920d..ac04d3f 100644 (file)
@@ -55,15 +55,22 @@ static inline char *offstr(struct section *sec, unsigned long offset)
 
 #define WARN_INSN(insn, format, ...)                                   \
 ({                                                                     \
-       WARN_FUNC(format, insn->sec, insn->offset,  ##__VA_ARGS__);     \
+       struct instruction *_insn = (insn);                             \
+       if (!_insn->sym || !_insn->sym->warned)                         \
+               WARN_FUNC(format, _insn->sec, _insn->offset,            \
+                         ##__VA_ARGS__);                               \
+       if (_insn->sym)                                                 \
+               _insn->sym->warned = 1;                                 \
 })
 
-#define BT_FUNC(format, insn, ...)                     \
-({                                                     \
-       struct instruction *_insn = (insn);             \
-       char *_str = offstr(_insn->sec, _insn->offset); \
-       WARN("  %s: " format, _str, ##__VA_ARGS__);     \
-       free(_str);                                     \
+#define BT_INSN(insn, format, ...)                             \
+({                                                             \
+       if (opts.verbose || opts.backtrace) {                   \
+               struct instruction *_insn = (insn);             \
+               char *_str = offstr(_insn->sec, _insn->offset); \
+               WARN("  %s: " format, _str, ##__VA_ARGS__);     \
+               free(_str);                                     \
+       }                                                       \
 })
 
 #define WARN_ELF(format, ...)                          \
diff --git a/tools/objtool/noreturns.h b/tools/objtool/noreturns.h
new file mode 100644 (file)
index 0000000..1514e84
--- /dev/null
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * This is a (sorted!) list of all known __noreturn functions in the kernel.
+ * It's needed for objtool to properly reverse-engineer the control flow graph.
+ *
+ * Yes, this is unfortunate.  A better solution is in the works.
+ */
+NORETURN(__invalid_creds)
+NORETURN(__kunit_abort)
+NORETURN(__module_put_and_kthread_exit)
+NORETURN(__reiserfs_panic)
+NORETURN(__stack_chk_fail)
+NORETURN(__ubsan_handle_builtin_unreachable)
+NORETURN(arch_call_rest_init)
+NORETURN(arch_cpu_idle_dead)
+NORETURN(btrfs_assertfail)
+NORETURN(cpu_bringup_and_idle)
+NORETURN(cpu_startup_entry)
+NORETURN(do_exit)
+NORETURN(do_group_exit)
+NORETURN(do_task_dead)
+NORETURN(ex_handler_msr_mce)
+NORETURN(fortify_panic)
+NORETURN(hlt_play_dead)
+NORETURN(hv_ghcb_terminate)
+NORETURN(kthread_complete_and_exit)
+NORETURN(kthread_exit)
+NORETURN(kunit_try_catch_throw)
+NORETURN(machine_real_restart)
+NORETURN(make_task_dead)
+NORETURN(mpt_halt_firmware)
+NORETURN(nmi_panic_self_stop)
+NORETURN(panic)
+NORETURN(panic_smp_self_stop)
+NORETURN(rest_init)
+NORETURN(rewind_stack_and_make_dead)
+NORETURN(sev_es_terminate)
+NORETURN(snp_abort)
+NORETURN(start_kernel)
+NORETURN(stop_this_cpu)
+NORETURN(usercopy_abort)
+NORETURN(x86_64_start_kernel)
+NORETURN(x86_64_start_reservations)
+NORETURN(xen_cpu_bringup_again)
+NORETURN(xen_start_kernel)
index 48efd1e..bae3439 100644 (file)
@@ -118,8 +118,8 @@ static int write_orc_entry(struct elf *elf, struct section *orc_sec,
        orc->bp_offset = bswap_if_needed(elf, orc->bp_offset);
 
        /* populate reloc for ip */
-       if (elf_add_reloc_to_insn(elf, ip_sec, idx * sizeof(int), R_X86_64_PC32,
-                                 insn_sec, insn_off))
+       if (!elf_init_reloc_text_sym(elf, ip_sec, idx * sizeof(int), idx,
+                                    insn_sec, insn_off))
                return -1;
 
        return 0;
@@ -237,12 +237,12 @@ int orc_create(struct objtool_file *file)
                WARN("file already has .orc_unwind section, skipping");
                return -1;
        }
-       orc_sec = elf_create_section(file->elf, ".orc_unwind", 0,
+       orc_sec = elf_create_section(file->elf, ".orc_unwind",
                                     sizeof(struct orc_entry), nr);
        if (!orc_sec)
                return -1;
 
-       sec = elf_create_section(file->elf, ".orc_unwind_ip", 0, sizeof(int), nr);
+       sec = elf_create_section_pair(file->elf, ".orc_unwind_ip", sizeof(int), nr, nr);
        if (!sec)
                return -1;
 
index baa85c3..91b1950 100644 (file)
@@ -62,7 +62,7 @@ static void reloc_to_sec_off(struct reloc *reloc, struct section **sec,
                             unsigned long *off)
 {
        *sec = reloc->sym->sec;
-       *off = reloc->sym->offset + reloc->addend;
+       *off = reloc->sym->offset + reloc_addend(reloc);
 }
 
 static int get_alt_entry(struct elf *elf, const struct special_entry *entry,
@@ -126,7 +126,7 @@ static int get_alt_entry(struct elf *elf, const struct special_entry *entry,
                                  sec, offset + entry->key);
                        return -1;
                }
-               alt->key_addend = key_reloc->addend;
+               alt->key_addend = reloc_addend(key_reloc);
        }
 
        return 0;
index 902e9ea..93d3b88 100644 (file)
@@ -11,6 +11,7 @@ int test__intel_pt_pkt_decoder(struct test_suite *test, int subtest);
 int test__intel_pt_hybrid_compat(struct test_suite *test, int subtest);
 int test__bp_modify(struct test_suite *test, int subtest);
 int test__x86_sample_parsing(struct test_suite *test, int subtest);
+int test__amd_ibs_via_core_pmu(struct test_suite *test, int subtest);
 
 extern struct test_suite *arch_tests[];
 
index 6f4e863..fd02d81 100644 (file)
@@ -5,3 +5,4 @@ perf-y += arch-tests.o
 perf-y += sample-parsing.o
 perf-$(CONFIG_AUXTRACE) += insn-x86.o intel-pt-test.o
 perf-$(CONFIG_X86_64) += bp-modify.o
+perf-y += amd-ibs-via-core-pmu.o
diff --git a/tools/perf/arch/x86/tests/amd-ibs-via-core-pmu.c b/tools/perf/arch/x86/tests/amd-ibs-via-core-pmu.c
new file mode 100644 (file)
index 0000000..2902798
--- /dev/null
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "arch-tests.h"
+#include "linux/perf_event.h"
+#include "tests/tests.h"
+#include "pmu.h"
+#include "pmus.h"
+#include "../perf-sys.h"
+#include "debug.h"
+
+#define NR_SUB_TESTS 5
+
+static struct sub_tests {
+       int type;
+       unsigned long config;
+       bool valid;
+} sub_tests[NR_SUB_TESTS] = {
+       { PERF_TYPE_HARDWARE, PERF_COUNT_HW_CPU_CYCLES, true },
+       { PERF_TYPE_HARDWARE, PERF_COUNT_HW_INSTRUCTIONS, false },
+       { PERF_TYPE_RAW, 0x076, true },
+       { PERF_TYPE_RAW, 0x0C1, true },
+       { PERF_TYPE_RAW, 0x012, false },
+};
+
+static int event_open(int type, unsigned long config)
+{
+       struct perf_event_attr attr;
+
+       memset(&attr, 0, sizeof(struct perf_event_attr));
+       attr.type = type;
+       attr.size = sizeof(struct perf_event_attr);
+       attr.config = config;
+       attr.disabled = 1;
+       attr.precise_ip = 1;
+       attr.sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_TID;
+       attr.sample_period = 100000;
+
+       return sys_perf_event_open(&attr, -1, 0, -1, 0);
+}
+
+int test__amd_ibs_via_core_pmu(struct test_suite *test __maybe_unused,
+                              int subtest __maybe_unused)
+{
+       struct perf_pmu *ibs_pmu;
+       int ret = TEST_OK;
+       int fd, i;
+
+       if (list_empty(&pmus))
+               perf_pmu__scan(NULL);
+
+       ibs_pmu = perf_pmu__find("ibs_op");
+       if (!ibs_pmu)
+               return TEST_SKIP;
+
+       for (i = 0; i < NR_SUB_TESTS; i++) {
+               fd = event_open(sub_tests[i].type, sub_tests[i].config);
+               pr_debug("type: 0x%x, config: 0x%lx, fd: %d  -  ", sub_tests[i].type,
+                        sub_tests[i].config, fd);
+               if ((sub_tests[i].valid && fd == -1) ||
+                   (!sub_tests[i].valid && fd > 0)) {
+                       pr_debug("Fail\n");
+                       ret = TEST_FAIL;
+               } else {
+                       pr_debug("Pass\n");
+               }
+
+               if (fd > 0)
+                       close(fd);
+       }
+
+       return ret;
+}
index aae6ea0..b5c85ab 100644 (file)
@@ -22,6 +22,7 @@ struct test_suite suite__intel_pt = {
 DEFINE_SUITE("x86 bp modify", bp_modify);
 #endif
 DEFINE_SUITE("x86 Sample parsing", x86_sample_parsing);
+DEFINE_SUITE("AMD IBS via core pmu", amd_ibs_via_core_pmu);
 
 struct test_suite *arch_tests[] = {
 #ifdef HAVE_DWARF_UNWIND_SUPPORT
@@ -35,5 +36,6 @@ struct test_suite *arch_tests[] = {
        &suite__bp_modify,
 #endif
        &suite__x86_sample_parsing,
+       &suite__amd_ibs_via_core_pmu,
        NULL,
 };
index f3918f2..ba80707 100644 (file)
@@ -51,7 +51,7 @@ static u64 arm_spe_calc_ip(int index, u64 payload)
                 * (bits [63:56]) is assigned as top-byte tag; so we only can
                 * retrieve address value from bits [55:0].
                 *
-                * According to Documentation/arm64/memory.rst, if detects the
+                * According to Documentation/arch/arm64/memory.rst, if detects the
                 * specific pattern in bits [55:52] of payload which falls in
                 * the kernel space, should fixup the top byte and this allows
                 * perf tool to parse DSO symbol for data address correctly.
index b0ca44c..9179942 100644 (file)
@@ -172,28 +172,37 @@ static void transfer(int fd, uint8_t const *tx, uint8_t const *rx, size_t len)
 
 static void print_usage(const char *prog)
 {
-       printf("Usage: %s [-DsbdlHOLC3vpNR24SI]\n", prog);
-       puts("  -D --device   device to use (default /dev/spidev1.1)\n"
-            "  -s --speed    max speed (Hz)\n"
-            "  -d --delay    delay (usec)\n"
-            "  -b --bpw      bits per word\n"
-            "  -i --input    input data from a file (e.g. \"test.bin\")\n"
-            "  -o --output   output data to a file (e.g. \"results.bin\")\n"
-            "  -l --loop     loopback\n"
-            "  -H --cpha     clock phase\n"
-            "  -O --cpol     clock polarity\n"
-            "  -L --lsb      least significant bit first\n"
-            "  -C --cs-high  chip select active high\n"
-            "  -3 --3wire    SI/SO signals shared\n"
-            "  -v --verbose  Verbose (show tx buffer)\n"
-            "  -p            Send data (e.g. \"1234\\xde\\xad\")\n"
-            "  -N --no-cs    no chip select\n"
-            "  -R --ready    slave pulls low to pause\n"
-            "  -2 --dual     dual transfer\n"
-            "  -4 --quad     quad transfer\n"
-            "  -8 --octal    octal transfer\n"
-            "  -S --size     transfer size\n"
-            "  -I --iter     iterations\n");
+       printf("Usage: %s [-2348CDFHILMNORSZbdilopsv]\n", prog);
+       puts("general device settings:\n"
+                "  -D --device         device to use (default /dev/spidev1.1)\n"
+                "  -s --speed          max speed (Hz)\n"
+                "  -d --delay          delay (usec)\n"
+                "  -l --loop           loopback\n"
+                "spi mode:\n"
+                "  -H --cpha           clock phase\n"
+                "  -O --cpol           clock polarity\n"
+                "  -F --rx-cpha-flip   flip CPHA on Rx only xfer\n"
+                "number of wires for transmission:\n"
+                "  -2 --dual           dual transfer\n"
+                "  -4 --quad           quad transfer\n"
+                "  -8 --octal          octal transfer\n"
+                "  -3 --3wire          SI/SO signals shared\n"
+                "  -Z --3wire-hiz      high impedance turnaround\n"
+                "data:\n"
+                "  -i --input          input data from a file (e.g. \"test.bin\")\n"
+                "  -o --output         output data to a file (e.g. \"results.bin\")\n"
+                "  -p                  Send data (e.g. \"1234\\xde\\xad\")\n"
+                "  -S --size           transfer size\n"
+                "  -I --iter           iterations\n"
+                "additional parameters:\n"
+                "  -b --bpw            bits per word\n"
+                "  -L --lsb            least significant bit first\n"
+                "  -C --cs-high        chip select active high\n"
+                "  -N --no-cs          no chip select\n"
+                "  -R --ready          slave pulls low to pause\n"
+                "  -M --mosi-idle-low  leave mosi line low when idle\n"
+                "misc:\n"
+                "  -v --verbose        Verbose (show tx buffer)\n");
        exit(1);
 }
 
@@ -201,31 +210,34 @@ static void parse_opts(int argc, char *argv[])
 {
        while (1) {
                static const struct option lopts[] = {
-                       { "device",  1, 0, 'D' },
-                       { "speed",   1, 0, 's' },
-                       { "delay",   1, 0, 'd' },
-                       { "bpw",     1, 0, 'b' },
-                       { "input",   1, 0, 'i' },
-                       { "output",  1, 0, 'o' },
-                       { "loop",    0, 0, 'l' },
-                       { "cpha",    0, 0, 'H' },
-                       { "cpol",    0, 0, 'O' },
-                       { "lsb",     0, 0, 'L' },
-                       { "cs-high", 0, 0, 'C' },
-                       { "3wire",   0, 0, '3' },
-                       { "no-cs",   0, 0, 'N' },
-                       { "ready",   0, 0, 'R' },
-                       { "dual",    0, 0, '2' },
-                       { "verbose", 0, 0, 'v' },
-                       { "quad",    0, 0, '4' },
-                       { "octal",   0, 0, '8' },
-                       { "size",    1, 0, 'S' },
-                       { "iter",    1, 0, 'I' },
+                       { "device",        1, 0, 'D' },
+                       { "speed",         1, 0, 's' },
+                       { "delay",         1, 0, 'd' },
+                       { "loop",          0, 0, 'l' },
+                       { "cpha",          0, 0, 'H' },
+                       { "cpol",          0, 0, 'O' },
+                       { "rx-cpha-flip",  0, 0, 'F' },
+                       { "dual",          0, 0, '2' },
+                       { "quad",          0, 0, '4' },
+                       { "octal",         0, 0, '8' },
+                       { "3wire",         0, 0, '3' },
+                       { "3wire-hiz",     0, 0, 'Z' },
+                       { "input",         1, 0, 'i' },
+                       { "output",        1, 0, 'o' },
+                       { "size",          1, 0, 'S' },
+                       { "iter",          1, 0, 'I' },
+                       { "bpw",           1, 0, 'b' },
+                       { "lsb",           0, 0, 'L' },
+                       { "cs-high",       0, 0, 'C' },
+                       { "no-cs",         0, 0, 'N' },
+                       { "ready",         0, 0, 'R' },
+                       { "mosi-idle-low", 0, 0, 'M' },
+                       { "verbose",       0, 0, 'v' },
                        { NULL, 0, 0, 0 },
                };
                int c;
 
-               c = getopt_long(argc, argv, "D:s:d:b:i:o:lHOLC3NR248p:vS:I:",
+               c = getopt_long(argc, argv, "D:s:d:b:i:o:lHOLC3ZFMNR248p:vS:I:",
                                lopts, NULL);
 
                if (c == -1)
@@ -268,6 +280,15 @@ static void parse_opts(int argc, char *argv[])
                case '3':
                        mode |= SPI_3WIRE;
                        break;
+               case 'Z':
+                       mode |= SPI_3WIRE_HIZ;
+                       break;
+               case 'F':
+                       mode |= SPI_RX_CPHA_FLIP;
+                       break;
+               case 'M':
+                       mode |= SPI_MOSI_IDLE_LOW;
+                       break;
                case 'N':
                        mode |= SPI_NO_CS;
                        break;
index f990cbb..0393940 100644 (file)
@@ -9,6 +9,8 @@ CONFIG_KUNIT=y
 CONFIG_KUNIT_EXAMPLE_TEST=y
 CONFIG_KUNIT_ALL_TESTS=y
 
+CONFIG_FORTIFY_SOURCE=y
+
 CONFIG_IIO=y
 
 CONFIG_EXT4_FS=y
index e824ce4..54ad897 100644 (file)
@@ -3,3 +3,6 @@
 # Enable virtio/pci, as a lot of tests require it.
 CONFIG_VIRTIO_UML=y
 CONFIG_UML_PCI_OVER_VIRTIO=y
+
+# Enable FORTIFY_SOURCE for wider checking.
+CONFIG_FORTIFY_SOURCE=y
index f01f941..7f64880 100644 (file)
@@ -92,7 +92,7 @@ class LinuxSourceTreeOperations:
                if stderr:  # likely only due to build warnings
                        print(stderr.decode())
 
-       def start(self, params: List[str], build_dir: str) -> subprocess.Popen[str]:
+       def start(self, params: List[str], build_dir: str) -> subprocess.Popen:
                raise RuntimeError('not implemented!')
 
 
@@ -113,7 +113,7 @@ class LinuxSourceTreeOperationsQemu(LinuxSourceTreeOperations):
                kconfig.merge_in_entries(base_kunitconfig)
                return kconfig
 
-       def start(self, params: List[str], build_dir: str) -> subprocess.Popen[str]:
+       def start(self, params: List[str], build_dir: str) -> subprocess.Popen:
                kernel_path = os.path.join(build_dir, self._kernel_path)
                qemu_command = ['qemu-system-' + self._qemu_arch,
                                '-nodefaults',
@@ -142,7 +142,7 @@ class LinuxSourceTreeOperationsUml(LinuxSourceTreeOperations):
                kconfig.merge_in_entries(base_kunitconfig)
                return kconfig
 
-       def start(self, params: List[str], build_dir: str) -> subprocess.Popen[str]:
+       def start(self, params: List[str], build_dir: str) -> subprocess.Popen:
                """Runs the Linux UML binary. Must be named 'linux'."""
                linux_bin = os.path.join(build_dir, 'linux')
                params.extend(['mem=1G', 'console=tty', 'kunit_shutdown=halt'])
diff --git a/tools/testing/kunit/mypy.ini b/tools/testing/kunit/mypy.ini
new file mode 100644 (file)
index 0000000..ddd2883
--- /dev/null
@@ -0,0 +1,6 @@
+[mypy]
+strict = True
+
+# E.g. we can't write subprocess.Popen[str] until Python 3.9+.
+# But kunit.py tries to support Python 3.7+, so let's disable it.
+disable_error_code = type-arg
index 8208c3b..c6d494e 100755 (executable)
@@ -23,7 +23,7 @@ commands: Dict[str, Sequence[str]] = {
        'kunit_tool_test.py': ['./kunit_tool_test.py'],
        'kunit smoke test': ['./kunit.py', 'run', '--kunitconfig=lib/kunit', '--build_dir=kunit_run_checks'],
        'pytype': ['/bin/sh', '-c', 'pytype *.py'],
-       'mypy': ['mypy', '--strict', '--exclude', '_test.py$', '--exclude', 'qemu_configs/', '.'],
+       'mypy': ['mypy', '--config-file', 'mypy.ini', '--exclude', '_test.py$', '--exclude', 'qemu_configs/', '.'],
 }
 
 # The user might not have mypy or pytype installed, skip them if so.
index 9286d3b..03539d8 100644 (file)
@@ -14,6 +14,7 @@
 #include "test.h"
 #include <stdlib.h>
 #include <time.h>
+#include "linux/init.h"
 
 #define module_init(x)
 #define module_exit(x)
@@ -22,7 +23,6 @@
 #define dump_stack()   assert(0)
 
 #include "../../../lib/maple_tree.c"
-#undef CONFIG_DEBUG_MAPLE_TREE
 #include "../../../lib/test_maple_tree.c"
 
 #define RCU_RANGE_COUNT 1000
@@ -81,7 +81,7 @@ static void check_mas_alloc_node_count(struct ma_state *mas)
  * check_new_node() - Check the creation of new nodes and error path
  * verification.
  */
-static noinline void check_new_node(struct maple_tree *mt)
+static noinline void __init check_new_node(struct maple_tree *mt)
 {
 
        struct maple_node *mn, *mn2, *mn3;
@@ -455,7 +455,7 @@ static noinline void check_new_node(struct maple_tree *mt)
 /*
  * Check erasing including RCU.
  */
-static noinline void check_erase(struct maple_tree *mt, unsigned long index,
+static noinline void __init check_erase(struct maple_tree *mt, unsigned long index,
                void *ptr)
 {
        MT_BUG_ON(mt, mtree_test_erase(mt, index) != ptr);
@@ -465,24 +465,24 @@ static noinline void check_erase(struct maple_tree *mt, unsigned long index,
 #define erase_check_insert(mt, i) check_insert(mt, set[i], entry[i%2])
 #define erase_check_erase(mt, i) check_erase(mt, set[i], entry[i%2])
 
-static noinline void check_erase_testset(struct maple_tree *mt)
+static noinline void __init check_erase_testset(struct maple_tree *mt)
 {
-       unsigned long set[] = { 5015, 5014, 5017, 25, 1000,
-                               1001, 1002, 1003, 1005, 0,
-                               6003, 6002, 6008, 6012, 6015,
-                               7003, 7002, 7008, 7012, 7015,
-                               8003, 8002, 8008, 8012, 8015,
-                               9003, 9002, 9008, 9012, 9015,
-                               10003, 10002, 10008, 10012, 10015,
-                               11003, 11002, 11008, 11012, 11015,
-                               12003, 12002, 12008, 12012, 12015,
-                               13003, 13002, 13008, 13012, 13015,
-                               14003, 14002, 14008, 14012, 14015,
-                               15003, 15002, 15008, 15012, 15015,
-                             };
-
-
-       void *ptr = &set;
+       static const unsigned long set[] = { 5015, 5014, 5017, 25, 1000,
+                                            1001, 1002, 1003, 1005, 0,
+                                            6003, 6002, 6008, 6012, 6015,
+                                            7003, 7002, 7008, 7012, 7015,
+                                            8003, 8002, 8008, 8012, 8015,
+                                            9003, 9002, 9008, 9012, 9015,
+                                            10003, 10002, 10008, 10012, 10015,
+                                            11003, 11002, 11008, 11012, 11015,
+                                            12003, 12002, 12008, 12012, 12015,
+                                            13003, 13002, 13008, 13012, 13015,
+                                            14003, 14002, 14008, 14012, 14015,
+                                            15003, 15002, 15008, 15012, 15015,
+                                          };
+
+
+       void *ptr = &check_erase_testset;
        void *entry[2] = { ptr, mt };
        void *root_node;
 
@@ -739,7 +739,7 @@ static noinline void check_erase_testset(struct maple_tree *mt)
 int mas_ce2_over_count(struct ma_state *mas_start, struct ma_state *mas_end,
                      void *s_entry, unsigned long s_min,
                      void *e_entry, unsigned long e_max,
-                     unsigned long *set, int i, bool null_entry)
+                     const unsigned long *set, int i, bool null_entry)
 {
        int count = 0, span = 0;
        unsigned long retry = 0;
@@ -969,8 +969,8 @@ retry:
 }
 
 #if defined(CONFIG_64BIT)
-static noinline void check_erase2_testset(struct maple_tree *mt,
-               unsigned long *set, unsigned long size)
+static noinline void __init check_erase2_testset(struct maple_tree *mt,
+               const unsigned long *set, unsigned long size)
 {
        int entry_count = 0;
        int check = 0;
@@ -1054,7 +1054,7 @@ static noinline void check_erase2_testset(struct maple_tree *mt,
                if (entry_count)
                        MT_BUG_ON(mt, !mt_height(mt));
 #if check_erase2_debug > 1
-               mt_dump(mt);
+               mt_dump(mt, mt_dump_hex);
 #endif
 #if check_erase2_debug
                pr_err("Done\n");
@@ -1085,7 +1085,7 @@ static noinline void check_erase2_testset(struct maple_tree *mt,
                mas_for_each(&mas, foo, ULONG_MAX) {
                        if (xa_is_zero(foo)) {
                                if (addr == mas.index) {
-                                       mt_dump(mas.tree);
+                                       mt_dump(mas.tree, mt_dump_hex);
                                        pr_err("retry failed %lu - %lu\n",
                                                mas.index, mas.last);
                                        MT_BUG_ON(mt, 1);
@@ -1114,11 +1114,11 @@ static noinline void check_erase2_testset(struct maple_tree *mt,
 
 
 /* These tests were pulled from KVM tree modifications which failed. */
-static noinline void check_erase2_sets(struct maple_tree *mt)
+static noinline void __init check_erase2_sets(struct maple_tree *mt)
 {
        void *entry;
        unsigned long start = 0;
-       unsigned long set[] = {
+       static const unsigned long set[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140721266458624, 140737488351231,
 ERASE, 140721266458624, 140737488351231,
@@ -1136,7 +1136,7 @@ ERASE, 140253902692352, 140253902864383,
 STORE, 140253902692352, 140253902696447,
 STORE, 140253902696448, 140253902864383,
                };
-       unsigned long set2[] = {
+       static const unsigned long set2[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140735933583360, 140737488351231,
 ERASE, 140735933583360, 140737488351231,
@@ -1160,7 +1160,7 @@ STORE, 140277094813696, 140277094821887,
 STORE, 140277094821888, 140277094825983,
 STORE, 140735933906944, 140735933911039,
        };
-       unsigned long set3[] = {
+       static const unsigned long set3[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140735790264320, 140737488351231,
 ERASE, 140735790264320, 140737488351231,
@@ -1203,7 +1203,7 @@ STORE, 47135835840512, 47135835885567,
 STORE, 47135835885568, 47135835893759,
        };
 
-       unsigned long set4[] = {
+       static const unsigned long set4[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140728251703296, 140737488351231,
 ERASE, 140728251703296, 140737488351231,
@@ -1224,7 +1224,7 @@ ERASE, 47646523277312, 47646523445247,
 STORE, 47646523277312, 47646523400191,
        };
 
-       unsigned long set5[] = {
+       static const unsigned long set5[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140726874062848, 140737488351231,
 ERASE, 140726874062848, 140737488351231,
@@ -1357,7 +1357,7 @@ STORE, 47884791619584, 47884791623679,
 STORE, 47884791623680, 47884791627775,
        };
 
-       unsigned long set6[] = {
+       static const unsigned long set6[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140722999021568, 140737488351231,
 ERASE, 140722999021568, 140737488351231,
@@ -1489,7 +1489,7 @@ ERASE, 47430432014336, 47430432022527,
 STORE, 47430432014336, 47430432018431,
 STORE, 47430432018432, 47430432022527,
        };
-       unsigned long set7[] = {
+       static const unsigned long set7[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140729808330752, 140737488351231,
 ERASE, 140729808330752, 140737488351231,
@@ -1621,7 +1621,7 @@ ERASE, 47439987130368, 47439987138559,
 STORE, 47439987130368, 47439987134463,
 STORE, 47439987134464, 47439987138559,
        };
-       unsigned long set8[] = {
+       static const unsigned long set8[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140722482974720, 140737488351231,
 ERASE, 140722482974720, 140737488351231,
@@ -1754,7 +1754,7 @@ STORE, 47708488638464, 47708488642559,
 STORE, 47708488642560, 47708488646655,
        };
 
-       unsigned long set9[] = {
+       static const unsigned long set9[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140736427839488, 140737488351231,
 ERASE, 140736427839488, 140736427839488,
@@ -5620,7 +5620,7 @@ ERASE, 47906195480576, 47906195480576,
 STORE, 94641242615808, 94641242750975,
        };
 
-       unsigned long set10[] = {
+       static const unsigned long set10[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140736427839488, 140737488351231,
 ERASE, 140736427839488, 140736427839488,
@@ -9484,7 +9484,7 @@ STORE, 139726599680000, 139726599684095,
 ERASE, 47906195480576, 47906195480576,
 STORE, 94641242615808, 94641242750975,
        };
-       unsigned long set11[] = {
+       static const unsigned long set11[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140732658499584, 140737488351231,
 ERASE, 140732658499584, 140732658499584,
@@ -9510,7 +9510,7 @@ STORE, 140732658565120, 140732658569215,
 STORE, 140732658552832, 140732658565119,
        };
 
-       unsigned long set12[] = { /* contains 12 values. */
+       static const unsigned long set12[] = { /* contains 12 values. */
 STORE, 140737488347136, 140737488351231,
 STORE, 140732658499584, 140737488351231,
 ERASE, 140732658499584, 140732658499584,
@@ -9537,7 +9537,7 @@ STORE, 140732658552832, 140732658565119,
 STORE, 140014592741375, 140014592741375, /* contrived */
 STORE, 140014592733184, 140014592741376, /* creates first entry retry. */
        };
-       unsigned long set13[] = {
+       static const unsigned long set13[] = {
 STORE, 140373516247040, 140373516251135,/*: ffffa2e7b0e10d80 */
 STORE, 140373516251136, 140373516255231,/*: ffffa2e7b1195d80 */
 STORE, 140373516255232, 140373516443647,/*: ffffa2e7b0e109c0 */
@@ -9550,7 +9550,7 @@ STORE, 140373518684160, 140373518688254,/*: ffffa2e7b05fec00 */
 STORE, 140373518688256, 140373518692351,/*: ffffa2e7bfbdcd80 */
 STORE, 140373518692352, 140373518696447,/*: ffffa2e7b0749e40 */
        };
-       unsigned long set14[] = {
+       static const unsigned long set14[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140731667996672, 140737488351231,
 SNULL, 140731668000767, 140737488351231,
@@ -9834,7 +9834,7 @@ SNULL, 139826136543232, 139826136809471,
 STORE, 139826136809472, 139826136842239,
 STORE, 139826136543232, 139826136809471,
        };
-       unsigned long set15[] = {
+       static const unsigned long set15[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140722061451264, 140737488351231,
 SNULL, 140722061455359, 140737488351231,
@@ -10119,7 +10119,7 @@ STORE, 139906808958976, 139906808991743,
 STORE, 139906808692736, 139906808958975,
        };
 
-       unsigned long set16[] = {
+       static const unsigned long set16[] = {
 STORE, 94174808662016, 94174809321471,
 STORE, 94174811414528, 94174811426815,
 STORE, 94174811426816, 94174811430911,
@@ -10330,7 +10330,7 @@ STORE, 139921865613312, 139921865617407,
 STORE, 139921865547776, 139921865564159,
        };
 
-       unsigned long set17[] = {
+       static const unsigned long set17[] = {
 STORE, 94397057224704, 94397057646591,
 STORE, 94397057650688, 94397057691647,
 STORE, 94397057691648, 94397057695743,
@@ -10392,7 +10392,7 @@ STORE, 140720477511680, 140720477646847,
 STORE, 140720478302208, 140720478314495,
 STORE, 140720478314496, 140720478318591,
        };
-       unsigned long set18[] = {
+       static const unsigned long set18[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140724953673728, 140737488351231,
 SNULL, 140724953677823, 140737488351231,
@@ -10425,7 +10425,7 @@ STORE, 140222970597376, 140222970605567,
 ERASE, 140222970597376, 140222970605567,
 STORE, 140222970597376, 140222970605567,
        };
-       unsigned long set19[] = {
+       static const unsigned long set19[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140725182459904, 140737488351231,
 SNULL, 140725182463999, 140737488351231,
@@ -10694,7 +10694,7 @@ STORE, 140656836775936, 140656836780031,
 STORE, 140656787476480, 140656791920639,
 ERASE, 140656774639616, 140656779083775,
        };
-       unsigned long set20[] = {
+       static const unsigned long set20[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140735952392192, 140737488351231,
 SNULL, 140735952396287, 140737488351231,
@@ -10850,7 +10850,7 @@ STORE, 140590386819072, 140590386823167,
 STORE, 140590386823168, 140590386827263,
 SNULL, 140590376591359, 140590376595455,
        };
-       unsigned long set21[] = {
+       static const unsigned long set21[] = {
 STORE, 93874710941696, 93874711363583,
 STORE, 93874711367680, 93874711408639,
 STORE, 93874711408640, 93874711412735,
@@ -10920,7 +10920,7 @@ ERASE, 140708393312256, 140708393316351,
 ERASE, 140708393308160, 140708393312255,
 ERASE, 140708393291776, 140708393308159,
        };
-       unsigned long set22[] = {
+       static const unsigned long set22[] = {
 STORE, 93951397134336, 93951397183487,
 STORE, 93951397183488, 93951397728255,
 STORE, 93951397728256, 93951397826559,
@@ -11047,7 +11047,7 @@ STORE, 140551361253376, 140551361519615,
 ERASE, 140551361253376, 140551361519615,
        };
 
-       unsigned long set23[] = {
+       static const unsigned long set23[] = {
 STORE, 94014447943680, 94014448156671,
 STORE, 94014450253824, 94014450257919,
 STORE, 94014450257920, 94014450266111,
@@ -14371,7 +14371,7 @@ SNULL, 140175956627455, 140175985139711,
 STORE, 140175927242752, 140175956627455,
 STORE, 140175956627456, 140175985139711,
        };
-       unsigned long set24[] = {
+       static const unsigned long set24[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140735281639424, 140737488351231,
 SNULL, 140735281643519, 140737488351231,
@@ -15533,7 +15533,7 @@ ERASE, 139635393024000, 139635401412607,
 ERASE, 139635384627200, 139635384631295,
 ERASE, 139635384631296, 139635393019903,
        };
-       unsigned long set25[] = {
+       static const unsigned long set25[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140737488343040, 140737488351231,
 STORE, 140722547441664, 140737488351231,
@@ -22321,7 +22321,7 @@ STORE, 140249652703232, 140249682087935,
 STORE, 140249682087936, 140249710600191,
        };
 
-       unsigned long set26[] = {
+       static const unsigned long set26[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140729464770560, 140737488351231,
 SNULL, 140729464774655, 140737488351231,
@@ -22345,7 +22345,7 @@ ERASE, 140109040951296, 140109040959487,
 STORE, 140109040955392, 140109040959487,
 ERASE, 140109040955392, 140109040959487,
        };
-       unsigned long set27[] = {
+       static const unsigned long set27[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140726128070656, 140737488351231,
 SNULL, 140726128074751, 140737488351231,
@@ -22741,7 +22741,7 @@ STORE, 140415509696512, 140415535910911,
 ERASE, 140415537422336, 140415562588159,
 STORE, 140415482433536, 140415509696511,
        };
-       unsigned long set28[] = {
+       static const unsigned long set28[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140722475622400, 140737488351231,
 SNULL, 140722475626495, 140737488351231,
@@ -22809,7 +22809,7 @@ STORE, 139918413348864, 139918413352959,
 ERASE, 139918413316096, 139918413344767,
 STORE, 93865848528896, 93865848664063,
        };
-       unsigned long set29[] = {
+       static const unsigned long set29[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140734467944448, 140737488351231,
 SNULL, 140734467948543, 140737488351231,
@@ -23684,7 +23684,7 @@ ERASE, 140143079972864, 140143088361471,
 ERASE, 140143205793792, 140143205797887,
 ERASE, 140143205797888, 140143214186495,
        };
-       unsigned long set30[] = {
+       static const unsigned long set30[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140733436743680, 140737488351231,
 SNULL, 140733436747775, 140737488351231,
@@ -24566,7 +24566,7 @@ ERASE, 140165225893888, 140165225897983,
 ERASE, 140165225897984, 140165234286591,
 ERASE, 140165058105344, 140165058109439,
        };
-       unsigned long set31[] = {
+       static const unsigned long set31[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140730890784768, 140737488351231,
 SNULL, 140730890788863, 140737488351231,
@@ -25379,7 +25379,7 @@ ERASE, 140623906590720, 140623914979327,
 ERASE, 140622950277120, 140622950281215,
 ERASE, 140622950281216, 140622958669823,
        };
-       unsigned long set32[] = {
+       static const unsigned long set32[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140731244212224, 140737488351231,
 SNULL, 140731244216319, 140737488351231,
@@ -26175,7 +26175,7 @@ ERASE, 140400417288192, 140400425676799,
 ERASE, 140400283066368, 140400283070463,
 ERASE, 140400283070464, 140400291459071,
        };
-       unsigned long set33[] = {
+       static const unsigned long set33[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140734562918400, 140737488351231,
 SNULL, 140734562922495, 140737488351231,
@@ -26317,7 +26317,7 @@ STORE, 140582961786880, 140583003750399,
 ERASE, 140582961786880, 140583003750399,
        };
 
-       unsigned long set34[] = {
+       static const unsigned long set34[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140731327180800, 140737488351231,
 SNULL, 140731327184895, 140737488351231,
@@ -27198,7 +27198,7 @@ ERASE, 140012522094592, 140012530483199,
 ERASE, 140012033142784, 140012033146879,
 ERASE, 140012033146880, 140012041535487,
        };
-       unsigned long set35[] = {
+       static const unsigned long set35[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140730536939520, 140737488351231,
 SNULL, 140730536943615, 140737488351231,
@@ -27955,7 +27955,7 @@ ERASE, 140474471936000, 140474480324607,
 ERASE, 140474396430336, 140474396434431,
 ERASE, 140474396434432, 140474404823039,
        };
-       unsigned long set36[] = {
+       static const unsigned long set36[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140723893125120, 140737488351231,
 SNULL, 140723893129215, 140737488351231,
@@ -28816,7 +28816,7 @@ ERASE, 140121890357248, 140121898745855,
 ERASE, 140121269587968, 140121269592063,
 ERASE, 140121269592064, 140121277980671,
        };
-       unsigned long set37[] = {
+       static const unsigned long set37[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140722404016128, 140737488351231,
 SNULL, 140722404020223, 140737488351231,
@@ -28942,7 +28942,7 @@ STORE, 139759821246464, 139759888355327,
 ERASE, 139759821246464, 139759888355327,
 ERASE, 139759888355328, 139759955464191,
        };
-       unsigned long set38[] = {
+       static const unsigned long set38[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140730666221568, 140737488351231,
 SNULL, 140730666225663, 140737488351231,
@@ -29752,7 +29752,7 @@ ERASE, 140613504712704, 140613504716799,
 ERASE, 140613504716800, 140613513105407,
        };
 
-       unsigned long set39[] = {
+       static const unsigned long set39[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140736271417344, 140737488351231,
 SNULL, 140736271421439, 140737488351231,
@@ -30124,7 +30124,7 @@ STORE, 140325364428800, 140325372821503,
 STORE, 140325356036096, 140325364428799,
 SNULL, 140325364432895, 140325372821503,
        };
-       unsigned long set40[] = {
+       static const unsigned long set40[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140734309167104, 140737488351231,
 SNULL, 140734309171199, 140737488351231,
@@ -30875,7 +30875,7 @@ ERASE, 140320289300480, 140320289304575,
 ERASE, 140320289304576, 140320297693183,
 ERASE, 140320163409920, 140320163414015,
        };
-       unsigned long set41[] = {
+       static const unsigned long set41[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140728157171712, 140737488351231,
 SNULL, 140728157175807, 140737488351231,
@@ -31185,7 +31185,7 @@ STORE, 94376135090176, 94376135094271,
 STORE, 94376135094272, 94376135098367,
 SNULL, 94376135094272, 94377208836095,
        };
-       unsigned long set42[] = {
+       static const unsigned long set42[] = {
 STORE, 314572800, 1388314623,
 STORE, 1462157312, 1462169599,
 STORE, 1462169600, 1462185983,
@@ -33862,7 +33862,7 @@ SNULL, 3798999040, 3799101439,
  */
        };
 
-       unsigned long set43[] = {
+       static const unsigned long set43[] = {
 STORE, 140737488347136, 140737488351231,
 STORE, 140734187720704, 140737488351231,
 SNULL, 140734187724800, 140737488351231,
@@ -34513,7 +34513,7 @@ static void *rcu_reader_rev(void *ptr)
                        if (mas.index != r_start) {
                                alt = xa_mk_value(index + i * 2 + 1 +
                                                  RCU_RANGE_COUNT);
-                               mt_dump(test->mt);
+                               mt_dump(test->mt, mt_dump_dec);
                                printk("Error: %lu-%lu %p != %lu-%lu %p %p line %d i %d\n",
                                       mas.index, mas.last, entry,
                                       r_start, r_end, expected, alt,
@@ -34996,7 +34996,7 @@ void run_check_rcu_slowread(struct maple_tree *mt, struct rcu_test_struct *vals)
        MT_BUG_ON(mt, !vals->seen_entry3);
        MT_BUG_ON(mt, !vals->seen_both);
 }
-static noinline void check_rcu_simulated(struct maple_tree *mt)
+static noinline void __init check_rcu_simulated(struct maple_tree *mt)
 {
        unsigned long i, nr_entries = 1000;
        unsigned long target = 4320;
@@ -35157,7 +35157,7 @@ static noinline void check_rcu_simulated(struct maple_tree *mt)
        rcu_unregister_thread();
 }
 
-static noinline void check_rcu_threaded(struct maple_tree *mt)
+static noinline void __init check_rcu_threaded(struct maple_tree *mt)
 {
        unsigned long i, nr_entries = 1000;
        struct rcu_test_struct vals;
@@ -35259,6 +35259,7 @@ static void mas_dfs_preorder(struct ma_state *mas)
 
        struct maple_enode *prev;
        unsigned char end, slot = 0;
+       unsigned long *pivots;
 
        if (mas->node == MAS_START) {
                mas_start(mas);
@@ -35291,6 +35292,9 @@ walk_up:
                mas_ascend(mas);
                goto walk_up;
        }
+       pivots = ma_pivots(mte_to_node(prev), mte_node_type(prev));
+       mas->max = mas_safe_pivot(mas, pivots, slot, mte_node_type(prev));
+       mas->min = mas_safe_min(mas, pivots, slot);
 
        return;
 done:
@@ -35366,7 +35370,7 @@ static void check_dfs_preorder(struct maple_tree *mt)
 /* End of depth first search tests */
 
 /* Preallocation testing */
-static noinline void check_prealloc(struct maple_tree *mt)
+static noinline void __init check_prealloc(struct maple_tree *mt)
 {
        unsigned long i, max = 100;
        unsigned long allocated;
@@ -35494,7 +35498,7 @@ static noinline void check_prealloc(struct maple_tree *mt)
 /* End of preallocation testing */
 
 /* Spanning writes, writes that span nodes and layers of the tree */
-static noinline void check_spanning_write(struct maple_tree *mt)
+static noinline void __init check_spanning_write(struct maple_tree *mt)
 {
        unsigned long i, max = 5000;
        MA_STATE(mas, mt, 1200, 2380);
@@ -35662,7 +35666,7 @@ static noinline void check_spanning_write(struct maple_tree *mt)
 /* End of spanning write testing */
 
 /* Writes to a NULL area that are adjacent to other NULLs */
-static noinline void check_null_expand(struct maple_tree *mt)
+static noinline void __init check_null_expand(struct maple_tree *mt)
 {
        unsigned long i, max = 100;
        unsigned char data_end;
@@ -35723,7 +35727,7 @@ static noinline void check_null_expand(struct maple_tree *mt)
 /* End of NULL area expansions */
 
 /* Checking for no memory is best done outside the kernel */
-static noinline void check_nomem(struct maple_tree *mt)
+static noinline void __init check_nomem(struct maple_tree *mt)
 {
        MA_STATE(ms, mt, 1, 1);
 
@@ -35758,7 +35762,7 @@ static noinline void check_nomem(struct maple_tree *mt)
        mtree_destroy(mt);
 }
 
-static noinline void check_locky(struct maple_tree *mt)
+static noinline void __init check_locky(struct maple_tree *mt)
 {
        MA_STATE(ms, mt, 2, 2);
        MA_STATE(reader, mt, 2, 2);
@@ -35780,10 +35784,10 @@ void farmer_tests(void)
        struct maple_node *node;
        DEFINE_MTREE(tree);
 
-       mt_dump(&tree);
+       mt_dump(&tree, mt_dump_dec);
 
        tree.ma_root = xa_mk_value(0);
-       mt_dump(&tree);
+       mt_dump(&tree, mt_dump_dec);
 
        node = mt_alloc_one(GFP_KERNEL);
        node->parent = (void *)((unsigned long)(&tree) | 1);
@@ -35793,7 +35797,7 @@ void farmer_tests(void)
        node->mr64.pivot[1] = 1;
        node->mr64.pivot[2] = 0;
        tree.ma_root = mt_mk_node(node, maple_leaf_64);
-       mt_dump(&tree);
+       mt_dump(&tree, mt_dump_dec);
 
        node->parent = ma_parent_ptr(node);
        ma_free_rcu(node);
index 90a62cf..6b456c5 100644 (file)
@@ -4,6 +4,7 @@ TARGETS += amd-pstate
 TARGETS += arm64
 TARGETS += bpf
 TARGETS += breakpoints
+TARGETS += cachestat
 TARGETS += capabilities
 TARGETS += cgroup
 TARGETS += clone3
@@ -144,10 +145,12 @@ ifneq ($(KBUILD_OUTPUT),)
   abs_objtree := $(realpath $(abs_objtree))
   BUILD := $(abs_objtree)/kselftest
   KHDR_INCLUDES := -isystem ${abs_objtree}/usr/include
+  KHDR_DIR := ${abs_objtree}/usr/include
 else
   BUILD := $(CURDIR)
   abs_srctree := $(shell cd $(top_srcdir) && pwd)
   KHDR_INCLUDES := -isystem ${abs_srctree}/usr/include
+  KHDR_DIR := ${abs_srctree}/usr/include
   DEFAULT_INSTALL_HDR_PATH := 1
 endif
 
@@ -161,7 +164,7 @@ export KHDR_INCLUDES
 # all isn't the first target in the file.
 .DEFAULT_GOAL := all
 
-all:
+all: kernel_header_files
        @ret=1;                                                 \
        for TARGET in $(TARGETS); do                            \
                BUILD_TARGET=$$BUILD/$$TARGET;                  \
@@ -172,6 +175,23 @@ all:
                ret=$$((ret * $$?));                            \
        done; exit $$ret;
 
+kernel_header_files:
+       @ls $(KHDR_DIR)/linux/*.h >/dev/null 2>/dev/null;                          \
+       if [ $$? -ne 0 ]; then                                                     \
+            RED='\033[1;31m';                                                  \
+            NOCOLOR='\033[0m';                                                 \
+            echo;                                                              \
+            echo -e "$${RED}error$${NOCOLOR}: missing kernel header files.";   \
+            echo "Please run this and try again:";                             \
+            echo;                                                              \
+            echo "    cd $(top_srcdir)";                                       \
+            echo "    make headers";                                           \
+            echo;                                                              \
+           exit 1;                                                                \
+       fi
+
+.PHONY: kernel_header_files
+
 run_tests: all
        @for TARGET in $(TARGETS); do \
                BUILD_TARGET=$$BUILD/$$TARGET;  \
index 93333a9..d4ad813 100644 (file)
@@ -39,6 +39,20 @@ static void cssc_sigill(void)
        asm volatile(".inst 0xdac01c00" : : : "x0");
 }
 
+static void mops_sigill(void)
+{
+       char dst[1], src[1];
+       register char *dstp asm ("x0") = dst;
+       register char *srcp asm ("x1") = src;
+       register long size asm ("x2") = 1;
+
+       /* CPYP [x0]!, [x1]!, x2! */
+       asm volatile(".inst 0x1d010440"
+                    : "+r" (dstp), "+r" (srcp), "+r" (size)
+                    :
+                    : "cc", "memory");
+}
+
 static void rng_sigill(void)
 {
        asm volatile("mrs x0, S3_3_C2_C4_0" : : : "x0");
@@ -210,6 +224,14 @@ static const struct hwcap_data {
                .sigill_fn = cssc_sigill,
        },
        {
+               .name = "MOPS",
+               .at_hwcap = AT_HWCAP2,
+               .hwcap_bit = HWCAP2_MOPS,
+               .cpuinfo = "mops",
+               .sigill_fn = mops_sigill,
+               .sigill_reliable = true,
+       },
+       {
                .name = "RNG",
                .at_hwcap = AT_HWCAP2,
                .hwcap_bit = HWCAP2_RNG,
index be95251..abe4d58 100644 (file)
@@ -20,7 +20,7 @@
 
 #include "../../kselftest.h"
 
-#define EXPECTED_TESTS 7
+#define EXPECTED_TESTS 11
 
 #define MAX_TPIDRS 2
 
@@ -132,6 +132,34 @@ static void test_tpidr(pid_t child)
        }
 }
 
+static void test_hw_debug(pid_t child, int type, const char *type_name)
+{
+       struct user_hwdebug_state state;
+       struct iovec iov;
+       int slots, arch, ret;
+
+       iov.iov_len = sizeof(state);
+       iov.iov_base = &state;
+
+       /* Should be able to read the values */
+       ret = ptrace(PTRACE_GETREGSET, child, type, &iov);
+       ksft_test_result(ret == 0, "read_%s\n", type_name);
+
+       if (ret == 0) {
+               /* Low 8 bits is the number of slots, next 4 bits the arch */
+               slots = state.dbg_info & 0xff;
+               arch = (state.dbg_info >> 8) & 0xf;
+
+               ksft_print_msg("%s version %d with %d slots\n", type_name,
+                              arch, slots);
+
+               /* Zero is not currently architecturally valid */
+               ksft_test_result(arch, "%s_arch_set\n", type_name);
+       } else {
+               ksft_test_result_skip("%s_arch_set\n");
+       }
+}
+
 static int do_child(void)
 {
        if (ptrace(PTRACE_TRACEME, -1, NULL, NULL))
@@ -207,6 +235,8 @@ static int do_parent(pid_t child)
        ksft_print_msg("Parent is %d, child is %d\n", getpid(), child);
 
        test_tpidr(child);
+       test_hw_debug(child, NT_ARM_HW_WATCH, "NT_ARM_HW_WATCH");
+       test_hw_debug(child, NT_ARM_HW_BREAK, "NT_ARM_HW_BREAK");
 
        ret = EXIT_SUCCESS;
 
index 8ab4c86..839e3a2 100644 (file)
@@ -4,7 +4,7 @@ fake_sigreturn_*
 sme_*
 ssve_*
 sve_*
-tpidr2_siginfo
+tpidr2_*
 za_*
 zt_*
 !*.[ch]
index 40be844..0dc948d 100644 (file)
@@ -249,7 +249,8 @@ static void default_handler(int signum, siginfo_t *si, void *uc)
                        fprintf(stderr, "-- Timeout !\n");
                } else {
                        fprintf(stderr,
-                               "-- RX UNEXPECTED SIGNAL: %d\n", signum);
+                               "-- RX UNEXPECTED SIGNAL: %d code %d address %p\n",
+                               signum, si->si_code, si->si_addr);
                }
                default_result(current, 1);
        }
diff --git a/tools/testing/selftests/arm64/signal/testcases/tpidr2_restore.c b/tools/testing/selftests/arm64/signal/testcases/tpidr2_restore.c
new file mode 100644 (file)
index 0000000..f9a86c0
--- /dev/null
@@ -0,0 +1,86 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2023 ARM Limited
+ *
+ * Verify that the TPIDR2 register context in signal frames is restored.
+ */
+
+#include <signal.h>
+#include <ucontext.h>
+#include <sys/auxv.h>
+#include <sys/prctl.h>
+#include <unistd.h>
+#include <asm/sigcontext.h>
+
+#include "test_signals_utils.h"
+#include "testcases.h"
+
+#define SYS_TPIDR2 "S3_3_C13_C0_5"
+
+static uint64_t get_tpidr2(void)
+{
+       uint64_t val;
+
+       asm volatile (
+               "mrs    %0, " SYS_TPIDR2 "\n"
+               : "=r"(val)
+               :
+               : "cc");
+
+       return val;
+}
+
+static void set_tpidr2(uint64_t val)
+{
+       asm volatile (
+               "msr    " SYS_TPIDR2 ", %0\n"
+               :
+               : "r"(val)
+               : "cc");
+}
+
+
+static uint64_t initial_tpidr2;
+
+static bool save_tpidr2(struct tdescr *td)
+{
+       initial_tpidr2 = get_tpidr2();
+       fprintf(stderr, "Initial TPIDR2: %lx\n", initial_tpidr2);
+
+       return true;
+}
+
+static int modify_tpidr2(struct tdescr *td, siginfo_t *si, ucontext_t *uc)
+{
+       uint64_t my_tpidr2 = get_tpidr2();
+
+       my_tpidr2++;
+       fprintf(stderr, "Setting TPIDR2 to %lx\n", my_tpidr2);
+       set_tpidr2(my_tpidr2);
+
+       return 0;
+}
+
+static void check_tpidr2(struct tdescr *td)
+{
+       uint64_t tpidr2 = get_tpidr2();
+
+       td->pass = tpidr2 == initial_tpidr2;
+
+       if (td->pass)
+               fprintf(stderr, "TPIDR2 restored\n");
+       else
+               fprintf(stderr, "TPIDR2 was %lx but is now %lx\n",
+                       initial_tpidr2, tpidr2);
+}
+
+struct tdescr tde = {
+       .name = "TPIDR2 restore",
+       .descr = "Validate that TPIDR2 is restored from the sigframe",
+       .feats_required = FEAT_SME,
+       .timeout = 3,
+       .sig_trig = SIGUSR1,
+       .init = save_tpidr2,
+       .run = modify_tpidr2,
+       .check_result = check_tpidr2,
+};
index 5ddcc46..5212678 100644 (file)
@@ -59,9 +59,7 @@ int dump_ksym(struct bpf_iter__ksym *ctx)
        } else {
                BPF_SEQ_PRINTF(seq, "0x%llx %c %s ", value, type, iter->name);
        }
-       if (!iter->pos_arch_end || iter->pos_arch_end > iter->pos)
-               BPF_SEQ_PRINTF(seq, "CORE ");
-       else if (!iter->pos_mod_end || iter->pos_mod_end > iter->pos)
+       if (!iter->pos_mod_end || iter->pos_mod_end > iter->pos)
                BPF_SEQ_PRINTF(seq, "MOD ");
        else if (!iter->pos_ftrace_mod_end || iter->pos_ftrace_mod_end > iter->pos)
                BPF_SEQ_PRINTF(seq, "FTRACE_MOD ");
diff --git a/tools/testing/selftests/cachestat/.gitignore b/tools/testing/selftests/cachestat/.gitignore
new file mode 100644 (file)
index 0000000..d6c30b4
--- /dev/null
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+test_cachestat
diff --git a/tools/testing/selftests/cachestat/Makefile b/tools/testing/selftests/cachestat/Makefile
new file mode 100644 (file)
index 0000000..fca73aa
--- /dev/null
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0
+TEST_GEN_PROGS := test_cachestat
+
+CFLAGS += $(KHDR_INCLUDES)
+CFLAGS += -Wall
+CFLAGS += -lrt
+
+include ../lib.mk
diff --git a/tools/testing/selftests/cachestat/test_cachestat.c b/tools/testing/selftests/cachestat/test_cachestat.c
new file mode 100644 (file)
index 0000000..54d09b8
--- /dev/null
@@ -0,0 +1,269 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <stdbool.h>
+#include <linux/kernel.h>
+#include <linux/mman.h>
+#include <sys/mman.h>
+#include <sys/shm.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+#include <string.h>
+#include <fcntl.h>
+#include <errno.h>
+
+#include "../kselftest.h"
+
+static const char * const dev_files[] = {
+       "/dev/zero", "/dev/null", "/dev/urandom",
+       "/proc/version", "/proc"
+};
+static const int cachestat_nr = 451;
+
+void print_cachestat(struct cachestat *cs)
+{
+       ksft_print_msg(
+       "Using cachestat: Cached: %lu, Dirty: %lu, Writeback: %lu, Evicted: %lu, Recently Evicted: %lu\n",
+       cs->nr_cache, cs->nr_dirty, cs->nr_writeback,
+       cs->nr_evicted, cs->nr_recently_evicted);
+}
+
+bool write_exactly(int fd, size_t filesize)
+{
+       int random_fd = open("/dev/urandom", O_RDONLY);
+       char *cursor, *data;
+       int remained;
+       bool ret;
+
+       if (random_fd < 0) {
+               ksft_print_msg("Unable to access urandom.\n");
+               ret = false;
+               goto out;
+       }
+
+       data = malloc(filesize);
+       if (!data) {
+               ksft_print_msg("Unable to allocate data.\n");
+               ret = false;
+               goto close_random_fd;
+       }
+
+       remained = filesize;
+       cursor = data;
+
+       while (remained) {
+               ssize_t read_len = read(random_fd, cursor, remained);
+
+               if (read_len <= 0) {
+                       ksft_print_msg("Unable to read from urandom.\n");
+                       ret = false;
+                       goto out_free_data;
+               }
+
+               remained -= read_len;
+               cursor += read_len;
+       }
+
+       /* write random data to fd */
+       remained = filesize;
+       cursor = data;
+       while (remained) {
+               ssize_t write_len = write(fd, cursor, remained);
+
+               if (write_len <= 0) {
+                       ksft_print_msg("Unable write random data to file.\n");
+                       ret = false;
+                       goto out_free_data;
+               }
+
+               remained -= write_len;
+               cursor += write_len;
+       }
+
+       ret = true;
+out_free_data:
+       free(data);
+close_random_fd:
+       close(random_fd);
+out:
+       return ret;
+}
+
+/*
+ * Open/create the file at filename, (optionally) write random data to it
+ * (exactly num_pages), then test the cachestat syscall on this file.
+ *
+ * If test_fsync == true, fsync the file, then check the number of dirty
+ * pages.
+ */
+bool test_cachestat(const char *filename, bool write_random, bool create,
+               bool test_fsync, unsigned long num_pages, int open_flags,
+               mode_t open_mode)
+{
+       size_t PS = sysconf(_SC_PAGESIZE);
+       int filesize = num_pages * PS;
+       bool ret = true;
+       long syscall_ret;
+       struct cachestat cs;
+       struct cachestat_range cs_range = { 0, filesize };
+
+       int fd = open(filename, open_flags, open_mode);
+
+       if (fd == -1) {
+               ksft_print_msg("Unable to create/open file.\n");
+               ret = false;
+               goto out;
+       } else {
+               ksft_print_msg("Create/open %s\n", filename);
+       }
+
+       if (write_random) {
+               if (!write_exactly(fd, filesize)) {
+                       ksft_print_msg("Unable to access urandom.\n");
+                       ret = false;
+                       goto out1;
+               }
+       }
+
+       syscall_ret = syscall(cachestat_nr, fd, &cs_range, &cs, 0);
+
+       ksft_print_msg("Cachestat call returned %ld\n", syscall_ret);
+
+       if (syscall_ret) {
+               ksft_print_msg("Cachestat returned non-zero.\n");
+               ret = false;
+               goto out1;
+
+       } else {
+               print_cachestat(&cs);
+
+               if (write_random) {
+                       if (cs.nr_cache + cs.nr_evicted != num_pages) {
+                               ksft_print_msg(
+                                       "Total number of cached and evicted pages is off.\n");
+                               ret = false;
+                       }
+               }
+       }
+
+       if (test_fsync) {
+               if (fsync(fd)) {
+                       ksft_print_msg("fsync fails.\n");
+                       ret = false;
+               } else {
+                       syscall_ret = syscall(cachestat_nr, fd, &cs_range, &cs, 0);
+
+                       ksft_print_msg("Cachestat call (after fsync) returned %ld\n",
+                               syscall_ret);
+
+                       if (!syscall_ret) {
+                               print_cachestat(&cs);
+
+                               if (cs.nr_dirty) {
+                                       ret = false;
+                                       ksft_print_msg(
+                                               "Number of dirty should be zero after fsync.\n");
+                               }
+                       } else {
+                               ksft_print_msg("Cachestat (after fsync) returned non-zero.\n");
+                               ret = false;
+                               goto out1;
+                       }
+               }
+       }
+
+out1:
+       close(fd);
+
+       if (create)
+               remove(filename);
+out:
+       return ret;
+}
+
+bool test_cachestat_shmem(void)
+{
+       size_t PS = sysconf(_SC_PAGESIZE);
+       size_t filesize = PS * 512 * 2; /* 2 2MB huge pages */
+       int syscall_ret;
+       size_t compute_len = PS * 512;
+       struct cachestat_range cs_range = { PS, compute_len };
+       char *filename = "tmpshmcstat";
+       struct cachestat cs;
+       bool ret = true;
+       unsigned long num_pages = compute_len / PS;
+       int fd = shm_open(filename, O_CREAT | O_RDWR, 0600);
+
+       if (fd < 0) {
+               ksft_print_msg("Unable to create shmem file.\n");
+               ret = false;
+               goto out;
+       }
+
+       if (ftruncate(fd, filesize)) {
+               ksft_print_msg("Unable to truncate shmem file.\n");
+               ret = false;
+               goto close_fd;
+       }
+
+       if (!write_exactly(fd, filesize)) {
+               ksft_print_msg("Unable to write to shmem file.\n");
+               ret = false;
+               goto close_fd;
+       }
+
+       syscall_ret = syscall(cachestat_nr, fd, &cs_range, &cs, 0);
+
+       if (syscall_ret) {
+               ksft_print_msg("Cachestat returned non-zero.\n");
+               ret = false;
+               goto close_fd;
+       } else {
+               print_cachestat(&cs);
+               if (cs.nr_cache + cs.nr_evicted != num_pages) {
+                       ksft_print_msg(
+                               "Total number of cached and evicted pages is off.\n");
+                       ret = false;
+               }
+       }
+
+close_fd:
+       shm_unlink(filename);
+out:
+       return ret;
+}
+
+int main(void)
+{
+       int ret = 0;
+
+       for (int i = 0; i < 5; i++) {
+               const char *dev_filename = dev_files[i];
+
+               if (test_cachestat(dev_filename, false, false, false,
+                       4, O_RDONLY, 0400))
+                       ksft_test_result_pass("cachestat works with %s\n", dev_filename);
+               else {
+                       ksft_test_result_fail("cachestat fails with %s\n", dev_filename);
+                       ret = 1;
+               }
+       }
+
+       if (test_cachestat("tmpfilecachestat", true, true,
+               true, 4, O_CREAT | O_RDWR, 0400 | 0600))
+               ksft_test_result_pass("cachestat works with a normal file\n");
+       else {
+               ksft_test_result_fail("cachestat fails with normal file\n");
+               ret = 1;
+       }
+
+       if (test_cachestat_shmem())
+               ksft_test_result_pass("cachestat works with a shmem file\n");
+       else {
+               ksft_test_result_fail("cachestat fails with a shmem file\n");
+               ret = 1;
+       }
+
+       return ret;
+}
index f4f7c0a..c7c9572 100644 (file)
@@ -292,6 +292,7 @@ static int test_memcg_protection(const char *root, bool min)
        char *children[4] = {NULL};
        const char *attribute = min ? "memory.min" : "memory.low";
        long c[4];
+       long current;
        int i, attempts;
        int fd;
 
@@ -400,7 +401,8 @@ static int test_memcg_protection(const char *root, bool min)
                goto cleanup;
        }
 
-       if (!values_close(cg_read_long(parent[1], "memory.current"), MB(50), 3))
+       current = min ? MB(50) : MB(30);
+       if (!values_close(cg_read_long(parent[1], "memory.current"), current, 3))
                goto cleanup;
 
        if (!reclaim_until(children[0], MB(10)))
@@ -987,7 +989,9 @@ static int tcp_client(const char *cgroup, unsigned short port)
        char servport[6];
        int retries = 0x10; /* nice round number */
        int sk, ret;
+       long allocated;
 
+       allocated = cg_read_long(cgroup, "memory.current");
        snprintf(servport, sizeof(servport), "%hd", port);
        ret = getaddrinfo(server, servport, NULL, &ai);
        if (ret)
@@ -1015,7 +1019,8 @@ static int tcp_client(const char *cgroup, unsigned short port)
                if (current < 0 || sock < 0)
                        goto close_sk;
 
-               if (values_close(current, sock, 10)) {
+               /* exclude the memory not related to socket connection */
+               if (values_close(current - allocated, sock, 10)) {
                        ret = KSFT_PASS;
                        break;
                }
index e495f89..e60cf4d 100644 (file)
@@ -129,7 +129,7 @@ int main(int argc, char *argv[])
        uid_t uid = getuid();
 
        ksft_print_header();
-       ksft_set_plan(18);
+       ksft_set_plan(19);
        test_clone3_supported();
 
        /* Just a simple clone3() should return 0.*/
@@ -198,5 +198,8 @@ int main(int argc, char *argv[])
        /* Do a clone3() in a new time namespace */
        test_clone3(CLONE_NEWTIME, 0, 0, CLONE3_ARGS_NO_TEST);
 
+       /* Do a clone3() with exit signal (SIGCHLD) in flags */
+       test_clone3(SIGCHLD, 0, -EINVAL, CLONE3_ARGS_NO_TEST);
+
        ksft_finished();
 }
index 75e9007..ce5068f 100644 (file)
@@ -5,11 +5,3 @@ CONFIG_CPU_FREQ_GOV_USERSPACE=y
 CONFIG_CPU_FREQ_GOV_ONDEMAND=y
 CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y
 CONFIG_CPU_FREQ_GOV_SCHEDUTIL=y
-CONFIG_DEBUG_RT_MUTEXES=y
-CONFIG_DEBUG_PLIST=y
-CONFIG_DEBUG_SPINLOCK=y
-CONFIG_DEBUG_MUTEXES=y
-CONFIG_DEBUG_LOCK_ALLOC=y
-CONFIG_PROVE_LOCKING=y
-CONFIG_LOCKDEP=y
-CONFIG_DEBUG_ATOMIC_SLEEP=y
diff --git a/tools/testing/selftests/damon/config b/tools/testing/selftests/damon/config
new file mode 100644 (file)
index 0000000..0daf389
--- /dev/null
@@ -0,0 +1,7 @@
+CONFIG_DAMON=y
+CONFIG_DAMON_SYSFS=y
+CONFIG_DAMON_DBGFS=y
+CONFIG_DAMON_PADDR=y
+CONFIG_DAMON_VADDR=y
+CONFIG_DAMON_RECLAIM=y
+CONFIG_DAMON_LRU_SORT=y
index 2506621..cb5f18c 100755 (executable)
@@ -301,7 +301,7 @@ ktaptest() { # result comment
     comment="# $comment"
   fi
 
-  echo $CASENO $result $INSTANCE$CASENAME $comment
+  echo $result $CASENO $INSTANCE$CASENAME $comment
 }
 
 eval_result() { # sigval
diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_opt_types.tc b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_opt_types.tc
new file mode 100644 (file)
index 0000000..9f5d993
--- /dev/null
@@ -0,0 +1,34 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (C) 2023 Akanksha J N, IBM corporation
+# description: Register/unregister optimized probe
+# requires: kprobe_events
+
+case `uname -m` in
+x86_64)
+;;
+arm*)
+;;
+ppc*)
+;;
+*)
+  echo "Please implement other architecture here"
+  exit_unsupported
+esac
+
+DEFAULT=$(cat /proc/sys/debug/kprobes-optimization)
+echo 1 > /proc/sys/debug/kprobes-optimization
+for i in `seq 0 255`; do
+        echo  "p:testprobe $FUNCTION_FORK+${i}" > kprobe_events || continue
+        echo 1 > events/kprobes/enable || continue
+        (echo "forked")
+       PROBE=$(grep $FUNCTION_FORK /sys/kernel/debug/kprobes/list)
+        echo 0 > events/kprobes/enable
+        echo > kprobe_events
+       if echo $PROBE | grep -q OPTIMIZED; then
+                echo "$DEFAULT" >  /proc/sys/debug/kprobes-optimization
+                exit_pass
+        fi
+done
+echo "$DEFAULT" >  /proc/sys/debug/kprobes-optimization
+exit_unresolved
index 294619a..1c952d1 100644 (file)
@@ -8,7 +8,8 @@ export logfile=/dev/stdout
 export per_test_logging=
 
 # Defaults for "settings" file fields:
-# "timeout" how many seconds to let each test run before failing.
+# "timeout" how many seconds to let each test run before running
+# over our soft timeout limit.
 export kselftest_default_timeout=45
 
 # There isn't a shell-agnostic way to find the path of a sourced file,
@@ -90,6 +91,14 @@ run_one()
                done < "$settings"
        fi
 
+       # Command line timeout overrides the settings file
+       if [ -n "$kselftest_override_timeout" ]; then
+               kselftest_timeout="$kselftest_override_timeout"
+               echo "# overriding timeout to $kselftest_timeout" >> "$logfile"
+       else
+               echo "# timeout set to $kselftest_timeout" >> "$logfile"
+       fi
+
        TEST_HDR_MSG="selftests: $DIR: $BASENAME_TEST"
        echo "# $TEST_HDR_MSG"
        if [ ! -e "$TEST" ]; then
index d4e1f4a..4f10055 100644 (file)
@@ -48,6 +48,34 @@ struct reg_sublist {
        __u64 rejects_set_n;
 };
 
+struct feature_id_reg {
+       __u64 reg;
+       __u64 id_reg;
+       __u64 feat_shift;
+       __u64 feat_min;
+};
+
+static struct feature_id_reg feat_id_regs[] = {
+       {
+               ARM64_SYS_REG(3, 0, 2, 0, 3),   /* TCR2_EL1 */
+               ARM64_SYS_REG(3, 0, 0, 7, 3),   /* ID_AA64MMFR3_EL1 */
+               0,
+               1
+       },
+       {
+               ARM64_SYS_REG(3, 0, 10, 2, 2),  /* PIRE0_EL1 */
+               ARM64_SYS_REG(3, 0, 0, 7, 3),   /* ID_AA64MMFR3_EL1 */
+               4,
+               1
+       },
+       {
+               ARM64_SYS_REG(3, 0, 10, 2, 3),  /* PIR_EL1 */
+               ARM64_SYS_REG(3, 0, 0, 7, 3),   /* ID_AA64MMFR3_EL1 */
+               4,
+               1
+       }
+};
+
 struct vcpu_config {
        char *name;
        struct reg_sublist sublists[];
@@ -68,7 +96,8 @@ static int vcpu_configs_n;
 
 #define for_each_missing_reg(i)                                                        \
        for ((i) = 0; (i) < blessed_n; ++(i))                                   \
-               if (!find_reg(reg_list->reg, reg_list->n, blessed_reg[i]))
+               if (!find_reg(reg_list->reg, reg_list->n, blessed_reg[i]))      \
+                       if (check_supported_feat_reg(vcpu, blessed_reg[i]))
 
 #define for_each_new_reg(i)                                                    \
        for_each_reg_filtered(i)                                                \
@@ -132,6 +161,25 @@ static bool find_reg(__u64 regs[], __u64 nr_regs, __u64 reg)
        return false;
 }
 
+static bool check_supported_feat_reg(struct kvm_vcpu *vcpu, __u64 reg)
+{
+       int i, ret;
+       __u64 data, feat_val;
+
+       for (i = 0; i < ARRAY_SIZE(feat_id_regs); i++) {
+               if (feat_id_regs[i].reg == reg) {
+                       ret = __vcpu_get_reg(vcpu, feat_id_regs[i].id_reg, &data);
+                       if (ret < 0)
+                               return false;
+
+                       feat_val = ((data >> feat_id_regs[i].feat_shift) & 0xf);
+                       return feat_val >= feat_id_regs[i].feat_min;
+               }
+       }
+
+       return true;
+}
+
 static const char *str_with_index(const char *template, __u64 index)
 {
        char *str, *p;
@@ -843,12 +891,15 @@ static __u64 base_regs[] = {
        ARM64_SYS_REG(3, 0, 2, 0, 0),   /* TTBR0_EL1 */
        ARM64_SYS_REG(3, 0, 2, 0, 1),   /* TTBR1_EL1 */
        ARM64_SYS_REG(3, 0, 2, 0, 2),   /* TCR_EL1 */
+       ARM64_SYS_REG(3, 0, 2, 0, 3),   /* TCR2_EL1 */
        ARM64_SYS_REG(3, 0, 5, 1, 0),   /* AFSR0_EL1 */
        ARM64_SYS_REG(3, 0, 5, 1, 1),   /* AFSR1_EL1 */
        ARM64_SYS_REG(3, 0, 5, 2, 0),   /* ESR_EL1 */
        ARM64_SYS_REG(3, 0, 6, 0, 0),   /* FAR_EL1 */
        ARM64_SYS_REG(3, 0, 7, 4, 0),   /* PAR_EL1 */
        ARM64_SYS_REG(3, 0, 10, 2, 0),  /* MAIR_EL1 */
+       ARM64_SYS_REG(3, 0, 10, 2, 2),  /* PIRE0_EL1 */
+       ARM64_SYS_REG(3, 0, 10, 2, 3),  /* PIR_EL1 */
        ARM64_SYS_REG(3, 0, 10, 3, 0),  /* AMAIR_EL1 */
        ARM64_SYS_REG(3, 0, 12, 0, 0),  /* VBAR_EL1 */
        ARM64_SYS_REG(3, 0, 12, 1, 1),  /* DISR_EL1 */
index 0f0a652..3dc9e43 100644 (file)
@@ -1,7 +1,10 @@
+CONFIG_CGROUPS=y
+CONFIG_CGROUP_SCHED=y
 CONFIG_OVERLAY_FS=y
-CONFIG_SECURITY_LANDLOCK=y
-CONFIG_SECURITY_PATH=y
+CONFIG_PROC_FS=y
 CONFIG_SECURITY=y
+CONFIG_SECURITY_LANDLOCK=y
 CONFIG_SHMEM=y
-CONFIG_TMPFS_XATTR=y
+CONFIG_SYSFS=y
 CONFIG_TMPFS=y
+CONFIG_TMPFS_XATTR=y
diff --git a/tools/testing/selftests/landlock/config.um b/tools/testing/selftests/landlock/config.um
new file mode 100644 (file)
index 0000000..40937c0
--- /dev/null
@@ -0,0 +1 @@
+CONFIG_HOSTFS=y
index b6c4be3..83d5655 100644 (file)
@@ -10,6 +10,7 @@
 #define _GNU_SOURCE
 #include <fcntl.h>
 #include <linux/landlock.h>
+#include <linux/magic.h>
 #include <sched.h>
 #include <stdio.h>
 #include <string.h>
@@ -19,6 +20,7 @@
 #include <sys/sendfile.h>
 #include <sys/stat.h>
 #include <sys/sysmacros.h>
+#include <sys/vfs.h>
 #include <unistd.h>
 
 #include "common.h"
@@ -107,8 +109,10 @@ static bool fgrep(FILE *const inf, const char *const str)
        return false;
 }
 
-static bool supports_overlayfs(void)
+static bool supports_filesystem(const char *const filesystem)
 {
+       char str[32];
+       int len;
        bool res;
        FILE *const inf = fopen("/proc/filesystems", "r");
 
@@ -119,11 +123,33 @@ static bool supports_overlayfs(void)
        if (!inf)
                return true;
 
-       res = fgrep(inf, "nodev\toverlay\n");
+       /* filesystem can be null for bind mounts. */
+       if (!filesystem)
+               return true;
+
+       len = snprintf(str, sizeof(str), "nodev\t%s\n", filesystem);
+       if (len >= sizeof(str))
+               /* Ignores too-long filesystem names. */
+               return true;
+
+       res = fgrep(inf, str);
        fclose(inf);
        return res;
 }
 
+static bool cwd_matches_fs(unsigned int fs_magic)
+{
+       struct statfs statfs_buf;
+
+       if (!fs_magic)
+               return true;
+
+       if (statfs(".", &statfs_buf))
+               return true;
+
+       return statfs_buf.f_type == fs_magic;
+}
+
 static void mkdir_parents(struct __test_metadata *const _metadata,
                          const char *const path)
 {
@@ -206,7 +232,26 @@ out:
        return err;
 }
 
-static void prepare_layout(struct __test_metadata *const _metadata)
+struct mnt_opt {
+       const char *const source;
+       const char *const type;
+       const unsigned long flags;
+       const char *const data;
+};
+
+const struct mnt_opt mnt_tmp = {
+       .type = "tmpfs",
+       .data = "size=4m,mode=700",
+};
+
+static int mount_opt(const struct mnt_opt *const mnt, const char *const target)
+{
+       return mount(mnt->source ?: mnt->type, target, mnt->type, mnt->flags,
+                    mnt->data);
+}
+
+static void prepare_layout_opt(struct __test_metadata *const _metadata,
+                              const struct mnt_opt *const mnt)
 {
        disable_caps(_metadata);
        umask(0077);
@@ -217,12 +262,28 @@ static void prepare_layout(struct __test_metadata *const _metadata)
         * for tests relying on pivot_root(2) and move_mount(2).
         */
        set_cap(_metadata, CAP_SYS_ADMIN);
-       ASSERT_EQ(0, unshare(CLONE_NEWNS));
-       ASSERT_EQ(0, mount("tmp", TMP_DIR, "tmpfs", 0, "size=4m,mode=700"));
+       ASSERT_EQ(0, unshare(CLONE_NEWNS | CLONE_NEWCGROUP));
+       ASSERT_EQ(0, mount_opt(mnt, TMP_DIR))
+       {
+               TH_LOG("Failed to mount the %s filesystem: %s", mnt->type,
+                      strerror(errno));
+               /*
+                * FIXTURE_TEARDOWN() is not called when FIXTURE_SETUP()
+                * failed, so we need to explicitly do a minimal cleanup to
+                * avoid cascading errors with other tests that don't depend on
+                * the same filesystem.
+                */
+               remove_path(TMP_DIR);
+       }
        ASSERT_EQ(0, mount(NULL, TMP_DIR, NULL, MS_PRIVATE | MS_REC, NULL));
        clear_cap(_metadata, CAP_SYS_ADMIN);
 }
 
+static void prepare_layout(struct __test_metadata *const _metadata)
+{
+       prepare_layout_opt(_metadata, &mnt_tmp);
+}
+
 static void cleanup_layout(struct __test_metadata *const _metadata)
 {
        set_cap(_metadata, CAP_SYS_ADMIN);
@@ -231,6 +292,20 @@ static void cleanup_layout(struct __test_metadata *const _metadata)
        EXPECT_EQ(0, remove_path(TMP_DIR));
 }
 
+/* clang-format off */
+FIXTURE(layout0) {};
+/* clang-format on */
+
+FIXTURE_SETUP(layout0)
+{
+       prepare_layout(_metadata);
+}
+
+FIXTURE_TEARDOWN(layout0)
+{
+       cleanup_layout(_metadata);
+}
+
 static void create_layout1(struct __test_metadata *const _metadata)
 {
        create_file(_metadata, file1_s1d1);
@@ -248,7 +323,7 @@ static void create_layout1(struct __test_metadata *const _metadata)
        create_file(_metadata, file1_s3d1);
        create_directory(_metadata, dir_s3d2);
        set_cap(_metadata, CAP_SYS_ADMIN);
-       ASSERT_EQ(0, mount("tmp", dir_s3d2, "tmpfs", 0, "size=4m,mode=700"));
+       ASSERT_EQ(0, mount_opt(&mnt_tmp, dir_s3d2));
        clear_cap(_metadata, CAP_SYS_ADMIN);
 
        ASSERT_EQ(0, mkdir(dir_s3d3, 0700));
@@ -262,11 +337,13 @@ static void remove_layout1(struct __test_metadata *const _metadata)
        EXPECT_EQ(0, remove_path(file1_s1d3));
        EXPECT_EQ(0, remove_path(file1_s1d2));
        EXPECT_EQ(0, remove_path(file1_s1d1));
+       EXPECT_EQ(0, remove_path(dir_s1d3));
 
        EXPECT_EQ(0, remove_path(file2_s2d3));
        EXPECT_EQ(0, remove_path(file1_s2d3));
        EXPECT_EQ(0, remove_path(file1_s2d2));
        EXPECT_EQ(0, remove_path(file1_s2d1));
+       EXPECT_EQ(0, remove_path(dir_s2d2));
 
        EXPECT_EQ(0, remove_path(file1_s3d1));
        EXPECT_EQ(0, remove_path(dir_s3d3));
@@ -510,7 +587,7 @@ TEST_F_FORK(layout1, file_and_dir_access_rights)
        ASSERT_EQ(0, close(ruleset_fd));
 }
 
-TEST_F_FORK(layout1, unknown_access_rights)
+TEST_F_FORK(layout0, unknown_access_rights)
 {
        __u64 access_mask;
 
@@ -608,7 +685,7 @@ static void enforce_ruleset(struct __test_metadata *const _metadata,
        }
 }
 
-TEST_F_FORK(layout1, proc_nsfs)
+TEST_F_FORK(layout0, proc_nsfs)
 {
        const struct rule rules[] = {
                {
@@ -657,11 +734,11 @@ TEST_F_FORK(layout1, proc_nsfs)
        ASSERT_EQ(0, close(path_beneath.parent_fd));
 }
 
-TEST_F_FORK(layout1, unpriv)
+TEST_F_FORK(layout0, unpriv)
 {
        const struct rule rules[] = {
                {
-                       .path = dir_s1d2,
+                       .path = TMP_DIR,
                        .access = ACCESS_RO,
                },
                {},
@@ -1301,12 +1378,12 @@ TEST_F_FORK(layout1, inherit_superset)
        ASSERT_EQ(0, test_open(file1_s1d3, O_RDONLY));
 }
 
-TEST_F_FORK(layout1, max_layers)
+TEST_F_FORK(layout0, max_layers)
 {
        int i, err;
        const struct rule rules[] = {
                {
-                       .path = dir_s1d2,
+                       .path = TMP_DIR,
                        .access = ACCESS_RO,
                },
                {},
@@ -4030,21 +4107,24 @@ static const char (*merge_sub_files[])[] = {
  *         └── work
  */
 
-/* clang-format off */
-FIXTURE(layout2_overlay) {};
-/* clang-format on */
+FIXTURE(layout2_overlay)
+{
+       bool skip_test;
+};
 
 FIXTURE_SETUP(layout2_overlay)
 {
-       if (!supports_overlayfs())
-               SKIP(return, "overlayfs is not supported");
+       if (!supports_filesystem("overlay")) {
+               self->skip_test = true;
+               SKIP(return, "overlayfs is not supported (setup)");
+       }
 
        prepare_layout(_metadata);
 
        create_directory(_metadata, LOWER_BASE);
        set_cap(_metadata, CAP_SYS_ADMIN);
        /* Creates tmpfs mount points to get deterministic overlayfs. */
-       ASSERT_EQ(0, mount("tmp", LOWER_BASE, "tmpfs", 0, "size=4m,mode=700"));
+       ASSERT_EQ(0, mount_opt(&mnt_tmp, LOWER_BASE));
        clear_cap(_metadata, CAP_SYS_ADMIN);
        create_file(_metadata, lower_fl1);
        create_file(_metadata, lower_dl1_fl2);
@@ -4054,7 +4134,7 @@ FIXTURE_SETUP(layout2_overlay)
 
        create_directory(_metadata, UPPER_BASE);
        set_cap(_metadata, CAP_SYS_ADMIN);
-       ASSERT_EQ(0, mount("tmp", UPPER_BASE, "tmpfs", 0, "size=4m,mode=700"));
+       ASSERT_EQ(0, mount_opt(&mnt_tmp, UPPER_BASE));
        clear_cap(_metadata, CAP_SYS_ADMIN);
        create_file(_metadata, upper_fu1);
        create_file(_metadata, upper_du1_fu2);
@@ -4075,8 +4155,8 @@ FIXTURE_SETUP(layout2_overlay)
 
 FIXTURE_TEARDOWN(layout2_overlay)
 {
-       if (!supports_overlayfs())
-               SKIP(return, "overlayfs is not supported");
+       if (self->skip_test)
+               SKIP(return, "overlayfs is not supported (teardown)");
 
        EXPECT_EQ(0, remove_path(lower_do1_fl3));
        EXPECT_EQ(0, remove_path(lower_dl1_fl2));
@@ -4109,8 +4189,8 @@ FIXTURE_TEARDOWN(layout2_overlay)
 
 TEST_F_FORK(layout2_overlay, no_restriction)
 {
-       if (!supports_overlayfs())
-               SKIP(return, "overlayfs is not supported");
+       if (self->skip_test)
+               SKIP(return, "overlayfs is not supported (test)");
 
        ASSERT_EQ(0, test_open(lower_fl1, O_RDONLY));
        ASSERT_EQ(0, test_open(lower_dl1, O_RDONLY));
@@ -4275,8 +4355,8 @@ TEST_F_FORK(layout2_overlay, same_content_different_file)
        size_t i;
        const char *path_entry;
 
-       if (!supports_overlayfs())
-               SKIP(return, "overlayfs is not supported");
+       if (self->skip_test)
+               SKIP(return, "overlayfs is not supported (test)");
 
        /* Sets rules on base directories (i.e. outside overlay scope). */
        ruleset_fd = create_ruleset(_metadata, ACCESS_RW, layer1_base);
@@ -4423,4 +4503,261 @@ TEST_F_FORK(layout2_overlay, same_content_different_file)
        }
 }
 
+FIXTURE(layout3_fs)
+{
+       bool has_created_dir;
+       bool has_created_file;
+       char *dir_path;
+       bool skip_test;
+};
+
+FIXTURE_VARIANT(layout3_fs)
+{
+       const struct mnt_opt mnt;
+       const char *const file_path;
+       unsigned int cwd_fs_magic;
+};
+
+/* clang-format off */
+FIXTURE_VARIANT_ADD(layout3_fs, tmpfs) {
+       /* clang-format on */
+       .mnt = mnt_tmp,
+       .file_path = file1_s1d1,
+};
+
+FIXTURE_VARIANT_ADD(layout3_fs, ramfs) {
+       .mnt = {
+               .type = "ramfs",
+               .data = "mode=700",
+       },
+       .file_path = TMP_DIR "/dir/file",
+};
+
+FIXTURE_VARIANT_ADD(layout3_fs, cgroup2) {
+       .mnt = {
+               .type = "cgroup2",
+       },
+       .file_path = TMP_DIR "/test/cgroup.procs",
+};
+
+FIXTURE_VARIANT_ADD(layout3_fs, proc) {
+       .mnt = {
+               .type = "proc",
+       },
+       .file_path = TMP_DIR "/self/status",
+};
+
+FIXTURE_VARIANT_ADD(layout3_fs, sysfs) {
+       .mnt = {
+               .type = "sysfs",
+       },
+       .file_path = TMP_DIR "/kernel/notes",
+};
+
+FIXTURE_VARIANT_ADD(layout3_fs, hostfs) {
+       .mnt = {
+               .source = TMP_DIR,
+               .flags = MS_BIND,
+       },
+       .file_path = TMP_DIR "/dir/file",
+       .cwd_fs_magic = HOSTFS_SUPER_MAGIC,
+};
+
+FIXTURE_SETUP(layout3_fs)
+{
+       struct stat statbuf;
+       const char *slash;
+       size_t dir_len;
+
+       if (!supports_filesystem(variant->mnt.type) ||
+           !cwd_matches_fs(variant->cwd_fs_magic)) {
+               self->skip_test = true;
+               SKIP(return, "this filesystem is not supported (setup)");
+       }
+
+       slash = strrchr(variant->file_path, '/');
+       ASSERT_NE(slash, NULL);
+       dir_len = (size_t)slash - (size_t)variant->file_path;
+       ASSERT_LT(0, dir_len);
+       self->dir_path = malloc(dir_len + 1);
+       self->dir_path[dir_len] = '\0';
+       strncpy(self->dir_path, variant->file_path, dir_len);
+
+       prepare_layout_opt(_metadata, &variant->mnt);
+
+       /* Creates directory when required. */
+       if (stat(self->dir_path, &statbuf)) {
+               set_cap(_metadata, CAP_DAC_OVERRIDE);
+               EXPECT_EQ(0, mkdir(self->dir_path, 0700))
+               {
+                       TH_LOG("Failed to create directory \"%s\": %s",
+                              self->dir_path, strerror(errno));
+                       free(self->dir_path);
+                       self->dir_path = NULL;
+               }
+               self->has_created_dir = true;
+               clear_cap(_metadata, CAP_DAC_OVERRIDE);
+       }
+
+       /* Creates file when required. */
+       if (stat(variant->file_path, &statbuf)) {
+               int fd;
+
+               set_cap(_metadata, CAP_DAC_OVERRIDE);
+               fd = creat(variant->file_path, 0600);
+               EXPECT_LE(0, fd)
+               {
+                       TH_LOG("Failed to create file \"%s\": %s",
+                              variant->file_path, strerror(errno));
+               }
+               EXPECT_EQ(0, close(fd));
+               self->has_created_file = true;
+               clear_cap(_metadata, CAP_DAC_OVERRIDE);
+       }
+}
+
+FIXTURE_TEARDOWN(layout3_fs)
+{
+       if (self->skip_test)
+               SKIP(return, "this filesystem is not supported (teardown)");
+
+       if (self->has_created_file) {
+               set_cap(_metadata, CAP_DAC_OVERRIDE);
+               /*
+                * Don't check for error because the file might already
+                * have been removed (cf. release_inode test).
+                */
+               unlink(variant->file_path);
+               clear_cap(_metadata, CAP_DAC_OVERRIDE);
+       }
+
+       if (self->has_created_dir) {
+               set_cap(_metadata, CAP_DAC_OVERRIDE);
+               /*
+                * Don't check for error because the directory might already
+                * have been removed (cf. release_inode test).
+                */
+               rmdir(self->dir_path);
+               clear_cap(_metadata, CAP_DAC_OVERRIDE);
+       }
+       free(self->dir_path);
+       self->dir_path = NULL;
+
+       cleanup_layout(_metadata);
+}
+
+static void layer3_fs_tag_inode(struct __test_metadata *const _metadata,
+                               FIXTURE_DATA(layout3_fs) * self,
+                               const FIXTURE_VARIANT(layout3_fs) * variant,
+                               const char *const rule_path)
+{
+       const struct rule layer1_allow_read_file[] = {
+               {
+                       .path = rule_path,
+                       .access = LANDLOCK_ACCESS_FS_READ_FILE,
+               },
+               {},
+       };
+       const struct landlock_ruleset_attr layer2_deny_everything_attr = {
+               .handled_access_fs = LANDLOCK_ACCESS_FS_READ_FILE,
+       };
+       const char *const dev_null_path = "/dev/null";
+       int ruleset_fd;
+
+       if (self->skip_test)
+               SKIP(return, "this filesystem is not supported (test)");
+
+       /* Checks without Landlock. */
+       EXPECT_EQ(0, test_open(dev_null_path, O_RDONLY | O_CLOEXEC));
+       EXPECT_EQ(0, test_open(variant->file_path, O_RDONLY | O_CLOEXEC));
+
+       ruleset_fd = create_ruleset(_metadata, LANDLOCK_ACCESS_FS_READ_FILE,
+                                   layer1_allow_read_file);
+       EXPECT_LE(0, ruleset_fd);
+       enforce_ruleset(_metadata, ruleset_fd);
+       EXPECT_EQ(0, close(ruleset_fd));
+
+       EXPECT_EQ(EACCES, test_open(dev_null_path, O_RDONLY | O_CLOEXEC));
+       EXPECT_EQ(0, test_open(variant->file_path, O_RDONLY | O_CLOEXEC));
+
+       /* Forbids directory reading. */
+       ruleset_fd =
+               landlock_create_ruleset(&layer2_deny_everything_attr,
+                                       sizeof(layer2_deny_everything_attr), 0);
+       EXPECT_LE(0, ruleset_fd);
+       enforce_ruleset(_metadata, ruleset_fd);
+       EXPECT_EQ(0, close(ruleset_fd));
+
+       /* Checks with Landlock and forbidden access. */
+       EXPECT_EQ(EACCES, test_open(dev_null_path, O_RDONLY | O_CLOEXEC));
+       EXPECT_EQ(EACCES, test_open(variant->file_path, O_RDONLY | O_CLOEXEC));
+}
+
+/* Matrix of tests to check file hierarchy evaluation. */
+
+TEST_F_FORK(layout3_fs, tag_inode_dir_parent)
+{
+       /* The current directory must not be the root for this test. */
+       layer3_fs_tag_inode(_metadata, self, variant, ".");
+}
+
+TEST_F_FORK(layout3_fs, tag_inode_dir_mnt)
+{
+       layer3_fs_tag_inode(_metadata, self, variant, TMP_DIR);
+}
+
+TEST_F_FORK(layout3_fs, tag_inode_dir_child)
+{
+       layer3_fs_tag_inode(_metadata, self, variant, self->dir_path);
+}
+
+TEST_F_FORK(layout3_fs, tag_inode_file)
+{
+       layer3_fs_tag_inode(_metadata, self, variant, variant->file_path);
+}
+
+/* Light version of layout1.release_inodes */
+TEST_F_FORK(layout3_fs, release_inodes)
+{
+       const struct rule layer1[] = {
+               {
+                       .path = TMP_DIR,
+                       .access = LANDLOCK_ACCESS_FS_READ_DIR,
+               },
+               {},
+       };
+       int ruleset_fd;
+
+       if (self->skip_test)
+               SKIP(return, "this filesystem is not supported (test)");
+
+       /* Clean up for the teardown to not fail. */
+       if (self->has_created_file)
+               EXPECT_EQ(0, remove_path(variant->file_path));
+
+       if (self->has_created_dir)
+               /* Don't check for error because of cgroup specificities. */
+               remove_path(self->dir_path);
+
+       ruleset_fd =
+               create_ruleset(_metadata, LANDLOCK_ACCESS_FS_READ_DIR, layer1);
+       ASSERT_LE(0, ruleset_fd);
+
+       /* Unmount the filesystem while it is being used by a ruleset. */
+       set_cap(_metadata, CAP_SYS_ADMIN);
+       ASSERT_EQ(0, umount(TMP_DIR));
+       clear_cap(_metadata, CAP_SYS_ADMIN);
+
+       /* Replaces with a new mount point to simplify FIXTURE_TEARDOWN. */
+       set_cap(_metadata, CAP_SYS_ADMIN);
+       ASSERT_EQ(0, mount_opt(&mnt_tmp, TMP_DIR));
+       clear_cap(_metadata, CAP_SYS_ADMIN);
+
+       enforce_ruleset(_metadata, ruleset_fd);
+       ASSERT_EQ(0, close(ruleset_fd));
+
+       /* Checks that access to the new mount point is denied. */
+       ASSERT_EQ(EACCES, test_open(TMP_DIR, O_RDONLY));
+}
+
 TEST_HARNESS_MAIN
index 0540046..d178542 100644 (file)
@@ -44,10 +44,26 @@ endif
 selfdir = $(realpath $(dir $(filter %/lib.mk,$(MAKEFILE_LIST))))
 top_srcdir = $(selfdir)/../../..
 
-ifeq ($(KHDR_INCLUDES),)
-KHDR_INCLUDES := -isystem $(top_srcdir)/usr/include
+ifeq ("$(origin O)", "command line")
+  KBUILD_OUTPUT := $(O)
 endif
 
+ifneq ($(KBUILD_OUTPUT),)
+  # Make's built-in functions such as $(abspath ...), $(realpath ...) cannot
+  # expand a shell special character '~'. We use a somewhat tedious way here.
+  abs_objtree := $(shell cd $(top_srcdir) && mkdir -p $(KBUILD_OUTPUT) && cd $(KBUILD_OUTPUT) && pwd)
+  $(if $(abs_objtree),, \
+    $(error failed to create output directory "$(KBUILD_OUTPUT)"))
+  # $(realpath ...) resolves symlinks
+  abs_objtree := $(realpath $(abs_objtree))
+  KHDR_DIR := ${abs_objtree}/usr/include
+else
+  abs_srctree := $(shell cd $(top_srcdir) && pwd)
+  KHDR_DIR := ${abs_srctree}/usr/include
+endif
+
+KHDR_INCLUDES := -isystem $(KHDR_DIR)
+
 # The following are built by lib.mk common compile rules.
 # TEST_CUSTOM_PROGS should be used by tests that require
 # custom build rule and prevent common build rule use.
@@ -58,7 +74,25 @@ TEST_GEN_PROGS := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_PROGS))
 TEST_GEN_PROGS_EXTENDED := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_PROGS_EXTENDED))
 TEST_GEN_FILES := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_FILES))
 
-all: $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) $(TEST_GEN_FILES)
+all: kernel_header_files $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) \
+     $(TEST_GEN_FILES)
+
+kernel_header_files:
+       @ls $(KHDR_DIR)/linux/*.h >/dev/null 2>/dev/null;                      \
+       if [ $$? -ne 0 ]; then                                                 \
+            RED='\033[1;31m';                                                  \
+            NOCOLOR='\033[0m';                                                 \
+            echo;                                                              \
+            echo -e "$${RED}error$${NOCOLOR}: missing kernel header files.";   \
+            echo "Please run this and try again:";                             \
+            echo;                                                              \
+            echo "    cd $(top_srcdir)";                                       \
+            echo "    make headers";                                           \
+            echo;                                                              \
+           exit 1; \
+       fi
+
+.PHONY: kernel_header_files
 
 define RUN_TESTS
        BASE_DIR="$(selfdir)";                  \
index 0f6aef2..2c44e11 100644 (file)
 #include <time.h>
 #include <linux/videodev2.h>
 
-int main(int argc, char **argv)
+#define PRIORITY_MAX 4
+
+int priority_test(int fd)
 {
-       int opt;
-       char video_dev[256];
-       int count;
-       struct v4l2_tuner vtuner;
-       struct v4l2_capability vcap;
+       /* This test will try to update the priority associated with a file descriptor */
+
+       enum v4l2_priority old_priority, new_priority, priority_to_compare;
        int ret;
-       int fd;
+       int result = 0;
 
-       if (argc < 2) {
-               printf("Usage: %s [-d </dev/videoX>]\n", argv[0]);
-               exit(-1);
+       ret = ioctl(fd, VIDIOC_G_PRIORITY, &old_priority);
+       if (ret < 0) {
+               printf("Failed to get priority: %s\n", strerror(errno));
+               return -1;
+       }
+       new_priority = (old_priority + 1) % PRIORITY_MAX;
+       ret = ioctl(fd, VIDIOC_S_PRIORITY, &new_priority);
+       if (ret < 0) {
+               printf("Failed to set priority: %s\n", strerror(errno));
+               return -1;
+       }
+       ret = ioctl(fd, VIDIOC_G_PRIORITY, &priority_to_compare);
+       if (ret < 0) {
+               printf("Failed to get new priority: %s\n", strerror(errno));
+               result = -1;
+               goto cleanup;
+       }
+       if (priority_to_compare != new_priority) {
+               printf("Priority wasn't set - test failed\n");
+               result = -1;
        }
 
-       /* Process arguments */
-       while ((opt = getopt(argc, argv, "d:")) != -1) {
-               switch (opt) {
-               case 'd':
-                       strncpy(video_dev, optarg, sizeof(video_dev) - 1);
-                       video_dev[sizeof(video_dev)-1] = '\0';
-                       break;
-               default:
-                       printf("Usage: %s [-d </dev/videoX>]\n", argv[0]);
-                       exit(-1);
-               }
+cleanup:
+       ret = ioctl(fd, VIDIOC_S_PRIORITY, &old_priority);
+       if (ret < 0) {
+               printf("Failed to restore priority: %s\n", strerror(errno));
+               return -1;
        }
+       return result;
+}
+
+int loop_test(int fd)
+{
+       int count;
+       struct v4l2_tuner vtuner;
+       struct v4l2_capability vcap;
+       int ret;
 
        /* Generate random number of interations */
        srand((unsigned int) time(NULL));
        count = rand();
 
-       /* Open Video device and keep it open */
-       fd = open(video_dev, O_RDWR);
-       if (fd == -1) {
-               printf("Video Device open errno %s\n", strerror(errno));
-               exit(-1);
-       }
-
        printf("\nNote:\n"
               "While test is running, remove the device or unbind\n"
               "driver and ensure there are no use after free errors\n"
@@ -98,4 +111,46 @@ int main(int argc, char **argv)
                sleep(10);
                count--;
        }
+       return 0;
+}
+
+int main(int argc, char **argv)
+{
+       int opt;
+       char video_dev[256];
+       int fd;
+       int test_result;
+
+       if (argc < 2) {
+               printf("Usage: %s [-d </dev/videoX>]\n", argv[0]);
+               exit(-1);
+       }
+
+       /* Process arguments */
+       while ((opt = getopt(argc, argv, "d:")) != -1) {
+               switch (opt) {
+               case 'd':
+                       strncpy(video_dev, optarg, sizeof(video_dev) - 1);
+                       video_dev[sizeof(video_dev)-1] = '\0';
+                       break;
+               default:
+                       printf("Usage: %s [-d </dev/videoX>]\n", argv[0]);
+                       exit(-1);
+               }
+       }
+
+       /* Open Video device and keep it open */
+       fd = open(video_dev, O_RDWR);
+       if (fd == -1) {
+               printf("Video Device open errno %s\n", strerror(errno));
+               exit(-1);
+       }
+
+       test_result = priority_test(fd);
+       if (!test_result)
+               printf("Priority test - PASSED\n");
+       else
+               printf("Priority test - FAILED\n");
+
+       loop_test(fd);
 }
index 8917455..7e2a982 100644 (file)
@@ -39,3 +39,6 @@ local_config.h
 local_config.mk
 ksm_functional_tests
 mdwe_test
+gup_longterm
+mkdirty
+va_high_addr_switch
index 4f0c50c..66d7c07 100644 (file)
@@ -32,11 +32,12 @@ endif
 # LDLIBS.
 MAKEFLAGS += --no-builtin-rules
 
-CFLAGS = -Wall -I $(top_srcdir) -I $(top_srcdir)/tools/include/uapi $(EXTRA_CFLAGS) $(KHDR_INCLUDES)
+CFLAGS = -Wall -I $(top_srcdir) $(EXTRA_CFLAGS) $(KHDR_INCLUDES)
 LDLIBS = -lrt -lpthread
 
 TEST_GEN_PROGS = cow
 TEST_GEN_PROGS += compaction_test
+TEST_GEN_PROGS += gup_longterm
 TEST_GEN_PROGS += gup_test
 TEST_GEN_PROGS += hmm-tests
 TEST_GEN_PROGS += hugetlb-madvise
@@ -167,6 +168,8 @@ endif
 # IOURING_EXTRA_LIBS may get set in local_config.mk, or it may be left empty.
 $(OUTPUT)/cow: LDLIBS += $(IOURING_EXTRA_LIBS)
 
+$(OUTPUT)/gup_longterm: LDLIBS += $(IOURING_EXTRA_LIBS)
+
 $(OUTPUT)/mlock-random-test $(OUTPUT)/memfd_secret: LDLIBS += -lcap
 
 $(OUTPUT)/ksm_tests: LDLIBS += -lnuma
index dc9d6fe..7324ce5 100644 (file)
@@ -14,8 +14,8 @@
 #include <unistd.h>
 #include <errno.h>
 #include <fcntl.h>
-#include <dirent.h>
 #include <assert.h>
+#include <linux/mman.h>
 #include <sys/mman.h>
 #include <sys/ioctl.h>
 #include <sys/wait.h>
 #include "../kselftest.h"
 #include "vm_util.h"
 
-#ifndef MADV_PAGEOUT
-#define MADV_PAGEOUT 21
-#endif
-#ifndef MADV_COLLAPSE
-#define MADV_COLLAPSE 25
-#endif
-
 static size_t pagesize;
 static int pagemap_fd;
 static size_t thpsize;
@@ -70,31 +63,6 @@ static void detect_huge_zeropage(void)
        close(fd);
 }
 
-static void detect_hugetlbsizes(void)
-{
-       DIR *dir = opendir("/sys/kernel/mm/hugepages/");
-
-       if (!dir)
-               return;
-
-       while (nr_hugetlbsizes < ARRAY_SIZE(hugetlbsizes)) {
-               struct dirent *entry = readdir(dir);
-               size_t kb;
-
-               if (!entry)
-                       break;
-               if (entry->d_type != DT_DIR)
-                       continue;
-               if (sscanf(entry->d_name, "hugepages-%zukB", &kb) != 1)
-                       continue;
-               hugetlbsizes[nr_hugetlbsizes] = kb * 1024;
-               nr_hugetlbsizes++;
-               ksft_print_msg("[INFO] detected hugetlb size: %zu KiB\n",
-                              kb);
-       }
-       closedir(dir);
-}
-
 static bool range_is_swapped(void *addr, size_t size)
 {
        for (; size; addr += pagesize, size -= pagesize)
@@ -1717,7 +1685,8 @@ int main(int argc, char **argv)
        if (thpsize)
                ksft_print_msg("[INFO] detected THP size: %zu KiB\n",
                               thpsize / 1024);
-       detect_hugetlbsizes();
+       nr_hugetlbsizes = detect_hugetlb_page_sizes(hugetlbsizes,
+                                                   ARRAY_SIZE(hugetlbsizes));
        detect_huge_zeropage();
 
        ksft_print_header();
diff --git a/tools/testing/selftests/mm/gup_longterm.c b/tools/testing/selftests/mm/gup_longterm.c
new file mode 100644 (file)
index 0000000..d33d3e6
--- /dev/null
@@ -0,0 +1,459 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * GUP long-term page pinning tests.
+ *
+ * Copyright 2023, Red Hat, Inc.
+ *
+ * Author(s): David Hildenbrand <david@redhat.com>
+ */
+#define _GNU_SOURCE
+#include <stdlib.h>
+#include <string.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <unistd.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <assert.h>
+#include <sys/mman.h>
+#include <sys/ioctl.h>
+#include <sys/vfs.h>
+#include <linux/magic.h>
+#include <linux/memfd.h>
+
+#include "local_config.h"
+#ifdef LOCAL_CONFIG_HAVE_LIBURING
+#include <liburing.h>
+#endif /* LOCAL_CONFIG_HAVE_LIBURING */
+
+#include "../../../../mm/gup_test.h"
+#include "../kselftest.h"
+#include "vm_util.h"
+
+static size_t pagesize;
+static int nr_hugetlbsizes;
+static size_t hugetlbsizes[10];
+static int gup_fd;
+
+static __fsword_t get_fs_type(int fd)
+{
+       struct statfs fs;
+       int ret;
+
+       do {
+               ret = fstatfs(fd, &fs);
+       } while (ret && errno == EINTR);
+
+       return ret ? 0 : fs.f_type;
+}
+
+static bool fs_is_unknown(__fsword_t fs_type)
+{
+       /*
+        * We only support some filesystems in our tests when dealing with
+        * R/W long-term pinning. For these filesystems, we can be fairly sure
+        * whether they support it or not.
+        */
+       switch (fs_type) {
+       case TMPFS_MAGIC:
+       case HUGETLBFS_MAGIC:
+       case BTRFS_SUPER_MAGIC:
+       case EXT4_SUPER_MAGIC:
+       case XFS_SUPER_MAGIC:
+               return false;
+       default:
+               return true;
+       }
+}
+
+static bool fs_supports_writable_longterm_pinning(__fsword_t fs_type)
+{
+       assert(!fs_is_unknown(fs_type));
+       switch (fs_type) {
+       case TMPFS_MAGIC:
+       case HUGETLBFS_MAGIC:
+               return true;
+       default:
+               return false;
+       }
+}
+
+enum test_type {
+       TEST_TYPE_RO,
+       TEST_TYPE_RO_FAST,
+       TEST_TYPE_RW,
+       TEST_TYPE_RW_FAST,
+#ifdef LOCAL_CONFIG_HAVE_LIBURING
+       TEST_TYPE_IOURING,
+#endif /* LOCAL_CONFIG_HAVE_LIBURING */
+};
+
+static void do_test(int fd, size_t size, enum test_type type, bool shared)
+{
+       __fsword_t fs_type = get_fs_type(fd);
+       bool should_work;
+       char *mem;
+       int ret;
+
+       if (ftruncate(fd, size)) {
+               ksft_test_result_fail("ftruncate() failed\n");
+               return;
+       }
+
+       if (fallocate(fd, 0, 0, size)) {
+               if (size == pagesize)
+                       ksft_test_result_fail("fallocate() failed\n");
+               else
+                       ksft_test_result_skip("need more free huge pages\n");
+               return;
+       }
+
+       mem = mmap(NULL, size, PROT_READ | PROT_WRITE,
+                  shared ? MAP_SHARED : MAP_PRIVATE, fd, 0);
+       if (mem == MAP_FAILED) {
+               if (size == pagesize || shared)
+                       ksft_test_result_fail("mmap() failed\n");
+               else
+                       ksft_test_result_skip("need more free huge pages\n");
+               return;
+       }
+
+       /*
+        * Fault in the page writable such that GUP-fast can eventually pin
+        * it immediately.
+        */
+       memset(mem, 0, size);
+
+       switch (type) {
+       case TEST_TYPE_RO:
+       case TEST_TYPE_RO_FAST:
+       case TEST_TYPE_RW:
+       case TEST_TYPE_RW_FAST: {
+               struct pin_longterm_test args;
+               const bool fast = type == TEST_TYPE_RO_FAST ||
+                                 type == TEST_TYPE_RW_FAST;
+               const bool rw = type == TEST_TYPE_RW ||
+                               type == TEST_TYPE_RW_FAST;
+
+               if (gup_fd < 0) {
+                       ksft_test_result_skip("gup_test not available\n");
+                       break;
+               }
+
+               if (rw && shared && fs_is_unknown(fs_type)) {
+                       ksft_test_result_skip("Unknown filesystem\n");
+                       return;
+               }
+               /*
+                * R/O pinning or pinning in a private mapping is always
+                * expected to work. Otherwise, we expect long-term R/W pinning
+                * to only succeed for special fielesystems.
+                */
+               should_work = !shared || !rw ||
+                             fs_supports_writable_longterm_pinning(fs_type);
+
+               args.addr = (__u64)(uintptr_t)mem;
+               args.size = size;
+               args.flags = fast ? PIN_LONGTERM_TEST_FLAG_USE_FAST : 0;
+               args.flags |= rw ? PIN_LONGTERM_TEST_FLAG_USE_WRITE : 0;
+               ret = ioctl(gup_fd, PIN_LONGTERM_TEST_START, &args);
+               if (ret && errno == EINVAL) {
+                       ksft_test_result_skip("PIN_LONGTERM_TEST_START failed\n");
+                       break;
+               } else if (ret && errno == EFAULT) {
+                       ksft_test_result(!should_work, "Should have failed\n");
+                       break;
+               } else if (ret) {
+                       ksft_test_result_fail("PIN_LONGTERM_TEST_START failed\n");
+                       break;
+               }
+
+               if (ioctl(gup_fd, PIN_LONGTERM_TEST_STOP))
+                       ksft_print_msg("[INFO] PIN_LONGTERM_TEST_STOP failed\n");
+
+               /*
+                * TODO: if the kernel ever supports long-term R/W pinning on
+                * some previously unsupported filesystems, we might want to
+                * perform some additional tests for possible data corruptions.
+                */
+               ksft_test_result(should_work, "Should have worked\n");
+               break;
+       }
+#ifdef LOCAL_CONFIG_HAVE_LIBURING
+       case TEST_TYPE_IOURING: {
+               struct io_uring ring;
+               struct iovec iov;
+
+               /* io_uring always pins pages writable. */
+               if (shared && fs_is_unknown(fs_type)) {
+                       ksft_test_result_skip("Unknown filesystem\n");
+                       return;
+               }
+               should_work = !shared ||
+                             fs_supports_writable_longterm_pinning(fs_type);
+
+               /* Skip on errors, as we might just lack kernel support. */
+               ret = io_uring_queue_init(1, &ring, 0);
+               if (ret < 0) {
+                       ksft_test_result_skip("io_uring_queue_init() failed\n");
+                       break;
+               }
+               /*
+                * Register the range as a fixed buffer. This will FOLL_WRITE |
+                * FOLL_PIN | FOLL_LONGTERM the range.
+                */
+               iov.iov_base = mem;
+               iov.iov_len = size;
+               ret = io_uring_register_buffers(&ring, &iov, 1);
+               /* Only new kernels return EFAULT. */
+               if (ret && (errno == ENOSPC || errno == EOPNOTSUPP ||
+                           errno == EFAULT)) {
+                       ksft_test_result(!should_work, "Should have failed\n");
+               } else if (ret) {
+                       /*
+                        * We might just lack support or have insufficient
+                        * MEMLOCK limits.
+                        */
+                       ksft_test_result_skip("io_uring_register_buffers() failed\n");
+               } else {
+                       ksft_test_result(should_work, "Should have worked\n");
+                       io_uring_unregister_buffers(&ring);
+               }
+
+               io_uring_queue_exit(&ring);
+               break;
+       }
+#endif /* LOCAL_CONFIG_HAVE_LIBURING */
+       default:
+               assert(false);
+       }
+
+       munmap(mem, size);
+}
+
+typedef void (*test_fn)(int fd, size_t size);
+
+static void run_with_memfd(test_fn fn, const char *desc)
+{
+       int fd;
+
+       ksft_print_msg("[RUN] %s ... with memfd\n", desc);
+
+       fd = memfd_create("test", 0);
+       if (fd < 0) {
+               ksft_test_result_fail("memfd_create() failed\n");
+               return;
+       }
+
+       fn(fd, pagesize);
+       close(fd);
+}
+
+static void run_with_tmpfile(test_fn fn, const char *desc)
+{
+       FILE *file;
+       int fd;
+
+       ksft_print_msg("[RUN] %s ... with tmpfile\n", desc);
+
+       file = tmpfile();
+       if (!file) {
+               ksft_test_result_fail("tmpfile() failed\n");
+               return;
+       }
+
+       fd = fileno(file);
+       if (fd < 0) {
+               ksft_test_result_fail("fileno() failed\n");
+               return;
+       }
+
+       fn(fd, pagesize);
+       fclose(file);
+}
+
+static void run_with_local_tmpfile(test_fn fn, const char *desc)
+{
+       char filename[] = __FILE__"_tmpfile_XXXXXX";
+       int fd;
+
+       ksft_print_msg("[RUN] %s ... with local tmpfile\n", desc);
+
+       fd = mkstemp(filename);
+       if (fd < 0) {
+               ksft_test_result_fail("mkstemp() failed\n");
+               return;
+       }
+
+       if (unlink(filename)) {
+               ksft_test_result_fail("unlink() failed\n");
+               goto close;
+       }
+
+       fn(fd, pagesize);
+close:
+       close(fd);
+}
+
+static void run_with_memfd_hugetlb(test_fn fn, const char *desc,
+                                  size_t hugetlbsize)
+{
+       int flags = MFD_HUGETLB;
+       int fd;
+
+       ksft_print_msg("[RUN] %s ... with memfd hugetlb (%zu kB)\n", desc,
+                      hugetlbsize / 1024);
+
+       flags |= __builtin_ctzll(hugetlbsize) << MFD_HUGE_SHIFT;
+
+       fd = memfd_create("test", flags);
+       if (fd < 0) {
+               ksft_test_result_skip("memfd_create() failed\n");
+               return;
+       }
+
+       fn(fd, hugetlbsize);
+       close(fd);
+}
+
+struct test_case {
+       const char *desc;
+       test_fn fn;
+};
+
+static void test_shared_rw_pin(int fd, size_t size)
+{
+       do_test(fd, size, TEST_TYPE_RW, true);
+}
+
+static void test_shared_rw_fast_pin(int fd, size_t size)
+{
+       do_test(fd, size, TEST_TYPE_RW_FAST, true);
+}
+
+static void test_shared_ro_pin(int fd, size_t size)
+{
+       do_test(fd, size, TEST_TYPE_RO, true);
+}
+
+static void test_shared_ro_fast_pin(int fd, size_t size)
+{
+       do_test(fd, size, TEST_TYPE_RO_FAST, true);
+}
+
+static void test_private_rw_pin(int fd, size_t size)
+{
+       do_test(fd, size, TEST_TYPE_RW, false);
+}
+
+static void test_private_rw_fast_pin(int fd, size_t size)
+{
+       do_test(fd, size, TEST_TYPE_RW_FAST, false);
+}
+
+static void test_private_ro_pin(int fd, size_t size)
+{
+       do_test(fd, size, TEST_TYPE_RO, false);
+}
+
+static void test_private_ro_fast_pin(int fd, size_t size)
+{
+       do_test(fd, size, TEST_TYPE_RO_FAST, false);
+}
+
+#ifdef LOCAL_CONFIG_HAVE_LIBURING
+static void test_shared_iouring(int fd, size_t size)
+{
+       do_test(fd, size, TEST_TYPE_IOURING, true);
+}
+
+static void test_private_iouring(int fd, size_t size)
+{
+       do_test(fd, size, TEST_TYPE_IOURING, false);
+}
+#endif /* LOCAL_CONFIG_HAVE_LIBURING */
+
+static const struct test_case test_cases[] = {
+       {
+               "R/W longterm GUP pin in MAP_SHARED file mapping",
+               test_shared_rw_pin,
+       },
+       {
+               "R/W longterm GUP-fast pin in MAP_SHARED file mapping",
+               test_shared_rw_fast_pin,
+       },
+       {
+               "R/O longterm GUP pin in MAP_SHARED file mapping",
+               test_shared_ro_pin,
+       },
+       {
+               "R/O longterm GUP-fast pin in MAP_SHARED file mapping",
+               test_shared_ro_fast_pin,
+       },
+       {
+               "R/W longterm GUP pin in MAP_PRIVATE file mapping",
+               test_private_rw_pin,
+       },
+       {
+               "R/W longterm GUP-fast pin in MAP_PRIVATE file mapping",
+               test_private_rw_fast_pin,
+       },
+       {
+               "R/O longterm GUP pin in MAP_PRIVATE file mapping",
+               test_private_ro_pin,
+       },
+       {
+               "R/O longterm GUP-fast pin in MAP_PRIVATE file mapping",
+               test_private_ro_fast_pin,
+       },
+#ifdef LOCAL_CONFIG_HAVE_LIBURING
+       {
+               "io_uring fixed buffer with MAP_SHARED file mapping",
+               test_shared_iouring,
+       },
+       {
+               "io_uring fixed buffer with MAP_PRIVATE file mapping",
+               test_private_iouring,
+       },
+#endif /* LOCAL_CONFIG_HAVE_LIBURING */
+};
+
+static void run_test_case(struct test_case const *test_case)
+{
+       int i;
+
+       run_with_memfd(test_case->fn, test_case->desc);
+       run_with_tmpfile(test_case->fn, test_case->desc);
+       run_with_local_tmpfile(test_case->fn, test_case->desc);
+       for (i = 0; i < nr_hugetlbsizes; i++)
+               run_with_memfd_hugetlb(test_case->fn, test_case->desc,
+                                      hugetlbsizes[i]);
+}
+
+static int tests_per_test_case(void)
+{
+       return 3 + nr_hugetlbsizes;
+}
+
+int main(int argc, char **argv)
+{
+       int i, err;
+
+       pagesize = getpagesize();
+       nr_hugetlbsizes = detect_hugetlb_page_sizes(hugetlbsizes,
+                                                   ARRAY_SIZE(hugetlbsizes));
+
+       ksft_print_header();
+       ksft_set_plan(ARRAY_SIZE(test_cases) * tests_per_test_case());
+
+       gup_fd = open("/sys/kernel/debug/gup_test", O_RDWR);
+
+       for (i = 0; i < ARRAY_SIZE(test_cases); i++)
+               run_test_case(&test_cases[i]);
+
+       err = ksft_get_fail_cnt();
+       if (err)
+               ksft_exit_fail_msg("%d out of %d tests failed\n",
+                                  err, ksft_test_num());
+       return ksft_exit_pass();
+}
index e2527f3..478bb1e 100644 (file)
 #include <sys/shm.h>
 #include <sys/mman.h>
 
-#ifndef SHM_HUGETLB
-#define SHM_HUGETLB 04000
-#endif
-
 #define LENGTH (256UL*1024*1024)
 
 #define dprintf(x)  printf(x)
index 557bdbd..5b354c2 100644 (file)
 
 #define MAP_LENGTH             (2UL * 1024 * 1024)
 
-#ifndef MAP_HUGETLB
-#define MAP_HUGETLB            0x40000 /* arch specific */
-#endif
-
 #define PAGE_SIZE              4096
 
 #define PAGE_COMPOUND_HEAD     (1UL << 15)
index 28426e3..d55322d 100644 (file)
@@ -65,11 +65,15 @@ void write_fault_pages(void *addr, unsigned long nr_pages)
 
 void read_fault_pages(void *addr, unsigned long nr_pages)
 {
-       unsigned long dummy = 0;
+       volatile unsigned long dummy = 0;
        unsigned long i;
 
-       for (i = 0; i < nr_pages; i++)
+       for (i = 0; i < nr_pages; i++) {
                dummy += *((unsigned long *)(addr + (i * huge_page_size)));
+
+               /* Prevent the compiler from optimizing out the entire loop: */
+               asm volatile("" : "+r" (dummy));
+       }
 }
 
 int main(int argc, char **argv)
index 97adc0f..030667c 100644 (file)
@@ -11,6 +11,7 @@
 #include <string.h>
 #include <unistd.h>
 
+#include <linux/mman.h>
 #include <sys/mman.h>
 #include <sys/wait.h>
 #include <sys/types.h>
 
 #include "vm_util.h"
 
-#ifndef MADV_PAGEOUT
-#define MADV_PAGEOUT 21
-#endif
-#ifndef MADV_POPULATE_READ
-#define MADV_POPULATE_READ 22
-#endif
-#ifndef MADV_COLLAPSE
-#define MADV_COLLAPSE 25
-#endif
-
 #define BASE_ADDR ((void *)(1UL << 30))
 static unsigned long hpage_pmd_size;
 static unsigned long page_size;
index 262eae6..6054724 100644 (file)
 #include "../kselftest.h"
 #include "vm_util.h"
 
-#ifndef MADV_POPULATE_READ
-#define MADV_POPULATE_READ     22
-#endif /* MADV_POPULATE_READ */
-#ifndef MADV_POPULATE_WRITE
-#define MADV_POPULATE_WRITE    23
-#endif /* MADV_POPULATE_WRITE */
-
 /*
  * For now, we're using 2 MiB of private anonymous memory for all tests.
  */
index eed4432..598159f 100644 (file)
 #include <stdlib.h>
 #include <unistd.h>
 
-#ifndef MAP_FIXED_NOREPLACE
-#define MAP_FIXED_NOREPLACE 0x100000
-#endif
-
 static void dump_maps(void)
 {
        char cmd[32];
index 312889e..1932815 100644 (file)
 #define LENGTH (256UL*1024*1024)
 #define PROTECTION (PROT_READ | PROT_WRITE)
 
-#ifndef MAP_HUGETLB
-#define MAP_HUGETLB 0x40000 /* arch specific */
-#endif
-
-#ifndef MAP_HUGE_SHIFT
-#define MAP_HUGE_SHIFT 26
-#endif
-
-#ifndef MAP_HUGE_MASK
-#define MAP_HUGE_MASK 0x3f
-#endif
-
 /* Only ia64 requires this */
 #ifdef __ia64__
 #define ADDR (void *)(0x8000000000000000UL)
index 6b8aeaa..240f2d9 100644 (file)
@@ -17,9 +17,7 @@
 #include <string.h>
 #include <unistd.h>
 
-#ifndef MMAP_SZ
 #define MMAP_SZ                4096
-#endif
 
 #define BUG_ON(condition, description)                                 \
        do {                                                            \
index 1cec842..3795815 100644 (file)
@@ -95,12 +95,15 @@ int migrate(uint64_t *ptr, int n1, int n2)
 
 void *access_mem(void *ptr)
 {
-       uint64_t y = 0;
+       volatile uint64_t y = 0;
        volatile uint64_t *x = ptr;
 
        while (1) {
                pthread_testcancel();
                y += *x;
+
+               /* Prevent the compiler from optimizing out the writes to y: */
+               asm volatile("" : "+r" (y));
        }
 
        return NULL;
index 782ea94..1fba77d 100644 (file)
@@ -7,6 +7,7 @@
 #include <sys/resource.h>
 #include <sys/capability.h>
 #include <sys/mman.h>
+#include <linux/mman.h>
 #include <fcntl.h>
 #include <string.h>
 #include <sys/ipc.h>
index 11b2301..80cddc0 100644 (file)
@@ -50,7 +50,6 @@ static int get_vm_area(unsigned long addr, struct vm_boundaries *area)
                        printf("cannot parse /proc/self/maps\n");
                        goto out;
                }
-               stop = '\0';
 
                sscanf(line, "%lx", &start);
                sscanf(end_addr, "%lx", &end);
index 2a6e76c..8e02991 100644 (file)
@@ -4,14 +4,6 @@
 #include <stdio.h>
 #include <stdlib.h>
 
-#ifndef MLOCK_ONFAULT
-#define MLOCK_ONFAULT 1
-#endif
-
-#ifndef MCL_ONFAULT
-#define MCL_ONFAULT (MCL_FUTURE << 1)
-#endif
-
 static int mlock2_(void *start, size_t len, int flags)
 {
 #ifdef __NR_mlock2
index 37b6d33..dca2104 100644 (file)
@@ -9,18 +9,10 @@
 #include <stdlib.h>
 #include <sys/wait.h>
 #include <unistd.h>
+#include <asm-generic/unistd.h>
 #include "vm_util.h"
-
 #include "../kselftest.h"
 
-#ifndef __NR_pidfd_open
-#define __NR_pidfd_open -1
-#endif
-
-#ifndef __NR_process_mrelease
-#define __NR_process_mrelease -1
-#endif
-
 #define MB(x) (x << 20)
 #define MAX_SIZE_MB 1024
 
index f01dc4a..ca23598 100644 (file)
 
 #include "../kselftest.h"
 
-#ifndef MREMAP_DONTUNMAP
-#define MREMAP_DONTUNMAP 4
-#endif
-
 unsigned long page_size;
 char *page_buffer;
 
index 634d87d..b5888d6 100644 (file)
@@ -6,10 +6,6 @@
 #include <sys/time.h>
 #include <sys/resource.h>
 
-#ifndef MCL_ONFAULT
-#define MCL_ONFAULT (MCL_FUTURE << 1)
-#endif
-
 static int test_limit(void)
 {
        int ret = 1;
index 1ebb586..ae5df26 100644 (file)
@@ -3,9 +3,6 @@
 #ifndef _PKEYS_POWERPC_H
 #define _PKEYS_POWERPC_H
 
-#ifndef SYS_mprotect_key
-# define SYS_mprotect_key      386
-#endif
 #ifndef SYS_pkey_alloc
 # define SYS_pkey_alloc                384
 # define SYS_pkey_free         385
index 72c14cd..814758e 100644 (file)
@@ -5,29 +5,11 @@
 
 #ifdef __i386__
 
-#ifndef SYS_mprotect_key
-# define SYS_mprotect_key      380
-#endif
-
-#ifndef SYS_pkey_alloc
-# define SYS_pkey_alloc                381
-# define SYS_pkey_free         382
-#endif
-
 #define REG_IP_IDX             REG_EIP
 #define si_pkey_offset         0x14
 
 #else
 
-#ifndef SYS_mprotect_key
-# define SYS_mprotect_key      329
-#endif
-
-#ifndef SYS_pkey_alloc
-# define SYS_pkey_alloc                330
-# define SYS_pkey_free         331
-#endif
-
 #define REG_IP_IDX             REG_RIP
 #define si_pkey_offset         0x20
 
@@ -132,7 +114,7 @@ int pkey_reg_xstate_offset(void)
        unsigned int ecx;
        unsigned int edx;
        int xstate_offset;
-       int xstate_size;
+       int xstate_size = 0;
        unsigned long XSTATE_CPUID = 0xd;
        int leaf;
 
index 0381c34..48dc151 100644 (file)
@@ -294,15 +294,6 @@ void pkey_access_deny(int pkey)
        pkey_disable_set(pkey, PKEY_DISABLE_ACCESS);
 }
 
-/* Failed address bound checks: */
-#ifndef SEGV_BNDERR
-# define SEGV_BNDERR           3
-#endif
-
-#ifndef SEGV_PKUERR
-# define SEGV_PKUERR           4
-#endif
-
 static char *si_code_str(int si_code)
 {
        if (si_code == SEGV_MAPERR)
@@ -476,7 +467,7 @@ int sys_mprotect_pkey(void *ptr, size_t size, unsigned long orig_prot,
                        ptr, size, orig_prot, pkey);
 
        errno = 0;
-       sret = syscall(SYS_mprotect_key, ptr, size, orig_prot, pkey);
+       sret = syscall(__NR_pkey_mprotect, ptr, size, orig_prot, pkey);
        if (errno) {
                dprintf2("SYS_mprotect_key sret: %d\n", sret);
                dprintf2("SYS_mprotect_key prot: 0x%lx\n", orig_prot);
@@ -1684,7 +1675,7 @@ void test_mprotect_pkey_on_unsupported_cpu(int *ptr, u16 pkey)
                return;
        }
 
-       sret = syscall(SYS_mprotect_key, ptr, size, PROT_READ, pkey);
+       sret = syscall(__NR_pkey_mprotect, ptr, size, PROT_READ, pkey);
        pkey_assert(sret < 0);
 }
 
index 4893eb6..3f26f6e 100644 (file)
@@ -24,7 +24,7 @@ separated by spaces:
 - mmap
        tests for mmap(2)
 - gup_test
-       tests for gup using gup_test interface
+       tests for gup
 - userfaultfd
        tests for  userfaultfd(2)
 - compaction
@@ -196,6 +196,8 @@ CATEGORY="gup_test" run_test ./gup_test -a
 # Dump pages 0, 19, and 4096, using pin_user_pages:
 CATEGORY="gup_test" run_test ./gup_test -ct -F 0x1 0 19 0x1000
 
+CATEGORY="gup_test" run_test ./gup_longterm
+
 CATEGORY="userfaultfd" run_test ./uffd-unit-tests
 uffd_stress_bin=./uffd-stress
 CATEGORY="userfaultfd" run_test ${uffd_stress_bin} anon 20 16
@@ -242,18 +244,18 @@ if [ $VADDR64 -ne 0 ]; then
        if [ "$ARCH" == "$ARCH_ARM64" ]; then
                echo 6 > /proc/sys/vm/nr_hugepages
        fi
-       CATEGORY="hugevm" run_test ./va_high_addr_switch.sh
+       CATEGORY="hugevm" run_test bash ./va_high_addr_switch.sh
        if [ "$ARCH" == "$ARCH_ARM64" ]; then
                echo $prev_nr_hugepages > /proc/sys/vm/nr_hugepages
        fi
 fi # VADDR64
 
 # vmalloc stability smoke test
-CATEGORY="vmalloc" run_test ./test_vmalloc.sh smoke
+CATEGORY="vmalloc" run_test bash ./test_vmalloc.sh smoke
 
 CATEGORY="mremap" run_test ./mremap_dontunmap
 
-CATEGORY="hmm" run_test ./test_hmm.sh smoke
+CATEGORY="hmm" run_test bash ./test_hmm.sh smoke
 
 # MADV_POPULATE_READ and MADV_POPULATE_WRITE tests
 CATEGORY="madv_populate" run_test ./madv_populate
index 61c6250..ba20d75 100644 (file)
@@ -616,3 +616,62 @@ int copy_page(int ufd, unsigned long offset, bool wp)
 {
        return __copy_page(ufd, offset, false, wp);
 }
+
+int uffd_open_dev(unsigned int flags)
+{
+       int fd, uffd;
+
+       fd = open("/dev/userfaultfd", O_RDWR | O_CLOEXEC);
+       if (fd < 0)
+               return fd;
+       uffd = ioctl(fd, USERFAULTFD_IOC_NEW, flags);
+       close(fd);
+
+       return uffd;
+}
+
+int uffd_open_sys(unsigned int flags)
+{
+#ifdef __NR_userfaultfd
+       return syscall(__NR_userfaultfd, flags);
+#else
+       return -1;
+#endif
+}
+
+int uffd_open(unsigned int flags)
+{
+       int uffd = uffd_open_sys(flags);
+
+       if (uffd < 0)
+               uffd = uffd_open_dev(flags);
+
+       return uffd;
+}
+
+int uffd_get_features(uint64_t *features)
+{
+       struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 };
+       /*
+        * This should by default work in most kernels; the feature list
+        * will be the same no matter what we pass in here.
+        */
+       int fd = uffd_open(UFFD_USER_MODE_ONLY);
+
+       if (fd < 0)
+               /* Maybe the kernel is older than user-only mode? */
+               fd = uffd_open(0);
+
+       if (fd < 0)
+               return fd;
+
+       if (ioctl(fd, UFFDIO_API, &uffdio_api)) {
+               close(fd);
+               return -errno;
+       }
+
+       *features = uffdio_api.features;
+       close(fd);
+
+       return 0;
+}
index 6068f23..197f526 100644 (file)
@@ -110,6 +110,11 @@ int __copy_page(int ufd, unsigned long offset, bool retry, bool wp);
 int copy_page(int ufd, unsigned long offset, bool wp);
 void *uffd_poll_thread(void *arg);
 
+int uffd_open_dev(unsigned int flags);
+int uffd_open_sys(unsigned int flags);
+int uffd_open(unsigned int flags);
+int uffd_get_features(uint64_t *features);
+
 #define TEST_ANON      1
 #define TEST_HUGETLB   2
 #define TEST_SHMEM     3
index f1ad9ee..995ff13 100644 (file)
@@ -88,16 +88,6 @@ static void uffd_stats_reset(struct uffd_args *args, unsigned long n_cpus)
        }
 }
 
-static inline uint64_t uffd_minor_feature(void)
-{
-       if (test_type == TEST_HUGETLB && map_shared)
-               return UFFD_FEATURE_MINOR_HUGETLBFS;
-       else if (test_type == TEST_SHMEM)
-               return UFFD_FEATURE_MINOR_SHMEM;
-       else
-               return 0;
-}
-
 static void *locking_thread(void *arg)
 {
        unsigned long cpu = (unsigned long) arg;
index 269c867..04d91f1 100644 (file)
@@ -109,12 +109,11 @@ static void uffd_test_pass(void)
                ksft_inc_fail_cnt();            \
        } while (0)
 
-#define  uffd_test_skip(...)  do {             \
-               printf("skipped [reason: ");    \
-               printf(__VA_ARGS__);            \
-               printf("]\n");                  \
-               ksft_inc_xskip_cnt();           \
-       } while (0)
+static void uffd_test_skip(const char *message)
+{
+       printf("skipped [reason: %s]\n", message);
+       ksft_inc_xskip_cnt();
+}
 
 /*
  * Returns 1 if specific userfaultfd supported, 0 otherwise.  Note, we'll
@@ -1149,7 +1148,6 @@ int main(int argc, char *argv[])
        uffd_test_case_t *test;
        mem_type_t *mem_type;
        uffd_test_args_t args;
-       char test_name[128];
        const char *errmsg;
        int has_uffd, opt;
        int i, j;
@@ -1192,10 +1190,8 @@ int main(int argc, char *argv[])
                        mem_type = &mem_types[j];
                        if (!(test->mem_targets & mem_type->mem_flag))
                                continue;
-                       snprintf(test_name, sizeof(test_name),
-                                "%s on %s", test->name, mem_type->name);
 
-                       uffd_test_start(test_name);
+                       uffd_test_start("%s on %s", test->name, mem_type->name);
                        if (!uffd_feature_supported(test)) {
                                uffd_test_skip("feature missing");
                                continue;
index 9b06a50..558c9cd 100644 (file)
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <string.h>
 #include <fcntl.h>
+#include <dirent.h>
 #include <sys/ioctl.h>
 #include <linux/userfaultfd.h>
 #include <sys/syscall.h>
@@ -198,6 +199,32 @@ unsigned long default_huge_page_size(void)
        return hps;
 }
 
+int detect_hugetlb_page_sizes(size_t sizes[], int max)
+{
+       DIR *dir = opendir("/sys/kernel/mm/hugepages/");
+       int count = 0;
+
+       if (!dir)
+               return 0;
+
+       while (count < max) {
+               struct dirent *entry = readdir(dir);
+               size_t kb;
+
+               if (!entry)
+                       break;
+               if (entry->d_type != DT_DIR)
+                       continue;
+               if (sscanf(entry->d_name, "hugepages-%zukB", &kb) != 1)
+                       continue;
+               sizes[count++] = kb * 1024;
+               ksft_print_msg("[INFO] detected hugetlb page size: %zu KiB\n",
+                              kb);
+       }
+       closedir(dir);
+       return count;
+}
+
 /* If `ioctls' non-NULL, the allowed ioctls will be returned into the var */
 int uffd_register_with_ioctls(int uffd, void *addr, uint64_t len,
                              bool miss, bool wp, bool minor, uint64_t *ioctls)
@@ -242,62 +269,3 @@ int uffd_unregister(int uffd, void *addr, uint64_t len)
 
        return ret;
 }
-
-int uffd_open_dev(unsigned int flags)
-{
-       int fd, uffd;
-
-       fd = open("/dev/userfaultfd", O_RDWR | O_CLOEXEC);
-       if (fd < 0)
-               return fd;
-       uffd = ioctl(fd, USERFAULTFD_IOC_NEW, flags);
-       close(fd);
-
-       return uffd;
-}
-
-int uffd_open_sys(unsigned int flags)
-{
-#ifdef __NR_userfaultfd
-       return syscall(__NR_userfaultfd, flags);
-#else
-       return -1;
-#endif
-}
-
-int uffd_open(unsigned int flags)
-{
-       int uffd = uffd_open_sys(flags);
-
-       if (uffd < 0)
-               uffd = uffd_open_dev(flags);
-
-       return uffd;
-}
-
-int uffd_get_features(uint64_t *features)
-{
-       struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 };
-       /*
-        * This should by default work in most kernels; the feature list
-        * will be the same no matter what we pass in here.
-        */
-       int fd = uffd_open(UFFD_USER_MODE_ONLY);
-
-       if (fd < 0)
-               /* Maybe the kernel is older than user-only mode? */
-               fd = uffd_open(0);
-
-       if (fd < 0)
-               return fd;
-
-       if (ioctl(fd, UFFDIO_API, &uffdio_api)) {
-               close(fd);
-               return -errno;
-       }
-
-       *features = uffdio_api.features;
-       close(fd);
-
-       return 0;
-}
index b950bd1..c7fa61f 100644 (file)
@@ -44,14 +44,11 @@ bool check_huge_file(void *addr, int nr_hpages, uint64_t hpage_size);
 bool check_huge_shmem(void *addr, int nr_hpages, uint64_t hpage_size);
 int64_t allocate_transhuge(void *ptr, int pagemap_fd);
 unsigned long default_huge_page_size(void);
+int detect_hugetlb_page_sizes(size_t sizes[], int max);
 
 int uffd_register(int uffd, void *addr, uint64_t len,
                  bool miss, bool wp, bool minor);
 int uffd_unregister(int uffd, void *addr, uint64_t len);
-int uffd_open_dev(unsigned int flags);
-int uffd_open_sys(unsigned int flags);
-int uffd_open(unsigned int flags);
-int uffd_get_features(uint64_t *features);
 int uffd_register_with_ioctls(int uffd, void *addr, uint64_t len,
                              bool miss, bool wp, bool minor, uint64_t *ioctls);
 
index bbce574..1b7b3c8 100644 (file)
@@ -64,7 +64,7 @@ QEMU_ARGS_mips       = -M malta -append "panic=-1 $(TEST:%=NOLIBC_TEST=%)"
 QEMU_ARGS_riscv      = -M virt -append "console=ttyS0 panic=-1 $(TEST:%=NOLIBC_TEST=%)"
 QEMU_ARGS_s390       = -M s390-ccw-virtio -m 1G -append "console=ttyS0 panic=-1 $(TEST:%=NOLIBC_TEST=%)"
 QEMU_ARGS_loongarch  = -M virt -append "console=ttyS0,115200 panic=-1 $(TEST:%=NOLIBC_TEST=%)"
-QEMU_ARGS            = $(QEMU_ARGS_$(ARCH))
+QEMU_ARGS            = $(QEMU_ARGS_$(ARCH)) $(QEMU_ARGS_EXTRA)
 
 # OUTPUT is only set when run from the main makefile, otherwise
 # it defaults to this nolibc directory.
@@ -76,16 +76,12 @@ else
 Q=@
 endif
 
-CFLAGS_STACKPROTECTOR = -DNOLIBC_STACKPROTECTOR \
-                       $(call cc-option,-mstack-protector-guard=global) \
-                       $(call cc-option,-fstack-protector-all)
-CFLAGS_STKP_i386 = $(CFLAGS_STACKPROTECTOR)
-CFLAGS_STKP_x86_64 = $(CFLAGS_STACKPROTECTOR)
-CFLAGS_STKP_x86 = $(CFLAGS_STACKPROTECTOR)
 CFLAGS_s390 = -m64
-CFLAGS  ?= -Os -fno-ident -fno-asynchronous-unwind-tables \
+CFLAGS_mips = -EL
+CFLAGS_STACKPROTECTOR ?= $(call cc-option,-mstack-protector-guard=global $(call cc-option,-fstack-protector-all))
+CFLAGS  ?= -Os -fno-ident -fno-asynchronous-unwind-tables -std=c89 \
                $(call cc-option,-fno-stack-protector) \
-               $(CFLAGS_STKP_$(ARCH)) $(CFLAGS_$(ARCH))
+               $(CFLAGS_$(ARCH)) $(CFLAGS_STACKPROTECTOR)
 LDFLAGS := -s
 
 help:
@@ -94,6 +90,7 @@ help:
        @echo "  help         this help"
        @echo "  sysroot      create the nolibc sysroot here (uses \$$ARCH)"
        @echo "  nolibc-test  build the executable (uses \$$CC and \$$CROSS_COMPILE)"
+       @echo "  libc-test    build an executable using the compiler's default libc instead"
        @echo "  run-user     runs the executable under QEMU (uses \$$ARCH, \$$TEST)"
        @echo "  initramfs    prepare the initramfs with nolibc-test"
        @echo "  defconfig    create a fresh new default config (uses \$$ARCH)"
@@ -128,10 +125,16 @@ nolibc-test: nolibc-test.c sysroot/$(ARCH)/include
        $(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) -o $@ \
          -nostdlib -static -Isysroot/$(ARCH)/include $< -lgcc
 
+libc-test: nolibc-test.c
+       $(QUIET_CC)$(CC) -o $@ $<
+
 # qemu user-land test
 run-user: nolibc-test
        $(Q)qemu-$(QEMU_ARCH) ./nolibc-test > "$(CURDIR)/run.out" || :
-       $(Q)grep -w FAIL "$(CURDIR)/run.out" && echo "See all results in $(CURDIR)/run.out" || echo "$$(grep -c ^[0-9].*OK $(CURDIR)/run.out) test(s) passed."
+       $(Q)awk '/\[OK\][\r]*$$/{p++} /\[FAIL\][\r]*$$/{f++} /\[SKIPPED\][\r]*$$/{s++} \
+                END{ printf("%d test(s) passed, %d skipped, %d failed.", p, s, f); \
+                if (s+f > 0) printf(" See all results in %s\n", ARGV[1]); else print; }' \
+                $(CURDIR)/run.out
 
 initramfs: nolibc-test
        $(QUIET_MKDIR)mkdir -p initramfs
@@ -147,18 +150,26 @@ kernel: initramfs
 # run the tests after building the kernel
 run: kernel
        $(Q)qemu-system-$(QEMU_ARCH) -display none -no-reboot -kernel "$(srctree)/$(IMAGE)" -serial stdio $(QEMU_ARGS) > "$(CURDIR)/run.out"
-       $(Q)grep -w FAIL "$(CURDIR)/run.out" && echo "See all results in $(CURDIR)/run.out" || echo "$$(grep -c ^[0-9].*OK $(CURDIR)/run.out) test(s) passed."
+       $(Q)awk '/\[OK\][\r]*$$/{p++} /\[FAIL\][\r]*$$/{f++} /\[SKIPPED\][\r]*$$/{s++} \
+                END{ printf("%d test(s) passed, %d skipped, %d failed.", p, s, f); \
+                if (s+f > 0) printf(" See all results in %s\n", ARGV[1]); else print; }' \
+                $(CURDIR)/run.out
 
 # re-run the tests from an existing kernel
 rerun:
        $(Q)qemu-system-$(QEMU_ARCH) -display none -no-reboot -kernel "$(srctree)/$(IMAGE)" -serial stdio $(QEMU_ARGS) > "$(CURDIR)/run.out"
-       $(Q)grep -w FAIL "$(CURDIR)/run.out" && echo "See all results in $(CURDIR)/run.out" || echo "$$(grep -c ^[0-9].*OK $(CURDIR)/run.out) test(s) passed."
+       $(Q)awk '/\[OK\][\r]*$$/{p++} /\[FAIL\][\r]*$$/{f++} /\[SKIPPED\][\r]*$$/{s++} \
+                END{ printf("%d test(s) passed, %d skipped, %d failed.", p, s, f); \
+                if (s+f > 0) printf(" See all results in %s\n", ARGV[1]); else print; }' \
+                $(CURDIR)/run.out
 
 clean:
        $(call QUIET_CLEAN, sysroot)
        $(Q)rm -rf sysroot
        $(call QUIET_CLEAN, nolibc-test)
        $(Q)rm -f nolibc-test
+       $(call QUIET_CLEAN, libc-test)
+       $(Q)rm -f libc-test
        $(call QUIET_CLEAN, initramfs)
        $(Q)rm -rf initramfs
        $(call QUIET_CLEAN, run.out)
index 21bacc9..4863349 100644 (file)
@@ -1,10 +1,7 @@
-// SPDX-License-Identifier: GPL-2.0
+/* SPDX-License-Identifier: GPL-2.0 */
 
 #define _GNU_SOURCE
 
-/* platform-specific include files coming from the compiler */
-#include <limits.h>
-
 /* libc-specific include files
  * The program may be built in 3 ways:
  *   $(CC) -nostdlib -include /path/to/nolibc.h => NOLIBC already defined
@@ -20,7 +17,9 @@
 #include <linux/reboot.h>
 #include <sys/io.h>
 #include <sys/ioctl.h>
+#include <sys/mman.h>
 #include <sys/mount.h>
+#include <sys/prctl.h>
 #include <sys/reboot.h>
 #include <sys/stat.h>
 #include <sys/syscall.h>
 #include <sched.h>
 #include <signal.h>
 #include <stdarg.h>
+#include <stddef.h>
+#include <stdint.h>
 #include <unistd.h>
+#include <limits.h>
 #endif
 #endif
 
@@ -43,8 +45,8 @@ char **environ;
 
 /* definition of a series of tests */
 struct test {
-       const char *name;              // test name
-       int (*func)(int min, int max); // handler
+       const char *name;              /* test name */
+       int (*func)(int min, int max); /* handler */
 };
 
 #ifndef _NOLIBC_STDLIB_H
@@ -103,24 +105,32 @@ const char *errorname(int err)
        CASE_ERR(EDOM);
        CASE_ERR(ERANGE);
        CASE_ERR(ENOSYS);
+       CASE_ERR(EOVERFLOW);
        default:
                return itoa(err);
        }
 }
 
+static void putcharn(char c, size_t n)
+{
+       char buf[64];
+
+       memset(buf, c, n);
+       buf[n] = '\0';
+       fputs(buf, stdout);
+}
+
 static int pad_spc(int llen, int cnt, const char *fmt, ...)
 {
        va_list args;
-       int len;
        int ret;
 
-       for (len = 0; len < cnt - llen; len++)
-               putchar(' ');
+       putcharn(' ', cnt - llen);
 
        va_start(args, fmt);
        ret = vfprintf(stdout, fmt, args);
        va_end(args);
-       return ret < 0 ? ret : ret + len;
+       return ret < 0 ? ret : ret + cnt - llen;
 }
 
 /* The tests below are intended to be used by the macroes, which evaluate
@@ -162,7 +172,7 @@ static int expect_eq(uint64_t expr, int llen, uint64_t val)
 {
        int ret = !(expr == val);
 
-       llen += printf(" = %lld ", expr);
+       llen += printf(" = %lld ", (long long)expr);
        pad_spc(llen, 64, ret ? "[FAIL]\n" : " [OK]\n");
        return ret;
 }
@@ -290,18 +300,24 @@ static int expect_sysne(int expr, int llen, int val)
 }
 
 
+#define EXPECT_SYSER2(cond, expr, expret, experr1, experr2)            \
+       do { if (!cond) pad_spc(llen, 64, "[SKIPPED]\n"); else ret += expect_syserr2(expr, expret, experr1, experr2, llen); } while (0)
+
 #define EXPECT_SYSER(cond, expr, expret, experr)                       \
-       do { if (!cond) pad_spc(llen, 64, "[SKIPPED]\n"); else ret += expect_syserr(expr, expret, experr, llen); } while (0)
+       EXPECT_SYSER2(cond, expr, expret, experr, 0)
 
-static int expect_syserr(int expr, int expret, int experr, int llen)
+static int expect_syserr2(int expr, int expret, int experr1, int experr2, int llen)
 {
        int ret = 0;
        int _errno = errno;
 
        llen += printf(" = %d %s ", expr, errorname(_errno));
-       if (expr != expret || _errno != experr) {
+       if (expr != expret || (_errno != experr1 && _errno != experr2)) {
                ret = 1;
-               llen += printf(" != (%d %s) ", expret, errorname(experr));
+               if (experr2 == 0)
+                       llen += printf(" != (%d %s) ", expret, errorname(experr1));
+               else
+                       llen += printf(" != (%d %s %s) ", expret, errorname(experr1), errorname(experr2));
                llen += pad_spc(llen, 64, "[FAIL]\n");
        } else {
                llen += pad_spc(llen, 64, " [OK]\n");
@@ -471,11 +487,60 @@ static int test_getpagesize(void)
        return !c;
 }
 
+static int test_fork(void)
+{
+       int status;
+       pid_t pid;
+
+       /* flush the printf buffer to avoid child flush it */
+       fflush(stdout);
+       fflush(stderr);
+
+       pid = fork();
+
+       switch (pid) {
+       case -1:
+               return 1;
+
+       case 0:
+               exit(123);
+
+       default:
+               pid = waitpid(pid, &status, 0);
+
+               return pid == -1 || !WIFEXITED(status) || WEXITSTATUS(status) != 123;
+       }
+}
+
+static int test_stat_timestamps(void)
+{
+       struct stat st;
+
+       if (sizeof(st.st_atim.tv_sec) != sizeof(st.st_atime))
+               return 1;
+
+       if (stat("/proc/self/", &st))
+               return 1;
+
+       if (st.st_atim.tv_sec != st.st_atime || st.st_atim.tv_nsec > 1000000000)
+               return 1;
+
+       if (st.st_mtim.tv_sec != st.st_mtime || st.st_mtim.tv_nsec > 1000000000)
+               return 1;
+
+       if (st.st_ctim.tv_sec != st.st_ctime || st.st_ctim.tv_nsec > 1000000000)
+               return 1;
+
+       return 0;
+}
+
 /* Run syscall tests between IDs <min> and <max>.
  * Return 0 on success, non-zero on failure.
  */
 int run_syscall(int min, int max)
 {
+       struct timeval tv;
+       struct timezone tz;
        struct stat stat_buf;
        int euid0;
        int proc;
@@ -491,7 +556,7 @@ int run_syscall(int min, int max)
        euid0 = geteuid() == 0;
 
        for (test = min; test >= 0 && test <= max; test++) {
-               int llen = 0; // line length
+               int llen = 0; /* line length */
 
                /* avoid leaving empty lines below, this will insert holes into
                 * test numbers.
@@ -527,14 +592,11 @@ int run_syscall(int min, int max)
                CASE_TEST(dup3_0);            tmp = dup3(0, 100, 0);  EXPECT_SYSNE(1, tmp, -1); close(tmp); break;
                CASE_TEST(dup3_m1);           tmp = dup3(-1, 100, 0); EXPECT_SYSER(1, tmp, -1, EBADF); if (tmp != -1) close(tmp); break;
                CASE_TEST(execve_root);       EXPECT_SYSER(1, execve("/", (char*[]){ [0] = "/", [1] = NULL }, NULL), -1, EACCES); break;
+               CASE_TEST(fork);              EXPECT_SYSZR(1, test_fork()); break;
                CASE_TEST(getdents64_root);   EXPECT_SYSNE(1, test_getdents64("/"), -1); break;
                CASE_TEST(getdents64_null);   EXPECT_SYSER(1, test_getdents64("/dev/null"), -1, ENOTDIR); break;
-               CASE_TEST(gettimeofday_null); EXPECT_SYSZR(1, gettimeofday(NULL, NULL)); break;
-#ifdef NOLIBC
-               CASE_TEST(gettimeofday_bad1); EXPECT_SYSER(1, gettimeofday((void *)1, NULL), -1, EFAULT); break;
-               CASE_TEST(gettimeofday_bad2); EXPECT_SYSER(1, gettimeofday(NULL, (void *)1), -1, EFAULT); break;
-               CASE_TEST(gettimeofday_bad2); EXPECT_SYSER(1, gettimeofday(NULL, (void *)1), -1, EFAULT); break;
-#endif
+               CASE_TEST(gettimeofday_tv);   EXPECT_SYSZR(1, gettimeofday(&tv, NULL)); break;
+               CASE_TEST(gettimeofday_tv_tz);EXPECT_SYSZR(1, gettimeofday(&tv, &tz)); break;
                CASE_TEST(getpagesize);       EXPECT_SYSZR(1, test_getpagesize()); break;
                CASE_TEST(ioctl_tiocinq);     EXPECT_SYSZR(1, ioctl(0, TIOCINQ, &tmp)); break;
                CASE_TEST(ioctl_tiocinq);     EXPECT_SYSZR(1, ioctl(0, TIOCINQ, &tmp)); break;
@@ -550,6 +612,7 @@ int run_syscall(int min, int max)
                CASE_TEST(poll_null);         EXPECT_SYSZR(1, poll(NULL, 0, 0)); break;
                CASE_TEST(poll_stdout);       EXPECT_SYSNE(1, ({ struct pollfd fds = { 1, POLLOUT, 0}; poll(&fds, 1, 0); }), -1); break;
                CASE_TEST(poll_fault);        EXPECT_SYSER(1, poll((void *)1, 1, 0), -1, EFAULT); break;
+               CASE_TEST(prctl);             EXPECT_SYSER(1, prctl(PR_SET_NAME, (unsigned long)NULL, 0, 0, 0), -1, EFAULT); break;
                CASE_TEST(read_badf);         EXPECT_SYSER(1, read(-1, &tmp, 1), -1, EBADF); break;
                CASE_TEST(sched_yield);       EXPECT_SYSZR(1, sched_yield()); break;
                CASE_TEST(select_null);       EXPECT_SYSZR(1, ({ struct timeval tv = { 0 }; select(0, NULL, NULL, NULL, &tv); })); break;
@@ -557,6 +620,7 @@ int run_syscall(int min, int max)
                CASE_TEST(select_fault);      EXPECT_SYSER(1, select(1, (void *)1, NULL, NULL, 0), -1, EFAULT); break;
                CASE_TEST(stat_blah);         EXPECT_SYSER(1, stat("/proc/self/blah", &stat_buf), -1, ENOENT); break;
                CASE_TEST(stat_fault);        EXPECT_SYSER(1, stat(NULL, &stat_buf), -1, EFAULT); break;
+               CASE_TEST(stat_timestamps);   EXPECT_SYSZR(1, test_stat_timestamps()); break;
                CASE_TEST(symlink_root);      EXPECT_SYSER(1, symlink("/", "/"), -1, EEXIST); break;
                CASE_TEST(unlink_root);       EXPECT_SYSER(1, unlink("/"), -1, EISDIR); break;
                CASE_TEST(unlink_blah);       EXPECT_SYSER(1, unlink("/proc/self/blah"), -1, ENOENT); break;
@@ -565,6 +629,8 @@ int run_syscall(int min, int max)
                CASE_TEST(waitpid_child);     EXPECT_SYSER(1, waitpid(getpid(), &tmp, WNOHANG), -1, ECHILD); break;
                CASE_TEST(write_badf);        EXPECT_SYSER(1, write(-1, &tmp, 1), -1, EBADF); break;
                CASE_TEST(write_zero);        EXPECT_SYSZR(1, write(1, &tmp, 0)); break;
+               CASE_TEST(syscall_noargs);    EXPECT_SYSEQ(1, syscall(__NR_getpid), getpid()); break;
+               CASE_TEST(syscall_args);      EXPECT_SYSER(1, syscall(__NR_statx, 0, NULL, 0, 0, NULL), -1, EFAULT); break;
                case __LINE__:
                        return ret; /* must be last */
                /* note: do not set any defaults so as to permit holes above */
@@ -581,7 +647,7 @@ int run_stdlib(int min, int max)
        void *p1, *p2;
 
        for (test = min; test >= 0 && test <= max; test++) {
-               int llen = 0; // line length
+               int llen = 0; /* line length */
 
                /* avoid leaving empty lines below, this will insert holes into
                 * test numbers.
@@ -639,9 +705,9 @@ int run_stdlib(int min, int max)
                CASE_TEST(limit_int_fast32_min);    EXPECT_EQ(1, INT_FAST32_MIN,   (int_fast32_t)    INTPTR_MIN); break;
                CASE_TEST(limit_int_fast32_max);    EXPECT_EQ(1, INT_FAST32_MAX,   (int_fast32_t)    INTPTR_MAX); break;
                CASE_TEST(limit_uint_fast32_max);   EXPECT_EQ(1, UINT_FAST32_MAX,  (uint_fast32_t)   UINTPTR_MAX); break;
-               CASE_TEST(limit_int_fast64_min);    EXPECT_EQ(1, INT_FAST64_MIN,   (int_fast64_t)    INTPTR_MIN); break;
-               CASE_TEST(limit_int_fast64_max);    EXPECT_EQ(1, INT_FAST64_MAX,   (int_fast64_t)    INTPTR_MAX); break;
-               CASE_TEST(limit_uint_fast64_max);   EXPECT_EQ(1, UINT_FAST64_MAX,  (uint_fast64_t)   UINTPTR_MAX); break;
+               CASE_TEST(limit_int_fast64_min);    EXPECT_EQ(1, INT_FAST64_MIN,   (int_fast64_t)    INT64_MIN); break;
+               CASE_TEST(limit_int_fast64_max);    EXPECT_EQ(1, INT_FAST64_MAX,   (int_fast64_t)    INT64_MAX); break;
+               CASE_TEST(limit_uint_fast64_max);   EXPECT_EQ(1, UINT_FAST64_MAX,  (uint_fast64_t)   UINT64_MAX); break;
 #if __SIZEOF_LONG__ == 8
                CASE_TEST(limit_intptr_min);        EXPECT_EQ(1, INTPTR_MIN,       (intptr_t)        0x8000000000000000LL); break;
                CASE_TEST(limit_intptr_max);        EXPECT_EQ(1, INTPTR_MAX,       (intptr_t)        0x7fffffffffffffffLL); break;
@@ -667,17 +733,98 @@ int run_stdlib(int min, int max)
        return ret;
 }
 
-#if defined(__clang__)
-__attribute__((optnone))
-#elif defined(__GNUC__)
-__attribute__((optimize("O0")))
-#endif
+#define EXPECT_VFPRINTF(c, expected, fmt, ...)                         \
+       ret += expect_vfprintf(llen, c, expected, fmt, ##__VA_ARGS__)
+
+static int expect_vfprintf(int llen, size_t c, const char *expected, const char *fmt, ...)
+{
+       int ret, fd, w, r;
+       char buf[100];
+       FILE *memfile;
+       va_list args;
+
+       fd = memfd_create("vfprintf", 0);
+       if (fd == -1) {
+               pad_spc(llen, 64, "[FAIL]\n");
+               return 1;
+       }
+
+       memfile = fdopen(fd, "w+");
+       if (!memfile) {
+               pad_spc(llen, 64, "[FAIL]\n");
+               return 1;
+       }
+
+       va_start(args, fmt);
+       w = vfprintf(memfile, fmt, args);
+       va_end(args);
+
+       if (w != c) {
+               llen += printf(" written(%d) != %d", w, (int) c);
+               pad_spc(llen, 64, "[FAIL]\n");
+               return 1;
+       }
+
+       fflush(memfile);
+       lseek(fd, 0, SEEK_SET);
+
+       r = read(fd, buf, sizeof(buf) - 1);
+       buf[r] = '\0';
+
+       fclose(memfile);
+
+       if (r != w) {
+               llen += printf(" written(%d) != read(%d)", w, r);
+               pad_spc(llen, 64, "[FAIL]\n");
+               return 1;
+       }
+
+       llen += printf(" \"%s\" = \"%s\"", expected, buf);
+       ret = strncmp(expected, buf, c);
+
+       pad_spc(llen, 64, ret ? "[FAIL]\n" : " [OK]\n");
+       return ret;
+}
+
+static int run_vfprintf(int min, int max)
+{
+       int test;
+       int tmp;
+       int ret = 0;
+       void *p1, *p2;
+
+       for (test = min; test >= 0 && test <= max; test++) {
+               int llen = 0; /* line length */
+
+               /* avoid leaving empty lines below, this will insert holes into
+                * test numbers.
+                */
+               switch (test + __LINE__ + 1) {
+               CASE_TEST(empty);        EXPECT_VFPRINTF(0, "", ""); break;
+               CASE_TEST(simple);       EXPECT_VFPRINTF(3, "foo", "foo"); break;
+               CASE_TEST(string);       EXPECT_VFPRINTF(3, "foo", "%s", "foo"); break;
+               CASE_TEST(number);       EXPECT_VFPRINTF(4, "1234", "%d", 1234); break;
+               CASE_TEST(negnumber);    EXPECT_VFPRINTF(5, "-1234", "%d", -1234); break;
+               CASE_TEST(unsigned);     EXPECT_VFPRINTF(5, "12345", "%u", 12345); break;
+               CASE_TEST(char);         EXPECT_VFPRINTF(1, "c", "%c", 'c'); break;
+               CASE_TEST(hex);          EXPECT_VFPRINTF(1, "f", "%x", 0xf); break;
+               CASE_TEST(pointer);      EXPECT_VFPRINTF(3, "0x1", "%p", (void *) 0x1); break;
+               case __LINE__:
+                       return ret; /* must be last */
+               /* note: do not set any defaults so as to permit holes above */
+               }
+       }
+       return ret;
+}
+
 static int smash_stack(void)
 {
        char buf[100];
+       volatile char *ptr = buf;
+       size_t i;
 
-       for (size_t i = 0; i < 200; i++)
-               buf[i] = 'P';
+       for (i = 0; i < 200; i++)
+               ptr[i] = 'P';
 
        return 1;
 }
@@ -689,12 +836,20 @@ static int run_protection(int min, int max)
 
        llen += printf("0 -fstackprotector ");
 
-#if !defined(NOLIBC_STACKPROTECTOR)
+#if !defined(_NOLIBC_STACKPROTECTOR)
        llen += printf("not supported");
        pad_spc(llen, 64, "[SKIPPED]\n");
        return 0;
 #endif
 
+#if defined(_NOLIBC_STACKPROTECTOR)
+       if (!__stack_chk_guard) {
+               llen += printf("__stack_chk_guard not initialized");
+               pad_spc(llen, 64, "[FAIL]\n");
+               return 1;
+       }
+#endif
+
        pid = -1;
        pid = fork();
 
@@ -708,6 +863,7 @@ static int run_protection(int min, int max)
                close(STDOUT_FILENO);
                close(STDERR_FILENO);
 
+               prctl(PR_SET_DUMPABLE, 0, 0, 0, 0);
                smash_stack();
                return 1;
 
@@ -778,6 +934,7 @@ static const struct test test_names[] = {
        /* add new tests here */
        { .name = "syscall",    .func = run_syscall    },
        { .name = "stdlib",     .func = run_stdlib     },
+       { .name = "vfprintf",   .func = run_vfprintf   },
        { .name = "protection", .func = run_protection },
        { 0 }
 };
@@ -785,7 +942,7 @@ static const struct test test_names[] = {
 int main(int argc, char **argv, char **envp)
 {
        int min = 0;
-       int max = __INT_MAX__;
+       int max = INT_MAX;
        int ret = 0;
        int err;
        int idx;
@@ -833,7 +990,7 @@ int main(int argc, char **argv, char **envp)
                                 * here, which defaults to the full range.
                                 */
                                do {
-                                       min = 0; max = __INT_MAX__;
+                                       min = 0; max = INT_MAX;
                                        value = colon;
                                        if (value && *value) {
                                                colon = strchr(value, ':');
@@ -899,7 +1056,7 @@ int main(int argc, char **argv, char **envp)
 #else
                else if (ioperm(0x501, 1, 1) == 0)
 #endif
-                       asm volatile ("outb %%al, %%dx" :: "d"(0x501), "a"(0));
+                       __asm__ volatile ("outb %%al, %%dx" :: "d"(0x501), "a"(0));
                /* if it does nothing, fall back to the regular panic */
 #endif
        }
index 6922d64..88d6830 100644 (file)
@@ -90,7 +90,6 @@ again:
        }
 
        ret = WEXITSTATUS(status);
-       ksft_print_msg("waitpid WEXITSTATUS=%d\n", ret);
        return ret;
 }
 
index 3fd8e90..4e86f92 100644 (file)
@@ -143,6 +143,7 @@ static inline int child_join(struct child *child, struct error *err)
                r = -1;
        }
 
+       ksft_print_msg("waitpid WEXITSTATUS=%d\n", r);
        return r;
 }
 
index e2dd4ed..00a07e7 100644 (file)
@@ -115,7 +115,8 @@ static int test_pidfd_send_signal_exited_fail(void)
 
        pidfd = open(buf, O_DIRECTORY | O_CLOEXEC);
 
-       (void)wait_for_pid(pid);
+       ret = wait_for_pid(pid);
+       ksft_print_msg("waitpid WEXITSTATUS=%d\n", ret);
 
        if (pidfd < 0)
                ksft_exit_fail_msg(
index 26d853c..4275cb2 100644 (file)
@@ -97,7 +97,7 @@ TEST_F(vma, renaming) {
        TH_LOG("Try to pass invalid name (with non-printable character \\1) to rename the VMA");
        EXPECT_EQ(rename_vma((unsigned long)self->ptr_anon, AREA_SIZE, BAD_NAME), -EINVAL);
 
-       TH_LOG("Try to rename non-anonynous VMA");
+       TH_LOG("Try to rename non-anonymous VMA");
        EXPECT_EQ(rename_vma((unsigned long) self->ptr_not_anon, AREA_SIZE, GOOD_NAME), -EINVAL);
 }
 
index b52d506..48b9147 100644 (file)
@@ -250,7 +250,7 @@ identify_qemu_args () {
                echo -machine virt,gic-version=host -cpu host
                ;;
        qemu-system-ppc64)
-               echo -enable-kvm -M pseries -nodefaults
+               echo -M pseries -nodefaults
                echo -device spapr-vscsi
                if test -n "$TORTURE_QEMU_INTERACTIVE" -a -n "$TORTURE_QEMU_MAC"
                then
index f57720c..84f6bb9 100644 (file)
@@ -5,4 +5,4 @@ rcutree.gp_init_delay=3
 rcutree.gp_cleanup_delay=3
 rcutree.kthread_prio=2
 threadirqs
-tree.use_softirq=0
+rcutree.use_softirq=0
index 64f864f..8e50bfd 100644 (file)
@@ -4,4 +4,4 @@ rcutree.gp_init_delay=3
 rcutree.gp_cleanup_delay=3
 rcutree.kthread_prio=2
 threadirqs
-tree.use_softirq=0
+rcutree.use_softirq=0
index 97165a8..9274398 100755 (executable)
@@ -26,6 +26,7 @@ Usage: $0 [OPTIONS]
   -l | --list                  List the available collection:test entries
   -d | --dry-run               Don't actually run any tests
   -h | --help                  Show this usage info
+  -o | --override-timeout      Number of seconds after which we timeout
 EOF
        exit $1
 }
@@ -33,6 +34,7 @@ EOF
 COLLECTIONS=""
 TESTS=""
 dryrun=""
+kselftest_override_timeout=""
 while true; do
        case "$1" in
                -s | --summary)
@@ -51,6 +53,9 @@ while true; do
                -d | --dry-run)
                        dryrun="echo"
                        shift ;;
+               -o | --override-timeout)
+                       kselftest_override_timeout="$2"
+                       shift 2 ;;
                -h | --help)
                        usage 0 ;;
                "")
@@ -85,7 +90,7 @@ if [ -n "$TESTS" ]; then
        available="$(echo "$valid" | sed -e 's/ /\n/g')"
 fi
 
-collections=$(echo "$available" | cut -d: -f1 | uniq)
+collections=$(echo "$available" | cut -d: -f1 | sort | uniq)
 for collection in $collections ; do
        [ -w /dev/kmsg ] && echo "kselftest: Running tests in $collection" >> /dev/kmsg
        tests=$(echo "$available" | grep "^$collection:" | cut -d: -f2)
index bfc54b4..444b2be 100755 (executable)
@@ -14,23 +14,27 @@ TEST_FILE=$(mktemp)
 
 # This represents
 #
-# TEST_ID:TEST_COUNT:ENABLED:TARGET
+# TEST_ID:TEST_COUNT:ENABLED:TARGET:SKIP_NO_TARGET
 #
 # TEST_ID: is the test id number
 # TEST_COUNT: number of times we should run the test
 # ENABLED: 1 if enabled, 0 otherwise
 # TARGET: test target file required on the test_sysctl module
+# SKIP_NO_TARGET: 1 skip if TARGET not there
+#                 0 run eventhough TARGET not there
 #
 # Once these are enabled please leave them as-is. Write your own test,
 # we have tons of space.
-ALL_TESTS="0001:1:1:int_0001"
-ALL_TESTS="$ALL_TESTS 0002:1:1:string_0001"
-ALL_TESTS="$ALL_TESTS 0003:1:1:int_0002"
-ALL_TESTS="$ALL_TESTS 0004:1:1:uint_0001"
-ALL_TESTS="$ALL_TESTS 0005:3:1:int_0003"
-ALL_TESTS="$ALL_TESTS 0006:50:1:bitmap_0001"
-ALL_TESTS="$ALL_TESTS 0007:1:1:boot_int"
-ALL_TESTS="$ALL_TESTS 0008:1:1:match_int"
+ALL_TESTS="0001:1:1:int_0001:1"
+ALL_TESTS="$ALL_TESTS 0002:1:1:string_0001:1"
+ALL_TESTS="$ALL_TESTS 0003:1:1:int_0002:1"
+ALL_TESTS="$ALL_TESTS 0004:1:1:uint_0001:1"
+ALL_TESTS="$ALL_TESTS 0005:3:1:int_0003:1"
+ALL_TESTS="$ALL_TESTS 0006:50:1:bitmap_0001:1"
+ALL_TESTS="$ALL_TESTS 0007:1:1:boot_int:1"
+ALL_TESTS="$ALL_TESTS 0008:1:1:match_int:1"
+ALL_TESTS="$ALL_TESTS 0009:1:1:unregister_error:0"
+ALL_TESTS="$ALL_TESTS 0010:1:1:mnt/mnt_error:0"
 
 function allow_user_defaults()
 {
@@ -613,7 +617,6 @@ target_exists()
        TEST_ID="$2"
 
        if [ ! -f ${TARGET} ] ; then
-               echo "Target for test $TEST_ID: $TARGET not exist, skipping test ..."
                return 0
        fi
        return 1
@@ -730,7 +733,7 @@ sysctl_test_0005()
 
 sysctl_test_0006()
 {
-       TARGET="${SYSCTL}/bitmap_0001"
+       TARGET="${SYSCTL}/$(get_test_target 0006)"
        reset_vals
        ORIG=""
        run_bitmaptest
@@ -738,7 +741,7 @@ sysctl_test_0006()
 
 sysctl_test_0007()
 {
-       TARGET="${SYSCTL}/boot_int"
+       TARGET="${SYSCTL}/$(get_test_target 0007)"
        if [ ! -f $TARGET ]; then
                echo "Skipping test for $TARGET as it is not present ..."
                return $ksft_skip
@@ -778,7 +781,7 @@ sysctl_test_0007()
 
 sysctl_test_0008()
 {
-       TARGET="${SYSCTL}/match_int"
+       TARGET="${SYSCTL}/$(get_test_target 0008)"
        if [ ! -f $TARGET ]; then
                echo "Skipping test for $TARGET as it is not present ..."
                return $ksft_skip
@@ -797,6 +800,34 @@ sysctl_test_0008()
        return 0
 }
 
+sysctl_test_0009()
+{
+       TARGET="${SYSCTL}/$(get_test_target 0009)"
+       echo -n "Testing if $TARGET unregistered correctly ..."
+       if [ -d $TARGET ]; then
+               echo "TEST FAILED"
+               rc=1
+               test_rc
+       fi
+
+       echo "ok"
+       return 0
+}
+
+sysctl_test_0010()
+{
+       TARGET="${SYSCTL}/$(get_test_target 0010)"
+       echo -n "Testing that $TARGET was not created  ..."
+       if [ -d $TARGET ]; then
+               echo "TEST FAILED"
+               rc=1
+               test_rc
+       fi
+
+       echo "ok"
+       return 0
+}
+
 list_tests()
 {
        echo "Test ID list:"
@@ -813,6 +844,8 @@ list_tests()
        echo "0006 x $(get_test_count 0006) - tests proc_do_large_bitmap()"
        echo "0007 x $(get_test_count 0007) - tests setting sysctl from kernel boot param"
        echo "0008 x $(get_test_count 0008) - tests sysctl macro values match"
+       echo "0009 x $(get_test_count 0009) - tests sysct unregister"
+       echo "0010 x $(get_test_count 0010) - tests sysct mount point"
 }
 
 usage()
@@ -857,38 +890,65 @@ function test_num()
                usage
        fi
 }
+function remove_leading_zeros()
+{
+       echo $1 | sed 's/^0*//'
+}
 
 function get_test_count()
 {
        test_num $1
-       TEST_DATA=$(echo $ALL_TESTS | awk '{print $'$1'}')
+       awk_field=$(remove_leading_zeros $1)
+       TEST_DATA=$(echo $ALL_TESTS | awk '{print $'$awk_field'}')
        echo ${TEST_DATA} | awk -F":" '{print $2}'
 }
 
 function get_test_enabled()
 {
        test_num $1
-       TEST_DATA=$(echo $ALL_TESTS | awk '{print $'$1'}')
+       awk_field=$(remove_leading_zeros $1)
+       TEST_DATA=$(echo $ALL_TESTS | awk '{print $'$awk_field'}')
        echo ${TEST_DATA} | awk -F":" '{print $3}'
 }
 
 function get_test_target()
 {
        test_num $1
-       TEST_DATA=$(echo $ALL_TESTS | awk '{print $'$1'}')
+       awk_field=$(remove_leading_zeros $1)
+       TEST_DATA=$(echo $ALL_TESTS | awk '{print $'$awk_field'}')
        echo ${TEST_DATA} | awk -F":" '{print $4}'
 }
 
+function get_test_skip_no_target()
+{
+       test_num $1
+       awk_field=$(remove_leading_zeros $1)
+       TEST_DATA=$(echo $ALL_TESTS | awk '{print $'$awk_field'}')
+       echo ${TEST_DATA} | awk -F":" '{print $5}'
+}
+
+function skip_test()
+{
+       TEST_ID=$1
+       TEST_TARGET=$2
+       if target_exists $TEST_TARGET $TEST_ID; then
+               TEST_SKIP=$(get_test_skip_no_target $TEST_ID)
+               if [[ $TEST_SKIP -eq "1" ]]; then
+                       echo "Target for test $TEST_ID: $TEST_TARGET not exist, skipping test ..."
+                       return 0
+               fi
+       fi
+       return 1
+}
+
 function run_all_tests()
 {
        for i in $ALL_TESTS ; do
-               TEST_ID=${i%:*:*:*}
+               TEST_ID=${i%:*:*:*:*}
                ENABLED=$(get_test_enabled $TEST_ID)
                TEST_COUNT=$(get_test_count $TEST_ID)
                TEST_TARGET=$(get_test_target $TEST_ID)
-               if target_exists $TEST_TARGET $TEST_ID; then
-                       continue
-               fi
+
                if [[ $ENABLED -eq "1" ]]; then
                        test_case $TEST_ID $TEST_COUNT $TEST_TARGET
                fi
@@ -923,18 +983,19 @@ function watch_case()
 
 function test_case()
 {
+       TEST_ID=$1
        NUM_TESTS=$2
+       TARGET=$3
 
-       i=0
-
-       if target_exists $3 $1; then
-               continue
+       if skip_test $TEST_ID $TARGET; then
+               return
        fi
 
+       i=0
        while [ $i -lt $NUM_TESTS ]; do
-               test_num $1
-               watch_log $i ${TEST_NAME}_test_$1 noclear
-               RUN_TEST=${TEST_NAME}_test_$1
+               test_num $TEST_ID
+               watch_log $i ${TEST_NAME}_test_${TEST_ID} noclear
+               RUN_TEST=${TEST_NAME}_test_${TEST_ID}
                $RUN_TEST
                let i=$i+1
        done
index 15dcee1..38d46a8 100644 (file)
@@ -84,12 +84,12 @@ static inline int vdso_test_clock(unsigned int clock_id)
 
 int main(int argc, char **argv)
 {
-       int ret;
+       int ret = 0;
 
 #if _POSIX_TIMERS > 0
 
 #ifdef CLOCK_REALTIME
-       ret = vdso_test_clock(CLOCK_REALTIME);
+       ret += vdso_test_clock(CLOCK_REALTIME);
 #endif
 
 #ifdef CLOCK_BOOTTIME
diff --git a/tools/workqueue/wq_monitor.py b/tools/workqueue/wq_monitor.py
new file mode 100644 (file)
index 0000000..6e258d1
--- /dev/null
@@ -0,0 +1,168 @@
+#!/usr/bin/env drgn
+#
+# Copyright (C) 2023 Tejun Heo <tj@kernel.org>
+# Copyright (C) 2023 Meta Platforms, Inc. and affiliates.
+
+desc = """
+This is a drgn script to monitor workqueues. For more info on drgn, visit
+https://github.com/osandov/drgn.
+
+  total    Total number of work items executed by the workqueue.
+
+  infl     The number of currently in-flight work items.
+
+  CPUtime  Total CPU time consumed by the workqueue in seconds. This is
+           sampled from scheduler ticks and only provides ballpark
+           measurement. "nohz_full=" CPUs are excluded from measurement.
+
+  CPUitsv  The number of times a concurrency-managed work item hogged CPU
+           longer than the threshold (workqueue.cpu_intensive_thresh_us)
+           and got excluded from concurrency management to avoid stalling
+           other work items.
+
+  CMwake   The number of concurrency-management wake-ups while executing a
+           work item of the workqueue.
+
+  mayday   The number of times the rescuer was requested while waiting for
+           new worker creation.
+
+  rescued  The number of work items executed by the rescuer.
+"""
+
+import sys
+import signal
+import os
+import re
+import time
+import json
+
+import drgn
+from drgn.helpers.linux.list import list_for_each_entry,list_empty
+from drgn.helpers.linux.cpumask import for_each_possible_cpu
+
+import argparse
+parser = argparse.ArgumentParser(description=desc,
+                                 formatter_class=argparse.RawTextHelpFormatter)
+parser.add_argument('workqueue', metavar='REGEX', nargs='*',
+                    help='Target workqueue name patterns (all if empty)')
+parser.add_argument('-i', '--interval', metavar='SECS', type=float, default=1,
+                    help='Monitoring interval (0 to print once and exit)')
+parser.add_argument('-j', '--json', action='store_true',
+                    help='Output in json')
+args = parser.parse_args()
+
+def err(s):
+    print(s, file=sys.stderr, flush=True)
+    sys.exit(1)
+
+workqueues              = prog['workqueues']
+
+WQ_UNBOUND              = prog['WQ_UNBOUND']
+WQ_MEM_RECLAIM          = prog['WQ_MEM_RECLAIM']
+
+PWQ_STAT_STARTED        = prog['PWQ_STAT_STARTED']      # work items started execution
+PWQ_STAT_COMPLETED      = prog['PWQ_STAT_COMPLETED']   # work items completed execution
+PWQ_STAT_CPU_TIME       = prog['PWQ_STAT_CPU_TIME']     # total CPU time consumed
+PWQ_STAT_CPU_INTENSIVE  = prog['PWQ_STAT_CPU_INTENSIVE'] # wq_cpu_intensive_thresh_us violations
+PWQ_STAT_CM_WAKEUP      = prog['PWQ_STAT_CM_WAKEUP']    # concurrency-management worker wakeups
+PWQ_STAT_MAYDAY         = prog['PWQ_STAT_MAYDAY']      # maydays to rescuer
+PWQ_STAT_RESCUED        = prog['PWQ_STAT_RESCUED']     # linked work items executed by rescuer
+PWQ_NR_STATS            = prog['PWQ_NR_STATS']
+
+class WqStats:
+    def __init__(self, wq):
+        self.name = wq.name.string_().decode()
+        self.unbound = wq.flags & WQ_UNBOUND != 0
+        self.mem_reclaim = wq.flags & WQ_MEM_RECLAIM != 0
+        self.stats = [0] * PWQ_NR_STATS
+        for pwq in list_for_each_entry('struct pool_workqueue', wq.pwqs.address_of_(), 'pwqs_node'):
+            for i in range(PWQ_NR_STATS):
+                self.stats[i] += int(pwq.stats[i])
+
+    def dict(self, now):
+        return { 'timestamp'            : now,
+                 'name'                 : self.name,
+                 'unbound'              : self.unbound,
+                 'mem_reclaim'          : self.mem_reclaim,
+                 'started'              : self.stats[PWQ_STAT_STARTED],
+                 'completed'            : self.stats[PWQ_STAT_COMPLETED],
+                 'cpu_time'             : self.stats[PWQ_STAT_CPU_TIME],
+                 'cpu_intensive'        : self.stats[PWQ_STAT_CPU_INTENSIVE],
+                 'cm_wakeup'            : self.stats[PWQ_STAT_CM_WAKEUP],
+                 'mayday'               : self.stats[PWQ_STAT_MAYDAY],
+                 'rescued'              : self.stats[PWQ_STAT_RESCUED], }
+
+    def table_header_str():
+        return f'{"":>24} {"total":>8} {"infl":>5} {"CPUtime":>8} '\
+            f'{"CPUitsv":>7} {"CMwake":>7} {"mayday":>7} {"rescued":>7}'
+
+    def table_row_str(self):
+        cpu_intensive = '-'
+        cm_wakeup = '-'
+        mayday = '-'
+        rescued = '-'
+
+        if not self.unbound:
+            cpu_intensive = str(self.stats[PWQ_STAT_CPU_INTENSIVE])
+            cm_wakeup = str(self.stats[PWQ_STAT_CM_WAKEUP])
+
+        if self.mem_reclaim:
+            mayday = str(self.stats[PWQ_STAT_MAYDAY])
+            rescued = str(self.stats[PWQ_STAT_RESCUED])
+
+        out = f'{self.name[-24:]:24} ' \
+              f'{self.stats[PWQ_STAT_STARTED]:8} ' \
+              f'{max(self.stats[PWQ_STAT_STARTED] - self.stats[PWQ_STAT_COMPLETED], 0):5} ' \
+              f'{self.stats[PWQ_STAT_CPU_TIME] / 1000000:8.1f} ' \
+              f'{cpu_intensive:>7} ' \
+              f'{cm_wakeup:>7} ' \
+              f'{mayday:>7} ' \
+              f'{rescued:>7} '
+        return out.rstrip(':')
+
+exit_req = False
+
+def sigint_handler(signr, frame):
+    global exit_req
+    exit_req = True
+
+def main():
+    # handle args
+    table_fmt = not args.json
+    interval = args.interval
+
+    re_str = None
+    if args.workqueue:
+        for r in args.workqueue:
+            if re_str is None:
+                re_str = r
+            else:
+                re_str += '|' + r
+
+    filter_re = re.compile(re_str) if re_str else None
+
+    # monitoring loop
+    signal.signal(signal.SIGINT, sigint_handler)
+
+    while not exit_req:
+        now = time.time()
+
+        if table_fmt:
+            print()
+            print(WqStats.table_header_str())
+
+        for wq in list_for_each_entry('struct workqueue_struct', workqueues.address_of_(), 'list'):
+            stats = WqStats(wq)
+            if filter_re and not filter_re.search(stats.name):
+                continue
+            if table_fmt:
+                print(stats.table_row_str())
+            else:
+                print(stats.dict(now))
+
+        if interval == 0:
+            break
+        time.sleep(interval)
+
+if __name__ == "__main__":
+    main()
index 9bfe1d6..e033c79 100644 (file)
@@ -61,8 +61,7 @@ static void async_pf_execute(struct work_struct *work)
         * access remotely.
         */
        mmap_read_lock(mm);
-       get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, NULL,
-                       &locked);
+       get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, &locked);
        if (locked)
                mmap_read_unlock(mm);
 
index 65f94f5..19f301e 100644 (file)
@@ -2495,7 +2495,7 @@ static inline int check_user_page_hwpoison(unsigned long addr)
 {
        int rc, flags = FOLL_HWPOISON | FOLL_WRITE;
 
-       rc = get_user_pages(addr, 1, flags, NULL, NULL);
+       rc = get_user_pages(addr, 1, flags, NULL);
        return rc == -EHWPOISON;
 }
 
@@ -2596,6 +2596,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 {
        kvm_pfn_t pfn;
        pte_t *ptep;
+       pte_t pte;
        spinlock_t *ptl;
        int r;
 
@@ -2619,14 +2620,16 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
                        return r;
        }
 
-       if (write_fault && !pte_write(*ptep)) {
+       pte = ptep_get(ptep);
+
+       if (write_fault && !pte_write(pte)) {
                pfn = KVM_PFN_ERR_RO_FAULT;
                goto out;
        }
 
        if (writable)
-               *writable = pte_write(*ptep);
-       pfn = pte_pfn(*ptep);
+               *writable = pte_write(pte);
+       pfn = pte_pfn(pte);
 
        /*
         * Get a reference here because callers of *hva_to_pfn* and
@@ -2644,7 +2647,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
         * tail pages of non-compound higher order allocations, which
         * would then underflow the refcount when the caller does the
         * required put_page. Don't allow those pages here.
-        */ 
+        */
        if (!kvm_try_get_pfn(pfn))
                r = -EFAULT;